Interconnect DS Project
Project Description
The telecom operator Interconnect would like to be able to forecast their churn of clients. If it's discovered that a user is planning to leave, they will be offered promotional codes and special plan options. Interconnect's marketing team has collected some of their clientele's personal data, including information about their plans and contracts.
Plan
- Data Preprocessing
- Look at the data
- Change data types
- Join the tables
- Make decition about NaN values
- EDA
- Figure out what data is more relevant for prediction task and what is the target value
- Try to find some pattern or new info that can be used
- Models training
- Use cross-validation for better traning
- Hyperparameter tuning
- Choose the best model rest on the choosen metric (AUC-ROC)
- Test the model
- Conclusion
!pip3 install catboost
import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor, cv, Dataset
from catboost import Pool, CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import GradientBoostingRegressor
contract = pd.read_csv(path_to_contract)
internet = pd.read_csv(path_to_internet)
personal = pd.read_csv(path_to_personal)
phone = pd.read_csv(path_to_phone)
contract.info()
internet.info()
personal.info()
phone.info()
contract.head()
internet.head()
personal.head()
phone.head()
internet['FiberOptic'] = (internet['InternetService'] == 'Fiber optic').astype(int)
internet = internet.drop(columns=['InternetService'])
df = contract.merge(phone, on='customerID', how='left')
df = df.merge(internet, on='customerID', how='left')
df = df.merge(personal, on='customerID', how='left')
df.head()
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')
df['TotalCharges'] = df['TotalCharges'].round(2)
columns = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies', 'FiberOptic', 'SeniorCitizen', 'PaperlessBilling', 'Partner', 'Dependents', 'TotalCharges','MonthlyCharges']
for i in columns:
df[i] = df[i].replace('No', 0)
df[i] = df[i].replace('Yes', 1)
df[i] = df[i].fillna(0)
df[i] = df[i].astype(int)
df['BeginDate'] = pd.to_datetime(df['BeginDate'], format='%Y-%m-%d')
df['Woman'] = (df['gender'] == 'Female').astype(int)
df = df.drop(columns=['gender'])
Need to explore ex-users and find something common between them.
For client who didn't exist in phone or internet tables NaN values should be fiiled with False value or 0 in the case. Probably they didn't use those services.
df['EndDate'].value_counts()
current_users = df.loc[df['EndDate'] == 'No']
ex_users = df.loc[df['EndDate'] != 'No']
ex_users['BeginDate'] = ex_users['BeginDate'].dt.year
current_users['BeginDate'] = current_users['BeginDate'].dt.year
ex_users['BeginDate'].hist()
There are a lot of new users from 2019 year who probably decided to try, but the service didn't sute them for some reason.
arg_difference = {'Ex-Users': ex_users.mean(), 'Current-Users': current_users.mean()}
arg_difference_df = pd.DataFrame(data=arg_difference)
arg_difference_df
ex_users.groupby('BeginDate').mean()
current_users.groupby('BeginDate').mean()
Conclution of EDA:
Our ex-clients usually had internet with Fiber Optic, used Streaming TV and Streaming Movies and as an obvious fact paid more. They also less used Tech Support, more of them used Multiple Lines of the phone and there are more SeniorCitizen among ex-users.
Partner and Gender information don't seem to be relevant, but Dependents can be as there are some difference by this parameter.
Most of the ex-users came to us in 2019 year.
Our target column is EndDate. As far as it should has boolean type (with us or not), so the column values should be changed.
It's better to remain all features as all of them can be relevant to the customer decision to leave or to stay.
It is a classification task, but Logistic regression can be used as there are only two target options.
AUC-ROC metric should be used here to find the best model, because our data is unballanced and it needs to take in account the rate of false-positive and false-negative values.
df['IsClient'] = (df['EndDate'] == 'No').astype(int)
df['EndDate'] = df['EndDate'].where(df['EndDate'] != 'No', "2020-02-01")
df['EndDate'] = pd.to_datetime(df['EndDate'], format='%Y-%m-%d')
df["AllTime"] = (df['EndDate'] - df['BeginDate'])
df["AllTime"] = (df["AllTime"] / np.timedelta64(1, "M"))
df["AllTime"] = df["AllTime"].astype(int)
df.head()
df.columns
X = df.drop(['customerID', 'BeginDate', 'EndDate','Partner', 'Woman', 'IsClient'], axis=1)
y = df['IsClient']
X.info()
encoder = OrdinalEncoder()
X = pd.DataFrame(encoder.fit_transform(X), columns = X.columns)
X.head()
for i in X.columns:
X[i] = X[i].astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12345)
X.head()
scorer_for_cv = make_scorer(roc_auc_score)
dummy_regr = DummyRegressor(strategy="constant", constant=0)
dummy_regr.fit(X_train, y_train)
dummy_predict = dummy_regr.predict(X_test)
dummy_auc = roc_auc_score(y_test, dummy_predict)
print("AUC-ROC =", dummy_auc)
model_lr = LinearRegression()
param_grid = [{'normalize': [True, False]}]
model_lr_cv = GridSearchCV(estimator=model_lr,
scoring=scorer_for_cv,
param_grid=param_grid,
cv=3)
model_lr_cv = model_lr_cv.fit(X_train, y_train)
model_lr_cv.best_score_
model_rfr = RandomForestRegressor()
param_grid = [
{'n_estimators':[20], 'max_depth':[12], 'random_state':[12345]},
{'n_estimators':[50], 'max_depth':[5],'random_state':[12345]},
{'n_estimators':[50],'max_depth':[50], 'random_state':[12345]}]
model_rfr_cv = GridSearchCV(estimator=model_rfr,
param_grid=param_grid,
scoring=scorer_for_cv,
cv=5)
model_rfr_cv = model_rfr_cv.fit(X_train, y_train)
model_rfr_cv.best_score_
train_pool = Pool(X_train,
y_train,
cat_features=['PaperlessBilling', 'TotalCharges', 'MultipleLines', 'MonthlyCharges','AllTime',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'FiberOptic', 'SeniorCitizen',
'Dependents'])
model_CatBoostRegressor = CatBoostRegressor(eval_metric='AUC')
param_grid = [
{'iterations': [5, 10, 100], 'depth':[5, 10, 15], 'learning_rate':[1]}]
model_cb_cv = model_CatBoostRegressor.grid_search(param_grid,
X=train_pool,
cv=3)
print("The best parameters is", model_cb_cv['params'])
model_cb_cv = CatBoostRegressor(iterations=100, depth=5,
learning_rate=1, eval_metric='AUC')
model_cb_cv.fit(train_pool)
y_pred = model_lr_cv.predict(X_test)
lr_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", lr_auc)
y_pred = model_rfr_cv.predict(X_test)
rfr_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", rfr_auc)
test_pool = Pool(X_test,
cat_features=['PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'MultipleLines',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'FiberOptic', 'SeniorCitizen',
'Dependents', 'AllTime'])
y_pred = model_cb_cv.predict(test_pool)
print(y_pred)
cbc_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", cbc_auc)
Solution Report
- Question: What steps of the plan were performed and what steps were skipped (explain why)?
- Answer: It was no need to skip any steps in the plan, so all steps were performed.
- Question: What difficulties did you encounter and how did you manage to solve them?
- Answer: The difficult and unsucceded part was to increase the AUC-ROC metric more than 0.88.
- Question: What were some of the key steps to solving the task?
- Answer: The key steps were to prepare data for model training and to find the best hyperparameters.
- Question: What is your final model and what quality score does it have?
- Answer: The Random Forest model gave the best result - it's AUC-ROC metric has value 0,853. It is the best model to predict the churn of clients.