Project Description

The telecom operator Interconnect would like to be able to forecast their churn of clients. If it's discovered that a user is planning to leave, they will be offered promotional codes and special plan options. Interconnect's marketing team has collected some of their clientele's personal data, including information about their plans and contracts.

Plan

Data Preprocessing
- Look at the data
- Change data types
- Join the tables
- Make decition about NaN values
EDA
- Figure out what data is more relevant for prediction task and what is the target value
- Try to find some pattern or new info that can be used
Models training
- Use cross-validation for better traning
- Hyperparameter tuning
- Choose the best model rest on the choosen metric (AUC-ROC)
Test the model
Conclusion

!pip3 install catboost

Collecting catboost
  Downloading https://files.pythonhosted.org/packages/96/3b/bb419654adcf7efff42ed8a3f84e50c8f236424b7ed1cc8ccd290852e003/catboost-0.24.4-cp37-none-manylinux1_x86_64.whl (65.7MB)
     |████████████████████████████████| 65.7MB 71kB/s 
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (1.3.3)
Installing collected packages: catboost
Successfully installed catboost-0.24.4

import pandas as pd
import numpy as np
from lightgbm import LGBMRegressor, cv, Dataset
from catboost import Pool, CatBoostRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import GradientBoostingRegressor

Data Preprocessing

contract =  pd.read_csv(path_to_contract)
internet =  pd.read_csv(path_to_internet)
personal =  pd.read_csv(path_to_personal)
phone =  pd.read_csv(path_to_phone)

contract.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
dtypes: float64(1), object(7)
memory usage: 440.3+ KB

internet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5517 entries, 0 to 5516
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customerID        5517 non-null   object
 1   InternetService   5517 non-null   object
 2   OnlineSecurity    5517 non-null   object
 3   OnlineBackup      5517 non-null   object
 4   DeviceProtection  5517 non-null   object
 5   TechSupport       5517 non-null   object
 6   StreamingTV       5517 non-null   object
 7   StreamingMovies   5517 non-null   object
dtypes: object(8)
memory usage: 344.9+ KB

personal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     7043 non-null   object
 1   gender         7043 non-null   object
 2   SeniorCitizen  7043 non-null   int64 
 3   Partner        7043 non-null   object
 4   Dependents     7043 non-null   object
dtypes: int64(1), object(4)
memory usage: 275.2+ KB

phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6361 entries, 0 to 6360
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customerID     6361 non-null   object
 1   MultipleLines  6361 non-null   object
dtypes: object(2)
memory usage: 99.5+ KB

contract.head()

internet.head()

personal.head()

phone.head()

internet['FiberOptic'] = (internet['InternetService'] == 'Fiber optic').astype(int)
internet = internet.drop(columns=['InternetService'])

df = contract.merge(phone, on='customerID', how='left')
df = df.merge(internet, on='customerID', how='left')
df = df.merge(personal, on='customerID', how='left')

df.head()

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'],errors='coerce')
df['TotalCharges'] = df['TotalCharges'].round(2)

columns = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 
           'StreamingMovies', 'FiberOptic', 'SeniorCitizen', 'PaperlessBilling', 'Partner',	'Dependents', 'TotalCharges','MonthlyCharges']
for i in columns:
  df[i] = df[i].replace('No', 0)
  df[i] = df[i].replace('Yes', 1)
  df[i] =  df[i].fillna(0)
  df[i] = df[i].astype(int)

df['BeginDate'] = pd.to_datetime(df['BeginDate'], format='%Y-%m-%d')

df['Woman'] = (df['gender'] == 'Female').astype(int)
df = df.drop(columns=['gender'])

Need to explore ex-users and find something common between them.

For client who didn't exist in phone or internet tables NaN values should be fiiled with False value or 0 in the case. Probably they didn't use those services.

EDA

df['EndDate'].value_counts()

No                     5174
2019-11-01 00:00:00     485
2019-12-01 00:00:00     466
2020-01-01 00:00:00     460
2019-10-01 00:00:00     458
Name: EndDate, dtype: int64

current_users = df.loc[df['EndDate'] == 'No']

ex_users = df.loc[df['EndDate'] != 'No']

ex_users['BeginDate'] = ex_users['BeginDate'].dt.year
current_users['BeginDate'] = current_users['BeginDate'].dt.year

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

ex_users['BeginDate'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff9c9338dd0>

There are a lot of new users from 2019 year who probably decided to try, but the service didn't sute them for some reason.

arg_difference = {'Ex-Users': ex_users.mean(), 'Current-Users': current_users.mean()}
arg_difference_df = pd.DataFrame(data=arg_difference)

arg_difference_df

ex_users.groupby('BeginDate').mean()

current_users.groupby('BeginDate').mean()

Conclution of EDA:

Our ex-clients usually had internet with Fiber Optic, used Streaming TV and Streaming Movies and as an obvious fact paid more. They also less used Tech Support, more of them used Multiple Lines of the phone and there are more SeniorCitizen among ex-users.

Partner and Gender information don't seem to be relevant, but Dependents can be as there are some difference by this parameter.

Most of the ex-users came to us in 2019 year.

Questions:

What do Partner and Dependents columns mean? Are they relevant?
What the most relevant columns would be the best to take as the features for machine leaning model?
What is the target value?
It is a classification task: whether or not the user will quit, isn't it?
What metric to use?

Our target column is EndDate. As far as it should has boolean type (with us or not), so the column values should be changed.

It's better to remain all features as all of them can be relevant to the customer decision to leave or to stay.

It is a classification task, but Logistic regression can be used as there are only two target options.

AUC-ROC metric should be used here to find the best model, because our data is unballanced and it needs to take in account the rate of false-positive and false-negative values.

Model Training

Creating Features

df['IsClient'] = (df['EndDate'] == 'No').astype(int)

df['EndDate'] = df['EndDate'].where(df['EndDate'] != 'No', "2020-02-01")
df['EndDate'] = pd.to_datetime(df['EndDate'], format='%Y-%m-%d')

df["AllTime"] = (df['EndDate'] - df['BeginDate']) 
df["AllTime"] = (df["AllTime"] / np.timedelta64(1, "M"))
df["AllTime"] = df["AllTime"].astype(int)

df.head()

df.columns

Index(['customerID', 'BeginDate', 'EndDate', 'Type', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'MultipleLines',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'FiberOptic', 'SeniorCitizen',
       'Partner', 'Dependents', 'Woman', 'IsClient', 'AllTime'],
      dtype='object')

X = df.drop(['customerID', 'BeginDate', 'EndDate','Partner', 'Woman', 'IsClient'], axis=1)
y = df['IsClient']

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Type              7043 non-null   object
 1   PaperlessBilling  7043 non-null   int64 
 2   PaymentMethod     7043 non-null   object
 3   MonthlyCharges    7043 non-null   int64 
 4   TotalCharges      7043 non-null   int64 
 5   MultipleLines     7043 non-null   int64 
 6   OnlineSecurity    7043 non-null   int64 
 7   OnlineBackup      7043 non-null   int64 
 8   DeviceProtection  7043 non-null   int64 
 9   TechSupport       7043 non-null   int64 
 10  StreamingTV       7043 non-null   int64 
 11  StreamingMovies   7043 non-null   int64 
 12  FiberOptic        7043 non-null   int64 
 13  SeniorCitizen     7043 non-null   int64 
 14  Dependents        7043 non-null   int64 
 15  AllTime           7043 non-null   int64 
dtypes: int64(14), object(2)
memory usage: 935.4+ KB

encoder = OrdinalEncoder()
X = pd.DataFrame(encoder.fit_transform(X), columns = X.columns)

X.head()

for i in X.columns:
  X[i] = X[i].astype('int')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12345)

X.head()

Dummy model

scorer_for_cv = make_scorer(roc_auc_score)

dummy_regr = DummyRegressor(strategy="constant", constant=0)
dummy_regr.fit(X_train, y_train)
dummy_predict = dummy_regr.predict(X_test)

dummy_auc = roc_auc_score(y_test, dummy_predict)
print("AUC-ROC =", dummy_auc)

AUC-ROC = 0.5

Linear regression

model_lr = LinearRegression()

param_grid = [{'normalize': [True, False]}]
model_lr_cv = GridSearchCV(estimator=model_lr,
                           scoring=scorer_for_cv,
                           param_grid=param_grid,
                           cv=3)
model_lr_cv = model_lr_cv.fit(X_train, y_train)

model_lr_cv.best_score_

0.9134962791064755

Random forrest

model_rfr = RandomForestRegressor()

param_grid = [
   {'n_estimators':[20], 'max_depth':[12], 'random_state':[12345]},
  {'n_estimators':[50], 'max_depth':[5],'random_state':[12345]},
  {'n_estimators':[50],'max_depth':[50], 'random_state':[12345]}]


model_rfr_cv = GridSearchCV(estimator=model_rfr,
                          param_grid=param_grid,
                           scoring=scorer_for_cv,
                          cv=5)
model_rfr_cv = model_rfr_cv.fit(X_train, y_train)

model_rfr_cv.best_score_

0.9228055321689848

Gradient boosting

train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=['PaperlessBilling',  'TotalCharges', 'MultipleLines', 'MonthlyCharges','AllTime', 
                  'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
                  'StreamingTV', 'StreamingMovies', 'FiberOptic', 'SeniorCitizen',
                    'Dependents']) 
model_CatBoostRegressor = CatBoostRegressor(eval_metric='AUC')

param_grid = [
  {'iterations': [5, 10, 100], 'depth':[5, 10, 15], 'learning_rate':[1]}]

model_cb_cv = model_CatBoostRegressor.grid_search(param_grid, 
                                                   X=train_pool,
                                                   cv=3)

bestTest = 0.8371061337
bestIteration = 4

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
0:	loss: 0.8371061	best: 0.8371061 (0)	total: 76.8ms	remaining: 615ms

bestTest = 0.8401220378
bestIteration = 6

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
1:	loss: 0.8401220	best: 0.8401220 (1)	total: 102ms	remaining: 358ms

bestTest = 0.8451820289
bestIteration = 14

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
2:	loss: 0.8451820	best: 0.8451820 (2)	total: 316ms	remaining: 633ms

bestTest = 0.826336716
bestIteration = 3

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
3:	loss: 0.8263367	best: 0.8451820 (2)	total: 374ms	remaining: 468ms

bestTest = 0.8264395044
bestIteration = 8

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
4:	loss: 0.8264395	best: 0.8451820 (2)	total: 475ms	remaining: 380ms

bestTest = 0.8264395044
bestIteration = 8

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
5:	loss: 0.8264395	best: 0.8451820 (2)	total: 1.7s	remaining: 852ms

bestTest = 0.8060803992
bestIteration = 1

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
6:	loss: 0.8060804	best: 0.8451820 (2)	total: 3.31s	remaining: 945ms

bestTest = 0.8060803992
bestIteration = 1

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
7:	loss: 0.8060804	best: 0.8451820 (2)	total: 5.42s	remaining: 678ms

bestTest = 0.8060803992
bestIteration = 1

Metric AUC is not calculated on train by default. To calculate this metric on train, add hints=skip_train~false to metric parameters.
8:	loss: 0.8060804	best: 0.8451820 (2)	total: 34.2s	remaining: 0us
Estimating final quality...

print("The best parameters is", model_cb_cv['params'])

The best parameters is {'depth': 5, 'learning_rate': 1, 'iterations': 100}

model_cb_cv = CatBoostRegressor(iterations=100, depth=5, 
                                        learning_rate=1, eval_metric='AUC')
model_cb_cv.fit(train_pool)

0:	total: 3.66ms	remaining: 363ms
1:	total: 6ms	remaining: 294ms
2:	total: 8.94ms	remaining: 289ms
3:	total: 11.6ms	remaining: 279ms
4:	total: 13.9ms	remaining: 264ms
5:	total: 16.3ms	remaining: 255ms
6:	total: 18.7ms	remaining: 248ms
7:	total: 21ms	remaining: 242ms
8:	total: 23.4ms	remaining: 237ms
9:	total: 25.7ms	remaining: 231ms
10:	total: 28.1ms	remaining: 227ms
11:	total: 30.4ms	remaining: 223ms
12:	total: 32.7ms	remaining: 219ms
13:	total: 34.9ms	remaining: 215ms
14:	total: 37.2ms	remaining: 211ms
15:	total: 39.4ms	remaining: 207ms
16:	total: 41.6ms	remaining: 203ms
17:	total: 44.2ms	remaining: 201ms
18:	total: 46.5ms	remaining: 198ms
19:	total: 48.7ms	remaining: 195ms
20:	total: 51ms	remaining: 192ms
21:	total: 53.3ms	remaining: 189ms
22:	total: 55.6ms	remaining: 186ms
23:	total: 57.9ms	remaining: 183ms
24:	total: 60.6ms	remaining: 182ms
25:	total: 63ms	remaining: 179ms
26:	total: 65.5ms	remaining: 177ms
27:	total: 67.9ms	remaining: 175ms
28:	total: 70.3ms	remaining: 172ms
29:	total: 72.6ms	remaining: 169ms
30:	total: 75.1ms	remaining: 167ms
31:	total: 77.3ms	remaining: 164ms
32:	total: 79.8ms	remaining: 162ms
33:	total: 82.1ms	remaining: 159ms
34:	total: 84.5ms	remaining: 157ms
35:	total: 86.9ms	remaining: 155ms
36:	total: 89.3ms	remaining: 152ms
37:	total: 91.7ms	remaining: 150ms
38:	total: 94ms	remaining: 147ms
39:	total: 96.4ms	remaining: 145ms
40:	total: 98.7ms	remaining: 142ms
41:	total: 101ms	remaining: 140ms
42:	total: 103ms	remaining: 137ms
43:	total: 106ms	remaining: 135ms
44:	total: 108ms	remaining: 132ms
45:	total: 111ms	remaining: 130ms
46:	total: 113ms	remaining: 127ms
47:	total: 115ms	remaining: 125ms
48:	total: 118ms	remaining: 123ms
49:	total: 120ms	remaining: 120ms
50:	total: 123ms	remaining: 118ms
51:	total: 125ms	remaining: 116ms
52:	total: 128ms	remaining: 113ms
53:	total: 130ms	remaining: 111ms
54:	total: 132ms	remaining: 108ms
55:	total: 135ms	remaining: 106ms
56:	total: 137ms	remaining: 103ms
57:	total: 140ms	remaining: 101ms
58:	total: 142ms	remaining: 98.7ms
59:	total: 144ms	remaining: 96.2ms
60:	total: 147ms	remaining: 93.8ms
61:	total: 149ms	remaining: 91.3ms
62:	total: 151ms	remaining: 89ms
63:	total: 154ms	remaining: 86.5ms
64:	total: 156ms	remaining: 84.1ms
65:	total: 159ms	remaining: 81.7ms
66:	total: 161ms	remaining: 79.2ms
67:	total: 167ms	remaining: 78.4ms
68:	total: 169ms	remaining: 76ms
69:	total: 172ms	remaining: 73.5ms
70:	total: 176ms	remaining: 72ms
71:	total: 182ms	remaining: 70.9ms
72:	total: 186ms	remaining: 68.8ms
73:	total: 190ms	remaining: 66.6ms
74:	total: 195ms	remaining: 65ms
75:	total: 197ms	remaining: 62.3ms
76:	total: 200ms	remaining: 59.6ms
77:	total: 202ms	remaining: 57ms
78:	total: 204ms	remaining: 54.3ms
79:	total: 207ms	remaining: 51.7ms
80:	total: 209ms	remaining: 49ms
81:	total: 211ms	remaining: 46.4ms
82:	total: 214ms	remaining: 43.8ms
83:	total: 216ms	remaining: 41.1ms
84:	total: 218ms	remaining: 38.5ms
85:	total: 221ms	remaining: 35.9ms
86:	total: 223ms	remaining: 33.3ms
87:	total: 225ms	remaining: 30.7ms
88:	total: 228ms	remaining: 28.1ms
89:	total: 230ms	remaining: 25.5ms
90:	total: 232ms	remaining: 23ms
91:	total: 235ms	remaining: 20.4ms
92:	total: 237ms	remaining: 17.8ms
93:	total: 239ms	remaining: 15.3ms
94:	total: 242ms	remaining: 12.7ms
95:	total: 244ms	remaining: 10.2ms
96:	total: 246ms	remaining: 7.62ms
97:	total: 249ms	remaining: 5.08ms
98:	total: 251ms	remaining: 2.54ms
99:	total: 254ms	remaining: 0us

<catboost.core.CatBoostRegressor at 0x7ff9bcacaed0>

Test the model

Linear regression

y_pred = model_lr_cv.predict(X_test)
lr_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", lr_auc)

AUC-ROC = 0.8293950540785984

Random forrest

y_pred = model_rfr_cv.predict(X_test)
rfr_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", rfr_auc)

AUC-ROC = 0.8531534101154354

CatBoostClassifier

test_pool = Pool(X_test, 
                  cat_features=['PaperlessBilling', 'MonthlyCharges', 'TotalCharges', 'MultipleLines',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'FiberOptic', 'SeniorCitizen',
        'Dependents', 'AllTime']) 
y_pred = model_cb_cv.predict(test_pool)
print(y_pred)

cbc_auc = roc_auc_score(y_test, y_pred)
print("AUC-ROC =", cbc_auc)

[0.88648241 0.46205313 0.10287211 ... 0.79784994 1.16184092 1.02592976]
AUC-ROC = 0.8272001088456784

Solution Report

Question: What steps of the plan were performed and what steps were skipped (explain why)?
Answer: It was no need to skip any steps in the plan, so all steps were performed.

Question: What difficulties did you encounter and how did you manage to solve them?
Answer: The difficult and unsucceded part was to increase the AUC-ROC metric more than 0.88.

Question: What were some of the key steps to solving the task?
Answer: The key steps were to prepare data for model training and to find the best hyperparameters.

Question: What is your final model and what quality score does it have?
Answer: The Random Forest model gave the best result - it's AUC-ROC metric has value 0,853. It is the best model to predict the churn of clients.

	customerID	BeginDate	EndDate	Type	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges
0	7590-VHVEG	2020-01-01	No	Month-to-month	Yes	Electronic check	29.85	29.85
1	5575-GNVDE	2017-04-01	No	One year	No	Mailed check	56.95	1889.5
2	3668-QPYBK	2019-10-01	2019-12-01 00:00:00	Month-to-month	Yes	Mailed check	53.85	108.15
3	7795-CFOCW	2016-05-01	No	One year	No	Bank transfer (automatic)	42.30	1840.75
4	9237-HQITU	2019-09-01	2019-11-01 00:00:00	Month-to-month	Yes	Electronic check	70.70	151.65

	Ex-Users	Current-Users
BeginDate	2017.834671	2016.533243
PaperlessBilling	0.749064	0.535562
MonthlyCharges	73.979668	60.797835
TotalCharges	1531.323167	2549.443564
MultipleLines	0.454789	0.409934
OnlineSecurity	0.157838	0.333204
OnlineBackup	0.279829	0.368380
DeviceProtection	0.291600	0.362775
TechSupport	0.165864	0.335137
StreamingTV	0.435527	0.365868
StreamingMovies	0.437667	0.369927
FiberOptic	0.693954	0.347700
SeniorCitizen	0.254682	0.128721
Partner	0.357945	0.528218
Dependents	0.174425	0.344801
Woman	0.502408	0.492656

	PaperlessBilling	MonthlyCharges	TotalCharges	MultipleLines	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	FiberOptic	SeniorCitizen	Partner	Dependents	Woman
BeginDate
2013	0.750000	100.125000	7137.750000	1.000000	0.625000	0.625000	0.875000	0.750000	0.875000	0.875000	0.750000	0.250000	0.875000	0.250000	0.500000
2014	0.775510	95.724490	6316.387755	0.826531	0.326531	0.734694	0.663265	0.387755	0.836735	0.846939	0.846939	0.255102	0.724490	0.295918	0.367347
2015	0.756098	88.000000	4646.040650	0.747967	0.325203	0.560976	0.512195	0.243902	0.731707	0.699187	0.796748	0.308943	0.601626	0.203252	0.471545
2016	0.813793	83.882759	3441.834483	0.600000	0.262069	0.475862	0.510345	0.317241	0.613793	0.627586	0.744828	0.365517	0.482759	0.172414	0.503448
2017	0.804020	82.884422	2375.693467	0.643216	0.190955	0.381910	0.361809	0.185930	0.537688	0.562814	0.824121	0.331658	0.492462	0.236181	0.532663
2018	0.790476	78.352381	1254.796825	0.517460	0.187302	0.269841	0.314286	0.190476	0.498413	0.501587	0.752381	0.279365	0.387302	0.184127	0.517460
2019	0.711519	65.162080	252.173293	0.296636	0.084608	0.149847	0.168196	0.094801	0.287462	0.286442	0.612640	0.207951	0.231397	0.142712	0.508665

	PaperlessBilling	MonthlyCharges	TotalCharges	MultipleLines	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	FiberOptic	SeniorCitizen	Partner	Dependents	Woman
BeginDate
2014	0.581059	74.090690	5121.858748	0.662921	0.542536	0.605939	0.600321	0.539326	0.561798	0.570626	0.418941	0.155698	0.763242	0.413323	0.483949
2015	0.578875	67.499314	3785.851852	0.508916	0.393690	0.462277	0.452675	0.406036	0.462277	0.475995	0.407407	0.153635	0.639232	0.370370	0.511660
2016	0.550162	62.250809	2746.218447	0.428803	0.330097	0.385113	0.399676	0.317152	0.401294	0.401294	0.364078	0.119741	0.548544	0.360841	0.525890
2017	0.518576	60.730650	1913.739938	0.380805	0.297214	0.329721	0.349845	0.326625	0.349845	0.345201	0.362229	0.147059	0.487616	0.329721	0.470588
2018	0.511888	54.158042	1074.416783	0.309091	0.236364	0.236364	0.236364	0.237762	0.261538	0.254545	0.318881	0.114685	0.446154	0.335664	0.492308
2019	0.498975	48.607582	347.054303	0.175205	0.177254	0.178279	0.148566	0.170082	0.184426	0.191598	0.267418	0.099385	0.309426	0.272541	0.501025
2020	0.397541	37.610656	35.766393	0.086066	0.094262	0.081967	0.049180	0.094262	0.061475	0.065574	0.131148	0.049180	0.168033	0.233607	0.422131

	customerID	gender	Partner	Dependents
0	7590-VHVEG	Female	Yes	No
1	5575-GNVDE	Male	No	No
2	3668-QPYBK	Male	No	No
3	7795-CFOCW	Male	No	No
4	9237-HQITU	Female	No	No

	customerID	BeginDate	EndDate	Type	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	FiberOptic	Partner	Woman	IsClient	AllTime
0	7590-VHVEG	2020-01-01	2020-02-01	Month-to-month	1	Electronic check	29	29	0	1	0	0	0	1	1	1	1
1	5575-GNVDE	2017-04-01	2020-02-01	One year	0	Mailed check	56	1889	1	0	1	0	0	0	0	1	34
2	3668-QPYBK	2019-10-01	2019-12-01	Month-to-month	1	Mailed check	53	108	1	1	0	0	0	0	0	0	2
3	7795-CFOCW	2016-05-01	2020-02-01	One year	0	Bank transfer (automatic)	42	1840	1	0	1	1	0	0	0	1	45
4	9237-HQITU	2019-09-01	2019-11-01	Month-to-month	1	Electronic check	70	151	0	0	0	0	1	0	1	0	2

	Type	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	FiberOptic	AllTime
0	0.0	1.0	2.0	10.0	10.0	0.0	1.0	0.0	0.0	0.0	1.0
1	1.0	0.0	3.0	36.0	1539.0	1.0	0.0	1.0	0.0	0.0	34.0
2	0.0	1.0	3.0	33.0	89.0	1.0	1.0	0.0	0.0	0.0	2.0
3	1.0	0.0	0.0	22.0	1503.0	1.0	0.0	1.0	1.0	0.0	45.0
4	0.0	1.0	2.0	50.0	131.0	0.0	0.0	0.0	0.0	1.0	2.0

	Type	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	FiberOptic	AllTime
0	0	1	2	10	10	0	1	0	0	0	1
1	1	0	3	36	1539	1	0	1	0	0	34
2	0	1	3	33	89	1	1	0	0	0	2
3	1	0	0	22	1503	1	0	1	1	0	45
4	0	1	2	50	131	0	0	0	0	1	2