Task statement

Prepare a prototype of a machine learning model for Zyfra. The company develops efficiency solutions for heavy industry.

The model should predict the amount of gold recovered from gold ore. There are the data on extraction and purification.

The model will help to optimize the production and eliminate unprofitable parameters.

import pandas as pd
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error
from numpy.random import RandomState
from scipy import stats as st
import matplotlib.pyplot as plt
from sklearn import preprocessing
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer

Data Preprocessing

Look at the data

train = pd.read_csv("/datasets/gold_recovery_train.csv")
test = pd.read_csv("/datasets/gold_recovery_test.csv")
full = pd.read_csv("/datasets/gold_recovery_full.csv")
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 87 columns):
date                                                  16860 non-null object
final.output.concentrate_ag                           16788 non-null float64
final.output.concentrate_pb                           16788 non-null float64
final.output.concentrate_sol                          16490 non-null float64
final.output.concentrate_au                           16789 non-null float64
final.output.recovery                                 15339 non-null float64
final.output.tail_ag                                  16794 non-null float64
final.output.tail_pb                                  16677 non-null float64
final.output.tail_sol                                 16715 non-null float64
final.output.tail_au                                  16794 non-null float64
primary_cleaner.input.sulfate                         15553 non-null float64
primary_cleaner.input.depressant                      15598 non-null float64
primary_cleaner.input.feed_size                       16860 non-null float64
primary_cleaner.input.xanthate                        15875 non-null float64
primary_cleaner.output.concentrate_ag                 16778 non-null float64
primary_cleaner.output.concentrate_pb                 16502 non-null float64
primary_cleaner.output.concentrate_sol                16224 non-null float64
primary_cleaner.output.concentrate_au                 16778 non-null float64
primary_cleaner.output.tail_ag                        16777 non-null float64
primary_cleaner.output.tail_pb                        16761 non-null float64
primary_cleaner.output.tail_sol                       16579 non-null float64
primary_cleaner.output.tail_au                        16777 non-null float64
primary_cleaner.state.floatbank8_a_air                16820 non-null float64
primary_cleaner.state.floatbank8_a_level              16827 non-null float64
primary_cleaner.state.floatbank8_b_air                16820 non-null float64
primary_cleaner.state.floatbank8_b_level              16833 non-null float64
primary_cleaner.state.floatbank8_c_air                16822 non-null float64
primary_cleaner.state.floatbank8_c_level              16833 non-null float64
primary_cleaner.state.floatbank8_d_air                16821 non-null float64
primary_cleaner.state.floatbank8_d_level              16833 non-null float64
rougher.calculation.sulfate_to_au_concentrate         16833 non-null float64
rougher.calculation.floatbank10_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.floatbank11_sulfate_to_au_feed    16833 non-null float64
rougher.calculation.au_pb_ratio                       15618 non-null float64
rougher.input.feed_ag                                 16778 non-null float64
rougher.input.feed_pb                                 16632 non-null float64
rougher.input.feed_rate                               16347 non-null float64
rougher.input.feed_size                               16443 non-null float64
rougher.input.feed_sol                                16568 non-null float64
rougher.input.feed_au                                 16777 non-null float64
rougher.input.floatbank10_sulfate                     15816 non-null float64
rougher.input.floatbank10_xanthate                    16514 non-null float64
rougher.input.floatbank11_sulfate                     16237 non-null float64
rougher.input.floatbank11_xanthate                    14956 non-null float64
rougher.output.concentrate_ag                         16778 non-null float64
rougher.output.concentrate_pb                         16778 non-null float64
rougher.output.concentrate_sol                        16698 non-null float64
rougher.output.concentrate_au                         16778 non-null float64
rougher.output.recovery                               14287 non-null float64
rougher.output.tail_ag                                14610 non-null float64
rougher.output.tail_pb                                16778 non-null float64
rougher.output.tail_sol                               14611 non-null float64
rougher.output.tail_au                                14611 non-null float64
rougher.state.floatbank10_a_air                       16807 non-null float64
rougher.state.floatbank10_a_level                     16807 non-null float64
rougher.state.floatbank10_b_air                       16807 non-null float64
rougher.state.floatbank10_b_level                     16807 non-null float64
rougher.state.floatbank10_c_air                       16807 non-null float64
rougher.state.floatbank10_c_level                     16814 non-null float64
rougher.state.floatbank10_d_air                       16802 non-null float64
rougher.state.floatbank10_d_level                     16809 non-null float64
rougher.state.floatbank10_e_air                       16257 non-null float64
rougher.state.floatbank10_e_level                     16809 non-null float64
rougher.state.floatbank10_f_air                       16802 non-null float64
rougher.state.floatbank10_f_level                     16802 non-null float64
secondary_cleaner.output.tail_ag                      16776 non-null float64
secondary_cleaner.output.tail_pb                      16764 non-null float64
secondary_cleaner.output.tail_sol                     14874 non-null float64
secondary_cleaner.output.tail_au                      16778 non-null float64
secondary_cleaner.state.floatbank2_a_air              16497 non-null float64
secondary_cleaner.state.floatbank2_a_level            16751 non-null float64
secondary_cleaner.state.floatbank2_b_air              16705 non-null float64
secondary_cleaner.state.floatbank2_b_level            16748 non-null float64
secondary_cleaner.state.floatbank3_a_air              16763 non-null float64
secondary_cleaner.state.floatbank3_a_level            16747 non-null float64
secondary_cleaner.state.floatbank3_b_air              16752 non-null float64
secondary_cleaner.state.floatbank3_b_level            16750 non-null float64
secondary_cleaner.state.floatbank4_a_air              16731 non-null float64
secondary_cleaner.state.floatbank4_a_level            16747 non-null float64
secondary_cleaner.state.floatbank4_b_air              16768 non-null float64
secondary_cleaner.state.floatbank4_b_level            16767 non-null float64
secondary_cleaner.state.floatbank5_a_air              16775 non-null float64
secondary_cleaner.state.floatbank5_a_level            16775 non-null float64
secondary_cleaner.state.floatbank5_b_air              16775 non-null float64
secondary_cleaner.state.floatbank5_b_level            16776 non-null float64
secondary_cleaner.state.floatbank6_a_air              16757 non-null float64
secondary_cleaner.state.floatbank6_a_level            16775 non-null float64
dtypes: float64(86), object(1)
memory usage: 11.2+ MB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 53 columns):
date                                          5856 non-null object
primary_cleaner.input.sulfate                 5554 non-null float64
primary_cleaner.input.depressant              5572 non-null float64
primary_cleaner.input.feed_size               5856 non-null float64
primary_cleaner.input.xanthate                5690 non-null float64
primary_cleaner.state.floatbank8_a_air        5840 non-null float64
primary_cleaner.state.floatbank8_a_level      5840 non-null float64
primary_cleaner.state.floatbank8_b_air        5840 non-null float64
primary_cleaner.state.floatbank8_b_level      5840 non-null float64
primary_cleaner.state.floatbank8_c_air        5840 non-null float64
primary_cleaner.state.floatbank8_c_level      5840 non-null float64
primary_cleaner.state.floatbank8_d_air        5840 non-null float64
primary_cleaner.state.floatbank8_d_level      5840 non-null float64
rougher.input.feed_ag                         5840 non-null float64
rougher.input.feed_pb                         5840 non-null float64
rougher.input.feed_rate                       5816 non-null float64
rougher.input.feed_size                       5834 non-null float64
rougher.input.feed_sol                        5789 non-null float64
rougher.input.feed_au                         5840 non-null float64
rougher.input.floatbank10_sulfate             5599 non-null float64
rougher.input.floatbank10_xanthate            5733 non-null float64
rougher.input.floatbank11_sulfate             5801 non-null float64
rougher.input.floatbank11_xanthate            5503 non-null float64
rougher.state.floatbank10_a_air               5839 non-null float64
rougher.state.floatbank10_a_level             5840 non-null float64
rougher.state.floatbank10_b_air               5839 non-null float64
rougher.state.floatbank10_b_level             5840 non-null float64
rougher.state.floatbank10_c_air               5839 non-null float64
rougher.state.floatbank10_c_level             5840 non-null float64
rougher.state.floatbank10_d_air               5839 non-null float64
rougher.state.floatbank10_d_level             5840 non-null float64
rougher.state.floatbank10_e_air               5839 non-null float64
rougher.state.floatbank10_e_level             5840 non-null float64
rougher.state.floatbank10_f_air               5839 non-null float64
rougher.state.floatbank10_f_level             5840 non-null float64
secondary_cleaner.state.floatbank2_a_air      5836 non-null float64
secondary_cleaner.state.floatbank2_a_level    5840 non-null float64
secondary_cleaner.state.floatbank2_b_air      5833 non-null float64
secondary_cleaner.state.floatbank2_b_level    5840 non-null float64
secondary_cleaner.state.floatbank3_a_air      5822 non-null float64
secondary_cleaner.state.floatbank3_a_level    5840 non-null float64
secondary_cleaner.state.floatbank3_b_air      5840 non-null float64
secondary_cleaner.state.floatbank3_b_level    5840 non-null float64
secondary_cleaner.state.floatbank4_a_air      5840 non-null float64
secondary_cleaner.state.floatbank4_a_level    5840 non-null float64
secondary_cleaner.state.floatbank4_b_air      5840 non-null float64
secondary_cleaner.state.floatbank4_b_level    5840 non-null float64
secondary_cleaner.state.floatbank5_a_air      5840 non-null float64
secondary_cleaner.state.floatbank5_a_level    5840 non-null float64
secondary_cleaner.state.floatbank5_b_air      5840 non-null float64
secondary_cleaner.state.floatbank5_b_level    5840 non-null float64
secondary_cleaner.state.floatbank6_a_air      5840 non-null float64
secondary_cleaner.state.floatbank6_a_level    5840 non-null float64
dtypes: float64(52), object(1)
memory usage: 2.4+ MB

Check that recovery is calculated correctly

rougher_table = train[['rougher.input.feed_au', 'rougher.output.concentrate_au', 
                       'rougher.output.tail_au', 'rougher.output.recovery']]
rougher_table.columns = ['feed', 'concentrate', 'tail', 'recovery']
rougher_table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16860 entries, 0 to 16859
Data columns (total 4 columns):
feed           16777 non-null float64
concentrate    16778 non-null float64
tail           14611 non-null float64
recovery       14287 non-null float64
dtypes: float64(4)
memory usage: 527.0 KB
rougher_table[['feed', 'concentrate', 'tail', 'recovery']].isnull().mean()
feed           0.004923
concentrate    0.004864
tail           0.133393
recovery       0.152610
dtype: float64
rougher_table = rougher_table.dropna().reset_index()
rougher_table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14287 entries, 0 to 14286
Data columns (total 5 columns):
index          14287 non-null int64
feed           14287 non-null float64
concentrate    14287 non-null float64
tail           14287 non-null float64
recovery       14287 non-null float64
dtypes: float64(4), int64(1)
memory usage: 558.2 KB
rougher_table.head()
index feed concentrate tail recovery
0 0 6.486150 19.793808 1.170244 87.107763
1 1 6.478583 20.050975 1.184827 86.843261
2 2 6.362222 19.737170 1.162562 86.842308
3 3 6.118189 19.320810 1.079755 87.226430
4 4 5.663707 19.216101 1.012642 86.688794
def recovery_cal(table):
    r = []
    for i in range(len(table)):
        c = table.loc[i, 'concentrate']
        f = table.loc[i, 'feed']
        t = table.loc[i, 'tail']
        if (f*(c-t)) == 0:
            r.append(0)
        else:
            result = (c*(f-t)) / (f*(c-t)) * 100
            r.append(result)
    return r
rougher_table['r_pred'] = recovery_cal(rougher_table)
rougher_table['r_pred'] = rougher_table['r_pred'].fillna(value=0)
mae = mean_squared_error(rougher_table['recovery'], rougher_table['r_pred'])
mae
2.0435431534920925e-28

To re-calculate recovery it's better to drop all NaN value from the table to avoid errors.

In general original recovery value is not so different from re-calculated ones.

Analyze the test set

not_in_test = [x for x in train.columns if x not in test.columns]
not_in_test

['final.output.concentrate_ag',
 'final.output.concentrate_pb',
 'final.output.concentrate_sol',
 'final.output.concentrate_au',
 'final.output.recovery',
 'final.output.tail_ag',
 'final.output.tail_pb',
 'final.output.tail_sol',
 'final.output.tail_au',
 'primary_cleaner.output.concentrate_ag',
 'primary_cleaner.output.concentrate_pb',
 'primary_cleaner.output.concentrate_sol',
 'primary_cleaner.output.concentrate_au',
 'primary_cleaner.output.tail_ag',
 'primary_cleaner.output.tail_pb',
 'primary_cleaner.output.tail_sol',
 'primary_cleaner.output.tail_au',
 'rougher.calculation.sulfate_to_au_concentrate',
 'rougher.calculation.floatbank10_sulfate_to_au_feed',
 'rougher.calculation.floatbank11_sulfate_to_au_feed',
 'rougher.calculation.au_pb_ratio',
 'rougher.output.concentrate_ag',
 'rougher.output.concentrate_pb',
 'rougher.output.concentrate_sol',
 'rougher.output.concentrate_au',
 'rougher.output.recovery',
 'rougher.output.tail_ag',
 'rougher.output.tail_pb',
 'rougher.output.tail_sol',
 'rougher.output.tail_au',
 'secondary_cleaner.output.tail_ag',
 'secondary_cleaner.output.tail_pb',
 'secondary_cleaner.output.tail_sol',
 'secondary_cleaner.output.tail_au']

Nearly all this values that are not in the test table can be predicted from existing values, so they are targets. Test table consist of features values.

Deal with NaN values

target_columns = ['rougher.output.recovery', 'final.output.recovery']
features_columns = [x for x in train.columns if x in test.columns]
features_columns = features_columns[1:]
for i in features_columns:
    i.replace('\\n       ', '')
target_features_colums = target_columns + features_columns
print("We will lost {0:.2f}% of test data".format((len(train) - len(train.dropna(subset=target_features_colums))) / len(train)))
We will lost 0.25% of test data
print("We will lost {0:.2f}% of test data".format((len(test) - len(test.dropna())) / len(test)))
We will lost 0.08% of test data
train = train.dropna(subset=features_columns)
train = train.dropna(subset=target_columns)
test = test.dropna()

EDA

Dependence between concentrations of metals (Au, Ag, Pb) and purification stage

concent_au = full[['rougher.output.concentrate_au', 
                   'primary_cleaner.output.concentrate_au', 
                   'final.output.concentrate_au']]
concent_ag = full[['rougher.output.concentrate_ag', 
                   'primary_cleaner.output.concentrate_ag', 
                   'final.output.concentrate_ag']]
concent_pd = full[['rougher.output.concentrate_pb', 
                   'primary_cleaner.output.concentrate_pb', 
                   'final.output.concentrate_pb']]
plt.hist(concent_au.iloc[:, 0], bins=20, density=True, label='rough', histtype = 'step')
plt.hist(concent_au.iloc[:, 1], bins=20, density=True, label='primary', histtype = 'step')
plt.hist(concent_au.iloc[:, 2], bins=20, density=True, label='final', histtype = 'step')
plt.legend()
plt.show()
plt.hist(concent_ag.iloc[:, 0], bins=20, density=True, label='rough', histtype = 'step')
plt.hist(concent_ag.iloc[:, 1], bins=20, density=True, label='primary', histtype = 'step')
plt.hist(concent_ag.iloc[:, 2], bins=20, density=True, label='final', histtype = 'step')
plt.legend()
plt.show()
plt.hist(concent_pd.iloc[:, 0], bins=20, density=True, label='rough', histtype = 'step')
plt.hist(concent_pd.iloc[:, 1], bins=20, density=True, label='primary', histtype = 'step')
plt.hist(concent_pd.iloc[:, 2], bins=20, density=True, label='final', histtype = 'step')
plt.legend()
plt.show()

By the histograms we can say that concentration of gold is increasing on each cleaner step and it's expected. Obviouse all purification processes filter a lot off non-related elements, like Silver. Concentrate of Plumbum is encrease as well, it's interesting why.

Compare the feed particle size distributions in the train and in the test sets

feed_columns = ['primary_cleaner.input.feed_size', 'rougher.input.feed_size']
train_feed = train[feed_columns]
test_feed = test[feed_columns]
plt.hist(train_feed.iloc[:, 0], bins=20, density=True, label='train', histtype = 'step')
plt.hist(test_feed.iloc[:, 0], bins=20, density=True, label='test', histtype = 'step')
plt.legend()
plt.show()
plt.hist(train_feed.iloc[:, 1], bins=20, density=True, label='train', histtype = 'step')
plt.hist(test_feed.iloc[:, 1], bins=20, density=True, label='test', histtype = 'step')
plt.legend()
plt.show()

There is almost no difference and we can consider these distribution the same.

Consider the total concentrations of all substances at different stages

raw_feed = ['rougher.input.feed_ag', 'rougher.input.feed_pb', 
            'rougher.input.feed_sol', 'rougher.input.feed_au'] 
rougher_concentrate = ['rougher.output.concentrate_ag', 'rougher.output.concentrate_pb', 
                       'rougher.output.concentrate_sol', 'rougher.output.concentrate_au']
final_concentrate = ['final.output.concentrate_ag', 'final.output.concentrate_pb', 
                       'final.output.concentrate_sol', 'final.output.concentrate_au']
full[raw_feed].sum(1).hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x7f5425f91950>
full[rougher_concentrate].sum(1).hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f5425e6b810>
full[final_concentrate].sum(1).hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f5425fe9ed0>
train = train.loc[train[raw_feed].sum(1) > 1]
test = test.loc[test[raw_feed].sum(1) > 1]
train = train.loc[train[rougher_concentrate].sum(1) > 1]
train = train.loc[train[final_concentrate].sum(1) > 1]

Train the model

features_r_columns = [x for x in features_columns if 'rougher.input' in x]
features_f_columns = [x for x in features_columns if 'primary' in x]
features_r = train[features_r_columns]
features_f = train[features_f_columns]
target_r = train[target_columns[0]]
target_f = train[target_columns[1]]
def smape(target, prediction):
    s = sum(np.abs(prediction - target) / ((np.abs(target) + np.abs(prediction)) / 2))
    return s * 100 / len(target) 
my_scorer = make_scorer(smape, greater_is_better=False)
model = LinearRegression()
print("LinearRegression:", cross_val_score(model, features_r, target_r, cv=3, scoring=my_scorer).mean())
LinearRegression: -7.73921166706288
for i in range(1,10):
    model = DecisionTreeRegressor(max_depth = i, random_state=12345)
    print("DecisionTreeRegressor:", cross_val_score(model, features_r, target_r, cv=3,  scoring=my_scorer).mean())
DecisionTreeRegressor: -8.309019116513007
DecisionTreeRegressor: -8.414436821521468
DecisionTreeRegressor: -8.694769910780858
DecisionTreeRegressor: -9.063014057160933
DecisionTreeRegressor: -9.06968621135918
DecisionTreeRegressor: -9.205584731735364
DecisionTreeRegressor: -9.940410399198106
DecisionTreeRegressor: -10.229794791833923
DecisionTreeRegressor: -10.344914895926372
for i in range(20,50,10):
    model = RandomForestRegressor(n_estimators=i, max_depth=3, random_state=12345)
    print("RandomForestRegressor with", i, "estimators:", cross_val_score(model, features_r, target_r, cv=3,  scoring=my_scorer).mean())
RandomForestRegressor with 20 estimators: -8.586403536006658
RandomForestRegressor with 30 estimators: -8.596271567881844
RandomForestRegressor with 40 estimators: -8.584381133310744
model = LinearRegression()
print("LinearRegression:", cross_val_score(model, features_f, target_f, cv=3, scoring=my_scorer).mean())
LinearRegression: -9.838455498334676
for i in range(1,10):
    model = DecisionTreeRegressor(max_depth = i, random_state=12345)
    print("DecisionTreeRegressor:", i, "max_depth:", cross_val_score(model, features_f, target_f, cv=3,  scoring=my_scorer).mean())
DecisionTreeRegressor: 1 max_depth: -9.547204412179097
DecisionTreeRegressor: 2 max_depth: -10.022658944341572
DecisionTreeRegressor: 3 max_depth: -10.349518009534277
DecisionTreeRegressor: 4 max_depth: -10.388885177403747
DecisionTreeRegressor: 5 max_depth: -10.738721462654391
DecisionTreeRegressor: 6 max_depth: -11.254489636909733
DecisionTreeRegressor: 7 max_depth: -11.172120146436209
DecisionTreeRegressor: 8 max_depth: -11.796373327634669
DecisionTreeRegressor: 9 max_depth: -12.106619581483434
for i in range(20,50,10):
    model = RandomForestRegressor(n_estimators=20, max_depth=1, random_state=12345)
    print("RandomForestRegressor with", i, "estimators:", cross_val_score(model, features_f, target_f, cv=3,  scoring=my_scorer).mean())
RandomForestRegressor with 20 estimators: -9.3756776856768
RandomForestRegressor with 30 estimators: -9.3756776856768
RandomForestRegressor with 40 estimators: -9.3756776856768

The most better model for rougher target is LinearRegression, for final recovery target is RandomForestRegressor with max_depth=1. Will test them

Test the model

targets_all = full[['date', 'rougher.output.recovery', 'final.output.recovery']]
test_new = test.set_index('date').join(targets_all.set_index('date'), on='date', how='left')
test_new = test_new.dropna()
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_r, target_r)
dummy_predict_r = dummy_regr.predict(test_new[features_r_columns])

dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(features_f, target_f)
dummy_predict_f = dummy_regr.predict(test_new[features_f_columns])
final_smape_dummy = 0.25*smape(test_new['rougher.output.recovery'], dummy_predict_r) + 0.75*smape(test_new['final.output.recovery'], dummy_predict_f)
final_smape_dummy
model = LinearRegression()
model.fit(features_r, target_r)
predict_r = model.predict(test_new[features_r_columns])
rough_smape = smape(test_new['rougher.output.recovery'], predict_r)
rough_smape
model = RandomForestRegressor(n_estimators=20, max_depth=1, random_state=12345)
model.fit(features_f, target_f)
predict_f = model.predict(test_new[features_f_columns])
final_smape = smape(test_new['final.output.recovery'], predict_f)
final_smape
final_smape = 0.25*smape(test_new['rougher.output.recovery'], predict_r) + 0.75*smape(test_new['final.output.recovery'], predict_f)
final_smape

Conclusion

The Final sMAPE of the LinearRegression and RandomForestRegressor is better than final sMAPE of DummyRegression created by mean, but not too much. We can use our model, but the dummy regressor show, that it can be omitted.