Today, we will have a look at the Scania APS Failure dataset as part of my “Exploring Less Known Datasets for Machine Learning” series.
Contents
Exploring the dataset
This dataset was released by Scania CV AB on the UCI Machine Learning Repository as part of the Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA) in 2016. The challenge was about predicting failure of Scania’s APS (Air Pressure System) in trucks to enable preventive maintenance and therefore reduce costs. The dataset is anonymized and containes binned/encoded values due to proprietary reasons.
Let us look at the dataset by loading and displaying a subset of the training data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
input_data_train = pd.read_csv('./data/aps_failure_training_set.csv',
skiprows=20,keep_default_na=False)
input_data_test = pd.read_csv('./data/aps_failure_test_set.csv',
skiprows=20,keep_default_na=False)
display(input_data_train.head(3))
display(input_data_train.tail(3))
class | aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | neg | 76698 | na | 2130706438 | 280 | 0 | 0 | 0 | 0 | 0 | ... | 1240520 | 493384 | 721044 | 469792 | 339156 | 157956 | 73224 | 0 | 0 | 0 |
1 | neg | 33058 | na | 0 | na | 0 | 0 | 0 | 0 | 0 | ... | 421400 | 178064 | 293306 | 245416 | 133654 | 81140 | 97576 | 1500 | 0 | 0 |
2 | neg | 41040 | na | 228 | 100 | 0 | 0 | 0 | 0 | 0 | ... | 277378 | 159812 | 423992 | 409564 | 320746 | 158022 | 95128 | 514 | 0 | 0 |
class | aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
59997 | neg | 112 | 0 | 2130706432 | 18 | 0 | 0 | 0 | 0 | 0 | ... | 792 | 386 | 452 | 144 | 146 | 2622 | 0 | 0 | 0 | 0 |
59998 | neg | 80292 | na | 2130706432 | 494 | 0 | 0 | 0 | 0 | 0 | ... | 699352 | 222654 | 347378 | 225724 | 194440 | 165070 | 802280 | 388422 | 0 | 0 |
59999 | neg | 40222 | na | 698 | 628 | 0 | 0 | 0 | 0 | 0 | ... | 440066 | 183200 | 344546 | 254068 | 225148 | 158304 | 170384 | 158 | 0 | 0 |
#missing values as strings ?!?
input_data_train['ab_000'].values[-2]
'na'
We have to do some data juggling since all columns that contains 'na'
are considered to be strings. First, we have to replace 'na'
with an integer. It looks like all values are positive, hence we may introduce a negative value for our missing values.
# replacing 'na' strings
input_data_train.replace('na','-1', inplace=True)
input_data_test.replace('na','-1', inplace=True)
display(input_data_test.tail(3))
class | aa_000 | ab_000 | ac_000 | ad_000 | ae_000 | af_000 | ag_000 | ag_001 | ag_002 | ... | ee_002 | ee_003 | ee_004 | ee_005 | ee_006 | ee_007 | ee_008 | ee_009 | ef_000 | eg_000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15997 | 0 | 79636 | -1 | 1670 | 1518 | 0 | 0 | 0 | 0 | 0 | ... | 806832 | 449962 | 778826 | 581558 | 375498 | 222866 | 358934 | 19548 | 0 | 0 |
15998 | 0 | 110 | -1 | 36 | 32 | 0 | 0 | 0 | 0 | 0 | ... | 588 | 210 | 180 | 544 | 1004 | 1338 | 74 | 0 | 0 | 0 |
15999 | 0 | 8 | 0 | 6 | 4 | 2 | 2 | 0 | 0 | 0 | ... | 46 | 10 | 48 | 14 | 42 | 46 | 0 | 0 | 0 | 0 |
Since we want to predict failures of a system, we should assume that no company publishes such a dataset if the classes (no failure/failure) are distributed equally ;):
#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes
print(['neg', 'pos'])
print(np.bincount(input_data_train['class'].values))
print(np.bincount(input_data_test['class'].values))
['neg', 'pos']
[59000 1000]
[15625 375]
or as a histogram:
plt.close('all')
bins = np.bincount(input_data_train['class'].values)
plt.bar([0,1], bins, color='black')
plt.xticks([0,1])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes [train set]')
plt.show()
Next, we have to split our data into X and y. We have to remember that columns containing 'na'
(now -1
) are strings. Therefore, we have to convert them. A conversion to floating point is okay since some values are float and we will scale the data anyhow.
# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)
y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)
# strings to float
X_train = X_train.astype('float64')
X_test = X_test.astype('float64')
zoomed in after the 2nd value:
![]()
We still have to scale the datasets:
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Applying ML algorithms
Let us throw some ML algorithms in the game and see what they come up with:
example of two classical ml algorithms:
def train_test_Gaussian_NB_classification(X_train, X_test,
y_train, y_test,
scorer, dataset_id):
Gaussian_NB_classification = GaussianNB()
grid_obj = GridSearchCV(Gaussian_NB_classification,
param_grid={}, cv=4, n_jobs=-1,
scoring=scorer, verbose=1)
start_time = time.time()
grid_fit = grid_obj.fit(X_train, y_train)
training_time = time.time() - start_time
prediction = grid_fit.predict(X_test)
accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
classification_rep = classification_report(y_true=y_test, y_pred=prediction)
return {'Classification type' : 'Gaussian Naive Bayes Classification',
'model' : grid_fit,
'Predictions' : prediction,
'Accuracy' : accuracy,
'Classification Report':classification_rep,
'Training time' : training_time,
'dataset' : dataset_id}
def catboost_classification(X_train, X_test,
y_train, y_test,
scorer, dataset_id):
catboost_classification = catboost.CatBoostClassifier(random_state=42)
grid_parameters_catboost_classification = {'learning_rate' :[0.001, 0.01, 0.1, 1.0]}
start_time = time.time()
grid_obj = GridSearchCV(catboost_classification,
param_grid=grid_parameters_catboost_classification,
cv=4, n_jobs=-1,
scoring=scorer, verbose=1)
grid_fit = grid_obj.fit(X_train, y_train)
training_time = time.time() - start_time
best_gradient_boosting_classification = grid_fit.best_estimator_
prediction = best_gradient_boosting_classification.predict(X_test)
accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
classification_rep = classification_report(y_true=y_test, y_pred=prediction)
return {'Classification type' : 'CatBoost',
'model' : grid_fit,
'Predictions' : prediction,
'Accuracy' : accuracy,
'Classification Report':classification_rep,
'Training time' : training_time,
'dataset' : dataset_id}
Simple NN made with Keras:
def build_baseline_model_1(input_dim):
model = Sequential()
model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(input_dim*2, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(75, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
return model
def run_baseline_model_1(X_train, X_test,
y_train, y_test,
dataset_id, epochs=2000,
validation_split=0.2,
batch_size=256):
model = build_baseline_model_1(X_train.shape[1])
callback_file_path = 'keras_models/baseline_model_1_'+str(dataset_id)+'_best.hdf5'
checkpoint = callbacks.ModelCheckpoint(callback_file_path,
monitor='val_loss',
save_best_only=True,
save_weights_only=True)
[...]
We will use ‘neg_log_loss’ and not a loss function that is tailored for imbalanced datasets.
the proposed metric for the competition was:
Cost-metric of miss-classification:
Predicted class True class pos neg pos - Cost_1 neg Cost_2 - Cost_1 = 10 and cost_2 = 500
The total cost of a prediction model the sum of ‘Cost_1’ multiplied by the number of Instances with type 1 failure and ‘Cost_2’ with the number of instances with type 2 failure, resulting in a ‘Total_cost’.
In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.
Total_cost = Cost_1No_Instances + Cost_2No_Instances.
Results
Classifier | Accuracy | Accuracy class 0 | Accuracy class 1 | |
---|---|---|---|---|
0 | Gaussian Naive Bayes Classification | 0.964875 | 0.966208 | 0.909333 |
1 | Decision Tree Classification | 0.985375 | 0.998016 | 0.458667 |
3 | Random Forest Classification | 0.990375 | 0.998912 | 0.634667 |
4 | AdaBoost Classification | 0.976562 | 1.000000 | 0.000000 |
5 | XGBoost Classification | 0.993563 | 0.998976 | 0.768000 |
6 | CatBoost | 0.993938 | 0.998912 | 0.786667 |
7 | Baseline_model_1 | 0.991313 | 0.996480 | 0.776000 |
Since we want to predict failure (class 1), it is tempting to say that the Gaussian Naive Bayes Classifier produces the most reasonable results. A more reasonable answer would be that CatBoost leads to the best results. However, none of these performances is acceptable (IMHO). I may write a longer tutorial or book on how to deal with imbalanced datasets more effectivly - especially if there is an underlying engineering/physics model.
update:
I tested automated machine learning toolkits on this dataset as well: