Today, we will have a look at the Scania APS Failure dataset as part of my “Exploring Less Known Datasets for Machine Learning” series.


Contents


Exploring the dataset

This dataset was released by Scania CV AB on the UCI Machine Learning Repository as part of the Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA) in 2016. The challenge was about predicting failure of Scania’s APS (Air Pressure System) in trucks to enable preventive maintenance and therefore reduce costs. The dataset is anonymized and containes binned/encoded values due to proprietary reasons.

Let us look at the dataset by loading and displaying a subset of the training data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

input_data_train = pd.read_csv('./data/aps_failure_training_set.csv',
                               skiprows=20,keep_default_na=False)
input_data_test = pd.read_csv('./data/aps_failure_test_set.csv',
                             skiprows=20,keep_default_na=False)

display(input_data_train.head(3))
display(input_data_train.tail(3))                             
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 eg_000
0 neg 76698 na 2130706438 280 0 0 0 0 0 ... 1240520 493384 721044 469792 339156 157956 73224 0 0 0
1 neg 33058 na 0 na 0 0 0 0 0 ... 421400 178064 293306 245416 133654 81140 97576 1500 0 0
2 neg 41040 na 228 100 0 0 0 0 0 ... 277378 159812 423992 409564 320746 158022 95128 514 0 0
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 eg_000
59997 neg 112 0 2130706432 18 0 0 0 0 0 ... 792 386 452 144 146 2622 0 0 0 0
59998 neg 80292 na 2130706432 494 0 0 0 0 0 ... 699352 222654 347378 225724 194440 165070 802280 388422 0 0
59999 neg 40222 na 698 628 0 0 0 0 0 ... 440066 183200 344546 254068 225148 158304 170384 158 0 0
#missing values as strings ?!?
input_data_train['ab_000'].values[-2]
'na'

We have to do some data juggling since all columns that contains 'na' are considered to be strings. First, we have to replace 'na' with an integer. It looks like all values are positive, hence we may introduce a negative value for our missing values.

# replacing 'na' strings
input_data_train.replace('na','-1', inplace=True)
input_data_test.replace('na','-1', inplace=True)

display(input_data_test.tail(3))
class aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000 ag_001 ag_002 ... ee_002 ee_003 ee_004 ee_005 ee_006 ee_007 ee_008 ee_009 ef_000 eg_000
15997 0 79636 -1 1670 1518 0 0 0 0 0 ... 806832 449962 778826 581558 375498 222866 358934 19548 0 0
15998 0 110 -1 36 32 0 0 0 0 0 ... 588 210 180 544 1004 1338 74 0 0 0
15999 0 8 0 6 4 2 2 0 0 0 ... 46 10 48 14 42 46 0 0 0 0

Since we want to predict failures of a system, we should assume that no company publishes such a dataset if the classes (no failure/failure) are distributed equally ;):

#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes

print(['neg', 'pos'])
print(np.bincount(input_data_train['class'].values))
print(np.bincount(input_data_test['class'].values))
['neg', 'pos']
[59000  1000]
[15625   375]

or as a histogram:

plt.close('all')
bins = np.bincount(input_data_train['class'].values)
plt.bar([0,1], bins, color='black')
plt.xticks([0,1])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes [train set]')
plt.show()
histogram trainining

hist test

Next, we have to split our data into X and y. We have to remember that columns containing 'na' (now -1) are strings. Therefore, we have to convert them. A conversion to floating point is okay since some values are float and we will scale the data anyhow.

# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)

y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)

# strings to float
X_train = X_train.astype('float64')
X_test = X_test.astype('float64')
example

zoomed in after the 2nd value:

example plot

We still have to scale the datasets:

from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Applying ML algorithms

Let us throw some ML algorithms in the game and see what they come up with:

example of two classical ml algorithms:

def train_test_Gaussian_NB_classification(X_train, X_test,
                                          y_train, y_test,
                                          scorer, dataset_id):
    Gaussian_NB_classification = GaussianNB()
    grid_obj = GridSearchCV(Gaussian_NB_classification,
                            param_grid={}, cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    start_time = time.time()
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    prediction = grid_fit.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    

    return {'Classification type' : 'Gaussian Naive Bayes Classification',
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}


def catboost_classification(X_train, X_test,
                           y_train, y_test,
                           scorer, dataset_id):
    catboost_classification = catboost.CatBoostClassifier(random_state=42)
    grid_parameters_catboost_classification = {'learning_rate' :[0.001, 0.01, 0.1, 1.0]}
    start_time = time.time()
    grid_obj = GridSearchCV(catboost_classification,
                            param_grid=grid_parameters_catboost_classification,
                            cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    best_gradient_boosting_classification = grid_fit.best_estimator_
    prediction = best_gradient_boosting_classification.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    
    return {'Classification type' : 'CatBoost', 
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}

Simple NN made with Keras:

def build_baseline_model_1(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(75, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def run_baseline_model_1(X_train, X_test,
                         y_train, y_test,
                         dataset_id, epochs=2000,
                         validation_split=0.2,
                         batch_size=256):
    model = build_baseline_model_1(X_train.shape[1])
    
    callback_file_path = 'keras_models/baseline_model_1_'+str(dataset_id)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path,
                                           monitor='val_loss',
                                           save_best_only=True,
                                           save_weights_only=True)
    [...]

We will use ‘neg_log_loss’ and not a loss function that is tailored for imbalanced datasets.

the proposed metric for the competition was:

Cost-metric of miss-classification:

Predicted class True class  
  pos neg
pos - Cost_1
neg Cost_2 -

Cost_1 = 10 and cost_2 = 500

The total cost of a prediction model the sum of ‘Cost_1’ multiplied by the number of Instances with type 1 failure and ‘Cost_2’ with the number of instances with type 2 failure, resulting in a ‘Total_cost’.

In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.

Total_cost = Cost_1No_Instances + Cost_2No_Instances.

Results

Classifier Accuracy Accuracy class 0 Accuracy class 1
0 Gaussian Naive Bayes Classification 0.964875 0.966208 0.909333
1 Decision Tree Classification 0.985375 0.998016 0.458667
3 Random Forest Classification 0.990375 0.998912 0.634667
4 AdaBoost Classification 0.976562 1.000000 0.000000
5 XGBoost Classification 0.993563 0.998976 0.768000
6 CatBoost 0.993938 0.998912 0.786667
7 Baseline_model_1 0.991313 0.996480 0.776000

Since we want to predict failure (class 1), it is tempting to say that the Gaussian Naive Bayes Classifier produces the most reasonable results. A more reasonable answer would be that CatBoost leads to the best results. However, none of these performances is acceptable (IMHO). I may write a longer tutorial or book on how to deal with imbalanced datasets more effectivly - especially if there is an underlying engineering/physics model.




update:

I tested automated machine learning toolkits on this dataset as well: