Revisiting Machine Learning Datasets

Today, we will have a look at the Scania APS Failure dataset as part of my “Exploring Less Known Datasets for Machine Learning” series.

Contents

Exploring the dataset
Applying ML algorithms
Results

Exploring the dataset

This dataset was released by Scania CV AB on the UCI Machine Learning Repository as part of the Industrial Challenge 2016 at The 15th International Symposium on Intelligent Data Analysis (IDA) in 2016. The challenge was about predicting failure of Scania’s APS (Air Pressure System) in trucks to enable preventive maintenance and therefore reduce costs. The dataset is anonymized and containes binned/encoded values due to proprietary reasons.

Let us look at the dataset by loading and displaying a subset of the training data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

input_data_train = pd.read_csv('./data/aps_failure_training_set.csv',
                               skiprows=20,keep_default_na=False)
input_data_test = pd.read_csv('./data/aps_failure_test_set.csv',
                             skiprows=20,keep_default_na=False)

display(input_data_train.head(3))
display(input_data_train.tail(3))                             

#missing values as strings ?!?
input_data_train['ab_000'].values[-2]

'na'

We have to do some data juggling since all columns that contains 'na' are considered to be strings. First, we have to replace 'na' with an integer. It looks like all values are positive, hence we may introduce a negative value for our missing values.

# replacing 'na' strings
input_data_train.replace('na','-1', inplace=True)
input_data_test.replace('na','-1', inplace=True)

display(input_data_test.tail(3))

Since we want to predict failures of a system, we should assume that no company publishes such a dataset if the classes (no failure/failure) are distributed equally ;):

#categorical encoding
input_data_train['class'] = pd.Categorical(input_data_train['class']).codes
input_data_test['class'] = pd.Categorical(input_data_test['class']).codes

print(['neg', 'pos'])
print(np.bincount(input_data_train['class'].values))
print(np.bincount(input_data_test['class'].values))

['neg', 'pos']
[59000  1000]
[15625   375]

or as a histogram:

plt.close('all')
bins = np.bincount(input_data_train['class'].values)
plt.bar([0,1], bins, color='black')
plt.xticks([0,1])
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Histogram of target classes [train set]')
plt.show()

Next, we have to split our data into X and y. We have to remember that columns containing 'na' (now -1) are strings. Therefore, we have to convert them. A conversion to floating point is okay since some values are float and we will scale the data anyhow.

# split data into X and y
y_train = input_data_train['class'].copy(deep=True)
X_train = input_data_train.copy(deep=True)
X_train.drop(['class'], inplace=True, axis=1)

y_test = input_data_test['class'].copy(deep=True)
X_test = input_data_test.copy(deep=True)
X_test.drop(['class'], inplace=True, axis=1)

# strings to float
X_train = X_train.astype('float64')
X_test = X_test.astype('float64')



zoomed in after the 2nd value:

We still have to scale the datasets:

from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Applying ML algorithms

Let us throw some ML algorithms in the game and see what they come up with:

example of two classical ml algorithms:

def train_test_Gaussian_NB_classification(X_train, X_test,
                                          y_train, y_test,
                                          scorer, dataset_id):
    Gaussian_NB_classification = GaussianNB()
    grid_obj = GridSearchCV(Gaussian_NB_classification,
                            param_grid={}, cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    start_time = time.time()
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    prediction = grid_fit.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    

    return {'Classification type' : 'Gaussian Naive Bayes Classification',
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}


def catboost_classification(X_train, X_test,
                           y_train, y_test,
                           scorer, dataset_id):
    catboost_classification = catboost.CatBoostClassifier(random_state=42)
    grid_parameters_catboost_classification = {'learning_rate' :[0.001, 0.01, 0.1, 1.0]}
    start_time = time.time()
    grid_obj = GridSearchCV(catboost_classification,
                            param_grid=grid_parameters_catboost_classification,
                            cv=4, n_jobs=-1,
                            scoring=scorer, verbose=1)
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    best_gradient_boosting_classification = grid_fit.best_estimator_
    prediction = best_gradient_boosting_classification.predict(X_test)
    accuracy = accuracy_score(y_true=y_test, y_pred=prediction)
    classification_rep = classification_report(y_true=y_test, y_pred=prediction)
    
    return {'Classification type' : 'CatBoost', 
            'model' : grid_fit,
            'Predictions' : prediction,
            'Accuracy' : accuracy,
            'Classification Report':classification_rep,
            'Training time' : training_time,
            'dataset' : dataset_id}

Simple NN made with Keras:

def build_baseline_model_1(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(input_dim*2, activation='relu'))
    model.add(Dropout(0.25))
    model.add(Dense(75, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
    return model

def run_baseline_model_1(X_train, X_test,
                         y_train, y_test,
                         dataset_id, epochs=2000,
                         validation_split=0.2,
                         batch_size=256):
    model = build_baseline_model_1(X_train.shape[1])
    
    callback_file_path = 'keras_models/baseline_model_1_'+str(dataset_id)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path,
                                           monitor='val_loss',
                                           save_best_only=True,
                                           save_weights_only=True)
    [...]

We will use ‘neg_log_loss’ and not a loss function that is tailored for imbalanced datasets.

the proposed metric for the competition was:

Cost-metric of miss-classification:

Predicted class True class

pos neg

pos - Cost_1

neg Cost_2 -

Cost_1 = 10 and cost_2 = 500

The total cost of a prediction model the sum of ‘Cost_1’ multiplied by the number of Instances with type 1 failure and ‘Cost_2’ with the number of instances with type 2 failure, resulting in a ‘Total_cost’.

In this case Cost_1 refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown.

Total_cost = Cost_1No_Instances + Cost_2No_Instances.

Results


  
    
      
      Classifier
      Accuracy
      Accuracy class 0
      Accuracy class 1
    
  
  
    
      0
      Gaussian Naive Bayes Classification
      0.964875
      0.966208
      0.909333
    
    
      1
      Decision Tree Classification
      0.985375
      0.998016
      0.458667
    
    
      3
      Random Forest Classification
      0.990375
      0.998912
      0.634667
    
    
      4
      AdaBoost Classification
      0.976562
      1.000000
      0.000000
    
    
      5
      XGBoost Classification
      0.993563
      0.998976
      0.768000
    
    
      6
      CatBoost
      0.993938
      0.998912
      0.786667
    
    
      7
      Baseline_model_1
      0.991313
      0.996480
      0.776000

	Classifier	Accuracy	Accuracy class 0	Accuracy class 1
0	Gaussian Naive Bayes Classification	0.964875	0.966208	0.909333
1	Decision Tree Classification	0.985375	0.998016	0.458667
3	Random Forest Classification	0.990375	0.998912	0.634667
4	AdaBoost Classification	0.976562	1.000000	0.000000
5	XGBoost Classification	0.993563	0.998976	0.768000
6	CatBoost	0.993938	0.998912	0.786667
7	Baseline_model_1	0.991313	0.996480	0.776000

Since we want to predict failure (class 1), it is tempting to say that the Gaussian Naive Bayes Classifier produces the most reasonable results. A more reasonable answer would be that CatBoost leads to the best results. However, none of these performances is acceptable (IMHO). I may write a longer tutorial or book on how to deal with imbalanced datasets more effectivly - especially if there is an underlying engineering/physics model.

update:

I tested automated machine learning toolkits on this dataset as well:

	class	aa_000	ab_000	ac_000	ad_000	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009
0	neg	76698	na	2130706438	280	...	1240520	493384	721044	469792	339156	157956	73224	0
1	neg	33058	na	0	na	...	421400	178064	293306	245416	133654	81140	97576	1500
2	neg	41040	na	228	100	...	277378	159812	423992	409564	320746	158022	95128	514

	class	aa_000	ab_000	ac_000	ad_000	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009
59997	neg	112	0	2130706432	18	...	792	386	452	144	146	2622	0	0
59998	neg	80292	na	2130706432	494	...	699352	222654	347378	225724	194440	165070	802280	388422
59999	neg	40222	na	698	628	...	440066	183200	344546	254068	225148	158304	170384	158

	aa_000	ab_000	ac_000	ad_000	ae_000	af_000	...	ee_002	ee_003	ee_004	ee_005	ee_006	ee_007	ee_008	ee_009
15997	79636	-1	1670	1518	0	0	...	806832	449962	778826	581558	375498	222866	358934	19548
15998	110	-1	36	32	0	0	...	588	210	180	544	1004	1338	74	0
15999	8	0	6	4	2	2	...	46	10	48	14	42	46	0	0