Revisiting Machine Learning Datasets

Today, we will have a look at this dataset on “Yacht Hydrodynamics” as part of my “Exploring Less Known Datasets for Machine Learning” series.

Contents

Comments on the original publication
Exploring the dataset and preprocessing
Machine Learning Models
Results
Discussion of results

Comments on the original publication

The original publication of this dataset is: I. Ortigosa, R. Lopez and J. Garcia. A neural networks approach to residuary resistance of sailing
yachts prediction. In Proceedings of the International Conference on Marine Engineering MARINE
2007, 2007. It seems like the original publication focuses on two different output variables with 5 input variables only whereas according to the dataset description we have only one target variable:

Variations concern hull geometry coefficients and the Froude number:

Longitudinal position of the center of buoyancy, adimensional.

Prismatic coefficient, adimensional.

Length-displacement ratio, adimensional.

Beam-draught ratio, adimensional.

Length-beam ratio, adimensional.

Froude number, adimensional.

The measured variable is the residuary resistance per unit weight of displacement:

Residuary resistance per unit weight of displacement, adimensional.

– Dataset description on UCI Machine Learning Repository

Exploring the dataset and preprocessing

Therefore, we have to explore on our own:

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# load the dataset and display a portion and basic statistics of it
filepath_input_data = './data/yacht_hydrodynamics.data'
input_data_df = pd.read_csv(filepath_input_data, delim_whitespace=True,
                           names=['Long pos', 'Prismatic coeff',
                                 'Length-displacement ratio',
                                 'Beam-draught ratio',
                                 'Length-beam ratio',
                                 'Froude number',
                                 'Residuary resistance'])
display(input_data_df.head(3))
display(input_data_df.tail(3))
display(input_data_df.describe())


  
    
      
      Long pos
      Prismatic coeff
      Length-displacement ratio
      Beam-draught ratio
      Length-beam ratio
      Froude number
      Residuary resistance
    
  
  
    
      0
      -2.3
      0.568
      4.78
      3.99
      3.17
      0.125
      0.11
    
    
      1
      -2.3
      0.568
      4.78
      3.99
      3.17
      0.150
      0.27
    
    
      2
      -2.3
      0.568
      4.78
      3.99
      3.17
      0.175
      0.47

	Long pos	Prismatic coeff	Length-displacement ratio	Beam-draught ratio	Length-beam ratio	Froude number	Residuary resistance
0	-2.3	0.568	4.78	3.99	3.17	0.125	0.11
1	-2.3	0.568	4.78	3.99	3.17	0.150	0.27
2	-2.3	0.568	4.78	3.99	3.17	0.175	0.47


  
    
      
      Long pos
      Prismatic coeff
      Length-displacement ratio
      Beam-draught ratio
      Length-beam ratio
      Froude number
      Residuary resistance
    
  
  
    
      305
      -2.3
      0.6
      4.34
      4.23
      2.73
      0.400
      19.59
    
    
      306
      -2.3
      0.6
      4.34
      4.23
      2.73
      0.425
      30.48
    
    
      307
      -2.3
      0.6
      4.34
      4.23
      2.73
      0.450
      46.66

	Long pos	Prismatic coeff	Length-displacement ratio	Beam-draught ratio	Length-beam ratio	Froude number	Residuary resistance
305	-2.3	0.6	4.34	4.23	2.73	0.400	19.59
306	-2.3	0.6	4.34	4.23	2.73	0.425	30.48
307	-2.3	0.6	4.34	4.23	2.73	0.450	46.66


  
    
      
      Long pos
      Prismatic coeff
      Length-displacement ratio
      Beam-draught ratio
      Length-beam ratio
      Froude number
      Residuary resistance
    
  
  
    
      count
      308.000000
      308.000000
      308.000000
      308.000000
      308.000000
      308.000000
      308.000000
    
    
      mean
      -2.381818
      0.564136
      4.788636
      3.936818
      3.206818
      0.287500
      10.495357
    
    
      std
      1.513219
      0.023290
      0.253057
      0.548193
      0.247998
      0.100942
      15.160490
    
    
      min
      -5.000000
      0.530000
      4.340000
      2.810000
      2.730000
      0.125000
      0.010000
    
    
      25%
      -2.400000
      0.546000
      4.770000
      3.750000
      3.150000
      0.200000
      0.777500
    
    
      50%
      -2.300000
      0.565000
      4.780000
      3.955000
      3.150000
      0.287500
      3.065000
    
    
      75%
      -2.300000
      0.574000
      5.100000
      4.170000
      3.510000
      0.375000
      12.815000
    
    
      max
      0.000000
      0.600000
      5.140000
      5.350000
      3.640000
      0.450000
      62.420000

	Long pos	Prismatic coeff	Length-displacement ratio	Beam-draught ratio	Length-beam ratio	Froude number	Residuary resistance
count	308.000000	308.000000	308.000000	308.000000	308.000000	308.000000	308.000000
mean	-2.381818	0.564136	4.788636	3.936818	3.206818	0.287500	10.495357
std	1.513219	0.023290	0.253057	0.548193	0.247998	0.100942	15.160490
min	-5.000000	0.530000	4.340000	2.810000	2.730000	0.125000	0.010000
25%	-2.400000	0.546000	4.770000	3.750000	3.150000	0.200000	0.777500
50%	-2.300000	0.565000	4.780000	3.955000	3.150000	0.287500	3.065000
75%	-2.300000	0.574000	5.100000	4.170000	3.510000	0.375000	12.815000
max	0.000000	0.600000	5.140000	5.350000	3.640000	0.450000	62.420000

Boxplots make it a bit more accessible to us:

Next, we have to rescale our dataset:

from sklearn.preprocessing import MaxAbsScaler
input_data_scaled_df = input_data_df.copy()
scaler = MaxAbsScaler()
input_data_scaled = scaler.fit_transform(input_data_df)
input_data_scaled_df.loc[:,:] = input_data_scaled
scaler_params = scaler.get_params()
# We are dealing with physics here, hence we need the unscaled values
extract_scaling_function = np.ones((1,input_data_scaled_df.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)
display(input_data_scaled_df.head(3))


  
    
      
      Long pos
      Prismatic coeff
      Length-displacement ratio
      Beam-draught ratio
      Length-beam ratio
      Froude number
      Residuary resistance
    
  
  
    
      0
      -0.46
      0.946667
      0.929961
      0.745794
      0.870879
      0.277778
      0.001762
    
    
      1
      -0.46
      0.946667
      0.929961
      0.745794
      0.870879
      0.333333
      0.004326
    
    
      2
      -0.46
      0.946667
      0.929961
      0.745794
      0.870879
      0.388889
      0.007530

	Long pos	Prismatic coeff	Length-displacement ratio	Beam-draught ratio	Length-beam ratio	Froude number	Residuary resistance
0	-0.46	0.946667	0.929961	0.745794	0.870879	0.277778	0.001762
1	-0.46	0.946667	0.929961	0.745794	0.870879	0.333333	0.004326
2	-0.46	0.946667	0.929961	0.745794	0.870879	0.388889	0.007530

from sklearn.model_selection import  train_test_split

datasets = {}
y = input_data_scaled_df['Residuary resistance'].values.reshape(-1,1)
X_df = input_data_scaled_df.copy()
X_df.drop(['Residuary resistance'], axis=1, inplace=True)
X = X_df.values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    shuffle=True)
comment = 'original dataset; scaled; 6 inputs, 1 output'

dataset_id = 'scaled_raw' 
datasets[dataset_id] = {'X_train': X_train, 'X_test' : X_test,
                        'y_train': y_train, 'y_test' : y_test,
                        'scaler' : scaler,
                        'scaler_array' : extract_scaling_function,
                        'comment' : comment,
                        'dataset' : dataset_id}

Machine Learning Models

Next, we have to choose our machine learning algorithms.
In this case we are going to evaluate some classical ones as well as two neural networks.

An example of a classical machine learning algorithms is Decision Tree regression. It is trained and tested as follows:

def train_test_decision_tree_regression(X_train, X_test,
                                        y_train, y_test,
                                        scorer,dataset_id,
                                        kfold_vs_size):
    decision_tree_regression = DecisionTreeRegressor(random_state=42)
    grid_parameters_decision_tree_regression = {'max_depth' : [None, 3,5,7,9,10,11]}
    start_time = time.time()
    grid_obj = GridSearchCV(decision_tree_regression,
                            param_grid=grid_parameters_decision_tree_regression,
                            cv=kfold_vs_size,
                            n_jobs=-1, scoring=scorer,
                            verbose=1)
    grid_fit = grid_obj.fit(X_train, y_train)
    training_time = time.time() - start_time
    best_decision_tree_regression = grid_fit.best_estimator_
    prediction = best_decision_tree_regression.predict(X_test)
    r2 = r2_score(y_test, prediction)
    mse = mean_squared_error(y_test, prediction)
    mae = mean_absolute_error(y_true=y_test, y_pred=prediction)
    
    # metrics for true values
    # r2 remains unchanged, mse, mea will change and cannot be scaled
    # because there is some physical meaning behind it
    prediction_true_scale = prediction * datasets[dataset_id]['scaler_array'][:,-(i+1)]
    y_test_true_scale = y_test * datasets[dataset_id]['scaler_array'][:,-(i+1)]
    mae_true_scale = mean_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    medae_true_scale = median_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    mse_true_scale = mean_squared_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    
    return {'Regression type' : 'Decision Tree Regression',
            'model' : grid_fit,
            'Predictions' : prediction,
            'R2' : r2,'MSE' : mse, 'MAE' : mae,
            'MSE_true_scale' : mse_true_scale,
            'RMSE_true_scale' : np.sqrt(mse_true_scale),
            'MAE_true_scale' : mae_true_scale,
            'MedAE_true_scale' : medae_true_scale,
            'Training time' : training_time,
            'dataset' : str(dataset_id) + str(-(i+1))}

And two neural networks using Keras (no need for detailed optimization with pure TensorFlow or PyTorch):

# simple model close to Ortigosa et. al (2007)

def build_baseline_model(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='sigmoid'))
    model.add(Dense(9, activation='sigmoid'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
    return model

def run_baseline_model(X_train, X_test,
                       y_train, y_test,
                       dataset_id,
                       epochs=1000,
                       validation_split=0.2,
                       batch_size=16):
    model = build_baseline_model(datasets[dataset_id]['X_train'].shape[1])
    
    callback_file_path = 'keras_models/model_baseline_model_'+str(dataset)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path, monitor='val_loss', save_best_only=True, save_weights_only=True)
    start_time = time.time()
    model.fit(X_train, y_train,callbacks=[checkpoint], batch_size=batch_size, epochs=epochs, validation_split=validation_split)
    training_time = time.time() - start_time
    history= model.history.history
    
    # load best model
    model.load_weights(callback_file_path) 
    prediction = model.predict(X_test)
    r2 = r2_score(y_test, prediction)
    mse = mean_squared_error(y_test, prediction)
    mae = mean_absolute_error(y_true=y_test, y_pred=prediction)
    
    # metrics for true values
    # r2 remains unchanged, mse, mea will change and cannot be scaled
    # because there is some physical meaning behind it
    prediction_true_scale = prediction * datasets[dataset_id]['scaler_array'][:,-1]
    y_test_true_scale = y_test * datasets[dataset_id]['scaler_array'][:,-1]
    mae_true_scale = mean_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    medae_true_scale = median_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    mse_true_scale = mean_squared_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    
    
    return {'Regression type' : 'Baseline NN',
            'model' : [callback_file_path,history],
            'Predictions' : prediction,
            'R2' : r2, 'MSE' : mse, 'MAE' : mae,
            'MSE_true_scale' : mse_true_scale,
            'RMSE_true_scale' : np.sqrt(mse_true_scale),
            'MAE_true_scale' : mae_true_scale,
            'MedAE_true_scale' : medae_true_scale ,
            'Training time' : training_time,
            'dataset' : str(dataset_id) + str(-(i+1))}

def build_deeper_model(input_dim):
    model = Sequential()
    model.add(Dense(input_dim, input_dim=input_dim, activation='relu'))
    model.add(Dense(20, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(20, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae'])
    return model

def run_deeper_model(X_train, X_test,
                       y_train, y_test,
                       dataset_id,
                       epochs=1000,
                       validation_split=0.2,
                       batch_size=16):
    model = build_deeper_model(datasets[dataset_id]['X_train'].shape[1])
    
    callback_file_path = 'keras_models/model_deeper_model_'+str(dataset)+'_best.hdf5'
    checkpoint = callbacks.ModelCheckpoint(callback_file_path, monitor='val_loss', save_best_only=True, save_weights_only=True)
    start_time = time.time()
    model.fit(X_train, y_train,callbacks=[checkpoint], batch_size=batch_size, epochs=epochs, validation_split=validation_split)
    training_time = time.time() - start_time
    history= model.history.history
    
    # load best model
    model.load_weights(callback_file_path) 
    prediction = model.predict(X_test)
    r2 = r2_score(y_test, prediction)
    mse = mean_squared_error(y_test, prediction)
    mae = mean_absolute_error(y_true=y_test, y_pred=prediction)
    
    # metrics for true values
    # r2 remains unchanged, mse, mea will change and cannot be scaled
    # because there is some physical meaning behind it
    prediction_true_scale = prediction * datasets[dataset_id]['scaler_array'][:,-1]
    y_test_true_scale = y_test * datasets[dataset_id]['scaler_array'][:,-1]
    mae_true_scale = mean_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    medae_true_scale = median_absolute_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    mse_true_scale = mean_squared_error(y_true=y_test_true_scale, y_pred=prediction_true_scale)
    
    
    return {'Regression type' : 'Deeper NN',
            'model' : [callback_file_path,history],
            'Predictions' : prediction,
            'R2' : r2, 'MSE' : mse, 'MAE' : mae,
            'MSE_true_scale' : mse_true_scale,
            'RMSE_true_scale' : np.sqrt(mse_true_scale),
            'MAE_true_scale' : mae_true_scale,
            'MedAE_true_scale' : medae_true_scale ,
            'Training time' : training_time,
            'dataset' : str(dataset_id) + str(-(i+1))}

Results


  
    
      Regression type
      R2
      MSE
      MAE
      MSE_true_scale
      RMSE_true_scale
      MAE_true_scale
      MedAE_true_scale
    
  
  
    
      Linear Regression
      0.543486
      0.017410
      0.108639
      67.832007
      8.236019
      6.781256
      6.518309
    
    
      Decision Tree Regression
      0.998138
      0.000071
      0.005219
      0.276711
      0.526033
      0.325739
      0.170000
    
    
      SVM Regression
      0.851072
      0.005680
      0.065834
      22.128809
      4.704127
      4.109343
      4.367709
    
    
      Random Forest Regression
      0.996198
      0.000145
      0.005843
      0.564943
      0.751627
      0.364702
      0.125792
    
    
      AdaBoost Regression
      0.989979
      0.000382
      0.015930
      1.488950
      1.220226
      0.994369
      0.871679
    
    
      XGBoost Regression
      0.998399
      0.000061
      0.003839
      0.237850
      0.487699
      0.239658
      0.125082
    
    
      Baseline NN
      0.763068
      0.009036
      0.074620
      35.205081
      5.933387
      4.657783
      4.176649
    
    
      Deeper NN
      0.942808
      0.002181
      0.028645
      8.497950
      2.915124
      1.788016
      1.370933

Regression type	R2	MSE	MAE	MSE_true_scale	RMSE_true_scale	MAE_true_scale	MedAE_true_scale
Linear Regression	0.543486	0.017410	0.108639	67.832007	8.236019	6.781256	6.518309
Decision Tree Regression	0.998138	0.000071	0.005219	0.276711	0.526033	0.325739	0.170000
SVM Regression	0.851072	0.005680	0.065834	22.128809	4.704127	4.109343	4.367709
Random Forest Regression	0.996198	0.000145	0.005843	0.564943	0.751627	0.364702	0.125792
AdaBoost Regression	0.989979	0.000382	0.015930	1.488950	1.220226	0.994369	0.871679
XGBoost Regression	0.998399	0.000061	0.003839	0.237850	0.487699	0.239658	0.125082
Baseline NN	0.763068	0.009036	0.074620	35.205081	5.933387	4.657783	4.176649
Deeper NN	0.942808	0.002181	0.028645	8.497950	2.915124	1.788016	1.370933

For visual assessment, we can plot predictions vs true values:

This tells us that there are certainly different regions that show less linear behavior.

Discussion of results

If we have a look at the decision tree (end of this page), it almost seems like it is very “overfitted” since in branches out so much.
It seems to be reasonable. We certainly need more data for training, cross-validation and testing for final evaluation and probably a more robust models.

Acknowledgements

I would like to thank Roberto Lopez for making this dataset available.

Btw, this is how the decision tree model looks like: