The original publication of this dataset is: I. Ortigosa, R. Lopez and J. Garcia. A neural networks approach to residuary resistance of sailing
yachts prediction. In Proceedings of the International Conference on Marine Engineering MARINE
2007, 2007. It seems like the original publication focuses on two different output variables with 5 input variables only whereas according to the dataset description we have only one target variable:
Variations concern hull geometry coefficients and the Froude number:
Longitudinal position of the center of buoyancy, adimensional.
Prismatic coefficient, adimensional.
Length-displacement ratio, adimensional.
Beam-draught ratio, adimensional.
Length-beam ratio, adimensional.
Froude number, adimensional.
The measured variable is the residuary resistance per unit weight of displacement:
Residuary resistance per unit weight of displacement, adimensional.
importtimeimportpandasaspdimportnumpyasnpimportmatplotlib.pyplotasplt# load the dataset and display a portion and basic statistics of it
filepath_input_data='./data/yacht_hydrodynamics.data'input_data_df=pd.read_csv(filepath_input_data,delim_whitespace=True,names=['Long pos','Prismatic coeff','Length-displacement ratio','Beam-draught ratio','Length-beam ratio','Froude number','Residuary resistance'])display(input_data_df.head(3))display(input_data_df.tail(3))display(input_data_df.describe())
Long pos
Prismatic coeff
Length-displacement ratio
Beam-draught ratio
Length-beam ratio
Froude number
Residuary resistance
0
-2.3
0.568
4.78
3.99
3.17
0.125
0.11
1
-2.3
0.568
4.78
3.99
3.17
0.150
0.27
2
-2.3
0.568
4.78
3.99
3.17
0.175
0.47
Long pos
Prismatic coeff
Length-displacement ratio
Beam-draught ratio
Length-beam ratio
Froude number
Residuary resistance
305
-2.3
0.6
4.34
4.23
2.73
0.400
19.59
306
-2.3
0.6
4.34
4.23
2.73
0.425
30.48
307
-2.3
0.6
4.34
4.23
2.73
0.450
46.66
Long pos
Prismatic coeff
Length-displacement ratio
Beam-draught ratio
Length-beam ratio
Froude number
Residuary resistance
count
308.000000
308.000000
308.000000
308.000000
308.000000
308.000000
308.000000
mean
-2.381818
0.564136
4.788636
3.936818
3.206818
0.287500
10.495357
std
1.513219
0.023290
0.253057
0.548193
0.247998
0.100942
15.160490
min
-5.000000
0.530000
4.340000
2.810000
2.730000
0.125000
0.010000
25%
-2.400000
0.546000
4.770000
3.750000
3.150000
0.200000
0.777500
50%
-2.300000
0.565000
4.780000
3.955000
3.150000
0.287500
3.065000
75%
-2.300000
0.574000
5.100000
4.170000
3.510000
0.375000
12.815000
max
0.000000
0.600000
5.140000
5.350000
3.640000
0.450000
62.420000
Boxplots make it a bit more accessible to us:
Next, we have to rescale our dataset:
fromsklearn.preprocessingimportMaxAbsScalerinput_data_scaled_df=input_data_df.copy()scaler=MaxAbsScaler()input_data_scaled=scaler.fit_transform(input_data_df)input_data_scaled_df.loc[:,:]=input_data_scaledscaler_params=scaler.get_params()# We are dealing with physics here, hence we need the unscaled values
extract_scaling_function=np.ones((1,input_data_scaled_df.shape[1]))extract_scaling_function=scaler.inverse_transform(extract_scaling_function)display(input_data_scaled_df.head(3))
Next, we have to choose our machine learning algorithms.
In this case we are going to evaluate some classical ones as well as two neural networks.
An example of a classical machine learning algorithms is Decision Tree regression. It is trained and tested as follows:
deftrain_test_decision_tree_regression(X_train,X_test,y_train,y_test,scorer,dataset_id,kfold_vs_size):decision_tree_regression=DecisionTreeRegressor(random_state=42)grid_parameters_decision_tree_regression={'max_depth':[None,3,5,7,9,10,11]}start_time=time.time()grid_obj=GridSearchCV(decision_tree_regression,param_grid=grid_parameters_decision_tree_regression,cv=kfold_vs_size,n_jobs=-1,scoring=scorer,verbose=1)grid_fit=grid_obj.fit(X_train,y_train)training_time=time.time()-start_timebest_decision_tree_regression=grid_fit.best_estimator_prediction=best_decision_tree_regression.predict(X_test)r2=r2_score(y_test,prediction)mse=mean_squared_error(y_test,prediction)mae=mean_absolute_error(y_true=y_test,y_pred=prediction)# metrics for true values
# r2 remains unchanged, mse, mea will change and cannot be scaled
# because there is some physical meaning behind it
prediction_true_scale=prediction*datasets[dataset_id]['scaler_array'][:,-(i+1)]y_test_true_scale=y_test*datasets[dataset_id]['scaler_array'][:,-(i+1)]mae_true_scale=mean_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)medae_true_scale=median_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)mse_true_scale=mean_squared_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)return{'Regression type':'Decision Tree Regression','model':grid_fit,'Predictions':prediction,'R2':r2,'MSE':mse,'MAE':mae,'MSE_true_scale':mse_true_scale,'RMSE_true_scale':np.sqrt(mse_true_scale),'MAE_true_scale':mae_true_scale,'MedAE_true_scale':medae_true_scale,'Training time':training_time,'dataset':str(dataset_id)+str(-(i+1))}
And two neural networks using Keras (no need for detailed optimization with pure TensorFlow or PyTorch):
# simple model close to Ortigosa et. al (2007)
defbuild_baseline_model(input_dim):model=Sequential()model.add(Dense(input_dim,input_dim=input_dim,activation='sigmoid'))model.add(Dense(9,activation='sigmoid'))model.add(Dense(1))model.compile(loss='mean_squared_error',optimizer='adam',metrics=['mae'])returnmodeldefrun_baseline_model(X_train,X_test,y_train,y_test,dataset_id,epochs=1000,validation_split=0.2,batch_size=16):model=build_baseline_model(datasets[dataset_id]['X_train'].shape[1])callback_file_path='keras_models/model_baseline_model_'+str(dataset)+'_best.hdf5'checkpoint=callbacks.ModelCheckpoint(callback_file_path,monitor='val_loss',save_best_only=True,save_weights_only=True)start_time=time.time()model.fit(X_train,y_train,callbacks=[checkpoint],batch_size=batch_size,epochs=epochs,validation_split=validation_split)training_time=time.time()-start_timehistory=model.history.history# load best model
model.load_weights(callback_file_path)prediction=model.predict(X_test)r2=r2_score(y_test,prediction)mse=mean_squared_error(y_test,prediction)mae=mean_absolute_error(y_true=y_test,y_pred=prediction)# metrics for true values
# r2 remains unchanged, mse, mea will change and cannot be scaled
# because there is some physical meaning behind it
prediction_true_scale=prediction*datasets[dataset_id]['scaler_array'][:,-1]y_test_true_scale=y_test*datasets[dataset_id]['scaler_array'][:,-1]mae_true_scale=mean_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)medae_true_scale=median_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)mse_true_scale=mean_squared_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)return{'Regression type':'Baseline NN','model':[callback_file_path,history],'Predictions':prediction,'R2':r2,'MSE':mse,'MAE':mae,'MSE_true_scale':mse_true_scale,'RMSE_true_scale':np.sqrt(mse_true_scale),'MAE_true_scale':mae_true_scale,'MedAE_true_scale':medae_true_scale,'Training time':training_time,'dataset':str(dataset_id)+str(-(i+1))}
defbuild_deeper_model(input_dim):model=Sequential()model.add(Dense(input_dim,input_dim=input_dim,activation='relu'))model.add(Dense(20,activation='relu'))model.add(Dropout(0.5))model.add(Dense(20,activation='relu'))model.add(Dropout(0.5))model.add(Dense(20,activation='relu'))model.add(Dense(1))model.compile(loss='mean_squared_error',optimizer='adam',metrics=['mae'])returnmodeldefrun_deeper_model(X_train,X_test,y_train,y_test,dataset_id,epochs=1000,validation_split=0.2,batch_size=16):model=build_deeper_model(datasets[dataset_id]['X_train'].shape[1])callback_file_path='keras_models/model_deeper_model_'+str(dataset)+'_best.hdf5'checkpoint=callbacks.ModelCheckpoint(callback_file_path,monitor='val_loss',save_best_only=True,save_weights_only=True)start_time=time.time()model.fit(X_train,y_train,callbacks=[checkpoint],batch_size=batch_size,epochs=epochs,validation_split=validation_split)training_time=time.time()-start_timehistory=model.history.history# load best model
model.load_weights(callback_file_path)prediction=model.predict(X_test)r2=r2_score(y_test,prediction)mse=mean_squared_error(y_test,prediction)mae=mean_absolute_error(y_true=y_test,y_pred=prediction)# metrics for true values
# r2 remains unchanged, mse, mea will change and cannot be scaled
# because there is some physical meaning behind it
prediction_true_scale=prediction*datasets[dataset_id]['scaler_array'][:,-1]y_test_true_scale=y_test*datasets[dataset_id]['scaler_array'][:,-1]mae_true_scale=mean_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)medae_true_scale=median_absolute_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)mse_true_scale=mean_squared_error(y_true=y_test_true_scale,y_pred=prediction_true_scale)return{'Regression type':'Deeper NN','model':[callback_file_path,history],'Predictions':prediction,'R2':r2,'MSE':mse,'MAE':mae,'MSE_true_scale':mse_true_scale,'RMSE_true_scale':np.sqrt(mse_true_scale),'MAE_true_scale':mae_true_scale,'MedAE_true_scale':medae_true_scale,'Training time':training_time,'dataset':str(dataset_id)+str(-(i+1))}
Results
Regression type
R2
MSE
MAE
MSE_true_scale
RMSE_true_scale
MAE_true_scale
MedAE_true_scale
Linear Regression
0.543486
0.017410
0.108639
67.832007
8.236019
6.781256
6.518309
Decision Tree Regression
0.998138
0.000071
0.005219
0.276711
0.526033
0.325739
0.170000
SVM Regression
0.851072
0.005680
0.065834
22.128809
4.704127
4.109343
4.367709
Random Forest Regression
0.996198
0.000145
0.005843
0.564943
0.751627
0.364702
0.125792
AdaBoost Regression
0.989979
0.000382
0.015930
1.488950
1.220226
0.994369
0.871679
XGBoost Regression
0.998399
0.000061
0.003839
0.237850
0.487699
0.239658
0.125082
Baseline NN
0.763068
0.009036
0.074620
35.205081
5.933387
4.657783
4.176649
Deeper NN
0.942808
0.002181
0.028645
8.497950
2.915124
1.788016
1.370933
For visual assessment, we can plot predictions vs true values:
This tells us that there are certainly different regions that show less linear behavior.
Discussion of results
If we have a look at the decision tree (end of this page), it almost seems like it is very “overfitted” since in branches out so much.
It seems to be reasonable. We certainly need more data for training, cross-validation and testing for final evaluation and probably a more robust models.
Acknowledgements
I would like to thank Roberto Lopez for making this dataset available.
Btw, this is how the decision tree model looks like: