Revisiting Machine Learning Datasets

Today, we will have a look at the NASA Airfoil Noise dataset as part of my “Exploring Less Known Datasets for Machine Learning” series.

Contents

Exploring the dataset
Applying ML algorithms
Results

Exploring the dataset

This dataset contains results of NASA airfoil testing in 1989 [1] and is published in the UCI Machine Learning Repository [2]. It is about self-induced or better self-caused noise due to airflow over an airfoil. In this blog post, we will look at the dataset only and do not investigate aeroacoustics. Lighthill (1992) [3] wrote a short introduction to aeroacoustics that is highly recommended to read. The results are based on (scaled!!!) windtunnel experiments. (Scaled experiments in scaled windtunnels are extremly difficult to “back-scale” to full scale and involve some nasty sides of fluid dynamics and thermodynamics (gas dynamics). Pure CFD was not feasible back then and replaces a lot of windtunnel testing nowadays. Windtunnel tests are really only good for transient and hypersonic experiments since that are the areas that are still difficult to model numerically.) Another approach to estimate airfoil induced noise is to perform numerical simulations that cover fluid-structure interaction.

Let’s load the dataset and have a look at it. If we load a .dat file with pandas.csv, then we have to set delim_whitespace to True.

filepath_input_data = "./data/airfoil_self_noise.dat"
input_data_df = pd.read_csv(filepath_input_data, delim_whitespace=True,
                           names=['Frequency (Hz)',
                                  'Angle of Attack (deg)',
                                  'Chord length (m)',
                                  'Free-stream velocity (m/s)', 
                                  'Suction side displacement thickness (m)', 
                                  'Noise (dB)'])

display(input_data_df.head(3))
display(input_data_df.tail(3))
input_data_df.describe()


  
    
      
      Frequency (Hz)
      Angle of Attack (deg)
      Chord length (m)
      Free-stream velocity (m/s)
      Suction side displacement thickness (m)
      Noise (dB)
    
  
  
    
      0
      800
      0.0
      0.3048
      71.3
      0.002663
      126.201
    
    
      1
      1000
      0.0
      0.3048
      71.3
      0.002663
      125.201
    
    
      2
      1250
      0.0
      0.3048
      71.3
      0.002663
      125.951

	Frequency (Hz)	Chord length (m)	Free-stream velocity (m/s)	Suction side displacement thickness (m)	Noise (dB)
0	800	0.3048	71.3	0.002663	126.201
1	1000	0.3048	71.3	0.002663	125.201
2	1250	0.3048	71.3	0.002663	125.951


  
    
      
      Frequency (Hz)
      Angle of Attack (deg)
      Chord length (m)
      Free-stream velocity (m/s)
      Suction side displacement thickness (m)
      Noise (dB)
    
  
  
    
      1500
      4000
      15.6
      0.1016
      39.6
      0.052849
      106.604
    
    
      1501
      5000
      15.6
      0.1016
      39.6
      0.052849
      106.224
    
    
      1502
      6300
      15.6
      0.1016
      39.6
      0.052849
      104.204

	Frequency (Hz)	Angle of Attack (deg)	Chord length (m)	Free-stream velocity (m/s)	Suction side displacement thickness (m)	Noise (dB)
1500	4000	15.6	0.1016	39.6	0.052849	106.604
1501	5000	15.6	0.1016	39.6	0.052849	106.224
1502	6300	15.6	0.1016	39.6	0.052849	104.204


  
    
      
      Frequency (Hz)
      Angle of Attack (deg)
      Chord length (m)
      Free-stream velocity (m/s)
      Suction side displacement thickness (m)
      Noise (dB)
    
  
  
    
      count
      1503.000000
      1503.000000
      1503.000000
      1503.000000
      1503.000000
      1503.000000
    
    
      mean
      2886.380572
      6.782302
      0.136548
      50.860745
      0.011140
      124.835943
    
    
      std
      3152.573137
      5.918128
      0.093541
      15.572784
      0.013150
      6.898657
    
    
      min
      200.000000
      0.000000
      0.025400
      31.700000
      0.000401
      103.380000
    
    
      25%
      800.000000
      2.000000
      0.050800
      39.600000
      0.002535
      120.191000
    
    
      50%
      1600.000000
      5.400000
      0.101600
      39.600000
      0.004957
      125.721000
    
    
      75%
      4000.000000
      9.900000
      0.228600
      71.300000
      0.015576
      129.995500
    
    
      max
      20000.000000
      22.200000
      0.304800
      71.300000
      0.058411
      140.987000

	Frequency (Hz)	Angle of Attack (deg)	Chord length (m)	Free-stream velocity (m/s)	Suction side displacement thickness (m)	Noise (dB)
count	1503.000000	1503.000000	1503.000000	1503.000000	1503.000000	1503.000000
mean	2886.380572	6.782302	0.136548	50.860745	0.011140	124.835943
std	3152.573137	5.918128	0.093541	15.572784	0.013150	6.898657
min	200.000000	0.000000	0.025400	31.700000	0.000401	103.380000
25%	800.000000	2.000000	0.050800	39.600000	0.002535	120.191000
50%	1600.000000	5.400000	0.101600	39.600000	0.004957	125.721000
75%	4000.000000	9.900000	0.228600	71.300000	0.015576	129.995500
max	20000.000000	22.200000	0.304800	71.300000	0.058411	140.987000

The original publication [1] contains detailed information on all features and experimental settings used for these measurements.

Boxplots are a bit better for visual understanding than a summary table:

The dataset shows some indications of outliers for several features.

Let’s see if any of the features are correlated directly to the noise level:

And as a correlation matrix:

It looks like there is no single variable short-cut here ;).

Next, we have to rescale the dataset and perform train-test splitting:

from sklearn.preprocessing import MaxAbsScaler
input_data_scaled_df = input_data_df.copy()
scaler = MaxAbsScaler()
input_data_scaled = scaler.fit_transform(input_data_df)
input_data_scaled_df.loc[:,:] = input_data_scaled
scaler_params = scaler.get_params()
# We are dealing with physics here, hence we need the unscaled values
extract_scaling_function = np.ones((1,input_data_scaled_df.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)
display(input_data_scaled_df.head(3))

from sklearn.model_selection import  train_test_split
y = input_data_scaled_df['Noise (dB)'].values.reshape(-1,1)
X_df = input_data_scaled_df.copy()
X_df.drop(['Noise (dB)'], axis=1, inplace=True)
X = X_df.values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    shuffle=True)

Applying ML algorithms

Let’s throw a smal set of ML algorithms at the dataset with hyperparameter optimization using a full grid search:

grid_parameters_linear_regression = {'fit_intercept' : [False, True]}
grid_parameters_decision_tree_regression = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_SVR_regression = {'C' : [1, 5, 7, 10, 30, 50],
                                     'epsilon' : [0.001, 0.01, 0.1, 0.2, 0.5, 0.6,0.8],
                                     'kernel' : ['rbf', 'linear'],
                                     'shrinking' : [False, True],
                                     'tol' : [0.001, 0.0001, 0.00001]}
grid_parameters_random_forest_regression = {'n_estimators' : [3,5,10,15,18],
                                                'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_regression = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120],
                                           'loss' : ['linear', 'square', 'exponential'],
                                           'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_xgboost_regression = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120,150,200,300],
                                          'max_depth' : [1,2,1015],
                                          'learning_rate' :[ 0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.8, 1.0]}

Furthermore, We are going to run two simple neural networks on it.

Results

Well, the results contains some ups and downs. Random Forests and XGBoost show quite good results. At this point I would like to compare it the original publication as well as to an PhD [4] and master [5] thesis, however it is somewhat how they achieved their results (train-valid-test splitting etc.) and I did not do any manual feature engineering/selection.

References

[1] Brooks, T.F.; Pope, D.S. and M.A. Marcolini (1989): Airfoil Self-Noise and Prediction. NASA Technical Report 1218. online available at: https://ntrs.nasa.gov/search.jsp?R=19890016302

[2] Dua, D.; Taniskidou, K.E. (2018). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

[3] Lighthill, J. (1992): A General Introduction to Aeroacoustics and Atmospheric Sound. NASA Technical Report 92-52/189717. online available at: https://www.archive.org/details/DTIC_ADA257887

[4] Lopez, R. (2008): Neural Networks for Variational Problems in Engineering. PhD Thesis. online available at: https://www.cimne.com/flood/docs/PhDThesis.pdf.

[5] Errasquin, L. (2014): Airfoil Self-Noise Prediction Using Neural Networks for Wind Turbines. Master Thesis. online available at: https://vtechworks.lib.vt.edu/handle/10919/35193.