Today, we will have a look at the NASA Airfoil Noise dataset as part of my “Exploring Less Known Datasets for Machine Learning” series.


Contents


Exploring the dataset

This dataset contains results of NASA airfoil testing in 1989 [1] and is published in the UCI Machine Learning Repository [2]. It is about self-induced or better self-caused noise due to airflow over an airfoil. In this blog post, we will look at the dataset only and do not investigate aeroacoustics. Lighthill (1992) [3] wrote a short introduction to aeroacoustics that is highly recommended to read. The results are based on (scaled!!!) windtunnel experiments. (Scaled experiments in scaled windtunnels are extremly difficult to “back-scale” to full scale and involve some nasty sides of fluid dynamics and thermodynamics (gas dynamics). Pure CFD was not feasible back then and replaces a lot of windtunnel testing nowadays. Windtunnel tests are really only good for transient and hypersonic experiments since that are the areas that are still difficult to model numerically.) Another approach to estimate airfoil induced noise is to perform numerical simulations that cover fluid-structure interaction.

Let’s load the dataset and have a look at it. If we load a .dat file with pandas.csv, then we have to set delim_whitespace to True.

filepath_input_data = "./data/airfoil_self_noise.dat"
input_data_df = pd.read_csv(filepath_input_data, delim_whitespace=True,
                           names=['Frequency (Hz)',
                                  'Angle of Attack (deg)',
                                  'Chord length (m)',
                                  'Free-stream velocity (m/s)', 
                                  'Suction side displacement thickness (m)', 
                                  'Noise (dB)'])

display(input_data_df.head(3))
display(input_data_df.tail(3))
input_data_df.describe()
Frequency (Hz) Angle of Attack (deg) Chord length (m) Free-stream velocity (m/s) Suction side displacement thickness (m) Noise (dB)
0 800 0.0 0.3048 71.3 0.002663 126.201
1 1000 0.0 0.3048 71.3 0.002663 125.201
2 1250 0.0 0.3048 71.3 0.002663 125.951
Frequency (Hz) Angle of Attack (deg) Chord length (m) Free-stream velocity (m/s) Suction side displacement thickness (m) Noise (dB)
1500 4000 15.6 0.1016 39.6 0.052849 106.604
1501 5000 15.6 0.1016 39.6 0.052849 106.224
1502 6300 15.6 0.1016 39.6 0.052849 104.204
Frequency (Hz) Angle of Attack (deg) Chord length (m) Free-stream velocity (m/s) Suction side displacement thickness (m) Noise (dB)
count 1503.000000 1503.000000 1503.000000 1503.000000 1503.000000 1503.000000
mean 2886.380572 6.782302 0.136548 50.860745 0.011140 124.835943
std 3152.573137 5.918128 0.093541 15.572784 0.013150 6.898657
min 200.000000 0.000000 0.025400 31.700000 0.000401 103.380000
25% 800.000000 2.000000 0.050800 39.600000 0.002535 120.191000
50% 1600.000000 5.400000 0.101600 39.600000 0.004957 125.721000
75% 4000.000000 9.900000 0.228600 71.300000 0.015576 129.995500
max 20000.000000 22.200000 0.304800 71.300000 0.058411 140.987000

The original publication [1] contains detailed information on all features and experimental settings used for these measurements.

Boxplots are a bit better for visual understanding than a summary table:

The dataset shows some indications of outliers for several features.

Let’s see if any of the features are correlated directly to the noise level:

And as a correlation matrix:

It looks like there is no single variable short-cut here ;).

Next, we have to rescale the dataset and perform train-test splitting:

from sklearn.preprocessing import MaxAbsScaler
input_data_scaled_df = input_data_df.copy()
scaler = MaxAbsScaler()
input_data_scaled = scaler.fit_transform(input_data_df)
input_data_scaled_df.loc[:,:] = input_data_scaled
scaler_params = scaler.get_params()
# We are dealing with physics here, hence we need the unscaled values
extract_scaling_function = np.ones((1,input_data_scaled_df.shape[1]))
extract_scaling_function = scaler.inverse_transform(extract_scaling_function)
display(input_data_scaled_df.head(3))

from sklearn.model_selection import  train_test_split
y = input_data_scaled_df['Noise (dB)'].values.reshape(-1,1)
X_df = input_data_scaled_df.copy()
X_df.drop(['Noise (dB)'], axis=1, inplace=True)
X = X_df.values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    shuffle=True)

Applying ML algorithms

Let’s throw a smal set of ML algorithms at the dataset with hyperparameter optimization using a full grid search:

grid_parameters_linear_regression = {'fit_intercept' : [False, True]}
grid_parameters_decision_tree_regression = {'max_depth' : [None, 3,5,7,9,10,11]}
grid_parameters_SVR_regression = {'C' : [1, 5, 7, 10, 30, 50],
                                     'epsilon' : [0.001, 0.01, 0.1, 0.2, 0.5, 0.6,0.8],
                                     'kernel' : ['rbf', 'linear'],
                                     'shrinking' : [False, True],
                                     'tol' : [0.001, 0.0001, 0.00001]}
grid_parameters_random_forest_regression = {'n_estimators' : [3,5,10,15,18],
                                                'max_depth' : [None, 2,3,5,7,9]}
grid_parameters_adaboost_regression = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120],
                                           'loss' : ['linear', 'square', 'exponential'],
                                           'learning_rate' : [0.001, 0.01, 0.1, 0.8, 1.0]}
grid_parameters_xgboost_regression = {'n_estimators' : [3,5,10,15,18,20,25,50,60,80,100,120,150,200,300],
                                          'max_depth' : [1,2,1015],
                                          'learning_rate' :[ 0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.8, 1.0]}

Furthermore, We are going to run two simple neural networks on it.

Results

Well, the results contains some ups and downs. Random Forests and XGBoost show quite good results. At this point I would like to compare it the original publication as well as to an PhD [4] and master [5] thesis, however it is somewhat how they achieved their results (train-valid-test splitting etc.) and I did not do any manual feature engineering/selection.

References

[1] Brooks, T.F.; Pope, D.S. and M.A. Marcolini (1989): Airfoil Self-Noise and Prediction. NASA Technical Report 1218. online available at: https://ntrs.nasa.gov/search.jsp?R=19890016302

[2] Dua, D.; Taniskidou, K.E. (2018). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.

[3] Lighthill, J. (1992): A General Introduction to Aeroacoustics and Atmospheric Sound. NASA Technical Report 92-52/189717. online available at: https://www.archive.org/details/DTIC_ADA257887

[4] Lopez, R. (2008): Neural Networks for Variational Problems in Engineering. PhD Thesis. online available at: https://www.cimne.com/flood/docs/PhDThesis.pdf.

[5] Errasquin, L. (2014): Airfoil Self-Noise Prediction Using Neural Networks for Wind Turbines. Master Thesis. online available at: https://vtechworks.lib.vt.edu/handle/10919/35193.