Today, we will have a look at this dataset on “Hill and Valley detection” as part of my “Exploring Less Known Datasets for Machine Learning” series. Why hill and valley detection? Well, there are many applications to detect local or global minima and maxima. They range all the way from topography to signal processing. Furhter, it is a benchmark dataset for the Waikato Environment for Knowledge Analysis (WEKA) - that could be interesting.


Contents


Dataset exploration and preprocessing

The aim of the dataset is to classify a series of 100 X coordinates as hills or valleys. The dataset contains two separated datasets. One contains a clean surface and the other contains many noisy datapoints:








Both subdatasets are divided into training and testing data. Both subsets contain training an testing sets with 606 datapoints each. We have to scale this data per row, since the absolute values between the rows change drastically and we are only interested in relative changes within a row (signal).

Applying classical machine learning algorithms

Since this data does not require any obvious feature engineering, we can simply throw some classifiers at it and see what happens. In this case we can try something simple such as:

  • Gaussian Naive Bayes
  • Decision Trees
  • Support Vector Machines

as well as some more complex algorithms such as:

  • Random Forest
  • AdaBoost
  • XGBoost

All classifiers are trained using gridsearch cv with a 5-fold cross validation.

Scaled input data

Classification type Accuracy dataset
0 Gaussian Naive Bayes Classification 1.00000 clean
1 Decision Tree Classification 1.00000 clean
2 SVM Classification 1.00000 clean
3 Random Forest Classification 1.00000 clean
4 AdaBoost Classification 1.00000 clean
5 eXtreme Gradient Boosting Classification 1.00000 clean
6 Gaussian Naive Bayes Classification 1.00000 noisy
7 Decision Tree Classification 1.00000 noisy
8 SVM Classification 1.00000 noisy
9 Random Forest Classification 1.00000 noisy
10 AdaBoost Classification 1.00000 noisy
11 eXtreme Gradient Boosting Classification 0.99835 noisy

That is almost too simple for a benchmark. However, XGBoost seems to have some trouble. It requires much more training time and probably overfits too much to the training data.

Combination of noisy and clean

What happens if we combine both datasets?

Classification type Accuracy
0 Gaussian Naive Bayes Classification 1.000000
1 Decision Tree Classification 1.000000
2 SVM Classification 1.000000
3 Random Forest Classification 1.000000
4 AdaBoost Classification 1.000000
5 eXtreme Gradient Boosting Classification 0.999175


It is not surprising that XGBoost performs slightly better. However, no surprises.

Unscaled

Let us try to use unscaled data to see if we can make at least one classifier perform badly!

Classification type Accuracy dataset
0 Gaussian Naive Bayes Classification 0.521452 clean
1 Decision Tree Classification 0.566007 clean
2 SVM Classification 1.000000 clean
3 Random Forest Classification 0.557756 clean
4 AdaBoost Classification 0.562706 clean
5 eXtreme Gradient Boosting Classification 0.597360 clean
6 Gaussian Naive Bayes Classification 0.490099 noisy
7 Decision Tree Classification 0.509901 noisy
8 SVM Classification 0.950495 noisy
9 Random Forest Classification 0.516502 noisy
10 AdaBoost Classification 0.514851 noisy
11 eXtreme Gradient Boosting Classification 0.514851 noisy


Okay, that works for every classifier but SVM ;). It is a clear indicator that the SVM approach to maximize distance between two classes works better.

Scaled and inverted datasets

There is one thing remaining: What happens if we train our on the clean dataset and test it on the noisy dataset and vice versa?

Classification type Accuracy dataset
0 Gaussian Naive Bayes Classification 1.0000 train on clean test on noisy
1 Decision Tree Classification 1.0000 train on clean test on noisy
2 SVM Classification 1.0000 train on clean test on noisy
3 Random Forest Classification 1.0000 train on clean test on noisy
4 AdaBoost Classification 1.0000 train on clean test on noisy
5 eXtreme Gradient Boosting Classification 0.9967 train on clean test on noisy
6 Gaussian Naive Bayes Classification 1.0000 train on noisy test on clean
7 Decision Tree Classification 1.0000 train on noisy test on clean
8 SVM Classification 1.0000 train on noisy test on clean
9 Random Forest Classification 1.0000 train on noisy test on clean
10 AdaBoost Classification 1.0000 train on noisy test on clean
11 eXtreme Gradient Boosting Classification 1.0000 train on noisy test on clean


It is no surprise that training on noisy data yields good results on the clean set. The other way around is a bit surprising but not too much ;).

Discussion of results

It seems like this is a bit too easy for a benchmark dataset. There are some really useful applications for hill and valley detection but in a different way than here. Let us think about digital elevation model of a mountain range. In such a case we would have many possible cross-sections across the 2D plane. Applying traditional terrain analysis algorithms may lead us analyse all kinds of surface features and classify some sort of valley and ridge separation (e.g. for watersheds). However, there are some open questions where machine learning could be useful:

  • What is perceived by locals and tourists as valley and hill and therefore can be utilized for improved touristic marketing?
  • What is the transition point between a hill and a valley?
  • Is it faster than classical methods?

And if we think of signal processing, we may:

  • want to segment signals
  • finding characteristic signals and classify them.

Acknowledgements

I would like to thank Lee Graham and Franz Oppacher for making this dataset available.