Exploring Less Known Datasets for Artificial Intelligence, Machine Learning and Data Science

This site and the content is under reconstruction! (most links are broken)

Are you sick and tired of seeing the same benchmark datasets for recent machine learning publications again and again and feel that it is more a fitting to the benchmarks to get it accepted in some journals? This is exactly how I feel. Hence, I will explore some less known datasets.

I do not want to complain too much about baseline datasets. There are some good out there such as LibriSpeech ASR corpus, ImageNet or GTSDB (German Traffic Sign Detection Benchmark. These datasets can put to good use. However, they might be a bit overused and there are some datasets that are really overused such as MNIST, Iris and Titanic. MNIST was a good dataset in the 1990s and early 2000s but not today ;). Fisher’s Iris dataset and Titanic survivors are completely overused though I have some ideas how to make something useful with the Titanic dataset that could teach data scientists as well as machine learning engineers that applications of machine learning and statistics for the physical world do not only focus on correlation but on cause and effect as well ;). The idea of machine learning and especially of deep learning is to generalize more and therefore move towards Artificial General Intelligence. In my opinion using the same datasets all over again contradicts this idea a bit. Furthermore, machine learning has so many different applications hence it is useful to think and test a bit broader. Perhaps I can inspire someone to apply it for more “real-world/physical-world” problems instead of simply trying to optimize e-commerce and detecting cats and dogs in images ;).

From time to time, I am going to explore some of the less known datasets and build machine learning models on them. Many of these datasets are from well-known dataset repositories. I am aiming to revisit baseline models that are published for these datasets with a reasonable amount of work. Because some of the datasets are a bit older it is interesting to see how newer approaches such as deeplearning or XGBoost will perform on them. Further, I am planning to focus on problems from the physical world and analyse the datasets in a broader way than purely throwing machine learning at them. I am aiming to explore at least one dataset every 1 - 2 weeks.

Contents

Audio
Autonomous systems, robotics, self-driving cars
Civil engineering
Computer Vision
Cybersecurity, security and safety engineering
Geoscience and mining
Science and engineering
Miscellaneous

Audio

Explored

Unexplored

Autonomous systems, robotics, self-driving cars

Everything related to robotics, self-driving cars and other autonomous (physical) agents.

(You can find a list with references (literature) on these datasets here.)

Explored

NuScene includes RADAR data as well
Lyft Level5 dataset
Argoverse

Unexplored

Civil engineering

Explored

Unexplored

Pittsburgh Bridges Dataset

Computer vision

Everything with pictures that doesn’t belong into any other category

Explored

Unexplored

Cybersecurity, security and safety engineering

Explored

Unexplored

Geoscience and mining

Everything that covers geophysics, geospatial, geoscience and mining related issues.

Explored

Unexplored

Medical and health

Explored

Unexplored

Science and engineering

Everything related to science and engineering that is not (yet) part of any other category.

Explored

Unexplored

Miscellaneous

Explored

Unexplored

If you know more datasets that would be worth exploring, then please drop me a note.