This site and the content is under reconstruction! (most links are broken)
Are you sick and tired of seeing the same benchmark datasets for recent machine learning publications again and again and feel that it is more a fitting to the benchmarks to get it accepted in some journals? This is exactly how I feel. Hence, I will explore some less known datasets.
I do not want to complain too much about baseline datasets. There are some good out there such as LibriSpeech ASR corpus, ImageNet or GTSDB (German Traffic Sign Detection Benchmark. These datasets can put to good use. However, they might be a bit overused and there are some datasets that are really overused such as MNIST, Iris and Titanic. MNIST was a good dataset in the 1990s and early 2000s but not today ;). Fisher’s Iris dataset and Titanic survivors are completely overused though I have some ideas how to make something useful with the Titanic dataset that could teach data scientists as well as machine learning engineers that applications of machine learning and statistics for the physical world do not only focus on correlation but on cause and effect as well ;). The idea of machine learning and especially of deep learning is to generalize more and therefore move towards Artificial General Intelligence. In my opinion using the same datasets all over again contradicts this idea a bit. Furthermore, machine learning has so many different applications hence it is useful to think and test a bit broader. Perhaps I can inspire someone to apply it for more “real-world/physical-world” problems instead of simply trying to optimize e-commerce and detecting cats and dogs in images ;).
From time to time, I am going to explore some of the less known datasets and build machine learning models on them. Many of these datasets are from well-known dataset repositories. I am aiming to revisit baseline models that are published for these datasets with a reasonable amount of work. Because some of the datasets are a bit older it is interesting to see how newer approaches such as deeplearning or XGBoost will perform on them. Further, I am planning to focus on problems from the physical world and analyse the datasets in a broader way than purely throwing machine learning at them. I am aiming to explore at least one dataset every 1 - 2 weeks.
Contents
- Audio
- Autonomous systems, robotics, self-driving cars
- Civil engineering
- Computer Vision
- Cybersecurity, security and safety engineering
- Geoscience and mining
- Science and engineering
- Miscellaneous
Audio
Explored
Unexplored
- LibriSpeech
- Mozilla Common Voice
- Bird Audio Detection Challenge
- DCASE - Detection and Classification of Acoustic Scenes and Events
Autonomous systems, robotics, self-driving cars
Everything related to robotics, self-driving cars and other autonomous (physical) agents.
(You can find a list with references (literature) on these datasets here.)
Explored
- NuScene includes RADAR data as well
- Lyft Level5 dataset
- Argoverse
Unexplored
- Anticipating Accidents in Dashcam Videos
- Apollo Scape
- Berkeley DeepDrive
- CARLA Imitation learning
- CamVid
- Chilean Underground Mine Dataset
- Cityscapes
- comma2k19
- Daimler Urban Segmentation (DUS)
- KITTI Vision Benchmark Suite
- Mapillary Vistas
- The Oxford RobotCar Dataset
- Underwater caves sonar and vision data set
Civil engineering
Explored
Unexplored
Computer vision
Everything with pictures that doesn’t belong into any other category
Explored
- CMNIST - a more challenging approach
- EMNIST
- Kuzushiji-MNIST,-49,-Kanji
- Publich traffic surveillance cameras
Unexplored
- Yet Another Computer Vision Index To Datasets (YACVID)
- FieldSAFE
- CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis
- UCF-Crime
- Omniglot
Cybersecurity, security and safety engineering
Explored
Unexplored
Geoscience and mining
Everything that covers geophysics, geospatial, geoscience and mining related issues.
Explored
- Forest Fires Data Set
- Forest type classification
- Ionosphere radar signal classification
- Rock permeability
- Sonar: Mines vs. Rocks
- Seismic Bumps Forecasting in a Coal Mine
- Urban land cover classification
Unexplored
- OpenStreetMap
- SEG (Society of Exploration Geophysicists) Open data (contains a lot of seismic data from real reservoirs)
- USGS (US Geological Survey) dataset collection
- BGS (British Geological Survey) open geoscience collection
- BODC (British Oceanographic Data Center)
- Digital Rocks
- Geoscience Australia
- NOAA datasets
- AODN (Australian Ocean Data Network
- Smithsonian Institute Global Volcanism Program
- Deutsche Bahn Open Data Portal
- NYC Taxi trips from 2013 to today
- LandSat 8
- Sentinel-2
- US Department of Energy - Geothermal Data Repository
- INGV - Long-term seismological data
- GPS Trajectories
- Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015
- Pixel Level Land Classification
- DOTA: A Large-scale Dataset for Object Detection in Aerial Images
- ISPRS Urban Classification datasets
Medical and health
Explored
Unexplored
- Breast Tissue Classification by Electrical Impedance Spectroscopy
- Breast Cancer Wisconsin (Diagnostic)
Science and engineering
Everything related to science and engineering that is not (yet) part of any other category.
Explored
- APS Failure at Scania Trucks
- Condition Based Maintenance of Naval Propulsion Plants
- Energy Efficiency (HVAC)
- Glass Identification
- ISAR Aircraft RADAR Signatures
- NASA Airfoil Self-Noise
- QSAR Aquatic Toxicity
- QSAR Fish Toxicity
- Sensorless Drive Diagnosis
- Steel Plates Faults Dataset
- Super Conductivity
- Yacht Hydrodynamics
Unexplored
- cheminformatics
- https://data.mendeley.com/datasets/journals/09215093
- https://www.research.manchester.ac.uk/portal/en/facultiesandschools/materials-science-centre(0f869022-4ac9-4902-9019-23a61c3127c2)/datasets.html
- Container Crane Controller
- Ultrasonic Flowmeter Diagnostics
- Ultrahigh Carbon Steel Microconstituent Annotations
- NIST Materials Data Repository
Miscellaneous
Explored
- Housing Prices - More realistic views
- Hill Valley detection
- Metro Interstate Traffic Volume
- Questionable/useless datasets in RDatasets
Unexplored
- http://deeplearning.net/datasets/
- https://github.com/niderhoff/nlp-datasets
- UCI Machine Learning Repository
- https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/
- List of datasets for machine learning research on Wikipedia
- Kaggle Datasets
- The R Datasets Package
- Mulan Datasets
- WEKA - Collection of Datasets
- Data in Brief Journal
- https://caffe2.ai/docs/datasets.html
- CSIRO data access portal
- Open Data on AWS
- Disaster Risk datasets
- University of Strathclyde Research Data Collection
- Wireless Indoor Localization
- English plaintext jokes
- Educational Process Mining (EPM): A Learning Analytics Data Set
- Goodbooks-10k
If you know more datasets that would be worth exploring, then please drop me a note.