Research and Practical Aspects of AI/ML
Common Mistakes in Machine Learning and Data Science
Providing a more holistic approach to minimizing errors. Unfortunately, I see that these very basic mistakes are repeated all over again. New mistakes are added from time to time.
Concatenated MNIST (CMNIST) Dataset
MNIST-like datasets are still used for a lot of fundamental research especially with respect to shallow models that may end up running on tiny micro-controllers. For anything outside such experiments the dataset should not be taking too seriously ;). CMNIST turns 784 pixels into something truly ridiculous and challenging for such models by concatenating various permutations of MNIST-like datasets into well specified new and large datasets.
Revisiting ML/BruteForceML
There are quite a few blog posts that fall into this category. The idea behind this micro project is to deploy a simple brute-force approach following best practices and set splits and pre-processing with some strict time limits and compare it against the baseline from publications a given dataset originates from. Relatively little AutoML tactics were deployed but often it outperformed AutoML frameworks such like auto-sklearn, auto-keras, and TPOT by quite a margin. For practical deployments the results of this micro project have some nice implications. Most of the experiments were conducted in 2018/2019. As of December 2022 most of the code is migrated to newer versions of some libraries used and to a proper structure instead of independent notebooks. Some time in 2023 I'm most likely going to release either technical report on this or turn it into a proper peer-reviewed paper or so ;).
Software
- zenodo-dl
- CLI to download Zenodo records
- various utils