Common Mistakes in Machine Learning and Data Science

Contents

Introduction
General Management Issues
- Misuse of Data and Results
- Misuse of Skills
Project Management and Planning Issues
Data Science/Machine Learning Issues

Introduction

A word to the wise up front. This list is debatable and extendable. In case you have some suggestions, please let me know. Further, I’ll focus on the big picture and factors with high impact to mess things up.

(Modern) Software engineering is a mess and could be considered broken beyond repair. Unfortunately, many data science and machine learning projects follow these footsteps of chaos. Some say that a lot of machine learning and data science projects get messed up because there are too many people involved who either do not really know what they are doing and/or bounded too tightly by project structures and boundaries such as budgets. Let us have a holistic look at some of the common mistakes in machine learning and data science and how to avoid them.

Before continue reading, have a look at this Data Fallacies to Avoid poster. Ideally, you print it and hang it next to computer and on your office door - especially when working at a university ;). Michael Lones’ “How to avoid machine learning pitfalls: a guide for academic researchers” is also an interesting guide - especially for academics.

General Management Issues

Misuse of Data and Results

Another issue that seems to arise is “using” data to justify decisions made already - often with questionable methods. The Data Fallacies to Avoid poster mentioned earlier seem to cover most misuse of data and results mistakes which in return lead to more follow-up chaos. Misusing data seems to cause significantly higher mid and long-term damage than (potential) short-term gains.

Misuse of Skills

People and their skills are often used incorrectly in any kind of software project. Things are getting worse data science and machine learning projects because skill sets are even more in-depth and given the value a well utilized ML engineer could provide if he/she is utilized optimally it is simply not acceptable to use them for anything else such as general programming tasks (not talking about a 2 person startup here). Foremost, everyone’s skill is misused by meetings. Avoid meetings at all costs! People do communicate with each other without forced meetings. It is hard to meet anyone who is not pissed off by too many (useless) meetings which lead to nothing but endless repetitions without any progress. Too many meetings are a direct consequence of bad/unqualified management and bad project management and planning.

Back to data science/machine learning related personnel. Considering a large enough team common roles are:

(data) sysadmin/MLOPs (there is some overlap with data engineering with respect to setting up infrastructure for acquisition, management, training and deployment)
data engineering (certainly a lot of overlap with MLOPs nowadays; often used synonymous used for people working on the whole infrastructure and sometimes even for data science)
data analytics (primarily aiming to present data to some human audience)
data science (nowaday often expected to do what ML engineers and researchers did 2-3 years ago)
machine learning (nowdays often expected to do all MLOPs as well whereas a lot of the proper ML work seems to move to “data science roles”)

Boundaries between these roles are not necessarily sharp and if this role segregation make sense or not is debatable. More importantly the tasks of certain roles seem to shifting constantly between them so that its getting really tricky to infer what someone does based on their job title. As rule of thumb we can assume that programming skills (beyond a bit of python - so C, C++, CUDA, etc.) and math skills increase going down the list above whereas the amount of manual data manipulation and acquisition drops significantly. As a consequence of this it seems like data science and especially machine learning engineers (less the researchers) are used for an awful lot of software engineering whereas people doing mainly data engineering and analytics are pushed into roles way above their skills. Yes, everyone can make colorful graphs but it requires a certain skill set to build truly functional systems. The result of really experienced machine learning engineers, especially with a deep knowledge of AutoML, doing data engineering is a lot higher than the results of data engineers or data scientists doing some “copy&paste ML”. In a broader picture, truly great ML makes manual work of a large part of the “earlier steps” obsolete. There is nothing with bringing people to another level and gaining more understanding for improving more fundamental tasks but wasting “higher skilled” staff on unrelated tasks (a lot of general programming) is not just demotivating for those involved but also a lot of business potential is wasted.

Oh, did I mention to stop wasting people’s time in pointless meetings? ;)

Project Management and Planning Issues

Understanding the Project/Project Purpose

Nowadays there seems to be a more fundamental issue in most projects - no matter if commercial or academic and no matter if it is a data project or not. More than ever there is a lack of project goals. Often the problem that is supposed to be solved is not defined properly and consequently possible “solutions” developed might not solve it or make it even worse. This sounds like a mistake that should not possible to happen but it seems like this is the most fundamental issues with respect to almost every industry. However, clean problem identification and design goals definitions don’t mean that all the steps to solve/reach these problems/goals should be defined before project start. A clear understanding what should be achieved makes it also a lot easier to estimate if a project is a solution in search for a problem and therefore could be killed much earlier. To make this clear. This lack of understanding covers not only product development but also sales strategies etc.

However, it is possible and from time to time very useful to do a complete exploratory (research) project to find new insights without pre-defining what we are looking for.

Progress Indicators and KPIs

Just as not every correlation is a causation not every progress indicator indicates real progress. There seems to be a tendency, not just in data related projects, that people are making up progress indicators and KPIs to give the illusion of progress without really doing much and blabbering instead. If real progress is desired, it is highly recommended to ensure that progress indicators suit the problem and the project.

Choosing the Right Hardware

Choosing the right hardware/infrastructure requires understand a project’s purpose. Frequently, it boils down to the following options for deployment:

cloud
on-premise
or a hybrid version

In case of tight hardware constraints (edge devices) and other constraints such as latency limits this is even more important to develop for the final product version that should be sold/deployed. The training side of a machine learning project can be handled more flexible.

The choice of hardware/infrastructure should fit to a project and to it’s economics. However, it is important to remember that migrating from on-premise to a cloud infrastructure is much easier - especially when the cloud side is build entirely as “Infrastructure as Code”.

Pre project/planning stage

A common mistake is to exclude the machine learning and data science teams from initial project planning and product development. If we are lucky, we are involved in the planning stage of a project and not join in a later stage to “do the data stuff”. Hence, we can have an enormous impact that the project is structured in a suitable way. This however is quite seldom. Often it would be helpful to put the planning stage into a separate project because it is so important.

Data Science/Machine Learning Issues

Machine Learning vs. Data Science

First, we should point out some differences between machine learning and data science. We may consider data science as good old statistics. We can define machine learning as a subset of Artificial Intelligence. This alone adds some controversy to it because AI has really bad public relations ;). However, borders between both are fuzzy. We should think of it as a continuous space in which we can move from one to the other. The situation is similar with respect to “data engineering” and “analytics”. In practice it is often more about “politics” and “people being afraid of not being important” than building something integrated that works. And yes, if done correctly and if we think from a “final outcome perspective”, machine learning makes theses other “disciplines” obsolete. That kind of is the essence of AutoML.

If we move away from tools and algorithms we may end up with the following table focusing on what machine learning engineers and data scientists do mostly. Again, this is kind of fuzzy and it seems like job titles are used more synonymously nowadays.

Table of overgeneralized differences between machine learning and data science

	Machine Learning	Data Science
Misuse of skills	ML researchers (the ones capable of doing more fundamental/long-term work) are often misused for shot-term ML engineering/deployment stuff. ML engineers are often misused as a software developer for completely different things or as data scientists. ML engineers waste too much of their valuable time in meetings and useless discussions.	Data Scientists are often misused for manual data crunching and database administration. Data Scientists waste too much of their valuable time in meetings and useless discussions.
Pre project	May be involved in project planning; may joins in a later stage	Seldom involved in basic planning of a project; joins a project for data analysis
Integration	Focus on a full pipeline from data acquisition to output	The project is most likely disconnected
Input	Often given; hopefully obtained as part of a suitable machine learning pipeline	Often given; manual data collection
Model	Model has a predictive character; supervised, unsupervised, reinforcement learning; feedback loops/self-learning models; model should be robust enough to handle unseen completely unseen events without causing major damage	Often purely descriptive but misused for predictions; more “traditional methods/algorithms” and classical statistical testing; model update process is more manual
Output	Ideally, the output is the input of another machine and receiving feedback from it. At the end of the day it is about building “software 2.0”	The output has to be converted to some kind of presentation.
Result presentation	For debugging; not necessarily as input to a human decision maker	Colorful presentation to decision makers

Simplified observation:

Data analytics

Plain old statistics with results that describe datasets with statistical methods and derive relationships that are misused as models for prediction
Often resulting in “using questionable data to back bad decisions made in the past”

Data science

Building predictive models that are the input for human decision makers

Machine learning

Building intelligent machines which work on their own. Predictive models serve as the input for another machine and not for people.

I really want to emphasis the fact that both machine learning and data science are not about the ability use existing tools but about problem solving. “Data science” (head-hunting and recruiting, and therefore people doing data science) is so much worse regarding this problem. It is not about memorizing SQL cheatsheets or the knowledge of some crappy programming language like Java and it’s derivatives. It’s all about using appropriate tools to solve problems and hopefully doing so by applying the scientific method.

Holistic pipeline vs. patch work

This a key difference between machine learning and data science. However, depending on the time we join a project we can influence this. In general, we can assume that machine learning projects are more likely to consist of a holistic pipeline from data acquisition to final model output. In data science projects it is more likely that we are facing a patch work project that is somewhat disconnected from the rest of product development. Since we are still in a pre-project phase, we should demand to integrated into a holistic pipeline.

However, this is not required if we are doing an exploratory project. In that case it is still useful and time saving in the long run that we are building a consistent pipeline for our project part.

In both cases, we should be aware that such a pipeline should be flexible enough to be adapted to intermediate result of the project.

Previous work and domain knowledge

We can sort out a lot of errors in the planning stage, if we do proper research on previous work and include a lot of domain knowledge. It helps a lot with understanding why and how data was collected. If done incorrectly, we must sort it out in a later stage when we are facing the Correlation and Causation Analysis.

One of the main errors in the project planning phase is insufficient research on previous work and domain knowledge. This kind of research is not finished within a day. It is more likely that this costs us a week or two of work before and is a great amount of work during a project. Depending on project sizes this might increase drastically. However, we must be careful with advice from people that are too long in the field we are working on. It is highly likely that they are developing blindness towards new ideas or will tell us that they tried to used machine learning algorithms 20+ years ago and they failed miserably.

Personally, I experienced the latter often in the fields of numerical simulations and machine learning/statistical modeling. It is often a lot of work to convince people that some things have changed and are possible now.

Input data

We minimized potential mistakes in the planning phase. Now, we must avoid mistakes in dealing with input data. As a rule of thumb, we should:

understand where our input data comes from
understand our input data
minimize our manual data cleaning if we are doing machine learning

Data mining and datasets

Let us assume that we must implement some data mining strategy. If we did our research of previous work and domain knowledge well enough, then we can decide better what should and should not be logged. This one important to make sure that we mine the really important data. However, if possible we can acquire additional data that may lead to different insights and better performing models. However, we must be careful to avoid running in the correlation and causation problem.

If we deploy a fully automatic data mining system for a specific task, then we should not forget to implement a warning system that triggers some alert if some errors occur. Especially, if we mine data from websites or any other source such as documents that we cannot control. Here we may run into problems of changing or malfunctioning character encoding. If our input comes from sensors, then it might be useful to implement sanity checks that raise alerts.

If we simply get some datasets, then we have no influence on data acquisition of it. In both cases we have do check the data carefully for further analysis. This is often neglected and causes a lot of problems when people are confronted with messy real-world and not carefully curated data.

Undocumented Datasets

We may run into undocumented datasets. This is one of the biggest challenges we face. Unlike in a kaggle competition or in some university/online course we may receive completely undocumented or even misdocumented datasets. If we could choose, then undocumented is still better than misdocumented data. The latter can cause much more damage. Therefore, we must double check what we get.

Quality checks and messy data

Let us assume that end up with correctly documented datasets. This is less of a problem if we get our data from a well-designed data mining pipeline with extremely strict schemas/data contracts. However, if we just receive the data and have little to no impact on data acquisition, then we must perform extensive quality checks to deal with potential messy data. This is particularly true when the input data originates from people entering data manually into a database.

If we deal with “machine input” then it makes more sense to focus less on standard data hygiene such as handling outliers as this is taken care of in an earlier step and is enforced using strict pipelines. Other things like data consistency (correct image color spaces are aligned, text encoding etc.) is more important in such cases. This is often neglected.

There is a second point to this: if we want train self-learning systems, then we should avoid manual data cleaning at all costs, because the model will lead to unusable/dangerous results if confronted with real-world data. This is something that is more common in academia to be able to publish “working results” as well as in the situation that classical statisticians work in industry on machine learning tasks. Therefore, we always should remember that our final goal is to build systems that deliver either human performance or surpass it and therefore dealing with messy data should be handled by our model and be hand-engineered.

Handling outliers

This is a science of its own. And yes, science involves experiments ;) Again here: if we are building really intelligent systems, then we should avoid cleaning up data. If our work is more on the classic data science side of life, then we should ask ourselves if the outliers come from faulty sensors and therefore are wrong or if they are some extreme values that make sense on a basis of physics/cause and effect.

Imbalanced datasets

One of the key problems of many algorithms is that they assume more or less even distributed data. Otherwise they are heavily biased. A simple example are self-driving cars. If our training input consists mainly of video streams of cars driving straight forward, then the final model will be biased towards driving straight and will have trouble to drive curves correctly. Balancing out datasets in a useful way is often trickier than just do some data augmentation.

Normalization

Almost every machine learning algorithm is designed in a way that it requires the dataset to be normalized to a range of [-1,1]. Further, it is assumed that we proved data that is a normal distribution. A common mistake to utilize a normalization function for production models instead of the values of the function on the training data.

Model building

The next stage where we encounter mistakes is the model building stage. Some important points are that we:

have proper understanding of what we are doing
should embrace a minimization of manual interference
remember the correlation and causation problem
have a proper feedback loop for failure analysis

Initial considerations and feature engineering

The final purpose of the model is often neglected. When building the models it is often very useful that we step back and remember what we want to achieve with the model. Depending on our task we may want to deploy several (combined) models instead of a single end-to-end model. This can reduce computational complexity dramatically.

Many people are taught that clever feature engineering is one of key skills of machine learning engineers. This is correct when it comes to hand-engineering the obvious features. However, we may enhance our automatic tool sets to enable automated feature engineering and see what algorithms come up with.

Another point here is that we may want to clean our data from datapoints where we believe that they will not contribute to the result we want and therefore they just add dimensionality. If we want to do sentiment analysis, then we are interested in subjective/emotional information and less in “hard facts/information extraction”. Therefore, we may consider removing stop words. Again what we remove can be kind of experimental, hence we should simply try different options on smaller subsets before deciding on a final big clean up and get in trouble later.

Lack of understanding of algorithms

It seems like we can identify two big groups in both machine learning as well as in data science. One group simply applies algorithms without knowing what they are actually doing (cause and effect) but they get some gut feeling on what could work or not. However, they often lack a deep understanding what the algorithm actually does. The other group has a more academic approach to this and lack of knowledge on applicability. However, some people of this group only know the mathematical formulation of an algorithm usually written in a very pure style and have no understanding what so ever of the differences to the actual algorithmic implementation.

I personal believe the situation in data science is much worse than in machine learning. And yes: this is independent of the certification or degree a person obtained.

An interesting project to tackle this problem is the book project by Andrew Ng called Machine Learning Yearning. A good portion of his deeplearning.ai courses is dedicated to this as well. A good understanding of the algorithms allows us to exclude algorithms that will have severe performance issues.

Choosing the right metric

I already wrote two blog posts on Metrics for Regression and Classification.

First of all, we should avoid using predefined metrics. In machine learning it is less of a problem because we want the best functioning model. In data science this is a bit trickier. We may face predefined (required) metrics that may have limited suitability only.

Furthermore, we must remember that if we are doing clustering with imbalanced datasets and data augmentation is not helpful, then we may must assess each category individually.

Too much and too less prototyping

We are facing a trade-off between building either too many or not enough prototypes. A suitable metric and a deep understanding of our algorithms helps us to tackle this problem.

Denial of automatic tools

Here we must distinguish between two problems:

First, the use tools that are called “automatic machine learning” or “autofit functions”. These tools are often considered to be really bad, especially in academia. However, they will do much better than many people who do not have real knowledge of statistics and machine learning algorithms. Depending on our project and our team this may lead to much better results than assigning people to the project who have close to no idea what they are actually doing.

Second, depending on the project size it might be very useful to have a look into test automation. This means we should deploy a framework in which we track all pipelines and models.

Correlation and causation

Correlation and causation is the biggest problem in machine learning and statistics. We must double check this always! Is it possible that x causes y? Debugging a machine learning model or a statistical model is not that difficult - not even deep neural networks. However, we may encounter results that are against all researchers opinion. We should consider this as a new insight. We can find an example of this in Jeremy Howard’s TED talk.

No failure analysis

Another big mistake in model building is the lack of a proper failure analysis. Perhaps we figure out that machine learning algorithms are not (yet) suitable for this problem. Perhaps we choose the wrong algorithms. Perhaps we need different data. A proper failure analysis helps us to build better models even when it is in the next project.

Final model output

A common mistake that we should avoid is to wait too long before a model is used for production. Often it is finished and locked away for some time before it may used for production.

Another common mistake is to ship models that have not been optimized and therefore still are (working) prototypes. We can achieve performance improvements and reduction in model failures if we optimize them before shipping.

Data presentation

This section is very important and perhaps one of the most neglected although everyone talks about it. Perhaps because it is so much easier to talk about it than to do it ;). Instead of a “one-size fits all” aproach we adapt our presentation to our target audience.

We can apply a few basic rules:

KISS (Keep It Simple Stupid)
adapting presentation to our target audience
simplicity through complexity
selecting suitable colors
chooses diagrams and categories carefully.

Simplicity and complexity

Most simple problems with low complexity involved are already solved. However, most complex problems are presented in a complicated way.

A brief search with our web search engine of choice will yield a lot of advise on “how to present to decision makers”. Most of these tips probably result in extreme oversimplification that leads to wrong decisions. Instead we should embrace complexity because it can lead to simplicity. Eric Berlow gave a great talk on how complexity leads to simplicity.

A simple figure with a lot of complex relations might be extremely well understandable if viewed and explained inside a text document. However, the figure is most likely fully overloaded for presentations and should be presented step-wise in a way that it is understandable intuitively without loosing any information.

Diagrams

If possible, we should not only select suitable colors but select suitable diagrams as well. (3D) Pie diagrams and other 3D diagrams that are only resembling 2D information. Diagrams should be readable and understandable even if displayed/printed in grayscale. This means every element should have unique symbols and colors.

Color selection

Color selection is another point. We must make sure that everyone can separate colors in our figures. A good tool for color selection (especially for maps) is colorbrewer by Cynthia Brewer, Mark Harrower and The Pennsylvania State University. If possible, our color selection should be colorblind safe, print friendly and suitable for photocopy (yes people still do this).

Categorization of continuous values problems

In general, this is a classical “how to lie with maps” or “how to lie with statistics” problem. Whenever we deal with continuous values that we want to categorize, we are facing the problem of selecting appropriate intervals for each category. We will find a real solution to this problem. This talk covers this problem for heat maps and hot spot analysis.