Contents
- Terminology and Fundamentals on Outliers
- Detecting Outliers
- General Ideas on Outliers in Machine Learning and Data Science
- General Guidelines
Terminology and Fundamentals of Outliers
Dealing with outliers in machine learning and data science is a science of its own. Most academic papers and presentations as well as most blog posts on this topic are very academic and have only limited applicability for complex (physical/real-world) datasets. The other extreme variant to get seemingly functioning models is to delete everything that looks weird. Let us have a more broader view on this.
If we are interested in outliers it is mainly for 2 reasons:
- We believe that they screw up our models and therefore want to know what to do with them. scikit-learn has a great example on the effect of different scalers on outliers.
- They are more interesting than the standard, well understood data (e.g. for fraud detection, engine failure, new insights into physical phenomena, network intrusion detection). This is more commonly known as “anomaly detection and analysis”.
In this blog post we will have a look at the first.
There is one more thing that we have to remember: Most algorithms assume that each datapoint in our dataset is the mean of a normal distributed measurement. This might be an additional error source.
Definitions
There are tons descriptions on the relations between machine learning and data science out there. For the sake of this blog post we define data science as something more descriptive and machine learning as something where the output is the input to another machine and therefore a more general and predictive/intelligent.
There is no clear definition of an outlier. Grubbs (1969) [1] is used often for defining outliers. He defines it as “an outlying observation […] that appears to deviate markedly from other members of the sample in which it occurs”. Hence, it is more or less a common sense definition. We may say that outliers are very individual for every dataset. However, there are some methods to detect them using statistics and/or machine learning.
Types of Outliers
We can distinguish 4 types of outliers:
- The outlier is “real” meaning that there are no measurement errors involved and the value itself is “valid” meaning that there is no error with the value due to data transformation. In simple words: everything is correct but the value in question is detected as an outlier.
- The outlier is caused by fault measurements (e.g. faulty sensors).
- The outlier is caused by wrong number crunching. The mistake was done by someone who crunched numbers from measurement sheets (handwritten) into a computer.
- Outlier is caused by wrong conversion/data handling and represent only a subset of the dataset.
Detecting Outliers
If we want to do anything to or with outliers, then we have to detect them first.
This is very shallow introduction to outlier detection to highlight the problem of univariate thinking with multivariate datasets.
Example Dataset
Let us build an example dataset to explore different methods of outlier detection.
We create our simple dataset with 2D coordinates and have a look at it:
// base dataset
const X = math.round(math.range(-6,9,0.03),3);
const noise = math.random([X._size],-0.3,0.4);
const y = math.round(math.add(X,noise),3);
// 3 sets of outliers
const outliers_set_1_X = math.round(math.range(-7,-5,1),3);
const outliers_set_1_y = math.round(math.abs(outliers_set_1_X),3);
const outliers_set_2_X = math.round(math.range(4.1,4.3,0.1),3);
const outliers_set_2_y = math.round(math.square(outliers_set_2_X),3);
const outliers_set_3_X = math.round(math.range(9.1,9.3,0.1),3);
const outliers_set_3_y = math.round(math.subtract(1,outliers_set_3_X),3);
// concatenate base dataset with sets of outliers
const X_c1c2c3 = math.concat(X, outliers_set_1_X, outliers_set_2_X, outliers_set_3_X);
const y_c1c2c3 = math.concat(y, outliers_set_1_y, outliers_set_2_y, outliers_set_3_y);
Sanity Checks
The simplest form of detecting outliers is to perform sanity checks if e.g. the value is in the range and resolution of valid sensor outputs. Sanity checks can be used for detecting outlier types 2 - 4. This may lead to changing of values or deleting or transforming them.
From personal experience I can tell you that this is perhaps a bigger problem than many people think it is. I have seen it quite often that e.g. sensors were in use that were not suitable in terms of their specs and calibration and later everyone wondered why they faced faulty data or seemingly unexplainable outliers.
Visual Approach
The visual approach is useful for low-dimensional and not too big datasets. Quite frankly it is almost unusable for data consisting of more than 4 - 5 dimensions because we have to check everything manually and this is not feasible for higher dimensional data without techniques such as PCA (Principal Component Analysis).
Statistical Approaches
The standard statistical approaches are often called extreme value statistics. If we call it outliers or extreme values does not matter. We can find many discussion on whether an outlier is an extreme value or the other way around as well [2]. For the purpose of this blog posts: it does not matter.
The Boxplot Approach
The simplest things are boxplots applied to each feature (dimension). The box itself ranges from Q1 (25th percentile) to Q3 (75th percentile) and has the median marked inside as a separating line. In this case we are using the plotly.js standard boxplot parameters. Hence, the whiskers (extensions outside the box) range from Q1 - 1.5 * IQR (min) and from Q3 + 1.5 * IQR (max). They are bounded by the so called lower and upper fences. These are the (min,max) values of the dataset that fall into that range. The IQR is the interquartile range and is defined as Q3 - Q1. If we apply this kind of statistical approach to detect outliers, then all data points outside this range are considered outliers.
However, this approach does not work well for multi-dimensional data since we can only analyse one feature at a time and therefore cannot identify higher dimensional outliers (e.g. clusters in 150 D space).
Let us treat our dataset as a composite of two independent univariate datasets first and see what is going to happen. This is an boxplot analysis of the y values of our example dataset. The red dots are detected as outliers.
Let us have a look at the X values as well:
Treating multivariate (multidimensional) data individually is a common mistake since most people are not able to thing beyond 2 - 3 dimensions. With our 2 times univariate approach we just ended up detecting a subset of the outliers. This kind of outlier detection is more common in (low quality) data science than machine learning since it seems like there is a greater desire to understand everything step by step and therefore using a (pure) whitebox instead of a graybox approach.
If we apply this approach using 2D histograms to our multivariate example dataset, then we can see that all three clusters of outliers are detected:
Common Outlier Detection Algorithms
There are a few common tests to detect outliers:
- z-scores
- DBscan
- Local Outlier Factor (LOF)
- k-Nearest Neighbor Clustering [3]
- k-Means Clustering
- Principal Component Analysis [4]
- Isolation Forest
- and many many classification algorithms more …
scikit-learn has a nice example of outlier detection as well.
All these methods have their own advantages and disadvantages.
General Ideas on Outliers in Machine Learning and Data Science
So far, we only detected outliers. But what should we do about them?
Physical Meaning
First, we have to understand what we have detected. Is there any physical meaning behind it? Do our outliers show phenomena such as engine failure, structural failure, (financial) fraud etc. . Depending on what we want to model it more useful to leave in.
Another useful approach to think about physical meaning is to think about what the physical meaning of all points inbetween the “normal” points and the outliers are. Can they occur in terms of a continuum approach, how likely do they occur etc. ? The latter is important because this value may occur often in real-life situations but seldom in lab tests. In such a case we would deal with highly imbalanced datasets - that is a completely different story ;)
Impact on Algorithms
To understand the impact of outliers on algorithms, we have to try them out. In this example we will have a look at the effect of the outliers on liner regression, since this is an algorithm that everyone understands intuitively.
We can see that using different sets of outliers leads to models with higher errors if we look at the dataset. If score the model against the base dataset only, this error is much smaller and we could say that the outliers have only minor effects on model performance. Therefore, we should remember that choosing the right metric for regression can be challenging.
NB!: The linear regression on these datasets is purely descriptive.
Base Data | Outliers 1 | Outliers 2 | Outliers 3 | $$R^{2}$$ | $$R^{2}$$ (base data only) | MSE | MSE (base data only) | RMSE | RMSE (base data only) |
---|---|---|---|---|---|---|---|---|---|
x | - | - | - | ||||||
x | x | - | - | ||||||
x | - | x | - | ||||||
x | - | - | x | ||||||
x | x | x | - | ||||||
x | x | - | x | ||||||
x | - | x | x | ||||||
x | x | x | x |
The plot below is interactive. Click on the legend to activate or deactivate predictions.
If we end up with such a case as above where we can say that detected outliers have close to no effect on model performance, then we have to decide if it is worth spending time (and therefore money) on cleaning these datasets. We can see a typical mistake of textbooks here. Textbook examples are typically “doable” meaning they are relatively simple and do not end up lots more than 1 - 2 h of manual work or computation. In times of big datasets with plenty of features we find some outliers in there. They might be even of types 2 - 4. Finding the errors in the dataset may come at the cost of a high amount of manual work if the outliers are very specific and cannot identified by computers easily (e.g. falsely labeled images, or labeled data where even experts do not know how to label them correctly) or a high computational costs.
It is a different story if our outliers have significant impact on model performance:
This is a good example to see how good or how bad some metrics for regression analysis are ;)
$$R^{2}$$ | $$R^{2}$$ (base data only) | MSE | MSE (base data only) | RMSE | RMSE (base data only) |
---|---|---|---|---|---|
In such a case we have to look if it is feasible to detect errors and correct them (if there is no physical meaning) or delete them or simply choose less susceptible algorithms and loss functions.
Using Less Susceptible Algorithms and Loss Functions
Before we do anything to our data we should aim for algorithms and loss functions that are less susceptible to outliers unless we find errors in the dataset that we can and should correct first.
Deep Neural Networks are less susceptible to outliers than other algorithms because they are more prone overfit to data that is more frequent. Due to the media hype of deep learning the social acceptance of such “graybox” algorithms is much higher than a few years ago.
We can control machine learning algorithms by controlling their training. Depending on what metric we choose we can influence the outcome of our models. It might be useful that we train a variety of different algorithms with different metrics each and assess their performance in a later stage.
Minkowski error is sometimes claimed to be a good metric that ignores outliers [5].
Data Transformation
Data transformation is useful if we deal with highly skewed datasets. However, again this is a very one dimensional view on things and we may face different challenges with high dimensional data.
Perhaps we can just apply some form of data transformation e.g. log transform. This is also useful for certain physical phenomena that can not have negative values and we can easily prevent that by log transform them first.
Changing Values
If our sanity checks detects abnormal values, then we should change them if and only if we know what went wrong.
We have to be aware of pseudo domain knowledge. I experienced it often that people with seemingly a lot of domain knowledge (20+ years of experience) change values to something that they believe is correct but cannot prove nor provide any real reasoning. As a consequences the models failed for these values in both cases (unchanged, changed).
Deleting Values
In general, we should only delete data if it is wrong or there is no underlying causation with our output and it simply messes up our model.
Another approach that somehow fits to “deleting values” is that we only use a subset and build different models for different subsets and ensemble them later. In this case we do not cap our data but preserve it and build our general model using ensemble methods.
Adding Values
Adding values is something we observe mainly in geospatial analyses. In geospatial problems we are often dealing with a subset of a feature that we assume is a spatial continuum and we want to predict the continuum from this subset. Adding values is often done because the data available is sometimes too sparse for error free behavior of algorithms in terms of causing numerical errors.
I would recommend this if and only if the added values are backed by numerical simulations. Using “personal experience” as added domain knowledge is bad advise. If no numerics are available it is possible to add some other variables (continuous data) that correlate and have causation with what we want to predict.
Adding numerics-backed values can useful for all kind of analyses of things in the physical world as well. We think of material science, robotics or chemistry. Quite often we will get data from experiments or from sensor readings for a subset of certain environmental conditions. In such cases extensive numerical simulations can provide the basis for much simpler statistical models. Depending on our data we guess initially that the bahavior is more or less linear but with additional data from numerics we may end up with something exponential or wavy. If we want to use such a model for controlling actuators to provide some form of intelligent behavior of our machine, then this combined numerics approach might be perfect for us.
Software for Automated Data Cleaning
Again from personal experience: many of the manual data cleaning attempts make it worse instead of improve it.
If we are totally uncertain how to clean our data we may utilize automatic tools such as datacleaner and dplyr. Another idea is that we build many models, compare them and perhaps use them as an ensemble.
General Guidelines
Data Science vs. Machine Learning Issue
In machine learning there is much higher acceptance of “dark grey box” algorithms where we have a very limited understanding on how and why they yield the results they do. Whiteboxing is not feasible with increasing model size and complexity.
Machine Learning | Data Science | |
---|---|---|
Unreasonable/wrong value in the dataset due to manual number crunching/false input program | Change value if known how to, delete otherwise | |
Outliers represents physically correct values | Embrace the outliers and build models that can deal with it. | Delete or change value, build models; keep record of what was changed; perhaps using ensemble methods to combine different models |
Outliers are seldom but do occur in the physical world (e.g. perceived by a self-driving car) | Ensemble models | Do not to data science ;) |
Outliers impact algorithms | Tran different algorithms with different loss functions and compare their performances | |
Dataset is to sparse/uncertain | Try to enhance the dataset it with numerical simulations (if possible) | |
Do not know what to do? | Using data cleaning algorithms used for e.g. automatic machine learning. |
This table is 100% debatable ;)
References
[1] Grubbs, F. (1969): Procedures for Detecting Outlying Observations in Samples. Technometrics 11 (1), 1-21. doi:10.1080/00401706.1969.10490657.
[2] Dixon, W. J. (1950): Analysis of Extreme Values. The Annals of Mathematical Statistics 21 (4), 488 - 506.
[3] Wang, X.; Wang, X. Li.; Ma, Y.; D. M. Wilkes (2015): A fast MST-inspired kNN-based outlier detection method. Information Systems 48, 89 - 112. doi:10.1016/j.is.2014.09.002
[4] Saha, P.; Roy, N.; Mukherjee, D.; A. K. Sarkar (2016): Application of Principal Component Analysis for Outlier Detection in Heterogeneous Traffic Data. Procedia Computer Science 83, 107 - 114. doi:10.1016/j.procs.2016.04.105.
[5] Quesada, A. (): 3 methods to deal with outliers. https://www.neuraldesigner.com/blog/3_methods_to_deal_with_outliers.