Revisiting Machine Learning Datasets - Metro Interstate Traffic Volume

Once in a while, a new dataset is added to the UCI Machine Learning Repository. Let’s have a look the latest dataset there. It is called the Metro Interstate Traffic Volume dataset.

The dataset originates from the Minnesota DoT and weather data from the OpenWeatherMap.

Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

– https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume

Let’s load the dataset and have a look at it.

import time
sys_start = time.time()
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

# not gentlemen-like but it helps to keep the notebook clean ;)
import warnings
warnings.simplefilter('ignore')


from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from skgarden import MondrianForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error, make_scorer, mean_absolute_error, median_absolute_error
import xgboost


input_data = pd.read_csv("./data/Metro_Interstate_Traffic_Volume.csv")
display(input_data.sample(10))


  
    
      
      holiday
      temp
      rain_1h
      snow_1h
      clouds_all
      weather_main
      weather_description
      date_time
      traffic_volume
    
  
  
    
      4457
      None
      268.730
      0.00
      0.0
      1
      Clear
      sky is clear
      2013-03-22 14:00:00
      5807
    
    
      31822
      None
      279.220
      0.00
      0.0
      90
      Clouds
      overcast clouds
      2017-03-17 13:00:00
      5857
    
    
      46859
      None
      293.550
      0.00
      0.0
      90
      Rain
      heavy intensity rain
      2018-08-20 08:00:00
      5349
    
    
      31249
      None
      272.950
      0.00
      0.0
      90
      Clouds
      overcast clouds
      2017-02-23 17:00:00
      6257
    
    
      12900
      None
      278.970
      0.00
      0.0
      20
      Clouds
      few clouds
      2014-03-14 00:00:00
      673
    
    
      31078
      None
      269.979
      0.00
      0.0
      0
      Clear
      Sky is Clear
      2017-02-17 05:00:00
      2577
    
    
      26327
      None
      292.150
      10.16
      0.0
      90
      Rain
      heavy intensity rain
      2016-09-05 04:00:00
      278
    
    
      33307
      None
      283.260
      0.00
      0.0
      75
      Clouds
      broken clouds
      2017-05-02 12:00:00
      4693
    
    
      32796
      None
      281.890
      0.00
      0.0
      1
      Clear
      sky is clear
      2017-04-18 00:00:00
      596
    
    
      20656
      None
      255.510
      0.00
      0.0
      90
      Haze
      haze
      2016-01-16 12:00:00
      4601

	holiday	temp	rain_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
4457	None	268.730	0.00	1	Clear	sky is clear	2013-03-22 14:00:00	5807
31822	None	279.220	0.00	90	Clouds	overcast clouds	2017-03-17 13:00:00	5857
46859	None	293.550	0.00	90	Rain	heavy intensity rain	2018-08-20 08:00:00	5349
31249	None	272.950	0.00	90	Clouds	overcast clouds	2017-02-23 17:00:00	6257
12900	None	278.970	0.00	20	Clouds	few clouds	2014-03-14 00:00:00	673
31078	None	269.979	0.00	0	Clear	Sky is Clear	2017-02-17 05:00:00	2577
26327	None	292.150	10.16	90	Rain	heavy intensity rain	2016-09-05 04:00:00	278
33307	None	283.260	0.00	75	Clouds	broken clouds	2017-05-02 12:00:00	4693
32796	None	281.890	0.00	1	Clear	sky is clear	2017-04-18 00:00:00	596
20656	None	255.510	0.00	90	Haze	haze	2016-01-16 12:00:00	4601

We have to create a few categorical features here:

#categorical encoding
input_data['holiday'] = pd.Categorical(input_data['holiday']).codes
input_data['weather_main'] = pd.Categorical(input_data['weather_main']).codes
input_data['weather_description'] = pd.Categorical(input_data['weather_description']).codes
input_data['year'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[0]))
input_data['month'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[1]))
input_data['day'] = input_data['date_time'].map(lambda x: int(x.split(" ")[0].split("-")[2]))
input_data['hour'] = input_data['date_time'].map(lambda x: int(x.split(" ")[1].split(":")[0]))
input_data.drop(['date_time'], axis=1, inplace=True)

display(input_data.sample(10))


  
    
      
      holiday
      temp
      rain_1h
      snow_1h
      clouds_all
      weather_main
      weather_description
      traffic_volume
      year
      month
      day
      hour
    
  
  
    
      10428
      7
      262.96
      0.0
      0.0
      64
      8
      16
      508
      2013
      12
      5
      0
    
    
      34836
      7
      288.89
      0.0
      0.0
      40
      1
      24
      2042
      2017
      6
      25
      22
    
    
      35513
      7
      292.28
      0.0
      0.0
      20
      3
      5
      6446
      2017
      7
      20
      7
    
    
      30541
      7
      272.65
      0.0
      0.0
      1
      0
      27
      6054
      2017
      1
      27
      17
    
    
      42067
      7
      274.79
      0.0
      0.0
      90
      8
      15
      3586
      2018
      3
      5
      10
    
    
      19578
      7
      276.49
      0.0
      0.0
      90
      2
      3
      538
      2015
      11
      19
      0
    
    
      26579
      7
      288.13
      0.0
      0.0
      1
      0
      27
      4405
      2016
      9
      13
      11
    
    
      19624
      7
      265.84
      0.0
      0.0
      40
      1
      24
      299
      2015
      11
      22
      4
    
    
      5113
      7
      271.15
      0.0
      0.0
      90
      8
      10
      459
      2013
      4
      14
      5
    
    
      13139
      7
      265.16
      0.0
      0.0
      20
      1
      4
      903
      2014
      3
      23
      23

	holiday	temp	clouds_all	weather_main	weather_description	traffic_volume	year	month	day	hour
10428	7	262.96	64	8	16	508	2013	12	5	0
34836	7	288.89	40	1	24	2042	2017	6	25	22
35513	7	292.28	20	3	5	6446	2017	7	20	7
30541	7	272.65	1	0	27	6054	2017	1	27	17
42067	7	274.79	90	8	15	3586	2018	3	5	10
19578	7	276.49	90	2	3	538	2015	11	19	0
26579	7	288.13	1	0	27	4405	2016	9	13	11
19624	7	265.84	40	1	24	299	2015	11	22	4
5113	7	271.15	90	8	10	459	2013	4	14	5
13139	7	265.16	20	1	4	903	2014	3	23	23

It’s time to visuazile the distribution of features.

Let’s throw a bunch of algorithms at it and see what happens ;).

So far we predicted traffic volume based on measurements/predictions for various environmental conditions. However, we could treat this as time series as well.

These results are not satisfactory. I have no intention for deeper investigation since I basically wanted to have a quick look at the dataset and ended with throwing my brute-force pipeline at it ;).