Revisiting Machine Learning Datasets - (Boston) Housing prices

Today, we will have a look at this famous dataset on Boston ousing Prices as part of my “Exploring Less Known Datasets for Machine Learning” series. Why housing prices? These datasets are definitely part of the “very known” machine learning datasets? As with my views on the “Titanic survivor” dataset I want to view certain parts of housing prices prediction from a different perspective. Hence, no code but many thoughts on the topic.

The Boston Housing Prices dataset probably is the most famous of all housing price datasets. As a brief reminder it consists for the following features (referring to sklearn.datasets.load_boston):

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000’s
(price) as target variable

Time series

While the features cover some features that reflect spatial information, it lacks of timestamps. A general problem with housing prices is that they are of spatio-temporal nature meaning they are a spatial
time series problem. Time series with housing prices is a bit relative. While most people will insist that this means that no “data from future” must be used, I would argue that this behaves more like a quasi-static with (in many cases) slow price changes over the period of a day or even a month. If you are a real bean counter and allow only exact time-stamped series data, then you’ll run out of data fast and give yourself a big disadvantage.

Further, there is a problem with the availability of data. Not every dwelling is offered on the same platform or online at all (believe it or not!). Further, we usually don’t know renting or buying prices of each property. Hence, we have to remind ourselves that we are dealing with partial data that our models are build on top of that. This part is more important for proper failure analysis than for initial model building and refinement processes.

Geospatial

More important than the temporal dimension is the geospatial dimension of price distributions. Nb!: depending on which areas you work there may apply rules on redlining. Pretty much all features of the Boston housing dataset would make much more sense if they are attached to coordinates. Everything that happens on this planet is geospatial! With that in mind, we can save ourselves a lot of work. However, I want to point out that geospatial does not only mean 2D coordinates (long/lat, x/y) but 3D! Some examples are:

field of view (what can/can’t we see from each window)
fire safety per floor (lower floors might be better?)
air quality as a function of 2D and 3D position (e.g. particulate matter from railways (iron on iron) or highways)
main wind directions (may transport pollutants)
fluid dynamic properties around a building to estimate air quality for each dwelling (yes, there exists a world without aircon and hazmat filters ;))

Let’s get back to 2D geospatial features again. We may come across the following features (some of which are part of the Boston Housing Prices dataset - but less distinct and without coordinates):

Infrastructure:

road network (incl. type of road)
parking spaces
distance to public transport (incl. their importance within the public transport hierarchy)
distance to airports as a travel time function for different methods of transport
drinking water source (though all sources have to meet requirements and limits there are difference between water that originates from surface water (rivers, lakes) and e.g. groundwater)
fail safety of electricity supply
infrastructure age and condition
distance to waterways

Socio-economic:

office vs. retail vs. domestic (housing) area
crime rates
accident hotspots
distance to recreational areas
income levels
tax rates
educational infrastructure (less important with internet access nowadays if home schooling is legal)

Finally, with all geospatial features we have to think about scale:

depending on what kind of information we want we have to apply different scales
scales mean a lot of the data is integrated (mean, median values)
information loss!!

Other stuff

Some other stuff that comes to my mind when thinking about estimating prices:

estimating the quality of life by using image classification (using real images of a dwelling not the advertising pictures)
distance to acoustic pollution sources (if hilly: acoustic wave propagation simulation may help here)
quality of building material
were the apartments visited before buying or not (a huge problem for many major European cities)