We all know the titanic survival dataset. It is a great playground to learn some machine learning. However, along with other datasets it is totally overused.
However, I have some ideas how to make something more sophisticated out of it (and yes it is kind of an overkill for a small data science/machine learning intro but it could make a great PhD Thesis ;)).
Let us recall what features the dataset contains:
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
This allows us to predict if a passenger has survived or not with reasonable accuracy. It is basically due to location (that is a function of ticket fare as well) and some information that could be related to health (sex, age and to some degree ticket fare).
But what can we do with it. They ship sunk and it stays down there. Can we still put it to good use for today’s cruise ships (yes they are designed differently)? I am thinking of something like a detailed (incl. simplified furniture (rigid-body models) - we need a lot of computation for that) to simulate the sinking process and see if we could trace the titanic survival dataset results back to some cause and effect. This is something that is often neglected by data scientists and machine learning engineers if they work on things that have happened in the physical world.
Settler, J.W. and B.S. Thomas published a paper called “Flooding and Structural Forensic Analysis of the Sinking of the RMS Titanic” (online available at: https://www.encyclopedia-titanica.org/community/attachments/study-titanic-pdf.39309/). It seems like it is still a bit oversimplified for my ideas but even the flooding process may be a good indicator for survival prediction. Floor plans are available at https://www.encyclopedia-titanica.org/titanic-deckplans/. And what about the evacuation process in each stage of the sinking? How could passengers and crew move around? What paths were free, which life boats were already unreachable etc. There are so many options to gather massive insights out of it which may lead to an understanding of the survival dataset.
I hope I could make at least a single person rethink the titanic survival dataset. And if someone would be interested in such a project, then please drop me a note because I would love to hear about it.