Pages

Sunday, January 18, 2015

Titanic: Machine Learning from Disaster


For those of you who are not familiar with Kaggle.com, the latter website is a competition website concerned with data science and Machine Learning problems. Several commercial and non-profit companies and organization make their problems open to the public domain in hopes that they can find solutions to their own data problems or improve the performance of their existing ones. 

No matter what your level of expertise is with machine learning, there are many beginner-level problems to learn from and solve. There are also several real-world competitive problems that if solved efficiently would grant the winner teams up to 100,000$ as well as reputation points on the leaderboard. It is also a great source for learning machine learning approaches to solving problems. For more information on kaggle, I suggest browsing their homepage for more details.

The "Titanic: Machine Learning from Disaster" problem is considered a getting started problem for beginners to familiarize themselves with the basic concepts and techniques of machine learning. The problem is as follows, given two datasets a training dataset (891 records) and a testing dataset (417 records) both containing information of 1,309 passengers who were onboard of the Titanic (Name, Sex, Ticket, Fare, Class, etc.). The training dataset contains each passenger information along with a binary column "Survived" indicating wither this particular passenger has survived (1) or not(0). The testing dataset; however, only contains information of the passenger without the "Survived" column. Now, using Machine Learning algorithms and techniques can you predict who survived in the testing dataset given that you have learned from the data given in the training set.

In this post I will explain how would you go about loading, preprocessing, cleaning and scaling your data. Then how to decide which features are most critical for your machine learning algorithm which in our case is the Random Forrest Classifier. We will also look into how to optimize our classifier's hyperparameters; such as, the number of trees in the forrest, the number of features to consider during splits and so on. Finally, once we have clean, preprocessed data, the most important features to use and a fine-tuned classifier we will fit the data into the classifier and come up with a prediction. The complete python code is available on Github.

This solution requires you to be familiar with the Python programming language. We will use the famous data manipulation library Pandas to easily manipulate and massage our datasets. Moreover, we will be using Matplotlib to plot some figures that will give us insight on our data and classifier. Finally and most importantly, we will rely on Scikit-Learn power for the machine learning part of this solution.

The starting execution point of the program is main.py. The following code snippet from main.py shows how we broke down the problem into 6 major sub-problems.

# load data
print("Loading Data...")
DIR = "./data"
train_df, test_df = load_data(DIR)

# preprocess, massage, scale, merge and clean data
print("Preprocessing Data...")
train_df, test_df = preprocess_data(train_df, test_df)

# use only most important features
print("Extracting Most Important Features...")
train_df, test_df = use_most_important_features(train_df, test_df)

# optimize hyperparameters
print("Optimizing Hyperparameters...")
optimize_hyperparameters(train_df)

# plot learning curves
print("Plot Learning Curves...")
plot_learning_curves(train_df)
plot_ROC_curve(train_df)

# predict survival
print("Predict Survival...")
predict_survival(train_df, test_df)

The sub-problems are:
1. Loading the datasets.
2. Preprocessing the data.
3. Extracting most important features.
4. Optimizing the classifier's hyperparameters.
5. Plotting learning curves.
6. Making a prediction.


1. Loading the Datasets

The training and testing csv files are located in the data folder. I obtained them originally from kaggle's titanic page. Our aim in this section is to read both training and testing datasets properly from csv and load them into memory. Because the size of the dataset is considerably small we can load it directly into memory. The script that defines the function load_data(directory) is located in utils/load.py.

# utility function to load data files into two pandas dataframes
def load_data(directory):
    train_df = pd.read_csv(join(directory, 'train.csv'), header=0)
    test_df = pd.read_csv(join(directory, 'test.csv'), header=0)
    return train_df, test_df

Note that both train_df and test_df are pandas DataFrames. A dataframe is a cool and very flexible data structure that allows us to manipulate tabular heterogenous data easily. If you are running the code in a python shell (Hopefully IPython) then you can do something like train_df.info() to get information on the training dataframe, or even train_df.head(10) to print the first 10 records in the dataframe. Learning about pandas can server you a long way if you are serious about your data endeavors.

Once we load the data, running train_df.info() and test_df.info() we can get a bunch of interesting information on each dataframe. The training dataset contains 891 entries indexed from 0 to 890. Not to be  confused with PassengerId. The index is a pandas automatically assigned index iterator that is not included as part of the dataset. The columns are (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked). If you look closely you will notice that (Age, Cabin and Embarked) columns have missing values. Moreover, you can see that the datatypes in the columns are heterogenous. Meaning they are not all of the same type which is a cool property of pandas dataframes.

test_df on the other hand have 418 entries and one column less which is the Survived column. In fact, we have to come up with the values of the survived column and submit it as the result to kaggle website. You can also notice here that there are missing values as well (Age, Fare and Cabin).

The title of each column is pretty much self explanatory except for some that might look confusing.
Pclass = Passenger Class
SibSp =Siblings and/or Spouse
Parch = Parents and/or Children

I encourage you to explore the dataframes and discover what kind of data do you have. For example typing train_df.head() you will get the first five entries in the training set as such. Note that the very first column is pandas internal index which is not part of the dataset.









Now that you have successfully loaded the data in python, it is time to preprocess, massage, scale and clean the data.


2. Preprocessing the Data

What most people don't realize is that for almost all machine learning problems the practitioner will most likely spend most of his time preprocessing and massaging the data. Most machine learning algorithms don't accept non-numerical datatypes; thus, we have to handle all the conversions of non-numerical data into meaningful numerical representations.

In machine learning terminology we refer to each row in our dataset as a "sample" and each column as "feature". Statisticians have other terminologies but non the less refer to the same meaning. However, as machine teachers we are going to follow machine learning terminologies. 

So what do we mean by preprocessing the data? well! simply put, we have to transform the data into a form acceptable by our machine learning algorithm. Further more, we have to sit down and think about what each feature really means. It is also very critical to find a way to fill in the missing gaps in our datasets. As you recall from the previous section we have a lot of missing data in our training and testing datasets. We must also make sure we don't confuse samples from the training dataset with samples from the testing dataset. We must always keep in mind which is which. This is because we are basically going to concatenate our two datasets (training and testing) into one big dataset for the sake of preprocessing. It makes better sense to preprocess the data on one dataset rather than on two different datasets. This will become clearer especially when we try to fill in missing data fields.

The script that deals with preprocessing our data is utils/preprocess.py. This method preprocess_data(train_df, test_df) in the script takes as input parameters the training and testing dataframes. Then it makes several function calls of methods in other scripts in features/feature_name.py. For example, when we want to preprocess the Age column we are going to invoke functions in features/age.py script. I followed this structure in hopes to make the solution modular and easy to trace and follow.

Ok! first things first. We concatenate the two dataframes into one dataframe and we keep in mind the split point at which we can split the dataframe back into the two original dataframes. Because we are concatenating test_df to train_df we need only to know the number of samples in train_df to ensure we can split the data back again.


train_shape = train_df.shape
df = pd.concat([train_df, test_df], axis=0)

2 comments :

  1. Hi would you be kind enough to let me play around/see your solution?

    ReplyDelete
    Replies
    1. You can clone or form my repo at github https://github.com/moeabdol/titanic

      Delete