Data Science 101: A Walk in a Random Forest

Random Forest using Python. Is a forest always better than a single tree?

Janaki Viswanathan
5 min readJul 29, 2020
Photo by author

Collective Wisdom is a concept that the wisdom of many is better than the wisdom of one. The wisdom of many would weed out the bias and idiosyncrasies of one’s individual opinion and bring out the best possible outcome. This is a central theme in statistics too where an average of many observations is more reliable than any given single observation. Taking a similar approach with data analysis, what if we had multiple decision trees providing an output on the same target variable? We would end up with a forest — a random forest.

A random forest, as the name implies, is a collection of decision trees working on the dataset. The data is broken down into sections and each tree works on one section of data. Each tree explores the relationship between the target and predictor variables in the dataset it has and provides an output. The final output of the random forest is an average of the output of each tree, thus going towards the collective wisdom of the model. This is known as “ensemble modeling” or “resampling”.

In my previous article, I discussed how to use Decision Trees with Python to analyze the real estate prices of a single family home in Boston. In the article below, I will explain in a simple manner how to model random forests and compare the output of the random forest model to the individual decision tree model for predicting housing Price for single family homes. My aim with this article is to give an easy Python template which you can use to develop a similar modeling approach, just by replacing the dataset used in the example with your own dataset. Using this template, I was able to create a predive model for other applications like sales in 5 minutes.

Random Forest Modeling:

In Python, for fitting Random Forest model on interval targets, we use sklearn.ensemble.RandomForestRegressor and for categorical targets we use sklearn.ensemble.RandomForestClassifier. As mentioned above, Random forest is a collection of decision tree, In sklearn the random forest methods use the parameter n_estimators to control the number of individual decision trees constructed in this manner. Random forest averages prediction over all the trees to output the prediction of the model.

There are 3 steps to building a Random Forest Model:

STEP-1: Split the data into training and validation sets.

Split the data into training and validation sets. We will use train the model using the training set and validate the results of the model using the validation set.

Split the data into Training and validation

STEP-2: Hyperparameter optimization & Cross validation

The main hyperparameters for random forests are:
1. The tree depth (max_depth)
2. The number of trees (n_estimators)
3. The minimum number of samples required to split a leaf (min_samples_split)
4. The minimum number of samples in a leaf (min_samples_leaf)
5. The maximum number of features for each tree (max_features)

In this case we will tune the tree depth (max_depth) and the number of trees (n_estimators) and use default value for the rest. For your application, you may tune the other hyperparameters if needed. For optimization we will use ‘MAE’ or mean absolute error (Goodness of Fit criteria) to identify the best tree depth and number of tree combination. A “Goodness of Fit” criteria are measures of how well the model configuration is able to forecast the target variable. We will use 10-fold validation on the training set(You can choose to fit on a training set or the entire dataset.) to identify best tree depth and optimum number of tree.

The Random Forest model divides the dataset into multiple parts to fit several decision tree models

K-fold validation and Hyperparameter Optimization

A link to my GitHub is HERE and it has the whole Python application.

Hyperparameter Optimization Results: Number of Trees = 100 and Best Depth =12

As you can see in the result above, the best depth is 12 and optimal number of trees is 100 for our dataset.

STEP 3: Hold out validation.

Last and most important step is the Model Validation. In this we will fit the same Random Forest Model with the best depth (12) and number of trees (100) obtained on training and validation dataset and compare the R2, MSE (mean squared error). These parameters should be approximately similar to be considered a valid model. If the Training Model has high R2 or MAE and the Validation model has lower R2 or MAE, this will indicate an overfit and the model cannot be used for prediction.

Holdout Validation Results

As you can see, the training and validation models give similar results (for R2, MAE, MSE and root-mean-squared error). I look for this as this gives confidence in the results of the model.

Once you have a valid model you can compute the feature importance. That is identify what variables are most significant and important for predicting the single family home price. In our case the number of floor and living room area came out to be the most important features as per the model

Which Model is better?

A comparison of the results from the Random Forest model and Decision Tree model are shown below:

The key advantages of Random Forest are:

  • Random forests are self-validating
  • Random forests improve forecast

At the same time, some of the disadvantages of using Random forest are:

  • Can require longer computer time for calculations especially for large models. Consequently this will require more computing power.
  • Interpretability is hard. Since there is not just one tree and each tree works on a section of the dataset, following a tree and trying to interpret the results becomes tedious.

Wrapping it up:

In the comparison of Decision Tree results with the Random Forest results, the R2 is greatly improved in the outcome of the Random forest. This indicates better accuracy. However this may not always be the case. For other applications, choosing to go for a Random Forest may not show any significant increase in accuracy of the results. It could even show a decrease in accuracy if the random forest overfits the data to the model. The choice of the model to take — Decision tree or Random Forest — depends on the application, complexity of data, the accuracy and interpretability that your application needs.

REFERENCES:

  1. “The Art and Science of Data Analytics” by Dr. Edward R. Jones — Texas A&M Analytics Program
  2. Decision Trees for Analytics using SAS Enterprise Miner, Barry de Ville and Padraic Neville
  3. Texas A&M Analytics program
  4. Github Application Link : https://github.com/evjanaki/BostonHousing2019PythonModel/blob/master/BostonHousingProperty_RandomForestModel.ipynb

--

--

Janaki Viswanathan

Data Scientist | Application Engineer | Technical Innovator | Marketing Analytics | Business Intelligence