Today we’ll use XGBoost Boosted Trees for regression over the official Human Development Index dataset. Who said Supervised Learning was all about classification?

### XGBoost: What is it?

XGBoost is a Python framework that allows us to train Boosted Trees exploiting multicore parallelism. It is also available in R, though we won’t be covering that here.

### The task: Regression

Boosted Trees are a Machine Learning model for regression. That is, given a set of inputs and numeric labels, they will estimate the function that outputs the label given their corresponding input.

Unlike classification though, we are interested in continuous values, and not a discrete set of classes, for the labels.

For instance, we may want to predict a person’s height given their weight and age, as opposed to, say, labeling them as male, female or other.

For each decision tree, we will start at the root, and move to the left or right child depending on the decision’s result. In the end, we’ll return the value at the leaf we reach.

### XGBoost’s model: What are Gradient Boosted Trees?

Boosted trees are similar to random forests: they are an amalgamation of decision trees. However, each leaf will return a number (or vector) in the space we are predicting.

For classification, we will usually return the average of the classes for the training set elements that fall on each leaf. In regression, we will usually return the average of the labels.

On each non-leaf node however, the tree will be making a decision: a numerical comparison between a certain feature’s value and a threshold.

This far, this would just be a regression forest. Where’s the difference?

### Boosted Trees vs Random Forest: The difference

When training a Boosted Tree, unlike with random forests, we change the labels every time we add a new tree.

For every new tree, we update the labels by subtracting the sum of the previous trees’ predictions, multiplied by a certain learning rate.

This way, each tree will effectively be learning to correct the previous trees’ mistakes.

Consequently, on the prediction phase, we will simply return the sum of the predictions of all of the trees, multiplied by the learning rate.

This also means, unlike with random forests or bagged trees, this model **will** overfit if we increase the quantity of trees arbitrarily. However, we will learn how to account for that.

To learn more about Boosted Trees, I strongly recommend you to read the official Docs for XGBoost. They taught me a lot, and explain the basics better and with nicer pictures.

If you want to dive even deeper into the model, the book Introduction to Statistical Learning was the one with which these concepts finally clicked for me, and I can’t recommend it enough.

### Using XGBoost with Python

XGBoost’s API is pretty straightforward, but we will also learn a bit about its hyperparameters. First of all however, I will show you today’s mission.

### Today’s Dataset: HDI Public Data

The HDI Dataset contains a lot of information about most countries’ development level, looking at many metrics and areas, over the decades.

For today’s article, I decided to only look at the data from the latest available year: 2017. This is just to keep things recent.

I also had to perform a bit of data reshaping and cleaning to make the original dataset into a more manageable and, particularly, consumable one.

The GitHub repository for this article is available here, and I encourage you to follow along with the Jupyter Notebook. However, I will like always be adding the most relevant snippets here.

### Preprocessing the Data with Pandas

First of all we will read the dataset into memory. Since it contains a whole column for every year, and a row for each country and metric, it is pretty cumbersome to manage.

We will reshape it into something along the lines of:

{country: {metric1: value1, metric2: value2, etc.}

for country in countries }

So that we can feed it into our XGBoost model. Also since all these metrics are numerical, no more preprocessing will be needed before training.