Regression with SkLearn


SciktLearn provides easy implementation for most regression algorithms. We can check out the major ones here. Just as all machine learning code, these modules are concept heavy and code lite. The code is trivially small. Just a couple of lines will do the job. But is very important to understand what happens underneath so that we can use these modules effectively.

Insurance Dataset


One of the early adopters of the analytics and machine learning techniques was the insurance industry. In order to price their products, the medical insurance companies need an estimate of the medical cost an individual could incur. This data is very easy to collect and it is continuously enriched with newer and newer observations. Hence it is a useful set for learning the techniques of regression.

This data set contains a good amount of details about various parameters that could have an impact on the health and medical expenses of an individual. Along with this, it contains the actual expenses incurred. Based on this data, we can get a reasonable prediction of how a new candidate would perform based on the available information about these parameters.

Let us now try to analyze the data to understand more.

Analyze the Data


Before jumping into the implementation of the machine learning model, it is always a good practice to peep into the data (at least a small sample), to understand how it looks. This can provide a great help in deciding which algorithm could be more suitable. So, let us work on that.

Let us start with importing the necessary Python modules

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We need to configure the Seaborn and MatplotLib - so that we can see the output on the JupyterNotebook that we use for the work. We can also set the Seaborn pallet to get the colors of our choice

sns.set_palette('husl')
%matplotlib inline

With this in place, let us load the Insurance data into a Pandas Data Frame. We need to give the appropriate path to the CSV file so that it can be loaded correctly.

data = pd.read_csv('Insurance.csv')

The next step is to look at the data. The dataset is pretty big and we cannot possibly look at each record there. But peeping into the first few records is quite helpful.

data.head()
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

This gives us an idea about what we have in the data frame. The insurance data has several records containing these aspects of different individuals, along with the charges they incurred. Now it is easier to understand the problem at hand.

Note that it is not really essential to understand all these aspects about the data. A good algorithm should be able to generate a good prediction model irrespective of the actual meaning of each of these input features - that is the whole point behind using machine learning.

But it always helps to use what we already know, so that we can get more accuracy with what we do not know.

Next, we can check out the data in a wider scope. What are the data types?, Do we have any missing values?, etc. With this information, we can take some implementation level decisions.

data.info()

This generates output:


RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
age         1338 non-null int64
sex         1338 non-null object
bmi         1338 non-null float64
children    1338 non-null int64
smoker      1338 non-null object
region      1338 non-null object
charges     1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.2+ KB

This helps us with implementation specific details. We have 1338 records. Moreover, we are lucky that all the values are present. We often find ourselves in a situation where some records have missing fields. We need to fix that before we proceed.

But, one bad news is that we have some text fields. Regression requires numbers. We need to convert the available data into numbers. But we have a catch here. The text data we have is not ordinal. There is no particular sorting order in there. For a linear regression to work well, we need sorted data - the output should consistently increase or decrease with increase in the value.

We don't have a problem with the binary data. Because they are sorted anyway. But we have a serious problem with the regions. A simple solution to this problem is to split it into a set of binary fields. For example, we can split the four directions into two fields of two values each (north/south and east/west). This may not always work - specially when we have several possible values. In such a case, we forego the information in that field or accept that as a limit of the linear regression model and try something better.

Let's work on that

data["north_south"] = np.where(((data["region"] == "northeast") | (data["region"] == "northwest")), 1, 0)
data["east_west"] = np.where(((data["region"] == "northeast") | (data["region"] == "southeast")), 1, 0)
data["sex"] = np.where(data["sex"] == "female", 1, 0)
data["smoker"] = np.where(data["smoker"] == "yes", 1, 0)
data = data.drop(["region"], axis=1)
data.head()

Here, we have split the region field into two fields that represent the north/south and the field that represents east/west. We have also changed all these binary fields into numeric values represented by 0 and 1. The actual numbers do not really matter - so long as they are different. Commonly developers use -1, 1 or 0, 1.

Now, this is what we have:

age sex bmi children smoker charges north_south east_west
0 19 1 27.900 0 1 16884.92400 0 0
1 18 0 33.770 1 0 1725.55230 0 1
2 28 0 33.000 3 0 4449.46200 0 1
3 33 0 22.705 0 0 21984.47061 1 0
4 32 0 28.880 0 0 3866.85520 1 0

This looks a lot more manageable. Next, we can look up some statistical information about the available data. What is the kind of values we have? What is the mean, variance, min / max values, percentiles, quartiles... These give us some more ideas about how the model should look.

data.describe()

It generates this output:

age sex bmi children smoker charges north_south east_west
count 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 0.494768 30.663397 1.094918 0.204783 13270.422265 0.485052 0.514200
std 14.049960 0.500160 6.098187 1.205493 0.403694 12110.011237 0.499963 0.499985
min 18.000000 0.000000 15.960000 0.000000 0.000000 1121.873900 0.000000 0.000000
25% 27.000000 0.000000 26.296250 0.000000 0.000000 4740.287150 0.000000 0.000000
50% 39.000000 0.000000 30.400000 1.000000 0.000000 9382.033000 0.000000 1.000000
75% 51.000000 1.000000 34.693750 2.000000 0.000000 16639.912515 1.000000 1.000000
max 64.000000 1.000000 53.130000 5.000000 1.000000 63770.428010 1.000000 1.000000

Plotting the Data


With this in place, let us try to look for more. Numbers don't speak enough. To get a better feel of the data, and dig deeper, we can use plots. These are visual representations of the available data. These help us develop a mental picture of the data. In a real case study, we would work on a small sample set of the available data. But in this case, we can use the entire data set. Let us start with age

plt.scatter(data["age"], data["charges"], alpha=0.5)

There is some correlation between the age and the expenditure. Similarly, we can try for all other fields.

Sex


plt.scatter(data["sex"], data["charges"], alpha=0.5)

BMI


plt.scatter(data["bmi"], data["charges"], alpha=0.5)

Children


plt.scatter(data["children"], data["charges"], alpha=0.5)

Smoker


plt.scatter(data["smoker"], data["charges"], alpha=0.5)

North-South


plt.scatter(data["north_south"], data["charges"], alpha=0.5)

East-West


plt.scatter(data["east_west"], data["charges"], alpha=0.5)

We can note in each of these plots that there is some correlation between the field and the charges. Thus, each field is important. But some are less important than the others. The region and sex have only a minor impact.

Regression


Let us now try to take the mathematical route. In this case, we can check up many different algorithms and check out how they look. SktLearn allows us to configure hyperparameters for each of these algorithms - which can greatly impact the performance. But for now, we will just use the defaults and see how this works.

First, we need to rearrange and split the data to get things setup. Let us start with importing the necessary modules

from sklearn import metrics
from sklearn.model_selection import train_test_split

Now we should separate the input and output values. Charges define the output while the others are input features.

X = data.drop(['charges'], axis=1)
y = data['charges']

Next, we need to split the data into the train and test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)

Let us now start with individual algorithms

Linear Regression


from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.757 and test set accuracy as 0.737

Ridge Regression


from sklearn import linear_model
reg = linear_model.Ridge (alpha = .5)
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.757 and test set accuracy as 0.737

Lasso Regression


from sklearn import linear_model
reg = linear_model.Lasso(alpha = .5)
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.757 and test set accuracy as 0.737

Bayesian Regression


from sklearn import linear_model
reg = linear_model.BayesianRidge(compute_score=True)
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.757 and test set accuracy as 0.737

Higher Order


Note that the accuracy is quite similar in all the above models. That is because they are all linear models and can not do much when the relation is far from linear. Let us now try some higher orders

Polynomial Regression


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
reg = Pipeline([('poly', PolynomialFeatures(degree=2)),
                    ('linear', LinearRegression(fit_intercept=False))])
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.851 and test set accuracy as 0.833. That is a good jump. But again, there is a limit. We can try playing around with the degree - to see how soon it overfits the data.

Neural Network


from sklearn.neural_network import MLPRegressor
reg = MLPRegressor(hidden_layer_sizes = (10,6), activation='relu', solver='lbfgs')
reg.fit(X_train, y_train)
print('Accuracy on the training subset: {:.3f}'.format(reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(reg.score(X_test, y_test)))

This reports the training set accuracy 0.839 and test set accuracy as 0.831. We can further play around with the network structure and different hyper-parameters to improve this further.

If we want to improve this further, we can also try to level the input parameters. We will discuss that in the following blogs.