Identifying Linear Relationships Between Variables In Machine Learning

Linear models assume that the independent variables, X, take on a linear relationship with the dependent variable, Y. This relationship can be dictated by the following equation (Equation of a Straight Line)

Screenshot%20from%202020-11-23%2008-25-12.png

Source here.

Here, X specifies the independent variables and β are the coefficients that indicate a change of unit in Y for a change of unit in X. If this assumption is not met, the performance of the model may be poor. Linear relationships can be evaluated using scatter plots and residual plots. Scatter plots result in the relationship of the independent variable X and the target Y.

The residuals (Loss - error) are the difference between the linear estimation of Y using X and the real target

Screenshot%20from%202020-11-23%2008-28-50.png

Linear models assume that the dependent variables X take on a linear relationship with the dependent variable Y. If the assumption is not true, the model may show poor performance. Let's visualize the linear relationships between X and Y. Let's import the following libraries: pandas, numpy, matplotlib, seaborn, and sklearn LinearRegression

Let's import the Boston Houses dataset from skit-learn

this is how we load the dataframe from scikit-learn

Then, we create the dataframe with the independent variables as follows

To access the values of 'y', we do it in the following way = boston_dataset.target. Create a new column called MEDV with the function we just showed you, and show again the boston dataframe

Here is the information about the data set. Familiarize yourself with the variables before continuing with the exercise.

The objective is to predict the "Mean House Value" The MEDV column in this data set and we have variables with features about the houses and neighborhoods. Run the following line

Now create a dataframe with the variable x that follows a normal distribution and shows a linear relationship with y. Create a random seed of 29 to ensure reproducibility

We define a variable n with the value 200, then a variable x with a numpy randn for a number of n samples. Finally we create the variable y by multiplying x by 10 and adding a numpy randn for a number of n samples multiplied by 2

Now we create a dataframe with Pandas with the values of x and y

Then, we create a scatterplot with Seaborn for x, y, data and with an order=1

As you saw, so far we generated x and y randomly with Numpy, but we have not used the Boston House Prices dataset. We know now that the value of y in that dataset is MEDV because it is the variable we want to predict, that is, the price! While the values of x are all the other columns of the dataset. Scatterplots only allow us to compare two variables, so we must make a scatterplot for each variable x. Let's look at two of them

Graph a scatterplot with seaborn that takes the value of x to the LSTAT column and the value of y to MEDV

Although not perfect, the relationship is quite linear. But notice that it is a negative linear relationship! because as LSTAT increases, the MEDV price decreases

Draw now the relationship between CRIM and MEDV

As we have already seen, linear relationships can also be assessed by evaluating the residues. The residues are the difference between the estimated (predicted) and real value. If the relationship is linear, the residuals should be normally distributed and centered around zero.

Create the model by instantiating scikit learn's LinearRegression() and assigning it to the variable linreg

Let's continue working with the dataframe we created with Numpy of only two columns x and y. Train the model with the scikit-learn fit method. Remember to pass the values of x as a DataFrame and not as a Serie

Let's get the predictions by calling the predict method of scikit-learn and passing as a parameter the values of x as a dataframe. Assign it to the predict variable

Calculate the residual values, and store them in a variable called error

Plot now the residual values with a Matplotlib scatterplot between pred and y

Let's see now the distribution of the residuals with another Matplotlib scatterplot between error and x

Let's now plot the distribution of the errors by drawing a Histogram with Seaborn displot, and with a bins of 30

Very well, we have done all the analysis of variables that have a linear relationship with the dataset created with Numpy. Now let's do the same steps but with the Boston Houses dataset, and taking into account only one Variable/Column = LSTAT. Then follow all the previous steps of training the model, predicting the model and plotting the relationship and the residuals

Conclusion

In this particular case, the residues are centered around zero, but are not distributed homogeneously among the LSTAT values. Larger and smaller LSTAT values show higher residual values. Furthermore, we see in the histogram that the residuals do not adopt a strictly Gaussian distribution.