Building A Linear Regression Model That Predicts Retail Customer Spending

We will create a complete project trying to predict customer spending using linear regression with Python.

In this exercise, we have some historical transaction data from 2010 and 2011. For each transaction, we have a customer identifier (CustomerID), the number of units purchased (Quantity), the date of purchase (InvoiceDate) and the unit cost (UnitPrice), as well as some other information about the purchased item.

You'll find the dataset here

We want to prepare this data for a regression of 2010 customer transaction data against 2011 expenses. Therefore, we will create features from the 2010 data and calculate the target (the amount of money spent) for 2011.

When we create this model, it should generalize to future years for which we do not yet have the result. Therefore, we could use 2020 data to predict 2021 spending behavior in advance, unless the market or business has changed significantly since the time period to which the data used to fit the model refers:

Convert the InvoiceDate column to date format using the following code:

Calculate the revenue for each row by multiplying the quantity by the unit price:

You will notice that each invoice is spread over several rows, one for each type of product purchased. These can be combined in such a way that the data for each transaction is in a single row. To do this, we can perform a grouped operation on InvoiceNo. However, before that, we need to specify how to combine those rows that are grouped. Use the following code:

In the preceding code snippet, we first specify the aggregation functions we will use for each column, and then perform the grouping and apply those functions. InvoiceDate and CustomerID will be the same for all rows of the same invoice, so we can only take the first entry for them. For revenue, we sum the revenue for all items on the same invoice to get the total revenue for that invoice.

Since we will be using the year to decide which rows are being used for prediction and which ones we are predicting, create a separate column called year for the year, as follows:

Transaction dates can also be an important source of characteristics. The days from a customer's last transaction to the end of the year, or how early a customer had their first transaction, can tell us a bit about the customer's purchase history, which could be important. Therefore, for each transaction, we will calculate how many days difference there is between the last day of 2010 and the date of the invoice:

Currently, we have the data grouped by invoice, but we really want it to be grouped by customer.

We'll start by calculating all of our predictors. We will again define a set of aggregation functions for each of our variables and apply them using groupby. We will calculate the sum of the revenues.

For days_since, we will calculate the maximum and minimum number of days (giving us features that tell us how long this customer has been active in 2010, and how recently), as well as the number of unique values (giving us how many days apart this customer made a purchase). Since these are for our forecasters, we will only apply these functions to our data from 2010, and store them in a variable, X, and use the head function to see the results:

As you can see in the figure above, since we perform multiple types of aggregations on the days_since column, we end up with multi-level column labels. To simplify this, we can rescale the column names for easy reference later. Use the following code and print the results:

Let's calculate one more characteristic: the average expense per order. We can calculate this by dividing the sum of the revenue by days_since_nunique (this is actually the average spend per day, not per order, but we are assuming that if two orders were placed on the same day, we can treat them as part of the same order for our purposes):

Now that we have our forecasters, we need the result we will predict, which is just the sum of the revenues for 2011. We can calculate it with a simple groupby and store the values in the variable y, as follows:

Now we can put our predictors and results into a single DataFrame, wrangled_df, and rename the columns to have more intuitive names. Finally, look at the resulting DataFrame, using the head function:

Note that many of the values in our DataFrame are NaN. This is caused by clients that were active only in 2010 or only in 2011, so there is no data for the other year. Later we will work on predicting which of our customers will churn, but for now, we will just drop all customers that are not active in both years. Note that this means that our model will predict customer spending in the next year assuming they are still active customers. To remove customers with no values, we will remove rows where any of the revenue columns are null, as follows:

As a final data cleaning step, it is often a good idea to get rid of outliers. A standard definition is that an outlier is any data point that is more than three standard deviations above the median, so we will use this to remove clients that are outliers in terms of 2010 or 2011 revenue:

It is often a good idea, after you have done the data cleanup and feature engineering, to save the new data as a new file so that, as you develop the model, you do not need to run the data through the entire feature engineering and cleanup pipeline every time you want to rerun the code. We can do this using the to_csv function.

Examining the relationships between the predictors and the outcome.

In this exercise, we will use the characteristics we calculated in the previous exercise and see if these variables have any relationship with our outcome of interest (customer sales revenue in 2011):

Using pandas to import the data you saved at the end of the last exercise, using CustomerID as the index:

The seaborn library has a number of plotting features. Its pair plot feature will plot histograms and pairwise scatter plots of all our variables on one line, allowing us to easily examine both the distributions of our data and the relationships between data points. Use the following code:

In the diagram above, the diagonal shows a histogram for each variable, while each row shows the scatter plot between one variable and the other. The bottom row of figures shows the scatter plots of 2011 income (our outcome of interest) against each of the other variables. Because the data points overlap and there is a fair amount of variation, the relationships do not appear very clear in the visualizations.

Therefore, we can use correlations to help us interpret the relationships. The corr function of pandas will generate correlations between all the variables in a DataFrame:

Again, we can look at the last row to see the relationships between our forecasters and the interest result (2011 revenue). Positive numbers indicate a positive relationship, e.g., the higher a client's 2010 income, the higher their expected income in 2011. Negative numbers mean the opposite, e.g., the more days since a customer's last purchase, the lower the revenue expectation for 2011. Also, the higher the absolute number, the stronger the relationship.

The resulting correlations should make sense. The more competitors in the area, the lower a location's revenue, while median income, loyalty members and population density are all positively related. The age of a place is also positively correlated with revenue, indicating that the longer a place is open, the better known it is and the more customers it attracts (or perhaps, only places that do well last a long time).

Building a linear model that predicts customer spending.

In this exercise, we will build a linear model on customer spending using the features created in the previous exercise:

Recall that there is only a weak relationship between days_since_first_purchase and 2011 revenue-so we will not include that predictor in our model.

Store the predictor columns and the outcome columns in the X and y variables, respectively:

We use sklearn to perform a split of the data, so that we can evaluate the model on a dataset on which it was not trained, as shown here:

We import LinearRegression from sklearn, create a LinearRegression model and adjust the training data:

We examine the coefficients of the model by checking the coef_ property. Note that these are in the same order as our X columns: 2010 revenue, days since last purchase, number of purchases and average order cost:

Check the intercept term of the model by checking the intercept_ property:

Now we can use the fitted model to make predictions about a customer outside our data set.

Make a DataFrame containing a customer's data, where the 2010 revenue is 1,000, the number of days since last purchase is 20, the number of purchases is 2, and the average order cost is 500. Have the model make a prediction on this customer's data:

We can plot the model predictions in the test set against the actual value. First, we import matplotlib, and make a scatter plot of the model predictions in X_test against y_test.

Constrain the x and y axes to a maximum value of 10,000 so that we have a better view of where most of the data points are located.

Finally, add a line with slope 1, which will serve as our reference: if all points lie on this line, it means that we have a perfect relationship between our predictions and the true response:

In the graph above, the red line indicates where the points would be if the prediction were the same as the actual value. Since many of our points are quite far from the red line, this indicates that the model is not completely accurate. However, there does appear to be some relationship, as higher model predictions have higher true values.

To further examine the relationship, we can use correlation. From scipy, we can import the pearsonr function, which calculates the correlation between two matrices, just as we did with Pandas for our entire DataFrame. We can use it to calculate the correlation between our model predictions and the actual value as follows:

You should have two numbers returned: (0.612574007666680493, 1.934002067463782e-20). The first number is the correlation, which is close to 0.6, indicating a strong relationship. The second number is the p-value, which indicates the probability of seeing such a strong relationship if the two sets of numbers were unrelated; the very low number here means that this relationship is unlikely to be due to chance.


We have constructed a simple example of linear regression. You could try this same one with Decision Trees and review the differences in the models. Later we will create another article to understand how to do this.