We will create a complete project trying to predict customer spending using linear regression with Python. In this exercise, we have some historical transaction data from 2010 and 2011. For each transaction, we have a customer identifier (CustomerID), the number of units purchased (Quantity), the date of purchase (InvoiceDate) and the unit cost (UnitPrice), as well as some other information about the purchased item.

You can find the dataset here

We want to prepare this data for a regression of 2010 customer transaction data against 2011 expenses. Therefore, we will create features from the 2010 data and calculate the target (the amount of money spent) for 2011.

When we create this model, it should generalize to future years for which we do not yet have the result. Therefore, we could use 2020 data to predict 2021 spending behavior in advance, unless the market or business has changed significantly since the time period to which the data used to fit the model refers:

import pandas as pd

df = pd.read_csv('datasets/retail_transactions.csv')
df.head()

resultado

Convert the InvoiceDate column to date format using the following code:

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head()

resultado

Calculate the revenue for each row by multiplying the quantity by the unit price:

df['revenue'] = df['UnitPrice']*df['Quantity']
df.head()

resultado

You will notice that each invoice is spread over several rows, one for each type of product purchased. These can be combined in such a way that the data for each transaction is in a single row. To do this, we can perform a grouped transaction in InvoiceNo. However, before that, we need to specify how to combine those rows that are grouped together. Use the following code:

operations = {'revenue':'sum',
              'InvoiceDate':'first',
              'CustomerID':'first' 
             }

df = df.groupby('InvoiceNo').agg(operations)
df.head()

resultado

In the preceding code snippet, we first specify the aggregation functions we will use for each column, and then perform the grouping and apply those functions. InvoiceDate and CustomerID will be the same for all rows of the same invoice, so we can only take the first entry for them. For revenue, we sum the revenue for all items on the same invoice to get the total revenue for that invoice.

Since we will be using the year to decide which rows are being used for prediction and which ones we are predicting, create a separate column called year for the year, as follows:

df['year'] = df['InvoiceDate'].apply(lambda x: x.year)
df.head()

resultado

Transaction dates can also be an important source of characteristics. The days from a customer's last transaction to the end of the year, or how early a customer had their first transaction, can tell us a bit about the customer's purchase history, which could be important. Therefore, for each transaction, we will calculate how many days difference there is between the last day of 2010 and the date of the invoice:

df['days_since'] = (pd.datetime(year=2010, month=12, day=31) - 
                    df['InvoiceDate']).apply(lambda x: x.days)
df.head()

resultado

Currently, we have the data grouped by invoice, but we really want it to be grouped by customer.

We'll start by calculating all of our predictors. We will again define a set of aggregation functions for each of our variables and apply them using groupby. We will calculate the sum of the revenues.

For `days_since`, we will calculate the maximum and minimum number of days (giving us features that tell us how long this customer has been active in 2010, and how recently), as well as the number of unique values (giving us how many days apart this customer made a purchase). Since these are for our forecasters, we will only apply these functions to our data from 2010, and store them in a variable, X, and use the `head` function to see the results:

operations = {'revenue':'sum',
              'days_since':['max','min','nunique'],
             }

X = df[df['year'] == 2010].groupby('CustomerID').agg(operations)
X.head()

resultado

As you can see in the figure above, since we perform multiple types of aggregations on the `days_since` column, we end up with multi-level column labels. To simplify this, we can rescale the column names for easy reference later. Use the following code and print the results:

X.columns = [' '.join(col).strip() for col in X.columns.values]
X.head()

resultado

Let's calculate one more characteristic: the average expense per order. We can calculate this by dividing the sum of the revenue by `days_since_nunique` (this is actually the average spend per day, not per order, but we are assuming that if two orders were placed on the same day, we can treat them as part of the same order for our purposes):

X['avg_order_cost'] = X['revenue sum']/X['days_since nunique']
X.head()

resultado

Now that we have our forecasters, we need the result we will predict, which is just the sum of the revenues for 2011. We can calculate it with a simple groupby and store the values in the variable y, as follows:

y = df[df['year'] == 2011].groupby('CustomerID')['revenue'].sum()
y

resultado

Now we can put our predictors and results into a single DataFrame, `wrangled_df`, and rename the columns to have more intuitive names. Finally, look at the resulting DataFrame, using the `head` function:

wrangled_df = pd.concat([X,y], axis=1)
wrangled_df.columns = ['2010 revenue',
                       'days_since_first_purchase',
                       'days_since_last_purchase',
                       'number_of_purchases',
                       'avg_order_cost',
                       '2011 revenue']
wrangled_df.head()

resultado

Note that many of the values in our DataFrame are `NaN`. This is caused by clients that were active only in 2010 or only in 2011, so there is no data for the other year. Later we will work on predicting which of our customers will churn, but for now, we will just drop all customers that are not active in both years. Note that this means that our model will predict customer spending in the next year assuming they are still active customers. To remove customers with no values, we will remove rows where any of the revenue columns are null, as follows:

wrangled_df = wrangled_df[~wrangled_df['2010 revenue'].isnull()]
wrangled_df = wrangled_df[~wrangled_df['2011 revenue'].isnull()]
wrangled_df.head()

resultado

As a final data cleaning step, it is often a good idea to get rid of outliers. A standard definition is that an outlier is any data point that is more than three standard deviations above the median, so we will use this to remove clients that are outliers in terms of 2010 or 2011 revenue:

wrangled_df = wrangled_df[wrangled_df['2011 revenue'] 
                          < ((wrangled_df['2011 revenue'].median()) 
                             + wrangled_df['2011 revenue'].std()*3)]

wrangled_df = wrangled_df[wrangled_df['2010 revenue'] 
                          < ((wrangled_df['2010 revenue'].median()) 
                             + wrangled_df['2010 revenue'].std()*3)]

wrangled_df.head()

resultado

It is often a good idea, after you have done the data cleanup and feature engineering, to save the new data as a new file so that, as you develop the model, you do not need to run the data through the entire feature engineering and cleanup pipeline every time you want to rerun the code. We can do this using the `to_csv` function.

wrangled_df.to_csv('datasets/wrangled_transactions.csv')

Examining the relationships between the predictors and the outcome.

In this exercise, we will use the characteristics we calculated in the previous exercise and see if these variables have any relationship with our outcome of interest (customer sales revenue in 2011):

Using pandas to import the data you saved at the end of the last exercise, using CustomerID as the index:

df = pd.read_csv('datasets/wrangled_transactions.csv', index_col='CustomerID')

The seaborn library has a number of plotting features. Its pair plot feature will plot histograms and pairwise scatter plots of all our variables on one line, allowing us to easily examine both the distributions of our data and the relationships between data points. Use the following code:

import seaborn as sns
%matplotlib inline

sns.pairplot(df)

resultado

In the diagram above, the diagonal shows a histogram for each variable, while each row shows the scatter plot between one variable and the other. The bottom row of figures shows the scatter plots of 2011 income (our outcome of interest) against each of the other variables. Because the data points overlap and there is a fair amount of variation, the relationships do not appear very clear in the visualizations.

Therefore, we can use correlations to help us interpret the relationships. The `corr` function of pandas will generate correlations between all the variables in a DataFrame:

df.corr()

resultado

Again, we can look at the last row to see the relationships between our forecasters and the interest result (2011 revenue). Positive numbers indicate a positive relationship, e.g., the higher a client's 2010 income, the higher their expected income in 2011. Negative numbers mean the opposite, e.g., the more days since a customer's last purchase, the lower the revenue expectation for 2011. Also, the higher the absolute number, the stronger the relationship.

The resulting correlations should make sense. The more competitors in the area, the lower a location's revenue, while median income, loyalty members and population density are all positively related. The age of a place is also positively correlated with revenue, indicating that the longer a place is open, the better known it is and the more customers it attracts (or perhaps, only places that do well last a long time).

Building a linear model that predicts customer spending.

In this exercise, we will build a linear model on customer spending using the characteristics created in the previous exercise:

Recall that there is only a weak relationship between `days_since_first_purchase` and 2011 revenue-so we will not include that predictor in our model.

Store the predictor columns and the outcome columns in the X and y variables, respectively:

X = df[['2010 revenue',
       'days_since_last_purchase',
       'number_of_purchases',
       'avg_order_cost'
       ]]

y = df['2011 revenue']

We use sklearn to perform a split of the data, so that we can evaluate the model on a dataset on which it was not trained, as shown here:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)

We import LinearRegression from sklearn, create a LinearRegression model and adjust the training data:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

We examine the coefficients of the model by checking the coef_ property. Note that these are in the same order as our X columns: 2010 revenue, days since last purchase, number of purchases and average order cost:

model.coef_
>> array([  5.78799016,   7.47737544, 336.60769871,  -2.0558923 ])

Check the intercept term of the model by checking the intercept_ property:

model.intercept_
>> 264.8693265705956

Now we can use the fitted model to make predictions about a customer outside our data set.

Make a DataFrame containing a customer's data, where the 2010 revenue is 1,000, the number of days since last purchase is 20, the number of purchases is 2, and the average order cost is 500. Have the model make a prediction on this customer's data:

single_customer = pd.DataFrame({
    '2010 revenue': [1000],
    'days_since_last_purchase': [20],
    'number_of_purchases': [2],
    'avg_order_cost': [500]
})

single_customer

resultado

model.predict(single_customer)
>> array([5847.67624446])

We can plot the model predictions in the test set against the actual value. First, we import matplotlib, and make a scatter plot of the model predictions in X_test against y_test.

Constrain the x and y axes to a maximum value of 10,000 so that we have a better view of where most of the data points are located.

Finally, add a line with slope 1, which will serve as our reference: if all points lie on this line, it means that we have a perfect relationship between our predictions and the true response:

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(model.predict(X_test),y_test)
plt.xlim(0,10000)
plt.ylim(0,10000)
plt.plot([0, 10000], [0, 10000], 'k-', color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Value')
plt.show()

resultado

In the graph above, the red line indicates where the points would be if the prediction were the same as the actual value. Since many of our points are quite far from the red line, this indicates that the model is not completely accurate. However, there does appear to be some relationship, as higher model predictions have higher true values.

To further examine the relationship, we can use correlation. From scipy, we can import the pearsonr function, which calculates the correlation between two matrices, just as we did with Pandas for our entire DataFrame. We can use it to calculate the correlation between our model predictions and the actual value as follows:

from scipy.stats.stats import pearsonr
pearsonr(model.predict(X_test),y_test)
>> (0.6125740076680493, 1.934002067463782e-20)

You should have two numbers returned: (0.612574007666680493, 1.934002067463782e-20). The first number is the correlation, which is close to 0.6, indicating a strong relationship. The second number is the p-value, which indicates the probability of seeing such a strong relationship if the two sets of numbers were unrelated; the very low number here means that this relationship is unlikely to be due to chance.

Conclusion

We have constructed a simple example of linear regression. You could try this same one with Decision Trees and review the differences in the models. Later we will create another article to understand how to do this.

Most Related Articles

10 Highly Probable Data Scientist Interview Questions

The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.Despite the high demand for data scientists, it is a highly challenging task to find your first job. Unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.Data science is an interdisciplinary field which covers a broad range of topics and concepts. Thus, the number of questions that you might be asked at an interview is very high.However, there are some questions about the fundamentals in data science and machine learning. These are the ones you do not want to miss. In this article, we will go over 10 questions that are likely to be asked at a data scientist interview.The questions are grouped into 3 main categories which are machine learning, Python, and SQL. I will try to provide a brief answer for each question. However, I suggest reading or studying each one in more detail afterwards.Machine Learning1. What is overfitting?Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.2. How can you reduce overfitting?We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.3. What is regularization?We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.4. What is the difference between classification and clustering?Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.PythonThe built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.5. What is the difference between lists and tuplesThe main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.mylist = [1,2,3] mylist.append(4) mylist.remove(1) print(mylist) [2,3,4]On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.mytuple = (1,2,3) mytuple.append(4) AttributeError: 'tuple' object has no attribute 'append'One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.mytuple = (1,2,["a","b","c"]) mytuple[2] ['a', 'b', 'c'] mytuple[2][0] = ["A"] print(mytuple) (1, 2, [['A'], 'b', 'c'])6. What is the difference between lists and setsLet’s do an example to demonstrate the main difference between lists and sets.text = "Python is awesome!" mylist = list(text) myset = set(text) print(mylist) ['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!'] print(myset) {'t', ' ', 'i', 'e', 'm', 'P', '!', 'y', 'o', 'h', 'n', 'a', 's', 'w'}As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.Another difference is that the characters in the list are ordered based on their location in the string. However, there is no order associated with the characters in the set.Here is a table that summarizes the main characteristics of lists, tuples, and sets.(image by author)7. What is a dictionary and what are the important features of dictionaries?A dictionary in Python is a collection of key-value pairs. It is similar to a list in the sense that each item in a list has an associated index starting from 0.mylist = ["a", "b", "c"] mylist[1] "b"In a dictionary, we have keys as the index. Thus, we can access a value by using its key.mydict = {"John": 24, "Jane": 26, "Ashley": 22} mydict["Jane"] 26The keys in a dictionary are unique which makes sense because they act like an address for the values.SQLSQL is an extremely important skill for data scientists. There are quite a number of companies that store their data in a relational database. SQL is what is needed to interact with relational databases.You will probably be asked a question that involves writing a query to perform a specific task. You might also be asked a question about general database knowledge.8. Query example 1Consider we have a sales table that contains daily sales quantities of products.SELECT TOP 10 * FROM SalesTable(image by author)Find the top 5 weeks in terms of total weekly sales quantities.SELECT TOP 5 CONCAT(YEAR(SalesDate), DATEPART(WEEK, SalesDate)) AS YearWeek, SUM(SalesQty) AS TotalWeeklySales FROM SalesTable GROUP BY CONCAT(YEAR(SalesDate), DATEPART(WEEK, SalesDate)) ORDER BY TotalWeeklySales DESC (image by author)We first extract the year and week information from the date column and then use it in the aggregation. The sum function is used to calculate the total sales quantities.9. Query example 2In the same sales table, find the number of unique items that are sold each month.SELECT MONTH(SalesDate) AS Month, COUNT(DISTINCT(ItemNumber)) AS ItemCount FROM SalesTable GROUP BY MONTH(SalesDate) Month ItemCount 1 9 1021 2 8 102110. What is normalization and denormalization in a database?These terms are related to database schema design. Normalization and denormalization aim to optimize different metrics.The goal of normalization is to reduce data redundancy and inconsistency by increasing the number of tables. On the other hand, denormalization aims to speed up the query execution. Denormalization decreases the number of tables but at the same time, it adds some redundancy.ConclusionIt is a challenging task to become a data scientist. It requires time, effort, and dedication. Without having prior job experience, the process gets harder.Interviews are very important to demonstrate your skills. In this article, we have covered 10 questions that you are likely to encounter in a data scientist interview.Thank you for reading. Please let me know if you have any feedback.Soner Yıldırım

Daniel Morales

Jan 26, 2021

Data Science

Machine Learning

Model Evaluation Metrics in Machine Learning

CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02 Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!

Juan Guillermo Gómez Ramírez

Jan 26, 2021

Python

Top 10 Python Extensions for Visual Studio Code

In this new post we want to talk about the most useful Python extensions for Visual Studio Code. Visual Studio Code is an integrated development environment created by Microsoft for Windows, Linux and macOS. Among its features are debugging, syntax highlighting, smart code completion, snippets, code refactoring and integrated Git. Users can change the theme, keyboard shortcuts, preferences and install extensions that add additional functionality.Precisely we are going to talk about the extensions you can install for VS. Here is a list of our favorites1- PythonLink: https://github.com/Microsoft/vscode-pythonPython extension for Visual Studio CodeA Visual Studio Code extension with rich support for the Python language (for all actively supported versions of the language: >=3.6), including features such as IntelliSense (Pylance), linting, debugging, code navigation, code formatting, refactoring, variable explorer, test explorer, and more!NOTE: Web support -- e.g., github.dev -- is limited.Installed extensionsThe Python extension will automatically install the Pylance and Jupyter extensions to give you the best experience when working with Python files and Jupyter notebooks. However, Pylance is an optional dependency, which means that the Python extension will remain fully functional if it is not installed. You can also uninstall it at the expense of some features if you are using a different language server.2- Python IndentLink: https://github.com/kbrose/vsc-python-indentIt is used to correct Python indentation in Visual Studio Code. How it worksEvery time you press the Enter key in a Python context, this extension will parse your Python file down to the location of your cursor, and determine exactly how much to indent the next line (or two in the case of hanging indents) and how much to indent nearby lines.There are three main cases when determining the correct indent. Review the documentation here: https://github.com/kbrose/vsc-python-indent3- Python Doctring GeneratorLink: https://github.com/NilsJPWerner/autoDocstringVisual Studio Code extension to quickly generate docstrings for python functions.FeaturesQuickly generate a docstring fragment that can be tabbed.Choose from several types of docstring formats.Infer parameter types via pep484 type hints, default values and var names.Support for args, kwargs, decorators, errors and parameter types.Docstring FormatsGoogle (default)docBlockrNumpySphinxPEP0257 (coming soon)UsageThe cursor must be on the line directly below the definition to generate a complete auto-populated docstring.Press enter after opening the docstring with triple quotes ("""" or ''')Keyboard shortcut: ctrl+shift+2 or cmd+shift+2 for macCan be changed in Preferences -> Keyboard shortcuts -> extension.generateDocstringCommand: Generate DocstringRight-click menu: Generate DocstringAlso read: 4 Must-Know Python Pandas Functions for Time Series Analysis4- Python ExtendedLink: https://github.com/tushortz/vscode-Python-ExtendedPython Extended is a vscode snippet that makes it easy to write Python code by providing completion options along with all arguments.UsageRun vscode and in a python file, type the name of the method to complete and press tab or enter on selection.How to installOpen vscode. Press F1, search for "ext install" followed by the extension name, in this case "ext install Python Extended" without the ">". Or if you prefer ">ext install", press enter, search for "Python Extended".5- Python PreviewLink: https://github.com/dongli0x00/python-previewA Visual Studio Code extension with debug preview support for the Python language.RequirementsInstall a version of Python 3.6 or Python 2.7. Make sure that the location of your Python interpreter is included in your PATH environment variable.It is best to install the Python extension for Python Intellisense.6- AREPL for PythonLink: https://github.com/almenon/arepl-vscodeAREPL automatically evaluates Python code in real time as you type.UsageFirst, make sure you have python 3.7 or higher installed.Open a python file and click on the cat in the top right bar to open AREPL. You can click the cat again to close it.Or run AREPL via the search command: control-shift-por use the shortcuts: control-shift-a (current document) / control-shift-q (new document)FeaturesReal-time evaluation: no need to run - AREPL evaluates your code automatically. You can control this (or even disable it) in the settings.Variable display: The final state of your local variables is displayed in a collapsible JSON format.Error display: The moment you make a mistake an error is displayed with the stack trace.Settings: AREPL offers many settings to suit your user experience. Customize the look and feel, bounce time, python options and much more.Aldo read: 3 Python Tricks That Will Improve Your Code7- Python PathLink: https://github.com/mgesbert/vscode-python-pathThis extension adds a set of tools to help generate internal import statements in a Python project.Features"Copy Python Path" is accessible from:Command lineExplorer context menuEditor context menuEditor title context menu8- Python Test ExplorerLink: https://github.com/kondratyev-nv/vscode-python-test-adapterThis extension allows you to run your Python Unittest, Pytest or Testplan tests with the Test Explorer user interface.How to get startedInstall the extensionConfigure Visual Studio Code to discover your tests (see the Configuration section and the documentation for the test framework of your choice:Unittest documentationPytest documentationTestplan documentationOpen the sidebar of the test viewExecute your tests via the Run icon in the Test ExplorerFeaturesDisplays a Test Explorer in the test view in the VS Code sidebar with all detected tests and suites and their statusConvenient error reporting during test detectionUnittest, Pytest and Testplan debuggingDisplays the log of a failed test when the test is selected in the explorerTest rerun when saving testsSupports multi-root workspacesSupports Unittest, Pytest and Testplan test frameworks and their plugins9- Python SnippetsLink: https://github.com/ylcnfrht/vscode-python-snippet-packA snippet package to make working with Python more productive This snippet package contains all of the following Python methodsall built-in python snippets and contains at least one example for each methodall python string snippets contain at least one example for each methodall python list snippets contain at least one example for each methodall Python set snippets contain at least one example for each methodall Python tuple snippets contain at least one example for each methodall python dictionary snippets contain at least one example for each methodAnd it contains many other code snippets (such as if/else, for, while, while/else, try/catch, file process, andclass snippets and class examples for oop (polymorphism, encapsulation, inheritance, etc.).If you don't use a method don't worry this extension contains a lot of code examples for each python method.This extension is not just a code snippet, it will also be useful for learning the python programming language.You will learn all python methods with a lot of code examples.For example, if you want to use the string replacement method, you just need to use .replace.But if you don't know how to use the replace method then use string.replace =>10- JupyterLink: https://github.com/Microsoft/vscode-jupyterA Visual Studio Code extension that provides basic notebook support for language kernels that are compatible with Jupyter Notebooks today. Many language kernels will work without any modifications. To enable advanced features, modifications to the VS Code language extensions may be necessary.Notebook supportThe Jupyter Extension uses VS code's built-in notebook support. This interface offers a number of advantages to notebook users:Out-of-the-box support for VS Code's wide range of basic code editing functions, such as hot output, search and replace, and code folding.Editor extensions such as VIM, bracket coloring, linters and many more are available while editing a cell.Deep integration with the general workbench and file-based features of VS Code, such as outline view (table of contents), breadcrumbs, and other operations.Fast load times for Jupyter notebook (.ipynb) files. Any notebook file is loaded and rendered as quickly as possible, while execution-related operations are initialized behind the scenes.Includes a notebook diff tool, which makes it easy to compare and visualize differences between code cells, results and metadata.Extensibility beyond what the Jupyter extension provides. Extensions can now add their own specific language or runtime to notebooks, such as the .NET and Gather interactive notebooks.Although the Jupyter extension comes with a comprehensive set of the most commonly used renderers for output, the marketplace supports installable custom renderers to make working with your notebooks even more productive. To get started writing your own, check out the VS Code renderer api documentation.You can also read data science posts in Spanish here.ConclusionThere are many extensions that you can use with your Visual Studio Code, and deciding which one to use will involve testing, reviewing utilities, use cases and so on in order to make your work easier while coding!Also read: Why Decorators In Python Are Pure Genius?

Daniel Morales

Jan 26, 2021

Machine Learning

The Role of AI in Unstructured Data Mining: Challenges and Opportunities

In our fast-paced digital world, we're producing staggering volumes of data every day. This data falls into two key categories: structured, known for its order and efficiency, and unstructured, a captivating puzzle brimming with untapped potential.In this article, we will uncover how AI confronts the complexities of unstructured data, the hurdles it faces, and the intriguing opportunities it opens up to businesses from any kind of industry.Understanding Unstructured DataUnstructured data mining is the technique of extracting valuable and meaningful insights from an abundant well of unstructured data. It uncovers hidden gems of knowledge, making it a crucial pursuit in our data-rich era.In today's digital realm, unstructured data is generated in unprecedented quantities. Billions of text documents, images, and videos come to life daily, creating a treasure trove of information just waiting for organizations to explore.Unlocking the insights hidden within unstructured data can provide organizations with a competitive edge. This data can reveal customer sentiments, emerging trends, and valuable feedback that might otherwise go unnoticed.The Basics of Data MiningHow data mining works is that it discovers patterns, trends, and valuable information within a dataset. It involves various techniques to extract knowledge from raw data. While it's exceptionally effective with structured data, applying data mining to unstructured data requires a unique set of skills and tools.Unstructured Data MiningUnstructured data mining is a method focused on the extraction of valuable information from the vast, unstructured data available. This process uncovers hidden insights, making it a valuable endeavor in today's data-driven world.The AI RevolutionThe AI revolution has given rise to an exciting era of possibilities in unstructured data mining. AI's remarkable capabilities are instrumental in taming the unstructured data landscape, and it involves a multitude of components, including:Machine learning enables AI systems to learn from data, make predictions, and identify patterns, enhancing data mining capabilities.Deep learning uses neural networks to model complex patterns in unstructured data, which is particularly valuable in image and speech recognition.Sentiment analysis gauges emotional tones within textual data, helping to understand public opinion and tailor strategies.Pattern recognition identifies recurring structures in data, aiding in image processing and text mining.Knowledge graphs structure data relationships, improving contextual understanding and data retrieval.Anomaly detection identifies outliers in data, which is essential for fraud detection and data security.Challenges in Unstructured Data MiningAs promising as AI is at handling unstructured data, it's not without its set of challenges. Here, we delve into some of the major hurdles:Data QualityUnstructured data is inherently messy. It's laden with errors, inconsistencies, and biases, which makes it a challenge to extract meaningful insights from this data. AI systems need to be trained rigorously to navigate and decipher this diversity in data quality. Techniques like data cleansing, normalization, and the use of context are essential in ensuring that AI systems provide accurate results.ScalabilityAs the volume of unstructured data grows, AI systems must scale to handle the data influx effectively. Traditional hardware and algorithms might not be sufficient to handle this data influx. Scalable infrastructure and distributed computing become crucial to ensuring that AI systems can process and analyze vast amounts of data efficiently.Privacy ConcernsMining unstructured data often raises ethical questions regarding privacy and data protection. That’s why it’s essential to strike the right balance between data utilization and respecting individual privacy. It's a challenge to ensure that AI systems are used responsibly and in compliance with data protection laws and regulations, such as GDPR in Europe. Techniques like anonymization and consent management play a vital role in addressing these privacy concerns.Opportunities and ApplicationsAI's role in unstructured data mining has opened up a world of opportunities across various industries. Let's explore some of the most promising applications:Customer InsightsUnstructured data, particularly sourced from social media and customer reviews, serves as a goldmine of information on customer behavior and preferences. By leveraging AI algorithms, companies can analyze sentiments, spot emerging trends, and even forecast future buying patterns. With these insights, they can fine-tune their marketing strategies, product development, and customer service to align with their ever-evolving audience's demands.Healthcare DiagnosisThe abundance of unstructured data found in medical records, radiological images, and wearable device data holds the key to transformative advancements. AI-powered systems, known for their proficiency in the analysis of this data, not only facilitate early disease detection but also provide highly individualized treatment plans, ultimately raising the standard of patient care. For example: AI expedites the process of analyzing medical images for anomalies, resulting in a significant reduction in the time required for diagnosing and treating severe conditions.Fraud DetectionWhen it comes to financial institutions, AI is a vital tool for exposing fraudulent activities that often hide within the vast volumes of unstructured transaction data. Through a meticulous examination of transaction patterns and anomalies, AI systems can rapidly pinpoint fraudulent actions, providing businesses with a robust defense against significant financial losses. The ability to detect and thwart fraud in real-time provides a critical advantage, resulting in annual savings of billions of dollars for businesses.ConclusionThe future belongs to those who embrace the AI revolution in unstructured data mining. In this future, data isn't just information; it's the key to success. So, let's move forward, embracing this tomorrow, where possibilities are limitless and opportunities are endless.

nikos_datasource

Jan 26, 2021

Building A Linear Regression Model With Python To Predict Retail Customer Spending

Contents Outline

Daniel Morales

Building A Linear Regression Model With Python To Predict Retail Customer Spending

Examining the relationships between the predictors and the outcome.

Building a linear model that predicts customer spending.

Conclusion

Related Posts

Categories

Join Competition

Daniel Morales

Juan Guillermo Gómez Ramírez

Daniel Morales

nikos_datasource

Building A Linear Regression Model With Python To Predict Retail Customer Spending

Contents Outline

Social Sharing

Daniel Morales

Examining the relationships between the predictors and the outcome.

Building a linear model that predicts customer spending.

Conclusion

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Juan Guillermo Gómez Ramírez

Daniel Morales

nikos_datasource