Deep learning is a hot topic these days. But what is it that makes it special and sets it apart from other aspects of machine learning? That is a deep question (pardon the pun). To even begin to answer it, we will need to learn the basics of neural networks.

Neural networks are the workhorses of deep learning. And while they may look like black boxes, deep down (sorry, I will stop the terrible puns) they are trying to accomplish the same thing as any other model — to make good predictions.

In this post, we will explore the ins and outs of a simple neural network. And by the end, hopefully you (and I) will have gained a deeper and more intuitive understanding of how neural networks do what they do.

The 30,000 Feet View

Let’s start with a really high level overview so we know what we are working with. Neural networks are multi-layer networks of neurons (the blue and magenta nodes in the chart below) that we use to classify things, make predictions, etc. Below is the diagram of a simple neural network with five inputs, 5 outputs, and two hidden layers of neurons.

Neural network with two hidden layers

Starting from the left, we have:

The input layer of our model in orange.
Our first hidden layer of neurons in blue.
Our second hidden layer of neurons in magenta.
The output layer (a.k.a. the prediction) of our model in green.

The arrows that connect the dots shows how all the neurons are interconnected and how data travels from the input layer all the way through to the output layer.

Later we will calculate step by step each output value. We will also watch how the neural network learns from its mistake using a process known as backpropagation.

Getting our Bearings

But first let’s get our bearings. What exactly is a neural network trying to do? Like any other model, it’s trying to make a good prediction. We have a set of inputs and a set of target values — and we are trying to get predictions that match those target values as closely as possible.

Forget for a second the more complicated looking picture of the neural network I drew above and focus on this simpler one below.

Logistic regression (with only one feature) implemented via a neural network

This is a single feature logistic regression (we are giving the model only one X variable) expressed through a neural network (if you need a refresher on logistic regression, I wrote about that here). To see how they connect we can rewrite the logistic regression equation using our neural network color codes.

Logistic regression equation

Let’s examine each element:

X (in orange) is our input, the lone feature that we give to our model in order to calculate a prediction.
B1 (in turquoise, a.k.a. blue-green) is the estimated slope parameter of our logistic regression — B1 tells us by how much the Log_Odds change as X changes. Notice that B1 lives on the turquoise line, which connects the input X to the blue neuron in Hidden Layer 1.
B0 (in blue) is the bias — very similar to the intercept term from regression. The key difference is that in neural networks, every neuron has its own bias term (while in regression, the model has a singular intercept term).
The blue neuron also includes a sigmoid activation function (denoted by the curved line inside the blue circle). Remember the sigmoid function is what we use to go from log-odds to probability (do a control-f search for “sigmoid” in my previous post).
And finally we get our predicted probability by applying the sigmoid function to the quantity (B1*X + B0).

Not too bad right? So let’s recap. A super simple neural network consists of just the following components:

A connection (though in practice, there will generally be multiple connections, each with its own weight, going into a particular neuron), with a weight “living inside it”, that transforms your input (using B1) and gives it to the neuron.
A neuron that includes a bias term (B0) and an activation function (sigmoid in our case).

And these two objects are the fundamental building blocks of the neural network. More complex neural networks are just models with more hidden layers and that means more neurons and more connections between neurons. And this more complex web of connections (and weights and biases) is what allows the neural network to “learn” the complicated relationships hidden in our data.

Let’s Add a Bit of Complexity Now

Now that we have our basic framework, let’s go back to our slightly more complicated neural network and see how it goes from input to output. Here it is again for reference:

Our slightly more complicated neural network

The first hidden layer consists of two neurons. So to connect all five inputs to the neurons in Hidden Layer 1, we need ten connections. The next image (below) shows just the connections between Input 1 and Hidden Layer 1.

The connections between Input 1 and Hidden Layer 1

Note our notation for the weights that live in the connections — W1,1 denotes the weight that lives in the connection between Input 1 and Neuron 1 and W1,2 denotes the weight in the connection between Input 1 and Neuron 2. So the general notation that I will follow is Wa,b denotes the weight on the connection between Input a (or Neuron a) and Neuron b.

Now let’s calculate the outputs of each neuron in Hidden Layer 1 (known as the activations). We do so using the following formulas (W denotes weight, In denotes input).

Z1 = W1*In1 + W2*In2 + W3*In3 + W4*In4 + W5*In5 + Bias_Neuron1
Neuron 1 Activation = Sigmoid(Z1)

We can use matrix math to summarize this calculation (remember our notation rules — for example, W4,2 denotes the weight that lives in the connection between Input 4 and Neuron 2):

Matrix math makes our life easier

For any layer of a neural network where the prior layer is m elements deep and the current layer is n elements deep, this generalizes to:

[W] @ [X] + [Bias] = [Z]

Where [W] is your n by m matrix of weights (the connections between the prior layer and the current layer), [X] is your m by 1 matrix of either starting inputs or activations from the prior layer, [Bias] is your n by 1 matrix of neuron biases, and [Z] is your n by 1 matrix of intermediate outputs. In the previous equation, I follow Python notation and use @ to denote matrix multiplication. Once we have [Z], we can apply the activation function (sigmoid in our case) to each element of [Z] and that gives us our neuron outputs (activations) for the current layer.

Finally before we move on, let’s visually map each of these elements back onto our neural network chart to tie it all up ([Bias] is embedded in the blue neurons).

Visualizing [W], [X], and [Z]

By repeatedly calculating [Z] and applying the activation function to it for each successive layer, we can move from input to output. This process is known as forward propagation. Now that we know how the outputs are calculated, it’s time to start evaluating the quality of the outputs and training our neural network.

Time for the Neural Network to Learn

This is going to be a long post so feel free to take a coffee break now. Still with me? Awesome! Now that we know how a neural network’s output values are calculated, it is time to train it.

The training process of a neural network, at a high level, is like that of many other data science models — define a cost function and use gradient descent optimization to minimize it.

First let’s think about what levers we can pull to minimize the cost function. In traditional linear or logistic regression we are searching for beta coefficients (B0, B1, B2, etc.) that minimize the cost function. For a neural network, we are doing the same thing but at a much larger and more complicated scale.

In traditional regression, we can change any particular beta in isolation without impacting the other beta coefficients. So by applying small isolated shocks to each beta coefficient and measuring its impact on the cost function, it is relatively straightforward to figure out in which direction we need to move to reduce and eventually minimize the cost function.

Five feature logistic regression implemented via a neural network

In a neural network, changing the weight of any one connection (or the bias of a neuron) has a reverberating effect across all the other neurons and their activations in the subsequent layers.

That’s because each neuron in a neural network is like its own little model. For example, if we wanted a five feature logistic regression, we could express it through a neural network, like the one on the left, using just a singular neuron!

So each hidden layer of a neural network is basically a stack of models (each individual neuron in the layer acts like its own model) whose outputs feed into even more models further downstream (each successive hidden layer of the neural network holds yet more neurons).

The Cost Function

So given all this complexity, what can we do? It’s actually not that bad. Let’s take it step by step. First, let me clearly state our objective. Given a set of training inputs (our features) and outcomes (the target we are trying to predict):

We want to find the set of weights (remember that each connecting line between any two elements in a neural network houses a weight) and biases (each neuron houses a bias) that minimize our cost function — where the cost function is an approximation of how wrong our predictions are relative to the target outcome.

For training our neural network, we will use Mean Squared Error (MSE) as the cost function:

MSE = Sum [ ( Prediction - Actual )² ] * (1 / num_observations)

The MSE of a model tell us on average how wrong we are but with a twist — by squaring the errors of our predictions before averaging them, we punish predictions that are way off much more severely than ones that are just slightly off. The cost functions of linear regression and logistic regression operate in a very similar manner.

OK cool, we have a cost function to minimize. Time to fire up gradient descent right?

Not so fast — to use gradient descent, we need to know the gradient of our cost function, the vector that points in the direction of greatest steepness (we want to repeatedly take steps in the opposite direction of the gradient to eventually arrive at the minimum).

Except in a neural network we have so many changeable weights and biases that are all interconnected. How will we calculate the gradient of all of that? In the next section, we will see how backpropagation helps us deal with this problem.

Quick Review of Gradient Descent

The gradient of a function is the vector whose elements are its partial derivatives with respect to each parameter. For example, if we were trying to minimize a cost function, C(B0, B1), with just two changeable parameters, B0 and B1, the gradient would be:

Gradient of C(B0, B1) = [ [dC/dB0], [dC/dB1] ]

So each element of the gradient tells us how the cost function would change if we applied a small change to that particular parameter — so we know what to tweak and by how much. To summarize, we can march towards the minimum by following these steps:

Illustration of Gradient Descent

Compute the gradient of our “current location” (calculate the gradient using our current parameter values).
Modify each parameter by an amount proportional to its gradient element and in the opposite direction of its gradient element. For example, if the partial derivative of our cost function with respect to B0 is positive but tiny and the partial derivative with respect to B1 is negative and large, then we want to decrease B0 by a tiny amount and increase B1 by a large amount to lower our cost function.
Recompute the gradient using our new tweaked parameter values and repeat the previous steps until we arrive at the minimum.

Backpropagation

I will defer to this great textbook (online and free!) for the detailed math (if you want to understand neural networks more deeply, definitely check it out). Instead we will do our best to build an intuitive understanding of how and why backpropagation works.

Remember that forward propagation is the process of moving forward through the neural network (from inputs to the ultimate output or prediction). Backpropagation is the reverse. Except instead of signal, we are moving error backwards through our model.

Some simple visualizations helped a lot when I was trying to understand the backpropagation process. Below is my mental picture of a simple neural network as it forward propagates from input to output. The process can be summarized by the following steps:

Inputs are fed into the blue layer of neurons and modified by the weights, bias, and sigmoid in each neuron to get the activations. For example: Activation_1 = Sigmoid( Bias_1 + W1*Input_1 )
Activation 1 and Activation 2, which come out of the blue layer are fed into the magenta neuron, which uses them to produce the final output activation.

And the objective of forward propagation is to calculate the activations at each neuron for each successive hidden layer until we arrive at the output.

Forward propagation in a neural network

Now let’s just reverse it. If you follow the red arrows (in the picture below), you will notice that we are now starting at the output of the magenta neuron. That is our output activation, which we use to make our prediction, and the ultimate source of error in our model. We then move this error backwards through our model via the same weights and connections that we use for forward propagating our signal (so instead of Activation 1, now we have Error1 — the error attributable to the top blue neuron).

Remember we said that the goal of forward propagation is to calculate neuron activations layer by layer until we get to the output? We can now state the objective of backpropagation in a similar manner:

We want to calculate the error attributable to each neuron (I will just refer to this error quantity as the neuron’s error because saying “attributable” again and again is no fun) starting from the layer closest to the output all the way back to the starting layer of our model.

Backpropagation in a neural network

So why do we care about the error for each neuron? Remember that the two building blocks of a neural network are the connections that pass signals into a particular neuron (with a weight living in each connection) and the neuron itself (with a bias). These weights and biases across the entire network are also the dials that we tweak to change the predictions made by the model.

This part is really important:

The magnitude of the error of a specific neuron (relative to the errors of all the other neurons) is directly proportional to the impact of that neuron’s output (a.k.a. activation) on our cost function.

So the error of each neuron is a proxy for the partial derivative of the cost function with respect to that neuron’s inputs. This makes intuitive sense — if a particular neuron has a much larger error than all the other ones, then tweaking the weights and bias of our offending neuron will have a greater impact on our model’s total error than fiddling with any of the other neurons.

And the partial derivatives with respect to each weight and bias are the individual elements that compose the gradient vector of our cost function. So basically backpropagation allows us to calculate the error attributable to each neuron and that in turn allows us to calculate the partial derivatives and ultimately the gradient so that we can utilize gradient descent. Hurray!

An Analogy that Helps — The Blame Game

That’s a lot to digest so hopefully this analogy will help. Almost everyone has had a terrible colleague at some point in his or her life — someone who would always play the blame game and throw coworkers or subordinates under the bus when things went wrong.

Well neurons, via backpropagation, are masters of the blame game. When the error gets backpropagated to a particular neuron, that neuron will quickly and efficiently point the finger at the upstream colleague (or colleagues) who is most at fault for causing the error (i.e. layer 4 neurons would point the finger at layer 3 neurons, layer 3 neurons at layer 2 neurons, and so forth).

Neurons blame the most active upstream neurons

And how does each neuron know who to blame, as the neurons cannot directly observe the errors of other neurons? They just look at who sent them the most signal in terms of the highest and most frequent activations. Just like in real life, the lazy ones that play it safe (low and infrequent activations) skate by blame free while the neurons that do the most work get blamed and have their weights and biases modified. Cynical yes but also very effective for getting us to the optimal set of weights and biases that minimize our cost function. To the left is a visual of how the neurons throw each other under the bus.

And that in a nutshell is the intuition behind the backpropagation process. In my opinion, these are the three key takeaways for backpropagation:

It is the process of shifting the error backwards layer by layer and attributing the correct amount of error to each neuron in the neural network.
The error attributable to a particular neuron is a good approximation for how changing that neuron’s weights (from the connections leading into the neuron) and bias will affect the cost function.
When looking backwards, the more active neurons (the non-lazy ones) are the ones that get blamed and tweaked by the backpropagation process.

Tying it All Together

If you have read all the way here, then you have my gratitude and admiration (for your persistence).

We started with a question, “What makes deep learning special?” I will attempt to answer that now (mainly from the perspective of basic neural networks and not their more advanced cousins like CNNs, RNNs, etc.). In my humble opinion, the following aspects make neural networks special:

Each neuron is its own miniature model with its own bias and set of incoming features and weights.
Each individual model/neuron feeds into numerous other individual neurons across all the hidden layers of the model. So we end up with models plugged into other models in a way where the sum is greater than its parts. This allows neural networks to fit all the nooks and crannies of our data including the nonlinear parts (but beware overfitting — and definitely consider regularization to protect your model from underperforming when confronted with new and out of sample data).
The versatility of the many interconnected models approach and the ability of the backpropagation process to efficiently and optimally set the weights and biases of each model lets the neural network to robustly “learn” from data in ways that many other algorithms cannot.

Author’s Note: Neural networks and deep learning are extremely complicated subjects. I am still early in the process of learning about them. This blog was written as much to develop my own understanding as it was to help you, the reader. I look forward to all of your comments, suggestions, and feedback. Cheers!

More by me on Data Science Topics:

Understanding the Random Forest Algorithm

Principal Components Analysis

Logistic Regression

A/B Testing

The Binomial Distribution

Are Data Scientists at Risk of Being Automated?

Sources:

Neural Networks and Deep Learning by Michael A. Nielsen

Wikipedia: Backpropagation

Most Related Articles

Machine Learning

Model Evaluation Metrics in Machine Learning

CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02 Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!

Juan Guillermo Gómez Ramírez

Aug 13, 2020

Data Science

DataSource AI Hosts KTM AG's Inaugural AI Challenge: "Code the Light Fantastic"

DataSource AI announces the launch of the KTM AG inaugural AI Challenge, an unprecedented 3-month online competition that aims to revolutionise two-wheeler innovation through artificial intelligence and deep learning. KTM AG is a global frontrunner in two-wheeler innovation, known for pushing the boundaries of what's possible in the world of motorcycles. With a rich history of groundbreaking engineering and a commitment to cutting-edge technology, KTM AG has set new standards in performance, design, and safety. As a global leader in two-wheeler innovation, KTM AG invites participants to embark on this groundbreaking innovation journey. At the core of this competition lies a challenge set to redefine the future of motorcycle lighting systems. Participants are tasked with developing an algorithm for a high-beam lighting system utilizing a pixel matrix. Participants can find detailed guidelines in the Datathon competition. The datathon unfolds in a 3-tiered cascade model: This Code Challenge by KTM AG promises not only substantial rewards but also an exciting opportunity to shape the future of two-wheeler technology, along with supporting the participants to upscale and test their knowledge in a global AI competition. The cumulative budget for this remarkable Code Challenge by KTM AG is a substantial €24,000, motivating participants with not only the opportunity to push the boundaries of two-wheeler technology but also significant rewards for those who rise to the occasion. With cumulative prizes, contestants have the chance to potentially take home a maximum reward of €10,800 in addition to contributing to cutting-edge advancements in the field. We invite all aspiring innovators, data scientists, and AI enthusiasts to join us in this journey to "Code the Light Fantastic." For more information, rules, and registration details, please register hereAbout DataSource: At DataSource AI, we are driven by a singular mission - to democratise the immense power of data science and AI/ML for businesses of all sizes and budgets. We facilitate AI competitions, for businesses of all sizes and budgets by harnessing our extensive data expert community that's collaborating over our intelligent AI algorithm crowdsourcing platform. Our community is at the heart of what we do. We've built a diverse and talented pool of data experts who are passionate about solving real-world problems. They collaborate, ideate, and innovate, driving forward the frontiers of data science.

nikos_datasource

Aug 13, 2020

Machine Learning

The Role of AI in Unstructured Data Mining: Challenges and Opportunities

In our fast-paced digital world, we're producing staggering volumes of data every day. This data falls into two key categories: structured, known for its order and efficiency, and unstructured, a captivating puzzle brimming with untapped potential.In this article, we will uncover how AI confronts the complexities of unstructured data, the hurdles it faces, and the intriguing opportunities it opens up to businesses from any kind of industry.Understanding Unstructured DataUnstructured data mining is the technique of extracting valuable and meaningful insights from an abundant well of unstructured data. It uncovers hidden gems of knowledge, making it a crucial pursuit in our data-rich era.In today's digital realm, unstructured data is generated in unprecedented quantities. Billions of text documents, images, and videos come to life daily, creating a treasure trove of information just waiting for organizations to explore.Unlocking the insights hidden within unstructured data can provide organizations with a competitive edge. This data can reveal customer sentiments, emerging trends, and valuable feedback that might otherwise go unnoticed.The Basics of Data MiningHow data mining works is that it discovers patterns, trends, and valuable information within a dataset. It involves various techniques to extract knowledge from raw data. While it's exceptionally effective with structured data, applying data mining to unstructured data requires a unique set of skills and tools.Unstructured Data MiningUnstructured data mining is a method focused on the extraction of valuable information from the vast, unstructured data available. This process uncovers hidden insights, making it a valuable endeavor in today's data-driven world.The AI RevolutionThe AI revolution has given rise to an exciting era of possibilities in unstructured data mining. AI's remarkable capabilities are instrumental in taming the unstructured data landscape, and it involves a multitude of components, including:Machine learning enables AI systems to learn from data, make predictions, and identify patterns, enhancing data mining capabilities.Deep learning uses neural networks to model complex patterns in unstructured data, which is particularly valuable in image and speech recognition.Sentiment analysis gauges emotional tones within textual data, helping to understand public opinion and tailor strategies.Pattern recognition identifies recurring structures in data, aiding in image processing and text mining.Knowledge graphs structure data relationships, improving contextual understanding and data retrieval.Anomaly detection identifies outliers in data, which is essential for fraud detection and data security.Challenges in Unstructured Data MiningAs promising as AI is at handling unstructured data, it's not without its set of challenges. Here, we delve into some of the major hurdles:Data QualityUnstructured data is inherently messy. It's laden with errors, inconsistencies, and biases, which makes it a challenge to extract meaningful insights from this data. AI systems need to be trained rigorously to navigate and decipher this diversity in data quality. Techniques like data cleansing, normalization, and the use of context are essential in ensuring that AI systems provide accurate results.ScalabilityAs the volume of unstructured data grows, AI systems must scale to handle the data influx effectively. Traditional hardware and algorithms might not be sufficient to handle this data influx. Scalable infrastructure and distributed computing become crucial to ensuring that AI systems can process and analyze vast amounts of data efficiently.Privacy ConcernsMining unstructured data often raises ethical questions regarding privacy and data protection. That’s why it’s essential to strike the right balance between data utilization and respecting individual privacy. It's a challenge to ensure that AI systems are used responsibly and in compliance with data protection laws and regulations, such as GDPR in Europe. Techniques like anonymization and consent management play a vital role in addressing these privacy concerns.Opportunities and ApplicationsAI's role in unstructured data mining has opened up a world of opportunities across various industries. Let's explore some of the most promising applications:Customer InsightsUnstructured data, particularly sourced from social media and customer reviews, serves as a goldmine of information on customer behavior and preferences. By leveraging AI algorithms, companies can analyze sentiments, spot emerging trends, and even forecast future buying patterns. With these insights, they can fine-tune their marketing strategies, product development, and customer service to align with their ever-evolving audience's demands.Healthcare DiagnosisThe abundance of unstructured data found in medical records, radiological images, and wearable device data holds the key to transformative advancements. AI-powered systems, known for their proficiency in the analysis of this data, not only facilitate early disease detection but also provide highly individualized treatment plans, ultimately raising the standard of patient care. For example: AI expedites the process of analyzing medical images for anomalies, resulting in a significant reduction in the time required for diagnosing and treating severe conditions.Fraud DetectionWhen it comes to financial institutions, AI is a vital tool for exposing fraudulent activities that often hide within the vast volumes of unstructured transaction data. Through a meticulous examination of transaction patterns and anomalies, AI systems can rapidly pinpoint fraudulent actions, providing businesses with a robust defense against significant financial losses. The ability to detect and thwart fraud in real-time provides a critical advantage, resulting in annual savings of billions of dollars for businesses.ConclusionThe future belongs to those who embrace the AI revolution in unstructured data mining. In this future, data isn't just information; it's the key to success. So, let's move forward, embracing this tomorrow, where possibilities are limitless and opportunities are endless.

nikos_datasource

Aug 13, 2020

Data Science

Data-Driven Creativity: Enhancing Video Content through Data Science

In the age of digital marketing and content creation, data-driven creativity is becoming an increasingly important concept. It's the fusion of artistic vision with the insights gleaned from data science to enhance the impact and effectiveness of video content. This 2500-word blog will explore how data science can be leveraged to elevate video content creation, ensuring that it not only engages but also resonates with the intended audience.Introduction to Data-Driven CreativityData-driven creativity marks a groundbreaking shift in video content creation, blending artistic vision with the insights provided by data science. This combination allows creators to break free from conventional creative limits, using data analytics to develop content that is both visually captivating and strategically significant. By delving into viewer behavior, preferences, and interactions, creators can refine their stories and visuals, achieving a deeper connection with their audience. This technique effectively transforms data into a guide for storytelling, steering content towards increased relevance and attractiveness. Consequently, video content becomes a more potent medium for engaging viewers and delivering impactful messages. Fundamentally, data-driven creativity is about converting data points into compelling stories and turning analytical insights into creative masterpieces, thereby redefining the standards of digital video content.Understanding the Role of Data in Video Content CreationExploring the Role of Data in Video Content Creation ventures into the rapidly growing realm of data-driven creativity, where data science emerges as a key instrument in enriching video content. In this realm, data transcends mere figures to become a narrative element, providing rich insights into what audiences prefer, how they behave, and emerging trends. Utilizing data, video creators can break free from conventional creative constraints, shaping their stories to more deeply connect with viewers. This process involves a detailed examination of viewer interactions, demographics, and feedback to hone storytelling skills, aiming to create videos that are not only watched but also emotionally impactful and memorable. Data-driven creativity is a fusion of art and science, where each view, reaction, and comment plays a role in directing the trajectory of video content, enhancing its relevance, engagement, and effect. This marks a transformative phase in content creation, where data equips creators to weave narratives that are not just creatively rich but also finely tuned to the dynamic preferences and interests of their audience.The Process of Gathering and Analyzing DataCollecting and analyzing data forms the foundation of data-driven creativity, especially in the realm of video content enhancement. This process involves the acquisition of key information, including audience demographics, interaction metrics, and performance measures, utilizing sophisticated tools and technologies. These range from social media analytics to advanced data mining applications designed to track a broad spectrum of viewer interactions. Once collected, this data undergoes thorough analysis to identify trends, preferences, and behaviors within the target audience. Such analysis equips content creators with insightful knowledge, allowing them to adjust their video content for greater appeal and connection with their audience. Leveraging these insights, creators can modify elements such as the tone, style, and themes of their content, revolutionizing storytelling methods and ensuring their content is both captivating and impactful. This integration of data science with creative storytelling heralds a transformative phase in video content production, where analytical findings significantly enhance artistic expression.Tailoring Content to Audience PreferencesAdapting content to audience preferences through data-driven creativity signifies a vital evolution in video content production. By incorporating data science, creators gain profound insights into audience behaviors, likes, and engagement patterns. This approach facilitates the creation of content that better resonates with viewers, ensuring everything from the plot to visual elements aligns with their interests. Utilizing analytics such as viewer habits and interaction rates, creators can pinpoint engaging aspects for better video content. Using a high-quality video editor tool is important to make the video look better. This knowledge allows precise adjustments, making the content not only captivating but also highly relevant. Ultimately, incorporating data in video content creation leads to more impactful and resonant viewer experiences, forging a deeper bond between the audience and the content.Enhancing Storytelling with Data InsightsUtilizing data insights to enhance storytelling is a groundbreaking method in video content production. Termed data-driven creativity, this technique blends the storytelling craft with data science accuracy. Content creators leverage analysis of viewer engagement, preferences, and behavior to fine-tune their narratives, ensuring a deeper connection with their audience. This integration results in not only engaging narratives but also ones that are in tune with audience interests and emerging trends. Insights from data grant a clearer understanding of what truly engages viewers, empowering creators to optimize their storytelling for the greatest effect. This modern approach reinvents traditional storytelling into an experience that's both more impactful and centered around the audience, with each creative decision being shaped and enriched by data.Using Data to Predict Future TrendsUtilizing data for future trends in data-driven creativity marks a revolutionary step in improving video content via data science. This technique focuses on analyzing viewer interactions, demographic information, and behavioral tendencies to predict future content direction. Using data enables creators to be proactive, crafting video content that resonates with emerging audience preferences and interests. Such a forward-thinking approach guarantees ongoing relevance in a dynamic digital world and fosters innovation and leadership among content creators. The blend of data analytics and artistic insight leads to the production of not just captivating but also pioneering videos, demonstrating the significant role of data in shaping the future of video content creation.Balancing Creativity and DataAchieving a harmonious blend of creativity and data in video content production is both subtle and potent. Data-driven creativity embodies the convergence of artistic flair and data analytics, providing an innovative method to boost video effectiveness. By weaving in data analysis, video creators unlock insights into what their audience prefers and how they behave, guiding their artistic choices. This integration results in content that is not only enthralling but also deeply meaningful to viewers. It is essential, however, to ensure that data serves as a guide, not a ruler, in the creative journey. This equilibrium keeps the content fresh and appealing while aligning it thoughtfully with data-driven knowledge. In essence, data-driven creativity in video content merges the narrative craft with analytical insights, culminating in videos that are both compelling and influential.Overcoming Challenges in Data-Driven CreativityOvercoming hurdles in data-driven creativity necessitates a nuanced integration of data science into the creation of video content. It involves striking a delicate balance between analytical methodologies and artistic expression, ensuring that data serves as an informative tool rather than a constraint on creativity. Accurate interpretation of data empowers content creators to avoid formulaic outputs, utilizing insights to enrich storytelling and enhance audience engagement. This intricate process demands a comprehensive understanding of the artistry of video creation and the scientific principles behind data analysis. Ethical considerations, including respecting audience privacy and obtaining data consent, are pivotal in this approach. Innovative strategies within data-driven creativity empower creators to produce content that forges deeper connections with viewers, setting new benchmarks in the digital landscape. Embracing these challenges is essential for unlocking the full potential of data-enhanced video content.Ethical Considerations in Data-Driven CreativityIn the domain of data-driven creativity, ethical considerations play a crucial role, especially when utilizing data science to enhance video content. While utilizing data insights can enhance creative processes, it is essential to address privacy concerns and ensure transparent, responsible data usage. Achieving the right equilibrium between creativity and ethical considerations becomes paramount as brands employ data to customize video content. Upholding user privacy and securing informed consent are fundamental principles in ethical data-driven creativity, fostering trust among audiences. Moreover, there is an obligation to avoid perpetuating biases and stereotypes in content creation, championing inclusivity and diversity. Ethical practices not only maintain brand integrity but also contribute to a positive and respectful digital environment for consumers.Tools and Resources for Data-Driven Video CreationExplore the potential of data-driven creativity using state-of-the-art tools and resources for crafting videos. In the current digital landscape, integrating data science and video content is transforming the landscape of creative processes. Immerse yourself in a domain where insights derived from data direct every facet of video production. These tools empower creators to customize content according to audience preferences, ensuring that each video is not only visually captivating but also strategically aligned. From scriptwriting informed by analytics to incorporating personalized visual elements, the utilization of data science takes video content to unprecedented levels. Delve into the crossroads of technology and creativity, where strategies driven by data redefine storytelling, captivating audiences in a personalized and meaningful manner.The Future of Data-Driven Creativity in Video ContentThe evolution of data-driven creativity in video content is set to transform our interaction with digital media. Through the incorporation of data science, creators gain valuable insights into viewer preferences, behavior, and trends. This collaboration enables a personalized and captivating viewing experience, heightening audience engagement. With the utilization of data-driven creativity, content producers can shape videos to suit the unique preferences of their target audience, resulting in more impactful storytelling and brand communication. As technology progresses, we anticipate a shift towards highly personalized content, driven by data insights, leading to innovative approaches in video production. This convergence of creativity and data science holds significant promise for the future development of video content within the digital landscape.ConclusionIn summary, the convergence of data-driven insights and creative components represents a transformative shift in the realm of video content creation. The fusion of Data Science and creativity provides content producers with the tools to precisely tailor videos to audience preferences, resulting in more impactful and engaging content. Leveraging the potential of data facilitates a deeper comprehension of viewer behavior, enabling targeted storytelling. Amidst the digital landscape, the symbiosis of data and creativity not only elevates video content but also fosters innovation and personalized experiences. Looking ahead, embracing Data-Driven Creativity becomes crucial for maintaining a leading edge in the continually evolving landscape of video content creation.

nikos_datasource

Aug 13, 2020

Understanding Neural Networks

Contents Outline

Tony Yiu

Understanding Neural Networks

The 30,000 Feet View

Getting our Bearings

Let’s Add a Bit of Complexity Now

Time for the Neural Network to Learn

Backpropagation

Tying it All Together

Related Posts

Categories

Join Competition

Juan Guillermo Gómez Ramírez

nikos_datasource

nikos_datasource

nikos_datasource

Understanding Neural Networks

Contents Outline

Social Sharing

Tony Yiu

The 30,000 Feet View

Getting our Bearings

Let’s Add a Bit of Complexity Now

Time for the Neural Network to Learn

Backpropagation

Tying it All Together

Related Posts

Categories

Join Competition

Most Related Articles

Juan Guillermo Gómez Ramírez

nikos_datasource

nikos_datasource

nikos_datasource