Random Forests is one of my favorite data mining algorithms. Invented by Leo Breiman and Adele Cutler back in the last century, it has retained its authenticity up to this day, no changes were added to it since its invention.

Without any exaggeration, it is one of the few universal algorithms. Random forests allow solving both the problems of regression and classification as well. It is good for searching for anomalies and selecting predictors. What is more, this algorithm is technically difficult to apply incorrectly. It is surprisingly simple in its essence. Unlike other algorithms, it has few configurable parameters. And at the same time, it is amazingly accurate.

Wow, so many advantages of using Random forests! It seems like a miracle for machine learning engineers ;) So, if you don’t know yet how it works, it’s the right time to fix this situation. Here is a learning adventure for beginners, where we see things in terms of branches, leaves, and Random forests, of course.

Without further ado, let’s get started!

Decision Trees in a Nutshell

Let’s first start with Decision Trees, because logically, there is no forest without trees.

Decision Trees is a non-parametric supervised learning algorithm that builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.

A Random Forest is actually just a bunch of Decision Trees. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to the leaf represent classification rules.

A decision tree consists of three types of nodes:

Decision nodes — typically represented by squares
Chance nodes — typically represented by circles
End nodes — typically represented by triangles

So all-in-all, the learned tree can also be represented as a nested if-else rule to improve human readability. Trees have a high risk of overfitting the training data as well as becoming computationally complex if they are not constrained and regularized properly during the growing stage. This overfitting implies a low bias, high variance trade-off in the model. Therefore, in order to deal with this problem, we use Ensemble Learning, an approach that allows us to correct this overlearning habit and hopefully, arrives at better, stronger results.

What is an Ensemble Method?

The method of ensembles is based on training algorithms that form many classifiers and then segment new data points, starting from voting or averaging. The original ensemble method is nothing but Bayesian averaging, but later algorithms include output coding error correction, bagging and boosting. Boosting is aimed at turning weak models into strong ones by building an ensemble of classifiers. Bagging also aggregates advanced classifiers, but it uses parallel training of basic classifiers. In the language of mathematical logic, bagging is an improving union, and boosting is an improving intersection.

In our case, a Random Forest (strong learner) is built as an ensemble of Decision Trees (weak learners) to perform different tasks such as regression and classification.

What Is the Idea of Random Forest?

The idea is simple: let’s say we have some very weak algorithm, say, a decision tree. If we make a lot of different models using this weak algorithm and average the result of their predictions, then the final result will be much better. This is the so-called Ensemble learning in action.

Well, here is a reason why Random forest is called this way, cause it creates many decision trees for the data and then averages the result of their predictions. A large number of decision trees are the parameters of the method, each of which is built according to a sample obtained from the original training select using bootstrap (sample with return).

An important point here is the element of randomness in the creation of each tree. After all, it is clear that if we create many identical trees, then the result of their averaging will have the accuracy of one tree.

Simple explanation

A random forest is a collection of random decision trees (the number of n_estimators in sklearn). You need to understand how to create one random decision tree.

Roughly speaking, to build a random decision tree that you start with a subset of your training samples. On each node, you arbitrarily draw a subset of functions (the number is determined by max_features in sklearn). For each of these functions, you will test different threshold values and see how they separate your samples according to a given criterion (usually entropy or gini, criterion in sklearn). Then you save the function and its threshold, which are the best way to separate your data and write it to node. When the construction of the tree finishes (this can be for various reasons: the maximum depth is reached (max_depth in sklearn), the minimum number of samples is reached (min_samples_leaf in sklearn), etc.), you look at the samples in each sheet and save the frequency of the marks. As a result, it looks like the tree gives you a section of your training samples according to meaningful functions.

Since each node is built from random functions, you understand that each tree constructed in this way will be different. This contributes to a good compromise between displacement and dispersion.

Then, in test mode, the test sample will go through each tree, giving you labels for each tree. The most represented label is usually the final result of the classification.

How does it work?

Suppose we have some input data. Each column corresponds to a certain parameter, each row corresponds to a certain data element.

We can randomly select from the entire data set a certain number of columns and rows and build a decision tree from them.

Then we can repeat this process many times and get a lot of different trees. The process of the tree-building algorithm is very fast. And therefore, it will not be difficult for us to make as many trees as we need. At the same time, all of these trees are, in a sense, random, because we chose a random subset of data to create each of them.

The number of trees grown is often an important factor. This number may influence the achieved level of classification error. In addition, with sharply unbalanced classes (for example, a lot of 0 and only a small amount of 1), it is important to perform stratified sampling to even out the levels of classification error in each of these classes.

In the original version of the algorithm, a random subset is selected at each step of the tree construction. But this does not change the essence and the results are comparable.

This surprisingly simple algorithm, the most difficult step in its implementation — the construction of the tree decision tree. And despite its simplicity, it gives very good results in real tasks. From a practical point of view, it has one huge advantage: it requires almost no configuration. If we take any other machine learning algorithm, be it regression or a neural network, they all have a bunch of parameters and you need to know what algorithms are better to apply for a specific task.

The random forest algorithm has essentially only one parameter: the size of the random subset selected at each step of the tree construction. This parameter is important, but even the default values provide very acceptable results.

Random Forests vs Decision Trees

Both the random forest and decision trees are a type of classification algorithm, which are supervised in nature.

A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. It’s called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree.

Random forests involve building several decision trees based on sampling features and then making predictions based on majority voting among trees for classification problems or average for regression problems. This solves the problem of overfitting in Decision Trees.

When working with the forest, when constructing each tree at the stages of splitting the vertices, only a fixed number of randomly selected features of the training set is used (the second parameter of the method) and a complete tree is constructed (without truncation). In other words, each leaf of the tree contains observations of only one class.

Random Forest Algorithm with Python and Scikit-Learn

The scikit-learn library has the following implementation of RF (below only for classification task):

class sklearn.ensemble.RandomForestClassifier(n_estimators=10,
criterion=’gini’, max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=’auto’, max_leaf_nodes=None, min_impurity_split=1e-07,
bootstrap=True, oob_score=False, n_jobs=1,
random_state=None, verbose=0, warm_start=False,
class_weight=None)

They work with the algorithm according to the standard scheme adopted in scikit-learn:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
# then — (X, y) -training, (X2, y2) — verification
# model — here (for contrast) consider the regressor
model = RandomForestRegressor(n_estimators=10 ,
oob_score=True,
random_state=1)
model.fit(X, y) # training
a = model.predict(X2) # prediction
print (“AUC-ROC (oob) = “, roc_auc_score(y, model.oob_prediction_))
print (“AUC-ROC (test) = “, roc_auc_score(y2, a))

Let’s have a look at what the main parameters mean:

N_estimators — Number of trees

The more trees, the better the quality, but the tuning and RF time also increase proportionally. Please note that often with an increase in n_estimators, the quality on the training sample increases (it can even reach 100%), and the quality on the test reaches the asymptote (you can estimate how many trees are enough for you).

Max_features — The number of features to select for splitting is

As max_features increases, the time taken to build the forest increases, and the trees become “more uniform”. By default, it is sqrt (n) in classification problems and n / 3 in regression problems. This is the most important parameter! It is set up first of all (with a sufficient number of trees in the forest).

Min_samples_split — The minimum number of objects at which splitting is performed

This parameter, as a rule, is not very important — you can leave the default value. The quality chart on the control may be similar to a “comb” (there is no clear optimum). As the parameter increases, the quality of training decreases, and the RF construction time decreases.

Min_samples_leaf — Limit on the number of objects in leaves

Everything that has been described min_samples_split is also suitable for describing this parameter. Often you can leave the default value. By the way, it is usually recommended to use the value 5 in regression tasks (this is implemented in the randomForest library for R, and 1 in sklearn).

Max_depth — Maximum tree depth

It is clear that the smaller the depth, the faster RF is built and works. With increasing depth, the quality of training sharply increases, but also in control, as a rule, it increases. It is recommended to use the maximum depth (except in cases where there are too many objects and very deep trees are obtained, the construction of which takes considerable time).

When using shallow trees, changing the parameters associated with limiting the number of objects in the sheet and for dividing does not lead to a significant effect (the leaves are “large” anyway). Shallow trees are recommended for use in problems with a large number of noise objects (emissions).

Criterion — Cleavage criterion

In terms of meaning, this is a very important parameter, but in fact, there are no choices here. Two criteria are implemented in the sklearn library for regression: “mse” and “mae”, which correspond to the error functions that they minimize. Most tasks require using mse. For classification, the “gini” and “entropy” criteria are implemented, which correspond to the classical splitting criteria. A simple search will help you choose what to use in a particular task.

There is no sample size parameter in the sklearn implementation of a random forest, which regulates how many objects should be selected to build each tree. There is such a parameter in the R implementation, but, in fact, it is often optimal to choose from the entire sample. It is also recommended that you select a subsample with a return: bootstrap = True (this is bagging — bootstrap aggregating).

Wrapping it up…

So, random forest is a very easy to use algorithm. It has many advantages, and here are the most important points for me:

random forests guarantee protection against overfitting even when the number of signs significantly exceeds the number of observations. By the way, you will not find such a function among other algorithms.
building random forests is very simple — you only need two parameters that require minimal tuning
random forests can be used not only for classification and regression tasks but also for the tasks of identifying the most informative features, clustering, highlighting anomalous observations and determining prototype classes

…………………………………….

Hope this post was useful and interesting for you! If you do anything cool with this information, leave a response in the comments below or reach out at any time on my Instagram and Medium blog.

Thanks for reading!

Most Related Articles

Python

Programming

Why Decorators In Python Are Pure Genius?

Analyze, test, and re-use your code with little more than an @ symbolIf there’s one thing that makes Python incredibly successful, that would be its readability. Everything else hinges on that: if code is unreadable, it’s hard to maintain. It’s also not beginner-friendly then — a novice getting boggled by unreadable code won’t attempt writing its own one day.Python was already readable and beginner-friendly before decorators came around. But as the language started getting used for more and more things, Python developers felt the need for more and more features, without cluttering the landscape and making code unreadable.Decorators are a prime-time example of a perfectly implemented feature. It does take a while to wrap your head around, but it’s worth it. As you start using them, you’ll notice how they don’t overcomplicate things and make your code neat and snazzy.Before anything else: higher-order functionsIn a nutshell, decorators are a neat way to handle higher-order functions. So let’s look at those first!Functions returning functionsSay you have one function, greet() — it greets whatever object you pass it. And let’s say you have another function, simon() — it inserts “Simon” wherever appropriate. How can we combine the two? Think about it a minute before you look below.def greet(name): return f"Hello, {name}!" def simon(func): return func("Simon") simon(greet)The output is 'Hello, Simon!'. Hope that makes sense to ya!Of course we could have just called greet("Simon"). However, the whole point is that we might want to put “Simon” into many different functions. And if we don’t use “Simon” but something more complicated, we can save a whole lot of lines of code by packing it into a function like simon().Functions inside other functionsWe can also define functions inside other functions. That’s important because decorators will do that, too! Without decorators it looks like this:def respect(maybe): def congrats(): return "Congrats, bro!" def insult(): return "You're silly!" if maybe == "yes": return congrats else: return insultThe function respect() returns a function; respect("yes") returns the congrats function, respect("brother") (or some other argument instead of "brother") returns the insult function. To call the functions, enter respect("yes")() and respect("brother")(), just like a normal function.Also read: Python Books You Must Read in 2020Got it? Then you’re all set for decorators!Code is beautifully nerdy. Image by author.The ABC of Python decoratorsFunctions with an @ symbolLet’s try a combination of the two previous concepts: a function that takes another function and defines a function. Sounds mind-boggling? Consider this:def startstop(func): def wrapper(): print("Starting...") func() print("Finished!") return wrapper def roll(): print("Rolling on the floor laughing XD") roll = startstop(roll)The last line ensures that we don’t need to call startstop(roll)() anymore; roll() will suffice. Do you know what the output of that call is? Try it yourself if you’re unsure!Now, as a very good alternative, we could insert this right after defining startstop():@startstop def roll(): print("Rolling on the floor laughing XD")This does the same, but glues roll() to startstop() at the onset.Added flexibilityWhy is that useful? Doesn’t that consume exactly as many lines of code as before?In this case, yes. But once you’re dealing with slightly more complicated stuff, it gets really useful. For once, you can move all decorators (i.e. the def startstop() part above) into its own module. That is, you write them into a file called decorators.py and write something like this into your main file:from decorators import startstop @startstop def roll(): print("Rolling on the floor laughing XD")In principle, you can do that without using decorators. But this way it makes life easier because you don’t have to deal with nested functions and endless bracket-counting anymore.You can also nest decorators:from decorators import startstop, exectime @exectime @startstop def roll(): print("Rolling on the floor laughing XD")Note that we haven’t defined exectime() yet, but you’ll see it in the next section. It’s a function that can measure how long a process takes in Python.This nesting would be equivalent to a line like this:roll = exectime(startstop(roll))Bracket counting is starting! Imagine you had five or six of those functions nested inside each other. Wouldn’t the decorator notation be much easier to read than this nested mess?You can even use decorators on functions that accept arguments. Now imagine a few arguments in the line above and your chaos would be complete. Decorators make it neat and tidy.Also read: How to Get a Job With PythonFinally, you can even add arguments to your decorators — like @mydecorator(argument). Yeah, you can do all of this without decorators. But then I wish you a lot of fun understanding your decorator-free code when you re-read it in three weeks…Decorators make everything easier. Image by author.Applications: where decorators cut the creamNow that I’ve hopefully convinced you that decorators make your life three times easier, let’s look at some classic examples where decorators are basically indispensable.Measuring execution timeLet’s say we have a function called waste time() and we want to know how long it takes. Well, just use a decorator!import time def measuretime(func): def wrapper(): starttime = time.perf_counter() func() endtime = time.perf_counter() print(f"Time needed: {endtime - starttime} seconds") return wrapper @measuretime def wastetime(): sum([i**2 for i in range(1000000)]) wastetime()A dozen lines of code and we’re done! Plus, you can use measuretime() on as many functions as you want.Slowing code downSometimes you don’t want to execute code immediately but wait a while. That’s where a slow-down decorator comes in handy:import time def sleep(func): def wrapper(): time.sleep(300) return func() return wrapper @sleep def wakeup(): print("Get up! Your break is over.") wakeup()Calling wakeup() makes lets you take a 5-minute break, after which your console reminds you to get back to work.Also read: Building A Linear Regression Model With Python To Predict Retail Customer SpendingTesting and debuggingSay you have a whole lot of different functions that you call at different stages, and you’re losing the overview over what’s being called when. With a simple decorator for every function definition, you can bring more clarity. Like so:def debug(func): def wrapper(): print(f"Calling {func.__name__}") return wrapper @debug def scare(): print("Boo!") scare()There is an a lot more elaborate example here. Note, though, that to understand that example, you’ll have to check how to decorate functions with arguments. Still, it’s worth the read!Reusing codeThis kinda goes without saying. If you’ve defined a function decorator(), you can just sprinkle @decorator everywhere in your code. To be honest, I don’t think it gets any simpler than that!Handling loginsIf you have functionalities that should only be accessed if a user is logged in, that’s also fairly easy with decorators. I’ll refer you to the full example for reference, but the principle is quite simple: first you define a function like login_required(). Before any function definition that needs logging in, you pop @login_required. Simple enough, I’d say.Syntactic sugar — or why Python is so sweetIt’s not like I’m not critical of Python or not using alternative languages where it’s appropriate. But there’s a big allure to Python: it’s so easy to digest, even when you’re not a computer scientist by training and just want to make things work.If C++ is an orange, then Python is a pineapple: similarly nutritious, but three times sweeter. Decorators are just one factor in the mix.But I hope you’ve come to see why it’s such a big sweet-factor. Syntactic sugar to add some pleasure to your life! Without health risks, except for having your eyes glued on a screen.I wish you lots of sweet code!Also Read: How to Use Python Datetimes Correctly?

Daniel Morales

May 14, 2020

Python

SQL

Programming

How to Use Python Datetimes Correctly?

Datetime is basically a python object that represents a point in time, like years, days, seconds, milliseconds. This is very useful to create our programs.The datetime module provides classes to manipulate dates and times in a simple and complex way. While date and time arithmetic is supported, the application focuses on the efficient extraction of attributes for formatting and manipulating output

Daniel Morales

May 14, 2020

Data Science

Programming

6 Advanced Statistical Concepts in Data Science

The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Beginners Guide to Statistics in Data Science and The Inferential Statistics Data Scientists Should Know we have talked about almost all the basics(Descriptive and Inferential) of statistics which are commonly used in understanding and working with any data science case study. In this article, lets go a little beyond and talk about some advance concepts which are not part of the buzz.Concept #1 - Q-Q(quantile-quantile) PlotsBefore understanding QQ plots first understand what is a Quantile?A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and percentiles (hundredth).An example:If we divide a distribution into four equal portions, we will speak of four quartiles. The first quartile includes all values that are smaller than a quarter of all values. In a graphical representation, it corresponds to 25% of the total area of distribution. The two lower quartiles comprise 50% of all distribution values. The interquartile range between the first and third quartile equals the range in which 50% of all values lie that are distributed around the mean. In Statistics, A Q-Q(quantile-quantile) plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight(y=x).Q-Q plotFor example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.It’s very important for you to know whether the distribution is normal or not so as to apply various statistical measures on the data and interpret it in much more human-understandable visualization and their Q-Q plot comes into the picture. The most fundamental question answered by the Q-Q plot is if the curve is Normally Distributed or not.Normally distributed, but why?The Q-Q plots are used to find the type of distribution for a random variable whether it is a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or even Pareto Distribution, etc. You can tell the type of distribution using the power of the Q-Q plot just by looking at the plot. In general, we are talking about Normal distributions only because we have a very beautiful concept of the 68–95–99.7 rule which perfectly fits into the normal distribution So we know how much of the data lies in the range of the first standard deviation, second standard deviation and third standard deviation from the mean. So knowing if a distribution is Normal opens up new doors for us to experiment with Types of Q-Q plots. Source Skewed Q-Q plotsQ-Q plots can find skewness(measure of asymmetry) of the distribution. If the bottom end of the Q-Q plot deviates from the straight line but the upper end is not, then the distribution is Left skewed(Negatively skewed).Now if upper end of the Q-Q plot deviates from the staright line and the lower is not, then the distribution is Right skewed(Positively skewed).Tailed Q-Q plotsQ-Q plots can find Kurtosis(measure of tailedness) of the distribution.The distribution with the fat tail will have both the ends of the Q-Q plot to deviate from the straight line and its centre follows the line, where as a thin tailed distribution will term Q-Q plot with very less or negligible deviation at the ends thus making it a perfect fit for normal distribution.Q-Q plots in Python(Source)Suppose we have the following dataset of 100 values:import numpy as np #create dataset with 100 values that follow a normal distribution np.random.seed(0) data = np.random.normal(0,1, 1000) #view first 10 values data[:10] array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799, -0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:import statsmodels.api as sm import matplotlib.pyplot as plt #create Q-Q plot with 45-degree line added to plot fig = sm.qqplot(data, line='45') plt.show()In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead, it represents where your data would be if it were normally distributed.The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.Consider instead if we generated a dataset of 100 uniformly distributed values and created a Q-Q plot for that dataset:#create dataset of 100 uniformally distributed values data = np.random.uniform(0,1, 1000) #generate Q-Q plot for the dataset fig = sm.qqplot(data, line='45') plt.show()The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.Concept #2- Chebyshev's InequalityIn probability, Chebyshev’s Inequality, also known as “Bienayme-Chebyshev” Inequality guarantees that, for a wide class of probability distributions, only a definite fraction of values will be found within a specific distance from the mean of a distribution.Source: https://www.thoughtco.com/chebyshevs-inequality-3126547 Chebyshev’s inequality is similar to The Empirical rule(68-95-99.7); however, the latter rule only applies to normal distributions. Chebyshev’s inequality is broader; it can be applied to any distribution so long as the distribution includes a defined variance and mean.So Chebyshev’s inequality says that at least (1-1/k^2) of data from a sample must fall within K standard deviations from the mean (or equivalently, no more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean).Where K --> Positive real numberIf the data is not normally distributed then different amounts of data could be in one standard deviation. Chebyshev’s inequality provides a way to know what fraction of data falls within K standard deviations from the mean for any data distribution.Also read: 22 Statistics Questions to Prepare for Data Science InterviewsCredits: https://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ Chebyshev’s inequality is of great value because it can be applied to any probability distribution in which the mean and variance are provided.Let us consider an example, Assume 1,000 contestants show up for a job interview, but there are only 70 positions available. In order to select the finest 70 contestants amongst the total contestants, the proprietor gives tests to judge their potential. The mean score on the test is 60, with a standard deviation of 6. If an applicant scores an 84, can they presume that they are getting the job?The results show that about 63 people scored above a 60, so with 70 positions available, a contestant who scores an 84 can be assured they got the job.Chebyshev's Inequality in Python(Source) Create a population of 1,000,000 values, I use a gamma distribution(also works with other distributions) with shape = 2 and scale = 2.import numpy as np import random import matplotlib.pyplot as plt #create a population with a gamma distribution shape, scale = 2., 2. #mean=4, std=2*sqrt(2) mu = shape*scale #mean and standard deviation sigma = scale*np.sqrt(shape) s = np.random.gamma(shape, scale, 1000000)Now sample 10,000 values from the population.#sample 10000 values rs = random.choices(s, k=10000)Count the sample that has a distance from the expected value larger than k standard deviation and use the count to calculate the probabilities. I want to depict a trend of probabilities when k is increasing, so I use a range of k from 0.1 to 3.#set k ks = [0.1,0.5,1.0,1.5,2.0,2.5,3.0] #probability list probs = [] #for each k for k in ks: #start count c = 0 for i in rs: # count if far from mean in k standard deviation if abs(i - mu) > k * sigma : c += 1 probs.append(c/10000)Plot the results:plot = plt.figure(figsize=(20,10)) #plot each probability plt.xlabel('K') plt.ylabel('probability') plt.plot(ks,probs, marker='o') plot.show() #print each probability print("Probability of a sample far from mean more than k standard deviation:") for i, prob in enumerate(probs): print("k:" + str(ks[i]) + ", probability: " \ + str(prob)[0:5] + \ " | in theory, probability should less than: " \ + str(1/ks[i]**2)[0:5])From the above plot and result, we can see that as the k increases, the probability is decreasing, and the probability of each k follows the inequality. Moreover, only the case that k is larger than 1 is useful. If k is less than 1, the right side of the inequality is larger than 1 which is not useful because the probability cannot be larger than 1.Concept #3- Log-Normal DistributionIn probability theory, a Log-normal distribution also known as Galton's distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y i.e, X = exp(Y), has a log-normal distribution. Skewed distributions with low mean and high variance and all positive values fit under this type of distribution. A random variable that is log-normally distributed takes only positive real values. The general formula for the probability density function of the lognormal distribution is:The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable.The shape of Lognormal distribution is defined by 3 parameters:σis the shape parameter, (and is the standard deviation of the log of the distribution)θ or μ is the location parameter (and is the mean of the distribution)m is the scale parameter (and is also the median of the distribution)The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable as explained above.If x = θ, then f(x) = 0. The case where θ = 0 and m = 1 is called the standard lognormal distribution. The case where θ equals zero is called the 2-parameter lognormal distribution.The following graph illustrates the effect of the location(μ) and scale(σ) parameter on the probability density function of the lognormal distribution: Source: https://www.sciencedirect.com/topics/mathematics/lognormal-distribution Log-Normal Distribution in Python(Source)Let us consider an example to generate random numbers from a log-normal distribution with μ=1 and σ=0.5 using scipy.stats.lognorm function.import numpy as np import matplotlib.pyplot as plt from scipy.stats import lognorm np.random.seed(42) data = lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000) plt.figure(figsize=(10,6)) ax = plt.subplot(111) plt.title('Generate wrandom numbers from a Log-normal distribution') ax.hist(data, bins=np.logspace(0,5,200), density=True) ax.set_xscale("log") shape,loc,scale = lognorm.fit(data) x = np.logspace(0, 5, 200) pdf = lognorm.pdf(x, shape, loc, scale) ax.plot(x, pdf, 'y') plt.show()Concept #4- Power Law distributionIn statistics, a Power Law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.A power law distribution has the form Y = k Xα, where:X and Y are variables of interest,α is the law’s exponent,k is a constant.Source: https://en.wikipedia.org/wiki/Power_law Power-law distribution is just one of many probability distributions, but it is considered a valuable tool to assess uncertainty issues that normal distribution cannot handle when they occur at a certain probability.Many processes have been found to follow power laws over substantial ranges of values. From the distribution in incomes, size of meteoroids, earthquake magnitudes, the spectral density of weight matrices in deep neural networks, word usage, number of neighbors in various networks, etc. (Note: The power law here is a continuous distribution. The last two examples are discrete, but on a large scale can be modeled as if continuous).Also read: Statistical Measures of Central TendencyPower-law distribution in Python(Source) Let us plot the Pareto distribution which is one form of a power-law probability distribution. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.import numpy as np import matplotlib.pyplot as plt from scipy.stats import pareto x_m = 1 #scale alpha = [1, 2, 3] #list of values of shape parameters plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for a in alpha: output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)]) plt.plot(samples, output.T, label='alpha {0}' .format(a)) plt.xlabel('samples', fontsize=15) plt.ylabel('PDF', fontsize=15) plt.title('Probability Density function', fontsize=15) plt.legend(loc='best') plt.show()Concept #5- Box cox transformationThe Box-Cox transformation transforms our data so that it closely resembles a normal distribution.The one-parameter Box-Cox transformations are defined as In many statistical techniques, we assume that the errors are normally distributed. This assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can (hopefully) normalize our errors (if they are not already normal).Additionally, transforming our variables can improve the predictive power of our models because transformations can cut away white noise.Original distribution(Left) and near-normal distribution after applying Box cox transformation. Source At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one that results in the best approximation of a normal distribution curve. The one-parameter Box-Cox transformations are defined as:and the two-parameter Box-Cox transformations as:Moreover, the one-parameter Box-Cox transformation holds for y > 0, i.e. only for positive values and two-parameter Box-Cox transformation for y > -λ, i.e. negative values. The parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.If we talk about some drawbacks of Box-cox transformation, then if interpretation is what you want to do, then Box-cox is not recommended. Because if λ is some non-zero number, then the transformed target variable may be more difficult to interpret than if we simply applied a log transform.A second stumbling block is that the Box-Cox transformation usually gives the median of the forecast distribution when we revert the transformed data to its original scale. Occasionally, we want the mean and not the median.Box-Cox transformation in Python(Source)SciPy’s stats package provides a function called boxcox for performing box-cox power transformation that takes in original non-normal data as input and returns fitted data along with the lambda value that was used to fit the non-normal distribution to normal distribution.#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) fig, ax = plt.subplots(1, 2) #plot the distribution of data values sns.distplot(data, hist=False, kde=True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Non-Normal", color ="red", ax = ax[0]) #perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) sns.distplot(transformed_data, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Normal", color ="red", ax = ax[1]) #adding legends to the subplots plt.legend(loc = "upper right") #rescaling the subplots fig.set_figheight(5) fig.set_figwidth(10) #display optimal lambda value print(f"Lambda value used for Transformation: {best_lambda}") Concept #6- Poisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.In very simple terms, A Poisson distribution can be used to estimate how likely it is that something will happen "X" number of times. Some examples of Poisson processes are customers calling a help center, radioactive decay in atoms, visitors to a website, photons arriving at a space telescope, and movements in a stock price. Poisson processes are usually associated with time, but they do not have to be. The Formula for the Poisson Distribution Is:Where:e is Euler's number (e = 2.71828...)k is the number of occurrencesk! is the factorial of kλ is equal to the expected value of kwhen that is also equal to its varianceLambda(λ) can be thought of as the expected number of events in the interval. As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters. Probability Mass function for Poisson Distribution with varying rate parameters.Source The Poisson distribution is also commonly used to model financial count data where the tally is small and is often zero. For one example, in finance, it can be used to model the number of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2, etc.As another example, this model can be used to predict the number of "shocks" to the market that will occur in a given time period, say over a decade.Poisson distribution in Pythonfrom numpy import random import matplotlib.pyplot as plt import seaborn as sns lam_list = [1, 4, 9] #list of Lambda values plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for lam in lam_list: sns.distplot(random.poisson(lam=lam, size=10), hist=False, label='lambda {0}'.format(lam)) plt.xlabel('Poisson Distribution', fontsize=15) plt.ylabel('Frequency', fontsize=15) plt.legend(loc='best') plt.show()As λ becomes bigger, the graph looks more like a normal distribution.I hope you have enjoyed reading this article, If you have any questions or suggestions, please leave a comment. Also read: False Positives vs. False NegativesFeel free to connect me on LinkedIn for any query.Thanks for reading!!!Referenceshttps://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ https://corporatefinanceinstitute.com/resources/knowledge/data-analysis/chebyshevs-inequality/ https://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm https://www.statology.org/q-q-plot-python/ https://gist.github.com/chaipi-chaya/9eb72978dbbfd7fa4057b493cf6a32e7 https://stackoverflow.com/a/41968334/7175247

Daniel Morales

May 14, 2020

Machine Learning

Programming

Building a Product Recommendation System with Collaborative Filtering

Daniel Morales

May 14, 2020

Random Forests for Complete Beginners

Contents Outline

Oleksii Kharkovyna

Random Forests for Complete Beginners

Decision Trees in a Nutshell

What is an Ensemble Method?

What Is the Idea of Random Forest?

Random Forests vs Decision Trees

Random Forest Algorithm with Python and Scikit-Learn

Wrapping it up…

Related Posts

Categories

Join Competition

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales

Random Forests for Complete Beginners

Contents Outline

Social Sharing

Oleksii Kharkovyna

Decision Trees in a Nutshell

What is an Ensemble Method?

What Is the Idea of Random Forest?

Random Forests vs Decision Trees

Random Forest Algorithm with Python and Scikit-Learn

Wrapping it up…

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales