This blog post was originally intended to be a side-note in my Pandas Join vs. Merge post. But it turned out to be long enough to warrant its own post (and way too verbose for a side-note). It’s not meant to be a full-on primer on SQL joins, but rather an example to help those new to SQL and relational databases begin to grasp what it means to join 2 tables.

Why Do We Join?

Why bother with joining at all? Can’t we just dump everything into a spreadsheet and sort things out there? Perhaps… but it would be incredibly time consuming, tedious, and error prone.

Relational databases are designed to be joined. Each table in the database contains data of a specific form or function. For example, one table might have basic data on a company’s customers such as customer ID (a unique ID that can be used to identify each customer), name, age, gender, date of first purchase, and address. While a separate much larger table stores detailed transaction level data — transaction ID, date of transaction, customer ID, product category, product ID, units sold, and price.

A given customer (or customer ID) could have hundreds or even thousands of transactions, so it would be extremely redundant to store that customer’s basic information over and over again for each row in the transactions table. The transactions table should be only for data relevant to transactions. Having too much overlapping data between tables is wasteful and can negatively impact system performance.

But that doesn’t mean we don’t care about the linkages between tables. Given how specific each table is, analyses that involve only a single table are generally not useful. The interesting analyses come from datasets that combine multiple tables. For example, we might want to segment transactions by age or geography. To do this, we would need data from both tables. And that’s where join comes in.

How Do We Join?

When we join two tables, we are linking them together via a selected characteristic. Let’s say we have two tables. The first one, Employee, lists out an employee’s unique ID number, name, and job title. The second one, Sale, lists out data on who made what sale by attaching the employee’s ID number and the units sold to a unique sales number:

SELECT * FROM Employee
SELECT * FROM Sale

Our 2 tables, Employee (left) and Sale (right)

(I omitted the underscores from the column names in my graphics for legibility)

Now let’s join the two tables. To link the two tables, we need to pick a column (or combination of columns) that serves as the point of intersection — let’s call the chosen column the join index. Table entries that share the same value for the join index are joined together. Note that the intersection does not have to be one to one. For example, Tony has made 2 sales, so upon joining the tables, both of his sales will be linked to Tony (a.k.a. Employee ID 1).

When we join tables, we generally want the join index to be unique. If the join index were not unique, quirky stuff might occur. For example, let’s say we had a second employee named Tony (along with the tables below), and he was a megastar salesman. If instead of joining on “Employee ID” we joined on “Name”, then we would mistakenly link Tony the Megastar’s sales to me, making my bonus way too high:

Joining on non-unique columns is not recommended

And Tony the Megastar would get credit for my pitiful sales as well (not that he needs it). So to avoid this, we join on a column with unique values such as “Employee ID” (I removed Tony the Megastar as he was only there to illustrate what not to do, and his incredible successes made me feel unworthy):

The “Employee ID” column provides the link between the 2 tables

There are various types of SQL joins and I will not go into the details of all of them here. In this example, we will use a left join, meaning that we prioritize the rows in the left table. So our output will include every row in the left table (the one with “Name” and “Title”) regardless of whether there is a match with the right table — thus employees that have not made a sale will still have a row in our output, but there will be no values (NULLs to be exact) for the “Sale Number” and “Units Sold” columns.

Let’s take a look at our output (we are selecting only “Sale Number” and “Units Sold” from the right table and sorting by “Employee ID”):

SELECT e.*, s.Sale_Number, s.Units_Sold
FROM Employee as e
    LEFT JOIN Sale as s ON e.Employee_ID=s.Employee_ID
ORDER BY e.Employee_ID

The result of our left join

The output of our join now includes data from both tables. Tony’s sales data has been linked with his employee data (thanks to Employee ID) as has Lisa’s. Notice 2 things:

It looks a bit repetitive because “Employee ID”, “Name”, and “Title” are repeated for as many times as the employee has sales. In reality, we wouldn’t stop here though. Next, we would most likely do a group by in order to count up how many sales each employee made, or calculate the average number of units sold each time an employee makes a sale.
Employee ID 3 is missing from the output because we did a left join and there was no entry for Employee ID 3 in the left table. Thus, it was omitted.

And that concludes our brief example. Hopefully, this gives you a rudimentary idea of why we need joins and how they work.

Key Takeaways:

Database tables generally contain very specific information. Therefore, meaningful analyses usually combine data from multiple tables.
This is accomplished via the join operation, which combines two tables by matching them up based on a specified column.
The column used to combine the tables should contain only unique values.
There are various types of joins. The one in this example is a left join, which returns every row in the left table whether or not there is a match.

Most Related Articles

Python

Programming

Why Decorators In Python Are Pure Genius?

Analyze, test, and re-use your code with little more than an @ symbolIf there’s one thing that makes Python incredibly successful, that would be its readability. Everything else hinges on that: if code is unreadable, it’s hard to maintain. It’s also not beginner-friendly then — a novice getting boggled by unreadable code won’t attempt writing its own one day.Python was already readable and beginner-friendly before decorators came around. But as the language started getting used for more and more things, Python developers felt the need for more and more features, without cluttering the landscape and making code unreadable.Decorators are a prime-time example of a perfectly implemented feature. It does take a while to wrap your head around, but it’s worth it. As you start using them, you’ll notice how they don’t overcomplicate things and make your code neat and snazzy.Before anything else: higher-order functionsIn a nutshell, decorators are a neat way to handle higher-order functions. So let’s look at those first!Functions returning functionsSay you have one function, greet() — it greets whatever object you pass it. And let’s say you have another function, simon() — it inserts “Simon” wherever appropriate. How can we combine the two? Think about it a minute before you look below.def greet(name): return f"Hello, {name}!" def simon(func): return func("Simon") simon(greet)The output is 'Hello, Simon!'. Hope that makes sense to ya!Of course we could have just called greet("Simon"). However, the whole point is that we might want to put “Simon” into many different functions. And if we don’t use “Simon” but something more complicated, we can save a whole lot of lines of code by packing it into a function like simon().Functions inside other functionsWe can also define functions inside other functions. That’s important because decorators will do that, too! Without decorators it looks like this:def respect(maybe): def congrats(): return "Congrats, bro!" def insult(): return "You're silly!" if maybe == "yes": return congrats else: return insultThe function respect() returns a function; respect("yes") returns the congrats function, respect("brother") (or some other argument instead of "brother") returns the insult function. To call the functions, enter respect("yes")() and respect("brother")(), just like a normal function.Also read: Python Books You Must Read in 2020Got it? Then you’re all set for decorators!Code is beautifully nerdy. Image by author.The ABC of Python decoratorsFunctions with an @ symbolLet’s try a combination of the two previous concepts: a function that takes another function and defines a function. Sounds mind-boggling? Consider this:def startstop(func): def wrapper(): print("Starting...") func() print("Finished!") return wrapper def roll(): print("Rolling on the floor laughing XD") roll = startstop(roll)The last line ensures that we don’t need to call startstop(roll)() anymore; roll() will suffice. Do you know what the output of that call is? Try it yourself if you’re unsure!Now, as a very good alternative, we could insert this right after defining startstop():@startstop def roll(): print("Rolling on the floor laughing XD")This does the same, but glues roll() to startstop() at the onset.Added flexibilityWhy is that useful? Doesn’t that consume exactly as many lines of code as before?In this case, yes. But once you’re dealing with slightly more complicated stuff, it gets really useful. For once, you can move all decorators (i.e. the def startstop() part above) into its own module. That is, you write them into a file called decorators.py and write something like this into your main file:from decorators import startstop @startstop def roll(): print("Rolling on the floor laughing XD")In principle, you can do that without using decorators. But this way it makes life easier because you don’t have to deal with nested functions and endless bracket-counting anymore.You can also nest decorators:from decorators import startstop, exectime @exectime @startstop def roll(): print("Rolling on the floor laughing XD")Note that we haven’t defined exectime() yet, but you’ll see it in the next section. It’s a function that can measure how long a process takes in Python.This nesting would be equivalent to a line like this:roll = exectime(startstop(roll))Bracket counting is starting! Imagine you had five or six of those functions nested inside each other. Wouldn’t the decorator notation be much easier to read than this nested mess?You can even use decorators on functions that accept arguments. Now imagine a few arguments in the line above and your chaos would be complete. Decorators make it neat and tidy.Also read: How to Get a Job With PythonFinally, you can even add arguments to your decorators — like @mydecorator(argument). Yeah, you can do all of this without decorators. But then I wish you a lot of fun understanding your decorator-free code when you re-read it in three weeks…Decorators make everything easier. Image by author.Applications: where decorators cut the creamNow that I’ve hopefully convinced you that decorators make your life three times easier, let’s look at some classic examples where decorators are basically indispensable.Measuring execution timeLet’s say we have a function called waste time() and we want to know how long it takes. Well, just use a decorator!import time def measuretime(func): def wrapper(): starttime = time.perf_counter() func() endtime = time.perf_counter() print(f"Time needed: {endtime - starttime} seconds") return wrapper @measuretime def wastetime(): sum([i**2 for i in range(1000000)]) wastetime()A dozen lines of code and we’re done! Plus, you can use measuretime() on as many functions as you want.Slowing code downSometimes you don’t want to execute code immediately but wait a while. That’s where a slow-down decorator comes in handy:import time def sleep(func): def wrapper(): time.sleep(300) return func() return wrapper @sleep def wakeup(): print("Get up! Your break is over.") wakeup()Calling wakeup() makes lets you take a 5-minute break, after which your console reminds you to get back to work.Also read: Building A Linear Regression Model With Python To Predict Retail Customer SpendingTesting and debuggingSay you have a whole lot of different functions that you call at different stages, and you’re losing the overview over what’s being called when. With a simple decorator for every function definition, you can bring more clarity. Like so:def debug(func): def wrapper(): print(f"Calling {func.__name__}") return wrapper @debug def scare(): print("Boo!") scare()There is an a lot more elaborate example here. Note, though, that to understand that example, you’ll have to check how to decorate functions with arguments. Still, it’s worth the read!Reusing codeThis kinda goes without saying. If you’ve defined a function decorator(), you can just sprinkle @decorator everywhere in your code. To be honest, I don’t think it gets any simpler than that!Handling loginsIf you have functionalities that should only be accessed if a user is logged in, that’s also fairly easy with decorators. I’ll refer you to the full example for reference, but the principle is quite simple: first you define a function like login_required(). Before any function definition that needs logging in, you pop @login_required. Simple enough, I’d say.Syntactic sugar — or why Python is so sweetIt’s not like I’m not critical of Python or not using alternative languages where it’s appropriate. But there’s a big allure to Python: it’s so easy to digest, even when you’re not a computer scientist by training and just want to make things work.If C++ is an orange, then Python is a pineapple: similarly nutritious, but three times sweeter. Decorators are just one factor in the mix.But I hope you’ve come to see why it’s such a big sweet-factor. Syntactic sugar to add some pleasure to your life! Without health risks, except for having your eyes glued on a screen.I wish you lots of sweet code!Also Read: How to Use Python Datetimes Correctly?

Daniel Morales

May 18, 2020

Data Science

Programming

6 Advanced Statistical Concepts in Data Science

The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Beginners Guide to Statistics in Data Science and The Inferential Statistics Data Scientists Should Know we have talked about almost all the basics(Descriptive and Inferential) of statistics which are commonly used in understanding and working with any data science case study. In this article, lets go a little beyond and talk about some advance concepts which are not part of the buzz.Concept #1 - Q-Q(quantile-quantile) PlotsBefore understanding QQ plots first understand what is a Quantile?A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and percentiles (hundredth).An example:If we divide a distribution into four equal portions, we will speak of four quartiles. The first quartile includes all values that are smaller than a quarter of all values. In a graphical representation, it corresponds to 25% of the total area of distribution. The two lower quartiles comprise 50% of all distribution values. The interquartile range between the first and third quartile equals the range in which 50% of all values lie that are distributed around the mean. In Statistics, A Q-Q(quantile-quantile) plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight(y=x).Q-Q plotFor example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.It’s very important for you to know whether the distribution is normal or not so as to apply various statistical measures on the data and interpret it in much more human-understandable visualization and their Q-Q plot comes into the picture. The most fundamental question answered by the Q-Q plot is if the curve is Normally Distributed or not.Normally distributed, but why?The Q-Q plots are used to find the type of distribution for a random variable whether it is a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or even Pareto Distribution, etc. You can tell the type of distribution using the power of the Q-Q plot just by looking at the plot. In general, we are talking about Normal distributions only because we have a very beautiful concept of the 68–95–99.7 rule which perfectly fits into the normal distribution So we know how much of the data lies in the range of the first standard deviation, second standard deviation and third standard deviation from the mean. So knowing if a distribution is Normal opens up new doors for us to experiment with Types of Q-Q plots. Source Skewed Q-Q plotsQ-Q plots can find skewness(measure of asymmetry) of the distribution. If the bottom end of the Q-Q plot deviates from the straight line but the upper end is not, then the distribution is Left skewed(Negatively skewed).Now if upper end of the Q-Q plot deviates from the staright line and the lower is not, then the distribution is Right skewed(Positively skewed).Tailed Q-Q plotsQ-Q plots can find Kurtosis(measure of tailedness) of the distribution.The distribution with the fat tail will have both the ends of the Q-Q plot to deviate from the straight line and its centre follows the line, where as a thin tailed distribution will term Q-Q plot with very less or negligible deviation at the ends thus making it a perfect fit for normal distribution.Q-Q plots in Python(Source)Suppose we have the following dataset of 100 values:import numpy as np #create dataset with 100 values that follow a normal distribution np.random.seed(0) data = np.random.normal(0,1, 1000) #view first 10 values data[:10] array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799, -0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:import statsmodels.api as sm import matplotlib.pyplot as plt #create Q-Q plot with 45-degree line added to plot fig = sm.qqplot(data, line='45') plt.show()In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead, it represents where your data would be if it were normally distributed.The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.Consider instead if we generated a dataset of 100 uniformly distributed values and created a Q-Q plot for that dataset:#create dataset of 100 uniformally distributed values data = np.random.uniform(0,1, 1000) #generate Q-Q plot for the dataset fig = sm.qqplot(data, line='45') plt.show()The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.Concept #2- Chebyshev's InequalityIn probability, Chebyshev’s Inequality, also known as “Bienayme-Chebyshev” Inequality guarantees that, for a wide class of probability distributions, only a definite fraction of values will be found within a specific distance from the mean of a distribution.Source: https://www.thoughtco.com/chebyshevs-inequality-3126547 Chebyshev’s inequality is similar to The Empirical rule(68-95-99.7); however, the latter rule only applies to normal distributions. Chebyshev’s inequality is broader; it can be applied to any distribution so long as the distribution includes a defined variance and mean.So Chebyshev’s inequality says that at least (1-1/k^2) of data from a sample must fall within K standard deviations from the mean (or equivalently, no more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean).Where K --> Positive real numberIf the data is not normally distributed then different amounts of data could be in one standard deviation. Chebyshev’s inequality provides a way to know what fraction of data falls within K standard deviations from the mean for any data distribution.Also read: 22 Statistics Questions to Prepare for Data Science InterviewsCredits: https://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ Chebyshev’s inequality is of great value because it can be applied to any probability distribution in which the mean and variance are provided.Let us consider an example, Assume 1,000 contestants show up for a job interview, but there are only 70 positions available. In order to select the finest 70 contestants amongst the total contestants, the proprietor gives tests to judge their potential. The mean score on the test is 60, with a standard deviation of 6. If an applicant scores an 84, can they presume that they are getting the job?The results show that about 63 people scored above a 60, so with 70 positions available, a contestant who scores an 84 can be assured they got the job.Chebyshev's Inequality in Python(Source) Create a population of 1,000,000 values, I use a gamma distribution(also works with other distributions) with shape = 2 and scale = 2.import numpy as np import random import matplotlib.pyplot as plt #create a population with a gamma distribution shape, scale = 2., 2. #mean=4, std=2*sqrt(2) mu = shape*scale #mean and standard deviation sigma = scale*np.sqrt(shape) s = np.random.gamma(shape, scale, 1000000)Now sample 10,000 values from the population.#sample 10000 values rs = random.choices(s, k=10000)Count the sample that has a distance from the expected value larger than k standard deviation and use the count to calculate the probabilities. I want to depict a trend of probabilities when k is increasing, so I use a range of k from 0.1 to 3.#set k ks = [0.1,0.5,1.0,1.5,2.0,2.5,3.0] #probability list probs = [] #for each k for k in ks: #start count c = 0 for i in rs: # count if far from mean in k standard deviation if abs(i - mu) > k * sigma : c += 1 probs.append(c/10000)Plot the results:plot = plt.figure(figsize=(20,10)) #plot each probability plt.xlabel('K') plt.ylabel('probability') plt.plot(ks,probs, marker='o') plot.show() #print each probability print("Probability of a sample far from mean more than k standard deviation:") for i, prob in enumerate(probs): print("k:" + str(ks[i]) + ", probability: " \ + str(prob)[0:5] + \ " | in theory, probability should less than: " \ + str(1/ks[i]**2)[0:5])From the above plot and result, we can see that as the k increases, the probability is decreasing, and the probability of each k follows the inequality. Moreover, only the case that k is larger than 1 is useful. If k is less than 1, the right side of the inequality is larger than 1 which is not useful because the probability cannot be larger than 1.Concept #3- Log-Normal DistributionIn probability theory, a Log-normal distribution also known as Galton's distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y i.e, X = exp(Y), has a log-normal distribution. Skewed distributions with low mean and high variance and all positive values fit under this type of distribution. A random variable that is log-normally distributed takes only positive real values. The general formula for the probability density function of the lognormal distribution is:The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable.The shape of Lognormal distribution is defined by 3 parameters:σis the shape parameter, (and is the standard deviation of the log of the distribution)θ or μ is the location parameter (and is the mean of the distribution)m is the scale parameter (and is also the median of the distribution)The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable as explained above.If x = θ, then f(x) = 0. The case where θ = 0 and m = 1 is called the standard lognormal distribution. The case where θ equals zero is called the 2-parameter lognormal distribution.The following graph illustrates the effect of the location(μ) and scale(σ) parameter on the probability density function of the lognormal distribution: Source: https://www.sciencedirect.com/topics/mathematics/lognormal-distribution Log-Normal Distribution in Python(Source)Let us consider an example to generate random numbers from a log-normal distribution with μ=1 and σ=0.5 using scipy.stats.lognorm function.import numpy as np import matplotlib.pyplot as plt from scipy.stats import lognorm np.random.seed(42) data = lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000) plt.figure(figsize=(10,6)) ax = plt.subplot(111) plt.title('Generate wrandom numbers from a Log-normal distribution') ax.hist(data, bins=np.logspace(0,5,200), density=True) ax.set_xscale("log") shape,loc,scale = lognorm.fit(data) x = np.logspace(0, 5, 200) pdf = lognorm.pdf(x, shape, loc, scale) ax.plot(x, pdf, 'y') plt.show()Concept #4- Power Law distributionIn statistics, a Power Law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.A power law distribution has the form Y = k Xα, where:X and Y are variables of interest,α is the law’s exponent,k is a constant.Source: https://en.wikipedia.org/wiki/Power_law Power-law distribution is just one of many probability distributions, but it is considered a valuable tool to assess uncertainty issues that normal distribution cannot handle when they occur at a certain probability.Many processes have been found to follow power laws over substantial ranges of values. From the distribution in incomes, size of meteoroids, earthquake magnitudes, the spectral density of weight matrices in deep neural networks, word usage, number of neighbors in various networks, etc. (Note: The power law here is a continuous distribution. The last two examples are discrete, but on a large scale can be modeled as if continuous).Also read: Statistical Measures of Central TendencyPower-law distribution in Python(Source) Let us plot the Pareto distribution which is one form of a power-law probability distribution. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.import numpy as np import matplotlib.pyplot as plt from scipy.stats import pareto x_m = 1 #scale alpha = [1, 2, 3] #list of values of shape parameters plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for a in alpha: output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)]) plt.plot(samples, output.T, label='alpha {0}' .format(a)) plt.xlabel('samples', fontsize=15) plt.ylabel('PDF', fontsize=15) plt.title('Probability Density function', fontsize=15) plt.legend(loc='best') plt.show()Concept #5- Box cox transformationThe Box-Cox transformation transforms our data so that it closely resembles a normal distribution.The one-parameter Box-Cox transformations are defined as In many statistical techniques, we assume that the errors are normally distributed. This assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can (hopefully) normalize our errors (if they are not already normal).Additionally, transforming our variables can improve the predictive power of our models because transformations can cut away white noise.Original distribution(Left) and near-normal distribution after applying Box cox transformation. Source At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one that results in the best approximation of a normal distribution curve. The one-parameter Box-Cox transformations are defined as:and the two-parameter Box-Cox transformations as:Moreover, the one-parameter Box-Cox transformation holds for y > 0, i.e. only for positive values and two-parameter Box-Cox transformation for y > -λ, i.e. negative values. The parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.If we talk about some drawbacks of Box-cox transformation, then if interpretation is what you want to do, then Box-cox is not recommended. Because if λ is some non-zero number, then the transformed target variable may be more difficult to interpret than if we simply applied a log transform.A second stumbling block is that the Box-Cox transformation usually gives the median of the forecast distribution when we revert the transformed data to its original scale. Occasionally, we want the mean and not the median.Box-Cox transformation in Python(Source)SciPy’s stats package provides a function called boxcox for performing box-cox power transformation that takes in original non-normal data as input and returns fitted data along with the lambda value that was used to fit the non-normal distribution to normal distribution.#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) fig, ax = plt.subplots(1, 2) #plot the distribution of data values sns.distplot(data, hist=False, kde=True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Non-Normal", color ="red", ax = ax[0]) #perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) sns.distplot(transformed_data, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Normal", color ="red", ax = ax[1]) #adding legends to the subplots plt.legend(loc = "upper right") #rescaling the subplots fig.set_figheight(5) fig.set_figwidth(10) #display optimal lambda value print(f"Lambda value used for Transformation: {best_lambda}") Concept #6- Poisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.In very simple terms, A Poisson distribution can be used to estimate how likely it is that something will happen "X" number of times. Some examples of Poisson processes are customers calling a help center, radioactive decay in atoms, visitors to a website, photons arriving at a space telescope, and movements in a stock price. Poisson processes are usually associated with time, but they do not have to be. The Formula for the Poisson Distribution Is:Where:e is Euler's number (e = 2.71828...)k is the number of occurrencesk! is the factorial of kλ is equal to the expected value of kwhen that is also equal to its varianceLambda(λ) can be thought of as the expected number of events in the interval. As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters. Probability Mass function for Poisson Distribution with varying rate parameters.Source The Poisson distribution is also commonly used to model financial count data where the tally is small and is often zero. For one example, in finance, it can be used to model the number of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2, etc.As another example, this model can be used to predict the number of "shocks" to the market that will occur in a given time period, say over a decade.Poisson distribution in Pythonfrom numpy import random import matplotlib.pyplot as plt import seaborn as sns lam_list = [1, 4, 9] #list of Lambda values plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for lam in lam_list: sns.distplot(random.poisson(lam=lam, size=10), hist=False, label='lambda {0}'.format(lam)) plt.xlabel('Poisson Distribution', fontsize=15) plt.ylabel('Frequency', fontsize=15) plt.legend(loc='best') plt.show()As λ becomes bigger, the graph looks more like a normal distribution.I hope you have enjoyed reading this article, If you have any questions or suggestions, please leave a comment. Also read: False Positives vs. False NegativesFeel free to connect me on LinkedIn for any query.Thanks for reading!!!Referenceshttps://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ https://corporatefinanceinstitute.com/resources/knowledge/data-analysis/chebyshevs-inequality/ https://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm https://www.statology.org/q-q-plot-python/ https://gist.github.com/chaipi-chaya/9eb72978dbbfd7fa4057b493cf6a32e7 https://stackoverflow.com/a/41968334/7175247

Daniel Morales

May 18, 2020

SQL

Pandas

Pandas vs SQL. When Data Scientists Should Use One Over the Other

A deep dive into the benefits of each toolTable of ContentsIntroductionPandasSQLSummaryReferencesIntroductionBoth of these tools are important to not only data scientists, but also to those in similar positions like data analytics and business intelligence. With that being said, when should data scientists specifically use pandas over SQL and vice versa? In some situations, you can get away with just using SQL, and some other times, pandas is much easier to use, especially for data scientists who focus on research in a Jupyter Notebook setting. Below, I will discuss when you should use SQL and when you should use pandas. Keep in mind that both of these tools have specific use cases, but there are many times where their functionality overlap, and that is what I will be comparing below as well.PandasPhoto by Kalen Kemp on Unsplash [2].Pandas [3] is an open-source data analysis tool in the Python programing language. The benefit of pandas starts when you already have your main dataset, usually from a SQL query. This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.It is important to note that I am not comparing what Pandas does that SQL cannot do and vice versa. I will be picking the tool that can do the function more efficiently or preferable for data science work — in my opinion, from personal experience.Here are times where using pandas is more beneficial than SQL — while also having the same functionality as SQL:creating calculated fields from existing featuresWhen incorporating a more complex SQL query, you often are incorporating subqueries as well in order to divide values from different columns. In pandas you can simply divide features much easier like the following:df["new_column"] = df["first_column"]/df["second_column"]The code above is showing how you can divide two separate columns, and assign those values to a new column — in this case, you are performing the feature creation on the whole entire dataset or dataframe. You can use this function in both feature exploration and feature engineering in the process of data science.grouping byAlso referring to subqueries, grouping by in SQL can become quite complex and require lines and lines of code that can be visually overwhelming. In pandas, you can simply group by one line of code. I am not referring to the group by at the end of a simple select from table query, but one where there are multiple subqueries involved.df.groupby(by="first_column").mean()This result would be returning the mean of the first_column for every column in the dataframe. There are many other ways to use this grouping function, of which are outlined nicely in the pandas documentation linked below.checking data typesIn SQL, you will often have to cast types, but sometimes it can be a little clearer to see the way pandas lays out data types in a vertical format, rather than scrolling through a horizontal output in SQL. You can expect some examples of data types returned to be int64, float64, datetime64[ns], and object.df.dtypesWhile these are all fairly simple functions of pandas and SQL, in SQL, they are particularly tricky, and sometimes just much easier to implement in a pandas dataframe. Now, let’s look at what SQL is better at performing.SQLPhoto by Caspar Camille Rubin on Unsplash [4].SQL is probably the language that is used most by the most amount of different positions. For example, a data engineer could use SQL, a Tableau developer, or a product manager. With that being said, data scientists tend to use SQL frequently. It is important to note that there are several different versions of SQL, usually all having a similar function, just slightly formatted differently.Here are times where using SQL is more beneficial than pandas — while also having the same functionality as pandas:WHERE clauseThis clause in SQL is used frequently and can also be performed in pandas. In pandas, however, it is slightly more difficult, or less intuitive. For example, you have to write out redundant code, whereas in SQL, you simply need the WHERE.SELECT ID FROM TABLE WHERE ID > 100In pandas, it would be something like:df[df["ID"] > 100]["ID"]Yes, both are simple, one is just a little more intuitive.JOINSPandas has a few ways to join, which can be a little overwhelming, whereas in SQL you can perform simple joins like the following: INNER, LEFT, RIGHTSELECT one.column_A, two.column_B FROM FIRST_TABLE one INNER JOIN SECOND_TABLE two on two.ID = one.IDIn this code, joining is slightly easier to read, than in pandas, where you have to merge dataframes, and especially as you merge more than two dataframes, it can be quite complex in pandas. SQL can perform multiple joins whether it be INNER, etc., all in the same query.All of these examples, whether it be SQL or pandas, can be used in at least the exploratory data analysis portion of the data science process, as well as in feature engineer, and querying model results once they are stored in a database.SummaryThis comparison of pandas versus SQL is more of a personal preference. With that being said, you may feel the opposite of my opinion. However, I hope it still sheds light on the differences between pandas and SQL, as well as what you can perform the same in both tools, using slightly different coding techniques and a different language altogether.To summarize, we have compared the benefits of using pandas over SQL and vice versa for a few of their shared functions:* creating calculated fields from existing features * grouping by * checking data types * WHERE clause * JOINSI hope you found my article both interesting and useful. Please feel free to comment down below if you agree with these comparisons — why or why not? Do you think one tool, in particular, is better than the other? What other data science tools can you think of that would have a similar comparison? What other functions of pandas and SQL could we compare?

Daniel Morales

May 18, 2020

Machine Learning

Programming

Building a Product Recommendation System with Collaborative Filtering

Daniel Morales

May 18, 2020

SQL Joins: A Brief Example Understand The Why And How Of SQL Joins

Contents Outline

Tony Yiu

SQL Joins: A Brief Example Understand The Why And How Of SQL Joins

Why Do We Join?

How Do We Join?

Key Takeaways:

Related Posts

Categories

Join Competition

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales

SQL Joins: A Brief Example Understand The Why And How Of SQL Joins

Contents Outline

Social Sharing

Tony Yiu

Why Do We Join?

How Do We Join?

Key Takeaways:

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Daniel Morales

Daniel Morales

Daniel Morales