Data Science & Machine Learning Articles

All

Data Science

Machine Learning

Deep learning

Flask

Deployment

Big Data

Business

Python

SQL

Programming

Libraries

Jupyter

Numpy

News

Pandas

Matplotlib

Visualization

Datathons

Programming Data Science

6 Advanced Statistical Concepts in Data Science The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Beginners Guide to Statistics in Data Science and The Inferential Statistics Data Scientists Should Know we have talked about almost all the basics(Descriptive and Inferential) of statistics which are commonly used in understanding and working with any data science case study. In this article, lets go a little beyond and talk about some advance concepts which are not part of the buzz.Concept #1 - Q-Q(quantile-quantile) PlotsBefore understanding QQ plots first understand what is a Quantile?A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and percentiles (hundredth).An example:If we divide a distribution into four equal portions, we will speak of four quartiles. The first quartile includes all values that are smaller than a quarter of all values. In a graphical representation, it corresponds to 25% of the total area of distribution. The two lower quartiles comprise 50% of all distribution values. The interquartile range between the first and third quartile equals the range in which 50% of all values lie that are distributed around the mean. In Statistics, A Q-Q(quantile-quantile) plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight(y=x).Q-Q plotFor example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.It’s very important for you to know whether the distribution is normal or not so as to apply various statistical measures on the data and interpret it in much more human-understandable visualization and their Q-Q plot comes into the picture. The most fundamental question answered by the Q-Q plot is if the curve is Normally Distributed or not.Normally distributed, but why?The Q-Q plots are used to find the type of distribution for a random variable whether it is a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or even Pareto Distribution, etc. You can tell the type of distribution using the power of the Q-Q plot just by looking at the plot. In general, we are talking about Normal distributions only because we have a very beautiful concept of the 68–95–99.7 rule which perfectly fits into the normal distribution So we know how much of the data lies in the range of the first standard deviation, second standard deviation and third standard deviation from the mean. So knowing if a distribution is Normal opens up new doors for us to experiment with Types of Q-Q plots. Source Skewed Q-Q plotsQ-Q plots can find skewness(measure of asymmetry) of the distribution. If the bottom end of the Q-Q plot deviates from the straight line but the upper end is not, then the distribution is Left skewed(Negatively skewed).Now if upper end of the Q-Q plot deviates from the staright line and the lower is not, then the distribution is Right skewed(Positively skewed).Tailed Q-Q plotsQ-Q plots can find Kurtosis(measure of tailedness) of the distribution.The distribution with the fat tail will have both the ends of the Q-Q plot to deviate from the straight line and its centre follows the line, where as a thin tailed distribution will term Q-Q plot with very less or negligible deviation at the ends thus making it a perfect fit for normal distribution.Q-Q plots in Python(Source)Suppose we have the following dataset of 100 values:import numpy as np #create dataset with 100 values that follow a normal distribution np.random.seed(0) data = np.random.normal(0,1, 1000) #view first 10 values data[:10] array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799, -0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:import statsmodels.api as sm import matplotlib.pyplot as plt #create Q-Q plot with 45-degree line added to plot fig = sm.qqplot(data, line='45') plt.show()In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead, it represents where your data would be if it were normally distributed.The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.Consider instead if we generated a dataset of 100 uniformly distributed values and created a Q-Q plot for that dataset:#create dataset of 100 uniformally distributed values data = np.random.uniform(0,1, 1000) #generate Q-Q plot for the dataset fig = sm.qqplot(data, line='45') plt.show()The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.Concept #2- Chebyshev's InequalityIn probability, Chebyshev’s Inequality, also known as “Bienayme-Chebyshev” Inequality guarantees that, for a wide class of probability distributions, only a definite fraction of values will be found within a specific distance from the mean of a distribution.Source: https://www.thoughtco.com/chebyshevs-inequality-3126547 Chebyshev’s inequality is similar to The Empirical rule(68-95-99.7); however, the latter rule only applies to normal distributions. Chebyshev’s inequality is broader; it can be applied to any distribution so long as the distribution includes a defined variance and mean.So Chebyshev’s inequality says that at least (1-1/k^2) of data from a sample must fall within K standard deviations from the mean (or equivalently, no more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean).Where K --> Positive real numberIf the data is not normally distributed then different amounts of data could be in one standard deviation. Chebyshev’s inequality provides a way to know what fraction of data falls within K standard deviations from the mean for any data distribution.Also read: 22 Statistics Questions to Prepare for Data Science InterviewsCredits: https://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ Chebyshev’s inequality is of great value because it can be applied to any probability distribution in which the mean and variance are provided.Let us consider an example, Assume 1,000 contestants show up for a job interview, but there are only 70 positions available. In order to select the finest 70 contestants amongst the total contestants, the proprietor gives tests to judge their potential. The mean score on the test is 60, with a standard deviation of 6. If an applicant scores an 84, can they presume that they are getting the job?The results show that about 63 people scored above a 60, so with 70 positions available, a contestant who scores an 84 can be assured they got the job.Chebyshev's Inequality in Python(Source) Create a population of 1,000,000 values, I use a gamma distribution(also works with other distributions) with shape = 2 and scale = 2.import numpy as np import random import matplotlib.pyplot as plt #create a population with a gamma distribution shape, scale = 2., 2. #mean=4, std=2*sqrt(2) mu = shape*scale #mean and standard deviation sigma = scale*np.sqrt(shape) s = np.random.gamma(shape, scale, 1000000)Now sample 10,000 values from the population.#sample 10000 values rs = random.choices(s, k=10000)Count the sample that has a distance from the expected value larger than k standard deviation and use the count to calculate the probabilities. I want to depict a trend of probabilities when k is increasing, so I use a range of k from 0.1 to 3.#set k ks = [0.1,0.5,1.0,1.5,2.0,2.5,3.0] #probability list probs = [] #for each k for k in ks: #start count c = 0 for i in rs: # count if far from mean in k standard deviation if abs(i - mu) > k * sigma : c += 1 probs.append(c/10000)Plot the results:plot = plt.figure(figsize=(20,10)) #plot each probability plt.xlabel('K') plt.ylabel('probability') plt.plot(ks,probs, marker='o') plot.show() #print each probability print("Probability of a sample far from mean more than k standard deviation:") for i, prob in enumerate(probs): print("k:" + str(ks[i]) + ", probability: " \ + str(prob)[0:5] + \ " | in theory, probability should less than: " \ + str(1/ks[i]**2)[0:5])From the above plot and result, we can see that as the k increases, the probability is decreasing, and the probability of each k follows the inequality. Moreover, only the case that k is larger than 1 is useful. If k is less than 1, the right side of the inequality is larger than 1 which is not useful because the probability cannot be larger than 1.Concept #3- Log-Normal DistributionIn probability theory, a Log-normal distribution also known as Galton's distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y i.e, X = exp(Y), has a log-normal distribution. Skewed distributions with low mean and high variance and all positive values fit under this type of distribution. A random variable that is log-normally distributed takes only positive real values. The general formula for the probability density function of the lognormal distribution is:The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable.The shape of Lognormal distribution is defined by 3 parameters:σis the shape parameter, (and is the standard deviation of the log of the distribution)θ or μ is the location parameter (and is the mean of the distribution)m is the scale parameter (and is also the median of the distribution)The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable as explained above.If x = θ, then f(x) = 0. The case where θ = 0 and m = 1 is called the standard lognormal distribution. The case where θ equals zero is called the 2-parameter lognormal distribution.The following graph illustrates the effect of the location(μ) and scale(σ) parameter on the probability density function of the lognormal distribution: Source: https://www.sciencedirect.com/topics/mathematics/lognormal-distribution Log-Normal Distribution in Python(Source)Let us consider an example to generate random numbers from a log-normal distribution with μ=1 and σ=0.5 using scipy.stats.lognorm function.import numpy as np import matplotlib.pyplot as plt from scipy.stats import lognorm np.random.seed(42) data = lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000) plt.figure(figsize=(10,6)) ax = plt.subplot(111) plt.title('Generate wrandom numbers from a Log-normal distribution') ax.hist(data, bins=np.logspace(0,5,200), density=True) ax.set_xscale("log") shape,loc,scale = lognorm.fit(data) x = np.logspace(0, 5, 200) pdf = lognorm.pdf(x, shape, loc, scale) ax.plot(x, pdf, 'y') plt.show()Concept #4- Power Law distributionIn statistics, a Power Law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.A power law distribution has the form Y = k Xα, where:X and Y are variables of interest,α is the law’s exponent,k is a constant.Source: https://en.wikipedia.org/wiki/Power_law Power-law distribution is just one of many probability distributions, but it is considered a valuable tool to assess uncertainty issues that normal distribution cannot handle when they occur at a certain probability.Many processes have been found to follow power laws over substantial ranges of values. From the distribution in incomes, size of meteoroids, earthquake magnitudes, the spectral density of weight matrices in deep neural networks, word usage, number of neighbors in various networks, etc. (Note: The power law here is a continuous distribution. The last two examples are discrete, but on a large scale can be modeled as if continuous).Also read: Statistical Measures of Central TendencyPower-law distribution in Python(Source) Let us plot the Pareto distribution which is one form of a power-law probability distribution. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.import numpy as np import matplotlib.pyplot as plt from scipy.stats import pareto x_m = 1 #scale alpha = [1, 2, 3] #list of values of shape parameters plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for a in alpha: output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)]) plt.plot(samples, output.T, label='alpha {0}' .format(a)) plt.xlabel('samples', fontsize=15) plt.ylabel('PDF', fontsize=15) plt.title('Probability Density function', fontsize=15) plt.legend(loc='best') plt.show()Concept #5- Box cox transformationThe Box-Cox transformation transforms our data so that it closely resembles a normal distribution.The one-parameter Box-Cox transformations are defined as In many statistical techniques, we assume that the errors are normally distributed. This assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can (hopefully) normalize our errors (if they are not already normal).Additionally, transforming our variables can improve the predictive power of our models because transformations can cut away white noise.Original distribution(Left) and near-normal distribution after applying Box cox transformation. Source At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one that results in the best approximation of a normal distribution curve. The one-parameter Box-Cox transformations are defined as:and the two-parameter Box-Cox transformations as:Moreover, the one-parameter Box-Cox transformation holds for y > 0, i.e. only for positive values and two-parameter Box-Cox transformation for y > -λ, i.e. negative values. The parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.If we talk about some drawbacks of Box-cox transformation, then if interpretation is what you want to do, then Box-cox is not recommended. Because if λ is some non-zero number, then the transformed target variable may be more difficult to interpret than if we simply applied a log transform.A second stumbling block is that the Box-Cox transformation usually gives the median of the forecast distribution when we revert the transformed data to its original scale. Occasionally, we want the mean and not the median.Box-Cox transformation in Python(Source)SciPy’s stats package provides a function called boxcox for performing box-cox power transformation that takes in original non-normal data as input and returns fitted data along with the lambda value that was used to fit the non-normal distribution to normal distribution.#load necessary packages import numpy as np from scipy.stats import boxcox import seaborn as sns #make this example reproducible np.random.seed(0) #generate dataset data = np.random.exponential(size=1000) fig, ax = plt.subplots(1, 2) #plot the distribution of data values sns.distplot(data, hist=False, kde=True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Non-Normal", color ="red", ax = ax[0]) #perform Box-Cox transformation on original data transformed_data, best_lambda = boxcox(data) sns.distplot(transformed_data, hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 2}, label = "Normal", color ="red", ax = ax[1]) #adding legends to the subplots plt.legend(loc = "upper right") #rescaling the subplots fig.set_figheight(5) fig.set_figwidth(10) #display optimal lambda value print(f"Lambda value used for Transformation: {best_lambda}") Concept #6- Poisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.In very simple terms, A Poisson distribution can be used to estimate how likely it is that something will happen "X" number of times. Some examples of Poisson processes are customers calling a help center, radioactive decay in atoms, visitors to a website, photons arriving at a space telescope, and movements in a stock price. Poisson processes are usually associated with time, but they do not have to be. The Formula for the Poisson Distribution Is:Where:e is Euler's number (e = 2.71828...)k is the number of occurrencesk! is the factorial of kλ is equal to the expected value of kwhen that is also equal to its varianceLambda(λ) can be thought of as the expected number of events in the interval. As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters. Probability Mass function for Poisson Distribution with varying rate parameters.Source The Poisson distribution is also commonly used to model financial count data where the tally is small and is often zero. For one example, in finance, it can be used to model the number of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2, etc.As another example, this model can be used to predict the number of "shocks" to the market that will occur in a given time period, say over a decade.Poisson distribution in Pythonfrom numpy import random import matplotlib.pyplot as plt import seaborn as sns lam_list = [1, 4, 9] #list of Lambda values plt.figure(figsize=(10,6)) samples = np.linspace(start=0, stop=5, num=1000) for lam in lam_list: sns.distplot(random.poisson(lam=lam, size=10), hist=False, label='lambda {0}'.format(lam)) plt.xlabel('Poisson Distribution', fontsize=15) plt.ylabel('Frequency', fontsize=15) plt.legend(loc='best') plt.show()As λ becomes bigger, the graph looks more like a normal distribution.I hope you have enjoyed reading this article, If you have any questions or suggestions, please leave a comment. Also read: False Positives vs. False NegativesFeel free to connect me on LinkedIn for any query.Thanks for reading!!!Referenceshttps://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ https://corporatefinanceinstitute.com/resources/knowledge/data-analysis/chebyshevs-inequality/ https://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm https://www.statology.org/q-q-plot-python/ https://gist.github.com/chaipi-chaya/9eb72978dbbfd7fa4057b493cf6a32e7 https://stackoverflow.com/a/41968334/7175247

Nagesh Singh Chauhan October 16, 2021

Programming Python

Why Decorators In Python Are Pure Genius? Analyze, test, and re-use your code with little more than an @ symbolIf there’s one thing that makes Python incredibly successful, that would be its readability. Everything else hinges on that: if code is unreadable, it’s hard to maintain. It’s also not beginner-friendly then — a novice getting boggled by unreadable code won’t attempt writing its own one day.Python was already readable and beginner-friendly before decorators came around. But as the language started getting used for more and more things, Python developers felt the need for more and more features, without cluttering the landscape and making code unreadable.Decorators are a prime-time example of a perfectly implemented feature. It does take a while to wrap your head around, but it’s worth it. As you start using them, you’ll notice how they don’t overcomplicate things and make your code neat and snazzy.Before anything else: higher-order functionsIn a nutshell, decorators are a neat way to handle higher-order functions. So let’s look at those first!Functions returning functionsSay you have one function, greet() — it greets whatever object you pass it. And let’s say you have another function, simon() — it inserts “Simon” wherever appropriate. How can we combine the two? Think about it a minute before you look below.def greet(name): return f"Hello, {name}!" def simon(func): return func("Simon") simon(greet)The output is 'Hello, Simon!'. Hope that makes sense to ya!Of course we could have just called greet("Simon"). However, the whole point is that we might want to put “Simon” into many different functions. And if we don’t use “Simon” but something more complicated, we can save a whole lot of lines of code by packing it into a function like simon().Functions inside other functionsWe can also define functions inside other functions. That’s important because decorators will do that, too! Without decorators it looks like this:def respect(maybe): def congrats(): return "Congrats, bro!" def insult(): return "You're silly!" if maybe == "yes": return congrats else: return insultThe function respect() returns a function; respect("yes") returns the congrats function, respect("brother") (or some other argument instead of "brother") returns the insult function. To call the functions, enter respect("yes")() and respect("brother")(), just like a normal function.Also read: Python Books You Must Read in 2020Got it? Then you’re all set for decorators!Code is beautifully nerdy. Image by author.The ABC of Python decoratorsFunctions with an @ symbolLet’s try a combination of the two previous concepts: a function that takes another function and defines a function. Sounds mind-boggling? Consider this:def startstop(func): def wrapper(): print("Starting...") func() print("Finished!") return wrapper def roll(): print("Rolling on the floor laughing XD") roll = startstop(roll)The last line ensures that we don’t need to call startstop(roll)() anymore; roll() will suffice. Do you know what the output of that call is? Try it yourself if you’re unsure!Now, as a very good alternative, we could insert this right after defining startstop():@startstop def roll(): print("Rolling on the floor laughing XD")This does the same, but glues roll() to startstop() at the onset.Added flexibilityWhy is that useful? Doesn’t that consume exactly as many lines of code as before?In this case, yes. But once you’re dealing with slightly more complicated stuff, it gets really useful. For once, you can move all decorators (i.e. the def startstop() part above) into its own module. That is, you write them into a file called decorators.py and write something like this into your main file:from decorators import startstop @startstop def roll(): print("Rolling on the floor laughing XD")In principle, you can do that without using decorators. But this way it makes life easier because you don’t have to deal with nested functions and endless bracket-counting anymore.You can also nest decorators:from decorators import startstop, exectime @exectime @startstop def roll(): print("Rolling on the floor laughing XD")Note that we haven’t defined exectime() yet, but you’ll see it in the next section. It’s a function that can measure how long a process takes in Python.This nesting would be equivalent to a line like this:roll = exectime(startstop(roll))Bracket counting is starting! Imagine you had five or six of those functions nested inside each other. Wouldn’t the decorator notation be much easier to read than this nested mess?You can even use decorators on functions that accept arguments. Now imagine a few arguments in the line above and your chaos would be complete. Decorators make it neat and tidy.Also read: How to Get a Job With PythonFinally, you can even add arguments to your decorators — like @mydecorator(argument). Yeah, you can do all of this without decorators. But then I wish you a lot of fun understanding your decorator-free code when you re-read it in three weeks…Decorators make everything easier. Image by author.Applications: where decorators cut the creamNow that I’ve hopefully convinced you that decorators make your life three times easier, let’s look at some classic examples where decorators are basically indispensable.Measuring execution timeLet’s say we have a function called waste time() and we want to know how long it takes. Well, just use a decorator!import time def measuretime(func): def wrapper(): starttime = time.perf_counter() func() endtime = time.perf_counter() print(f"Time needed: {endtime - starttime} seconds") return wrapper @measuretime def wastetime(): sum([i**2 for i in range(1000000)]) wastetime()A dozen lines of code and we’re done! Plus, you can use measuretime() on as many functions as you want.Slowing code downSometimes you don’t want to execute code immediately but wait a while. That’s where a slow-down decorator comes in handy:import time def sleep(func): def wrapper(): time.sleep(300) return func() return wrapper @sleep def wakeup(): print("Get up! Your break is over.") wakeup()Calling wakeup() makes lets you take a 5-minute break, after which your console reminds you to get back to work.Also read: Building A Linear Regression Model With Python To Predict Retail Customer SpendingTesting and debuggingSay you have a whole lot of different functions that you call at different stages, and you’re losing the overview over what’s being called when. With a simple decorator for every function definition, you can bring more clarity. Like so:def debug(func): def wrapper(): print(f"Calling {func.__name__}") return wrapper @debug def scare(): print("Boo!") scare()There is an a lot more elaborate example here. Note, though, that to understand that example, you’ll have to check how to decorate functions with arguments. Still, it’s worth the read!Reusing codeThis kinda goes without saying. If you’ve defined a function decorator(), you can just sprinkle @decorator everywhere in your code. To be honest, I don’t think it gets any simpler than that!Handling loginsIf you have functionalities that should only be accessed if a user is logged in, that’s also fairly easy with decorators. I’ll refer you to the full example for reference, but the principle is quite simple: first you define a function like login_required(). Before any function definition that needs logging in, you pop @login_required. Simple enough, I’d say.Syntactic sugar — or why Python is so sweetIt’s not like I’m not critical of Python or not using alternative languages where it’s appropriate. But there’s a big allure to Python: it’s so easy to digest, even when you’re not a computer scientist by training and just want to make things work.If C++ is an orange, then Python is a pineapple: similarly nutritious, but three times sweeter. Decorators are just one factor in the mix.But I hope you’ve come to see why it’s such a big sweet-factor. Syntactic sugar to add some pleasure to your life! Without health risks, except for having your eyes glued on a screen.I wish you lots of sweet code!Also Read: How to Use Python Datetimes Correctly?

Rhea Moutafis June 4, 2021

Programming Machine Learning

Building a Product Recommendation System with Collaborative Filtering Building a Product Recommendation System with Collaborative Filtering

Daniel Morales April 6, 2021

Python SQL Programming

How to Use Python Datetimes Correctly? Datetime is basically a python object that represents a point in time, like years, days, seconds, milliseconds. This is very useful to create our programs.The datetime module provides classes to manipulate dates and times in a simple and complex way. While date and time arithmetic is supported, the application focuses on the efficient extraction of attributes for formatting and manipulating output

Daniel Morales February 23, 2021

Python Programming Libraries

Understanding Python's Collections Module The Python collections module has different specialized data types that function as containers and can be used to replace the general purpose Python containers (`dict`, `tuple`, `list` and `set`). We will study the following parts of this module:- `ChainMap`- `defaultdict`- `deque`There is a submodule of collections called abc or Abstract Base Classes. These will not be covered in this post, let's start with the ChainMap container!ChainMapA ChainMap is a class that provides the ability to link multiple mappings together so that they end up as a single unit. If you look at the documentation, you will notice that it accepts `**maps*`, which means that a ChainMap will accept any number of mappings or dictionaries and convert them into a single view that you can update. Let's see an example so you can see how it works:from collections import ChainMap car_parts = {'hood': 500, 'engine': 5000, 'front_door': 750} car_options = {'A/C': 1000, 'Turbo': 2500, 'rollbar': 300} car_accessories = {'cover': 100, 'hood_ornament': 150, 'seat_cover': 99} car_pricing = ChainMap(car_accessories, car_options, car_parts)car_pricing >> ChainMap({'cover': 100, 'hood_ornament': 150, 'seat_cover': 99}, {'A/C': 1000, 'Turbo': 2500, 'rollbar': 300}, {'hood': 500, 'engine': 5000, 'front_door': 750}) car_pricing['hood'] >> 500Here we import ChainMap from our collections module. Next we create three dictionaries. Then we create an instance of our ChainMap by passing it the three dictionaries we just created.Finally, we try to access one of the keys of our ChainMap. When we do this, the ChainMap will go through each map to see if that key exists and has a value. If it does, then the ChainMap will return the first value it finds that matches that key.This is especially useful if you want to set default values. Suppose we want to create an application that has some default values. The application will also know the operating system environment variables. If there is an environment variable that matches one of the keys we have as a default in our application, the environment will override our default value. In addition, let's assume that we can pass arguments to our application. These arguments take precedence over the environment and the defaults. This is one place where a ChainMap can really come in handy. Let's look at a simple example that is based on one from the Python documentation:Note: do not run this code from Jupyter Notebook, but from your favorite IDE and calling it from a terminal. this command `python chain_map.py -u daniel`import argparse import os from collections import ChainMap def main(): app_defaults = {'username':'admin', 'password':'admin'} parser = argparse.ArgumentParser() parser.add_argument('-u', '--username') parser.add_argument('-p', '--password') args = parser.parse_args() command_line_arguments = {key:value for key, value in vars(args).items() if value} chain = ChainMap(command_line_arguments, os.environ, app_defaults) print(chain['username']) if __name__ == '__main__': main() os.environ['username'] = 'test' main()➜ python python3 post.py -u daniel daniel daniel Let's break this down a bit. Here we import the Python `argparse` module along with the `os` module. We also import `ChainMap`.Next we have a simple function that has some defaults. I have seen these defaults used for some popular routers. We then set up our argument parser and tell it how to handle certain command line options. You'll notice that argparse doesn't provide a way to get a dictionary object from its arguments, so we use a dict comprehension to extract what we need. The other interesting piece here is the use of Python's built-in vars. If you called it without a vars argument it would behave like Python's built-in locales. But if you pass it an object, then vars is the equivalent of the `__dict__` property of object. In other words, vars(args) is equal to `args.__dict__`. Finally we create our ChainMap by passing our command line arguments (if any), then the environment variables and finally the default values.At the end of the code, we try calling our function, then setting an environment variable and calling it again. Try it and you will see that it prints admin and then tests as expected. Now let's try calling the script with a command line argument:python chain_map.py -u danielWhen I run this on my machine, it returns daniel twice. This is because our command line argument overrides everything else. It doesn't matter what we set the environment to because our ChainMap will look at the command line arguments first before anything else. If you try it without the `-u daniel` it will run the actual arguments, in my case `"admin" "test"`.Now that you know how to use ChainMaps, we can move on to Counter!CounterThe collections module also provides us with a small tool that allows us to perform convenient and fast counting. This tool is called `Counter`. You can run it with most iterables. Let's try it with a stringfrom collections import Counter Counter('superfluous') >> Counter({'s': 2, 'u': 3, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1})counter = Counter('superfluous') counter['u'] >> 3In this example, we import `Counter` from `collections` and pass it a string. This returns a Counter object which is a subclass of the Python dictionary. We then run the same command but assign it to the counter variable so that we can access the dictionary more easily. In this case, we have seen that the letter `"u"` appears three times in the example string.The counter provides some methods that you may be interested in. For example, you can call elements which will get an iterator over the elements that are in the dictionary, but in an arbitrary order. This function can be considered as an "encoder", since the output in this case is an encoded version of the string.list(counter.elements()) >> ['s', 's', 'u', 'u', 'u', 'p', 'e', 'r', 'f', 'l', 'o']Another useful method is most_common. You can ask the Counter what are the most common elements by passing a number that represents what are the most recurring `"n"` elements:counter.most_common(2) [('u', 3), ('s', 2)]Here we just asked our Counter which were the two most recurring items. As you can see, it produced a list of tuples that tells us that `"u"` occurred 3 times and `"s"` occurred twice.The other method I want to cover is the subtract method. The `subtract` method accepts an iterable or a mapping and uses that argument to subtract. It's a little easier to explain if you see some code:counter_one = Counter('superfluous') counter_one >> Counter({'s': 2, 'u': 3, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1}) counter_two = Counter('super') counter_one.subtract(counter_two) counter_one >> Counter({'s': 1, 'u': 2, 'p': 0, 'e': 0, 'r': 0, 'f': 1, 'l': 1, 'o': 1})So here we recreate our first counter and print it out so we know what's in it. Thus we create our second Counter object. Finally we subtract the second counter from the first. If you look closely at the output at the end, you will notice that the number of letters in five of the elements has been decreased by one.As I mentioned at the beginning of this section, you can use Counter against any iterable or mapping, so you don't have to use only strings. You can also pass tuples, dictionaries and lists to it.Try it on your own to see how it works with those other data types. Now we are ready to move on to `defaultdict`!`defaultdict`The collections module has a handy tool called `defaultdict`. The `defaultdict` is a subclass of the Python dict that accepts a `default_factory` as its main argument. The `default_factory` is usually a Python data type, such as int or a list, but you can also use a function or a lambda. Let's start by creating a regular Python dictionary that counts the number of times each word is used in a sentence:sentence = "The red for jumped over the fence and ran to the zoo for food" words = sentence.split(' ') words >> ['The', 'red', 'for', 'jumped', 'over', 'the', 'fence', 'and', 'ran', 'to', 'the', 'zoo', 'for', 'food'] reg_dict = {} for word in words: if word in reg_dict: reg_dict[word] += 1 else: reg_dict[word] = 1 print(reg_dict) >> {'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1}Now let's try to do the same with defaultdict!from collections import defaultdict sentence = "The red for jumped over the fence and ran to the zoo for food" words = sentence.split(' ') d = defaultdict(int) for word in words: d[word] += 1 print(d) >> defaultdict(<class 'int'>, {'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1})You will notice right away that the code is much simpler. The defaultdict will automatically assign zero as a value to any key it doesn't already have in it. We add one to make it make more sense and it will also increment if the word appears multiple times in the sentence.Now let's try using a Python list type as our `default_factory`. We will start with a regular dictionary, as before.my_list = [(1234, 100.23), (345, 10.45), (1234, 75.00), (345, 222.66), (678, 300.25), (1234, 35.67)] reg_dict = {} for acct_num, value in my_list: if acct_num in reg_dict: reg_dict[acct_num].append(value) else: reg_dict[acct_num] = [value]If you run this code, you should get output similar to the following:print(reg_dict) >> {1234: [100.23, 75.0, 35.67], 345: [10.45, 222.66], 678: [300.25]}Now let's reimplement this code using defaultdict:from collections import defaultdict my_list = [(1234, 100.23), (345, 10.45), (1234, 75.00), (345, 222.66), (678, 300.25), (1234, 35.67)] d = defaultdict(list) for acct_num, value in my_list: d[acct_num].append(value)Again, this eliminates the if/else conditional logic and makes the code easier to follow. Here is the output of the above code:print(d) >> defaultdict(<class 'list'>, {1234: [100.23, 75.0, 35.67], 345: [10.45, 222.66], 678: [300.25]})This is a very good thing! Let's try using a `lambda` also as our `default_factory`!from collections import defaultdict animal = defaultdict(lambda: "Monkey") animal >> defaultdict(<function __main__.<lambda>()>, {}) animal['Sam'] = 'Tiger' print (animal['Nick']) >> Monkey animal >> defaultdict(<function __main__.<lambda>()>, {'Sam': 'Tiger', 'Nick': 'Monkey'})Here we create a `defaultdict` that will assign `Monkey` as default value to any key. The first key is set to `Tiger`, and the next key is not set. If you print the second key, you will see that it has 'Monkey' assigned to it. In case you haven't noticed yet, it is basically impossible to cause a KeyError as long as you set the `default_factory` to something that makes sense. The documentation mentions that if you set the `default_factory` to `None`, then you will receive a KeyError.Let's see how that works:from collections import defaultdict x = defaultdict(None) x['Mike'] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-30-d21c3702d01d> in <module> 1 from collections import defaultdict 2 x = defaultdict(None) ----> 3 x['Mike'] KeyError: 'Mike'In this case, we just created a `defaultdict` with an error. It can no longer assign a default value to our key, so it throws a `KeyError` instead. Of course, since it is a subclass of `dict`, we can simply set the key to some value and it will work. But that defeats the purpose of `defaultdict`.`deque`According to the Python documentation, deques "is a generalization of stacks and queues (stacks and queues)". It is pronounced `deck`, which is short for `double-ended queue`. They are a replacement container for the Python list. Deques are thread-safe and allow you to efficiently add and remove data from memory from either side of the deque. A list is optimized for fast fixed-length operations. Full details can be found in the Python documentation. A deque accepts a `maxlen` argument that sets the limits for the deque. Otherwise, the deque will grow to an arbitrary size. When a bounded deque is full, any new elements added will cause the same number of elements to come out the other end.As a general rule, if you need to add or remove elements quickly, use a deque. If you need quick random access use a list. Let's take a moment to see how a deque can be created and used.from collections import deque import string d = deque(string.ascii_lowercase) for letter in d: print(letter) >> a b c d e f g h i j k l m n o p q r s t u v w x y zHere we import the deque from our collections module and we also import the strings module. To create an instance of a deque, we need to pass it an iterable. In this case, we pass `string.ascii_lowerercase`, which returns a list of all the lowercase letters of the alphabet. Finally, we loop over our deque and print each element. Now let's look at some of the methods that deque has.d.append('bye') d >> deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bye']) d.appendleft('hello') d >> deque(['hello', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bye']) d.rotate(1) d >> deque(['bye', 'hello', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])Let's break this down a bit. First we add a chain to the right end of the deque. Then we add another string to the left side of the deque. Finally, we call `rotate` on our deque and pass it a one, which causes it to rotate once to the right. In other words, it makes an element rotate from the far right and in front. You can pass it a negative number to make the deque rotate to the left instead. Let's end this section with an example based on some of the Python documentationfrom collections import deque def get_last(filename, n=5): """ Returns the last n lines from the file """ try: with open(filename) as f: return deque(f, n) except OSError: print("Error opening file: {}".format(filename)) raiseThis code works much like the Linux tail program does. Here we pass a `filename` to our script along with the `n` number of lines we want it to return. The deque is limited to whatever number we pass as `n`. This means that once the deque is full, when new lines are read and added to the deque, the oldest lines are pulled from the other end and discarded. I have also wrapped the file opening with a simple exception handler because it is very easy to pass a malformed path. This will catch files that do not exist.ConclusionWe've covered a lot of ground in this post. You learned how to use a defaultdict and a Count. We also learned about a subclass of the Python list, the deque. Finally, we saw how to use them to perform various activities. I hope you found each of these collections interesting. They may be of great use to you in your day-to-day work.

Daniel Morales February 16, 2021

Machine Learning Programming Deep learning

21 Resources for Learning Math for Data Science This is probably one of the biggest worries of those starting in the area of data science, learning/refreshing mathImage by DataSource.aiLet’s be honest, most people didn’t do very well in math in school, maybe not even in college, and this is very scary and creates a barrier for those who want to explore this discipline called data science.A few days ago I published a post in Towards Data Scienceand right here on our blog called “Study Plan for Learning Data Science Over the Next 12 Months”, where I gave some quarterly recommendations and made an emphasis on studying mathematics and statistics for this first quarter, and from which I received many questions about exactly which materials I recommended. Well, this post answers those questions. But before that, I want to give you a context.Leaving aside the factors or reasons that have led most people to hate math, it is a reality that we need it in data science. For me, one of the biggest shortcomings I found in mathematics was its lack of applicability in the real world, I didn’t see a reason for intermediate and advanced mathematics, such as multivariate calculus. I confess that in school and college I didn’t like them for that reason, but I always did well and got good scores and averages above the majority (especially in statistics). But I still didn’t see how I could use a derivative or a matrix in the real world. I finally ended up as a software engineer and once I entered the world of data science I was able to make the connection between mathematics, statistics, and the real world.On the other hand, it is important to clarify that we do not need a master’s degree in pure mathematics to do data science projects. As I mentioned in previous posts there is a big debate in the community about how much math we need to do a good job as data scientists.We could say that data science is divided into two major fields of work: research and productionBy research, we mean the part of research and development, which normally takes place within a large company (usually a tech company), or which has focused on cutting-edge technological issues (such as medical research). Or it is also an area that is developed within universities. This sector has very limited job offers. The great advantage is the deep knowledge of algorithms and their implementations, as well as being a person capable of creating variations of existing algorithms, to improve them. Or even create new machine learning algorithms. The disadvantage is the unpractical nature of their work. It is a very theoretical work, in which often the only objective is to publish papers and is far from the business use cases in general. For reference on this, I recently read this post on Reddit, I recommend you read it.By production, we refer to the practical side of this discipline, where you’ll use generally and in your day to day job libraries such as scikit-learn, Tensorflow, Keras, Pytorch, and others. These libraries operate like a black box, where you enter data, you get an output, but you don’t know in detail what happened in the process. This also has its advantages and disadvantages, but it certainly makes life much easier when putting useful models into production. What I don’t recommend is to use them blindly, where you don’t have the minimum bases of mathematics to understand a little of their fundamentals and that is the objective of this post, to guide you and recommend you some valuable resources to have the necessary bases and not to operate blindly those libraries.So if you decide to focus on Research and Development, you are going to need mathematics and statistics in depth (very in-depth). If you are going to go for the practical part, the libraries will help you deal with most of it, under the hood. It should be noted that most job offers are in the practical side.Well, after the previous remarks, it is time to define which are the specific topics needed to have an initial basis in mathematics for data science. Linear Algebra: This subject is important to have the fundamentals of working with data in vector and matrix form, to acquire skills to solve systems of linear algebraic equations, and to find the basic matrix decompositions and the general understanding of their applicability.Calculus: Here it is important to study functional maps, limits (in case of sequences, functions of one and several variables), differentiation (from a single variable to multiple cases), integration, thus sequentially building a foundation for basic optimization. It is also important here to study gradient descent.Probability theory: Here you should learn about random variables, i.e. a variable whose values are determined by a random experiment. Random variables are used as a model for the data generation processes we want to study. The properties of the data are deeply linked to the corresponding properties of the random variables, such as expected value, variance, and correlations.Note: these subjects are much deeper than what I just mentioned, this is simply a guide of the subjects and resources recommended to approach mathematics in the field of data science.Now that we have a better idea of the path we should take, let’s examine the recommended resources to address this topic. We will divide them into basic, intermediate, and advanced. In the advanced ones, we’ll have resources focused on deep learningBasics: in this first section of resources we’ll recommend the mathematical basics. Mathematical thinking, algebra, and how to implement math with python.1- Introduction to mathematical thinkingPrice: FreeImage by CourseraDescription: Learn how to think the way mathematicians do — a powerful cognitive process developed over thousands of years.Mathematical thinking is not the same as doing mathematics — at least not as mathematics is typically presented in our school system. School math typically focuses on learning procedures to solve highly stereotyped problems. Professional mathematicians think a certain way to solve real problems, problems that can arise from the everyday world, or from science, or from within mathematics itself. The key to success in school math is to learn to think inside-the-box. In contrast, a key feature of mathematical thinking is thinking outside-the-box — a valuable ability in today’s world. This course helps to develop that crucial way of thinking.Link: https://www.coursera.org/learn/mathematical-thinking#2- Mathematical Foundation for AI and Machine LearningPrice: $46.99 usdImage by PacktDescription: Artificial Intelligence has gained importance in the last decade with a lot depending on the development and integration of AI in our daily lives. The progress that AI has already made is astounding with innovations like self-driving cars, medical diagnosis and even beating humans at strategy games like Go and Chess. The future for AI is extremely promising and it isn’t far from when we have our own robotic companions. This has pushed a lot of developers to start writing codes and start developing for AI and ML programs. However, learning to write algorithms for AI and ML isn’t easy and requires extensive programming and mathematical knowledge. Mathematics plays an important role as it builds the foundation for programming for these two streams. And in this course, we’ve covered exactly that. We designed a complete course to help you master the mathematical foundation required for writing programs and algorithms for AI and ML.Link: https://www.packtpub.com/product/mathematical-foundation-for-ai-and-machine-learning-video/97817896132093- Math for ProgrammersPrice: $47.99Image by ManningDescription: In Math for Programmers you’ll explore important mathematical concepts through hands-on coding. Filled with graphics and more than 300 exercises and mini-projects, this book unlocks the door to interesting–and lucrative!–careers in some of today’s hottest fields. As you tackle the basics of linear algebra, calculus, and machine learning, you’ll master the key Python libraries used to turn them into real-world software applications.Link: https://www.manning.com/books/math-for-programmers4- Algebra 1Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/algebra5- Algebra 2Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/algebra26- Master Math by Coding in PythonPrice: $12.99Image by UdemyDescription: You can learn a lot of math with a bit of coding!Many people don’t know that Python is a really powerful tool for learning math. Sure, you can use Python as a simple calculator, but did you know that Python can help you learn more advanced topics in algebra, calculus, and matrix analysis? That’s exactly what you’ll learn in this course.This course is a perfect supplement to your school/university math course, or for your post-school return to mathematics.Let me guess what you are thinking:“But I don’t know Python!”That’s okay! This course is aimed at complete beginners; I take you through every step of the code. You don’t need to know anything about Python, although it’s useful if you already have some programming experience.“But I’m not good at math!”You will be amazed at how much better you can learn math by using Python as a tool to help with your courses or your independent study. And that’s exactly the point of this course: Python programming as a tool to learn mathematics. This course is designed to be the perfect addition to any other math course or textbook that you are going through.Link: https://www.udemy.com/course/math-with-python/7- Introduction to Linear Models and Matrix AlgebraPrice: FreeImage by EdxDescription: Matrix Algebra underlies many of the current tools for experimental design and the analysis of high-dimensional data. In this introductory online course in data analysis, we will use matrix algebra to represent the linear models that commonly used to model differences between experimental units. We perform statistical inference on these differences. Throughout the course we will use the R programming language to perform matrix operations.Given the diversity in educational background of our students we have divided the series into seven parts. You can take the entire series or individual courses that interest you. If you are a statistician you should consider skipping the first two or three courses, similarly, if you are biologists you should consider skipping some of the introductory biology lectures. Note that the statistics and programming aspects of the class ramp up in difficulty relatively quickly across the first three courses. You will need to know some basic stats for this course. By the third course will be teaching advanced statistical concepts such as hierarchical models and by the fourth advanced software engineering skills, such as parallel computing and reproducible research concepts.Link: https://www.edx.org/course/introduction-to-linear-models-and-matrix-algebra8- Applying Math with PythonPrice: $20.99Image by PacktDescription: Python, one of the world’s most popular programming languages, has a number of powerful packages to help you tackle complex mathematical problems in a simple and efficient way. These core capabilities help programmers pave the way for building exciting applications in various domains, such as machine learning and data science, using knowledge in the computational mathematics domain.The book teaches you how to solve problems faced in a wide variety of mathematical fields, including calculus, probability, statistics and data science, graph theory, optimization, and geometry. You’ll start by developing core skills and learning about packages covered in Python’s scientific stack, including NumPy, SciPy, and Matplotlib. As you advance, you’ll get to grips with more advanced topics of calculus, probability, and networks (graph theory). After you gain a solid understanding of these topics, you’ll discover Python’s applications in data science and statistics, forecasting, geometry, and optimization. The final chapters will take you through a collection of miscellaneous problems, including working with specific data formats and accelerating code.By the end of this book, you’ll have an arsenal of practical coding solutions that can be used and modified to solve a wide range of practical problems in computational mathematics and data science.Link: https://www.packtpub.com/product/applying-math-with-python/9781838989750Intermediate: in this second section we will recommend resources focused on calculation and probability.9- Calculus 1Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/calculus-110- Calculus 2Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/calculus-211- Multivariable calculusPrice: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/multivariable-calculus12- Mathematics for Data Science SpecializationPrice: FreeImage by CourseraDescription: Behind numerous standard models and constructions in Data Science there is mathematics that makes things work. It is important to understand it to be successful in Data Science. In this specialisation we will cover wide range of mathematical tools and see how they arise in Data Science. We will cover such crucial fields as Discrete Mathematics, Calculus, Linear Algebra and Probability. To make your experience more practical we accompany mathematics with examples and problems arising in Data Science and show how to solve them in Python.Each course of the specialisation ends with a project that gives an opportunity to see how the material of the course is used in Data Science. Each project is directed at solving practical problem in Data Science. In particular, in your projects you will analyse social graphs, predict estate prices and uncover hidden relations in the data.Link: https://www.coursera.org/specializations/mathematics-for-data-science13- Practical Discrete MathematicsPrice: $24.99Image by PacktDescription: Discrete mathematics is a field of math that deals with studying finite and distinct elements. The theories and principles of discrete math are widely used in solving complexities and building algorithms in computer science and computing data in data science. It helps you to understand algorithms, binary, and general mathematics that is commonly used in data-driven tasks.Learn Discrete Mathematics is a comprehensive introduction for those who are new to the mathematics of countable objects. This book will help you get up-to-speed with implementing discrete math principles to take your programming skills to another level. You’ll learn the discrete math language and methods crucial to studying and describing objects and functions in branches of computer science and machine learning. Complete with real-world examples, the book covers the internal workings of memory and CPUs, analyzes data for useful patterns, and shows you how to solve problems in network routing, encryption, and data science.By the end of this book, you’ll have a deeper understanding of discrete mathematics and its applications in computer science, and get ready to work on real-world algorithm development and machine learning.Link: https://www.packtpub.com/product/practical-discrete-mathematics/978183898314714- Math for Data Science and Machine Learning: University LevelPrice: $12.99Image by UdemyDescription: In this course we will learn math for data science and machine learning. We will also discuss the importance of Math for data science and machine learning in practical word. Moreover, Math for data science and machine learning course is bundle of two courses of linear algebra and probability and statistics. So, students will learn complete contents of probability and statistics and linear algebra. It is not like that you will not complete all the contents in this 7 hours videos course. This is a beautiful course and I have designed this course according to the need of the students.Linear algebra and probability and statistics is usually offered for the students of data science, machine learning, python and IT students. So, that’s why I have prepared this dual course for different sciences.I have taught this course multiple times on my universities classes. It is offered usually in two different modes like, it is offered as linear algebra for 100 marks paper and probability and statistics as another 100 marks paper for two different or in a same semesters. I usually focus on the method and examples while teaching this course. Examples clear the concepts of the students in a variety of way like, they can understand the main idea that instructor want to deliver if they feel typical the method of the subject or topics. So, focusing on example makes the course easy and understandable for the students.Link: https://www.udemy.com/course/master-linear-algebra-and-probability-2-in-1-bundle/15- Data Science Math SkillsPrice: FreeImage by CourseraDescription: Data science courses contain math — no avoiding that! This course is designed to teach learners the basic math you will need in order to be successful in almost any data science math course and was created for learners who have basic math skills but may not have taken algebra or pre-calculus. Data Science Math Skills introduces the core math that data science is built upon, with no extra complexity, introducing unfamiliar ideas and math symbols one-at-a-time.Learners who complete this course will master the vocabulary, notation, concepts, and algebra rules that all data scientists must know before moving on to more advanced material.Link: https://www.coursera.org/learn/datasciencemathskillsAdvanced: in this last section we will focus on the statistical part (probability theory) and the application of mathematics to deep learning algorithms.16- Statistics and probabilityPrice: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/statistics-probability17- Intro to Inferential StatisticsPrice: FreeImage by UdacityDescription: Inferential statistics allows us to draw conclusions from data that might not be immediately obvious. This course focuses on enhancing your ability to develop hypotheses and use common tests such as t-tests, ANOVA tests, and regression to validate your claims.Link: https://www.udacity.com/course/intro-to-inferential-statistics--ud20118- Statistical Methods and Applied Mathematics in Data SciencePrice: $124.99Image by PacktDescription: Machine learning and data analysis are the center of attraction for many engineers and scientists. The reason is quite obvious: its vast application in numerous fields and booming career options. And Python is one of the leading open source platforms for data science and numerical computing. IPython, and its associated Jupyter Notebook, provide Python with efficient interfaces to for data analysis and interactive visualization, and they constitute an ideal gateway to the platform. If you are among those seeking to enhance their capabilities in machine learning, then this course is the right choice.Statistical Methods and Applied Mathematics in Data Science provides many easy-to-follow, ready-to-use, and focused recipes for data analysis and scientific computing. This course tackles data science, statistics, machine learning, signal and image processing, dynamical systems, and pure and applied mathematics. You will apply state-of-the-art methods to various real-world examples, illustrating topics in applied mathematics, scientific modeling, and machine learning. In short, you will be well versed with the standard methods in data science and mathematical modeling.Link: https://www.packtpub.com/product/statistical-methods-and-applied-mathematics-in-data-science-video/978178953921919- Exploring Math for Programmers and Data ScientistsPrice: FreeImage by ManningDescription: Exploring Math for Programmers and Data Scientists showcases chapters from three Manning books, chosen by author and master-of-math Paul Orland. You’ll start with a look at the nearest neighbor search problem, common with multidimensional data, and walk through a real-world solution for tackling it. Next, you’ll delve into a set of methods and techniques integral to Principal Component Analysis (PCA), an underlying technique in Latent Semantic Analysis (LSA) for document retrieval. In the last chapter, you’ll work with digital audio data, using mathematical functions in different and interesting ways. Begin sharpening your competitive edge with the fun and fascinating math in this (free!) practical guide!Link: https://www.manning.com/books/exploring-math-for-programmers-and-data-scientists20- Hands-On Mathematics for Deep LearningPrice: $27.99Image by PacktDescription: Most programmers and data scientists struggle with mathematics, having either overlooked or forgotten core mathematical concepts. This book uses Python libraries to help you understand the math required to build deep learning (DL) models.You’ll begin by learning about core mathematical and modern computational techniques used to design and implement DL algorithms. This book will cover essential topics, such as linear algebra, eigenvalues and eigenvectors, the singular value decomposition concept, and gradient algorithms, to help you understand how to train deep neural networks. Later chapters focus on important neural networks, such as the linear neural network and multilayer perceptrons, with a primary focus on helping you learn how each model works. As you advance, you will delve into the math used for regularization, multi-layered DL, forward propagation, optimization, and backpropagation techniques to understand what it takes to build full-fledged DL models. Finally, you’ll explore CNN, recurrent neural network (RNN), and GAN models and their application.By the end of this book, you’ll have built a strong foundation in neural networks and DL mathematical concepts, which will help you to confidently research and build custom models in DL.Link: https://www.packtpub.com/product/hands-on-mathematics-for-deep-learning/978183864729221- Math and Architectures of Deep LearningPrice: $39.99Image by ManningDescription: Math and Architectures of Deep Learning sets out the foundations of DL in a way that’s both useful and accessible to working practitioners. Each chapter explores a new fundamental DL concept or architectural pattern, explaining the underpinning mathematics and demonstrating how they work in practice with well-annotated Python code. You’ll start with a primer of basic algebra, calculus, and statistics, working your way up to state-of-the-art DL paradigms taken from the latest research. By the time you’re done, you’ll have a combined theoretical insight and practical skills to identify and implement DL architecture for almost any real-world challenge.Link: https://www.manning.com/books/math-and-architectures-of-deep-learningConclusionThis is an extensive recommendation on resources for learning mathematics for data science, following the previous post about the path to follow in this year 2021 to learn data science. When we have limited time for study, we should select those that we feel best and those that fit our style. For example, you might prefer videos about books, so go ahead and choose what suits you best. This material is sufficient whether you want to take a brief look at the mathematics, or if you want to go deeper into it. I hope you find it useful.If you have other recommendations for courses, books or videos, please leave them in the comments so that we can all create links of interest.Note: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slackI hope you enjoyed this reading! you can follow me on twitter or linkedinThanks for reading!

Daniel Morales January 19, 2021

Python Programming

Python Books You Must Read in 2020 Python is one of thetop programming languages for a diverse range of tasks and domains. Python’s user-friendliness, high-level nature, and the emphasis on simplicity and enhanced code readability make it a favorable choice for many developers around the world. If that doesn’t sell Python to you, I’m sure that its exhaustive ecosystem of more than 255 thousand third-party packages will.Features like these have skyrocketed the demand for Python everywhere, be it application development, Data Science, Artificial Intelligence, or any other industry. The goal behind this write-up is to round up some of the best Python books around, to help you gain the knowledge and confidence with this amazing programming language.According to Stackoverflow, Python is the most preferred language which means that the majority of developers use python.Python BooksBooks are quite possibly one of the top sources of information about almost any topic, and in this section, we have gathered more than ten top books to help you familiarize and gain some practical knowledge with Python. Some of these books cover comprehensive knowledge about the programming language while some are excellent in giving you hands-on experience with it.Regardless of your previous experience with Python or any programming language for that matter, we’re sure you will find some great handy tips from these books for your next project.Disclaimer: Those affiliate links are for information purposes only.1. Python Crash CourseAuthor: Eric MatthesPublisher — No Starch PressDifficulty Level: BeginnersGet Both Books here — Amazon, AmazonCover of the book “Python Crash Course”As the name suggests, the author has written this book to act as a quick crash course for readers with little to no programming exposure. The author has made all the introductory concepts as easy as ABC for beginners so that they can start implementing their knowledge on fun projects. The introductory nature of this book also makes it a fitting choice for academics.This two-part book covers the introduction to programming in its first part, whereas in the second part, it takes on a project-driven approach where the readers are encouraged to complete any or all three programming projects. The projects include coding a 2D game, creating a data visualization program, and the last, an online Learning Log for note-taking.2. Learning PythonAuthor: Mark LutzPublisher — O’Reilly MediaDifficulty Level: BeginnersGet Book here — AmazonCover of the book “Learning Python”In the Python way of thinking, explicit is better than implicit, and simple is better than complex.― Mark LutzLearning Python covers all the fundamentals of the programming language and aims to be a one-stop solution for beginners alike who are in search of an in-depth introduction to Python. Also being a two-part book like the previous one, the author Mark Lutz has tried to create a solid foundation with Python in this part, while the other part focusses more on real-life examples and situations for better practical programming exposure.The latest edition of the book covers Python v3.3 and all its latest improvements along with the older v2.7. On a side note, if you have zero proficiency with programming, it would be a good idea to supplement this book with additional introductory references.3. Python Tricks: A Buffet of Awesome Python FeaturesAuthor: Dan BaderPublisher — Dan Bader (dbader.org)Difficulty Level: BeginnersGet Book here — AmazonCover of the book “Python Tricks: A Buffet of Awesome Python Features”“There should be one — and preferably only one — obvious way to do it.”― Dan BaderAs the title suggests, Python Tricks brings together a collection of convenient features, tips, and tricks to make you efficient with Python.In the words of the author — “What started out as a fun twitter experiment, turned into a series of noteworthy and useful tricks accompanied by a clear code example, that helped hundreds of Python developers understand the idea behind the various aspects of Python.”The book covers a large collection of tricks from a variety of topics in Python, and they’re presented in a well-explained style, but to make full use of this book, you’d still need a strong foundation in Python.4. Learn Python the Hard WayAuthor: Zed ShawPublisher — Addison-WesleyDifficulty Level: BeginnersGet Book here — AmazonCover of the book “Learn Python the Hard Way”“Just take it slow and do not hurt your brain.”If you’re intimidated by the “Hard” in the title, don’t be. That’s just the author’s way of using instructions to make you thoroughly go through the chapters and practice what you’ve learned.Putting the title aside, the author has done an amazing job presenting the fundamental concepts of Python at a more beginner-friendly pace to prepare you for complex topics. The book also includes plenty of instructional videos and exercises to enhance your knowledge of Python. In case you’re wondering, yes, the book has been updated with a newer edition that supports Python 3.5. Automating Boring Stuff with PythonAuthor: Al SweigartPublisher — No Starch PressDifficulty Level: BeginnersGet Book here — AmazonCover of the book “Automating Boring Stuff with Python”According to the author-Don’t spend your time doing work a well-trained monkey could do. Even if you’ve never written a line of code, you can make your computer do the grunt work. Learn how in Automate the Boring Stuff with Python.The title says it all. Every now and then, you must’ve come across a boring or a repetitive task that makes you say, “not this again.” It is these moments that this book intends to eliminate. The author has covered the necessary basics of the programming language in this book to help you create some nifty snippets of Python that can automate a simple but boring task to be done in seconds instead of hours.While the book does wonders for anyone wanting to get the boring tasks out of the way quickly, sans the programming background, it does not comprehensively cover each aspect of Python’s. Good enough for creating handy throwaway code but not enough for a thorough introduction.6. Python for Data AnalysisAuthor: Wes McKinneyPublisher — O′ReillyDifficulty Level: IntermediateGet Book here — AmazonCover of the book “Python for Data Analysis”Python can be used for a variety of tasks, and one of them is data analysis. If you constantly find yourself occupied with analyzing and manipulating structured data or are simply keen on learning about how efficient Python can be for data analysis tasks, you might find this book useful.The author has explained the fundamentals of working with data in a very comprehensive manner while also touching upon the topic of scientific computing. Python for Data Analysis also covers some of the most popular libraries for data analysis, such as NumPy, pandas, matplotlib, IPython, and SciPy.“Act without doing; work without effort. Think of the small as large and the few as many. Confront the difficult while it is still easy; accomplish the great task by a series of small acts. — Laozi”― Wes McKinney7. Introduction to Machine Learning with PythonAuthor: Andreas C. Müller and Sarah GuidoPublisher — O′ReillyDifficulty Level: IntermediateGet Book here — AmazonCover of the book “Introduction to Machine Learning with Python”The rate at which Machine Learning is advancing is fascinating. To be able to make the most out of this technology, Python is among the top choices for a glue language. Targeted towards aspiring Machine Learning professionals in search of solutions to real-world machine learning problems, this introductory book requires zero prior experience with Machine Learning.Instead of diving into the mathematics behind the algorithms and models being used throughout, the book takes a gentler approach and explains the background and their importance. Though it does require some knowledge of using Python to implement the vast collection of algorithms and models covered by the libraries followed in the book, such asscikit-learn, NumPy, and matplotlib.8. Python Data Science HandbookAuthor: Jake VanderPlasPublisher — O’Reilly MediaDifficulty Level: IntermediateGet Book here — AmazonGit Hub — https://github.com/jakevdp/PythonDataScienceHandbookCover of the book “Python Data Science Handbook”Data Science is becoming more of an imperative skill in various domains lately as the benefits it offers is invaluable. The author of this handbook has put more emphasis on learning Data Science as a skill than a new domain of knowledge, as it can prove to be advantageous while applying the skill to a problem in hand.The contents of the book have been structured into five different libraries provided by Python for extensive coverage, i.e, IPython, NumPy, Pandas, Matplotlib, Scikit-Learn. The handbook does require a certain degree of proficiency in Python to follow the book as intended by the author, and would not make for a suitable choice for beginners.9. Head First Python: A Brain-Friendly GuideAuthor: Paul BarryPublisher — O’Reilly MediaDifficulty Level: IntermediateGet Book here — AmazonCover of the book “Head First Python: A Brain-Friendly Guide”“code is read more than it’s written. This”― Paul BarryAlthough this book doesn’t cover the programming language and its entirety in detail, it still manages to teach you Python in a more practical and fun way. What separates this book from others is the fun and casual style the author has used to build a connection with the readers, and more importantly, the comical use of pictures to keep the learning process intuitive.For those of you who have just started learning Python, or any other programming language for that matter, you might have a rough time getting your head around the topics covered. If you do know the basics of programming, hop on.10. Fluent PythonAuthor: Luciano RamalhoPublisher — O’Reilly MediaDifficulty Level: ExpertsGet Book here — AmazonCover of the book “Fluent Python”Python can be a very versatile and powerful programming language when used efficiently, and this very motive is the sole driving factor for the book Fluent Python. It is clear that learning Python and achieving fluency in Python are both very different things. Most developers will often achieve what they want with Python anyways but at the cost of its full potential.The author has emphasized on highlighting some of the less utilized features and techniques to make it possible to get the most out of Python. If you have recently started learning Python, this might not be the right book for you, as you might find it hard to follow.11. Effective PythonAuthor: Brett SlatkinPublisher — Addison-WesleyDifficulty Level: ExpertsGet Book here — AmazonCover of the book “ Effective Python”Python is a programming language that puts a high emphasis on creating clear and extremely readable code, but there can still be some situations where that isn’t achieved. This is where the book Effective Python comes into the picture. The author has covered some common mistakes and provided invaluable insights and practices on how to avoid them in the first place, to write cleaner, reusable, and more effective Python code. If you’re constantly struggling to find ways to optimize your code, this could be the right book for you.12. Python CookbookAuthor: David Beazley & Brian K. JonesPublisher — O’Reilly MediaDifficulty Level: ExpertsGet Book here — AmazonCover of the book “Python Cookbook”Python has an excellent community, and this book takes its inspiration from the many unique challenges faced by the community, which are referred to as recipes in the Python Cookbook. The recipes come with relevant examples and detailed background studies on the problems from some of the most insightful members of the community.To make it absolutely clear, the author has targeted this book towards experienced Python developers looking to strengthen their understanding of the various modern techniques in Python. Beginners are suggested to pick up something that covers the introductory parts of Python before starting with the Python Cookbook.More Python Books to Read —Python Pocket ReferencePython Machine LearningDeep Learning with PythonPython Programming: An Introduction to Computer ScienceNatural Language Processing with PythonPython in a NutshellThink Python: How to Think Like a Computer ScientistDjango for Beginners: Build websites with Python and DjangoConclusionPython is an elegant and powerful programming language that can do wonders if utilized correctly. The books covered in this write-up should provide you sufficient knowledge to get you started with Python along with some additional tips and tricks to write a clear and optimized code that works beautifully. To tie things up here, if you are keen on learning Python from scratch or would simply like a brush-up, we highly recommend you go through these books.Note: To eliminate problems of different kinds, I want to alert you to the fact this article represent just my personal opinion I want to share, and you possess every right to disagree with it.

Claire D October 6, 2020

Programming Data Science Python

How to Get a Job With Python There are so many websites out there offering job listings for different fields of jobs. Even though you might be at a certain position you should always look for a job and that can get boring. But here comes a simple solution in order to get through so many of those job offers with ease!We are going to build an easy Python script to get job listings and filter them to your likings.It is a simple use of Python and you do not need any specific skills to do this with me. We will go step by step and build everything together.Let’s just jump right into it!CodingPlan the processFirst, we have to find the job listing website that we are going to get the offers from.I choose a website called Indeed. (It is just an example for this tutorial, but If you have a website you prefer to use for job hunting, please feel free to do so!)Here is what we are going to do:Filter the criteria of jobs that fit us and perform scraping on those.Here is what Indeed’s website looks like after I search for Data Science in United States.Search on the Website exampleIn the end, once we have our data, we are going to pack it into DataFrames and get a CSV file, which can be opened easily with Excel or Libre Office.Setting up the environmentYou are going to have to install a ChromeDriver and use it with Selenium, which is going to enable us to manipulate the browser and send commands to it for testing and after for use.Open the link and download the file for your operating system. I recommend the latest stable release unless you know what you are doing already.Next up, you need to unzip that file. I would recommend going into Files and doing it manually by right-clicking and then “Extract here”.Inside the folder, there is a file called “chromedriver”, which we have to move to a specific folder on your computer.Open the terminal and type these commands:sudo su #enter the root mode cd #go back to base from the current location mv /home/*your_pc_name*/Downloads/chromedriver /usr/local/bin #move the file to the right locationJust instead of *your_pc_name* insert your actual name of the computer.There are a few more libraries needed for this to work:In that terminal you should install these:pip3 install pandasPandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.sudo pip3 install beautifulsoup4Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.Once we are done with that, we open the editor. My personal choice is Visual Studio Code. It is straightforward to use, customizable, and light for your computer.Open a new Project where ever you like and create two new files. This is an example of how mine looks like to help you:Visual Studio Code — Project setupIn the VS Code, there is a “Terminal” tab with which you can open an internal terminal inside the VS Code, which is very useful to have everything in one place.When you have that open, there is few more thing we need to install and that is the virtual environment and selenium for web driver. Type these commands into your terminal.pip3 install virtualenv source venv/bin/activate pip3 install seleniumAfter activating the virtual environment, we are completely ready to go.Creating the ToolWe have everything set up and now we are going to code!First, as mentioned before, we have to import installed libraries.from selenium import webdriver import pandas as pd from bs4 import BeautifulSoup from time import sleepCreate your tool with any name and start the driver for Chrome.class FindJob(): def __init__(self): self.driver = webdriver.Chrome()That is all we need to start developing. Now go to your terminal and type:python -i findJob.pyThis command lets us our file as an interactive playground. The new tab of the browser will be opened and we can start issuing commands to it.If you want to experiment you can use the command line instead of just typing it directly to your source file. Just instead of self use bot.For Terminal:bot = FindJob() bot.driver.get('https://www.indeed.com/jobs?q=data+science&l=United+States&start=')And now for the source code:self.driver.get('https://www.indeed.com/jobs?q=data+science&l=United+States&start=')Creating the DataFrame we are going to be using is easy, so let’s begin there!For this data frame, we need to have “Title”, “Location”, “Company”, “Salary”, “Description” all related to the jobs we are going to scrape.dataframe = pd.DataFrame( columns=["Title", "Location", "Company", "Salary", "Description"])We will use this DataFrame as a Column names for our CSV file.The thing with this website is that on every page there are 10 job offers and the link changes as you go to the next page. Once I have figured that out, I made a for loop that checks every page and after it is done it goes to the next one. Here is what that looks like:for cnt in range(0, 50, 10): self.driver.get("https://www.indeed.com/jobs?q=data+science&l=United+States&start=" + str(cnt))I set up a counter variable ‘cnt’ and add that number, converted to string to my link. The for loop specifically begins with 0 goes up to 50 and does it in iterations of 10 because that is how many jobs per page Indeed shows us.When we get to the first page, we need to scrape the table of offers one by one and we are going to do it this way:In the image above, where you see offers, they are packed in a table and we can find that table by pressing F12 on keyboard or right-click -> Inspect.Here is what that looks like:We will find the table by the class name and we enter this line:jobs = self.driver.find_elements_by_class_name('result')That saves all Web elements that it has found by the class name result.Once we have those saved we can create another for loop and go through every element inside that table and use that data we found inside.Before I show you more code on scraping those offers, we should go through a couple of things.For this part, we are going to use BeautifulSoup since I find it to work way faster.We have to set up a few things for BeatifulSoup, and those are the actual data that we give it to perform its search on and parser that we say it should use:result = job.get_attribute('innerHTML') soup = BeautifulSoup(result, 'html.parser')Once we get those, we just have to find elements inside the ‘soup’ defined variable, which is just a prepared data by BeautifulSoup.We get the data for the DataFrame that we want:title = soup.find("a", class_="jobtitle").text.replace('\n', '') location = soup.find(class_="location").text employer = soup.find(class_="company").text.replace('\n', '').strip() try: salary = soup.find(class_="salary").text.replace( '\n', '').strip() except: salary = 'None'I did a salary part in such a manner because sometimes it is not defined and we have to set None or empty for that particular cell.Since I am working in the terminal for testing my code, you can also print what you found so far:print(title, location, employer, salary)Once this script is done, it will look like this:The last thing missing for the DataFrame is the Description of the job and left this out because, in order to get the text for the job description, you have to click the job offer first. I do it this way:summ = job.find_elements_by_class_name("summary")[0] summ.click() job_desc = self.driver.find_element_by_id('vjs-desc').textAfter we got all elements that should go into the DataFrame, we fill it:dataframe = dataframe.append( {'Title': title, 'Location': location, 'Employer': employer, 'Description': job_desc}, ignore_index=True)One more thing that I have to mention before you start testing this for yourself.Once you go to the second page of the website, there is a popup that blocks you from clicking further on anything!I also thought of that and created a try-expect, which will close the popup and continue scraping data!pop_up = self.driver.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/div[2]/div[2]/div[4]/div[3]/div[2]/a') pop_up.click()Once the for loop finishes, we copy the data frame data to CSV called ‘jobs’:dataframe.to_csv("jobs.csv", index=False)We are done!The complete code is under here on my GitHub account:lazargugleta/findJobNext stepsYou can take this script to another level by implementing a comparison between different websites and get the best offers on the internet overall.You can find a job in Python here.Until then, follow me for more! 😎Thanks for reading!

Lazar Gugleta June 24, 2020

Programming Data Science News

Data Science Trends for 2020 Crucial Data Science Trends for the New DecadeData science is the discipline of making data useful.There is absolutely no doubt that this decade has bought loads of innovation in Artificial Intelligence. Besides Artificial Intelligence, we are witnessing a massive boost in the data generated from thousands of sources. The fact that millions of devices are responsible for this enormous spike in data brings us to the topic of its smart utilization.The domain of Data Science brings with itself a variety of scientific tools, processes, algorithms, and knowledge extraction systems from structured and unstructured data alike, for identifying meaningful patterns in it.Data Science also benefits data mining and big data. Brought into the mainstream in the year 2001, Data Science has been evolving ever since and is rated as one of the most exciting career paths of all time.Towards Data Science reports:Currently, the daily data output is more than 2.5 quintillion bytes.In the near future, “1.7 Mb of data will be created every second for every person on the planet.”A wide variety of Data Science roles will drive these massive data loads.Google search popularity of “Data Science” over the past 5 years. Generated by Google Trends.Trends in Data ScienceWith the diversity in data problems and requirements, comes a broad range of innovative solutions. These solutions often bring with themselves a host of data science trends granting businesses the agility they require while offering them deeper insights into their data. A few of these top Data Science trends are briefly explained below:1. Graph AnalyticsWith data flowing in from all directions, it becomes harder to analyze.Graph Analytics aims to solve this problem by acting as a flexible yet powerful tool that analyzes complicated data points and relationships using graphs. The intention behind using graphs is to represent the complex data abstractly and in a visual format that is easier to digest and offers maximum insights. Graph Analytics are applied in a plethora of areas such as:Filtering out bots on social media to reduce false informationIdentifying frauds in banking industriesPreventing financial crimeAnalyzing power and water grids to find flaws2. Data FabricData Fabric is a relatively new trend, and at its core, it encapsulates an organization’s data collected from a vast number of sources such as APIs, reusable data services, pipelines, semantic tiers, providing transformable access to data.Created for assisting the business context of data and keeping data in an intelligible way not just for users but also for applications, Data Fabrics enable you to have scalable data while being agile.By doing so, you get unparalleled access to process, manage, store, and share the data as needed. Business Intelligence and Data Science relies heavily upon Data Fabrics due to its smooth and clean access to enormous amounts of data.3. Data Privacy by DesignThe trend of Data privacy by design incorporates a safer and more proactive approach to collecting and handling user data while training your machine learning model on it.Corporations need user data to train their models on real-world scenarios, and they collect data from various sources such as browsing patterns and devices.The idea behind Federated Learning is to collect as little data as possible, keeping the user in the loop by also giving them the option to opt-out and erase all collected data at any time.While the data may come from an enormous audience, for privacy reasons, it must be guaranteed that any reverse-engineering of the original data to identify the user isn’t possible.4. Augmented AnalyticsAugmented Analytics refers to driving better insights from the data in hand by excluding any incorrect conclusions or bias for optimized decisions. By infusing Artificial Intelligence and Machine Learning, Augmented Analytics aids users in planning a new model.With reduced dependency on data scientists and machine learning experts, Augmented Analytics aims to deliver relatively better insights on data to aid the entire Business Intelligence process.This subtle introduction of Artificial Intelligence & Machine Learning has a significant impact on the traditional insight discovery process by automating many aspects of data science. Augmented Analytics is gaining a stronghold in providing better decisions free of any errors and bias in the analysis.5. Python as the De-Facto Language for Data SciencePhoto by Hitesh Choudhary on UnsplashPython is an absolute all-rounder programming language and is considered a worth entry point if you’re interested in getting into the world of Artificial Intelligence and Data Science.With a supportive online community, you can get support almost instantly, and the integrations in Python are just the tip of the iceberg.The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death.- Guido van RossumPython comes stacked with integrations for numerous programming languages and libraries, making it an excellent option for, say, jumping into creating a quick prototype for the problem at hand or going in-depth into large datasets.Some of its most popular libraries are -TensorFlow, for machine learning workloads and working with datasets. Scikit-learn, for training machine learning models.PyTorch, for computer vision and natural language processing.Keras, as the code interface for highly complex mathematical calculations and operationsSparkMLlib, like Apache Spark’s Machine Learning library, making machine learning easy for everyone with tools like algorithms and utilities6. Widespread Automation in Data ScienceTime is a critical component, and none of it should be spent on performing repetitive tasks.As Artificial intelligence advanced, its automation capabilities expanded as well. Various innovations in automation are turning many complex Artificial Intelligence tasks easier.Automation in the field of Data Science is already simplifying much of the process, if not all. The entire process of Data Science includes identification of the problem, data collection, processing, exploration, analysis, and sharing of processed information to others.7. Conversational Analytics and Natural Language ProcessingNatural Language Processing and Conversational Analytics are already making big waves in the digital world by simplifying the way we interact with machines and look up information online.NLP has hugely helped us progress into an era where computers and humans can communicate in common natural language, enabling a constant and fluent conversation between the two.The applications of NLP and conversational systems can be seen everywhere, such as chatbots and smart digital assistants. It has been predicted that the usage of voice-based searches will exceed the more commonly used text-based searches in a very short time.8. Super-sized Data Science in the CloudThe onset of Artificial Intelligence and the amount of data generated from it has skyrocketed ever since. The size of data grew tremendously from a few gigabytes to a few hundred as businesses grew their online presence.This increased requirement of data storage and processing capabilities gave rise to Data Science for a controlled and precise utilization of data and pushed organizations working on a global scale to opt for cloud solutions.Various cloud solutions providers such as Google, Amazon, Microsoft offer vast cloud computing options that include enterprise-grade cloud server capabilities ensuring high scalability and zero downtime.9. Mitigate Model Biases and DiscriminationNo model is entirely immune to biases, and they can begin to exhibit discriminatory behavior at any stage due to factors such as lack of sufficient data, historical bias, and incorrect data collection practices. Bias and discrimination is a common problem with models and is an emerging trend. If timely detected, these biases can be mitigated at three stages:Pre-Processing StageIn-Processing StagePost-Processing StageEach stage comes with its own set of corrective aspects including algorithms and techniques to optimize the model for fairness, and to increase its accuracy for eliminating any chance of bias.10. In-Memory ComputingIn-Memory computing is an emerging trend that is vastly different from how we traditionally process data.In-Memory computing processes data stored in an in-memory database as opposed to the traditional methods using hard drives and relational databases with a querying language. This technique allows for processing and querying of data in real-time for instant decision making and reporting.With memory becoming cheaper and businesses relying on real-time results, In-Memory computing enables them to have applications with richer, more interactive dashboards that can be supplied with newer data and be ready for reporting almost instantly.11. Blockchain in Data and AnalyticsBlockchain, in simpler terms, is a time-stamped collection of immutable data managed by a cluster of computers, and not by any single entity. The chain here refers to the connection between each of these blocks, bound together using cryptographic algorithms.Transforming gradually similar to Data Science, Blockchain is crucial for maintaining and validating records while Data Science works on the collecting and information extraction part of the data. Data Science and Blockchain are related as they both use algorithms to govern various segments of their processing.ConclusionAs businesses begin to grow, they generate more data, and Data Science can help them analyze their areas of improvement. With several of the noteworthy Data Science trends mentioned above, some have begun to consider Data Science as the fourth paradigm of science next to Empirical, Theoretical, Computational. Keeping up to date with newer trends is an absolute must for businesses to achieve maximum efficiency and stay at the forefront of the competition.

Claire D May 27, 2020

SQL Programming

SQL Joins: A Brief Example Understand The Why And How Of SQL Joins This blog post was originally intended to be a side-note in my Pandas Join vs. Merge post. But it turned out to be long enough to warrant its own post (and way too verbose for a side-note). It’s not meant to be a full-on primer on SQL joins, but rather an example to help those new to SQL and relational databases begin to grasp what it means to join 2 tables.Why Do We Join?Why bother with joining at all? Can’t we just dump everything into a spreadsheet and sort things out there? Perhaps… but it would be incredibly time consuming, tedious, and error prone.Relational databases are designed to be joined. Each table in the database contains data of a specific form or function. For example, one table might have basic data on a company’s customers such as customer ID (a unique ID that can be used to identify each customer), name, age, gender, date of first purchase, and address. While a separate much larger table stores detailed transaction level data — transaction ID, date of transaction, customer ID, product category, product ID, units sold, and price.A given customer (or customer ID) could have hundreds or even thousands of transactions, so it would be extremely redundant to store that customer’s basic information over and over again for each row in the transactions table. The transactions table should be only for data relevant to transactions. Having too much overlapping data between tables is wasteful and can negatively impact system performance.But that doesn’t mean we don’t care about the linkages between tables. Given how specific each table is, analyses that involve only a single table are generally not useful. The interesting analyses come from datasets that combine multiple tables. For example, we might want to segment transactions by age or geography. To do this, we would need data from both tables. And that’s where join comes in.How Do We Join?When we join two tables, we are linking them together via a selected characteristic. Let’s say we have two tables. The first one, Employee, lists out an employee’s unique ID number, name, and job title. The second one, Sale, lists out data on who made what sale by attaching the employee’s ID number and the units sold to a unique sales number:SELECT * FROM Employee SELECT * FROM SaleOur 2 tables, Employee (left) and Sale (right)(I omitted the underscores from the column names in my graphics for legibility)Now let’s join the two tables. To link the two tables, we need to pick a column (or combination of columns) that serves as the point of intersection — let’s call the chosen column the join index. Table entries that share the same value for the join index are joined together. Note that the intersection does not have to be one to one. For example, Tony has made 2 sales, so upon joining the tables, both of his sales will be linked to Tony (a.k.a. Employee ID 1).When we join tables, we generally want the join index to be unique. If the join index were not unique, quirky stuff might occur. For example, let’s say we had a second employee named Tony (along with the tables below), and he was a megastar salesman. If instead of joining on “Employee ID” we joined on “Name”, then we would mistakenly link Tony the Megastar’s sales to me, making my bonus way too high:Joining on non-unique columns is not recommendedAnd Tony the Megastar would get credit for my pitiful sales as well (not that he needs it). So to avoid this, we join on a column with unique values such as “Employee ID” (I removed Tony the Megastar as he was only there to illustrate what not to do, and his incredible successes made me feel unworthy):The “Employee ID” column provides the link between the 2 tablesThere are various types of SQL joins and I will not go into the details of all of them here. In this example, we will use a left join, meaning that we prioritize the rows in the left table. So our output will include every row in the left table (the one with “Name” and “Title”) regardless of whether there is a match with the right table — thus employees that have not made a sale will still have a row in our output, but there will be no values (NULLs to be exact) for the “Sale Number” and “Units Sold” columns.Let’s take a look at our output (we are selecting only “Sale Number” and “Units Sold” from the right table and sorting by “Employee ID”):SELECT e.*, s.Sale_Number, s.Units_Sold FROM Employee as e LEFT JOIN Sale as s ON e.Employee_ID=s.Employee_ID ORDER BY e.Employee_IDThe result of our left joinThe output of our join now includes data from both tables. Tony’s sales data has been linked with his employee data (thanks to Employee ID) as has Lisa’s. Notice 2 things:It looks a bit repetitive because “Employee ID”, “Name”, and “Title” are repeated for as many times as the employee has sales. In reality, we wouldn’t stop here though. Next, we would most likely do a group by in order to count up how many sales each employee made, or calculate the average number of units sold each time an employee makes a sale.Employee ID 3 is missing from the output because we did a left join and there was no entry for Employee ID 3 in the left table. Thus, it was omitted.And that concludes our brief example. Hopefully, this gives you a rudimentary idea of why we need joins and how they work.Key Takeaways:Database tables generally contain very specific information. Therefore, meaningful analyses usually combine data from multiple tables.This is accomplished via the join operation, which combines two tables by matching them up based on a specified column.The column used to combine the tables should contain only unique values.There are various types of joins. The one in this example is a left join, which returns every row in the left table whether or not there is a match.

Tony Yiu May 18, 2020

DataSource.ai Blogs

Newsletter