The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Data Science

Data-Driven Creativity: Enhancing Video Content through Data Science
In the age of digital marketing and content creation, data-driven creativity is becoming an increasingly important concept. It's the fusion of artistic vision with the insights gleaned from data science to enhance the impact and effectiveness of video content. This 2500-word blog will explore how data science can be leveraged to elevate video content creation, ensuring that it not only engages but also resonates with the intended audience.Introduction to Data-Driven CreativityData-driven creativity marks a groundbreaking shift in video content creation, blending artistic vision with the insights provided by data science. This combination allows creators to break free from conventional creative limits, using data analytics to develop content that is both visually captivating and strategically significant. By delving into viewer behavior, preferences, and interactions, creators can refine their stories and visuals, achieving a deeper connection with their audience. This technique effectively transforms data into a guide for storytelling, steering content towards increased relevance and attractiveness. Consequently, video content becomes a more potent medium for engaging viewers and delivering impactful messages. Fundamentally, data-driven creativity is about converting data points into compelling stories and turning analytical insights into creative masterpieces, thereby redefining the standards of digital video content.Understanding the Role of Data in Video Content CreationExploring the Role of Data in Video Content Creation ventures into the rapidly growing realm of data-driven creativity, where data science emerges as a key instrument in enriching video content. In this realm, data transcends mere figures to become a narrative element, providing rich insights into what audiences prefer, how they behave, and emerging trends. Utilizing data, video creators can break free from conventional creative constraints, shaping their stories to more deeply connect with viewers. This process involves a detailed examination of viewer interactions, demographics, and feedback to hone storytelling skills, aiming to create videos that are not only watched but also emotionally impactful and memorable. Data-driven creativity is a fusion of art and science, where each view, reaction, and comment plays a role in directing the trajectory of video content, enhancing its relevance, engagement, and effect. This marks a transformative phase in content creation, where data equips creators to weave narratives that are not just creatively rich but also finely tuned to the dynamic preferences and interests of their audience.The Process of Gathering and Analyzing DataCollecting and analyzing data forms the foundation of data-driven creativity, especially in the realm of video content enhancement. This process involves the acquisition of key information, including audience demographics, interaction metrics, and performance measures, utilizing sophisticated tools and technologies. These range from social media analytics to advanced data mining applications designed to track a broad spectrum of viewer interactions. Once collected, this data undergoes thorough analysis to identify trends, preferences, and behaviors within the target audience. Such analysis equips content creators with insightful knowledge, allowing them to adjust their video content for greater appeal and connection with their audience. Leveraging these insights, creators can modify elements such as the tone, style, and themes of their content, revolutionizing storytelling methods and ensuring their content is both captivating and impactful. This integration of data science with creative storytelling heralds a transformative phase in video content production, where analytical findings significantly enhance artistic expression.Tailoring Content to Audience PreferencesAdapting content to audience preferences through data-driven creativity signifies a vital evolution in video content production. By incorporating data science, creators gain profound insights into audience behaviors, likes, and engagement patterns. This approach facilitates the creation of content that better resonates with viewers, ensuring everything from the plot to visual elements aligns with their interests. Utilizing analytics such as viewer habits and interaction rates, creators can pinpoint engaging aspects for better video content. Using a high-quality video editor tool is important to make the video look better. This knowledge allows precise adjustments, making the content not only captivating but also highly relevant. Ultimately, incorporating data in video content creation leads to more impactful and resonant viewer experiences, forging a deeper bond between the audience and the content.Enhancing Storytelling with Data InsightsUtilizing data insights to enhance storytelling is a groundbreaking method in video content production. Termed data-driven creativity, this technique blends the storytelling craft with data science accuracy. Content creators leverage analysis of viewer engagement, preferences, and behavior to fine-tune their narratives, ensuring a deeper connection with their audience. This integration results in not only engaging narratives but also ones that are in tune with audience interests and emerging trends. Insights from data grant a clearer understanding of what truly engages viewers, empowering creators to optimize their storytelling for the greatest effect. This modern approach reinvents traditional storytelling into an experience that's both more impactful and centered around the audience, with each creative decision being shaped and enriched by data.Using Data to Predict Future TrendsUtilizing data for future trends in data-driven creativity marks a revolutionary step in improving video content via data science. This technique focuses on analyzing viewer interactions, demographic information, and behavioral tendencies to predict future content direction. Using data enables creators to be proactive, crafting video content that resonates with emerging audience preferences and interests. Such a forward-thinking approach guarantees ongoing relevance in a dynamic digital world and fosters innovation and leadership among content creators. The blend of data analytics and artistic insight leads to the production of not just captivating but also pioneering videos, demonstrating the significant role of data in shaping the future of video content creation.Balancing Creativity and DataAchieving a harmonious blend of creativity and data in video content production is both subtle and potent. Data-driven creativity embodies the convergence of artistic flair and data analytics, providing an innovative method to boost video effectiveness. By weaving in data analysis, video creators unlock insights into what their audience prefers and how they behave, guiding their artistic choices. This integration results in content that is not only enthralling but also deeply meaningful to viewers. It is essential, however, to ensure that data serves as a guide, not a ruler, in the creative journey. This equilibrium keeps the content fresh and appealing while aligning it thoughtfully with data-driven knowledge. In essence, data-driven creativity in video content merges the narrative craft with analytical insights, culminating in videos that are both compelling and influential.Overcoming Challenges in Data-Driven CreativityOvercoming hurdles in data-driven creativity necessitates a nuanced integration of data science into the creation of video content. It involves striking a delicate balance between analytical methodologies and artistic expression, ensuring that data serves as an informative tool rather than a constraint on creativity. Accurate interpretation of data empowers content creators to avoid formulaic outputs, utilizing insights to enrich storytelling and enhance audience engagement. This intricate process demands a comprehensive understanding of the artistry of video creation and the scientific principles behind data analysis. Ethical considerations, including respecting audience privacy and obtaining data consent, are pivotal in this approach. Innovative strategies within data-driven creativity empower creators to produce content that forges deeper connections with viewers, setting new benchmarks in the digital landscape. Embracing these challenges is essential for unlocking the full potential of data-enhanced video content.Ethical Considerations in Data-Driven CreativityIn the domain of data-driven creativity, ethical considerations play a crucial role, especially when utilizing data science to enhance video content. While utilizing data insights can enhance creative processes, it is essential to address privacy concerns and ensure transparent, responsible data usage. Achieving the right equilibrium between creativity and ethical considerations becomes paramount as brands employ data to customize video content. Upholding user privacy and securing informed consent are fundamental principles in ethical data-driven creativity, fostering trust among audiences. Moreover, there is an obligation to avoid perpetuating biases and stereotypes in content creation, championing inclusivity and diversity. Ethical practices not only maintain brand integrity but also contribute to a positive and respectful digital environment for consumers.Tools and Resources for Data-Driven Video CreationExplore the potential of data-driven creativity using state-of-the-art tools and resources for crafting videos. In the current digital landscape, integrating data science and video content is transforming the landscape of creative processes. Immerse yourself in a domain where insights derived from data direct every facet of video production. These tools empower creators to customize content according to audience preferences, ensuring that each video is not only visually captivating but also strategically aligned. From scriptwriting informed by analytics to incorporating personalized visual elements, the utilization of data science takes video content to unprecedented levels. Delve into the crossroads of technology and creativity, where strategies driven by data redefine storytelling, captivating audiences in a personalized and meaningful manner.The Future of Data-Driven Creativity in Video ContentThe evolution of data-driven creativity in video content is set to transform our interaction with digital media. Through the incorporation of data science, creators gain valuable insights into viewer preferences, behavior, and trends. This collaboration enables a personalized and captivating viewing experience, heightening audience engagement. With the utilization of data-driven creativity, content producers can shape videos to suit the unique preferences of their target audience, resulting in more impactful storytelling and brand communication. As technology progresses, we anticipate a shift towards highly personalized content, driven by data insights, leading to innovative approaches in video production. This convergence of creativity and data science holds significant promise for the future development of video content within the digital landscape.ConclusionIn summary, the convergence of data-driven insights and creative components represents a transformative shift in the realm of video content creation. The fusion of Data Science and creativity provides content producers with the tools to precisely tailor videos to audience preferences, resulting in more impactful and engaging content. Leveraging the potential of data facilitates a deeper comprehension of viewer behavior, enabling targeted storytelling. Amidst the digital landscape, the symbiosis of data and creativity not only elevates video content but also fosters innovation and personalized experiences. Looking ahead, embracing Data-Driven Creativity becomes crucial for maintaining a leading edge in the continually evolving landscape of video content creation.

Data Science
Machine Learning

Model Evaluation Metrics in Machine Learning
CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02
Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!

Data Science

DataSource AI Hosts KTM AG's Inaugural AI Challenge: "Code the Light Fantastic"
DataSource AI announces the launch of the KTM AG inaugural AI Challenge, an unprecedented 3-month online competition that aims to revolutionise two-wheeler innovation through artificial intelligence and deep learning. KTM AG is a global frontrunner in two-wheeler innovation, known for pushing the boundaries of what's possible in the world of motorcycles. With a rich history of groundbreaking engineering and a commitment to cutting-edge technology, KTM AG has set new standards in performance, design, and safety. As a global leader in two-wheeler innovation, KTM AG invites participants to embark on this groundbreaking innovation journey. At the core of this competition lies a challenge set to redefine the future of motorcycle lighting systems. Participants are tasked with developing an algorithm for a high-beam lighting system utilizing a pixel matrix. Participants can find detailed guidelines in the Datathon competition. The datathon unfolds in a 3-tiered cascade model: This Code Challenge by KTM AG promises not only substantial rewards but also an exciting opportunity to shape the future of two-wheeler technology, along with supporting the participants to upscale and test their knowledge in a global AI competition. The cumulative budget for this remarkable Code Challenge by KTM AG is a substantial €24,000, motivating participants with not only the opportunity to push the boundaries of two-wheeler technology but also significant rewards for those who rise to the occasion. With cumulative prizes, contestants have the chance to potentially take home a maximum reward of €10,800 in addition to contributing to cutting-edge advancements in the field. We invite all aspiring innovators, data scientists, and AI enthusiasts to join us in this journey to "Code the Light Fantastic." For more information, rules, and registration details, please register hereAbout DataSource: At DataSource AI, we are driven by a singular mission - to democratise the immense power of data science and AI/ML for businesses of all sizes and budgets. We facilitate AI competitions, for businesses of all sizes and budgets by harnessing our extensive data expert community that's collaborating over our intelligent AI algorithm crowdsourcing platform. Our community is at the heart of what we do. We've built a diverse and talented pool of data experts who are passionate about solving real-world problems. They collaborate, ideate, and innovate, driving forward the frontiers of data science.

Data Science

Top 2 Online Data Science Courses to Improve your Career in 2023
The discipline of Data Science is expanding quickly and has enormous promise. It is used in various sectors, including manufacturing, retail, healthcare, and finance. Today, a wide variety of online Data Science courses are accessible. With so many choices, you might need help choosing the best one. This article will summarize the best Data Science programs so you can choose the program that's best for you. What Is a Data Science Course?The theoretical ideas of data science are taught to novices in a Data Science course. Additionally, you'll learn about the steps involved in Data Science, such as mathematical and statistical analysis, data preparation & staging, data interpretation, data visualization, and methods for presenting data insights in an organizational context. Advanced subjects, such as employing neural networks to develop recommendation engines, are covered in more specialized courses.Why Data Science?In the expanding field of data science, a data scientist earns one of the best jobs. Data science gained popularity and started to be utilised in an expanding number of applications when big data appeared and the necessity to manage these massive volumes of data arose. Data science, which enables companies to derive conclusions on the basis and take measures based on those conclusions, is one of the primary applications of artificial intelligence. Data Scientists are in great demand due to Data Science's importance for all industries. There is fierce rivalry everywhere. But if you can get an advantage over your competitors, you may easily land lucrative positions in demand. Taking data science courses online might give you that advantage. Data science involves:● Analytical capabilities.● A foundational understanding of the field.● Practical abilities to produce outcomes. To understand data science, you don't need to spend years working with big data or have a tonne of expertise in the software sector. You may always study from the greatest online Data Science courses and create a way to join this area while working. These are the top data science programs you can take to further your career and understand the subject. Let's discover more about the top data science courses available online. 2 Online Data Science Courses for 2023 to Advance Your Career1. Program for Business Analytics CertificationThis online Data Science course lasts three months and calls for 8 to 10 hours of study per week. The course created for analytics aspirants is one of the greatest data science courses in India and one of the market's top data science courses. It has more than 100 hours of material. The Data Science course was built with the help of business professionals from organizations like Flipkart, Gardener, and Actify. This is one of the finest online courses for learning the fundamentals of data science since it offers committed mentor assistance, prompt doubt resolution services, and live sessions with subject matter specialists. Students will gain knowledge in statistics, optimization, business problem-solving, and predictive modeling via this course. This online data science course was created for managers, engineers, recent graduates, software and IT workers, and marketing and salespeople. Students will concentrate on corporate problem-solving, insights, and narrative for the very first 3 weeks of the course. In this portion, you will discover how to formulate hypotheses, comprehend business issues, and concentrate on narrative. The following four weeks will be devoted to understanding statistics, optimization, and exploratory data analysis. A case study assignment will also be included. You will study several machine learning approaches to evaluate data and provide insights during the last five weeks, which will be devoted to predictive analysis. There will be three initiatives at the industry level: uber supply-demand gap, customer creditworthiness, and market mix modeling for e-commerce. Students who take this business analyst course have access to various options, including the ability to apply for managerial, business analyst, and data analyst employment. 2. Data Science Master's DegreeIt is among the top online courses for Data Science. This master’s in Data Science program lasts for 18 months and is delivered online. If you engage in expert online data science courses from a recognised and trusted provider, you may put your talents to the test on real assignments.This course is one of the best Data Science courses since it offers a variety of distinctive characteristics. There are several specialization options available for the course. Business intelligence/data analytics, Natural Language Processing, and Deep Learning,Business Analyst course, and Data Engineering are the available specialization areas. In addition to these specializations, the course offers its students a platform to study more than 14 programming languages and technologies that are utilized in the diverse area of data science, as well as industry mentoring and committed career assistance. One of the greatest data science courses in India, it includes tools like Python, Tableau, Hadoop, MySQL, Hive, Excel, PowerBI, MongoDB, Shiny, Keras, TensorFlow, PySpark, HBase, and Apache Airflow. More than 400 hours of learning material are planned for the online data science course. You will get a thorough understanding of data science and topics linked to it through these videos and publications, enabling you to succeed in any data science interviews. For the students to have a practical and hands-on understanding of all the tools and ideas covered in the course, the online data course includes more than ten industrial projects and case studies. The students may study various subjects, languages, and tools throughout the course. The first four weeks will be devoted to learning the fundamentals of Python and how to use Excel to deal with data. The 11 weeks will be devoted to teaching students how to utilize all the tools needed for data science and how to prepare and work with the provided data. You will acquire in-depth information on Python, Excel, and SQL in this part. Learning about machine learning and its many algorithms will be the main emphasis of the next nine weeks. ConclusionThe best online Data Science courses provide a good introduction to the subject, which is how the article can be summed up. They go through the fundamentals of data science, such as handling data, cleansing data, and doing statistical analysis. They also provide a more thorough examination of Data Science and machine learning Anyone interested in pursuing a career in data science must take these courses.

Data Science

5 Tips To Ace Your Job Interview For A Data Scientist Opening
5 Tips To Ace Your Job Interview For A Data Scientist Opening.PNG 795.94 KBImage SourceAspiring data scientists have a bright future ahead of them. They’re about to enter a field that’s exponentially expanding in terms of job growth and career opportunities. Reports say that the sector has seen a 650% job growth since 2012 and, according to predictions, there will be an estimated 11.5 million new jobs by 2026. All that’s left for data scientist hopefuls is to develop their skills and ace their job interviews. While that may be the most daunting part, we're here to give you five tips on how to impress your interviewer and grab that opportunity.
#1- Prepare answers for potential interview questions
In our list of 10 highly probable data scientist interview questions, we highlight some of the most asked questions that you’ll want to prepare for. These include situations related to machine learning, Python, and SQL. For example, you could get asked about the difference between classification and clustering, or the important features of dictionaries. While these questions can depend on your interviewer and the company you’re trying for, it won’t hurt to have prepared answers for these basic questions. Brush up on these topics and do your own research.
#2- Recall your technical abilities
Companies often have a separate technical screening portion prepared for you, but it would also be helpful to run over your technical abilities. This also depends on the specifications of the position you’re applying for; as a data scientist, they might inquire about your efficiency in developing algorithms for the collation and cleaning of datasets. Your interviewer could ask if you’ve had the chance to create an original algorithm of your own. If you’ve done any data projects, they might also inquire about the challenges you faced and how you were able to deal with them.
#3- Communicate your strengths
It would be helpful if you could confidently explain and articulate what kind of data scientist you are. To do this, you must know where your strengths lie and what your niche is. In any job interview, companies often ask about an applicant’s strengths because the way they approach this question says a lot about them. Think about what you could contribute to a team and what type of role you see yourself thriving in. Then, figure out a way to communicate why you think your unique strengths are an asset to the company.
#4- When prompted, ask questions of your own
Interviewers love when an applicant shows engagement and interest in the company. Throughout the process, jot down the questions that might come to you, and don’t be afraid to ask away when prompted. The questions you ask in an interview could be a chance for you to learn more about your potential employer and the work environment. You could ask them simple questions like, “what is the most enjoyable part of working here?”, or “what are the company’s goals over the next few years?” This shows them your passion and dedication for the role.
#5- Stay updated on trends in the data science space
The data science industry is always changing, and there’s always something new to learn every day. If you want to gain an edge against your competitors, make sure you’re on top of the latest data science trends and news. One way to do so is to always be on the lookout for upcoming data science conferences and seminars you can attend. Attending events could earn you connections and help you learn things you won’t find in textbooks. You could also do some supplementary reading of the latest research papers. This will show your interviewers that you’re a motivated self-starter.With time, effort, and these five tips in mind, you’ll be ready to answer any question thrown at you. Interviews are just the first step towards the career of your dreams, so make sure you prepare for every opportunity presented to you.

Programming
Data Science

6 Advanced Statistical Concepts in Data Science
The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Beginners Guide to Statistics in Data Science and The Inferential Statistics Data Scientists Should Know we have talked about almost all the basics(Descriptive and Inferential) of statistics which are commonly used in understanding and working with any data science case study. In this article, lets go a little beyond and talk about some advance concepts which are not part of the buzz.Concept #1 - Q-Q(quantile-quantile) PlotsBefore understanding QQ plots first understand what is a Quantile?A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and percentiles (hundredth).An example:If we divide a distribution into four equal portions, we will speak of four quartiles. The first quartile includes all values that are smaller than a quarter of all values. In a graphical representation, it corresponds to 25% of the total area of distribution. The two lower quartiles comprise 50% of all distribution values. The interquartile range between the first and third quartile equals the range in which 50% of all values lie that are distributed around the mean. In Statistics, A Q-Q(quantile-quantile) plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight(y=x).Q-Q plotFor example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.It’s very important for you to know whether the distribution is normal or not so as to apply various statistical measures on the data and interpret it in much more human-understandable visualization and their Q-Q plot comes into the picture. The most fundamental question answered by the Q-Q plot is if the curve is Normally Distributed or not.Normally distributed, but why?The Q-Q plots are used to find the type of distribution for a random variable whether it is a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or even Pareto Distribution, etc. You can tell the type of distribution using the power of the Q-Q plot just by looking at the plot. In general, we are talking about Normal distributions only because we have a very beautiful concept of the 68–95–99.7 rule which perfectly fits into the normal distribution So we know how much of the data lies in the range of the first standard deviation, second standard deviation and third standard deviation from the mean. So knowing if a distribution is Normal opens up new doors for us to experiment with Types of Q-Q plots. Source Skewed Q-Q plotsQ-Q plots can find skewness(measure of asymmetry) of the distribution. If the bottom end of the Q-Q plot deviates from the straight line but the upper end is not, then the distribution is Left skewed(Negatively skewed).Now if upper end of the Q-Q plot deviates from the staright line and the lower is not, then the distribution is Right skewed(Positively skewed).Tailed Q-Q plotsQ-Q plots can find Kurtosis(measure of tailedness) of the distribution.The distribution with the fat tail will have both the ends of the Q-Q plot to deviate from the straight line and its centre follows the line, where as a thin tailed distribution will term Q-Q plot with very less or negligible deviation at the ends thus making it a perfect fit for normal distribution.Q-Q plots in Python(Source)Suppose we have the following dataset of 100 values:import numpy as np
#create dataset with 100 values that follow a normal distribution
np.random.seed(0)
data = np.random.normal(0,1, 1000)
#view first 10 values
data[:10] array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:import statsmodels.api as sm
import matplotlib.pyplot as plt
#create Q-Q plot with 45-degree line added to plot
fig = sm.qqplot(data, line='45')
plt.show()In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead, it represents where your data would be if it were normally distributed.The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.Consider instead if we generated a dataset of 100 uniformly distributed values and created a Q-Q plot for that dataset:#create dataset of 100 uniformally distributed values
data = np.random.uniform(0,1, 1000)
#generate Q-Q plot for the dataset
fig = sm.qqplot(data, line='45')
plt.show()The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.Concept #2- Chebyshev's InequalityIn probability, Chebyshev’s Inequality, also known as “Bienayme-Chebyshev” Inequality guarantees that, for a wide class of probability distributions, only a definite fraction of values will be found within a specific distance from the mean of a distribution.Source: https://www.thoughtco.com/chebyshevs-inequality-3126547 Chebyshev’s inequality is similar to The Empirical rule(68-95-99.7); however, the latter rule only applies to normal distributions. Chebyshev’s inequality is broader; it can be applied to any distribution so long as the distribution includes a defined variance and mean.So Chebyshev’s inequality says that at least (1-1/k^2) of data from a sample must fall within K standard deviations from the mean (or equivalently, no more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean).Where K --> Positive real numberIf the data is not normally distributed then different amounts of data could be in one standard deviation. Chebyshev’s inequality provides a way to know what fraction of data falls within K standard deviations from the mean for any data distribution.Also read: 22 Statistics Questions to Prepare for Data Science InterviewsCredits: https://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ Chebyshev’s inequality is of great value because it can be applied to any probability distribution in which the mean and variance are provided.Let us consider an example, Assume 1,000 contestants show up for a job interview, but there are only 70 positions available. In order to select the finest 70 contestants amongst the total contestants, the proprietor gives tests to judge their potential. The mean score on the test is 60, with a standard deviation of 6. If an applicant scores an 84, can they presume that they are getting the job?The results show that about 63 people scored above a 60, so with 70 positions available, a contestant who scores an 84 can be assured they got the job.Chebyshev's Inequality in Python(Source) Create a population of 1,000,000 values, I use a gamma distribution(also works with other distributions) with shape = 2 and scale = 2.import numpy as np
import random
import matplotlib.pyplot as plt
#create a population with a gamma distribution
shape, scale = 2., 2. #mean=4, std=2*sqrt(2)
mu = shape*scale #mean and standard deviation
sigma = scale*np.sqrt(shape)
s = np.random.gamma(shape, scale, 1000000)Now sample 10,000 values from the population.#sample 10000 values
rs = random.choices(s, k=10000)Count the sample that has a distance from the expected value larger than k standard deviation and use the count to calculate the probabilities. I want to depict a trend of probabilities when k is increasing, so I use a range of k from 0.1 to 3.#set k
ks = [0.1,0.5,1.0,1.5,2.0,2.5,3.0]
#probability list
probs = [] #for each k
for k in ks:
#start count
c = 0
for i in rs:
# count if far from mean in k standard deviation
if abs(i - mu) > k * sigma :
c += 1
probs.append(c/10000)Plot the results:plot = plt.figure(figsize=(20,10))
#plot each probability
plt.xlabel('K')
plt.ylabel('probability')
plt.plot(ks,probs, marker='o')
plot.show()
#print each probability
print("Probability of a sample far from mean more than k standard deviation:")
for i, prob in enumerate(probs):
print("k:" + str(ks[i]) + ", probability: " \
+ str(prob)[0:5] + \
" | in theory, probability should less than: " \
+ str(1/ks[i]**2)[0:5])From the above plot and result, we can see that as the k increases, the probability is decreasing, and the probability of each k follows the inequality. Moreover, only the case that k is larger than 1 is useful. If k is less than 1, the right side of the inequality is larger than 1 which is not useful because the probability cannot be larger than 1.Concept #3- Log-Normal DistributionIn probability theory, a Log-normal distribution also known as Galton's distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y i.e, X = exp(Y), has a log-normal distribution. Skewed distributions with low mean and high variance and all positive values fit under this type of distribution. A random variable that is log-normally distributed takes only positive real values. The general formula for the probability density function of the lognormal distribution is:The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable.The shape of Lognormal distribution is defined by 3 parameters:σis the shape parameter, (and is the standard deviation of the log of the distribution)θ or μ is the location parameter (and is the mean of the distribution)m is the scale parameter (and is also the median of the distribution)The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable as explained above.If x = θ, then f(x) = 0. The case where θ = 0 and m = 1 is called the standard lognormal distribution. The case where θ equals zero is called the 2-parameter lognormal distribution.The following graph illustrates the effect of the location(μ) and scale(σ) parameter on the probability density function of the lognormal distribution: Source: https://www.sciencedirect.com/topics/mathematics/lognormal-distribution Log-Normal Distribution in Python(Source)Let us consider an example to generate random numbers from a log-normal distribution with μ=1 and σ=0.5 using scipy.stats.lognorm function.import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import lognorm
np.random.seed(42)
data = lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000)
plt.figure(figsize=(10,6))
ax = plt.subplot(111)
plt.title('Generate wrandom numbers from a Log-normal distribution')
ax.hist(data, bins=np.logspace(0,5,200), density=True)
ax.set_xscale("log")
shape,loc,scale = lognorm.fit(data)
x = np.logspace(0, 5, 200)
pdf = lognorm.pdf(x, shape, loc, scale)
ax.plot(x, pdf, 'y')
plt.show()Concept #4- Power Law distributionIn statistics, a Power Law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.A power law distribution has the form Y = k Xα, where:X and Y are variables of interest,α is the law’s exponent,k is a constant.Source: https://en.wikipedia.org/wiki/Power_law Power-law distribution is just one of many probability distributions, but it is considered a valuable tool to assess uncertainty issues that normal distribution cannot handle when they occur at a certain probability.Many processes have been found to follow power laws over substantial ranges of values. From the distribution in incomes, size of meteoroids, earthquake magnitudes, the spectral density of weight matrices in deep neural networks, word usage, number of neighbors in various networks, etc. (Note: The power law here is a continuous distribution. The last two examples are discrete, but on a large scale can be modeled as if continuous).Also read: Statistical Measures of Central TendencyPower-law distribution in Python(Source) Let us plot the Pareto distribution which is one form of a power-law probability distribution. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto
x_m = 1 #scale
alpha = [1, 2, 3] #list of values of shape parameters
plt.figure(figsize=(10,6))
samples = np.linspace(start=0, stop=5, num=1000)
for a in alpha:
output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)])
plt.plot(samples, output.T, label='alpha {0}' .format(a))
plt.xlabel('samples', fontsize=15)
plt.ylabel('PDF', fontsize=15)
plt.title('Probability Density function', fontsize=15)
plt.legend(loc='best')
plt.show()Concept #5- Box cox transformationThe Box-Cox transformation transforms our data so that it closely resembles a normal distribution.The one-parameter Box-Cox transformations are defined as In many statistical techniques, we assume that the errors are normally distributed. This assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can (hopefully) normalize our errors (if they are not already normal).Additionally, transforming our variables can improve the predictive power of our models because transformations can cut away white noise.Original distribution(Left) and near-normal distribution after applying Box cox transformation. Source At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one that results in the best approximation of a normal distribution curve. The one-parameter Box-Cox transformations are defined as:and the two-parameter Box-Cox transformations as:Moreover, the one-parameter Box-Cox transformation holds for y > 0, i.e. only for positive values and two-parameter Box-Cox transformation for y > -λ, i.e. negative values. The parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.If we talk about some drawbacks of Box-cox transformation, then if interpretation is what you want to do, then Box-cox is not recommended. Because if λ is some non-zero number, then the transformed target variable may be more difficult to interpret than if we simply applied a log transform.A second stumbling block is that the Box-Cox transformation usually gives the median of the forecast distribution when we revert the transformed data to its original scale. Occasionally, we want the mean and not the median.Box-Cox transformation in Python(Source)SciPy’s stats package provides a function called boxcox for performing box-cox power transformation that takes in original non-normal data as input and returns fitted data along with the lambda value that was used to fit the non-normal distribution to normal distribution.#load necessary packages
import numpy as np
from scipy.stats import boxcox
import seaborn as sns
#make this example reproducible
np.random.seed(0)
#generate dataset
data = np.random.exponential(size=1000)
fig, ax = plt.subplots(1, 2)
#plot the distribution of data values
sns.distplot(data, hist=False, kde=True,
kde_kws = {'shade': True, 'linewidth': 2},
label = "Non-Normal", color ="red", ax = ax[0])
#perform Box-Cox transformation on original data
transformed_data, best_lambda = boxcox(data)
sns.distplot(transformed_data, hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2},
label = "Normal", color ="red", ax = ax[1])
#adding legends to the subplots
plt.legend(loc = "upper right")
#rescaling the subplots
fig.set_figheight(5)
fig.set_figwidth(10)
#display optimal lambda value
print(f"Lambda value used for Transformation: {best_lambda}")
Concept #6- Poisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.In very simple terms, A Poisson distribution can be used to estimate how likely it is that something will happen "X" number of times. Some examples of Poisson processes are customers calling a help center, radioactive decay in atoms, visitors to a website, photons arriving at a space telescope, and movements in a stock price. Poisson processes are usually associated with time, but they do not have to be. The Formula for the Poisson Distribution Is:Where:e is Euler's number (e = 2.71828...)k is the number of occurrencesk! is the factorial of kλ is equal to the expected value of kwhen that is also equal to its varianceLambda(λ) can be thought of as the expected number of events in the interval. As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters. Probability Mass function for Poisson Distribution with varying rate parameters.Source The Poisson distribution is also commonly used to model financial count data where the tally is small and is often zero. For one example, in finance, it can be used to model the number of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2, etc.As another example, this model can be used to predict the number of "shocks" to the market that will occur in a given time period, say over a decade.Poisson distribution in Pythonfrom numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
lam_list = [1, 4, 9] #list of Lambda values
plt.figure(figsize=(10,6))
samples = np.linspace(start=0, stop=5, num=1000)
for lam in lam_list:
sns.distplot(random.poisson(lam=lam, size=10), hist=False, label='lambda {0}'.format(lam))
plt.xlabel('Poisson Distribution', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.legend(loc='best')
plt.show()As λ becomes bigger, the graph looks more like a normal distribution.I hope you have enjoyed reading this article, If you have any questions or suggestions, please leave a comment. Also read: False Positives vs. False NegativesFeel free to connect me on LinkedIn for any query.Thanks for reading!!!Referenceshttps://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ https://corporatefinanceinstitute.com/resources/knowledge/data-analysis/chebyshevs-inequality/ https://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm https://www.statology.org/q-q-plot-python/ https://gist.github.com/chaipi-chaya/9eb72978dbbfd7fa4057b493cf6a32e7 https://stackoverflow.com/a/41968334/7175247

Data Science

Data Scientists are Really Just Product Managers. Here’s Why.
Unpopular opinion?Table of ContentsIntroductionBusiness and Product UnderstandingStakeholder CollaborationSummaryReferencesIntroductionAs mentioned above, the title of this article may be an unpopular opinion, but let me lay it out for you to see what you think with a new perspective. If you are in school still, this statement might sound quite surprising, however, if have been in the tech business for a few years, this statement may come as no surprise to you. Of course, this statement will not be true for everyone, all of the time, but it gets pretty close when you point out the similarities between these two product roles. With that being said, I will outline some of the similarities between data science and product manager roles, skills required, and goals below.Business and Product UnderstandingPhoto by Hugo Rocha on Unsplash [2].When you work as a data scientist, it is essential that you know the business, its goals, as well as its shortcomings, the same can be said about product managers. The main difference between these roles is the method by which their specific goals are reached, but overall, the main goals are shared. Here is an example:Data Science Goal:Identify problems in the productUse data science models as a solution to a product problemUse data science models as a productAnalyze data to predict future data with algorithmsProduct Manager Goal:Identify problems in the productRank which problems are of most concernLook into data to learn about future data trendsOf course, data science can have a focus on coding, but without an understanding of the business and product, a data science model can be completely useless. It is essential to have this understanding for both roles, as this knowledge is how you come up with next steps, a process, problem diagnosis, and eventual solutions.On the company structure level, you will see that data scientists are often the bridge between departments or a part of both engineering and product. That statement alone can show that there is considerable overlap.Here are some skills that both roles share:Data explorationProblem diagnosisVisualization tools (Tableau, Looker, Lucidchart, etc.)Querying (SQL)In addition to some of the skills shared between roles, there is also a similar process:Product problem isolation or product improvementData analysisCurrent process analysisSolutions (this is where data science differs by being the ones who will create the model, product managers will organize this step, with things like results, costs, timeline, and how it will ultimately affect the business, but data scientists can also share some of that work)Results presentationTestingExecutive approval/approval in generalImplementationAs you can see, some skills, process steps, and data science and product management goals are shared. Sometimes, there are no product managers at smaller companies, and therefore, data scientists will have to work that role as well.Stakeholder CollaborationPhoto by Mimi Thian on Unsplash [3].When you work as a data scientist, you will have to collaborate with several verticals of the business and their respective stakeholders. The same can be said for product managers. Both data scientists and product managers can also be stakeholders themselves.Here is where these roles share stakeholder collaboration:Proof of concepts with software engineers, data analysts, and executives, etc.Level of effort analysis with the same roles from above, and moreSetting up meetings with those rolesUpdating those roles on stepsAllocating work to othersOverall, both data scientists and product managers can prove to be cross-functional in their work, as well as in who they collaborate with, whether that be a data analyst, salesperson, or software engineer (etc.).SummaryThe goal of this article is not to say one role is better than another, but to highlight that both roles have a considerable amount of overlap in their daily tasks, skills required, who they work with, and overall goals. To be more specific, maybe data scientists are just product managers who have special skills and focus on algorithms.To summarize, here are some ways that data scientists and product managers are similar:* Business and Product Understanding
* Stakeholder CollaborationI hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these comparisons between roles. Why or why not? What other comparisons (or differences) do you think are important to point out? These can certainly be clarified even further, but I hope I was able to shed some light on some of the common similarities between data scientists and product managers. Thank you for reading!I am not affiliated with any of these companies.Please feel free to check out my profile, Matt Przybyla, and other articles, as well as subscribe to receive email notifications for my blogs by following the link below, or by clicking on the subscribe icon on the left of the screen, and reach out to me on LinkedIn if you have any questions or comments.

Pandas
Data Science

A Better Way for Data Preprocessing: Pandas Pipe
Real-life data is usually messy. It requires a lot of preprocessing to be ready for use. Pandas being one of the most-widely used data analysis and manipulation libraries offers several functions to preprocess the raw data.In this article, we will focus on one particular function that organizes multiple preprocessing operations into a single one: the pipe function.When it comes to software tools and packages, I learn best by working through examples. I keep this in mind when creating content. I will do the same in this article.Let’s start with creating a data frame with mock data.import numpy as np
import pandas as pd
df = pd.DataFrame({
"id": [100, 100, 101, 102, 103, 104, 105, 106],
"A": [1, 2, 3, 4, 5, 2, np.nan, 5],
"B": [45, 56, 48, 47, 62, 112, 54, 49],
"C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]
})
df(image by author)Our data frame contains some missing values indicated by a standard missing value representation (i.e. NaN). The id column includes duplicate values. Last but not least, 112 in column B seems like an outlier.These are some of the typical issues in real-life data. We will be creating a pipe that handles the issues we have just described.For each task, we need a function. Thus, the first step is to create the functions that will be placed in the pipe.It is important to note that the functions used in the pipe need to take a data frame as argument and return a data frame.The first function handles the missing values.def fill_missing_values(df):
for col in df.select_dtypes(include= ["int","float"]).columns:
val = df[col].mean()
df[col].fillna(val, inplace=True)
return dfI prefer to replace the missing values in the numerical columns with the mean value of the column. Feel free to customize this function. It will work in the pipe as long as it takes a data frame as argument and returns a data frame.The second function will help us remove the duplicate values.def drop_duplicates(df, column_name):
df = df.drop_duplicates(subset=column_name)
return dfI have got some help from the built-in drop duplicates function of Pandas. It eliminates the duplicate values in the given column or columns. In addition to the data frame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.Also read: Pandas vs SQL. When Data Scientists Should Use One Over the OtherThe last function in the pipe will be used for eliminating the outliers.def remove_outliers(df, column_list):
for col in column_list:
avg = df[col].mean()
std = df[col].std()
low = avg - 2 * std
high = avg + 2 * std
df = df[df[col].between(low, high, inclusive=True)]
return dfWhat this function does is as follows:It takes a data frame and a list of columnsFor each column in the list, it calculates the mean and standard deviationIt calculates a lower and upper bound using the mean and standard deviationIt removes the values that are outside range defined by the lower and upper boundJust like the previous functions, you can choose your own way of detecting outliers.We now have 3 functions that handle a data preprocessing task. The next step is to create a pipe with these functions.df_processed = (df.
pipe(fill_missing_values).
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))This pipe executes the functions in the given order. We can pass the arguments to the pipe along with the function names.One thing to mention here is that some functions in the pipe modify the original data frame. Thus, using the pipe as indicated above will update df as well.One option to overcome this issue is to use a copy of the original data frame in the pipe. If you do not care about keeping the original data frame as is, you can just use it in the pipe.I will update the pipe as below:my_df = df.copy()
df_processed = (my_df.
pipe(fill_missing_values).
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))Let’s take a look at the original and processed data frames:df (image by author)df_processed (image by author)ConclusionYou can, of course, accomplish the same tasks by applying these functions separately. However, the pipe function offers a structured and organized way for combining several functions into a single operation.Depending on the raw data and the tasks, the preprocessing may include more steps. You can add as many steps as you need in the pipe function. As the number of steps increase, the syntax becomes cleaner with the pipe function compared to executing functions separately.Thank you for reading. Please let me know if you have any feedback.Also read:- Using Python And Pandas Datareader to Analyze Financial Data- Using Pandas Profiling to Accelerate Our Exploratory Analysis- Pandas Essentials For Data Science

Data Science

How To Write The Perfect Data Science CV
These tips are also applicable to Software Engineers. Make a few changes in your CV and land that job!Writing a good CV can be one of the toughest challenges of job searching.Most employers spend just a few seconds scanning each CV before sticking it in the Yes or No pile.Photo by Christina @ wocintechchat.com on UnsplashHere are the top 5 tips that will increase the chances that your CV lands in the Yes pile.1. Beautiful DesignPhoto by Neven Krcmarek on UnsplashYour CV should reflect your future potential.I always thought a nicely designed CV is not that important — we aren’t designers so no one expects a polished CV from a Data Scientist, right?Well, I was wrong!A polished CV becomes important when applying for a remote position or a position in a bigger company which gets thousands of application.Also read: 3 Ways to Get Real-Life Data Science Experience Before Your First JobIn Europe, we have a standardized template for CVs called Europass. While it has an online editor and surprisingly good user experience, the end result is not satisfying.My 2 page Europass CV.The boring design of the Europass CV doesn’t catch the eye. It doesn’t reflect your future potential.A CV that catches the eyeI wrote my CV with Awesome-CV template.You don’t need design skills to write an eye-catching CV.Awesome-CV is a beautiful CV template in which you only need to change the content. You can also personalize it by changing the colors to your liking.This beautiful CV template is made with LaTeX — a high-quality typesetting system. LaTeX is the de facto standard for the communication and publication of scientific documents.A prerequisite to creating a CV with Awesome-CV is to install LaTeX on your computer.After you’re done with editing, you need to compile the CV with LaTeX and it will output a PDF.2. Less is MorePhoto by Prateek Katyal on UnsplashMany applicants write a CV that spans over multiple pages.Employers don’t have the time to review long-form CVs so many automatically put them in the “No” pile.Also Read: How to Get a Job With PythonDon’t believe me?Take a look at Joseph Redmon’s CV — the mind behind YOLO (Unified, Real-Time Object Detection engine).It’s a one page CV that looks like it came from a cartoon. But it catches the eye.I recommend writing a one-page CV as it forces you to condense your work experience.3. Optimize for Applicant Tracking SystemsPhoto by Kelly Sikkema on UnsplashBig companies like Google and Apple get thousands of applications for each job vacancy. Recruiters cannot efficiently review all of them, so they use Applicant Tracking Systems (ATS).ATS is a software that scans and analyzes your CV. The recruiters only see the scorecard of your resume which is the end result of the scan.It is important that your resume contains keywords related to the position that you are applying for.Eg. Important keywords for a Big Data Engineer position are Big Data, Hadoop, Spark, Redshift, etc.4. Put most significant achievements firstPhoto by Fauzan Saari on UnsplashPut yourself in the role of a recruiter. How would you review a CV?Most probably you would skim the first few words of each bullet point.Well, recruiters do the same.Read Also: How to keep your skills sharp while data science job huntingMake sure you list your most significant achievements first. Put tedious work last.Don’t write about the work you did, instead describe the results you got. Write about the business impact of your work.Even better if you can quantify the work you did. Use sentences like “reduced the costs“, “automated processes”, “optimized”…For example:Initiated analytics tools to make data-driven decisions in the marketing department,Automated management of the distributed nodes with statistical distributions,Maintained and optimized Ad network platform to spend less on the Cloud infrastructure.5. Don’t go too deep into detailsPhoto by Octavian Dan on UnsplashWhen describing your projects don’t go too much into detail — list only the most significant details.Describe briefly what the project is about, what problems it solves, mention interesting facts, like “our app was the iPhone Business App of the Month in the UK”.Describe your role in the project, the challenges and the solution. Eg. Recruited to improve the accuracy of classification for one of the most popular business apps for self-employed.Try to quantify the results: Categorization accuracy improved by 30% using a Machine Learning model.Add a Tech stack with technologies that you’ve used: Python, sklearn, etc.An example of a good project description.ConclusionPhoto by Andrey Grinkevich on UnsplashHaving a polished CV will increase your chances of being seen by the employer. You’re going to stand out from others.Hopefully, these tips will help you land that job! Let me know in the comments if you would add any other tip.Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.Photo by Courtney Hedger on UnsplashRead Also: Land Your First Data Science Job

Machine Learning
Data Science

11 Kaggle Alternatives in Data Science Competitions
Data science competitions are a very particular field of applied machine learning, or what is commonly known as applied AI. From the point of view of a data scientist, it has the particularity of simulating a real environment for the solution of a machine learning problem. From the point of view of a company, it has the particularity of solving a problem in a collaborative way (following the wisdom of crowds) and obtaining the benefits derived from it: such as benchmarking, new ideas, different solutions. As we mentioned in a previous blog post about data science competitions, some time ago several platforms were born among which is Kaggle and in which large technology companies in Silicon Valley and even Multinationals try to solve really complex problems with the help of outsiders, through something called "data science competitions". This was because internal talent could not solve these problems, because they did not have the time, resources or capabilities. Obviously these were really complex problems.These data science competition platforms allow the company to access a global talent pool of data science specialists ranging from PHDs to self-taught individuals who set out on an adventure to solve the challenge posed by the company sponsoring a competition. The prizes are obviously exorbitant, with Netflix even paying $1 million for a machine learning solution. Sponsoring a competition there has an investment ranging from $20,000 USD to $100,000 USD on average. A luxury that only the big tech companies (or multinationals) can afford. What about startups? Well, unless you've raised a Series B or Series C, you might be able to afford to sponsor a $20,000 USD competition, or even have an internal team of data scientists to help you experiment or solve a problem with machine learning. But what about companies that are at an earlier stage and don't have these funds or that in-house talent? That is why we have dedicated this blog post to show other options for sponsoring data science competitions.1- DataSource.aiCompany Targets: Startups & SMBsType Competitions: Cash PrizesCompetition System: Usually last 2-3 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: Starting at USD $3.000Link: https://www.datasource.ai/The focus of this platform is to democratize data science competitions. Other data science competition platforms are focused on very large companies, very high prizes and very complex problems. This translates into competitions that can only be paid for by companies with deep pockets, competitions that take months to complete, and that are made for data scientists and "super-senior" teams. At the end of the day sponsoring a $20,000 USD (or 1 million USD) competition is not for every type of company. So they decided to rethink the way data science competitions are built and decided to focus on startups of any size and from any part of the world, that can afford competitions that are in line with a startup budget, that don't take so long to be solved (8 weeks), that can launch more than one, two or three competitions (because they can afford it) and in which all kinds of data science talent can participate, from any level and from any part of the world. 2- NumeraiCompany Targets: Finance, CryptoType Competitions: Crypto PrizesCompetition System: ContinuousAverage Opened Competitions per Month: ContinuousLink: https://numer.ai/In the Numerai Tournament you build machine learning models on abstract financial data to predict the stock market. Your models can be wagered with the NMR cryptocurrency to earn performance-based rewards. Numerai's wagered models combine to form the Meta Model which controls Numerai's hedge fund capital in the global stock market. Here companies are not allowed to sponsor competitions, Numerai is the sponsor itself and is the one who delivers the rewards3- International Data Analysis Olympiad (IDAHO)Company Targets: Made by Yandex onlyType Competitions: Money Prize and InternshipsCompetition System: Usually last 1 year each competitionAverage Opened Competitions per Year: 1Average Prize Money: $10.000Link: https://idao.world/IDAHO is an annual competition organized by the Higher School of Economics and Yandex. This event is open to all teams and individuals, be they undergraduate, postgraduate or Ph.D. students, company employees, researchers or new data scientists.4- DrivenDataCompany Targets: Social CompaniesType Competitions: Money Prize and KudosCompetition System: Usually last 2-4 months each competitionAverage Opened Competitions per Month: 2Average Prize Money: $17.000Link: https://www.drivendata.org/competitions/DrivenData brings cutting-edge practices in data science and crowdsourcing to some of the world’s biggest social challenges and the organizations taking them on5- CodaLabCompany Targets: Social and Big CompaniesType Competitions: KnowledgeCompetition System: Usually last 3-6 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: Almost all are for knowledge Link: https://competitions.codalab.org/Codalab open-source platform for computational research. The competitions are held for the sake of collaborative research and code testing.6- DataHack & DSATCompany Targets: Social and Big CompaniesType Competitions: KudosCompetition System: Usually last 2-4 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: All are hackathons without a final prize Link: https://datahack.analyticsvidhya.com/This platform basically allows you to compete with the best in the world on real-life data science problems, learn by working on real-world problems, showcase your expertise and get hired in top firms, build your profile, and be on the top of competitions and win lucrative prizes7- Machine HackCompany Targets: Social and Big CompaniesType Competitions: Money Prize and Kudos, InterviewsCompetition System: Usually last 2-4 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: All are hackathons without a final prize Link: https://machinehack.com/Machine Hack is an online platform for Machine Learning competitions. At Machine Hack, you get to test and practice your ML skills. In this platform, you have the opportunity to compete against hundreds of Data Scientists, with our industry curated Hackathons. 8- TianchiCompany Targets: Big CompaniesType Competitions: Money Prize and KudosCompetition System: Usually last 3-6 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: $100.000Link: https://tianchi.aliyun.com/competition/gameList/activeListTianchi is a crowdsourcing community of global data scientists that hosts big data competitions in various industries. This Big Data Competition has million-dollar prize pools and real business test cases. You have the chance to compete against AI elites from around the world.9- KDD CupCompany Targets: Organized by ACM Special Interest GroupType Competitions: Money PrizeCompetition System: Usually last 1 year each competitionAverage Opened Competitions per Month: 1Average Prize Money: $12.000Link: https://www.kdd.org/kdd-cupIs the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners. Year to year archives including datasets, instructions, and winners are available for most years10- vizdoomCompany Targets: Organized by VIZDOOMType Competitions: Money PrizeCompetition System: Usually last 1 year each competitionAverage Opened Competitions per Month: 1Average Prize Money: No applyLink: http://vizdoom.cs.put.edu.pl/competitions/vdaic-2017-cigVIZDOOM allows developing AI bots that play DOOM using the visual information (the screen buffer).11- Crowd AnalytiXCompany Targets: Medium Size CompaniesType Competitions: Money Prize and KudosCompetition System: Usually last 2-4 months each competitionAverage Opened Competitions per Month: 1Average Prize Money: $7.000Link: https://www.crowdanalytix.com/communityData experts collaborate & compete to build & optimize AI, ML, NLP and Deep Learning algorithmsConclusionIf you are a data scientist, you will definitely have a lot of learning and participation options. But make sure you also have a real chance to win!If you are a company and you are thinking about solving a data science problem, don't hesitate to make the decision for the best!

The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!