Let’s look at some of Uber’s top machine learning open source projects

Artificial intelligence(AI) has been an atypical technology trend. In a traditional technology cycle, innovation typically begins with startups trying to disrupt industry incumbents. In the case of AI, most of the innovation in the space has been coming from the big corporate labs of companies like Google, Facebook, Uber or Microsoft. Those companies are not only leading impressive tracks of research but also regularly open sourcing new frameworks and tools that streamline the adoption of AI technologies. In that context, Uber has emerged as one of the most active contributors to open source AI technologies in the current ecosystems. In just a few years, Uber has regularly open sourced projects across different areas of the AI lifecycle. Today, I would like to review a few of my favorites.

Uber is a near-perfect playground for AI technologies. The company combines all the traditional AI requirements of a large scale tech company with a front row seat to AI-first transportation scenarios. As a result, Uber has been building machine/deep learning applications across largely diverse scenarios ranging from customer classifications to self-driving vehicles. Many of the technologies used by Uber teams have been open sourced and received accolades from the machine learning community. Let’s look at some of my favorites:

Note: I am not covering technologies like Michelangelo or PyML, as they are well documented having been open sourced.

Ludwig: A Toolbox for No-Code Machine Learning Models

Ludwig is a TensorFlow based toolbox that allows to train and test deep learning models without the need to write code. Conceptually, Ludwig was created under five fundamental principles:

No coding required: no coding skills are required to train a model and use it for obtaining predictions.
Generality: a new data type-based approach to deep learning model design that makes the tool usable across many different use cases.
Flexibility: experienced users have extensive control over model building and training, while newcomers will find it easy to use.
Extensibility: easy to add new model architecture and new feature data types.
Understandability: deep learning model internals are often considered black boxes, but we provide standard visualizations to understand their performance and compare their predictions.

Using Ludwig, a data scientist can train a deep learning model by simply providing a CSV file that contains the training data as well as a YAML file with the inputs and outputs of the model. Using those two data points, Ludwig performs a multi-task learning routine to predict all outputs simultaneously and evaluate the results. Under the covers, Ludwig provides a series of deep learning models that are constantly evaluated and can be combined in a final architecture. The Uber engineering team explains this process by using the following analogy: “if deep learning libraries provide the building blocks to make your building, Ludwig provides the buildings to make your city, and you can choose among the available buildings or add your own building to the set of available ones.”

Pyro: A Native Probabilistic Programming Language

Pyro is a deep probabilistic programming language(PPL) released by Uber AI Labs. Pyro is built on top of PyTorch and is based on four fundamental principles:

Universal: Pyro is a universal PPL — it can represent any computable probability distribution. How? By starting from a universal language with iteration and recursion (arbitrary Python code), and then adding random sampling, observation, and inference.
Scalable: Pyro scales to large data sets with little overhead above hand-written code. How? By building modern black box optimization techniques, which use mini-batches of data, to approximate inference.
Minimal: Pyro is agile and maintainable. How? Pyro is implemented with a small core of powerful, composable abstractions. Wherever possible, the heavy lifting is delegated to PyTorch and other libraries.
Flexible: Pyro aims for automation when you want it and control when you need it. How? Pyro uses high-level abstractions to express generative and inference models, while allowing experts to easily customize inference.

These principles often pull Pyro’s implementation in opposite directions. Being universal, for instance, requires allowing arbitrary control structure within Pyro programs, but this generality makes it difficult to scale. However, in general, Pyro achieves a brilliant balance between these capabilities making one of the best PPLs for real world applications.

Manifold: A Debugging and Interpretation Toolset for Machine Learning Models

Manifold is Uber technologies for debugging and interpreting machine learning models at scale. With Manifold, the Uber engineering team wanted to accomplish some very tangible goals:

· Debug code errors in a machine learning model.

· Understand the strengths and weaknesses of one model both in isolation and in comparison, with other models.

· Compare and ensemble different models.

· Incorporate insights gathered through inspection and performance analysis into model iterations.

To accomplish those goals, Manifold segments the machine learning analysis process into three main phases: Inspection, Explanation and Refinement.

· Inspection: In the first part of the analysis process, the user designs a model and attempts to investigate and compare the model outcome with other existing ones. During this phase, the user compares typical performance metrics, such as accuracy, precision/recall, and receiver operating characteristic curve (ROC), to have coarse-grained information of whether the new model outperforms the existing ones.

· Explanation: This phase of the analysis process attempts to explain the different hypotheses formulated in the previous phase. This phase relies on comparative analysis to explain some of the symptoms of the specific models.

· Refinement: In this phase, the user attempts to verify the explanations generated from the previous phase by encoding the knowledge extracted from the explanation into the model and testing the performance.

Plato: A Framework for Building Conversational Agents at Scale

Uber built the Plato Research Dialogue System(PRDS) to address the challenges of building large scale conversational applications. Conceptually, PRDS is a framework to create, train and evaluate conversational AI agents on diverse environments. From a functional standpoint, PRDS includes the following building blocks:

Speech recognition (transcribe speech to text)
Language understanding (extract meaning from that text)
State tracking (aggregate information about what has been said and done so far)
API call (search a database, query an API, etc.)
Dialogue policy (generate abstract meaning of agent’s response)
Language generation (convert abstract meaning into text)
Speech synthesis (convert text into speech)

PRDS was designed with modularity in mind in order to incorporate state-of-the-art research in conversational systems as well as continuously evolve every component of the platform. In PRDS, each component can be trained either online (from interactions) or offline and incorporate into the core engine. From the training standpoint, PRDS supports interactions with human and simulated users. The latter are common to jumpstart conversational AI agents in research scenarios while the former is more representative of live interactions.

Horovod: A Framework for Training Deep Learning at Scale

Horovod is one of the Uber ML stacks that has become extremely popular within the community and has been adopted by research teams at AI-powerhouses like DeepMind or OpenAI. Conceptually, Horovod is a framework for running distributed deep learning training jobs at scale.

Horovod leverages message passing interface stacks such as OpenMPI to enable a training job to run on a highly parallel and distributed infrastructure without any modifications. Running a distributed TensorFlow training job in Horovod is accomplished in four simple steps:

hvd.init() initializes Horovod.
config.gpu_options.visible_device_list = str(hvd.local_rank())assigns a GPU to each of the TensorFlow processes.
opt=hvd.DistributedOptimizer(opt)wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-allreduce.
hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes to ensure consistent initialization.

Uber AI Research: A Regular Source of AI Research

Last by not least, we should mention Uber’s active contributions to AI research. Many of Uber’s open source releases are inspired by their research efforts. Uber AI Research website is a phenomenal catalog of papers that highlight Uber’s latest effort in AI research.

These are some of the contributions of the Uber engineering team that have seen regular adoption by the AI research and development community. As Uber continues implementing AI solutions at scale, we should see new and innovated frameworks that simplify the adoption of machine learning by data scientists and researchers.

Most Related Articles

10 Highly Probable Data Scientist Interview Questions

The popularity of data science attracts a lot of people from a wide range of professions to make a career change with the goal of becoming a data scientist.Despite the high demand for data scientists, it is a highly challenging task to find your first job. Unless you have a solid prior job experience, interviews are where you can show you skills and impress your potential employer.Data science is an interdisciplinary field which covers a broad range of topics and concepts. Thus, the number of questions that you might be asked at an interview is very high.However, there are some questions about the fundamentals in data science and machine learning. These are the ones you do not want to miss. In this article, we will go over 10 questions that are likely to be asked at a data scientist interview.The questions are grouped into 3 main categories which are machine learning, Python, and SQL. I will try to provide a brief answer for each question. However, I suggest reading or studying each one in more detail afterwards.Machine Learning1. What is overfitting?Overfitting in machine learning occurs when your model is not generalized well. The model is too focused on the training set. It captures a lot of detail or even noise in the training set. Thus, it fails to capture the general trend or the relationships in the data. If a model is too complex compared to the data, it will probably be overfitting.A strong indicator of overfitting is the high difference between the accuracy of training and test sets. Overfit models usually have very high accuracy on the training set but the test accuracy is usually unpredictable and much lower than the training accuracy.2. How can you reduce overfitting?We can reduce overfitting by making the model more generalized which means it should be more focused on the general trend rather than specific details.If it is possible, collecting more data is an efficient way to reduce overfitting. You will be giving more juice to the model so it will have more material to learn from. Data is always valuable especially for machine learning models.Another method to reduce overfitting is to reduce the complexity of the model. If a model is too complex for a given task, it will likely result in overfitting. In such cases, we should look for simpler models.3. What is regularization?We have mentioned that the main reason for overfitting is a model being more complex than necessary. Regularization is a method for reducing the model complexity.It does so by penalizing higher terms in the model. With the addition of a regularization term, the model tries to minimize both loss and complexity.Two main types of regularization are L1 and L2 regularization. L1 regularization subtracts a small amount from the weights of uninformative features at each iteration. Thus, it causes these weights to eventually become zero.On the other hand, L2 regularization removes a small percentage from the weights at each iteration. These weights will get closer to zero but never actually become 0.4. What is the difference between classification and clustering?Both are machine learning tasks. Classification is a supervised learning task so we have labelled observations (i.e. data points). We train a model with labelled data and expect it to predict the labels of new data.For instance, spam email detection is a classification task. We provide a model with several emails marked as spam or not spam. After the model is trained with those emails, it will evaluate the new emails appropriately.Clustering is an unsupervised learning task so the observations do not have any labels. The model is expected to evaluate the observations and group them into clusters. Similar observations are placed into the same cluster.In the optimal case, the observations in the same cluster are as close to each other as possible and the different clusters are as far apart as possible. An example of a clustering task would be grouping customers based on their shopping behavior.PythonThe built-in data structures are of crucial importance. Thus, you should be familiar with what they are and how to interact with them. List, dictionary, set, and tuple are 4 main built-in data structures in Python.5. What is the difference between lists and tuplesThe main difference between lists and tuples is mutability. Lists are mutable so we can manipulate them by adding or removing items.mylist = [1,2,3] mylist.append(4) mylist.remove(1) print(mylist) [2,3,4]On the other hand, tuples are immutable. Although we can access each element in a tuple, we cannot modify its content.mytuple = (1,2,3) mytuple.append(4) AttributeError: 'tuple' object has no attribute 'append'One important point to mention here is that although tuples are immutable, they can contain mutable elements such as lists or sets.mytuple = (1,2,["a","b","c"]) mytuple[2] ['a', 'b', 'c'] mytuple[2][0] = ["A"] print(mytuple) (1, 2, [['A'], 'b', 'c'])6. What is the difference between lists and setsLet’s do an example to demonstrate the main difference between lists and sets.text = "Python is awesome!" mylist = list(text) myset = set(text) print(mylist) ['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!'] print(myset) {'t', ' ', 'i', 'e', 'm', 'P', '!', 'y', 'o', 'h', 'n', 'a', 's', 'w'}As we notice in the resulting objects, the list contains all the characters in the string whereas the set only contains unique values.Another difference is that the characters in the list are ordered based on their location in the string. However, there is no order associated with the characters in the set.Here is a table that summarizes the main characteristics of lists, tuples, and sets.(image by author)7. What is a dictionary and what are the important features of dictionaries?A dictionary in Python is a collection of key-value pairs. It is similar to a list in the sense that each item in a list has an associated index starting from 0.mylist = ["a", "b", "c"] mylist[1] "b"In a dictionary, we have keys as the index. Thus, we can access a value by using its key.mydict = {"John": 24, "Jane": 26, "Ashley": 22} mydict["Jane"] 26The keys in a dictionary are unique which makes sense because they act like an address for the values.SQLSQL is an extremely important skill for data scientists. There are quite a number of companies that store their data in a relational database. SQL is what is needed to interact with relational databases.You will probably be asked a question that involves writing a query to perform a specific task. You might also be asked a question about general database knowledge.8. Query example 1Consider we have a sales table that contains daily sales quantities of products.SELECT TOP 10 * FROM SalesTable(image by author)Find the top 5 weeks in terms of total weekly sales quantities.SELECT TOP 5 CONCAT(YEAR(SalesDate), DATEPART(WEEK, SalesDate)) AS YearWeek, SUM(SalesQty) AS TotalWeeklySales FROM SalesTable GROUP BY CONCAT(YEAR(SalesDate), DATEPART(WEEK, SalesDate)) ORDER BY TotalWeeklySales DESC (image by author)We first extract the year and week information from the date column and then use it in the aggregation. The sum function is used to calculate the total sales quantities.9. Query example 2In the same sales table, find the number of unique items that are sold each month.SELECT MONTH(SalesDate) AS Month, COUNT(DISTINCT(ItemNumber)) AS ItemCount FROM SalesTable GROUP BY MONTH(SalesDate) Month ItemCount 1 9 1021 2 8 102110. What is normalization and denormalization in a database?These terms are related to database schema design. Normalization and denormalization aim to optimize different metrics.The goal of normalization is to reduce data redundancy and inconsistency by increasing the number of tables. On the other hand, denormalization aims to speed up the query execution. Denormalization decreases the number of tables but at the same time, it adds some redundancy.ConclusionIt is a challenging task to become a data scientist. It requires time, effort, and dedication. Without having prior job experience, the process gets harder.Interviews are very important to demonstrate your skills. In this article, we have covered 10 questions that you are likely to encounter in a data scientist interview.Thank you for reading. Please let me know if you have any feedback.Soner Yıldırım

Daniel Morales

Apr 16, 2020

Deep learning

When to Avoid Deep Learning

IntroductionThis article is intended for data scientists who may consider using deep learning algorithms, and want to know more about the cons of implementing these type of models into your work. Deep learning algorithms have many benefits, are powerful, and can be fun to show off. However, there are a few times when you should avoid them. I will be discussing those times when you should stop using deep learning below, so keep on reading if you would like a deeper dive into deep learning.When You Want to Easily ExplainPhoto by Malte Helmhold on Unsplash [2].Because other algorithms have been around longer, they have countless amounts of documentation, including examples and functions that make interpretability easier. It is also how the other algorithms work themselves. Deep learning can be intimidating to data scientists for this reason as well, it can be a turn-off to use a deep learning algorithm when you are unsure of how to explain it to a stakeholder.Here are 3 examples of when you would have trouble explaining deep learning:When you want to describe the top features of your model — the features become hidden inputs, so you will not know what caused a certain prediction to happen, and if you need to prove to stakeholders or customers why a certain output was achieved, it can be more of a black boxWhen you want to tune your hyperparameters like learning rate and batch sizeWhen you want to explain how the algorithm works itself — for example, if you were to present the algorithm itself to stakeholders, they might get lost, because even a simplified approach is still difficult to understandHere are 3 examples of how you could explain those same situations from above from non-deep learning algorithms:When you want to explain your top features, you can easily access SHAP libraries, say for the algorithm CatBoost, once your model is fitted, you can simply make a summary plot from feat = model.get_feature_importance() and then use the summary_plot() to rank the features by feature name, so that you can present a nice plot to stakeholders (and yourself for that matter)Example of ranked SHAP output from a non-deep learning model [3].As a solution, some other algorithms have made it plenty easy to tune your hyperparameters by randomized grid search or a more structured, set grid search method. There are even some algorithms that tune themselves so you do not have to worry about complicated tuningExplaining how other algorithms work can be a lot easier, like decision trees, for example, you can easily show a yes or no, 0/1 chart that shows a simple answer for features that lead to a prediction, like yes it is raining, yes it is winter, would provide for yes it is going to be coldOverall, deep learning algorithms are useful and powerful, so there is definitely a time and place for them, but there are other algorithms that you can use instead, as we will discuss below.When You Can Use Other AlgorithmsPhoto by Luca Bravo on Unsplash [4].To be frank, there are a few go-to algorithms that can give you a great model with great results rather quickly. Some of these algorithms include Linear Regression, Decision Trees, Random Forest, XGBoost, and CatBoost. These are alternatives that are more simple.Here are examples of why you would want to use a non-deep learning algorithm, becuase you have so many other, simpler, non-deep learning options:They can be easier and faster to set up, for example, deep learning can require you to have your model add sequential, dense layers, and compile it, which can be more complex, and take longer than simply having a regressor or classifier and fitting it with non-deep learning algorithmsI personally find more errors that can result from this more complex deep learning code and documentation for how to fix it can be confusing or old and not be applicable, using an algorithm like Random Forest instead, can have much more documentation on errors that are easy to understandTraining on a deep learning algorithm may not be complicated sometimes, but when predicting from an endpoint, it might be confusing on how to feed values to predict on, whereas some models, you can simply have the values in an encoded list of ordered valuesI would say that you can of course try out deep learning algorithms, but before you do that, it might be best to start with a simpler solution. It can depend on things like how often you will train and make predictions, or if it is a one-off task. There are some other reasons why you would not want to use a deep learning algorithm, like when you have a small dataset and small budget, as we will discuss below.When You Have a Small Dataset and BudgetPhoto by Hello I’m Nik on Unsplash [5].Oftentimes, you can be working as a data scientist at a smaller company, or perhaps at a startup. In these cases, you would not have much data and you might not have a big budget. You would, therefore, try to avoid the use of deep learning algorithms. Sometimes you can even have a small dataset that is just a few thousand rows and few features, you could simply run an alternative model instead locally, rather than spending a lot of money by serving it frequently.Here is when you should second guess using a deep learning algorthim based on costs and data availability:Small data availability is usually the case for a lot of companies (but is not always the case), and deep learning performs better on information with a lot of dataYou might be performing a one-off task, as in the model only predicts one time — and you can run it locally for free (not all models will be running in production frequently), like a simple Decision Tree Classifier. It might not be worth investing time in a deep learning model.Your company is interested in data science applications but wants to keep the budget small, rather than perform costly executions from a deep learning model, and rather, use a tree-based model with early-stopping-rounds to prevent overfitting, shorten training time, and ultimately reduce costsThere have been times where I brought up deep learning and it was shot down for a variety of reasons, and these reasons were usually the case. But, I do not want to dissuade someone from using deep learning completely, as it is something you should use sometimes in your career, and can be something you do frequently or mainly depending on the circumstances and where you are working.SummaryOverall, before you dive deep into deep learning, realize that there are some times when you should avoid using it for a variety of reasons. There are, of course, more reasons for avoiding it, but there are also reasons for using it too. It is ultimately up to you to look at the pros and cons of deep learning yourself.Here are three times/reasons when you should not use deep learning:* When You Want to Easily Explain * When You Can Use Other Algorithms * When You Have Small Dataset and BudgetI hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with reasons for avoiding deep learning. Why or why not? What other reasons do you think you should avoid using deep learning as a data scientist? These can certainly be clarified even further, but I hope I was able to shed some light on deep learning. Thank you for reading!I am not affiliated with any of these companies.Please feel free to check out my profile, Matt Przybyla, and other articles, as well as subscribe to receive email notifications for my blogs by following the link below, or by clicking on the subscribe icon on the top of the screen by the follow icon, and reach out to me on LinkedIn if you have any questions or comments.Subscribe link: https://datascience2.medium.com/subscribeReferences[1] Photo by Nadine Shaabana on Unsplash, (2018)[2] Photo by Malte Helmhold on Unsplash, (2021)[3] M.Przybyla, Example of ranked SHAP output from a non-deep learning model, (2021)[4] Photo by Luca Bravo on Unsplash, (2016)[5] Photo by Hello I’m Nik on Unsplash, (2021)

Daniel Morales

Apr 16, 2020

Data Science

Machine Learning

Model Evaluation Metrics in Machine Learning

CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02 Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!

Juan Guillermo Gómez Ramírez

Apr 16, 2020

Machine Learning

The Role of AI in Unstructured Data Mining: Challenges and Opportunities

In our fast-paced digital world, we're producing staggering volumes of data every day. This data falls into two key categories: structured, known for its order and efficiency, and unstructured, a captivating puzzle brimming with untapped potential.In this article, we will uncover how AI confronts the complexities of unstructured data, the hurdles it faces, and the intriguing opportunities it opens up to businesses from any kind of industry.Understanding Unstructured DataUnstructured data mining is the technique of extracting valuable and meaningful insights from an abundant well of unstructured data. It uncovers hidden gems of knowledge, making it a crucial pursuit in our data-rich era.In today's digital realm, unstructured data is generated in unprecedented quantities. Billions of text documents, images, and videos come to life daily, creating a treasure trove of information just waiting for organizations to explore.Unlocking the insights hidden within unstructured data can provide organizations with a competitive edge. This data can reveal customer sentiments, emerging trends, and valuable feedback that might otherwise go unnoticed.The Basics of Data MiningHow data mining works is that it discovers patterns, trends, and valuable information within a dataset. It involves various techniques to extract knowledge from raw data. While it's exceptionally effective with structured data, applying data mining to unstructured data requires a unique set of skills and tools.Unstructured Data MiningUnstructured data mining is a method focused on the extraction of valuable information from the vast, unstructured data available. This process uncovers hidden insights, making it a valuable endeavor in today's data-driven world.The AI RevolutionThe AI revolution has given rise to an exciting era of possibilities in unstructured data mining. AI's remarkable capabilities are instrumental in taming the unstructured data landscape, and it involves a multitude of components, including:Machine learning enables AI systems to learn from data, make predictions, and identify patterns, enhancing data mining capabilities.Deep learning uses neural networks to model complex patterns in unstructured data, which is particularly valuable in image and speech recognition.Sentiment analysis gauges emotional tones within textual data, helping to understand public opinion and tailor strategies.Pattern recognition identifies recurring structures in data, aiding in image processing and text mining.Knowledge graphs structure data relationships, improving contextual understanding and data retrieval.Anomaly detection identifies outliers in data, which is essential for fraud detection and data security.Challenges in Unstructured Data MiningAs promising as AI is at handling unstructured data, it's not without its set of challenges. Here, we delve into some of the major hurdles:Data QualityUnstructured data is inherently messy. It's laden with errors, inconsistencies, and biases, which makes it a challenge to extract meaningful insights from this data. AI systems need to be trained rigorously to navigate and decipher this diversity in data quality. Techniques like data cleansing, normalization, and the use of context are essential in ensuring that AI systems provide accurate results.ScalabilityAs the volume of unstructured data grows, AI systems must scale to handle the data influx effectively. Traditional hardware and algorithms might not be sufficient to handle this data influx. Scalable infrastructure and distributed computing become crucial to ensuring that AI systems can process and analyze vast amounts of data efficiently.Privacy ConcernsMining unstructured data often raises ethical questions regarding privacy and data protection. That’s why it’s essential to strike the right balance between data utilization and respecting individual privacy. It's a challenge to ensure that AI systems are used responsibly and in compliance with data protection laws and regulations, such as GDPR in Europe. Techniques like anonymization and consent management play a vital role in addressing these privacy concerns.Opportunities and ApplicationsAI's role in unstructured data mining has opened up a world of opportunities across various industries. Let's explore some of the most promising applications:Customer InsightsUnstructured data, particularly sourced from social media and customer reviews, serves as a goldmine of information on customer behavior and preferences. By leveraging AI algorithms, companies can analyze sentiments, spot emerging trends, and even forecast future buying patterns. With these insights, they can fine-tune their marketing strategies, product development, and customer service to align with their ever-evolving audience's demands.Healthcare DiagnosisThe abundance of unstructured data found in medical records, radiological images, and wearable device data holds the key to transformative advancements. AI-powered systems, known for their proficiency in the analysis of this data, not only facilitate early disease detection but also provide highly individualized treatment plans, ultimately raising the standard of patient care. For example: AI expedites the process of analyzing medical images for anomalies, resulting in a significant reduction in the time required for diagnosing and treating severe conditions.Fraud DetectionWhen it comes to financial institutions, AI is a vital tool for exposing fraudulent activities that often hide within the vast volumes of unstructured transaction data. Through a meticulous examination of transaction patterns and anomalies, AI systems can rapidly pinpoint fraudulent actions, providing businesses with a robust defense against significant financial losses. The ability to detect and thwart fraud in real-time provides a critical advantage, resulting in annual savings of billions of dollars for businesses.ConclusionThe future belongs to those who embrace the AI revolution in unstructured data mining. In this future, data isn't just information; it's the key to success. So, let's move forward, embracing this tomorrow, where possibilities are limitless and opportunities are endless.

nikos_datasource

Apr 16, 2020

Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market

Contents Outline

Jesus Rodriguez

Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market

Related Posts

Categories

Join Competition

Daniel Morales

Daniel Morales

Juan Guillermo Gómez Ramírez

nikos_datasource

Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market

Contents Outline

Social Sharing

Jesus Rodriguez

Related Posts

Categories

Join Competition

Most Related Articles

Daniel Morales

Daniel Morales

Juan Guillermo Gómez Ramírez

nikos_datasource