The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Deep learning

When to Avoid Deep Learning
IntroductionThis article is intended for data scientists who may consider using deep learning algorithms, and want to know more about the cons of implementing these type of models into your work. Deep learning algorithms have many benefits, are powerful, and can be fun to show off. However, there are a few times when you should avoid them. I will be discussing those times when you should stop using deep learning below, so keep on reading if you would like a deeper dive into deep learning.When You Want to Easily ExplainPhoto by Malte Helmhold on Unsplash [2].Because other algorithms have been around longer, they have countless amounts of documentation, including examples and functions that make interpretability easier. It is also how the other algorithms work themselves. Deep learning can be intimidating to data scientists for this reason as well, it can be a turn-off to use a deep learning algorithm when you are unsure of how to explain it to a stakeholder.Here are 3 examples of when you would have trouble explaining deep learning:When you want to describe the top features of your model — the features become hidden inputs, so you will not know what caused a certain prediction to happen, and if you need to prove to stakeholders or customers why a certain output was achieved, it can be more of a black boxWhen you want to tune your hyperparameters like learning rate and batch sizeWhen you want to explain how the algorithm works itself — for example, if you were to present the algorithm itself to stakeholders, they might get lost, because even a simplified approach is still difficult to understandHere are 3 examples of how you could explain those same situations from above from non-deep learning algorithms:When you want to explain your top features, you can easily access SHAP libraries, say for the algorithm CatBoost, once your model is fitted, you can simply make a summary plot from feat = model.get_feature_importance() and then use the summary_plot() to rank the features by feature name, so that you can present a nice plot to stakeholders (and yourself for that matter)Example of ranked SHAP output from a non-deep learning model [3].As a solution, some other algorithms have made it plenty easy to tune your hyperparameters by randomized grid search or a more structured, set grid search method. There are even some algorithms that tune themselves so you do not have to worry about complicated tuningExplaining how other algorithms work can be a lot easier, like decision trees, for example, you can easily show a yes or no, 0/1 chart that shows a simple answer for features that lead to a prediction, like yes it is raining, yes it is winter, would provide for yes it is going to be coldOverall, deep learning algorithms are useful and powerful, so there is definitely a time and place for them, but there are other algorithms that you can use instead, as we will discuss below.When You Can Use Other AlgorithmsPhoto by Luca Bravo on Unsplash [4].To be frank, there are a few go-to algorithms that can give you a great model with great results rather quickly. Some of these algorithms include Linear Regression, Decision Trees, Random Forest, XGBoost, and CatBoost. These are alternatives that are more simple.Here are examples of why you would want to use a non-deep learning algorithm, becuase you have so many other, simpler, non-deep learning options:They can be easier and faster to set up, for example, deep learning can require you to have your model add sequential, dense layers, and compile it, which can be more complex, and take longer than simply having a regressor or classifier and fitting it with non-deep learning algorithmsI personally find more errors that can result from this more complex deep learning code and documentation for how to fix it can be confusing or old and not be applicable, using an algorithm like Random Forest instead, can have much more documentation on errors that are easy to understandTraining on a deep learning algorithm may not be complicated sometimes, but when predicting from an endpoint, it might be confusing on how to feed values to predict on, whereas some models, you can simply have the values in an encoded list of ordered valuesI would say that you can of course try out deep learning algorithms, but before you do that, it might be best to start with a simpler solution. It can depend on things like how often you will train and make predictions, or if it is a one-off task. There are some other reasons why you would not want to use a deep learning algorithm, like when you have a small dataset and small budget, as we will discuss below.When You Have a Small Dataset and BudgetPhoto by Hello I’m Nik on Unsplash [5].Oftentimes, you can be working as a data scientist at a smaller company, or perhaps at a startup. In these cases, you would not have much data and you might not have a big budget. You would, therefore, try to avoid the use of deep learning algorithms. Sometimes you can even have a small dataset that is just a few thousand rows and few features, you could simply run an alternative model instead locally, rather than spending a lot of money by serving it frequently.Here is when you should second guess using a deep learning algorthim based on costs and data availability:Small data availability is usually the case for a lot of companies (but is not always the case), and deep learning performs better on information with a lot of dataYou might be performing a one-off task, as in the model only predicts one time — and you can run it locally for free (not all models will be running in production frequently), like a simple Decision Tree Classifier. It might not be worth investing time in a deep learning model.Your company is interested in data science applications but wants to keep the budget small, rather than perform costly executions from a deep learning model, and rather, use a tree-based model with early-stopping-rounds to prevent overfitting, shorten training time, and ultimately reduce costsThere have been times where I brought up deep learning and it was shot down for a variety of reasons, and these reasons were usually the case. But, I do not want to dissuade someone from using deep learning completely, as it is something you should use sometimes in your career, and can be something you do frequently or mainly depending on the circumstances and where you are working.SummaryOverall, before you dive deep into deep learning, realize that there are some times when you should avoid using it for a variety of reasons. There are, of course, more reasons for avoiding it, but there are also reasons for using it too. It is ultimately up to you to look at the pros and cons of deep learning yourself.Here are three times/reasons when you should not use deep learning:* When You Want to Easily Explain
* When You Can Use Other Algorithms
* When You Have Small Dataset and BudgetI hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with reasons for avoiding deep learning. Why or why not? What other reasons do you think you should avoid using deep learning as a data scientist? These can certainly be clarified even further, but I hope I was able to shed some light on deep learning. Thank you for reading!I am not affiliated with any of these companies.Please feel free to check out my profile, Matt Przybyla, and other articles, as well as subscribe to receive email notifications for my blogs by following the link below, or by clicking on the subscribe icon on the top of the screen by the follow icon, and reach out to me on LinkedIn if you have any questions or comments.Subscribe link: https://datascience2.medium.com/subscribeReferences[1] Photo by Nadine Shaabana on Unsplash, (2018)[2] Photo by Malte Helmhold on Unsplash, (2021)[3] M.Przybyla, Example of ranked SHAP output from a non-deep learning model, (2021)[4] Photo by Luca Bravo on Unsplash, (2016)[5] Photo by Hello I’m Nik on Unsplash, (2021)

Programming
Deep learning
Machine Learning

21 Resources for Learning Math for Data Science
This is probably one of the biggest worries of those starting in the area of data science, learning/refreshing mathImage by DataSource.aiLet’s be honest, most people didn’t do very well in math in school, maybe not even in college, and this is very scary and creates a barrier for those who want to explore this discipline called data science.A few days ago I published a post in Towards Data Scienceand right here on our blog called “Study Plan for Learning Data Science Over the Next 12 Months”, where I gave some quarterly recommendations and made an emphasis on studying mathematics and statistics for this first quarter, and from which I received many questions about exactly which materials I recommended. Well, this post answers those questions. But before that, I want to give you a context.Leaving aside the factors or reasons that have led most people to hate math, it is a reality that we need it in data science. For me, one of the biggest shortcomings I found in mathematics was its lack of applicability in the real world, I didn’t see a reason for intermediate and advanced mathematics, such as multivariate calculus. I confess that in school and college I didn’t like them for that reason, but I always did well and got good scores and averages above the majority (especially in statistics). But I still didn’t see how I could use a derivative or a matrix in the real world. I finally ended up as a software engineer and once I entered the world of data science I was able to make the connection between mathematics, statistics, and the real world.On the other hand, it is important to clarify that we do not need a master’s degree in pure mathematics to do data science projects. As I mentioned in previous posts there is a big debate in the community about how much math we need to do a good job as data scientists.We could say that data science is divided into two major fields of work: research and productionBy research, we mean the part of research and development, which normally takes place within a large company (usually a tech company), or which has focused on cutting-edge technological issues (such as medical research). Or it is also an area that is developed within universities. This sector has very limited job offers. The great advantage is the deep knowledge of algorithms and their implementations, as well as being a person capable of creating variations of existing algorithms, to improve them. Or even create new machine learning algorithms. The disadvantage is the unpractical nature of their work. It is a very theoretical work, in which often the only objective is to publish papers and is far from the business use cases in general. For reference on this, I recently read this post on Reddit, I recommend you read it.By production, we refer to the practical side of this discipline, where you’ll use generally and in your day to day job libraries such as scikit-learn, Tensorflow, Keras, Pytorch, and others. These libraries operate like a black box, where you enter data, you get an output, but you don’t know in detail what happened in the process. This also has its advantages and disadvantages, but it certainly makes life much easier when putting useful models into production. What I don’t recommend is to use them blindly, where you don’t have the minimum bases of mathematics to understand a little of their fundamentals and that is the objective of this post, to guide you and recommend you some valuable resources to have the necessary bases and not to operate blindly those libraries.So if you decide to focus on Research and Development, you are going to need mathematics and statistics in depth (very in-depth). If you are going to go for the practical part, the libraries will help you deal with most of it, under the hood. It should be noted that most job offers are in the practical side.Well, after the previous remarks, it is time to define which are the specific topics needed to have an initial basis in mathematics for data science. Linear Algebra: This subject is important to have the fundamentals of working with data in vector and matrix form, to acquire skills to solve systems of linear algebraic equations, and to find the basic matrix decompositions and the general understanding of their applicability.Calculus: Here it is important to study functional maps, limits (in case of sequences, functions of one and several variables), differentiation (from a single variable to multiple cases), integration, thus sequentially building a foundation for basic optimization. It is also important here to study gradient descent.Probability theory: Here you should learn about random variables, i.e. a variable whose values are determined by a random experiment. Random variables are used as a model for the data generation processes we want to study. The properties of the data are deeply linked to the corresponding properties of the random variables, such as expected value, variance, and correlations.Note: these subjects are much deeper than what I just mentioned, this is simply a guide of the subjects and resources recommended to approach mathematics in the field of data science.Now that we have a better idea of the path we should take, let’s examine the recommended resources to address this topic. We will divide them into basic, intermediate, and advanced. In the advanced ones, we’ll have resources focused on deep learningBasics: in this first section of resources we’ll recommend the mathematical basics. Mathematical thinking, algebra, and how to implement math with python.1- Introduction to mathematical thinkingPrice: FreeImage by CourseraDescription: Learn how to think the way mathematicians do — a powerful cognitive process developed over thousands of years.Mathematical thinking is not the same as doing mathematics — at least not as mathematics is typically presented in our school system. School math typically focuses on learning procedures to solve highly stereotyped problems. Professional mathematicians think a certain way to solve real problems, problems that can arise from the everyday world, or from science, or from within mathematics itself. The key to success in school math is to learn to think inside-the-box. In contrast, a key feature of mathematical thinking is thinking outside-the-box — a valuable ability in today’s world. This course helps to develop that crucial way of thinking.Link: https://www.coursera.org/learn/mathematical-thinking#2- Mathematical Foundation for AI and Machine LearningPrice: $46.99 usdImage by PacktDescription: Artificial Intelligence has gained importance in the last decade with a lot depending on the development and integration of AI in our daily lives. The progress that AI has already made is astounding with innovations like self-driving cars, medical diagnosis and even beating humans at strategy games like Go and Chess. The future for AI is extremely promising and it isn’t far from when we have our own robotic companions. This has pushed a lot of developers to start writing codes and start developing for AI and ML programs. However, learning to write algorithms for AI and ML isn’t easy and requires extensive programming and mathematical knowledge. Mathematics plays an important role as it builds the foundation for programming for these two streams. And in this course, we’ve covered exactly that. We designed a complete course to help you master the mathematical foundation required for writing programs and algorithms for AI and ML.Link: https://www.packtpub.com/product/mathematical-foundation-for-ai-and-machine-learning-video/97817896132093- Math for ProgrammersPrice: $47.99Image by ManningDescription: In Math for Programmers you’ll explore important mathematical concepts through hands-on coding. Filled with graphics and more than 300 exercises and mini-projects, this book unlocks the door to interesting–and lucrative!–careers in some of today’s hottest fields. As you tackle the basics of linear algebra, calculus, and machine learning, you’ll master the key Python libraries used to turn them into real-world software applications.Link: https://www.manning.com/books/math-for-programmers4- Algebra 1Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/algebra5- Algebra 2Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/algebra26- Master Math by Coding in PythonPrice: $12.99Image by UdemyDescription: You can learn a lot of math with a bit of coding!Many people don’t know that Python is a really powerful tool for learning math. Sure, you can use Python as a simple calculator, but did you know that Python can help you learn more advanced topics in algebra, calculus, and matrix analysis? That’s exactly what you’ll learn in this course.This course is a perfect supplement to your school/university math course, or for your post-school return to mathematics.Let me guess what you are thinking:“But I don’t know Python!”That’s okay! This course is aimed at complete beginners; I take you through every step of the code. You don’t need to know anything about Python, although it’s useful if you already have some programming experience.“But I’m not good at math!”You will be amazed at how much better you can learn math by using Python as a tool to help with your courses or your independent study. And that’s exactly the point of this course: Python programming as a tool to learn mathematics. This course is designed to be the perfect addition to any other math course or textbook that you are going through.Link: https://www.udemy.com/course/math-with-python/7- Introduction to Linear Models and Matrix AlgebraPrice: FreeImage by EdxDescription: Matrix Algebra underlies many of the current tools for experimental design and the analysis of high-dimensional data. In this introductory online course in data analysis, we will use matrix algebra to represent the linear models that commonly used to model differences between experimental units. We perform statistical inference on these differences. Throughout the course we will use the R programming language to perform matrix operations.Given the diversity in educational background of our students we have divided the series into seven parts. You can take the entire series or individual courses that interest you. If you are a statistician you should consider skipping the first two or three courses, similarly, if you are biologists you should consider skipping some of the introductory biology lectures. Note that the statistics and programming aspects of the class ramp up in difficulty relatively quickly across the first three courses. You will need to know some basic stats for this course. By the third course will be teaching advanced statistical concepts such as hierarchical models and by the fourth advanced software engineering skills, such as parallel computing and reproducible research concepts.Link: https://www.edx.org/course/introduction-to-linear-models-and-matrix-algebra8- Applying Math with PythonPrice: $20.99Image by PacktDescription: Python, one of the world’s most popular programming languages, has a number of powerful packages to help you tackle complex mathematical problems in a simple and efficient way. These core capabilities help programmers pave the way for building exciting applications in various domains, such as machine learning and data science, using knowledge in the computational mathematics domain.The book teaches you how to solve problems faced in a wide variety of mathematical fields, including calculus, probability, statistics and data science, graph theory, optimization, and geometry. You’ll start by developing core skills and learning about packages covered in Python’s scientific stack, including NumPy, SciPy, and Matplotlib. As you advance, you’ll get to grips with more advanced topics of calculus, probability, and networks (graph theory). After you gain a solid understanding of these topics, you’ll discover Python’s applications in data science and statistics, forecasting, geometry, and optimization. The final chapters will take you through a collection of miscellaneous problems, including working with specific data formats and accelerating code.By the end of this book, you’ll have an arsenal of practical coding solutions that can be used and modified to solve a wide range of practical problems in computational mathematics and data science.Link: https://www.packtpub.com/product/applying-math-with-python/9781838989750Intermediate: in this second section we will recommend resources focused on calculation and probability.9- Calculus 1Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/calculus-110- Calculus 2Price: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/calculus-211- Multivariable calculusPrice: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/multivariable-calculus12- Mathematics for Data Science SpecializationPrice: FreeImage by CourseraDescription: Behind numerous standard models and constructions in Data Science there is mathematics that makes things work. It is important to understand it to be successful in Data Science. In this specialisation we will cover wide range of mathematical tools and see how they arise in Data Science. We will cover such crucial fields as Discrete Mathematics, Calculus, Linear Algebra and Probability. To make your experience more practical we accompany mathematics with examples and problems arising in Data Science and show how to solve them in Python.Each course of the specialisation ends with a project that gives an opportunity to see how the material of the course is used in Data Science. Each project is directed at solving practical problem in Data Science. In particular, in your projects you will analyse social graphs, predict estate prices and uncover hidden relations in the data.Link: https://www.coursera.org/specializations/mathematics-for-data-science13- Practical Discrete MathematicsPrice: $24.99Image by PacktDescription: Discrete mathematics is a field of math that deals with studying finite and distinct elements. The theories and principles of discrete math are widely used in solving complexities and building algorithms in computer science and computing data in data science. It helps you to understand algorithms, binary, and general mathematics that is commonly used in data-driven tasks.Learn Discrete Mathematics is a comprehensive introduction for those who are new to the mathematics of countable objects. This book will help you get up-to-speed with implementing discrete math principles to take your programming skills to another level. You’ll learn the discrete math language and methods crucial to studying and describing objects and functions in branches of computer science and machine learning. Complete with real-world examples, the book covers the internal workings of memory and CPUs, analyzes data for useful patterns, and shows you how to solve problems in network routing, encryption, and data science.By the end of this book, you’ll have a deeper understanding of discrete mathematics and its applications in computer science, and get ready to work on real-world algorithm development and machine learning.Link: https://www.packtpub.com/product/practical-discrete-mathematics/978183898314714- Math for Data Science and Machine Learning: University LevelPrice: $12.99Image by UdemyDescription: In this course we will learn math for data science and machine learning. We will also discuss the importance of Math for data science and machine learning in practical word. Moreover, Math for data science and machine learning course is bundle of two courses of linear algebra and probability and statistics. So, students will learn complete contents of probability and statistics and linear algebra. It is not like that you will not complete all the contents in this 7 hours videos course. This is a beautiful course and I have designed this course according to the need of the students.Linear algebra and probability and statistics is usually offered for the students of data science, machine learning, python and IT students. So, that’s why I have prepared this dual course for different sciences.I have taught this course multiple times on my universities classes. It is offered usually in two different modes like, it is offered as linear algebra for 100 marks paper and probability and statistics as another 100 marks paper for two different or in a same semesters. I usually focus on the method and examples while teaching this course. Examples clear the concepts of the students in a variety of way like, they can understand the main idea that instructor want to deliver if they feel typical the method of the subject or topics. So, focusing on example makes the course easy and understandable for the students.Link: https://www.udemy.com/course/master-linear-algebra-and-probability-2-in-1-bundle/15- Data Science Math SkillsPrice: FreeImage by CourseraDescription: Data science courses contain math — no avoiding that! This course is designed to teach learners the basic math you will need in order to be successful in almost any data science math course and was created for learners who have basic math skills but may not have taken algebra or pre-calculus. Data Science Math Skills introduces the core math that data science is built upon, with no extra complexity, introducing unfamiliar ideas and math symbols one-at-a-time.Learners who complete this course will master the vocabulary, notation, concepts, and algebra rules that all data scientists must know before moving on to more advanced material.Link: https://www.coursera.org/learn/datasciencemathskillsAdvanced: in this last section we will focus on the statistical part (probability theory) and the application of mathematics to deep learning algorithms.16- Statistics and probabilityPrice: FreeImage by KhanacademyLink: https://www.khanacademy.org/math/statistics-probability17- Intro to Inferential StatisticsPrice: FreeImage by UdacityDescription: Inferential statistics allows us to draw conclusions from data that might not be immediately obvious. This course focuses on enhancing your ability to develop hypotheses and use common tests such as t-tests, ANOVA tests, and regression to validate your claims.Link: https://www.udacity.com/course/intro-to-inferential-statistics--ud20118- Statistical Methods and Applied Mathematics in Data SciencePrice: $124.99Image by PacktDescription: Machine learning and data analysis are the center of attraction for many engineers and scientists. The reason is quite obvious: its vast application in numerous fields and booming career options. And Python is one of the leading open source platforms for data science and numerical computing. IPython, and its associated Jupyter Notebook, provide Python with efficient interfaces to for data analysis and interactive visualization, and they constitute an ideal gateway to the platform. If you are among those seeking to enhance their capabilities in machine learning, then this course is the right choice.Statistical Methods and Applied Mathematics in Data Science provides many easy-to-follow, ready-to-use, and focused recipes for data analysis and scientific computing. This course tackles data science, statistics, machine learning, signal and image processing, dynamical systems, and pure and applied mathematics. You will apply state-of-the-art methods to various real-world examples, illustrating topics in applied mathematics, scientific modeling, and machine learning. In short, you will be well versed with the standard methods in data science and mathematical modeling.Link: https://www.packtpub.com/product/statistical-methods-and-applied-mathematics-in-data-science-video/978178953921919- Exploring Math for Programmers and Data ScientistsPrice: FreeImage by ManningDescription: Exploring Math for Programmers and Data Scientists showcases chapters from three Manning books, chosen by author and master-of-math Paul Orland. You’ll start with a look at the nearest neighbor search problem, common with multidimensional data, and walk through a real-world solution for tackling it. Next, you’ll delve into a set of methods and techniques integral to Principal Component Analysis (PCA), an underlying technique in Latent Semantic Analysis (LSA) for document retrieval. In the last chapter, you’ll work with digital audio data, using mathematical functions in different and interesting ways. Begin sharpening your competitive edge with the fun and fascinating math in this (free!) practical guide!Link: https://www.manning.com/books/exploring-math-for-programmers-and-data-scientists20- Hands-On Mathematics for Deep LearningPrice: $27.99Image by PacktDescription: Most programmers and data scientists struggle with mathematics, having either overlooked or forgotten core mathematical concepts. This book uses Python libraries to help you understand the math required to build deep learning (DL) models.You’ll begin by learning about core mathematical and modern computational techniques used to design and implement DL algorithms. This book will cover essential topics, such as linear algebra, eigenvalues and eigenvectors, the singular value decomposition concept, and gradient algorithms, to help you understand how to train deep neural networks. Later chapters focus on important neural networks, such as the linear neural network and multilayer perceptrons, with a primary focus on helping you learn how each model works. As you advance, you will delve into the math used for regularization, multi-layered DL, forward propagation, optimization, and backpropagation techniques to understand what it takes to build full-fledged DL models. Finally, you’ll explore CNN, recurrent neural network (RNN), and GAN models and their application.By the end of this book, you’ll have built a strong foundation in neural networks and DL mathematical concepts, which will help you to confidently research and build custom models in DL.Link: https://www.packtpub.com/product/hands-on-mathematics-for-deep-learning/978183864729221- Math and Architectures of Deep LearningPrice: $39.99Image by ManningDescription: Math and Architectures of Deep Learning sets out the foundations of DL in a way that’s both useful and accessible to working practitioners. Each chapter explores a new fundamental DL concept or architectural pattern, explaining the underpinning mathematics and demonstrating how they work in practice with well-annotated Python code. You’ll start with a primer of basic algebra, calculus, and statistics, working your way up to state-of-the-art DL paradigms taken from the latest research. By the time you’re done, you’ll have a combined theoretical insight and practical skills to identify and implement DL architecture for almost any real-world challenge.Link: https://www.manning.com/books/math-and-architectures-of-deep-learningConclusionThis is an extensive recommendation on resources for learning mathematics for data science, following the previous post about the path to follow in this year 2021 to learn data science. When we have limited time for study, we should select those that we feel best and those that fit our style. For example, you might prefer videos about books, so go ahead and choose what suits you best. This material is sufficient whether you want to take a brief look at the mathematics, or if you want to go deeper into it. I hope you find it useful.If you have other recommendations for courses, books or videos, please leave them in the comments so that we can all create links of interest.Note: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slackI hope you enjoyed this reading! you can follow me on twitter or linkedinThanks for reading!

Deep learning
Machine Learning

4 Superpowers That Will Make You Indispensable In a Data Science Career
Learn about these biggest Challenges in the Data Science industry to avoid stagnation in your CareerThe challenges people encounter in a data science career are far more serious than the ones they face while getting into it.Often, there is a big mismatch between job expectations and actual responsibilities. If you’re lucky to work in areas you aspire to, collaborating with other roles in a data science project can be a real struggle.It might be easier to get your tooth extracted than to cope up with daily demands from your project manager. Get all of this right, and you might discover that your solution is untouched by users. “Why wouldn’t anyone understand or use something that’s so obvious”, you might wonder.All of this can lead to an existential crisis, early on in your data science career. The risk of career stagnation is high for many professionals in this field. How do you tackle this?I’ll share the 4 big hairy challenges in data science projects that cause many of your personal struggles. Based on learnings from our work at Gramener, we’ll discuss what they mean to you. And, how you can smash them to become indispensable in your project, and in the data science industry.1. Sharpen your ability to handle messy dataPhoto by Karim MANJRA on UnsplashPoor data quality is one of the top challenges in data science. Bad data costs organizations over $15 Million annually. You need clean, structured data to come up with big, useful, and surprising insights. Fancy using deep learning techniques? Then, you’ll need a lot more data and it must be neatly labeled.In data science, 80% of the time is spent preparing data, and the other 20% on complaining about it! — Kirk BorneYou must pick up skills to discover the data that your business problem needs. Learn how to curate and transform the data for analysis. Yes, data cleaning is very much a data scientist’s job. Play with data and get your hands dirty. You’ll develop an eye to spot anomalies and patterns will start jumping out at you.Let’s say your project’s intent is to analyze customer experience. The first task is to scout for all potential data assets like customer profiles, transactions, surveys, social activities. Any of these that don’t map to your business problem must be dropped. Inspect and clean the data and you will lose some more. Do this for weeks or months, and then you are ready for your analysis!2. Learn the techniques and don’t worry about the toolsThe Data and AI Landscape by Matt Turk. Can’t read this chart? Don’t worry, that’s beside the point!The data science industry is crowded with hundreds of tools. No one tool covers the entire workflow. Every week, brilliant new tools get created. And a dozen go out of business or get bought out. Companies spend millions on enterprise licenses, only to find that they aren’t as compelling anymore.This fragmented ecosystem poses a big challenge for aspirants. A top question I often get asked is, “Should I learn Python or R? PowerBI or D3?” I always say that the tool really does not matter. Learn the technique like the back of your hand. You can always transfer your learning from one tool to another in weeks.The tool really does not matter. It is the person’s skill with a tool that counts.For example, to master visualization, don’t start with the tools. Learn the principles of information design, the basics of visual design, and color theory. Then get some real data and internalize the techniques by solving problems. Any visualization tool you can get your hands on will do. Don’t over-optimize.3. Master the application of techniques to solve real-world problemsPhoto by Olav Ahrens Røtne on UnsplashOver 80% of data science projects fail. Wonder why? There are challenges throughout the lifecycle: from picking the wrong business problem to framing an incorrect solution approach. From choosing wrong techniques to a failure in translating them to users. Every role in data science contributes to these misses. No, most of these gaps aren’t technical.Most data science projects don’t deliver business ROI because they solve the wrong problems.What’s the common thread here? It is the poor application of skills to business problems. For example, when data scientists just want to build great models but miss paying attention to their user’s needs, it hurts projects. Don’t stop with an intuition of a technique, or the math behind it. Find where it’s relevant and what it takes to apply it. Get invested in solving user’s problems.Let’s say you’ve mastered a dozen forecasting techniques. Which one would you pick when your user needs tomorrow’s price to make her trade, but has just 1 past data point? Does it change if you have 100 or 10,000 points? What if she just needs to know whether to ‘hold’ or ‘sell at the market price’?4. Go beyond data and analytics skills to succeed in data analytics!Companies often just hire for machine learning skills. They invest in data engineering and may get some training organized in visualization and data literacy. But, this team is imbalanced and will deliver just sub-optimal results. Every data science team must have 5 skills for effective project outcomes.The 5 roles and skills critical to delivering value in data science (Comics: www.gramener.com/comicgen)If you’re playing one of these roles, should you care about this? Absolutely. Here’s how you can increase your influence in any project. Master one skill as your core area. This is your primary role. Invest and learn a secondary skill. You should be able to step in as a backup and support on this one.What about the other three? Pick up a broad familiarity. You must be able to relate to them, understand pain areas, and connect them back to your work. Do this and you’ll be worth your weight in gold!You need a lot more than data and analytics to succeed in the data analytics industryA key takeaway here is that there are 5 roles in a data science career. Not just ‘data scientist’. Let’s say you’re an ML engineer. Your secondary skill can be information design. Learn about charts and how to choose the right one. Find what users look for in visuals and what it means for the UI you’re building.It’s now time to make yourself indispensableEvery single project in data science faces these four challenges. Organizations lose millions due to failed data science investments. Clients are worried because their business problems remain unsolved. Data science leaders and managers freak out because the failure rate of projects is insane.All of this often translates to excess demands and high pressure on data science professionals. Understanding these big-picture challenges is a great starting point for you. Empathize with your project team and leaders.The four tips you’ve learned here will equip you to tackle the challenges head-on. Start practicing them and you’ll see greater trust and acceptance of your work. Soon, you’ll become indispensable and rise faster in your career.Good luck smashing these challenges in your project!Found these suggestions useful? Have any more tips to tackle these challenges? Add them to the comments. Stay in touch with me on Linkedin, Twitter.Title photo by Steven Libralon on Unsplash.

Deep learning
Data Science
Machine Learning

Google Believes Machine Learning Frameworks Need Five Key Things to Reach Mainstream Developers
Google Research conducted a detailed survey within TensorFlow.js developers to determine the key elements that streamline adoption of machine learning frameworks.Machine learning(ML) could be the most important element of the next generation of software applications and, yet, its usage its constrained to high skilled developers. Differently from previous technology trends, machine learning is not showing a clear transition path into mainstream developer adoption. Part of that friction is due to the fact that most machine learning frameworks still require a very high entry point in terms of computer science knowledge. Recently, Google Research published a paperthat outlined five key principles to break that friction point and design machine learning frameworks that can be adopted by a broader range of developers.Tinkers vs. Scientists: The Impostor Syndrome in Machine LearningIn a famous article published in 1978, Dr. Pauline R. Clance described a phenomenon in which accomplished individuals experience of self-perceived intellectual phoniness or the feeling of being a fraud. Dr. Clance referred to this phenomenon as the “impostor syndrome” which has become a common term in modern psychology. The impostor syndrome is present across all sorts of areas of a modern society but its definitely prevalent in scenarios that require a highly intellectual skills. I believe a version of that phenomenon is negatively influencing the adoption of machine learning technologies as many developers believe they don’t have the necessary mathematical and computer science skills to jump into the space.The industry to software development technologies have been a constant quest to lover the entry point to increase adoption. From the early programming languages to recent movements such as mobile developments, software development technologies undergoes several simplification cycles to attract broader groups of developers. Since the early days of programming languages and graphical interfaces, most software development stacks have created layers upon layers of abstractions that hide the underlying computer science details. Part of the massive developer adoption experienced by mobile and web technologies relied on attracting developers many of which are “tinkers” and doers without a computer science background. That path towards simplification doesn’t seem very obvious in the case of machine learning.Google’s research correctly uncovered that many developers fear to get into machine learning technologies because they lack the mathematical skills required. Despite plenty of evidence of self-taught hackers tinkering with machine learning models, most developer see machine learning specialists as people with deep knowledge of linear algebra and statistics; a text-book definition of the impostor syndrome. In my opinion, there are several factors that are not helping to mitigate this challenge:1) Mathematical terminology in ML frameworks documentation: Today, the documentation of most machine learning frameworks reads like a math class rather than a developer framework.2) Acceleration of artificial intelligence(AI) research: The rapid growth of AI research is a fascinating thing. However, very regularly machine learning frameworks are forced to adapt new research techniques that lack a proper developer experience.3) Lack of ML lifecycle management tools: The process debugging or testing machine learning models feels like a never ending exercise of tuning hundreds of obscure parameters that are not understood by mainstream developers. The existing toolset is the machine learning space is still too limited for mainstream adoption.Google’s Five Principles for Better ML FrameworksDesigning better machine learning frameworks to streamline developer adoption is one of the pivotal challenges of the next decade of the AI space. Based on their research, Google identified five key aspects that machine learning framework designers should consider to lower the entry points for developers:1) Demystify Mathematical and Algorithmic ConceptsDespite ongoing efforts to lower the barriers to machine learning, and despite a preponderance of ML-newcomers aspiring to learn, surprisingly many developers perceived current resources to be intended for more advanced audiences. New machine learning frameworks should abstract these mathematical models into digestible, practical concepts. For instance, most developer have no idea what a gradient-descent is, but they can certainly understand tuning the learning rate of a model.2) Support Learning by DoingToday, most tutorials that highlight best practices in machine learning require substantial lines of code. Abstracting best practices into small code-blocks that can be used by beginner developers is essential to help developers who prefer a “learning by doing” approach to mater a new technology.3) Support Re-use and Modification of Pre-made ML ModelsComplementing the previous point, seems important to include canonical models that could help developers casually become aware of machine learning idiosyncrasies from within their existing programming workflow. In addition to lower the entry point for developers, it can also help to enforce the consistency across programs.4) Synthesize ML Best Practices into Just-in-Time HintsDecisions such as which parameter to fine-tune first or how many layers to use in a model can require years of machine learning development experience. Building better visualization and interpretability tools that use strategic pointers can help developers overcome this challenge.5) Emphasize and Support the Experimental Nature of MLThe lifecycle of machine learning applications is different from previous technology trends. Specifically, machine learning applications require heavy experimentation and trial and error. Despite using the best model for a specific scenario, it is required to conduct dozens and dozens of experiments in order to ensure its optimal performance. Facilitating the experimentation process should be another key focus on future machine learning frameworks.These are some of the recommendations Google Research believes are required to streamline the adoption of machine learning frameworks. Ultimately, new developer stacks need to balance incorporating new research methods with a simple developer experience that facilitates the adoption by a broader range of developers.

Deep learning
Machine Learning
Data Science

Numpy & OpenCV In Action
Numpy is a powerful free source python library, which is widely used mainly for mathematical and statistical calculations. This is because many of its operations are based on vector calculations using multidimensional arrays. Also, other important libraries like Pandas and OpenCV make use of Numpy. In this post we will see a little bit of the efficiency of Numpy when selecting the data presented in rows and columns. Additionally, we will talk about OpenCV which is widely used in image processing, machine learning and computer vision. Also, we will understand how it relates to Numpy. Finally, we will work on basic operations with OpenCV and how we can make use of Numpy in image processing.REQUIREMENTS To start working with Numpy and OpenCV we must previously have python installed in the computer. In the link below you will find the download page to get Python. https://www.python.org/downloads/Once Python is installed, we can proceed with getting Numpy and OpenCV as follows in the steps below.To get Numpy just use this command "pip install numpy" in the command line of your computer. In the link below, you can find a brief description.https://pypi.org/project/numpy/Similarly for OpenCV, we can use this command "pip install opencv-python". Also, you can see the link below for a brief description.https://pypi.org/project/opencv-python/Another alternative is to download Anaconda, in which all those packages (Numpy,OpenCV etc...) have been previously installed. To download Anaconda go to the link below.https://www.anaconda.com/distribution/FIRST STEPS WITH NUMPYThe first step is to import the library using the keyword "import". As follows.import numpy as npAlready imported Numpy , we can create a list in which we will find numbers between 0-10 that denotes the score of a survey for a new product launched in the market. The idea is to select the scores using Python and Numpy and see the difference in selection.-Selection of unique data-Column selection-Column selection but not completeAs we can see Python makes a lot of use of for looping for its selection. This takes time and resources to take one data at a time. On the other hand, Numpy uses vectors to make a faster selection. Likewise its mathematical operations are based on vectorization. -Basic operations with NumpyBelow we download a dataset that presents the number of deaths from Malaria by year and continent. From this data, we will see how to extract the information to be analyzed. Next, we'll apply basic Numpy operations. The dataset was downloaded from the following link.https://ourworldindata.org/malaria#malaria-death-rates-First, we import the csv where the information is contained as a list.-Transform the list into a Numpy array.-Then, we extract the Malaria deaths that occurred in 2000 for the different continents.-We can check this by doing it with Python and then see the difference.-As we can see, we got the same values. However, these values are strings. Now we'll convert them to integers-Now that they are whole, we can add up the values and know the total number of malaria deaths in the year 2000.-Also, get the average.-Maximum and minimum value Let's talk a little bit about OpenCV OpenCV is a powerful library for working with computer vision, machine learning and image processing among many other applications. A very interesting point of this library, is that it makes use of Numpy to handle all its vector structure. This makes it much easier to use certain complex operations since Numpy handles math, statistics and vectors very well. Another plus point is the integration with other Numpy based libraries like Scipy and Matplotlib. For more information about OpenCV and its relationship with Numpy see the following link.https://docs.opencv.org/master/d0/de3/tutorial_py_intro.htmlNext, we will see how to import OpenCV and read an image from this library and display it with Matplotlib. Based on OpenCV,Numpy and Matplotlib we will develop each item listed below. -Import OpenCV and Matplotlib.-read an image with OpenCV in grayscale and RGB filter.-Check the type and dimensions of the image (grayscale and RGB).Para la imagen de escalas de grises:In this section we can see that a grayscale image is a 2-dimensional Numpy array. Which is of type "Int8". which means that it is composed of values between 0-255. These two values can be compared with the color Black for 0 and White for 255 in the color range. On the other hand an image with RGB filters (red,green,blue) is a 3-dimensional Numpy array.Now that we know this, we can work with the images as we did with the 2-dimensional Numpy arrays. But first, let's see what the images look like. For the RGB image, we can see how each filter looks.Now let's play with the 2-dimensional images.-We can show the top half of the grayscale image.Luego la mitad inferior.Also with Numpy we can rotate the images.Additionally, there is an option to rotate the image 90 degrees.Finally, let's take the negative out of an image with Numpy.Arithmetic operations with images using Numpy and OpenCVLet's import a new image and adjust its size to the same as the first imported image with a shape of (600,571).Let's see the addition of images with OpenCV.Now with Numpy. When using Numpy the results are based on module operations and the images do not look very good. On the contrary, using openCV is much better because it is based on saturation at either 255 or 0 depending on the operation. To better understand this concept of module and saturation, simply understand that Numpy takes the module out of the operation either by addition or subtraction. However, OpenCV uses saturation because it always tries to approximate 255 when it is addition or 0 when it is subtraction, so the images look cleaner.For subtraction with Numpy we can see as follows below.Now with OpenCV.You can take as reference the link below to see the arithmetic operations with OpenCv and Numpy.https://docs.opencv.org/master/d0/d86/tutorial_py_image_arithmetics.html

Machine Learning
Deep learning

Uber Has Been Quietly Assembling One of the Most Impressive Open Source Deep Learning Stacks in the Market
Let’s look at some of Uber’s top machine learning open source projectsArtificial intelligence(AI) has been an atypical technology trend. In a traditional technology cycle, innovation typically begins with startups trying to disrupt industry incumbents. In the case of AI, most of the innovation in the space has been coming from the big corporate labs of companies like Google, Facebook, Uber or Microsoft. Those companies are not only leading impressive tracks of research but also regularly open sourcing new frameworks and tools that streamline the adoption of AI technologies. In that context, Uber has emerged as one of the most active contributors to open source AI technologies in the current ecosystems. In just a few years, Uber has regularly open sourced projects across different areas of the AI lifecycle. Today, I would like to review a few of my favorites.Uber is a near-perfect playground for AI technologies. The company combines all the traditional AI requirements of a large scale tech company with a front row seat to AI-first transportation scenarios. As a result, Uber has been building machine/deep learning applications across largely diverse scenarios ranging from customer classifications to self-driving vehicles. Many of the technologies used by Uber teams have been open sourced and received accolades from the machine learning community. Let’s look at some of my favorites:Note: I am not covering technologies like Michelangelo or PyML, as they are well documented having been open sourced.Ludwig: A Toolbox for No-Code Machine Learning ModelsLudwig is a TensorFlow based toolbox that allows to train and test deep learning models without the need to write code. Conceptually, Ludwig was created under five fundamental principles:No coding required: no coding skills are required to train a model and use it for obtaining predictions.Generality: a new data type-based approach to deep learning model design that makes the tool usable across many different use cases.Flexibility: experienced users have extensive control over model building and training, while newcomers will find it easy to use.Extensibility: easy to add new model architecture and new feature data types.Understandability: deep learning model internals are often considered black boxes, but we provide standard visualizations to understand their performance and compare their predictions.Using Ludwig, a data scientist can train a deep learning model by simply providing a CSV file that contains the training data as well as a YAML file with the inputs and outputs of the model. Using those two data points, Ludwig performs a multi-task learning routine to predict all outputs simultaneously and evaluate the results. Under the covers, Ludwig provides a series of deep learning models that are constantly evaluated and can be combined in a final architecture. The Uber engineering team explains this process by using the following analogy: “if deep learning libraries provide the building blocks to make your building, Ludwig provides the buildings to make your city, and you can choose among the available buildings or add your own building to the set of available ones.”Pyro: A Native Probabilistic Programming LanguagePyro is a deep probabilistic programming language(PPL) released by Uber AI Labs. Pyro is built on top of PyTorch and is based on four fundamental principles:Universal: Pyro is a universal PPL — it can represent any computable probability distribution. How? By starting from a universal language with iteration and recursion (arbitrary Python code), and then adding random sampling, observation, and inference.Scalable: Pyro scales to large data sets with little overhead above hand-written code. How? By building modern black box optimization techniques, which use mini-batches of data, to approximate inference.Minimal: Pyro is agile and maintainable. How? Pyro is implemented with a small core of powerful, composable abstractions. Wherever possible, the heavy lifting is delegated to PyTorch and other libraries.Flexible: Pyro aims for automation when you want it and control when you need it. How? Pyro uses high-level abstractions to express generative and inference models, while allowing experts to easily customize inference.These principles often pull Pyro’s implementation in opposite directions. Being universal, for instance, requires allowing arbitrary control structure within Pyro programs, but this generality makes it difficult to scale. However, in general, Pyro achieves a brilliant balance between these capabilities making one of the best PPLs for real world applications.Manifold: A Debugging and Interpretation Toolset for Machine Learning ModelsManifold is Uber technologies for debugging and interpreting machine learning models at scale. With Manifold, the Uber engineering team wanted to accomplish some very tangible goals:· Debug code errors in a machine learning model.· Understand the strengths and weaknesses of one model both in isolation and in comparison, with other models.· Compare and ensemble different models.· Incorporate insights gathered through inspection and performance analysis into model iterations.To accomplish those goals, Manifold segments the machine learning analysis process into three main phases: Inspection, Explanation and Refinement.· Inspection: In the first part of the analysis process, the user designs a model and attempts to investigate and compare the model outcome with other existing ones. During this phase, the user compares typical performance metrics, such as accuracy, precision/recall, and receiver operating characteristic curve (ROC), to have coarse-grained information of whether the new model outperforms the existing ones.· Explanation: This phase of the analysis process attempts to explain the different hypotheses formulated in the previous phase. This phase relies on comparative analysis to explain some of the symptoms of the specific models.· Refinement: In this phase, the user attempts to verify the explanations generated from the previous phase by encoding the knowledge extracted from the explanation into the model and testing the performance.Plato: A Framework for Building Conversational Agents at ScaleUber built the Plato Research Dialogue System(PRDS) to address the challenges of building large scale conversational applications. Conceptually, PRDS is a framework to create, train and evaluate conversational AI agents on diverse environments. From a functional standpoint, PRDS includes the following building blocks:Speech recognition (transcribe speech to text)Language understanding (extract meaning from that text)State tracking (aggregate information about what has been said and done so far)API call (search a database, query an API, etc.)Dialogue policy (generate abstract meaning of agent’s response)Language generation (convert abstract meaning into text)Speech synthesis (convert text into speech)PRDS was designed with modularity in mind in order to incorporate state-of-the-art research in conversational systems as well as continuously evolve every component of the platform. In PRDS, each component can be trained either online (from interactions) or offline and incorporate into the core engine. From the training standpoint, PRDS supports interactions with human and simulated users. The latter are common to jumpstart conversational AI agents in research scenarios while the former is more representative of live interactions.Horovod: A Framework for Training Deep Learning at ScaleHorovod is one of the Uber ML stacks that has become extremely popular within the community and has been adopted by research teams at AI-powerhouses like DeepMind or OpenAI. Conceptually, Horovod is a framework for running distributed deep learning training jobs at scale.Horovod leverages message passing interface stacks such as OpenMPI to enable a training job to run on a highly parallel and distributed infrastructure without any modifications. Running a distributed TensorFlow training job in Horovod is accomplished in four simple steps:hvd.init() initializes Horovod.config.gpu_options.visible_device_list = str(hvd.local_rank())assigns a GPU to each of the TensorFlow processes.opt=hvd.DistributedOptimizer(opt)wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-allreduce.hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes to ensure consistent initialization.Uber AI Research: A Regular Source of AI ResearchLast by not least, we should mention Uber’s active contributions to AI research. Many of Uber’s open source releases are inspired by their research efforts. Uber AI Research website is a phenomenal catalog of papers that highlight Uber’s latest effort in AI research.These are some of the contributions of the Uber engineering team that have seen regular adoption by the AI research and development community. As Uber continues implementing AI solutions at scale, we should see new and innovated frameworks that simplify the adoption of machine learning by data scientists and researchers.

The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!