Data science encompasses a variety of terms. Not only beginners but also professionals having plenty of years of experience can stumble upon one or another term and wonder what it is? Our memory is not perfect, but we can always go back and refresh it. For this reason, I decided to create the ultimate glossary for data scientists. Some of the concepts here are used more and some less often. However, things you probably won’t use today can be something you’re looking for tomorrow, isn’t it? Having a certain picture in your mind is always beneficial. So let’s get started!
Data Science Terms: An A-Z Guide
Please note that all terms are sorted alphabetically, implying the necessity of having some basic understanding. In other cases, be prepared to absorb a lot of new things :)
Although all concepts are arranged alphabetically, this one takes a very logical position. Algorithms are the basis for everything. An elementary thing which makes up absolutely any procedure, and a computer program in particular. For programming is a set of instructions we give a computer so it can take values and manipulate them as we need.
Data science != programming, but we use lots of algorithms for statistics and machine learning — Principal Component Analysis, K-Means Clustering, Support Vector Machines, and more. Full explanation on all of them can be found in my previous article — Top 10 Machine Learning Algorithms for Data Science.
Artificial Intelligence (AI)
The human mind can solve tasks of various complexity. What about machines? They are also learning to do so, and this ability is called Artificial Intelligence because it strives to replicate human intelligence. But, not replace it! Systems that use this ability are called AI-powered programs and they involve usage of machine learning algorithms, statistics, data science techniques, and etc.
Artificial intelligence and Machine learning are related. However, ML is a subset of AI and not vice versa. AI — is the ability of machines to learn, and this ability is constantly evolving. ML — is just a process of this evolution, tech part of it, a set of algorithms. AI has a variety of applications: speech recognition, decision-making, language translation, object classification and etc.
Big data is not about the data, it is about the stunning progress and strategies in the statistical and other methods of extracting insights from the data. Big data has the following principle: the more you know about a particular subject, the more reliably you can achieve a new understanding and predict what will happen in the future.
This is done through a process that includes building models based on the data that we can collect, extract insights from it, and simulate future scenarios, during which each time the values of the data points are adjusted.
This process is automated — modern analytics technologies will run millions of these simulations, adjusting all possible variables until they find a model — or idea — that will help solve the problem they are working on. The famous 3 Vs of Big Data — Volume, Velocity, and Variety.
Classification is a supervised machine learning algorithm. What does it mean supervised?It is called supervised learning because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. We know the correct answers, the algorithm iteratively makes predictions on the training data and is corrected by the teacher.
Okay, so supervised learning can be further grouped into regression and classification problems. A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. For example, it can be used to determine if a customer is likely to spend over $20 online, based on their similarity to other customers who have previously spent that amount.
Covariance is a measure of how changes in one variable are associated with changes in a second variable.
When two sets of data are strongly linked together we say they have a High Correlation. The word Correlation is made of Co- (meaning “together”), and Relation. Correlation is Positive when the values increase together, and it is Negative when one value decreases as the other increases.
Clustering techniques attempt to collect and categorize sets of points into groups that are “sufficiently similar,” or “close” to one another. “Close” varies depending on how you choose to measure distance. Complexity increases as more features are added to a problem space.
This discipline is the little brother of data science. Data analysis is focused more on answering questions about the present and the past. It uses less complex statistics and generally tries just to identify patterns. It is more about procedures made with data like cleaning, transforming, and modeling. Data science is broader than that, cause it’s about making predictions, extracting insights, and other things.
“A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.”
–Gordon Lindsay Glegg
In its heart, it is a hybrid of sorts between a data analyst and a data scientist; Data Engineer is typically in charge of managing data workflows, pipelines, and ETL processes. So, it is all about the back end. Data engineers build systems for data scientists to conduct their analysis. Data scientist may also be a data engineer. In larger groups, engineers are able to focus solely on speeding up analysis and keeping data well organized and easy to access.
Read more about it: Who Is a Data Engineer & How to Become a Data Engineer?
Well, here is the main thing that made so much noise. Predictions, insights, manipulations with data, turning messy and disparate data into understandable material — all this is done by Data science badass.
What does a day in the life of a data scientist look like? He/or she is responsible for: making data foundation, performing robust analytic, making experiments, building machine learning pipelines and personalized data products, and finally getting a better understanding of business.
Read more: A Beginner’s Guide To Data Science
Significant part of the data science routine is taken by visualization of what they do. Do you understand statistics, machine learning algorithms, SQL, Python? It is great, if yes. But the vast majority of customers are far away from these terms, but they still require to know what is going on. Infographics, traditional plots, or even full data dashboards — is what data visualization is all about. The ability to use it, and translate complex procedures in simple terms is a real art!
Another part of the data science process. Here a scientist usually asks basic questions that help to understand the context of a data set. Exploring = investigating. What you learn during the exploration phase will guide more in-depth analysis later. Further, it helps you recognize when a result might be surprising and warrant further investigation.
The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization.
A data warehouse is a system used to do quick analysis of business trends using data from many sources. They’re designed to make it easy for people to answer important statistical questions without a Ph.D. in database architecture.
You can’t be a badass data scientist without trees. Decision is what our task is all about, and decision trees are a great tool to create a predicting model for it. As the name suggests, the visual model for the decision-making process is a tree. It’s widely used in data mining and machine learning.
A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to the leaf represent classification rules.
Exploratory Data Analysis is often the first step when analyzing datasets. With EDA techniques, data scientists can summarize a dataset’s main characteristics and inform the development of more complex models or logical next steps.
Extract, transform, load — ETL is a type of data integration used to blend data from multiple sources. It’s often used to build a data warehouse. An important aspect of this data warehousing is that it consolidates data from multiple sources and transforms it into a common, useful format. For example, ETL normalizes data from multiple business departments and processes to make it standardized and consistent.
Great knowledge hub for programmers & data scientists. It provides access control and several collaboration features, such as bug tracking, feature requests, task management, and wikis for every project. GitHub offers both private repositories and free accounts, which are commonly used to host open-source software projects.
Hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Given these hyperparameters, the training algorithm learns the parameters from the data.
Linear Regression is used to model a linear relationship between a continuous, scalar response variable and at least one explanatory variable. Linear Regression can be used for predicting monetary valuations amongst other use cases.
Logistic Regression is used to model a probabilistic relationship between a binary response variable and at least one explanatory variable. The output of the Logistic Regression model is the log odds, which can be transformed to obtain the probability. Logistic Regression can be used to predict the likelihood of churn amongst other use cases.
Machine learning is a set of algorithms that can be fed only with structured data in order to complete a task without being programmed. All those algorithms build a mathematical model, known as “training data”. There are many types of machine learning techniques; most are classified as either supervised or unsupervised techniques.
While AI is a technique that enables machines to mimic human behavior, Machine Learning is a technique used to implement Artificial Intelligence. It is a certain process during which machines (computers) are learning by feeding them data and letting them learn a few tricks on their own, without being explicitly programmed to do so. So all-in-all, Machine Learning is the meat and potatoes of AI.
“Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism” — D. M. Hawkins (1980)
An outlier is an element of a data set that distinctly stands out from the rest of the data. In other words, outliers are those data points that lie outside the overall pattern of distribution. Outliers may indicate variability in measurement, experimental errors, or a novelty.
Overfitting happens when a model considers too much information. It’s like asking a person to read a sentence while looking at a page through a microscope. The patterns that enable understanding get lost in the noise.
Regression is a supervised machine learning algorithm. It solves a problem when the output variable is a real value, such as “dollars” or “weight”. Regression aims to find the relationship between variables and for Machine Learning it is needed for predicting the outcome based on such a relationship. It focuses on how a target value changes as other values within a data set change.
The most commonly used forms of regression are Linear regression, Logistic regression, Ridge regression, and etc. Read more: Key Types of Regressions: Which One to Use?
An area of unsupervised machine learning where the machine seeks to maximize reward. The machine, or “agent,” learns through trial and error as well as reward and punishment.
If you’ve heard of positive and negative reinforcement, those same principles are applied here. Reinforcement learning problems are usually explained in terms of games. Let’s take chess, for example. The machine’s goal is to win at chess. It’s positively reinforced when it makes moves that win material, such as capturing a pawn, and negatively reinforced when it makes moves that lose material, such as having a pawn captured. Combinations of these rewards and punishments result in a self-learning machine that improves at chess over time.
The standard deviation of a set of values helps us understand how spread out those values are. This statistic is more useful than the variance because it’s expressed in the same units as the values themselves. Mathematically, the standard deviation is the square root of the variance of a set. It’s often represented by the greek symbol sigma, σ.
Supervised learning is when the model is getting trained on a labeled dataset. The labeled dataset is one that has both input and output parameters. In this type of learning both training and validation, datasets are labeled as shown in the figures below.
“It’s similar to the way a child might learn arithmetic from a teacher.” — Nikki Castle.
This is distinctly different from unsupervised learning, which does not rely on human guidance. An example use case for supervised learning might include a data scientist training an algorithm to recognize images of female human beings using correctly labeled images of female human beings and their characteristics.
Underfitting happens when you don’t offer a model enough information. An example of underfitting would be asking someone to graph the change in temperature over a day and only giving them high and low. Instead of the smooth curve, one might expect, you only have enough information to draw a straight line.
Any data that does not fit a predefined data model. Often this data does not fit into the typical row-column structure of a database. Images, emails, videos, audio, and pretty much anything else that might be difficult to “tabify” might constitute examples of unstructured data.
Supervised learning is the technique of accomplishing a task by providing training, input, and output patterns to the systems whereas unsupervised learning is a self-learning technique in which the system has to discover the features of the input population on its own and no prior set of categories are used. Unsupervised learning is often used to preprocess the data. Usually, that means compressing it in some meaning-preserving way like with PCA or SVD before feeding it to a deep neural net or another supervised learning algorithm.
The variance of a set of values measures how spread out those values are. Mathematically, it is the average difference between individual values and the mean for the set of values. The square root of the variance for a set gives us the standard deviation, which is more intuitively useful.
Web scraping is the process of pulling data from a website’s source code. It generally involves writing a script that will identify the information a user wants and pull it into a new file for later analysis.
I hope this glossary will add some clarity to terms and destroy any misconceptions about data science. Feel free to use it as a reference anytime you want to brush up your knowledge. Happy data science learning!
Thanks for reading, best of luck, and cheers!
Inspired to learn more about AI, ML & Data Science? Check out my Medium
“The Ultimate Glossary of Data Science”– Oleksii Kharkovyna Tweet