OPINION.

You have probably read an article about the difference between a data scientist and a data engineer. I always thought the distinction was clear. Data engineers make the data ready for use and then data scientists work on that data.

However, my opinion on this distinction has changed dramatically after I started working as a data scientist.

Photo by Ben White on Unsplash

You have probably read an article about the difference between a data scientist and a data engineer. I always thought the distinction was clear. Data engineers make the data ready for use and then data scientists work on that data.

Everything in data science starts with data. Your machine learning model is just as good as the data fed into it. Garbage in, garbage out! A data scientist cannot do some magic to create a valuable product without proper data.

The proper data is not always readily available for data scientists. In most cases, it will the responsibility of the data scientist to convert the raw data to a proper format.

Unless you work for a big tech company that has separate teams of data engineers and data scientists, you should possess the ability and skills to handle some data engineering tasks. These tasks cover a broad range of operations and I will elaborate on this in the remaining part of the article.

What is the difference anyway?

I would like to state my opinion on the relationship between the job of a data engineer and a data scientist.

A data engineer is a data engineer. A data scientist should be both a data scientist and a data engineer.

It may seem like an arguable statement. However, I would like to emphasize that my opinion was different before I started working as a data scientist. I used to think of data engineers and data scientists as separate entities.

In the remaining part of the article, I will try to explain what I mean by a data scientist should be both a data scientist and a data engineer.

For instance, data engineers do a set of operations known as ETL (extract, transform, load). It covers the procedures for collecting data from one or more sources, apply some transformations, and then load into a different source.

I would definitely not be surprised if a data scientist is expected to perform ETL operations. Data science is still evolving and most companies do not have clearly separated data engineer and data scientist roles. As a result, a data scientist should be able to perform some data engineering tasks.

If you expect to only work on running machine learning algorithms with ready-to-use data, you will face the harsh truth soon after you start working as a data scientist.

You may have to write some stored procedures in SQL to preprocess the client data. It is also possible that you receive the client data from a few different sources. It will be your job to extract and combine them. Then, you will need to load them into a single source. In order to write efficient stored procedures, you need extensive SQL skills.

The transform part of ETL procedures involves in many data cleaning and manipulation steps. SQL may not be the best choice if you work with large-scale data. Distributed computing is a better alternative in such cases. Therefore, a data scientist should also be familiar with distributed computing.

Your best friend in distributed computing might be Spark. It is an analytics engine used for large-scale data processing. We can distribute both data and computations over clusters to achieve a substantial performance increase.

If you are familiar with Python and SQL, you won’t have hard time getting used to Spark. You can use Spark features with PySpark which is a Python API for Spark.

Read also: A Beginner’s Guide to Apache Spark

When it comes to work with clusters, the optimal environment is the cloud. There are various cloud providers but AWS, Azure, and Google Cloud Platform (GCP) lead the way.

Although the PySpark code is the same for all cloud providers, how you setup the environment and create clusters change between them. They allow for creating clusters using both scripts or the user interface.

Distributed computing over clusters is a whole different world. It is nothing like doing analysis in your computer. It has very different dynamics. Evaluating cluster performance and choosing the optimal number of workers for a cluster will be your predominant concerns.

Read also:
* The Full Stack Data Scientist
* Everything a Data Scientist Should Know About Data Management

Most Related Articles

Libraries

Customize your Jupyter Notebooks

IntroductionJupyter Notebook is a great programming environment and often the most popular choice for data scientists or data analysts that are coding in python. Unfortunately, its default settings do not allow the level of customization that you have with standard programming environments such as PyCharm or similar tools.Jupyter Notebooks themes are trying to diminish this gap and allow you to make the notebook a bit prettier and also more functional using the themes. In this article, I will walk you through the installation process of Jupyter Notebook themes and show you some of their most important features.InstallationJupyter Notebook themes is an open-source library and can be installed with pip install. Just type the following code in a command line:pip install jupyterthemesThis should trigger the installation of the latest version. Once done you should be able to switch between themes, adjust fonts used in the notebooks, or customize the style of the plots. We will go through these features in detail in the next sections.Changing themesAfter installation, you can launch Jupyter Notebooks as normal and inspect the themes from within the notebook itself.In order to list all possible themes, you can use the following code:!jt -lAs you can see there are currently nine themes available. In order to switch themes you can use this command:!jt -t <theme_name>Let’s choose onedork theme.!jt -t onedorkYou will see that the theme does not change immediately. Some people report that they can reload the page and see the effect after that. From my personal experience, I have to restart Jupyter Notebook in order for the theme to change. Just stop the notebook and launch it again. This is how it should look like with onedork theme when it is loaded.Now, you can play with different themes and choose your favorite one.What you will notice is that some parts of the standard GUI are not visible by default in the theme settings. I am referring to the part pictured below.In order to switch the theme but keeping the standard GUI look, you could use the following variation of the code.!jt -t <theme_name> -T -N -klWith onedork theme, these would like that.!jt -t onedork -T -N -klRestarting Jupyter Notebooks should give you the result similar to the screenshot below.In order to restore the notebook to its default settings, you can use this code.!jt -rNote that I have shown the commands being executed in the Jupyter notebook itself but you can use them without the exclamation mark in the terminal window as well.Setting up the graphing styleOnce you are using the themes you will notice that the graphs created with Matplotlib library do not look the best. For example, this is a simple code to create a line chart.import matplotlib.pyplot as plt %matplotlib inline bp_x = np.linspace(0, 2*np.pi, num=40, endpoint=True) bp_y = np.sin(bp_x) # Make the plot plt.plot(bp_x, bp_y, linewidth=3, linestyle="--", color="blue", label=r"Legend label $\sin(x)$") plt.xlabel(r"Description of $x$ coordinate (units)") plt.ylabel(r"Description of $y$ coordinate (units)") plt.title(r"Title here (remove for papers)") plt.xlim(0, 2*np.pi) plt.ylim(-1.1, 1.1) plt.legend(loc="lower left") plt.show()And this is the screenshot of the plot that is created in the notebook as seen using the onedork theme without customization.This definitely does not match the theme you have chosen. It turns out that in order to customize Matplotlib to match the theme you will need to add two additional lines of code at the top of your notebook.from jupyterthemes import jtplot jtplot.style()Once you run the same chart code (with the two code lines from above added at the top of the notebook) you should see that the chart now matches the current style of the theme.That looks much better!You can actually change the graph style to match any theme that you want.from jupyterthemes import jtplot jtplot.style(<theme_name>)If you do not give a parameter theme for the style function it will use the theme that is currently loaded in the notebook.Note unlike with theme setting there is no need to restart the notebook by changing the Matplotlib graphing style.Changing fontsJupyter Notebook themes do not only allow you to change themes but also do some additional customization regarding the fonts being used in the notebook.You can change the fonts when loading a theme with jt command and adding some additional parameters. You can customize…font used for code (-f) and its size (-fs),font used for notebook (-nf) and its size (-nfs) ,font used for text/markdown (-ft) and its size (-fts).SummaryIn this article, you have learned how to customize your standard Jupyter Notebook with Jupyter Notebook themes. We went through the details of the installation of the library and set up the themes including graph and font customization.You should be able to try it yourself now.Happy customizing!Originally published at aboutdatablog.com: Customize your Jupyter Notebooks with Jupyter Notebooks Themes, on January 20, 2021.

Mariana Uribe

Jul 05, 2021

Libraries

Pandas

4 Must-Know Python Pandas Functions for Time Series Analysis

Time series data consists of data points attached to sequential time stamps. Daily sales, hourly temperature values, and second-level measurements in a chemical process are some examples of time series data.Time series data has different characteristics than ordinary tabular data. Thus, time series analysis has its own dynamics and can be considered as a separate field. There are books over 500 pages to cover time series analysis concepts and techniques in depth.Pandas was created by Wes Mckinney to provide an efficient and flexible tool to work with financial data which is kind of a time series. In this article, we will go over 4 Pandas functions that can be used for time series analysis.We need data for the examples. Let’s start with creating our own time series data.import numpy as np import pandas as pd df = pd.DataFrame({ "date": pd.date_range(start="2020-05-01", periods=100, freq="D"), "temperature": np.random.randint(18, 30, size=100) + np.random.random(100).round(1) }) df.head()(image by author)We have created a data frame that contains temperature measurements during a period of 100 days. The date_range function of Pandas can be used for generating a date range with customized frequency. The temperature values are generated randomly using Numpy functions.We can now start on the functions.1. ShiftIt is a common operation to shift time series data. We may need to make a comparison between lagged or lead features. In our data frame, we can create a new feature that contains the temperature of the previous day.df["temperature_lag_1"] = df["temperature"].shift(1) df.head()(image by author)The scalar value passed to the shift function indicates the number of periods to shift. The first row of the new column is filled with NaN because there is no previous value for the first row.The fill_value parameter can be used for filling the missing values with a scalar. Let’s replace the NaN with the average value of the temperature column.df["temperature_lag_1"] = df["temperature"]\ .shift(1, fill_value = df.temperature.mean()) df.head()(image by author)If you are interested in the future values, you can shift backwards by passing negative values to the shift function. For instance, “-1” brings the temperature in the next day.2. ResampleAnother common operation performed on time series data is resampling. It involves in changing the frequency of the periods. For instance, we may be interested in the weekly temperature data rather than daily measurements.The resample function creates groups (or bins) of a specified internal. Then, we can apply aggregation functions to the groups to calculate the value based on resampled frequency.Let’s calculate the average weekly temperatures. The first step is to resample the data to week level. Then, we will apply the mean function to calculate the average.df_weekly = df.resample("W", on="date").mean() df_weekly.head()(image by author)The first parameter specifies the frequency for resampling. “W” stands for week, surprisingly. If the data frame does not have a datetime index, the column that contains the date or time related information needs to be passed to the on parameter.3. AsfreqThe asfreq function provides a different technique for resampling. It returns the value at the end of the specified interval. For instance, asfreq(“W”)returns the value on the last day of each week.In order to use the asfreq function, we should set the date column as the index of the data frame.df.set_index("date").asfreq("W").head()(image by author)Since we are getting a value at a specific day, it is not necessary to apply an aggregation function.4. RollingThe rolling function can be used for calculating moving average which is a highly common operation for time series data. It creates a window of a particular size. Then, we can use this window to make calculations as it rolls through the data points.The figure below explains the concept of rolling.(image by author)Let’s create a rolling window of 3 and use it to calculate the moving average.df.set_index("date").rolling(3).mean().head()(image by author)For any day, the values show the average of the day and the previous 2 days. The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average on the third day is the average of these values which is 20.7.The first 2 values are NaN because they do not have previous 2 values. We can also use this rolling window to cover the previous and next day for any given day. It can be done by setting the center parameter as true.df.set_index("date").rolling(3, center=True).mean().head()(image by author)The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average in the second day is the average of these 3 values. In this setting, only the first value is NaN because we only need 1 previous value.ConclusionWe have covered 4 Pandas functions that are commonly used in time series analysis. Predictive analytics is an essential part of data science. Time series analysis is at the core of many problems that predictive analytics aims to solve. Hence, if you plan to work on predictive analytics, you should definitely learn how to handle time series data.Thank you for reading. Please let me know if you have any feedback.Soner Yıldırım

Daniel Morales

Jul 05, 2021

Machine Learning

Libraries

Analyzing Customer Conversions with Machine Learning and Decision Trees

Daniel Morales

Jul 05, 2021

Libraries

Pandas

16 Underrated Pandas Series Methods And When To Use Them

In this article, we’re going to explore some lesser-known but very useful pandas methods for manipulating Series objects. Some of these methods are related only to Series, the others — both to Series and DataFrames, having, however, specific features when used with both structure types.1. is_uniqueAs its name sugests, this method checks if all the values of a Series are unique:import pandas as pd print(pd.Series([1, 2, 3, 4]).is_unique) print(pd.Series([1, 2, 3, 1]).is_unique) Output: True False 2 & 3. is_monotonic and is_monotonic_decreasingWith these 2 methods, we can check if the values of a Series are in ascending/descending order:print(pd.Series([1, 2, 3, 8]).is_monotonic) print(pd.Series([1, 2, 3, 1]).is_monotonic) print(pd.Series([9, 8, 4, 0]).is_monotonic_decreasing) Output: True False TrueBoth methods work also for a Series with string values. In this case, Python uses a lexicographical ordering under the hood, comparing two subsequent strings character by character. It’s not the same as just an alphabetical ordering, and actually, the example with the numeric data above is a particular case of such an ordering. As the Python documentation says,Lexicographical ordering for strings uses the Unicode code point number to order individual characters.In practice, it mainly means that the letter case and special symbols are also taken into account:print(pd.Series(['fox', 'koala', 'panda']).is_monotonic) print(pd.Series(['FOX', 'Fox', 'fox']).is_monotonic) print(pd.Series(['*', '&', '_']).is_monotonic) Output: True True FalseA curious exception happens when all the values of a Series are the same. In this case, both methods return True:print(pd.Series([1, 1, 1, 1, 1]).is_monotonic) print(pd.Series(['fish', 'fish']).is_monotonic_decreasing) Output: True TrueAlso Read: 4 Must-Know Python Pandas Functions for Time Series Analysis4. hasnansThis method checks if a Series contains NaN values:import numpy as np print(pd.Series([1, 2, 3, np.nan]).hasnans) print(pd.Series([1, 2, 3, 10, 20]).hasnans) Output: True False5. emptySometimes, we might want to know if a Series is completely empty, not containing even NaN values:print(pd.Series().empty) print(pd.Series(np.nan).empty) Output: True FalseA Series can become empty after some manipulations with it, for example, filtering:s = pd.Series([1, 2, 3]) s[s > 3].empty Output: True 6 & 7. first_valid_index() and last_valid_index()These 2 methods return index for first/last non-NaN value and are particularly useful for Series objects with many NaNs:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).first_valid_index()) print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).last_valid_index()) Output: 2 4If all the values of a Series are NaN, both methods return None:print(pd.Series([np.nan, np.nan, np.nan]).first_valid_index()) print(pd.Series([np.nan, np.nan, np.nan]).last_valid_index()) Output: None None 8. truncate()This method allows truncating a Series before and after some index value. Let’s truncate the Series from the previous section leaving only non-NaN values:s = pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]) s.truncate(before=2, after=4) Output: 2 1.0 3 2.0 4 3.0 dtype: float64The original index of the Series was preserved. We may want to reset it and also to assign the truncated Series to a variable:s_truncated = s.truncate(before=2, after=4).reset_index(drop=True) print(s_truncated) Output: 0 1.0 1 2.0 2 3.0 dtype: float64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other9. convert_dtypes()As the pandas documentation says, this method is used toConvert columns to best possible dtypes using dtypes supporting pd.NA.If to consider only Series objects and not DataFrames, the only application of this method is to convert all nullable integers (i.e. float numbers with a decimal part equal to 0, such as 1.0, 2.0, etc.) back to “normal” integers. Such float numbers appear when the original Series contains both integers and NaN values. Since NaN is a float in numpy and pandas, it leads to the whole Series with any missing values to become of float type as well.Let’s take a look at the example from the previous section to see how it works:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan])) print('\n') print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).convert_dtypes()) Output: 0 NaN 1 NaN 2 1.0 3 2.0 4 3.0 5 NaN dtype: float64 0 <NA> 1 <NA> 2 1 3 2 4 3 5 <NA> dtype: Int64 10. clip()We can clip all the values of a Series at input thresholds (lower and upper parameters):s = pd.Series(range(1, 11)) print(s) s_clipped = s.clip(lower=2, upper=7) print(s_clipped) Output: 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 dtype: int64 0 2 1 2 2 3 3 4 4 5 5 6 6 7 7 7 8 7 9 7 dtype: int64 11. rename_axis()In the case of a Series object, this method sets the name of the index:s = pd.Series({'flour': '300 g', 'butter': '150 g', 'sugar': '100 g'}) print(s) s=s.rename_axis('ingredients') print(s) Output: flour 300 g butter 150 g sugar 100 g dtype: object ingredients flour 300 g butter 150 g sugar 100 g dtype: object 12 & 13. nsmallest() and nlargest()These 2 methods return the smallest/largest elements of a Series. By default, they return 5 values, in ascending order for nsmallest() and in descending - for nlargest().s = pd.Series([3, 2, 1, 100, 200, 300, 4, 5, 6]) s.nsmallest() Output: 2 1 1 2 0 3 6 4 7 5 dtype: int64It’s possible to specify another number of the smallest/largest values to be returned. Also, we may want to reset the index and assign the result to a variable:largest_3 = s.nlargest(3).reset_index(drop=True) print(largest_3) Output: 0 300 1 200 2 100 dtype: int64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other14. pct_change()For a Series object, we can calculate percentage change (or, more precisely, fraction change) between the current and a prior element. This approach can be helpful, for example, when working with time series, or for creating a waterfall chart in % or fractions.s = pd.Series([20, 33, 14, 97, 19]) s.pct_change() Output: 0 NaN 1 0.650000 2 -0.575758 3 5.928571 4 -0.804124 dtype: float64To make the resulting Series more readable, let’s round it:s.pct_change().round(2) Output: 0 NaN 1 0.65 2 -0.58 3 5.93 4 -0.80 dtype: float64 15. explode()This method transforms each list-like element of a Series (lists, tuples, sets, Series, ndarrays) to a row. Empty list-likes will be transformed in a row with NaN. To avoid repeated indices in the resulting Series, it’s better to reset index:s = pd.Series([[np.nan], {1, 2}, 3, (4, 5)]) print(s) s_exploded = s.explode().reset_index(drop=True) print(s_exploded) Output: 0 [nan] 1 {1, 2} 2 3 3 (4, 5) dtype: object 0 NaN 1 1 2 2 3 3 4 4 5 5 dtype: object 16. repeat()This method is used for consecutive repeating each element of a Series a defined number of times. Also in this case, it makes sense to reset index:s = pd.Series([1, 2, 3]) print(s) s_repeated = s.repeat(2).reset_index(drop=True) print(s_repeated) Output: 0 1 1 2 2 3 dtype: int64 0 1 1 1 2 2 3 2 4 3 5 3 dtype: int64If the number of repetitions is assigned to 0, an empty Series will be returned:s.repeat(0) Output: Series([], dtype: int64) ConclusionTo sum up, we investigated 16 rarely used pandas methods for working with Series and some of their application cases. If you know some other interesting ways to manipulate pandas Series, you’re very welcome to share them in the comments.Thanks for reading!Also read: Using Python And Pandas Datareader to Analyze Financial Data

Daniel Morales

Jul 05, 2021

Data Scientists Without Data Engineering Skills Will Face the Harsh Truth

Contents Outline

Soner Yıldırım

Data Scientists Without Data Engineering Skills Will Face the Harsh Truth

Related Posts

Categories

Join Competition

Mariana Uribe

Daniel Morales

Daniel Morales

Daniel Morales

Data Scientists Without Data Engineering Skills Will Face the Harsh Truth

Contents Outline

Social Sharing

Soner Yıldırım

Related Posts

Categories

Join Competition

Most Related Articles

Mariana Uribe

Daniel Morales

Daniel Morales

Daniel Morales