The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Libraries
Pandas

16 Underrated Pandas Series Methods And When To Use Them
In this article, we’re going to explore some lesser-known but very useful pandas methods for manipulating Series objects. Some of these methods are related only to Series, the others — both to Series and DataFrames, having, however, specific features when used with both structure types.1. is_uniqueAs its name sugests, this method checks if all the values of a Series are unique:import pandas as pd
print(pd.Series([1, 2, 3, 4]).is_unique)
print(pd.Series([1, 2, 3, 1]).is_unique)
Output:
True
False
2 & 3. is_monotonic and is_monotonic_decreasingWith these 2 methods, we can check if the values of a Series are in ascending/descending order:print(pd.Series([1, 2, 3, 8]).is_monotonic)
print(pd.Series([1, 2, 3, 1]).is_monotonic)
print(pd.Series([9, 8, 4, 0]).is_monotonic_decreasing)
Output:
True
False
TrueBoth methods work also for a Series with string values. In this case, Python uses a lexicographical ordering under the hood, comparing two subsequent strings character by character. It’s not the same as just an alphabetical ordering, and actually, the example with the numeric data above is a particular case of such an ordering. As the Python documentation says,Lexicographical ordering for strings uses the Unicode code point number to order individual characters.In practice, it mainly means that the letter case and special symbols are also taken into account:print(pd.Series(['fox', 'koala', 'panda']).is_monotonic)
print(pd.Series(['FOX', 'Fox', 'fox']).is_monotonic)
print(pd.Series(['*', '&', '_']).is_monotonic)
Output:
True
True
FalseA curious exception happens when all the values of a Series are the same. In this case, both methods return True:print(pd.Series([1, 1, 1, 1, 1]).is_monotonic)
print(pd.Series(['fish', 'fish']).is_monotonic_decreasing)
Output:
True
TrueAlso Read: 4 Must-Know Python Pandas Functions for Time Series Analysis4. hasnansThis method checks if a Series contains NaN values:import numpy as np
print(pd.Series([1, 2, 3, np.nan]).hasnans)
print(pd.Series([1, 2, 3, 10, 20]).hasnans)
Output:
True
False5. emptySometimes, we might want to know if a Series is completely empty, not containing even NaN values:print(pd.Series().empty)
print(pd.Series(np.nan).empty)
Output:
True
FalseA Series can become empty after some manipulations with it, for example, filtering:s = pd.Series([1, 2, 3])
s[s > 3].empty
Output:
True
6 & 7. first_valid_index() and last_valid_index()These 2 methods return index for first/last non-NaN value and are particularly useful for Series objects with many NaNs:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).first_valid_index())
print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).last_valid_index())
Output:
2
4If all the values of a Series are NaN, both methods return None:print(pd.Series([np.nan, np.nan, np.nan]).first_valid_index())
print(pd.Series([np.nan, np.nan, np.nan]).last_valid_index())
Output:
None
None
8. truncate()This method allows truncating a Series before and after some index value. Let’s truncate the Series from the previous section leaving only non-NaN values:s = pd.Series([np.nan, np.nan, 1, 2, 3, np.nan])
s.truncate(before=2, after=4)
Output:
2 1.0
3 2.0
4 3.0
dtype: float64The original index of the Series was preserved. We may want to reset it and also to assign the truncated Series to a variable:s_truncated = s.truncate(before=2, after=4).reset_index(drop=True)
print(s_truncated)
Output:
0 1.0
1 2.0
2 3.0
dtype: float64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other9. convert_dtypes()As the pandas documentation says, this method is used toConvert columns to best possible dtypes using dtypes supporting pd.NA.If to consider only Series objects and not DataFrames, the only application of this method is to convert all nullable integers (i.e. float numbers with a decimal part equal to 0, such as 1.0, 2.0, etc.) back to “normal” integers. Such float numbers appear when the original Series contains both integers and NaN values. Since NaN is a float in numpy and pandas, it leads to the whole Series with any missing values to become of float type as well.Let’s take a look at the example from the previous section to see how it works:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]))
print('\n')
print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).convert_dtypes())
Output:
0 NaN
1 NaN
2 1.0
3 2.0
4 3.0
5 NaN
dtype: float64
0 <NA>
1 <NA>
2 1
3 2
4 3
5 <NA>
dtype: Int64
10. clip()We can clip all the values of a Series at input thresholds (lower and upper parameters):s = pd.Series(range(1, 11))
print(s)
s_clipped = s.clip(lower=2, upper=7)
print(s_clipped)
Output:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
dtype: int64
0 2
1 2
2 3
3 4
4 5
5 6
6 7
7 7
8 7
9 7
dtype: int64
11. rename_axis()In the case of a Series object, this method sets the name of the index:s = pd.Series({'flour': '300 g', 'butter': '150 g', 'sugar': '100 g'})
print(s)
s=s.rename_axis('ingredients')
print(s)
Output:
flour 300 g
butter 150 g
sugar 100 g
dtype: object
ingredients
flour 300 g
butter 150 g
sugar 100 g
dtype: object
12 & 13. nsmallest() and nlargest()These 2 methods return the smallest/largest elements of a Series. By default, they return 5 values, in ascending order for nsmallest() and in descending - for nlargest().s = pd.Series([3, 2, 1, 100, 200, 300, 4, 5, 6])
s.nsmallest()
Output:
2 1
1 2
0 3
6 4
7 5
dtype: int64It’s possible to specify another number of the smallest/largest values to be returned. Also, we may want to reset the index and assign the result to a variable:largest_3 = s.nlargest(3).reset_index(drop=True)
print(largest_3)
Output:
0 300
1 200
2 100
dtype: int64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other14. pct_change()For a Series object, we can calculate percentage change (or, more precisely, fraction change) between the current and a prior element. This approach can be helpful, for example, when working with time series, or for creating a waterfall chart in % or fractions.s = pd.Series([20, 33, 14, 97, 19])
s.pct_change()
Output:
0 NaN
1 0.650000
2 -0.575758
3 5.928571
4 -0.804124
dtype: float64To make the resulting Series more readable, let’s round it:s.pct_change().round(2)
Output:
0 NaN
1 0.65
2 -0.58
3 5.93
4 -0.80
dtype: float64
15. explode()This method transforms each list-like element of a Series (lists, tuples, sets, Series, ndarrays) to a row. Empty list-likes will be transformed in a row with NaN. To avoid repeated indices in the resulting Series, it’s better to reset index:s = pd.Series([[np.nan], {1, 2}, 3, (4, 5)])
print(s)
s_exploded = s.explode().reset_index(drop=True)
print(s_exploded)
Output:
0 [nan]
1 {1, 2}
2 3
3 (4, 5)
dtype: object
0 NaN
1 1
2 2
3 3
4 4
5 5
dtype: object
16. repeat()This method is used for consecutive repeating each element of a Series a defined number of times. Also in this case, it makes sense to reset index:s = pd.Series([1, 2, 3])
print(s)
s_repeated = s.repeat(2).reset_index(drop=True)
print(s_repeated)
Output:
0 1
1 2
2 3
dtype: int64
0 1
1 1
2 2
3 2
4 3
5 3
dtype: int64If the number of repetitions is assigned to 0, an empty Series will be returned:s.repeat(0)
Output:
Series([], dtype: int64)
ConclusionTo sum up, we investigated 16 rarely used pandas methods for working with Series and some of their application cases. If you know some other interesting ways to manipulate pandas Series, you’re very welcome to share them in the comments.Thanks for reading!Also read: Using Python And Pandas Datareader to Analyze Financial Data

Libraries
Pandas

4 Must-Know Python Pandas Functions for Time Series Analysis
Time series data consists of data points attached to sequential time stamps. Daily sales, hourly temperature values, and second-level measurements in a chemical process are some examples of time series data.Time series data has different characteristics than ordinary tabular data. Thus, time series analysis has its own dynamics and can be considered as a separate field. There are books over 500 pages to cover time series analysis concepts and techniques in depth.Pandas was created by Wes Mckinney to provide an efficient and flexible tool to work with financial data which is kind of a time series. In this article, we will go over 4 Pandas functions that can be used for time series analysis.We need data for the examples. Let’s start with creating our own time series data.import numpy as np
import pandas as pd
df = pd.DataFrame({
"date": pd.date_range(start="2020-05-01", periods=100, freq="D"),
"temperature": np.random.randint(18, 30, size=100) +
np.random.random(100).round(1)
})
df.head()(image by author)We have created a data frame that contains temperature measurements during a period of 100 days. The date_range function of Pandas can be used for generating a date range with customized frequency. The temperature values are generated randomly using Numpy functions.We can now start on the functions.1. ShiftIt is a common operation to shift time series data. We may need to make a comparison between lagged or lead features. In our data frame, we can create a new feature that contains the temperature of the previous day.df["temperature_lag_1"] = df["temperature"].shift(1)
df.head()(image by author)The scalar value passed to the shift function indicates the number of periods to shift. The first row of the new column is filled with NaN because there is no previous value for the first row.The fill_value parameter can be used for filling the missing values with a scalar. Let’s replace the NaN with the average value of the temperature column.df["temperature_lag_1"] = df["temperature"]\
.shift(1, fill_value = df.temperature.mean())
df.head()(image by author)If you are interested in the future values, you can shift backwards by passing negative values to the shift function. For instance, “-1” brings the temperature in the next day.2. ResampleAnother common operation performed on time series data is resampling. It involves in changing the frequency of the periods. For instance, we may be interested in the weekly temperature data rather than daily measurements.The resample function creates groups (or bins) of a specified internal. Then, we can apply aggregation functions to the groups to calculate the value based on resampled frequency.Let’s calculate the average weekly temperatures. The first step is to resample the data to week level. Then, we will apply the mean function to calculate the average.df_weekly = df.resample("W", on="date").mean()
df_weekly.head()(image by author)The first parameter specifies the frequency for resampling. “W” stands for week, surprisingly. If the data frame does not have a datetime index, the column that contains the date or time related information needs to be passed to the on parameter.3. AsfreqThe asfreq function provides a different technique for resampling. It returns the value at the end of the specified interval. For instance, asfreq(“W”)returns the value on the last day of each week.In order to use the asfreq function, we should set the date column as the index of the data frame.df.set_index("date").asfreq("W").head()(image by author)Since we are getting a value at a specific day, it is not necessary to apply an aggregation function.4. RollingThe rolling function can be used for calculating moving average which is a highly common operation for time series data. It creates a window of a particular size. Then, we can use this window to make calculations as it rolls through the data points.The figure below explains the concept of rolling.(image by author)Let’s create a rolling window of 3 and use it to calculate the moving average.df.set_index("date").rolling(3).mean().head()(image by author)For any day, the values show the average of the day and the previous 2 days. The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average on the third day is the average of these values which is 20.7.The first 2 values are NaN because they do not have previous 2 values. We can also use this rolling window to cover the previous and next day for any given day. It can be done by setting the center parameter as true.df.set_index("date").rolling(3, center=True).mean().head()(image by author)The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average in the second day is the average of these 3 values. In this setting, only the first value is NaN because we only need 1 previous value.ConclusionWe have covered 4 Pandas functions that are commonly used in time series analysis. Predictive analytics is an essential part of data science. Time series analysis is at the core of many problems that predictive analytics aims to solve. Hence, if you plan to work on predictive analytics, you should definitely learn how to handle time series data.Thank you for reading. Please let me know if you have any feedback.Soner Yıldırım

Pandas
Big Data

How to Process a DataFrame With Millions of Rows in Seconds
Yet another Python library for Data Analysis that You Should Know About — and no, I am not talking about Spark or DaskBig Data Analysis in Python is having its renaissance. It all started with NumPy, which is also one of the building blocks behind the tool I am presenting in this article.In 2006, Big Data was a topic that was slowly gaining traction, especially with the release of Hadoop. Pandas followed soon after with its DataFrames. 2014 was the year when Big Data became mainstream, also Apache Spark was released that year. In 2018 came Dask and other libraries for data analytics in Python.Each month I find a new Data Analytics tool, which I am eager to learn. It is a worthy investment of spending an hour or two on tutorials as it can save you a lot of time in the long run. It’s also important to keep in touch with the latest tech.While you might expect that this article will be about Dask you are wrong. I found another Python library for data analysis that you should know about.Like Python, it is equally important that you become proficient in SQL. In case you aren’t familiar with it, and you have some money to spare, check out this course: Master SQL, the core language for Big Data analysis.Big Data Analysis in Python is having its renaissanceIn case you’re interested, Udacity offers Free Access to:- Intro to Machine Learning with PyTorch
- Deep Learning Nanodegree and moreMeet VaexPhoto by Mathew Schwartz on UnsplashVaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets.It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.What’s the difference between Vaex and Dask?Photo by Stillness InMotion on UnsplashVaex is not similar to Dask but is similar to Dask DataFrames, which are built on top pandas DataFrames. This means that Dask inherits pandas issues, like high memory usage. This is not the case Vaex.Vaex doesn’t make DataFrame copies so it can process bigger DataFrame on machines with less main memory.Both Vaex and Dask use lazy processing. The only difference is that Vaex calculates the field when needed, wherewith Dask we need to explicitly use the compute function.Data needs to be in HDF5 or Apache Arrow format to take full advantage of Vaex.How to install Vaex?To install Vaex is as simple as installing any other Python package:pip install vaexLet’s take Vaex to a test drivePhoto by Eugene Chystiakov on UnsplashLet’s create a pandas DataFrame with 1 million rows and 1000 columns to create a big data file.import vaex
import pandas as pd
import numpy as np
n_rows = 1000000
n_cols = 1000
df = pd.DataFrame(np.random.randint(0, 100, size=(n_rows, n_cols)), columns=['col%d' % i for i in range(n_cols)])
df.head()First few lines in a Pandas Dataframe (image made by author)How much main memory does this DataFrame use?df.info(memory_usage='deep')Let’s save it to disk so that we can read it later with Vaex.file_path = 'big_file.csv'
df.to_csv(file_path, index=False)We wouldn’t gain much by reading the whole CSV directly with Vaex as the speed would be similar to pandas. Both need approximately 85 seconds on my laptop.We need to convert the CSV to HDF5 (the Hierarchical Data Format version 5) to see the benefit with Vaex. Vaex has a function for conversion, which even supports files bigger than the main memory by converting smaller chunks.If you cannot open a big file with pandas, because of memory constraints, you can covert it to HDF5 and process it with Vaex.dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)This function creates an HDF5 file and persists it to disk.What’s the datatype of dv?type(dv)
# output
vaex.hdf5.dataset.Hdf5MemoryMappedNow, let’s read the 7.5 GB dataset with Vaex — We wouldn’t need to read it again as we already have it in dv variable. This is just to test the speed.dv = vaex.open('big_file.csv.hdf5')Vaex needed less than 1 second to execute the command above. But Vaex didn’t actually read the file, because of lazy loading, right?Let’s force to read it by calculating a sum of col1.suma = dv.col1.sum()
suma
# Output
# array(49486599)I was really surprised by this one. Vaex needed less than 1 second to calculate the sum. How is that possible?Opening such data is instantaneous regardless of the file size on disk. Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.PlottingVaex is also fast when plotting data. It has special plotting functions plot1d, plot2d and plot2d_contour.dv.plot1d(dv.col2, figsize=(14, 7))Plotting with Vaex (image made by author)Virtual columnsVaex creates a virtual column when adding a new column, — a column that doesn’t take the main memory as it is computed on the fly.dv['col1_plus_col2'] = dv.col1 + dv.col2
dv['col1_plus_col2']The virtual column in Vaex (image made by author)Efficient filteringVaex won’t create DataFrame copies when filtering data, which is much more memory efficient.dvv = dv[dv.col1 > 90]
AggregationsAggregations work slightly differently than in pandas, but more importantly, they are blazingly fast.Let’s add a binary virtual column where col1 ≥ 50.dv['col1_50'] = dv.col1 >= 50Vaex combines group by and aggregation in a single command. The command below groups data by the “col1_50” column and calculates the sum of the col3 column.dv_group = dv.groupby(dv['col1_50'], agg=vaex.agg.sum(dv['col3']))
dv_groupAggregations in Vaex (image made by author)JoinsVaex joins data without making memory copies, which saves the main memory. Pandas users will be familiar with the join function:dv_join = dv.join(dv_group, on=’col1_50')
ConclusionIn the end, you might ask: Should we simply switch from pandas to Vaex? The answer is a big NO.Pandas is still the best tool for data analysis in Python. It has well-supported functions for the most common data analysis tasks.When it comes to bigger files, pandas might not be the fastest tool. This is a great place for Vaex to jump in.Vaex is a tool you should add to your Data Analytics toolbox.When working on an analysis task where pandas is too slow or simply crashes, pull Vaex out of your toolbox, filter out the most important entries and continue the analysis with pandas.Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.

Data Science
Pandas

A Better Way for Data Preprocessing: Pandas Pipe
Real-life data is usually messy. It requires a lot of preprocessing to be ready for use. Pandas being one of the most-widely used data analysis and manipulation libraries offers several functions to preprocess the raw data.In this article, we will focus on one particular function that organizes multiple preprocessing operations into a single one: the pipe function.When it comes to software tools and packages, I learn best by working through examples. I keep this in mind when creating content. I will do the same in this article.Let’s start with creating a data frame with mock data.import numpy as np
import pandas as pd
df = pd.DataFrame({
"id": [100, 100, 101, 102, 103, 104, 105, 106],
"A": [1, 2, 3, 4, 5, 2, np.nan, 5],
"B": [45, 56, 48, 47, 62, 112, 54, 49],
"C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]
})
df(image by author)Our data frame contains some missing values indicated by a standard missing value representation (i.e. NaN). The id column includes duplicate values. Last but not least, 112 in column B seems like an outlier.These are some of the typical issues in real-life data. We will be creating a pipe that handles the issues we have just described.For each task, we need a function. Thus, the first step is to create the functions that will be placed in the pipe.It is important to note that the functions used in the pipe need to take a data frame as argument and return a data frame.The first function handles the missing values.def fill_missing_values(df):
for col in df.select_dtypes(include= ["int","float"]).columns:
val = df[col].mean()
df[col].fillna(val, inplace=True)
return dfI prefer to replace the missing values in the numerical columns with the mean value of the column. Feel free to customize this function. It will work in the pipe as long as it takes a data frame as argument and returns a data frame.The second function will help us remove the duplicate values.def drop_duplicates(df, column_name):
df = df.drop_duplicates(subset=column_name)
return dfI have got some help from the built-in drop duplicates function of Pandas. It eliminates the duplicate values in the given column or columns. In addition to the data frame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.Also read: Pandas vs SQL. When Data Scientists Should Use One Over the OtherThe last function in the pipe will be used for eliminating the outliers.def remove_outliers(df, column_list):
for col in column_list:
avg = df[col].mean()
std = df[col].std()
low = avg - 2 * std
high = avg + 2 * std
df = df[df[col].between(low, high, inclusive=True)]
return dfWhat this function does is as follows:It takes a data frame and a list of columnsFor each column in the list, it calculates the mean and standard deviationIt calculates a lower and upper bound using the mean and standard deviationIt removes the values that are outside range defined by the lower and upper boundJust like the previous functions, you can choose your own way of detecting outliers.We now have 3 functions that handle a data preprocessing task. The next step is to create a pipe with these functions.df_processed = (df.
pipe(fill_missing_values).
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))This pipe executes the functions in the given order. We can pass the arguments to the pipe along with the function names.One thing to mention here is that some functions in the pipe modify the original data frame. Thus, using the pipe as indicated above will update df as well.One option to overcome this issue is to use a copy of the original data frame in the pipe. If you do not care about keeping the original data frame as is, you can just use it in the pipe.I will update the pipe as below:my_df = df.copy()
df_processed = (my_df.
pipe(fill_missing_values).
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))Let’s take a look at the original and processed data frames:df (image by author)df_processed (image by author)ConclusionYou can, of course, accomplish the same tasks by applying these functions separately. However, the pipe function offers a structured and organized way for combining several functions into a single operation.Depending on the raw data and the tasks, the preprocessing may include more steps. You can add as many steps as you need in the pipe function. As the number of steps increase, the syntax becomes cleaner with the pipe function compared to executing functions separately.Thank you for reading. Please let me know if you have any feedback.Also read:- Using Python And Pandas Datareader to Analyze Financial Data- Using Pandas Profiling to Accelerate Our Exploratory Analysis- Pandas Essentials For Data Science

SQL
Pandas

Pandas vs SQL. When Data Scientists Should Use One Over the Other
A deep dive into the benefits of each toolTable of ContentsIntroductionPandasSQLSummaryReferencesIntroductionBoth of these tools are important to not only data scientists, but also to those in similar positions like data analytics and business intelligence. With that being said, when should data scientists specifically use pandas over SQL and vice versa? In some situations, you can get away with just using SQL, and some other times, pandas is much easier to use, especially for data scientists who focus on research in a Jupyter Notebook setting. Below, I will discuss when you should use SQL and when you should use pandas. Keep in mind that both of these tools have specific use cases, but there are many times where their functionality overlap, and that is what I will be comparing below as well.PandasPhoto by Kalen Kemp on Unsplash [2].Pandas [3] is an open-source data analysis tool in the Python programing language. The benefit of pandas starts when you already have your main dataset, usually from a SQL query. This main difference can mean that the two tools are separate, however, you can also perform several of the same functions in each respective tool, for example, you can create new features from existing columns in pandas, perhaps easier and faster than in SQL.It is important to note that I am not comparing what Pandas does that SQL cannot do and vice versa. I will be picking the tool that can do the function more efficiently or preferable for data science work — in my opinion, from personal experience.Here are times where using pandas is more beneficial than SQL — while also having the same functionality as SQL:creating calculated fields from existing featuresWhen incorporating a more complex SQL query, you often are incorporating subqueries as well in order to divide values from different columns. In pandas you can simply divide features much easier like the following:df["new_column"] = df["first_column"]/df["second_column"]The code above is showing how you can divide two separate columns, and assign those values to a new column — in this case, you are performing the feature creation on the whole entire dataset or dataframe. You can use this function in both feature exploration and feature engineering in the process of data science.grouping byAlso referring to subqueries, grouping by in SQL can become quite complex and require lines and lines of code that can be visually overwhelming. In pandas, you can simply group by one line of code. I am not referring to the group by at the end of a simple select from table query, but one where there are multiple subqueries involved.df.groupby(by="first_column").mean()This result would be returning the mean of the first_column for every column in the dataframe. There are many other ways to use this grouping function, of which are outlined nicely in the pandas documentation linked below.checking data typesIn SQL, you will often have to cast types, but sometimes it can be a little clearer to see the way pandas lays out data types in a vertical format, rather than scrolling through a horizontal output in SQL. You can expect some examples of data types returned to be int64, float64, datetime64[ns], and object.df.dtypesWhile these are all fairly simple functions of pandas and SQL, in SQL, they are particularly tricky, and sometimes just much easier to implement in a pandas dataframe. Now, let’s look at what SQL is better at performing.SQLPhoto by Caspar Camille Rubin on Unsplash [4].SQL is probably the language that is used most by the most amount of different positions. For example, a data engineer could use SQL, a Tableau developer, or a product manager. With that being said, data scientists tend to use SQL frequently. It is important to note that there are several different versions of SQL, usually all having a similar function, just slightly formatted differently.Here are times where using SQL is more beneficial than pandas — while also having the same functionality as pandas:WHERE clauseThis clause in SQL is used frequently and can also be performed in pandas. In pandas, however, it is slightly more difficult, or less intuitive. For example, you have to write out redundant code, whereas in SQL, you simply need the WHERE.SELECT ID
FROM TABLE
WHERE ID > 100In pandas, it would be something like:df[df["ID"] > 100]["ID"]Yes, both are simple, one is just a little more intuitive.JOINSPandas has a few ways to join, which can be a little overwhelming, whereas in SQL you can perform simple joins like the following: INNER, LEFT, RIGHTSELECT
one.column_A,
two.column_B
FROM FIRST_TABLE one
INNER JOIN SECOND_TABLE two on two.ID = one.IDIn this code, joining is slightly easier to read, than in pandas, where you have to merge dataframes, and especially as you merge more than two dataframes, it can be quite complex in pandas. SQL can perform multiple joins whether it be INNER, etc., all in the same query.All of these examples, whether it be SQL or pandas, can be used in at least the exploratory data analysis portion of the data science process, as well as in feature engineer, and querying model results once they are stored in a database.SummaryThis comparison of pandas versus SQL is more of a personal preference. With that being said, you may feel the opposite of my opinion. However, I hope it still sheds light on the differences between pandas and SQL, as well as what you can perform the same in both tools, using slightly different coding techniques and a different language altogether.To summarize, we have compared the benefits of using pandas over SQL and vice versa for a few of their shared functions:* creating calculated fields from existing features
* grouping by
* checking data types
* WHERE clause
* JOINSI hope you found my article both interesting and useful. Please feel free to comment down below if you agree with these comparisons — why or why not? Do you think one tool, in particular, is better than the other? What other data science tools can you think of that would have a similar comparison? What other functions of pandas and SQL could we compare?

Visualization
Pandas
Libraries

Using Predictive Power Score to Pinpoint Non-linear Correlations
Using Predictive Power Score to Pinpoint Non-linear Correlations

Business
Libraries
Pandas

Using Python And Pandas Datareader to Analyze Financial Data
Finance and economics are becoming more and more interesting for all kinds of people, regardless of their career or profession. This is because we are all affected by economic data, or at least we are increasingly interested in being up-to-date, and we have a lot of information at hand.Every day billions of bytes of financial data are sent over the Internet. Whether it is the price of a share, an e-commerce transaction, or even information on a country's GDP. All this data, when properly organized and managed can be used to build some amazing and insightful software applications.

Pandas
Libraries

Using Pandas Profiling to Accelerate Our Exploratory Analysis
Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for a quick data analysis.

News
Machine Learning
Business
Pandas

Interview To The Winners Of The Data Science Competition "Predicting App Ratings In Google Play Store".
As usual, we have given ourselves the task of interviewing the winners of the competition "Google Play Store Rating Prediction" that ended a few days ago, having as winner Edimer "Siderus" from Colombia and with a score of 0.698709403908066 and who has become the #1 of our general leaderboard, counting the 5 competitions that we have developed so far. The objective of this competition was to analyze and rank the rating of mobile applications in the Android marketplace of the Google Play Store. The evaluation of the model was given using the F1 score, this is because the amount of data in both classes was not symmetrical. As we worked with an imbalanced dataset, the goal was to optimize the model to properly classify both classes and maximize the classification accuracy, especially of the class with minority of data.For this competition we had a record number of participants, with 135 people joining and where we evaluated a total of 1,497 models. Many thanks to the participants, and we invite you to take part in the new competition called "Prediction of Online Shoppers Purchasing Intention".Let's take a look at the first places of the competition and the answers they gave us for the interview, let's learn from them!Rank #1 - Siderus - ColombiaQ: In general terms, how did you approach the problem posed in the competition?A: At first I tried to conceive the problem correctly, familiarizing myself with the database. Then I spent a lot of time building graphs, trying to find underlying patterns in the data or atypicalities that would allow me to make objective decisions. Finally, I fitted three models that served as a baseline to compare whether the new ideas (or algorithms) performed better than these initial results.Q: For this particular competition, did you have any previous experience in this field?A: No, none. My field is agricultural sciences.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Several results caught my attention, for example, an application that has many reviews is not necessarily successful, however, the rate between the number of installations and reviews turned out to be the most important variable for my models. I found it interesting that free apps were more likely to be unsuccessful, it also seems that people like apps to be constantly updated and low in size. Personally, I think the biggest problem is that the classes were unbalanced, fortunately there are tools that using sampling with replacement allow us to work with this type of information.Q: In general terms, what data processing and feature engineering did you do for this competition? A: As preprocessing I used missing value imputation through the k nearest neighbors algorithm, for the multilayer perceptron I standardized the numerical variables and transformed them with the Yeo-Johnson transformation; in tree-based algorithms (XGBoost, LightGBM or Catboost) I only imputed the data. In all algorithms I used up-sampling to balance the classes.Q: Which Machine Learning algorithms did you use for the competition?A: I tried many, Naive Bayes, KNN, generalized linear models with regularization, multilayer percentron with keras, Support Vector Machine with radial kernel, Random Forest, XGBoost, LightGBM, Catboost, among others.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others? A: The three highest scoring algorithms were LightGBM, Catboost and Multilayer Perceptron, the assembly of the three provided the best results.Q: What libraries did you use for this particular competition? A: All my work was with R, making use of tidyverse and tidymodels as main libraries. I also used lightgbm, catboost and treesnip. The themis library was very useful for up-sampling.Q: How many years of experience do you have in Data Science and where are you currently working?A: I have been working with data for about 5 years, mainly in the design and statistical analysis of agricultural experiments. Q: What advice would you give to those who did not score so well in the competition?A: To explore the data a lot, to invest a lot of time in visualization, to understand the problem I think is the fundamental part of any data-driven project.Rank #2 - Pablo Lucero - EcuadorQ: In general terms, how did you approach the problem posed in the competition? A: First I did a basic exploratory analysis, then I made a baseline to have something to build on. Subsequently, I performed an attribute extraction and then generated new ones. For modeling I tried different algorithms, the best results were found in tree-based methods, which I optimized to improve the final score.Q: For this particular competition, did you have any previous experience in this field?A: Yes, in my previous work I had the opportunity to address similar issues.Q: What important findings/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Well very quickly, free applications are the most demanded, most of the successful applications have support at least version 4.1. The Eceryone category has the most applications on the market. One of the challenges was the generation of new attributes. I think that was the key to reaching the top positions.Q: In general terms, what data processing and feature engineering did you do for this competition?A: In general terms, for the data processing I cleaned the text type attributes to convert them to numerical values (Price, Installs, last update, etc), I removed symbols or other characters that are not necessary (Current Ver). As for the attribute engineering part, this was based on obtaining new attributes from the relationship that may exist between the App attribute with the rest. For example, the number of words in the App title or if a Category word appears in the App title. This allowed us to obtain about 20 base attributes. A logarithmic transformation was also implemented to improve the distribution of certain attributes. Genetic programming was then applied to obtain about 40 new attributes, giving a total set of 60.Q: What Machine Learning algorithms did you use for the competition?A: I tried different ones, from SVM, RF, MLP, LightGBM, XGBoost and Catboost. Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: Of all of them the one that gave me the best results was LightGBM so I decided to optimize the parameters for the final round.Q: What libraries did you use for this particular competition?A: One for genetic programming called gplearn.Q: How many years of experience do you have in Data Science and where do you currently work? A: I have 5 years of experience. I am currently working in a manufacturing company in the project area, leading Industry 4.0 topics.Q: What advice would you give to those who did not score so well in the competition? A: Review online documentation on similar problems, it helps to get a better picture of the problem. (We should not invent the wheel). Rank #3 - Fernando Chica - EcuadorQ: In general terms, how did you approach the problem posed in the competition? A: Initially, I performed an exploratory analysis of the data to identify the features of the data, from there I postulated possible feature extraction techniques and classification models. Q: For this particular competition, did you have any previous experience in this field?A: In data analysis yes, but for this particular problem of predicting application ratings I did not.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: The first thing you can notice is the fact that most of the variables are categorical, so at the beginning you had to think about what kind of transformation could be applied to transform them into numerical variables. This is due to the fact that not all models allow working with categorical variables. On the other hand, the main problem of this database (it is even mentioned in the description of the challenge) is the fact that the amount of data of each class is not the same, that is, it is an unbalanced dataset. In that sense, the challenge was to select the model or process to follow to address this problem and avoid overtraining. Q: In general terms, what data processing and feature engineering did you do for this competition?A: Transformation from categorical to numerical variables using, then perform data balancing tests; duplicating data from the class with fewer observations, removing data from the class with more observations and creating synthetic data (until balancing the data) from the class with fewer observations. But there was no significant improvement in the performance of the models tested. So, data balancing was not used in the final model. Q: What Machine Learning algorithms did you use for the competition?A: Multi-layer Perceptron, linear regression, decision trees, XGboost, Light GBM, random forest and Bagging.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: The one that gave me the best score was Bagging, using decision trees as base models. I think it worked better because of the data processing I did, and with Bagging you can also choose the importance given to each class during training and since the data is unbalanced it allows you to regularize the model and prevent over training (overfitting). Q: What libraries did you use for this particular competition?A: A variety of libraries, but in a general way: Sklearn, numpy, pandas, matplotlib, seaborn, imblearn, datetime and keras.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have about 4 years of experience in Data Science and I am currently working as a researcher at a university in the field of applied artificial intelligence. Q: What advice would you give to those who did not score so well in the competition? A: Be very curious about what the data hides, take into account strategies that may seem absurd and look beyond what the data shows at first glance. Rank #4 - Nicolás Dominutti - ArgentinaQ: In general terms, how did you approach the problem posed in the competition? A: After the EDA, I applied a preprocessing pipeline to obtain valuable data from the variables. Then I focused on generating new variables that would provide another perspective to the original data before entering the model selection stage.Q: For this particular competition, did you have any previous experience in this field?A: This is the 1st official competition in which I participate, previously I did bootcamps and focused on personal ML projects.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: From the EDA it emerged that the dataset was highly unbalanced and consisted of very disparate and messy variables that demanded an interesting data processing pipeline. On the other hand, this analysis also revealed insights that allowed us to generate new variables that added value (e.g. APPS with 0 reviews tended to have a high rating almost unanimously).Q: In general terms, what data processing and feature engineering did you do for this competition?A: We applied techniques such as: extraction of relevant data via regex, creation of new variables, encoding of features treated as categorical and standardization of numeric variables (for algorithms that need it, in the winning algorithm, being an xgboost, it was not used). As an interesting point, having an unbalanced dataset, I chose to perform a random oversampling on the least represented class.Q: What Machine Learning algorithms did you use for the competition?A: I tested Logistic Regression, SVM, Random Forest, Catboost and Xgboost.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: It is not surprising that the best score was obtained with XGBOOST, an algorithm already consolidated in worldwide competitions. This is a very powerful library that is based on the use of boosting, which allows to obtain interesting scores.Q: Which libraries did you use for this particular competition?A: re, numpy, pandas, sklearn, catboost and xgboost.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have 2 years of starting my first Data Science courses. I am currently working at Johnson & Johnson.Q: What advice would you give to those who did not score as well in the competition? A: Spend time to understand the problem domain in detail, ask yourself questions about the why of the industry and manage to capture the answers and insights in the dataset.Rank #5 - Fernando Cifuentes - ColombiaQ: In general terms, how did you approach the problem posed in the competition? A: First I had to understand the problem, understand the variables and above all a good cleaning job on them since it was difficult to work on them as they were, then I created new variables, after that I optimized hyperparameters in my models to finally make the prediction.Q: For this particular competition, did you have any previous experience in this field?A: I have experience in classification models which I have been working on for the last few years.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: For this case it was a challenge to work with the version variable as it did not correspond to a decimal number, e.g. 8.1.1.Also for the Android version in which I indicated that it varied depending on the version, it was concluded that it is not possible to work with these variables directly, but that a good cleaning job had to be done before entering it into the Model.In addition to this I realized that the data were unbalanced because I had to use a SMOTE algorithm to have a balanced base by oversampling.Q: In general terms, what data processing and feature engineering did you do for this competition?A: For example for the version I took only up to its second level, i.e. 8.1.For the update date I took the maximum update date in the base and on that date I calculated the months that the other applications had been without update.For the Android version I imputed the data in order to have an approximation of the Android version in which I was working in the cases where I did not specify a version. did not specify a version.I also created a new variable which I call rating ratio corresponding to the number of comments over the number of downloads which was my most important variable in my ranking model.Q: What Machine Learning algorithms did you use for the competition?A: I used 3 models Random Forest, Xgboost, Lightgbm.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: An ensemble model by voting of the three models mentioned above, I think it got the best result because at the macro level each model had very similar metrics, however at the individual level the predictions varied for some records, so the ensemble made a "consensus" among the three models.Q: What libraries did you use for this particular competition?A: The main libraries used were: pandas, sklearn, xgboost, lightgbm. Q: How many years of experience do you have in Data Science and where do you currently work? A: I am currently working in a Bank and specifically working in modeling for about three years.Q: What advice would you give to those who did not score so well in the competition? A: Don't get discouraged, we all start like that and keep participating in competitions and reading forums, that's where you get the most help to improve your results.Rank #6 - David Villabón - ColombiaQ: In general terms, how did you approach the problem posed in the competition? A: The first thing I did with the dataset was to transform the variables that were supposed to be numerical, then feature engineering, then testing raw models by evaluating their "f1" score and finally the improvement of the selected model!Q: For this particular competition, did you have any previous experience in this field?A: No, but with exploration and understanding of the data I came to gain insights from the field.Q: What important findings/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Evidently in the exploration of the data there was a considerable imbalance in the objective "Rating" which was a challenge to obtain good results. Q: In general terms, what data processing and feature engineering did you do for this competition?A: After transforming the data that I assumed was numerical and was not, I proceeded to coding the categorical variables, then removing outliers, scaling the data, variable selection and finally techniques for balancing the target variable.Q: What Machine Learning algorithms did you use for the competition?A: I tested LogisticRegression, Perceptron, RandomForestClassifier, knn, XGBoost, LightGBM, RUSBoostClassifier, AdaBoostClassifier.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: I chose RUSBoostClassifier, since it did not overfit.Q: What libraries did you use for this particular competition?A: I used Pandas, Numpy, matplotlib, Sklearn, Imblearn, xgboost.Q: How many years of experience do you have in Data Science and where do you currently work? A: I have been studying data science for a couple of years, currently my work is not related to Data Science.Q: What advice would you give to those who did not score so well in the competition? A: It is fundamental to understand the dataset, to scrutinize the data, to know how to select the final model. I think that is part of the aspects to obtain good results.Rank #9 - James Valencia - PeruQ: In general terms, how did you approach the problem posed in the competition? A: I performed the steps described in the CRISP-DM methodology. To address the particular problem of the unbalanced target I divided the train into three partitions to train a different boosting model for each partition and obtain the final prediction by evaluating the three predictions obtained by each model.Q: For this particular competition, did you have any previous experience in this field?A: I participated in the previous DataSourceAI competition and also in some competitions in Kaggle.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Preprocessing of the data was necessary to obtain numerical data to identify the impact on the target. In addition, I had to investigate a method of evaluation focused on unbalanced target: model assembly.Q: In general terms, what data processing and feature engineering did you do for this competition?A: I used regex method to remove characters such as M (million), $ (dollar), etc. Also for the Encoding of categorical variables I focused on the average of the target associated to each category according to the analyzed column.Q: What Machine Learning algorithms did you use for the competition?A: Three Boosting models: Catboost, XGboost; LightGBM.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: The LightGBM model because it is a more optimized model and works well with large amounts of previously processed data.Q: What libraries did you use for this particular competition?A: The classic libraries for preprocessing: pandas, scikit-learn, matplotlib, metrics, among others. Plus some particular ones for boosting models: catboost, XGBoost Classifier, lightgbm.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have two years of experience coding predictive clustering, classification and regression models in Python. In addition, due to the elections in my country (Peru) I am training natural language processing models, taking as imput the tweets in social networks through the tweepy and spacy libraries.Q: What advice would you give to those who did not score so well in the competition? A: Do your own research through tutorials on the internet. Currently there are many resources on Kaggle, Analytics Vidhya, TowardDataScience and even Youtube channels (my favorite on StatQuest).Rank #10 - Frank Diego - PeruQ: In general terms, how did you approach the problem posed in the competition? A: Performing an exploratory analysis of the data, data cleaning, identifying the most significant predictor variables and testing different classification models.Q: For this particular competition, did you have any previous experience in this field?A: First timeQ: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Finding categorical variables with high cardinality, imbalanced data, identifying and removing outliers in different predictor variables and testing various classification models.Q: In general terms, what data processing and feature engineering did you do for this competition?A: Removing special characters and text characters in the Size, Installs and Prices variables; identifying the version number of each app and the number of android versions available for each app, using Enconding techniques for categorical variables, and data normalization.Q: What Machine Learning algorithms did you use for the competition?A: Logistic Regression and Random ForestQ: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: Random Forest because it has better scores in accuracy, precision and recall.Q: Which libraries did you use for this particular competition?A: Pandas, sklearn, matplotlib, seaborn and scikitplot.Q: How many years of experience do you have in Data Science and where are you currently working? A: I've only been in the data science world for about half a year. I have taken online courses on data processing with the Pandas library, basic statistics and following youtube tutorials on machine learning which has helped me to apply it to this challenge. On the other hand, I have a venture on commercial intelligence of exports from Peru that allows me to give support to exporting companies on the foreign trade scenario in various productive sectors.Q: What advice would you give to those who did not score so well in the competition? A: To deepen the exploratory analysis of data in the datasets to obtain a better understanding of the most important characteristics that influence the target variable. ConclusionAs we can see each of the participants was able to test different models, among which Boosting models stand out and where each participant experiences different approaches to solve the problem. We hope you have drawn your own conclusions, you can share them with us in the comments, and we wait for you in the competition that is active, and maybe you could be the interviewee of the TOP 10 of the next competition!Join Competition￼Many thanks to all the participants and to the winners who helped us with the survey!PS: we are growing our data scientist discussion forum on Slack at the following link, join and participate.

Pandas
Python
Machine Learning

Building A Linear Regression Model With Python To Predict Retail Customer Spending
We will create a complete project trying to predict customer spending using linear regression with Python. In this exercise, we have some historical transaction data from 2010 and 2011. For each transaction, we have a customer identifier (CustomerID), the number of units purchased (Quantity), the date of purchase (InvoiceDate) and the unit cost (UnitPrice), as well as some other information about the purchased item.You can find the dataset hereWe want to prepare this data for a regression of 2010 customer transaction data against 2011 expenses. Therefore, we will create features from the 2010 data and calculate the target (the amount of money spent) for 2011. When we create this model, it should generalize to future years for which we do not yet have the result. Therefore, we could use 2020 data to predict 2021 spending behavior in advance, unless the market or business has changed significantly since the time period to which the data used to fit the model refers: import pandas as pd
df = pd.read_csv('datasets/retail_transactions.csv')
df.head()resultadoConvert the InvoiceDate column to date format using the following code:df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df.head()resultadoCalculate the revenue for each row by multiplying the quantity by the unit price:df['revenue'] = df['UnitPrice']*df['Quantity']
df.head()resultadoYou will notice that each invoice is spread over several rows, one for each type of product purchased. These can be combined in such a way that the data for each transaction is in a single row. To do this, we can perform a grouped transaction in InvoiceNo. However, before that, we need to specify how to combine those rows that are grouped together. Use the following code:operations = {'revenue':'sum',
'InvoiceDate':'first',
'CustomerID':'first'
}
df = df.groupby('InvoiceNo').agg(operations)
df.head()
resultadoIn the preceding code snippet, we first specify the aggregation functions we will use for each column, and then perform the grouping and apply those functions. InvoiceDate and CustomerID will be the same for all rows of the same invoice, so we can only take the first entry for them. For revenue, we sum the revenue for all items on the same invoice to get the total revenue for that invoice.Since we will be using the year to decide which rows are being used for prediction and which ones we are predicting, create a separate column called year for the year, as follows:df['year'] = df['InvoiceDate'].apply(lambda x: x.year)
df.head()resultadoTransaction dates can also be an important source of characteristics. The days from a customer's last transaction to the end of the year, or how early a customer had their first transaction, can tell us a bit about the customer's purchase history, which could be important. Therefore, for each transaction, we will calculate how many days difference there is between the last day of 2010 and the date of the invoice:df['days_since'] = (pd.datetime(year=2010, month=12, day=31) -
df['InvoiceDate']).apply(lambda x: x.days)
df.head()resultadoCurrently, we have the data grouped by invoice, but we really want it to be grouped by customer. We'll start by calculating all of our predictors. We will again define a set of aggregation functions for each of our variables and apply them using groupby. We will calculate the sum of the revenues. For `days_since`, we will calculate the maximum and minimum number of days (giving us features that tell us how long this customer has been active in 2010, and how recently), as well as the number of unique values (giving us how many days apart this customer made a purchase). Since these are for our forecasters, we will only apply these functions to our data from 2010, and store them in a variable, X, and use the `head` function to see the results:operations = {'revenue':'sum',
'days_since':['max','min','nunique'],
}
X = df[df['year'] == 2010].groupby('CustomerID').agg(operations)
X.head()resultadoAs you can see in the figure above, since we perform multiple types of aggregations on the `days_since` column, we end up with multi-level column labels. To simplify this, we can rescale the column names for easy reference later. Use the following code and print the results:X.columns = [' '.join(col).strip() for col in X.columns.values]
X.head()resultadoLet's calculate one more characteristic: the average expense per order. We can calculate this by dividing the sum of the revenue by `days_since_nunique` (this is actually the average spend per day, not per order, but we are assuming that if two orders were placed on the same day, we can treat them as part of the same order for our purposes):X['avg_order_cost'] = X['revenue sum']/X['days_since nunique']
X.head()resultadoNow that we have our forecasters, we need the result we will predict, which is just the sum of the revenues for 2011. We can calculate it with a simple groupby and store the values in the variable y, as follows:y = df[df['year'] == 2011].groupby('CustomerID')['revenue'].sum()
yresultadoNow we can put our predictors and results into a single DataFrame, `wrangled_df`, and rename the columns to have more intuitive names. Finally, look at the resulting DataFrame, using the `head` function:wrangled_df = pd.concat([X,y], axis=1)
wrangled_df.columns = ['2010 revenue',
'days_since_first_purchase',
'days_since_last_purchase',
'number_of_purchases',
'avg_order_cost',
'2011 revenue']
wrangled_df.head()resultadoNote that many of the values in our DataFrame are `NaN`. This is caused by clients that were active only in 2010 or only in 2011, so there is no data for the other year. Later we will work on predicting which of our customers will churn, but for now, we will just drop all customers that are not active in both years. Note that this means that our model will predict customer spending in the next year assuming they are still active customers. To remove customers with no values, we will remove rows where any of the revenue columns are null, as follows:wrangled_df = wrangled_df[~wrangled_df['2010 revenue'].isnull()]
wrangled_df = wrangled_df[~wrangled_df['2011 revenue'].isnull()]
wrangled_df.head()
resultadoAs a final data cleaning step, it is often a good idea to get rid of outliers. A standard definition is that an outlier is any data point that is more than three standard deviations above the median, so we will use this to remove clients that are outliers in terms of 2010 or 2011 revenue:wrangled_df = wrangled_df[wrangled_df['2011 revenue']
< ((wrangled_df['2011 revenue'].median())
+ wrangled_df['2011 revenue'].std()*3)]
wrangled_df = wrangled_df[wrangled_df['2010 revenue']
< ((wrangled_df['2010 revenue'].median())
+ wrangled_df['2010 revenue'].std()*3)]wrangled_df.head()
resultadoIt is often a good idea, after you have done the data cleanup and feature engineering, to save the new data as a new file so that, as you develop the model, you do not need to run the data through the entire feature engineering and cleanup pipeline every time you want to rerun the code. We can do this using the `to_csv` function. wrangled_df.to_csv('datasets/wrangled_transactions.csv')Examining the relationships between the predictors and the outcome.In this exercise, we will use the characteristics we calculated in the previous exercise and see if these variables have any relationship with our outcome of interest (customer sales revenue in 2011):Using pandas to import the data you saved at the end of the last exercise, using CustomerID as the index:df = pd.read_csv('datasets/wrangled_transactions.csv', index_col='CustomerID')The seaborn library has a number of plotting features. Its pair plot feature will plot histograms and pairwise scatter plots of all our variables on one line, allowing us to easily examine both the distributions of our data and the relationships between data points. Use the following code:import seaborn as sns
%matplotlib inline
sns.pairplot(df)resultadoIn the diagram above, the diagonal shows a histogram for each variable, while each row shows the scatter plot between one variable and the other. The bottom row of figures shows the scatter plots of 2011 income (our outcome of interest) against each of the other variables. Because the data points overlap and there is a fair amount of variation, the relationships do not appear very clear in the visualizations.Therefore, we can use correlations to help us interpret the relationships. The `corr` function of pandas will generate correlations between all the variables in a DataFrame:df.corr()resultadoAgain, we can look at the last row to see the relationships between our forecasters and the interest result (2011 revenue). Positive numbers indicate a positive relationship, e.g., the higher a client's 2010 income, the higher their expected income in 2011. Negative numbers mean the opposite, e.g., the more days since a customer's last purchase, the lower the revenue expectation for 2011. Also, the higher the absolute number, the stronger the relationship.The resulting correlations should make sense. The more competitors in the area, the lower a location's revenue, while median income, loyalty members and population density are all positively related. The age of a place is also positively correlated with revenue, indicating that the longer a place is open, the better known it is and the more customers it attracts (or perhaps, only places that do well last a long time).Building a linear model that predicts customer spending.In this exercise, we will build a linear model on customer spending using the characteristics created in the previous exercise:Recall that there is only a weak relationship between `days_since_first_purchase` and 2011 revenue-so we will not include that predictor in our model.Store the predictor columns and the outcome columns in the X and y variables, respectively:X = df[['2010 revenue',
'days_since_last_purchase',
'number_of_purchases',
'avg_order_cost'
]]
y = df['2011 revenue']We use sklearn to perform a split of the data, so that we can evaluate the model on a dataset on which it was not trained, as shown here:from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100)We import LinearRegression from sklearn, create a LinearRegression model and adjust the training data:from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)We examine the coefficients of the model by checking the coef_ property. Note that these are in the same order as our X columns: 2010 revenue, days since last purchase, number of purchases and average order cost:model.coef_
>> array([ 5.78799016, 7.47737544, 336.60769871, -2.0558923 ])Check the intercept term of the model by checking the intercept_ property:model.intercept_
>> 264.8693265705956Now we can use the fitted model to make predictions about a customer outside our data set.Make a DataFrame containing a customer's data, where the 2010 revenue is 1,000, the number of days since last purchase is 20, the number of purchases is 2, and the average order cost is 500. Have the model make a prediction on this customer's data:single_customer = pd.DataFrame({
'2010 revenue': [1000],
'days_since_last_purchase': [20],
'number_of_purchases': [2],
'avg_order_cost': [500]
})
single_customerresultadomodel.predict(single_customer)
>> array([5847.67624446])We can plot the model predictions in the test set against the actual value. First, we import matplotlib, and make a scatter plot of the model predictions in X_test against y_test.Constrain the x and y axes to a maximum value of 10,000 so that we have a better view of where most of the data points are located.Finally, add a line with slope 1, which will serve as our reference: if all points lie on this line, it means that we have a perfect relationship between our predictions and the true response:import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(model.predict(X_test),y_test)
plt.xlim(0,10000)
plt.ylim(0,10000)
plt.plot([0, 10000], [0, 10000], 'k-', color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Value')
plt.show()resultadoIn the graph above, the red line indicates where the points would be if the prediction were the same as the actual value. Since many of our points are quite far from the red line, this indicates that the model is not completely accurate. However, there does appear to be some relationship, as higher model predictions have higher true values.To further examine the relationship, we can use correlation. From scipy, we can import the pearsonr function, which calculates the correlation between two matrices, just as we did with Pandas for our entire DataFrame. We can use it to calculate the correlation between our model predictions and the actual value as follows:from scipy.stats.stats import pearsonr
pearsonr(model.predict(X_test),y_test)
>> (0.6125740076680493, 1.934002067463782e-20)You should have two numbers returned: (0.612574007666680493, 1.934002067463782e-20). The first number is the correlation, which is close to 0.6, indicating a strong relationship. The second number is the p-value, which indicates the probability of seeing such a strong relationship if the two sets of numbers were unrelated; the very low number here means that this relationship is unlikely to be due to chance.ConclusionWe have constructed a simple example of linear regression. You could try this same one with Decision Trees and review the differences in the models. Later we will create another article to understand how to do this.

The best data science and machine learning articles. Written by data scientist for data scientist (and business people)

Join our private community in Discord

Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!