Finance and economics are becoming more and more interesting for all kinds of people, regardless of their career or profession. This is because we are all affected by economic data, or at least we are increasingly interested in being up-to-date, and we have a lot of information at hand.
Every day billions of bytes of financial data are sent over the Internet. Whether it is the price of a share, an e-commerce transaction, or even information on a country's GDP. All this data, when properly organized and managed can be used to build some amazing and insightful software applications.
We will use Python to access public financial data, organize it and combine it to gain new insights into how money makes the world go round. We will focus mainly on two Python modules:
We will use these modules to import data from some of the largest financial organizations in the world, as well as data stored locally on our computers. By the end of the notebook, you should feel comfortable importing financial data, either from a public source or from a local file, into Python, organizing that data and combining it with each other
Many financial institutions, stock markets and global banks provide the public with large amounts of the data they publicly store. Most of this data is well organized, updated live and accessible through the use of an application programming interface (API), which offers programming languages such as Python a way to download and import it.
The pandas-datareader module is specifically designed to interface with some of the world's most popular financial data APIs, and import their data into an easily digestible pandas DataFrame. Each financial API is accessed by a different function exposed by pandas-datareader. Generally, accessing each API requires a different set of arguments and information to be provided by the programmer.
We will import data from several of these APIs and play with them. For a complete list of all data that the pandas-datareader can access, you can consult the official documentation.
Let's make a small program that uses pandas-datareader to get data from the World Bank API gdp. First we will install the library as follows (Please run the command on your own in the next cell, I already have it installed!):
!pip3 install pandas_datareader
from pandas_datareader import wb
from datetime import datetime
Here we have imported the data from the World Bank via wb. And we imported the python datetime library, in order to put the start and end dates in which I want to do the analysis. For more information about Python Datetimes see this other notebook!
start = datetime(2005, 1, 1)
end = datetime(2008, 1, 1)
indicator_id = 'NY.GDP.PCAP.KD'
gdp_per_capita = wb.download(indicator=indicator_id, start=start, end=end, country=['US', 'CA', 'MX'])
gdp_per_capita
| NY.GDP.PCAP.KD | ||
|---|---|---|
| country | year | |
| Canada | 2008 | 48495.204037 |
| 2007 | 48534.174482 | |
| 2006 | 45857.996555 | |
| 2005 | 44471.080072 | |
| Mexico | 2008 | 9587.636339 |
| 2007 | 9622.047957 | |
| 2006 | 9547.333571 | |
| 2005 | 9270.656542 | |
| United States | 2008 | 49319.478865 |
| 2007 | 49856.281491 | |
| 2006 | 49405.767296 | |
| 2005 | 48499.812376 |
The download method has different parameters that we can pass on in order to get the data from the World Bank. Among them is the indicator_id. You can find more information about the parameters here.
As we can see, with this command we have obtained the GDP per capita data for 3 countries on a given date
The NASDAQ Stock Exchange identifies each of its shares with a unique symbol:
It also provides a useful API for accessing the symbols currently traded on it. Pandas-datareader provides several functions to import data from the NASDAQ API through its nasdaq_trader sub-module.
from pandas_datareader.nasdaq_trader import get_nasdaq_symbols
To import the list of stock symbols, we want to use the function get_nasdaq_symbols from nasdaq_trader. It is done in this way
symbols = get_nasdaq_symbols()
When called, it will go to the NASDAQ API, and import the list of symbols currently being traded. The advantage of using pandas-datareader is that all the logic to interact with the NASDAQ API or any other API is encapsulated in easy-to-use sub-modules and functions like the ones above.
symbols
| Nasdaq Traded | Security Name | Listing Exchange | Market Category | ETF | Round Lot Size | Test Issue | Financial Status | CQS Symbol | NASDAQ Symbol | NextShares | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Symbol | |||||||||||
| A | True | Agilent Technologies, Inc. Common Stock | N | False | 100.0 | False | NaN | A | A | False | |
| AA | True | Alcoa Corporation Common Stock | N | False | 100.0 | False | NaN | AA | AA | False | |
| AAA | True | Listed Funds Trust AAF First Priority CLO Bond... | P | True | 100.0 | False | NaN | AAA | AAA | False | |
| AAAU | True | Perth Mint Physical Gold ETF | P | True | 100.0 | False | NaN | AAAU | AAAU | False | |
| AACG | True | ATA Creativity Global - American Depositary Sh... | Q | G | False | 100.0 | False | N | NaN | AACG | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ZXYZ.A | True | Nasdaq Symbology Test Common Stock | Q | Q | False | 100.0 | True | N | NaN | ZXYZ.A | False |
| ZXZZT | True | NASDAQ TEST STOCK | Q | G | False | 100.0 | True | N | NaN | ZXZZT | False |
| ZYME | True | Zymeworks Inc. Common Shares | N | False | 100.0 | False | NaN | ZYME | ZYME | False | |
| ZYNE | True | Zynerba Pharmaceuticals, Inc. - Common Stock | Q | G | False | 100.0 | False | N | NaN | ZYNE | False |
| ZYXI | True | Zynex, Inc. - Common Stock | Q | S | False | 100.0 | False | N | NaN | ZYXI | False |
9338 rows × 11 columns
We have a total of 9,338 different stocks that are listed on the Nasdaq, and in whose columns we have different economic and financial information about that stock. But what if we want to access a single symbol/stock?
symbols.loc['IBM']
Nasdaq Traded True Security Name International Business Machines Corporation Co... Listing Exchange N Market Category ETF False Round Lot Size 100 Test Issue False Financial Status NaN CQS Symbol IBM NASDAQ Symbol IBM NextShares False Name: IBM, dtype: object
Technical analysis in finance is the type of analysis performed by means of statistics and charts on stocks (or indices in our case). Let's see how to do something very simple with 'Plotly' a Python library for charting. In this case we'll access to Microsoft daily quotes. Let's do it!
!pip3 install plotly
Requirement already satisfied: plotly in /home/daniel/Desktop/narrativetext_project/notebooks/lib/python3.6/site-packages (4.12.0)
Requirement already satisfied: six in /home/daniel/Desktop/narrativetext_project/notebooks/lib/python3.6/site-packages (from plotly) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in /home/daniel/Desktop/narrativetext_project/notebooks/lib/python3.6/site-packages (from plotly) (1.3.3)
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/home/daniel/Desktop/narrativetext_project/notebooks/bin/python -m pip install --upgrade pip' command.
import plotly.graph_objects as go
import pandas_datareader.data as web
stock = 'MSFT'
start = datetime(2019, 1, 1)
df = web.DataReader(stock, data_source='yahoo', start=start)
df
| High | Low | Open | Close | Volume | Adj Close | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-01-02 | 101.750000 | 98.940002 | 99.550003 | 101.120003 | 35329300.0 | 98.602066 |
| 2019-01-03 | 100.190002 | 97.199997 | 100.099998 | 97.400002 | 42579100.0 | 94.974693 |
| 2019-01-04 | 102.510002 | 98.930000 | 99.720001 | 101.930000 | 44060600.0 | 99.391899 |
| 2019-01-07 | 103.269997 | 100.980003 | 101.639999 | 102.059998 | 35656100.0 | 99.518669 |
| 2019-01-08 | 103.970001 | 101.709999 | 103.040001 | 102.800003 | 31514400.0 | 100.240234 |
| ... | ... | ... | ... | ... | ... | ... |
| 2020-11-16 | 217.740005 | 214.520004 | 214.869995 | 217.229996 | 24953300.0 | 216.662766 |
| 2020-11-17 | 217.679993 | 214.080002 | 216.100006 | 214.460007 | 24154100.0 | 213.900009 |
| 2020-11-18 | 215.169998 | 210.929993 | 213.649994 | 211.080002 | 28372800.0 | 211.080002 |
| 2020-11-19 | 213.029999 | 209.929993 | 211.380005 | 212.419998 | 24792700.0 | 212.419998 |
| 2020-11-20 | 213.289993 | 210.000000 | 212.199997 | 210.389999 | 22829100.0 | 210.389999 |
478 rows × 6 columns
We have accessed the data of MSFT. We did this by importing .data from datareader and giving it the web alias. Under the hood we are using Yahoo Finance to import the data from an API, but in this case pandas.datareader allowed us to do it in a very simple way. Now we are going to plot the result to make Technical Analysis.
graph = {
'x': df.index,
'open': df.Open,
'close': df.Close,
'high': df.High,
'low': df.Low,
'type': 'candlestick',
'name': 'MSFT',
'showlegend': True
}
layout = go.Figure(
data = [graph],
layout_title="Microsoft Stock"
)
layout
We just did something very interesting and it was to chart MSFT's acicon with the updated data! Today is November 20, 2020, so the last data of my graph is that date, you can do the same, place the mouse at the end of the graph and see the last quote of the stock! You could in this case access your investment portfolio and run the code daily and make a technical analysis on those tickets!
Many of the APIs that the pandas-datareader connects to allow us to filter the data we get by date. Financial institutions tend to keep track of data that goes back several decades, and when we import that data, it is useful to be able to specify exactly when we want it to come from
An API that does just that is the Federal Reserve Bank of San Luis (FRED), which we can access by first importing the pandas_datareader.data sub-module and then calling its DataReader function:
import pandas_datareader.data as web
start = datetime(2019, 1, 1)
end = datetime(2019, 2, 1)
sap_data = web.DataReader('SP500', 'fred', start, end)
sap_data
| SP500 | |
|---|---|
| DATE | |
| 2019-01-01 | NaN |
| 2019-01-02 | 2510.03 |
| 2019-01-03 | 2447.89 |
| 2019-01-04 | 2531.94 |
| 2019-01-07 | 2549.69 |
| 2019-01-08 | 2574.41 |
| 2019-01-09 | 2584.96 |
| 2019-01-10 | 2596.64 |
| 2019-01-11 | 2596.26 |
| 2019-01-14 | 2582.61 |
| 2019-01-15 | 2610.30 |
| 2019-01-16 | 2616.10 |
| 2019-01-17 | 2635.96 |
| 2019-01-18 | 2670.71 |
| 2019-01-21 | NaN |
| 2019-01-22 | 2632.90 |
| 2019-01-23 | 2638.70 |
| 2019-01-24 | 2642.33 |
| 2019-01-25 | 2664.76 |
| 2019-01-28 | 2643.85 |
| 2019-01-29 | 2640.00 |
| 2019-01-30 | 2681.05 |
| 2019-01-31 | 2704.10 |
| 2019-02-01 | 2706.53 |
The DataReader function takes 4 arguments:
By changing the start and end dates, you can easily filter the data you receive
Shift() operation¶Once we've imported a DataFrame full of financial data, there are some pretty cool ways to manipulate it. In this exercise we will see the shift() operation, a DataFrame function that moves all the rows in a column up or down
Shift() can be called in a single column (as in the image above), or in the entire DataFrame where all columns will be shifted. You can also scroll through more than one row, and in any direction.
# shifts all rows down by 1
dataframe.shift(1)
# shifts all rows in name column up 5
dataframe[name].shift(-5)
# shifts all rows in the price column down 3
dataframe[name].shift(3)
Shift is particularly useful when it comes to financial data. For example, it can be used to help calculate the percentage growth between one row and the next, or to find the difference in stock prices over a series of days. Let's see an example
start = datetime(2008, 1, 1)
end = datetime(2018, 1, 1)
gdp = web.DataReader('GDP', 'fred', start, end)
gdp.head()
| GDP | |
|---|---|
| DATE | |
| 2008-01-01 | 14651.039 |
| 2008-04-01 | 14805.611 |
| 2008-07-01 | 14835.187 |
| 2008-10-01 | 14559.543 |
| 2009-01-01 | 14394.547 |
We have imported the GDP from the FRED, now we will create a new column called Growth where we can do the math of the difference (in absolute values) between the different days
gdp['Growth'] = gdp['GDP'] - gdp['GDP'].shift(1)
gdp.head()
| GDP | Growth | |
|---|---|---|
| DATE | ||
| 2008-01-01 | 14651.039 | NaN |
| 2008-04-01 | 14805.611 | 154.572 |
| 2008-07-01 | 14835.187 | 29.576 |
| 2008-10-01 | 14559.543 | -275.644 |
| 2009-01-01 | 14394.547 | -164.996 |
We can now see the absolute differences in this new column. An important clarification: the first row of growth column is now 'NaN' because it has no one to do the calculation with, it is the first row of the dataset
Two useful calculations that can be made with financial data are variance and covariance. To illustrate these concepts, let's use the example of a DataFrame that measures stock and bond prices over time
Variance measures how far a set of numbers is from its average. In finance, it is used to determine the volatility of investments.
dataframe['stocks'].var() # 106427
dataframe['bonds'].var() # 2272
In the above variance calculations, stocks are greater in value than bonds (106427 vs 2272). That's because stock prices are more dispersed than bond prices, indicating that stocks are a more volatile investment.
Covariance, in a financial context, describes the relationship between the returns of two different investments over a period of time, and can be used to help balance a portfolio. Calling our stock and bond columns cov() produces an array that defines the covariance values between each pair of columns in the DataFrame. Covariance is also known as a correlation in finance. In our example data, when stock prices go up, bonds go down. We can use the covariance function to see this numerically.
dataframe.cov()
The above code produces the following DataFrame output:
Each value above represents the covariance between two columns. The higher the number, the more investments tend to go up and down at the same time. The upper right and lower left columns represent the covariance between stocks and bonds. Here we have a negative number, which indicates that stocks and bonds tend to move in different directions. To see this with an example, we are going to make a more complex exercise, let's go!
The Logarithmic return is one of the three methods for calculating the return and assumes that the returns are composed continuously rather than through sub-periods. It is calculated by taking the natural logarithm of the final value divided by the initial value.
As an example, let's say that the final value of an investment was $11 and the initial value was $10. The function would say =LN(11/10) for a result of 9.53%.
In mathematics and statistics, a distinction is often made between discrete and continuous data. The return of the record is the most theoretical continuous version. In the real world, however, most people think of returns divided into discrete periods.
So the logarithmic return is the non-discrete, continuous version, meaning that if the whole period were divided into an infinite number of periods, what would that return be? You see, it's theoretical.
Except for very short periods of time (less than a day), and for theoretical applications involving the calculation and precise measurement of curves, logarithmic returns are not commonly used.
Price data are useful, but in this case, since we want to compare each data set, it would be even better if instead of daily/annual prices, we had information on the returns of daily/annual prices.
As a first step, let's define a function called log_return, which should accept one parameter, prices.
def log_return(prices):
pass
The equation for calculating the log return between two prices is as follows
natural_log(current price/previous price)
In our case we want to run this equation for every day/year of price data in our imported DataFrame series (the A series is a single column in a DataFrame).
The pandas shift function (sifth()) can be used to divide each current price by its previous price in the Series.
prices / prices.shift(1)
And we can use Numpy's natural logarithm function to get the log return for each entry in the new Series.
import numpy as np
np.log(Series)
Let's do it
import numpy as np
def log_return(prices):
return np.log(prices / prices.shift(1))
Let's use our new log_return function to calculate the logarithmic return of the DataFrame of Nasdaq, GDP and other indicator that we've already loaded
start = datetime(1999, 1, 1)
end = datetime(2019, 1, 1)
nasdaq_data = web.DataReader("NASDAQ100", "fred", start, end)
sap_data = web.DataReader("SP500", "fred", start, end)
gdp_data = wb.download(indicator='NY.GDP.MKTP.CD', country=['US'], start=start, end=end)
export_data = wb.download(indicator='NE.EXP.GNFS.CN', country=['US'], start=start, end=end)
Now we are going to pass each variable to the log_return function
nasdaq_returns = log_return(nasdaq_data['NASDAQ100'])
nasdaq_returns
DATE
1999-01-01 NaN
1999-01-04 NaN
1999-01-05 0.025876
1999-01-06 0.031526
1999-01-07 0.001221
...
2018-12-26 NaN
2018-12-27 0.004069
2018-12-28 -0.000483
2018-12-31 0.007087
2019-01-01 NaN
Name: NASDAQ100, Length: 5218, dtype: float64
sap_returns = log_return(sap_data['SP500'])
sap_returns
DATE
2010-11-22 NaN
2010-11-23 -0.014387
2010-11-24 0.014813
2010-11-25 NaN
2010-11-26 NaN
...
2018-12-26 NaN
2018-12-27 0.008526
2018-12-28 -0.001242
2018-12-31 0.008457
2019-01-01 NaN
Name: SP500, Length: 2117, dtype: float64
gdp_returns = log_return(gdp_data['NY.GDP.MKTP.CD'])
gdp_returns
country year
United States 2019 NaN
2018 -0.040354
2017 -0.052176
2016 -0.040757
2015 -0.026427
2014 -0.039039
2013 -0.042966
2012 -0.035650
2011 -0.041243
2010 -0.036063
2009 -0.036900
2008 0.018100
2007 -0.017898
2006 -0.045096
2005 -0.057963
2004 -0.065203
2003 -0.063851
2002 -0.046611
2001 -0.032961
2000 -0.031631
1999 -0.062554
Name: NY.GDP.MKTP.CD, dtype: float64
export_returns = log_return(export_data['NE.EXP.GNFS.CN'])
export_returns
country year
United States 2019 NaN
2018 0.002376
2017 -0.063109
2016 -0.059492
2015 0.020588
2014 0.045240
2013 -0.042320
2012 -0.036803
2011 -0.041123
2010 -0.130190
2009 -0.154485
2008 0.149476
2007 -0.100832
2006 -0.120293
2005 -0.120663
2004 -0.102871
2003 -0.127967
2002 -0.036798
2001 0.025597
2000 0.067562
1999 -0.099148
Name: NE.EXP.GNFS.CN, dtype: float64
We are now ready to compare the volatility of each type of data. Remember that variance, in the context of financial data, tells us how volatile an investment is. We'll use the var() function in Pandas to calculate the variance of returns on stocks and World Bank data, and print the results.
The results can be interpreted in various ways, but in general, the higher the variance, the more volatile the data.
What conclusions can be drawn from these data? Which data set was the most volatile? Did any data sets have similar variances?
sap series: sap: 8.860342194008153e-05 which is equivalent to 0.00008860342194008153
print('nasdaq_returns:', nasdaq_returns.var())
nasdaq_returns: 0.0003178379833057229
print('sap_returns:', sap_returns.var())
sap_returns: 8.375653262337878e-05
print('gdp_returns:', gdp_returns.var())
gdp_returns: 0.0003408930734314469
print('export_returns:', export_returns.var())
export_returns: 0.006197236757109849
The S&P 500, a set of 500 large companies listed on U.S. stock exchanges, has the smallest variation, and therefore is the least volatile. Because the S&P 500 index is a weighted measure of many stocks in a variety of industries, it is considered a safer and more diversified investment.
Stocks are interesting. The NASDAQ 100 is more volatile than the S&P 500, which, when you think about it, makes sense since the S&P 500 is much more diversified and follows the market more closely.
So finally we have GDP and exports.
Exports are very volatile, which could have to do with industries that have moved abroad in the last 20 years, and global competition for the production of goods.
GDP is actually quite similar to the NASDAQ 100 in terms of volatility, which is perhaps an interesting correlation.
I hope you enjoyed this reading!