Quantitative Investment Learning Together: March 2018

Saturday, March 24, 2018

Mind-Map for Junior Students Looking for Quantitative Research Positions

This roadmap is what I thought that should be beneficial for students like me who are interested in a quantitative research position in financial institutions, especially in hedge funds.

In my opinion, Finance, Mathematics, Statistics, Computer Science and Soft Skills are key elements to successfully looking for such a job. Sadly, the red blanks showed what I have not learnt yet, and not having basic knowledge in these fields might prevent me from having a competitive competence.

Hopefully, this graph could remind me of what is important for my career and what is not, and it should be much better if it could be beneficial for others who are also highly interested in quant research.

PDF Version:

Image:

Friday, March 9, 2018

Black-Scholes Option Pricing Model and Empirical Analysis on 50ETF Option

In the previous blog, I tried Binomial-Tree Model to price 50ETF option in China, where I found it significantly underestimated the price. In the following analysis, I choose Black-Scholes Model instead, but found that although the difference of the predicted prices are significantly different with a p-value around 0.03, the results of prediction is pretty similar, leading to the same conclusion: Black-Scholes Model also underestimates the price of 50ETF option in China.

The formula of Black-Scholes is as follow:

I have also written the codes of estimating prices under BS Model:

from scipy import stats
def Black_Scholes(r,sig,S0,K,T,isCall):
# European Option Pricing
d1 = (np.log(S0/K) + (r+sig**2/2)*T)/(sig*np.sqrt(T))
d2 = d1 - sig*np.sqrt(T)
if isCall == True:
C0 = S0*stats.norm.cdf(d1,0,1)-np.exp(-r*T)*K*stats.norm.cdf(d2,0,1)
return C0, stats.norm.cdf(d1,0,1)
else:
P0 = -S0*stats.norm.cdf(-d1,0,1)+np.exp(-r*T)*K*stats.norm.cdf(-d2,0,1)
return P0, -stats.norm.cdf(-d1,0,1)

Keep all the other methods the same with the previous blog, only replace the function to price option under binomial-tree model with the function above, and increase the test period from the range of (2015-07-31, 2015-11-30) to (2015-07-31, 2016-03-22), the results of the prices are stored in this file: binomialResults_Binomial&BS.csv

Visualizing the difference between the two models, and we find that these two models nearly predict the same results:

plt.figure(dpi=200)
plt.title('Binomial Against Black-Scholes Price')
plt.xlabel('Binomial Price')
plt.ylabel('Black-Scholes Price')
plt.xlim(0,0.6); plt.ylim(0,0.6)
plt.plot(df_results.BinomialPrice,df_results.BSPrice,ls=' ',marker='.')
plt.savefig('Binomial Against Black-Scholes Price.png')

Although statistical analysis (t-test on the differences) shows that the results are significantly different with a p-value of around 0.03:

diff = df_results.BinomialPrice-df_results.BSPrice
ttest_1sample = stats.ttest_1samp(diff, 0)

In [1]: ttest_1sample
Out[1]: Ttest_1sampResult(statistic=2.1480372929383886, pvalue=0.031870221099299899)

Then we may conclude that although slightly different, Binomial-Tree and Black-Scholes Models predict nearly the same prices on 50ETF option from 2015-07-31 to 2016-03-22. Plot Black-Scholes Price against Real Price, and we can find it also significantly underestimates the price of options.

plt.figure(dpi=200)
plt.title('Black-Scholes Price against Real Price')
plt.xlabel('Black-Scholes Price')
plt.ylabel('Real Price')
plt.xlim(0,0.6); plt.ylim(0,0.6)
plt.plot(df_results.BinomialPrice,df_results.Close,ls=' ',marker='.')
plt.savefig('Black-Scholes Price against Real Price.png')

Binomial-Tree Option Pricing Model and Empirical Analysis on 50ETF Option

I have implemented binomial-tree option pricing model to the only listed financial option in China: option on 50ETF. The results show that though somewhat similar, binomial-tree model cannot successfully price the options, and it might underestimate the risks which could lead to raising the prices.

I. Basic Steps on Binomial-Tree Model

Binomial-Tree Model assumes that in each time period, the stock price would either go up or go down, with a similar scale, as shown in the following graph:

Assuming that we know the exact parameter of the growth and the variance, we can calculate the expected tree of stocks. Specify a certain dt up to its exercise date, we may get a stock tree.

After building the stock tree, we can derive the corresponding option tree. However, unlike calculating from the first node, in the Binomial-Tree Model, we calculate the price of the option from the end of the tree. That is, we calculate all the possible prices of option on the exercise date, and backwardly propagating the previous prices by the formula shown above, and leading to the estimated price at present.

We can also calculate the delta by (C(1,1)-C(1,0))/(S(1,1)-S(1,0)) to show the sensitivity of option to its underlying asset.

Similarly, replacing C(N,m) = Max(S(N,m)-K,0) with Max(K-S(N,m),0), we can derive the price at exercise date of the put option, and by remaining other steps the same, we might calculate the P(0,0) and its delta.

The Python codes of binomial tree are shown as following:

import numpy as np

def binomial_tree(r,mu,sig,S0,K,T,N,isCall):

# European Option Pricing

dt = T/N

q=(np.exp(r*dt)-1-mu*dt)/(2*sig*np.sqrt(dt))+0.5;

U=1+mu*dt+sig*np.sqrt(dt)

D=1+mu*dt-sig*np.sqrt(dt)

DCF=np.exp(-r*dt)

S=np.zeros((N+1,N+1))

C=np.zeros((N+1,N+1))

S[0,0]=S0

for j in range(1,N+1):

S[0,j]=S[0,j-1]*D

for i in range(1,j+1):

S[i,j]=S[i-1,j-1]*U

if isCall == True:

for i in range(N+1):

C[i,N]=max(S[i,N]-K,0)

else:

for i in range(N+1):

C[i,N]=max(K-S[i,N],0)

for j in range(N-1,-1,-1):

for i in range(0,j+1):

C[i,j] = DCF*((1-q)*C[i,j+1]+q*C[i+1,j+1])

C0 = C[0,0]

alpha0 = (C[1,1]-C[0,1])/(S[1,1]-S[0,1])

return C0, alpha0

II. Empirical Analysis on 50ETF Option Data

50ETF (Managed by Huaxia Fund) is the only active tradable in-the-market financial asset who has derivative option. Its price basically reflects the future aspiration of overall financial market in China. Because the option on 50ETF is European, the following analysis would only consider European method.

I have collected the data of the basic information and daily price of underlying 50ETF prices (from Wind), and the data of information in the scale of tick of 50ETF option (private dataset). Because the dataset of option is too large, I might not put all of them online. I would only list one sample of the dataset here. Please e-mail me for further request. Other data are available here: Price of 50ETF, Basic Information about 50ETF Option, non-risk asset returns in China, description of the option dataset, one sample of the option dataset. The range of test time is from 2015-07-31 to 2015-11-30.

The price of 50ETF is like the price of the stock shown above, and mu is estimated by its historical returns from the estimated date to it was one year ago, dividing the time of each iteration. sigma is estimated by the standard variance with a similar method.

Codes of reading data, determining parameters, and pricing options for each option with respective strike price on each date with trading volumes greater than 0 are as follows:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import ffn
import datetime
import calendar
import os

# Read Basic Information of the Data
def filterMonth(kwd, files):
return [file for file in files if file.find(str(kwd)) != -1]
df_results = pd.DataFrame(columns=['TradingDate','SYMBOL','BinomialPrice','delta','Close'])
df_etf = pd.read_csv('50ETF.csv')
df_etf.index = pd.to_datetime(df_etf.TradingDate)
df_description = pd.read_excel('SHSOL1_description.xlsx')
df_basicinfo = pd.read_excel('Basic Information.xlsx')
df_basicinfo.index = df_basicinfo['期权代码'] #'期权代码' means the symbol of the option
df_nonrisk = pd.read_csv('nonrisk.csv')
df_nonrisk.index = pd.to_datetime(df_nonrisk.Clsdt)

# Load the price of option

year = 2015
for month in range(7,12):
monthRange = calendar.monthrange(year, month)
startdate = datetime.datetime(year,month,monthRange[0])
enddate = datetime.datetime(year,month,monthRange[1]) #One Month at most at one time
df_etf_inrange = df_etf[startdate:enddate]
ymonth = pd.to_datetime(startdate).year*100+pd.to_datetime(startdate).month
df_description = pd.read_excel('SHSOL1_description.xlsx')
cols = df_description['字段名'].tolist() #'字段名' means the name of the character
for (root, dirs, files) in os.walk('SHSOL1_TAQ1(1-100)'):
files = filterMonth(ymonth, files)
df = pd.DataFrame()
for f in files:
print(f)
if not df.empty:
df = pd.concat( [df,\
pd.read_csv('SHSOL1_TAQ1(1-100)\\'+f,names=cols)], axis=0 )
else:
df = pd.read_csv('SHSOL1_TAQ1(1-100)\\'+f,names=cols)
contracts = df.SYMBOL.unique()
df.index = pd.to_datetime(df.BUSINESSTIME)

# Load Parameters and Pricing
diags = []
for date in pd.date_range(startdate,enddate):
if date.to_pydatetime() in df_etf_inrange.index:
close = df_etf_inrange.Close[date.to_pydatetime()]
S0 = close
tradableOptions = df_basicinfo[(df_basicinfo['上市日'] <= date.to_pydatetime())&(df_basicinfo['到期日'] > date.to_pydatetime())] #'上市日' means the first date in the market; '到期日' means the last date, or strike/exercise date
returns = ffn.to_returns(df_etf.Close[pd.to_datetime(date)-datetime.timedelta(days=365):pd.to_datetime(date)-datetime.timedelta(days=1)])
mu0 = returns.mean()
sig0 = returns.std()
Ts = tradableOptions['到期日'] - date #'到期日' means strike/exercise date
Ks = tradableOptions['行权价'] #'行权价' means strike/exercise price
N = 20
r = df_nonrisk.Nrrdata[date.to_pydatetime()]/100
isCall = tradableOptions['认购/认沽'] == '认购' #'认购' means 'Call'; '认沽' means 'Put'
for option in tradableOptions.index:
if option not in contracts:
continue
reals = df[(df.SYMBOL==option)&(df.VOLUME!=0)]
if reals.empty:
continue
else:
reals = reals.LASTPRICE[date.strftime('%Y-%m-%d')]
if reals.empty:
continue
K = Ks[option]
T = Ts[option].days/365
isC = isCall[option]
real = reals[~np.isnan(reals)][-1]
mu = mu0/(T/N)
sig = sig0/np.sqrt(T/N)
binomial0, alpha0 = binomial_tree(r, mu, sig, S0, K, T, N, isC)
diags.append([date,option,binomial0,alpha0,real])
df_results = pd.concat([df_results,pd.DataFrame(diags,columns=['TradingDate','SYMBOL','BinomialPrice','delta','Close'])])
df_results.to_csv('binomialResults.csv')

The results of the 'TradingDate', 'SYMBOL', 'BinomialPrice', 'delta', and 'Close' are stored in 'binomialResults.csv'. Show the scatter plot of the binomial price and close price, we have:

plt.figure(dpi=200)
plt.title('Binomial Price against Real Price')
plt.xlabel('Binomial Price')
plt.ylabel('Real Price')
plt.xlim(0,0.6); plt.ylim(0,0.6)
plt.plot(df_results.BinomialPrice,df_results.Close,ls=' ',marker='.')
plt.savefig('Binomial Price against Real Price.png')

The plot shows that the price of binomial tree model is somewhat similar with the real price, but it seems that it is slightly smaller than the real price. We try the following codes to show the result of one sample t-test to show how similar the results are:

from scipy import stats
ttest_1sample = stats.ttest_1samp(df_results.BinomialPrice-df_results.Close, 0)

In[1]: ttest_1sample
Out[1]: Ttest_1sampResult(statistic=-22.871123481744913, pvalue=1.430363679347074e-91)

It shows that binomial-tree model significantly underestimate the option price. It seems that the risk in the option market cannot be successfully reflected in this model.

Thursday, March 8, 2018

Logistic Regression: Classical Binomial Linear Classification Machine Learning Algorithm

Logistic Regression is one of the most basic machine learning classification algorithms. It is fast and easy to understand, but it could only identify binomial classes. Other than that, we may need Softmax Regression or Factorization Machine, which are more generalized forms of Logistic Regression aiming to solve multi-classes problems. Furthermore, emerging usage of SVM, ANN, and Decision Tree can also be highly powerful.

However, in this blog, I am only going to focus on Logistic Regression, briefly summarize the essential steps of it, and give a simple example. The dataset is available here and it was extracted from https://catalog.data.gov/dataset/state-energy-data-system-seds-complete-dataset-through-2009#sec-dates.

I. Introduction to Dataset

The dataset is about the energy profile of all states of U.S. from 1960 to 2009. The meanings of each attribute is available in the metadata posted in the website. The 'energy.csv' is exactly the same but preprocessed to panel form.

Suppose we would like to teach our computers to distinguish the economic and energy developing status of different states, and to make it solvable by Logistic Regression, we only recognize two states, Hawaii ('HI') and New York ('NY'). To make the results visualizable, we only include two attributes as the explainers, and they are Total Energy Average Consumption ('TETCB'/Billion Btu) and Real Value of GDP ('GDPRV'/Million Dollars).

Firstly, we could have a brief look at the scatter plots of the two states with the following codes:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

df = pd.read_csv('energy_data.csv')
df1 = df[['StateCode','Year','GDPRV','TETCB']].dropna()
state1 = 'HI'
state2 = 'NY'
fullset = df1[(df1['StateCode'] == state1)|(df1['StateCode'] ==state2)]

plt.figure(dpi=150)
plt.title(state1+' & '+state2)
plt.xlabel('Total Energy Total Consumption (Billion Btu)')
plt.ylabel('Real GDP Value (Million Dollars)')
plt.plot(fullset[fullset.StateCode==state1].TETCB,fullset[fullset.StateCode==state1].GDPRV,ls='',marker='.',color='blue')
plt.plot(fullset[fullset.StateCode==state2].TETCB,fullset[fullset.StateCode==state2].GDPRV,ls='',marker='.',color='red')
plt.legend([state1,state2])
plt.savefig(state1+'_'+state2+'_scatter.png')

Because our analysis is simplified, we could easily observe the difference between the two plots. For example, we could easily classify a point (600000, 40000) to 'NY'. Now, we can test the ability of machine learning algorithm.

II. Steps and Codes

Logistic Regression uses sigmoid function as the function to transfer processed attributes to probability and in the meanwhile beautifully lead the loss function to be convex, and implement maximum likelihood estimation to minimize the loss function of the estimations and real values.

Instead of looking for analytical solutions, we specify a random initial hyperplane and use gradient descent method to learn how close the hyperplane is to the final optima. As thousands of iteration, our results might steadily converge to a constant, which should be the final result of the weights, matrix form of classification hyperplane.

The following codes describe the steps and the codes:

def sig(x):
return 1.0 / (1 + np.exp(-x))

def lr_train_bgd(feature, label, maxCycle, alpha):
n = np.shape(feature)[1]
w = np.mat(np.random.rand(n,1))
i = 0
while i <= maxCycle:
i += 1
h = sig(feature * w)
err = label - h
if i % (maxCycle/10) == 0:
print("\t---------iter=" + str(i) + \
" , sum of err= " + str(np.sum(np.abs(err))))
print('\t-------------weights=',end=' ')
print(w)
w = w + alpha * feature.T * err
return w

#The codes of the function sig(*) and lr_train_bgd(*) is available in https://github.com/zhaozhiyong19890102/Python-Machine-Learning-Algorithm/blob/master/Chapter_1%20Logistic%20Regression/lr_test.py

ones = np.ones((len(fullset),1))
feature = np.column_stack((ones,np.mat(fullset[['TETCB','GDPRV']])))
label = []
for s in fullset.StateCode:
if s == state1:
label.append(0.0)
elif s == state2:
label.append(1.0)
label = np.mat(label).T
weights = lr_train_bgd(feature,label,10000,0.1)

plt.figure(dpi=150)
plt.title(state1+' & '+state2+' without Standardization')
plt.xlabel('Total Energy Total Consumption (Billion Btu)')
plt.ylabel('Real GDP Value (Million Dollars)')
plt.plot(fullset[fullset.StateCode==state1].TETCB,fullset[fullset.StateCode==state1].GDPRV,ls='',marker='.',color='blue')
plt.plot(fullset[fullset.StateCode==state2].TETCB,fullset[fullset.StateCode==state2].GDPRV,ls='',marker='.',color='red')
x = np.linspace(200000,700000)
y = (- weights[0]/weights[2] - weights[1]*x/weights[2]).T
plt.plot(x,y)
plt.legend([state1,state2])
plt.savefig(state1+'_'+state2+'_LogisticRegression_noStandard.png')

The solid line in the plot is the classification line. Above it is classified as 'HI' and beneath it as 'HI'. Obviously, the result is not satisfactory. I asked codes from my friend professional at computer science and found nearly the same results. It turned out that I was once so upset that this algorithm was too unreliable.

III. The importance of Standardizing Attributes

However, I suddenly got an idea: I forgot to standardize my explainers / attributes! Because Logistic Regression uses gradient descent method to improve the accuracy of classification hyperplane, and without standardizing my attributes to the same scale, my loss function might be a extremely extended ellipsoid, and any specified learning rate might be either too large to converge the loss function or too small to speed up the calculations.

Change a little bit of the code, by adding the red part inside the codes:

......

state1 = 'NY'

state2 = 'HI'

fullset = df1[(df1['StateCode'] == state1)|(df1['StateCode'] ==state2)]

for col in ('GDPRV','TETCB'):

fullset[col] = (fullset[col] - np.min(fullset[col]))/(np.max(fullset[col]) - np.min(fullset[col]))

def sig(x):
return 1.0 / (1 + np.exp(-x))
.......

We may get a satisfactory result of classification:

Therefore, we could use this code to solve binomial classification problems.

IV. Available Source Codes

I have also tried a simple test without standardizing attributes (but the attributes are relatively in the same scale), and I may provide all the codes.

Part I: Introduction to dataset
Part II: Logistic Regression without Standardization
Part III: Logistic Regression with Manual Dataset
Part IV: Logistic Regression with Standardization

Reference:

Machine Learning Algorithm: Python, Zhiyong ZHAO (《Python机器学习算法》赵志勇)；essential codes are from his github website referred previously.