Stock Prediction Using Machine Learning Algorithms
Introduction
There are lots of people involved every day in trading stock, and hundred million dollars cash flow in these markets such as Nasdaq, New York Stock Exchange, Dow, S&P 500. Everybody would like to predict the stock price and get the benefit of the stock market. In this project, I will try to predict the direction of the given stock.
Project Definition
Project Overview
In this project, I will explore if machine learning algorithms can lead us to predict the direction of a given stock. For that reason, we pull up a dataset from yahoo finance for the stock and pre-process the data. After that, the data is examined with exploratory data analysis methods to check out whether historically performs well or not. Lastly, the project forecasts the stock price by applying various machine learning algorithms and compares the results.
Problem Statement
The goal of the project is to predict price change and the direction of the stock using various machine learning models. Since the input(Adj Close Price) used in the prediction of stock prices are continuous values, I use regression models to forecast future prices. The list of tasks is involved as follow:
1. Load historical stock price data from Yahoo Finance.
2. Check for missing values and data cleaning.
3. Process Exploratory Data Analysis.
4. Perform data preparation and feature engineering for machine learning.
5. Train the regression models.
6. Validate the models.
7. Select the best model and make a recommendation.
The best model will recommend the best stock price prediction for the investors.
Metrics
For the project, I use Root Mean Square Error and R2 score to assess and validate the various machine learning algorithms.
The first one is, Root Mean Square Error (RMSE) is the standard deviation of the residuals. RMSE measures the error rate. It is easy to use in statistics. The formula is used as follow;
RMSE = Root Mean Squared Error
i = variable i
{N} = number of non-missing data points
yi = actual observations time series
yi hat = estimated time series
R-squared is a goodness-of-fit measure for regression models in statistics. The percentage of the independent variable explains that dependent variable. R-squared measures the strength of the relationship between your model and the dependent variable on 0–100% scale. If the relationship is strong, the scale is high otherwise, it is low.
In short, historical data of given stock is used to build a machine learning model and two metrics are applied to these models to select the best solution model. So, The project is based on selecting the model which has a low RMSE and high R2 Score value.
Data Exploration
I select Adobe stock for this project and pull up the historical data from yahoo finance. It is easy to fetch all financial data from yahoo with yfinance module. I import the yfinance module with import yfinance as yf method.
The data contains open price, high price, low price, close price, adj close price, and volume for the stock. The following table shows the Adobe stock price data from beginning of 2010.
The preview of the screenshot shows detailed summary of data for mean, median, standard deviation, etc.
Exploratory Visualization
First thing I look at how the price change between 2010 and 2021. I select adj close price and build a chart of the price change. Stock shows upward trend based on 11 years period.
The chart below shows the daily return for the stock.
I also calculate aggregated annual return and risk in the following code and the result indicates that annual return is 28% and risk is 30%
#This function is calculated annual risk and return for stocksdef ann_risk_return(returns_df):
summary = returns_df.agg(["mean", "std"]).T
summary.columns = ["Return", "Risk"]
summary.Return = summary.Return*252
summary.Risk = summary.Risk * np.sqrt(252)
return summarysummary = ann_risk_return(ret)
Moving Average Comparison for 50,100,200 days
Simple Moving Average(SMA) is a stock indicator that is commonly used in finance as technical indicator. The reason for calculating SMA is to get smooth out the stock price data.
I create SMA for 50 days, 100 days, and 200 days to see the smooth price changes. Since it is a powerful technical indicator in finance, it is frequently used to predict the direction of the stock. 50 days and 200 days are well known by investors and it is accepted as a trading signal. A buy signal is generated if 50 days moving average crosses the 200 days moving average.
Based on the moving average chart, the stock indicate a buy signal since 50 days moving average crosses the 200 days moving average(SMA)
Annual Return Triangle
In this part of the data analysis, i take a look at the historical annual return on the stock. First, i calculate the annual return
annual = df_adj_close.resample("A", kind = "period").last()
The annual return triangle shows the historical annual returns on 11 years period
annual_log_ret = np.log(annual/annual.shift())
annual_log_ret.dropna(inplace = True)
years = annual_log_ret.index.sizedef annual_return_triangle(df,annual_log_ret):
windows = [year for year in range(years, 0, -1)]
for i in df.columns:
for year in windows:
annual_log_ret["{}{}Y".format(i,year)] =
annual_log_ret[i].rolling(year).mean()
return annual_log_retan_ret_tri = annual_return_triangle(df_adj_close,annual_log_ret)triangle= an_ret_tri.drop(columns = df_adj_close.columns)
triangle.columns
The following codes show how to build the annual triangle return graph. When i look at the graph, i can see that in 11 years period, the stock is gained a different percentage of return. For example, ADBE1Y and 2021 intersection show us one-year annual return is 9.2%. if someone buys the stock in 2020 and holds it for one year, he would get 41.6% (2020 and ADBE1Y intersection) while he would loss -8.5% if the stock was bought in 2011 for one-year period.
def graph_annual_return_triangle(df):
i=0
new_list=[]
while i<len(triangle.columns.values):
new_list.append(df.columns.values[i:i+years])
i+=yearsfor i in new_list:
plt.figure(figsize=(30,20))
sns.set(font_scale=2)
sns.heatmap(df[i], annot = True, fmt = ".1%", cmap = "RdYlGn")
plt.tick_params(axis = "y", labelright =True)
return plt.show()
Machine Learning Algorithms
Algorithms and Techniques
The aim of the project is to predict accurate stock prices by using the best possible machine learning algorithm.
I use TimeSerieSplit function to split to data into a train, test set, and validation set. I dont chosse to use random split for the time series data , since the random split would not be valid for time series data due to its autoregressive nature, trend, seasonality.
I use various machine learning models to see if they would preciously predict the stock price. These are the machine learning algorithms that I use in the project. Decision Tree Regressor, Support Vector Regressor(SVR), LassoCV, RidgeCV, Stochastic Gradient Descent(SGD).
Machine Learning Work Flow:
- Normalize the data
- Split the data into Train, Test, and Validation Sets
- Implement model prediction and evaluation
- Compare the results
1- Normalize the data
When the data is pulled up from directly yahoo finance, it is not in a good range to predict the future price. Feature scaling is the way to normalize the data to increase the performance of some machine learning algorithms.
I use sklearn MinMaxScaler method to range the data between 0 and 1.
def normalize_featuresDF(df):
"""
created to normalize df data - range between 0-1
args:
df : the data we pulled up from yahoo for the stock
return: normalized_features_df : data after it is normalized
"""
scaler = MinMaxScaler()
feature_columns = df.columns
feature_minmax_data = scaler.fit_transform(df)
normalized_features_df =
pd.DataFrame(columns=feature_columns, data=feature_minmax_data, index=df.index)
return normalized_features_df
2- Split the data into Train, Test, and Validation sets
I split data into three different data set as Train, Test, and Validation. This process will help to evaluate data more accurately.
The train set is the sample of data used to fit the model. The validation set is used to provide an unbiased evaluation of a model fit on a training set while tuning model hyperparameters. The test set is used for the final evaluation after the model is completely trained with the train and validation set.
def split_ValidationSet(features_df, target_df, length=90):
"""
method is to separate validation set from the complete df
args:
features_df: full features_df
target_df : full target_df
length: prediction length
returns :
validation_x : returns features validations sets
validation_y : returns target validations sets
"""
#need to shift target array because we are prediction n + 1 days price
target_df = target_df.shift(-1)
#split validation set . i am spliting 10% latest data for validation.
#target
validation_y = target_df[-length:-1]
validation_x = features_df[-length:-1]
return validation_x, validation_y#Now get final_features_df and final_target_df by excluding validation set
def split_Final_df(normalized_features_df, target_df, v_length=90):
"""
This method will be having remaining data after the validation set.
args:
features_df: normalized features_df
target_df: complete target_df
v_length: validation set length
return:
final_features_df : set of feature df excluding validation set
final_target_df : set of target df excluding validation set
"""
final_features_df = normalized_features_df[:-v_length]
final_target_df = target_df[:-v_length]
return final_features_df, final_target_df#Split final set into training and testing sets
#splitting training and testing set using sklearn's TimeSeries split
def split_Train_Test_DF(final_features_df, final_target_df, n_splits=10):
"""
Using sklearn's timeseries split to split the training and testing sets
args:
final_features_df: features_df after splitting validation set
final_target_df: target_df after splitting validation set
return:
x_train : traing feature set
y_train : training target set
x_test : testing feature set
y_test : testing target set
"""
ts_split = TimeSeriesSplit(n_splits)
for train_index, test_index in ts_split.split(final_features_df):
x_train, x_test = final_features_df[:len(train_index)], final_features_df[len(train_index): (len(train_index)+len(test_index))]
y_train, y_test = final_target_df[:len(train_index)].values.ravel(), final_target_df[len(train_index): (len(train_index)+len(test_index))].values.ravel()
return x_train, y_train, x_test, y_test
After splitting the data, it is visualized with a graph below.
def DataSet_Graph():
"""
Chart shows 2 different sets(Train,Test,Validation) into single plot
Since it is time series data, it shouldnot be mixed
"""
t=y_test.astype(np.float)
v = target_df[-90:-1].values.ravel()
plt.figure(figsize = (20,10))
plt.plot(y_train, label='trainning_set')
plt.plot([None]*len(y_train) + [x for x in t], label='test_set')
plt.plot([None]*(len(y_train)+len(t)) + [x for x in v], label='validation_set')
plt.xlabel('Days',fontsize = 18)
plt.ylabel('Price',fontsize = 18)
plt.title('Split dataset into training/validation/test set',fontsize = 20)
plt.legend()
3- Implement Model Prediction and Evaluation
I focus on error between the actual and predicted value . For that reason i use RMSE(Root Mean Square Error ) and R-Squared(R2) metrics. Both metrics works well to select the best model to predict stock prices. Both metrics measure the errors as well as they serve as loss functions to minimize.
I have created three different functions. First one is model_validateResult(). It uses the validation data set then return RMSE score and R2 Score. It also builds and returns graph to show trendline for actual vs prediction data. The second function is bestModel_validateResult(). It uses test dataset then returns RMSE score and R2 score with graph building trendline for actual prediction test data. The last function is value_Compare(). It generates dataset to show difference between actual data and predicted data.
#Method to evaluate the benchmark model and solution model with validate data set
def model_validateResult(model, model_name):
"""
Returns RMSE_Score and R2_Score
Also plots actual vs predicted trend
args:
model : the model is to validate
model_name: name of the model
return:
RMSE_Score : calculates rmse score
R2_Score : calculates R2 score
"""model = model(x_train, y_train, validation_x)
prediction = model.predict(validation_x)
RMSE_Score = np.sqrt(mean_squared_error(validation_y, prediction))
R2_Score = r2_score(validation_y, prediction)
#trendline for actual vs predictionplt.figure(figsize = (23,10))
plt.plot(validation_y.index, prediction, color='green', linestyle='dashed', linewidth = 3,
marker='o', markerfacecolor='green', markersize=8)
plt.plot(validation_y.index, validation_y, color='red', linestyle='dashed', linewidth = 3,
marker='o', markerfacecolor='red', markersize=8)
plt.plot(figsize = (23,10))
plt.ylabel('Price',fontsize = 20)
plt.xlabel('Date',fontsize = 20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.title(model_name + ' Predict vs Actual',fontsize = 20)
plt.legend(loc='upper right')
plt.show()
print(model_name + ' RMSE: ', RMSE_Score)
print(model_name + ' R2 score: ', R2_Score)
return RMSE_Score, R2_Score#Method to evaluate the final model with testing data set
def bestModel_validateResult(model, model_name):
"""
Returns RMSE_Score and R2_Score
Also plots actual vs predicted trend
args:
model : the model is to validate
model_name: name of the model
return:
RMSE_Score : calculates rmse score
R2_Score : calculates R2 score
"""
#I am giving testing set for the evaluation
model = model(x_train, y_train, x_test)
prediction = model.predict(x_test)
RMSE_Score = np.sqrt(mean_squared_error(y_test, prediction))
R2_Score = r2_score(y_test, prediction)
plt.figure(figsize = (23,10))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title(model_name + 'Prediction Vs Actual',fontsize = 20)
plt.plot(y_test, label='test data')
plt.plot(prediction, label='prediction')
plt.xlabel('Days',fontsize = 20)
plt.ylabel('Price',fontsize = 20)
plt.legend();
print(model_name + ' RMSE: ', RMSE_Score)
print(model_name + ' R2 score: ', R2_Score)
return RMSE_Score, R2_Scoredef value_Compare(model):
"""
this method is used to create final data frame using testing value with predicted value.
args:
model : trained model
return :
df : df with test value and predicted value
"""
model = model(x_train, y_train, x_test)
prediction = model.predict(x_test)
col1 = pd.DataFrame(y_test, columns=['True_value'])
col2 = pd.DataFrame(prediction, columns = ['Predicted_value'])
df = pd.concat([col1, col2], axis=1)
return df
4- Algorithms
a)- Decision Tree Regressor
First model i use to predict stock prices is Decision Tree Regressor(DTR). It is a supervised learning method. Briefly, DTR model breaks down a data set into smaller subsets while the model is developing. The function is created for decision tree regressor algoritm and the graph shows that the model does not fit well based on validation data since RMSE is high and R2-Square is so low.
#modeling DecisionTreeRegressor with default parametersdef model_Decision_Tree_Regressor(x_train, y_train,random_state=0):
"""
args:
x_train : training set
y_train : target training set
random_state :randomness of the estimater
return:
model : returns the trained model
"""
#initialize DTR
dtr = DecisionTreeRegressor(random_state=0)
#fit the data
model = dtr.fit(x_train, y_train)
return model# Getting the RMSE and R2 score by predicting the model.
# DTR model RMSE and R2 score with plot
RMSE_Score, R2_Score = model_validateResult
(model_Decision_Tree_Regressor, model_name = "Decision Tree")
b)- Support Vector Regressor (SVR)
SVR is very popular machine learning for regression models. . SVR measure how much error is tolerable in our model and finds an appropriate line or hyperplane in higher dimensions to fit the data.
In SVR model, I use linear kernel for the model. For tuning model, I use set of parameters of C and epsilon and use GridSearchCV method to fit the best parameters.
As per the graph, SVR works with validation data based on RMSE and RSquared Scores. RMSE is 8.7 while R2 Score is around 91%. SVR tunning model works better than SVR model itself. RMSE is 8.2 and R2 Score is %92.
def model_SVR(x_train, y_train, validation_x):
"""
This method uses svr algorithm to trian the data.
args:
x_train : feature training set
y_train : target training set
validation_x : validation feature set
return:
model : returns the trained model
"""
svr_model = SVR(kernel='linear')
model = svr_model.fit(x_train, y_train)
return modeldef model_SVRTuning(x_train, y_train, validation_x):
"""
This method uses svr algorithm to trian the data.
Using different set of C and epsilon.
Using GridSearchCV to select best hyperparameters
args:
x_train : feature training set
y_train : target training set
validation_x : validation feature set
return:
model : returns the trained model
"""
hyperparameters_linearSVR = {
'C':[0.5, 1.0, 10.0, 50.0, 100.0, 120.0,150.0, 300.0, 500.0,700.0,800.0, 1000.0],
'epsilon':[0, 0.1, 0.5, 0.7, 0.9],
}
grid_search_SVR_feat = GridSearchCV(estimator=model_SVR(x_train, y_train, validation_x),
param_grid=hyperparameters_linearSVR,
cv=TimeSeriesSplit(n_splits=10),
)model = grid_search_SVR_feat.fit(x_train, y_train)
#print(grid_search_SVR_feat.best_params_)
return model
RMSE_Score, R2_Score = model_validateResult(model_SVR, model_name = "SVR")#SVR model Tuning
RMSE_Score, R2_Score = model_validateResult(model_SVRTuning, model_name = "SVR_Tuned")
c)- Lasso and Ridge
Lasso regression is a regularization technique . It obtains the subset of predictors that minimizes prediction error. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central point as the mean. The lasso procedure encourages simple, sparse models
Ridge regression is a model tuning method that is used to analyze any data that suffers from multicollinearity. This method performs L2 regularization. If ther is an issue with multicollinearity ,least-squares are unbiased, and variances are large, this results in predicted values is far away from the actual data. In our model, the results shows that validation data set fit well based on RSME and R2 Score.
def model_Lasso(x_train, y_train, validation_x):
"""
This method uses train the data.
args:
x_train : feature training set
y_train : target training set
validation_x : validation feature set
return:
model : returns the trained model
"""
lasso_clf = LassoCV(n_alphas=1000, max_iter=3000, random_state=0)
model = lasso_clf.fit(x_train,y_train)
# prediction = model.predict(validation_x)
return modeldef model_Ridge(x_train, y_train, validation_x):
"""
This method uses to train the data.
args:
x_train : feature training set
y_train : target training set
validation_x : validation feature set
return:
model : returns the trained model
"""
ridge_clf = RidgeCV(gcv_mode='auto')
model = ridge_clf.fit(x_train,y_train)
# prediction = ridge_model.predict(validation_x)
return modelRMSE_Score, R2_Score = model_validateResult(model_Lasso, model_name = "Lasso")
RMSE_Score, R2_Score = model_validateResult(model_Ridge, model_name = "Ridge")
d)- Stochastic Gradient Descent model
Stochastic gradient descent is an optimization algorithm often used in machine learning world to find the model parameters that correspond to the best fit between predicted and actual outputs. It’s a powerful technique. it is broadly used in machine learning applications.
The parameters are used as follow:
max_iter = 1000 the maximum number of iteration the model going through
tol = 1e-3 the stopping criterion
loss=’squared_epsilon_insensitive’ ignore errors less than epsilon
penalty = l1 the penalty
alpha = 0.1 Constant
def Stochastic_Gradient_Descent_model(x_train,y_train,validation_x):
"""
This method uses to train the data.
args:
x_train : feature training set
y_train : target training set
validation_x : validation feature set
return:
model : returns the trained model
"""
sgd =SGDRegressor(max_iter=1000, tol=1e-3,loss='squared_epsilon_insensitive',penalty='l1',alpha=0.1)
model = sgd.fit(x_train,y_train)
# prediction = model.predict(validation_x)
return modelRMSE_Score, R2_Score = model_validateResult(Stochastic_Gradient_Descent_model, model_name = "SGD")
5- Model Comparison
In order to compare all models, i create function to pick which model is the best based on low RMSE value and high R2 Score.
def ValidationDataResult(model, model_name):
"""
Returns RMSE_Score and R2_Score
Also plots actual vs predicted trend
args:
model : it takes the model to validate
model_name: give the model name
return:
RMSE_Score : calculates rmse score
R2_Score : calculates R2 score
"""
model = model(x_train, y_train, validation_x)
prediction = model.predict(validation_x)
RMSE_Score = np.sqrt(mean_squared_error(validation_y, prediction))
R2_Score = r2_score(validation_y, prediction)
model_validation = {model_name:[RMSE_Score,R2_Score]}
return model_validation#################################################################################Method to evaluate the final model with testing data set
def TestDataResult(model, model_name):
"""
Returns RMSE_Score and R2_Score
Also plots actual vs predicted trend
USing testing data set for evaluation
args:
model : it takes the model to validate
model_name: give the model name
return:
RMSE_Score : calculates rmse score
R2_Score : calculates R2 score
"""
#I am giving testing set for the evaluation
model = model(x_train, y_train, x_test)
prediction = model.predict(x_test)
RMSE_Score = np.sqrt(mean_squared_error(y_test, prediction))
R2_Score = r2_score(y_test, prediction)
model_validation_test_data = {model_name:[RMSE_Score,R2_Score]}
return model_validation_test_data
After the functions are created, Validation Data RMSE and R2 Scores, and Test Data RMSE and R2 Scores are collected into dictionary which i generate.
import warnings
warnings.filterwarnings('ignore')model_list = {'Decision_Tree': model_Decision_Tree_Regressor,'SVR': model_SVR,'SVR_Tuning':model_SVRTuning,'Lasso':model_Lasso,'Ridge':model_Ridge,'Stockhastic_Gradient':Stochastic_Gradient_Descent_model}ValidationData_RMSE_R2_Score = []
TestData_RMSE_R2_Score = []for key, value in model_list.items():
all_model_val = ValidationDataResult(model = value, model_name = key)
ValidationData_RMSE_R2_Score.append(all_model_val)
print('Validation Data Result : ',ValidationData_RMSE_R2_Score)for key, value in model_list.items():
all_model_val_test = TestDataResult(model = value, model_name = key)
TestData_RMSE_R2_Score.append(all_model_val_test)
print('Test Data Result : ', TestData_RMSE_R2_Score)
Validation DataSet results of RMSE and R2 Scores are as follow in the chart.
Test DataSet Model results of RMSE and R2 Scores are as follow in the chart.
Based on the dataframe of Validation Data Set and Test Data Set results, SVR tuning shows lowest RMSE and highest R2 Scores
# We select lowest RMSE and Highest R2_Score to select best Model.All results indicate SVR_Tuning is the best model.
print('Min RMSE for Validation DataSet : ', Validation_Model_List['RMSE'].idxmin(),'\nMax R2_Score for Validation DataSet : ',Validation_Model_List['R2_Score'].idxmax())
print('Min RMSE for Test DataSet: ',Test_Model_List['RMSE'].idxmin(),'\nMax R2_Score for Test DataSet : ',Test_Model_List['R2_Score'].idxmax())
SVR tuning model is validated with test data set . I use the function of bestModel_validateResult() to show the best model results as follow;
This is the test data result showing actual and prediction data.
Conclusion
In this project, I have tried various machine learning algorithms to predict stock prices. After comparing the metrics of the models. Actual trend line and predicted line obviously indicate that SVR model is the good fit model that we can use to predict the stock price.
After tuning SVR model with C and epsilon prediction , the model is progressed. I would say that it is not possible to get 99% accurate model prediction since there are numerous aspects can affect the stock prices. However, we could at least predict the general trendline by taking different factors into consideration
Improvement
There are still many technical and fundamental analysis methods and feature variables that we have not included in this project.
There are also many different Machine Learning Algorithm that I have not applied may be tried later.
I would also increase the number of year period that I use to build machine learning model to increase the performance of the model.
The model would be improved for more than one stock with programming tools.
Github link is as follow: