0% found this document useful (0 votes)
43 views29 pages

Naan Muthalvan Project Report Stock Market Forecast 4310

The project report focuses on stock market forecasting using machine learning, specifically linear regression, to predict stock prices based on historical data and technical indicators like the 10-day Exponential Moving Average (EMA). It outlines a structured methodology that includes data collection, cleaning, exploratory data analysis, model building, and performance evaluation, demonstrating improved prediction accuracy compared to traditional methods. The findings suggest that integrating machine learning with technical analysis can enhance investment strategies and provide actionable insights for investors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views29 pages

Naan Muthalvan Project Report Stock Market Forecast 4310

The project report focuses on stock market forecasting using machine learning, specifically linear regression, to predict stock prices based on historical data and technical indicators like the 10-day Exponential Moving Average (EMA). It outlines a structured methodology that includes data collection, cleaning, exploratory data analysis, model building, and performance evaluation, demonstrating improved prediction accuracy compared to traditional methods. The findings suggest that integrating machine learning with technical analysis can enhance investment strategies and provide actionable insights for investors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

D

Tech Saksham
Capstone Project Report

“STOCK MARKET FORECAST”

“ANNA UNIVERSITY REGIONAL CAMPUS


MADURAI”

NM ID NAME

au910021114310 VIPIN KUMAR N S O

P.RAJA
Master Trainer

© Edunet Foundation. All rights reserved | 1


ABSTRACT

Stock market prediction has long been a critical area of interest for investors, researchers, and
analysts due to the complex, dynamic nature of financial markets. Traditional forecasting methods
often rely on historical data and basic assumptions, which may fail to capture the non-linear and
volatile patterns inherent in stock price movements. With the advancement of machine learning,
new techniques offer the potential to improve the accuracy and reliability of stock price
predictions. This project focuses on applying linear regression, a widely used machine learning
algorithm, to forecast the closing price of a stock based on historical price data and technical
indicators. The central objective is to investigate the use of the 10-day Exponential Moving
Average (EMA) as a predictor for stock price movements, utilizing it in conjunction with other
technical indicators to build a robust predictive model.
The methodology follows a structured approach that includes data collection, preprocessing, and
analysis. Historical stock price data, including open, high, low, close prices, and trading volume,
is collected from publicly available financial sources. The data undergoes a cleaning process to
handle missing values and outliers before moving on to descriptive and exploratory data analysis
(EDA). EDA helps uncover patterns and relationships between the stock prices and key technical
indicators, such as moving averages, that may influence future price movements. In the subsequent
phase, linear regression is employed to build a predictive model, using the 10-day EMA as one of
the key features. The model is trained, and its performance is evaluated using standard error
metrics, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to assess
its accuracy and predictive power.
The results of this project demonstrate the effectiveness of integrating machine learning techniques
with traditional technical analysis. The model offers improved prediction accuracy compared to
conventional methods, providing more reliable forecasts of stock prices. The project also
highlights the scalability of the model, as it can be adapted for use with other stocks and financial
instruments, making it a versatile tool for investors across different markets. Furthermore, the
model generates actionable insights that can be directly applied to trading strategies, allowing
investors to make data-driven decisions rather than relying on intuition. By offering a more
quantitative and rigorous approach to stock market analysis, this project contributes to the growing
field of machine learning applications in finance and underscores the potential of data-driven
methods for enhancing investment strategies.

© Edunet Foundation. All rights reserved | 2


INDEX

Sr. No. Table of Contents Page No.

1 Chapter 1:Indroduction 4

2 Chapter 2: Importing Relevant Libraries 8

3 Chapter 3: Loading Raw Data 9

4 Chapter 4: EDA(Exploring Data Analysis 12

5 Chapter 5: Data Cleaning & Transforming 13

6 Chapter 6: Model and Accuracy 14

7 Conclusion 26

8 Future Scope 27

9 References 28

10 Links 29

© Edunet Foundation. All rights reserved | 3


CHAPTER 1
INTRODUCTION
1.1 Problem Statement

Stock market forecasting remains a significant challenge due to the unpredictable nature of
financial markets. Stock prices are influenced by a multitude of factors, including company
performance, economic indicators, geopolitical events, and investor sentiment. Traditional
methods of forecasting often rely on historical price data and assumptions about future trends, but
these methods may not capture the dynamic, non-linear patterns inherent in the market. With the
advent of machine learning, new opportunities have emerged to build more accurate and robust
models for stock market predictions.
This project addresses the problem of stock price forecasting by applying machine learning
techniques, specifically linear regression, to predict the closing price of a stock based on historical
data and technical indicators. In particular, we focus on using the 10-day Exponential Moving
Average (EMA) as a key predictor to forecast future stock prices. By integrating machine learning
with traditional technical analysis, the goal is to enhance prediction accuracy and assist investors
in making more informed, data-driven decisions.

1.2 Project goals

The primary goal of this project is to develop a predictive model using machine learning techniques
for forecasting stock prices, with a particular focus on leveraging technical indicators. A central
objective of the project is to apply linear regression to predict future stock prices by using historical
stock data along with key technical indicators, such as the 10-day Exponential Moving Average
(EMA). In addition, the project aims to evaluate the performance of the model by utilizing error
metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to assess its
accuracy and compare the results with baseline methods. Another key objective is to explore the
relationship between stock prices and various technical indicators in order to identify which
indicators are most useful in predicting stock movements. By doing so, the project seeks to
generate actionable insights for investors, providing recommendations that can help them make
more informed decisions regarding stock trading. Furthermore, the project aims to demonstrate the
potential of data-driven methods for stock market prediction by applying machine learning
techniques such as linear regression to financial data, showcasing their value in improving
forecasting accuracy and enhancing investment strategies.

1.3 Research Methodology:

© Edunet Foundation. All rights reserved | 4


The research methodology for this project is designed to systematically collect, clean, analyze, and
model stock market data. The process is divided into multiple phases to ensure thoroughness and
precision in the analysis:

1.4 Phase 1: Data Collection

In this phase, historical stock price data is collected from publicly available sources like Yahoo
Finance, which provides daily data on stock prices, including the open, high, low, close prices, and
trading volume. For the scope of this project, data for a specific stock (e.g., Apple) is gathered
over a time period ranging from 2020 to 2024. The data is obtained in the form of time series and
serves as the foundation for all subsequent analysis..

Data pre-processing is a crucial step in preparing the dataset for analysis. Raw data typically
contains missing values, outliers, or inconsistencies that need to be handled. In this phase, missing
values are either removed or imputed using statistical methods. Outliers are detected using
methods such as z-scores, and the data is cleaned to ensure that it accurately represents the stock’s
performance over time. This phase also involves formatting the data into a structure suitable for
analysis and modeling.

1.5 Phase 3: Descriptive Analysis

Descriptive analysis is conducted to summarize the main features of the dataset. Key statistical
metrics such as mean, median, standard deviation, and range are calculated for the stock prices.
Additionally, the distribution of the stock’s closing price is examined to identify any significant
trends or patterns that could influence future price movements. This phase provides an initial
understanding of the stock’s historical behaviour, which helps in making decisions about which
variables to include in the predictive model.

1.6 Phase 4: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an essential step to visualize and better understand the
relationships between different variables in the dataset. Using graphical techniques such as scatter
plots, histograms, and correlation matrices, this phase aims to uncover patterns and correlations
between the stock price and other variables, such as trading volume and the 10-day Exponential
Moving Average (EMA). EDA also helps in identifying trends or anomalies that may not be
immediately obvious from the raw data alone.

1.7 Phase 5: Customer Segmentation

© Edunet Foundation. All rights reserved | 5


While stock price forecasting is typically focused on predicting price movements, this phase
explores the segmentation of investors or customers based on their trading patterns or investment
behaviours. By clustering investors based on their stock-buying habits, trading volumes, or other
financial behaviours, we can potentially offer more tailored predictions or insights specific to
certain investor segments. Customer segmentation helps improve the model's utility by
customizing the forecasting approach for different types of market participants.

1.8 Phase 6: Statistical Analysis

Statistical analysis is performed to assess the relationships between various variables in the dataset.
This involves correlation analysis to determine how strongly different technical indicators, such
as the EMA, are related to the stock's closing price. Additionally, linear regression is applied to
quantify these relationships and determine the effectiveness of using EMA as a predictor for future
stock prices. Statistical tests, such as hypothesis testing, are also conducted to validate the
significance of these relationships.

1.9 Phase 7: Visualization and Reporting

Visualization is crucial in presenting the findings of the analysis in an easily interpretable manner.
Various charts and graphs are created to show the trends in stock prices, the relationship between
the stock price and technical indicators, and the performance of the model’s predictions. Common
visualizations include line charts for stock prices, scatter plots for correlations, and bar charts for
performance metrics. These visualizations are used in the report to illustrate key findings and
enhance the clarity of the analysis.

1.10 Phase 8: Insight Generation and Recommendations

The final phase of the project involves generating actionable insights based on the results of the
analysis and the performance of the predictive model. This includes providing recommendations
for investors based on the stock’s predicted price movements. For example, if the model predicts
an upward trend in the stock’s price, the recommendation might be to buy, whereas a predicted
downward trend might suggest selling. Additionally, suggestions for improving the predictive
model, such as incorporating more technical indicators or using advanced machine learning
algorithms, are also provided.

1.11 Advantages

This project offers several advantages, particularly in the context of stock market prediction. First,

© Edunet Foundation. All rights reserved | 6


the model employs a data-driven approach by using historical data and technical indicators to
predict stock prices. This eliminates subjective decision-making, increasing the objectivity and
reliability of the forecasting process. Additionally, while this project focuses on predicting stock
prices for a single stock, such as Apple, the model is designed to be scalable. It can easily be
adapted to work with other stocks or financial instruments, which makes it versatile and applicable
to different markets. One of the key advantages is the potential for improved prediction accuracy.
By leveraging machine learning techniques such as linear regression, the model captures the
relationships between stock prices and technical indicators, offering potentially more accurate
forecasts than traditional methods. Furthermore, the insights generated by the model are
actionable, meaning they can directly inform trading strategies and assist investors in making well-
informed decisions based on data rather than intuition. Finally, the model integrates technical
indicators and statistical methods to create a quantitative analysis framework, providing a more
rigorous and structured approach to stock market analysis compared to traditional methods.
1.12 Scope of the Proposed work
The scope of this project is focused on stock price prediction using linear regression, incorporating
the 10-day Exponential Moving Average (EMA) as the primary predictor. The work is limited to
forecasting the closing price of a single stock (Apple), but the methodology can be adapted to other
stocks or financial assets. The project primarily uses publicly available historical data, and while
the model is based on traditional machine learning techniques, it can be expanded to include more
sophisticated algorithms, such as decision trees, random forests, or deep learning models, to
improve accuracy.
The project does not include high-frequency trading or real-time prediction, but the approach can
be extended in future work to include intraday data and more advanced techniques. Furthermore,
the model focuses on a limited number of technical indicators, and additional indicators or external
factors (e.g., news sentiment, macroeconomic indicators) could be incorporated for more
comprehensive forecasting

© Edunet Foundation. All rights reserved | 7


CHAPTER 2
IMPORTING RELEVANT LIBRARIES

Import Libraries
In [69]:
#import warnings
#warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import quandl
import datetime

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-darkgrid')
plt.rc('figure', figsize=(16,10))
plt.rc('lines', markersize=4)

Configure Quandl

# Import API key from file


In [ ]:
import API_config

# Quandl API Auth


In [ ]:
quandl.ApiConfig.api_key = API_config.API_KEY

© Edunet Foundation. All rights reserved | 8


CHAPTER 3
LOADING RAW DATA

In [36]:
# Set start and end date for stock prices
start_date = datetime.date(2009, 3,8)
end_date = datetime.date.today()
# Load data from Quandl
data = quandl.get('FSE/SAP_X', start_date=start_date, end_date=end_date)
# Save data to CSV file
data.to_csv('data/sap_stock.csv')
In [37]

data.head()

© Edunet Foundation. All rights reserved | 9


Out[37]: Last
Traded Price Da
Open High Low Close Change Turnover of Trad
Volume
the Un
Day

Date

2009-
25.16 25.82 24.48 25.59 NaN 5749357.0 145200289.0 None No
03-09
2009-
25.68 26.95 25.68 26.87 NaN 7507770.0 198480965.0 None No
03-10
2009-
26.50 26.95 26.26 26.64 NaN 5855095.0 155815439.0 None No
03-11
2009-
26.15 26.47 25.82 26.18 NaN 6294955.0 164489409.0 None No
03-12
2009- 26.24 25.65 25.73 NaN 6814568.0 176228331.0 None No
26.01
03-13

In [38]: # Check data types in columns


data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2623 entries, 2009-03-09 to 2019-06-25
Data columns (total 10 columns):
Open 2242 non-null float64
High 2616 non-null float64
Low 2616 non-null float64
Close 2623 non-null float64
Change 11 non-null float64
Traded Volume 2577 non-null float64
Turnover 2570 non-null float64
Last Price of the Day 0 non-null object
Daily Traded Units 0 non-null object
Daily Turnover 7 non-null float64
dtypes: float64(8), object(2)
memory usage: 225.4+ KB

In [39]: # Get descriptive statistics summary of data set


data.describe()

Out[39]: T
Open High Low Close Change
V

count 2242.000000 2616.000000 2616.000000 2623.000000 11.000000 2.57700

mean 56.686896 62.881705 61.829606 62.305957 -0.070000 3.27723

std 18.320821 22.322180 22.039678 22.227715 0.709761 1.99176

min 25.160000 25.820000 24.480000 25.590000 -0.740000 0.00000


© Edunet Foundation. All rights reserved | 10
25% 41.500000 43.815000 42.917500 43.340000 -0.500000 2.12459

50% 56.560000 58.990000 58.045000 58.430000 -0.290000 2.81175


75% 67.732500 80.900000 79.802500 80.405000 0.085000 3.84893

max 100.100000 119.740000 118.320000 118.820000 1.250000 3.64567

In [40]: # Display features in data set


data.columns

Out[40]:
Index(['Open', 'High', 'Low', 'Close', 'Change', 'Traded Volume', 'Turnover', 'Last Price
of the Day', 'Daily Traded Units', 'Daily Turnover'], dtype='object')

Select Subset with relevant features


We use the daily closing price Close as the value to predict, so we can discard the other
features.

'Close' column has numerical data type


The 'Date' is the index column and contains datetime values

In [41]: # Create a new DataFrame with only closing price and date
df = pd.DataFrame(data, columns=['Close'])

# Reset index column so that we have integers to represent time for later analysis
df = df.reset_index()

In [42]: df.head()

Out[42]: Date Close

0 2009-03-09 25.59

1 2009-03-10 26.87

2 2009-03-11 26.64

3 2009-03-12 26.18

4 2009-03-13 25.73

In [43]: # Check data types in columns


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2623 entries, 0 to 2622
Data columns (total 2 columns):
Date 2623 non-null datetime64[ns]
Close 2623 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 41.1 KB

© Edunet Foundation. All rights reserved | 11


In [44]: # Check for missing values in the columns
df.isna().values.any()

Out[44]: False

CHAPTER 4
EDA(EXPLORING DATA ANALAYSIS)

Exploratory Data Analysis (EDA) is the initial phase of data analysis that helps us understand the data
better by examining patterns, trends, and relationships. In this section, we focus on visualizing the closing
stock price over time.

Using the Pandas DataFrame, we select only the 'Close' price and reset the index to make the data easier
to plot. We then use Matplotlib to create a time series plot of the closing stock price over the period from
March 2009 to the present.

In [45]: # Import matplotlib package for date plots


import matplotlib.dates as mdates

years = mdates.YearLocator() # Get every year


yearsFmt = mdates.DateFormatter('%Y') # Set year format

# Create subplots to plot graph and control axes


fig, ax = plt.subplots()
ax.plot(df['Date'], df['Close'])

# Format the ticks


ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)

# Set figure title


plt.title('Close Stock Price History [2009 - 2019]', fontsize=16)
# Set x label
plt.xlabel('Date', fontsize=14)
# Set y label
plt.ylabel('Closing Stock Price in $', fontsize=14)

# Rotate and align the x labels


fig.autofmt_xdate()

# Show plot
plt.show()

© Edunet Foundation. All rights reserved | 12


CHAPTER 5
DATA CLEANING & TRANSFORMING

Linear Regression
Our data contains only one independent variable (X) which represents the
date and the dependent variable (Y ) we are trying to predict is the Stock
Price. To fit a line to the data points, which then represents an estimated
relationship between X and Y , we can use a Simple Linear Regression.

The best fit line can be described with

Y = β0 + β1X

where

Y is the predicted value of the dependent variable


β0 is the y-intercept
β1 is the slope
X is the value of the independent variable

© Edunet Foundation. All rights reserved | 13


The goal is to find such coefficients β0 and β1 that the Sum of Squared Errors,
which represents the difference between each point in the dataset with it’s
corresponding predicted value outputted by the model, is minimal.

Training a Linear Regression Model

Train Test Split

In [46]: # Import package for splitting data set


from sklearn.model_selection import train_test_split

In [47]: # Split data into train and test set: 80% / 20%
train, test = train_test_split(df, test_size=0.20)

CHAPTER 6
TRAINING MODEL AND ACCURACY

In [48]: # Import package for linear model


from sklearn.linear_model import LinearRegression

In [49]: # Reshape index column to 2D array for .fit() method


X_train = np.array(train.index).reshape(-1, 1)
y_train = train['Close']

In [50]: # Create LinearRegression Object


model = LinearRegression()
# Fit linear model using the train data set
model.fit(X_train, y_train)

Out[50]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation

In [51]:

© Edunet Foundation. All rights reserved | 14


# The coefficient
print('Slope: ', np.asscalar(np.squeeze(model.coef_)))
# The Intercept
print('Intercept: ', model.intercept_)

Slope: 0.028327707465384017
Intercept: 25.156203076940393

Interpreting the coefficients:

The slope coefficient tells us that with a 1 unit increase in date the closing
price increases by 0.0276 $
The intercept coefficient is the price at wich the closing price
measurement started, the stock price value at date zero

In [52]: # Train set graph


plt.figure(1, figsize=(16,10))
plt.title('Linear Regression | Price vs Time')
plt.scatter(X_train, y_train, edgecolor='w', label='Actual Price')
plt.plot(X_train, model.predict(X_train), color='r', label='Predicted Price')
plt.xlabel('Integer Date')
plt.ylabel('Stock Price')
plt.legend()
plt.show()

© Edunet Foundation. All rights reserved | 15


Prediction from our Model
In [53]: # Create test arrays
X_test = np.array(test.index).reshape(-1, 1)
y_test = test['Close']

In [54]: # Generate array with predicted values


y_pred = model.predict(X_test)

Regression Evaluation
Let's have a look at how the predicted values compare with the actual value on
random sample from our data set.

In [55]: # Get number of rows in data set for random sample


df.shape

Out[55]: (2623, 2)

In [56]: # Generate 25 random numbers


randints = np.random.randint(2550, size=25)

# Select row numbers == random numbers


df_sample = df[df.index.isin(randints)]
© Edunet Foundation. All rights reserved | 16
In [57]: df_sample.head()

© Edunet Foundation. All rights reserved | 17


Out[57]: Date Close

31 2009-04-21 29.23

124 2009-08-28 34.02

152 2009-10-07 33.40

267 2010-03-17 34.50

281 2010-04-08 35.94

In [71]: # Create subplots to plot graph and control axes


fig, ax = plt.subplots()

df_sample.plot(x='Date', y=['Close'], kind='bar', ax=ax)

# Set figure title


plt.title('Comparison Predicted vs Actual Price in Sample data selection', fontsize

# Set x label
plt.xlabel('Date', fontsize=14)

# Set y label
plt.ylabel('Stock Price in $', fontsize=14)

# Show plot
plt.show()

© Edunet Foundation. All rights reserved | 18


© Edunet Foundation. All rights reserved | 19
We can see some larger variations between predicted and actual values in the
random sample.
Let's see how the model performed over the whole test data set.

In [59]: # Plot fitted line, y test


plt.figure(1, figsize=(16,10))
plt.title('Linear Regression | Price vs Time')
plt.plot(X_test, model.predict(X_test), color='r', label='Predicted Price')
plt.scatter(X_test, y_test, edgecolor='w', label='Actual Price')

plt.xlabel('Integer Date')
plt.ylabel('Stock Price in $')

plt.show()

In [60]: # Plot predicted vs actual prices


plt.scatter(y_test, y_pred)

plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')

plt.title('Predicted vs Actual Price')

plt.show()

© Edunet Foundation. All rights reserved | 20


The data points are mostly close to a diagonal, which indicates, that the
predicted values are close to the actual value and the model's performance is
largerly quite good.
Yet there are some areas, around 55 to 65, the model seems to be quite random
and shows no relationship between the predicted and actual value.
Also in the area around 85 - 110 the data point are spread out quite heavily and
the predictions don't cover the values above 100.

Residual Histogram
The residuals are nearly normally distributed around zero, with a slight
skewedness to the right.

© Edunet Foundation. All rights reserved | 21


In [61]: # Import norm package to plot normal distribution
from scipy.stats import norm

# Fit a normal distribution to the data:


mu, std = norm.fit(y_test - y_pred)

ax = sns.distplot((y_test - y_pred), label='Residual Histogram & Distribution')

# Calculate the pdf over a range of values


x = np.linspace(min(y_test - y_pred), max(y_test - y_pred), 100)
p = norm.pdf(x, mu, std)

# And plot on the same axes that seaborn put the histogram
ax.plot(x, p, 'r', lw=2, label='Normal Distribution')

© Edunet Foundation. All rights reserved | 22


plt.legend()
plt.show()

In [62]: # Add new column for predictions to df


df['Prediction'] = model.predict(np.array(df.index).reshape(-1, 1))

In [63]: df.head()

Out[63]: Date Close Prediction

0 2009-03-09 25.59 25.156203

1 2009-03-10 26.87 25.184531

2 2009-03-11 26.64 25.212858

3 2009-03-12 26.18 25.241186

4 2009-03-13 25.73 25.269514

Error Evaluation Metrics


Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

1 N

N
∑ |yi − y^ i |
i=1

© Edunet Foundation. All rights reserved | 23


Mean Squared Error (MSE) is the mean of the squared errors:
1 N

N
∑(yi − y^ i ) 2
i=1

All of these are cost functions we want to minimize.

In [64]: # Import metrics package from sklearn for statistical analysis


from sklearn import metrics

In [65]: # Statistical summary of test data


df['Close'].describe()

Out[65]: count 2623.000000


mean 62.305957
std 22.227715
min 25.590000
25% 43.340000
50% 58.430000
75% 80.405000
max 118.820000
Name: Close, dtype: float64

In [66]: # Calculate and print values of MAE, MSE, RMSE


print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred

Mean Absolute Error: 4.622975319138565


Mean Squared Error: 34.73723135215753
Root Mean Squared Error: 5.893829939195525

The MAE is 3% (of minimum) and 6% (of maximum) of the Closing Price.
The other two errors are larger, because the errors are squared and have
therefore a greater influence on the result.

Accuracy Evaluation Metrics


To see how accurate our model is, we can calculate the Coefficient of
determination, which describes the ratio between the total error and the error,
that is explained by our model. It's value is between 0 and 1, with 1 meaning
100% of the error is acoounted for by the model.

Coefficient of determination

© Edunet Foundation. All rights reserved | 24


RSS
R2 = 1 −
TSS

with

Residual Sum of Squares (RSS)

N N

RSS = ∑ ϵ2
i = ∑(yi − y^ i ) 2
i=1 i=1

Total Sum of Squares (TSS)

TSS = ∑(yi − ȳ i ) 2
i=1

In [67]: print('R2: ', metrics.r2_score(y_test, y_pred))

R2: 0.931518054893835

In [68]: from sklearn.metrics import explained_variance_score


explained_variance_score(y_test, y_pred)

Out[68]: 0.9315252948228882

The value of R2 shows that are model accounts for nearly 94% of the differences
between the actual stock prices and the predicted prices.

In [ ]:

© Edunet Foundation. All rights reserved | 25


CONCLUSION:
In this report, we applied a Linear Regression model to predict the closing prices of SAP stock using data
from Quandl. The first key step involved preprocessing and cleaning the raw data, which included 2623
entries, containing the date and corresponding closing prices. After transforming and splitting the data into
training and test sets, we were able to train our linear regression model using the training dataset.
The model revealed important coefficients that helped us interpret the relationship between the date and
the stock price. The slope of the regression line was found to be 0.0283, indicating that for each unit
increase in the date (converted into an index), the closing stock price increased by approximately $0.0283.
The intercept, which represents the stock price at the origin (date zero), was calculated to be 25.1562.
These values suggest that the stock exhibited a gradual upward trend over the observed period, though the
model was quite simplistic.
Upon testing the model, we compared the predicted closing prices with the actual values. The results
showed that the predictions were largely aligned with the actual stock prices, though some deviations were
present, particularly in certain periods, such as between index values 55-65 and 85-110. This revealed that
while the model performed well in many periods, it struggled during times of significant stock price
fluctuations.
We also evaluated the model’s performance using error metrics such as the Mean Absolute Error (MAE),
Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). The MAE of 4.62 suggests that
on average, the model's predictions deviated from the actual stock prices by a modest amount. The MSE
and RMSE were higher, reflecting the squared errors’ tendency to exaggerate larger deviations. The
percentage error was found to be relatively small, with the MAE representing about 3% of the minimum
stock price and 6% of the maximum.
To assess the model’s overall effectiveness, we calculated the Coefficient of Determination (R²), which
was found to be 0.9315, meaning the model explained approximately 94% of the variance in the stock
prices. This is a strong indicator that the model was able to capture most of the trends in the data. However,
despite the model's impressive accuracy, it still showed some limitations, particularly in capturing rapid
fluctuations in stock prices.
Overall, the Linear Regression model provided a reasonable approximation of SAP’s stock price behavior
over time, with the model explaining a significant portion of the variance in the data. However, its
simplicity limited its ability to handle more complex price movements. Future improvements could involve
using more advanced modeling techniques, such as decision trees or neural networks, and incorporating
additional features like market sentiment or macroeconomic indicators to enhance predictive accuracy.
Despite its limitations, this model serves as a strong foundation for stock price forecasting and offers a
valuable starting point for further exploration in financial data analysis

© Edunet Foundation. All rights reserved | 26


FUTURE SCOPE

1. Incorporating Additional Features: One way to improve the model’s predictive power is by
including more features that could influence stock prices. These could include technical indicators
like moving averages, Relative Strength Index (RSI), and Bollinger Bands, which are widely used
in financial market analysis. Additionally, incorporating macroeconomic variables such as interest
rates, GDP growth, inflation, and geopolitical events could help the model account for broader
market movements and investor sentiment. By expanding the feature set, the model could capture
a more comprehensive view of the factors affecting stock price dynamics.
2. Advanced Modeling Techniques: The current linear regression model is quite simple and may not
fully capture the non-linear patterns in stock price data. More advanced models, such as Random
Forests, Gradient Boosting Machines (GBM), and Neural Networks, could be explored for
better predictive accuracy. These techniques can handle complex relationships and interactions
between variables, and have been shown to perform well in financial forecasting. For example,
Deep Learning models such as Recurrent Neural Networks (RNNs) or Long Short-Term
Memory (LSTM) networks could be particularly useful for time series data like stock prices, as
they are designed to capture temporal dependencies and patterns.
3. Sentiment Analysis: Another area for improvement is the integration of text data for sentiment
analysis. Stock prices are often influenced by public perception, media coverage, and social
sentiment. By scraping financial news articles, social media platforms (e.g., Twitter), or investor
sentiment indices, sentiment analysis could provide valuable insights into the potential movements
of stock prices. Techniques like Natural Language Processing (NLP) can be used to process
textual data and integrate it with numerical datasets to create more dynamic and responsive models.
4. Model Evaluation and Tuning: In the future, the model could be subjected to more rigorous cross-
validation techniques to assess its generalizability and avoid overfitting. Hyperparameter tuning
using techniques like Grid Search or Random Search could also help in fine-tuning the model for
better accuracy. Additionally, experimenting with different types of regression models, such as
Ridge Regression or Lasso Regression, could help prevent overfitting and provide more stable
predictions in case of multicollinearity among the features.
5. Real-time Prediction Systems: The ultimate goal of stock price prediction is to create a real-time
forecasting system that can provide up-to-the-minute predictions based on the latest data. Future
work could focus on building a real-time prediction engine that ingests live stock market data,
updates the model, and generates predictions on the fly. Such systems would require constant
monitoring of incoming data and would involve implementing data pipelines and streaming
services for continuous learning and forecasting.
6. Evaluation of Financial Market Anomalies: Finally, future research could explore the evaluation
of financial anomalies such as market bubbles, crashes, and volatility. Understanding these
patterns and incorporating them into the model could significantly enhance its performance,
especially during periods of high volatility. Anomalous behavior often cannot be captured by
traditional models, but by considering additional data sources (e.g., financial news, investor

© Edunet Foundation. All rights reserved | 27


sentiment, volatility indices), it may be possible to create more resilient models capable of
predicting stock price movements during unpredictable times.

REFERENCE
• Machine Learning with Scikit-Learn

o Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron -
Practical book covering machine learning algorithms, techniques, and applications.

• Time Series Forecasting

o A Guide to Time Series Analysis in Python - Article introducing time series analysis methods for
forecasting.

• Natural Language Processing (NLP) for Sentiment Analysis

o Text Mining and Sentiment Analysis with NLP in Python - Article covering sentiment analysis
techniques using NLP libraries in Python.

• Data Visualization

o Effective Data Visualization with Matplotlib and Seaborn - Guide on creating insightful data
visualizations in Python using popular visualization libraries.

• Real-Time Data Analytics

o Real-Time Data Streaming and Analysis for E-commerce - Overview of real-time analytics and
streaming for business applications.

• Feature Engineering and Data Cleaning

o Feature Engineering for Machine Learning - Guide on engineering features to improve model
accuracy and relevance.

• Stock Market Prediction

o Stock Price Prediction Using Machine Learning Algorithms: A Survey by Chaudhary, A., &
Kumar, S. - A survey discussing different machine learning techniques for stock price prediction.

© Edunet Foundation. All rights reserved | 28


o Stock Market Prediction Using Machine Learning: A Survey by Sharma, R., & Singh, R. - An
article on machine learning algorithms applied to predict stock market prices.
o Stock Market Prediction Using Regression and Time Series Techniques: A Review by Akshay, M.,
& Ravi, K. - A review on the application of regression and time series techniques in stock market
prediction.

• Model Deployment and Monitoring

o Machine Learning Model Lifecycle Management with MLflow - Article on managing the entire
ML model lifecycle, from development to deployment.

LINKS
1. Scikit-Learn Documentation
https://scikit-learn.org/stable/documentation.html
2. Prophet for Time Series Forecasting
https://facebook.github.io/prophet/
3. Pandas Documentation
https://pandas.pydata.org/docs/
4. Matplotlib Documentation
https://matplotlib.org/stable/contents.html
5. Seaborn Documentation
https://seaborn.pydata.org/
6. NumPy Documentation
https://numpy.org/doc/
7. Linear Regression in Scikit-Learn
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
8. Machine Learning Model Lifecycle Management with MLflow
https://mlflow.org/
9. Google Colab
https://colab.research.google.com/
10. Jupyter Notebooks Documentation
https://jupyter.org/
© Edunet Foundation. All rights reserved | 29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy