Bda PJ Report
Bda PJ Report
On
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
by
S.JAYANTH
(227R1A67H4)
DEPARTMENT OF COMPUTER
CMR TECHNICAL CAMPUS
An UGC Autonomous Institute
Accredited by NBA & NAAC with A Grade
(Approved by AICTE, Affiliated to JNTU, Hyderabad)
Kandlakoya (V), Medchal (M), Hyderabad-501 401
(2024-2025)
1
2
CERTIFICATE
This to certify that, the Presentation entitled “ BIG DATA STOCK ANALYSIS
Subject Faculty
Mrs.B.Sangamithra
2
3
3
4
4
5
1.ABSTRACT
In today’s fast-paced financial world, analyzing stock market trends is essential for making
informed investment decisions. The vast and complex nature of stock market data, including
historical prices, real-time transactions, and market sentiment, demands advanced tools and
frameworks for effective processing and analysis. Big Data technologies offer a promising
solution to tackle this challenge.
This paper presents a system for Big Data Stock Analysis using Hadoop, a powerful open-
source framework designed for distributed storage and processing of massive datasets. By
leveraging Hadoop’s core components—HDFS (Hadoop Distributed File System) for data
storage and MapReduce for parallel data processing—we develop an efficient architecture to
manage and analyze extensive stock market datasets.
The system processes historical and real-time stock data to generate insights, such as
predicting trends, detecting anomalies, and identifying profitable investment opportunities.
Additional integration with tools like Apache Hive and Apache Spark facilitates querying,
data visualization, and enhanced analytics. The framework also incorporates sentiment
analysis by processing social media data and news articles, thereby correlating market
sentiments with stock performance.
The proposed solution demonstrates scalability, fault tolerance, and efficiency, making it
suitable for handling the dynamic and high-volume nature of financial data. Our experiments
show that the Hadoop-based approach significantly reduces data processing time and
enhances prediction accuracy compared to traditional methods. This work highlights the
potential of Big Data technologies in transforming stock market analytics and improving
decision-making in the financial domain.
Keywords: Big Data, Hadoop, Stock Analysis, HDFS, MapReduce, Financial Analytics,
Sentiment Analysis
5
6
2.INTRODUCTION
Big data exceeds the reach of commonly used hardware environments and software tools to
capture, manage, and process it with in a tolerable elapsed time for its user population [1].
Big data refers to data sets whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze [2]. Big data is a collection of data sets so large and
complex that it becomes difficult to process using on-hand database management tools [3].
Big Data encompasses everything from click stream data from the web to genomic and
proteomic data from biological research and medicines. Big Data is a heterogeneous mix of
data both structured (traditional datasets –in rows and columns like DBMS tables, CSV's and
XLS's) and unstructured data like e-mail attachments, manuals, images, PDF documents,
medical records such as x-rays, ECG and MRI images, forms, rich media like graphics, video
and audio, contacts,
forms and documents. Businesses are primarily concerned with managing unstructured data,
because over 80 percent of enterprise data is unstructured and require significant storage
space and effort to manage.―Big data‖ refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and analyze [3]. Big data has the
following characteristics:[3] Volume – The first important characteristics of big data. It is the
size of the data which determines whether it can actually be considered Big Data or not. The
name ‗Big Data‘ itself indicates the data is huge.
6
7
3.PURPOSE
7
8
4.OBJECTIVES
1. Efficient Stock Data Handling: Develop a scalable and distributed system to process
and analyze large volumes of stock market data using Hadoop.
2. Real-Time Data Processing: Implement mechanisms to handle real-time streaming
stock data for timely insights.
3. Comprehensive Data Analysis: Perform historical and trend analysis on stock data
to identify patterns and predict future market movements.
4. Scalability and Performance Optimization: Utilize Hadoop's distributed file system
(HDFS) and MapReduce framework to ensure fast processing of vast datasets while
maintaining high performance.
5. Data Storage and Retrieval: Efficiently store and retrieve structured and
unstructured stock data across a distributed environment.
6. Visualization and Insights: Generate user-friendly reports and visualizations for
stock performance metrics, enabling better decision-making.
7. Integration with Analytical Tools: Integrate Hadoop with data analysis tools (e.g.,
Hive, Pig, Spark) for enhanced querying and machine learning capabilities.
8. Data Security and Reliability: Ensure the security and reliability of sensitive stock
market data throughout the processing pipeline.
9. Automation of Analysis Pipelines: Develop automated workflows for data ingestion,
cleaning, processing, and visualization.
10. Cost-Effective Solution: Leverage Hadoop's open-source framework to provide a
cost-efficient system for handling large-scale stock analysis.
8
9
5. APACHE HADOOP
:
Apache Hadoop is an open-source software framework written in Java for distributed storage
and distributed processing of very large data sets on computer clusters. All the modules in
Hadoop are designed with a fundamental assumption that hardware failures are common
place and thus should be automatically handled in software by the framework[6]. The core of
Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce). Hadoop splits files into large blocks and distributes them
amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged
code for nodes to process in parallel, based on the data each node needs to process. This
approach takes advantage of data locality– nodes manipulating the data that they have on-
hand – to allow the data to be processed faster and more efficiently than it would be in a more
conventional supercomputer architecture that relies on a parallel file system where
computation and data are connected via high-speed networking.
The base Apache Hadoop framework is composed of the following modules: Hadoop
Common – contains libraries and utilities needed by other Hadoop modules; Hadoop
Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster; Hadoop YARN – a
resource-management platform responsible for managing compute resources in clusters and
using them for scheduling of users' applications and Hadoop MapReduce – a programming
model for large scale data processing. The term "Hadoop" has come to refer not just to the
base modules above, but also to the collection of additional software packages, such as
Apache Pig, Apache Hive and others.
9
10
6,MAPREDUCE
10
11
7.PIG
Apache Pig is a platform for analyzing Big-Data that consists of a high-level language for
expressing data analysis programs, along with infrastructure for evaluating these programs.
Pig's architecture consists of a compiler which produces sequences of Map-Reduce programs,
for already existing large-scale parallel implementation. In Pig data workers can write
complex data transformations independent of Java knowledge. Pig‘s simple SQL-like
scripting language is called Pig Latin, and is easily understood by developers who are
familiar with scripting languages and SQL. Pig is complete, therefore all required data
manipulations can be done in Apache Hadoop with Pig. Using the User Defined Functions
(UDF) that are available in Pig, it can invoke code in many languages like JRuby, Jython and
Java. Pig scripts can be embedded in other languages. The advantage of Pig is that it can be
used as a key to build larger and more complex applications that handle real business
problems. Pig works with data from many sources, and stores the results into the HDFS.
Important features of Pig are: • Ease of programming. It is easy to achieve parallel execution
of data analysis tasks. Complex tasks consisting of multiple co-related data transformations
are explicitly encoded as data flow sequences, thus is easy to write, understand, and maintain
11
12
8.HIVE
The Apache Hive data warehouse infrastructure built on top of Hadoop facilitates querying
and managing large datasets residing in distributed storage. Hive provides a mechanism to
project structure onto this data and query the data using a SQL-like language called HiveQL.
It also allows traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL. Hadoop was
built to organize and store massive amounts of data of various shapes, sizes and formats.
Because of its ―schema on read‖ architecture, a Hadoop cluster is a perfect reservoir of
heterogeneous data— structured and unstructured—from a multitude of sources. Data
analysts use Hive to explore structure and analyze that data, then turn it into business insight.
Hive is similar to traditional database code with SQL access. However, since Hive is based
on Hadoop and MapReduce operations, there are several key differences. The first is that
Hadoop is intended for long sequential scans, and since it is based on Hadoop, queries may
have a very high latency (many minutes). Therefore hive cannot be used for applications
which need fast response times. Finally, Hive is read-based and therefore not appropriate for
application that requires a high percentage of write operations.
The tables in Hive are similar to tables in a relational database, and data units are organized
in taxonomy from larger to more granular units. Databases consist of tables, which are made
up of partitions. Data can be accessed via a simple query language and Hive supports
overwriting or appending data. Within a particular database, data in the tables is serialized
and each table has a corresponding Hadoop Distributed File System (HDFS) directory.
12
13
Collect stock data from sources like Yahoo Finance, Google Finance, or APIs such as
Alpha Vantage, Quandl, or Bloomberg.
Data types include historical prices, intraday trading data, and news sentiment.
2. Data Preprocessing
Transformation: Normalize prices, convert timestamps, and format the data for
compatibility with Hadoop.
Use HDFS (Hadoop Distributed File System) to store large volumes of stock data.
o CSV
o JSON
4. Processing Framework
MapReduce:
o Use Mapper for parallel processing of stock data (e.g., calculating moving
averages).
o Use Reducer for aggregation tasks (e.g., computing total trading volumes).
Apache Hive:
o Set up tables in Hive for querying stock data with SQL-like syntax.
Apache Pig:
13
14
Descriptive Analytics:
Volume Analysis:
Predictive Analytics:
Sentiment Analysis:
o Combine Hadoop with NLP libraries to assess the impact of news on stock
trends.
Export processed data from Hadoop to visualization tools like Tableau or Power BI.
Use libraries such as Matplotlib or D3.js for custom charts and graphs.
7. Workflow Automation
8. Performance Optimization
14
15
10.TECHNICAL INDICATORS
1. Moving Averages
a. Simple Moving Average (SMA):
Description: Calculates the average stock price over a fixed number of periods.
Implementation in Hadoop:
o For each stock, the mapper processes the price data, and the reducer calculates
the SMA for each time window.
Implementation:
Implementation:
3. Bollinger Bands:
Description: Composed of a moving average (middle band) and two standard
deviations above and below it (upper and lower bands).
Implementation:
15
16
o Use Hive or Spark to compute SMA and standard deviation for the desired
window size.
Implementation:
o Subtract the two EMAs to find the MACD line, then calculate the signal line
as a 9-day EMA of the MACD.
Implementation:
o Use MapReduce or Spark to sum (Price × Volume) and Volume, and then
divide the two.
Implementation:
o Use Hive queries or Spark functions to calculate the true range for each day
and then average it over a time window.
16
17
11.SAMPLECODE
Steps to Execute in Hadoop :
1.Upload Input Data to HDFS:
-mapper mapper.py \
-reducer reducer.py \
-input /stockdata/stock_data.csv \
-output /stockdata/output
3.View Results:
#!/usr/bin/env python3
import sys
# Skip header
if line.startswith("Date"):
continue
try:
fields = line.strip().split(",")
stock_symbol = fields[1]
close_price = float(fields[3])
17
18
print(f"{stock_symbol}\t{close_price}")
except Exception as e:
continue
#!/usr/bin/env python3
import sys
current_symbol = None
sum_price = 0
count = 0
try:
close_price = float(close_price)
if stock_symbol == current_symbol:
sum_price += close_price
count += 1
else:
if current_symbol:
print(f"{current_symbol}\t{avg_price:.2f}")
current_symbol = stock_symbol
sum_price = close_price
count = 1
18
19
except Exception as e:
continue
if current_symbol:
print(f"{current_symbol}\t{avg_price:.2f}")
11.OUTPUT
19
20
Using various such queries the visualization of data is done for various attributes of stock
data and one such graph for price change over a period of last 7 days, last 30 days, last 6
months and moving average of the previous year is shown in Fig 5
20
21
12.Future Scope
Big Data Stock Analysis using Hadoop has immense potential for growth and evolution.
Some key areas of future development include:
1. Advanced Predictive Analytics: Leveraging machine learning and AI with Hadoop for
more accurate stock price predictions and market trend analysis. This can provide
better decision-making tools for investors.
2. Real-Time Stock Analysis: Integration of Hadoop with real-time data streaming tools
like Apache Kafka to perform instantaneous analysis, offering immediate insights into
market movements.
21
22
12.Conclusion
Big Data Stock Analysis using Hadoop marks a significant advancement in the field of
financial analytics, offering a powerful and scalable solution for processing the immense
volume of data generated by stock markets. Traditional systems often struggle with the sheer
size, velocity, and variety of data in the financial domain, but Hadoop's distributed computing
model overcomes these challenges efficiently.
This approach enables the seamless integration of structured and unstructured data, making it
possible to derive actionable insights from diverse sources such as market feeds, social
media, and financial reports. The project showcases Hadoop’s potential to deliver high-speed
data processing, predictive analytics, and visualization, helping traders, investors, and
analysts make informed decisions.
The success of this methodology emphasizes the transformative role of Big Data in modern
finance, highlighting its potential to redefine stock market analysis. As financial markets
continue to grow in complexity and data size, solutions like Hadoop will be indispensable for
driving innovation, improving decision-making, and ensuring a more data-centric approach to
investment strategies.
This project not only underlines the current capabilities of Big Data tools but also paves the
way for future advancements in financial technologies.
22
23
13.REFERENCES
1. Books:
o Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks
and Applications, 19(2), 171–209.
o Jain, V., & Reddy, K. (2017). Big Data and Predictive Analytics in Stock
Market Decision Making. International Journal of Computer Applications,
162(7), 34–38.
o McKinsey Global Institute: Big Data: The Next Frontier for Innovation,
Competition, and Productivity
23
24
24