0% found this document useful (0 votes)
32 views24 pages

Bda PJ Report

Uploaded by

sadurlajayanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views24 pages

Bda PJ Report

Uploaded by

sadurlajayanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

1

A Project Report for BIG DATA Lab(22CS307PC)

On

BIG DATA STOCK ANALYSIS USING HADOOP


Submitted
to
CMR Technical Campus, Hyderabad

In Partial fulfillment for the requirement of the Award of the Degree


of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING

by
S.JAYANTH
(227R1A67H4)

Under the esteemed guidance of


Mr G Pavan Kumar
Assistant Professor

DEPARTMENT OF COMPUTER
CMR TECHNICAL CAMPUS
An UGC Autonomous Institute
Accredited by NBA & NAAC with A Grade
(Approved by AICTE, Affiliated to JNTU, Hyderabad)
Kandlakoya (V), Medchal (M), Hyderabad-501 401
(2024-2025)

1
2

CERTIFICATE
This to certify that, the Presentation entitled “ BIG DATA STOCK ANALYSIS

USING HADOOP ” is submitted by S.JAYANTH bearing the Roll Number


227R1A67H4 of B.Tech Computer Science and Engineering,In Partial fulfillment for
the requirement of the Presentation and for the award of the Degree of Bachelor of
Technology during the academic year 2024-25.

Subject Faculty
Mrs.B.Sangamithra

2
3

3
4

TABLE OF CONTENTS: PAGE NO


 ABSTRACT 5
 INTRODUCTION 6
 PURPOSE 7
 OBJECTIVES 8
 APACHE HADOOP 9
 MAPREDUCE 10
 PIG 11
 HIVE 15 12
 STOCK DATA ANALYSIS 13-14
 TECHNICAL INDICATORS 15-16
 SAMPLECODE 17-18
 OUTPUT 19-20
 FUTURE SCOPE 21
 CONCLUSION 22
 REFERENCES 23-24

4
5

1.ABSTRACT

Abstract: Big Data Stock Analysis Using Hadoop

In today’s fast-paced financial world, analyzing stock market trends is essential for making
informed investment decisions. The vast and complex nature of stock market data, including
historical prices, real-time transactions, and market sentiment, demands advanced tools and
frameworks for effective processing and analysis. Big Data technologies offer a promising
solution to tackle this challenge.
This paper presents a system for Big Data Stock Analysis using Hadoop, a powerful open-
source framework designed for distributed storage and processing of massive datasets. By
leveraging Hadoop’s core components—HDFS (Hadoop Distributed File System) for data
storage and MapReduce for parallel data processing—we develop an efficient architecture to
manage and analyze extensive stock market datasets.
The system processes historical and real-time stock data to generate insights, such as
predicting trends, detecting anomalies, and identifying profitable investment opportunities.
Additional integration with tools like Apache Hive and Apache Spark facilitates querying,
data visualization, and enhanced analytics. The framework also incorporates sentiment
analysis by processing social media data and news articles, thereby correlating market
sentiments with stock performance.
The proposed solution demonstrates scalability, fault tolerance, and efficiency, making it
suitable for handling the dynamic and high-volume nature of financial data. Our experiments
show that the Hadoop-based approach significantly reduces data processing time and
enhances prediction accuracy compared to traditional methods. This work highlights the
potential of Big Data technologies in transforming stock market analytics and improving
decision-making in the financial domain.
Keywords: Big Data, Hadoop, Stock Analysis, HDFS, MapReduce, Financial Analytics,
Sentiment Analysis

5
6

2.INTRODUCTION

Big data exceeds the reach of commonly used hardware environments and software tools to
capture, manage, and process it with in a tolerable elapsed time for its user population [1].
Big data refers to data sets whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze [2]. Big data is a collection of data sets so large and
complex that it becomes difficult to process using on-hand database management tools [3].
Big Data encompasses everything from click stream data from the web to genomic and
proteomic data from biological research and medicines. Big Data is a heterogeneous mix of
data both structured (traditional datasets –in rows and columns like DBMS tables, CSV's and
XLS's) and unstructured data like e-mail attachments, manuals, images, PDF documents,
medical records such as x-rays, ECG and MRI images, forms, rich media like graphics, video
and audio, contacts,

forms and documents. Businesses are primarily concerned with managing unstructured data,
because over 80 percent of enterprise data is unstructured and require significant storage
space and effort to manage.―Big data‖ refers to datasets whose size is beyond the ability of
typical database software tools to capture, store, manage, and analyze [3]. Big data has the
following characteristics:[3] Volume – The first important characteristics of big data. It is the
size of the data which determines whether it can actually be considered Big Data or not. The
name ‗Big Data‘ itself indicates the data is huge.

6
7

3.PURPOSE

1. Efficient Processing of Large-Scale Financial Data: To handle and analyze massive


volumes of stock market data, including historical and real-time data, by leveraging
Hadoop's distributed computing framework.
2. Performance Optimization: To process complex computations involved in stock
price predictions, trends, and behavior analysis with enhanced speed and efficiency
using Hadoop’s parallel processing capabilities.
3. Pattern Detection and Insights Extraction: To uncover patterns, anomalies, and
trends from stock market data that can guide investment decisions, reduce risks, and
enhance returns.
4. Cost-Effective Data Management: To enable scalable and cost-efficient storage and
analysis of financial data using Hadoop's ecosystem, including HDFS (Hadoop
Distributed File System) and MapReduce.
5. Real-Time Analytics: To facilitate near real-time analysis of stock market data for
high-frequency trading, sentiment analysis, or algorithmic trading.
6. Decision Support: To empower financial analysts, investors, and businesses with
actionable insights by leveraging advanced analytics and machine learning models on
the Hadoop framework.
7. Scalability and Flexibility: To ensure the stock analysis system can scale as data
grows and adapt to new sources of data or changing market dynamics.

7
8

4.OBJECTIVES

1. Efficient Stock Data Handling: Develop a scalable and distributed system to process
and analyze large volumes of stock market data using Hadoop.
2. Real-Time Data Processing: Implement mechanisms to handle real-time streaming
stock data for timely insights.
3. Comprehensive Data Analysis: Perform historical and trend analysis on stock data
to identify patterns and predict future market movements.
4. Scalability and Performance Optimization: Utilize Hadoop's distributed file system
(HDFS) and MapReduce framework to ensure fast processing of vast datasets while
maintaining high performance.
5. Data Storage and Retrieval: Efficiently store and retrieve structured and
unstructured stock data across a distributed environment.
6. Visualization and Insights: Generate user-friendly reports and visualizations for
stock performance metrics, enabling better decision-making.
7. Integration with Analytical Tools: Integrate Hadoop with data analysis tools (e.g.,
Hive, Pig, Spark) for enhanced querying and machine learning capabilities.
8. Data Security and Reliability: Ensure the security and reliability of sensitive stock
market data throughout the processing pipeline.
9. Automation of Analysis Pipelines: Develop automated workflows for data ingestion,
cleaning, processing, and visualization.
10. Cost-Effective Solution: Leverage Hadoop's open-source framework to provide a
cost-efficient system for handling large-scale stock analysis.

8
9

5. APACHE HADOOP
:

Apache Hadoop is an open-source software framework written in Java for distributed storage
and distributed processing of very large data sets on computer clusters. All the modules in
Hadoop are designed with a fundamental assumption that hardware failures are common
place and thus should be automatically handled in software by the framework[6]. The core of
Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a
processing part (MapReduce). Hadoop splits files into large blocks and distributes them
amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged
code for nodes to process in parallel, based on the data each node needs to process. This
approach takes advantage of data locality– nodes manipulating the data that they have on-
hand – to allow the data to be processed faster and more efficiently than it would be in a more
conventional supercomputer architecture that relies on a parallel file system where
computation and data are connected via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:  Hadoop
Common – contains libraries and utilities needed by other Hadoop modules;  Hadoop
Distributed File System (HDFS) – a distributed file-system that stores data on commodity
machines, providing very high aggregate bandwidth across the cluster;  Hadoop YARN – a
resource-management platform responsible for managing compute resources in clusters and
using them for scheduling of users' applications and  Hadoop MapReduce – a programming
model for large scale data processing. The term "Hadoop" has come to refer not just to the
base modules above, but also to the collection of additional software packages, such as
Apache Pig, Apache Hive and others.

9
10

Fig 1: Hadoop ecosystem installed over EXT-4 file system.

6,MAPREDUCE

MapReduce is a programming model for expressing distributed calculation on huge amount of


data and an execution framework for large-scale data processing on clusters of article of trade
servers. It was originally developed by Google and built on well-known principles in parallel
and distributed processing. Hadoop implements the open source MapReduce written in java
which provides reliable, scalable and fault tolerant distributed computing. Key-value pair
forms the basic data structure in MapReduce. Keys and values are primitives such as integers,
floating point values, strings, and raw bytes or they may be arbitrary complex structures (lists,
tuples, associative array, etc.). Programmers can define their own data types. The map function
will take the input data and will generate intermediate key and value pairs. The reduce function
then takes an intermediate key and a set of values to form a smaller set of values. Typically the
reducer produces only zero or one output value. MapReduce framework is responsible for
automatically splitting the input, distributing each chunk to mappers on multiple machines,
grouping and arrangement all intermediate values related with the intermediate key, passing
these values to reducers on multiple resources. Master monitors the functioning of the mapper
and reducer and re-executes them on failure. The MapReduce jobs have thousands of
individual tasks which have to be assigned to nodes in the cluster.

10
11

Fig 2: Simplified view of MapReduce

7.PIG
Apache Pig is a platform for analyzing Big-Data that consists of a high-level language for
expressing data analysis programs, along with infrastructure for evaluating these programs.
Pig's architecture consists of a compiler which produces sequences of Map-Reduce programs,
for already existing large-scale parallel implementation. In Pig data workers can write
complex data transformations independent of Java knowledge. Pig‘s simple SQL-like
scripting language is called Pig Latin, and is easily understood by developers who are
familiar with scripting languages and SQL. Pig is complete, therefore all required data
manipulations can be done in Apache Hadoop with Pig. Using the User Defined Functions
(UDF) that are available in Pig, it can invoke code in many languages like JRuby, Jython and
Java. Pig scripts can be embedded in other languages. The advantage of Pig is that it can be
used as a key to build larger and more complex applications that handle real business
problems. Pig works with data from many sources, and stores the results into the HDFS.
Important features of Pig are: • Ease of programming. It is easy to achieve parallel execution
of data analysis tasks. Complex tasks consisting of multiple co-related data transformations
are explicitly encoded as data flow sequences, thus is easy to write, understand, and maintain

11
12

8.HIVE

The Apache Hive data warehouse infrastructure built on top of Hadoop facilitates querying
and managing large datasets residing in distributed storage. Hive provides a mechanism to
project structure onto this data and query the data using a SQL-like language called HiveQL.

It also allows traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL. Hadoop was
built to organize and store massive amounts of data of various shapes, sizes and formats.
Because of its ―schema on read‖ architecture, a Hadoop cluster is a perfect reservoir of
heterogeneous data— structured and unstructured—from a multitude of sources. Data
analysts use Hive to explore structure and analyze that data, then turn it into business insight.
Hive is similar to traditional database code with SQL access. However, since Hive is based
on Hadoop and MapReduce operations, there are several key differences. The first is that
Hadoop is intended for long sequential scans, and since it is based on Hadoop, queries may
have a very high latency (many minutes). Therefore hive cannot be used for applications
which need fast response times. Finally, Hive is read-based and therefore not appropriate for
application that requires a high percentage of write operations.

How Hive Works :

The tables in Hive are similar to tables in a relational database, and data units are organized
in taxonomy from larger to more granular units. Databases consist of tables, which are made
up of partitions. Data can be accessed via a simple query language and Hive supports
overwriting or appending data. Within a particular database, data in the tables is serialized
and each table has a corresponding Hadoop Distributed File System (HDFS) directory.

12
13

9.STOCK DATA ANALYSIS

Steps to Perform Stock Data Analysis with Hadoop :


1. Data Collection

 Collect stock data from sources like Yahoo Finance, Google Finance, or APIs such as
Alpha Vantage, Quandl, or Bloomberg.

 Data types include historical prices, intraday trading data, and news sentiment.

2. Data Preprocessing

 Cleaning: Handle missing values, inconsistent formats, and anomalies.

 Transformation: Normalize prices, convert timestamps, and format the data for
compatibility with Hadoop.

3. Data Storage in Hadoop

 Use HDFS (Hadoop Distributed File System) to store large volumes of stock data.

 Structure data files as:

o CSV

o JSON

o Parquet (for better efficiency).

4. Processing Framework

 MapReduce:

o Use Mapper for parallel processing of stock data (e.g., calculating moving
averages).

o Use Reducer for aggregation tasks (e.g., computing total trading volumes).

 Apache Hive:

o Set up tables in Hive for querying stock data with SQL-like syntax.

 Apache Pig:

o Use Pig scripts for semi-structured or unstructured stock data analysis.

13
14

 Apache Spark (optional):

o If real-time processing or faster computation is required, use Spark with


Hadoop.

5. Stock Data Analysis Techniques

 Descriptive Analytics:

o Compute averages, variances, and standard deviations of stock prices.

 Time Series Analysis:

o Use rolling averages, Bollinger bands, or ARIMA for trend analysis.

 Volume Analysis:

o Analyze trading volumes to detect unusual activities.

 Predictive Analytics:

o Integrate machine learning frameworks like Apache Mahout or TensorFlow


for price predictions.

 Sentiment Analysis:

o Combine Hadoop with NLP libraries to assess the impact of news on stock
trends.

6. Visualization and Reporting

 Export processed data from Hadoop to visualization tools like Tableau or Power BI.

 Use libraries such as Matplotlib or D3.js for custom charts and graphs.

7. Workflow Automation

 Schedule recurring data analysis tasks with Apache Oozie.

 Implement data pipelines using Apache NiFi for real-time updates.

8. Performance Optimization

 Optimize HDFS block size for large files.

 Tune the Hadoop cluster configuration to handle high-frequency data efficiently.

14
15

10.TECHNICAL INDICATORS
1. Moving Averages
a. Simple Moving Average (SMA):

 Description: Calculates the average stock price over a fixed number of periods.

 Implementation in Hadoop:

o Use MapReduce or Hive to calculate averages on a sliding window of stock


prices.

o For each stock, the mapper processes the price data, and the reducer calculates
the SMA for each time window.

b. Exponential Moving Average (EMA):

 Description: Similar to SMA but gives more weight to recent prices.

 Implementation:

o Use a recursive formula in Spark for efficient computation across time-series


data.

2. Relative Strength Index (RSI):


 Description: Measures the magnitude of recent price changes to evaluate overbought
or oversold conditions.

 Implementation:

o Calculate daily gains and losses using Pig or Spark SQL.

o Apply the RSI formula using UDFs (User Defined Functions).

3. Bollinger Bands:
 Description: Composed of a moving average (middle band) and two standard
deviations above and below it (upper and lower bands).

 Implementation:

15
16

o Use Hive or Spark to compute SMA and standard deviation for the desired
window size.

o Generate bands as additional fields in the dataset.

4. Moving Average Convergence Divergence (MACD):

 Description: Highlights changes in the stock's momentum by comparing two moving


averages (short-term and long-term).

 Implementation:

o Compute the 12-day EMA (short) and 26-day EMA (long).

o Subtract the two EMAs to find the MACD line, then calculate the signal line
as a 9-day EMA of the MACD.

5. Volume Weighted Average Price (VWAP):


 Description: Measures the average price weighted by trading volume.

 Implementation:

o Use MapReduce or Spark to sum (Price × Volume) and Volume, and then
divide the two.

6. Average True Range (ATR):


 Description: Measures market volatility by considering recent high, low, and closing
prices.

 Implementation:

o Use Hive queries or Spark functions to calculate the true range for each day
and then average it over a time window.

16
17

11.SAMPLECODE
Steps to Execute in Hadoop :
1.Upload Input Data to HDFS:

hdfs dfs -mkdir /stockdata

hdfs dfs -put stock_data.csv /stockdata/

2.Run the MapReduce Job:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-mapper mapper.py \

-reducer reducer.py \

-input /stockdata/stock_data.csv \

-output /stockdata/output

3.View Results:

hdfs dfs -cat /stockdata/output/part-00000

Mapper Code (mapper.py) :

#!/usr/bin/env python3

import sys

for line in sys.stdin:

# Skip header

if line.startswith("Date"):

continue

try:

fields = line.strip().split(",")

stock_symbol = fields[1]

close_price = float(fields[3])

17
18

print(f"{stock_symbol}\t{close_price}")

except Exception as e:

continue

Reducer Code (reducer.py) :

#!/usr/bin/env python3

import sys

current_symbol = None

sum_price = 0

count = 0

for line in sys.stdin:

try:

stock_symbol, close_price = line.strip().split("\t")

close_price = float(close_price)

if stock_symbol == current_symbol:

sum_price += close_price

count += 1

else:

if current_symbol:

# Emit result for previous stock

avg_price = sum_price / count

print(f"{current_symbol}\t{avg_price:.2f}")

current_symbol = stock_symbol

sum_price = close_price

count = 1

18
19

except Exception as e:

continue

# Emit result for the last stock

if current_symbol:

avg_price = sum_price / count

print(f"{current_symbol}\t{avg_price:.2f}")

11.OUTPUT

Fig 3: Status of Pig query getting executed

19
20

Fig 4: Snapshot of file system over the HDFS.

Using various such queries the visualization of data is done for various attributes of stock
data and one such graph for price change over a period of last 7 days, last 30 days, last 6
months and moving average of the previous year is shown in Fig 5

Fig#7: Graph for some calculated attributes.

20
21

12.Future Scope

Big Data Stock Analysis using Hadoop has immense potential for growth and evolution.
Some key areas of future development include:

1. Advanced Predictive Analytics: Leveraging machine learning and AI with Hadoop for
more accurate stock price predictions and market trend analysis. This can provide
better decision-making tools for investors.

2. Real-Time Stock Analysis: Integration of Hadoop with real-time data streaming tools
like Apache Kafka to perform instantaneous analysis, offering immediate insights into
market movements.

3. Integration with Blockchain: Combining Hadoop's analytical power with blockchain's


secure, decentralized data storage could enhance the reliability and transparency of
stock transactions.

4. Personalized Investment Advice: Customizing investment recommendations based on


user behavior and portfolio patterns analyzed through Hadoop.

5. Scalability: As datasets grow larger, Hadoop's distributed architecture will enable


handling even more extensive and complex data efficiently, making it ideal for
evolving stock markets.

21
22

12.Conclusion

Big Data Stock Analysis using Hadoop marks a significant advancement in the field of
financial analytics, offering a powerful and scalable solution for processing the immense
volume of data generated by stock markets. Traditional systems often struggle with the sheer
size, velocity, and variety of data in the financial domain, but Hadoop's distributed computing
model overcomes these challenges efficiently.

This approach enables the seamless integration of structured and unstructured data, making it
possible to derive actionable insights from diverse sources such as market feeds, social
media, and financial reports. The project showcases Hadoop’s potential to deliver high-speed
data processing, predictive analytics, and visualization, helping traders, investors, and
analysts make informed decisions.

Moreover, by employing Hadoop’s ecosystem, businesses can reduce costs, enhance


operational efficiency, and gain a competitive edge in stock market analysis. The ability to
handle data across different nodes ensures reliability, even in the face of hardware failures,
making it a dependable choice for critical financial systems.

The success of this methodology emphasizes the transformative role of Big Data in modern
finance, highlighting its potential to redefine stock market analysis. As financial markets
continue to grow in complexity and data size, solutions like Hadoop will be indispensable for
driving innovation, improving decision-making, and ensuring a more data-centric approach to
investment strategies.

This project not only underlines the current capabilities of Big Data tools but also paves the
way for future advancements in financial technologies.

22
23

13.REFERENCES

1. Books:

o White, T. (2015). Hadoop: The Definitive Guide: Storage and Analysis at


Internet Scale. O'Reilly Media.

o Miner, D., & Shook, A. (2012). MapReduce Design Patterns: Building


Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly
Media.

2. Research Papers and Journals:

o Chen, M., Mao, S., & Liu, Y. (2014). Big Data: A Survey. Mobile Networks
and Applications, 19(2), 171–209.

o Jain, V., & Reddy, K. (2017). Big Data and Predictive Analytics in Stock
Market Decision Making. International Journal of Computer Applications,
162(7), 34–38.

3. Websites and Articles:

o Apache Hadoop Official Documentation: https://hadoop.apache.org/

o Investopedia: Big Data in Stock Market Analysis:


https://www.investopedia.com/

o Hortonworks Blog on Hadoop and Financial Services:


https://hortonworks.com/blog/

4. Case Studies and Reports:

o Gartner Report on Big Data Trends in Finance

o McKinsey Global Institute: Big Data: The Next Frontier for Innovation,
Competition, and Productivity

23
24

5.Online Tutorials and Videos:

o Hadoop tutorials on platforms like Coursera, edX, and Udemy.

o YouTube Channels: Data Engineering and Big Data-focused channels often


include practical examples and projects.

6.Datasets and Tools:

o Yahoo Finance API for stock data.

o Kaggle Datasets: Stock Market Data and Big Data Analysis.

o Apache Hive, HBase, and Spark for Hadoop-related analytics.

24

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy