Yasir f29 Ass1 Bigdata
Yasir f29 Ass1 Bigdata
ID F20603029
ASSIGNMENT # 1
SUBJECT BIG DATA ANALYTICS
Question#1 [Marks=10]
You are required to list down the prominent tools used in the field of Big Data Analytics
by exploring the books, internet sources and published research works. Furthermore,
provide the prominent features of each tool.
ANSWER:
1) Apache Hadoop:
Distributed storage and processing: Hadoop splits large datasets into smaller chunks and
distributes them across multiple computers in a cluster. This allows for faster processing since
each computer works on a smaller part of the data.
Fault tolerance: Hadoop is smart enough to handle computer failures. If one computer stops
working, Hadoop automatically copies the data stored on that computer to other computers in
the cluster, so the processing can continue without interruption.
Scalability: As your data grows, you can easily add more computers to your Hadoop cluster to
handle the extra load. This makes it easy to scale your data infrastructure as your business
grows.
MapReduce: MapReduce is a way of splitting up a big data processing task into smaller parts,
then combining the results. Hadoop uses this programming model to process large datasets in
parallel, making it faster and more efficient.
2) Apache Spark:
In-memory processing: Spark keeps data in memory as much as possible, which makes it
faster than systems like Hadoop that need to read and write from disk. This speed boost can
make a big difference when working with large datasets.
Unified analytics engine: Spark is a versatile tool that can handle a wide range of analytics
tasks, including SQL queries, machine learning, graph processing, and real-time streaming
analytics. This means you can use Spark for many different types of data analysis without
needing separate tools.
Fault tolerance: Just like Hadoop, Spark is designed to keep working even if some computers
in the cluster fail. It does this by keeping track of where data is stored and making copies as
needed to ensure nothing is lost.
Compatibility: Spark can run on different types of cluster management systems, including
Hadoop YARN, Apache Mesos, or Kubernetes. This flexibility makes it easy to integrate Spark
into your existing data infrastructure.
3) Apache Kafka:
Distributed messaging system: Kafka is a tool for storing and streaming data between
different systems or applications. It's designed to handle large volumes of data with high
throughput and low latency.
Horizontal scalability: Kafka can handle more data by adding more computers (called brokers)
to the cluster. This makes it easy to scale Kafka as your data needs grow.
Durability: Kafka stores data on disk, so it can recover even if the entire cluster goes down.
This makes it a reliable option for storing important data.
Stream processing: Kafka includes a feature called Kafka Streams that lets you process data
in real-time as it flows through the system. This can be useful for tasks like monitoring, fraud
detection, or real-time analytics.
4) Apache Flink:
Stream processing: Flink is designed to process streaming data in real-time, making it a good
choice for applications that need to react quickly to changing data.
Event time processing: Flink can handle data that arrives out of order or late, which is
common in streaming applications. It does this by keeping track of when each event occurred,
rather than just processing them in the order they arrive.
Stateful computations: Flink lets you keep track of state across multiple events, which can be
useful for tasks like aggregating data or detecting patterns over time.
Fault tolerance: Flink automatically takes checkpoints of its state, so it can recover quickly if
something goes wrong. This makes it a reliable option for mission-critical applications.
5) Apache Cassandra:
Linear scalability: Cassandra can handle more data by adding more computers to the cluster.
This makes it easy to scale as your data grows.
Tunable consistency: Cassandra lets you choose how consistent you want your data to be.
You can prioritize consistency (making sure all copies of the data are the same) or availability
(making sure you can always access the data), depending on your needs.
Built-in replication: Cassandra automatically copies data to multiple computers in the cluster,
so even if one computer fails, you can still access your data. This makes it a reliable option for
storing important data.
6) MongoDB:
Scalability: MongoDB can handle more data by spreading it across multiple computers, a
process called sharding. This makes it easy to scale as your data grows.
Indexing: MongoDB supports various types of indexes, which can speed up queries by telling
MongoDB where to look for data.
Replication and high availability: MongoDB automatically copies data to multiple computers
in the cluster, so even if one computer fails, you can still access your data. It also includes
features for automatic failover, so if one computer goes down, another one can take its place
without interrupting service.
7) Elasticsearch:
Distributed search and analytics engine: Elasticsearch is built on Apache Lucene and offers
a distributed search and analytics engine for handling large volumes of structured and
unstructured data across clusters.
Real-time indexing and search: It enables real-time indexing and search, allowing users to
ingest data and search for it immediately after ingestion.
Scalability and fault tolerance: Elasticsearch's distributed architecture ensures scalability by
allowing users to add more nodes to the cluster. It also provides fault tolerance through data
replication and automatic failover mechanisms.
8) Splunk
Platform for machine-generated big data: Splunk is a comprehensive platform designed for
searching, monitoring, and analyzing large volumes of machine-generated data.
Real-time data ingestion: It offers real-time data ingestion capabilities with high throughput
and low latency, enabling organizations to quickly access and analyze data as it's generated.
Indexing and search: Splunk indexes and searches various types of machine data, including
logs, events, and other types of machine-generated data, facilitating rapid data retrieval and
analysis.
Monitoring and alerting: Splunk provides robust monitoring and alerting capabilities, allowing
users to set up alerts based on predefined conditions and receive proactive insights into system
performance and potential issues.
Question#2 [Marks=10]
Provide the application areas of Big Data Analytics in various Engineering fields to solve
the complex engineering problems. Furthermore, discuss the importance of Big Data
Analytics in Engineering domain with the help of different case studies.
ANSWER:
Big Data Analytics has made significant inroads across diverse engineering sectors,
fundamentally reshaping the approach to tackling intricate engineering challenges. Below, we
explore various applications of Big Data Analytics in different engineering domains alongside
illuminating case studies that underscore their critical importance:
1. Manufacturing Engineering:
Predictive Maintenance: Leveraging sensor data analysis, Big Data Analytics
empowers predictive maintenance, as demonstrated by General Electric (GE)
who successfully implemented IoT sensors and analytics to anticipate and
prevent machinery failures. This initiative resulted in substantial reductions in
downtime and maintenance expenses.
Quality Control: Real-time production data analysis proves instrumental in defect
identification and process optimization, as evidenced by Ford's utilization of big
data analytics to enhance vehicle quality. By scrutinizing warranty claims, Ford
achieved diminished defects and heightened customer satisfaction.
2. Civil Engineering:
Infrastructure Monitoring: Big Data Analytics plays a pivotal role in monitoring
infrastructure health, exemplified by the Hong Kong government's
implementation of a real-time bridge monitoring system. By harnessing big data
analytics, structural safety is vigilantly ensured through continuous analysis of
data from sensors, cameras, and other sources.
Traffic Management: Effective traffic flow optimization and congestion reduction
are achieved through the analysis of traffic data from sensors and GPS devices,
as illustrated by the city of Los Angeles. By employing big data analytics to refine
traffic signal timing, Los Angeles achieved noteworthy reductions in travel times
and fuel consumption.
3. Environmental Engineering:
Climate Modeling: Climate scientists leverage Big Data Analytics to model
climate change and its repercussions, as demonstrated by the European Centre
for Medium-Range Weather Forecasts (ECMWF). By harnessing big data
analytics for weather prediction and climate research, ECMWF enhances climate
modeling accuracy and insights.
Natural Disaster Management: Historical data analysis forms the backbone of
predicting and mitigating the impact of natural disasters, a practice exemplified
by NASA's Earth Science Division. By utilizing big data analytics to monitor
natural disasters and issue early warnings, NASA effectively mitigates disaster-
related risks.
4. Electrical Engineering:
Smart Grids: Optimizing energy distribution in smart grids is achieved through
consumption pattern analysis and grid performance data scrutiny, as showcased
by Pacific Gas and Electric (PG&E). By leveraging big data analytics, PG&E
enhances grid reliability and efficiency, fostering sustainable energy distribution.
Power Systems Optimization: Data analysis from power plants and transmission
grids fuels energy generation and distribution optimization, exemplified by
Siemens. Through the implementation of a big data analytics platform for energy
management, Siemens achieved cost reductions and improved grid stability.
5. Mechanical Engineering:
Product Design and Optimization: Engineers harness Big Data Analytics to
optimize product designs through simulation data and performance metrics
analysis, epitomized by Boeing's aircraft design optimization efforts. By
leveraging big data analytics, Boeing achieves the development of fuel-efficient
and environmentally friendly aircraft.
Supply Chain Management: Supply chain data analysis facilitates inventory
management, production scheduling, and logistics optimization, as demonstrated
by General Motors (GM). By employing big data analytics, GM optimizes its
supply chain, resulting in cost reductions and efficiency improvements.