Biodiesel Research
Biodiesel Research
Big data
Big data is a combination of structured, semistructured and unstructured data collected by
organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.
Apache Hadoop
• Apache Hadoop is an open source framework that is used to efficiently store and process large
data.
• Instead of using one large computer to store and process the data, Hadoop allows clustering
multiple computers to analyze massive datasets in parallel more quickly.
• Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining.
Following are the components that collectively form a Hadoop ecosystem:
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
• The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The
output of the Mapper is fed to the reducer as input. The reducer runs only after the
Mapper is over. The reducer too takes input in key-value format, and the output of
reducer is the final output.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable. Files in HDFS are broken into block-sized
chunks, which are stored as independent units. Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to
minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion
and replication as stated by the name node.
HDFS DataNode and NameNode Image:
HDFS Read
• To read a file from HDFS, client needs to interact with Namenode
• Namenode provides the address of the slaves where file is stored.
• Client will interact with respective datanodes to read the file.
• Namenode also provide a token to the client which it shows to data node for
authentication.
• Namenode will check access right to clientnode then it will give access to datanodes for
client node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points.It communicates with the name node and take snapshot of
meta data which helps minimize downtime and loss of data.
HDFS commands:
1. First you need to start the Hadoop services using the following command:
sbin/start-all.sh
2. To check the Hadoop services are up and running use the following command:
jps
3. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.
4. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.
Syntax:
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder
created relative to the home directory.
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
6. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is
the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->
8. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero
myfile.txt from geeks folder will be copied to folder hero present on Desktop.
Note: Observe that we don’t write bin/hdfs while checking the things present on
local filesystem.
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
10. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
11. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
12. rmr: This command deletes a file from HDFS recursively. It is very useful command when
you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content
inside the directory then the directory itself.
Example:
bin/hdfs dfs -du /geeks
14. dus: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
15. stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks
16. setrep: This command is used to change the replication factor of a file/directory in HDFS.
By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored
in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means
recursively, we use it for directories as they may also contain many files and
folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command:
bin/hdfs dfs
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving
tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality
content at affordable prices, all geared towards accelerating your growth in a time-bound
manner. Join the millions we've already empowered, and we're here
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in
the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.
Example:
MapReduce architecture
SPARK
Spark and Core Spark: From Fundamentals to Advanced Concepts
Spark is a powerful distributed computing platform for big data processing. It excels at handling
large datasets efficiently by parallelizing tasks across clusters of computers.
Spark is a market leader for big data processing. It is widely used across organizations in many
ways. It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on
disks.
Spark mainly comes for data processing and data storage.
Spark is replacement for MapReduce.
Spark components
1. Low-Level API (RDDs)
• RDDs (Resilient Distributed Datasets): Immutable distributed collections of objects
processed in parallel. Operations are either transformations (e.g., map(), filter()) or
actions (e.g., collect(), reduce()).
• Provides fine-grained control and fault tolerance but is more complex than higher-level
APIs.
2. Structured API (DataFrames & Datasets)
• DataFrames: Distributed collections of data organized into named columns (like tables in
a database), with optimizations via the Catalyst optimizer.
• Datasets: Type-safe, strongly-typed versions of DataFrames (available in Scala and Java).
• Supports SQL queries, aggregation, and transformations in a more user-friendly way
than RDDs.
3. Libraries and Ecosystem
• Spark SQL: For querying structured data using SQL syntax, integrates with DataFrames.
• Spark Streaming: Real-time data stream processing with DStreams and Structured
Streaming.
• MLlib: Machine learning library with scalable algorithms for classification, regression,
clustering, and more.
• GraphX: Graph processing library for analytics on graphs (social networks,
recommendations).
• Delta Lake: ACID transactions, versioning, and schema enforcement for big data storage.
• PySpark: Python API for using Spark, especially for data processing and machine
learning.
• Koalas: Pandas-like API on Spark for working with large datasets.
4. Ecosystem Integration
• Hadoop: Spark can run on Hadoop clusters and interact with HDFS.
• Kafka, Cassandra, HBase: Spark integrates with these systems for real-time data
ingestion and storage.
Core Components
The following diagram gives the clear picture of the different components of Spark:
Apache Spark 3.5 is a framework that is supported in Scala, Python, R Programming, and Java.
Below are different implementations of Spark.
• Spark – Default interface for Scala and Java
• PySpark – Python interface for Spark
• SparklyR – R interface for Spark.
As of writing this Apache Spark Tutorial, Spark supports below cluster managers:
• Standalone – a simple cluster manager included with Spark that makes it easy to
set up a cluster.
• Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, a cluster
manager.
• A pache Mesos – Mesons is a Cluster manager that can also run Hadoop
MapReduce and Spark applications.
• Kubernetes – an open-source system for automating deployment, scaling, and
management of containerized applications.
What is a Core in Spark?
In Spark, the term core has two different meanings:
1. Core (as a physical resource):
o Core refers to a CPU core on a machine. In a cluster, each worker node has a
certain number of CPU cores available for running tasks.
o Spark uses the cores to distribute tasks across the executors. The more cores
available, the more parallel tasks can be executed, improving performance for
large jobs.
2. Core (as part of the Spark ecosystem):
Spark Core is the foundational component of the Spark ecosystem. It provides the basic
functionalities for Spark applications, including:
o Memory management
o Fault tolerance
o Scheduling of jobs and tasks
o Interaction with storage systems (HDFS, S3, etc.)
All higher-level components like Spark SQL, MLlib, and Spark Streaming are built on top
of Spark Core.
What is lazy evaluation:
lazy evaluation in Spark means that the execution will not start until an action is triggered.
Spark Job
Spark job: A job in Spark refers to a sequence of transformations on data. Whenever an
action like count(), first(), collect(), and save() is called on RDD (Resilient Distributed
Datasets), a job is created.
Spark Stages: Spark job is divided into Spark stages, where each stage represents a set of
tasks that can be executed in parallel.
A stage consists of a set of tasks that are executed on a set of partitions of the data.
DAG (Directed Acyclic Graph): Spark schedules stages and tasks. Spark uses a DAG (Directed
Acyclic Graph) scheduler, which schedules stages of tasks.
The TaskScheduler is responsible for sending tasks to the cluster, running them, retrying if
there are failures, and mitigating stragglers.
Data Partitioning: Data is divided into smaller chunks, called partitions, and processed in
parallel across multiple nodes in a cluster.
Task: Tasks are the smallest unit of work, sent to one executor. The number of tasks
depends on the number of data partitions. Each task performs transformations on a chunk
of data.
A Spark job typically involves the following steps:
• Loading data from a data source
• Transforming or manipulating the data using Spark’s APIs such as Map, Reduce,
Join, etc.
• Storing the processed data back to a data store.
Narrow Transformations:
• No data shuffling across partitions.
• Faster and more efficient (executed in a single stage).
• map(), filter(), flatMap(), union(), sample(), mapPartitions(), mapValues(), coalesce(), zip()
Wide Transformations:
• Data shuffling between partitions.
• Slower and more resource-intensive (requires multiple stages).
• groupBy(), reduceByKey(), join(), distinct(), cogroup(), repartition(), cartesian(),
aggregateByKey(), groupByKey()
Spark Job Explaination with an Example:
Let’s take an example where we read a CSV file, perform some transformations on the
data, and then run an action to demonstrate the concepts of job, stage, and task in Spark .
import org.apache.spark.sql.SparkSession
println(result)
spark.stop()
spark-shell
Spark binary comes with an interactive spark-shell. In order to start a shell, go to your
SPARK_HOME/bin directory and type “spark-shell“. This command loads the Spark and
displays what version of Spark you are using.
spark-shell
By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) objects to
use. Let’s see some examples.
RDD creation
RDDs are created primarily in two different ways, first parallelizing an existing
collection and secondly referencing a dataset in an external storage
system (HDFS, HDFS, S3 and many more).
sparkContext.parallelize()
sparkContext.parallelize is used to parallelize an existing collection in your driver program.
This is a basic method to create RDD.
sparkContext.textFile()
Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#,
Azure, local e.t.c into RDD.
RDD Operations
On Spark RDD, you can perform two kinds of operations.
RDD Transformations
Spark RDD Transformations are lazy operations meaning they don’t execute until you call
an action on RDD. Since RDDs are immutable, When you run a transformation(for exam ple
map()), instead of updating a current RDD, it returns a new RDD.
Some transformations on RDDs
are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return a new RDD
instead of updating the current.
RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any
RDD function that returns non RDD[T] is considered as an action. RDD operations trigger
the computation and return RDD in a List to the driver program.
Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.
DataFrames are structure format that contains names and column, we can get the
schema of the DataFrame using the df.printSchema()
df.show() shows the 20 elements from the DataFrame.
Since DataFrames are structure format that contains names and column, we can
get the schema of the DataFrame using the df.printSchema()
df.show() shows the 20 elements from the DataFrame.
+---------+----------+--------+----------+------+------+
|firstname |middlename |lastname |dob |gender |salary |
+---------+----------+--------+----------+------+------+
|James | |Smith |1991-04-01|M |3000 |
|Michael |Rose | |2000-05-19|M |4000 |
|Robert | |Williams |1978-09-05|M |4000 |
|Maria |Anne |Jones |1967-12-01|F |4000 |
|Jen |Mary |Brown |1980-02-17|F |-1 |
+---------+----------+--------+----------+------+------+