0% found this document useful (0 votes)
19 views29 pages

Biodiesel Research

Big data encompasses structured, semi-structured, and unstructured data that organizations can analyze for insights and decision-making. It operates on the principles of the Five V's: Volume, Velocity, Variety, Veracity, and Value, which highlight the challenges and importance of managing large datasets. Apache Hadoop is a key framework for processing big data, utilizing components like HDFS, YARN, and MapReduce to efficiently store and analyze vast amounts of information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views29 pages

Biodiesel Research

Big data encompasses structured, semi-structured, and unstructured data that organizations can analyze for insights and decision-making. It operates on the principles of the Five V's: Volume, Velocity, Variety, Veracity, and Value, which highlight the challenges and importance of managing large datasets. Apache Hadoop is a key framework for processing big data, utilizing components like HDFS, YARN, and MapReduce to efficiently store and analyze vast amounts of information.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Created by – Jairam Kancharla

Big data
Big data is a combination of structured, semistructured and unstructured data collected by
organizations that can be mined for information and used in machine
learning projects, predictive modeling and other advanced analytics applications.

How Does Big Data Work?


Analytics of big data involves spotting trends, patterns, and correlations within vast amounts of
unprocessed data in order to guide data-driven decisions. These procedures employ well-
known statistical analysis methods, such as clustering and regression, to larger datasets with
the aid of more recent instruments.
1.Data Collecting
2. Organise the Data
3. Clean Data
4. Analysis of Data
The Five V’s of Big Data
These datasets are so huge and complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.

Volume: With increasing dependence on technology, data is producing at a large


volume. Common examples are data being produced by various social networking
sites, sensors, scanners, airlines and other organizations.
Velocity: Huge amount of data is generated per second. It is estimated that by the end
of 2020, every individual will produce 3mb data per second. This large volume of data
is being generated with a great velocity.
Variety: The data being produced by different means is of three types:
Structured Data: It is the relational data which is stored in the form of rows
and columns.
Unstructured Data: Texts, pictures, videos etc. are the examples of
unstructured data which can’t be stored in the form of rows and columns.
Semi Structured Data: Log files are the examples of this type of data.
Veracity: The term Veracity is coined for the inconsistent or incomplete data which
results in the generation of doubtful or uncertain Information. Often data
inconsistency arises because of the volume or amount of data e.g. data in bulk could
create confusion whereas less amount of data could convey half or incomplete
Information.
Value: After having the 4 V’s into account there comes one more V which stands for
Value!. Bulk of Data having no Value is of no good to the company, unless you turn it
into something useful. Data in itself is of no use or importance but it needs to be
converted into something valuable to extract Information. Hence, you can state that
Value! is the most important V of all the 5V’s

Apache Hadoop
• Apache Hadoop is an open source framework that is used to efficiently store and process large
data.
• Instead of using one large computer to store and process the data, Hadoop allows clustering
multiple computers to analyze massive datasets in parallel more quickly.
• Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining.
Following are the components that collectively form a Hadoop ecosystem:

• HDFS: Hadoop Distributed File System


• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
What are the four main modules of Hadoop?
Hadoop consists of four main modules:
• Hadoop Distributed File System (HDFS) – A distributed file system that runs on
standard or low-end hardware. HDFS provides better data throughput than traditional
file systems, in addition to high fault tolerance and native support of large datasets.
• Yet Another Resource Negotiator (YARN) – Manages and monitors cluster nodes and
resource usage. It schedules jobs and tasks.
• MapReduce – A framework that helps programs do the parallel computation on data.
The map task takes input data and converts it into a dataset that can be computed in
key value pairs. The output of the map task is consumed by reduce tasks to aggregate
output and provide the desired result.
• Hadoop Common – Provides common Java libraries that can be used across all
modules.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
• The MapReduce is a paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form of a key-value pair. The
output of the Mapper is fed to the reducer as input. The reducer runs only after the
Mapper is over. The reducer too takes input in key-value format, and the output of
reducer is the final output.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are
128 MB by default and this is configurable. Files in HDFS are broken into block-sized
chunks, which are stored as independent units. Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file stored
in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size is large just to
minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently,so all this information is handled bya single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion
and replication as stated by the name node.
HDFS DataNode and NameNode Image:

HDFS Read
• To read a file from HDFS, client needs to interact with Namenode
• Namenode provides the address of the slaves where file is stored.
• Client will interact with respective datanodes to read the file.
• Namenode also provide a token to the client which it shows to data node for
authentication.
• Namenode will check access right to clientnode then it will give access to datanodes for
client node.

HDFS Read Image:


HDFS Write Image:

Secondary Name Node: It is a separate physical machine which acts as a helper of name node.
It performs periodic check points.It communicates with the name node and take snapshot of
meta data which helps minimize downtime and loss of data.
HDFS commands:
1. First you need to start the Hadoop services using the following command:

sbin/start-all.sh

2. To check the Hadoop services are up and running use the following command:
jps

3. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful
when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>

Example:
bin/hdfs dfs -ls /

It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File System)
commands.
4. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s
first create it.
Syntax:

bin/hdfs dfs -mkdir <folder name>


creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer

Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder
created relative to the home directory.

5. touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>

Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
6. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is
the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
bin/hdfs dfs -put ../Desktop/AI.txt /geeks

7. cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>

Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->

8. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

myfile.txt from geeks folder will be copied to folder hero present on Desktop.

Note: Observe that we don’t write bin/hdfs while checking the things present on
local filesystem.

9. moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>

Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

10. cp: This command is used to copy files within hdfs. Lets copy
folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>

Example:
bin/hdfs -cp /geeks /geeks_copied

11. mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>

Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied

12. rmr: This command deletes a file from HDFS recursively. It is very useful command when
you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>

Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content
inside the directory then the directory itself.

13. du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>

Example:
bin/hdfs dfs -du /geeks
14. dus: This command will give the total size of directory/file.
Syntax:
bin/hdfs dfs -dus <dirName>

Example:
bin/hdfs dfs -dus /geeks

15. stat: It will give the last modified time of directory or path. In short it will give stats of the
directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>

Example:
bin/hdfs dfs -stat /geeks

16. setrep: This command is used to change the replication factor of a file/directory in HDFS.
By default it is 3 for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.
bin/hdfs dfs -setrep -R -w 6 geeks.txt
Example 2: To change the replication factor to 4 for a directory geeksInput stored
in HDFS.
bin/hdfs dfs -setrep -R 4 /geeks
Note: The -w means wait till the replication is completed. And -R means
recursively, we use it for directories as they may also contain many files and
folders inside them.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command:
bin/hdfs dfs

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving
tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality
content at affordable prices, all geared towards accelerating your growth in a time-bound
manner. Join the millions we've already empowered, and we're here
What is YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it.Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.
Components Of YARN
o Client: For submitting MapReduce jobs.
o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in
the cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.

Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.

Steps in Map Reduce:

Step 1: Mapper (or) Map


• The input data is first split into smaller blocks. data in the form of pairs and returns a list
of <key, value> pairs.
• In the Mapping step, data is split between parallel processing tasks. Transformation
logic can be applied to each chunk of data.
Step 2: Sort and Shuffle
• The sort and shuffle occur on the output of Mapper and before the reducer.
• the results are sorted by key.
• The shuffling phase facilitates the removal of duplicate values and the grouping of
values.
Step 3: Reduce
• The reducer performs a defined function on a list of values for unique keys, and Final
output <key, value> will be stored/displayed.
• The reducer processes this input further to reduce the intermediate values into smaller
values.
• The output from this phase is stored in the HDFS.

Example:
MapReduce architecture
SPARK
Spark and Core Spark: From Fundamentals to Advanced Concepts
Spark is a powerful distributed computing platform for big data processing. It excels at handling
large datasets efficiently by parallelizing tasks across clusters of computers.
Spark is a market leader for big data processing. It is widely used across organizations in many
ways. It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on
disks.
Spark mainly comes for data processing and data storage.
Spark is replacement for MapReduce.
Spark components
1. Low-Level API (RDDs)
• RDDs (Resilient Distributed Datasets): Immutable distributed collections of objects
processed in parallel. Operations are either transformations (e.g., map(), filter()) or
actions (e.g., collect(), reduce()).
• Provides fine-grained control and fault tolerance but is more complex than higher-level
APIs.
2. Structured API (DataFrames & Datasets)
• DataFrames: Distributed collections of data organized into named columns (like tables in
a database), with optimizations via the Catalyst optimizer.
• Datasets: Type-safe, strongly-typed versions of DataFrames (available in Scala and Java).
• Supports SQL queries, aggregation, and transformations in a more user-friendly way
than RDDs.
3. Libraries and Ecosystem
• Spark SQL: For querying structured data using SQL syntax, integrates with DataFrames.
• Spark Streaming: Real-time data stream processing with DStreams and Structured
Streaming.
• MLlib: Machine learning library with scalable algorithms for classification, regression,
clustering, and more.
• GraphX: Graph processing library for analytics on graphs (social networks,
recommendations).
• Delta Lake: ACID transactions, versioning, and schema enforcement for big data storage.
• PySpark: Python API for using Spark, especially for data processing and machine
learning.
• Koalas: Pandas-like API on Spark for working with large datasets.
4. Ecosystem Integration
• Hadoop: Spark can run on Hadoop clusters and interact with HDFS.
• Kafka, Cassandra, HBase: Spark integrates with these systems for real-time data
ingestion and storage.
Core Components
The following diagram gives the clear picture of the different components of Spark:

Apache Spark 3.5 is a framework that is supported in Scala, Python, R Programming, and Java.
Below are different implementations of Spark.
• Spark – Default interface for Scala and Java
• PySpark – Python interface for Spark
• SparklyR – R interface for Spark.

Features of Apache Spark


• In-memory computation
• Distributed processing using parallelize
• Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
• Fault-tolerant
• Immutable
• Lazy evaluation
• Cache & persistence
• Inbuild-optimization when using DataFrames
• Supports ANSI SQL

Advantages of Apache Spark


• Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine
that allows you to process data efficiently in a distributed fashion.
• Applications running on Spark are 100x faster than traditional systems.
• You will get great benefits from using Spark for data ingestion pipelines.
• Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks
DBFS, Azure Blob Storage, and many file systems.
• Spark also is used to process real-time data using Streaming and Kafka.
• Provides connectors to store the data in NoSQL databases like MongoDB.
Cluster Manager Types

As of writing this Apache Spark Tutorial, Spark supports below cluster managers:
• Standalone – a simple cluster manager included with Spark that makes it easy to
set up a cluster.
• Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, a cluster
manager.
• A pache Mesos – Mesons is a Cluster manager that can also run Hadoop
MapReduce and Spark applications.
• Kubernetes – an open-source system for automating deployment, scaling, and
management of containerized applications.
What is a Core in Spark?
In Spark, the term core has two different meanings:
1. Core (as a physical resource):
o Core refers to a CPU core on a machine. In a cluster, each worker node has a
certain number of CPU cores available for running tasks.
o Spark uses the cores to distribute tasks across the executors. The more cores
available, the more parallel tasks can be executed, improving performance for
large jobs.
2. Core (as part of the Spark ecosystem):
Spark Core is the foundational component of the Spark ecosystem. It provides the basic
functionalities for Spark applications, including:
o Memory management
o Fault tolerance
o Scheduling of jobs and tasks
o Interaction with storage systems (HDFS, S3, etc.)
All higher-level components like Spark SQL, MLlib, and Spark Streaming are built on top
of Spark Core.
What is lazy evaluation:
lazy evaluation in Spark means that the execution will not start until an action is triggered.

Apache Spark Architecture:


Spark works in a master-slave architecture where the master is called the “Driver”
and slaves are called “Workers”.
Master - Driver
Slaves - Workers (or) executors
Cluster manager – (Resource allocation)

Explanation for spark architecture:


Spark Driver creates a context that is an entry point to your application, and all operations
(transformations and actions) are executed on worker nodes, and the resources are
managed by Cluster Manager.
Driver Program - Driver Program in the Apache Spark architecture calls the main program of an
application and creates SparkContext.
Spark Driver contains various other components such as DAG Scheduler, Task Scheduler,
Backend Scheduler, and Block Manager, which are responsible for translating the user-written
code into jobs that are actually executed on the cluster.
Sparkcontext –
• Driver Program in the Apache Spark architecture calls the main program of an
application and creates SparkContext.
• Spark Driver and SparkContext collectively watch over the spark job execution within
the cluster.
• Whenever an RDD is created in the SparkContext, it can be distributed across many
worker nodes and can also be cached there.
• Worker nodes execute the tasks assigned by the Cluster Manager and return it back to
the Spark Context.
Cluster manager – (Resource allocation)
• Driver requests cluster manager to allocation resources
• Cluster manager will provide where nodes are available and where we can execute the
code.
• Spark Driver works with the Cluster Manager to manage various other jobs.
• And then, the job is split into multiple smaller tasks which are further distributed to
worker nod

Spark Job
Spark job: A job in Spark refers to a sequence of transformations on data. Whenever an
action like count(), first(), collect(), and save() is called on RDD (Resilient Distributed
Datasets), a job is created.

Spark Stages: Spark job is divided into Spark stages, where each stage represents a set of
tasks that can be executed in parallel.
A stage consists of a set of tasks that are executed on a set of partitions of the data.
DAG (Directed Acyclic Graph): Spark schedules stages and tasks. Spark uses a DAG (Directed
Acyclic Graph) scheduler, which schedules stages of tasks.
The TaskScheduler is responsible for sending tasks to the cluster, running them, retrying if
there are failures, and mitigating stragglers.
Data Partitioning: Data is divided into smaller chunks, called partitions, and processed in
parallel across multiple nodes in a cluster.
Task: Tasks are the smallest unit of work, sent to one executor. The number of tasks
depends on the number of data partitions. Each task performs transformations on a chunk
of data.
A Spark job typically involves the following steps:
• Loading data from a data source
• Transforming or manipulating the data using Spark’s APIs such as Map, Reduce,
Join, etc.
• Storing the processed data back to a data store.

Narrow Transformations:
• No data shuffling across partitions.
• Faster and more efficient (executed in a single stage).
• map(), filter(), flatMap(), union(), sample(), mapPartitions(), mapValues(), coalesce(), zip()
Wide Transformations:
• Data shuffling between partitions.
• Slower and more resource-intensive (requires multiple stages).
• groupBy(), reduceByKey(), join(), distinct(), cogroup(), repartition(), cartesian(),
aggregateByKey(), groupByKey()
Spark Job Explaination with an Example:
Let’s take an example where we read a CSV file, perform some transformations on the
data, and then run an action to demonstrate the concepts of job, stage, and task in Spark .

import org.apache.spark.sql.SparkSession

// Create a Spark Session


val spark = SparkSession.builder
.appName("Spark Job Stage Task Example")
.getOrCreate()

// Read a CSV file - this is a transformation and doesn't trigger a job


val data = spark.read.option("header", "true").csv("path/to/your/file.csv")

// Perform a transformation to create a new DataFrame with an added column


// This also doesn't trigger a job, as it's a transformation (not an
action)
val transformedData = data.withColumn("new_column", data("existing_column")
* 2)

// Now, call an action - this triggers a Spark job


val result = transformedData.count()

println(result)

spark.stop()

In the above code:


1. A Job is triggered when we call the action count(). This is where Spark schedules
tasks to be run.
2. Stages are created based on transformations. In this example, we have two
transformations (read.csv and withColumn). However, these two transformations
belong to the same stage since there's no data shuffling between them.
3. Tasks are the smallest unit of work, sent to one executor. The number of tasks
depends on the number of data partitions. Each task performs transformations on a
chunk of data.
Spark Installation

spark-shell
Spark binary comes with an interactive spark-shell. In order to start a shell, go to your
SPARK_HOME/bin directory and type “spark-shell“. This command loads the Spark and
displays what version of Spark you are using.
spark-shell
By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) objects to
use. Let’s see some examples.

spark-shell create RDD


Spark-shell also creates a Spark context web UI and by default, it can access from
http://localhost:4041.
Spark Core
Spark Core library with examples in Scala code. Spark Core is the main base library of Spark
which provides the abstraction of how distributed task dispatching, scheduling, basic I/O
functionalities etc.
Key Differences:
Feature SparkContext SparkSession
Introduced Spark 1.x Spark 2.0
Managing RDDs and interacting with Unified entry point for Spark SQL,
Primary Use
Spark clusters DataFrames, Datasets, and RDDs
Access to Data RDDs DataFrames, Datasets, RDDs, SQL Queries
Cluster
Yes Yes (through the underlying SparkContext)
Management
SQL Support No Yes
Hive Support No Yes (with enableHiveSupport())
SparkContext needs to be manually It internally handles SparkContext creation,
Requires
created for interacting with RDDs making it simpler

Resilient Distributed Datasets (RDD) are considered the fundamental data


structure of Spark commands. RDD is immutable and read -only. RDDs perform
computations through transformations and actions using Spa rk commands.
RDDs are fault-tolerant, immutable distributed collections of objects, which means
once you create an RDD you cannot change it. Each dataset in RDD is divided into
logical partitions, which can be computed on different nodes of the cluster.

RDD creation
RDDs are created primarily in two different ways, first parallelizing an existing
collection and secondly referencing a dataset in an external storage
system (HDFS, HDFS, S3 and many more).
sparkContext.parallelize()
sparkContext.parallelize is used to parallelize an existing collection in your driver program.
This is a basic method to create RDD.

//Create RDD from parallelize


val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))
val rdd=spark.sparkContext.parallelize(dataSeq)

sparkContext.textFile()
Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#,
Azure, local e.t.c into RDD.

//Create RDD from external Data source


val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

RDD Operations
On Spark RDD, you can perform two kinds of operations.

RDD Transformations
Spark RDD Transformations are lazy operations meaning they don’t execute until you call
an action on RDD. Since RDDs are immutable, When you run a transformation(for exam ple
map()), instead of updating a current RDD, it returns a new RDD.
Some transformations on RDDs
are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return a new RDD
instead of updating the current.

RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any
RDD function that returns non RDD[T] is considered as an action. RDD operations trigger
the computation and return RDD in a List to the driver program.
Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.

Spark Read CSV file into RDD


load a CSV file into Spark RDD using a Scala example. Using the textFile() the method
in SparkContext class we can read CSV files, multiple CSV files
// Read from csv file into RDD
val rddFromFile = spark.sparkContext. textFile ("C:/tmp/files/text01.csv")
DataFrame Spark Tutorial with Basic
DataFrame creation
By using createDataFrame() function of the SparkSession you can create a DataFrame.
// Create DataFrame
val data = Seq(('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
)

val columns = Seq("firstname","middlename","lastname","dob","gender","salary")


df = spark.createDataFrame(data), schema = columns).toDF(columns:_*)

DataFrames are structure format that contains names and column, we can get the
schema of the DataFrame using the df.printSchema()
df.show() shows the 20 elements from the DataFrame.

Since DataFrames are structure format that contains names and column, we can
get the schema of the DataFrame using the df.printSchema()
df.show() shows the 20 elements from the DataFrame.
+---------+----------+--------+----------+------+------+
|firstname |middlename |lastname |dob |gender |salary |
+---------+----------+--------+----------+------+------+
|James | |Smith |1991-04-01|M |3000 |
|Michael |Rose | |2000-05-19|M |4000 |
|Robert | |Williams |1978-09-05|M |4000 |
|Maria |Anne |Jones |1967-12-01|F |4000 |
|Jen |Mary |Brown |1980-02-17|F |-1 |
+---------+----------+--------+----------+------+------+

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy