Integration of Python With Hadoop and Spark
Integration of Python With Hadoop and Spark
Introduction
Big data is the collection of data that is vast in size, however,
growing exponentially faster with time. It’s data with so huge size
and complexity that none of the traditional data management
tools will store it or process it with efficiency.
There are other technologies also that can process Big Data more
efficiently than python. They are Hadoop and Spark.
Hadoop
Hadoop is the best solution for storing and processing Big Data
because Hadoop stores huge files in the form of (HDFS) Hadoop
distributed file system without specifying any schema.
Become a Data Scientist in JUST 200 Days
A 100% Job Guarantee Training ProgramDownload Brochure
Spark
2. Easy to learn:
Python is very easy to learn just like the English language. Its
syntax and code are easy and readable for beginners also. Python
has a lot of applications like the development of web applications,
data science, machine learning, and, so on.
Python allows us to write programs with lesser lines of code than
most of the other programming languages. The popularity of
Python is growing rapidly because of its simplicity.
Spark provides a Python API called PySpark released by the Apache Spark
community to support Python with Spark. Using PySpark, one will simply
integrate and work with RDDs within the Python programming language
too.
Spark comes with an interactive python shell called PySpark shell. This
PySpark shell is responsible for the link between the python API and the
spark core and initializing the spark context. PySpark can also be
launched directly from the command line by giving some instructions for
interactive use.
5. Speed and Efficiency:
Python is that the most well-liked language for ML/AI due to its
Components of Hadoop:
There are mainly two components of Hadoop:
Image source: by me
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#create java home variable
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-
hadoop3.2"
#Running Hadoop
!/usr/local/hadoop-3.3.0/bin/hadoop
!mkdir ~/input
!cp /usr/local/hadoop-3.3.0/etc/hadoop/*.xml ~/input
!ls ~/input
!/usr/local/hadoop-3.3.0/bin/hadoop jar /usr/local/hadoop-
3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-
3.3.0.jar grep ~/input ~/grep_example 'allowed[.]*'
MapReduce
Image source: by me
Apache Spark
Image source: by me
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Install spark (change the version number if needed)
!wget -q
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-
bin-hadoop3.2.tgz
#Unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
#Set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-
hadoop3.2"
#Install findspark using pip
!pip install -q findspark
#Spark for Python (pyspark)
!pip install pyspark
#importing pyspark
import pyspark
#importing sparksessio
from pyspark.sql import SparkSession
#creating a sparksession object and providing appName
spark=SparkSession.builder.appName("local[*]").getOrCreate()
#printing the version of spark
print("Apache Spark version: ", spark.version)