0% found this document useful (0 votes)
31 views8 pages

v2 3 Running+PySpark+on+Jupyter+NoteBook

The document provides steps to setup and configure Jupyter notebook and PySpark environment on an EC2 instance. It includes verifying prerequisite software, configuring Jupyter, setting environment variables and testing PySpark setup on Jupyter notebook. The document contains detailed instructions with commands and code snippets for complete configuration.

Uploaded by

Junaid Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views8 pages

v2 3 Running+PySpark+on+Jupyter+NoteBook

The document provides steps to setup and configure Jupyter notebook and PySpark environment on an EC2 instance. It includes verifying prerequisite software, configuring Jupyter, setting environment variables and testing PySpark setup on Jupyter notebook. The document contains detailed instructions with commands and code snippets for complete configuration.

Uploaded by

Junaid Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Prerequisite:

1. Anaconda installation
2. Java 1.8
3. Spark 2.3

1. Verify the installation


Log in to EC2 user as the ​root​ user

a. Run the following command to verify the anaconda installation


conda --version

b. Run the following command to verify Java 8 installation


java -version

c. Run the following command to verify Spark2 installation


spark2-shell --version
2. ​Configure Jupyter(One-time process)
Jupyter notebook is already installed in our CDH instance with Anaconda. However, we need to
configure it before we can actually run it.

a. Run the following command to generate jupyter configuration file.


jupyter notebook --generate-config

You can see​ jupyter_notebook_config.py​ has been created inside the ​/root/.jupyter​ directory.

b. Allow access to your remote Jupyter server

Open the ​ jupyter_notebook_config.py ​file


vi .jupyter/jupyter_notebook_config.py

Press “I” to get into the insert mode


Copy and paste the below two lines as shown in the screenshot.
c.NotebookApp.allow_origin = ' ​ *'​ ​#allow all origins
c.NotebookApp.ip = ​'0.0.0.0'​ #​ listen on all IPs

To exit
> Press ​‘Esc’ > ​Type ​:wq! ​> Hit ​Enter
3. ​Now, use the following command to run the Jupyter Notebook

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

Note​: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press ​Ctrl + C

You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is

http://​(ip-10-0-0-228.ec2.internal or 127.0.0.1)​:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005

In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP ​of your EC2 Instance.

In my case, the final URL will be-


http://3.89.129.54:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
Now, you can open your web browser and copy-paste your final URL to open the Jupyter
notebook. You should be able to see the Jupyter home page.

4. PySpark on Jupiter

In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.

import​ os
import​ sys
os.environ[​"PYSPARK_PYTHON"​] = ​"/opt/cloudera/parcels/Anaconda/bin/python"
os.environ[​"JAVA_HOME"​] = ​"/usr/java/jdk1.8.0_161/jre"
os.environ[​"SPARK_HOME"​] = ​"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ[​"PYLIB"​] = os.environ[​"SPARK_HOME"​] + ​"/python/lib"
sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/py4j-0.10.6-src.zip"​)
sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/pyspark.zip"​)

Note​: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python

os.environ[​"PYSPARK_PYTHON"​] = ​"/opt/cloudera/parcels/Anaconda/bin/python"

Make sure Anaconda is installed in the​ /opt/cloudera/parcels/​ director. To Verify please check
whether the ​Anaconda​ directory is present under​ /opt/cloudera/parcels/​ path.

ls /opt/cloudera/parcels/

ii. Java Home

os.environ[​"JAVA_HOME"​] = ​"/usr/java/jdk1.8.0_161/jre"

Run the following command to get your Java_Home path


echo ​$JAVA_HOME

Hence​ /usr/java/jdk1.8.0_161/jre​ will be the value of our JAVA_HOME variable

iii. Spark Home


For the Spark home path in the following line-
os.environ[​"SPARK_HOME"​] = ​"/opt/cloudera/parcels/​SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101​/lib/spark2/"

The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/

Run the following command to check/verify and replace it in case you have a different
version/distribution installed.

ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file

sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/​py4j-0.10.6-src​.zip"​)


sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/pyspark.zip"​)

After you have identified your correct Spark home path this step is to verify the version of the
py4j file.

For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)

ls /opt/cloudera/parcels/SPARK2​-2.3.0​.cloudera2​-1.​cdh5​.13.3​.p0​.316101​/lib/spark2/python/lib

You should be able to see the ​py4j​ and p


​ yspark​ file. Make sure you modify this according to
your instance in the above code.
5. Test your PySpark setup

a. Open a new Jupyter Notebook


b. Copy-paste the environment variable that you have finalized in the 4th point into a cell.
(You need to do it for every pySpark notebook which you will create)

c. Run the cell you should not see any error.


d. Now, let’s initialize the SparkContext object. Copy-paste the following code in a new cell
> Run the cell
from​ pyspark ​import​ SparkContext, SparkConf

conf = SparkConf().setAppName(​"jupyter_Spark"​).setMaster(​"yarn-client"​)
sc = SparkContext(conf=conf)
sc

You should be able see the following output

This means your pySpark is working fine.


6.​ Closing Jupyter Notebook

To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press ​Ctrl + C.
Enter ​‘y’ ​to shut down the cluster.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy