Sumit Kothari Apache Spark and Scala Practical 17
Sumit Kothari Apache Spark and Scala Practical 17
_______________
Faculty in-charge
Apache Spark and Scala Practical
INDEX
In case we are not having the SDK installed then download the latest version according to the
computer requirements
from oracle.com and just proceed with the installation.
Downloading Scala: Before starting with the installation process, you need to download it.
For that, all
versions of Scala for Windows are available on scala-lang.org
Download the Scala and follow the further instructions for the installation of Scala.
Beginning with the Installation:
Getting Started:
Move on to Installing
Installation Process:
Finished Installation:
After completing the installation process, any IDE or text editor can be used to write Scala
Codes and Run them on the
IDE or the Command prompt with the use of command:
scalac file_name.Scala
scala class_name
Name: Sumit Kothari
Roll number: 17
Practical 2
Aim: Write steps to setup and configure Apache Spark
Prerequisites:
• Java Development Kit (JDK): Ensure JDK 8 or later is installed.
• Apache Spark: Download the latest stable version from the official website
(https://spark.apache.org/downloads.html).
• Hadoop: If you're using Hadoop for distributed processing, download and install it.
• Scala (Optional): If you plan to write Spark applications in Scala, install Scala.
Setup and Configuration:
1. Unzip Spark: Extract the downloaded Spark distribution to a desired location.
2. Set Environment Variables:
o SPARK_HOME: Set this variable to the extracted Spark directory.
o JAVA_HOME: Set this variable to the directory where your JDK is installed.
o HADOOP_HOME: If using Hadoop, set this variable to the Hadoop
installation directory.
o PATH: Add the bin directory of Spark and Hadoop (if applicable) to your
system's PATH.
3. Run Spark:
o Local Mode: To run Spark locally on your machine, open a terminal and
navigate to the bin directory of Spark. Execute the following command:
Bash
./spark-shell
o Standalone Mode: For a standalone cluster, configure spark-env.sh in the conf
directory of Spark. Set the necessary properties (e.g., spark.master,
spark.executor.cores, spark.executor.memory) and start the master and worker
nodes.
o YARN Mode: If using YARN, configure spark-defaults.conf and submit
applications using the yarn client.
o Mesos Mode: If using Mesos, configure spark-defaults.conf and submit
applications using the mesos client.
Name: Sumit Kothari
Roll number: 17
Practical 3
Aim: Write a scala program to perform basic mathematical operations in scala
Source code & Output:
Name: Sumit Kothari
Roll number: 17
Practical 4
4.1
Aim: Write a scala program to compute the sum of the two given integer values. If the two
values are the same, then return triple their sum.
Source code & Output:
4.2
Aim: Write a scala program to compute the sum of the two given integer values. If the two
values are the same, then return triples their sum.
Source code & Output:
4.3
Aim: Write a scala program to print the table of a number.
Source code & Output:
Name: Sumit Kothari
Roll number: 17
Practical 5
5.1
Aim: Write a program to greet the user
Source code & Output:
5.2
Aim: Write a recursive function that calculates the factorial
Source code & Output:
5.3
Aim: Write a program to print a List
Source code & Output:
5.4
Aim: Write a program to add two numbers
Source code & Output:
5.5
Aim: Write a higher order function to apply functions to a list
Source code & Output:
5.6
Aim: Write an anonymous to filter even numbers
Source code & Output:
Name: Sumit Kothari
Roll number: 17
Practical 06
Aim: To implement a basic word count program using Spark RDDs.
Steps:
Name:
Roll number:
1. Create a text file containing repeated words and save it in the C: drive. This file will
be used for the Word Count program.
2. Open Windows PowerShell, change the directory to the Spark bin folder using the cd
command, and then execute the command spark-shell to launch the Spark REPL.
3. Next, load your text file into Spark by typing the following command in the Spark
shell.
4. Next, execute the command text.collect() to display the contents of the file loaded into
Spark.
6. Next, run counts.collect() to display the list of words after splitting each line of text.
7. Use the command given below to map each word to a key-value pair where the word
is the key, and the value is 1.
8. Type in the next command to retrieve and display the results of the mapped RDD
from Spark to the driver program as a list.
9. The command written below is used to aggregate the values of each key in the RDD
mapf by summing them up, creating a new RDD reducef with the total counts for each unique
key.
10. Use reducef.collect to retrieve and display the aggregated results from the reducef
RDD.
11. Enter the command mentioned below in order to save the aggregated results from the
reducef RDD to the "spark_output" folder in the C drive. This folder will be created
automatically. One does not need to create it manually.
Output: