CCA-175 Docs and Projects
CCA-175 Docs and Projects
Each CCA question requires you to solve a particular scenario. In some cases, a tool such as Impala or Hive
may be used. In other cases, coding is required. In order to speed up development time of Spark questions, a
template is often provided that contains a skeleton of the solution, asking the candidate to fill in the missing
lines with functional code. This template is written in either Scala or Python.
You are not required to use the template and may solve the scenario using a language you prefer. Be aware,
however, that coding every problem from scratch may take more time than is allocated for the exam.
Your exam is graded immediately upon submission and you are e-mailed a score report the same day as your
exam. Your score report displays the problem number for each problem you attempted and a grade on that
problem. If you fail a problem, the score report includes the criteria you failed (e.g., “Records contain
incorrect data” or “Incorrect file format”). We do not report more information in order to protect the exam
content.
If you pass the exam, you receive a second e-mail within a few days of your exam with your digital
certificate as a PDF, your license number, a Linkedin profile update, and a link to download your CCA logos
for use in your personal business collateral and social media profiles
Required Skills
Data Ingest
The skills to transfer data between external systems and your cluster. This includes the following:
•Change the delimiter and file format of data during import using Sqoop
•Load data into and out of HDFS using the Hadoop File System commands
Transform, Stage, and Store
Convert a set of data values in a given format stored in HDFS into new data values or a new data format and
write them into HDFS.
•Write the results from an RDD back into HDFS using Spark
Use Spark SQL to interact with the metastore programmatically in your applications. Generate reports by
using queries against loaded data.
•Use metastore tables as an input source or an output sink for Spark applications
This is a practical exam and the candidate should be familiar with all aspects of generating a result, not just
writing code.
•Supply command-line options to change your application configuration, such as increasing available
memory
TEST YOURSELF :
1. Write the missing Spark SQL in the given program to sort by Name column.
"Output: Array((E04,Amer), (E05,Ankit), (E08,Deshdeep), (E02,Karthik), (E09,Kumar), (E03,Rakesh),
(E06,Roopesh), (E01,Shivank), (E07,Tejas), (E10
,Venkat))"
Program
val emp = sc.textFile("/user/simplilearn/Employee")
val pairRDD = emp.map(x => (x.split(",")(0), x.split(",")(1)))
< Write your code>
Answer :
val swap1 = pairRDD.map(_.swap).sortByKey().map(_.swap)
2. Create a database named "XYZ" or if already created, use the existing database. Write Hive DDL script to
create a table named "simplilearn3" and load a dataset as per the given format into the table and complete the
following requirement:
Write Hive query for employee who has salary more than 10,000
Format:
Sl. No, Name, Age, Salary
Emp001, John, 34, 20000
Paste the create table syntax in the given space.
Answer
select * from simplilearn3 where salary >=100000;
4. Create a database named "XYZ" or if already created, use the existing database. Write Hive DDL script to
create a table named "simplilearn1" and load some sample data into the table in the given below format:
Name Sex Age Father_Name
Example:
Anupam Male 45 Daulat
Paste the create table syntax in the given space.
Answer
create table if not exists simplilearn1(name string,business_places array<string>,sex_age
struct<sex:string,age:int>,fathername_nuofchild map<string,int>)row format delimited fields terminated by
'|'collection items terminated by ','map keys terminated by ':';
5. Write the missing Spark SQL in the given program to sort by Name column.
"Output: Array((E04,Amer), (E05,Ankit), (E08,Deshdeep), (E02,Karthik), (E09,Kumar), (E03,Rakesh),
(E06,Roopesh), (E01,Shivank), (E07,Tejas), (E10
,Venkat))"
Program
val emp = sc.textFile("/user/simplilearn/Employee")
val pairRDD = emp.map(x => (x.split(",")(0), x.split(",")(1)))
< Write your code>
6. Execute the following Python program, using four local threads to count the words.
Program:
# create a program wordcount.py
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
<Write your code>
<Write your code>
<Write your code>
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
Answer
counts = lines.flatMap(lambda x: x.split(' ')) \.map(lambda x: (x, 1)) \.reduceByKey(add)
7. Find the missing code in the Scala program to display the output in the following format.
Output: Array[(Int, String)] = Array((4,anar), (5,applelichi), (6,bananagrapes), (7,oranges))
Program
val a = sc.parallelize(List("apple","banana","oranges","grapes","lichi","anar"))
val b = a.map(x =>(x.length,x))
< Write your code>
Answer
b.foldByKey("")(_+_).collect
8. Write the missing code in the given program to display the expected output to identify animals that have
names with four letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
< Write your code>
Answer
b.subtractByKey(d).collect
11. Write the missing code in the given Scala program to display the output in the below format.
Output: Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)
Program
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
< Write your code>
Answer
b.countByvalue
Projects :
https://drive.google.com/drive/folders/0B9tN1aTNNV0RLWhLOTdWSHN5NXM?usp=sharing