0% found this document useful (0 votes)

168 views33 pages

Spark Questions Imp

Uploaded by

mahima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views33 pages

Spark Questions Imp

Uploaded by

mahima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Apache Spark Interview Questions Youtube: Pyspark telugu

1. Explain Spark Clusters Architecture And Process?

The diagram below shows an example Spark cluster, basically there exists a Driver
node that communicates with executor nodes. Each of these executor nodes have
slots which are logically like execution cores.

At a high level, every Apache Spark application consists of a driver program that
launches various parallel operations on executor Java Virtual Machines (JVMs)
running either in a cluster or locally on the same machine. In Databricks, the notebook
interface is the driver program. This driver program contains the main loop for the
program and creates distributed datasets on the cluster, then applies operations
(transformations & actions) to those datasets.
Driver programs access Apache Spark through a SparkSession object regardless of
deployment location.
Apache Spark Interview Questions Youtube: Pyspark telugu

2. Spark Job Execution process and involved steps?

At high level, when any action is called on the RDD, Spark creates the DAG and submits
it to the DAG scheduler.

• The DAG scheduler divides operators into stages of tasks. A stage is comprised of
tasks based on partitions of the input data. The DAG scheduler pipelines operators
together. For e.g. Many map operators can be scheduled in a single stage. The final
result of a DAG scheduler is a set of stages.
• The Stages are passed on to the Task Scheduler.The task scheduler launches tasks
via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't
know about dependencies of the stages.
• The Worker executes the tasks on the Slave.
Apache Spark Interview Questions Youtube: Pyspark telugu

3. Differences between Hadoop Map reduce And Spark?

4. What are the components of Spark?

Apache Spark Interview Questions Youtube: Pyspark telugu

1) Structured Data: Spark SQL

Many data scientists, analysts, and general business intelligence users rely on
interactive SQL queries for exploring data. Spark SQL is a Spark module for
structured data processing. It provides a programming abstraction called
DataFrame and can also act as distributed SQL query engine. It enables unmodified
Hadoop Hive queries to run up to 100x faster on existing deployments and data. It
also provides powerful integration with the rest of the Spark ecosystem (e.g.,
integrating SQL query processing with machine learning).

2) Streaming Analytics: Spark Streaming

Many applications need the ability to process and analyse not only batch data,
but also streams of new data in real-time. Running on top of Spark, Spark
Streaming enables powerful interactive and analytical applications across both
streaming and historical data, while inheriting Spark’s ease of use and fault
tolerance characteristics. It readily integrates with a wide variety of popular data
sources, including HDFS, Flume, Kafka, and Twitter.

3) Machine Learning: MLlib

Machine learning has quickly emerged as a critical piece in mining Big Data for
actionable insights. Built on top of Spark, MLlib is a scalable machine learning
library that delivers both high-quality algorithms (e.g., multiple iterations to
increase accuracy) and blazing speed (up to 100x faster than MapReduce). The
library is usable in Java, Scala, and Python as part of Spark applications, so that
you can include it in complete workflows.
Apache Spark Interview Questions Youtube: Pyspark telugu
4) Graph Computation: GraphX

GraphX is a graph computation engine built on top of Spark that enables users
to interactively build, transform and reason about graph structured data at
scale. It comes complete with a library of common algorithms.

5) General Execution: Spark Core

Spark Core is the underlying general execution engine for the Spark platform that
all other functionality is built on top of. It provides in-memory computing
capabilities to deliver speed, a generalized execution model to support a wide
variety of applications, and Java, Scala, and Python APIs for ease of development.

5. What is Single Node Cluster (Local Mode) in Spark?

A Single Node cluster is a cluster consisting of a Spark driver and no Spark workers.
Such clusters support Spark jobs and all Spark data sources, including Delta Lake.
In contrast, Standard clusters require at least one Spark worker to run Spark jobs.

Single Node clusters are helpful in the following situations:

• Running single node machine learning workloads that need Spark to load and
save data Lightweight exploratory data analysis (EDA)

Single Node cluster properties

A Single Node cluster has the following properties:

• Runs Spark locally with as many executor threads as logical cores on the cluster
(the number of cores on driver - 1).
• Has 0 workers, with the driver node acting as both master and worker.
• The executor stderr, stdout, and log4j logs are in the driver log.
• Cannot be converted to a Standard cluster. Instead, create a new cluster with the
mode set to Standard

6. Why RDD resilient?

Apache Spark Interview Questions Youtube: Pyspark telugu
Apache spark RDD is Resilient Distributed Dataset. Resilient means self-auto
recovery from failures. That’s called fault tolerance. Fault Tolerance property
means RDD, has a capability of handling if any loss occurs. It can recover the
failure itself, here fault refers to failure. If any bug or loss found, RDD has the
capability to recover the loss. We need a redundant element to redeem the lost
data. Redundant data plays important role in a selfrecovery process. Ultimately,
we can recover lost data by redundant data.

In Spark, a job is associated with a chain of RDD dependencies organized in a direct

acyclic graph (DAG) that looks like the following

RDD Lineage (like RDD operator graph or RDD dependency graph) is a graph of
all the parent RDDs of a RDD. It is built as a result of applying transformations
to the RDD and creates a logical execution plan.

7. Difference between persist and cache?

Cache() and persist() both the methods are used to improve performance of
spark computation. These methods help to save intermediate results so they
can be reused in subsequent stages.

The only difference between cache() and persist() is ,

Cache technique we can save intermediate results in memory only
Persist () we can save the intermediate results in 5 storage levels(MEMORY_ONLY,
MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER,
DISK_ONLY).
Apache Spark Interview Questions Youtube: Pyspark telugu

8. What is narrow and wide transformation?

At high level, there are two transformations that can be applied onto the RDDs,
namely narrow transformation and wide transformation. Wide transformations
basically result in stage boundaries.

Narrow transformation - doesn't require the data to be shuffled across the

partitions. for example, Map, filter etc.. wide transformation - requires the data
to be shuffled for example, reduceByKey etc..
Apache Spark Interview Questions Youtube: Pyspark telugu
Apache Spark Interview Questions Youtube: Pyspark telugu
9. Differences between RDD , Dataframe And DataSet?

10. What are shared variables and it uses?

Apache Spark Interview Questions Youtube: Pyspark telugu

11. how to create UDF in Pyspark?

Apache Spark Interview Questions Youtube: Pyspark telugu
User-defined functions – Python
Python user-defined function (UDF) examples. It shows how to register UDFs, how
to invoke UDFs, and caveats regarding evaluation order of subexpressions in Spark
SQL.

12. Explain Stages and Tasks creation in Spark?

Once the DAG is build, the Spark scheduler creates a physical execution plan. As
mentioned above, the DAG scheduler splits the graph into multiple stages, the
stages are created based on the transformations. The narrow transformations will
be grouped (pipelined) together into a single stage. So for our example, Spark will
create two stage execution as follows:
Apache Spark Interview Questions Youtube: Pyspark telugu

The DAG scheduler will then submit the stages into the task scheduler. The
number of tasks submitted depends on the number of partitions present in the
textFile. Fox example consider we have 4 partitions in this example, then there will
be 4 set of tasks created and submitted in parallel provided there are enough
slaves/cores. Below diagram illustrates this in more detail:

13. difference between coalese and repartition

The coalesce method reduces the number of partitions in a DataFrame. Coalesce

avoids full shuffle, instead of creating new partitions, it shuffles the data using
Hash Partitioner (Default), and adjusts into existing partitions, this means it can
only decrease the number of partitions.

Unlike repartition, coalesce doesn’t perform a shuffle to create the

partitions.The repartition method can be used to either increase or decrease the
number of partitions in a DataFrame. Repartition is a full Shuffle operation, whole
data is taken out from existing partitions and equally distributed into newly formed
partitions.
Coalesce
Apache Spark Interview Questions Youtube: Pyspark telugu
1. Coalesce is used to reduce the number of partitions
2. Coalesce will not trigger partition
3. Here an RDD with 4 partitions is reduced to 2 partitions (coalesce(2))
4. Partition from Node1 is moved to Node2
5. Partition from Node4 is moved to Node3

Repartition

1. Repartition is used to increase the number of partitions

2. Repartition triggers shuffling
3. Here an RDD with 3 partitions is repartitioned into 4 partitions (repartition(4))

Source: https://pixipanda.com/
Apache Spark Interview Questions Youtube: Pyspark telugu
14. What is data skew?

Data Skew

Data skew is a condition in which a table’s data is unevenly distributed among

partitions in the cluster. Data skew can severely downgrade performance of
queries, especially those with joins. Joins between big tables require shuffling data
and the skew can lead to an extreme imbalance of work in the cluster. It’s likely
that data skew is affecting a query if a query appears to be stuck finishing very few
tasks (for example, the last 3 tasks out of 200).

Often the data is split into partitions based on a key, for instance the first letter of
a name. If values are not evenly distributed throughout this key then more data
will be placed in one partition than another. An example would be:

Here the A partition is 3 times larger than the other two, and therefore will take
approximately 3 times as long to compute. As the next stage of processing cannot
begin until all three partitions are evaluated, the overall results from the stage will
be delayed.

15. How to fix data skew issues?

A skew hint must contain at least the name of the relation with skew. A relation is
a table, view, or a subquery. All joins with this relation then use skew join
optimization.
Apache Spark Interview Questions Youtube: Pyspark telugu

Relation and columns

There might be multiple joins on a relation and only some of them will suffer from
skew. Skew join optimization has some overhead so it is better to use it only when
needed. For this purpose, the skew hint accepts column names. Only joins with
these columns use skew join optimization.

Relation, columns, and skew values

You can also specify skew values in the hint. Depending on the query and data,
the skew values might be known (for example, because they never change) or might
be easy to find out. Doing this reduces the overhead of skew join optimization.
Otherwise, Delta Lake detects them automatically.
Apache Spark Interview Questions Youtube: Pyspark telugu

16. What happens when use collect() action on DF or RDD?

Don’t use collect() on large datasets. If we call collect() on large datasets it will
collect all data from all workers nodes and it will send to driver node. It may cause
out of memory exception. To avoid this type of errors use take() or head to identify
the same data.
Apache Spark Interview Questions Youtube: Pyspark telugu

17. what is pair rdd? When to use them?

Key/value RDDs are commonly used to perform aggregations, and often we will
do some initial ETL (extract, transform, and load) to get our data into a key/value
format. Key/value RDDs expose new operations (e.g., counting up reviews for
each product, grouping together data with the same key, and grouping together
two different RDDs).

Spark provides special operations on RDDs containing key/value pairs. These

RDDs are called pair RDDs. Pair RDDs are a useful building block in many
programs, as they expose operations that allow you to act on each key in parallel
or regroup data across the network. For example, pair RDDs have a reduceByKey()
method that can aggregate data separately for each key, and a join() method that
can merge two RDDs together by grouping elements with the same key. It is
common to extract fields from an RDD (representing, for instance, an event time,
customer ID, or other identifier) and use those fields as keys in pair RDD
operations.

Transformations on one pair RDD PairRDD: {(1, 2), (3, 4), (3, 6)}

reduceByKey(func) Combine values with the same key. groupByKey() Group

values with the same key. mapValues(func) Apply a function to each value of a

pair RDD without changing the key.

flatMapValues(func) Apply a function to each value of a pair RDD without

changing the key. keys() Return an RDD of just the keys. values() Return an
RDD of just the values
Apache Spark Interview Questions Youtube: Pyspark telugu
sortByKey() Return an RDD sorted by the key.

subtractByKey remove elements with a key present in the

other rdd. join : perform an inner join between two rdd's.

rightouterjoin: performa a join between two RDD's

leftouterjoin: perform a join between two rdd's cogroup: group

data from both rdd's sharing the same key.

18. What is Shuffling?

A shuffle occurs when data is rearranged between partitions. This is required when
a transformation requires information from other partitions, such as summing all
the values in a column. Spark will gather the required data from each partition and
combine it into a new partition, likely on a different executor.

During a shuffle, data is written to disk and transferred across the network,
halting Spark’s ability to do processing in-memory and causing a performance
bottleneck. Consequently we want to try to reduce the number of shuffles being
done or reduce the amount of data being shuffled.
Apache Spark Interview Questions Youtube: Pyspark telugu

19. difference between cluster and client mode

Cluster Mode

In the cluster mode, the Spark driver or spark application master will get started in
any of the worker machines. So, the client who is submitting the application can
submit the application and the client can go away after initiating the application or
can continue with some other work. So, it works with the concept of Fire and
Forgets.

The question is: when to use Cluster-Mode? If we submit an application from a

machine that is far from the worker machines, for instance, submitting locally from
our laptop, then it is common to use cluster mode to minimize network latency
between the drivers and the executors. In any case, if the job is going to run for a
long period time and we don’t want to wait for the result then we can submit the job
using cluster mode so once the job submitted client doesn’t need to be online.

Client Mode

In the client mode, the client who is submitting the spark application will start the
driver and it will maintain the spark context. So, till the particular job execution
gets over, the management of the task will be done by the driver. Also, the client
should be in touch with the cluster. The client will have to be online until that
particular job gets completed.
Apache Spark Interview Questions Youtube: Pyspark telugu

In this mode, the client can keep getting the information in terms of what is the
status and what are the changes happening on a particular job. So, in case if we
want to keep monitoring the status of that particular job, we can submit the job in
client mode. In this mode, the entire application is dependent on the Local machine
since the Driver resides in here. In case of any issue in the local machine, the driver
will go off. Subsequently, the entire application will go off. Hence this mode is not
suitable for Production use cases. However, it is good for debugging or testing since
we can throw the outputs on the driver terminal which is a Local machine.

20. What types of file format using in big data and those differences?
Apache Spark Interview Questions Youtube: Pyspark telugu
Apache Spark Interview Questions Youtube: Pyspark telugu
Apache Spark Interview Questions Youtube: Pyspark telugu

21. what happens if a worker node is dead?

The master considers the worker to be failure if it didnt receive the heartbeat
message for past 60 sec (according to spark.worker.timeout). In that case the
partition is assigned to another worker(remember partitioned RDD can be
reconstructed even if its lost).
1. An RDD is an immutable, deterministically re-computable, distributed
dataset. Each RDD remembers the lineage of deterministic operations that
were used on a faulttolerant input dataset to create it.
2. If any partition of an RDD is lost due to a worker node failure, then that
partition can be re-computed from the original fault-tolerant dataset using
the lineage of operations.
3. Assuming that all of the RDD transformations are deterministic, the data in
the final transformed RDD will always be the same irrespective of failures in
the Spark cluster.

Spark operates on data in fault-tolerant file systems like HDFS or S3. Hence, all of
the RDDs generated from the fault-tolerant data are also fault-tolerant. However,
this is not the case for Spark Streaming as the data in most cases is received over
the network (except when fileStream is used). To achieve the same fault-tolerance
properties for all of the generated RDDs, the received data is replicated among
Apache Spark Interview Questions Youtube: Pyspark telugu
multiple Spark executors in worker nodes in the cluster (default replication factor
is 2). This leads to two kinds of data in the system that need to recovered in the
event of failures:

1. Data received and replicated - This data survives failure of a single worker
node as a copy of it exists on one of the other nodes.
2. Data received but buffered for replication - Since this is not replicated,
the only way to recover this data is to get it again from the source.
Furthermore, there are two kinds of failures that we should be concerned
about:
3. Failure of a Worker Node - Any of the worker nodes running executors can
fail, and all in-memory data on those nodes will be lost. If any receivers were
running on failed nodes, then their buffered data will be lost.
4. Failure of the Driver Node - If the driver node running the Spark Streaming
application fails, then obviously the SparkContext is lost, and all executors
with their in-memory data are lost.
With this basic knowledge, let us understand the fault-tolerance semantics of
Spark Streaming.

22. Difference between reduceByKey()and group ByKey()?

Both of these transformations operate on pair RDDs. A pair RDD is an RDD

where each element is a pair tuple (key, value). For example, sc.parallelize([('a',
1), ('a', 2), ('b', 1)]) would create a pair RDD where the keys are 'a', 'a', 'b' and the
values are 1, 2, 1.

The reduceByKey() transformation gathers together pairs that have the same key
and applies a function to two associated values at a time. reduceByKey() operates
Apache Spark Interview Questions Youtube: Pyspark telugu
by applying the function first within each partition on a per-key basis and then
across the partitions.
While both the groupByKey() and reduceByKey() transformations can often be used
to solve the same problem and will produce the same answer,
the reduceByKey() transformation works much better for large distributed datasets.
This is because Spark knows it can combine output with a common key on each
partition before shuffling (redistributing) the data across nodes. Only use
groupByKey() if the operation would not benefit from reducing the data before the
shuffle occurs.

groupByKey() is just to group your dataset based on a key. It will result in data
shuffling when RDD is not already partitioned.

reduceByKey() is something like grouping + aggregation. We can say

reduceBykey() equivalent to dataset.group(…).reduce(…). It will shuffle less data
unlike groupByKey().
Apache Spark Interview Questions Youtube: Pyspark telugu

Here are more functions to prefer over groupByKey:

• combineByKey can be used when you are combining elements but your return type
differs from your input value type.

• foldByKey merges the values for each key using an associative function and a
neutral "zero value".

•
Apache Spark Interview Questions Youtube: Pyspark telugu
23. Lazy evaluation in Spark and its benefits? Lazy Evaluation:

1. Laziness means not computing transformation till it’s need

2. Once, any action is performed then the actual computation starts
3. A DAG (Directed acyclic graph) will be created for the tasks
4. Catalyst Engine is used to optimize the tasks & queries
5. It helps reduce the number of passes

24. What is Catalyst Optimizer And Explain End to End Process?

Spark SQL is one of the most technically involved components of Apache Spark. It
powers both SQL queries and the DataFrame API. At the core of Spark SQL is the
Catalyst optimizer, which leverages advanced programming language features (e.g.
Scala’s pattern matching and quasiquotes) in a novel way to build an extensible
query optimizer.

Catalyst is based on functional programming constructs in Scala and designed with

these key two purposes:

•Easily add new optimization techniques and features to Spark SQL

•Enable external developers to extend the optimizer (e.g. adding data source
specific rules, support for new data types, etc.)
As well, Catalyst supports both rule-based and cost-based optimization.
Apache Spark Interview Questions Youtube: Pyspark telugu

25. Difference between ShuffledHashJoin And BroadcastHashjoin?

Shuffle Hash Join (SHJ)

When the side of the table is relatively small, we choose to broadcast it out to
avoid shuffle, improve performance. But because the broadcast table is first to
collect to the driver segment, and then distributed to each executor redundant, so
when the table is relatively large, the use of broadcast progress will be the driver
and executor side caused greater pressure.
Apache Spark Interview Questions Youtube: Pyspark telugu

But because Spark is a distributed computing engine, you can partition the large
number of data can be divided into n smaller data sets for parallel computing.
This idea is applied to the Join is Shuffle Hash Join. Spark SQL will be larger table
join and rule, the first table is divided into n partitions, and then the
corresponding data in the two tables were Hash Join, so that is to a certain extent,
the same time, Reducing the pressure on the side of the driver broadcast side, but
also reduce the executor to take the entire broadcast by the memory of the table.

Broadcast HashJoin (BHJ)

As we all know, in the database common model (such as star model or snowflake
model), the table is generally divided into two types: fact table and dimension
table. Dimension tables (small tables) generally refer to fixed, less variable tables,
such as contacts, items, etc., the general data is limited. The fact table generally
records water, such as sales lists, etc., usually with the growth of time constantly
expanding.... means large tables

Because the Join operation is the two tables in the same key value of the record to
connect, in SparkSQL, the two tables to do Join the most direct way is based on
the key partition, and then in each partition the key value of the same record
Come out to do the connection operation. But this will inevitably involve shuffle,
and shuffle in Spark is a more timeconsuming operation, we should try to design
Spark application to avoid a lot of shuffle.

When the dimension table and the fact table for the Join operation, in order to
avoid shuffle, we can be limited size of the dimension table of all the data
Apache Spark Interview Questions Youtube: Pyspark telugu
distributed to each node for the fact table to use. executor all the data stored in
the dimension table, to a certain extent, sacrifice the space, in exchange for
shuffle operation a lot of time-consuming, which in SparkSQL called Broadcast
Join

26. How many modes are there for spark execution?

Apache Spark Interview Questions Youtube: Pyspark telugu
Local Mode
Spark local mode is special case of standalone cluster mode in a way that the
_master & _worker run on same machine.

Standalone Cluster Mode

As the name suggests, it’s a standalone cluster with only spark specific
components. It doesn’t have any dependencies on hadoop components and Spark
driver acts as cluster manager.
Apache Spark Interview Questions Youtube: Pyspark telugu

Hadoop YARN/ Mesos

Apache Spark runs on Mesos or YARN (Yet another Resource Navigator, one of
the key features in the second-generation Hadoop) without any root-access or
pre-installation. It integrates Spark on top Hadoop stack that is already present
on the system.
Apache Spark Interview Questions Youtube: Pyspark telugu

Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Unit V
No ratings yet
Unit V
35 pages
Unit 4 (Big Data Analytics)
No ratings yet
Unit 4 (Big Data Analytics)
28 pages
Final Exam Update Huawei
0% (1)
Final Exam Update Huawei
13 pages
Chatgpt: Generated The Idea For This Cover
100% (1)
Chatgpt: Generated The Idea For This Cover
100 pages
M5 Q&a
No ratings yet
M5 Q&a
26 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
07 - Apache Spark - An Introduction
No ratings yet
07 - Apache Spark - An Introduction
36 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
CPU Scheduling
No ratings yet
CPU Scheduling
39 pages
ECS765P - W5 - Spark Programming
No ratings yet
ECS765P - W5 - Spark Programming
43 pages
Bda U4
No ratings yet
Bda U4
49 pages
SPARK Question Answers
No ratings yet
SPARK Question Answers
19 pages
Spark Theory
No ratings yet
Spark Theory
26 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Module 4
No ratings yet
Module 4
29 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Q1. Understanding Apache Spark
No ratings yet
Q1. Understanding Apache Spark
4 pages
Language Processors
No ratings yet
Language Processors
41 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Apache Spark
No ratings yet
Apache Spark
15 pages
Java Program To Multiply 2 Matrices - Javatpoint
No ratings yet
Java Program To Multiply 2 Matrices - Javatpoint
6 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Cookies and Sessions
100% (1)
Cookies and Sessions
40 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
1 Oracle Developer Data Types Essentials m1 Slides
No ratings yet
1 Oracle Developer Data Types Essentials m1 Slides
70 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
8 pages
Data Bricks Interview
No ratings yet
Data Bricks Interview
18 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
06 AS Engineering 009
No ratings yet
06 AS Engineering 009
43 pages
Employee Data Analysis System (Ip Class Xii)
No ratings yet
Employee Data Analysis System (Ip Class Xii)
26 pages
ESIOT Manual
No ratings yet
ESIOT Manual
23 pages
I - ASP - NET - Razor Syntax Cheatsheet - Codecademy
No ratings yet
I - ASP - NET - Razor Syntax Cheatsheet - Codecademy
5 pages
Python Pandas
100% (1)
Python Pandas
35 pages
PySpark Core Print
No ratings yet
PySpark Core Print
8 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Spark Material
No ratings yet
Spark Material
6 pages
Unit-Iii Class & Object Diagrams
No ratings yet
Unit-Iii Class & Object Diagrams
8 pages
4-Greedy Algorithm
No ratings yet
4-Greedy Algorithm
21 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Logcat
No ratings yet
Logcat
171 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Java (Set Map)
No ratings yet
Java (Set Map)
16 pages
Spark Interview Questions 04
No ratings yet
Spark Interview Questions 04
4 pages
Read Recipes With Angular PDF
No ratings yet
Read Recipes With Angular PDF
82 pages
PL SQL Exercise by Unsw
No ratings yet
PL SQL Exercise by Unsw
5 pages
5 SPFILE Commands - Which Command You Miss
No ratings yet
5 SPFILE Commands - Which Command You Miss
5 pages
Orphan Process: Write A Program To Create A Thread To Find The Factorial of A Natural Number N'
No ratings yet
Orphan Process: Write A Program To Create A Thread To Find The Factorial of A Natural Number N'
40 pages
UBI Data Engineer JD
No ratings yet
UBI Data Engineer JD
1 page
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Chapter 2 BCP
No ratings yet
Chapter 2 BCP
6 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Python Sets
No ratings yet
Python Sets
4 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Api Web
No ratings yet
Api Web
10 pages
UNIT II (Formal Modelling and Verification)
No ratings yet
UNIT II (Formal Modelling and Verification)
23 pages
Wireless Industrial Automation System Using Python
No ratings yet
Wireless Industrial Automation System Using Python
5 pages
Julia Ode
No ratings yet
Julia Ode
22 pages
ENGL1101 Wohlsen Digital Literacy
No ratings yet
ENGL1101 Wohlsen Digital Literacy
4 pages
University Wide Statistics For 3 Years
No ratings yet
University Wide Statistics For 3 Years
1 page
Goa Placement Statistics For 3 Years 1
No ratings yet
Goa Placement Statistics For 3 Years 1
1 page
Gauss-Seidel - More Examples Civil Engineering: Example 1
No ratings yet
Gauss-Seidel - More Examples Civil Engineering: Example 1
4 pages
Navas: Navigation Approaches For Answer Sets: Asmaa Afeefi
No ratings yet
Navas: Navigation Approaches For Answer Sets: Asmaa Afeefi
6 pages
Matlab1 2022-1
No ratings yet
Matlab1 2022-1
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Spark Questions Imp

Uploaded by

Spark Questions Imp

Uploaded by

Apache Spark Interview Questions Youtube: Pyspark telugu

1. Explain Spark Clusters Architecture And Process?

2. Spark Job Execution process and involved steps?

3. Differences between Hadoop Map reduce And Spark?

4. What are the components of Spark?

1) Structured Data: Spark SQL

2) Streaming Analytics: Spark Streaming

3) Machine Learning: MLlib

5) General Execution: Spark Core

5. What is Single Node Cluster (Local Mode) in Spark?

Single Node clusters are helpful in the following situations:

Single Node cluster properties

A Single Node cluster has the following properties:

6. Why RDD resilient?

In Spark, a job is associated with a chain of RDD dependencies organized in a direct

7. Difference between persist and cache?

The only difference between cache() and persist() is ,

8. What is narrow and wide transformation?

Narrow transformation - doesn't require the data to be shuffled across the

10. What are shared variables and it uses?

11. how to create UDF in Pyspark?

12. Explain Stages and Tasks creation in Spark?

13. difference between coalese and repartition

The coalesce method reduces the number of partitions in a DataFrame. Coalesce

Unlike repartition, coalesce doesn’t perform a shuffle to create the

1. Repartition is used to increase the number of partitions

Data skew is a condition in which a table’s data is unevenly distributed among

15. How to fix data skew issues?

Relation and columns

Relation, columns, and skew values

16. What happens when use collect() action on DF or RDD?

17. what is pair rdd? When to use them?

Spark provides special operations on RDDs containing key/value pairs. These

reduceByKey(func) Combine values with the same key. groupByKey() Group

pair RDD without changing the key.

flatMapValues(func) Apply a function to each value of a pair RDD without

subtractByKey remove elements with a key present in the

other rdd. join : perform an inner join between two rdd's.

rightouterjoin: performa a join between two RDD's

leftouterjoin: perform a join between two rdd's cogroup: group

data from both rdd's sharing the same key.

18. What is Shuffling?

19. difference between cluster and client mode

The question is: when to use Cluster-Mode? If we submit an application from a

21. what happens if a worker node is dead?

22. Difference between reduceByKey()and group ByKey()?

Both of these transformations operate on pair RDDs. A pair RDD is an RDD

reduceByKey() is something like grouping + aggregation. We can say

Here are more functions to prefer over groupByKey:

1. Laziness means not computing transformation till it’s need

24. What is Catalyst Optimizer And Explain End to End Process?

Catalyst is based on functional programming constructs in Scala and designed with

•Easily add new optimization techniques and features to Spark SQL

25. Difference between ShuffledHashJoin And BroadcastHashjoin?

Shuffle Hash Join (SHJ)

Broadcast HashJoin (BHJ)

26. How many modes are there for spark execution?

Standalone Cluster Mode

Hadoop YARN/ Mesos

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.