0% found this document useful (0 votes)
10 views6 pages

Question

The document outlines a comprehensive guide to PySpark, covering topics such as RDDs, DataFrames, SQL integration, data preprocessing, and performance optimization techniques. It also includes a series of interview questions related to PySpark, SQL, AWS, and Power BI, aimed at assessing technical knowledge and problem-solving skills. Additionally, it details the structure of interviews, including technical and managerial rounds, focusing on practical applications and theoretical understanding.

Uploaded by

21f1003899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Question

The document outlines a comprehensive guide to PySpark, covering topics such as RDDs, DataFrames, SQL integration, data preprocessing, and performance optimization techniques. It also includes a series of interview questions related to PySpark, SQL, AWS, and Power BI, aimed at assessing technical knowledge and problem-solving skills. Additionally, it details the structure of interviews, including technical and managerial rounds, focusing on practical applications and theoretical understanding.

Uploaded by

21f1003899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 6

➊ Basics of PySpark

→Understanding Resilient Distributed Datasets (RDDs)


→Differences between RDD, DataFrame, and Dataset
→SparkSession (entry point to PySpark)
→PySpark installation and setup

➋ DataFrames and Transformations


→Creating DataFrames (from RDD, CSV, JSON, etc.)
→Common transformations (select, filter, withColumn, groupBy)
→Actions (collect, count, show)
→Lazy evaluation concept

➌ PySpark SQL
→Registering DataFrames as temporary views
→Writing SQL queries within PySpark
→Using built-in SQL functions
→Joins (inner, outer, left, right)

➍ Data Preprocessing
→Handling null values (fillna, dropna)
→Changing column data types (cast)
→Renaming columns
→Working with schemas

➎ UDFs (User-Defined Functions)


→Creating and applying UDFs in PySpark
→Performance considerations with UDFs
→Vectorized UDFs using Pandas

➏ Working with Partitions


→Understanding partitions and parallelism
→Repartitioning and coalescing DataFrames
→Optimizing performance by managing partitions

➐ Performance Optimization
→Broadcast joins
→Catalyst optimizer and Tungsten execution engine
→Caching and persistence (cache, persist)
→Skew handling and data shuffling

➑ File Formats and Data Sources


→Reading and writing data (CSV, Parquet, ORC, Avro)
→Compression techniques for big data
→Working with external databases using JDBC

➒ Error Handling and Debugging


→Common PySpark errors and troubleshooting
→Understanding and analyzing logs
→Using explain() to debug query plans

-----------------------------------------------------------------------------
PYSPARK / Databricks interview questions for azure data engineer

1) How will you apply indexing on a table in data bricks?


2) Explain the use of DAG in spark?
3) Explain optimization techniques in spark?
4) How do you handle skewed data issues in PYSPARK?
5) What are the shuffle operations available in PYSPARK?
6) Diff between spark streaming and structured streaming in PYSPARK?
7) Explain the different types of joins in PYSPARK?
8) Explain the fault tolerance in PYSPARK?
9) What are broad cast variables and broadcast joins?
10) Explain the lazy evaluation in PYSPARK?
11) What is RDD, Data frame, Data Set in PYSPARK?
12) What are the different transformations are available in PYSPARK?
13) How to handle null values in PYSPARK?
14) How to perform aggregations in PYSPARK?
15) Diff between cluster mode and client mode?
16) Diff between partitioning and bucketing?
17) What are the different cluster options in data bricks?
18) Explain the data bricks architecture?
19) Diff between Map reduce and Spark?
20) How to read Data from CSV file to create data frame?
21) How to remove duplicate values from Data frame?
22) How to add new column into Data frame?
23) How to select only few columns from data frame?
24) How to drop columns from data frame?
25) Diff between repartition and coalesce?
26) How do you handle your PYSPARK code deployment?
27) How to call one notebook in another notebook?
28) How to connect data bricks notebook from ADF pipeline?
29) How to estimate the amount of resources for your spark job?
30) How do you read data from URL in data bricks?
31) What is unity catalog?
32) You have a CSV file, how would you save it in a delta format?
33) You have a table in your data bricks, how would you optimised it?
34) Diff between caching and persistence?
35) How does autoloader works?
36) You have the key vaults and you need to pass the value in the notebook, how you
do it?
37) How did you implement incremental loading?
38) How do you manage metadata use in data bricks?
39) How did you use time travel in your project?
40) Why data bricks is good than dataflow?

----------------------------------------------------------------------
1. What is SQL, and why is it used?
2. Write a query to fetch the second-highest salary from the Employee table.
3. What are the different types of SQL commands?
4. Write a query to find duplicate records in a table.
5. What is the difference between DELETE and TRUNCATE?
6. Write a query to get the department with the highest number of employees.
7. What are joins in SQL? Name the types of joins.
8. Write a query to fetch records where name starts with ‘A’.
9. What is a primary key, and how is it different from a unique key?
10. Write a query to fetch employees who earn more than the average salary.
11. What is a foreign key, and why is it important?
12. Write a query to get the top 3 highest salaries in the Employee table.
13. What is the difference between WHERE and HAVING clauses?
14. Write a query to fetch common records from two tables.
15. What is normalization? Explain its types.
16. Write a query to create a table with constraints (primary key, unique, and
foreign key).
17. What are indexes in SQL, and what are their types?
18. Write a query to count the number of employees in each department.
19. What is the difference between clustered and non-clustered indexes?
20. Write a query to find employees who have not been assigned a department.
21. What are aggregate functions in SQL? Give examples.
22. Write a query to combine the results of two tables using UNION.
23. What is the difference between UNION and UNION ALL?
24. Write a query to fetch the nth highest salary in a table.
25. What is a self-join, and when would you use it?
26. Write a query to get the total salary paid to employees in each department.
27. What is the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()?
28. Write a query to update the salary of employees by 10% in the Employee table.
29. What are ACID properties in a database?
30. Write a query to delete duplicate records from a table while keeping one
instance.

-------------------------------------------------------------------------
Round 1: Technical (1 Hr)
✅ Tell me about yourself and any recent projects you have been a part of.
✅ Questions related to your projects.
✅ How would you connect multiple tables from different AWS databases (e.g., RDS,
Redshift) using a single connection in AWS Glue?
✅ What are the different types of triggers in AWS Glue or AWS Step Functions?
✅ How do you deploy code from DEV to QA and PROD environments using AWS services?
✅ How do you create a CI/CD pipeline for deployment in AWS using CodePipeline,
CodeCommit, and CodeBuild?
✅ What types of transformations have you performed in your projects using AWS Glue
or other services?
✅ How can you replace spaces in column names with underscores in source files using
AWS Glue and S3?
✅ What is SCD Type 2, and how can you implement it in AWS using Glue or Redshift?
✅ What are the differences between AWS S3 and AWS Redshift in terms of data storage
and usage?
✅ How do you read data from S3 using Amazon Redshift Spectrum or Athena?
✅ Write a Python function to merge two sorted lists into one sorted list.
✅ Write an SQL Query to fetch 2nd highest salary department wise and differe
approaches to do it.

Round 2: Technical (30 Mins)


✅ How do you create a view in AWS Glue or Amazon Redshift?
✅ Write a DDL command in Amazon Redshift to create a table.
✅ What AWS Glue activities have you used in your project?
✅ Are you familiar with AWS S3 and IAM security? How do you secure access to data
in S3?
✅ What are the different authentication methods available in AWS Glue for accessing
S3 or RDS?
✅ How many team members are there in your team and what's your role in the team?
✅ What are your skillsets, roles, and responsibilities in your current data
engineering project, especially around Spark and AWS?
✅ How would you design a pipeline to ingest, transform, and load (ETL) large
datasets from S3 into Amazon Redshift using Spark?
✅ How would you implement data versioning in a Spark-based pipeline, ensuring that
data can be tracked across versions?
✅ Questions related to Spark Optimizations like what are they and when to use them

Round 3: HR
✅ Discussion around my experience and projects, some resume-based questions.
✅ What are you expecting in your next job role?
✅ Package discussion

------------------------------------------------------------------------
SQL Questions
Write a query to fetch the top 5 employees with the highest salaries from an
Employees table.
Write a query to list all records in the Orders table where the delivery_date is
NULL.
Write a query to calculate the total sales and average discount offered in each
product category from the Products table.
Retrieve project details along with the names of project managers for all projects
where the status is "Completed," using a join between the Projects and Employees
tables.
Write a query to fetch all invoices in the Invoices table where the due_date is
more than 15 days past the invoice_date.
Write a query to identify suppliers from the Suppliers table whose total supplied
quantity exceeds 10,000 units, grouped by supplier_id.
How would you find duplicate entries in the Transactions table based on both
transaction_id and customer_id? Write a query to display these duplicate rows.
Write a query to rank products in each category by their total sales revenue using
a ranking function.
Write a query to find all customers in the Customers table who have not placed an
order in the last 6 months.
Write a query to update the salary column in the Employees table to increase by 10%
for employees in the "Marketing" department.

Power BI Questions
Create a dynamic visual to display total revenue by product category and allow
users to filter the data by month and region.
Write a DAX measure to calculate the year-over-year revenue growth for each product
category.
Write a DAX measure to calculate cumulative revenue for each region across quarters
in a fiscal year.
Write a DAX measure to display the top 10 customers by revenue in a table visual.
Explain the difference between calculated columns and measures with examples of
calculating employee bonus percentages and total team bonus.
Explain how to implement RLS in Power BI to ensure department heads only see data
related to their own teams.
Create a fiscal date table where the fiscal year starts in July, and use it to
calculate year-to-date revenue for the fiscal year.
Write a DAX measure to calculate the percentage of returning customers month-over-
month.
Design a KPI dashboard in Power BI to show quarterly profit margins with dynamic
color indicators (e.g., red for below target, green for above target).
Explain how to use Power Query to handle messy data by:
Splitting a single column with concatenated values into multiple columns.
Removing special characters from a text column.
Merging two tables based on a common key.

------------------------------------------------------------------

1/ How would you find the second highest salary in a table without using LIMIT or
TOP?
2/ Write a query to find duplicate rows in a table and the count of their
occurrences.
3/ How would you retrieve the nth highest salary from a table?
4/ Write a query to identify employees whose salaries are greater than the average
salary in their department.
5/ How can you delete duplicate rows from a table while keeping only one instance
of each?
6/ Write a query to find employees who have the highest salary in each department.
7/ How would you retrieve records where a column contains only numeric data, even
if it’s stored as a string?
8/ Write a query to find the running total of sales in a sales table.
9/ How would you find the longest consecutive sequence of dates in a table?
10/ Write a query to pivot a table's data from rows to columns.
11/ How would you calculate the cumulative percentage of a column in a table?
12/ Write a query to find gaps in a sequence of numbers in a table.
13/ How would you retrieve records that belong to a specific time window (e.g.,
last 7 days)?
14/ Write a query to join a table with itself to find employees who share the same
manager.
15/ How would you find the top three customers by total purchase amount in each
region?
16/ Write a query to find the maximum difference between two consecutive values in
a column.
17/ How would you identify the median value of a column in a table?
18/ Write a query to find overlapping date ranges in a table.
19/ How would you rank employees based on their salaries within their department?
20/ Write a query to find rows where data in a specific column repeats after a
certain number of rows.

-----------------------------------------------------------------------------------
SQL Questions -

1. Write a query to find the second-highest salary in a department. You might use
ROW_NUMBER() or DENSE_RANK() to achieve this.
2. Create a query to calculate the total number of transactions per user for each
day. This typically involves GROUP BY and COUNT() for aggregation.
3. Write a query to select projects with the highest budget-per-employee ratio
from two related tables (projects and employees). This tests your ability to handle
complex joins and aggregations.

Power BI Questions -

1. Explain the difference between Import and Direct Query modes. Which would you
choose for large datasets? (Direct Query enables real-time data but may be slower,
whereas Import is faster but static.)
2. What are slicers, and how do they differ from visual-level filters? Discuss
their impact on data in a Power BI dashboard.
3. How do you implement Row-Level Security (RLS) in Power BI? Explain how you
would restrict data access to specific users or groups.
4. What is a paginated report, and when would you use it? These are ideal for
multi-page outputs like invoices or billing statements.

Python Questions -

1. Write a Python script to identify unique values in a list and count their
occurrences. This tests your understanding of sets and dictionaries.
2. How would you use pandas to merge two datasets and calculate total sales for
products with valid promotions? This involves merge(), groupby(), and basic data
analysis functions.
3. Explain the differences between lists, tuples, sets, and dictionaries in
Python, highlighting their use cases in data manipulation and analysis.

𝟭. 𝗥𝗲𝘀𝘂𝗺𝗲 𝗦𝗰𝗿𝗲𝗲𝗻𝗶𝗻𝗴
-----------------------------------------------------------------------------------

• Experience with cloud data pipelines.


• Proficiency in Snowflake, SQL, and advanced SQL.
• Strong skills in AWS, PySpark, and Python.
• Highlighted relevant cloud and data engineering projects.
𝟮. 𝗧𝗲𝗹𝗲𝗽𝗵𝗼𝗻𝗶𝗰 𝗗𝗶𝘀𝗰𝘂𝘀𝘀𝗶𝗼𝗻
• Discussed past projects and technical stack.
• Rated skills listed in the resume.
• Clarified expectations for the role.
• Assessed alignment with the job requirements.

𝟯. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜
• Parquet file format use cases and limitations.
• SQL query to find distinct flight routes.
• Coding problem: Shift zeroes in a list without extra space.
• PySpark transformations and cloud storage integration.

𝟰. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜𝗜
• Complex SQL queries with joins and window functions.
• File format comparisons (e.g., Parquet vs. CSV).
• Coding challenge: String manipulation and character counting.
• Deep dive into RANK() vs. DENSE_RANK() differences.

𝟱. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜𝗜𝗜


• Differences between partitioning and bucketing.
• Designing pipeline triggers for large data loads.
• Comparison of RDBMS vs. NoSQL databases.
• SQL for handling missing and duplicate data.

𝟲. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜𝗩
• Designing pipelines for petabyte-scale data.
• Data loading frameworks and performance tuning.
• AWS services for streaming data solutions.
• Basic ML concepts, including CNNs.

𝟳. 𝗧𝗲𝗰𝗵𝗻𝗼-𝗠𝗮𝗻𝗮𝗴𝗲𝗿𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄
• E-commerce platform enhancements.
• Probability-based brain teasers.
• Insight gathering from datasets.
• Logical problem-solving under pressure.

𝟴. 𝗠𝗮𝗻𝗮𝗴𝗲𝗿𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄
• Reasons for joining Apple.
• Challenges faced in past roles.
• Significant achievements and lessons learned.
• Strategies for handling conflicts and teamwork.

𝟵. 𝗛𝗥 𝗗𝗶𝘀𝗰𝘂𝘀𝘀𝗶𝗼𝗻
• Reason for job change.
• Location and role preferences.
• Expectations for the role and benefits discussion.
• Discussed career growth opportunities.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy