Question
Question
➌ PySpark SQL
→Registering DataFrames as temporary views
→Writing SQL queries within PySpark
→Using built-in SQL functions
→Joins (inner, outer, left, right)
➍ Data Preprocessing
→Handling null values (fillna, dropna)
→Changing column data types (cast)
→Renaming columns
→Working with schemas
➐ Performance Optimization
→Broadcast joins
→Catalyst optimizer and Tungsten execution engine
→Caching and persistence (cache, persist)
→Skew handling and data shuffling
-----------------------------------------------------------------------------
PYSPARK / Databricks interview questions for azure data engineer
----------------------------------------------------------------------
1. What is SQL, and why is it used?
2. Write a query to fetch the second-highest salary from the Employee table.
3. What are the different types of SQL commands?
4. Write a query to find duplicate records in a table.
5. What is the difference between DELETE and TRUNCATE?
6. Write a query to get the department with the highest number of employees.
7. What are joins in SQL? Name the types of joins.
8. Write a query to fetch records where name starts with ‘A’.
9. What is a primary key, and how is it different from a unique key?
10. Write a query to fetch employees who earn more than the average salary.
11. What is a foreign key, and why is it important?
12. Write a query to get the top 3 highest salaries in the Employee table.
13. What is the difference between WHERE and HAVING clauses?
14. Write a query to fetch common records from two tables.
15. What is normalization? Explain its types.
16. Write a query to create a table with constraints (primary key, unique, and
foreign key).
17. What are indexes in SQL, and what are their types?
18. Write a query to count the number of employees in each department.
19. What is the difference between clustered and non-clustered indexes?
20. Write a query to find employees who have not been assigned a department.
21. What are aggregate functions in SQL? Give examples.
22. Write a query to combine the results of two tables using UNION.
23. What is the difference between UNION and UNION ALL?
24. Write a query to fetch the nth highest salary in a table.
25. What is a self-join, and when would you use it?
26. Write a query to get the total salary paid to employees in each department.
27. What is the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()?
28. Write a query to update the salary of employees by 10% in the Employee table.
29. What are ACID properties in a database?
30. Write a query to delete duplicate records from a table while keeping one
instance.
-------------------------------------------------------------------------
Round 1: Technical (1 Hr)
✅ Tell me about yourself and any recent projects you have been a part of.
✅ Questions related to your projects.
✅ How would you connect multiple tables from different AWS databases (e.g., RDS,
Redshift) using a single connection in AWS Glue?
✅ What are the different types of triggers in AWS Glue or AWS Step Functions?
✅ How do you deploy code from DEV to QA and PROD environments using AWS services?
✅ How do you create a CI/CD pipeline for deployment in AWS using CodePipeline,
CodeCommit, and CodeBuild?
✅ What types of transformations have you performed in your projects using AWS Glue
or other services?
✅ How can you replace spaces in column names with underscores in source files using
AWS Glue and S3?
✅ What is SCD Type 2, and how can you implement it in AWS using Glue or Redshift?
✅ What are the differences between AWS S3 and AWS Redshift in terms of data storage
and usage?
✅ How do you read data from S3 using Amazon Redshift Spectrum or Athena?
✅ Write a Python function to merge two sorted lists into one sorted list.
✅ Write an SQL Query to fetch 2nd highest salary department wise and differe
approaches to do it.
Round 3: HR
✅ Discussion around my experience and projects, some resume-based questions.
✅ What are you expecting in your next job role?
✅ Package discussion
------------------------------------------------------------------------
SQL Questions
Write a query to fetch the top 5 employees with the highest salaries from an
Employees table.
Write a query to list all records in the Orders table where the delivery_date is
NULL.
Write a query to calculate the total sales and average discount offered in each
product category from the Products table.
Retrieve project details along with the names of project managers for all projects
where the status is "Completed," using a join between the Projects and Employees
tables.
Write a query to fetch all invoices in the Invoices table where the due_date is
more than 15 days past the invoice_date.
Write a query to identify suppliers from the Suppliers table whose total supplied
quantity exceeds 10,000 units, grouped by supplier_id.
How would you find duplicate entries in the Transactions table based on both
transaction_id and customer_id? Write a query to display these duplicate rows.
Write a query to rank products in each category by their total sales revenue using
a ranking function.
Write a query to find all customers in the Customers table who have not placed an
order in the last 6 months.
Write a query to update the salary column in the Employees table to increase by 10%
for employees in the "Marketing" department.
Power BI Questions
Create a dynamic visual to display total revenue by product category and allow
users to filter the data by month and region.
Write a DAX measure to calculate the year-over-year revenue growth for each product
category.
Write a DAX measure to calculate cumulative revenue for each region across quarters
in a fiscal year.
Write a DAX measure to display the top 10 customers by revenue in a table visual.
Explain the difference between calculated columns and measures with examples of
calculating employee bonus percentages and total team bonus.
Explain how to implement RLS in Power BI to ensure department heads only see data
related to their own teams.
Create a fiscal date table where the fiscal year starts in July, and use it to
calculate year-to-date revenue for the fiscal year.
Write a DAX measure to calculate the percentage of returning customers month-over-
month.
Design a KPI dashboard in Power BI to show quarterly profit margins with dynamic
color indicators (e.g., red for below target, green for above target).
Explain how to use Power Query to handle messy data by:
Splitting a single column with concatenated values into multiple columns.
Removing special characters from a text column.
Merging two tables based on a common key.
------------------------------------------------------------------
1/ How would you find the second highest salary in a table without using LIMIT or
TOP?
2/ Write a query to find duplicate rows in a table and the count of their
occurrences.
3/ How would you retrieve the nth highest salary from a table?
4/ Write a query to identify employees whose salaries are greater than the average
salary in their department.
5/ How can you delete duplicate rows from a table while keeping only one instance
of each?
6/ Write a query to find employees who have the highest salary in each department.
7/ How would you retrieve records where a column contains only numeric data, even
if it’s stored as a string?
8/ Write a query to find the running total of sales in a sales table.
9/ How would you find the longest consecutive sequence of dates in a table?
10/ Write a query to pivot a table's data from rows to columns.
11/ How would you calculate the cumulative percentage of a column in a table?
12/ Write a query to find gaps in a sequence of numbers in a table.
13/ How would you retrieve records that belong to a specific time window (e.g.,
last 7 days)?
14/ Write a query to join a table with itself to find employees who share the same
manager.
15/ How would you find the top three customers by total purchase amount in each
region?
16/ Write a query to find the maximum difference between two consecutive values in
a column.
17/ How would you identify the median value of a column in a table?
18/ Write a query to find overlapping date ranges in a table.
19/ How would you rank employees based on their salaries within their department?
20/ Write a query to find rows where data in a specific column repeats after a
certain number of rows.
-----------------------------------------------------------------------------------
SQL Questions -
1. Write a query to find the second-highest salary in a department. You might use
ROW_NUMBER() or DENSE_RANK() to achieve this.
2. Create a query to calculate the total number of transactions per user for each
day. This typically involves GROUP BY and COUNT() for aggregation.
3. Write a query to select projects with the highest budget-per-employee ratio
from two related tables (projects and employees). This tests your ability to handle
complex joins and aggregations.
Power BI Questions -
1. Explain the difference between Import and Direct Query modes. Which would you
choose for large datasets? (Direct Query enables real-time data but may be slower,
whereas Import is faster but static.)
2. What are slicers, and how do they differ from visual-level filters? Discuss
their impact on data in a Power BI dashboard.
3. How do you implement Row-Level Security (RLS) in Power BI? Explain how you
would restrict data access to specific users or groups.
4. What is a paginated report, and when would you use it? These are ideal for
multi-page outputs like invoices or billing statements.
Python Questions -
1. Write a Python script to identify unique values in a list and count their
occurrences. This tests your understanding of sets and dictionaries.
2. How would you use pandas to merge two datasets and calculate total sales for
products with valid promotions? This involves merge(), groupby(), and basic data
analysis functions.
3. Explain the differences between lists, tuples, sets, and dictionaries in
Python, highlighting their use cases in data manipulation and analysis.
𝟭. 𝗥𝗲𝘀𝘂𝗺𝗲 𝗦𝗰𝗿𝗲𝗲𝗻𝗶𝗻𝗴
-----------------------------------------------------------------------------------
𝟯. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜
• Parquet file format use cases and limitations.
• SQL query to find distinct flight routes.
• Coding problem: Shift zeroes in a list without extra space.
• PySpark transformations and cloud storage integration.
𝟰. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜𝗜
• Complex SQL queries with joins and window functions.
• File format comparisons (e.g., Parquet vs. CSV).
• Coding challenge: String manipulation and character counting.
• Deep dive into RANK() vs. DENSE_RANK() differences.
𝟲. 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗜𝗩
• Designing pipelines for petabyte-scale data.
• Data loading frameworks and performance tuning.
• AWS services for streaming data solutions.
• Basic ML concepts, including CNNs.
𝟳. 𝗧𝗲𝗰𝗵𝗻𝗼-𝗠𝗮𝗻𝗮𝗴𝗲𝗿𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄
• E-commerce platform enhancements.
• Probability-based brain teasers.
• Insight gathering from datasets.
• Logical problem-solving under pressure.
𝟴. 𝗠𝗮𝗻𝗮𝗴𝗲𝗿𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄
• Reasons for joining Apple.
• Challenges faced in past roles.
• Significant achievements and lessons learned.
• Strategies for handling conflicts and teamwork.
𝟵. 𝗛𝗥 𝗗𝗶𝘀𝗰𝘂𝘀𝘀𝗶𝗼𝗻
• Reason for job change.
• Location and role preferences.
• Expectations for the role and benefits discussion.
• Discussed career growth opportunities.