100 Data Engineering QUESTIONS ANSWERS
100 Data Engineering QUESTIONS ANSWERS
By
Afrin Ahamed
1. Explain what is Data Engineering?
Data engineering is the process of designing, building, and maintaining the infrastructure and systems that enable the collection, storage, processing, and
analysis of large amounts of data. It involves tasks such as data integration, data transformation, data storage and retrieval, data quality assurance, and data
pipeline development. Data engineering is a crucial component of data-driven organizations and enables them to effectively leverage the power of data to gain
insights, make informed decisions, and drive business growth.
2. End to End Data Engineering Architecture?
Data engineering architecture refers to the design and structure of the systems and processes used to manage and process data in an organization. A well-
designed data engineering architecture enables efficient, reliable, and scalable data processing, storage, and retrieval.
Page 1 of 58
A typical data engineering architecture may involve several layers, including:
Data Sources: This layer includes all the systems and applications that generate data, such as databases, sensors, and web services.
Data Ingestion: This layer is responsible for collecting and aggregating data from various sources and bringing it into the data processing pipeline. Tools such
as Apache Kafka, Apache Flume, and AWS Kinesis are often used for this purpose.
Data Storage: This layer involves storing data in a structured, semi-structured, or unstructured format, depending on the nature of the data. Common data
storage systems include relational databases, NoSQL databases, and data lakes.
Data Processing: This layer is responsible for transforming, cleaning, and enriching the data to make it useful for downstream analytics and machine learning.
Tools such as Apache Spark, Apache Beam, and AWS Glue are commonly used for this purpose.
Data Analytics: This layer involves using various tools and techniques to analyze the data and gain insights. This may involve data visualization, machine
learning, and other data analysis techniques.
Data Delivery: This layer is responsible for delivering the data and insights to end-users or downstream applications, such as dashboards, reports, and APIs.
Overall, a good data engineering architecture should be scalable, reliable, secure, and flexible enough to accommodate changing data requirements and
business needs.
3.How does the day to day look like for a Data Engineer?
The day-to-day responsibilities of a data engineer can vary depending on the organization, but here are some common tasks that a data engineer may
perform:
Data Ingestion: Collecting data from various sources and systems and bringing it into the data processing pipeline. This may involve setting up ETL (Extract,
Transform, Load) pipelines or working with tools such as ADF, AZCopy, FTP Tools, API Calls, Apache Kafka, Apache Flume, or AWS Kinesis.
Data Storage: Designing and implementing data storage solutions that can accommodate large volumes of structured, semi-structured, or unstructured data
in data lake like Azure Data Lake Gen2, AWS S3, Google Cloud Storage, HDFS. This may involve working with Cloud Data Warehouses like Azure SQL DWH,
AWS RedShift, Google Big Query and snowflake databases, NoSQL databases.
Data Transformation: Transforming and cleaning the data to make it useful for downstream analytics and machine learning. This may involve using tools such
as Apache Spark,Databricks Spark SQL, Synapse Analytics with SQL/Spark, Apache Beam, or AWS Glue.
Data Quality: Ensuring the quality and consistency of the data by implementing data validation, verification, and monitoring processes using pyspark
Transformations in Databricks, synapse analytics or any other cloud Data Warehouses using SQL.
Performance Optimization: Improving the performance of data processing pipelines by optimizing queries, improving data partitioning, or using caching
strategies.
Data Security: Ensuring the security of the data by implementing encryption, access controls, and other security measures.
Page 2 of 58
Documentation: Creating and maintaining documentation for the data processing pipelines and data storage systems to help ensure that they can be easily
understood and maintained by other members of the team.
Collaboration: Working closely with data analysts, data scientists, and other stakeholders to understand their requirements and to ensure that the data
engineering systems meet their needs.
Overall, the day-to-day responsibilities of a data engineer involve designing, building, and maintaining the infrastructure and systems that enable an
organization to effectively manage and process large volumes of data.
Page 3 of 58
Azure IoT Hub
Azure Stream Analytics
Microsoft Purview
Azure Data Share
Microsoft Power BI
Azure Active Directory
Azure Cost Management
Azure Key Vault
Azure Monitor
Microsoft Defender for Cloud
Azure DevOps
Azure Policy
GitHub
6.What are various cloud data platforms you worked? explain data services available in one of the cloud provider.
Amazon Web Services (AWS): AWS offers a wide range of data-related services, including data storage (S3, EBS, EFS, etc.), data processing (EC2, EMR,
Glue, etc.), and data analytics (Athena, Redshift, QuickSight, etc.).
Microsoft Azure: Azure offers a range of data-related services, including data storage (Blob storage, Azure Files, etc.), data processing (HDInsight, Azure Data
Factory, etc.), and data analytics (Azure Synapse Analytics, Power BI, etc.).
Google Cloud Platform (GCP): GCP offers a range of data-related services, including data storage (Cloud Storage, Cloud SQL, etc.), data processing (Cloud
Dataproc, Dataflow, etc.), and data analytics (BigQuery, Data Studio, etc.).
Snowflake: Snowflake is a cloud-based data warehousing platform that enables users to store, process, and analyze large volumes of data.
IBM Cloud: IBM Cloud offers a range of data-related services, including data storage (Cloud Object Storage, Cloud Databases, etc.), data processing (IBM
Cloud Pak for Data, Watson Studio, etc.), and data analytics (IBM Cognos Analytics, Watson Discovery, etc.).
Page 4 of 58
7.What are the the various programming languages used in data engineering
There are several programming languages that are commonly used in data engineering, including:
Python: Python is a popular language for data engineering due to its ease of use, versatility, and large selection of data-related libraries and frameworks, such
as Pandas, NumPy, and Apache Spark. Python can be used for data processing, data analysis, and building data pipelines.
SQL: SQL (Structured Query Language) is a language used for querying and manipulating data in relational databases. It is a standard language that is widely
used in data engineering, particularly for data warehousing and data analytics.
Java: Java is a commonly used language for building large-scale data processing systems, particularly those that use distributed computing frameworks such
as Apache Hadoop and Apache Spark.
Scala: Scala is a high-performance language that is used in distributed computing frameworks such as Apache Spark. It is often used in conjunction with
Java to build large-scale data processing systems.
R: R is a language that is commonly used for statistical computing and data analysis. It has a large selection of libraries and frameworks that make it well-
suited for data engineering tasks such as data cleaning, data visualization, and data analysis.
JavaScript : Snowflake supports the use of JavaScript in Snowflake Stored Procedures and User-Defined Functions (UDFs).JavaScript can be used in Stored
Procedures and UDFs to implement custom business logic and data transformations within Snowflake. This can include manipulating data, calling external
APIs, and performing calculations.
Other programming languages that are commonly used in data engineering include C++, Perl, and Go. The choice of language will depend on factors such as
the specific requirements of the project, the technical expertise of the team, and the available tools and libraries.
Ease of use: Python is an easy-to-learn language with a simple syntax that makes it accessible for beginners. This means that data engineers can quickly get
up to speed with Python and start building data pipelines.
Versatility: Python has a wide range of libraries and frameworks that are specifically designed for data processing, such as Pandas, NumPy, and Apache
Spark. This makes it a powerful language for building data engineering pipelines, as well as for performing data analysis and machine learning tasks.
Interoperability: Python can easily interface with other languages and technologies commonly used in data engineering, such as SQL databases and big data
platforms like Hadoop and Spark.
Community: Python has a large and active community of data engineers and data scientists, who contribute to open-source libraries and tools that make
data engineering tasks faster and more efficient.
Scalability: Python's ability to scale horizontally through distributed computing frameworks like Apache Spark and Dask make it suitable for processing large
data sets in parallel.
Page 5 of 58
Flexibility: Python is a multi-purpose language that can be used for a variety of tasks beyond data engineering and data science. For example, it is used for
web development, automation, and scripting.
Better support for deep learning: Python has gained popularity in recent years as a language for deep learning, and as such, it has several widely used deep
learning libraries, such as TensorFlow, Keras, and PyTorch. These libraries are well-supported in Databricks, making Python a natural choice for data
engineers and data scientists who are working on deep learning projects.
Integration with Jupyter Notebooks: Databricks supports the use of Jupyter Notebooks, which are a popular platform for interactive data analysis and
visualization using Python. This integration enables data engineers and data scientists to work seamlessly in a single environment.
Data extraction: SQL can be used to extract data from databases and load it into data pipelines for further processing.
Data transformation: SQL can be used to transform data by manipulating or joining tables, filtering data based on certain conditions, or aggregating data
into new forms.
Data quality: SQL can be used to ensure data quality by performing data validation checks, such as checking for missing or duplicate values.
Data warehousing: SQL can be used to create and manage data warehouses, which are centralized repositories of data used for reporting and analysis.
ETL (Extract, Transform, Load): SQL can be used in ETL pipelines to extract data from different sources, transform it into the desired format, and load it into
the target data store.
SQL is a widely used and powerful tool for data engineering, and it is supported by many different relational database management systems (RDBMS).
Additionally, the use of SQL can help ensure data integrity and consistency across different systems and applications.
10.What are the different data sources / targets you used in your project?
Data Engineering can have any number of sources like
Relational databases: These are databases that store data in tables with predefined relationships between them. Common examples include MySQL,
PostgreSQL, Oracle, and Microsoft SQL Server.
NoSQL databases: These are databases that store data in non-tabular structures, such as key-value stores, document databases, and graph databases.
Common examples include MongoDB, Cassandra, and Neo4j.
Cloud storage: Cloud storage services like Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage are commonly used as sources and targets
for data pipelines.
Page 6 of 58
APIs: APIs (Application Programming Interfaces) allow data to be accessed and integrated from various online services and applications. Common examples
include REST APIs and SOAP APIs.
File formats: Data can be stored and exchanged in a variety of file formats, such as CSV, JSON, XML, Parquet, and Avro. These formats are often used as
sources and targets for data pipelines.
Message queues: Message queues, such as Apache Kafka and RabbitMQ, are used to send and receive data between different systems.
11.What is data ingestion means? what tools you used for data ingestion.
Ingestion in cloud computing refers to the process of bringing data from external sources into a cloud-based storage or processing system. This is a critical
step in data engineering, as it enables organizations to centralize their data and make it accessible for analysis, reporting, and other purposes.
There are various types of ingestion tools that are commonly used in cloud computing, including:
Batch ingestion tools: These tools are used for processing large volumes of data in batches, typically on a daily or weekly basis. Examples of batch ingestion
tools include Apache Hadoop, Apache Spark, and Apache Flume.
Real-time ingestion tools: These tools are used for processing data in real-time or near real-time, typically with low latency and high throughput. Examples of
real-time ingestion tools include Azure Event Hub, Azure IOT, Apache Kafka, AWS Kinesis, and Google Cloud Pub/Sub.
Cloud-native ingestion tools: These tools are specifically designed for use in cloud environments and often provide seamless integration with cloud-based
storage and processing systems. Examples of cloud-native ingestion tools include AWS Glue, Google Cloud Dataflow, and Microsoft Azure Data Factory.
Database replication tools: These tools are used to replicate data from one database to another, typically for disaster recovery or high availability purposes.
Examples of database replication tools include AWS Database Migration Service, Google Cloud SQL, and Microsoft SQL Server Replication.
File transfer tools: These tools are used for transferring files from one location to another, typically over the internet. Examples of file transfer tools include
AWS Transfer for SFTP, Google Cloud Storage Transfer Service, and Microsoft Azure File Sync.
12.How do you read / move data from on-premise to cloud? did you work with some tools.
Cloud Data Migration Tools: These tools are specifically designed for use in cloud environments and often provide seamless integration with cloud-based
storage and processing systems. Examples of cloud-native ingestion tools include AWS Glue, Google Cloud Dataflow, and Microsoft Azure Data Factory.
Third-party data integration tools: There are many third-party data integration tools available that can help you ingest data from on-premises to the cloud.
These tools offer features such as data transformation, data validation, and data cleansing, and support a wide range of data sources and targets.
Custom scripts and APIs: If you have specific data integration requirements, you can develop custom scripts or APIs to ingest data from on-premises to the
cloud. This approach can be time-consuming and complex, but it provides more flexibility and control over the data integration process.
Page 7 of 58
13.What are the different file types you used in your day to day work?
In data engineering, there are several types of data files that are commonly used to store and process data. Here are a few examples:
CSV files: CSV (Comma-Separated Values) files are a simple and widely used format for storing tabular data. Each row represents a record, and each column
represents a field in the record. CSV files can be easily imported into a variety of data processing tools and databases.
JSON files: JSON (JavaScript Object Notation) files are a lightweight data format that is easy to read and write. They are commonly used for web-based
applications and for exchanging data between different programming languages. JSON files are structured as key-value pairs and can contain nested objects
and arrays.
Parquet files: Parquet is a columnar storage format that is optimized for big data processing. It is designed to be highly efficient for analytics workloads,
allowing for faster query performance and reduced storage costs. Parquet files are often used in data warehousing and data lake environments.
Avro files: Avro is a binary data format that is designed to be compact and efficient. It supports schema evolution, meaning that the schema can be changed
over time without breaking existing applications. Avro files are often used in Hadoop and other big data processing frameworks.
ORC files: ORC (Optimized Row Columnar) files are another columnar storage format that is designed for fast data processing. ORC files are highly
compressed, making them efficient to store and transmit over networks. They are commonly used in Hadoop and other big data processing environments.
Delta files: Delta is a file format created by Databricks that is designed for building data lakes and data warehouses. Delta files are based on Parquet files but
add transactional capabilities to support updates, deletes, and merges. Delta files are designed to be highly scalable and performant, making them well-suited
for big data processing.
XML files: XML (Extensible Markup Language) is a markup language that is used to store and transport data. XML files are structured as nested elements,
with each element representing a record or data item. XML is a flexible and self-describing format that can be used for a wide range of data processing needs,
including web services, document exchange, and database integration.
The choice of data file format depends on the specific needs of the data processing and storage system. Data engineers need to consider factors such as
performance, scalability, flexibility, and interoperability when selecting the appropriate format for a particular use case.
Page 8 of 58
15.What are the differences between ORC, AVRO and Parquet files?
If you have large-scale data processing needs: Orc and Parquet are both designed for efficient storage and processing of large datasets. They can provide
high performance and scalability for data processing jobs that require processing of large volumes of data.
If you need advanced compression capabilities: Parquet and Orc support more advanced compression techniques like Zstandard and Snappy, which can lead
to higher compression ratios and faster query performance. If you have specific compression requirements, these file formats may be a better choice than
Avro.
If you need strong schema evolution support: Avro has strong support for schema evolution, meaning that the schema can be changed over time without
breaking existing applications. If you expect your schema to evolve frequently, Avro may be a better choice than Parquet or Orc.
If you need a lightweight and flexible file format: Avro is a lightweight and flexible file format that can be used for a wide range of data processing needs. If you
have a small dataset or require more flexible schema management, Avro may be a better choice than Parquet or Orc.
16.What are best use cases for ORC, AVRO, and Parquet files (Explain best suitable data scenarios for each file type)
Orc, Avro, and Parquet are three popular columnar file formats used in data engineering. Here are some key differences between them:
Compression: All three file formats support compression, but the specific compression algorithms and settings vary. Parquet and Orc support more advanced
compression techniques like Zstandard and Snappy, which can lead to higher compression ratios and faster query performance. Avro supports simpler
compression techniques like deflate and snappy, which are less efficient but more widely supported.
Schema evolution: Avro has strong support for schema evolution, meaning that the schema can be changed over time without breaking existing applications.
Parquet and Orc also support schema evolution, but the process is more complex and requires more care to avoid breaking compatibility.
Performance: Parquet and Orc are optimized for big data processing and can handle large volumes of data quickly and efficiently. Avro is more lightweight and
may be better suited for smaller datasets or applications that require more flexible schema management.
Integration with data processing tools: All three file formats are widely supported by data processing tools and platforms, including Hadoop, Spark, and
others. However, some tools may have better performance or compatibility with certain file formats.
17.How do you convert data from one file format to another file format? for example, CSV to ORC? ORC to Parquet? JSON to CSV?
you can use the following steps to convert a file in Orc format to CSV:
Read the Orc file using the spark.read.orc() function and store it in a DataFrame.
Write the DataFrame to a CSV file using the DataFrame.write() function, specifying the format as “csv”.
Page 9 of 58
This will save the contents of the Orc file in CSV format to the specified file path.
Similarly, to convert a CSV file to Orc format, you can use the following steps:
Read the CSV file using the spark.read.csv() function and store it in a DataFrame.
Write the DataFrame to an Orc file using the DataFrame.write() function, specifying the format as “orc".
18.What is metadata?
Metadata refers to data that describes other data. It provides information about the structure, content, quality, and other characteristics of a dataset. Metadata is
used to help users understand, interpret, and use data, and it is a critical component of data management and data engineering. Metadata can include various
types of information, such as: Data type and format Data source and origin Date and time of data creation, update, and access Data schema and relationships
Data quality and completeness Data privacy and security restrictions Data access and usage permissions Metadata can be stored in a separate database or
system, or it can be embedded within the data itself, such as in file headers or database schemas. It can be managed manually or automatically using metadata
management tools and processes. Metadata is essential for effective data discovery, integration, transformation, and analysis. It helps to ensure that data is
accurate, consistent, and usable across different applications and systems.
Page 10 of 58
19.What are typical scenarios where you get data in the format JSON files?
JSON (JavaScript Object Notation) is a widely used data format that is simple, lightweight, and easy to parse. It is used for a variety of data scenarios,
including:
Web APIs: JSON is commonly used as a data format for web APIs, which allow applications to exchange data over the internet. APIs can return JSON data in
response to requests from client applications, which can then parse and use the data as needed.
Big data: JSON is often used for storing and processing big data, especially in NoSQL databases like MongoDB and Cassandra. JSON allows for flexible
schema design, which can be beneficial for handling unstructured or semi-structured data.
IoT devices: JSON is used to transmit and store data from Internet of Things (IoT) devices, which can generate large amounts of data in real time. JSON can
be used to encode data such as sensor readings, location data, and other metadata.
Configuration files: JSON can be used to store configuration data for applications or systems. This can include data such as server settings, user preferences,
or other application-specific settings.
Log files: JSON can be used to store log data, which can be analyzed and monitored for application performance, security, or other purposes. JSON log data
can be easier to parse and analyze than other formats like plain text or XML.
Overall, JSON is a versatile and widely used data format that can be used in a variety of data scenarios. Its simplicity, flexibility, and compatibility with many
programming languages and tools make it a popular choice for developers and data engineers.
20.What are the different types of data you worked in you project?
Explain different types of source data files and target data files in your project. Like csv, tsv , parquet , orc and delta file formats.
In summary, if you are working with large datasets that require distributed processing, Spark is likely the better choice. However, if your data fits into memory
and you are comfortable working with Python, Pandas can be a more convenient option.
Page 11 of 58
22.What is best between pandas and Spark? explain why?
if you are working with large datasets that require distributed processing, Spark is likely the better choice. However, if your data fits into memory and you are
comfortable working with Python, Pandas can be a more convenient option.
Clusters in Databricks are typically used to run distributed computing jobs, such as processing large datasets, training machine learning models, or
performing real-time data analytics. Multiple users can share a cluster, and the platform automatically manages resource allocation and scheduling to ensure
efficient use of the available resources.
Azure Data Factory provides a graphical interface and a set of tools to enable you to build and manage data pipelines that can connect to various data
sources, including on-premises and cloud-based sources. You can use the tool to perform a variety of data integration tasks, such as data ingestion, data
transformation, and data loading.
Integration with other Azure services, such as Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database, and more.
Support for a variety of data sources, including structured, unstructured, and semi-structured data.
Built-in data transformation and processing using Azure Databricks or HDInsight.
Advanced security and monitoring features, including role-based access control, auditing, and logging.
Overall, Azure Data Factory simplifies the process of building and managing data integration pipelines, allowing organizations to easily move, transform, and
analyze data from various sources in a scalable, cost-effective way.
Page 12 of 58
To set up the connection, you would need to follow these general steps:
Create a linked service in Azure Data Factory for the on-premises data source you want to connect to.
For the self-hosted IR method, download and install the self-hosted IR on an on-premises machine, and register it with Azure Data Factory.
For the Azure Data Gateway method, download and install the gateway, and register it with Azure Data Factory.
Configure the linked service to use the self-hosted IR or Azure Data Gateway.
Use the linked service in your data pipeline to move data between the on-premises data source and Azure.
Page 13 of 58
29.Can you write a custom activity / transformation in Data Factory using Python? Pandas? Spark?
Use Databricks notebook for any kind of customer transformations using SQL, Python, Pyspark or Scala languages.
Page 14 of 58
31.What is the big data SaaS Services in Azure?
Azure Data Lake is a big data solution based on multiple cloud services in the Microsoft Azure ecosystem. It allows organizations to ingest multiple data sets,
including structured, unstructured, and semi-structured data, into an infinitely scalable data lake enabling storage, processing, and analytics. Learn about the
4 key components of an Azure Data Lake - core infrastructure, ADLS, ADLA, Databricks , Synapse Analytics and HDInsights - and best practices to using
them effectively.
Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob Storage. For example, Data Lake Storage Gen2
provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with
high availability/disaster recovery capabilities.
Page 15 of 58
33.What are event driven data solutions?
An event-driven architecture consists of event producers that generate a stream of events, and event consumers that listen for the events.
An event driven architecture can use a publish/subscribe (also called pub/sub) model or an event stream model.
Pub/sub: The messaging infrastructure keeps track of subscriptions. When an event is published, it sends the event to each subscriber. After an event is
received, it cannot be replayed, and new subscribers do not see the event.
Event streaming: Events are written to a log. Events are strictly ordered (within a partition) and durable. Clients don't subscribe to the stream, instead a client
can read from any part of the stream. The client is responsible for advancing its position in the stream. That means a client can join at any time, and can replay
events.
Copy and transform data from and to a REST endpoint by using Azure Data Factory.
We can use below two activities.
1) copy activity with dataset type of REST API
2)Web activity with REST API URL - GET or PUT or POST Methods.
Web Activity can be used to call a custom REST endpoint from an Azure Data Factory or Synapse pipeline. You can pass datasets and linked services to be
consumed and accessed by the activity.
Page 16 of 58
36.What is batch data processing?
Batch processing is a technique for automating and processing multiple transactions as a single group. Batch processing helps in handling tasks like payroll,
end-of-month reconciliation, or settling trades overnight.
Batch data processing is a method of processing data in which data is collected over a period of time and then processed as a group (or "batch") rather than
in real-time. This involves storing the data in a file or database until there is enough data to process, and then running a job or script to process the data in
bulk. Batch processing is commonly used for large-scale data processing and analysis, such as data warehousing, ETL (extract, transform, load), and report
generation.
38.What is streaming?
Streaming refers to the continuous flow of data or media content (such as audio, video, or text) from a source to a recipient over a network in real-time,
allowing the recipient to access and consume the content as it is being transmitted. Unlike traditional download-based methods, which require the entire
content to be downloaded before it can be accessed, streaming allows users to access and use the content as it is being delivered, which provides faster
access and reduces storage requirements. Streaming is commonly used for a wide range of applications, such as video and music streaming services, online
gaming, live broadcasts, and real-time data processing.
Azure Event Hubs: a highly scalable data streaming platform that can collect and process millions of events per second from various sources.
Azure IoT Hub: a cloud-based platform that can connect, monitor, and manage IoT devices and process real-time device telemetry data.
Azure Notification Hubs: a service that enables push notifications to be sent to mobile and web applications at scale.
Azure Media Services: a platform that provides cloud-based media processing and delivery services for streaming video and audio content.
Azure Data Explorer: a fast and highly scalable data exploration and analytics service that can process and analyze large volumes of streaming data in real-
time.
Page 17 of 58
40.What is Spark SQL?
Spark SQL is a component of the Apache Spark open-source big data processing framework that enables developers to run SQL-like queries on large
datasets stored in distributed storage systems like Hadoop Distributed File System (HDFS) and Apache Cassandra. It provides a programming interface to
work with structured data using SQL queries, DataFrame APIs, and Datasets APIs.
Spark SQL allows users to combine the benefits of both relational and procedural programming paradigms to work with data in a distributed environment. It
also provides support for reading and writing data in various file formats such as Parquet, ORC, Avro, JSON, and CSV.
Spark SQL includes an optimizer that can optimize SQL queries to improve query performance by pushing down filters, aggregations, and other operations to
the data source. This optimization enables Spark SQL to process large datasets efficiently in a distributed environment.
41.How do you covert pandas dataframe to spark data frame vice versa?
C r e a t i n g s p a r k D a t a F r a m e a n d c o n v e r t i n g i n t o p a n d a s D a t a F r a m e
Page 18 of 58
42.How do you slice a data file into 10 small files (a large CSV file with 10 million lines, slice them 1 million each file) using Pandas?
43.How do you slice a data file into 10 small files (a large CSV file with 10 million lines, slice them 1 million each file) using Spark?
Page 19 of 58
44.What is databricks? why is it required? what databricks runtimes you used?
Databricks Runtime is a managed computing environment for Apache Spark, which is an open-source distributed computing framework for big data
processing. Databricks Runtime is optimized for running Spark-based workloads in a cloud-based environment and includes pre-configured clusters, drivers,
and tools that make it easy to set up and manage Spark applications. It provides a unified platform for data engineers, data scientists, and business analysts
to collaborate on big data processing and analytics tasks. Databricks Runtime is part of the Databricks Unified Analytics Platform, which also includes data
integration, machine learning, and visualization tools.
Databricks provides a Docker image for the Databricks Runtime environment. The image can be used to run Databricks workloads locally or in a containerized
environment. The image can be obtained from Docker Hub or built from source using the Databricks open-source repository on GitHub.
45.What is cluster in databricks? what are different cluster types available in databricks?
A Databricks cluster is a managed computing environment that allows users to run distributed data processing workloads on the Databricks platform. It is a
group of virtual machines that are provisioned and configured to work together to execute distributed data processing tasks, such as data ingestion,
transformation, machine learning, and deep learning. The cluster resources, such as the number and type of virtual machines, can be adjusted based on the
workload requirements, and users can choose from various cluster configurations to optimize performance and cost. Databricks clusters are typically used to
process large volumes of data and to train and deploy machine learning models at scale.
Databricks offers two types of Compute specifically designed for running batch workloads: All-purpose compute and Job compute.
All-purpose clusters: These clusters are designed to run long-running batch jobs and are optimized for high availability, fault tolerance, and resource
isolation. They can be created and managed using the Databricks UI or API and can be used to run data processing jobs, ETL pipelines, and machine learning
workflows.
Job clusters: Job clusters are a type of ephemeral cluster that are created on-demand to run a specific job and are terminated automatically once the job is
complete. Job clusters are optimized for cost and performance, as they are created with the minimum required resources to run the job. They are typically
used for running ad-hoc or one-time batch jobs, such as data transformations or model training.
Page 20 of 58
Standard clusters: These are the most common type of cluster and are used for general-purpose data processing and analytics workloads. Standard clusters
are highly customizable and can be configured with various virtual machine types, network settings, and storage options.
High Concurrency clusters: These clusters are optimized for running interactive workloads and serving multiple users concurrently. They are designed to
handle small to medium-sized queries and are highly scalable, allowing users to increase or decrease the cluster size based on demand.
GPU clusters: These clusters are used for running deep learning workloads and training machine learning models that require high-performance GPUs. GPU
clusters can be configured with different types of GPUs, such as NVIDIA V100, P100, and K80, and are optimized for running TensorFlow, PyTorch, and other
deep learning frameworks.
Serverless clusters: These clusters are designed to provide a highly scalable and cost-effective computing environment for ad-hoc workloads and bursty
data processing. Serverless clusters automatically scale up or down based on the workload demand and are charged based on the actual usage.
Kubernetes clusters: These clusters allow users to run Databricks workloads on a Kubernetes cluster, giving them more control over the cluster environment
and enabling them to leverage Kubernetes features such as auto-scaling, load balancing, and rolling updates.
Standard mode clusters are now called No Isolation Shared access mode clusters.
High Concurrency with Tables ACLs are now called Shared access mode clusters.
Page 21 of 58
46.How do you connect (Mount) Data lake storage with Databricks?
47.How do you enable SFTP service into your Azure data lake storage?
Page 22 of 58
48.How do you securely mount your data from ADLS to Databricks?
Page 23 of 58
49.How do you read / write data from ADLS to databricks?
You can read and write data from Azure Data Lake Storage (ADLS) to Databricks using the following steps:
Mount ADLS Gen1 or Gen2 to Databricks: To mount ADLS to Databricks, you can use the Databricks UI or API. You will need to provide the ADLS account
name and key or an Azure Active Directory token, along with the mount point and configuration options. This will create a virtual filesystem on Databricks that
points to the ADLS storage account.
Read data from ADLS: Once the ADLS account is mounted, you can read data from it using the file APIs in Databricks, such as spark.read or dbutils.fs. For
example, you can read a CSV file from ADLS using the following code:
Write data to ADLS: To write data to ADLS, you can use the same file APIs, but specify the ADLS mount point as the output directory. For example, to write a
DataFrame to a Parquet file on ADLS, you can use the following code:
50.What are the different ways to schedule a data engineering job in Azure?
There are several ways to schedule a data engineering job in Azure, depending on the requirements and use case:
Azure Data Factory: Azure Data Factory (ADF) is a cloud-based data integration service that allows you to create, schedule, and orchestrate data pipelines.
ADF provides a visual interface for building pipelines using drag-and-drop components, as well as code-based pipeline definition using Azure Resource
Manager (ARM) templates. ADF also supports various data sources and destinations, including cloud and on-premises databases, files, and big data stores.
Azure Logic Apps: Azure Logic Apps is a cloud-based service that allows you to create and schedule workflows that integrate with various systems and
services. Logic Apps provides a visual workflow designer that allows you to create workflows using pre-built connectors and custom code. Logic Apps can
integrate with Azure services, as well as third-party services, such as Salesforce, Slack, and Twilio.
Page 24 of 58
Azure Functions: Azure Functions is a serverless compute service that allows you to run event-driven code in response to various triggers. Functions can be
used to run data processing or data integration code on a schedule or in response to an event, such as a file upload or a message in a queue. Functions can
be written in several languages, including C#, Python, and JavaScript.
Azure Batch: Azure Batch is a cloud-based job scheduling and compute management service that allows you to run large-scale parallel and high-
performance computing (HPC) workloads. Batch provides a managed environment for running jobs on clusters of virtual machines, with support for job
scheduling, job dependencies, and scaling. Batch can be used to run data processing or machine learning workloads on a large scale.
Azure Kubernetes Service: Azure Kubernetes Service (AKS) is a managed Kubernetes service that allows you to deploy and manage containerized
applications and services. AKS provides a scalable and highly available environment for running batch jobs or data processing workloads using containerized
applications. AKS can be integrated with Azure services, such as Azure Container Registry, for a seamless end-to-end experience.
azure vm cron job: In Azure, you can schedule a cron job on a virtual machine (VM) using the built-in Linux cron service.
SSH into your VM: Open a terminal and SSH into your VM using the ssh command and your VM's public IP address or DNS name.
Open the cron configuration file: Once you're connected to the VM, open the cron configuration file using the following command:
Databricks Workflow jobs: are used to automate and schedule data processing and analysis tasks. Here are some key features of Databricks jobs:
Scheduling: You can create jobs to run on a schedule, such as daily, weekly, or monthly. You can also specify the start time and end time for the job and the
time zone in which it should run.
Recurrence: You can set the job to recur at a specific interval, such as every 15 minutes, every hour, or every day.
Page 25 of 58
months_between
months_between(timestamp1, timestamp2[, roundOff]) - If timestamp1 is later than timestamp2, then the result is positive. If timestamp1 and timestamp2 are
on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month,
and rounded to 8 digits unless roundOff=false.
SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
trim
trim(str) - Removes the leading and trailing space characters from str.
52.How do read filename while reading data from ADLS into databricks?
To read the filename while reading data from ADLS into Databricks, you can use the input_file_name() function in PySpark. Here's an example code snippet:
Page 26 of 58
54.How do you read schema from a particular file using Spark / Pandas?
To read the schema from a file using Spark, you can use the printSchema() method of a DataFrame object. This method prints the schema of the DataFrame in
a tree format, showing the data types and structure of each column. Here's an example using PySpark:
Delta Lake provides several advantages over traditional data storage solutions for big data analytics. Here are some of the key advantages of using Delta
Lake:
ACID transactions: Delta Lake provides ACID transactions, which ensure that data is processed reliably and without conflicts. This helps to eliminate
common data integrity issues that can arise in distributed environments.
Data versioning: Delta Lake keeps track of every change that is made to the data, allowing users to revert to earlier versions of the data if necessary. This
makes it easy to recover from data corruption or accidental data loss.
Schema enforcement: Delta Lake enforces schema on write, which ensures that all data written to the data lake conforms to a consistent schema. This helps
to eliminate data quality issues and makes it easier to manage and analyze data.
Query optimization: Delta Lake provides features like data skipping, predicate pushdown, and Z-ordering that improve query performance by reducing the
amount of data that needs to be scanned.
Stream processing: Delta Lake supports both batch and streaming workloads, allowing users to process data in real-time and perform near-real-time
analysis.
Open source: Delta Lake is an open-source project that is supported by a large and active community. This makes it easy to get help and support, and to
contribute to the development of the technology.
Overall, Delta Lake provides a reliable, scalable, and performant data storage solution for big data analytics, making it an attractive choice for many
organizations.
Page 27 of 58
56.How do you list all databases and all tables from databricks?
you can use the language magic command %<language> at the beginning of a cell. The supported magic
commands are: %python, %r, %scala, and %sql.
%fs: Access the Databricks File System (DBFS) and interact with files.
%sh: Run shell commands in a notebook cell.
%md: Write markdown in a notebook cell.
%sql: Execute SQL queries against tables in a database.
%python: Switch to Python language mode in a notebook cell.
%scala: Switch to Scala language mode in a notebook cell.
%r: Switch to R language mode in a notebook cell.
%run: Run a notebook, passing arguments if necessary.
%pip: Install Python packages using pip.
%conda: Install Python packages using conda.
%load: Load external code into a notebook cell.
%lsmagic: List all available magic commands.
Page 28 of 58
58.How do you connect databricks with a SQL datastore?
Azure Databricks supports connecting to external databases using JDBC. This article provides the basic syntax for configuring and using these connections
with examples in Python, SQL, and Scala.
Read data with JDBC
You must configure a number of settings to read data using JDBC. Note that each database uses a different format for the <jdbc_url>.
Generate ADB Token in User Settings. Use token based authentication ADF Linked Service.
Page 29 of 58
Create new pipeline and use databricks notebook activity in pipeline activities.
add a Databricks notebook to the pipeline by expanding the "Databricks" activity, then dragging and dropping a Databricks notebook onto the pipeline design
canvas.
Page 30 of 58
60.What is Azure purview?
Microsoft Purview provides a unified data governance solution to help manage and govern your on-premises, multicloud, and software as a service (SaaS)
data. Easily create a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data
lineage. Enable data consumers to access valuable, trustworthy data management.
Azure Purview is a cloud-based data governance solution from Microsoft that helps organizations discover, manage, and secure their data assets across on-
premises, multi-cloud, and SaaS environments. It provides a unified view of an organization's data landscape, enabling users to understand their data assets
and their relationships, and to discover new data sources and insights.
Azure Purview is designed to help organizations address the challenges of data discovery, cataloging, and governance. It includes features such as automated
data discovery and classification, data lineage and impact analysis, metadata management, data cataloging, and policy enforcement. These features help
organizations ensure the accuracy, completeness, consistency, and security of their data assets throughout their lifecycle.
Azure Purview also provides integration with other Microsoft cloud services, such as Azure Data Factory, Azure Databricks, and Azure Synapse Analytics, as
well as with third-party services, enabling organizations to manage and govern their data assets across a wide range of environments and tools.
Overall, Azure Purview is a comprehensive solution for data governance and management that helps organizations gain greater visibility and control over their
data assets, while also improving compliance and reducing risks associated with data management.
Page 31 of 58
61.What is data governance?
Data governance is a set of policies, procedures, and practices that organizations use to manage their data assets. It encompasses the processes and
controls that ensure the accuracy, completeness, consistency, and security of an organization's data throughout its lifecycle.
Data governance is concerned with ensuring that the right people have access to the right data at the right time, and that data is used in a responsible and
ethical way. It also involves managing data quality, metadata, data standards, and data security.
The goals of data governance include improving the accuracy and consistency of data, ensuring compliance with regulations and standards, reducing risks
associated with data management, and maximizing the value of an organization's data assets.
Data governance is typically managed by a dedicated team or department within an organization, which is responsible for defining policies and procedures,
monitoring compliance, and enforcing standards. This team works closely with other departments, such as IT, legal, and business operations, to ensure that
data governance is integrated throughout the organization.
Page 32 of 58
62.What is data lineage?
Data lineage refers to the journey or path that data takes from its source through various transformations and processes, to its final destination, such as a
report or dashboard. It is a way of tracing the flow of data through a system, and it helps to understand how data is used and manipulated within an
organization.
Data lineage typically includes information about the data's origins, its processing and transformations, and any other data that it may be related to or depend
on. It can also include metadata about the data, such as its format, structure, and quality.
Data lineage is important for a number of reasons. It helps to ensure data accuracy and consistency, by identifying any potential sources of errors or
discrepancies in the data. It also helps to meet regulatory and compliance requirements, by providing a clear audit trail of data usage and handling.
Additionally, it helps to improve data governance and management, by providing a better understanding of how data is used within an organization, and who
has access to it.
Page 33 of 58
Page 34 of 58
64.How do you store REST API data in databricks delta lake?
You can use Python to read REST API data and store in JSON format.
Then Read JSON format and create DataFrame and create target table as delta table using Dataframe Write API.
Page 35 of 58
66.Can you use databricks delta as operation store? for example for an ordering system? or real time booking system?
Databricks Delta is primarily designed for OLAP (Online Analytical Processing) workloads, which involve analyzing large datasets to gain insights and make
informed business decisions. OLAP workloads typically involve complex queries that aggregate and summarize data, and require fast access to large volumes
of data.
While Delta can be used for OLTP (Online Transaction Processing) workloads as well, it is not optimized for this type of workload. OLTP workloads involve
managing high volumes of small transactions, typically involving the insertion, deletion, and updating of individual records. These workloads require high
throughput, low latency, and high concurrency, and typically involve relatively small datasets.
While Delta does provide transactional capabilities, including ACID transactions and data versioning, its design and performance characteristics are better
suited for OLAP workloads, which involve larger datasets and more complex queries.
That being said, Delta can be used in combination with other tools and platforms to support OLTP workloads. For example, you could use Delta to store and
manage historical data, and use a separate database or data store to handle real-time transactional processing.
Overall, Databricks Delta is a powerful and flexible platform for managing and processing large datasets, and can be a great choice for OLAP workloads that
require fast access to large volumes of data.
Page 36 of 58
69.What is best to use Python / Scala / SQL within databricks?
The choice of programming language to use in Databricks depends on a variety of factors, including the nature of the data and the processing that you need to
perform, as well as your personal preferences and expertise. In general, Databricks supports several programming languages, including Python, Scala, SQL, R, and
Java. Each language has its own strengths and weaknesses, and may be better suited to certain types of tasks.
Here are a few factors to consider when deciding which language to use in Databricks:
Data types and processing: Python is a popular choice for data analysis and machine learning tasks, as it has a large number of libraries and tools for these
tasks, including NumPy, Pandas, and Scikit-learn. Scala, on the other hand, is a good choice for tasks that require high-performance data processing, such as
streaming or distributed computing. SQL is best for tasks that require querying and processing data stored in databases or data warehouses.
Integration with Spark: Scala is a native language for Apache Spark, and is often used for developing Spark applications. Python, on the other hand, has a
Spark API that allows you to write Spark applications using Python. SQL is used to express relational queries in Spark SQL, which can be used to process
large-scale data sets.
Team skills and preferences: The choice of language may also depend on the skills and preferences of your team. If your team has more experience with
Python, it may be more efficient to use Python. If your team has more experience with SQL, it may be more efficient to use SQL.
Overall, the best language to use in Databricks depends on your specific use case and the nature of the data and processing that you need to perform. It's
often a good idea to experiment with different languages to see which one works best for your needs. Databricks provides an environment that supports
multiple languages and makes it easy to switch between them, so you can choose the language that is most appropriate for each task.
Azure Data Lake Storage: A highly scalable and secure data lake service that allows you to store and analyze large amounts of data.
Azure SQL Database: A fully managed relational database service that offers high availability, security, and scalability.
Azure Cosmos DB: A globally distributed, multi-model database service that supports NoSQL data models, including key-value, graph, and document.
Azure HDInsight: A fully managed cloud service that makes it easy to process big data using popular open-source frameworks such as Hadoop, Spark, Hive,
and HBase.
Azure Stream Analytics: A real-time data processing service that allows you to analyze and gain insights from streaming data.
Azure Databricks: A collaborative, cloud-based platform for data engineering, machine learning, and analytics that is based on Apache Spark.
Azure Synapse Analytics: An analytics service that allows you to analyze large amounts of structured and unstructured data using both serverless and
provisioned resources.
Azure Machine Learning: A cloud-based machine learning service that allows you to build, train, and deploy machine learning models.
Azure Cognitive Search: A fully managed search-as-a-service that allows you to add search capabilities to your applications using natural language
processing and machine learning.
Page 37 of 58
71.What is Azure Cosmos DB?
Cosmos Database (DB) is a globally distributed, low latency, multi-model database for managing data at large scales. It is a cloud-based NoSQL database
offered as a PaaS (Platform as a Service) from Microsoft Azure. It is a highly available, high throughput, reliable database and is often called a serverless
database. Cosmos database contains the Azure Document DB and is available everywhere.
The key features of Cosmos DB are:
Globally Distributed: With Azure regions spread out globally, the data can be replicated globally.
Scalability: Cosmos DB is horizontally scalable to support hundreds of millions of reads and writes per second.
Schema-Agnostic Indexing: This enables the automatic indexing of data without schema and index management.
Multi-Model: It can store data in Key-value Pairs, Document-based, Graph-based, Column Family-based databases. Global distribution, horizontal
partitioning, and automatic indexing capabilities are the same irrespective of the data model.
High Availability: It has 99.99 % availability for reads and writes for both multi region and single region Azure Cosmos DB accounts.
Low Latency: The global availability of Azure regions allows for the global distribution of data, which further makes it available nearest to the customers. This
reduces the latency in retrieving data.
Page 38 of 58
73.What is delta time travel? did you work with delta time travel? explain where you used it
Delta Time Travel is a feature of Databricks Delta that allows users to access and query previous versions of a Delta table. With Delta Time Travel, users can
query a table as it appeared at any point in time, without having to create and manage multiple versions of the table manually.
To use Delta Time Travel, you can specify a version number or a timestamp when querying a Delta table. For example, you can use the AS OF syntax in a SQL
query to query the table as it appeared at a specific timestamp, like this:
I have used Delta Time Travel in a project where we needed to maintain a history of changes to a customer database, so we could track changes over time
and understand how the data had evolved. We used Delta Time Travel to query the database as it appeared at different points in time, and to compare
versions of the data to identify changes and trends. This allowed us to gain insights into the data and make better decisions based on historical trends.
Page 39 of 58
75.Do you have an option to work with Spark without databricks in Azure?
Yes, you can work with Apache Spark on Azure without using Databricks. Azure provides several services for running Spark workloads, including:
Azure HDInsight: This is a fully-managed cloud service that makes it easy to process big data using popular open-source frameworks such as Hadoop,
Spark, Hive, and HBase. With HDInsight, you can deploy and manage Spark clusters in Azure, and run Spark jobs using familiar tools and languages.
Azure Synapse Analytics: This is an analytics service that allows you to analyze large amounts of structured and unstructured data using both serverless and
provisioned resources. Synapse Analytics includes a Spark pool that allows you to run Spark jobs and Spark SQL queries on large data sets.
Azure Data Factory: This is a cloud-based data integration service that allows you to create and schedule data pipelines that can move and transform data
between various sources and destinations, including Spark clusters.
Azure Kubernetes Service (AKS): This is a fully managed Kubernetes service that allows you to deploy and manage containerized applications and services,
including Spark applications. With AKS, you can deploy Spark clusters as Kubernetes pods and manage them using Kubernetes tools and APIs.
These services provide different ways of running Spark workloads on Azure, depending on your needs and requirements. They offer varying levels of
scalability, performance, and cost, and support different programming languages, data sources, and data processing frameworks.
Page 40 of 58
With Synapse Analytics, users can:
Ingest data from various sources, including streaming data, batch data, and big data.
Store data in a scalable and flexible data lake that uses Azure Blob Storage or Azure Data Lake Storage Gen2.
Analyze data using a variety of tools and services, including Apache Spark, Power BI, Azure Machine Learning, and Azure Databricks.
Build data pipelines and workflows to automate data integration and processing using Azure Data Factory.
Use a serverless SQL pool or dedicated SQL pool to run fast, scalable SQL queries on large data sets.
Secure data using advanced security and compliance features, including Azure Active Directory integration, network isolation, and row-level security.
Synapse Analytics also offers an integrated development environment (IDE) called Synapse Studio, which provides a unified workspace for data engineers,
data scientists, and business analysts to collaborate on data-related projects. The IDE includes tools for data preparation, data transformation, data
visualization, and machine learning, as well as a notebook environment for running code and exploring data.
77.How do you get all reporting employees of a manager as a list in Spark SQL?
To get all reporting employees of a manager as a list in Spark SQL, you can use a self-join and a GROUP BY clause. Here's an example SQL query:
This query joins the employees table with itself using the manager_id column to match each employee with their manager. The collect_list function is used to
aggregate the employee_id values for each manager into a list. The GROUP BY clause groups the results by manager_id.
This query will return a result set with two columns: manager_id and reporting_employee_ids. The manager_id column contains the ID of each manager, and
the reporting_employee_ids column contains a list of the IDs of all the employees reporting to that manager. You can further process this result set in Spark
SQL or other Spark APIs to generate the desired output format.
Page 41 of 58
78.How do you loop through all records of a delta table using SQL and Python?
You can loop through all records of a Delta table using SQL and Python by using the Delta Lake pySpark API in a Python script.
Here's an example script:
it loops through each row of the DataFrame using the df.collect() method, and prints the values of each row.
Alternatively, you can use Spark SQL to loop through all records of a Delta table. Here's an example SQL query:
Page 42 of 58
80.How do you covert a dataframe to graph frame?
To convert a DataFrame to a GraphFrame in Databricks, you can use the GraphFrame.fromEdges method to create a GraphFrame from a DataFrame that
represents the edges of the graph. Here's an example of how to do this:
In this example, the createDataFrame method is used to create a DataFrame edges that contains two columns, src and dst, which represent the edges of the
graph. The fromEdges method of the GraphFrame class is then used to create a GraphFrame g from the edges DataFrame. Finally, the display method is used
to show the vertices and edges of the graph.
Note that before you can convert a DataFrame to a GraphFrame, you need to make sure that the DataFrame has the correct schema and contains the
appropriate columns to represent the vertices and edges of the graph. In addition, you may need to perform additional transformations on the DataFrame to
prepare it for use as a graph, such as adding vertex properties or filtering out irrelevant data.
Page 43 of 58
81.How do you stream data from IoT devices in Azure databricks?
Data Ingest - stream real-time raw sensor data from Azure IoT Hubs into the Delta format in Azure Storage
Data Processing - stream process sensor data from raw (Bronze) to silver (aggregated) to gold (enriched) Delta tables on Azure Storage
Page 44 of 58
82.How do you find available versions of a delta table?
Describe History Table Returns provenance information, including the operation, user, and so on, for each write to a table. Table history is retained for 30 days.
83.How do you find data history from Delta tables using time stamp or version?
Page 45 of 58
84.How do remove all versions of a delta table?
To remove all versions of a Delta table in Databricks, you can use the VACUUM command with the RETAIN 0 HOURS option. This command will remove all
the older versions of the table that are older than the retention period, which in this case is set to zero hours. Here is an example:
Page 46 of 58
86.Explain various SQL joins with example?
In Spark SQL, there are four types of joins that can be used to combine data from two or more tables: Inner Join: Returns only the rows where there is a match
between the keys in both tables. Syntax: SELECT ... FROM table1 JOIN table2 ON table1.key =
table2.key
Left Outer Join: Returns all the rows from the left table and the matching rows from the right table. If there is no match in the right table, it returns null values
for the right table's columns. Syntax: SELECT ... FROM table1 LEFT JOIN table2 ON table1.key = table2.key
Right Outer Join: Returns all the rows from the right table and the matching rows from the left table. If there is no match in the left table, it returns null values
for the left table's columns. Syntax: SELECT ... FROM table1 RIGHT JOIN table2 ON table1.key = table2.key
Full Outer Join: Returns all the rows from both tables and fills null values for the columns that do not have a matching key in the other table. Syntax:
SELECT ... FROM table1 FULL OUTER JOIN table2 ON table1.key = table2.key
Left Semi Join: This join returns all the rows from the left table for which there is a match in the right table, and it does not return any columns from the right
table. In other words, it is similar to an inner join, but it only returns the columns from the left table. Syntax: SELECT ... FROM table1 LEFT SEMI JOIN table2
ON table1.key = table2.key
Left Anti Join: This join returns all the rows from the left table for which there is no match in the right table, and it does not return any columns from the right
table. In other words, it returns all the rows from the left table that do not have a corresponding match in the right table. Syntax: SELECT ... FROM table1
LEFT ANTI JOIN table2 ON table1.key = table2.key
Page 47 of 58
87.What is partition in databricks? how does it work?
A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller
partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a
partition column which is used to segment the table.
If a query uses a qualifying filter on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining
partitions. This process is called pruning.
In a partitioned table, data is stored in physical blocks, each of which holds one partition of data. Each partitioned table maintains various metadata about the
sort properties across all operations that modify it.
Page 48 of 58
88.Read a text contains millions of words, and find top 10 words from entire text excluding prepositions and articles?
89.How do you find all possible substrings of a given name or string excluding space and special characters? for example, “data” → “d”, ”a”, “t”, “da”,
“dt”, “td”, “ta”, “at”, “aa”…etc….”data”
Page 49 of 58
90.Read data from multiple CSV files of same schema, and display number of records from each file with filename? get the file name into dataframe /
table storage
91.There are 10 million records in the table and the schema does not contain the ModifiedDate column. One cell was modified the next day in the table.
How will you fetch that particular information that needs to be loaded into the warehouse?
In Spark, a left anti join is a type of join operation that returns all the rows from the left DataFrame that do not have a matching key in the right DataFrame. In
other words, it returns the rows in the left DataFrame that are not present in the right DataFrame.
The left anti join operation is useful when you want to find the rows in one DataFrame that are not present in another DataFrame based on a common key
column. This operation can be used to filter out the rows in the left DataFrame that do not have a corresponding key in the right DataFrame.
The left anti join operation can be performed using the join method of the DataFrame API, with the how parameter set to "left_anti".
Here's an example:
Page 50 of 58
92.Read covid data through from https://data.cdc.gov/ and provide number of diseased patients per day for the last 5 days.
TR Raveendra
Page 51 of 58
93.Make a secure connection from databricks to ADLS and read data from a CSV file and convert it into a delta table with appropriate schema
Page 52 of 58
94.If you are running more no of pipelines and its taking longer time to execute. How to resolve this type of issues?
If you are experiencing slow pipeline execution times in Azure Data Factory, there are several steps you can take to optimize performance:
Use parallelism: If your pipelines are processing large amounts of data, consider using parallelism to split the workload across multiple activities or pipelines.
This can help speed up processing times and reduce the overall time required to complete the pipeline.
Optimize data movement: Data movement can be a bottleneck for pipeline performance, so it's important to optimize the data movement as much as
possible. This could involve compressing data before transfer, using partitioning to move data in smaller chunks, or using a dedicated transfer service like
Azure Data Box.
Optimize data transformation: If your pipelines involve complex data transformation, consider using more efficient data processing technologies such as
Databricks / synapse analytics. This can help reduce the time required to transform data and speed up overall pipeline execution.
Optimize infrastructure: Consider upgrading the infrastructure used to run the pipelines. This might involve upgrading the VM size, increasing the number of
nodes in a cluster, or scaling out by adding more worker nodes.
Monitor and optimize: Monitor the performance of your pipelines regularly and use performance metrics to identify areas for improvement. This could involve
using tools like Azure Monitor or third-party monitoring solutions to track metrics such as execution time, data throughput, and resource utilization.
Page 53 of 58
Like Celebr ate Suppor t Funny Love Insightful Cur ious
96.Different Types of triggers in ADF?
In Azure Data Factory, there are three types of triggers available for executing pipelines:
Schedule Trigger: A Schedule Trigger allows you to run a pipeline on a recurring schedule, such as once a day or once an hour. You can define the frequency
and start time for the trigger, as well as any additional parameters, and the trigger will automatically execute the pipeline at the specified times.
Event-Based Trigger: An Event-Based Trigger allows you to execute a pipeline in response to an event, such as a file being added to a data store, a message
being posted to a queue, or an HTTP request being received. You can configure the trigger to monitor specific events and trigger the pipeline when those
events occur.
Tumbling Window Trigger: A Tumbling Window Trigger allows you to execute a pipeline on a recurring schedule, but with a more complex definition of the
trigger time. You can define a start time and end time for the trigger, as well as the duration of the window, and the trigger will execute the pipeline at the start
of each window.
Each of these trigger types has its own specific use cases and benefits. For example, a Schedule Trigger might be appropriate for a pipeline that needs to run
on a set schedule, while an Event-Based Trigger might be useful for a pipeline that needs to process data in real-time as it's generated. The Tumbling Window
Trigger might be used when you need to process data in regular, overlapping time periods.
There are several ways to execute pipelines in Azure Data Factory, including:
Trigger-based execution: You can create a trigger that automatically executes a pipeline on a specified schedule or when an event occurs (such as when
new data is added to a data store).
Ad-hoc execution: You can manually execute a pipeline from the Azure Data Factory user interface or programmatically using the REST API or Azure
PowerShell.
Event-based execution: You can use Azure Event Grid to trigger a pipeline when an event occurs in an Azure service such as Blob Storage, Event Hub, or IoT
Hub.
External execution: You can use a third-party scheduling tool or an orchestration tool such as Azure Logic Apps or Azure Functions to execute pipelines.
Continuous integration and delivery (CI/CD): You can use Azure DevOps or another CI/CD tool to automatically build, test, and deploy Data Factory
pipelines.
Overall, these execution options provide a lot of flexibility for running Data Factory pipelines in different scenarios, whether it's on a set schedule, in response
to an event, or manually triggered as needed.
Page 54 of 58
98.Best Way to copy large from on premises to data lake using ADF?
The performance of a self-hosted integration runtime in Azure Data Factory can be influenced by a variety of factors, including the configuration of the runtime
and the performance of the hardware it is running on. Here are some tips to optimize the performance of a self-hosted integration runtime:
Use a dedicated machine: To optimize performance, use a dedicated machine for the self-hosted integration runtime. Avoid sharing the machine with other
workloads that may affect its performance.
Optimize machine resources: Ensure that the machine used for the self-hosted integration runtime has sufficient CPU, memory, and disk space to handle
the workloads.
Use compression: Compressing the data before transferring it can reduce the amount of data that needs to be transferred, which can result in faster transfer
speeds.
Use multi-part uploads: If the file is very large, consider using multi-part uploads. This allows the file to be split into smaller chunks, which can be transferred
in parallel. This can significantly improve the transfer speed.
Binary copy is a feature in Azure Data Factory that allows you to copy files between different file-based data stores, such as Azure Blob Storage, Azure Data
Lake Storage, and on-premises file systems. Binary copy enables you to copy large files quickly and efficiently, without having to read the entire file into
memory.
A self-hosted integration runtime can be installed on-premises and used to securely transfer data between on-premises data stores and cloud-based data
stores. The self-hosted integration runtime can take advantage of the faster network speeds and reduced latency of an on-premises network.
Use Azure ExpressRoute: Azure ExpressRoute provides a dedicated, private connection between an on-premises network and Azure. This can improve the
performance of data transfer by providing faster and more reliable connectivity.
Use Azure Data Box: If the data is very large, consider using Azure Data Box. Azure Data Box is a physical appliance that can be shipped to the on-premises
location to transfer large amounts of data.
Optimize the on-premises network: Ensure that the on-premises network is optimized for data transfer, with sufficient bandwidth and low latency. Consider
upgrading the network infrastructure if necessary.
Page 55 of 58
99.How To Get the latest added file in a folder using Azure Data Factory?
To get the latest added file in a folder using Azure Data Factory, you can use the "Get Metadata" activity with a child item of "Child Items" and sort the results
by creation or modification time. Here are the steps to do this:
Add a "Get Metadata" activity to your pipeline and configure it to connect to the folder you want to monitor.
In the "Get Metadata" activity, select the "Child Items" option as the child item.
Under the "Field List" tab, add the "creationTime" and/or "lastModified" fields.
Under the "Field List" tab, click on the "Add dynamic content" button to add an expression that sorts the files by the creation or modification time. The
expression should look like this:
This will sort the files in descending order by their creation or modification time.
Finally, add an "If Condition" activity to check if any files were found in the folder. The expression should be:
If the output is empty, you can use a "Set Variable" activity to set a default value. If there are files, you can use a "Set Variable" activity to set the latest file
name and/or path using the expression:
Page 56 of 58
100.What is the meaning of a hierarchical folder structure in data engineering, and how is it used in organizing and managing data within a data lake??
In data engineering, hierarchical folder structure refers to the organization of data files and folders in a hierarchical or tree-like structure, where files and folders
are arranged in a parent-child relationship. In this structure, each folder can contain one or more sub-folders, and each sub-folder can contain one or more
files or additional sub-folders. This allows for a logical organization of data files and facilitates efficient storage and retrieval of data.
In the context of data lakes, a hierarchical folder structure can be used to organize data files in a way that reflects the different data domains, data sources, or
business units. For example, a data lake may have a top-level folder for each data domain (such as sales, finance, or customer data), and within each domain
folder, there may be sub-folders for each data source (such as ERP systems, CRM systems, or social media platforms).
A hierarchical folder structure can also be used to enforce data governance and access controls, by setting permissions at the folder or file level to control
who can view or modify data.
Page 57 of 58
Learn
&
Lead
Page 58 of 58