v3 Gcp Service Wise Interview Questions
v3 Gcp Service Wise Interview Questions
Dataproc
1. What is Dataproc?
2. What is machine family? Types of Machine Family? (General-Purpose, Storage
Optimized, Compute Optimized, GPU-Focused)
3. What is the difference between Dataproc and Dataflow and which one is costlier?
4. I have created Dataproc cluster and accidentally deleted that cluster, can we retrieve
it back? Does the data get deleted? If yes, how can we save it?
5. I have created Dataproc cluster and my jobs are running slow, what are the ways to
optimize my current jobs?
6. What is the basic difference between zonal cluster and regional cluster?
7. What is preemptable clusters?
8. What is the best practice to optimize Dataproc cost?
9. How will you optimize Spark Performance on Dataproc Cluster?
10. We have one file in Cloud Storage and we want to process that file in Apache Spark
on a Dataproc Cluster, how can we submit Spark job to run the file using CLI?
11. How will you monitor the job progress and identify the issue if the Spark job is taking
a long time to run?
12. How can you handle sudden surge in data if using Dataproc?
Dataflow
1. What is Dataflow?
2. Major components of Dataflow?
3. How Dataflow automatically scales?
4. What is windowing and what is its purpose?
5. How would you schedule a Dataflow workflow without using Cloud Composer?
6. Explain autoscaling in Dataflow.
7. How does Dataflow handle errors or job failures?
Miscellaneous/Project-Specific Questions
1. Explain your project and the services you used.
2. What are the operators used in your DAGs in Cloud Composer?
3. What kind of data transformation was done during the project?
4. How did you perform orchestration, task dependency, and pipeline management in
your project?
5. How would you optimize and troubleshoot jobs running in Dataproc, BigQuery, or
Dataflow?
ANSWERS OF ALL THE QUESTIONS ASKED ABOVE
Google Cloud Storage (GCS):
1. How to Change the Bucket Location from One Region to Another?
• Bucket location cannot be changed directly.
o Why? Once a GCS bucket is created in a specific location, it is geographically
bound to that region. The location defines where the data is stored.
o Solution:
1. Create a new bucket in the desired region.
2. Copy the data from the old bucket to the new bucket using gsutil or
the Google Cloud Console.
3. Delete the old bucket (optional) if it’s no longer needed.
5. What Are the Ways to Upload a CSV File into GCS Bucket?
• You can upload files to GCS using:
1. Web Interface (Console):
▪ Navigate to GCS in the Google Cloud Console, click on your bucket,
and use the Upload Files button.
2. gsutil (CLI):
▪ Use the gsutil cp command:
gsutil cp /local/path/to/file.csv gs://your-bucket-name/
3. API:
▪ Use the Cloud Storage API to programmatically upload data from your
application.
4. Storage Transfer Service:
▪ Use this for large data transfers from external sources or cloud storage
(e.g., AWS S3 to GCS).
6. gsutil Command for Copying File from One Bucket to Another Bucket?
• The command is:
gsutil cp gs://source-bucket-name/file.csv gs://destination-bucket-name/
This copies the file from one bucket to another.
8. What Are Two Types of Permissions on Bucket Level? (Access Control List & IAM)
• Access Control List (ACL):
o Grants permissions to specific users or groups to access or manage objects in
the bucket.
o Permissions include READ, WRITE, OWNER for the objects and bucket.
• Identity and Access Management (IAM):
o IAM provides more granular, role-based access control to resources in Google
Cloud.
o Roles like roles/storage.objectViewer or roles/storage.objectAdmin allow
you to manage access at the bucket or project level.
11. What is the Maximum File Size We Can Upload to GCS Bucket?
• The maximum file size that can be uploaded is 5 TB per object.
12. We Want to Upload a File in Bucket, What is the Optimized Method by Which We Can
Reduce Cost and Save from Any Disaster?
• Cost Reduction:
o Store data in Nearline, Coldline, or Archive classes for infrequent access.
o Implement bucket lifecycle policies to transition older data to cheaper
classes.
• Protection from Disaster:
o Enable versioning in the bucket to keep multiple versions of objects.
o Use Object Access Control Lists (ACLs) or IAM to restrict access.
o Enable Object Locking to prevent data from being deleted or overwritten.
13. How to Add Bucket-Level Permission to Have Only View Level Access?
• Use IAM roles like roles/storage.objectViewer to allow read-only access to objects in
the bucket.
o This will grant view-only permissions without allowing any modifications.
Example:
gsutil iam ch user:your-email@example.com:roles/storage.objectViewer gs://your-bucket-
name
14. How Do You Ensure Data Consistency in a GCP Bucket?
• GCS provides strong consistency for objects:
o After an object is uploaded, it is immediately available for reading (no
eventual consistency).
o Atomic updates: When you overwrite an object, the new data is immediately
visible.
o Versioning can help you maintain a history of objects and protect against
accidental changes.
20. Versioning?
• Versioning in GCS allows you to store multiple versions of the same object.
o Every time an object is overwritten, the previous version is still stored in the
bucket.
o Enable versioning through the Cloud Console or gsutil:
gsutil versioning set on gs://your-bucket-name/
21. If You Have 9 Versions of a File, How Will You Access a Particular Version, 1st or 5th
Version?
• You can access specific versions of an object by specifying the object version's
generation number.
o Retrieve the object by its version using the generation parameter:
gsutil cp gs://your-bucket-name/your-file#<generation_number> local-file
BigQuery (BQ):
1. Big Query Architecture?
• BigQuery Architecture is built on the serverless model, which abstracts away the
underlying infrastructure.
o Storage Layer: Stores data in a highly optimized, distributed manner.
▪ Data is split into columns (columnar storage).
o Compute Layer: Processes queries using distributed compute engines.
o Query Engine: Executes SQL queries on data stored in BigQuery. It can scale
out to thousands of nodes automatically for large queries.
o Dremel Engine: BigQuery uses Dremel, a highly scalable, low-latency system
for running queries on large datasets.
o Cloud Storage: Integrates seamlessly with Google Cloud Storage for data
loading and exporting.
o SQL Engine: BigQuery supports standard SQL for querying and has
optimizations like materialized views, cached results, etc.
4. How Many Types of Views are Present in Big Query? What’s the Purpose and Why Do
We Need Them?
• Types of Views:
1. Standard Views:
▪ A stored query that executes each time you reference the view.
▪ Purpose: Simplifies complex queries, hides underlying table details,
and provides data abstraction.
2. Materialized Views:
▪ Precomputed results stored for faster access.
▪ Purpose: Reduces query cost and execution time, especially for
frequently used queries.
7. Does Big Query Have a Concept of Primary Key and Foreign Key?
• No, BigQuery does not enforce primary or foreign keys like traditional relational
databases.
o It is an OLAP system optimized for analytics, so constraints like primary and
foreign keys are not supported.
o However, you can create relationships and enforce them logically in your
application.
8. Upload CSV File from GCS Bucket to Big Query from CLI (Header Not Required, Big Query
Should Auto-detect Schema)?
• To load a CSV file from GCS into BigQuery with schema auto-detection and no
header:
bash
Copy
bq load --source_format=CSV --skip_leading_rows=1 --autodetect dataset.table gs://your-
bucket-name/your-file.csv
9. How Do You Handle Large-Scale Data Transformations in Big Query While Migrating
from On-premises?
• Use Data Transfer Services to move data:
1. Cloud Storage: Upload data from on-prem to GCS, and then load into
BigQuery.
2. Cloud Dataproc: Use it for batch data processing and then load data into
BigQuery.
3. BigQuery Data Transfer Service: For regular data transfers from other
sources.
4. Streaming Inserts: Use streaming for real-time data migration.
13. I Want Current Location and Past Location of the Product We Ordered in BigQuery?
• You can track historical data using Time Travel to get past states of the data.
• If you have a location column with a timestamp of when the product's location
changes, you can query the most recent location and historical ones using ORDER BY
and LIMIT.
14. While Loading Data to Big Query, I Do Not Want to Put Schema Manually?
• You can use schema auto-detection when loading data:
bash
Copy
bq load --autodetect --source_format=CSV dataset.table gs://your-bucket-name/your-file.csv
15. We Are Working on Sales Aggregated Data; We Want to Avoid the Whole Table Scan,
So What is the Best Way to Avoid It?
• Partitioning and Clustering can help reduce the data scanned during queries:
o Partition by date to query only specific time ranges.
o Cluster by columns frequently used in queries like product_id to optimize
performance.
16. Using Big Query, I Want to Query Files in Amazon Redshift and Azure Blob Storage,
How to Do? (Using Big Query Omni Service)
• BigQuery Omni allows querying data stored in AWS S3 and Azure Blob Storage
directly from BigQuery.
o Set up BigQuery Omni for cross-cloud querying, where you can access data
stored on AWS or Azure without moving it to GCP.
17. What is the Maximum Number of Columns That Can Be Partitioned and Clustered in
BigQuery?
• Partitioning: A table can only be partitioned on one column.
• Clustering: You can cluster on up to 4 columns.
18. How Do You Store and Query Structured vs Semi-Structured Data in BigQuery?
• Structured Data:
o Store in regular tables with fixed schemas (e.g., relational tables).
• Semi-Structured Data:
o Store in STRUCT or ARRAY fields within tables.
o Use JSON format for semi-structured data (e.g., nested fields).
o Query semi-structured data using JSON_EXTRACT functions.
3. What is the Difference Between Dataproc and Dataflow and Which One is Costlier?
• Dataproc vs Dataflow:
o Dataproc:
▪ Apache Spark and Apache Hadoop based. Primarily used for batch
processing big data workloads.
▪ Supports: Spark jobs, MapReduce, and Hive queries.
▪ Use case: Complex data transformations, iterative machine learning,
and legacy Hadoop workflows.
o Dataflow:
▪ Apache Beam based service. Used for both batch and streaming data
processing.
▪ Supports: Stream processing, windowing, and triggers (event-based
processing).
▪ Use case: Real-time analytics, event-driven systems, real-time data
pipelines.
• Cost Comparison:
o Dataproc is usually cheaper when handling batch processing jobs (as it is
billed by cluster usage).
o Dataflow tends to be more expensive for long-running stream processing due
to its managed pipeline model and automatic scaling.
4. I Have Created Dataproc Cluster and Accidentally Deleted That Cluster, Can We Retrieve
It Back? Does the Data Get Deleted? If Yes, How Can We Save It?
• Cluster Deletion:
o Once a Dataproc cluster is deleted, the cluster's compute resources (VMs)
and temporary data (stored on local disks) are gone.
o Data on Cloud Storage or BigQuery is not affected. Dataproc clusters typically
process data stored in Google Cloud Storage (GCS), which is persistent and
survives cluster deletion.
• How to Save Data:
o Always ensure critical data is stored in Google Cloud Storage or other
managed services (BigQuery, Cloud SQL) before deleting the cluster.
o Consider using persistent disk for temporary storage, or Cloud Storage for
data that's meant to persist.
5. I Have Created Dataproc Cluster and My Jobs Are Running Slow, What Are the Ways to
Optimize My Current Jobs?
Ways to Optimize Spark Jobs on Dataproc:
1. Increase Cluster Size: Add more worker nodes to scale the cluster, especially if jobs
are resource-intensive.
2. Tune Spark Settings: Adjust Spark configuration to optimize performance (e.g.,
executor memory and core settings).
3. Data Partitioning:
o Partition large datasets effectively to ensure better parallelism.
o Repartitioning during job execution might help avoid data shuffling.
4. Avoid Wide Transformations: Minimize operations that require shuffling (like
groupByKey, join on large datasets).
5. Use Dataproc Optimized Images: Use Dataproc’s optimized images for Spark to
enhance performance.
6. Preemptible VMs: Use preemptible instances to scale cost-effectively for high-
performance workloads.
6. What is the Basic Difference Between Zonal Cluster and Regional Cluster?
• Zonal Cluster:
o Resides in a single zone within a region.
o Less fault-tolerant, as if the zone faces issues, the cluster may become
unavailable.
• Regional Cluster:
o Spans multiple zones within a region, providing high availability and fault
tolerance.
o More resilient to zone failures and distributes the workload across different
zones for better uptime.
10. We Have One File in Cloud Storage and We Want to Process That File in Apache Spark
on a Dataproc Cluster, How Can We Submit Spark Job to Run the File Using CLI?
• To submit a Spark job on Dataproc via the CLI:
bash
Copy
gcloud dataproc jobs submit spark \
--cluster your-cluster-name \
--region your-region \
--class org.apache.spark.examples.SparkPi \
--jars gs://your-bucket-name/your-jar-file.jar \
-- gs://your-bucket-name/input-data.txt
o Replace your-cluster-name with your Dataproc cluster's name, and adjust the
file paths accordingly.
o You can specify Spark properties and arguments after the --.
11. How Will You Monitor the Job Progress and Identify the Issue if the Spark Job is Taking
a Long Time to Run?
• Monitoring:
1. Cloud Logging: Use Stackdriver Logging to view logs from your Spark job.
2. Dataproc UI: Dataproc provides a web UI to monitor job progress and logs.
3. Spark UI: Check the Spark UI for detailed metrics like task duration, shuffle
operations, and memory usage.
4. Job Metrics: Monitor the job's stages, task time, and any skew in data
processing.
• Troubleshooting:
o If tasks are taking longer than expected, check if any tasks are stuck or there
are out-of-memory errors.
12. How Can You Handle Sudden Surge in Data if Using Dataproc?
• Handling Data Surges:
1. Use Autoscaling: Dataproc allows you to automatically scale the number of
nodes in the cluster based on the demand.
2. Preemptible VMs: Leverage preemptible VMs to quickly scale up without
adding significant costs.
3. Pre-emptive Job Scheduling: Schedule jobs in off-peak hours if a surge is
predictable.
4. Job Queuing: Use Apache Kafka or Google Cloud Pub/Sub to manage
streaming data and process it in batches.
Cloud Composer (Airflow):
1. What is Cloud Composer?
• Cloud Composer is a fully managed Apache Airflow service offered by Google Cloud.
It is used for orchestrating and scheduling workflows or data pipelines.
• Apache Airflow is an open-source tool to help automate, monitor, and schedule
complex workflows.
o Orchestrates workflows: Runs data processing tasks in sequence or parallel,
and ensures the correct order of execution.
o Managed Service: Cloud Composer takes care of provisioning resources,
scaling, and upgrading Airflow instances.
o Integrated with Google Cloud Services: Easily integrates with other Google
Cloud products like Dataproc, BigQuery, Cloud Storage, etc.
2. What is a DAG?
• DAG stands for Directed Acyclic Graph. In the context of Apache Airflow:
o Directed: Each edge (task dependency) points from one task to another.
o Acyclic: No task can point to itself, meaning there’s no circular dependencies.
o Graph: A structure that represents a set of tasks and their dependencies.
• In Cloud Composer, a DAG is a collection of tasks that are organized in a way that
they follow certain execution order or dependencies.
o Tasks: Individual units of work (like running a script, loading data, or calling an
API).
o Dependencies: The relationship that determines which task must run before
or after another.
6. I Have Created a DAG with 4 Tasks, If Any Task Fails, I Don’t Want to Stop the DAG.
• In this case, you can set ignore_downstream or trigger_rule:
1. Trigger Rule:
▪ The trigger rule controls when a task is executed based on the status
of previous tasks.
▪ Use TriggerRule.ALL_FAILED or TriggerRule.ONE_FAILED to execute
the downstream task even if a previous task fails.
2. Setting ignore_downstream=True in your task can also allow you to skip
downstream tasks even if an upstream task fails.
3. Task retry mechanism: Enable retries for the failed task to attempt re-
execution.
10. How to Execute One DAG After the Successful Completion of Another DAG?
• Use the ExternalTaskSensor operator:
o This operator waits for a task from another DAG to complete before executing
the current task.
o Example:
Python:
from airflow.sensors.external_task import ExternalTaskSensor
task_1 = ExternalTaskSensor(
task_id='wait_for_task_1',
external_dag_id='dag_1', # DAG ID of the other DAG
external_task_id='task_1', # Task ID to monitor
mode='poke', # Polling mode
timeout=600 # Max wait time
)
11. How Would You Use DAG in Data Pipeline Orchestration?
• A DAG is an essential part of orchestrating data pipelines, as it defines the workflow
and task dependencies in a structured manner.
• Example Workflow:
1. Extract Data from a source (e.g., API, Database).
2. Transform Data using a Spark job or any other processing tool.
3. Load Data into the target (e.g., BigQuery, Cloud Storage).
• Key Benefits of using DAGs in orchestration:
o Task Dependencies: Easily control execution order and logic.
o Scheduling: Automatically schedule and monitor data pipelines.
o Error Handling: Implement retries, failure handling, and alerts.
o Monitoring: Airflow provides detailed monitoring of task execution and
performance.
Dataflow:
1. What is Dataflow?
• Dataflow is a fully managed streaming and batch data processing service in Google
Cloud.
o It is built on Apache Beam, an open-source unified stream and batch data
processing model.
o Dataflow simplifies the process of building, deploying, and managing data
pipelines for both batch and real-time data processing.
o Purpose: It helps you analyze and process data at large scale with minimal
effort, providing auto-scaling, monitoring, and cost-effective data processing.
Key Features:
• Unified Programming Model: Dataflow uses Apache Beam, which allows you to
process both batch and streaming data with the same codebase.
• Fully Managed: You don’t have to worry about managing infrastructure or clusters.
Google handles all scaling and resource allocation automatically.
• Cost-Effective: You only pay for the resources you use, and Dataflow can scale
automatically to save costs during low usage times.
def my_python_function():
# Data transformation logic
pass
spark = SparkSession.builder.appName('DataTransformation').getOrCreate()
# Load data
data = spark.read.csv("gs://my-bucket/raw_data.csv", header=True)
# Cleaning: Remove duplicates
clean_data = data.dropDuplicates()
4. How did you perform orchestration, task dependency, and pipeline management in your
project?
Orchestration and Task Dependencies:
In Cloud Composer (Airflow), orchestration is achieved by defining a DAG that represents
the workflow of tasks. The tasks are connected in a dependency chain where one task must
complete before the next task starts.
• Task Dependencies: You define the order in which tasks should run by using the >> or
<< operators, which create a "dependency chain" between tasks.
o Example: Task A must finish before Task B starts.
Python:
task_a >> task_b # task_b will run only after task_a completes
• Pipeline Management: Using Cloud Composer, the entire workflow is scheduled,
managed, and monitored via the Airflow UI, where you can view logs, retry failed
tasks, and monitor task execution status.
Example of Task Dependencies in Airflow:
Python:
task1 = PythonOperator(task_id='task1', python_callable=my_function, dag=dag)
task2 = PythonOperator(task_id='task2', python_callable=my_function, dag=dag)
task3 = PythonOperator(task_id='task3', python_callable=my_function, dag=dag)
submit_pyspark = DataprocSubmitJobOperator(
task_id='submit_pyspark_job',
job={
'reference': {'project_id': 'my-project'},
'placement': {'cluster_name': 'my-cluster'},
'pyspark_job': {'main_python_file_uri': 'gs://my-bucket/my_script.py'},
},
region='us-central1',
dag=dag
)
pyspark_task = DataprocSubmitJobOperator(
task_id='pyspark_task',
job={
'reference': {'project_id': 'my-project'},
'placement': {'cluster_name': 'my-cluster'},
'pyspark_job': {'main_python_file_uri': 'gs://my-bucket/my_pyspark_script.py'},
},
region='us-central1',
dag=dag
)
def my_pyspark_function():
# Submit the job or run the script here (PySpark related logic)
pass
python_task = PythonOperator(
task_id='python_task',
python_callable=my_pyspark_function,
dag=dag
)
8. Hooks in Airflow
Hooks are used in Airflow to interact with external systems. They abstract the logic
needed to connect to a system like databases, cloud storage, etc.
Example Hooks:
• BigQueryHook: Allows interaction with BigQuery.
• GoogleCloudStorageHook: Interacts with GCS.
• DataprocHook: Used to manage Dataproc clusters.
Example:
python
Copy
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
hook = BigQueryHook(gcp_conn_id='google_cloud_default')
results = hook.get_pandas_df('SELECT * FROM dataset.table')
9. XCom variable in Airflow – more details about it, which format you share it, and how
can you catch it between tasks?
XCom (Cross-communication) is a mechanism in Airflow to exchange small amounts of
data between tasks. You can push and pull data between tasks.
• Pushing XCom: Use xcom_push to send data from one task to another.
• Pulling XCom: Use xcom_pull to retrieve the data in another task.
Format: XCom can store any Python object like strings, integers, or even dictionaries.
Example:
Python:
# Push an XCom value
task_instance.xcom_push(key='my_key', value='my_value')
def my_function():
my_var = os.getenv('MY_ENV_VAR')
print(my_var)
task = PythonOperator(
task_id='my_task',
python_callable=my_function,
env={'MY_ENV_VAR': 'my_value'},
dag=dag
)
12. Different storage classes in GCS
Google Cloud Storage (GCS) offers several storage classes to optimize for performance
and cost:
1. Standard: High availability, frequently accessed data.
2. Nearline: Infrequently accessed data (at least once a month).
3. Coldline: Long-term storage for data that is rarely accessed (once a year).
4. Archive: Lowest cost, for data that is infrequently accessed.
15. I have a file in GCS and want to share the file outside GCP (temporary access). How
to give only share file?
You can generate a signed URL that grants temporary access to a specific file in your GCS
bucket.
Example:
gsutil signurl -d 10m /path/to/private-key.json gs://your_bucket/file_name
This provides temporary access to the file via a URL.
16. Signed URL in GCP
A signed URL allows temporary access to a specific GCS object without the need for the
user to authenticate. It is typically used for giving time-limited access to files.
17. I want to submit a job but don’t want to create a cluster in Dataproc. If this can be
done, how to do it?
Dataproc Serverless allows you to run Spark or Hadoop jobs without the need to
manage clusters. This is beneficial for short-lived jobs that don't require persistent
clusters.
19. Optimizing techniques you know about when jobs take more time for submitting?
• Increase Parallelism: Adjust the number of partitions to allow tasks to run in parallel.
• Use Caching: Cache intermediate data to avoid redundant computation.
• Optimize Cluster Size: Choose the correct instance types and sizes for your cluster.
• Optimize Code: Use efficient algorithms, avoid shuffling, and minimize data transfer.
df = df.withColumn("row_num",
F.row_number().over(Window.partitionBy("ID").orderBy("timestamp")))
dag = DAG(
'example_dag',
start_date=datetime(2025, 2, 6),
end_date=datetime(2025, 2, 7), # Set the time range
schedule_interval=timedelta(hours=1), # Runs every hour within the range
)
• start_date: The date and time when the DAG will first be triggered.
• end_date: The last date the DAG will run.
3. XCom – How does it work? Default size?
XCom (Cross-communication) is a feature in Apache Airflow that allows tasks to share data
between each other. You can push data from one task and pull it from another task.
• Push XCom: xcom_push(key='key_name', value='value')
• Pull XCom: xcom_pull(task_ids='task_name', key='key_name')
• Default Size: The default size limit for XCom values in Airflow is 48 KB. If you exceed
this size, it's recommended to store the data externally (like in a database or Cloud
Storage) and store only a reference or identifier in XCom.
4. Templating in Airflow
Templating in Airflow allows you to use Jinja templating to dynamically generate task
arguments, based on execution context, such as execution date, task instance information,
etc.
Example:
python
Copy
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def print_execution_date(**kwargs):
print(f"The execution date is: {kwargs['execution_date']}")
task = PythonOperator(
task_id='print_execution_date_task',
python_callable=print_execution_date,
provide_context=True,
dag=dag
)
In this case, the execution_date is dynamically passed to the function through Jinja
templating.
5. Micros in Dataproc
In Dataproc, Micros refer to a feature in the context of Apache Spark where jobs are
executed in very small batches. This is a more real-time or near-real-time processing feature,
often called micro-batching, where data is processed in small chunks.
• Typically used for streaming data processing.
• In Dataproc, this can be achieved by setting up Spark Streaming jobs with small time
windows for processing.
6. Backfill in Airflow
Backfilling in Airflow refers to the automatic filling of task instances for the dates that were
missed or skipped. For example, if a DAG failed to run for a few days and then succeeds, it
will automatically backfill those days' task instances.
• Backfilling ensures that all tasks that should have been executed are run.
• Can be controlled by the catchup argument.
7. Catchup in Airflow
Catchup is an Airflow parameter that ensures all previous scheduled runs are executed if
they are missed.
• Catchup=True (default): Airflow will execute all the missed runs for the DAG in
sequence, starting from the start_date.
• Catchup=False: It will only execute the current DAG run and not attempt to "catch
up" on missed runs.
python
Copy
dag = DAG(
'example_dag',
catchup=False, # Skip missed DAG runs
start_date=datetime(2025, 2, 6),
schedule_interval=timedelta(hours=1)
)
8. Dataflow Architecture
Dataflow is a fully-managed stream and batch processing service on Google Cloud that is
based on Apache Beam. The architecture consists of:
• Pipeline: The core part of Dataflow where transformations are defined.
• Workers: The compute resources that process the data. Workers scale up and down
based on the load.
• FlexRS and autoscaling: Automatically adjust the number of workers based on
workload.
• Data storage: Data can come from sources like GCS, BigQuery, Pub/Sub, etc.
9. Windowing in Dataflow
Windowing in Dataflow (via Apache Beam) is used to break streaming data into fixed-size or
sliding windows for easier processing.
• Fixed windows: Data is grouped into fixed-length time intervals (e.g., every 5
minutes).
• Sliding windows: Data is grouped into overlapping intervals, which "slide" over time.
• Session windows: Grouping data based on user-defined periods of activity.