0% found this document useful (0 votes)
50 views62 pages

v3 Gcp Service Wise Interview Questions

The document contains a comprehensive list of interview questions and answers related to various Google Cloud Platform (GCP) services, including Google Cloud Storage, BigQuery, Dataproc, Cloud Composer, and Dataflow. It covers topics such as storage classes, data security, query optimization, and lifecycle policies, along with practical commands and best practices for managing cloud resources. Additionally, it includes general cloud topics and project-specific questions to assess candidates' knowledge and experience in GCP.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views62 pages

v3 Gcp Service Wise Interview Questions

The document contains a comprehensive list of interview questions and answers related to various Google Cloud Platform (GCP) services, including Google Cloud Storage, BigQuery, Dataproc, Cloud Composer, and Dataflow. It covers topics such as storage classes, data security, query optimization, and lifecycle policies, along with practical commands and best practices for managing cloud resources. Additionally, it includes general cloud topics and project-specific questions to assess candidates' knowledge and experience in GCP.

Uploaded by

Satyajit Ligade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

GCP SERVICE WISE INTERVIEW QUESTIONS

Google Cloud Storage (GCS)


1. How to Change the Bucket Location from one region to other?
2. What are the different storage classes in GCS?
3. Can we convert Nearline to Coldline and Coldline to Nearline?
4. Meaning of Object in Google Cloud Storage?
5. What are the ways that we upload CSV file into GCS Bucket?
6. gsutil command for copying file from one bucket to other bucket?
7. Default Storage Class in GCS?
8. What are two types of permissions on Bucket Level? (Access Control List & IAM)
9. What is Lifecycle Policy Of GCS?
10. How will you secure the data present on Google Cloud Storage? [Google Managed
key and CMEK (Customer Managed Encrypted Key)]
11. What is the maximum file size we can upload to GCS bucket?
12. We want to upload file in bucket, what is the optimized method by which we can
reduce cost and also save from any disaster?
13. How to add bucket-level permission to have only view level access?
14. How do you ensure data consistency in a GCP bucket?
15. What is gsutil?
16. gsutil Copy command from 1 bucket to another bucket?
17. Usage of -m and -r in copy?
18. How to prevent accidental bucket deletion in GCP?
19. How to apply lifecycle policy in GCS?
20. Versioning?
21. If you have 9 versions of a file, how will you access a particular version, 1st or 5th
version?
BigQuery (BQ)
1. Big Query Architecture?
2. Difference Between Partitioning and clustering?
3. Basic query optimization and cost optimization Technique in Big Query?
4. How many types of Views Present in Big Query, What’s the purpose and why we
need them?
5. How Big Query Charge?
6. What is Dry Run?
7. Do Big Query Have a concept of Primary Key and Foreign key?
8. Upload CSV file from GCS Bucket to Big Query from CLI (Header not required, Big
Query should auto-detect Schema?)
9. How do you handle large scale data transformations in Big Query while migrating
from On-premises?
10. Limitation of Big Query?
11. Big query is OLAP or OLTP?
12. I have one table and dropped that, now I want it back, How? (Snapshot, Time Travel)
13. I want current location and past location of the product we ordered in Big Query?
14. While Loading data to Big Query I do not want to put schema manually?
15. We are working on the sales aggregated data we want to avoid the whole table scan,
so what is the best way to avoid it?
16. Using Big Query I want to query file which is in Amazon Red-shift and in Azure Blob
Storage, How to do? (Using Big Query Omni Service)
17. What is the maximum number of columns that can be partitioned and clustered in
BigQuery?
18. How do you store and query structured vs semi-structured data in BigQuery?
19. Can you explain the time travel feature in BigQuery?
20. How to alter a table in BigQuery?
21. How to create authorized views in BigQuery?
22. How to load data from GCS to BigQuery?
23. How to mask sensitive data in BigQuery?
24. What is partitioning and clustering in BigQuery?
25. How to perform optimizations in BigQuery during joins?

Dataproc
1. What is Dataproc?
2. What is machine family? Types of Machine Family? (General-Purpose, Storage
Optimized, Compute Optimized, GPU-Focused)
3. What is the difference between Dataproc and Dataflow and which one is costlier?
4. I have created Dataproc cluster and accidentally deleted that cluster, can we retrieve
it back? Does the data get deleted? If yes, how can we save it?
5. I have created Dataproc cluster and my jobs are running slow, what are the ways to
optimize my current jobs?
6. What is the basic difference between zonal cluster and regional cluster?
7. What is preemptable clusters?
8. What is the best practice to optimize Dataproc cost?
9. How will you optimize Spark Performance on Dataproc Cluster?
10. We have one file in Cloud Storage and we want to process that file in Apache Spark
on a Dataproc Cluster, how can we submit Spark job to run the file using CLI?
11. How will you monitor the job progress and identify the issue if the Spark job is taking
a long time to run?
12. How can you handle sudden surge in data if using Dataproc?

Cloud Composer (Airflow)


1. What is Cloud Composer?
2. What is DAG?
3. Types of operators used in Cloud Composer?
4. I have created one DAG, and inside that:
o Created Dataproc cluster
o Submitted one pyspark file/job
o After submitting the job, I am deleting the cluster
o What happens if I didn’t mention DAG task dependency?
o If DAG fails, why and if it is not failed, why?
5. What is the meaning of "on failure callback" in Cloud Composer?
6. I have created a DAG with 4 tasks, if any task fails, I don’t want to stop the DAG.
7. How would you handle DAG failures or retries in Cloud Composer?
8. How to restart a DAG in Airflow?
9. How to create task dependencies in Airflow?
10. How to execute one DAG after the successful completion of another DAG?
11. How would you use DAG in data pipeline orchestration?

Dataflow
1. What is Dataflow?
2. Major components of Dataflow?
3. How Dataflow automatically scales?
4. What is windowing and what is its purpose?
5. How would you schedule a Dataflow workflow without using Cloud Composer?
6. Explain autoscaling in Dataflow.
7. How does Dataflow handle errors or job failures?

Miscellaneous/General Cloud Topics


1. What is IAM? What is a service account? How many service accounts can be created
per project?
2. What are roles in GCP?
3. What is the difference between service and batch accounts?
4. Explain different storage classes in GCS and their costings.
5. How do you ensure data consistency in a GCP bucket?
6. What is the difference between HTTP and HTTPS?
7. What is an API Whitelisting and Bucket Whitelisting?
8. How do you handle large-scale data migrations from on-prem to cloud?
9. Difference between serverless and managed services?
10. Horizontal scaling vs. vertical scaling.
11. How do you move on-prem data to cloud using Data Migration Service?

Miscellaneous/Project-Specific Questions
1. Explain your project and the services you used.
2. What are the operators used in your DAGs in Cloud Composer?
3. What kind of data transformation was done during the project?
4. How did you perform orchestration, task dependency, and pipeline management in
your project?
5. How would you optimize and troubleshoot jobs running in Dataproc, BigQuery, or
Dataflow?
ANSWERS OF ALL THE QUESTIONS ASKED ABOVE
Google Cloud Storage (GCS):
1. How to Change the Bucket Location from One Region to Another?
• Bucket location cannot be changed directly.
o Why? Once a GCS bucket is created in a specific location, it is geographically
bound to that region. The location defines where the data is stored.
o Solution:
1. Create a new bucket in the desired region.
2. Copy the data from the old bucket to the new bucket using gsutil or
the Google Cloud Console.
3. Delete the old bucket (optional) if it’s no longer needed.

2. What are the Different Storage Classes in GCS?


• GCS provides four main storage classes, which optimize costs based on how
frequently data is accessed:
o Standard:
▪ Designed for frequent access.
▪ Low latency, high throughput.
▪ Suitable for applications that need fast, frequent access to data.
o Nearline:
▪ For infrequent access (data accessed less than once a month).
▪ Lower cost than Standard.
o Coldline:
▪ For archival data that’s accessed less than once a year.
▪ Much cheaper than Nearline.
o Archive:
▪ For long-term storage of data that is accessed very rarely.
▪ Cheapest option, but with higher retrieval costs.
3. Can We Convert Nearline to Coldline and Coldline to Nearline?
• Yes, you can change the storage class of an object:
o Use gsutil or the GCP Console to change an object’s storage class.
o Example gsutil rewrite command:
gsutil rewrite -s coldline gs://your-bucket/your-object
o Note: Changing the storage class doesn’t move the data, but adjusts the cost
and retrieval time.

4. Meaning of Object in Google Cloud Storage?


• Object:
o In GCS, an object is any file or data stored within a bucket.
o Objects consist of:
▪ Data (the actual content, like an image, text file, or database dump).
▪ Metadata (additional information about the object, such as the file
name, size, storage class, and custom metadata).

5. What Are the Ways to Upload a CSV File into GCS Bucket?
• You can upload files to GCS using:
1. Web Interface (Console):
▪ Navigate to GCS in the Google Cloud Console, click on your bucket,
and use the Upload Files button.
2. gsutil (CLI):
▪ Use the gsutil cp command:
gsutil cp /local/path/to/file.csv gs://your-bucket-name/
3. API:
▪ Use the Cloud Storage API to programmatically upload data from your
application.
4. Storage Transfer Service:
▪ Use this for large data transfers from external sources or cloud storage
(e.g., AWS S3 to GCS).
6. gsutil Command for Copying File from One Bucket to Another Bucket?
• The command is:
gsutil cp gs://source-bucket-name/file.csv gs://destination-bucket-name/
This copies the file from one bucket to another.

7. Default Storage Class in GCS?


• Standard is the default storage class for new buckets in GCS.
o This means that unless you specify otherwise, the data will be stored in the
Standard class, optimized for frequent access.

8. What Are Two Types of Permissions on Bucket Level? (Access Control List & IAM)
• Access Control List (ACL):
o Grants permissions to specific users or groups to access or manage objects in
the bucket.
o Permissions include READ, WRITE, OWNER for the objects and bucket.
• Identity and Access Management (IAM):
o IAM provides more granular, role-based access control to resources in Google
Cloud.
o Roles like roles/storage.objectViewer or roles/storage.objectAdmin allow
you to manage access at the bucket or project level.

9. What is Lifecycle Policy of GCS?


• Lifecycle Policy:
o Automatically manage objects in a bucket over time by applying rules.
o Rules can transition objects to cheaper storage classes (e.g., move to Coldline
after 30 days) or delete objects after a certain period.
o Example:
gsutil lifecycle set lifecycle.json gs://your-bucket-name/
10. How Will You Secure the Data Present on Google Cloud Storage?
• Google Managed Encryption:
o GCS encrypts your data by default using Google-managed encryption keys.
• Customer-Managed Encryption Keys (CMEK):
o You can control encryption by using your own encryption keys stored in Cloud
KMS (Key Management Service).
o Helps meet compliance and security requirements that require you to
manage the encryption keys.

11. What is the Maximum File Size We Can Upload to GCS Bucket?
• The maximum file size that can be uploaded is 5 TB per object.

12. We Want to Upload a File in Bucket, What is the Optimized Method by Which We Can
Reduce Cost and Save from Any Disaster?
• Cost Reduction:
o Store data in Nearline, Coldline, or Archive classes for infrequent access.
o Implement bucket lifecycle policies to transition older data to cheaper
classes.
• Protection from Disaster:
o Enable versioning in the bucket to keep multiple versions of objects.
o Use Object Access Control Lists (ACLs) or IAM to restrict access.
o Enable Object Locking to prevent data from being deleted or overwritten.

13. How to Add Bucket-Level Permission to Have Only View Level Access?
• Use IAM roles like roles/storage.objectViewer to allow read-only access to objects in
the bucket.
o This will grant view-only permissions without allowing any modifications.
Example:
gsutil iam ch user:your-email@example.com:roles/storage.objectViewer gs://your-bucket-
name
14. How Do You Ensure Data Consistency in a GCP Bucket?
• GCS provides strong consistency for objects:
o After an object is uploaded, it is immediately available for reading (no
eventual consistency).
o Atomic updates: When you overwrite an object, the new data is immediately
visible.
o Versioning can help you maintain a history of objects and protect against
accidental changes.

15. What is gsutil?


• gsutil is a command-line tool that allows you to interact with Google Cloud Storage.
o It supports tasks like uploading, downloading, copying, and moving files
between GCS buckets.
o You can also manage bucket settings and permissions using gsutil.

16. gsutil Copy Command from One Bucket to Another Bucket?


• The command for copying files from one bucket to another is:
gsutil cp gs://source-bucket-name/file.csv gs://destination-bucket-name/

17. Usage of -m and -r in Copy?


• -m:
o This flag enables multi-threading for the operation, speeding up large file
transfers by processing files in parallel.
• -r:
o This flag indicates a recursive operation, which is used when copying
directories and subdirectories.
o Example:
gsutil -m cp -r gs://source-bucket-name/ gs://destination-bucket-name/
18. How to Prevent Accidental Bucket Deletion in GCP?
• Enable Bucket Lock:
o This will make the bucket and its contents immutable for a certain period,
preventing accidental deletion.
o Object Versioning can also be used to keep older versions of the objects safe.

19. How to Apply Lifecycle Policy in GCS?


• Lifecycle policies can be set using a JSON configuration.
o Example JSON rule:
{
"rule": [
{
"action": {
"type": "Delete"
},
"condition": {
"age": 365
}
}
]
}
o Apply it using:
gsutil lifecycle set lifecycle.json gs://your-bucket-name/

20. Versioning?
• Versioning in GCS allows you to store multiple versions of the same object.
o Every time an object is overwritten, the previous version is still stored in the
bucket.
o Enable versioning through the Cloud Console or gsutil:
gsutil versioning set on gs://your-bucket-name/
21. If You Have 9 Versions of a File, How Will You Access a Particular Version, 1st or 5th
Version?
• You can access specific versions of an object by specifying the object version's
generation number.
o Retrieve the object by its version using the generation parameter:
gsutil cp gs://your-bucket-name/your-file#<generation_number> local-file
BigQuery (BQ):
1. Big Query Architecture?
• BigQuery Architecture is built on the serverless model, which abstracts away the
underlying infrastructure.
o Storage Layer: Stores data in a highly optimized, distributed manner.
▪ Data is split into columns (columnar storage).
o Compute Layer: Processes queries using distributed compute engines.
o Query Engine: Executes SQL queries on data stored in BigQuery. It can scale
out to thousands of nodes automatically for large queries.
o Dremel Engine: BigQuery uses Dremel, a highly scalable, low-latency system
for running queries on large datasets.
o Cloud Storage: Integrates seamlessly with Google Cloud Storage for data
loading and exporting.
o SQL Engine: BigQuery supports standard SQL for querying and has
optimizations like materialized views, cached results, etc.

2. Difference Between Partitioning and Clustering?


• Partitioning:
o Organizes data into segments based on a column (typically date/time).
o Example: You can partition data by a timestamp column, dividing it into daily,
monthly, or yearly partitions.
o Benefit: Optimizes query performance by only scanning relevant partitions.
• Clustering:
o Organizes data within partitions based on columns (like grouping rows with
similar values).
o Data is physically stored in blocks ordered by clustered columns.
o Benefit: Speeds up queries that filter or group by clustered columns, reducing
scan time.
3. Basic Query Optimization and Cost Optimization Technique in Big Query?
• Query Optimization:
1. Select only required columns: Avoid SELECT *.
2. Filter early: Apply filters in WHERE clauses to reduce the result set.
3. Use approximate functions: For large aggregations, use
APPROX_COUNT_DISTINCT.
4. Limit the data: Use LIMIT to restrict the query results.
5. Avoid joins on large datasets unless necessary.
• Cost Optimization:
1. Partition and cluster tables.
2. Materialized Views: Use materialized views to store precomputed results.
3. Use caching: BigQuery caches query results by default for repeated queries.
4. Data Sampling: Use LIMIT for sample queries.
5. Query fewer rows: Partition data and filter to scan only relevant rows.

4. How Many Types of Views are Present in Big Query? What’s the Purpose and Why Do
We Need Them?
• Types of Views:
1. Standard Views:
▪ A stored query that executes each time you reference the view.
▪ Purpose: Simplifies complex queries, hides underlying table details,
and provides data abstraction.
2. Materialized Views:
▪ Precomputed results stored for faster access.
▪ Purpose: Reduces query cost and execution time, especially for
frequently used queries.

5. How Big Query Charges?


• BigQuery Pricing is based on:
1. Storage:
▪ Charged per GB per month for storing data in tables.
2. Querying:
▪ Charged per bytes processed during queries (typically $5 per TB).
3. Data Insertion:
▪ Free for data loading into tables.
4. Streaming Inserts:
▪ Charged per MB for data streamed in real-time.

6. What is Dry Run?


• Dry Run allows you to simulate the query execution without actually running it,
helping estimate the cost and resources needed for a query.
o Example: Use --dry_run to check how much data a query will process before
execution.

7. Does Big Query Have a Concept of Primary Key and Foreign Key?
• No, BigQuery does not enforce primary or foreign keys like traditional relational
databases.
o It is an OLAP system optimized for analytics, so constraints like primary and
foreign keys are not supported.
o However, you can create relationships and enforce them logically in your
application.

8. Upload CSV File from GCS Bucket to Big Query from CLI (Header Not Required, Big Query
Should Auto-detect Schema)?
• To load a CSV file from GCS into BigQuery with schema auto-detection and no
header:
bash
Copy
bq load --source_format=CSV --skip_leading_rows=1 --autodetect dataset.table gs://your-
bucket-name/your-file.csv
9. How Do You Handle Large-Scale Data Transformations in Big Query While Migrating
from On-premises?
• Use Data Transfer Services to move data:
1. Cloud Storage: Upload data from on-prem to GCS, and then load into
BigQuery.
2. Cloud Dataproc: Use it for batch data processing and then load data into
BigQuery.
3. BigQuery Data Transfer Service: For regular data transfers from other
sources.
4. Streaming Inserts: Use streaming for real-time data migration.

10. Limitations of Big Query?


• Storage Limitations:
o Maximum table size: 1 TB per table.
o Maximum row size: 10 MB.
• Query Limitations:
o Maximum query length: 1 MB (query text).
o Execution Time: Queries should be completed within 6 hours.
• No Support for primary/foreign keys, and joins on large tables can be slow without
partitioning or clustering.

11. Big Query is OLAP or OLTP?


• BigQuery is an OLAP (Online Analytical Processing) system, designed for large-scale
data analysis and complex queries over huge datasets.
o It is optimized for querying data in a read-heavy environment, not for
transactional operations.
12. I Have One Table and Dropped That, Now I Want It Back, How? (Snapshot, Time Travel)
• Time Travel in BigQuery allows you to query historical data within the last 7 days (if
table versioning is enabled).
o Example:
SELECT * FROM `your-project.your-dataset.your-table` FOR SYSTEM_TIME AS OF TIMESTAMP
'2022-02-01 12:00:00 UTC'
o Alternatively, use Snapshots (manual backups) to restore the table.

13. I Want Current Location and Past Location of the Product We Ordered in BigQuery?
• You can track historical data using Time Travel to get past states of the data.
• If you have a location column with a timestamp of when the product's location
changes, you can query the most recent location and historical ones using ORDER BY
and LIMIT.

14. While Loading Data to Big Query, I Do Not Want to Put Schema Manually?
• You can use schema auto-detection when loading data:
bash
Copy
bq load --autodetect --source_format=CSV dataset.table gs://your-bucket-name/your-file.csv

15. We Are Working on Sales Aggregated Data; We Want to Avoid the Whole Table Scan,
So What is the Best Way to Avoid It?
• Partitioning and Clustering can help reduce the data scanned during queries:
o Partition by date to query only specific time ranges.
o Cluster by columns frequently used in queries like product_id to optimize
performance.

16. Using Big Query, I Want to Query Files in Amazon Redshift and Azure Blob Storage,
How to Do? (Using Big Query Omni Service)
• BigQuery Omni allows querying data stored in AWS S3 and Azure Blob Storage
directly from BigQuery.
o Set up BigQuery Omni for cross-cloud querying, where you can access data
stored on AWS or Azure without moving it to GCP.

17. What is the Maximum Number of Columns That Can Be Partitioned and Clustered in
BigQuery?
• Partitioning: A table can only be partitioned on one column.
• Clustering: You can cluster on up to 4 columns.

18. How Do You Store and Query Structured vs Semi-Structured Data in BigQuery?
• Structured Data:
o Store in regular tables with fixed schemas (e.g., relational tables).
• Semi-Structured Data:
o Store in STRUCT or ARRAY fields within tables.
o Use JSON format for semi-structured data (e.g., nested fields).
o Query semi-structured data using JSON_EXTRACT functions.

19. Can You Explain the Time Travel Feature in BigQuery?


• Time Travel allows you to query a table as it appeared at any point in the past (within
the last 7 days).
o Useful for recovering accidentally deleted or altered data.

20. How to Alter a Table in BigQuery?


• BigQuery doesn’t support full ALTER TABLE commands (like changing a column type).
However, you can:
1. Add columns with:
ALTER TABLE dataset.table ADD COLUMN new_column STRING;
2. Rename columns or change types by creating a new table with the updated
schema and copying data.

21. How to Create Authorized Views in BigQuery?


• Authorized Views allow you to restrict access to the data in a table via a view.
o Create a view and grant access only to the view, not the underlying table.
o Use the bq command to share access to the view, ensuring the data is
accessed in a controlled manner.

22. How to Load Data from GCS to BigQuery?


• Use the bq command or BigQuery Console:
bq load --source_format=CSV dataset.table gs://your-bucket-name/your-file.csv

23. How to Mask Sensitive Data in BigQuery?


• Use Data Masking with BigQuery’s Column-Level Security to hide sensitive data for
specific users.
o Define rules based on user roles to mask or obfuscate data.

24. What is Partitioning and Clustering in BigQuery?


• Partitioning: Breaks data into partitions (like by date) to reduce scan costs.
• Clustering: Orders data within partitions by specific columns, optimizing queries on
those columns.

25. How to Perform Optimizations in BigQuery During Joins?


• Optimize Joins:
1. Use partitioned tables to limit the data being joined.
2. Cluster tables by the join key.
3. Use approximate functions (APPROX_COUNT_DISTINCT) for large datasets.
Dataproc:
1. What is Dataproc?
• Google Cloud Dataproc is a fully managed Apache Hadoop and Apache Spark service
for processing big data workloads.
o Hadoop: A framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
o Spark: A fast, in-memory data processing engine, primarily for batch
processing, but also supports real-time data streaming.
• Dataproc provides an easy-to-use, scalable, and cost-effective solution for big data
processing.
o You can create clusters (groups of virtual machines) and run Spark or Hadoop
jobs on them.
o It integrates with other Google Cloud services like Google Cloud Storage,
BigQuery, and Cloud Pub/Sub.
o Dataproc is billed by the second for the compute time used, providing
flexibility and cost control.

2. What is Machine Family? Types of Machine Family?


• Machine family in Google Cloud refers to the types of virtual machines that you can
use for your workload. Each family is optimized for specific use cases.
Types of Machine Families:
1. General-Purpose:
o Balanced for most workloads.
o Examples: n1-standard, e2, n2
o Use case: Websites, web apps, batch jobs, medium databases.
2. Storage Optimized:
o Optimized for storage with high input/output operations per second (IOPS).
o Examples: n2d, t2d
o Use case: Large data lakes, data warehouses.
3. Compute Optimized:
o High processing power with fewer resources for memory and storage.
o Examples: c2, c2d
o Use case: High-performance computing (HPC), scientific simulations, batch
processing tasks.
4. GPU-Focused:
o Equipped with GPUs (Graphics Processing Units) for tasks requiring massive
parallel processing like machine learning and AI workloads.
o Examples: a2, p4, v100
o Use case: Machine learning, deep learning, image processing.

3. What is the Difference Between Dataproc and Dataflow and Which One is Costlier?
• Dataproc vs Dataflow:
o Dataproc:
▪ Apache Spark and Apache Hadoop based. Primarily used for batch
processing big data workloads.
▪ Supports: Spark jobs, MapReduce, and Hive queries.
▪ Use case: Complex data transformations, iterative machine learning,
and legacy Hadoop workflows.
o Dataflow:
▪ Apache Beam based service. Used for both batch and streaming data
processing.
▪ Supports: Stream processing, windowing, and triggers (event-based
processing).
▪ Use case: Real-time analytics, event-driven systems, real-time data
pipelines.
• Cost Comparison:
o Dataproc is usually cheaper when handling batch processing jobs (as it is
billed by cluster usage).
o Dataflow tends to be more expensive for long-running stream processing due
to its managed pipeline model and automatic scaling.
4. I Have Created Dataproc Cluster and Accidentally Deleted That Cluster, Can We Retrieve
It Back? Does the Data Get Deleted? If Yes, How Can We Save It?
• Cluster Deletion:
o Once a Dataproc cluster is deleted, the cluster's compute resources (VMs)
and temporary data (stored on local disks) are gone.
o Data on Cloud Storage or BigQuery is not affected. Dataproc clusters typically
process data stored in Google Cloud Storage (GCS), which is persistent and
survives cluster deletion.
• How to Save Data:
o Always ensure critical data is stored in Google Cloud Storage or other
managed services (BigQuery, Cloud SQL) before deleting the cluster.
o Consider using persistent disk for temporary storage, or Cloud Storage for
data that's meant to persist.

5. I Have Created Dataproc Cluster and My Jobs Are Running Slow, What Are the Ways to
Optimize My Current Jobs?
Ways to Optimize Spark Jobs on Dataproc:
1. Increase Cluster Size: Add more worker nodes to scale the cluster, especially if jobs
are resource-intensive.
2. Tune Spark Settings: Adjust Spark configuration to optimize performance (e.g.,
executor memory and core settings).
3. Data Partitioning:
o Partition large datasets effectively to ensure better parallelism.
o Repartitioning during job execution might help avoid data shuffling.
4. Avoid Wide Transformations: Minimize operations that require shuffling (like
groupByKey, join on large datasets).
5. Use Dataproc Optimized Images: Use Dataproc’s optimized images for Spark to
enhance performance.
6. Preemptible VMs: Use preemptible instances to scale cost-effectively for high-
performance workloads.
6. What is the Basic Difference Between Zonal Cluster and Regional Cluster?
• Zonal Cluster:
o Resides in a single zone within a region.
o Less fault-tolerant, as if the zone faces issues, the cluster may become
unavailable.
• Regional Cluster:
o Spans multiple zones within a region, providing high availability and fault
tolerance.
o More resilient to zone failures and distributes the workload across different
zones for better uptime.

7. What is Preemptable Clusters?


• Preemptible VMs are short-lived, cost-effective instances that Google Cloud can
terminate at any time when resources are needed elsewhere.
o Cheaper than regular VMs (about 80% less).
o Ideal for batch jobs that can tolerate interruption and can be restarted.
o Use for tasks like Spark jobs where execution time is flexible.

8. What is the Best Practice to Optimize Dataproc Cost?


• Best Practices:
1. Use Preemptible VMs: For non-critical workloads, use preemptible instances
to lower costs.
2. Use Autoscaling: Set up autoscaling to dynamically add or remove worker
nodes based on workload, avoiding over-provisioning.
3. Idle Clusters: Shut down clusters when not in use to avoid unnecessary costs.
4. Optimize Spark Jobs: Follow Spark optimization strategies to reduce the
runtime and resource usage of jobs.
5. Monitor Usage: Use Stackdriver Monitoring to track cluster performance and
identify idle resources.
9. How Will You Optimize Spark Performance on Dataproc Cluster?
• Spark Performance Optimization:
1. Partitioning Strategy: Make sure your data is partitioned based on the keys
you're querying.
2. Executor Memory: Increase memory allocated to executors if your jobs need
more resources.
3. Shuffle Partitions: Increase the number of shuffle partitions
(spark.sql.shuffle.partitions) if your cluster is large.
4. Data Caching: Cache intermediate data that is reused in Spark
transformations.
5. Use the Right File Format: Prefer columnar formats (Parquet, ORC) over row-
based formats (CSV, JSON) to minimize data movement.

10. We Have One File in Cloud Storage and We Want to Process That File in Apache Spark
on a Dataproc Cluster, How Can We Submit Spark Job to Run the File Using CLI?
• To submit a Spark job on Dataproc via the CLI:
bash
Copy
gcloud dataproc jobs submit spark \
--cluster your-cluster-name \
--region your-region \
--class org.apache.spark.examples.SparkPi \
--jars gs://your-bucket-name/your-jar-file.jar \
-- gs://your-bucket-name/input-data.txt
o Replace your-cluster-name with your Dataproc cluster's name, and adjust the
file paths accordingly.
o You can specify Spark properties and arguments after the --.
11. How Will You Monitor the Job Progress and Identify the Issue if the Spark Job is Taking
a Long Time to Run?
• Monitoring:
1. Cloud Logging: Use Stackdriver Logging to view logs from your Spark job.
2. Dataproc UI: Dataproc provides a web UI to monitor job progress and logs.
3. Spark UI: Check the Spark UI for detailed metrics like task duration, shuffle
operations, and memory usage.
4. Job Metrics: Monitor the job's stages, task time, and any skew in data
processing.
• Troubleshooting:
o If tasks are taking longer than expected, check if any tasks are stuck or there
are out-of-memory errors.

12. How Can You Handle Sudden Surge in Data if Using Dataproc?
• Handling Data Surges:
1. Use Autoscaling: Dataproc allows you to automatically scale the number of
nodes in the cluster based on the demand.
2. Preemptible VMs: Leverage preemptible VMs to quickly scale up without
adding significant costs.
3. Pre-emptive Job Scheduling: Schedule jobs in off-peak hours if a surge is
predictable.
4. Job Queuing: Use Apache Kafka or Google Cloud Pub/Sub to manage
streaming data and process it in batches.
Cloud Composer (Airflow):
1. What is Cloud Composer?
• Cloud Composer is a fully managed Apache Airflow service offered by Google Cloud.
It is used for orchestrating and scheduling workflows or data pipelines.
• Apache Airflow is an open-source tool to help automate, monitor, and schedule
complex workflows.
o Orchestrates workflows: Runs data processing tasks in sequence or parallel,
and ensures the correct order of execution.
o Managed Service: Cloud Composer takes care of provisioning resources,
scaling, and upgrading Airflow instances.
o Integrated with Google Cloud Services: Easily integrates with other Google
Cloud products like Dataproc, BigQuery, Cloud Storage, etc.

2. What is a DAG?
• DAG stands for Directed Acyclic Graph. In the context of Apache Airflow:
o Directed: Each edge (task dependency) points from one task to another.
o Acyclic: No task can point to itself, meaning there’s no circular dependencies.
o Graph: A structure that represents a set of tasks and their dependencies.
• In Cloud Composer, a DAG is a collection of tasks that are organized in a way that
they follow certain execution order or dependencies.
o Tasks: Individual units of work (like running a script, loading data, or calling an
API).
o Dependencies: The relationship that determines which task must run before
or after another.

3. Types of Operators Used in Cloud Composer?


Operators are building blocks in Airflow used to define the tasks within a DAG. Some
common types of operators are:
1. BashOperator:
o Executes a bash command/script.
o Example: BashOperator(task_id='run_bash', bash_command='echo "Hello
World"')
2. PythonOperator:
o Executes a Python function.
o Example: PythonOperator(task_id='run_python',
python_callable=my_python_function)
3. BranchPythonOperator:
o Used to decide which task(s) to execute based on some condition.
o Example: Run different tasks based on the output of a condition.
4. DummyOperator:
o Placeholder for tasks, often used for testing or DAG structure purposes.
o Example: DummyOperator(task_id='start')
5. DataprocOperator:
o Used to create and manage Google Cloud Dataproc clusters and run jobs on
them (like Spark jobs).
o Example: Submit a Spark job to a Dataproc cluster.
6. BigQueryOperator:
o Used to run queries in Google BigQuery.
o Example: Execute a SQL query in BigQuery.
7. GCSFileTransferOperator:
o Handles file transfer between Google Cloud Storage (GCS) and other services.
8. HttpOperator:
o Sends HTTP requests to any web service (e.g., REST APIs).

4. I Have Created One DAG, and Inside That:


• Created Dataproc cluster
• Submitted one PySpark file/job
• After submitting the job, I am deleting the cluster
• What happens if I didn’t mention DAG task dependency?
• If DAG fails, why and if it is not failed, why?
• Task Dependencies:
o If you do not specify task dependencies, tasks may run in parallel by default,
which could cause issues like trying to delete the cluster before the PySpark
job has finished.
o Without dependencies, tasks do not wait for others to complete before
starting, leading to race conditions or premature execution of dependent
tasks (like deleting the cluster before the job is finished).
• If the DAG Fails:
o The DAG could fail if:
▪ A task like PySpark job submission fails (e.g., script issues, resources
unavailable).
▪ Cluster deletion fails because the task was not set to wait until the
PySpark job finishes.
o If the DAG does not fail:
▪ It may be due to lack of proper dependencies, which could allow tasks
to run even if other steps should logically come first.

5. What is the Meaning of "On Failure Callback" in Cloud Composer?


• On Failure Callback is a function you define to be called when a task fails.
o You can specify a callback function to notify you or trigger an action when a
task fails.
o Example: Sending an email, logging an error, or triggering a different process
to handle the failure.
Example:
def on_failure_callback(context):
send_failure_email(context['task_instance'])
• This can be configured using the on_failure_callback parameter when defining a task
in your DAG.

6. I Have Created a DAG with 4 Tasks, If Any Task Fails, I Don’t Want to Stop the DAG.
• In this case, you can set ignore_downstream or trigger_rule:
1. Trigger Rule:
▪ The trigger rule controls when a task is executed based on the status
of previous tasks.
▪ Use TriggerRule.ALL_FAILED or TriggerRule.ONE_FAILED to execute
the downstream task even if a previous task fails.
2. Setting ignore_downstream=True in your task can also allow you to skip
downstream tasks even if an upstream task fails.
3. Task retry mechanism: Enable retries for the failed task to attempt re-
execution.

7. How Would You Handle DAG Failures or Retries in Cloud Composer?


• Failure Handling:
o Retries: You can set a task to retry in case of failure by configuring the retries
parameter.
▪ Example: retries=3, which means Airflow will retry the task up to 3
times.
o Retry Delay: Set a delay between retries using retry_delay parameter.
▪ Example: retry_delay=timedelta(minutes=5)
• On Failure Callbacks: As mentioned, use callbacks to handle specific failure actions
like sending alerts or triggering another process.
• Alerting:
o Set up email notifications or use Google Cloud Monitoring to alert you on
task failures.

8. How to Restart a DAG in Airflow?


• To restart a DAG:
1. Mark the task as Failed/Success: If a task fails, manually mark it as success or
retry.
2. Triggering DAG: You can manually trigger a DAG from the Airflow UI by
clicking on the play button or use the following command:
Bash:
airflow dags trigger your_dag_id
9. How to Create Task Dependencies in Airflow?
• Task dependencies are created by using the set_upstream() and set_downstream()
methods, or by using the bitshift (>> and <<) operators.
o Bitshift Operator: It is the preferred method and is more concise.
Python:
task_1 >> task_2 # task_2 will run after task_1
▪ You can also use the reverse:
Python:
task_2 << task_1 # task_1 will run before task_2
o Using set_downstream:
Python:
task_1.set_downstream(task_2) # task_2 will run after task_1

10. How to Execute One DAG After the Successful Completion of Another DAG?
• Use the ExternalTaskSensor operator:
o This operator waits for a task from another DAG to complete before executing
the current task.
o Example:
Python:
from airflow.sensors.external_task import ExternalTaskSensor

task_1 = ExternalTaskSensor(
task_id='wait_for_task_1',
external_dag_id='dag_1', # DAG ID of the other DAG
external_task_id='task_1', # Task ID to monitor
mode='poke', # Polling mode
timeout=600 # Max wait time
)
11. How Would You Use DAG in Data Pipeline Orchestration?
• A DAG is an essential part of orchestrating data pipelines, as it defines the workflow
and task dependencies in a structured manner.
• Example Workflow:
1. Extract Data from a source (e.g., API, Database).
2. Transform Data using a Spark job or any other processing tool.
3. Load Data into the target (e.g., BigQuery, Cloud Storage).
• Key Benefits of using DAGs in orchestration:
o Task Dependencies: Easily control execution order and logic.
o Scheduling: Automatically schedule and monitor data pipelines.
o Error Handling: Implement retries, failure handling, and alerts.
o Monitoring: Airflow provides detailed monitoring of task execution and
performance.
Dataflow:
1. What is Dataflow?
• Dataflow is a fully managed streaming and batch data processing service in Google
Cloud.
o It is built on Apache Beam, an open-source unified stream and batch data
processing model.
o Dataflow simplifies the process of building, deploying, and managing data
pipelines for both batch and real-time data processing.
o Purpose: It helps you analyze and process data at large scale with minimal
effort, providing auto-scaling, monitoring, and cost-effective data processing.
Key Features:
• Unified Programming Model: Dataflow uses Apache Beam, which allows you to
process both batch and streaming data with the same codebase.
• Fully Managed: You don’t have to worry about managing infrastructure or clusters.
Google handles all scaling and resource allocation automatically.
• Cost-Effective: You only pay for the resources you use, and Dataflow can scale
automatically to save costs during low usage times.

2. Major Components of Dataflow


The key components in Dataflow are:
1. Pipeline:
o A pipeline defines the workflow for processing data. It consists of a series of
steps that define how the data flows through various processing stages (e.g.,
reading data, applying transformations, writing data).
o In Apache Beam, a pipeline is composed of PCollection (data) and
PTransforms (transformations).
2. PCollection:
o A PCollection is a collection of data that is processed within a pipeline. It
represents an abstraction over the data that may be either bounded (batch)
or unbounded (streaming).
3. PTransform:
o A PTransform is a transformation applied to a PCollection. This can be
operations like filtering, mapping, or windowing data.
o For example, a ParDo is a common PTransform that allows you to apply a
function to each element of the PCollection.
4. Windowing:
o Windowing is a technique that groups elements of an unbounded stream
(real-time data) into finite windows based on certain criteria (time, data size,
etc.).
o This is crucial for processing streaming data in manageable chunks.
5. Runner:
o A runner is the engine that executes your Apache Beam pipeline. Dataflow
itself acts as a runner on Google Cloud.
o Example: The Dataflow Runner executes the pipeline on the Dataflow
service, which manages resources and scaling automatically.
6. IO (Input/Output):
o IO connectors allow Dataflow pipelines to read and write data from external
sources, such as Google Cloud Storage, BigQuery, Cloud Pub/Sub, etc.

3. How Dataflow Automatically Scales?


Dataflow can automatically scale your data processing pipeline based on the amount of data
that needs processing. This scaling happens in the following ways:
1. Dynamic Scaling:
o Dataflow automatically adjusts the number of workers (compute resources)
used to process data based on the size and complexity of the data.
o It automatically scales up (add more workers) when data volume increases or
scales down (reduce workers) when the volume decreases, to optimize
performance and costs.
2. Worker Types:
o Dataflow provides different worker types like standard and flexible workers.
o Standard workers are automatically provisioned based on the data flow’s
requirements.
o Flexible workers allow you to control scaling based on your needs, such as
creating specialized workers for different processing stages.
3. Autoscaling Algorithms:
o Dataflow uses a progressive autoscaling algorithm to determine how many
workers are required at any given time based on pipeline load.
o This ensures minimal latency and resource wastage, optimizing performance
and cost.
4. Dynamic Work Rebalancing:
o When data is processed, Dataflow dynamically redistributes work between
workers, ensuring an even load distribution. This helps to prevent
bottlenecks.

4. What is Windowing and What is Its Purpose?


• Windowing is a technique used in streaming data to break the data into manageable
windows (groups) of elements based on time or other criteria.
Purpose of Windowing:
1. Time-based Processing: When dealing with streaming data (e.g., real-time
data), you need a way to group data over time intervals. This makes it easier
to aggregate or perform operations over time ranges.
2. Event-Time Processing: In event-driven systems, data doesn’t always arrive in
the correct order, and windowing allows you to re-group data into time
windows (e.g., 5-minute, 1-hour windows).
3. Handling Late Data: Windowing ensures that late-arriving data is not ignored
but is instead included in the correct window for proper processing.
Types of Windowing:
1. Fixed Window:
o Data is grouped into fixed-size windows (e.g., 1-minute windows).
o Example: Group events occurring in a specific 1-minute window.
2. Sliding Window:
o Overlapping windows with a specified size and slide interval.
o Example: A 5-minute window sliding every 1 minute.
3. Session Window:
o Data is grouped based on session activity. This is useful for grouping user
sessions or events that are related to a single interaction.
o Example: Grouping events within a 10-minute gap of inactivity.
4. Global Window:
o A single large window containing all elements. This is useful for non-time-
based aggregations.
5. How Would You Schedule a Dataflow Workflow Without Using Cloud Composer?
You can schedule Dataflow pipelines without using Cloud Composer in several ways:
1. Google Cloud Scheduler:
o Use Google Cloud Scheduler, a fully managed cron job service, to trigger your
Dataflow pipeline at specific times.
o Steps:
1. Create a Cloud Scheduler job.
2. Set it to trigger an HTTP request or Pub/Sub message.
3. Trigger a Dataflow job using the gcloud dataflow jobs run command or
the Dataflow API.
2. Google Cloud Functions:
o Use Cloud Functions to execute your pipeline on a schedule.
o Set up a Cloud Function to trigger a Dataflow job when an event occurs or at
a specific time (using a timer in the Cloud Function code).
3. Command Line Interface (CLI):
o You can manually or script the triggering of your Dataflow pipeline using the
gcloud CLI.
o Example:
Bash:
gcloud dataflow jobs run <job-name> --gcs-location=<pipeline-location> --region=<region>
4. Dataflow Templates:
o Dataflow supports templates, which can be executed on a scheduled basis.
o You can create a Dataflow template and then trigger it via Cloud Scheduler or
any other tool that supports HTTP triggers.

6. Explain Autoscaling in Dataflow


Autoscaling in Dataflow refers to the ability of the service to dynamically adjust the number
of workers involved in executing a pipeline based on the volume of incoming data or the
computational requirements.
• How it works:
1. Initial Scaling: Dataflow starts with a small number of workers.
2. Scaling Up: If more data arrives or the job requires more compute resources,
Dataflow increases the number of workers automatically.
3. Scaling Down: Once the job completes, or data volume decreases, Dataflow
reduces the number of workers to optimize cost.
• Autoscaling Characteristics:
o Dataflow handles both stream and batch processing with autoscaling.
o This ensures cost efficiency because you're only billed for the resources you
actually use.
o It adjusts based on pipeline load and task complexity.

7. How Does Dataflow Handle Errors or Job Failures?


Dataflow provides several mechanisms to handle errors or failures in your jobs:
1. Retry Logic:
o Dataflow automatically retries failed tasks or stages up to a specified
maxRetryCount (set by you) to mitigate transient errors.
2. Error Handling via Dead-letter Policy:
o For certain types of errors (e.g., processing failures), Dataflow allows you to
configure a dead-letter policy where data that could not be processed is
stored in a separate location for later inspection or reprocessing.
3. Logging:
o Dataflow integrates with Cloud Logging (formerly Stackdriver), which
provides detailed logs for each job stage. You can monitor errors or failures
via the logs and take corrective actions.
o Errors are logged, allowing users to trace the cause of the failure.
4. Job State:
o Dataflow jobs provide feedback on their state (e.g., Running, Succeeded,
Failed). You can monitor job status via the Dataflow Console or Cloud
Monitoring to identify failures early.
5. Custom Error Handling:
o You can implement your own error handling inside the Apache Beam
pipeline, using custom logic to handle or filter erroneous data.
Miscellaneous/General Cloud Topics:
1. What is IAM? What is a service account? How many service accounts can be created per
project?
IAM (Identity and Access Management):
• IAM is a framework used to control who has access to specific resources in your
Google Cloud project and what actions they can perform.
• Purpose: IAM allows you to manage and enforce access control policies for
resources, ensuring that only authorized users and services can access or modify
your resources.
• Components:
1. Users: Individuals or groups who have access to resources.
2. Roles: Define a set of permissions that determine what actions a user can
perform on resources.
3. Permissions: The actions that are allowed on a resource (e.g., read, write,
execute).
4. Service Accounts: These are accounts used by applications or virtual
machines (VMs) to interact with Google Cloud services programmatically.
Service Account:
• A service account is an identity used by applications or virtual machines (VMs) to
interact with Google Cloud services.
• Unlike user accounts (which represent real users), service accounts represent non-
human entities that need to access Google Cloud resources.
• Service Account Keys are used to authenticate these applications, allowing them to
interact with Google Cloud services.
How many service accounts can be created per project?
• Service Accounts: A Google Cloud project can have up to 100 service accounts by
default, but this limit can be increased by submitting a request to Google Cloud
support if necessary.

2. What are Roles in GCP?


• Roles in IAM are collections of permissions that are assigned to users or service
accounts to determine what actions they can perform on Google Cloud resources.
Types of Roles:
1. Primitive Roles: Broad permissions across all resources in a project.
o Owner: Full access to all resources and services.
o Editor: Can modify resources but not manage permissions.
o Viewer: Read-only access to resources.
2. Predefined Roles: More granular and specific roles that are tailored to certain
services.
o Example: roles/storage.objectViewer for read-only access to Google Cloud
Storage.
3. Custom Roles: Roles that you can define by selecting specific permissions for fine-
grained control over access.

3. What is the difference between service and batch accounts?


• Service Account:
o A service account is used by applications, VMs, or services to perform
automated actions or interact with Google Cloud APIs.
o It is typically used for continuous, long-running operations.
• Batch Account:
o A batch account is used for jobs that are scheduled to run in batches. These
jobs are typically one-time, or run periodically (e.g., every night), and are not
expected to run continuously.
o For example, Google Cloud Dataflow can execute batch data processing jobs,
and a batch account would be used to manage those workloads.

4. Explain different storage classes in GCS and their costings.


Google Cloud Storage (GCS) offers multiple storage classes, each designed for specific use
cases:
1. Standard:
o For frequently accessed data.
o Cost: Higher storage cost but low access cost (good for frequently accessed
data).
2. Nearline:
o For data that is accessed less than once a month.
o Cost: Lower storage cost than Standard, but higher access cost. Ideal for
infrequent access data (e.g., backups).
3. Coldline:
o For data that is rarely accessed (less than once a year).
o Cost: Lower storage cost than Nearline, but higher access cost. Suitable for
long-term archival data.
4. Archive:
o For data that is rarely accessed and is meant for long-term storage.
o Cost: The cheapest storage class in terms of storage, but with higher retrieval
costs. Best for archival data with very infrequent access.
Cost Breakdown:
• Storage Costs: Charges based on the amount of data stored.
• Access Costs: Charges for data retrieval or access (higher for colder classes).
• Data Retrieval Time: Coldline and Archive take longer to retrieve data than Standard
and Nearline.

5. How do you ensure data consistency in a GCP bucket?


• Data consistency in Google Cloud Storage is guaranteed with strong consistency:
o Strong Consistency: Any operation that modifies the data (e.g., write, delete)
is immediately reflected in subsequent read requests.
o Example: If you upload an object to a bucket, subsequent read requests for
that object will reflect the newly uploaded data immediately, no matter
where they originate.
• Eventual Consistency (used in other cloud storage systems) is not applicable in GCS,
as GCS provides strong consistency by default.

6. What is the difference between HTTP and HTTPS?


• HTTP (Hypertext Transfer Protocol):
o HTTP is the protocol used for transferring data over the web.
o It does not encrypt the data between the client and the server, making it
vulnerable to man-in-the-middle attacks and eavesdropping.
• HTTPS (Hypertext Transfer Protocol Secure):
o HTTPS is the secure version of HTTP that uses SSL/TLS encryption to secure
the connection between the client and server.
o It ensures data integrity, confidentiality, and authentication, making it safe
for transmitting sensitive data like passwords or payment details.

7. What is API Whitelisting and Bucket Whitelisting?


• API Whitelisting:
o API whitelisting allows you to restrict access to certain APIs based on a list of
approved IP addresses, services, or users.
o This ensures that only trusted entities can call your APIs, preventing
unauthorized access.
• Bucket Whitelisting:
o Similar to API whitelisting, bucket whitelisting allows you to restrict access to
your Google Cloud Storage buckets based on IP addresses or predefined
rules.
o This ensures that only certain users or systems can access or modify the data
in the bucket.

8. How do you handle large-scale data migrations from on-prem to cloud?


To handle large-scale data migrations from on-premises systems to Google Cloud, you can
use several approaches:
1. Google Cloud Storage Transfer Service:
o Automates the transfer of large amounts of data from on-premises to Google
Cloud Storage.
o Can handle scheduled, incremental, or one-time transfers.
2. Transfer Appliance:
o A physical device that allows you to copy large datasets locally before
shipping it to Google Cloud for upload.
o This is particularly useful when you have terabytes or petabytes of data to
migrate.
3. Google Cloud Data Transfer API:
o Used for custom transfers or integrating with existing systems.
4. Third-Party Tools:
o You can also use third-party tools like Velostrata or CloudEndure to handle
complex migrations.
9. Difference between serverless and managed services?
• Serverless Services:
o In a serverless architecture, the cloud provider fully manages the
infrastructure for you.
o Example: Google Cloud Functions or Google Cloud Run.
o Benefits: No need to manage servers, auto-scaling, pay-per-use pricing.
o Drawback: Limited control over the underlying infrastructure.
• Managed Services:
o A managed service also abstracts infrastructure management but typically
gives you more control over configuration.
o Example: Google Kubernetes Engine (GKE) or Google Cloud SQL.
o Benefits: More control, while still abstracting some infrastructure
management.
o Drawback: You still need to manage certain components, like scaling or
maintenance.

10. Horizontal Scaling vs. Vertical Scaling


• Horizontal Scaling (Scaling Out):
o Involves adding more instances (e.g., additional virtual machines or
containers) to handle increased load.
o Example: Adding more instances of a service in Kubernetes or Google
Compute Engine.
o Benefits: Highly flexible and cost-effective for large-scale applications.
• Vertical Scaling (Scaling Up):
o Involves adding resources (e.g., CPU, memory) to a single instance to increase
its capacity.
o Example: Upgrading a virtual machine's size in Google Compute Engine.
o Benefits: Simpler and good for applications that require larger resources in
one instance.
11. How do you move on-prem data to cloud using Data Migration Service?
Google Cloud's Data Migration Service helps you migrate databases from on-premises
systems to Google Cloud.
• Steps:
1. Set up the Data Migration Service in Google Cloud Console.
2. Choose the source database (e.g., MySQL, SQL Server).
3. Provide connection details for the on-prem database.
4. Set up the target database on Google Cloud (e.g., Google Cloud SQL).
5. Start the migration, which will transfer data from on-prem to the cloud with
minimal downtime.
• Features:
o Minimal downtime migration.
o Supports realtime replication of ongoing changes during migration.
Miscellaneous/Project-Specific Questions:
1. Explain your project and the services you used.
Project Overview:
• A project typically refers to a cloud-based system where data is ingested, processed,
and analyzed. The services you use depend on the nature of the project. For
example:
o A data processing pipeline in Google Cloud may involve collecting data from
different sources, transforming it, and storing the results in a warehouse or
dashboard.
Example Project:
• Data Ingestion: Data is ingested from different sources such as on-prem systems,
web scraping, APIs, etc., into Google Cloud Storage (GCS).
• Data Processing: The data is processed using Dataproc (for Apache Spark/Apache
Hadoop jobs) or Dataflow (for real-time and batch processing using Apache Beam).
• Data Storage: Processed data is then stored in BigQuery for analytics and querying.
• Orchestration: Orchestration of the pipeline and task management is done using
Cloud Composer, which automates scheduling and monitoring of workflows.
• Data Visualization: Dashboards are created using Google Data Studio or Looker to
visualize the data from BigQuery.
Services Used:
1. Google Cloud Storage (GCS): For storing raw and processed data.
2. Dataproc: For running Apache Spark and Hadoop jobs.
3. Dataflow: For real-time data processing using Apache Beam.
4. BigQuery: For data warehousing and running analytical queries.
5. Cloud Composer: For orchestration and scheduling tasks in the pipeline.
6. Google Data Studio/Looker: For creating visualizations and reports.
2. What are the operators used in your DAGs in Cloud Composer?
DAG (Directed Acyclic Graph):
• A DAG in Cloud Composer (Apache Airflow) is a collection of tasks that are executed
in a specified order.

Operators in Cloud Composer:


Operators are predefined tasks in Airflow, and each one represents a different kind of
activity. Here are some operators commonly used in Cloud Composer:
1. PythonOperator: Executes a Python function in a task.
o Use Case: Running Python scripts for data processing or transformations.
2. BashOperator: Executes a Bash command in a task.
o Use Case: Running shell commands, file manipulations, or calling external
scripts.
3. DataprocSubmitJobOperator: Used to submit a job to a Dataproc cluster (Apache
Spark or Hadoop jobs).
o Use Case: Running Spark or Hadoop jobs in a Dataproc cluster as part of your
workflow.
4. BigQueryOperator: Executes a query in BigQuery or loads data to BigQuery.
o Use Case: Running SQL queries on BigQuery, loading data into BigQuery, or
exporting data.
5. GCSFileTransferOperator: For interacting with Google Cloud Storage (GCS), such as
uploading or downloading files.
o Use Case: Uploading files from a local machine to GCS or vice versa.
6. HttpSensor: Waits for an HTTP endpoint to return a specific response before
proceeding.
o Use Case: Triggering tasks after an external API responds.
7. DummyOperator: A placeholder task used to create dependencies or structure the
workflow.
o Use Case: Used when a task is just needed as a placeholder (e.g., for grouping
or synchronizing tasks).
8. EmailOperator: Sends emails as part of the workflow.
o Use Case: Notifying the team about task success or failure.
Example DAG:
Python:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.dataproc_operator import DataprocSubmitJobOperator
from airflow.operators.gcs_operator import GoogleCloudStorageDownloadOperator

def my_python_function():
# Data transformation logic
pass

dag = DAG('my_data_pipeline', start_date=datetime(2025, 2, 6))

task1 = PythonOperator(task_id='python_task', python_callable=my_python_function,


dag=dag)
task2 = DataprocSubmitJobOperator(task_id='spark_task', job=job_config,
cluster_name='my-cluster', dag=dag)
task3 = GoogleCloudStorageDownloadOperator(task_id='download_file',
bucket_name='my_bucket', object_name='file.txt', destination='/tmp', dag=dag)

task1 >> task2 >> task3 # Defining task dependencies

3. What kind of data transformation was done during the project?


Data transformations refer to the process of cleaning, enriching, or altering raw data into a
more structured format for analysis. Below are some examples of common transformations
done during cloud-based data projects:
Types of Transformations:
1. Data Cleaning:
o Removing duplicates: Cleaning out duplicate records to ensure data integrity.
o Handling missing values: Replacing null or missing values with meaningful
values or removing them.
o Normalization/Standardization: Ensuring consistency in formats (e.g.,
converting date formats, handling unit conversions).
2. Data Aggregation:
o Aggregating data (e.g., summing up values, calculating averages) to get high-
level insights.
o Example: Calculating daily, monthly, or yearly totals in sales data.
3. Data Enrichment:
o Combining data from multiple sources to enhance the data’s value.
o Example: Merging customer data with external demographic data.
4. Data Parsing:
o Extracting useful information from unstructured or semi-structured data (e.g.,
parsing logs or JSON).
o Example: Parsing a log file into structured data that can be loaded into
BigQuery.
5. Data Filtering:
o Filtering out irrelevant records that do not contribute to your analysis.
o Example: Only including transactions above a certain value.
6. Data Transformation for Analysis:
o Performing mathematical or statistical operations to transform the data for
analysis.
o Example: Converting raw transaction data into time series data for trend
analysis.
Example Transformation using Spark (in Dataproc):
Python:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('DataTransformation').getOrCreate()

# Load data
data = spark.read.csv("gs://my-bucket/raw_data.csv", header=True)
# Cleaning: Remove duplicates
clean_data = data.dropDuplicates()

# Aggregation: Calculate sum of sales by product category


aggregated_data = clean_data.groupBy('category').sum('sales')

# Save the transformed data


aggregated_data.write.format('bigquery').option('table',
'project_id.dataset_name.table_name').save()

4. How did you perform orchestration, task dependency, and pipeline management in your
project?
Orchestration and Task Dependencies:
In Cloud Composer (Airflow), orchestration is achieved by defining a DAG that represents
the workflow of tasks. The tasks are connected in a dependency chain where one task must
complete before the next task starts.
• Task Dependencies: You define the order in which tasks should run by using the >> or
<< operators, which create a "dependency chain" between tasks.
o Example: Task A must finish before Task B starts.
Python:
task_a >> task_b # task_b will run only after task_a completes
• Pipeline Management: Using Cloud Composer, the entire workflow is scheduled,
managed, and monitored via the Airflow UI, where you can view logs, retry failed
tasks, and monitor task execution status.
Example of Task Dependencies in Airflow:
Python:
task1 = PythonOperator(task_id='task1', python_callable=my_function, dag=dag)
task2 = PythonOperator(task_id='task2', python_callable=my_function, dag=dag)
task3 = PythonOperator(task_id='task3', python_callable=my_function, dag=dag)

task1 >> task2 # Task2 runs after Task1 completes


task2 >> task3 # Task3 runs after Task2 completes
5. How would you optimize and troubleshoot jobs running in Dataproc, BigQuery, or
Dataflow?
Optimization:
1. Dataproc:
o Cluster Sizing: Ensure that the cluster is properly sized based on the job’s
resource requirements (memory, CPU). Use preemptible instances to reduce
costs.
o Caching: Cache intermediate results in HDFS or Spark to avoid
recomputation.
o Resource Management: Tune the number of partitions and use dynamic
resource allocation for better performance.
2. BigQuery:
o Partitioning and Clustering: Partition tables based on time or other relevant
fields to speed up query performance. Cluster tables to optimize for specific
columns frequently queried.
o Limit Scanned Data: Use filters to reduce the amount of data processed by
queries. Use SELECT statements to limit the number of columns returned.
o Query Optimization: Use materialized views for frequently queried results or
cached query results.
3. Dataflow:
o Autoscaling: Ensure that autoscaling is enabled to adjust the number of
workers dynamically based on job demands.
o Pipeline Optimization: Use windowing and watermarks to handle large
amounts of data and real-time processing efficiently.
o Parallel Processing: Design the pipeline with parallelism in mind to speed up
processing.
Troubleshooting:
1. Dataproc: Monitor logs via Cloud Logging. Look for errors or warnings in the job logs
that can indicate issues with resources or cluster configuration.
2. BigQuery: Use the Query Execution Plan to analyze query performance and identify
bottlenecks.
3. Dataflow: Review Dataflow's job logs for any errors or slow stages. Use the Dataflow
dashboard to monitor worker performance and the stages of your pipeline.
Ajay Pratap Singh questions:
1. Explain recent project briefly – which scripting language you use for transformation
– PySpark?
Project Overview:
• The project might involve data processing in a cloud-based environment, where raw
data is ingested, transformed, and analyzed.
• The scripting language used for transformation could be PySpark, which is a Python
API for Apache Spark. PySpark allows you to perform distributed data processing on
large datasets.
Example Scenario:
• Data from various sources (such as Cloud Storage, databases, or APIs) is loaded into
Google Cloud Storage (GCS).
• Using Apache Spark (via Dataproc), PySpark scripts are written to transform the
data, such as aggregating, cleaning, or enriching the data.
• After transformation, the data is loaded into BigQuery for analysis.

2. What kind of transformations do you perform on data?


Common types of data transformations include:
• Data Cleaning: Removing null values, fixing formatting issues, and removing
duplicates.
• Data Aggregation: Summing or averaging data by different categories.
• Data Filtering: Filtering out records based on certain conditions.
• Data Enrichment: Adding additional information to the data by merging or joining
with other datasets.
• Data Normalization: Scaling values to a common range or format.
• Data Parsing: Extracting structured information from semi-structured data like JSON
or CSV.

3. How did you orchestrate and schedule the job?


Orchestration and scheduling can be done using Cloud Composer, which is built on
Apache Airflow. You can create DAGs (Directed Acyclic Graphs) that define the order of
tasks, their dependencies, and schedules.
Steps:
1. Define tasks like loading data from GCS, processing it with Dataproc, and storing
results in BigQuery.
2. Use Cloud Composer to schedule the DAG to run at specific intervals.
3. Task dependencies ensure tasks run in the correct order (e.g., a Spark job on
Dataproc cannot run before the data is available).

4. How do you execute a PySpark script to Dataproc?


You can submit a PySpark job to Dataproc via the DataprocSubmitJobOperator in
Airflow.
Example:
python
Copy
from airflow.providers.google.cloud.dataproc.operators.dataproc import
DataprocSubmitJobOperator

submit_pyspark = DataprocSubmitJobOperator(
task_id='submit_pyspark_job',
job={
'reference': {'project_id': 'my-project'},
'placement': {'cluster_name': 'my-cluster'},
'pyspark_job': {'main_python_file_uri': 'gs://my-bucket/my_script.py'},
},
region='us-central1',
dag=dag
)

5. Hands-on with Dataproc and Composer


• Dataproc: Create clusters and run jobs using Spark and Hadoop.
• Cloud Composer: Orchestrate workflows by defining DAGs to schedule tasks such as
running Dataproc jobs, transferring data to BigQuery, etc.
6. How can we submit a PySpark job to Airflow?
To submit a PySpark job to Airflow, you would use the DataprocSubmitJobOperator,
similar to the example mentioned above.
Example:
python
Copy
from airflow.providers.google.cloud.dataproc.operators.dataproc import
DataprocSubmitJobOperator

pyspark_task = DataprocSubmitJobOperator(
task_id='pyspark_task',
job={
'reference': {'project_id': 'my-project'},
'placement': {'cluster_name': 'my-cluster'},
'pyspark_job': {'main_python_file_uri': 'gs://my-bucket/my_pyspark_script.py'},
},
region='us-central1',
dag=dag
)

7. Use Python operator to submit the job in Airflow


You can use the PythonOperator to execute a Python function. However, if you want to
run a PySpark job, it’s typically better to use the DataprocSubmitJobOperator. But here’s
an example of using PythonOperator:
python
Copy
from airflow.operators.python_operator import PythonOperator

def my_pyspark_function():
# Submit the job or run the script here (PySpark related logic)
pass

python_task = PythonOperator(
task_id='python_task',
python_callable=my_pyspark_function,
dag=dag
)

8. Hooks in Airflow
Hooks are used in Airflow to interact with external systems. They abstract the logic
needed to connect to a system like databases, cloud storage, etc.
Example Hooks:
• BigQueryHook: Allows interaction with BigQuery.
• GoogleCloudStorageHook: Interacts with GCS.
• DataprocHook: Used to manage Dataproc clusters.
Example:
python
Copy
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook

hook = BigQueryHook(gcp_conn_id='google_cloud_default')
results = hook.get_pandas_df('SELECT * FROM dataset.table')

9. XCom variable in Airflow – more details about it, which format you share it, and how
can you catch it between tasks?
XCom (Cross-communication) is a mechanism in Airflow to exchange small amounts of
data between tasks. You can push and pull data between tasks.
• Pushing XCom: Use xcom_push to send data from one task to another.
• Pulling XCom: Use xcom_pull to retrieve the data in another task.
Format: XCom can store any Python object like strings, integers, or even dictionaries.
Example:
Python:
# Push an XCom value
task_instance.xcom_push(key='my_key', value='my_value')

# Pull an XCom value


value = task_instance.xcom_pull(task_ids='previous_task', key='my_key')

10. What happens if we do XCom push?


When you push an XCom, the data gets stored in Airflow's metadata database. This
allows subsequent tasks to access the data using xcom_pull.

11. How to define environment variable in Airflow?


You can set environment variables in Airflow in two ways:
1. In the DAG definition using env parameter.
2. Set globally in the Airflow configuration file (airflow.cfg).
Example:
from airflow.operators.python_operator import PythonOperator
import os

def my_function():
my_var = os.getenv('MY_ENV_VAR')
print(my_var)

task = PythonOperator(
task_id='my_task',
python_callable=my_function,
env={'MY_ENV_VAR': 'my_value'},
dag=dag
)
12. Different storage classes in GCS
Google Cloud Storage (GCS) offers several storage classes to optimize for performance
and cost:
1. Standard: High availability, frequently accessed data.
2. Nearline: Infrequently accessed data (at least once a month).
3. Coldline: Long-term storage for data that is rarely accessed (once a year).
4. Archive: Lowest cost, for data that is infrequently accessed.

13. In GCS bucket, object deleted – how to get it back?


If you have versioning enabled for your GCS bucket, you can retrieve a deleted object by
accessing a previous version of the file.

14. Object Versioning in GCS


Object versioning in GCS allows you to keep previous versions of objects in a bucket.
When a file is deleted or overwritten, GCS retains the older versions that can be
accessed later.
To enable versioning:
bash
Copy
gsutil versioning set on gs://your_bucket_name

15. I have a file in GCS and want to share the file outside GCP (temporary access). How
to give only share file?
You can generate a signed URL that grants temporary access to a specific file in your GCS
bucket.
Example:
gsutil signurl -d 10m /path/to/private-key.json gs://your_bucket/file_name
This provides temporary access to the file via a URL.
16. Signed URL in GCP
A signed URL allows temporary access to a specific GCS object without the need for the
user to authenticate. It is typically used for giving time-limited access to files.

17. I want to submit a job but don’t want to create a cluster in Dataproc. If this can be
done, how to do it?
Dataproc Serverless allows you to run Spark or Hadoop jobs without the need to
manage clusters. This is beneficial for short-lived jobs that don't require persistent
clusters.

18. Dataproc Serverless


Dataproc Serverless allows running Spark and Hadoop jobs in a serverless environment,
where Google manages the infrastructure for you. This is useful for cost optimization and
scalability.

19. Optimizing techniques you know about when jobs take more time for submitting?
• Increase Parallelism: Adjust the number of partitions to allow tasks to run in parallel.
• Use Caching: Cache intermediate data to avoid redundant computation.
• Optimize Cluster Size: Choose the correct instance types and sizes for your cluster.
• Optimize Code: Use efficient algorithms, avoid shuffling, and minimize data transfer.

20. Partition Purging


Partition purging refers to removing data from partitions that are no longer needed. For
example, removing old data from partitions in a BigQuery table or GCS.

21. How to find data skewness?


You can detect data skewness by analyzing the distribution of data across partitions. If
one partition has much more data than others, it can cause performance issues in
distributed processing systems like Dataproc or BigQuery.
22. How will you partition ID of that partition and number the records in that
partition?
In PySpark, you can use partitionBy() to partition the data by a specific column, like ID.
Use row_number() to number the records within each partition.
Example:
from pyspark.sql import functions as F

df = df.withColumn("row_num",
F.row_number().over(Window.partitionBy("ID").orderBy("timestamp")))

23. Use of BigQuery in Data Warehouse


BigQuery is used as a data warehouse for performing fast, scalable, and cost-efficient
analytics on large datasets. It allows running SQL queries on data stored in tables and
performs real-time analytics.

24. How can we optimize queries in BigQuery?


• Partitioning and Clustering: Use partitioned tables and clustered columns to
optimize query performance.
• Limit Scanned Data: Use filters and only select the necessary columns.
• Materialized Views: Use materialized views to store query results for frequent use.

25. Materialized View – Describe


A materialized view in BigQuery stores the results of a query, improving performance
when repeatedly querying the same data. They are updated automatically in the
background.
Jojo questions:
1. Executor Types
Executors are responsible for running the code on a cluster. In the context of Apache Spark
(which Dataproc uses for running Spark jobs), the two most common types of executors are:
• Driver: The main process that coordinates the execution of the Spark application,
managing the SparkContext and distributing tasks.
• Worker Executor: These are responsible for executing the actual tasks that are
distributed by the driver. They perform operations like reading input data, processing
the data, and storing results.
In the Dataproc context:
• A Dataproc cluster consists of one driver node and several worker nodes. You can
specify the number of worker nodes depending on the scale of your workload.

2. How to schedule a job in a time range


In Cloud Composer (which is based on Airflow), you can schedule jobs to run within a
specific time range by defining start_date and end_date for the DAG.
Here’s how you define it in the DAG definition:
python
Copy
from datetime import datetime, timedelta

dag = DAG(
'example_dag',
start_date=datetime(2025, 2, 6),
end_date=datetime(2025, 2, 7), # Set the time range
schedule_interval=timedelta(hours=1), # Runs every hour within the range
)
• start_date: The date and time when the DAG will first be triggered.
• end_date: The last date the DAG will run.
3. XCom – How does it work? Default size?
XCom (Cross-communication) is a feature in Apache Airflow that allows tasks to share data
between each other. You can push data from one task and pull it from another task.
• Push XCom: xcom_push(key='key_name', value='value')
• Pull XCom: xcom_pull(task_ids='task_name', key='key_name')
• Default Size: The default size limit for XCom values in Airflow is 48 KB. If you exceed
this size, it's recommended to store the data externally (like in a database or Cloud
Storage) and store only a reference or identifier in XCom.

4. Templating in Airflow
Templating in Airflow allows you to use Jinja templating to dynamically generate task
arguments, based on execution context, such as execution date, task instance information,
etc.
Example:
python
Copy
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def print_execution_date(**kwargs):
print(f"The execution date is: {kwargs['execution_date']}")

task = PythonOperator(
task_id='print_execution_date_task',
python_callable=print_execution_date,
provide_context=True,
dag=dag
)
In this case, the execution_date is dynamically passed to the function through Jinja
templating.
5. Micros in Dataproc
In Dataproc, Micros refer to a feature in the context of Apache Spark where jobs are
executed in very small batches. This is a more real-time or near-real-time processing feature,
often called micro-batching, where data is processed in small chunks.
• Typically used for streaming data processing.
• In Dataproc, this can be achieved by setting up Spark Streaming jobs with small time
windows for processing.

6. Backfill in Airflow
Backfilling in Airflow refers to the automatic filling of task instances for the dates that were
missed or skipped. For example, if a DAG failed to run for a few days and then succeeds, it
will automatically backfill those days' task instances.
• Backfilling ensures that all tasks that should have been executed are run.
• Can be controlled by the catchup argument.

7. Catchup in Airflow
Catchup is an Airflow parameter that ensures all previous scheduled runs are executed if
they are missed.
• Catchup=True (default): Airflow will execute all the missed runs for the DAG in
sequence, starting from the start_date.
• Catchup=False: It will only execute the current DAG run and not attempt to "catch
up" on missed runs.
python
Copy
dag = DAG(
'example_dag',
catchup=False, # Skip missed DAG runs
start_date=datetime(2025, 2, 6),
schedule_interval=timedelta(hours=1)
)
8. Dataflow Architecture
Dataflow is a fully-managed stream and batch processing service on Google Cloud that is
based on Apache Beam. The architecture consists of:
• Pipeline: The core part of Dataflow where transformations are defined.
• Workers: The compute resources that process the data. Workers scale up and down
based on the load.
• FlexRS and autoscaling: Automatically adjust the number of workers based on
workload.
• Data storage: Data can come from sources like GCS, BigQuery, Pub/Sub, etc.

9. Windowing in Dataflow
Windowing in Dataflow (via Apache Beam) is used to break streaming data into fixed-size or
sliding windows for easier processing.
• Fixed windows: Data is grouped into fixed-length time intervals (e.g., every 5
minutes).
• Sliding windows: Data is grouped into overlapping intervals, which "slide" over time.
• Session windows: Grouping data based on user-defined periods of activity.

10. Fault Tolerance and Recovery in Dataflow (Real-time)


Fault tolerance in Dataflow is the ability of the system to handle errors and continue
processing. Dataflow ensures that even if there is a failure in one part of the pipeline, the
system can recover and continue processing data.
• Checkpointing: Data is periodically saved to stable storage so that in case of failure,
the process can resume from the last checkpoint.
• Retry mechanisms: If tasks fail, Dataflow automatically retries them.

11. Bounded vs. Unbounded Data in Dataflow


• Bounded data: Data that has a fixed, known size (batch processing). Once the data
arrives, it’s processed in chunks (e.g., a daily log file).
• Unbounded data: Continuous, real-time data streams (e.g., data from sensors,
messages in Pub/Sub). Dataflow processes this in real-time.
12. PVM (Portable Data Processing Model)
PVM is a model used in Apache Beam (and hence Dataflow) for defining data processing
pipelines. It allows for portability of code across different execution environments (like
Dataflow, Spark, etc.).

13. Spark Submit Ways


There are several ways to submit Spark jobs:
1. Command-line: Use spark-submit from the terminal.
bash
Copy
spark-submit --class <main-class> --master yarn <your-jar-file>
2. Dataproc: Submit jobs directly to Dataproc using the DataprocSubmitJobOperator in
Airflow.
3. Web UI: Dataproc provides a web UI for submitting jobs manually.

14. Data Cardinality in Partitioning


Data Cardinality refers to the uniqueness of data in a column or table. When partitioning
data, understanding cardinality is important to ensure that the partitions are balanced, and
no partition becomes overly large compared to others.
• For example, partitioning a dataset by a high-cardinality column like User ID can
result in very unbalanced partitions if the column has many unique values.

15. Capacitor in Dataproc


Capacitor refers to a high-performance columnar data format used in Apache Spark or
Dataproc. It is designed for fast reads and writes of data in distributed environments. It is
ideal for large-scale analytics on datasets stored in Hadoop, BigQuery, or Cloud Storage.

16. Snapshot in Dataproc


A Snapshot refers to the point-in-time copy of a dataset or cluster state. In Dataproc, you
can create snapshots of a cluster to preserve its configuration, data, and processing state.
17. BigQuery Limitations
BigQuery has the following limitations:
• Maximum Query Size: 1 MB for the query itself.
• Maximum Rows: Maximum of 10,000 rows per query result.
• Maximum Column Count: 10,000 columns per table.
• Nested Queries: Query execution time may be slow with very complex, nested
queries.

18. BigQuery Pricing


BigQuery pricing is primarily based on:
• Storage costs: Based on the amount of data stored.
• Query costs: Based on the amount of data processed by a query (measured in bytes).
• Streaming insert costs: For real-time data ingestion via the Streaming API.
You can optimize costs by using partitioned tables and reducing the amount of data queried.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy