Suharshini - Data - Engineer - Python
Suharshini - Data - Engineer - Python
Summary:
• Over 9 years of experience as a Senior Data Engineer, specializing in data pipeline
design, development, validation, and implementation, with a focus on enhancing
data quality and enabling actionable insights through advanced data visualization.
• Expert in ETL processes, data modeling, and database management, ensuring
efficient data flow, organized structures, and optimized data retrieval across
various systems.
• Proficient in programming languages such as Python, Apache Spark, PySpark,
and Big Data technologies including Hadoop, Hive, Airflow, Kafka, Nifi, and
Grafana.
• Hands-on experience in designing and developing Spark applications using Scala
and PySpark, comparing performance between Spark, Hive, and SQL/Oracle.
• Optimized DataFrame queries using lazy evaluation and predicate pushdown
techniques.
• Experience with cloud services including AWS (EC2, S3, Redshift, Glue,
Lambda, Step Functions, CloudWatch, SNS, DynamoDB, SQS) and Azure (Data
Factory, Azure SQL, Blob Storage, SQL Data Warehouse).
• Solid experience in working with diverse data formats, including CSV, text,
sequential, Avro, Parquet, ORC, and JSON.
• Automated data ingestion into BigQuery from various sources using Python and
Airflow.
• Implemented authentication and authorization mechanisms in Node.js APIs for
secure data access.
• Strong understanding of NoSQL databases with hands-on experience developing
applications using Cassandra and MongoDB.
• Extensive experience with relational databases such as Oracle, MySQL, and
PostgreSQL, with proficiency in writing complex SQL queries.
• Applied advanced statistical techniques (regression, clustering, time series
analysis) for data modeling.
• Skilled in using JIRA for project management, GIT for version control, and
Jenkins and Docker for continuous integration and deployment. Proficient in
conducting code reviews using Crucible.
• Experience leveraging Oozie and Airflow for orchestrating complex data
workflows, automating ETL processes, and optimizing data pipeline performance.
• Managed data lake architectures to store and process large volumes of structured
and unstructured data.
• Expertise in Teradata, Informatica, Python, and UNIX shell scripting to process
large volumes of data from diverse sources and load it into databases like Teradata
and Oracle.
• Developed reusable Talend components for standardized ETL processes across multiple
projects.
• Developed multiple Power BI dashboards that provide actionable insights, enabling real-
time monitoring and facilitating informed decision-making for stakeholders.
• Used PySpark UDFs (User-Defined Functions) to handle complex business logic in Spark
DataFrames.
• Experienced in leveraging Databricks for building scalable data pipelines and
performing advanced analytics with Apache Spark.
• Proficient in using Snowflake for efficient data warehousing, optimizing performance,
and integrating structured and semi-structured data for analytics.
Technical Skills:
Programming Languages : Python, SQL, Scala,PL/SQL
Big Data Technologies : Apache Spark, Hadoop, Hive,Kafka
Data Warehousing : Snowflake, Redshift, BigQuery
Cloud Platforms : AWS (S3, Lambda, Glue, Redshift, EMR,Kinesis,CloudWatch, SQS,SNS), Azure
(Data Factory, Synapse,Azure Data Lake Storage), GCP (BigQuery, Dataflow)
ETL Tools : Apache Airflow, Informatica, Talend,DBT
Databases : MySQL, PostgreSQL, MongoDB, Cassandra
Data Modeling : Star Schema, Snowflake Schema, Dimensional Modeling,Databricks
DevOps/CI-CD : Docker, Kubernetes, Jenkins ,GitHub
Visualization Tools : Power BI, Tableau, Qlik Sense,Grafana,Kibana
Scripting/Automation : Shell scripting, Python scripting for automation,Terraform
PROFESSIONAL EXPERIENCE :
• Designed and implemented parameterized pipelines using AWS Glue, S3, and
Spark to migrate 300+ tables from on-premises systems to a Lakehouse architecture
on Amazon Redshift, improving scalability and reducing the need for multiple
pipelines.
• Developed complex queries in Teradata SQL, optimizing data retrieval and reporting
efficiency.
• Integrated Talend with Snowflake to ensure seamless data migration and real-time
analytics.
• Used Python’s logging and exception handling to improve data pipeline robustness.
• Integrated BigQuery with Apache Spark for hybrid data processing.
• Used Node.js with MongoDB and PostgreSQL to build high-performance data storage
solutions.
• Built and optimized ETL pipelines with AWS Glue and PySpark to process datasets of
up to 10TB, enhancing data quality and reducing pipeline runtime by 30%.
• Developed real-time data processing pipelines using Apache Spark Streaming and
Kafka, handling millions of events daily and storing results in PostgreSQL for
downstream analysis.
• Designed efficient DataFrame-based transformations to enhance data processing speed.
• Created configuration-driven ETL pipelines with Spark, Hive, HDFS, SQL Server,
Oozie, and AWS S3 for efficient data ingestion and transformation into mirror
databases.
• Scheduled and monitored ETL workflows using PySpark on Databricks and Apache
Airflow, ensuring smooth and reliable processing of batch and streaming data.
• Worked with Teradata QueryGrid and foreign servers to integrate data from multiple
sources.
• Automated log parsing and monitoring using CloudWatch Insights, improving system
observability.
• Optimized Talend job execution by improving memory usage, parallel processing, and
data partitioning.
• Utilized Spark for data transformations, including cleaning, aggregation,
normalization, and loading into Snowflake and Redshift with Spark-Snowflake
connectors.
• Designed data models using dimensional modeling and Snowflake schema to meet
reporting and analytics requirements.
• Implemented advanced Spark optimization techniques, such as salting, broadcast
joins, and caching/persistence, to enhance query performance and reduce job
execution time.
• Optimized SQL queries using Starburst/Trino to enhance data retrieval performance
across distributed datasets.
• Designed ETL architectures to transfer data from OLTP to OLAP systems and
integrated attribution data into the Hadoop ecosystem for analytical processing.
• Built cloud platform infrastructure on AWS using Terraform for infrastructure as
code and Kubernetes for container orchestration to ensure scalability and high
availability.
• Performed A/B testing and statistical hypothesis testing to validate business
assumptions.
• Developed SQL-based ETL pipelines using BigQuery to process massive datasets.
• Automated AWS resource management with Python scripts and the Boto3 SDK,
reducing manual intervention by 40% and improving accuracy in provisioning and
data transfer.
• Automated Talend workflows using scheduling tools to streamline ETL operations.
• Implemented Snowflake micro-partitioning, clustering, and materialized views to
optimize query execution.
• Developed multi-threaded and parallel processing applications in Python to optimize
data workloads.
• Established CI/CD pipelines with Maven, GitHub, and AWS CodePipeline,
reducing deployment times from over 5 hours to under 10 minutes and ensuring
consistent releases across environments.
• Developed advanced Teradata SQL queries for complex data extraction, aggregation,
and analysis.
• Implemented window functions, aggregations, and complex joins using Spark
DataFrames.
• Conducted extensive query performance tuning in AWS Redshift, Hive, and
Snowflake, utilizing partitioning, bucketing, file compression, and cost-based
optimization to reduce query execution time by 50%.
• Enhanced Redshift cluster performance by applying distribution keys and sort keys
effectively, optimizing workloads for analytics and reporting.
• Debugged and resolved issues in failed Spark jobs, ensuring smooth workflows and
maximizing data processing efficiency.
• Designed and developed interactive dashboards using Tableau and Qlik Sense,
integrating data from AWS Redshift, S3, and Snowflake for real-time KPI monitoring
and business metrics analysis.
• Developed CloudWatch dashboards for real-time visualization of infrastructure
performance.
• Developed data quality checks and validation rules within Talend to ensure accurate
and consistent data processing.
• Integrated Snowflake with DBT for scalable data transformations, ensuring
compliance with best practices for performance and cost optimization.
• Performed cross-team and internal trainings for multiple deliverables being
produced by data engineering team.
• Leveraged AWS EMR to process and move large volumes of data into S3,
DynamoDB, and other AWS data stores for analytics and storage.
• Designed custom Python libraries and modules to enhance reusability and
maintainability.
• Developed ETL processes in Node.js, enabling seamless data transfer between
distributed systems.
• Implemented cost-efficient data processing strategies by leveraging BigQuery’s
partitioning and clustering features.
• Created reusable Shell and Python scripts to automate recurring tasks, such as data
validation, improving error handling and reducing manual effort.
• Established incident response frameworks, including runbooks and automated root-
cause analysis, enhancing system reliability and minimizing downtime.
• Participated in code reviews with peers to ensure proper test coverage and consistent
code standards.
Company: Kaiser Permanente, California Oct 2021-May 2022
Role: Data Engineer
• Installed Kafka on Hadoop cluster and configured producer and consumer in Java to
establish connections from source to HDFS with popular hashtags.
• Loaded real-time data from various data sources into HDFS using Kafka.
• Worked with NoSQL databases like HBase to create tables and load large datasets
from various sources.
• Built RESTful APIs and microservices in Python to facilitate data exchange across
platforms.
• Worked on reading multiple data formats on HDFS using Python.
• Implemented Spark using Python (PySpark) and SparkSQL for faster testing and
processing of data.
• Implemented Talend error-handling and logging mechanisms to improve data pipeline
reliability and maintainability.
• Converted Hive/SQL queries into Spark transformations using APIs like Spark SQL,
Data Frames, and Python.
• Designed and built scalable data pipelines to process petabyte-scale datasets.
• Monitored and optimized Teradata system resources, reducing CPU and I/O usage by
optimizing query structures.
• Involved in converting MapReduce programs into Spark transformations using Spark
RDD on Python.
• Designed ETL workflows to load data from diverse sources (e.g., flat files, APIs,
databases) into Snowflake while ensuring efficient data transformation and
processing.
• Automated deployment of NiFi workflows across staging and production
environments using NiFi Registry and CI/CD pipelines.
• Diagnosed and resolved bottlenecks in NiFi dataflows by analyzing data provenance
and processor performance metrics.
• Successfully migrated 20TB of data from on-premises Oracle databases to AWS
Redshift, developing a comprehensive migration strategy.
• Configured AWS CloudWatch metrics and alerts to monitor system performance and
identify anomalies.
• Developed RESTful APIs in Node.js to enable real-time data integration across
systems.
• Designed and optimized BigQuery tables for efficient data storage and retrieval.
• Experience in Airflow and workflow schedulers to manage Spark jobs with control
flows.
• Integrated Node.js microservices with cloud-based data warehouses (Snowflake,
Teradata, AWS Redshift).
• Led the migration from Perforce to GitHub, implementing branching, merging, and
tagging strategies to ensure minimal workflow disruptions.
• Automated monitoring and alerting systems using Grafana and Prometheus, ensuring
real-time tracking of infrastructure health and reducing incident response time by
40%.
• Configured Talend Data Integration components to connect with Snowflake, SQL
databases, APIs, and cloud storage.
• Automated cloud platform provisioning workflows using Python and Terraform,
enabling rapid environment setup with minimal manual intervention.
• Leveraged Delta Lake for ACID transactions, ensuring data consistency and
reliability in ETL workflows.
• Integrated Starburst with Databricks and AWS S3 to enable seamless querying
across multiple data sources.
• Engineered secure cloud environments by implementing IAM policies, encryption
standards, and network firewalls, ensuring compliance with GDPR and SOC2
regulations.
• Ensured end-to-end data quality and governance by implementing robust validation
frameworks in both Informatica and DBT.
• Implemented Python-based automation scripts to streamline data pipeline deployments
and monitoring.
• Developed and optimized Teradata stored procedures and macros to automate
repetitive tasks and improve query efficiency.
• Worked on visualizing the reports using Power BI.
• Built parameterized Matillion jobs to handle dynamic data loads, enabling job
reusability across environments.
• Supported DevOps teams with transitions for increased security and data privacy
policies.
• Handled schema inference and explicit schema definitions to ensure data consistency.
• Experience with setting up Atlas, Kafka, and NiFi clusters from scratch with high
availability.
Company:USBank,Peoria, IL June 2020-Sep2021
Role: Data Engineer
• Designed and implemented data pipelines by launching Azure Databricks clusters
integrated with Azure Data Factory, reading datasets from various data sources,
performing transformations and analytics, and storing results in target applications
such as Azure SQL Database and Azure Synapse Analytics.
• Experienced in working with the Spark ecosystem, utilizing Spark SQL and Scala to
process diverse data formats, including text files, CSV files, and JSON, with
various transformations.
• Implemented performance tuning techniques such as query rewrite, join optimization,
and parallel execution in Teradata.
• Designed and developed ETL pipelines using Talend to extract, transform, and load
data from multiple sources.
• Migrated on-premises databases, including Oracle and PostgreSQL, to Azure SQL
Database and Azure Database for PostgreSQL, using Azure Database Migration
Service (DMS) for scalable and cost-effective solutions.
• Designed Azure infrastructure using Azure Resource Manager (ARM) templates and
Terraform, provisioning services such as Azure Virtual Machines (VMs), Azure
Blob Storage, Azure SQL Database, and Cosmos DB for data storage and
processing.
• Built and managed complex data workflows using orchestration tools like Azure
Data Factory, Apache Airflow, and Databricks, ensuring smooth pipeline
executions.
• Developed data engineering solutions using Python for data ingestion, transformation,
and validation.
• Designed federated queries in Starburst/Trino to improve analytical performance on
large-scale datasets.
• Developed an import pipeline that copies raw data from on-prem servers to Azure
Blob storage, establishing a bronze table for initial data storage.
• Created a process pipeline using Databricks notebooks to refine data into silver and
gold tables, enhancing data quality.
• Integrated Azure Database for auditing pipeline runs and managing dependencies,
improving monitoring and reliability.
• Implemented data validation and cleansing techniques to ensure data integrity and
consistency, including removing duplicates, fixing incorrect formats, and filling
missing values.
• Developed and optimized ETL pipelines in Databricks using PySpark and
SparkSQL for scalable data transformation.
• Ensured scalability of data pipelines to handle increasing volumes of data by
implementing horizontal and vertical scaling solutions.
• Utilized Teradata utilities (FastLoad, MultiLoad, TPT, BTEQ) for high-performance
data ingestion and extraction.
• Built and implemented custom monitoring systems for data quality and pipeline
performance, providing real-time insights into operational status.
• Optimized cloud resource usage and storage, implemented data lifecycle
management, and ensured cost efficiency using Azure Cost Management.
• Ensured proper encryption and security measures for sensitive data, including
encryption at rest and in transit, and implemented access control mechanisms such as
RBAC (Role-Based Access Control).
• Created comprehensive documentation for the design, architecture, and maintenance
of data pipelines, ensuring knowledge transfer and best practices.
• Developed and optimized ETL workflows using Informatica PowerCenter, ensuring
high-performance data transformations.
• Built multiple Power BI dashboards to provide actionable insights, enabling real-time
monitoring and informed decision-making for stakeholders.
• Implemented best practices for version control, including using Git for source code
management and managing code deployments using CI/CD pipelines.
• Implemented solutions to ensure the successful migration of large-scale data,
handling complex datasets, ensuring data integrity, and optimizing storage.
Company: Tech Mahindra, Bangalore, India Aug 2015 - Nov 2019
Role: Data Engineer
• Designed and implemented scalable ETL pipelines using Python, SQL, and tools like
AWS Glue and Apache Airflow, integrating data from SQL Server, flat files, and
APIs.
• Migrated on-premises infrastructure to AWS (EC2, S3, RDS), achieving a 40% cost
reduction, and managed cloud-based data lakes and warehouses such as Snowflake
and Redshift.
• Designed and maintained data lakes on AWS S3, integrating data from diverse sources
and optimizing retrieval speeds using Athena and Hive partitions.
• Connected Power Apps with data pipelines ( Azure Data Factory, Synapse
Analytics) for real-time updates or manual interventions.
• Developed detailed business and technical documentation, including schemas, data
mapping, and validation rules, to standardize processes across teams.
• Integrated test case development into agile processes, ensuring all technical solutions
aligned with pre-defined business use cases.
• Improved query performance by implementing advanced techniques such as CTEs,
temp tables, dynamic SQL, indexing, and partitioning, resulting in significant
reductions in execution times.
• Developed complex Teradata SQL queries to extract, transform, and analyze large
datasets for business reporting.
• Engineered ETL processes to extract data from various sources, transform it
according to business requirements, and load it into data models such as ADF and
Synapse.
• Developed and maintained star/snowflake schemas and interactive dashboards in
Power BI, delivering actionable insights and enhancing decision-making.
• Automated workflows using dbt and Power Automate, streamlining operations and
improving operational efficiency.
• Conducted data cleaning, de-duplication, and validation processes, ensuring 99%
data accuracy for critical business operations.
• Collaborated in Scrum environments and implemented CI/CD pipelines with Git and
Jenkins to ensure efficient deployment and development workflows.
• Skilled in designing and optimizing databases for performance and scalability across
platforms like SQL Server, Oracle, and PostgreSQL.
• Managed cloud-based infrastructure (AWS S3, Azure Data Lake) for handling large-
scale data workflows, optimizing resource utilization and reducing storage costs by
30%.
EDUCATION:
Masters in Computer Science, Western Illinois University
UG in Electrical and Electronics Engineering, Vikrama Simhapuri University
Certifications:
• AWS Solutions Architect Associate (SAA)
• Databricks Lakehouse Fundamentals