Sai Vodnala DE
Sai Vodnala DE
Email: Rahul@sarsai.us
Phone: (972)-945-5081
Data Engineer
PROFESSIONAL SUMMARY:
Over 9 years of experience in building highly scalable data analytics applications using the latest Big Data
technologies.
Expertise in developing production-ready Spark applications utilizing Spark-Core, Data Frames, Spark-SQL,
Spark-ML, and Spark-Streaming APIs.
Hands-on experience with Cloud Platforms like AWS, Athena, GCP, and Azure, and their services like EC2,
S3, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, Auto Scaling, CloudFront,
Cloud Watch, Data Pipeline, DMS, Aurora, ETL, Big Query, GCS Bucket, G-Cloud Function, Cloud data flow,
Data fusion, Pub/Sub cloud shell, and Composer (Air Flow as a service).
Experienced in building real-time data workflows using Kafka, Spark streaming, and HBase.
Good hands-on experience working with various distributions of Hadoop like Cloudera (CDH),
Hortonworks (HDP), and Amazon EMR.
Good understanding of Distributed Systems architecture and design principles behind Parallel Computing.
Expertise in designing, implementing, and managing server integration solutions to ensure seamless data
flow between various systems and applications.
Demonstrated ability to utilize Databricks for advanced analytics, including real-time data processing,
machine learning model training, and data visualization, driving actionable insights and data-driven
decision-making.
Experienced in ETL pipelines using the combination of Python and Snowflake's Snow SQL and writing SQL
queries against Snowflake.
Hands-on experience with leading cloud platforms like AWS, GCP, and Azure, utilizing Databricks for
complex data processing tasks, real-time analytics, and machine learning applications.
Proficient in designing, deploying, and managing cloud infrastructure using Terraform.
Proficiently designed, developed, and maintained complex data integration solutions using the Dell Boomi
Integration Platform, ensuring seamless and efficient data flow across diverse systems, contributing to
enhanced data accuracy and business decision-making.
Good knowledge of distributed databases like Snowflake, HBase, Cassandra, and MongoDB.
Good understanding of performance tuning, partitioning, and building scalable data lakes.
Solid experience in working with various data formats like CSV, text, sequential, Avro, Parquet, ORC, and
JSON.
Experienced in performing ETL on structured, semi-structured data using Pig Latin Scripts.
Strong experience writing complex MapReduce jobs including development of custom Input Formats and
custom Record Readers.
Experience in connecting various Hadoop sources like Hive, Impala, Phoenix to Tableau for reporting.
Good understanding of the core concepts of programming such as algorithms, data structures, and
collections.
Development experience with RDBMS, including writing SQL queries, views, stored procedures, triggers,
etc.
Strong understanding of Software Development Lifecycle (SDLC) and various methodologies (Waterfall,
Agile).
TECHNICAL SKILLS:
Big Data Technologies Hadoop, HDFS, MapReduce, Databricks, Spark, HBase, Oozie, Hive,
Sqoop, Pig, Flume, Kafka, Flink, Snowflake, NiFi, Cassandra, MongoDB,
Big Query, GCS Bucket, G-Cloud Function, Cloud dataflow and Data
fusion, Pub/Sub cloud shell, Composer (Air Flow as a service)
Programming Languages Python, Java, Scala, SQL, R, PowerShell, Perl, Shell
Data Warehousing Informatica PowerCenter, Informatica Cloud, informatica IDQ Talend
Open Studio & Integration Suite, Data Flow Server Integration Services
(SSIS), AWS Redshift
Databases Oracle, SQL Server, MySQL, DB2, PostgreSQL, Amazon Redshift, Aurora
BI & Data Visualization Tools: Tableau, Business Objects, Looker, Power
BI, Quick Sight, Domo, QlikView, Apache Superset
Query Languages SQL, PL/SQL, T-SQL, HiveQL, Impala SQL
PROFESSIONAL EXPERIENCE:
Responsibilities:
Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services
(AWS) on EC2. Handled AWS Management Tools as Cloud watch and Cloud Trail.
Stored the log files in AWS S3. Used versioning in S3 buckets where the highly sensitive information is
stored.
Integrated AWS DynamoDB using AWS lambda to store the values of items and backup the DynamoDB
streams. Automated Regular AWS tasks like snapshots creation using Python scripts.
Designed data warehouses on platforms such as AWS Redshift, Azure SQL Data Warehouse, and other high-
performance platforms.
Install and configure Apache Airflow for AWS S3 bucket and create to run the Airflow.
Prepared scripts to automate the ingestion process using PySpark and Scala as needed through various
sources such as API, AWS S3, Teradata and Redshift.
Created multiple scripts to automate ETL/ ELT process using PySpark from multiple sources.
Developed PySpark scripts utilizing SQL and DD in spark for data analysis and storing back into S3.
Developed PySpark code to load from set to hub implementing the business logic. Developed code in Spark
SQL for implementing Business logic with python as programming language.
Designed, Developed and Delivered the jobs and transformations over the data to enrich the data and
progressively elevate for consuming in the Pub layer of the data lake.
Integrated Delta Lake on Databricks to manage data with ACID transactions, enabling efficient handling of
batch and streaming data, while ensuring data accuracy and compliance.
Designed and automated ETL pipelines on Databricks using PySpark and Scala, streamlining data ingestion
and transformation from multiple sources including AWS S3, Teradata, and Redshift, improving overall data
pipeline performance.
Developed and optimized ETL pipelines using Talend Open Studio to facilitate seamless data migration and
transformation between various systems on AWS.
Automated the provisioning and management of AWS infrastructure using Terraform, enabling consistent
and repeatable cloud deployments.
Developed and maintained Terraform modules for creating VPCs, EC2 instances, S3 buckets, and other AWS
services, streamlining infrastructure as code (IaC) processes.
Worked on Sequence files, Map side joins, bucketing, partitioning for hive performance enhancement and
storage improvement.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs
with ingested data.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch
processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used
Spark engine, Spark SQL for data analysis and provided it to the data scientists for further analysis.
Developed various UDFs in Map-Reduce and Python for Pig and Hive.
Data Integrity checks have been handled using hive queries, Hadoop, and Spark.
Worked on performing transformations & actions on RDDs and Spark Streaming data with Scala.
Implemented the Machine learning algorithms using Spark with Python.
Leveraged Qlik Replicate to set up real-time data replication between on-premises databases and cloud-
based data warehouses, ensuring high data availability and consistency.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in
data and implement data quality metrics using necessary queries or python scripts based on source.
Designs and implements Scala programs using Spark Data frames and RDDs for transformations and
actions on input data.
Improved the Hive queries performance by implementing partitioning and clustering and Optimized file
formats (ORC).
Environment: AWS, Kafka, Jenkins, Databricks, Docker, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting,
Web Sphere, Splunk, Soap UI, PowerShell.
Responsibilities:
Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by
dimensional modeling, Star and Snowflake schemas.
Worked on Exporting and analyzing data to the Snowflake using for visualization and to generate reports
for the BI team.
Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage
Service (S3) as storage mechanism.
Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs
on AWS.
Configured AWS Multi Factor Authentication in IAM to implement 2 step authentication of user's access
using Google Authenticator and AWS Virtual MFA.
Worked on AWS Data Pipeline to configure data loads from S3 into Snowflake.
Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.
Implemented partitioning and bucketing in hive.
Worked on many ETL creating aggregate tables by doing many transformations and actions like JOINS, sum
of the amounts etc.
Designed, build and managed ELT data pipeline, leveraging Airflow, python, DBT, Stitch Data and Azure
solutions.
Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic
to massage and transform and serialize raw data.
Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS
system.
Utilized Databricks and Spark Streaming to implement real-time data processing pipelines, ensuring that
processed data was readily available for downstream applications and analytics.
Improved performance of Hive queries on Databricks by implementing partitioning, bucketing, and other
optimization techniques, significantly reducing query execution times.
Ensured data quality checks are being performed in align with the business needs and move the processed
data to the Consumption layer and Druid.
Used JSON schema to define table and column mapping from S3 data to Snowflake.
Responsible for built scalable distributed data solutions using Snowflake cluster environment with Amazon
EMR.
Worked on migrating data from Teradata to AWS using Python and BI tools like Alteryx.
Processed the raw log files from the set-top boxes using java map reduce code and shell scripts and stored
them as text files in HDFS.
Ingested the data from legacy and upstream systems to HDFS using Apache Sqoop, Flume java map-
reduce programs, hive queries and pig scripts.
Created external tables in hive and Athena when the data is stored in s3 buckets.
Experienced in performance tuning of Spark Applications for setting correct level of Parallelism and
memory tuning.
Expert in designed ETL data flows using creating mappings/workflows to extract data from SQL Server and
Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
Used Python in data migration from Teradata to Snowflake during history loads and quality checks.
Created data models, used tools like Qlik to produce visualizations and reports.
Developed all kind of Graphical MIS reports like various charts and dashboards.
Developed complex spark applications for performing various denormalization of the datasets and creating
a unified data analytics layer for downstream teams.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark,
Effective & efficient Joins.
Created S3 buckets also managing policies for S3 buckets and utilized S3 bucket and Glacier for storage and
backup on AWS.
Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems
by maintaining feeds.
Implemented continuous integration and deployment using CI/CD tools like Jenkins, GIT, Maven.
Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
Used Bash Operator, S3 Operator, Python Operator Etc. to run python applications.
Environment: AWS S3, EMR, Athena, Databricks, Druid, IAM, Teradata, PySpark, Oracle, Airflow, Flink, Snowflake,
Hadoop.
Responsibilities:
Involved in migrating existing traditional ETL jobs to Spark and Hive Jobs on new cloud data lake.
Developed UDF for hive and pig to support extra functionality provided by Teradata.
Involved in creating Hive scripts for performing ad hoc data analysis required by the business teams.
Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data
pipe-line system.
Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target
systems from multiple sources.
Analyzed existing Azure Synapse database, tables and other objects to prepare to migrate to GCP.
Worked extensively with Data migration, Data cleansing, Data profiling, and ETL Processes features for data
warehouses.
Created Data ingestion framework in Snowflake for both Batch and Real-time Data from different file
formats using Snowflake Stage and Snowflake Data Pipe.
Managed the ACLs and IAM policies for the data services in GCS, Data Flow and Big Query.
Automated the process of creating/deleting Data proc clusters using Google Cloud Functions (GCF).
Created a Python based framework to incrementally push the data from GCS to Big Query.
Developed a Scala Archetype that abstracts ETL functions and track lineage.
Reduced the compute cost by configuring the autoscaling policy for data proc clusters to ensure optimal
resource usage.
Used Airflow for orchestration and scheduling of the ingestion scripts.
Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for
workflow management and automation using the Airflow tool.
ETL pipelined in and out of data warehouse using a combination of Python and Snowflakes Snow SQL
Informatica mapped were created and designed to load data from source systems.
Worked with the Source Analyzer, Data Warehousing Designer, Mapping Mapplet Designer, and
Transformation Designer tools in the Informatica Power Center.
Hands on experience with Alteryx software for ETL, data preparation for EDA and performing spatial and
predictive analytics.
Involved in automation of build activity of the project using Bash Script.
Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data
responsible to manage data from different sources.
Worked on data analysis and reported using Tableau on customer usage metrics.
Used this analysis to present to the leadership towards a product growth to motivate a team of engineers
and product managers.
Developed Job Scheduler scripts for data migration using Bash Shell scripting.
Extensively worked on Hive, Pig and Sqoop for sourcing and transformations.
Developed Hive Scripts for ETL workflow.
Environment: Hortonworks, GCP, Bash, HDFS, Hive, Scala, Airflow, Teradata, Oracle, Sqoop, Snowflake, SQL, Oozie.
Responsibilities:
Developed a robust ETL framework in SSIS to efficiently process and monitor the import of flat files for
critical clients, ensuring smooth data integration.
Managed and enhanced SQL scripts and intricate queries for data analysis and extraction, crafting Stored
Procedures to generate a diverse range of reports, including Drill-through, Parameterized, Tabular, and
Matrix reports, using SSRS.
Proficiently executed version control tasks such as Branching, Merging, Tagging, and Release Activities
within GIT, facilitating seamless code management and collaboration.
Designed, constructed, and deployed Business Intelligence (BI) solutions using Power BI, enabling data-
driven insights and visualizations.
Automated an in-house metadata exploration project through the creation of Shell Script and SQL scripts,
streamlining data exploration and resource management.
Leveraged scripting languages such as Shell and Python to develop custom solutions, including Python
scripts for data transfer from MongoDB to SQL Database.
Demonstrated expertise in crafting Calculated Columns and Measures in Power BI and Excel, tailoring them
to specific project requirements using DAX queries.
Played a pivotal role in the installation, configuration, and development of SSIS packages, optimizing data
processing workflows.
Actively enhanced database performance by implementing various clustered and non-clustered indexes, as
well as index views, and employed effective data aggregation, sorting, and table joining strategies.
Environment: ETL Framework, SQL Scripts, SSRS, GIT, Power BI, Shell Scripting, SQL DB.