0% found this document useful (0 votes)
18 views9 pages

Sai Kruthik Reddy Data Engineer

Uploaded by

mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views9 pages

Sai Kruthik Reddy Data Engineer

Uploaded by

mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

SAI KRUTHIK P [Data Engineer/Analyst]

Kruthikreddy3010@gmail.com

https://www.linkedin.com/in/sai-kruthik-p93/

PH: (704)-879-1877

Professional Summary:

● Having 10+ years of experience as a Senior Data Engineer/ Data Engineer and Hadoop Developer
including designing, developing, and implementation of data models for enterprise-level applications
and systems.
● Experienced in Creating Vizboards for data visualization in Platfora for real-time dashboards on
Hadoop.
● Collected log data from various sources and integrated into HDFS using Flume.
● Creative skills in developing elegant solutions to challenges related to pipeline engineering
● Strong experience in core Java, Scala, SQL, PL/SQL, and Restful web services.
● Strong Experience in Data Migration from RDBMS to Snowflake cloud data warehouse
● Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations,
filters, prompts, calculated fields, Sets, Groups, Parameters, etc., in Tableau experience in working
with Flume and NiFi for loading log files into Hadoop.
● Experienced in running queries - using Impala and used BI tools to run ad-hoc queries directly on
Hadoop.
● Designed APIs to load data from Google Analytics, Google Big Query
● Experienced in managing Hadoop clusters and services using Cloudera Manager.
● Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of
Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
● Expertise in using various Hadoop infrastructures such as Map Reduce, Pig, Hive, Zookeeper,
Hbase, Sqoop, Oozie, Flume, Drill , and Spark for data storage and analysis.
● Implemented various algorithms for analytics using Cassandra with Spark and Scala.
● Good experience in Oozie Framework and Automating daily import jobs.
● Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and Map Reduce.
● Highly experienced in importing and exporting data between HDFS and Relational Database
Management systems using Sqoop.
● Strong understanding of the principles of Data Warehousing(DWH) concepts using Fact tables,
Dimension tables, and Star / Snowflake Schema modeling.
● Expertise in designing and developing data models and KPIs to support business decision-making
● Proficient in data gathering, exploration, transformation, analysis, and data mining to create
analytics-ready datasets
● Strong background in working with large datasets, managing high data volumes, and integrating
data from multiple sources
● Skilled in using SQL and Python scripting to translate business needs into data-driven solutions
● Proficiency in Dataiku, along with expertise in SQL and Python
● Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data
Lake Store, Azure Data Factory, Azure Databricks, and created POC in moving the data from flat
files and SQL Server using U-SQL jobs.
● Good knowledge of querying data from Cassandra for searching grouping and sorting.
● Extensive hands-on experience in using distributed computing architectures such as AWS products
(e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark, and effective use of Azure
SQL Database, MapReduce, Hive, SQL, and PySpark to solve big data type problems.
● Ability to work effectively in cross-functional team environments, excellent communication, and
interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using
Spark Data frames and Scala.
● Expertise with Big data on scala cloud services i.e. EC2, S3, Sagemaker Auto Scaling, Glue, Lambda,
Cloud Watch, Cloud Formation, Athena, DynamoDB, and RedShift
● Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka
topics.
● Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.
● Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty,
GIT.
● Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on
reading multiple data formats on HDFS using Scala.
● Good understanding of NoSQL Databases and hands-on work experience in writing applications on
No SQL databases like Cassandra and Mongo DB.
Technical Skills:

● Big Data Technologies: HDFS, Map Reduce YARN, Hive, Pig, HBase, Impala, Zookeeper, Sqoop,
Oozie, Kafka, DataStax & Apache Cassandra, Drill, Flume, Spark, Solr and Avro.
● Programming Languages: Scala, Python, SQL, Java, PL/SQL, Linux shell scripts
● RDBMS: Oracle 10g/11g/12c, MySQL, SQL Server, Teradata, MS Access
● No SQL: Hbase, Cassandra, Mongo DB
● Cloud Technologies: MS AZURE, AWS, Snowflake
● Web/Application servers: Tomcat, LDAP
● Methodologies: Agile, UML, Design Patterns (Core Java and J2EE)
● Tools Used: Eclipse, Putty, Cygwin, MS Office
● BI Tools: Platfora, Tableau, PowerBI, Pentaho
Project Experience:
Client: Anheuser-Busch, NYC NY Aug 2023 – till
now
Senior Data Engineer

● Led the development of Python scripts and UDFs using DataFrames/SQL and RDD/MapReduce in
Spark for advanced data aggregation, complex queries, and seamless integration with RDBMS
through Sqoop.
● Established an end-to-end Data Science platform using AWS Sagemaker
to deploy multiple machine learning and deep learning models at scale.
● Performing statistical analysis and building high-quality prediction and classification systems using
data mining and machine learning techniques. Improved prediction and classification techniques by
using various machine learning modeling techniques such as SVM, Naive Bayes, Decision Trees,
Gradient Boosting- GBM, XGBoost, AdaNet, Random Forest, Classi cation, Linear/Logistic
Regression, K-Means, K-NN along with Deep Learning applications such as TensorFlow and Keras
library with the objective of achieving lowest test error.
● Design and development of a complex dashboard system to monitor, display, and control various
loT devices along with a front-end web Ul
● Worked with deep learning frameworks such as MXNet, Caffe 2, TensorFlow, Theano, CNTK, and
Keras to build deep learning models for object detection,
● Spearheaded the ingestion and processing of XML messages using Kafka and Spark Streaming,
ensuring real-time capture and processing of UI updates.
● Leveraged JSON for real-time data processing tasks, ensuring accurate and timely data updates in data
pipelines.
● Orchestrated the migration of an on-premises application to AWS, leveraging services like EC2, and
S3, and expertly managing Hadoop clusters on EMR for optimal performance.
● Architected robust data pipelines, importing data from AWS S3 into Spark RDDs, and implementing
efficient transformations for scalable data processing.
● Pioneered the implementation of real-time processing solutions using Spark Streaming with Kafka,
enabling timely insights and actions on streaming data.
● Innovated solutions for handling and analyzing large-scale clickstream data from Google Analytics
with BigQuery, driving actionable insights for business stakeholders.
● Established and maintained essential DevOps tools and processes, including provisioning scripts,
deployment automation, and environment management on AWS, Rackspace, and Cloud platforms.
● Designed and implemented data ingestion frameworks, enabling seamless extraction,
transformation, and loading of consumer response data into Hive external tables for visualization in
Tableau dashboards.
● Demonstrated expertise in managing file movements between HDFS and AWS S3, ensuring efficient
data transfer and storage across distributed environments.
● Engineered solutions for loading and transforming large volumes of structured and semi-structured
data using Hive, optimizing performance and resource utilization.
● Managed and analyzed vast amounts of log data using Hadoop, Apache Kafka, and Apache Storm,
delivering actionable insights for operational improvements.
● Led the migration of legacy data warehouses(DWH) and databases to Snowflake, ensuring data
integrity, performance, and scalability of modern analytics platforms.
● Led the integration efforts with various NoSQL databases like HBase and Cassandra, enriching data
capabilities and supporting diverse use cases.
● Designed and implemented AWS CloudFormation templates, enabling consistent and scalable
infrastructure provisioning for web applications and databases.
● Managed S3 buckets and implemented backup strategies using Glacier, ensuring data durability and
compliance with organizational policies.
● Developed and migrated MapReduce programs into Spark transformations using Spark and Scala,
improving the performance and scalability of data processing workflows.
● Configured and optimized Spark Streaming for real-time data ingestion from Kafka, ensuring high
throughput and low-latency processing.
● Leveraged Spark SQL to connect to Hive and perform distributed processing, optimizing query
performance and resource utilization.
● Orchestrated the export of analyzed data to relational databases using Sqoop, facilitating
visualization and reporting in Tableau for business stakeholders.
● Designed and implemented a composite server for data virtualization needs, enabling secure and
efficient access to restricted data via REST APIs.
● Integrated Cassandra as a distributed persistent metadmodelata store, providing scalable and high-
performance metadata resolution for network entities.
● Led the integration of various cloud technologies (AWS, Azure, Snowflake) to create scalable and
efficient data architectures
● Spearheaded the migration of on-premises applications to cloud platforms, leveraging services like
EC2, S3, EMR, and Snowflake
● Implemented and optimized Spark Streaming with Kafka to enable timely insights and actions on
streaming data
● Engineered solutions for handling and analyzing large-scale clickstream data from sources like
Google Analytics and BigQuery
● Demonstrated expertise in designing and developing robust data pipelines for real-time data
processing and analytics
Led the end-to-end migration of 800+ objects (4TB) from SQL Server to Snowflake.
● Created roles and access privileges, managed Snowflake Admin activities.

● Retrofitted 500 Talend jobs from SQL Server to Snowflake.

● Developed and optimized views and stored procedures in Snowflake.

● Enabled data sharing between Snowflake accounts and validated data integrity.
● Built dashboards and reports in Looker and Power BI, ensuring data accuracy.

● Consulted on Snowflake Data Platform solutions, focusing on architecture, design, and deployment.

Environment: HDFS, Hive, Snowflake, AWS (EC2, S3, EMR, Redshift, ECS, Glue, VPC, RDS, etc.),
Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, BigQuery, Google Analytics, Splunk,
RDBMS, Elasticsearch, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, Ranger, Git,
Kafka, CI/CD (Jenkins), Kubernetes.

Client: Equifax, St Louis, MO May 2021 – Aug 2023


Lead Data Engineer

● Implemented Agile methodologies and utilized Rally for task and user story management.

● Conducted data cleaning, reshaping, and segmentation using Numpy and Pandas in Python.

● Implemented Apache Sentry to ensure fine-grained access control on Hive tables, enhancing data
security.
● Developed and maintained shell scripts for automating data processing and job scheduling tasks.

● Deployed and managed Windows Kubernetes clusters with Azure Container Service (ACS),
leveraging Kubernetes and Docker for streamlined CI/CD workflows.
● Led the migration of key systems to Azure Cloud Services, focusing on Snowflake data
warehouse solutions and optimizing SQL queries
● Designed and constructed robust data pipelines using Azure Data Factory and Azure Databricks,
enabling efficient data movement and transformation across various Azure data services.
● Migrated complex MapReduce programs to Spark processing, enhancing data processing efficiency
and scalability.
● Executed comprehensive data preprocessing, feature engineering, and data cleaning tasks using
Python to ensure data quality and reliability.
● Leveraged Cassandra for data storage and retrieval, implementing various data modeling techniques
to optimize database performance.
● Developed and maintained ETL pipelines using Python and Snowflake, facilitating seamless data
integration and transformation processes.
● Developed Stored procedures to automate ETL processes, ensuring efficient data extraction, transformation,
and loading.
● Implemented Apache Drill on Hadoop for SQL and NoSQL data integration, enhancing data
accessibility and analytics capabilities.
● Engineered Kafka consumers for data validation before ingestion into Hive and Cassandra
databases, ensuring data quality and consistency.
● Utilized Spark and Scala for real-time data processing and analytics, enabling timely insights and
decision-making.
● Deployed composite servers for data virtualization needs, facilitating secure and efficient access to
restricted data via REST APIs.
● Managed and optimized Snowflake environments, ensuring high availability and performance of
data warehouse solutions.
● Created insightful data visualizations using Python, Tableau, and Azure CLI scripts for automation,
enabling data-driven decision-making processes.
● Designed and constructed robust data pipelines using Azure Data Factory and Azure
Databricks, enabling efficient data movement and transformation
● Conducted Hadoop cluster maintenance and upgrades, collaborating with enterprise data support
teams to ensure system stability and performance.
● Utilized Azure Databricks notebooks for data movement, transformation, and analysis, streamlining
data processing workflows.
● Employed SQL Azure for database needs and implemented partitioning and dynamic partitions in
Hive for efficient data access and management.
● Created Groovy domain classes to access the database, Created login and other services

● Added Groovy composers and Views to the project

● Involved in setting up one-click deployment of the application to TC servers using Jenkins and
GreenLight.
● Individual contributor to developing service applications to consume and integrate with Kenexa
using GRAILS, spring, hibernate, Groovy, and Oracle.
● Implemented framework to read data from Excel using Groovy

● Implemented services in modelling analytics platform using Grails and Groovy to expose restful web
services to get consumed by the Ul layer.

● Evaluate Snowflake Design considerations for any change in the application.

● Build the Logical and Physical data model for Snowflake as per the changes required.

● Published Power BI Reports in the required originations and Made Power BI Dashboards available in
Web clients and mobile apps
● Build the Logical and Physical data model for Snowflake as per the changes required.

● Worked with cloud Architecture to set up the environment. Involved in Migrating Objects from Teradata
to Snowflake., Created Snowpipe for continuous data load

Environment: MapR, MapReduce, spark, Snowflake, Scala, AWS, Java, Azure SQL, Azure Databricks,
Azure Data Lake, HDFS, Hive, Pig, Impala, Cassandra, Python, Kafka, Tableau, Teradata, CentOS,
Pentaho, PIG, Zookeeper, Sqoop.

Client: OhioHealth - Columbus, OH Aug 2019 – Apr 2021


Data Engineer

 Designed and developed SSRS reports and SSIS packages to facilitate Extract, Transform, and
Load (ETL) processes from diverse Healthcare source systems.
 Leveraged Hive SQL, Presto SQL, and Spark SQL for efficient ETL operations, selecting the
appropriate technology for each Healthcare data task.
 Implemented robust monitoring, metrics, and logging systems on AWS to ensure data pipeline
reliability and performance in Healthcare data processing.
 Engineered exception-handling mechanisms and integrated Kafka for error management within
Healthcare data processing workflows.
 Led ETL processes and conducted thorough data validation using SQL Server Integration
Services (SSIS), ensuring Healthcare data quality and consistency.
 Orchestrated and automated ETL solutions, streamlining Healthcare operational processes for
improved efficiency and reliability.
 Optimized Redshift environment to enhance query performance, achieving up to 100x faster
execution for Healthcare analytics tools like Tableau and SAS Visual Analytics.
 Automated data sampling procedures using Python scripts to maintain Healthcare data integrity
and quality assurance.
 Worked extensively with AWS cloud services including EC2, S3, EMR, and DynamoDB for
scalable Healthcare Big Data processing solutions.
 Designed Entity Relationship Diagrams (ERDs), functional diagrams, and data flow diagrams to
inform Healthcare database design and implementation.
 Implemented Workload Management (WLM) strategies in Redshift to prioritize and optimize
query execution for different types of Healthcare workloads.
 Developed ad hoc queries and reports using SQL Server Reporting Services (SSRS) to support
Healthcare business decision-making processes.
 Designed and implemented Healthcare data marts following Ralph Kimball's Dimensional Data
Mart modeling methodology to ensure efficient data storage and retrieval.
 Integrated Kafka with Spark Streaming for real-time Healthcare data processing, enabling timely
insights and actions on streaming data.
 Managed AWS security groups and implemented high-availability, fault-tolerance, and auto-
scaling configurations using Terraform templates for Healthcare applications.
 Applied machine learning algorithms and statistical modeling techniques such as decision trees
and logistic regression using scikit-learn in Python for Healthcare predictive analytics.
 Implemented data normalization processes for new Healthcare data ingested into Redshift,
ensuring data consistency and integrity.
 Developed and maintained complex SSIS/ETL packages for Healthcare data extraction,
transformation, and loading operations.
 Utilized Oozie Scheduler to automate Healthcare pipeline workflows and orchestrate MapReduce
jobs for timely data extraction and processing.
● Conducted performance tuning and optimization of SQL queries using execution plans, query
analyzers, and database tuning tools to enhance overall Healthcare system performance.
Environment: SQL Server, Hive SQL, Presto SQL, Spark SQL, SSRS, SSIS, AWS (EC2, S3, Redshift,
EMR, DynamoDB), Kafka, Python, Terraform, Tableau, SAS Visual Analytics, Ralph Kimball's
Dimensional Data Mart modeling methodology, Oozie Scheduler, scikit-learn, Erwin, CI/CD (AWS
Lambda, CodePipeline), Data Warehousing, Big Data technologies

Client: Insurance Auto Auctions, West Chester, IL Dec 2017 – July 2019
Data Engineer
● Contributed to the design and development phases of the Software Development Life Cycle
(SDLC) utilizing Scrum methodology.
● Managed data synchronization between EC2 and S3, Hive stand-up, and AWS profiling for
efficient data operations.
● Developed and optimized data pipelines of MapReduce programs using Chained Mappers for
scalable data processing.
● Utilized Hive extensively to analyze partitioned and bucketed data, deriving actionable insights
and metrics for reporting on dashboards.
● Leveraged Maven for building and deploying jar files of MapReduce programs to the cluster,
ensuring seamless execution.
● Worked with a variety of NoSQL databases including HBase, Cassandra, DynamoDB (AWS),
and MongoDB for diverse data storage and retrieval needs.
● Designed and implemented optimized Hive partitions for efficient data separation and
processing, adhering to best practices in Pig and Hive for performance tuning.
● Loaded aggregated data into DB2 for reporting and analysis purposes, ensuring data accuracy
and accessibility.
● Utilized Pig as an ETL tool for data transformations, event joins, filters, and pre-aggregations
before storing data onto HDFS.
● Developed a customized BI tool using HiveQL for query analytics, empowering the
management team with actionable insights.
● Managed permissions, policies, and roles for users and groups using AWS Identity and Access
Management (IAM), ensuring secure data access.
● Implemented AWS CLI Auto Scaling and CloudWatch Monitoring for efficient resource
management and monitoring of data processes.
● Facilitated data import/export operations between HDFS and databases using Sqoop, ensuring
seamless data movement.
● Engineered a data pipeline using Flume, Sqoop, Pig, and Java MapReduce to ingest behavioral
data into HDFS for analysis, enabling data-driven decision-making.
● Implemented optimization and performance tuning techniques in Hive and Pig to enhance data
processing efficiency and speed.
● Developed and automated job flows in Oozie for seamless workflow orchestration and data
extraction from warehouses and weblogs.
● Designed and implemented a Cassandra-based NoSQL database for persisting high-volume
user profile data, ensuring scalability and reliability.
● Successfully migrated high-volume OLTP transactions from Oracle to Cassandra, optimizing
data storage and retrieval processes.
● Created Talend Jobs for Batch Processing of Data.Worked on Multi-Threading in Talend
Environment: Scrum, EC2, S3, Hive, AWS IAM, DynamoDB, MongoDB, HBase, Cassandra, AWS
CLI, Sqoop, Flume, AWS CloudWatch, Git, Maven, Erwin, SQL Server, Spark, Presto SQL, Kafka,
Python, Java, Oozie Scheduler, Tableau, SAS Visual Analytics, Terraform, Ralph Kimball's
Dimensional Data Mart modeling methodology, CI/CD (AWS Lambda, CodePipeline), Data
Warehousing, Big Data technologies.

Client: Indian Immunologicals Ltd, India Feb 2014 – Sep 2017


Data Engineer

● Generated report on predictive analytics using Python and Tableau including visualizing model
performance and prediction results.
● Created a data service layer of internal tables in Hive for data manipulation and organization.
● Achieved business intelligence by creating and analyzing an application service layer in Hive
containing internal tables of the data which are also integrated with Hbase
● Troubleshoot RSA SSH keys in Linux for authorization purposes.
● Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-
date for reporting purposes by Pig.
● Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-
structured CSV file dataset into HDFS data lake.
● Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX
operations.
● Utilized ORC format to optimize storage space and improve query performance in Hive and Spark
environments.
● Experience creating and organizing HDFS over a staging area.
● Performed data preprocessing and feature engineering for further predictive analytics using Python
Pandas.

Environment: Hadoop, Hive, HBase, Spark, Python, Pandas, PL/SQL, My SQL, SQL Server,
PostgreSQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy