Gautham - Data Engineer
Gautham - Data Engineer
Data Engineer
Contact:
Email: goutham.tech02@gmail.com
Professional Summary:
Around 8 years of professional work experience as a Data Engineer, working with Python, Spark,
AWS, SQL & MicroStrategy in design, development, Testing and Implementation of business
application systems for Health Care and Educational sectors.
Extensively worked on system analysis, design, development, testing and implementation of
projects (SDLC) and capable of handling responsibilities independently as well as proactive team
members.
Hands-on experience in designing and implementing data engineering pipelines and analyzing
data using AWS stack like AWS EMR, AWS Glue, EC2, AWS Lambda, Athena, Redshift, Scoop
and Hive.
Hands on experience in programming using Python, Scala, Java and SQL.
Sound knowledge of architecture of Distributed Systems and parallel processing frameworks.
Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge
amounts of behavioral data and log data.
Good experience working with various data analytics in AWS Cloud like EMR, Redshift, S3,
Athena, Glue.
Experienced in developing production ready spark applications using Spark RDD APIs, Data
frames, Spark-SQL and Spark-Streaming API's.
Hands on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud
function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Data Proc, Operations Suite
(Stack driver).
Worked extensively on fine tuning spark applications to improve performance and troubleshooting
failures in spark applications.
Hands on Experinece in GCP, Big query , cloud functions, data proc.
Strong experience in using Spark Streaming, Spark SQL and other components of spark like
accumulators, Broadcast variables, different levels of caching and optimization techniques for spark
jobs.
Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
Experience in setting up Hadoop clusters on cloud platforms like AWS and GCP.
Used hive extensively to perform various data analytics required by business teams.
Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
Experience automating end-to-end data pipelines with strong resilience and recoverability.
Experience in creating Impala views on hive tables for fast access to data.
Experienced in using waterfall, Agile and Scrum models of software development process
framework.
Good knowledge in Oracle PL/SQL and shell scripting.
Database / ETL Performance Tuning: Broad Experience in Database Development including effective
use of Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes,
Table Partitions, Sub Partitions, Materialized Views, Global Temporary tables, Autonomous
Transitions, Bulk Binds, Capabilities of using Oracle Built-in Functions.
Experienced in developing production ready spark applications using Spark RDD APIs, Data
frames and Spark-SQL.
Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top
of Hadoop performed advanced analytical application by making use of Spark with Hive and
SQL/Oracle/Snowflake.
Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-
solving skills using SQL, MicroStrategy, Advanced Excel, Python.
Proficient in writing unit testing code using Unit Test/PyTest and integrating the test code with the
build process.
Used Python scripts to parse XML and JSON reports and load the information in database.
Experienced with version control systems like Git, GitHub, Bitbucket to keep the versions and
configurations of the code organized.
1
EDUCATIONAL QUALIFICATION:
Bachelor of Engineering: Computer Science and Engineering Aug 2012 –
Feb 2015
Sathyabhama Institute of Science and Technology Information
TECHNOLOGY:
Big Data Eco-System: HDFS, GPFS, HIVE, SQOOP, SPARK, YARN,PIG,Kafka
Hadoop Distributions: Hortonworks, Cloudera, IBM Big Insights,
Operating Systems: Windows, Linux (Centos, Ubuntu)
Programming Languages: PYTHON,Scala, SHELL SCRIPTING
Databases: Hive, MYSQL,NETEZZA, SQL Server
IDE Tools & Utilities: IntelliJ IDEA ,ECLIPSE, PYCHARM, AGINITY WORKBENCH, GIT
Markup Languages: HTML
ETL: Data stage 9.1/11.5(Designer/Monitor/Director)
Job Scheduler: Control-M, IBM Symphony Platform, AMBARI,Apache Air flow
Reporting Tools: Tableau, Lumira
Cloud Computing Tools : AWS, GCP
Scrum Methodologies: Agile, Asana, Jira
Others: MS Office, RTC, Service Now, OPTIM, IGC(Info sphere Governance catalog), WinSCP, MS
Visio
Responsibilities:
Involved in writing Spark applications using Python to perform various data cleansing, validation,
transformation, and summarization activities according to the requirement.
Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of
Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using
Pyspark.
Loaded the data into Spark dataframes and perform in-memory data computation to generate the output as
per the requirements.
Worked on AWS Cloud to convert all on premise, existing processes and databases to AWS Cloud.
Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3,
ORC/Parquet/Text Files into AWS Redshift.
Used AWS Redshift, S3, Spectrum and Athena services to query large amounts of data stored on S3 to
create a Virtual Data Lake without having to go through the ETL process.
Developed pyspark job to load the CSV files into the S3 buckets and created AWS S3buckets, performed
folder management in each bucket, managed logs and objects within each bucket.
Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
Developed a daily process to do incremental import of data from DB2 and Teradata into Hive tables using
Sqoop.
Analyzed the SQL scripts and designed the solution to implement using Spark.
Worked on importing metadata into Hive using Python and migrated existing tables and the data pipeline
from Legacy to AWS cloud (S3) environment and wrote Lambda functions to run the data pipeline in the
cloud.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate
reports for the BI team.
Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed
and External tables, also worked on optimization of Hive queries.
Designed, developed and created ETL (Extract, Transform and Load) packages using Python to load data
into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.
2
Utilized inbuilt Python module JSON to parse the member data which is in JSON format using json. loads or
json.dumps and load into a database for reporting.
Consumed REST APIs using Python requests such as POST & GET operations to fetch and post memberdata
to different environments.
Used Pandas API to put the data as time series and tabular format for central timestamp data manipulation
and retrieval during various loads in the DataMart.
Used Python libraries and SQL queries/subqueries to create several datasets which produced statistics,
tables, figures, charts and graphs and has good experience of software development using IDEs:
pycharm, Jupyter Notebook.
Worked on bash scripting to automate the Python jobs for day-to-day administration.
Performed data extraction and manipulation over large relational datasets using SQL, Python, and other
analytical tools.
Extensively worked with Teradata utilities like BTEQ, Fast Export, Fast Load, Multi Load to export and
load Claims & Callers data to/from different source systems including flat files.
Technical Stack: AWS EMR, AWS Glue, Redshift, Hadoop, HDFS, Teradata, SQL, Oracle, Hive, Spark, Python, Hive,
Sqoop, MicroStrategy, Excel.
Responsibilities:
Will be responsible for developing a highly scalable and flexible authority engine for all customer data.
Worked on Resetting of customer attributes that provide insight about customer. Purchase frequency,
marketing channel, Groupon deal categorization. Advocate different sources of data using SQL, HIVE,
SCALA.
Integrated 3rd party data agencies (Gender, age, other purchase history from other sites) try to
integrate that data to existing data store.
Used Kafka HDFS Connector to export data from Kafka topic to HDFS files in a variety of formats and
integrates with apache hive to make data immediately available for SQO querying.
Normalized the data according to the business needs like data cleansing, modifying the data types
and various transformations using Spark, Scala and GCP Dataproc.
Implemented dynamic partitioning in bigquery tables and used appropriate file format, compression
technique to improve the performance of Pyspark jobs in the DATAPROC.
Built a system for analyzing the column names from all tables and identifying personal information
columns of data across on-premises Databases (data migration) to GCP
Process and load bound and unbound Data from Google Pub/Sub topic to Big-Query using cloud
Dataflow with Python.
Worked on partitions of Pub/Sub messages and setting up the replication factors.
Operation focused, including building proper monitoring on data process and data quality.
Effectively worked and communicated with product, marketing, business owners, and business
intelligence and the data infrastructure and warehouse teams.
Performed analysis on data discrepancies and recommended solutions based upon root cause.
Designed and developed job flow using Apache Air flow.
Worked on Intellij IDE, Eclipse IDE, Maven, SBT, GIT
Working on data pipe line which is build on top of Spark using scala
Designed, developed Created ETL (Extract, Transform and Load) Packages using Python, SQL
Server Integration Services (SSIS) to load data into Data warehouse (Microsoft SQL Server),
from Excel workbooks and Flat Files into database.
Implemented an application for cleansing and processing terabytes of data using Python and
Spark.
Developed packages using Python, Shell scripting, XML to automate some of the menial tasks.
Used Python to write data into JSON files for testing Student Item level information.
Created scripts for data modelling and data import and export.
3
Developed SSIS Packages to extract Student data from source systems such as Transactional
system for online assessments and legacy system for paper pencil assessments, transform data
based on business rules and load the data into reporting DataMart tables such as dimensions,
facts and aggregated fact tables.
Developed T-SQL (transact SQL) queries, stored procedures, user-defined functions, built-in
functions.
Using Advance SQL, Dynamic SQL methods, creating Pivot functions, Un-pivot functions, dynamic
table expressions, dynamic executions load through parameters and variables for generating data
files.
Used windowing functions such as ROW_NUMBER, RANK, DENSE_RANK, NTILE to order data,
remove duplicates in source data before loading to DataMart for better performance.
Worked on performance tuning of existing and new queries using SQL Server Execution plan, SQL
Sentry Plan Explorer to identify missing indexes, table scans, index scans.
Redesign and tune stored procedures, triggers, UDF, views, indexes and increase the
performance of the slow running queries.
Expertise in snowflake to create and Maintain Tables and views.
Optimized queries by adding necessary non-clustered indexes and covering indexes.
Developed Power Pivot/SSRS (SQL Server Reporting Services) Reports and added logos, pie
charts, bar graphs for display purposes as per business needs.
Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using
Sqoop for analysis, visualization and to generate reports.
Designed SSRS reports using parameters, drill down options, filters, sub reports.
Developed internal dashboards for the team using Power BI tools for tracking daily tasks.
Environment: Python, GCP Spark, Hive, Scala, Snowflake, Jupyter Notebook, Shell Scripting, SQL Server
2016/2012, T-SQL, SSIS, Visual studio, Power BI, PowerShell.
Responsibilities:
Developed Hive Scripts, Hive UDFs, Python Scripting and used Spark (Spark-SQL, Spark-shell) to
process data in Hortonworks.
Performed advanced procedures like text analytics and processing using the in-memory computing
capabilities of Spark.
Designed and Developed Scala code for data pull from cloud based systems and applying
transformations on it.
Usage of Sqoop to import data into HDFS from MySQL database and vice-versa.
Implemented optimized joins to perform analysis on different data sets using MapReduce programs.
Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to
automate steps in software delivery process.
Experience in processing of load and transform the large data sets of structured, unstructured and
semi structured data in Hortonworks.
Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.
Worked in Agile environment, and used rally tool to maintain the user stories and tasks.
Extensively worked on HiveQL, join operations, writing custom UDF's and having good experience in
optimizing Hive Queries.
Experienced in running query using Impala and used BI tools and reporting tool (tableau) to run ad-
hoc queries directly on Hadoop.
Worked on Apache Tez, an extensible framework for building high performance batch and interactive
data processing applications Hive jobs
Experience in using Spark framework with Scala and Python. Good exposure to performance tuning
hive queries and MapReduce jobs in spark(Spark-SQL) framework on Hortonworks.
Developed Scala & Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark-
SQL for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
Configured Spark streaming (receivers) to receive Kafka input streams from the Kafka and Specified
exact block interval for data Processing into HDFS using Scala.
4
Collect the data using Spark streaming and dump into HBase and Cassandra. Used the Spark-
Cassandra Connector to load data to and from Cassandra.
Collecting and aggregating large amounts of log data using Kafka and staging data in HDFS Data lake
for further analysis.
Used Hive to analyze data ingested into HBase by using Hive-HBase integration and HBase Filters to
compute various metrics for reporting on the dashboard.
Developed shell scripts in UNIX environment to automate the dataflow from source to different zones
in HDFS. .
Created and defined job work flows as per their dependencies in Oozie and e-mail notification service
upon completion of job for the team that request for the data and monitored jobs using Oozie on
Hortonworks.
Experience in designing both time driven and data driven automated workflows using Oozie.
Environment: Hadoop (Cloudera), HDFS, Map Reduce, Hive, Scala, Python, Pig, Sqoop, AWS, Azure, DB2,
UNIX Shell Scripting, JDBC.
Responsibilities:
Installed, configured, and maintained Apache Hadoop clusters for application development and
major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.
Implemented six nodes CDH4 Hadoop Cluster on CentOS.
Importing and exporting data into HDFS and Hive from different RDBMS using Sqoop.
Experienced in defining job flows to run multiple Map Reduce and Pig jobs using Oozie.
Importing log files using Flume into HDFS and load into Hive tables to query data.
Monitoring the running Map Reduce programs on the cluster.
Responsible for loading data from UNIX file systems to HDFS.
Used HBase-Hive integration, written multiple Hive UDFs for complex queries.
Involved in writing APIs to Read HBase tables, cleanse data and write to another HBase table.
Created multiple Hive tables, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for
efficient data access.
Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation
from multiple file formats including XML, JSON, CSV and other compressed file formats.
Experienced in running batch processes using Pig Scripts and developed Pig UDFs for data
manipulation according to Business Requirements.
Experienced in writing programs using HBase Client API.
Involved in loading data into HBase using HBase Shell, HBase Client API, Pig and Sqoop.
Experienced in design, development, tuning and maintenance of NoSQL database.
Written Map Reduce program in Python with the Hadoop streaming API.
Developed unit test cases for Hadoop Map Reduce jobs with MRUnit.
Excellent experience in ETL analysis, designing, developing, testing and implementing ETL processes
including performance tuning and query optimizing of database.
Continuously monitored and managed the Hadoop cluster using Cloudera manager and Web UI.
Worked with application teams to install operating system, Hadoop updates, patches, version
upgrades as required.
Used Maven as the build tool and SVN for code management.
Worked on writing RESTful web services for the application.
Implemented testing scripts to support test driven development and continuous integration.
Environment: Hadoop, Map Reduce, HDFS, HBase, Hive, Impala,Pig, Java, SQL, Ganglia, Scoop, Flume,
Oozie, Unix, Java, Java Script , Maven, Eclipse.
5
Responsibilities:
Involved in the complete SDLC life cycle, design and development of the application.
AGILE methodology was followed and was involved in SCRUM meetings.
Created various java bean classes to capture the data from the UI controls.
Designed UML diagrams like class diagrams, sequence diagrams and activity diagrams.
Implemented the java web services, JSP, Servlets for handling data.
Designed and developed the user interface using Struts 2.0, JavaScript, XHTML
Made use of Struts validation framework for validations at the server side.
Created and Implemented the DAO layer using Hibernate tools.
Implemented custom interceptors and exception handlers for Struts 2 application.
Ajax was used to provide dynamic search capabilities for the application.
Developed business components using service locator, session facade design patterns.
Developed session facade with stateless session beans for coarse functionality.
Worked with Log4J for logging purpose in the project.
Environment: Java 1.5, Java Script, Struts 2.0, Hibernate 3.0, Ajax, JAXB, XML, XSLT, Eclipses, Tomcat.