DE Sample Resume
DE Sample Resume
PROFESSIONAL SUMMARY:
Dynamic and motivated IT professional with over 8 years of experience as a Big Data Engineer with expertise in
designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data
Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.
In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job
Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such
as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and
aggregations). Strong Knowledge on Architecture of DistributedSystems and Parallel Processing, In-depth
understanding of MapReduce programming paradigm and Spark execution framework.
Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop
using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD's and worked explicitly on
PySpark and Scala.
Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform
transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with
incremental load to populate HIVE external tables.
Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop,
Pig, Oozie, Flume, HBase and Zookeeper.
Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data
processing using Spark Streaming.
Extensively worked with Kafka as middleware for real-time data pipelines.
Writing UDFs and integrating with Hive and Pig using Java.
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with
using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download
Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as
MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and
HBase. Used Phoenix to create SQL layer on HBase.
Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored
Procedures, Cursors, Triggers and Transactions.
Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data
Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset,
Lookup file set, Complex flat file, Modify, Aggregator, XML.
Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing,
Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD
(Slowly changing dimension).
Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on
various applications using python integrated IDEs like Sublime Text and PyCharm.
Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling,
machine learning, or other data mining techniques.
Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were
used to deepen relationships, strengthen longevity and personalize interactions with customers.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka,
Zookeeper, Yarn, Apache Spark, Mahout, Sparklib
Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos.
Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL
Cloud Technologies: AWS (lambda, EC2, EMR, AmazonS3, Kinesis, Sage maker), Microsoft Azure, GCP
Frameworks: Django REST framework, MVC, Hortonworks
Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL
Server Management Studio, SQL Assistance, Eclipse, Postman
Versioning tools: SVN, Git, GitHub (Version Control)
Network Security: Kerberos
Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
Monitoring Tool: Apache Airflow, Agile, Jira, Rally
Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI
Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest,
Associative rules, NLP and Clustering.
Machine LearningTools: Scikit-learn, Pandas, TensorFlow, SparkML, SAS, R, Keras
EXPERIENCE:
USDA, NY Jul 2022 to Till date
Data Engineer
Responsibilities:
Involved in Analyzing data from different sources like Teradata, MySQL and Sqooping data into Hive using Sqoop.
Involved in identifying customers using unstructured data with Pyspark and Fuzzy wuzzy logics.
Written UDF, UDTF and UDAF in spark to implement business logic on data.
Involved in moving huge data between servers in Hadoop by using compression techniques.
Created a Realtime data pipelines and frameworks with Kafka, Spark streaming and loading data to HBase.
Worked on Nifi processors for creating data pipelines to copy data from JMS MQ to Kafka topics and processing
data in between like JSON to XML conversion.
Created a customized Nifi processor for processing data with business logics.
Involved in development of data frameworks with Python, Java and Scala languages.
Written Shell scripts to initiate jobs with required features and environment.
Monitored Spark Web UI, DAG scheduler and Yarn resource manager UI to optimize queries and performance in
spark.
Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement
using PySpark.
Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and
inserted to HBase. Responsible for building scalable distributed data solutions using Hadoop. Managed data
coming from different sources and involved in HDFS maintenance and loading of structured andunstructured
data
Written multiple MapReduce programs in Java for Data Analysis.
Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log
files.
Load data from various data sources into HDFS.
Experienced in migrating HiveQL to minimize query response time.
Implemented Avro and parquet data formats for apache Hive computations to handle custom business
requirements.
Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and
Spark SQL APIs.
Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement
and storage improvement.
Experienced in handling different types of joins in Hive.
Implemented the Data Bricks API in Scala program to push the processed data to AWS S3.
Responsible for performing extensive data validation using Hive.
Used storage format like AVRO to access multiple columnar data quickly in complex queries.
Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with
historical data.
Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations.
Worked on different file formats and different Compression Codecs.
Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Worked on different file formats like Parquet, Avro files and ORC file formats.
Created a framework using spark streaming and Kafka to process data in Realtime which feeds data to APIs.
Worked on recreating data pipelines in AWS architecture, worked on EC2, S3 storage system, RDS,
CloudFormation, Security Groups, SNS, SQS, Lambda functions and DevOps practices, Unix (RedHat), networking,
webservices architecture are essential for this hands-on role.
Environment: Hive, Spark, Spark SQL, Kafka, Spark Streaming, Scala, Nifi, AWS EC2, AWS EMR, AWS S3, Unix
Shell Scripting, No-Sql database HBase, Control-m, Kafka, YARN, Jenkins, JIRA, JDBC.
OMES May 2021 to June 2022
Data Engineer
Responsibilities:
Evaluate, extract/transform data for analytical purpose within the context of Big data environment.
In - depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames.
Developed spark application by using python (pyspark) to transform data according to business rules.
Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.
Experienced in performance tuning of Spark Applications for setting correct level of Parallelism and memory
tuning.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark,
Effective & efficient Joins.
Sourced Data from various sources into Hadoop Eco system using big data tools like Sqoop.
Worked with Oracle and Teradata for data import/export operations from different data marts.
Involved in creating hive tables, loading with data and writing hive queries.
Worked extensively with data migration, data cleansing, data profiling.
Worked in tuning Hive to improve performance and solved performance issues in Hive with understanding of
Joins, Group and aggregation and how does it translate to Map Reduce jobs.
Implemented partitioning, dynamic partitioning and bucketing in hive.
Model hive partitions extensively for data separation.
Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to
build jobs.
Expertise in hive queries, created user defined aggregated function worked on advanced optimization
techniques.
Implemented ETL code to load data from multiple sources into HDFS using pig scripts.
Automation tools like oozie was used for scheduling jobs.
Export the analyzed data to all the relational databases using Sqoop for visualization and to generate reports.
Environment: Apache Hadoop, Hive, Map Reduce, SQOOP, Spark, Python, Cloudera Manager CM 5.1.1, HDFS,
Oozie, Putty.
Con Edison, NYC, NY Feb 2019 to Apr 2021
Hadoop Developer
Responsibilities:
Involved in creating Hive tables, loading and analyzing data using hive queries.
Optimized existing Hive scripts using many hive optimization techniques.
Built reusable Hive UDF libraries for business which enables users to reuse.
Built and Implemented Apache PIG scripts to load data from and to store data into Hive.
Involved in loading data from LINUX file system to HDFS.
Worked on implementing Flume to import streaming data logs and aggregating the data to HDFS through Flume.
Imported data from various data sources performed transformations using spark and loaded data into hive.
Worked with spark core, Spark Streaming and spark SQL modules of Spark.
Used Scala to write the code for all the use cases in spark
Monitor Hadoop cluster using tools like Ambari and Cloudera Manager.
Explored various modules of Spark and worked with Data Frames, RDD and Spark Context.
Data analysis using Spark with Scala.
Created RDD’s, Data Frames and Datasets.
Experience in using D-Streams in spark streaming, accumulators, Broadcast variables, various levels of caching
and optimization techniques in spark.
Create numerous Spark streaming jobs that pull JSON messages from Kafka topics, and parse them using Java
code in flight, and land them on our Hadoop platform
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common
learner data model which gets the data from Kafka in near real time and Persists into Data bases.
Experience working on Spark in performing ETL using Spark-SQL Loaded all data-sets into Hive from Source CSV
files using spark from Source CSV files using Spark.
Experience in working with NoSQL database like HBase, Cassandra and Mongo DB.
Worked closely with admin team on Configuring Zookeeper and used Zookeeper to co-ordinate cluster services.
Created views in Tableau Desktop that were published to internal team for review and further data analysis and
customization using filters and actions.
Environment: Hive, MapReduce, Ambari, Spark, Knox, Spark SQL, Spark Streaming, Scala, Kafka, Zookeeper, Unix
Shell Scripting, Oozie, UNIX Shell/Bash Scripting, Python, Tableau, YARN, JIRA, JDBC
Nomura Securities, NY Aug 2017 to Jan 2019
Bigdata Developer
Responsibilities:
As a Big Data Developer, I worked on Hadoop eco-systems including Hive, HBase, Oozie, Pig, Zookeeper, Spark
Streaming MCS (MapR Control System) and so on with MapR distribution.
Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data
cleaning and pre-processing.
Built code for real time data ingestion using Java, MapR-Streams (Kafka) and STORM.
Involved in various phases of development analyzed and developed the system going through Agile Scrum
methodology.
Worked on Apache Solr which is used as indexing and search engine.
Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
Worked on analyzing Hadoop stack and different Big data tools including Pig and Hive, HBase database and
Sqoop.
Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data
into Hive tables.
Used J2EE design patterns like Factory pattern & Singleton Pattern.
Used Spark to create the structured data from large amount of unstructured data from various sources.
Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon
Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, Impala and loaded
final data into HDFS.
Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
Experienced in designing and developing POC’s in Spark using Scala to compare the performance of Spark with
Hive and SQL/Oracle.
Responsible for coding MapReduce program, Hive queries, testing and debugging the MapReduce programs.
Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load
the data into Cassandra.
Involved in the process of data acquisition, data pre-processing and data exploration of telecommunication
project in Scala.
Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in
JSON File format.
Imported weblogs & unstructured data using the Apache Flume and stores the data in Flume channel.
Exported event weblogs to HDFS by creating a HDFS sink which directly deposits the weblogs in HDFS.
Used RESTful web services with MVC for parsing and processing XML data.
Utilized XML and XSL Transformation for dynamic web-content and database connectivity.
Involved in loading data from Unix file system to HDFS. Involved in designing schema, writing CQL's and loading
data using Cassandra.
Built the automated build and deployment framework using Jenkins, Maven etc.
Environment: Hadoop, Hive, HBase, Oozie, Pig, Zookeeper, MapR, HDFS, MapReduce, Java, MS, Jenkins, Agile,
Apache Solr, Apache Flume, Amazon EMR, Spark, Scala, Cassandra, Apache Kafka, MVC
Dataiku Jan 2016 to Jul 2016
Java Developer
Responsibilities: