Mathisha Jeeva
Mathisha Jeeva
Phone: +1 (540)-753-0890
Email: jvmathisha@gmail.com
Professional Summary:
Sr. Data Engineer with 10+ years of experience in building data intensive applications and creating pipelines
using Python with good experience on Amazon Web Services (AWS).
Good experience in software development which includes design and development of enterprise and web-
based applications.
Hands-on technical experience in Python, Java, DB2, SQL and R programming.
Good experience with Python libraries such as Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn to
analyse the insights of data and perform data cleaning, data visualization and build models.
Hands-on experience in developing and deploying enterprise-based applications using major Hadoop
ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark
GraphX, Spark SQL, Kafka.
Experience in evaluating the technology stack for building Cloud-Based Analytics solutions by conducting
research and identifying appropriate strategies, tools, and methodologies for developing end-to-end analytics
solutions and assisting in the development of a technology roadmap for Data Ingestion, Data Lakes, Data
Processing, and Data Visualization.
Experience in designing and building data management lifecycle covering Data Ingestion, Data integration,
Data consumption, Data delivery, and integration Reporting, Analytics, and System-System integration.
Experience with Amazon Web Services (Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load
Balancing, Amazon SQS, AWS Identity and access management, Amazon SNS, AWS Cloud Watch, Amazon
EBS, Amazon CloudFront, VPC, DynamoDB, Lambda and Redshift)
Good experience in developing AWS centralized logging solutions for security teams to consolidate AWS
logs and analyze them to detect incidents, using elastic search, EC2 server logs.
Experience with Data Extraction, Transformation and Loading (ETL).
Good experience on Data Modelling (Dimensional and Relational) concepts like Star-Schema Modelling,
Snowflake Schema Modelling, and Fact and Dimension Tables.
Strong experience in using Python Integrated IDEs like PyCharm, Sublime Text, and IDLE.
Experience in developing web applications and implementing Model View Control (MVC) architecture
using server-side applications Django and Flask.
Experience with Streaming Processing using PySpark and Kafka.
Working knowledge on Kubernetes to deploy scale, load balance, and manage Docker containers
Good experience in Data Extraction, Transforming and Loading (ETL) using various tools such as SQL
Server Integration Services (SSIS), Data Transformation Services (DTS).
Experience in Database Design and development with Business Intelligence using SQL Server Integration
Services (SSIS), SQL Server Analysis Services (SSAS), OLAP Cubes, Star Schema and Snowflake Schema.
Expertise in creating and enhancing CI/CD pipeline to ensure Business Analysts can build, test, and deploy
quickly.
Extensive knowledge on Exploratory Data Analysis, Big Data Analytics using Spark, Predictive analysis
using Linear and Logistic Regression models and good understanding in supervised and unsupervised
algorithms.
Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing
and transforming complex data using in-memory computing capabilities written in Scala.
Hands-on experience in visualizing the data using Power BI, Tableau, R (ggplot), Python (Pandas, matplotlib,
NumPy, SciPy).
Proficient in writing Data Analysis eXpression (DAX) in Tabular data model.
Hands on experience in designing Database schema by achieving normalization.
Experienced in all phases of Software Development Life Cycle (SDLC) including Requirements gathering,
Analysis, Design, Reviews, Coding, Unit Testing, and Integration Testing.
Built Machine Leaning models to the safety and risk data to identify factors responsible for injuries.
Analyzed the requirements and developed Use Cases, UML Diagrams, Class Diagrams, Sequence and
State Machine Diagrams.
Excellent communication and interpersonal skills with ability in resolving complex business problems.
Direct interaction with client and business users across different locations for critical issues.
TECHNICAL SKILLS:
Professional Experience:
Responsibilities:
Responsible for loading structured, unstructured, and semi-structured data into Hadoop by creating static and
dynamic partitions.
Worked as AWS Services, Cloud Watch, Splunk Monitoring, Qlik Sense.
Worked on different data formats such as JSON and performed machine learning algorithms in Python.
Performed statistical data analysis and data visualization using Python.
Created and run jobs on AWS cloud to extract transform and load data into AWS Redshift using AWS Glue,
S3 for data storage and AWS Lambda to trigger the jobs.
Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming.
Configured Spark Streaming to receive real-time data from Apache Kafka and store the streamed data to
HDFS using Scala.
Worked on developing Pyspark script to encrypting the raw data by using hashing algorithms concepts on
client specified columns.
Worked on creating data pipelines and transforming data to a structure that is relevant to the problem by
selecting appropriate techniques.
Research and implementation of scalable machine learning models to detect customers with high potential
treatment effects.
Work on technical feasibility along worth Developers, Architect and stakeholder
Involve in AWS development related to Metric Alarms and Dashboards
Identify Data Integration and reconcile data across application and Developing Business Process
monitoring.
Applied various ADF dataflow transformations such as Data Conversion, Conditional Split, Derived
Column, Lookup, join, Union, Aggregate, pivot, and filter and performed data flow transformation using the
data flow activity.
Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Worked on creating pipelines, data flows, and complex data transformations and manipulations using ADF
and PySpark with Databricks.
Developed Python programs to consume data from APIs as part of several data extraction processes and store
the data in AWS S3.
Developed Spark applications using PySpark and Spark-SQL for data transformation, and aggregation from
multiple file formats for analyzing & transforming the data to uncover insights into the customer usage
patterns.
Worked on predicting the cluster size, monitoring, and troubleshooting the Spark data bricks cluster.
Worked with IDE tools like Eclipse to update and push the data extracted and created workflows from the
databases.
Applied various ADF dataflow transformations such as Data Conversion, Conditional Split, Derived Column,
Lookup, join, Union, Aggregate, pivot, and filter and performed data flow transformation using the data flow
activity.
Automated and validated the created data-driven workflows extracted from the ADF using Apache Airflow.
Scheduled and monitored the data pipelines in ADF and automated jobs using different triggers (Event,
Scheduled, and Tumbling) in ADF.
Worked on maintaining and tracking the changes using version control tools like GIT.
Worked with building data warehouse structures, and creating facts, dimensions, and aggregate tables, by
dimensional modeling, and Star and Snowflake schemas.
Worked on defining the CI/CD process and support test automation framework in the cloud as part of the
build engineering team.
Built Power BI reports and dashboards with interactive capabilities.
Environment: AWS, Python, SQL Database, SQL Database, Azure Synapse Analytics, Teradata, Hadoop,
HDFS, Sqoop, ETL, SQL DB, Oracle, SQL Server, Teradata, PySpark with Databricks, Apache Airflow, Eclipse,
GIT
Client: Kone / CognitiveMobile Technology Pvt. Ltd. Dec 2020 – July 2022
Role: Sr. Data Engineer
Responsibilities:
Designed and developed scalable, efficient data pipeline processes to handle data ingestion, cleansing,
transformation, and integration using Sqoop, Hive, Python, and Impala.
Design and build data pipelines this involves extracting data from multiple sources, transforming it, and
loading it into appropriate AWS storage like S3, Redshift.
Choosing and configuring AWS services like Redshift, Lambda to handle data storage, processing and
analysis based on specific needs.
Developed a data pipeline using Kafka and Storm to store data into HDFS.
Develop data quality check and Monitor and Optimize data pipelines.
Collaborating with data analysts, and business users to understand their data needs and design solutions.
Creating clear documentation using AWS service-specific tools to ensure efficient knowledge transfer and
team understanding.
Identification and mitigation of risk.
Implemented Big Data Analytics Python and Advanced Data Science techniques to identify trends, patterns,
and discrepancies on petabytes of data by using, Hadoop, Python, HDFS, MapReduce, and Machine
Learning. Hive
Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs,
Data Frames & Datasets using SparkSQL and Spark Streaming Contexts
Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential
new business case with Python Scikit-learn.
Experienced in implementing and monitoring solutions with AWS Lambda, S3, Amazon Redshift,
Databricks, and Amazon CloudWatch for scalable and high-performance computing clusters.
Imported data from relational data sources to HDFS and imported bulk data into HBase using Map Reduce
programs.
Utilized Spark SQL AP in Pyspark to extract and load data and perform SQL queries.
Proficiency in analyzing large unstructured data sets using PIG and developing and designing POCs using
Map-reduce and Scala and deploying on the Yarn cluster
Developed simple to complex MapReduce streaming jobs using Python.
Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the
data with Pig and Hive.
Built models of the data processing using the PySpark and Spark SQL in Databricks to generate insights
for specific purposes.
Create ETL pipelines to load data from multiple data sources to the Databricks Delta Lake in a multi-layer
architecture.
Experienced in designing and implementing data warehousing and analytics solutions on Snowflake and to
store and manage data in a scalable and performant manner.
Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift and have used AWS
components (Amazon Web Services) - Downloading and uploading data files (with ETL) to AWS system
using S3 components and Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process
web server logs stored in Amazon S3 bucket.
Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers
(SQS, SNS).
Designed AWS landing zones that re safe and secure.
Proficient in using Scala and PySpark for big data processing and analytics, leveraging the power of Apache
Spark to handle large volumes of data and deliver insights.
Handled ETL Framework in Spark with python and scala for data transformations.
Designed and executed on-prem to AWS cloud migration projects for state agencies.
Performed data pre-processing and feature engineering for further predictive analytics using Python Pandas.
Worked with building data warehouse structures, and creating facts, dimensions, and aggregate tables, by
dimensional modeling Star and Snowflake schemas.
Worked on Oozie workflow engine for job scheduling,
Implement a Continuous Delivery Pipeline with Docker and Git Hub.
Involved in creating/modifying worksheets and data visualization dashboards in Tableau.
Environment: Amazon S3 Buckets, AWS Glue Data Catalog CloudWatch, Crawler & AWS Athena, Python,
Hadoop, MapReduce, HDFS, HBase, AWS, Redshift, S3, IAM, Lambda, SQS, SNS, Kubernetes, Docker, UNIX,
Scala, Maven, GitHub, Tableau, Agile, Scrum
Responsibilities:
Involved in Data Analysis, Data Profiling, Data Modeling, and Data Governance, production support and,
ETL Development.
The initial goal of the project is to support the Capital One Auto Finance (COAF) ETL applications that
process the data related to Loan Origination, Loan Servicing and Marketing & Analysis which are then
completely migrated to cloud.
Worked on pySpark jobs to load and transform source files from S3 buckets. Used various pySpark data
frame functions and implemented custom UDF to perform be spoke business logic.
Developed generic pySpark application to handle schema transformation, columns projection with data
frame reads and writes to Snowflake database.
Involved in migration activity from On-Prem to cloud infrastructure [ initial database, compute and as well as
storage layer]
Involved in production Deployment activities, Production parallel validation & Warranty Support activities.
Responsibility for overall team delivery and support release activities
Provide technical guidance to the product owner. Actively participating in the feature/story grooming sessions
to gather and get clarified with the business requirements.
Developed automic scripting and created jobs, workflows and time events as a part of the data pipelines from
source to destination.
Developed automated reports that validate the data loads and using python and PySpark
Hands on building Groups, hierarchies, sets to create detail level summary reports and Dashboard.
Implemented Row level security for the reports to be able to restrict data related to other dealers.
Onboarded data sets from new sources to the existing data pipelines all the way to the metrics.
Data processing using Spark Jobs with PySpark in Databricks to clean the information and then store or
integrate it into the Enterprise Data Warehouse in BigQuery.
Planning, Implementing and Management of Digital Signage server and systems. Performed Performance &
Security Load testing for various applications and systems
Extracted data from multiple well-logging systems to develop machine learning algorithms for problem-
solving.
Developed predictive models using Python & R to predict customer churn and classification of customers.
Used Spark for Parallel data processing and better performances using Python.
Used Spark Streaming to divide streaming data into batches as an input to the Spark engine for batch
processing.
Developed Spark Programs using Scala and performed transformations and actions on RDDs.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala, and Python.
Created Databricks notebooks using PySpark, Scala, and Spark SQL to read and write JSON, CSV, and
Parquet.
Generated various capacity planning reports (graphical) using Python packages like NumPy, SciPy, Pandas,
and Matplotlib.
Worked with Snowflake utilities, SnowSQL, SnowPipe, and Big Data model techniques using Python.
Created ETL pipelines in and out of the data warehouse using a combination of Python and Snowflakes
SnowSQL and wrote SQL queries against Snowflake.
Used Python APIs for extracting daily data from multiple vendors.
Built interactive Power BI dashboards and publish Power BI reports utilizing parameters, calculated fields
and table calculations, user filters, action filters, and sets to handle views more efficiently.
Environment: AWS S3, RDS, EC2 Cloud Formation, Lambda, Cloud Watch, EMR, VPC, Python, NumPy,
SciPy, HDInsight, Big Data, Snowflake, SnowSQL, SnowPipe, Spark, Spark Streaming, Spark RDD, PySpark,
Spark SQL
Client: Advice Sync Consulting Pvt.Ltd, India Jun 2011 – Jan 2015
Role: Programming Analyst
Responsibilities:
DLG (Direct Line Group of Insurance) provides an application where customers can directly apply for a new
quote and either can buy a policy or save & retrieve it later. There are other channels including Mobile, Web
(Branch operations & direct web), Aggregators Price Comparison Websites) and legacy systems such as
mainframe (UIS) and tele sales (call center).
Involved in setting up the Hadoop development environment and configurations.
Coding and unit testing Map Reduce programs to aggregate and transform files in HDFS and deliver to
HBase over which web applications run. Implemented Customized Input Format, Record Readers and Map
Reduce design patterns.
Developed pig scripts for analyzing and preprocessing input files. Written hive queries to analyze the
intermediate output files.
Extending the Pig and Hive functionalities by writing UDF function. Troubleshoot and improve job
performance for optimization and benchmarking.
Investigated business critical incidents and took timely action to resolve the root cause of the issue and hence
ensured the BAU.
Involved in preparing/updating application related documents like system appreciation document and Run
book.
Co-coordinating Change management, Deployment in the production server for Change Record (CR) and
other major minor release
Worked on and designed a Big Data analytics platform for processing customer interface preferences and
comments using Hadoop, Hive and Pig, and Cloudera.
Importing and exporting data into HDFS and Hive using Sqoop from Oracle and vice versa.
Worked on reading multiple data formats on HDFS using Python.
Analyzed the SQL scripts and designed the solution to implement using Python, and R.
Build data platforms, pipelines, and storage systems using Apache Kafka, Apache Storm, and search
technologies such as Elastic search.
Implemented POCs to migrate iterative MapReduce programs into Spark transformations using Python.
Used the advanced Python packages like NumPy, SciPy for various sophisticated numerical and scientific
calculations.
Worked with IAM to set up user roles with corresponding user and group policies using JSON and add
project users to the AWS account with multi factor authentication enabled and least privilege permissions.
Utilized AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS and create nightly AMIs
for mission critical production servers as backups.
Worked with EC2, Cloud Watch, Elastic Load Balancing and managing securities on AWS.
Used AWS lambda to run code virtually. Queries from Python using Python -MySQL connector and MySQL
database package.
Designed front end and backend of the application utilizing Python on Django Web Framework.
Responsible for configuring and networking of Virtual Private Cloud (VPC), Cloud Front.
Identified the business rules that are implemented in the complete project and documenting them. The
information is being used by Business team to write up the requirements.
Validated the UI specs and correcting them to make sure that they are in line with the existing system.
Responsible for getting updates from the onshore team, conducting stand up meetings and providing the
updates to the Scrum Master.
Analyzed and documented the details on various external reports which are obtained from external systems
within the company and third-party vendors.
Implemented and enhanced CRUD operations for the applications using the MVC (Model View Controller)
architecture of Django framework and Python conducting code reviews.
Wrote Python modules to extract/load asset data from the MySQL source database.
Analyzed the requirements and developed Use Cases, UML Diagrams, Class Diagrams, Sequence and State
Machine Diagrams.
Environment: Python, Django, Hadoop, HIVE, Sqoop, HBase, Scala, Spark, AWS EC2, Lambda, Cloud Watch,
Elastic Load Balancing (ELB), IAM, Virtual Private Cloud (VPC), Cloud Front, JSON, JavaScript, RESTful
webservice, MySQL, PyUnit, Jenkins, Visio, SQL Server
Education:
Bachelors in information technology from Anna University, India - 2011