0% found this document useful (0 votes)

13 views6 pages

My Journey As A Data Engineer Spans Over

Uploaded by

Rasoavelonirina Danielson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

My Journey As A Data Engineer Spans Over

Uploaded by

Rasoavelonirina Danielson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Thanks for having me. I appreciate you taking the time to discuss the Senior Data Engineer opportunity.

I'm looking forward to explaining my experience so after that, you

can decide if I can feat Deriv's challenges and goals.

At the very beginning, I'd like to apologize if my level of English isn't very optimal. Anyway, I'll do my best.

My journey as a Data Engineer spans over six years, during which I've focused on the end-to-end lifecycle of data – from ingestion and processing to making it accessible
and valuable for decision-making, which I see aligns closely with data like a production asset.

A significant part of my experience, for instance, at Orange Madagascar, involved designing, developing, and maintaining data pipelines. This meant working with tools
like Apache Nifi, Apache Kafka, and Talend for data ingestion and ETL processes. But actually, my main mission is to integret Apache Spark. We managed large datasets
within Hadoop, Apache Saprk and Delta Lake environments, and I was also responsible for the orchestration of these workflows using Apache Airflow and VTOM,
ensuring data was processed efficiently and reliably. This directly relates the need for building and maintaining data analytics platforms and handling data ingestion and
validation.

I've developed solutions to feed Data Warehouses (like those I've worked with conceptually, and practically with cloud solutions like Google BigQuery and AWS RDS) and
have hands-on experience creating dashboards and reports using BI tools such as Power BI, Qlik Sense, and Tableau.

My cloud experience is primarily with

- AWS (using services like Lambda: serverless compute, EMR: simplifies running big data frameworks like Apache Spark or Hadoop, S3)
- GCP (BigQuery, Dataflow, Cloud Storage). I'm comfortable with infrastructure-as-code principles, having used Terraform for deployments. My work has always
involved batch processing systems, and I have a good command of Python and SQL, which are central tools to my data engineering toolkit.

During my time as a Data Quality Manager, I was specifically tasked with ensuring the integrity and reliability of our data assets. This involved implementing validation
processes, monitoring data, and collaborating with various teams to resolve issues. This experience would be very valuable in contributing to ours data governance
initiatives and maintaining your data catalogue.

In roles at companies like RAVATE, GHANTY and Argos Veterinaire, I further honed my skills in ETL development with Talend and workflow orchestration with Airflow,
often collaborating closely with software engineers and business stakeholders to integrate data solutions and ensure they met business needs.

Throughout these experiences, I've always aimed to be autonomous and proactive because I am always assigned to external customers, whether remote like
ravate(reunion), argos(france) or local to madagascar like OrangeMadagascar. So, in this case,identifying areas for improvement and taking the initiative to implement
better solutions is on of my goal. The only reason is that I represent a company that I have to promote to our customers. As a result, I'm obliged to give my all on every
problem I raise. I'm passionate about building systems that not only work but are also maintainable and scalable, and I'm excited by the prospect of contributing to a
dynamic Data department like yours, particularly within a company that so clearly prioritizes data-driven decision-making and aims to be a leader in its field.

That's about all I can say about my resume. I think I've covered all my experience. Now I'm ready to answer your questions. Thank you for hearing me out.
Spark:

Essentially, Apache Spark provides a fast, general-purpose, and fault-tolerant distributed computing framework that excels at processing large datasets by leveraging
in-memory computation and a unified engine for various analytical tasks.

1. In-Memory Computation for Speed:

o Spark's primary advantage is its ability to perform computations in memory rather than constantly reading and writing to disk (like traditional MapReduce). This makes
it significantly faster for iterative algorithms (common in machine learning) and interactive data analysis.

2. Resilient Distributed Datasets (RDDs) & Fault Tolerance:

o The fundamental data structure in Spark is the RDD, an immutable, partitioned collection of records that can be operated on in parallel.

o RDDs track their lineage (the sequence of transformations that created them). If a partition of an RDD is lost due to a node failure, Spark can recompute it using this
lineage, providing fault tolerance.

3. Distributed & Parallel Processing:

o Spark distributes data (RDDs/DataFrames/Datasets) across a cluster of machines and executes operations (transformations and actions) in parallel on these distributed
partitions, enabling it to process very large datasets.

4. Unified Analytics Engine (Versatility):

o Spark is not just for batch processing. It provides a unified platform with libraries for:

 Spark SQL: For querying structured and semi-structured data using SQL.

 Spark Streaming: For processing real-time streaming data.

 MLlib (Machine Learning Library): For scalable machine learning algorithms.

 GraphX (Graph Processing Library): For graph analytics.

o This allows users to combine these different types of processing within a single application.

5. Lazy Evaluation:

o Spark transformations (like map, filter) are lazily evaluated. This means they don't execute immediately when defined. Instead, Spark builds up a DAG (Directed Acyclic
Graph) of operations.

o Computations are only triggered when an "action" (like count, collect, save) is called. This allows Spark to optimize the overall execution plan.

6. Support for Multiple Languages:

o Spark provides APIs in Scala (its native language), Python, Java, and R, making it accessible to a wide range of developers and data scientists.

Kafka:

Essentially, Kafka is built for high-throughput, scalable, durable, and real-time message streaming by distributing data logs across a cluster of servers.

1. Distributed Log: Kafka acts like a massive, ordered, and fault-tolerant log file where messages are appended.

2. Topics & Partitions: Data is organized into categories called "topics." Each topic is split into "partitions" (the actual logs) for parallelism and scalability. Order is
guaranteed within a partition.

3. Producers: Applications that write (publish) messages to Kafka topics.

4. Consumers & Consumer Groups: Applications that read (subscribe to) messages. Consumers track their progress via "offsets." Multiple consumers can form a
"consumer group" to share the load of processing messages from a topic's partitions.

5. Brokers: Servers that form the Kafka cluster, storing data and serving client requests.

6. Replication: Partitions are copied across multiple brokers (leader & followers) for fault tolerance and data durability. If a leader fails, a follower takes over.

7. Data Retention: Messages are kept for a configurable period, allowing re-reading, even if already consumed.

8. Decoupling & Scalability: Producers and consumers are independent, and the system scales horizontally by adding more brokers, partitions, and consumers.
Airflow

Essentially, Airflow provides a robust and flexible framework for defining, scheduling, executing, and monitoring complex data pipelines and other workflows
programmatically, with a strong emphasis on code-based definitions and a rich set of integrations.

1. Workflows as Code (Python):

o Workflows (called DAGs - Directed Acyclic Graphs) are defined as Python scripts. This allows for dynamic generation, version control, collaboration, and easy testing.

2. Directed Acyclic Graphs (DAGs):

o Workflows are structured as DAGs. This means they have a clear direction of execution (tasks flow one way) and no circular dependencies, ensuring a finite and
predictable run. Each DAG represents a collection of all the tasks you want to run, organized and showing their relationships.

3. Tasks & Operators:

o Each step or unit of work within a DAG is a "Task."

o "Operators" define what a task actually does. Airflow provides many pre-built operators (e.g., PythonOperator, BashOperator, PostgresOperator, GCPOperator) and
you can create custom ones.

4. Scheduling & Orchestration:

o Airflow's scheduler is responsible for triggering DAG runs based on defined schedules (e.g., daily, hourly, cron expressions) or external events.

o It manages task execution order based on dependencies, handles retries for failed tasks, and monitors the overall workflow progress.

5. Extensibility & Modularity:

o Airflow is highly extensible. You can create custom operators, hooks (to interface with external systems), and plugins to fit specific needs. This modular design allows
you to easily integrate with virtually any system.

6. Rich Web User Interface (UI):

o Airflow provides a comprehensive web UI to visualize DAGs, monitor their progress, view logs, trigger runs manually, clear task statuses, and manage connections to
external systems.

7. Scalability:

o Airflow is designed to scale. While a simple setup can run on a single machine, it supports distributed execution of tasks using executors like Celery or Kubernetes,
allowing it to handle a large number of complex workflows and tasks.
Unity Catalog

1. Unified Governance for Data and AI Assets on the Lakehouse:

o Unity Catalog provides a central place to govern not just tables (especially Delta Lake tables, which is an open-source storage framework), but also files, machine
learning models, and functions across all your Databricks workspaces and potentially clouds.

o Open Connection: It aims to bring structure and governance to data often stored in open formats (like Parquet, Avro, ORC) within your data lake (e.g., on S3, ADLS,
GCS), especially when managed by Delta Lake.

2. Fine-Grained Access Control using Standard SQL:

o It allows you to define access permissions (GRANT/REVOKE) on tables, views, schemas, and catalogs using familiar SQL commands. This provides granular control
over who can access and modify what data.

o Open Connection: While the UC mechanism is Databricks-specific, it uses the well-understood SQL standard for defining these permissions.

3. Automated Data Lineage:

o Unity Catalog automatically captures and visualizes data lineage down to the column level for queries and notebooks run through Databricks. This helps understand
how data is derived, track dependencies, and assess the impact of changes.

o Open Connection: This lineage often tracks data transformations performed by Apache Spark (open source) on data in Delta Lake tables.

4. Data Discovery and Search:

o It provides a searchable catalog of all your data assets, making it easier for users to find and understand the data they need. Users can add comments, tags, and
browse metadata.

5. Centralized Auditing:
o It creates detailed audit logs of actions performed against data and AI assets, such as who accessed what data and when, which is crucial for security and compliance.

6. Secure Data Sharing via Delta Sharing (Open Protocol):

o This is a key area where "open source" comes into play directly. Unity Catalog deeply integrates with Delta Sharing, which is an open protocol (incubated by the
Linux Foundation) for securely sharing live data from your lakehouse with other organizations or internal teams, regardless of their computing platform.

o Unity Catalog acts as the metastore and governance layer to manage what data is shared via this open protocol.

7. Unified Namespace (Three-Level):

o It introduces a standard three-level namespace (catalog.schema.table) to organize data, making it easier to manage and access data consistently across different
workspaces and even clouds (with federation in the roadmap).

Beyond my technical skills, I believe what makes me particularly suited for a role like this is a profound personal drive, largely shaped by my experiences and aspirations
growing up in Madagascar.

My dream has always been to build a future with greater opportunity, not just for myself but to also show what's possible. This isn't just about a job for me; it's a
significant step towards that dream. This translates into an exceptional level of a strong commitment to making the most of every chance. When you deeply value an
opportunity, you naturally bring a higher level of dedication and a strong work ethic.

Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
From Everand
Java 17 Backend Development: Design backend systems using Spring Boot, Docker, Kafka, Eureka, Redis, and Tomcat
Elara Drevyn
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Spark 101
No ratings yet
Spark 101
25 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Big Data Concepts - Spark & Streaming
No ratings yet
Big Data Concepts - Spark & Streaming
35 pages
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Benefits of Hadoop MapReduce
No ratings yet
Benefits of Hadoop MapReduce
1 page
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
IoT Module 5
No ratings yet
IoT Module 5
9 pages
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Java 17 Backend Development
From Everand
Java 17 Backend Development
Elara Drevyn
No ratings yet
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Learning Concurrent Programming in Scala
From Everand
Learning Concurrent Programming in Scala
Aleksandar Prokopec
No ratings yet
Study 3
No ratings yet
Study 3
20 pages
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 5
No ratings yet
Unit 5
14 pages
Sala Questions
No ratings yet
Sala Questions
38 pages
Apache Spark Features
No ratings yet
Apache Spark Features
2 pages
Awk Programming in Practice: Definitive Reference for Developers and Engineers
From Everand
Awk Programming in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Java
From Everand
Advanced Java
Manish Soni
No ratings yet
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
From Everand
Databricks Platform Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Naukri Sugam (6y 0m)
No ratings yet
Naukri Sugam (6y 0m)
3 pages
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
From Everand
Apache Arrow Dataset in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Big Data 3rd Assignment Answers
No ratings yet
Big Data 3rd Assignment Answers
8 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit V
No ratings yet
Unit V
35 pages
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
No ratings yet
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
6 pages
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Shiva Data - Resume
No ratings yet
Shiva Data - Resume
6 pages
Spark: Big Data Cluster Computing in Production
From Everand
Spark: Big Data Cluster Computing in Production
Ilya Ganelin
No ratings yet
Prashanth Snowflake Data Engg
No ratings yet
Prashanth Snowflake Data Engg
5 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Synapse Administration and Deployment: The Complete Guide for Developers and Engineers
From Everand
Synapse Administration and Deployment: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Spark Devops
0% (1)
Spark Devops
301 pages
Core Java Programming
From Everand
Core Java Programming
Jitendra Patel
4/5 (11)
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
No ratings yet
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
65 pages
OpenStack Object Storage (Swift) Essentials
From Everand
OpenStack Object Storage (Swift) Essentials
Amar Kapadia
No ratings yet
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Prophecy Io
No ratings yet
Prophecy Io
3 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
M5
No ratings yet
M5
18 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Design and Fabrication of Portable Tree Branch Cutting Machine
No ratings yet
Design and Fabrication of Portable Tree Branch Cutting Machine
3 pages
Sharp Gf777
No ratings yet
Sharp Gf777
6 pages
PLC Standar IEC 1131-3
No ratings yet
PLC Standar IEC 1131-3
6 pages
GymnaUniphy Phyaction 740,790 - Service Manual
No ratings yet
GymnaUniphy Phyaction 740,790 - Service Manual
68 pages
KUWAIT
No ratings yet
KUWAIT
3 pages
Programming For Problem Solving
No ratings yet
Programming For Problem Solving
3 pages
Kickert Bowed Spreader
No ratings yet
Kickert Bowed Spreader
5 pages
Anil 15
No ratings yet
Anil 15
2 pages
ISE Project SRS Document of Hotel Management System Version 2
No ratings yet
ISE Project SRS Document of Hotel Management System Version 2
24 pages
SAS Slides 10: ODS HTML
No ratings yet
SAS Slides 10: ODS HTML
24 pages
Electric Vehicles and Power Electronics PDF
0% (1)
Electric Vehicles and Power Electronics PDF
41 pages
Business Idea Related To The Web Design
No ratings yet
Business Idea Related To The Web Design
12 pages
CAD Standards
No ratings yet
CAD Standards
39 pages
THE Constitution
No ratings yet
THE Constitution
19 pages
Kbutty Resume 2022 Sask
No ratings yet
Kbutty Resume 2022 Sask
2 pages
Tuttnauer - Intl - Narrow T-Max - Medical - AB - Ver 1.11 PDF
No ratings yet
Tuttnauer - Intl - Narrow T-Max - Medical - AB - Ver 1.11 PDF
9 pages
Gtoli
No ratings yet
Gtoli
2 pages
Computer Vision: Dr. Sanjay Jain Associate Professor, CSA
No ratings yet
Computer Vision: Dr. Sanjay Jain Associate Professor, CSA
8 pages
Xtralis VESDA ASPIRE2: ASPIRE2 - Air Sampling Smoke Detection Made Easy
No ratings yet
Xtralis VESDA ASPIRE2: ASPIRE2 - Air Sampling Smoke Detection Made Easy
2 pages
Stove Pipe White Paper
No ratings yet
Stove Pipe White Paper
9 pages
Controlled Power, LLC 38kV Metal Clad Switchgear Guide Specification
No ratings yet
Controlled Power, LLC 38kV Metal Clad Switchgear Guide Specification
17 pages
MP5358E - TKTI 10 Software
No ratings yet
MP5358E - TKTI 10 Software
32 pages
Truvu MPC I/O Expanders: I-Vu Building Automation System
No ratings yet
Truvu MPC I/O Expanders: I-Vu Building Automation System
2 pages
How To Write A Discursive Essay
No ratings yet
How To Write A Discursive Essay
18 pages
IWA-14 Vehicle Safety Barrier
No ratings yet
IWA-14 Vehicle Safety Barrier
6 pages
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
No ratings yet
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
18 pages
2024 LG Commercial TV E-Catalog (Low) - 20240307
100% (1)
2024 LG Commercial TV E-Catalog (Low) - 20240307
28 pages
Kledbetter Module1resume 0115
No ratings yet
Kledbetter Module1resume 0115
5 pages
L20se 0425
No ratings yet
L20se 0425
24 pages
Birla Institute of Technology and Science II Semester 2012-13 MEL G641 CAD For IC Design Lab Assignement-1
No ratings yet
Birla Institute of Technology and Science II Semester 2012-13 MEL G641 CAD For IC Design Lab Assignement-1
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

My Journey As A Data Engineer Spans Over

Uploaded by

My Journey As A Data Engineer Spans Over

Uploaded by

Thanks for having me. I appreciate you taking the time to discuss the Senior Data Engineer opportunity.

I'm looking forward to explaining my experience so after that, you

My cloud experience is primarily with

1. In-Memory Computation for Speed:

2. Resilient Distributed Datasets (RDDs) & Fault Tolerance:

3. Distributed & Parallel Processing:

4. Unified Analytics Engine (Versatility):

 Spark Streaming: For processing real-time streaming data.

 MLlib (Machine Learning Library): For scalable machine learning algorithms.

 GraphX (Graph Processing Library): For graph analytics.

6. Support for Multiple Languages:

3. Producers: Applications that write (publish) messages to Kafka topics.

1. Workflows as Code (Python):

2. Directed Acyclic Graphs (DAGs):

3. Tasks & Operators:

o Each step or unit of work within a DAG is a "Task."

4. Scheduling & Orchestration:

5. Extensibility & Modularity:

6. Rich Web User Interface (UI):

1. Unified Governance for Data and AI Assets on the Lakehouse:

2. Fine-Grained Access Control using Standard SQL:

3. Automated Data Lineage:

4. Data Discovery and Search:

6. Secure Data Sharing via Delta Sharing (Open Protocol):

7. Unified Namespace (Three-Level):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.