0% found this document useful (0 votes)
13 views6 pages

My Journey As A Data Engineer Spans Over

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

My Journey As A Data Engineer Spans Over

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Thanks for having me. I appreciate you taking the time to discuss the Senior Data Engineer opportunity.

I'm looking forward to explaining my experience so after that, you


can decide if I can feat Deriv's challenges and goals.

At the very beginning, I'd like to apologize if my level of English isn't very optimal. Anyway, I'll do my best.

My journey as a Data Engineer spans over six years, during which I've focused on the end-to-end lifecycle of data – from ingestion and processing to making it accessible
and valuable for decision-making, which I see aligns closely with data like a production asset.

A significant part of my experience, for instance, at Orange Madagascar, involved designing, developing, and maintaining data pipelines. This meant working with tools
like Apache Nifi, Apache Kafka, and Talend for data ingestion and ETL processes. But actually, my main mission is to integret Apache Spark. We managed large datasets
within Hadoop, Apache Saprk and Delta Lake environments, and I was also responsible for the orchestration of these workflows using Apache Airflow and VTOM,
ensuring data was processed efficiently and reliably. This directly relates the need for building and maintaining data analytics platforms and handling data ingestion and
validation.

I've developed solutions to feed Data Warehouses (like those I've worked with conceptually, and practically with cloud solutions like Google BigQuery and AWS RDS) and
have hands-on experience creating dashboards and reports using BI tools such as Power BI, Qlik Sense, and Tableau.

My cloud experience is primarily with

- AWS (using services like Lambda: serverless compute, EMR: simplifies running big data frameworks like Apache Spark or Hadoop, S3)
- GCP (BigQuery, Dataflow, Cloud Storage). I'm comfortable with infrastructure-as-code principles, having used Terraform for deployments. My work has always
involved batch processing systems, and I have a good command of Python and SQL, which are central tools to my data engineering toolkit.

During my time as a Data Quality Manager, I was specifically tasked with ensuring the integrity and reliability of our data assets. This involved implementing validation
processes, monitoring data, and collaborating with various teams to resolve issues. This experience would be very valuable in contributing to ours data governance
initiatives and maintaining your data catalogue.

In roles at companies like RAVATE, GHANTY and Argos Veterinaire, I further honed my skills in ETL development with Talend and workflow orchestration with Airflow,
often collaborating closely with software engineers and business stakeholders to integrate data solutions and ensure they met business needs.

Throughout these experiences, I've always aimed to be autonomous and proactive because I am always assigned to external customers, whether remote like
ravate(reunion), argos(france) or local to madagascar like OrangeMadagascar. So, in this case,identifying areas for improvement and taking the initiative to implement
better solutions is on of my goal. The only reason is that I represent a company that I have to promote to our customers. As a result, I'm obliged to give my all on every
problem I raise. I'm passionate about building systems that not only work but are also maintainable and scalable, and I'm excited by the prospect of contributing to a
dynamic Data department like yours, particularly within a company that so clearly prioritizes data-driven decision-making and aims to be a leader in its field.

That's about all I can say about my resume. I think I've covered all my experience. Now I'm ready to answer your questions. Thank you for hearing me out.
Spark:

Essentially, Apache Spark provides a fast, general-purpose, and fault-tolerant distributed computing framework that excels at processing large datasets by leveraging
in-memory computation and a unified engine for various analytical tasks.

1. In-Memory Computation for Speed:

o Spark's primary advantage is its ability to perform computations in memory rather than constantly reading and writing to disk (like traditional MapReduce). This makes
it significantly faster for iterative algorithms (common in machine learning) and interactive data analysis.

2. Resilient Distributed Datasets (RDDs) & Fault Tolerance:

o The fundamental data structure in Spark is the RDD, an immutable, partitioned collection of records that can be operated on in parallel.

o RDDs track their lineage (the sequence of transformations that created them). If a partition of an RDD is lost due to a node failure, Spark can recompute it using this
lineage, providing fault tolerance.

3. Distributed & Parallel Processing:

o Spark distributes data (RDDs/DataFrames/Datasets) across a cluster of machines and executes operations (transformations and actions) in parallel on these distributed
partitions, enabling it to process very large datasets.

4. Unified Analytics Engine (Versatility):

o Spark is not just for batch processing. It provides a unified platform with libraries for:

 Spark SQL: For querying structured and semi-structured data using SQL.

 Spark Streaming: For processing real-time streaming data.

 MLlib (Machine Learning Library): For scalable machine learning algorithms.

 GraphX (Graph Processing Library): For graph analytics.

o This allows users to combine these different types of processing within a single application.

5. Lazy Evaluation:

o Spark transformations (like map, filter) are lazily evaluated. This means they don't execute immediately when defined. Instead, Spark builds up a DAG (Directed Acyclic
Graph) of operations.

o Computations are only triggered when an "action" (like count, collect, save) is called. This allows Spark to optimize the overall execution plan.

6. Support for Multiple Languages:


o Spark provides APIs in Scala (its native language), Python, Java, and R, making it accessible to a wide range of developers and data scientists.

Kafka:

Essentially, Kafka is built for high-throughput, scalable, durable, and real-time message streaming by distributing data logs across a cluster of servers.

1. Distributed Log: Kafka acts like a massive, ordered, and fault-tolerant log file where messages are appended.

2. Topics & Partitions: Data is organized into categories called "topics." Each topic is split into "partitions" (the actual logs) for parallelism and scalability. Order is
guaranteed within a partition.

3. Producers: Applications that write (publish) messages to Kafka topics.

4. Consumers & Consumer Groups: Applications that read (subscribe to) messages. Consumers track their progress via "offsets." Multiple consumers can form a
"consumer group" to share the load of processing messages from a topic's partitions.

5. Brokers: Servers that form the Kafka cluster, storing data and serving client requests.

6. Replication: Partitions are copied across multiple brokers (leader & followers) for fault tolerance and data durability. If a leader fails, a follower takes over.

7. Data Retention: Messages are kept for a configurable period, allowing re-reading, even if already consumed.

8. Decoupling & Scalability: Producers and consumers are independent, and the system scales horizontally by adding more brokers, partitions, and consumers.
Airflow

Essentially, Airflow provides a robust and flexible framework for defining, scheduling, executing, and monitoring complex data pipelines and other workflows
programmatically, with a strong emphasis on code-based definitions and a rich set of integrations.

1. Workflows as Code (Python):

o Workflows (called DAGs - Directed Acyclic Graphs) are defined as Python scripts. This allows for dynamic generation, version control, collaboration, and easy testing.

2. Directed Acyclic Graphs (DAGs):

o Workflows are structured as DAGs. This means they have a clear direction of execution (tasks flow one way) and no circular dependencies, ensuring a finite and
predictable run. Each DAG represents a collection of all the tasks you want to run, organized and showing their relationships.

3. Tasks & Operators:

o Each step or unit of work within a DAG is a "Task."

o "Operators" define what a task actually does. Airflow provides many pre-built operators (e.g., PythonOperator, BashOperator, PostgresOperator, GCPOperator) and
you can create custom ones.

4. Scheduling & Orchestration:

o Airflow's scheduler is responsible for triggering DAG runs based on defined schedules (e.g., daily, hourly, cron expressions) or external events.

o It manages task execution order based on dependencies, handles retries for failed tasks, and monitors the overall workflow progress.

5. Extensibility & Modularity:

o Airflow is highly extensible. You can create custom operators, hooks (to interface with external systems), and plugins to fit specific needs. This modular design allows
you to easily integrate with virtually any system.

6. Rich Web User Interface (UI):

o Airflow provides a comprehensive web UI to visualize DAGs, monitor their progress, view logs, trigger runs manually, clear task statuses, and manage connections to
external systems.

7. Scalability:

o Airflow is designed to scale. While a simple setup can run on a single machine, it supports distributed execution of tasks using executors like Celery or Kubernetes,
allowing it to handle a large number of complex workflows and tasks.
Unity Catalog

1. Unified Governance for Data and AI Assets on the Lakehouse:

o Unity Catalog provides a central place to govern not just tables (especially Delta Lake tables, which is an open-source storage framework), but also files, machine
learning models, and functions across all your Databricks workspaces and potentially clouds.

o Open Connection: It aims to bring structure and governance to data often stored in open formats (like Parquet, Avro, ORC) within your data lake (e.g., on S3, ADLS,
GCS), especially when managed by Delta Lake.

2. Fine-Grained Access Control using Standard SQL:

o It allows you to define access permissions (GRANT/REVOKE) on tables, views, schemas, and catalogs using familiar SQL commands. This provides granular control
over who can access and modify what data.

o Open Connection: While the UC mechanism is Databricks-specific, it uses the well-understood SQL standard for defining these permissions.

3. Automated Data Lineage:

o Unity Catalog automatically captures and visualizes data lineage down to the column level for queries and notebooks run through Databricks. This helps understand
how data is derived, track dependencies, and assess the impact of changes.

o Open Connection: This lineage often tracks data transformations performed by Apache Spark (open source) on data in Delta Lake tables.

4. Data Discovery and Search:

o It provides a searchable catalog of all your data assets, making it easier for users to find and understand the data they need. Users can add comments, tags, and
browse metadata.

5. Centralized Auditing:
o It creates detailed audit logs of actions performed against data and AI assets, such as who accessed what data and when, which is crucial for security and compliance.

6. Secure Data Sharing via Delta Sharing (Open Protocol):

o This is a key area where "open source" comes into play directly. Unity Catalog deeply integrates with Delta Sharing, which is an open protocol (incubated by the
Linux Foundation) for securely sharing live data from your lakehouse with other organizations or internal teams, regardless of their computing platform.

o Unity Catalog acts as the metastore and governance layer to manage what data is shared via this open protocol.

7. Unified Namespace (Three-Level):

o It introduces a standard three-level namespace (catalog.schema.table) to organize data, making it easier to manage and access data consistently across different
workspaces and even clouds (with federation in the roadmap).

Beyond my technical skills, I believe what makes me particularly suited for a role like this is a profound personal drive, largely shaped by my experiences and aspirations
growing up in Madagascar.

My dream has always been to build a future with greater opportunity, not just for myself but to also show what's possible. This isn't just about a job for me; it's a
significant step towards that dream. This translates into an exceptional level of a strong commitment to making the most of every chance. When you deeply value an
opportunity, you naturally bring a higher level of dedication and a strong work ethic.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy