0% found this document useful (0 votes)
187 views6 pages

Data Report Martin Inline Graphics R8 1

The document describes a unified data architecture that includes components for ingesting data from various sources, storing the data in data warehouses and lakes, and performing queries, analytics, and machine learning on the historical and current data to power dashboards, reports, and embedded analytics. The architecture aims to provide a single platform to extract, transform, load, store, query, analyze and visualize data from diverse systems and applications.

Uploaded by

Bisyron Wahyudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views6 pages

Data Report Martin Inline Graphics R8 1

The document describes a unified data architecture that includes components for ingesting data from various sources, storing the data in data warehouses and lakes, and performing queries, analytics, and machine learning on the historical and current data to power dashboards, reports, and embedded analytics. The architecture aims to provide a single platform to extract, transform, load, store, query, analyze and visualize data from diverse systems and applications.

Uploaded by

Bisyron Wahyudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Unified Data Infrastructure Architecture

Query and Processing


Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch,
via CDC Matillion)
(Snowflake, BigQuery, Redshift)
Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
Interpreting the Architecture
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Generate relevant Extract data from Store data in a Present results of


Provide an interface for analysts and data scientists
business and operational systems format accessible to data analysis to
to derive insights (query)
operational data (E) query & processing internal and
systems external users
Execute queries and data models against stored
Deliver to storage,
data, often using distributed compute (processing)
aligning schemas Optimize for low Embed data models
between source cost, scalability, and into operational
and destination (L) analytic workloads systems and
(e.g., column store) applications
Transform data to a
structure ready for In some cases,
analysis (T) provide additional
data structures or
guarantees Describe what Predict what will
happened in the happen in the future
past (including very
recent past) Build data-driven/
ML applications

Coordinate the flow of data and the execution of computations across the full lifecycle

Ensure proper data quality, performance, and governance of all systems and datasets
Three Common Blueprints

Analytic
1 Modern Business Intelligence
Systems

2 Multimodal Data Processing

Operational
3 AI and ML
Systems
1. Modern Business Intelligence Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch,
via CDC Matillion)
(Snowflake, BigQuery, Redshift)
Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
2. Multimodal Data Processing Blueprint
Query and Processing
Ingestion and
Sources Transformation Storage Historical Predictive Output

Connectors Data Warehouse Dashboards


OLTP Databases (Looker, Superset,
(Fivetran, Stitch,
via CDC Matillion)
(Snowflake, BigQuery, Redshift)
Mode, Tableau)

Applications/ERP Embedded
(Oracle, Salesforce,
Data Modeling Analytics
Netsuite, ...) (dbt, LookML)
(Sisense, Looker,
cube.js)
Event Collectors Workflow Data Science Platform
(Segment, Snowplow) Manager (Databricks, Domino, Sagemaker, Dataiku, Augmented
(Airflow, Dagster, DataRobot, Anaconda, ...) Analytics
Prefect)
(Thoughtspot, Outlier,
Anodot, Sisu)
Logs
Data Science and ML Libraries
(Pandas, Numpy, R, Dask, Ray, Spark, ...
Spark Platform Data Lake Scikit-learn, Pytorch, TensorFlow, Spark ML, XGBoost, ...) App Frameworks
3rd Party APIs (Databricks, EMR) (Plotly Dash, Streamlit)
(e.g., Stripe) Databricks/
Delta Lake, Iceberg, Ad Hoc Query
Python Libs Hudi, Hive Acid
(Pandas, Boto,
Engine
File and Object Dask, Ray, ...) (Presto, Dremio/ Custom Apps
Storage Drill, Impala)
Parquet,
Batch Query ORC, Avro
Engine Real-time
(Hive) Analytics
(Imply/Druid, Altinity/
S3, GCS, Clickhouse, Rockset)
ABS, HDFS

Event Streaming
(Confluent/Kafka,
Pulsar, AWS Kinesis)

Stream
Processing
(Databricks/Spark,
Confluent/Kafka, Flink)

Metadata
Management Quality and Testing Entitlements Observability
and Security (Unravel, Accel Data,
(Collibra, Alation, Hive, (Great Expectations)
(Privacera, Immuta) Fiddler)
Metastore, DataHub, ...)
3. AI and ML Blueprint

Data Transformation Model Training and Development Model Inference

Data Labeling
(Labelbox, Snorkel,
Scale, Sagemaker)

Data Sources
(Data lake + Dataflow Automation
data warehouse + (Airflow, Pachyderm, Elementl, Prefect, Tecton, Kubeflow)
streaming engine)

Query Engines Feature Store Feature Server


(Presto, Hive) (Tecton) (Tecton, Cassandra)

Data Science
Libraries
(Spark, Pandas,
NumPy, Dask)

Data Science Platform Model Batch Predictor


(Jupyter, Databricks, Domino, Sagemaker, DataRobot, Registry (Spark)
H2O, Colab, Deepnote, Noteable) (Algorithmia,
MLflow,
Sagemaker) Online Model Clients
Server
Experiment ML (TF Serving, Ray
Tracking Framework Compiler Serve, Seldon)
(Weights and (Scikit-learn, (TVM)
Biases, Comet, XGBoost, MLlib)
MLflow)
Model
DL Monitoring
Visualization Framework (Fiddler, Arthur,
(Tensorboard, (TensorFlow, Keras, Arize)
Fiddler) PyTorch, H2O,
Hugging Face)

Model Tuning
(Sigopt, hyperopt, RL Libraries
Ray Tune) (Gym, Dopamine,
RLlib, Coach)

Distributed
Processing
(Spark, Ray, Dask,
Distributed TF,
Kubeflow,
Horovod)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy