Chapter 09 MRSHuawei's Big Data Platform
Chapter 09 MRSHuawei's Big Data Platform
Foreword
⚫ This chapter first provides an overview of Huawei's big data platform, MRS,
before looking at its advantages and application scenarios. Then, it
describes some MRS components, including Hudi, HetuEngine, Ranger, and
LDAP+Kerberos authentication. Finally, it illustrates the MRS cloud-native
data lake baseline solution.
2 Huawei Confidential
Objectives
3 Huawei Confidential
Contents
1. Overview of MRS
2. MRS Components
4 Huawei Confidential
Trends in the Evolution of Big Data Technology
5 Huawei Confidential
Huawei Cloud Services
⚫ Huawei Cloud is Huawei's signature cloud service brand. It is a culmination of Huawei's 30-plus years
of expertise in ICT infrastructure products and solutions. Huawei Cloud is committed to providing
stable, secure, and reliable cloud services that help organizations of all sizes grow in an intelligent
world. To complement an already impressive list of offerings, Huawei Cloud is pursuing a vision of
inclusive AI, a vision of AI that is affordable, effective, and reliable for everyone. As a foundation,
Huawei Cloud provides a powerful computing platform and an easy-to-use development platform for
Huawei's full-stack all-scenario AI strategy.
⚫ Huawei aims to build an open, cooperative, and win-win cloud ecosystem and helps partners quickly
integrate into that local ecosystem. Huawei Cloud adheres to business boundaries, respects data
sovereignty, does not monetize customer data, and works with partners for joint innovation to
continuously create value for customers and partners.
6 Huawei Confidential
Huawei Cloud MRS
⚫ MapReduce Service (MRS) is used to deploy and manage Hadoop systems on Huawei Cloud.
⚫ MRS provides enterprise-level big data clusters on the cloud. Tenants have full control over clusters and
can easily run big data components such as Hadoop, Spark, HBase, Kafka, and Storm. MRS is fully
compatible with open-source APIs, and incorporates the advantages of Huawei Cloud computing and
storage and big data industry experience to provide customers with a full-stack big data platform
featuring high performance, low cost, flexibility, and ease-of-use. In addition, the platform can be
customized based on service requirements to help enterprises quickly build a massive data processing
system and discover new value points and business opportunities by analyzing and mining massive
amounts of data in real time or at a later time.
7 Huawei Confidential
MRS Highlights
8 Huawei Confidential
MRS Architecture
Huawei Cloud FusionInsight MRS cloud-native data lake Cloud-native architecture for agile
building of data lakes
Real-time Offline Interactive Real-time Multi-modal Manager • Easy deployment: one-click cluster creation
analytics analysis query retrieval analysis and service provisioning within 30 minutes
Automatic • Agile building: unified data import, metadata
deployment management, and security management for
Convergent
ClickHouse Hive HBase Large cluster • Decoupled storage and compute: independent
HetuEngine management compute and storage resource expansion for
Kafka Redis IoTDB convenient deployment of gPaaS & AI DaaS
Spark Tenant services like DataArts Studio and GES
management
Flink Tez Elasticsearch Three cloud-native data lakes
Refined with one architecture
Scheduler monitoring
Yarn Superior • Offline data lake, real-time data lake, and
scheduling
Centralized logical data lake
alarm • Specialized data marts: Data warehouse
management
Data lake formation handling within a data lake, shortening the analysis
link and construction period
Data
9 Huawei Confidential
MRS Application Scenarios
10 Huawei Confidential
MRS in Hybrid Cloud: Data Base of the FusionInsight
Intelligent Data Lake
Government Enterprises Finance Internet Ecosystem
collaboration
Accumulating and sharing industry digital assets Application enablement | Open
ecosystem
Data enablement
AI enablement Application
FusionInsight intelligent data lake enablement
Data
ModelArts
Application MRS + DWS + DataArts Studio ROMA intelligence
Data enablement | AI enablement
Developer
Connected
Organization Compute Storage Network Security
11 Huawei Confidential
Contents
1. Overview of MRS
2. Components
◼ Hudi
HetuEngine
Ranger
LDAP+Kerberos Security Authentication
12 Huawei Confidential
Hudi
⚫ Hudi is an open-source project launched by Apache in 2019 and became a top
Apache project in 2020.
⚫ Huawei participated in Hudi community development in 2020 and used Hudi in
FusionInsight.
⚫ Hudi is in a data lake table format, which provides the ability to update and delete
data as well as consume new data on HDFS. It supports multiple compute engines
and provides insert, update, and delete (IUD) interfaces and streaming primitives,
including upsert and incremental pull, over datasets on HDFS.
⚫ Hudi is the file organization layer of the data lake. It manages Parquet files, provides
data lake capabilities and IUD APIs, and supports compute engines.
13 Huawei Confidential
Hudi Features
⚫ Supports fast updates through custom indexes.
⚫ Supports snapshot isolation for data write and query.
⚫ Manages file size and layout based on statistics.
⚫ Supports timeline.
⚫ Supports data rollback.
⚫ Supports savepoints for data restoration.
⚫ Merges data asynchronously.
⚫ Optimizes data lake storage using the clustering mechanism.
14 Huawei Confidential
Hudi Architecture: Batch and Real-Time Data Import, Compatible
with Diverse Components, and Open-Source Storage Formats
⚫ Storage modes
Spark Flink Hive HetuEngine
Copy On Write (COW): high read performance and
slower write speed than that of MOR
Merge on Read (MOR): high write performance and
Read view
lower read performance
Read-optimized Incremental Real-time
⚫ Storage formats view view view
The open-source Parquet and HFile formats are Storage Hudi datasets
supported. The support for ORC is under planning. mode
Timeline
⚫ Storage engines Data files Index
COW Metadata
Spark Streaming
Open-source HDFS and Huawei Cloud Object
Storage formats
Storage Service (OBS)
Fink
Batch Parquet HFile ORC
⚫ Views MOR
Read-optimized view Storage engine
Incremental view HDFS OBS
Real-time view
15 Huawei Confidential
Contents
1. Overview of MRS
2. Components
Hudi
◼ HetuEngine
Ranger
LDAP+Kerberos Security Authentication
16 Huawei Confidential
A Big Data Ecosystem Requires Interactive Query and
Unified SQL Access
➢ Services require interactive analysis within subseconds or seconds, which may result in redundant data replicas.
➢ Unified SQL access is required due to diversified components in the data lake.
17 Huawei Confidential
HetuEngine
⚫ HetuEngine is a Huawei-developed high-performance engine for distributed SQL query and data virtualization. Fully
compatible with the big data ecosystem, HetuEngine implements mass data query within seconds. It supports
heterogeneous data sources, enabling one-stop SQL analysis in the data lake.
Cross-domain service entry
Unified authentication and access control
Cross-domain
Distributed networking
Cloud Data source High-performance cross-domain data
O&M and Resource Permission Configuration transmission at GB/s level
information
service monitoring
management
management management tuning Zero metadata synchronization
layer Restricted data access
Business metadata Cross-domain computing pushdown
Fast rollout with simplified configuration
Compute instances Compute instances Compute instances Compute instances Data virtualization
Cross-source
Materialized view
Engine UDFs specific for users and data sources
Subquery pushdown
Layer Small table roaming
Compatibility with HQL syntax
Interconnection with common BI tools
Cloud-native
Centralized O&M of resources and permissions
Visualized and instant data source configuration
Auto scaling
Data layer Multi-instance and multi-tenant deployment
Hive/HDFS/OBS ClickHouse HBase Elasticsearch DWS
Rolling restart without interrupting services
Backup & DR
18 Huawei Confidential
Open-Source Community Edition vs. HetuEngine
Presto syntax
Syntax Syntax enhancement: compatible with
Presto syntax 90% HQL scenarios
compatibility
19 Huawei Confidential
Contents
1. Overview of MRS
2. Components
Hudi
HetuEngine
◼ Ranger
LDAP+Kerberos Security Authentication
20 Huawei Confidential
Ranger
⚫ Apache Ranger offers a centralized security management framework and supports
unified authorization and auditing. It manages fine-grained access control over
Hadoop and related components, such as HDFS, Hive, HBase, Kafka, and Storm.
Users can use the front-end web UI provided by Ranger to configure policies to
control users' access to these components.
21 Huawei Confidential
Ranger Architecture
22 Huawei Confidential
Relationship Between Ranger and Other Components
⚫ Ranger provides PBAC authentication plug-ins for component servers. Currently, components
like HDFS, Yarn, Hive, HBase, Kafka, Storm, and Spark2x support Ranger authentication.
More components will become available in the future.
24 Huawei Confidential
Contents
1. Overview of MRS
2. Components
Hudi
HetuEngine
Ranger
◼ LDAP+Kerberos Security Authentication
25 Huawei Confidential
LDAP
⚫ LDAP stands for Lightweight Directory Access Protocol. It is a protocol for
implementing centralized account management architecture based on X.500
protocols.
⚫ On the Huawei big data platform, an LDAP server functions as a directory service
system to implement centralized account management.
⚫ LDAP has the following characteristics:
LDAP runs over TCP/IP or other connection-oriented transfer services.
LDAP is an Internet Engineering Task Force (IETF) standard track protocol and is specified
in RFC 4510 on Lightweight Directory Access Protocol (LDAP): Technical Specification
Road Map.
26 Huawei Confidential
Kerberos
⚫ Kerberos is an authentication concept named after the ferocious three-headed guard dog of
Hades from Greek mythology. The Kerberos protocol adopts a client–server model and
cryptographic algorithms such as Data Encryption Standard (DES) and Advanced Encryption
Standard (AES). Furthermore, it provides mutual authentication, so that the client and server
can verify each other's identity.
⚫ Huawei big data platform uses KrbServers to provide Kerberos functions for all components.
To manage access control permissions on data and resources in a cluster, it is recommended
that the cluster be installed in security mode. In security mode, a client application must be
authenticated and a secure session must be established before the application can access
resources in the cluster. MRS uses KrbServers to provide Kerberos authentication for all
components, implementing a reliable authentication mechanism.
27 Huawei Confidential
Architecture of Huawei Big Data Security Authentication
Scenarios
3.1 Perform authentication.
3. Perform authentication.
User CAS server Kerberos1
1. Log in.
28 Huawei Confidential
Enhanced Open-Source LDAP+Kerberos Features
⚫ Service Authentication in the Cluster
In an MRS cluster in security mode, mutual access between services is implemented based on the Kerberos
security architecture. When a service (such as HDFS) in the cluster is set to start, the corresponding sessionkey
(keytab, used for identity authentication of the application) is obtained from Kerberos. If another service (such
as Yarn) needs to access HDFS to add, delete, modify, or query data in HDFS, the corresponding TGT and ST
must be obtained for secure access.
29 Huawei Confidential
Contents
1. Overview of MRS
2. Components
30 Huawei Confidential
A Panorama of the FusionInsight MRS Cloud-Native Data
Lake Baseline Solution in Huawei Cloud Stack
⚫ The MRS data lake solution implements the "three lakes + mart" service scenario to meet customers' requirements in different
phases of data lake construction.
Real-time Data Data Specialized Mining & Fixed AI Self-service List/Details Large-screen BI
management cleansing analytics modeling reports analytics analysis query display
applications
Real-time Kafka (message SparkStreaming/Flink IoTDB
stream processing queue) (stream processing engine) (time series database)
Real-time Real-time ClickHouse
synchronization Batch loading
IoT Real-time (real-time OLAP)
loading
loading
Real-time Logical Real-time retrieval
data lake Flink SQL data lake HBase (simple
Messages Offline data lake
(batch-stream On-demand retrieval)
Real-time Batch Interactive query HetuEngine
... synchronization convergence) loading Elasticsearch
HetuEngine CDL(real-time (cross-lake query) (complex retrieval)
Spark Hive (query in the integration Data Data
lake) engine) lake A lake B GES (graph database)
Files Scheduled
loading Parquet ORC Hudi Redis (in-memory
database)
Service DBs Data
...
Data storage Source Detail Model OBS
HDFS DR Specialized data marts
sources data data data
Hybrid cloud
31 Huawei Confidential
Offline Data Lake
Data lake: A big data platform holding a vast amount of data in its native format for an enterprise. Access to data and compute
power is granted to users through strict data permission and resource control. In a data lake, one replica of data supports
multidimensional analysis.
Offline: Typically, data is not stored in a data lake until a delay of over 15 minutes after being generated, during which period the
data is offline.
Hybrid cloud
32 Huawei Confidential
Real-Time Data Lake
Data lake: A big data platform holding a vast amount of data in its native format for an enterprise. Access to data and compute power
is granted to users through strict data permission and resource control. In a data lake, one replica of data supports multidimensional
analysis.
Real-time: Real-time refers to cases where data can be stored in the data lake within one minute after being generated, while quasi-
real-time is where data is stored in the data lake within 1 to 15 minutes.
Hybrid cloud
33 Huawei Confidential
Logical Data Lake
Data lake: a big data platform holding a vast amount of data in various formats in an enterprise. It opens data and compute power to
users with strict data permission and resource control.
Logical data lake: a virtual data lake composed of multiple physically dispersed data platforms.
Messages
Logical data lake Simple retrieval mart
Hybrid cloud
34 Huawei Confidential
X Bank: Rolling Upgrades, Decoupled Storage-Compute, and
HetuEngine-based Real-Time BI
Pain points
Operation report Feature label Data science
⚫ Clusters on X's big data platform process 100,000+ jobs and store 30+ PB data per
... day. During upgrades, traditional solutions require power-off and restart, which
affects important services such as anti-fraud and precision marketing on live networks.
⚫ Traditional big data storage usage exceeds 70%, but the CPU usage is less than 50%.
In the all-in-one solution, compute and storage resources need to be expanded
Data mart (performance, credit, and more) together, which wastes resources.
⚫ Data in the lakes and warehouses of the traditional big data platform is isolated.
Associated analysis requires complex ETL tasks to process data and then load it to the
OLAP mart. This results in long data links and low analysis efficiency.
Financial data Analysis and
warehouse DWS mining platform Solution
HetuEngine
(480+ nodes) DWS (240+ nodes) ⚫ The MRS rolling upgrade supports sequential upgrades in different batches until all
nodes in the cluster are upgraded to the latest version. It also supports automatic
isolation of faulty nodes during the upgrade. When an upgrade is complete, the
Unified full data processing platform MRS (1,500+ nodes) faulty nodes are handled.
⚫ The decoupled storage-compute architecture enables on-demand expansion of
HDFS Stream insufficient resources only. The traditional three replicas are replaced by the
Batch processing Storage of massive raw data enterprise-level EC 1.2 replicas.
processing
⚫ The HetuEngine engine supports cross-source collaboration and lakehouse
collaborative analysis, preventing unnecessary ETL processes and reducing data
migrations.
Benefits
⚫ Rolling upgrades ensure service continuity, which enables continuous evolution
Semi-structured and
Structured data based on the same architecture.
unstructured data ⚫ The decoupled storage-compute architecture improves compute resource usage by
OA ERP ... ... 30%+, storage resource usage by 100%+, and reduces TCO by 60%.
⚫ Collaborative analysis across data lakes and warehouses reduces ETL by 80% and
improves analysis efficiency by 10x+, reducing required time from minutes to
seconds.
35 Huawei Confidential
XX Healthcare Security Administration Built a Unified Offline
Data Lake for Decision-Making
Pain points
Macro Service MBF risk Operation Real-time Audit & ⚫ The medical insurance, medicine, and medical treatment systems are siloed. A
Real-time
decision- application warning unified data platform covering all subjects, objects, services, processes, and
monitoring dashboard supervision monitoring data is needed.
making
⚫ Scattered management and fixed standards cause low efficiency in business
handling and operation.
Offline data lake (100+ nodes) ⚫ It is difficult to detect and prevent violations in medical insurance
Medical reimbursement and insurance fraud.
insurance Solution
Original Common Application Offline ⚫ Use MRS to build an offline data lake to store data from different sources, such
data layer data model data store computing as medical insurance, medication, and medical treatment. Build the original
ODS CDM ADS data store (ODS), common data model (CDM), and application data store
Medicine Real-time (ADS) in the offline and real-time computing data areas to implement full data
access Unified management and governance and build a unified provincial health insurance
Real-time storage data platform.
computing ⚫ Use unified data standards to implement centralized data governance and
FusionInsight MRS cloud-native data lake association analysis of data from different sources.
Healthcare ⚫ Build an offline data lake to centrally store video, image, text, and IoT data,
providing real-time computing data areas and real-time data processing
capabilities.
Benefits
Provincial
Basic Business Public Management ⚫ A one-stop platform that holds all-domain data is provided for people to handle
Informationv service service healthcare business. The intensive construction reduces data silos and TCO by
data data
data data 30%.
Provincial National ⚫ Medical insurance reimbursement becomes 3 times more efficient, and manual
data data review workloads and error rates decrease by 80%. People only need to visit the
exchange exchange office once to handle business.
Public Health Civil Bank Insurance platform platform ⚫ The real-time data computing capability effectively controls vulnerabilities that
safety commissions affairs may breed medical insurance reimbursement violations and insurance fraud,
recovering XX00 million in economic losses every year and ensuring the sound
development of the medical benefits fund (MBF).
36 Huawei Confidential
Quiz
1. What is MRS?
37 Huawei Confidential
Summary
⚫ This chapter first described Huawei's big data platform, MRS, along with its
advantages and application scenarios. It then went through some MRS
components, including Hudi, HetuEngine, Ranger, and LDAP+Kerberos
security authentication. Finally, this chapter introduced the MRS cloud-
native data lake baseline solution.
38 Huawei Confidential
Acronyms and Abbreviations
⚫ BI: Business Intelligence
⚫ ETL: Extract, Transform, and Load, a process that involves extracting data,
transforming the data, and loading the data to final targets.
⚫ AI: Artificial Intelligence
⚫ DWS: Data Warehouse Service
⚫ ES: Elasticsearch, distributed full-text search service
⚫ OBS: Object Storage Service
⚫ ORC: OptimizedRC File. ORC is a top-level Apache project and is a self-describing
column-based storage.
39 Huawei Confidential
Acronyms and Abbreviations
⚫ COW: Copy On Write
⚫ MOR: Merge On Read
⚫ UDF: User-Defined Functions
⚫ TCO: Total Cost of Ownership
⚫ ODS: Operational Data Store
⚫ CDM: Cloud Data Migration
⚫ ADS: Anti-DDoS Service
40 Huawei Confidential
Recommendations
⚫ Huawei Cloud
https://www.huaweicloud.com/intl/en-us/
⚫ Huawei Talent
https://e.huawei.com/en/talent/portal/#/
⚫ Huawei Enterprise Product & Service Support
https://support.huawei.com/enterprise/en/index.html
41 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.