0% found this document useful (0 votes)
106 views161 pages

Alibaba Cloud Apsara Stack Enterprise 2105

Uploaded by

Badhrul Salman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views161 pages

Alibaba Cloud Apsara Stack Enterprise 2105

Uploaded by

Badhrul Salman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Alibaba Cloud

Alibaba Cloud
Apsara Stack Enterprise
Apsara Stack Enterprise

MaxCompute
MaxCompute
Operations and Maintenance
Operations and Maintenance
Guide
Guide

Product Version: 2105, Internal: V3.14.0


Document Version: 20221222

Product Version: 2105, Internal: V3.14.0


Document Version: 20221222
Operat ions and Maint enance Guide·
MaxComput e
Legal disclaimer

Legal disclaimer
Alibaba Cloud reminds you t o carefully read and fully underst and t he t erms and condit ions of t his legal
disclaimer before you read or use t his document . If you have read or used t his document , it shall be deemed
as your t ot al accept ance of t his legal disclaimer.

1. You shall download and obt ain t his document from t he Alibaba Cloud websit e or ot her Alibaba Cloud-
aut horized channels, and use t his document for your own legal business act ivit ies only. The cont ent of
t his document is considered confident ial informat ion of Alibaba Cloud. You shall st rict ly abide by t he
confident ialit y obligat ions. No part of t his document shall be disclosed or provided t o any t hird part y for
use wit hout t he prior writ t en consent of Alibaba Cloud.

2. No part of t his document shall be excerpt ed, t ranslat ed, reproduced, t ransmit t ed, or disseminat ed by
any organizat ion, company or individual in any form or by any means wit hout t he prior writ t en consent of
Alibaba Cloud.

3. The cont ent of t his document may be changed because of product version upgrade, adjust ment , or
ot her reasons. Alibaba Cloud reserves t he right t o modify t he cont ent of t his document wit hout not ice
and an updat ed version of t his document will be released t hrough Alibaba Cloud-aut horized channels
from t ime t o t ime. You should pay at t ent ion t o t he version changes of t his document as t hey occur and
download and obt ain t he most up-t o-dat e version of t his document from Alibaba Cloud-aut horized
channels.

4. This document serves only as a reference guide for your use of Alibaba Cloud product s and services.
Alibaba Cloud provides t his document based on t he "st at us quo", "being defect ive", and "exist ing
funct ions" of it s product s and services. Alibaba Cloud makes every effort t o provide relevant operat ional
guidance based on exist ing t echnologies. However, Alibaba Cloud hereby makes a clear st at ement t hat
it in no way guarant ees t he accuracy, int egrit y, applicabilit y, and reliabilit y of t he cont ent of t his
document , eit her explicit ly or implicit ly. Alibaba Cloud shall not t ake legal responsibilit y for any errors or
lost profit s incurred by any organizat ion, company, or individual arising from download, use, or t rust in
t his document . Alibaba Cloud shall not , under any circumst ances, t ake responsibilit y for any indirect ,
consequent ial, punit ive, cont ingent , special, or punit ive damages, including lost profit s arising from t he
use or t rust in t his document (even if Alibaba Cloud has been not ified of t he possibilit y of such a loss).

5. By law, all t he cont ent s in Alibaba Cloud document s, including but not limit ed t o pict ures, archit ect ure
design, page layout , and t ext descript ion, are int ellect ual propert y of Alibaba Cloud and/or it s
affiliat es. This int ellect ual propert y includes, but is not limit ed t o, t rademark right s, pat ent right s,
copyright s, and t rade secret s. No part of t his document shall be used, modified, reproduced, publicly
t ransmit t ed, changed, disseminat ed, dist ribut ed, or published wit hout t he prior writ t en consent of
Alibaba Cloud and/or it s affiliat es. The names owned by Alibaba Cloud shall not be used, published, or
reproduced for market ing, advert ising, promot ion, or ot her purposes wit hout t he prior writ t en consent of
Alibaba Cloud. The names owned by Alibaba Cloud include, but are not limit ed t o, "Alibaba Cloud",
"Aliyun", "HiChina", and ot her brands of Alibaba Cloud and/or it s affiliat es, which appear separat ely or in
combinat ion, as well as t he auxiliary signs and pat t erns of t he preceding brands, or anyt hing similar t o
t he company names, t rade names, t rademarks, product or service names, domain names, pat t erns,
logos, marks, signs, or special descript ions t hat t hird part ies ident ify as Alibaba Cloud and/or it s
affiliat es.

6. Please direct ly cont act Alibaba Cloud for any errors of t his document .

> Document Version: 20221222 I


Operat ions and Maint enance Guide·
MaxComput e
Document convent ions

Document conventions
St yle Descript io n Example

A danger notice indicates a situation that Danger:


Danger will cause major system changes, faults,
Resetting will result in the loss of user
physical injuries, and other adverse
configuration data.
results.

W arning:
A warning notice indicates a situation
W arning that may cause major system changes, Restarting will cause business
faults, physical injuries, and other adverse interruption. About 10 minutes are
results. required to restart an instance.

A caution notice indicates warning No t ice:


No t ice information, supplementary instructions,
If the weight is set to 0, the server no
and other content that the user must
longer receives new requests.
understand.

A note indicates supplemental No t e:


No t e instructions, best practices, tips, and
You can use Ctrl + A to select all files.
other content.

Closing angle brackets are used to Click Set t ings > Net w o rk > Set net w o rk
>
indicate a multi-level menu cascade. t ype .

Bold formatting is used for buttons ,


Bo ld menus, page names, and other UI Click OK .
elements.

Run the cd /d C:/window command to


Courier font Courier font is used for commands
enter the Windows system folder.

bae log list --instanceid


Italic formatting is used for parameters
Italic
and variables.
Instance_ID

T his format is used for an optional value,


[] or [a|b] ipconfig [-all|-t]
where only one item can be selected.

T his format is used for a required value,


{} or {a|b} switch {active|stand}
where only one item can be selected.

> Document Version: 20221222 I


Operat ions and Maint enance Guide·
MaxComput e
Table of Cont ent s

Table of Contents
1.Concepts and architecture 08

2.O&M commands and tools 13

2.1. Before you start 13

2.2. odpscmd commands 13

2.3. Tunnel commands 15

2.4. LogView tool 20

2.4.1. Before you start 20

2.4.2. LogView introduction 20

2.4.3. Preliminary knowledge of LogView 21

2.4.4. Basic operations and examples 24

2.4.5. Best practices 26

2.5. Apsara Big Data Manager 27

3.Routine O&M 28

3.1. Configurations 28

3.2. Routine inspections 28

3.3. Shut down a chunkserver, perform maintenance, and then …clone the
32 chunkserv

3.4. Shut down a chunkserver for maintenance without compromising


… the
36 system

3.5. Adjust the virtual resources of the Apsara system in MaxCompute


… 38

3.6. Restart MaxCompute services 40

4.MaxCompute O&M 42

4.1. Log on to the ABM console 42

4.2. Business O&M 43

4.2.1. O&M overview and entry 43

4.2.2. Project management 44

4.2.2.1. Project list 44

4.2.2.2. Project details 46

> Document Version: 20221222 I


Operat ions and Maint enance Guide·
MaxComput e
Table of Cont ent s

4.2.2.3. Encrypt data 50

4.2.2.4. Grant access permissions on the metadata warehouse


… 52

4.2.2.5. Perform disaster recovery 53

4.2.2.6. Migrate projects 56

4.2.3. Manage quota groups 63

4.2.4. Job management 65

4.2.4.1. Job snapshots 65

4.2.5. Business optimization 67

4.2.5.1. Merge small files 67

4.2.5.2. Compress idle files 73

4.2.5.3. Analyze resources 78

4.3. Service O&M 81

4.3.1. Control service O&M 81

4.3.1.1. O&M features and entry 81

4.3.1.2. Control service overview 82

4.3.1.3. Control service health 83

4.3.1.4. Instances 84

4.3.1.5. Control service configuration 84

4.3.1.6. Metadata warehouse for the control service 84

4.3.1.7. Stop or start a server role 85

4.3.1.8. Start AdminConsole 86

4.3.1.9. Collect service logs 87

4.3.2. Job Scheduler O&M 88

4.3.2.1. O&M features and entry 88

4.3.2.2. Overview 89

4.3.2.3. Job Scheduler health 92

4.3.2.4. Quotas 92

4.3.2.5. Instances 94

II > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Table of Cont ent s

4.3.2.6. Job Scheduler compute nodes 95

4.3.2.7. Enable and disable SQL acceleration 96

4.3.2.8. Restart a master node of Job Scheduler 98

4.3.3. Apsara Distribute File System O&M 99

4.3.3.1. O&M features and entry 99

4.3.3.2. Overview 100

4.3.3.3. Instances 103

4.3.3.4. Apsara Distributed File System health 103

4.3.3.5. Apsara Distributed File System storage 104

4.3.3.6. Change the primary master node of Apsara Distributed


… File
106System

4.3.3.7. Clear the recycle bin of Apsara Distributed File System


… 108

4.3.3.8. Enable or disable data rebalancing for Apsara Distributed


… File System
109

4.3.3.9. Run a checkpoint on a master node of Apsara Distributed


… File System
111

4.3.4. Tunnel service 112

4.3.4.1. O&M features and entry 112

4.3.4.2. Overview 113

4.3.4.3. Instances 114

4.3.4.4. Traffic analysis 114

4.3.4.5. Restart Tunnel servers 114

4.4. Cluster O&M 116

4.4.1. O&M features and entry 116

4.4.2. Cluster health 117

4.4.3. Overview 121

4.4.4. Servers 126

4.4.5. Scale in and scale out a MaxCompute cluster 127

4.4.6. Restore environment settings and enable auto repair 132

4.5. Host O&M 133

4.5.1. O&M features and entry 133

> Document Version: 20221222 III


Operat ions and Maint enance Guide·
MaxComput e
Table of Cont ent s

4.5.2. Host overview 133

4.5.3. Host charts 139

4.5.4. Host health 139

4.5.5. Host services 144

5.Common issues and solutions 145

5.1. View and allocate MaxCompute cluster resources 145

5.2. Common issues and data skew troubleshooting 154

IV > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Concept s and archit ect ure

1.Concepts and architecture


T he t opic describes t he concept s and archit ect ure of MaxComput e. T he archit ect ure and descript ions
are for reference only. T hey are subject t o t he released product t ype and supplement ary feat ures.
Archit ect ure

indicat es t he basic feat ures of MaxComput e. indicat es t he enhanced feat ures of MaxComput e.
indicat es t he feat ures provided by ext ernal syst ems.

Category Description

MaxCompute supports the following peripheral platforms:


Apsara Uni-manager Management Console: a unified and intelligent
O&M platform. For more information, see Apsara Uni-manager Mana
gement Console User Guide.
DataWorks: a visualization tool. You can use DataWorks to perform
common operations, such as synchronize data, schedule jobs, and
generate reports. For more information, see DataWorks T echnical W
hite Paper.
Apsara Big Data Manager (ABM): provides an easy method for field
Peripheral platforms engineers to manage MaxCompute. For more information, see Apsar
a Big Data Manager T echnical White Paper.
Machine Learning Platform for AI (PAI): a machine learning algorithm
platform based on MaxCompute. For more information, see Machine
Learning Platform for AI T echnical White Paper.
T wo-party applications: other Alibaba Cloud services supported by
MaxCompute, such as DataV.
T hree-party applications: other services that are compatible with
MaxCompute.

> Document Version: 20221222 8


Operat ions and Maint enance Guide·
MaxComput e
Concept s and archit ect ure

Category Description

MaxCompute supports the following tools:


T unnel: a tunnel service. MaxCompute allows you to import
heterogeneous data into or export the data from MaxCompute by
using T unnel. For more information, see T unnel in MaxCompute Prod
uct Introduction.
MaxCompute Migration Assist (MMA): the data migration tool of
MaxCompute. If you use MMA, Meta Carrier is used to access your
Hive metastore service and capture Hive metadata. T hen, MMA uses
the Hive metadata to generate data definition language (DDL)
statements and SQL statements of Hive user-defined table-valued
functions (UDT Fs). T he DDL statements are used to create
MaxCompute tables and their partitions. T he SQL statements of Hive
T ools UDT Fs are used to migrate data.
Hybrid backup recovery (HBR): integrates data backup and migration
capabilities of Apsara Stack.
odpscmd: the MaxCompute client. For more information, see Client
in MaxCompute User Guide.
MaxCompute Studio: the big data integrated development
environment tool that is provided by MaxCompute. MaxCompute
Studio is installed on a developer client. It is a development plug-in
that Alibaba Cloud provides for the popular integrated development
environment (IDE) IntelliJ IDEA.
DataWorks DataStudio: a visualized development platform provided
by DataWorks. For more information, see DataWorks User Guide.

MaxCompute supports the following interfaces:


Interactive languages: CLI, SQL, Python, Java, and Scala.

User interfaces SDKs and APIs: SDK for Java, SDK for Python, and Java Database
Connectivity (JDBC).

For more information, see MaxCompute Developer Guide.

MaxCompute supports the following SQL computing capabilities:


Enhanced capabilities: support LOAD, parameterized view, lifecycle
management, and CLONE T ABLE.
User-defined functions (UDFs): include SQL UDFs, Java UDFs, and
Python UDFs.
Query: the query operations, such as SELECT and EXPLAIN
statements and built-in functions.

SQL computing capabilities Data manipulation language (DML) statements: include INSERT ,
UPDAT E, and DELET E.
DDL statements: allow you to create internal tables, external tables,
clustered tables, and partitioned tables.
Basic capabilities: support multiple data types and data formats
and allow you to upload resource files.

For more information, see MaxCompute SQL in MaxCompute User Guide.

9 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Concept s and archit ect ure

Category Description

MaxCompute supports the following computing models:


Mars: a tensor-based unified distributed computing framework.
Mars can use parallel and distributed computing technologies to
accelerate data processing for Python data science stacks. For more
information, see Mars in MaxCompute User Guide.
Spark on MaxCompute: a solution developed by Alibaba Cloud to
enable the seamless use of Spark on the MaxCompute platform. It
supplements a wide variety of features to MaxCompute. For more
information, see Spark on MaxCompute in MaxCompute User Guide.
MapReduce on MaxCompute: allows you to run MapReduce jobs on
MaxCompute. For more information, see MaxCompute MapReduce in
Computing models
MaxCompute User Guide.
VVP on MaxCompute: encapsulates the features of Realtime
Compute for Apache Flink that is developed on the Ververica
Platform (VVP) based on MaxCompute resources. You can use the
Cupid joint computing platform to complete the operations related
to real-time computing by using the underlying storage and
computing resources of MaxCompute on the VVP UI. For more
information, see VVP On MaxCompute in MaxCompute User Guide.
Graph: a processing framework designed for iterative graph
computing. For more information, see MaxCompute Graph in MaxCo
mpute User Guide.

MaxCompute can be managed from the following aspects:


Cost: measures resource usage.
Job: provides mechanisms to manage jobs. For example, you can use
these mechanisms to schedule jobs, use LogView to view job
information, and set job priorities.
Engine resource: supports high-performance MaxCompute Query
Acceleration (MCQA).
Management Large scale: allows you to deploy MaxCompute clusters across
regions.
Lakehouse: a data management platform that combines data lakes
and data warehouses. It integrates the flexibility and diverse
ecosystems of data lakes with the enterprise-class deployment of
data warehouses.

For more information, see MaxCompute Operations and Maintenance


Guide.

> Document Version: 20221222 10


Operat ions and Maint enance Guide·
MaxComput e
Concept s and archit ect ure

Category Description

MaxCompute allows you to use the following methods for compliance


governance:
Security management: allows you to control the permissions of
users and roles, and supports multiple authorization methods, such
as ACL-based, policy-based, and column-level authorization.
Unified metadata storage: stores metadata in a centralized manner.
Log audit: audits different log data of different users.
Backup and restoration: allows you to back up and restore data
from a storage system.
Compliance governance
Dynamic data masking: allows you to query data masking rules in
DataWorks.
Data quality: DataWorks provides an end-to-end platform that
supports quality verification, notification, and management services
for various heterogeneous data sources.
Content moderation audit: uses the content moderation engine to
identify and audit pornographic, violent, and illegal content.

For more information about security, see MaxCompute Security White


Paper.

Data storage MaxCompute stores data as tables or volumes.

T he following figure shows how a MaxComput e job is run.


Procedure t o run a MaxComput e job

T he following concept s are involved in t he procedure t o run a MaxComput e job.

1. MaxComput e inst ance: t he inst ance of a MaxComput e job. A job is anonymous if it is not defined. A
MaxComput e job can cont ain mult iple MaxComput e t asks. In a MaxComput e inst ance, you can submit
mult iple SQL or MapReduce t asks, and specify whet her t o run t he t asks in parallel or in sequence. T his
applicat ion is rarely implement ed because MaxComput e jobs are not commonly used. In most cases,
an inst ance cont ains only one t ask.

11 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Concept s and archit ect ure

2. MaxComput e t ask: a specific t ask in MaxComput e. Almost 20 t ask t ypes, such as SQL, MapReduce,
Admin, Lot , and Xlib, are support ed. T he execut ion logic varies great ly based on t he t ask t ype.
Different t asks in an inst ance are different iat ed by t heir t ask name. MaxComput e t asks run in t he
cont rol clust er. Simple t asks, such as met adat a modificat ion, can run in t he cont rol clust er for t heir
ent ire lifecycles. T o run comput ing t asks, submit Fuxi jobs t o t he comput e clust er.
3. Fuxi job: a comput ing model provided by t he Job Scheduler module. A Fuxi job corresponds t o a Fuxi
service. A Fuxi job represent s a t ask t hat can be complet ed, while a Fuxi service represent s a resident
process.
T he direct ed acyclic graph (DAG) scheduling approach can be used t o schedule Fuxi jobs. Each job
has a job mast er t o schedule it s job resources.
For SQL, Fuxi jobs are divided int o offline and online jobs. Online jobs evolve from t he service mode
jobs. An online job is also called a quasi-real-t ime t ask. An online job is a resident process t hat can
be execut ed whenever t asks are available. T his reduces t he t ime required for st art ing and st opping
a job.
You can submit a MaxComput e t ask t o mult iple comput e clust ers. T he primary key name of a Fuxi
job is in t he format of clust er name + job name.
T he JSON plan for Job Scheduler t o submit a job and t he st at us of a finished job are st ored in
Apsara Dist ribut ed File Syst em.
4. Fuxi t ask: a sub-concept of Fuxi job. Similar t o MaxComput e t asks, different Fuxi t asks represent
different execut ion logics. Fuxi t asks can be linked t oget her as pipes t o implement complex logic.
5. Fuxi inst ance: t he inst ance of a Fuxi t ask. A Fuxi inst ance is t he smallest unit t hat can be scheduled by
Job Scheduler. When a t ask is execut ed, it is divided int o many logical unit s t o improve t he processing
speed. Different inst ances will run on t he same execut ion logic but work wit h different input and
out put dat a.
6. Fuxi worker: an underlying concept of Job Scheduler. A worker represent s an operat ing syst em
process. A worker can be reused by mult iple Fuxi inst ances, but a worker can only handle one inst ance
at a t ime.

Not e
Inst anceID: t he unique ident ifier of a MaxComput e job. It is commonly used for
t roubleshoot ing. You can const ruct t he LogView of t he current inst ance based on t he
project name and inst ance ID.
Service mast er or job mast er: a primary node of t he service or job t ype. T he primary node is
responsible for request ing and scheduling resources, creat ing work plans for workers, and
monit oring workers across t heir ent ire lifecycles.

T he st orage and comput ing layer of MaxComput e is a core component of t he propriet ary cloud
comput ing plat form of Alibaba Cloud. As t he kernel of t he Apsara syst em, t his component runs in t he
comput e clust er independent of t he cont rol clust er. T he archit ect ure diagram illust rat es only t he major
modules.

> Document Version: 20221222 12


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

2.O&M commands and tools


2.1. Before you start
Before using MaxComput e O&M commands and t ools, you must be aware of t he following informat ion:

During t he MaxComput e O&M process, t he default account is admin. You must run all commands as an
admin user. You must use your admin account and sudo t o run commands t hat require sudo privileges.

2.2. odpscmd commands


You can use t he command line t o perform operat ions and maint enance. You must log on t o t he
command line t ool before you can run commands. T he specific procedure is as follows:

1. Log on t o t he Apsara Infrast ruct ure Management Framework console. In t he left -side navigat ion
pane, choose Operat ions > Clust er Operat ions. In t he Clust er search box, ent er odps t o search
for t he expect ed clust er.
2. Click t he clust er in t he search result . On t he Clust er Det ails page, click t he Services t ab. In t he
Services search box, search for odps-service-comput er. Click odps-service-comput er in t he search
result .
3. Aft er you access t he odps-service-comput er service, select Comput erInit # on t he Service Det ails
page. In t he Act ions column corresponding t o t he machine, click T erminal. In t he T erminalService
window t hat appears, you can perform subsequent command line operat ions.

Console command directories and configurations


T he MaxComput e client is locat ed in t he clt folder under t he /apsara/odps_tools direct ory of odpsag.
T he client configurat ion file is locat ed in t he conf direct ory under t he clt folder. T he access_id,
access_key, end_point , log_view, and t unnel_point paramet ers are configured by default . You can use
t he ./clt/bin/odpscmd command t o view informat ion such as t he version number in int eract ive
mode. For example, run t he HTTP GET /projects/admin_task_project/system; command t o check
t he version informat ion of MaxComput e.

Description of client command options


T he following figure shows t he client command opt ions.
Client command opt ions

-e : T he MaxComput e client does not execut e SQL st at ement s in int eract ive mode.
--project , -u, and -p : T he client direct ly uses t he specified values for t he project , user, and pass
paramet ers. If you do not specify a paramet er, t he client uses t he corresponding value configured in

13 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

t he conf file.
-k and -f : T he client direct ly execut es local SQL files.
--inst ance-priorit y : T his opt ion is used t o assign a priorit y t o t he current t ask. Valid values: 0 t o 9.
A lower value indicat es a higher priorit y.
-r: T his opt ion indicat es t he number of t imes a failed command will be ret ried. It is commonly used in
script ing jobs.

Commonly used SQ L commands for O &M


T he following t able list s t he commonly used commands.

Commonly used commands

Command Description

Allows you to view your Apsara Stack tenant account and


whoami
endpoint information.

Allows you to view information about all instances that have


show p
been run.

Allows you to re-generate the LogView and Fuxi job


information of a task. T o run this command, you must have
wait <instanceid>
owner permissions, and the LogView and Fuxi job information
must be stored in the same project.

kill <instanceid> Allows you to terminate specified instances.

tunnel upload/download Allows you to test whether T unnel is functioning.

Allows you to view the project usage.


desc ext ended t able : allows you to view table
information.
desc t able_name part it io n(pt _spec) : allows you to
desc project <projectname> -extended view partition information.
desc reso urce $reso urce_name : allows you to view
project resource information.
desc pro ject $pro ject _name -ext ended : allows you to
view cluster information.

Allows you to export DDL statements of all tables in a


export <project name>< local_file_path>
project.

create table <tablename> (...) Allows you to create a table.

select count(*) from <tablename> Allows you to search for a table.

Allows you to create plans without submitting Fuxi jobs to


Explain
view resources required for tasks.

list Allows you to list tables, resources, and roles.

show Allows you to view table and partition information.

> Document Version: 20221222 14


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

Command Description

Allows you to remove all data from the MaxCompute recycle


bin directly to the Apsara Distributed File System recycle bin.
purge t able < t ablename> : allows you to purge a single
purge
table.
purge all : allows you to purge all tables from the current
project.

2.3. Tunnel commands


T he client provides T unnel commands t hat implement t he original funct ions of t he Dship t ool. T unnel
commands are mainly used t o upload or download dat a.

T unnel commands

Command Description

Allows you to upload data to MaxCompute tables. You can upload files or
level-1 directories. Data can only be uploaded to a single table or table
tunnel upload
partition each time. T he destination partition must be specified for partitioned
tables.

Allows you to download data from MaxCompute tables. You can only
download data to a single file. Only data in one table or partition can be
tunnel download
downloaded to one file each time. For partitioned tables, the source partition
must be specified.

If an error occurs because of network or T unnel service faults, you can resume
file or directory transmission after interruption. T his command only allows you
tunnel resume to resume the previous data upload. Every data upload or download
operation is called a session. Run the resume command and specify the ID of
the session to be resumed.

tunnel show Allows you to view historical task information.

Purges the session directory. Sessions from the last three days are purged by
tunnel purge
default.

T unnel commands allow you t o view help informat ion by using t he Help sub-command on t he client .
T he sub-commands of each T unnel command are described as follows:

Upload
Import s dat a of a local file int o a MaxComput e t able. T he following example shows how t o use t he
sub-commands:

15 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

odps@ project_name>tunnel help upload;


usage: tunnel upload [options] <path> <[project.]table[/partition]>
upload data from local file
-acp,-auto-create-partition <ARG> auto create target partition if not
exists, default false
-bs,-block-size <ARG> block size in MiB, default 100
-c,-charset <ARG> specify file charset, default ignore.
set ignore to download raw data
-cp,-compress <ARG> compress, default true
-dbr,-discard-bad-records <ARG> specify discard bad records
action(true|false), default false
-dfp,-date-format-pattern <ARG> specify date format pattern, default
yyyy-MM-dd HH:mm:ss
-fd,-field-delimiter <ARG> specify field delimiter, support
unicode, eg \u0001. default ","
-h,-header <ARG> if local file should have table
header, default false
-mbr,-max-bad-records <ARG> max bad records, default 1000
-ni,-null-indicator <ARG> specify null indicator string,
default ""(empty string)
-rd,-record-delimiter <ARG> specify record delimiter, support
unicode, eg \u0001. default "\r\n"
-s,-scan <ARG> specify scan file
action(true|false|only), default true
-sd,-session-dir <ARG> set session dir, default
D:\software\odpscmd_public\plugins\ds
hip
-ss,-strict-schema <ARG> specify strict schema mode. If false,
extra data will be abandoned and
insufficient field will be filled
with null. Default true
-te,-tunnel_endpoint <ARG> tunnel endpoint
-threads <ARG> number of threads, default 1
-tz,-time-zone <ARG> time zone, default local timezone:
Asia/Shanghai
Example:
tunnel upload log.txt test_project.test_table/p1="b1",p2="b2"

Paramet ers:

-acp: indicat es whet her t o aut omat ically creat e t he dest inat ion part it ion if it does not exist . No
dest inat ion part it ion is creat ed by default .
-bs: specifies t he size of each dat a block uploaded wit h T unnel. Default value: 100 MiB (MiB = 1024 *
1024B).
-c: specifies t he local dat a file encoding format . Default value: UT F-8. If t his paramet er is not set ,
t he encoding format of t he downloaded source dat a is used by default .
-cp: indicat es whet her t o compress t he local dat a file before it is uploaded t o reduce net work
t raffic. By default , t he local dat a file is compressed before it is uploaded.
-dbr: indicat es whet her t o ignore dirt y dat a (such as addit ional columns, missing columns, and
columns wit h mismat ched dat a t ypes).
If t his paramet er is set t o t rue, all dat a t hat does not comply wit h t able definit ions is ignored.

> Document Version: 20221222 16


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

If t his paramet er is set t o false, an error is ret urned when dirt y dat a is found, so t hat raw dat a in
t he dest inat ion t able is not cont aminat ed.

-dfp: specifies t he Dat eT ime format . Default value: yyyy-MM-dd HH:mm:ss.


-fd: specifies t he column delimit er used in t he local dat a file. Default value: comma (,).
-h: indicat es whet her t he dat a file cont ains t he header. If t his paramet er is set t o t rue, Dship skips t he
header row and st art s uploading dat a from t he second row.
-mbr: t erminat es any at t empt s t o upload more t han 1,000 rows of dirt y dat a. T his paramet er allows
you t o adjust t he maximum allowable volume of dirt y dat a.
-ni: specifies t he NULL dat a ident ifier. Default value: an empt y st ring ("").
-rd: specifies t he row delimit er used in t he local dat a file. Default value: \r\n.
-s: indicat es whet her t o scan t he local dat a file. Default value: false.
If t his paramet er is set t o t rue, t he syst em scans t he source dat a first , and t hen import s t he dat a if
t he format is correct .
If t his paramet er is set t o false, t he syst em import s dat a direct ly wit hout scanning.
If t his paramet er is set t o only, t he syst em only scans t he source dat a, and does not import t he
dat a aft er scanning.

-sd: set s t he session direct ory.


-t e: specifies t he T unnel endpoint .
-t hreads: specifies t he number of t hreads. Default value: 1.
-t z: specifies t he t ime zone. Default value: Asia/Shanghai.

Show
Displays hist orical records. T he following example shows how t o use t he sub-commands:

odps@ project_name>tunnel help show;


usage: tunnel show history [options]
show session information
-n,-number <ARG> lines
Example:
tunnel show history -n 5
tunnel show log

Paramet ers:

-n: specifies t he number of rows t o be displayed.

Resume
Resumes t he execut ion of hist orical operat ions (only applicable t o dat a upload). T he following
example shows how t o use t he sub-commands:

odps@ project_name>tunnel help resume;


usage: tunnel resume [session_id] [-force]
resume an upload session
-f,-force force resume
Example:
tunnel resume

17 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

Download
T he following example shows how t o use t he sub-commands:

odps@ project_name>tunnel help download;


usage: tunnel download [options] <[project.]table[/partition]> <path>
download data to local file
-c,-charset <ARG> specify file charset, default ignore.
set ignore to download raw data
-ci,-columns-index <ARG> specify the columns index(starts from
0) to download, use comma to split each
index
-cn,-columns-name <ARG> specify the columns name to download,
use comma to split each name
-cp,-compress <ARG> compress, default true
-dfp,-date-format-pattern <ARG> specify date format pattern, default
yyyy-MM-dd HH:mm:ss
-e,-exponential <ARG> When download double values, use
exponential express if necessary.
Otherwise at most 20 digits will be
reserved. Default false
-fd,-field-delimiter <ARG> specify field delimiter, support
unicode, eg \u0001. default ","
-h,-header <ARG> if local file should have table header,
default false
-limit <ARG> specify the number of records to
download
-ni,-null-indicator <ARG> specify null indicator string, default
""(empty string)
-rd,-record-delimiter <ARG> specify record delimiter, support
unicode, eg \u0001. default "\r\n"
-sd,-session-dir <ARG> set session dir, default
D:\software\odpscmd_public\plugins\dshi
p
-te,-tunnel_endpoint <ARG> tunnel endpoint
-threads <ARG> number of threads, default 1
-tz,-time-zone <ARG> time zone, default local timezone:
Asia/Shanghai
usage: tunnel download [options] instance://<[project/]instance_id> <path>
download instance result to local file
-c,-charset <ARG> specify file charset, default ignore.
set ignore to download raw data
-ci,-columns-index <ARG> specify the columns index(starts from
0) to download, use comma to split each
index
-cn,-columns-name <ARG> specify the columns name to download,
use comma to split each name
-cp,-compress <ARG> compress, default true
-dfp,-date-format-pattern <ARG> specify date format pattern, default
yyyy-MM-dd HH:mm:ss
-e,-exponential <ARG> When download double values, use
exponential express if necessary.
Otherwise at most 20 digits will be
reserved. Default false
-fd,-field-delimiter <ARG> specify field delimiter, support

> Document Version: 20221222 18


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

-fd,-field-delimiter <ARG> specify field delimiter, support


unicode, eg \u0001. default ","
-h,-header <ARG> if local file should have table header,
default false
-limit <ARG> specify the number of records to
download
-ni,-null-indicator <ARG> specify null indicator string, default
""(empty string)
-rd,-record-delimiter <ARG> specify record delimiter, support
unicode, eg \u0001. default "\r\n"
-sd,-session-dir <ARG> set session dir, default
D:\software\odpscmd_public\plugins\dshi
p
-te,-tunnel_endpoint <ARG> tunnel endpoint
-threads <ARG> number of threads, default 1
-tz,-time-zone <ARG> time zone, default local timezone:
Asia/Shanghai
Example:
tunnel download test_project.test_table/p1="b1",p2="b2" log.txt
tunnel download instance://test_project/test_instance log.txt

Paramet ers:

-c: specifies t he local dat a file encoding format . Default value: UT F-8.
-ci: specifies t he column index (st art ing from 0) for downloading. Separat e mult iple ent ries wit h
commas (,).
-cn: specifies t he names of columns t o be downloaded. Separat e mult iple ent ries wit h commas (,).
-cp, -compress: indicat es whet her t o compress t he dat a file before it is uploaded t o reduce net work
t raffic. By default , a dat a file is compressed by it is uploaded.
-dfp: specifies t he Dat eT ime format . Default value: yyyy-MM-dd HH:mm:ss.
-e: allows you t o express t he values as exponent ial funct ions when you download Double t ype dat a.
If t his paramet er is not set , a maximum of 20 digit s can be ret ained.
-fd: specifies t he column delimit er used in t he local dat a file. Default value: comma (,).
-h: indicat es whet her t he dat a file cont ains a header. If t his paramet er is set t o t rue, Dship skips t he
header row and st art s downloading dat a from t he second row.

Not e -h=true and threads>1 cannot be used t oget her.

-limit : specifies t he number of files t o be downloaded.


-ni: specifies t he NULL dat a ident ifier. Default value: an empt y st ring ("").
-rd: specifies t he row delimit er used in t he local dat a file. Default value: \r\n.
-sd: set s t he session direct ory.
-t e: specifies t he T unnel endpoint .
-t hreads: specifies t he number of t hreads. Default value: 1.
-t z: specifies t he t ime zone. Default value: Asia/Shanghai.

Purge

19 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

Purges t he session direct ory. Sessions from t he last t hree days are purged by default . T he following
example shows how t o use t he sub-commands:

odps@ project_name>tunnel help purge;


usage: tunnel purge [n]
force session history to be purged.([n] days before, default
3 days)
Example:
tunnel purge 5

2.4. LogView tool


2.4.1. Before you start
You must confirm t he LogView process st at us before using LogView. If t he process st at us is off, you
must st art t he LogView process.

T he procedure for querying t he process st at us and st art ing t he process is as follows:

1. Log on t o t he Apsara Infrast ruct ure Management Framework console. In t he left -side navigat ion
pane, choose Operat ions > Clust er Operat ions. In t he Clust er search box, ent er odps t o search
for t he expect ed clust er.
2. Click t he clust er in t he search result . On t he Clust er Det ails page, click t he Services t ab. In t he Service
search box, search for odps-service-console . Click odps-service-console in t he search result .
3. Aft er you access t he odps-service-console service, select LogView# on t he Service Det ails page.
In t he Act ions column corresponding t o t he machine, click T erminal t o open t he T erminalService
window.
4. Run t he following command t o find t he Docker cont ainer where LogView resides:

docker ps|grep logview

5. Run t he following commands t o view t he LogView process st at us:

ps -aux|grep logview

netstat -ntulp|grep 9000

6. If t he process st at us is off, run t he following command t o st art t he process:

/opt/aliyun/app/logview/bin/control start

2.4.2. LogView introduction


LogView is a t ool for checking and debugging a job submit t ed t o MaxComput e. LogView allows you t o
check t he running det ails of a job.

LogView functions
LogView allows you t o check t he running st at us, det ails, and result s of a job, and t he progress of each
phase.

LogView endpoint

> Document Version: 20221222 20


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

T ake t he odpscmd client as an example. Aft er you submit an SQL t ask on t he client , a long st ring
st art ing wit h logview is ret urned.
A long st ring st art ing wit h logview

Ent er t he st ring wit h all carriage ret urn and line feed charact ers removed in t he address bar of t he
browser.

Composition of a LogView string


A LogView st ring consist s of five part s, as shown in t he following figure.
Composit ion of a LogView st ring

2.4.3. Preliminary knowledge of LogView


For complex SQL queries, you must have an in-dept h knowledge of t he relat ionships bet ween
MaxComput e t asks and Fuxi inst ances before you can underst and LogView.

In short , a MaxComput e t ask consist s of one or more Fuxi jobs. Each Fuxi job consist s of one or more Fuxi
t asks. Each Fuxi t ask consist s of one or more Fuxi inst ances.
Relat ionships bet ween MaxComput e t asks and Fuxi inst ances

T he following figures show t he relevant informat ion in LogView.

MaxCompute Instance
MaxComput e Inst ance

21 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

MaxCompute Task
MaxComput e T ask

Task Detail - Fuxi Job


T ask Det ail - Fuxi Job(1)

T ask Det ail - Fuxi Job(2)

> Document Version: 20221222 22


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

Task Detail - Summary


T ask Det ail - Summary

Task Detail - JSO NSummary


T ask Det ail - JSONSummary

23 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

2.4.4. Basic operations and examples


View each point in time in the life cycle of a job.
View each point in t ime in t he life cycle of a job

> Document Version: 20221222 24


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

View the time it takes for Job Scheduler to schedule an instance.


View t he t ime it t akes for Job Scheduler t o schedule an inst ance

View the polling interval.


View t he polling int erval

Aft er a MaxComput e inst ance is submit t ed, odpscmd polls t he execut ion st at us of t he job at a
specified int erval of approximat ely 5s.

Check for data skews


Check for dat a skews

View the UDF and MR debugging information


View t he UDF and MR debugging informat ion

25 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

View t he UDF and MR debugging informat ion

View the task status - Terminated


View t he t ask st at us - T erminat ed

2.4.5. Best practices


Locate LogView based on the instance ID
Aft er you submit a job, you can press Ct rl+C t o ret urn t o odpscmd and perform ot her operat ions. You
can run t he wait <instanceid>; command t o locat e LogView and obt ain t he job st at us.
Locat e LogView based on t he inst ance ID

Locate running tasks

> Document Version: 20221222 26


Operat ions and Maint enance Guide·
MaxComput e
O&M commands and t ools

Aft er you exit t he cont rol window, you can run t he show p; command t o locat e current ly running
t asks and hist orical t asks.
Locat e running t asks

2.5. Apsara Big Data Manager


Apsara Big Dat a Manager (ABM) support s O&M on big dat a services from t he perspect ives of business,
services, clust ers, and host s. You can also updat e big dat a services, cust omize alert configurat ions, and
view t he O&M hist ory in t he ABM console.

On-sit e Apsara St ack engineers can use ABM t o easily manage big dat a services by performing act ions,
such as viewing resource usage, checking and handling alert s, and modifying configurat ions.

For more informat ion about how t o log on t o t he ABM console and perform O&M operat ions in t he
console, see MaxCompute O&M .

27 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

3.Routine O&M
3.1. Configurations
MaxComput e configurat ions are st ored in t he /apsara/odps_service/deploy/env.cfg direct ory in
odpsag. T he configurat ion file cont ains t he following cont ent :

odps_worker_num=3
executor_worker_num=3
hiveserver_worker_num=3
replication_server_num=3
messager_partition_num=3

You can modify t hese paramet er values based on your requirement s and st art t he corresponding
MaxComput e services based on t he configured values. For more informat ion, see Restart a MaxCompute
service.
If you add xstream_max_worker_num=3 at t he end of t he configurat ion file, XSt ream will be st art ed
wit h t hree running workers.

3.2. Routine inspections


1. On t he Clust er Operat ions page in Apsara Infrast ruct ure Management Framework, check whet her all
machines have reached t he desired st at e.
i. Log on t o t he Apsara Infrast ruct ure Management Framework console. In t he left -side navigat ion
pane, choose Operat ions > Clust er Operat ions. In t he Clust er search box, ent er odps and click
t he search icon t o search for t he expect ed clust er.
ii. Check whet her all machines have reached t he desired st at e based on t he informat ion in t he St at us,
Machine St at us, and Server Role St at us columns. T he following figure shows t hat some machines
have not reached t he desired st at e.
iii. Click t he except ions in t he Machine St at us and Server Role st at us columns t o view t he except ion
det ails.
2. Go t o t he /home/admin/odps/odps_tools/clt/bin/odpscmd -e direct ory and run t he following
command:

select count(*) from datahub_smoke_test;

> Document Version: 20221222 28


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

T he following figure shows t hat fuxi job is running. T he command out put indicat es t hat fuxi job
funct ions properly.

29 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

3. Run t he following commands t o check whet her t he following workers exist and whet her t hey have
been rest art ed recent ly:
i. r swl Odps/MessagerServicex

ii. r swl Odps/OdpsServicex

iii. r swl Odps/HiveServerx

> Document Version: 20221222 30


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

iv. r swl Odps/QuotaServicex

v. r swl Odps/ReplicationServicex

4. Run t he following command t o check for errors:

puadmin lscs |grep -vi NORMAL|grep -vi DISK_OK

5. Run t he following commands t o check dat a int egrit y:


i. puadmin fs -abnchunk -t none

ii. puadmin fs -abnchunk -t onecopy

iii. puadmin fs -abnchunk -t lessmin

6. Log on t o t he machine where Apsara Name Service and Dist ribut ed Lock Synchronizat ion Syst em
resides.

echo srvr | nc localhost 10240 | grep Mode

Examples:

31 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

tj_show -r nuwa.NuwaZK#>/tmp/nuwa;pssh -h /tmp/nuwa -i "echo srvr | nc localhost 10240 |


grep Mode"

7. Run t he following commands t o check whet her Apsara Dist ribut ed File Syst em funct ions properly:

puadmin gems

puadmin gss

8. Perform daily inspect ions in Apsara Big Dat a Manager (ABM) t o check disk usage.

3.3. Shut down a chunkserver,


perform maintenance, and then clone
the chunkserver
Prerequisites
A cust omer has asked t o fix a fault y inst ance of odps_cs and clone a new one.
You must inform t he cust omer t hat t his operat ion will t emporarily render a chunkserver in t he clust er
unavailable, but will not affect t he overall operat ion of t he service.
All MaxComput e services have reached t he desired st at e and are funct ioning properly.
All services on t he OPS1 server have reached t he desired st at e and are funct ioning properly.
You must ensure t hat t he disk space available is sufficient for dat a migrat ion t riggered when a node
goes offline.
If t he primary node exist s on t he machine t o be brought offline, you must ensure t hat services are
swit ched from t he primary node t o t he secondary node.

Procedure

> Document Version: 20221222 32


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

1. In Apsara Infrast ruct ure Management Framework, find Comput erInit # in t he odps-service-comput er
service of t he odps clust er, and open t he corresponding T erminalService window. Run t he following
commands t o check t he dat a int egrit y of Apsara Dist ribut ed File Syst em:

puadmin abnchunk fs -t none


-- Check for any missing files. If no output is displayed, no files are missing.
puadmin abnchunk fs -t onecopy
-- Check whether each file has only one copy. If no output is displayed, each file has on
ly one copy.
puadmin abnchunk fs -t lessmin
-- Check whether the number of files is smaller than the minimum number of backups. If no
output is displayed, the number of files is smaller than the minimum number of backups.

2. Add t he machine t o be shut down t o a Job Scheduler blacklist .


i. Run t he following command t o enable t he blacklist ing funct ion of Job Scheduler (ignore t his st ep if
t he funct ion has been enabled):

/apsara/deploy/rpc_caller --Server=nuwa://localcluster/sys/fuxi/master/ForClient --Meth


od=/fuxi/SetGlobalFlag --Parameter={\"fuxi_Enable_BadNodeManager\":false}

ii. Run t he following command t o check t he host names in t he exist ing blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

iii. Run t he following command t o add t he machine t o be shut down t o t he blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster add $hostname

iv. Run t he following command t o check whet her t he machine t o be shut down is already included in
t he blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

3. Shut down t he machine, perform maint enance, and t hen rest art t he machine.

Not e Do not compromise t he syst em during maint enance.

4. Run t he following commands t o remove t he Job Scheduler blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster remove $hostname


/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

5. Set t he st at us of rma t o pending for t he fault y machine.

33 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

i. Log on t o t he OPS1 server. Set t he st at us of t he rma act ion t o pending for t he fault y machine. T he
host name of t he fault y machine is m1.

Run t he following command:


curl "ht t p://127.0.0.1:7070/api/v5/Set MachineAct ion?host name= m1" -d
'{"act ion_name":"rma", "act ion_st at us":"pending"}'
T he command out put is as follows:

{
"err_code": 0,
"err_msg": "",
"data": [
{
"hostname": "m1"
}
]
}

ii. Run t he following command t o configure t he audit log:


curl "ht t p://127.0.0.1:7070/api/v5/AddAudit Log?object = /m/m1& cat egory= act ion" -d
'{"cat egory":"act ion", "f rom":"t ianji.HealingService# ", "object ":"/m/m1", "cont ent ": "{\n
\"act ion\" : \"/act ion/rma\",\n \"descript ion\" : \"/monit or/rma= error, mt ime:
1513488046851649\",\n \"st at us\" : \"pending\"\n}\n" }'
T he mt ime paramet er, which represent s act ion_descript ion@mt ime, is set t o 1513488046851649 in
t he example. Set t he paramet er t o t he current syst em t ime when you configure t he audit log. Run
t he following command t o query t he mt ime value:
curl "ht t p://127.0.0.1:7070/api/v5/Get MachineInf o?
host name= m1& at t r= act ion_name,act ion_st at us,act ion_descript ion@mt ime"

T he command out put is as follows:

{
"err_code": 0,
"err_msg": "",
"data": {
"action_description": "",
"action_description@mtime": 1516168642565661,
"action_name": "rma",
"action_name@mtime": 1516777552688111,
"action_status": "pending",
"action_status@mtime": 1516777552688111,
"hostname": "m1",
"hostname@mtime": 1516120875605211
}
}

6. Wait for approval.

> Document Version: 20221222 34


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

i. Wait unt il t he st at us of t he rma act ion becomes approved or doing on t he machine. Check t he
act ion st at us.
Run t he following command t o obt ain t he machine informat ion:

curl "ht t p://127.0.0.1:7070/api/v5/Get MachineInf o?host name= m1"


Command out put :

A large amount of informat ion is ret urned. You can locat e t he following keyword: "act ion_st at us":
"pending".

ii. Check t he SR approval st at us on t he machine. pending indicat es t hat t he SR is being approved.


approved, doing, or done indicat es t hat t he SR has been approved. If no act ion was t aken, t he SR
was not approved.
Run t he following query command:

curl ht t p://127.0.0.1:7070/api/v5/Get MachineInf oPackage?


host name= m1& at t r= sr.id,sr.act ion_name,sr.act ion_st at us

Command out put : A large amount of informat ion is ret urned. You can also view it ems in t he doing
st at e on t he webpage.
7. Shut down t he machine when t he st at us of rma becomes approved or doing. Aft er t he maint enance
is complet ed, st art t he machine.

Not e If you need t o clone t he machine aft er t he maint enance is complet ed, proceed wit h
t he next st ep. Ot herwise, skip t he next st ep.

8. Clone t he machine.
i. Aft er t he maint enance is complet ed, run t he following command t o clone t he machine on t he
OPS1 server:
curl "ht t p://127.0.0.1:7070/api/v5/Set MachineAct ion?
host name= m1& act ion_name= rma& act ion_st at us= doing" -d '{"act ion_name":"clone",
"act ion_st at us":"approved", "act ion_descript ion":"", "f orce":t rue}'
T he command out put is as follows:

{
"err_code": 0,
"err_msg": "",
"data": [
{
"hostname": "m1"
}
]
}

35 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

ii. Access t he clone cont ainer. Run t he following commands t o check t he clone st at us and confirm
whet her t he clone operat ion t akes effect .
a. Run t he following command t o query t he clone cont ainer:

docker ps|grep clone


T he command out put is as follows:

18c1339340ab reg.docker.god7.cn/tianji/ops_service:1f147fec4883e082646715cb79c3710f7b
2ae9c6e6851fa9a9452b92b4b3366a ops.OpsClone__.clone.1514969139

b. Run t he following command t o log on t o t he cont ainer:

docker ps|grep clone


c. Run t he following command t o query t he clone t ask:

/home/t ops/bin/pyt hon /root /opsbuild/bin/opsbuild.py acli list --st at us= ALL -n
10000 | vim -

9. Run t he following command t o rest ore t he machine st at us:


curl "ht t p://127.0.0.1:7070/api/v5/Set MachineAct ion?host name= m1& act ion_name= rma"
-d '{"act ion_name":"rma","act ion_st at us":"done", "f orce":t rue}'

10. Check t he machine st at us t hrough t he command or Apsara Infrast ruct ure Management Framework. If
t he st at us is GOOD, t he machine is normal.
Run t he following command t o check t he machine st at us:

curl "ht t p://127.0.0.1:7070/api/v5/Get MachineInf o?


host name= m1& at t r= st at e,host name"

11. Check whet her t he clust er has reached t he desired st at e. Ensure t hat all services on t he machine
being brought online have reached t he desired st at e.
12. Run t he following commands t o remove t he Job Scheduler blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster remove $hostname


/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

3.4. Shut down a chunkserver for


maintenance without compromising
the system
Prerequisites

> Document Version: 20221222 36


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

Check t hat all MaxComput e services have reached t he final st at us and are funct ioning properly.

Procedure
1. In Apsara Infrast ruct ure Management Framework, locat e Comput erInit # in t he odps-service-
comput er service of t he odps clust er, and open t he corresponding T erminalService window. Run t he
following commands t o check t he dat a int egrit y of Apsara Dist ribut ed File Syst em:

puadmin abnchunk fs -t none


-- Check for any missing files. If no output is displayed, no files are missing.
puadmin abnchunk fs -t onecopy
-- Check whether each file has only one copy. If no output is displayed, each file has on
ly one copy.
puadmin abnchunk fs -t lessmin
-- Check whether the number of files is smaller than the minimum number of backups. If no
output is displayed, the number of files is smaller than the minimum number of backups.

2. Add t he machine t o be shut down t o a Job Scheduler blacklist .


i. Run t he following command t o enable t he blacklist ing funct ion of Job Scheduler (ignore t his st ep if
t he funct ion has been enabled):

/apsara/deploy/rpc_caller --Server=nuwa://localcluster/sys/fuxi/master/ForClient --Meth


od=/fuxi/SetGlobalFlag --Parameter={\"fuxi_Enable_BadNodeManager\":false}

ii. Run t he following command t o check t he host names in t he exist ing blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

iii. Run t he following command t o add t he machine t o be shut down t o t he blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster add $hostname

iv. Run t he following command t o check whet her t he machine t o be shut down is already included in
t he blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

3. Shut down t he machine for maint enance and t hen rest art t he machine.

Not e Do not compromise t he syst em during maint enance.

4. Run t he following commands t o remove t he Job Scheduler blacklist :

/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster remove $hostname


/apsara/deploy/rpc_wrapper/rpc.sh blacklist cluster get

Expected results
During t he shut down of Pangu_chunkserver, Apsara Dist ribut ed File Syst em will keep t rying t o read
dat a, and SQL t asks will remain in t he running st at e. T he t asks are complet ed aft er seven t o eight
minut es, or aft er t he machine resumes operat ion.

3.5. Adjust the virtual resources of the


37 > Document Version: 20221222
Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

3.5. Adjust the virtual resources of the


Apsara system in MaxCompute
Prerequisites
All MaxComput e services have reached t he desired st at e and are funct ioning properly.

Procedure
1. Log on t o t he Apsara Infrast ruct ure Management Framework console. In t he left -side navigat ion
pane, choose Operat ions > Clust er Operat ions. In t he Clust er search box, ent er odps t o search
for t he expect ed clust er.
2. Click t he clust er in t he search result . On t he Clust er Det ails page, click t he Clust er Conf igurat ion t ab.
In t he left -side file list , find t he role.conf file in t he fuxi direct ory.
role.conf file

3. Adjust t he machine t ags on t he right and click Preview and Submit .


Adjust machine t ags

> Document Version: 20221222 38


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

4. In t he Conf irm and Submit dialog box t hat appears, ent er t he change descript ion and click Submit .
Submit

5. T he clust er st art s rolling and t he changes st art t o t ake effect .

Not e You can check t he t ask st at us in t he operat ion log. If t he changes t ake effect , t he
st at us becomes Successful.

6. Aft er t he changes are made, run t he r ttrl command in t he T erminalService window t o confirm
t he changes.

39 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

3.6. Restart MaxCompute services


Procedure
1. Log on t o t he Apsara Infrast ruct ure Management Framework console. In t he left -side navigat ion
pane, choose Operat ions > Clust er Operat ions. In t he Clust er search box, ent er odps t o search
for t he expect ed clust er.
2. Click t he clust er in t he search result . On t he Clust er Det ails page, click t he Services t ab. In t he Service
search box, search for odps-service-comput er. Click odps-service-comput er in t he search result .
3. Aft er you access t he odps-service-comput er service, select Comput erInit # on t he Service Det ails
page. In t he Act ions column corresponding t o t he machine, click T erminal. In t he T erminalService
window t hat appears, you can perform subsequent command line operat ions.
4. Run t he following command t o obt ain t he number of machines:

tj_show -r fuxi.Tubo#

5. Divide t he number of machines by 3 t o obt ain t he workernum value.

Not e T he workernum value ranges from 1 t o 3.

6. Modify workernum in vim /apsara/odps_service/deploy/env.cfg.

odps_worker_num = 2
executor_worker_num = 2
hiveserver_worker_num = 2
replication_server_num = 2
messager_partition_num = 2
-- The values here are used as an example. Set these values as needed.

7. Rest art Hive and MaxComput e.

/apsara/odps_service/deploy/install_odps.sh restart_hiveservice
-- Restart Hive.
/apsara/odps_service/deploy/install_odps.sh restart_odpsservice
-- Restart MaxCompute.

r swl Odps/OdpsServicex
r swl Odps/HiveServerx
-- Check the service update status and time after restart.

8. Rest art t he messager service.

cd /apsara/odps_service/deploy/; sh install_odps.sh pedeploymessagerservice


-- Restart the messager service.

r swl Odps/MessagerServicex
-- Check the service update status and time after restart.

9. Rest art t he quot a service.

cd /apsara/odps_service/deploy/; sh install_odps.sh pedeployquotaservice


-- Restart the quota service.

> Document Version: 20221222 40


Operat ions and Maint enance Guide·
MaxComput e
Rout ine O&M

r swl Odps/QuotaServicex
-- Check the service update status and time after restart.

10. Rest art t he replicat ion service.

cd /apsara/odps_service/deploy/; sh install_odps.sh pedeployreplicationservice


-- Restart the replication service.

r swl Odps/ReplicationServicex
-- Check the service update status and time after restart.

11. Rest art t he service mode.

r plan Odps/CGServiceControllerx >/home/admin/servicemode.json


r sstop Odps/CGServiceControllerx
r start /home/admin/servicemode.json
-- Restart the service mode.

r swl Odps/CGServiceControllerx
-- Check the CGServiceControllerx service update status and time after restart.

41 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.MaxCompute O&M
4.1. Log on to the ABM console
T his t opic describes how t o log on t o t he Apsara Big Dat a Manager (ABM) console.

Prerequisites
T he endpoint of t he Apsara Uni-manager Operat ions Console and t he username and password used
t o log on t o t he console are obt ained from t he deployment personnel or an administ rat or.

T he endpoint of t he Apsara Uni-manager Operat ions Console is in t he following format : region-


id.ops.console.intranet-domain-id.
A browser is available. We recommend t hat you use Google Chrome.

Procedure
1. Open your Chrome browser.
2. In t he address bar, ent er t he endpoint of t he Apsara Uni-manager Operat ions Console. Press t he
Ent er key.

Not e You can select a language from t he drop-down list in t he upper-right corner of t he
page.

3. Ent er your username and password.

Not e Obt ain t he username and password used t o log on t o t he Apsara Uni-manager
Operat ions Console from t he deployment personnel or an administ rat or.

When you log on t o t he Apsara Uni-manager Operat ions Console for t he first t ime, you must change
t he password of your username.

For securit y reasons, your password must meet t he following requirement s:


T he password cont ains uppercase and lowercase let t ers.
T he password cont ains digit s.
T he password cont ains t he following special charact ers: ! @ # $ %

> Document Version: 20221222 42


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he password must be 10 t o 20 charact ers in lengt h.


4. Click Log On.
5. In t he t op navigat ion bar of t he Apsara Uni-manager Operat ions Console, click O& M.
6. In t he left -side navigat ion pane, choose Product Management > Product s.
7. In t he Big Dat a Services sect ion, choose General-Purpose O& M > Apsara Big Dat a Manager.

4.2. Business O&M


4.2.1. O&M overview and entry
T his t opic describes t he business O&M feat ures and how t o go t o t he business O&M page.

Business O &M features


Project s:
Project List : shows all project s and project det ails in a MaxComput e clust er. You can search for and
filt er project s. You can also change t he quot a group of a project . If zone-disast er recovery is
enabled, you can specify resource replicat ion paramet ers and det ermine whet her t o enable
resource replicat ion for a project .
Aut horize Package for Met adat a Reposit ory: allows you t o aut horize members of a project t o
access t he met adat a warehouse.
Encrypt ion at Rest : allows you t o encrypt t he dat a st ored in MaxComput e project s.
Disast er Recovery: allows you t o view t he clust er st at us when zone-disast er recovery is enabled
for MaxComput e. You can enable t he swit chover bet ween t he primary and secondary clust ers. You
can also det ermine whet her t o run scheduled t asks t o synchronize resources bet ween t he primary
and secondary clust ers.

Quot a Groups: shows t he quot a groups of all project s in a MaxComput e clust er. It allows you t o
creat e and modify quot a groups. You can also view det ails about quot a groups and enable period
management for quot a groups.
Jobs: shows informat ion about jobs in a MaxComput e clust er. You can search for and filt er jobs. You
can also view t he operat ional logs, t erminat e running jobs, and collect job logs.
Business Opt imizat ion:
File Merging: allows you t o creat e file merge t asks for clust ers and project s. You can also filt er
merge t asks and view t he records of t he t asks.
File Archiving: allows you t o creat e file archive t asks for clust ers and project s. You can also filt er
archive t asks and view t he records of t he t asks.
Resource Analysis: allows you t o view t he resource usage of t he clust er from different dimensions.

Go to the business O &M page


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Business t ab. In t he
left -side navigat ion pane, choose Project s > Project List .

43 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.2.2. Project management

4.2.2.1. Project list


T he Project List page shows all project s and project det ails in a MaxComput e clust er. You can filt er,
query, and sort project s. You can also change t he quot a group of a project . If zone-disast er recovery is
enabled, you can specify resource replicat ion paramet ers and det ermine whet her t o enable resource
replicat ion for a project .

Go to the Project List page


In t he left -side navigat ion pane of t he Business t ab, choose Project s > Project List t o view project s
in a clust er.

T he Project List page shows t he det ailed informat ion about all project s in a clust er. You can view t he
name, clust er, used st orage, st orage quot a, st orage usage, number of files, owner, and creat ion t ime of
a project .

View project details


On t he Project List page, click t he name of a project t o view it s det ails. You can view t he project
overview, jobs, st orage, configurat ion, quot a group, and t unnel, as well as informat ion about resource
analysis and cross-clust er replicat ion. For more informat ion, see MaxComput e workbench. You can also
grant access permissions on t he met adat a warehouse t o project members and encrypt dat a of t he
project . For more informat ion, see Grant access permissions on t he met adat a warehouse and Encrypt
dat a.

> Document Version: 20221222 44


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Change a quota group


You can change t he default quot a group of a project .
1. On t he Project List page, find t he t arget project , click Act ions in t he Act ions column, and select
Change Def ault Quot a Group . In t he Change Def ault Quot a Group pane, specify t he required
paramet ers.
Paramet ers:
Region: t he region of t he project .
Clust er: t he default clust er of t he project . If t he project belongs t o mult iple clust ers, select a
clust er from t he drop-down list t o serve as t he default clust er.
Quot a Group : t he quot a group t o which t he project belongs. T o change t he quot a group, select
a quot a group from t he drop-down list .
2. Aft er you specify t he paramet ers, click Run.

Modify the storage quota


You can modify t he st orage quot a of a project .
1. On t he Project List page, find t he t arget project , click Act ions in t he Act ions column, and select
Modif y St orage Quot a. In t he Change St orage Quot a pane, specify t he required paramet ers.

Paramet ers:
Region: t he region of t he project
Project : t he name of t he project for which you want t o modify t he st orage quot a
Clust er: t he default clust er of t he project
T arget St orage Quot a (T B): t he new st orage quot a
Reason: t he cause for t he modificat ion
2. Aft er you specify t he paramet ers, click Run.

Configure resource replication


T he resource replicat ion feat ure can be configured only in zone-disast er recovery scenarios. In ot her
scenarios, you can only view t he set t ings. In zone-disast er recovery scenarios, you can det ermine
whet her t o enable t he resource replicat ion feat ure for a project in t he primary clust er. If t he resource
replicat ion feat ure is enabled for a project , you can configure dat a synchronizat ion rules for t he project
t o regularly synchronize dat a such as t able dat a t o a secondary clust er.
1. On t he Project List page, find t he t arget project , click Act ions in t he Act ions column, and select
Resource Replicat ion. In t he Copy Resource pane, specify t he required paramet ers.

45 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Paramet ers:
Enable : specifies whet her t o enable t he resource replicat ion feat ure. T he value t rue indicat es
t hat t he resource replicat ion feat ure is enabled. T he value f alse indicat es t hat t he resource
replicat ion feat ure is disabled. Default value: f alse .
Conf igure : t he dat a synchronizat ion rules of a project . In most cases, t he default set t ings are
used. If you want t o modify t he set t ings, consult second-line O&M engineers.
2. Aft er you modify code in t he Conf igure field, click Compare Versions t o view t he differences,
which are highlight ed.

3. Click Run.

4.2.2.2. Project details


T he Apsara Big Dat a Manager (ABM) console shows your MaxComput e project s and project det ails. You
can view t he project overview, jobs, st orage, configurat ions, quot a groups, and t unnels, as well as
informat ion about resource analysis, st orage encrypt ion, and cross-clust er replicat ion.

Go to the project details page


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Business t ab. T he

> Document Version: 20221222 46


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Project List page appears by default . Click t he name of a project t o view it s det ails.

O verview
On t he Overview t ab, you can view t he following informat ion about t he select ed project :
Basic informat ion, such as t he default quot a group, creat or, creat ion t ime, service, and region
T rend chart s t hat show t he t rend lines of request ed and used CPU and memory resources by minut e
in different colors
T rend chart t hat shows t he t rend lines of CPU ut ilizat ion and memory usage by day in different colors

Jobs
On t he Jobs t ab, you can view job snapshot s by day over t he last week. Det ailed informat ion about a
job snapshot includes t he job ID, project , quot a group, submit t er, running durat ion, minimum CPU
ut ilizat ion, maximum CPU ut ilizat ion, minimum memory usage, maximum memory usage, Dat aWorks node,
running st at us, st art t ime, priorit y, and t ype. You can also view t he operat ional logs of a job t o locat e
it s running fault s.

You can perform t he following operat ions on t he Jobs t ab:


Cust omize columns or sort job snapshot s by column.
View t he operat ional logs of jobs or t erminat e jobs.

Storage

47 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he St orage t ab, you can view t he st orage usage, used st orage space, st orage quot a, and
available st orage space. You can also view a t rend chart t hat shows t he t rend lines of st orage usage,
t he number of files in Apsara Dist ribut ed File Syst em, t he number of t ables, t he number of part it ions,
and idle st orage by day in different colors.

Not e T he St orage t ab shows only informat ion about st orage resources. T o query
informat ion about comput ing resources, go t o t he Quot a Groups t ab.

Configuration
On t he Conf igurat ion t ab, you can configure t he general, sandbox, SQL, MapReduce, access cont rol,
and resource recycling propert ies of t he project . You can configure package-based aut horizat ion t o
allow access t o t he met adat a warehouse.
On t he Propert ies t ab, you can view and modify each configurat ion it em. T hen, click Submit . T o
rest ore all configurat ion it ems t o t he default set t ings, click Reset .

> Document Version: 20221222 48


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Aut horiz e Package f or Met adat a Reposit ory t ab, you can inst all t he package and perform
package-based aut horizat ion.

Q uota Groups
On t he Quot a Groups t ab, you can view t he quot a groups of a project and t he det ails of each quot a
group.

T o view det ails about a quot a group, click t he quot a group name in t he Quot a column.

Not e T he Quot a Groups t ab shows only informat ion about comput ing resources. T o query
informat ion about st orage resources, go t o t he St orage t ab.

Tunnel
On t he T unnel t ab, you can view t he t unnel t hroughput of t he project in t he unit of byt es per minut e.
T he T unnel T hroughput (Byt es/Min) chart shows t he t rend lines of inbound and out bound t raffic in
different colors.

Resource Analysis
On t he Resource Analysis t ab, you can view t he resource usage of t he project from different
dimensions, including t ables, t asks, execut ion t ime, st art t ime, and engines.

Encryption at Rest

49 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Encrypt ion at Rest t ab, you can encrypt dat a by using t he following encrypt ion algorit hms:
AES-CT R, AES256, RC4, and SM4.

Cross-cluster Replication
On t he Cross-clust er Replicat ion t ab, you can view t he project s t hat have t he cross-clust er
replicat ion feat ure enabled and t he det ails and st at us of cross-clust er replicat ion.

When you deploy mult iple clust ers t o use MaxComput e, MaxComput e project s may be mut ually
dependent . In t his case, dat a may be direct ly read bet ween project s. MaxComput e regularly scans
t ables or part it ions t hat are direct ly read by ot her t ables or part it ions. If t he durat ion of direct dat a
reading reaches t he specified t hreshold, MaxComput e adds t he t ables or part it ions t o t he cross-clust er
replicat ion list .

Assume t hat Project 1 in Clust er A depends on T able1 of Project 2 in Cust er B. In t his case, Project 1
direct ly reads dat a from T able1. If t he durat ion of direct dat a reading reaches t he specified t hreshold,
MaxComput e adds T able1 t o t he cross-clust er replicat ion list .

T he Cross-clust er Replicat ion t ab consist s of t he Replicat ion Det ails and Replicat ion
Conf igurat ion sub-t abs.
Replicat ion Det ails: shows informat ion about t he t ables t hat support cross-clust er replicat ion. T he
informat ion includes t he project name, clust er name, t able name, part it ion, st orage space, number of
files, and clust er t o which t he dat a is synchronized.
Replicat ion Configurat ion: shows t he configurat ion of t he t ables t hat support cross-clust er
replicat ion. T he configurat ion includes t he t able name, priorit y, clust er t o which t he dat a is
synchronized, and lifecycle. You can also view t he progress of cross-clust er replicat ion for a t able.

4.2.2.3. Encrypt data


You can specify whet her t o encrypt t he dat a st ored in MaxComput e project s.

Prerequisites
If MaxComput e V3.8.0 or lat er is deployed, st orage encrypt ion is support ed by default . If MaxComput e
is upgraded t o V3.8.0 or lat er, st orage encrypt ion is not support ed by default . If you want t o enable
st orage encrypt ion, complet e t he configurat ion for your MaxComput e clust er.

Context
Aft er st orage encrypt ion is enabled for a project , it cannot be disabled. Aft er st orage encrypt ion is
enabled, only t he dat a t hat is newly writ t en t o t he project is aut omat ically encrypt ed. T o encrypt
hist orical dat a, you can creat e rules and configure t asks.

Before you encrypt hist orical dat a for a project , make sure t hat you underst and t he concept s of rules
and t asks in Apsara Big Dat a Manager (ABM). A rule is used t o specify t he t ime period of hist orical dat a
t hat you want t o encrypt in a specific project . Aft er you creat e a rule, t he syst em obt ains t he dat a in
t he specified t ime period every day aft er t he dat a is export ed from t he met adat a warehouse. You can
creat e only one rule every day. If mult iple rules are creat ed on a single day, only t he lat est rule t akes
effect . Each rule t akes effect only once. You can creat e a key rot at e t ask t o encrypt t he select ed
hist orical dat a.

> Document Version: 20221222 50


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Procedure
1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Business t ab. In t he
left -side navigat ion pane, choose Project s > Project List .
4. On t he Project List page, click t he name of t he required project t o go t o t he project det ails page.
5. On t he project det ails page, click t he Encrypt ion at Rest t ab. T he Encrypt t ab appears.
6. Enable st orage encrypt ion.
Aft er st orage encrypt ion is enabled, all dat a t hat is newly writ t en t o t he project is aut omat ically
encrypt ed.
i. On t he Encrypt t ab, click Modif y in t he Act ions column. In t he Conf igure Encrypt ed St orage
panel, specify Encrypt ion Algorit hm, region, and project .

Not e AES-CT R, AES256, RC4, and SM4 encrypt ion algorit hms are support ed.

ii. Click Run.

Aft er st orage encrypt ion is enabled, t he swit ch in t he Encrypt ed St orage column is t urned on.
7. T o encrypt hist orical dat a or encrypt ed dat a, perform t he following st eps:
i. Creat e a rule.

On t he Creat e Rule t ab, click OK in t he Act ions column of a t ime period in t he Creat e Rule
sect ion. In t he Creat e Rule message, click Run. T he new rule appears in t he rule list .

T he available t ime periods include Last T hree Mont hs, Last Six Mont hs, T hree Mont hs Ago ,
Six Mont hs Ago , and All.
ii. Creat e a key rot at e t ask.

On t he Conf igure T ask t ab, click Add a key rot at e t ask. In t he Edit Key Rot at e T ask panel,
specify t he required paramet ers and click Run.

Parameter Description

T he region where the project whose data is to be encrypted resides. Select a


Regio n
region from the drop-down list.

Pro ject Name T he name of the project whose data is to be encrypted.

St art
T he start time of the task.
T imest amp

Ended At T he end time of the task.

Prio rit y T he priority of the task. A small value indicates a high priority.

Enabled Specifies whether the task is enabled.

51 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Parameter Description

Specifies whether to limit the concurrency of merge tasks for the project.
Bandw idt h
Y es : indicates that merge tasks cannot be concurrently run.
Limit
No : indicates that merge tasks can be concurrently run.

Maximum T he maximum number of merge tasks that can be run for the cluster of the
Co ncurrent selected project at the same time. T his parameter is valid only when Bandw idt h
T asks Limit is set to No .

Maximum T he maximum number of jobs that can be run for the cluster of the selected
Number o f project at the same time. T his parameter is a global parameter. T he jobs refer to
Running Jo bs all types of jobs in the cluster of the selected project, not only the merge tasks.

{
"odps.merge.cross.paths": "true",
"odps.idata.useragent": "odps encrypt key rotate via force
mergeTask",
"odps.merge.max.filenumber.per.job": "10000000",
"odps.merge.max.filenumber.per.instance": "10000",
Merge "odps.merge.failure.handling": "any",
Paramet ers "odps.merge.maintain.order.flag": "true",
"odps.merge.smallfile.filesize.threshold": "4096",
"odps.merge.quickmerge.flag": "true",
"odps.merge.maxmerged.filesize.threshold": "4096",
"odps.merge.force.rewrite": "true",
"odps.merge.restructure.action": "hardlink"
}

8. (Opt ional)View t he hist ory of dat a encrypt ion in t he project .

On t he Hist orical Queries t ab, select a dat e from t he Dat e drop-down list . T hen, you can view
informat ion about st orage encrypt ion on t he specified dat e.

4.2.2.4. Grant access permissions on the metadata


warehouse
You can grant access permissions on t he met adat a warehouse t o project s and project members.

Prerequisites
If MaxComput e V3.8.1 or lat er is deployed, t he package of t he met adat a warehouse is inst alled by
default . In t his case, you can direct ly use Apsara Big Dat a Manager (ABM) t o grant access permissions
on t he met adat a warehouse. If MaxComput e is upgraded t o V3.8.1 or lat er, t he package of t he
met adat a warehouse is not inst alled by default . Before you grant access permissions on t he
met adat a warehouse, you must manually inst all t he package of t he met adat a warehouse.
A project is creat ed in Dat aWorks.

Context

> Document Version: 20221222 52


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T o allow a project t o access t he met adat a warehouse, grant t he required permissions t o t he project
and inst all t he package t o t he project in t he ABM console. When you inst all t he package, ABM ret rieves
aut hent icat ion informat ion, such as t he AccessKey pair, of t he project from Dat aWorks. If t he project is
creat ed in MaxComput e, an error message is ret urned during inst allat ion.

Procedure
1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Business t ab. T he
Project List page appears by default .
4. Click t he name of t he required project t o go t o t he project det ails page.
5. On t he project det ails page, click t he Conf igurat ion t ab. T hen, click t he Aut horiz e Package f or
Met adat a Reposit ory t ab.
6. Click Aut horiz e in t he Act ions column. In t he Aut horiz e Package message, click Run. A message
appears, indicat ing t hat t he permissions are grant ed.
7. Click Inst all in t he Act ions column. In t he Inst all Package message, click Run. A message appears,
indicat ing t hat t he package is inst alled.
Aft er t he package is inst alled, t he swit ch in t he Aut horiz ed column is t urned on.

4.2.2.5. Perform disaster recovery


When a primary MaxComput e clust er fails, you can perform a primary/secondary swit chover in t he
Apsara Big Dat a Manager (ABM) console t o rest ore services. T his t opic describes t he prerequisit es and
procedure of disast er recovery. In t his t opic, disast er recovery indicat es zone-disast er recovery.

Prerequisites
T he resource replicat ion feat ure is disabled in t he ABM console. T o disable t he feat ure, perform t he
following st eps:
i. Log on t o t he ABM console.

ii. In t he upper-left corner, click t he icon and t hen MaxComput e .

iii. In t he left -side navigat ion pane of t he Business t ab, choose Project s > Disast er Recovery .
iv. On t he page t hat appears, t urn off Resource Synchroniz at ion St at us.
T he domain name of ABM is point ed t o t he IP address of t he secondary ABM clust er. T o point t he
domain name t o t he IP address, perform t he following st eps:
i. Log on t o t he ABM console.

ii. In t he upper-left corner, click t he icon and t hen MaxComput e .

iii. On t he MaxComput e page, click Management in t he t op navigat ion bar. In t he left -side navigat ion
pane of t he page t hat appears, click Jobs. T he Jobs t ab appears by default .
iv. Find t he Change Bcc Dns-Vip Relat ion For Disast er Recovery job and click Run in t he Act ions column.
T he Job Propert ies sect ion appears.

v. Click t he icon next t o Group Name t o configure t he IP address of t he Docker cont ainer.

53 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Not e NewBccAGIp indicat es t he IP address of t he Docker cont ainer under AG# for t he
bcc-saas service of t he secondary ABM clust er. You must configure an IP address at t he
#Docker# level.

In t he dialog box t hat appears, click t he Servers t ab. Ent er t he IP address of a server in t he field
and click Add Server. T hen, click OK. T he IP address is configured.

vi. In t he upper-right corner, click Run. In t he message t hat appears, click Conf irm.
vii. On t he page t hat appears, click St art in t he upper-right corner. T he swit chover st art s.

Not e If a st ep fails, click Ret ry . Aft er all t he st eps are complet e, t he domain name of
ABM is point ed t o t he IP address of t he secondary ABM clust er.

T he secondary ABM clust er page is accessible. If t his page is inaccessible, go t o t he /usr/loca/bigdata


k/controllers/bcc/tool/disaster_recovery direct ory of t he Docker cont ainer in bcc-saa.AG# of t he
secondary ABM clust er. T hen, run t he /home/tops/bin/python change_dns_vip.py script in t he
direct ory. If job_success appears, t he execut ion succeeds. T hen, run t he /home/tops/bin/python dis
aster_init.py script in t he current direct ory. If job_success appears, t he execut ion succeeds. Aft er
t he script s are successfully run, you can go t o t he secondary clust er page.

Not e If an except ion occurs when you run t he script s, click Ret ry .

T he Business Cont inuit y Management Cent er (BCMC) swit chover of MaxComput e is complet e. T he
services on which MaxComput e depends are running normally. T he services include AAS, T ablest ore,
and MiniRDS.
By default , t he dat a synchronizat ion feat ure is disabled for MaxComput e project s because t he
comput ing and st orage resources of t he primary and secondary dat a cent ers are limit ed. T o enable
t he dat a synchronizat ion feat ure, submit a t icket .

Context
Pay at t ent ion t o t he following point s for a disast er recovery swit chover:

By default , t he logon t o Apsara Big Dat a Manager depends on t he Apsara Uni-manager Operat ions
Console. If t he Apsara Uni-manager Operat ions Console has not reached t he desired st at e, single
sign-on is not support ed. In t his case, go t o t he /usr/loca/bigdatak/controllers/bcc/tool/disaster_re
covery direct ory of t he Docker cont ainer in bcc-saa.AG#. T hen, run change_login_by_bcc.sh t o
swit ch t he logon mode t o t he mode t hat is independent of t he Apsara Uni-manager Operat ions
Console. Aft er t he Apsara Uni-manager Operat ions Console has reached t he desired st at e, run chan
ge_login_by_aso.sh t o swit ch t he logon mode back t o t he mode t hat depends on t he Apsara Uni-
manager Operat ions Console.
An except ion may occur in each st ep of t he swit chover process. If an except ion occurs, click Ret ry . If
t he ret ry succeeds, proceed t o t he next st ep. If t he except ion persist s aft er mult iple ret ries, cont act
O&M engineers t o perform t roubleshoot ing. T hen, click Ret ry t o complet e t he st ep.
For each swit chover, t he Apsara dist ribut ed operat ing syst em of t he original primary MaxComput e
clust er must be rest art ed. Ot herwise, t he admint ask service may be fault y aft er t he swit chover is
complet e.
In t he Collect Unsynchronized Dat a st ep, an except ion shown in t he following figure may occur. If t his
occurs, click Recollect Unsynchroniz ed Dat a.

> Document Version: 20221222 54


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Procedure
1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. In t he left -side navigat ion pane of t he Business t ab, choose Project s > Disast er Recovery .
4. In t he upper-right corner, click Swit chover Process t o st art t he disast er recovery process.
5. Wait for resource replicat ion t o aut omat ically st op.
Wait for resource replicat ion t o aut omat ically st op. Aft er Next becomes blue, click Next .

Not e If an error occurs, click Ret ry . If t he ret ry is invalid, cont act O&M engineers t o perform
t roubleshoot ing and t ry again.

6. Swit ch cont rol clust ers.


i. Wait for t he primary/secondary swit chover t o complet e for cont rol clust ers.

Not e Aft er t he original primary clust er becomes t he secondary clust er, t he swit chover is
complet e.

ii. Click Rest art St andby Clust er.

Not e T he MaxComput e clust ers become abnormal.

iii. Aft er t he MaxComput e clust ers become normal, click Rest art Front end Server and wait unt il t he
rest art result is ret urned.
iv. Aft er t he rest art succeeds, click T est adminT ask.

Not e If an except ion occurs, click Ret ry and t hen T est adminT ask. Alt ernat ively,
repeat from St ep 6.b.

v. Aft er Next becomes blue, click Next .

Not e T he Swit ching message remains displayed unt il t he t est succeeds.

7. Swit ch comput ing clust ers.


T he comput ing clust er swit chover aut omat ically st art s for t he project s t hat have t wo comput ing
clust ers. T he swit chover cannot be performed for t he project s t hat have only one comput ing clust er.
Aft er t he swit chovers are complet e for all t he project s, click Next .

Not e If t he comput ing clust ers of a project fail t o be swit ched, cont act O&M engineers t o
ident ify t he cause of t he except ion. If t he except ion can be fixed, fix it and click Ret ry t o
cont inue t he swit chover. If t he project is damaged or does not need a clust er swit chover, click
Next aft er you confirm t hat comput ing clust ers of ot her project s are swit ched.

8. Swit ch t he replicat ion service t o t he secondary clust ers.

T he script is aut omat ically run at t he background. When a success message appears, click Next .

55 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

9. Collect unsynchronized dat a.


i. Wait for t he syst em t o collect st at ist ics on project s t hat cont ain unsynchronized dat a.

Not e T his st ep requires a long t ime t o complet e. T he specific t ime depends on t he dat a
volume.

ii. Aft er t he collect ion is complet e, click Download Unsynchroniz ed Dat a of Select ed Project s
t o download t he unsynchronized dat a t o your comput er.

Not e T he unsynchronized dat a t hat is obt ained from t his st ep is required for t he
Manually Fill in Missing Dat a st ep. T he project s t hat are obt ained from t his st ep must be t he
same as t hose for t he Repair Met adat a and Manually Fill in Missing Dat a st eps.

iii. Aft er t he unsynchronized dat a is downloaded, verify t he dat a and click Next . If all dat a is
synchronized, click Next .

Not e If t he unsynchronized dat a is abnormal, you can click Recollect Unsynchroniz ed


Dat a.

10. Repair met adat a.


Select all project s, click Repair Met adat a of Select ed Project s, and t hen wait for result s. If t he
met adat a of some project s fails t o be repaired, click Download Last Execut ion Log and send t he
logs t o O&M engineers. T he logs can be used t o ident ify and analyze t he cause of t he except ion.
Aft er t he except ion is fixed, repair t he met adat a of t he project s again. If you do not need t o repair
t he met adat a of all project s, click Next aft er t he met adat a of required project s is repaired.

11. Manually supplement missing dat a.


Use Dat aWorks or t he odpscmd client t o manually supplement t he missing dat a based on t he
unsynchronized dat a t hat you downloaded. Aft er you supplement t he missing dat a, select all
project s and click Conf irm Dat a Repair Complet e . T hen, click Next .

12. Repair unsynchronized resources.


i. Wait for t he syst em t o collect st at ist ics on project s t hat cont ain unsynchronized resources.

Not e T his st ep requires a long t ime t o complet e. T he specific t ime depends on t he dat a
volume.

ii. Use Dat aWorks or t he odpscmd client t o manually supplement t he missing resources based on t he
unsynchronized resources t hat you collect ed. If an except ion occurs, send except ion informat ion
t o O&M engineers t o perform t roubleshoot ing. Aft er all t he project resources are repaired, click
Complet e and Next .
13. Wait for resource replicat ion t o aut omat ically st art .

Wait for resource replicat ion t o aut omat ically st art . Aft er Next becomes blue, click Next .

14. Exit t he configurat ion wizard.


Aft er t he swit chover is complet e, click Back t o exit t he wizard.

4.2.2.6. Migrate projects

> Document Version: 20221222 56


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Apsara Big Dat a Manager (ABM) allows you t o migrat e MaxComput e project s across regions from one
clust er t o anot her. T his allows you t o balance t he comput ing and st orage resources of each clust er.

Not e T he project migrat ion feat ure is support ed only when t he clust ers are deployed in
mult i-region mode.

Create a project migration task


1. In t he left -side navigat ion pane of t he Business t ab, choose Project s > Project Migrat ion.
2. In t he upper part of t he Migrat ion Mission page, select t he region where t he project resides.

3. In t he upper-right corner, click Creat e Mission. On t he page t hat appears, specify t he paramet ers in
t he General, Source , T arget Select ion, and Clust er f or Mission Execut ion sect ions as
prompt ed.

57 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he following t able describes t he required paramet ers.

Section Parameter Description

T he name of the source cluster. Select a cluster from the


So urce Clust er
drop-down list.

> Document Version: 20221222 58


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Section Parameter Description

T he quota group of the source cluster. Select a quota group


Quo t a Gro up
from the drop-down list.
So urce
T he projects that you want to migrate. After Quo t a Gro up is
specified, all the projects in the quota group are
automatically loaded. You can migrate these projects at a
pro ject List time.

If some projects in the quota group do not need to be


migrated, you can remove the projects.

Specifies whether the destination cluster uses the same quota


Co py So urce
group as the source cluster. If you enable this feature, the
Quo t a Gro up
T arget Quo t a Gro up parameter cannot be specified.

Specifies whether to use a new T unnel route. T unnel provides


highly concurrent upload and download services for offline
Change T unnel
T arget data. Each project has a default T unnel route. If you want to
Ro ut ing
Select io n use a new T unnel route after a project is migrated to a new
Address
cluster, enable Change T unnel Ro ut ing Address and
specify the new T unnel route.

Specifies whether the destination Apsara Distributed File


PanguV o lume
System volume can be specified. Cross-volume project
T arget Server
migration is not supported. Set this parameter to No .

So urce Clust er : indicates that the source cluster pushes


Clust er f o r the project to the destination cluster.
Missio n Clust er
Execut io n T arget Clust er : indicates that the destination cluster
pulls the project from the source cluster.

4. Click Preview t o preview project migrat ion det ails.

5. Aft er you confirm t he configurat ion, click St art Planning in t he upper-left corner. A project
migrat ion t ask is generat ed. T he migrat ion det ails appear.

It requires some t ime t o generat e t he t ask.

59 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

A st andard project migrat ion t ask generally includes five st eps:

i. Add T arget Clust er: Add t he dest inat ion clust er t o t he clust er list of t he project t hat you want
t o migrat e.
ii. St art t o Replicat e : Replicat e t he project from t he source clust er t o t he dest inat ion clust er.
iii. Swit ch Def ault Clust er: Change t he default clust er of t he project t o t he dest inat ion clust er.
Aft er t he default clust er is changed, generat ed dat a is writ t en t o t he dest inat ion clust er.
iv. Clear Replicat ion: Clear t he dat a replicat ion list . During project migrat ion, t he migrat ed project in
t he source clust er and t he corresponding project in t he dest inat ion clust er synchronize dat a based
on t he dat a replicat ion list . T his ensures dat a consist ency bet ween t he t wo project s. Dat a is
cont inuously synchronized unt il t he dat a replicat ion list is cleared.
v. Remove Source Clust er: Delet e t he migrat ed project from t he source clust er.

For more informat ion about how t o modify a t ask aft er it is generat ed, see Modify a project migrat ion
t ask.

Run the project migration task


Aft er t he project migrat ion t ask is creat ed, you can run t he t ask on t he Migrat ion Det ails page.

1. Click t he t ask name in t he t ask list t o go t o t he Migrat ion Det ails page.
2. On t he Migrat ion Det ails page, click Submit f or Execut ion.
Aft er t he project migrat ion t ask st art s, t he syst em aut omat ically runs t he Add T arget Clust er and
St art t o Replicat e st eps in sequence.

If you migrat e mult iple project s at a t ime, t he process requires many st eps t o complet e. T herefore,
we recommend t hat you sort t he st eps by project t o view t he migrat ion st eps for each project . If t he
st at us of a st ep is Success, t he st ep is complet e. If t he st at us of a st ep is Failed , t he st ep fails.

In t he migrat ion process, some st eps can be run only aft er you click OK. If you do not need t o run a
st ep, click Skip . T o confirm or skip mult iple st eps at a t ime, select t he st eps and click OK or Skip in t he
upper-left corner.
You can also click t he st at us of a migrat ion st ep for a project . In t he dialog box t hat appears, click
Yes t o skip t he remaining st eps.

3. When t he St art t o Replicat e st ep is complet e, check t he difference in dat a volumes bet ween t he
migrat ed project in t he source clust er and t he corresponding project in t he dest inat ion clust er.

Import ant We recommend t hat you run t he next st ep only when t he difference in dat a
volumes does not exceed 5%.

T o check t he dat a volume of a project , log on t o t he admingat eway host in t he clust er where t he
project resides and run t he pu dirmet a /product /aliyun/odps/${project _name}/ command.

> Document Version: 20221222 60


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4. If t he difference in dat a volumes does not exceed 5%, perform one of t he following operat ions:
Change t he default clust er: Click OK in t he Act ions column of t he Swit ch Def ault Clust er st ep.
Aft er t his operat ion, t he dest inat ion clust er becomes t he default clust er of t he migrat ed project .
T he default clust er is changed in t his example.
Do not change t he default clust er: Click Skip in t he Act ions column of t he Swit ch Def ault
Clust er st ep. Aft er t his operat ion, t he source clust er is st ill used as t he default clust er of t he
project .

Aft er t he default clust er is changed, generat ed dat a is writ t en t o t he dest inat ion clust er.

Warning During project migrat ion, t he migrat ed project in t he source clust er and t he
corresponding project in t he dest inat ion clust er synchronize dat a based on t he dat a replicat ion
list t o ensure dat a consist ency. It requires some t ime for dat a synchronizat ion t o complet e.
T herefore, aft er t he default clust er is changed, we recommend t hat you wait for about one
week before you proceed t o t he next st ep.

5. Wait for about one week and check whet her t he dat a volume of t he migrat ed project in t he source
clust er is t he same as t hat of t he corresponding project in t he dest inat ion clust er.
T o check t he dat a volume of a project , log on t o t he admingat eway host in t he clust er where t he
project resides and run t he pu dirmet a /product /aliyun/odps/${project _name}/ command.

Warning Before you proceed t o t he next st ep, make sure t hat t he dat a volume of t he
migrat ed project in t he source clust er is t he same as t hat of t he corresponding project in t he
dest inat ion clust er. Ot herwise, dat a may be lost .

6. T o ret ain t he migrat ed project in t he source clust er, click Skip in t he Act ions column of t he Remove
Source Clust er st ep before you perform t he Clear Replicat ion st ep.
7. Aft er t he dat a volume of t he migrat ed project in t he source clust er becomes t he same as t hat of t he
project in t he dest inat ion clust er, click OK in t he Act ions column of t he Clear Replicat ion st ep t o
clear t he dat a replicat ion list .

Aft er t he dat a replicat ion list is cleared, dat a is no longer synchronized bet ween t he migrat ed
project in t he source clust er and t he corresponding project in t he dest inat ion clust er.

T he syst em aut omat ically runs t he Remove Source Clust er st ep t o delet e all migrat ed project s
from t he source clust er. T his releases st orage and comput ing resources.

View migration details


You can view t he det ails of a project migrat ion t ask, including t he st eps, result s, and debugging
informat ion.

1. If mult iple migrat ion t asks exist , search for a t ask or filt er t asks on t he Migrat ion Mission page.
Filt er t asks: Select a t ask st at e from t he Filt er out Mission By drop-down list . All t asks in t his
st at e are aut omat ically filt ered from t he migrat ion t ask list .
Search for a t ask: Ent er t he name of a migrat ion t ask in t he search box in t he upper-right corner
and click t he search icon t o search for t he t ask.

61 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. Click t he name of a t ask. On t he Migrat ion Det ails page, view t he det ails of t he t ask.

3. If a st ep fails, click t he Det ails or Debugging icon in t he Act ions column t o view t he det ails or
debugging informat ion of t he st ep. T his allows you t o ident ify t he cause of t he failure.
4. Perform ot her required operat ions.

Click Menu in t he upper-right corner. You can export t he st ep list , change t he column widt h t o
aut omat ically fit t he cont ent , or cust omize whet her t o show or hide a column.

You can also right -click a cell in t he st ep list and copy t he cell cont ent .

View step details and debugging information


If a st ep fails, you can view t he st ep det ails and debugging informat ion t o ident ify t he cause of t he
failure.

1. Find t he st ep t hat fails t o run during t he migrat ion of a project .

2. Click t he Det ails icon in t he Act ions column t o view t he det ails of t he st ep.

> Document Version: 20221222 62


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

3. Click t he Debugging icon in t he Act ions column t o view t he debugging informat ion of t he st ep.

Modify a project migration task


Aft er a project migrat ion t ask is creat ed, you can modify t he t ask if t he t ask does not meet your
requirement s.

T o modify t he t ask, find t he required t ask, click Modif y Mission in t he Act ions column, or click Replan
on t he Migrat ion Det ails page.

4.2.3. Manage quota groups


Apsara Big Dat a Manager (ABM) shows t he quot a groups of all project s in a MaxComput e clust er. It
allows you t o creat e and modify quot a groups. You can also view det ails about quot a groups and
enable period management for quot a groups.

Go to the Q uota Groups page


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Business t ab. In t he
left -side navigat ion pane of t he t ab t hat appears, click Quot a Groups. T hen, click Quot a Groups or
Periods as required.

Create a quota group


In t he upper-right corner of t he Quot a Groups page, click Creat e Quot a Group . In t he panel t hat
appears, configure t he paramet ers and click Run.

63 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Parameter Description

T he cluster of the quota group that you want to


Cluster
create.

T he name of the quota group that you want to


Quota Group
create.

T he preemption policy of the quota group. Valid


Preemption Policy values: No Preemption and Preemption. Default
value: No Preemption.

T he type of resource scheduling. Valid values: First


Scheduling T ype In, First Out and Average. Default value: First In, First
Out.

T he minimum number of compute units (CUs) that


Minimum CUs
are provided by the quota group.

T he maximum number of CUs that are provided by


Maximum CUs
the quota group.

T he ratio of CPUs to memory of hosts in the quota


CPU-to-Memory Ratio
group.

Modify a quota group


On t he Quot a Groups page, find t he quot a group t hat you want t o modify and click Modif y in t he
Act ions column. In t he panel t hat appears, modify t he set t ings and click Run.

Not e If period management has been enabled for t he quot a group you want t o modify, first
modify t he period management configurat ion.

View details about a quota group


On t he Quot a Groups page, find t he quot a group whose det ails you want t o view and click Det ails in
t he Act ions column. T hen, you can view informat ion about t he resource usage, resource analysis, and
period management of t he quot a group.

Enable period management for a quota group


1. On t he Periods page, find t he quot a group for which you want t o enable period management and
click Period Management in t he Act ions column.
2. On t he Period Management t ab, click Set Periods. In t he dialog box t hat appears, set Period and
click Enable Period Management .

Not e
You can click Add t o specify more t han one period and Delet e t o delet e a period.
For t he quot a group t hat has period management enabled, click Edit in t he Act ions
column. In t he Modif y Period Conf igurat ion panel, you can modify t he paramet ers of
t he quot a group wit hin t he specified period.

> Document Version: 20221222 64


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

3. T o disable period management for a quot a group, click Set Periods again. In t he dialog box t hat
appears, click Disable Period Management .

4.2.4. Job management

4.2.4.1. Job snapshots


T he job snapshot s feat ure allows you t o manage t he t asks t hat are creat ed in MaxComput e and t he
merge t asks t hat are creat ed in Apsara Big Dat a Manager (ABM). You can also view Logview informat ion
about jobs, t erminat e jobs, and collect job logs.

View job snapshots


You can view job snapshot s by day in t he last week. T he informat ion about a job snapshot includes t he
job ID, project , quot a group, submit t er, running durat ion, minimum CPU ut ilizat ion, and maximum CPU
ut ilizat ion. It also includes t he minimum memory usage, maximum memory usage, Dat aWorks node,
running st at us, st art t ime, priorit y, and t ype. You can also view t he operat ional logs of a job t o ident ify
job failures.

1. In t he left -side navigat ion pane of t he Business t ab, choose Jobs > Job Snapshot s. T he Job
Snapshot s page appears.

2. In t he upper-right corner, select t he dat e and t ime t o view job snapshot s by day.

3. Click All, Running , Wait ing f or Resources, or Init ializ ing t o view job snapshot s on t he specified
dat e.
4. Find t he required snapshot and click Logview in t he Act ions column. In t he dialog box t hat appears,
click Run t o view Logview informat ion about t he job.

65 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Terminate jobs
1. In t he left -side navigat ion pane of t he Business t ab, choose Jobs > Job Snapshot s. T he Job
Snapshot s page appears.

2. Select one or more jobs and click T erminat e Job above t he snapshot list . In t he panel t hat appears,
view informat ion about t he job or jobs t hat you want t o t erminat e.

3. Click Run. A message appears, indicat ing t he running result .

Collect job logs


If an except ion occurs during job running, you can collect job logs t o ident ify and analyze t he
except ion.

> Document Version: 20221222 66


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

1. In t he left -side navigat ion pane of t he Business t ab, choose Jobs > Job Snapshot s. T he Job
Snapshot s page appears.

2. In t he upper-right corner of t he Job Snapshot s page, choose Act ions > Collect Job Logs.
3. In t he Collect Job Logs panel, configure t he paramet ers.

T he following t able describes t he paramet ers.

Parameter Description

T arget Service T he service from which you want to collect job logs.

instanceid Optional. T he ID of the job instance.

Optional. T he request ID returned when the job fails. If the value you specify is
requestid
not a request ID, job logs that contain the specified value are collected.

T ime Period T he time period to collect job logs.

T ime Interval Optional. T he time interval to collect job logs. Unit: hours.

T he maximum number of nodes from which you can collect job logs at the
Degree of Concurrency
same time.

4. Click Run t o st art job log collect ion.


5. View t he execut ion st at us and progress of job log collect ion.
In t he upper-right corner of t he Job Snapshot s page, click Act ions and select Execut ion Hist ory
next t o Collect Job Logs. In t he Execut ion Hist ory panel, view t he execut ion st at us and hist ory of
job log collect ion.
RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion succeeds.
FAILED indicat es t hat t he execut ion fails. If t he st at us is RUNNING, click Det ails in t he Act ions column
of a t ask t o view t he execut ion progress.
6. View t he pat h t o st ore job logs.

In t he Execut ion Hist ory panel, click Det ails in t he Det ails column of an execut ion record t o view
t he det ails. In t he St eps sect ion, view t he pat h t o st ore t he job logs.

4.2.5. Business optimization

4.2.5.1. Merge small files

67 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Excessive small files in a MaxComput e clust er occupy a lot of memory resources. Apsara Big Dat a
Manager (ABM) allows you t o merge mult iple small files in clust ers and project s t o free up memory
occupied by t he files.

Create a file merge task for a cluster


If mult iple small files exist in most project s of a MaxComput e clust er, you can creat e a t ask t o merge
t hese files in a cent ralized manner.

1. In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File
Merging . T he Merge T asks t ab appears.

2. In t he Merge T asks f or Clust ers sect ion, click Creat e Merge T ask. In t he Modify Merge T ask for
Clust er panel, specify t he required paramet ers.

T he following t able describes t he paramet ers.

Parameter Description

T he cluster for which you want to run the merge task. Select a cluster from the
Clust er
drop-down list.

> Document Version: 20221222 68


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Parameter Description

St art T ime T he start time of the task.

End T ime T he end time of the task.

Specifies whether to limit the concurrency of merge tasks for the cluster.
Bandw idt h
Y es : indicates that merge tasks cannot be concurrently run.
Limit
No : indicates that merge tasks can be concurrently run.

Maximum
T he maximum number of merge tasks that can be run for the selected cluster at the
Co ncurrent
same time. T his parameter is valid only when Bandw idt h Limit is set to No .
T asks

Enabled Specifies whether the task is enabled.

T he parameter configuration for the merge task. You can use the following default
configuration:

{
"odps.idata.useragent": "SRE Merge",
"odps.merge.cpu.quota": "75",
"odps.merge.quickmerge.flag": "true",
Merge "odps.merge.cross.paths": "true",
Paramet ers "odps.merge.smallfile.filesize.threshold": "4096",
"odps.merge.maxmerged.filesize.threshold": "4096",
"odps.merge.max.filenumber.per.instance": "10000",
"odps.merge.max.filenumber.per.job": "10000000",
"odps.merge.maintain.order.flag": "true",
"odps.merge.failure.handling": "any"
}

T he maximum number of jobs that can be run for the selected cluster at the same
Maximum
time. T his parameter is a global parameter. T he jobs refer to all types of jobs in the
Running Jo bs
selected cluster, not only merge tasks.

3. Click Compare Versions below Merge Paramet ers t o view t he differences bet ween t he original and
modified values.

4. Click Run.

69 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he newly creat ed merge t ask appears in t he list of merge t asks for clust ers.

Create a merge task for a project


If excessive small files exist in only a few project s of a MaxComput e clust er, you can creat e a merge t ask
t o merge t he small files in a specific project .
1. In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File
Merging . T he Merge T asks t ab appears.

2. In t he Merge T asks f or Project s sect ion, click Creat e Merge T ask. In t he Modify Merge T ask for
Project panel, specify t he required paramet ers.

T he following t able describes t he paramet ers.

> Document Version: 20221222 70


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Parameter Description

T he region where the selected project resides. Select a region from the drop-down
Regio n
list.

T he name of the project for which you want to run the merge task. Select a project
Pro ject Name
from the drop-down list.

St art T ime T he start time of the task.

Prio rit y T he priority of the task. A small value indicates a high priority.

End T ime T he end time of the task.

Enabled Specifies whether the task is enabled.

Specifies whether to limit the concurrency of merge tasks for the project.
Bandw idt h
Y es : indicates that merge tasks cannot be concurrently run.
Limit
No : indicates that merge tasks can be concurrently run.

Maximum T he maximum number of merge tasks that can be run for the cluster where the
Co ncurrent selected project resides at the same time. T his parameter is valid only when
T asks Bandw idt h Limit is set to No .

T he maximum number of jobs that can be run for the cluster where the selected
Maximum project resides at the same time. T his parameter is a global parameter. T he jobs
Running Jo bs refer to all types of jobs in the cluster where the selected project resides, not only
merge tasks.

3. Click Run.

T he newly creat ed merge t ask appears in t he list of merge t asks for project s.

View merge task statistics


In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File Merging .
T hen, click t he Hist orical St at ist ics t ab t o view t he hist orical st at ist ics of merge t asks for clust ers and
project s.

Merge T ask St at ist ics

T he t rend chart for merge t asks shows st at ist ics on t he execut ion of all merge t asks for each day in t he
last mont h. It shows t he numbers of running t asks, finished t asks, wait ing t asks, t imeout t asks, failed
t asks, invalid t asks, merged part it ions, and reduced files. It also shows t he reduced dat a volume on
physical st orage, in byt es.

71 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Merge T asks for Clust ers and Merge T asks for Project s
T he t wo t ables show st at ist ics on t he execut ion of merge t asks for clust ers and project s on a specific
day in t he last mont h. T he t ables show t he numbers of running t asks, finished t asks, wait ing t asks,
t imeout t asks, failed t asks, invalid t asks, merged part it ions, and reduced files. T he t ables also show t he
reduced dat a volume on physical st orage, in byt es.

Manage merge types


In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File Merging .
T hen, click t he Merge T ypes t ab t o view t he exist ing merge t ypes and merge paramet ers.

Creat e Merge T ype


1. In t he Merge T asks sect ion, click Creat e Merge T ype . In t he Modify Merge T ype panel, specify t he
required paramet ers.

> Document Version: 20221222 72


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he following t able describes t he paramet ers.

Parameter Description

Merge T ype T he name of the merge type.

Merge
T he merge parameters of the merge type.
Paramet ers

2. Click Compare Versions below Merge Paramet ers t o view t he differences bet ween t he original and
modified values.

3. Click Run.

T he newly creat ed merge t ype appears in t he list of merge t ypes.

4.2.5.2. Compress idle files


Apsara Big Dat a Manager (ABM) allows you t o creat e archive t asks t o compress idle files in MaxComput e
clust ers and project s. T his saves st orage space for t he clust ers.

Definition

73 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In a clust er, ABM sort s t he t ables or part it ions creat ed more t han 90 days ago by st orage space. T hen, it
compresses t he first 100,000 t ables or part it ions.

Create an archive task for a cluster


If excessive idle files exist in most project s of a MaxComput e clust er, you can creat e an archive t ask t o
compress t he idle files in t he clust er in a cent ralized manner.

1. In t he left -side navigat ion pane of t he Business t ab, choose Business opt imiz at ion > File
Archiving . T he Archive T asks t ab appears.

2. In t he Archive T asks f or Clust ers sect ion, click Creat e Archive T ask. In t he Modify Archive T ask for
Clust er panel, specify t he required paramet ers.

T he following t able describes t he paramet ers.

Parameter Description

T he cluster for which you want to run the archive task. Select a cluster from the
Clust er
drop-down list.

St art T ime T he start time of the task.

End T ime T he end time of the task.

Specifies whether to limit the concurrency of archive tasks for the cluster.
Bandw idt h
Y es : indicates that archive tasks cannot be concurrently run.
Limit
No : indicates that archive tasks can be concurrently run.

Maximum T he maximum number of archive tasks that can be run for the selected cluster at
Co ncurrent Jo bs the same time. T his parameter is valid only when Bandw idt h Limit is set to No .

Enable Specifies whether the task is enabled.

T he maximum number of jobs that can be run for the selected cluster at the same
Maximum
time. T his parameter is a global parameter. T he jobs refer to all types of jobs in the
Running Jo bs
selected cluster, not only archive tasks.

> Document Version: 20221222 74


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Parameter Description

T he parameter configuration for the archive task. You can use the following default
configuration:

{
"odps.idata.useragent": "SRE Archive",
"odps.oversold.resources.ratio": "100",
"odps.merge.quickmerge.flag": "true",
"odps.merge.cross.paths": "true",
"odps.merge.smallfile.filesize.threshold": "4096",
"odps.merge.maxmerged.filesize.threshold": "4096",
"odps.merge.max.filenumber.per.instance": "10000",
"odps.merge.max.filenumber.per.job": "10000000",
Archive "odps.merge.maintain.order.flag": "true",
Paramet ers "odps.sql.hive.compatible": "true",
"odps.merge.compression.strategy": "normal",
"odps.compression.strategy.normal.compressor": "zstd",
"odps.merge.failure.handling": "any",
"odps.merge.archive.flag": "true"
}

3. Click Compare Versions below Archive Paramet ers t o view t he differences bet ween t he original and
modified values.
4. Click Run.

T he newly creat ed archive t ask appears in t he list of archive t asks for clust ers.

Create an archive task for a project


If excessive idle files exist in only a few project s of a MaxComput e clust er, you can creat e an archive
t ask t o compress t he idle files in a specific project .

Not e If t he t ables or part it ions of a project are not ranked t op 100,000 in t he clust er of t he
project , t he archive t ask cannot compress t he idle files in t he project .

1. In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File
Archiving . T he Archive T asks t ab appears.

75 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. In t he Archive T asks f or Project s sect ion, click Creat e Archive T ask. In t he Modify Archive T ask for
Project panel, specify t he required paramet ers.
T he following t able describes t he paramet ers.

Parameter Description

T he region where the selected project resides. Select a region from the drop-down
Regio n
list.

T he name of the project for which you want to run the archive task. Select a project
Pro ject Name
from the drop-down list.

St art T ime T he start time of the task.

Prio rit y T he priority of the task. A small value indicates a high priority.

End T ime T he end time of the task.

Specifies whether to limit the concurrency of archive tasks for the project.
Bandw idt h
Y es : indicates that archive tasks cannot be concurrently run.
Limit
No : indicates that archive tasks can be concurrently run.

T he maximum number of archive tasks that can be run for the cluster where the
Maximum
selected project resides at the same time. T his parameter is valid only when
Co ncurrent Jo bs
Bandw idt h Limit is set to No .

Enable Specifies whether the task is enabled.

T he maximum number of jobs that can be run for the cluster where the selected
Maximum project resides at the same time. T his parameter is a global parameter. T he jobs
Running Jo bs refer to all types of jobs in the cluster where the selected project resides, not only
archive tasks.

3. Click Run.

T he newly creat ed archive t ask appears in t he list of archive t asks for project s.

View archive task statistics

> Document Version: 20221222 76


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File
Archiving . T hen, click t he Hist orical St at ist ics t ab t o view t he hist orical st at ist ics of archive t asks for
clust ers and project s.

Archive T asks

T he t rend chart for archive t asks shows st at ist ics on t he execut ion of all archive t asks for each day in
t he last mont h. It shows t he numbers of running t asks, finished t asks, wait ing t asks, t imeout t asks, failed
t asks, invalid t asks, merged part it ions, and reduced files. It also shows t he reduced dat a volume on
physical st orage, in byt es.

St at ist ics by Clust er and St at ist ics by Project

T he t wo t ables show st at ist ics on t he execut ion of archive t asks for clust ers and project s on a specific
day in t he last mont h. T he t ables show t he numbers of running t asks, finished t asks, wait ing t asks,
t imeout t asks, failed t asks, invalid t asks, merged part it ions, and reduced files. T he t ables also show t he
reduced dat a volume on physical st orage, in byt es.

Manage archive types


In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > File
Archiving . T hen, click t he Archive T ypes t ab t o view t he exist ing archive t ypes and archive
paramet ers.
Creat e Archive T ype

1. In t he Archive T asks sect ion, click Creat e Archive T ype . In t he Modify Archive T ype panel, specify
t he required paramet ers.

T he following t able describes t he paramet ers.

Parameter Description

Archive T ype T he name of the archive type.

Archive
T he archive parameters of the archive type.
Paramet ers

77 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. Click Compare Versions below Archive Paramet ers t o view t he differences bet ween t he original and
modified values.
3. Click Run.

T he newly creat ed archive t ype appears in t he list of archive t ypes.

4.2.5.3. Analyze resources


Apsara Big Dat a Manager (ABM) allows you t o analyze t he resources for MaxComput e clust ers on
different t abs in t he ABM console. T his way, you can bet t er underst and t he dat a st orage in
MaxComput e. T he t abs include T ables, Project s, T asks, Execut ion T ime, St art T ime, and Engines.

Tables
On t he T ables t ab, you can view t he det ailed informat ion about all t ables in each project , including
Part it ions, St orage Usage (GB), Pangu File Count , Part it ions Ranking, St orage Usage Ranking, and Pangu
File Count Ranking. You can sort t ables by part it ion quant it y, physical st orage usage, and file quant it y
of Apsara Dist ribut ed File Syst em.
In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. T he T ables t ab appears.

Projects
On t he Project s t ab, you can view t he det ailed informat ion about st orage for each project , including
Pangu File Count , St orage Usage (GB), CU Usage, T ot al Memory Usage, T asks, T ables, Idle St orage, and
daily and weekly increases in percent age of t hese it ems.

> Document Version: 20221222 78


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. Click t he Project s t ab.

Tasks
On t he T asks t ab, you can view t he det ailed informat ion about all t asks in each project , including
inst anceid, St at us, CU Usage, St art T ime, End T ime, Execut ion T ime (s), CU Usage Ranking, and SQL
St at ement s.
In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. Click t he T asks t ab.

Execution Time
On t he Execut ion T ime t ab, you can view t he numbers of t asks whose execut ion t ime is wit hin different
t ime ranges in each project . T he met rics include Less t han 5 Minut es, Less t han 15 Minut es, Less t han 30
Minut es, Less t han 60 Minut es, and More t han 60 Minut es. T he Execut ion T ime chart displays t he t rend
lines of t ask quant it y in different colors by day.

79 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. Click t he Execut ion T ime t ab.

Start Time
On t he St art T ime t ab, you can view t he numbers of t asks st art ed in different t ime periods for each
project . T he t ime int erval is 30 minut es. T he T asks chart displays t he t rend line of t he number of t asks
st art ed in a specified t ime period by day.

In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. Click t he St art T ime t ab.

Engines
On t he Engines t ab, you can view t he t rend lines of performance st at ist ics of t asks in each project in t he
T ask Performance Analysis chart . T he performance met rics include cost _cpu, cost _mem, cost _t ime,
input _byt es, input _byt es_per_cu, input _records, input _records_per_cu, out put _byt es,
out put _byt es_per_cu, out put _records, and out put _records_per_cu.

> Document Version: 20221222 80


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he left -side navigat ion pane of t he Business t ab, choose Business Opt imiz at ion > Resource
Analysis. Click t he Engines t ab.

4.3. Service O&M


4.3.1. Control service O&M

4.3.1.1. O&M features and entry


T his t opic describes cont rol service O&M feat ures and how t o go t o t he cont rol service O&M page.

Control service O &M features


Overview: shows t he overall running informat ion about t he cont rol service. You can view t he service
overview, service st at us, job running, execut or pool size, and job st at us.
Healt h St at us: shows all checkers for t he cont rol service. You can query checker det ails, check result s
for host s in a clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a host and
perform manual checks on t he host .
Inst ances: shows informat ion about t he server roles of t he cont rol service. You can view t he host ,
st at us, request ed CPU resources, and request ed memory of each server role.
Configurat ion: provides t he access ent ry t o configure global comput ing, clust er-level comput ing,
comput ing scheduling, and clust er endpoint s.
Met adat a Reposit ory: allows you t o view t he complet ion t ime and st at us of t he out put t asks of t he
met adat a warehouse and t he t rend chart of t he consumed t ime for running t asks in MaxComput e.
St art Service Role or St op Service Role: allows you t o st art or st op t he server roles of t he
MaxComput e cont rol service and view t he execut ion hist ory. If you fail t o st art or st op t he server
roles, you can ident ify t he cause of t he failure.
St art Admin Console: allows you t o st art AdminConsole.
Collect Service Logs: allows you t o collect service logs for t he specified t ime period. T his enables you
t o ident ify t he cause of a failure.

Go to the control service O &M page


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.

81 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4. In t he left -side navigat ion pane of t he Services t ab, click Cont rol. T he Overview t ab for t he cont rol
service appears.

4.3.1.2. Control service overview


T he Overview page displays t he overall running informat ion about t he cont rol service, including t he
service summary, service st at us, job summary, execut or pool summary, and job st at us.

Entry
On t he Services page, click Cont rol in t he left -side navigat ion pane. T he Overview page for t he
cont rol service appears.

> Document Version: 20221222 82


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Overview page, you can view t he overall running informat ion about t he cont rol service,
including t he service summary, service st at us, job summary, execut or pool summary, and job st at us.

Services
T his sect ion displays t he numbers of available services and unavailable services respect ively.

Service Status
T his sect ion displays all cont rol service roles. You can also view t he numbers of available and
unavailable services respect ively for each service role.

Traffic - Jobs
T his sect ion displays t he t ot al number of jobs in t he clust er, and t he numbers of running jobs, jobs
wait ing for resources, and jobs wait ing for scheduling respect ively.

Saturability - Executor Pool Size


T he sect ion displays informat ion about t he t hread pool, including t he resource usage, number of jobs
being processed, queue lengt h, and maximum concurrency.

Latency - Waiting Jobs


T his sect ion displays t he t rend chart of jobs. T he chart displays t he t rend lines of t he numbers of
running jobs, jobs wait ing for resources, and jobs wait ing for scheduling in different colors.

4.3.1.3. Control service health


On t he Healt h St at us page for t he cont rol service, you can view all checkers of a clust er, including t he
checker det ails, check result s for t he host s in t he clust er, and schemes t o clear alert s (if any). In addit ion,
you can log on t o a host and perform manual checks on t he host .

Entry
On t he Services page, click Cont rol in t he left -side navigat ion pane, and t hen click t he Healt h St at us
t ab.

83 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Healt h St at us page, you can view all checkers of t he clust er and t he check result s for t he host s
in t he clust er. T he check result s are divided int o Crit ical, Warning , and Except ion. T hey are displayed
in different colors. Pay at t ent ion t o t he check result s, especially t he Crit ical and Warning result s, and
handle t hem in a t imely manner.

Supported operations
On t he Healt h St at us page, you can view all checkers of a clust er, including t he checker det ails, check
result s for t he host s in t he clust er, and schemes t o clear alert s (if any). In addit ion, you can log on t o a
host and perform manual checks on t he host . For more informat ion, see Clust er healt h.

4.3.1.4. Instances
T he Inst ances t ab shows informat ion about server roles, which includes t he host , st at us, request ed CPU
resources, and request ed memory of each server role.

Go to the Instances tab


In t he left -side navigat ion pane of t he Services t ab, click Cont rol. T hen, click t he Inst ances t ab.
T he Inst ances t ab shows informat ion about server roles, which includes t he host , st at us, request ed CPU
resources, and request ed memory of each server role.

4.3.1.5. Control service configuration


T he Configurat ion page under Cont rol is t he access t o configuring global comput ing, clust er-level
comput ing, comput ing scheduling, and clust er endpoint s. If you need t o modify t he configurat ions of
t he cont rol service, submit a t icket t o apply for t echnical support , and t hen modify t he configurat ions
carefully under t he guidance of t echnical support engineers.

On t he Services page, click Cont rol in t he left -side navigat ion pane, and t hen click t he Conf igurat ion
t ab.
T he Conf igurat ion page consist s of t he following t abs:

Comput ing: provides t he global comput ing configurat ion, clust er-level comput ing configurat ion, and
comput e scheduling configurat ion feat ures.
T unnel Rout ing Address: provides t he clust er endpoint configurat ion feat ure.

4.3.1.6. Metadata warehouse for the control service


T his t opic describes how t o view t he complet ion t ime and st at us of t he out put t asks of t he met adat a
warehouse and t he t rend chart of t he consumed t ime for running t asks in MaxComput e.

T he met adat a warehouse in MaxComput e regularly runs out put t asks every day. Apsara Big Dat a
Manager (ABM) obt ains t he st at us of out put t asks every 30 minut es. If an out put t ask of t he met adat a
warehouse is not complet e wit hin 24 hours, t he out put t ask is regarded as a failure.

> Document Version: 20221222 84


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he left -side navigat ion pane of t he Services t ab, click Cont rol. On t he page t hat appears, click t he
Met adat a Reposit ory t ab.

T he Met adat a Reposit ory t ab displays t he complet ion t ime of t he out put t asks of t he met adat a
warehouse and t he t rend chart of t he consumed t ime for running t asks. T he t ime displayed in t he
Complet ed At column indicat es t he t ime when an out put t ask is complet e. T he t ime displayed in t he
Collect ed At column indicat es t he last t ime at which ABM obt ains t he st at us of out put t asks.

4.3.1.7. Stop or start a server role


Apsara Big Dat a Manager (ABM) allows you t o st art or st op t he server roles of t he MaxComput e cont rol
service and view t he execut ion hist ory. If you fail t o st art or st op t he server roles, you can ident ify t he
failure.

Stop a server role


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Cont rol. In t he upper-right corner of t he
t ab t hat appears, choose Act ions > St op Service Role .
5. In t he St op Service Role panel, select a server role t hat you want t o st op and click Run.
6. In t he upper-right corner, click Act ions and select Execut ion Hist ory next t o St op Service Role t o
check whet her t he act ion is successful in t he execut ion hist ory.

T he Execut ion Hist ory panel shows t he current st at us, submission t ime, st art t ime, end t ime, and
operat or of each act ion.

7. Click Det ails in t he Det ails column t o view t he execut ion det ails.

On t he execut ion det ails page, you can view t he job name, execut ion st at us, execut ion st eps, script ,
and paramet er set t ings. You can also download t he execut ion det ails t o your comput er.

Start a server role

85 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Cont rol. In t he upper-right corner of t he
t ab t hat appears, choose Act ions > St art Service Role .
5. In t he St art Service Role panel, select a server role t hat you want t o st art and click Run.
6. In t he upper-right corner, click Act ions and select Execut ion Hist ory next t o St art Service Role t o
check whet her t he act ion is successful in t he execut ion hist ory.

T he Execut ion Hist ory panel shows t he current st at us, submission t ime, st art t ime, end t ime, and
operat or of each act ion.

7. Click Det ails in t he Det ails column t o view t he execut ion det ails.

On t he execut ion det ails page, you can view t he job name, execut ion st at us, execut ion st eps, script ,
and paramet er set t ings. You can also download t he execut ion det ails t o your comput er.

Identify the cause of a failure


T his sect ion describes how t o ident ify t he cause of t he failure t o st art a server role.

1. In t he Execut ion Hist ory panel, click Det ails in t he Det ails column of t he t ask t o view t he det ails.
2. In t he St art Service Role panel, click View Det ails for a failed st ep t o ident ify t he cause of t he
failure.

You can view t he paramet er set t ings, out put s, error messages, script , and runt ime paramet ers t o
ident ify t he cause of t he failure.

4.3.1.8. Start AdminConsole


AdminConsole is a management plat form of MaxComput e. It is disabled by default . Apsara Big Dat a
Manager (ABM) allows you t o quickly st art AdminConsole t o bet t er manage MaxComput e clust ers.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Step 1: Start AdminConsole


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Cont rol.
5. In t he upper-right corner of t he page t hat appears, choose Act ions > St art Admin Console .
6. In t he St art Admin Console panel, click Run.

Step 2: View the execution status or progress


1. On any t ab of t he CONT ROL page, click Act ions and select Execut ion Hist ory next t o St art
Admin Console in t he upper-right corner t o view t he execut ion hist ory.

> Document Version: 20221222 86


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.

2. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he execut ion progress.

Step 3: (O ptional) Identify the cause of a failure


If t he st at us is FAILED, you can view t he execut ion logs t o ident ify t he cause of t he failure.

1. On any t ab of t he CONT ROL page, click Act ions and select Execut ion Hist ory next t o St art
Admin Console in t he upper-right corner t o view t he execut ion hist ory.
2. In t he Execut ion Hist ory panel, click Det ails in t he Det ails column of t he t ask t o view t he det ails.
3. On t he Servers t ab of t he failed st ep, click View Det ails in t he Act ions column of a failed server.
T he Execut ion Out put t ab appears in t he Execut ion Det ails sect ion. You can view t he out put t o
ident ify t he cause of t he failure.

4.3.1.9. Collect service logs


Apsara Big Dat a Manager (ABM) allows you t o collect service logs for t he specified t ime period. T his
enables you t o ident ify t he cause of a failure.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Step 1: Collect service logs


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Cont rol.
5. In t he upper-right corner of t he page t hat appears, choose Act ions > Collect Service Logs.
6. In t he Collect Service Logs panel, specify t he required paramet ers.
T he following t able describes t he paramet ers.

Parameter Description

T he service from which you want to collect service logs. Select a service from
T arget Service
the drop-down list. You can select multiple services.

T ime Perio d T he time period in which the logs that you want to collect are generated.

Degree o f T he maximum number of nodes from which you can collect service logs at the
Co ncurrency same time.

Ho st name T he name of the host. Separate multiple hostnames with commas (,).

7. Click Run.

Step 2: View the execution status or progress

87 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

1. On any t ab of t he CONT ROL page, click Act ions and select Execut ion Hist ory next t o Collect
Service Logs in t he upper-right corner t o view t he execut ion hist ory.

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.

2. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he execut ion progress.

Step 3: (O ptional) Identify the cause of a failure


If t he st at us is FAILED, you can view t he execut ion logs t o ident ify t he cause of t he failure.

1. On any t ab of t he CONT ROL page, click Act ions and select Execut ion Hist ory next t o Collect
Service Logs in t he upper-right corner t o view t he execut ion hist ory.
2. In t he Execut ion Hist ory panel, click Det ails in t he Det ails column of t he t ask t o view t he det ails.
3. On t he Servers t ab of t he failed st ep, click View Det ails in t he Act ions column of a failed server.
T he Execut ion Out put t ab appears in t he Execut ion Det ails sect ion. You can view t he out put t o
ident ify t he cause of t he failure.

4.3.2. Job Scheduler O&M

4.3.2.1. O&M features and entry


T his t opic describes Job Scheduler O&M feat ures. It also provides more informat ion about how t o go t o
t he Job Scheduler O&M page.

Job Scheduler O &M features


Overview: displays t he key operat ing informat ion of Job Scheduler. T he informat ion includes t he
service overview, service st at us, resource usage, comput e node overview, and t he t rend chart s of CPU
ut ilizat ion and memory usage.
Healt h St at us: displays all checkers for Job Scheduler. You can query checker det ails, check result s for
host s in a clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a host and
perform manual checks on t he host .
Quot as: allows you t o view, creat e, or modify t he quot a groups in Job Scheduler.
Inst ances: displays informat ion about t he mast er nodes and server roles of Job Scheduler and allows
you t o rest art t he mast er nodes.
Comput e Nodes: displays all comput e nodes in Job Scheduler and allows you t o add comput e nodes
t o or remove comput e nodes from a blacklist or read-only list .
Enable SQL Accelerat ion or Disable SQL Accelerat ion: allows you t o enable or disable SQL
accelerat ion for Job Scheduler.
Rest art Fuxi Mast er Node: allows you t o rest art t he primary and secondary mast er nodes for Job
Scheduler.

Go to the Job Scheduler O &M page


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Fuxi. T he Overview t ab appears.

> Document Version: 20221222 88


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.2.2. Overview
T he Overview t ab shows t he key operat ing informat ion of Job Scheduler. T he informat ion includes t he
service overview, service st at us, resource usage, comput e node overview, and t he t rend chart s of CPU
ut ilizat ion and memory usage.

Go to the O verview tab


1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi.
2. Select a clust er and click t he Overview t ab. T he Overview t ab for t he select ed clust er appears.

Services
T his sect ion shows t he numbers of available services, unavailable services, and services t hat are being
updat ed.

89 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Roles
T his sect ion shows all Job Scheduler server roles and t heir st at es. You can also view t he expect ed and
act ual numbers of machines for each server role.

Click t he name of a server role t o go t o t he Apsara Infrast ruct ure Management Framework console and
view it s det ails.

CPU Usage (1/100 Core) and Memory Usage (MB)


T he T rend for Resource Usage sect ion shows t he t rend chart s of CPU ut ilizat ion and memory usage for
Job Scheduler. Each t rend chart shows informat ion about t he used quot a, minimum quot a, maximum
clust er quot a, request ed quot a, and maximum quot a in different colors. T he t rend chart s are
periodically refreshed. You can also manually refresh t he t rend chart s. You can also view t he t rend
chart s of CPU ut ilizat ion and memory usage for a specific period.

> Document Version: 20221222 90


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Saturability - Resource Usage


T his sect ion shows t he allocat ion of CPU and memory resources.
CPU (Core): shows t he CPU ut ilizat ion, t he t ot al number of CPU cores, t he number of available CPU
cores, and t he CPU cores for SQL accelerat ion.
Memory (Byt es): shows t he memory usage, t he t ot al memory size, t he available memory size, and t he
memory size for SQL accelerat ion.

Compute Nodes
T his sect ion shows t he det ails of comput e nodes in Job Scheduler. T he det ails include t he percent age
of online comput e nodes, t he t ot al number of comput e nodes, t he number of online comput e nodes,
and t he number of comput e nodes in a blacklist .

91 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.2.3. Job Scheduler health


On t he Healt h St at us page for Job Scheduler, you can view all checkers of Job Scheduler, including t he
checker det ails, check result s, and schemes t o clear alert s (if any). In addit ion, you can log on t o a host
and perform manual checks on t he host .

Entry
1. On t he Services page, click Fuxi in t he left -side navigat ion pane.
2. Select a clust er from t he drop-down list , and t hen click t he Healt h St at us t ab. T he Healt h St at us
page for Job Scheduler appears.

On t he Healt h St at us page, you can view all checkers of t he Job Scheduler service and t he check
result s for all host s in t he clust er. T he check result s are divided int o Crit ical, Warning , and
Except ion. T hey are displayed in different colors. Pay at t ent ion t o t he check result s, especially t he
Crit ical and Warning result s, and handle t hem in a t imely manner.

Supported operations
On t he Healt h St at us page, you can view all checkers of a clust er, including t he checker det ails, check
result s for t he host s in t he clust er, and schemes t o clear alert s (if any). In addit ion, you can log on t o a
host and perform manual checks on t he host . For more informat ion, see Clust er healt h.

4.3.2.4. Quotas
You can view, creat e, or modify quot a groups in Job Scheduler on t he Quot as t ab. A quot a group is
used t o allocat e comput ing resources t o MaxComput e project s, including CPU and memory resources.

Go to the Q uotas tab


1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi.
2. Select a clust er and click t he Quot as t ab. T he Quot as t ab for t he select ed clust er appears.

> Document Version: 20221222 92


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Quot as t ab list s exist ing quot a groups in Job Scheduler.

Create a quota group


1. In t he upper-left corner of t he Quot as t ab, click Creat e Quot a Group .
2. In t he Quot a Group pane, specify t he required paramet ers.

3. Click Run.
T he newly creat ed quot a group appears in t he quot a group list .

View quota group details


Click t he name of a quot a group t o view it s det ails. T he Resource Usage t ab shows t he t rend chart s
of CPU ut ilizat ion and memory usage. T he Applicat ions t ab shows t he project s t hat use t he quot a
group resources.
Resource usage

93 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Applicat ions

Modify a quota group


1. On t he Quot as t ab, find t he quot a group t hat you want t o modify and click Modif y in t he Act ions
column. In t he pane t hat appears, modify paramet ers as inst ruct ed.
2. Click Run.

Aft er t he configurat ion is complet e, you can check whet her t he quot a group is modified in t he quot a
group list .

4.3.2.5. Instances
T his t opic describes how t o view informat ion about t he mast er nodes and server roles of Job Scheduler
and how t o rest art t he mast er nodes.

Go to the Instances tab


1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi.
2. Select a clust er and click t he Inst ances t ab. T he Inst ances t ab for t he select ed clust er appears.

> Document Version: 20221222 94


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Inst ances t ab shows informat ion about t he mast er nodes and server roles of Job Scheduler. T he
informat ion about t he mast er nodes includes t he IP address, host name, server role, and st art t ime.
T he informat ion about a server role includes t he role name, host name, role st at us, and host st at us.

Supported operations
You can rest art t he mast er nodes of Job Scheduler. For more informat ion, see Rest art t he primary
mast er node of Job Scheduler.

4.3.2.6. Job Scheduler compute nodes


You can view t he det ails of comput e nodes on t he Comput e Nodes page for Job Scheduler, including
t he t ot al CPU, idle CPU, t ot al memory, and idle memory of each comput e node. You can also check
whet her a node is added t o t he blacklist and whet her it is act ive. In addit ion, you can add comput e
nodes t o or remove comput e nodes from t he blacklist or read-only list on t he Comput e Nodes page.

Entry
1. On t he Services page, click Fuxi in t he left -side navigat ion pane.
2. Select a clust er from t he drop-down list , and t hen click t he Comput e Nodes t ab. T he Comput e
Nodes page for Job Scheduler appears.

95 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

You can view t he det ails of comput e nodes on t he Comput e Nodes page for Job Scheduler, including
t he t ot al CPU, idle CPU, t ot al memory, and idle memory of each comput e node. You can also check
whet her a node is added t o t he blacklist and whet her it is act ive.

Blacklist and read-only setting


You can add comput e nodes t o or remove comput e nodes from t he blacklist or read-only list . T o add
comput e nodes t o t he blacklist , follow t hese st eps:

1. On t he Comput e Nodes page, click Act ions for t he t arget comput e node and t hen select Add t o
Blacklist .
2. In t he dialog box t hat appears, click Run. A message appears, indicat ing t hat t he act ion has been
submit t ed.

T he value of t he Host name paramet er is aut omat ically filled. You do not need t o specify a value for
t his paramet er.
You can check whet her a comput e node is added t o t he blacklist in t he comput e node list aft er t he
configurat ion is complet ed.

4.3.2.7. Enable and disable SQL acceleration


You can enable or disable SQL accelerat ion for Job Scheduler in t he Apsara Big Dat a Manager (ABM)
console. T he execut ion speed of SQL st at ement s in Job Scheduler is great ly increased wit h SQL
accelerat ion enabled, but more comput ing resources are consumed.

Enable SQ L acceleration
1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi. T hen, select a clust er.
2. In t he upper-right corner of t he t ab t hat appears, choose Act ions > Enable SQL Accelerat ion.
3. In t he Enable SQL Accelerat ion panel, set t he WorkerSpans paramet er.

> Document Version: 20221222 96


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

WorkerSpans: t he default resource quot a of t he clust er and t he resource quot a for a specific
period. Default value: def ault :2,12-23:2 .

Not e T he default value indicat es t hat t he default resource quot a is 2 and t he resource
quot a for t he period from 12:00 t o 23:00 is also 2. You can set t he resource quot a as needed.
For example, you can set t his paramet er t o default :2,12-23:4 t o increase t he resource quot a in
peak hours.

4. Click Run.

Disable SQ L acceleration
1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi. T hen, select a clust er.
2. In t he upper-right corner of t he t ab t hat appears, choose Act ions > Disable SQL Accelerat ion.
3. In t he Disable SQL Accelerat ion panel, click Run.

View the execution history of enabling or disabling SQ L acceleration


Aft er you submit t he act ion of enabling or disabling SQL accelerat ion, you can view t he execut ion
hist ory t o check whet her t he act ion is complet e. T he syst em execut es t he act ion as a job. It provides
execut ion records and logs for each execut ion so t hat you can ident ify fault s encount ered during it s
execut ion. T his sect ion describes how t o view t he execut ion hist ory of enabling SQL accelerat ion.

1. In t he left -side navigat ion pane of t he Services t ab, click Fuxi. T hen, select a clust er.
2. In t he upper-right corner of t he t ab t hat appears, click Act ions and select Execut ion Hist ory next
t o Enable SQL Accelerat ion.
3. In t he Execut ion Hist ory panel, view t he execut ion hist ory of enabling SQL accelerat ion.

T he execut ion hist ory shows t he current st at us, submission t ime, st art t ime, end t ime, and operat or
of each execut ion.

4. If t he execut ion fails, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

97 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.2.8. Restart a master node of Job Scheduler


Job Scheduler is t he resource management and t ask scheduling syst em of t he Apsara dist ribut ed
operat ing syst em. Apsara Big Dat a Manager (ABM) allows you t o quickly rest art t he primary and
secondary mast er nodes of Job Scheduler. Clust er services are not affect ed during t he rest art process.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Step 1: Restart a master node of Job Scheduler


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Fuxi. T hen, click t he Inst ances t ab.
5. On t he Inst ances t ab, choose Act ions > Rest art Fuxi Mast er Node in t he Act ions column of a
primary or secondary mast er node.
6. In t he Rest art Fuxi Mast er Node panel, click Run. T he Rest art Fuxi Mast er Node panel appears.

Step 2: View the execution status or progress


1. In t he Rest art Fuxi Mast er Node panel, check t he execut ion hist ory of rest art ing mast er nodes.

T he Rest art Fuxi Mast er Node panel displays t he rest art hist ory. RUNNING indicat es t hat t he
execut ion is in progress. SUCCESS indicat es t hat t he execut ion succeeds. FAILED indicat es t hat t he
execut ion fails.

2. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he execut ion progress.

Step 3: (O ptional) Identify the cause of a failure

> Document Version: 20221222 98


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

If t he st at us is FAILED, you can view t he execut ion logs t o ident ify t he cause of t he failure.

1. In t he Rest art Fuxi Mast er Node panel, check t he execut ion hist ory of rest art ing mast er nodes.
2. Click Det ails in t he Det ails column of t he t ask t o view t he det ails.
3. On t he Servers t ab of t he failed st ep, click View Det ails in t he Act ions column of a failed server.
T he Execut ion Out put t ab appears in t he Execut ion Det ails sect ion. You can view t he out put t o
ident ify t he cause of t he failure.

4.3.3. Apsara Distribute File System O&M

4.3.3.1. O&M features and entry


T his t opic describes t he O&M feat ures of Apsara Dist ribut ed File Syst em. It also provides more
informat ion about how t o go t o t he Apsara Dist ribut ed File Syst em O&M page.

Apsara Distributed File System O &M features


Overview: shows t he key operat ing informat ion of Apsara Dist ribut ed File Syst em. T he informat ion
includes t he service overview, service st at us, st orage usage, st orage node overview, and t he t rend
chart s of st orage usage and file count .
Healt h St at us: shows all checkers for Apsara Dist ribut ed File Syst em. You can query checker det ails,
check result s for host s in a clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a
host and perform manual checks on t he host .
Inst ances: shows informat ion about t he mast er nodes and server roles of Apsara Dist ribut ed File
Syst em. You can change t he primary mast er node or run a checkpoint on a mast er node of Apsara
Dist ribut ed File Syst em.
St orage Nodes: shows informat ion about t he st orage nodes of Apsara Dist ribut ed File Syst em. You
can set t he st at us of a st orage node t o Disabled or Normal. You can also set t he st at us of a disk on a
st orage node t o Normal or Error.
Change Primary Mast er Node: allows you t o change t he primary mast er node of Apsara Dist ribut ed
File Syst em in a clust er.
Run Checkpoint on Mast er Node: allows you t o run checkpoint s on mast er nodes of Apsara
Dist ribut ed File Syst em t o writ e memory dat a t o disks.
Empt y Recycle Bin: allows you t o clear t he recycle bin of Apsara Dist ribut ed File Syst em.
Enable Dat a Rebalancing or Disable Dat a Rebalancing: allows you t o enable or disable t he dat a
rebalancing feat ure of Apsara Dist ribut ed File Syst em.

Go to the Pangu page


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Pangu. T hen, select a clust er. T he
Overview t ab for t he select ed clust er appears.

99 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.3.2. Overview
T he Overview t ab shows t he key operat ing informat ion about Apsara Dist ribut ed File Syst em. T he
informat ion includes t he service overview, service st at us, st orage usage, st orage node overview, and
t he t rend chart s of st orage usage and file count .

Go to the O verview tab


1. In t he left -side navigat ion pane of t he Services t ab, click Pangu.
2. Select a clust er and click t he Overview t ab. T he Overview t ab for t he select ed clust er appears.

Services
T his sect ion shows t he st at us of Apsara Dist ribut ed File Syst em and t he number of server roles.

Roles

> Document Version: 20221222 100


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T his sect ion shows all server roles of Apsara Dist ribut ed File Syst em and t heir st at es. You can also view
t he expect ed and act ual numbers of host s for each server role.

Saturability - Storage
T his sect ion shows t he st orage usage and file count .

St orage: shows t he st orage usage, t ot al st orage space, available st orage space, and recycle bin size.
File Count : shows t he file count usage, maximum number of files, number of exist ing files, and
number of files in t he recycle bin.

Storage Trend and File Count Trend


T his sect ion shows t he t rend chart s of t he st orage usage and file count . T he st orage usage chart
shows t he t rend lines of t he t ot al st orage space, used st orage space, and st orage usage in different
colors. T he file count chart shows t he t rend line of t he file count .

101 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart . T he following figure shows
an enlarged chart of st orage usage.

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
st orage usage of t he clust er in t he specified period.

Storage Nodes
T his sect ion shows informat ion about t he st orage nodes of Apsara Dist ribut ed File Syst em. T he
informat ion includes t he numbers of dat a nodes, normal nodes, disks, and normal disks. You can also
view t he fault y node percent age and fault y disk percent age.

> Document Version: 20221222 102


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.3.3. Instances
T his t opic describes how t o view informat ion about t he mast er nodes and server roles of Apsara
Dist ribut ed File Syst em. It also describes how t o change t he primary mast er node or run a checkpoint on
a mast er node of Apsara Dist ribut ed File Syst em.

Go to the Instances tab


1. In t he left -side navigat ion pane of t he Services t ab, click Pangu.
2. Select a clust er and click t he Inst ances t ab. T he Inst ances t ab for t he select ed clust er appears.

T he Inst ances t ab shows informat ion about t he mast er nodes and server roles of Apsara Dist ribut ed
File Syst em. T he informat ion about a mast er node includes t he IP address, host name, server role, and
log ID. T he informat ion about a server role includes t he role name, host name, role st at us, and host
st at us.

Supported operations
You can change t he primary mast er node or run a checkpoint on a mast er node of Apsara Dist ribut ed
File Syst em. For more informat ion, see Change t he primary mast er node for Apsara Dist ribut ed File
Syst em and Run a checkpoint on t he mast er nodes of Apsara Dist ribut ed File Syst em.

4.3.3.4. Apsara Distributed File System health


On t he Healt h St at us page for Apsara Dist ribut ed File Syst em, you can view all checkers of Apsara
Dist ribut ed File Syst em, including t he checker det ails, check result s, and schemes t o clear alert s (if any).
In addit ion, you can log on t o a host and perform manual checks on t he host .

Entry

103 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

1. On t he Services page, click Pangu in t he left -side navigat ion pane.


2. Select a clust er from t he drop-down list , and t hen click t he Healt h St at us t ab. T he Healt h St at us
page for Apsara Dist ribut ed File Syst em appears.

On t he Healt h St at us page, you can view all checkers of Apsara Dist ribut ed File Syst em and t he
check result s for all host s in t he clust er. T he check result s are divided int o Crit ical, Warning , and
Except ion. T hey are displayed in different colors. Pay at t ent ion t o t he check result s, especially t he
Crit ical and Warning result s, and handle t hem in a t imely manner.

Supported operations
On t he Healt h St at us page, you can view all checkers of a clust er, including t he checker det ails, check
result s for t he host s in t he clust er, and schemes t o clear alert s (if any). In addit ion, you can log on t o a
host and perform manual checks on t he host . For more informat ion, see Clust er healt h.

4.3.3.5. Apsara Distributed File System storage


T his t opic describes how t o view t he st orage overview and st orage node informat ion of Apsara
Dist ribut ed File Syst em, and how t o set t he st at us of st orage nodes and dat a disks.

Entry to the Storage O verview page


1. On t he Services page, click Pangu in t he left -side navigat ion pane.
2. Select a clust er from t he drop-down list , and t hen click t he St orage t ab. T he St orage Overview
page for Apsara Dist ribut ed File Syst em appears.

> Document Version: 20221222 104


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he St orage Overview page displays whet her dat a rebalancing is enabled, key met rics and t heir
values, suggest ions t o handle except ions, and rack specificat ions of Apsara Dist ribut ed File Syst em.

Entry to the Storage Nodes page


1. On t he Services page, click Pangu in t he left -side navigat ion pane.
2. Select a clust er from t he drop-down list , and t hen click t he St orage t ab. T he St orage Overview
page for Apsara Dist ribut ed File Syst em appears.
3. Click t he St orage Nodes t ab. T he St orage Nodes page appears.

T he St orage Nodes page displays t he informat ion about all st orage nodes of Apsara Dist ribut ed
File Syst em, including t he t ot al st orage size, available st orage size, st at us, T T L, and send buffer size.

Set the storage node status


You can set t he st orage node st at us t o Disabled or Normal. T his sect ion describes how t o set t he
st at us of a st orage node t o Disabled.
1. On t he St orage Nodes page, find t he t arget st orage node and choose Act ions > Set Node
St at us t o Disabled in t he Act ions column.
2. In t he Set Node St at us t o Shut down panel, click Run. A message appears, indicat ing t hat t he
act ion has been submit t ed.

105 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he values of t he Volume and Host name paramet ers are aut omat ically filled based on t he select ed
st orage node. You do not need t o specify values for t he paramet ers.

You can check whet her t he st at us of st orage node is changed in t he st orage node list .

Set the data disk status


You can set t he dat a disk st at us t o Error or Normal. T his sect ion describes how t o set t he st at us of a
dat a disk t o Error.

1. On t he St orage Nodes page, find t he t arget st orage node and choose Act ions > Set Disk St at us
t o Error in t he Act ions column.
2. In t he Set Disk St at us t o Error panel, set t he Diskid paramet er.

T he values of t he Volume and Host name paramet ers are aut omat ically filled based on t he select ed
st orage node. You do not need t o specify values for t he paramet ers.

3. Click Run. A message appears, indicat ing t hat t he act ion has been submit t ed.

4.3.3.6. Change the primary master node of Apsara


Distributed File System
Apsara Big Dat a Manager (ABM) allows you t o perform a primary/secondary swit chover on t he mast er
nodes of Apsara Dist ribut ed File Syst em. Aft er t he primary/secondary swit chover is complet e, an
original secondary mast er node becomes t he primary mast er node, and t he original primary mast er node
becomes a secondary mast er node.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

> Document Version: 20221222 106


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Background information
A volume in Apsara Dist ribut ed File Syst em is similar t o a namespace. T he default volume is
PanguDefault Volume. If a clust er cont ains a large number of nodes, mult iple volumes may exist . A
volume has t hree mast er nodes. One of t he nodes serves as t he primary mast er node, and t he ot her
t wo nodes serve as secondary mast er nodes.

Procedure
1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Pangu. T hen, select a clust er and click t he
Inst ances t ab.
5. In t he Mast er St at us sect ion of t he Inst ances t ab, find t he required mast er node and choose
Act ions > Change Primary Mast er Node in t he Act ions column. In t he Change Primary Mast er Node
panel, specify t he required paramet ers.

Paramet er descript ion:

Volume : t he volume whose primary mast er node needs t o be changed. Default value:
PanguDef ault Volume . If a clust er cont ains mult iple volumes, set t his paramet er t o t he name of
t he act ual volume whose primary mast er node needs t o be changed.
Host name : t he host name of t he secondary mast er node t hat is t o be t he new primary mast er
node.
Log Gap : t he maximum log number gap bet ween t he original primary and secondary mast er nodes
you want t o swit ch. During t he swit chover, t he syst em checks t he log number gap. If t he gap is less
t han t he specified value, t he swit chover is allowed. Ot herwise, you cannot change t he primary
mast er node. Default value: 100000 .
6. Click Run. T he Change Primary Mast er Node panel appears.

107 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Change Primary Mast er Node panel shows t he swit chover hist ory. RUNNING indicat es t hat t he
execut ion is in progress. SUCCESS indicat es t hat t he execut ion succeeds. FAILED indicat es t hat t he
execut ion fails.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

You can view informat ion about paramet er set t ings, host det ails, script , and runt ime paramet ers t o
ident ify t he cause of t he failure.

4.3.3.7. Clear the recycle bin of Apsara Distributed File


System
Apsara Big Dat a Manager (ABM) allows you t o clear t he recycle bin of Apsara Dist ribut ed File Syst em t o
release st orage space.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Procedure
1. In t he left -side navigat ion pane of t he Services t ab, click Pangu. T hen, select a clust er. T he
Overview t ab for t he select ed clust er appears.
2. In t he upper-right corner, choose Act ions > Empt y Recycle Bin.
3. In t he Empt y Recycle Bin panel, set t he Volume paramet er. T he default value is
PanguDef ault Volume .

> Document Version: 20221222 108


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4. Click Run.
5. View t he execut ion st at us.

In t he upper-right corner, click Act ions and select Execut ion Hist ory next t o Empt y Recycle Bin t o
view t he execut ion hist ory.

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.
6. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

You can view informat ion about paramet er set t ings, host det ails, script , and runt ime paramet ers t o
ident ify t he cause of t he failure.

4.3.3.8. Enable or disable data rebalancing for Apsara


Distributed File System
Apsara Big Dat a Manager (ABM) allows you t o enable or disable dat a rebalancing for Apsara Dist ribut ed
File Syst em.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Disable data rebalancing


1. In t he left -side navigat ion pane of t he Services t ab, click Pangu.
2. In t he Pangu panel and select a clust er.

109 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

3. In t he upper-right corner of t he t ab t hat appears, choose Act ions > Disable Dat a Rebalancing .
4. In t he Disable Dat a Rebalancing panel, set t he Volume paramet er. T he default value is
PanguDef ault Volume .

5. Click Run.
6. View t he execut ion st at us.

Click Act ions and select Execut ion Hist ory next t o Disable Dat a Rebalancing t o view t he
execut ion hist ory.

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure. For more
informat ion, see Ident ify t he cause of a failure.

Enable data rebalancing


1. In t he left -side navigat ion pane of t he Services t ab, click Pangu.
2. In t he Pangu panel and select a clust er.
3. In t he upper-right corner of t he t ab t hat appears, choose Act ions > Enable Dat a Rebalancing .
4. In t he Enable Dat a Rebalancing panel, set t he Volume paramet er. T he default value is
PanguDef ault Volume .

5. Click Run.
6. View t he execut ion st at us.

Click Act ions and select Execut ion Hist ory next t o Enable Dat a Rebalancing t o view t he
execut ion hist ory.

> Document Version: 20221222 110


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure. For more
informat ion, see Ident ify t he cause of a failure.

Identify the cause of a failure


T his sect ion uses t he procedure of ident ifying t he cause of t he failure t o enable dat a rebalancing as an
example.
1. In t he Execut ion Hist ory panel, click Det ails in t he Det ails column for a failed execut ion.
2. In t he Enable Dat a Rebalancing panel, click View Det ails for a failed st ep t o ident ify t he cause of
t he failure.

You can view informat ion about paramet er set t ings, host det ails, script , and runt ime paramet ers t o
ident ify t he cause of t he failure.

4.3.3.9. Run a checkpoint on a master node of Apsara


Distributed File System
Apsara Big Dat a Manager (ABM) allows you t o run checkpoint s on mast er nodes of Apsara Dist ribut ed
File Syst em. T his operat ion writ es memory dat a t o disks. If Apsara Dist ribut ed File Syst em is fault y, you
can use checkpoint s t o rest ore dat a t o t he st at us before t he failure. T his ensures dat a consist ency.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Procedure
1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click Pangu. T hen, select a clust er and click t he
Inst ances t ab.
5. In t he Mast er St at us sect ion of t he Inst ances t ab, find t he required mast er node and choose
Act ions > Run Checkpoint on Mast er Node in t he Act ions column. In t he Run Checkpoint on
Mast er Node panel, set t he Volume paramet er.

Not e T he default value of Volume is PanguDef ault Volume .

111 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

6. Click Run. T he Run Checkpoint on Mast er Node panel appears.

T he Run Checkpoint on Mast er Node panel shows t he execut ion hist ory of t he checkpoint on t he
mast er node. RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he
execut ion succeeds. FAILED indicat es t hat t he execut ion fails.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.
You can also view informat ion about paramet er set t ings, host det ails, script , and execut ion
paramet ers t o ident ify t he cause of t he failure.

4.3.4. Tunnel service

4.3.4.1. O&M features and entry


T his t opic describes t he definit ion and O&M feat ures of t he T unnel service. It also provides more
informat ion about how t o go t o t he O&M page of t he T unnel service.

Definition of the Tunnel service


T he T unnel service serves as a dat a t unnel of MaxComput e. You can use t his service t o upload dat a t o
or download dat a from MaxComput e.

O &M features of the Tunnel service


Overview: shows informat ion about t he T unnel service. T he informat ion includes t he service overview,
service st at us, and t hroughput t rend chart .
Inst ances: shows informat ion about t he server roles of t he T unnel service.
T raffic Analysis: shows t he t raffic curves of specific project s in a specific period. T he curves show
t raffic t ypes and t he peak t hroughout in t he specified period, which helps you make informed
decisions.
Rest art T unnel Server: allows you t o rest art one or more T unnel servers.

Go to the Tunnel Service page


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-right corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click T unnel Service . T he Overview t ab for t he
T unnel service appears.

> Document Version: 20221222 112


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.3.4.2. Overview
T he Overview t ab for t he T unnel service shows key operat ing informat ion. T he informat ion includes t he
service overview, service st at us, and t hroughput .

Go to the O verview tab


In t he left -side navigat ion pane of t he Services t ab, click T unnel Service . T he Overview t ab for t he
T unnel service appears.

T he Overview t ab shows key operat ing informat ion about t he T unnel service. T he informat ion includes
t he service overview, service st at us, and t hroughput t rend chart .

Services
T he Services sect ion shows t he numbers of available services, unavailable services, and services t hat are
being updat ed.

Roles
T he Roles sect ion shows all T unnel server roles and t heir st at us. You can also view t he expect ed and
act ual numbers of host s for each server role.

113 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Tunnel throughput
T he T unnel T hroughput (Byt es/Min) chart shows t he t rend lines of t he inbound and out bound t raffic in
different colors. T his t rend chart can be aut omat ically or manually refreshed. You can view t he t rend
chart of T unnel t hroughput in a specific period.

4.3.4.3. Instances
T he Inst ances t ab shows informat ion about t he T unnel server roles. T he informat ion includes t he role
name, host name, IP address, role st at us, and host st at us.

Go to the Instances tab


In t he left -side navigat ion pane of t he Services t ab, click T unnel Service . T hen, click t he Inst ances
t ab. T he Inst ances t ab for t he T unnel service appears.

T he Inst ances t ab shows informat ion about all T unnel server roles. T he informat ion includes t he role
name, host name, IP address, role st at us, and host st at us. T he st at us can be good, error, or upgrading.

4.3.4.4. Traffic analysis


T he T raffic Analysis t ab displays t he t raffic curves of specific project s in a specific period. T he curves
show t raffic t ypes and t he peak t hroughout in t he specified period, which helps you make informed
decisions.

Go to the Traffic Analysis tab


In t he left -side navigat ion pane of t he Services t ab, click T unnel Service . T hen, click t he T raf f ic
Analysis t ab. T he T raf f ic Analysis t ab for t he T unnel service appears.

Aft er you specify a period and t he project for t raffic analysis, click t he icon. T hen, you can view t he
upst ream and downst ream t hroughput curves of T unnel t raffic for t raffic analysis.

Not e
T he t raffic dat a comes from Monit oring Syst em. Make sure t hat t his syst em is normal.
By default , t he t op five project s t hat have t he most t raffic are select ed. You can also filt er
project s based on your business requirement s.
By default , t he beginning of t he period is t wo days before t he current t ime, and t he end of
t he period is one day before t he current t ime. You can also specify t he period based on your
business requirement s.

4.3.4.5. Restart Tunnel servers

> Document Version: 20221222 114


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Apsara Big Dat a Manager (ABM) allows you t o rest art T unnel servers for t he corresponding server roles.

Prerequisites
Your ABM account is grant ed t he required permissions t o perform O&M operat ions on MaxComput e.

Context
You can rest art one or more T unnel servers at a t ime on t he Inst ances t ab.

Step 1: Restart Tunnel servers


1. Log on t o t he ABM console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Services t ab.
4. In t he left -side navigat ion pane of t he Services t ab, click T unnel Service . T hen, click t he Inst ances
t ab.
5. On t he Inst ances t ab, select one or more server roles for which you want t o rest art t he T unnel service.
In t he upper-right corner, choose Act ions > Rest art T unnel Server.
6. In t he Rest art T unnel Server panel, configure t he required paramet ers.
T he following t able describes t he required paramet ers.

Parameter Description

Specifies whether to forcibly restart the T unnel server for the selected server
role. Valid values:

no _f o rce : Do not forcibly restart the T unnel server. If a server role is in the
Fo rce Rest art running state, the corresponding T unnel server is not restarted.
f o rce : Forcibly restart the T unnel server. T he T unnel server is restarted
regardless of the server role state.

T he hostname of the selected server role. T he value is automatically provided.


Ho st name
You do not need to specify a value for this parameter.

7. Click Run.

Step 2: View the execution status or progress


1. On t he Overview or Inst ances t ab of t he T unnel Service page, click Act ions in t he upper-right
corner. T hen, select Execut ion Hist ory next t o Rest art T unnel Server t o view t he execut ion
hist ory.

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion
succeeds. FAILED indicat es t hat t he execut ion fails.
2. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he execut ion progress.

Step 3: (O ptional) Identify the cause of a failure


If t he st at us is FAILED, you can view t he execut ion logs t o ident ify t he cause of t he failure.

1. On t he Overview or Inst ances t ab of t he T unnel Service page, click Act ions in t he upper-right

115 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

corner. T hen, select Execut ion Hist ory next t o Rest art T unnel Server t o view t he execut ion
hist ory.
2. In t he Execut ion Hist ory panel, click Det ails in t he Det ails column of t he t ask t o view t he det ails.
3. On t he Servers t ab of t he failed st ep, click View Det ails in t he Act ions column of a failed server.
T he Execut ion Out put t ab appears in t he Execut ion Det ails sect ion. You can view t he out put t o
ident ify t he cause of t he failure.

4.4. Cluster O&M


4.4.1. O&M features and entry
T his t opic describes t he O&M feat ures of MaxComput e clust ers. It also provides more informat ion about
how t o go t o t he MaxComput e clust er O&M page.

Cluster O &M features


O&M feat ures of MaxComput e clust ers:

Overview: shows t he overall running informat ion about a clust er. You can view t he host st at us,
service st at us, healt h check result , and healt h check hist ory. You can also view t he t rend chart s of
CPU ut ilizat ion, disk usage, memory usage, load, and packet t ransmission for t he clust er. In t he Log on
sect ion, you can click t he name of t he host whose role is pangu mast er, fuxi mast er, or odps ag t o log
on t o t he host .
Healt h St at us: shows all checkers for a clust er. You can query checker det ails, check result s for host s
in t he clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a host and perform
manual checks on t he host .
Servers: shows informat ion about host s in a clust er. T he informat ion includes t he host name, IP
address, role, t ype, CPU ut ilizat ion, memory usage, root disk usage, packet loss rat e, and packet error
rat e.
Scale out Clust er or Scale in Clust er: allows you t o add or remove physical host s t o scale out or scale
in a MaxComput e clust er.
Enable Aut o Repair: allows you t o enable aut o repair for MaxComput e clust ers.
Rest ore Environment Set t ings: allows you t o rest ore environment set t ings for mult iple host s in a
MaxComput e clust er at a t ime.

> Document Version: 20221222 116


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Go to the Clusters tab


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Clust ers t ab.
4. In t he left -side navigat ion pane of t he Clust ers t ab, click a clust er. T he Overview t ab for t he
select ed clust er appears.

4.4.2. Cluster health


T he Healt h St at us t ab shows all checkers for a clust er. You can query checker det ails, check result s for
host s in t he clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a host and
perform manual checks on t he host .

Go to the Health Status tab


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Clust ers t ab.
4. In t he left -side navigat ion pane of t he Clust ers t ab, select a clust er. T hen, click t he Healt h St at us
t ab. T he Healt h St at us t ab for t he select ed clust er appears.

On t he Healt h St at us t ab, you can view all checkers for t he clust er and t he check result s for t he
host s in t he clust er. T he following alert s may be report ed on a host : CRIT ICAL, WARNING, and
EXCEPT ION. T he alert s are represent ed in different colors. You must handle t he alert s in a t imely
manner, especially t he CRIT ICAL and WARNING alert s.

View checker details


1. On t he Healt h St at us t ab, click Det ails in t he Act ions column of a checker. On t he Det ails page, view
checker det ails.

117 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he checker det ails include Name , Source , Alias, Applicat ion, T ype , Scheduling , Dat a
Collect ion, Def ault Execut ion Int erval, and Descript ion. T he schemes t o clear alert s are provided
in t he descript ion.

2. Click Show More t o view more informat ion about t he checker.

You can view informat ion about Script , T arget (T ianJi), Def ault T hreshold , and Mount Point .

View the hosts for which alerts are reported and causes for the alerts
You can view t he check hist ory and check result s of a checker on a host .

1. On t he Healt h St at us t ab, click + t o expand a checker for which alert s are report ed. You can view all
host s where t he checker is run.

> Document Version: 20221222 118


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. Click a host name. In t he panel t hat appears, click Det ails in t he Act ions column of a check result t o
view t he cause of t he alert .

Clear alerts
On t he Healt h St at us t ab, click Det ails in t he Act ions column of a checker for which alert s are report ed.
On t he Det ails page, view t he schemes t o clear alert s.

119 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Log on to a host
You may need t o log on t o a host t o handle alert s or ot her issues t hat occurred on t he host .
1. On t he Healt h St at us t ab, click + t o expand a checker for which alert s are report ed.

2. Click t he Login in icon of a host . T he T erminalService page appears.

3. On t he T erminalService page, click t he host name in t he left -side navigat ion pane t o log on t o t he
host .

> Document Version: 20221222 120


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Run a checker again


Aft er you clear an alert for a host , click Ref resh in t he Act ions column of t he host t o run t he checker
again for t he host . T his way, you can check whet her t he alert is cleared.

4.4.3. Overview
T his t opic describes how t o go t o t he Overview t ab of a MaxComput e clust er. It also shows t he clust er
overview and describes t he operat ions t hat you can perform on t his t ab.

Go to the O verview tab


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Clust ers t ab.
4. In t he left -side navigat ion pane of t he Clust ers t ab, select a clust er. T he Overview t ab for t he
select ed clust er appears.

121 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Overview t ab, you can quickly log on t o a host t hat is commonly used in MaxComput e clust er
O&M. You can view t he host st at us, service st at us, healt h check result , and healt h check hist ory. You
can also view t he t rend chart s of CPU ut ilizat ion, disk usage, memory usage, load, and packet
t ransmission for t he clust er.

Log on
In t his sect ion, you can log on t o a host t hat is commonly used in MaxComput e clust er O&M and whose
role is pangu mast er, fuxi mast er, or odps ag.

1. In t he Log on sect ion, click t he host name in t he Host name column. T he Host s t ab for t he host
appears.
2. In t he upper-left corner, click t he Login in icon of t he host . T he T erminalService page appears.

3. In t he left -side navigat ion pane, click t he host name t o log on t o t he host .

> Document Version: 20221222 122


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Servers
T his sect ion shows all host st at us and t he number of host s in each st at e. A host can be in t he good or
error st at e.

Services
T his sect ion displays all services deployed in t he clust er and t he respect ive number of services in t he
good and bad st at es.

CPU
T his chart shows t he t rend lines of t he t ot al CPU ut ilizat ion (cpu), CPU ut ilizat ion for execut ing code in
kernel space (sys), and CPU ut ilizat ion for execut ing code in user space (user) for t he clust er in different
colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
CPU ut ilizat ion of t he clust er in t he specified period.

DISK
T his chart shows t he t rend lines of t he st orage usage on t he/, /boot , /home/admin, and /home
direct ories for t he clust er over t ime in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

123 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
st orage usage of t he clust er in t he specified period.

LO AD
T his chart shows t he t rend lines of t he 1-minut e, 5-minut e, and 15-minut e load averages for t he clust er
in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
1-minut e, 5-minut e, and 15-minut e load averages of t he clust er in t he specified period.

MEMO RY
T his chart shows t he t rend lines of t he memory usage (mem), t ot al memory size (t ot al), used memory
size (used), size of memory used by buffers (buff), size of memory used by t he page cache (cach), and
available memory size (free) for t he clust er in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

> Document Version: 20221222 124


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
memory usage of t he clust er in t he specified period.

PACKAGE
T his chart shows t he t rend lines of t he numbers of dropped packet s (drop), error packet s (error),
received packet s (in), and sent packet s (out ) for t he clust er in different colors. T hese t rend lines reflect
t he dat a t ransmission st at us of t he clust er.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
dat a t ransmission st at us of t he clust er in t he specified period.

Health Check
T his sect ion shows t he number of checkers for t he clust er and t he numbers of CRIT ICAL, WARNING, and
EXCEPT ION alert s.

Click View Det ails t o go t o t he Healt h St at us t ab. On t his t ab, you can view healt h check det ails. For
more informat ion, see Clust er healt h.

Health Check History

125 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T his sect ion shows t he records of t he healt h checks performed on t he clust er. You can view t he
numbers of CRIT ICAL, WARNING, and EXCEPT ION alert s.

Click View Det ails t o go t o t he Healt h St at us t ab. On t his t ab, you can view healt h check det ails. For
more informat ion, see Clust er healt h.

You can click t he event cont ent of a check t o view t he except ion it ems.

4.4.4. Servers
T he Servers t ab shows informat ion about host s. T he informat ion includes t he host name, IP address,
role, t ype, CPU ut ilizat ion, t ot al memory size, available memory size, load, root disk usage, packet loss
rat e, and packet error rat e.

In t he left -side navigat ion pane of t he Clust ers t ab, click a clust er. T hen, click t he Servers t ab. T he
Servers t ab for t he select ed clust er appears.

> Document Version: 20221222 126


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T o view more informat ion about a host , click t he name of t he host . T he Host s t ab appears.

4.4.5. Scale in and scale out a MaxCompute


cluster
Apsara Big Dat a Manager (ABM) support s MaxComput e clust er scaling. T o scale out a MaxComput e
clust er, add physical host s in t he default clust er of Apsara Infrast ruct ure Management Framework t o
t he MaxComput e clust er. T o scale in a MaxComput e clust er, remove physical host s from t he
MaxComput e clust er t o t he default clust er of Apsara Infrast ruct ure Management Framework.

Description
In Apsara St ack, scaling out a clust er involves complex operat ions. You must configure a new physical
host on Deployment Planner and Apsara Infrast ruct ure Management Framework so t hat it can be added
t o t he default clust er of Apsara Infrast ruct ure Management Framework. T he default clust er of Apsara
Infrast ruct ure Management Framework is an idle resource pool t hat provides resources t o scale out
clust ers. If you want t o scale out a clust er, add physical host s in t he default clust er of Apsara
Infrast ruct ure Management Framework t o t he clust er. If you want t o scale in a clust er, remove physical
host s from t he clust er t o t he default clust er of Apsara Infrast ruct ure Management Framework.

You can use t his met hod t o scale out or in a MaxComput e clust er in t he ABM console.

Prerequisites
Scale-out : T he physical host t hat you want t o add is an SInst ance host in t he default clust er of
Apsara Infrast ruct ure Management Framework.
Scale-out : T he t emplat e host must be an SInst ance host . You can log on t o t he admingat eway host
in a MaxComput e clust er t o view SInst ance host s.
Scale-in: T he physical host t hat you want t o remove is an SInst ance host . You can log on t o t he
admingat eway host in a MaxComput e clust er t o view SInst ance host s.

Scale out a MaxCompute cluster


You can add mult iple host s t o a MaxComput e clust er at a t ime t o scale out t he clust er. T o add host s t o
a MaxComput e clust er, you must specify an exist ing host as t he t emplat e host . T he host s t hat you
want t o add copy configurat ions from t he t emplat e host . T his allows t he host s t o be added t o t he
clust er at a t ime.

1. Log on t o t he admingat eway host in t he MaxComput e clust er. Run t he r ttrl command t o query
and record SInst ance host s. For more informat ion about how t o log on t o a host , see Log on t o a
host .

127 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. In t he left -side navigat ion pane of t he Clust ers t ab, click a clust er. T hen, click t he Servers t ab. On
t he t ab t hat appears, select an SInst ance host and use it as t he t emplat e host .

3. In t he upper-right corner, choose Act ions > Scale out Clust er. In t he Scale out Clust er panel,
configure t he paramet ers.

Paramet ers:

Region: t he region of t he host t hat you want t o add.


Refer Host name: t he name of t he t emplat e host . By default , t he name of t he select ed host is

> Document Version: 20221222 128


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

used.
Host name: t he name of t he host t hat you want t o add. T he drop-down list displays all available
host s in t he default clust er for scale-out operat ions. You can select one or more host s from t he
drop-down list .

4. Click Run. A message appears, indicat ing t hat t he request has been submit t ed.
5. View t he scale-out st at us.

In t he upper-right corner, click Act ions and select Execut ion Hist ory next t o Scale out Clust er t o
view t he scale-out hist ory.

It requires some t ime for t he clust er t o be scaled out . RUNNING indicat es t hat t he execut ion is in
progress. SUCCESS indicat es t hat t he execut ion succeeds. FAILED indicat es t hat t he execut ion fails.

6. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he st eps and progress of t he
execut ion.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

Scale in a MaxCompute cluster


You can remove mult iple host s from a MaxComput e clust er at a t ime t o scale in t he clust er.

1. Log on t o t he admingat eway host in t he MaxComput e clust er. Run t he r ttrl command t o query
and record SInst ance host s. For more informat ion about how t o log on t o a host , see Log on t o a
host .

129 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

2. In t he left -side navigat ion pane of t he Clust ers t ab, click a clust er. T hen, click t he Servers t ab. On
t he t ab t hat appears, select one or more SInst ance host s t hat you want t o remove.

3. In t he upper-right corner, choose Act ions > Scale in Clust er. In t he Scale in Clust er panel,
configure t he paramet ers.

Paramet ers:

Region: t he region of t he host t hat you want t o remove.


Host name: t he name of t he host t hat you want t o remove. By default , t he name of t he select ed
host is used.

> Document Version: 20221222 130


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4. Click Run. A message appears, indicat ing t hat t he request has been submit t ed.
5. View t he scale-in st at us.

In t he upper-right corner, click Act ions and select Execut ion Hist ory next t o Scale in Clust er t o
view t he scale-in hist ory.

It requires some t ime for t he clust er t o be scaled in. RUNNING indicat es t hat t he execut ion is in
progress. SUCCESS indicat es t hat t he execut ion succeeds. FAILED indicat es t hat t he execut ion fails.

6. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he st eps and progress of t he
execut ion.

7. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

Identify the cause of a scale-in or scale-out failure


T his sect ion uses clust er scale-in as an example t o describe how t o ident ify t he cause of a failure.

1. In t he upper-right corner of t he Clust ers t ab, click Act ions and select Execut ion Hist ory next t o
Scale in Clust er t o view t he scale-in hist ory.
2. Click Det ails in t he Det ails column of a failed operat ion t o ident ify t he cause of t he failure.

You can view informat ion about paramet er set t ings, host det ails, script s, and runt ime paramet ers t o
ident ify t he cause of t he failure.

131 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.4.6. Restore environment settings and enable


auto repair
Apsara Big Dat a Manager (ABM) allows you t o rest ore t he environment set t ings for mult iple host s in a
MaxComput e clust er at a t ime. It also allows you t o enable t he aut o repair feat ure for a MaxComput e
clust er.

Restore environment settings


ABM allows you t o rest ore t he environment set t ings for mult iple host s in a MaxComput e clust er at a
t ime.

1. In t he upper-right corner of t he Clust ers t ab, choose Act ions > Rest ore Environment Set t ings. In
t he Rest ore Environment Set t ings panel, set t he Host s paramet er.

Not e You can ent er t he names of mult iple host s and must separat e t he names wit h
commas (,).

2. Click Run. A message appears, indicat ing t hat t he request has been submit t ed.
3. View t he rest orat ion st at us.

Click Act ions and select Execut ion Hist ory next t o Rest ore Environment Set t ings t o view t he
rest orat ion hist ory.

It requires some t ime for t he rest orat ion t o complet e. RUNNING indicat es t hat t he execut ion is in
progress. SUCCESS indicat es t hat t he execut ion succeeds. FAILED indicat es t hat t he execut ion fails.

4. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he st eps and progress of t he
execut ion.
5. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

Enable auto repair


ABM allows you t o enable t he aut o repair feat ure for a MaxComput e clust er. Aft er t his feat ure is
enabled, repair t icket s report ed by Xunyangjian are aut omat ically handled.

1. In t he upper-right corner of t he Clust ers t ab, choose Act ions > Enable Aut o Repair. In t he Enable
Aut o Repair panel, set t he Clust er paramet er and select Enable for Aut o Repair.

Paramet ers:

Clust er: t he name of t he clust er for which you want t o enable t he aut o repair feat ure.
Aut o Repair: If you require t he feat ure, select Enable . Ot herwise, select Disable .

2. Click Run. A message appears, indicat ing t hat t he request has been submit t ed.
3. View t he st at us of t he feat ure.

Click Act ions and select Execut ion Hist ory next t o Enable Aut o Repair t o view t he feat ure-
relat ed operat ion hist ory.

RUNNING indicat es t hat t he execut ion is in progress. SUCCESS indicat es t hat t he execut ion succeeds.
FAILED indicat es t hat t he execut ion fails.

4. If t he st at us is RUNNING, click Det ails in t he Det ails column t o view t he st eps and progress of t he
execut ion.

> Document Version: 20221222 132


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

5. If t he st at us is FAILED, click Det ails in t he Det ails column t o ident ify t he cause of t he failure.

4.5. Host O&M


4.5.1. O&M features and entry
T his t opic describes MaxComput e host O&M feat ures. It also provides more informat ion about how t o
go t o t he host O&M page.

Host O &M features


Overview: shows brief informat ion about host s in a MaxComput e clust er. T he informat ion includes
t he server informat ion, server role st at us, healt h check result , and healt h check hist ory. You can also
view t he t rend chart s of CPU ut ilizat ion, disk usage, memory usage, load, and packet t ransmission for
t he host .
Chart s: shows t he enlarged t rend chart s of CPU ut ilizat ion, memory usage, disk usage, load, and
packet t ransmission.
Healt h St at us: shows all checkers for a host . You can query checker det ails, check result s for host s in
a clust er, and schemes t o clear alert s (if any exist s). You can also log on t o a host and perform
manual checks on t he host .
Services: shows t he clust er, service inst ances, and service inst ance roles of a host .

Go to the Hosts tab


1. Log on t o t he Apsara Big Dat a Manager (ABM) console.

2. In t he upper-left corner, click t he icon and t hen MaxComput e .

3. On t he MaxComput e page, click O& M in t he t op navigat ion bar. T hen, click t he Host s t ab.
4. In t he left -side navigat ion pane of t he Host s t ab, select a host . T he Overview t ab for t he host
appears.

4.5.2. Host overview

133 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Overview t ab for a host shows brief informat ion about t he host in a MaxComput e clust er. On t his
t ab, you can view server informat ion, service role st at us, healt h check result , and healt h check hist ory of
t he host . You can also view t he t rend chart s of CPU ut ilizat ion, disk usage, memory usage, load, and
packet t ransmission for t he host .

Go to the O verview tab


In t he left -side navigat ion pane of t he Host s t ab, click a host . T hen, click t he Overview t ab. T he
Overview t ab for t he host appears.

On t he Overview t ab, you can view server informat ion, service role st at us, healt h check result , and
healt h check hist ory of t he host . You can also view t he t rend chart s of CPU ut ilizat ion, disk usage,
memory usage, load, and packet t ransmission for t he host .

Server Information

> Document Version: 20221222 134


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Server Informat ion sect ion shows informat ion about t he host . Server informat ion includes t he
region, clust er, name, IP address, dat a cent er, and server room.

Service Role Status

135 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he Service Role St at us sect ion shows informat ion about t he services deployed on t he host , including
t he roles, st at us, and number of services.

CPU
T he CPU chart shows t he t rend lines of t he t ot al CPU ut ilizat ion (cpu), CPU ut ilizat ion for execut ing
code in kernel space (sys), and CPU ut ilizat ion for execut ing code in user space (user) of t he host over
t ime in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
CPU ut ilizat ion of t he host in t he specified period.

DISK

> Document Version: 20221222 136


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he DISK chart shows t he t rend lines of t he st orage usage in t he/, /boot , /home/admin, and /home
direct ories for t he host over t ime in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
st orage usage of t he host in t he specified period.

LO AD
T he LOAD chart shows t he t rend lines of t he 1-minut e, 5-minut e, and 15-minut e load averages for t he
host over t ime in different colors.

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
1-minut e, 5-minut e, and 15-minut e load averages of t he host in t he specified period.

MEMO RY
T he MEMORY chart shows t he t rend lines of t he memory usage (mem), t ot al memory size (t ot al), used
memory size (used), size of memory used by kernel buffers (buff), size of memory used by t he page
cache (cach), and available memory size (free) for t he host over t ime in different colors.

137 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
memory usage of t he host in t he specified period.

PACKAGE
T he PACKAGE chart shows t he t rend lines of t he number of dropped packet s (drop), t hat of error
packet s (error), t hat of received packet s (in), and t hat of sent packet s (out ) for t he host over t ime in
different colors. T hese t rend lines reflect t he dat a t ransmission st at us of t he host .

In t he upper-right corner of t he chart , click t he icon t o zoom in t he chart .

You can specify t he st art t ime and end t ime in t he upper-left corner of t he enlarged chart t o view t he
dat a t ransmission st at us of t he host in t he specified period.

Health Check
T he Healt h Check sect ion shows t he number of checkers deployed for t he host and t he numbers of
CRIT ICAL, WARNING, and EXCEPT ION alert s.

Click View Det ails t o go t o t he Healt h St at us t ab. On t his t ab, you can view t he healt h check det ails.

> Document Version: 20221222 138


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

Health Check History


T he Healt h Check Hist ory sect ion shows t he records of t he healt h checks performed on t he host .

Click View Det ails t o go t o t he Healt h St at us t ab. On t his t ab, you can view t he healt h check det ails.
You can click t he event cont ent of a check t o view t he abnormal it ems.

4.5.3. Host charts


On t he host chart page, you can view t he enlarged t rend chart s of CPU usage, memory usage, st orage
usage, load, and packet t ransmission.

On t he Host s page, select a host in t he left -side navigat ion pane, and t hen click t he Chart s t ab. T he
Chart s page for t he host appears.

T he Chart s page displays t rend chart s of CPU usage, disk usage, memory usage, load, and packet
t ransmission for t he host . For more informat ion, see Host overview.

4.5.4. Host health

139 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Healt h St at us page, you can view t he checkers of t he select ed host , including t he checker
det ails, check result s, and schemes t o clear alert s (if any). In addit ion, you can log on t o t he host and
perform manual checks on t he host .

Entry
On t he Host s page, select a host in t he left -side navigat ion pane, and t hen click t he Healt h St at us
t ab. T he Healt h St at us page for t he host appears.

On t he Healt h St at us page, you can view all checkers and t he check result s for t he host . T he check
result s are divided int o Crit ical, Warning , and Except ion. T hey are displayed in different colors. Pay
at t ent ion t o t he check result s, especially t he Crit ical and Warning result s, and handle t hem in a t imely
manner.

View checker details


1. On t he Healt h St at us page, click Det ails in t he Act ions column of a checker. In t he dialog box t hat
appears, view t he checker det ails.

> Document Version: 20221222 140


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

T he checker det ails include t he name, source, alias, applicat ion, t ype, default execut ion int erval, and
descript ion of t he checker, whet her scheduling is enabled, and whet her dat a collect ion is enabled.
T he schemes t o clear alert s are provided in t he descript ion.

2. Click Show More at t he bot t om t o view more informat ion about t he checker.

You can view informat ion about t he execut ion script , execut ion t arget , default t hreshold, and mount
point for dat a collect ion.

View alert causes


You can view t he check hist ory and check result s of a checker.

1. On t he Healt h St at us page, click + t o expand a checker wit h alert s.

2. Click t he host name. In t he dialog box t hat appears, click Det ails in t he Act ions column of a check
result t o view t he alert causes.

Clear alerts

141 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

On t he Healt h St at us page, click Det ails in t he Act ions column of a checker wit h alert s. In t he dialog
box t hat appears, view t he schemes t o clear alert s.

Log on to a host
T o log on t o a host t o clear alert s or perform ot her operat ions, follow t hese st eps:

1. On t he Healt h St at us page, click + t o expand a checker wit h alert s.

2. Click t he Log On icon of a host . T he T erminalService page appears.

> Document Version: 20221222 142


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

3. On t he T erminalService page, click t he host name on t he left t o log on t o t he host .

Run a checker again


Aft er you clear an alert for a host , click Ref resh in t he Act ions column of t he host t o run t he checker
again for t he host . In t his way, you can check whet her t he alert is cleared.

143 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
MaxComput e O&M

4.5.5. Host services


On t he Services page, you can view informat ion about service inst ances and service inst ance roles of a
host .

On t he Host s page, select a host in t he left -side navigat ion pane, and t hen click t he Services t ab. T he
Services page for t he host appears.

On t he Services page, you can view t he clust er, service inst ances, and service inst ance roles of t he
host .

> Document Version: 20221222 144


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

5.Common issues and solutions


5.1. View and allocate MaxCompute
cluster resources
T his t opic describes how t o view t he st orage and comput ing resources in a MaxComput e clust er. T his
t opic also describes t he quot a group-relat ed concept s, relat ionships bet ween a quot a group and a
MaxComput e project , and quot a group division policies.

Resources that can be allocated to projects in a MaxCompute cluster


St orage resources: T he t ot al sum of st orage resources available in a MaxComput e clust er is limit ed
and can be calculat ed based on t he number of comput e nodes in t he ent ire clust er. T he st orage
capacit y in a MaxComput e clust er is managed t hrough Apsara Dist ribut ed File Syst em. You can run
Apsara Dist ribut ed File Syst em commands t o view t he t ot al st orage capacit y, such as t he current
st orage usage st at ist ics. T he following met rics are available for measuring st orage resources:
St orage capacit y met ric: indicat es t he t ot al size of files t hat can be st ored in a clust er. You can
calculat e t he t ot al file size in a clust er based on t he following formula: T ot al file size in a clust er =
Number of machines * (Size of a single disk * (Number of disks on a single machine – 1)) * Syst em
securit y level * Syst em compression rat io/Number of dist ribut ed replicas.

Not e
Based on t he st andard T PC-H t est dat a set , t he rat io of t he original dat a size t o t he
compressed dat a size is 3:1. T he rat io varies depending on t he charact erist ics of
business dat a.
T ypically, t hree replicas are st ored in a dist ribut ed manner.
Securit y level: T he def ault value is 0.85 in t he MaxComput e syst em. You can set
a cust om securit y level as required. For example, when t he business dat a increases
rapidly and reaches 85% of t he t ot al st orage quot a, t he securit y level is low. You must
scale out t he syst em as required or delet e unnecessary dat a.

How t o view t he st orage capacit y of a MaxComput e clust er

145 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Run t he puadmin lscs command on t he clust er AG. T he t ot al disk size, t ot al free disk size, and
t ot al file size are displayed at t he end of t he command out put .
Capacit y informat ion

Not e Paramet ers:


T ot al Disk Size: t he t ot al amount of physical space. Each file is st ored in
t hree copies. T he logical space is one t hird t he size of t he physical space.
T ot al Free Disk Size: t he t ot al size of available disks, excluding recycle bins
on chunkservers.
T ot al File Size: t he t ot al amount of physical space used by Apsara
Dist ribut ed File Syst em files, including t he /delet ed/ direct ory.

Run t he following command on t he clust er AG t o view t he st orage capacit y used by all project s:

pu ls -l pangu://localcluster/product/aliyun/odps/

Example:

pu ls -l pangu://localcluster/product/aliyun/odps/|grep adsmr -A 4
-- View the capacity used by a single project, such as adsmr.

Project capacit y informat ion

Not e Paramet ers:


Lengt h: t he logical lengt h used by a project . T he physical lengt h required
is t hree t imes t he logical lengt h.
FileNumber: t he number of files used.
DirNumber: t he number of direct ories used.

File size met ric: T he t ot al size of files t hat can be st ored in a clust er is limit ed based on t he
memory capacit y of PanguMast er. T he exist ence of a large number of small files or an improper
number of files in a clust er can also affect t he st abilit y of t he clust er and it s services.

> Document Version: 20221222 146


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

T he Apsara Dist ribut ed File Syst em index files, including t he informat ion of Apsara Dist ribut ed File
Syst em files and direct ories, are st ored in t he PanguMast er memory. Each file in PanguMast er
corresponds t o a file node. Each file node uses XXX byt es of memory, each level of direct ory uses
XXX byt es of memory, and each chunk uses XXX byt es of memory. A large file is split int o mult iple
chunks in Apsara Dist ribut ed File Syst em. T herefore, t he fact ors t hat affect PanguMast er memory
usage include t he number of files, direct ory hierarchy, and number of chunks.

If t he size of t he original files in Apsara Dist ribut ed File Syst em is large, t he memory usage of
PanguMast er is relat ively low. When a large number of small files exist , t he memory usage of
PanguMast er is relat ively high.

We recommend t hat you perform t he following operat ions t o reduce t he memory usage of
PanguMast er:

Reduce or even delet e empt y direct ories which occupy memory, and reduce t he number of
direct ory levels.
Do not creat e direct ories. A direct ory is creat ed aut omat ically when you creat e a file.
St ore mult iple files in a direct ory. However, a maximum of 100,000 files can be st ored.
Decrease t he lengt h of file names and direct ory names t o reduce t he memory usage and
net work t raffic in PanguMast er.
Reduce t he number of small t ables and files. We recommend t hat you use T unnel t o upload and
commit MaxComput e t ables only when t he t able dat a size reaches 64 MB.

T he following figure shows t he numbers of files t hat can be st ored in Apsara Dist ribut ed File
Syst em for different PanguMast er memory capacit ies.
Numbers of files t hat can be st ored for different PanguMast er memory capacit ies

How t o view t he number of files st ored in a MaxComput e clust er

147 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Run t he pu quota command on t he clust er AG t o view t he t ot al number of files st ored in a


MaxComput e clust er.
T ot al number of files

T his example uses t he adsmr project t o demonst rat e how t o view t he number of files. Run t he
following command on t he clust er AG t o view t he number of files for a single project in a
MaxComput e clust er:

pu ls -l pangu://localcluster/product/aliyun/odps/|grep adsmr -A 4

Number of files for a single project

Not e Paramet ers:


FileNumber: t he number of files used.
DirNumber: t he number of direct ories used.
FileNumber + DirNumber = Number of files for t he current project .

Comput ing resources: CPU and memory are t ypically referred t o as comput ing resources in a
MaxComput e clust er. T he t ot al amount of comput ing resources is calculat ed based on t he following
formula: T ot al amount of comput ing resources = (Number of CPU cores + Memory size of each
machine) * Number of machines. For example, each machine has 56 CPU cores. One core on each
machine is used by t he syst em. T he remaining 55 cores are managed by t he dist ribut ed scheduling
syst em and are scheduled for use by t he MaxComput e service. T he memory (aside from t he chunk of
memory for syst em overhead) is allocat ed by Job Scheduler. T ypically, 4 GB of memory is allocat ed
per CPU core in each MaxComput e t ask. T he rat io varies depending on MaxComput e t asks.

How t o view comput ing resources

> Document Version: 20221222 148


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Run t he r ttrl command on t he clust er AG t o view all comput ing resources.


All comput ing resources

Not e In t he command out put , t he domain name, t ot al CPU capacit y (Unit : U.


100 U = 1 core), and t ot al memory (Unit : MB) of each T ubo machine, as well as t he
role of each T ubo machine in Job Scheduling Syst em are list ed in four columns.

Run t he r tfrl command on t he clust er AG t o view t he remaining comput ing resources.


Remaining comput ing resources

Not e In t he command out put , t he domain name, t ot al CPU capacit y (Unit : U.


100 U = 1 core), and t ot al memory (Unit : MB) of each T ubo machine, as well as t he
role of each T ubo machine in Job Scheduling Syst em are list ed in four columns.

149 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Run t he r cru command on t he clust er AG t o view t he resources used by all running jobs in
MaxComput e.
Resources used by all running jobs

Not e T he name, t ot al CPU capacit y, t ot al memory of each job, as well as t he


number of Fuxi inst ances st art ed in t he role of each job in Job Scheduling Syst em are
list ed in four columns.

How to allocate project resources in a MaxCompute cluster


St orage resource allocat ion: Based on t he charact erist ics of a project , t he space size and file size
limit are configured when you creat e t he project .

If t he following error messages are displayed, t he file size limit of t he project has been exceeded. In
t his case, you must organize t he dat a in t he project by delet ing unnecessary t able dat a or increasing
t he st orage resource quot a.
Error messages

Import ant T he sum of t he st orage capacit y of all project s cannot exceed t he t ot al


allowable st orage capacit y of a service. Similarly, t he t ot al file size of all project s cannot exceed
t he t ot al allowable file size. T herefore, you must properly allocat e t he st orage space and file
size limit by project and make t imely adjust ment based on your business requirement s.

Comput ing resource allocat ion: division of quot a groups.


What is a quot a group?

> Document Version: 20221222 150


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

A MaxComput e clust er allows you t o divide comput ing resources int o different quot a groups, and
schedule t hem as required. A quot a group represent s a cert ain amount of CPU and memory
resources. MinQuot a and MaxQuot a are used for CPU and memory configurat ions. MinQuot a is t he
minimum quot a allowed for t he quot a group, and MaxQuot a is t he maximum quot a allowed for
t he quot a group. For example, MinCPU=500 indicat es t hat t he quot a group has been assigned at
least 500/100=5 cores. MaxCPU=2000 indicat es t hat t he quot a group has been assigned at least
2000/100=20 cores.

MaxComput e uses a FAIR scheduling policy and a first -in-first -out (FIFO) scheduling policy by
default . T he difference bet ween t he FAIR and FIFO scheduling polices lies in t he keys by which
t asks in wait ing queues are sort ed. If each schedule unit has it s own priorit y, bot h FAIR and FIFO
scheduling policies allocat e high-priorit y schedule unit s first . If all schedule unit s share t he same
priorit y, t he FIFO scheduling policy sort s t he schedule unit s by t he t ime when t hey are submit t ed.
T he earlier t hey are submit t ed, t he higher priorit y t hey have. T he FAIR scheduling policy sort s t he
scheduling unit s by t he slot Num allocat ed t o t hem. T he smaller t he slot Num is, t he higher priorit y
t hey have. For t he FAIR policy group, t his can basically ensure t hat t he same amount of resources
are assigned t o schedule unit s wit h t he same priorit y.

You can run t he r quota command on t he clust er AG t o view quot a group set t ings.
View quot a group set t ings

You can run t he following command on t he clust er AG t o creat e and modify a quot a as needed:

sh /apsara/deploy/rpc_wrapper/rpc.sh setquota -i $QUOTAID -a $QUOTANAME -t fair -s $max


_cpu_quota $max_mem_quota -m $min_cpu_quota $min_mem_quota

Not e T he command wit h $QUOT AID is used t o modify a quot a. T he command wit hout
$QUOT AID is used t o creat e a quot a.

Creat e a quot a

151 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Modify a quot a

How t o divide quot a groups

T o divide quot a groups correct ly, you must underst and t he relat ionship bet ween a MaxComput e
project and a quot a group.

You can select t he quot a group t o which a project belongs upon project creat ion or modify t he
quot a group aft er project creat ion.

Resources in a quot a group can be used by all running t asks of all project s in t his quot a group.
T herefore, t he project t asks in t he same quot a group may be affect ed during peak hours. T hat is,
one or several large t asks may t ake up all resources in t he quot a group, while ot her comput ing
t asks can only wait for resources.

> Document Version: 20221222 152


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

For example, in t he following t wo figures, t he first figure shows t hat a lot of jobs are wait ing for
resources (in red box). However, a lot of clust er resources are left unused. You can check t he quot a
usage. In t he second figure, quot a 9243 is only allocat ed wit h 5000U, all of which are in use. T he
CPU quot a for 9243 is used up, but t here are st ill pending t asks in 9243. In t his case, even if t here
are unused clust er resources, t he t asks under t his quot a cannot have resources allocat ed t o t hem.
Jobs wait ing for resources

Quot a used up

You must divide quot a groups based on t he following general principles:

You must plan quot a groups in a way t hat t hey do not mut ually int erfere wit h each ot her in a
large resource pool, and avoid overly fine-grained division of resource groups. For example, some
large t asks cannot be scheduled due t o quot a group limit s, or occupy a quot a group for an
ext ended period of t ime, which affect s ot her t asks in t he group.
You must consider t he configured MinQuot a and MaxQuot a when dividing quot a groups.
You can oversell t he resources in your clust er, t hat is, t he sum of MaxQuot as of all quot a groups
can be great er t han t he t ot al amount of clust er resources. However, t he oversell rat io cannot be
t oo high. If t he oversell rat io is t oo high, a quot a group wit h a running project may perpet ually
occupy a large amount of resources.
When dividing quot a groups, you must consider t he priorit ies of t asks, t ask execut ion durat ion,
amount of t ask dat a, and charact erist ics of comput ing t ypes.

153 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Properly configure quot a groups for peak hours. We recommend t hat you configure a separat e
quot a group for t asks t hat are import ant and t ime-consuming.
T he division of quot a groups and t he select ion and configurat ion of project s are conduct ed
based on a resource pre-allocat ion policy, which needs t o be adjust ed in a t imely manner, based
on act ual requirement s.

5.2. Common issues and data skew


troubleshooting
Scenario 1: how to determine whether a job has stopped running
due to insufficient resources
Sympt om: T he job does not progress as expect ed.
Sympt om

Cause: T he issue is t ypically caused by insufficient resources. You can use LogView t o det ermine t he
st at us of job resources (t ask inst ance st at us).

Ready: indicat es t hat inst ances are wait ing for Job Scheduler t o allocat e resources. Inst ances can
resume operat ion aft er t hey obt ain t he necessary resources.
Wait : indicat es t hat inst ances are wait ing for dependent t asks t o complet e.

> Document Version: 20221222 154


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

T he t ask inst ances in t he Ready st at e shown in t he following figure indicat e t hat t here are insufficient
resources t o run t hese t asks. Aft er an inst ance obt ains t he necessary resources, it s st at us changes t o
Running.

Solut ion:

If t here are insufficient resources during peak hours, you can reschedule t he t asks t o run during off-
peak hours.
If t he comput ing quot as are insufficient , check whet her t he quot a group of t he project has sufficient
comput ing resources.
If comput ing resources in t he clust er are occupied for long periods of t ime, you can develop a
comput ing quot a allocat ion policy t o scale t he quot a as necessary.
We recommend t hat you do not run abnormally large jobs t o prevent t he jobs from occupying
resources for ext ended periods of t ime.
You can enable SQL accelerat ion, so t hat you can run small jobs wit hout request ing resources from
Job Scheduler.
You can use t he First -In First -Out (FIFO) scheduling policy.

Scenario 2: how to find the root cause of a job that has been running
for an extended period of time
Sympt om: T he MaxComput e job execut ion progress has remained at 99% for a long period of t ime.

Cause: T he running t ime of some Fuxi inst ances in t he MaxComput e job is significant ly longer t han t hat
of ot her Fuxi inst ances.
Cause analysis

155 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Furt her analysis: Analyze t he job summary in LogView, and calculat e t he difference bet ween t he max
and avg values of input and out put records of a slow t ask. If t he max and avg values differ by several
orders of magnit ude, it can be init ially det ermined t hat t he job dat a is skewed.
Furt her analysis

Solut ion: If t here are slow Fuxi inst ances on a part icular machine, check whet her a hardware failure has
occurred on t he machine.

Scenario 3: How to improve the concurrency of MaxCompute jobs


Fault locat ing: T he concurrency of Map t asks depends on t he following fact ors:

Split size and merge limit .

Map t akes a series of dat a files as input s. Larger files are split int o part it ions based on t he
odps.sql.mapper.split .size value, which is 256 MB by default . An inst ance is st art ed for each part it ion.
However, st art ing an inst ance requires resources and t ime. Small files can be merged int o a single
part it ion based on t he odps.sql.mapper.merge.limit .size value and be processed by a single inst ance
t o improve inst ance ut ilizat ion. T he default value of odps.sql.mapper.merge.limit .size is 64 MB. T he
t ot al size of small files merged cannot exceed t his value.

Inst ances cannot process dat a across mult iple part it ions.

A part it ion is mapped t o a folder in Apsara Dist ribut ed File Syst em. You must run at least one inst ance
t o process dat a in a part it ion. Inst ances cannot process dat a across mult iple part it ions. In a part it ion,
you must run inst ances based on t he preceding rule.

T ypically, t he number of inst ances for Reduce t asks is 1/4 of t hat for Map t asks. T he number of
inst ances for Join t asks is t he same as t hat for Map t asks, but cannot exceed 1,111.

You can use t he following met hods t o increase t he number of concurrent inst ances for Reduce and Join
t asks:
set odps.sql.reducer.instances = xxx

set odps.sql.joiner.instances = xxx

Scenarios t hat require higher concurrency:

A single record only cont ains a small amount of dat a.


Because a single record cont ains a small amount of dat a, t here are many records in a file of t he same
size. If you split dat a int o 256 MB chunks, a single Map inst ance needs t o process a large number of
records, reducing concurrency.

Dump operat ions occur in t he Map, Reduce, and Join st ages.

Based on t he preceding job summary analysis, t he displayed dump informat ion indicat es t hat t he
inst ance does not have sufficient memory t o sort dat a in t he Shuffle st age. Improving concurrency
can reduce t he amount of dat a processed by a single inst ance t o t he amount of dat a t hat can be
handled by t he memory, eliminat e disk I/O t ime consumpt ion, and improve t he processing speed.

> Document Version: 20221222 156


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

T ime-consuming UDFs are used.

T he execut ion of UDFs is t ime-consuming. If you execut e UDFs concurrent ly, you can reduce t he UDF
execut ion t ime of an inst ance.

Solut ion:

You can decrease t he following paramet er values t o improve t he concurrency of Map t asks:

odps.sql.mapper.split.size = xxx
odps.sql.mapper.merge.limit.size = xxx

You can increase t he following paramet er values t o improve t he concurrency of Reduce and Join
t asks:

odps.sql.reducer.instances = xxx
odps.sql.joiner.instances = xxx

Not e: Improving concurrency will result in a great er amount of resources being consumed. We
recommend t hat you t ake cost int o account when improving concurrency. An inst ance t akes an average
of 10 minut es t o complet e aft er opt imizat ion, improving overall resource ut ilizat ion. We recommend
t hat you opt imize jobs in crit ical pat hs so t hat t hey consume less t ime.

Scenario 4: how to resolve data skew issues


Different t ypes of dat a skew issues in SQL are resolved in different ways.

GROUP BY dat a skew

T he uneven dist ribut ion of GROUP BY keys result s in dat a skew on reducers. You can set t he ant i-skew
paramet er before execut ing SQL t asks.

set odps.sql.groupby.skewindata=true

Aft er t his paramet er is set t o t rue, t he syst em aut omat ically adds a random number t o each key
when running t he Shuffle hash algorit hm and prevent s dat a skew by int roducing a new t ask.

DIST RIBUT E BY dat a skew

Using const ant s t o execut e t he DIST RIBUT E BY clause for full sort ing of t he ent ire t able will result in
dat a skew on reducers. We recommend t hat you do not perform t his operat ion.

Dat a skew in t he Join st age

Dat a is skewed in t he Join st age when t he Join keys are unevenly dist ribut ed. For example, a key exist s
in mult iple joined t ables, result ing in a Cart esian explosion of dat a in t he Join inst ance. You can use
one of t he following solut ions t o resolve dat a skew in t he Join st age:

When a large t able and a small t able are joined, use MapJoin inst ead of Join t o opt imize query
performance.
Use a separat e logic t o handle a skewed key. For example, when a large number of null values exist
in t he key, you can filt er out t he null values or execut e a CASE WHEN st at ement t o replace t hem
wit h random values before t he Join operat ion.
If you do not want t o modify SQL st at ement s, configure t he following paramet ers t o allow
MaxComput e t o perform aut omat ic opt imizat ion:

set odps.sql.skewinfo=tab1:(col1,col2)[(v1,v2),(v3,v4),...]
set odps.sql.skewjoin=true;

157 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

Dat a skew caused by mult i-dist inct


Mult i-dist inct synt ax aggravat es GROUP BY dat a skew. You can use t he GROUP BY clause wit h t he
COUNT funct ion inst ead of mult i-dist inct t o alleviat e t he dat a skew issue.

UDF OOM

Some jobs report an OOM error during runt ime. T he error message is as follows: FAILED: ODPS-01231
44: Fuxi job failed - WorkerRestart errCode:9,errMsg:SigKill(OOM), usually caused by OOM(out
of memory) . You can fix t he error by configuring t he UDF runt ime paramet ers. Example:

odps.sql.mapper.memory=3072;
set odps.sql.udf.jvm.memory=2048;
set odps.sql.udf.python.memory=1536;

T he relat ed dat a skew set t ings are as follows:

set odps.sql.groupby.skewindata=true/false

Descript ion: allows you t o enable GROUP BY opt imizat ion.

set odps.sql.skewjoin=true/false

Descript ion: allows you t o enable Join opt imizat ion. It is effect ive only when odps.sql.skewinfo is set .

set odps.sql.skewinfo

Descript ion: allows you t o set det ailed informat ion for Join opt imizat ion. T he command synt ax is as
follows:
set odps.sql.skewinfo=skewed_src:(skewed_key)[("skewed_value")]
src a join src_skewjoin1 b on a.key = b.key;

Example:
set odps.sql.skewinfo=src_skewjoin1:(key)[("0")]
-- The output result for a single skewed value of a single field is as follows: explain sel
ect a.key c1, a.value c2, b.key c3, b.value c4 from src a join src_skewjoin1 b on a.key = b
.key;

set odps.sql.skewinfo=src_skewjoin1:(key)[("0")("1")]
-- The output result for multiple skewed values of a single field is as follows: explain se
lect a.key c1, a.value c2, b.key c3, b.value c4 from src a join src_skewjoin1 b on a.key =
b.key;

Scenario 5: how to configure common SQ L parameters


Map set t ings

set odps.sql.mapper.cpu=100

Descript ion: allows you t o set t he number of CPUs used by each inst ance in a Map t ask. Default value:

> Document Version: 20221222 158


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

100. Valid values: 50 t o 800.

set odps.sql.mapper.memory=1024

Descript ion: allows you t o set t he memory size of each inst ance in a Map t ask. Unit : MB. Default value:
1024. Valid values: 256 t o 12288.

set odps.sql.mapper.merge.limit.size=64

Descript ion: allows you t o set t he maximum size of cont rol files t o be merged. Unit : MB. Default value:
64. You can set t his variable t o cont rol t he input s of mappers. Valid values: 0 t o Int eger.MAX_VALUE.

set odps.sql.mapper.split.size=256

Descript ion: allows you t o set t he maximum dat a input volume for a Map t ask. Unit : MB. Default value:
256. You can set t his variable t o cont rol t he input s of mappers. Valid values: 1 t o Int eger.MAX_VALUE.

Join set t ings

set odps.sql.joiner.instances=-1

Descript ion: allows you t o set t he number of inst ances in a Join t ask. Default value: -1. Valid values: 0 t o
2000.

set odps.sql.joiner.cpu=100

Descript ion: allows you t o set t he number of CPUs used by each inst ance in a Join t ask. Default value:
100. Valid values: 50 t o 800.

set odps.sql.joiner.memory=1024

Descript ion: allows you t o set t he memory size of each inst ance in a Join t ask. Unit : MB. Default value:
1024. Valid values: 256 t o 12288.

Reduce set t ings

set odps.sql.reducer.instances=-1

Descript ion: allows you t o set t he number of inst ances in a Reduce t ask. Default value: -1. Valid values:
0 t o 2000.

set odps.sql.reducer.cpu=100

Descript ion: allows you t o set t he number of CPUs used by each inst ance in a Reduce t ask. Default
value: 100. Valid values: 50 t o 800.

set odps.sql.reducer.memory=1024

Descript ion: allows you t o set t he memory size of each inst ance in a Reduce t ask. Unit : MB. Default
value: 1024. Valid values: 256 t o 12288.

159 > Document Version: 20221222


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

UDF set t ings

set odps.sql.udf.jvm.memory=1024

Descript ion: allows you t o set t he maximum memory size used by t he UDF JVM heap. Unit : MB. Default
value: 1024. Valid values: 256 t o 12288.

set odps.sql.udf.timeout=600

Descript ion: allows you t o set t he t imeout period of a UDF. Unit : seconds. Default value: 600. Valid
values: 0 t o 3600.

set odps.sql.udf.python.memory=256

Descript ion: allows you t o set t he maximum memory size used by t he UDF Pyt hon API. Unit : MB. Default
value: 256. Valid values: 64 t o 3072.

set odps.sql.udf.optimize.reuse=true/false

Descript ion: When t his paramet er is set t o t rue, each UDF funct ion expression can only be calculat ed
once, improving performance. Default value: t rue.

set odps.sql.udf.strict.mode=false/true

Descript ion: allows you t o cont rol whet her funct ions ret urn NULL or an error if dirt y dat a is found. If t he
paramet er is set t o t rue, an error is ret urned. Ot herwise, NULL is ret urned.

MapJoin set t ings

set odps.sql.mapjoin.memory.max=512

Descript ion: allows you t o set t he maximum memory size for a small t able when running MapJoin. Unit :
MB. Default value: 512. Valid values: 128 t o 2048.

set odps.sql.reshuffle.dynamicpt=true/false

Descript ion:
Dynamic part it ioning scenarios are t ime-consuming. Disabling dynamic part it ioning can accelerat e SQL.
If t here are few dynamic part it ions, disabling dynamic part it ioning can prevent dat a skew.

Scenario 6: how to check the storage usage of a single project


Launch t he MaxComput e console as a project owner and run t he desc project <project_name>-
extended; command t o view t he following informat ion.
St orage informat ion

> Document Version: 20221222 160


Operat ions and Maint enance Guide·
MaxComput e
Common issues and solut ions

T he preceding figure shows t he capacit y-relat ed st orage informat ion of t he project . T he relat ionship
bet ween t he physical and logical values of t he relat ed met rics is: Physical value of a met ric = Logical
value of t he met ric * Number of replicas.

161 > Document Version: 20221222

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy