R E P GNN G B: Ethinking The Xpressive Ower of S Via Raph Iconnectivity
R E P GNN G B: Ethinking The Xpressive Ower of S Via Raph Iconnectivity
G RAPH B ICONNECTIVITY
Bohang Zhang∗ Shengjie Luo∗ Liwei Wang Di He
zhangbohang@pku.edu.cn, luosj@stu.pku.edu.cn, {wanglw,dihe}@pku.edu.cn
Peking University
A BSTRACT
arXiv:2301.09505v3 [cs.LG] 11 Feb 2024
1 I NTRODUCTION
Graph neural networks (GNNs) have recently become the dominant approach for graph representa-
tion learning. Among numerous architectures, message-passing neural networks (MPNNs) are ar-
guably the most popular design paradigm and have achieved great success in various fields (Gilmer
et al., 2017; Hamilton et al., 2017; Kipf & Welling, 2017; Veličković et al., 2018). However, one
major drawback of MPNNs lies in the limited expressiveness: as pointed out by Xu et al. (2019);
Morris et al. (2019), they can never be more powerful than the classic 1-dimensional Weisfeiler-
Lehman (1-WL) test in distinguishing non-isomorphic graphs (Weisfeiler & Leman, 1968). This
inspired a variety of works to design provably more powerful GNNs that go beyond the 1-WL test.
One line of subsequent works aimed to propose GNNs that match the higher-order WL variants
(Morris et al., 2019; 2020; Maron et al., 2019c;a; Geerts & Reutter, 2022). While being highly
expressive, such an approach suffers from severe computation/memory costs. Moreover, there
have been concerns about whether the achieved expressiveness is necessary for real-world tasks
(Veličković, 2022). In light of this, other recent works sought to develop new GNN architectures
with improved expressiveness while still keeping the message-passing framework for efficiency
(Bouritsas et al., 2022; Bodnar et al., 2021b;a; Bevilacqua et al., 2022; Wijesinghe & Wang, 2022,
and see Appendix A for more recent advances). However, most of these works mainly justify their
expressiveness by giving toy examples where WL algorithms fail to distinguish, e.g., by focusing on
regular graphs. On the theoretical side, it is quite unclear what additional power they can system-
atically and provably gain. More fundamentally, to the best of our knowledge (see Appendix D.1),
there is still a lack of principled and convincing metrics beyond the WL hierarchy to formally mea-
sure the expressive power and to guide the design of provably better GNN architectures.
∗
Equal Contribution.
1
E G {G} {F,G}
D F {C,D} D {D,E,F,I} F
A {D,E,F,I}
{F,H}
C I H {H} C
{A,B,C} I
B
{A,B,C} {I,J}
K J M
{J,K,L,M,N}
L N {J,K,L} J {J,M,N}
(a) Original graph (b) Block cut-edge tree (c) Block cut-vertex tree
Figure 1: An illustration of edge-biconnectivity and vertex-biconnectivity. Cut vertices/edges are
outlined in bold red. Gray nodes in (b)/(c) are edge/vertex-biconnected components, respectively.
In this paper, we systematically study the problem of designing expressive GNNs from a novel
perspective of graph biconnectivity. Biconnectivity has long been a central topic in graph theory
(Bollobás, 1998). It comprises a series of important concepts such as cut vertex (articulation point),
cut edge (bridge), biconnected component, and block cut tree (see Section 2 for formal definitions).
Intuitively, biconnectivity provides a structural description of a graph by decomposing it into disjoint
sub-components and linking them via cut vertices/edges to form a tree structure (cf. Figure 1(b,c)).
As can be seen, biconnectivity purely captures the intrinsic structure of a graph.
The significance of graph biconnectivity can be reflected in various aspects. Firstly, from a theo-
retical point of view, it is a basic graph property and is linked to many fundamental topics in graph
theory, ranging from path-related problems to network flow (Granot & Veinott Jr, 1985) and span-
ning trees (Kapoor & Ramesh, 1995), and is highly relevant to planar graph isomorphism (Hopcroft
& Tarjan, 1972). Secondly, from a practical point of view, cut vertices/edges have substantial values
in many real applications. For example, chemical reactions are highly related to edge-biconnectivity
of the molecule graph, where the breakage of molecular bonds usually occurs at the cut edges and
each biconnected component often remains unchanged after the reaction. As another example, social
networks are related to vertex-biconnectivity, where cut vertices play an important role in linking
between different groups of people (biconnected components). Finally, from a computational point
of view, the problems related to biconnectivity (e.g., finding cut vertices/edges or constructing block
cut trees) can all be efficiently solved using classic algorithms (Tarjan, 1972), with a computation
complexity equal to graph size (which is the same as an MPNN). Therefore, one may naturally ex-
pect that popular GNNs should be able to learn all things related to biconnectivity without difficulty.
Unfortunately, we show this is not the case. After a thorough analysis of four classes of representa-
tive GNN architectures in literature (see Section 3.1), we find that surprisingly, none of them could
even solve the easiest biconnectivity problem: to distinguish whether a graph has cut vertices/edges
or not (corresponding to a graph-level binary classification). As a result, they obviously failed in
the following harder tasks: (i) identifying all cut vertices (a node-level task); (ii) identifying all
cut edges (an edge-level task); (iii) the graph-level task for general biconnectivity problems, e.g.,
distinguishing a pair of graphs that have non-isomorphic block cut trees. This raises the following
question: can we design GNNs with provable expressiveness for biconnectivity problems?
We first give an affirmative answer to the above question. By conducting a deep analysis of the
recently proposed Equivariant Subgraph Aggregation Network (ESAN) (Bevilacqua et al., 2022), we
prove that the DSS-WL algorithm with node marking policy can precisely identify both cut vertices
and cut edges. This provides a new understanding as well as a strong theoretical justification for the
expressive power of DSS-WL and its recent extensions (Frasca et al., 2022). Furthermore, we give
a fine-grained analysis of several key factors in the framework, such as the graph generation policy
and the aggregation scheme, by showing that neither (i) the ego-network policy without marking
nor (ii) a variant of the weaker DS-WL algorithm can identify cut vertices.
However, GNNs designed based on DSS-WL are usually sophisticated and suffer from high com-
putation/memory costs. The main contribution in this paper is then to give a principled and effi-
cient way to design GNNs that are expressive for biconnectivity problems. Targeting this question,
we restart from the classic 1-WL algorithm and figure out a major weakness in distinguishing bi-
connectivity: the lack of distance information between nodes. Indeed, the importance of distance
information is theoretically justified in our proof for analyzing the expressive power of DSS-WL.
To this end, we introduce a novel color refinement framework, formalized as Generalized Distance
Weisfeiler-Lehman (GD-WL), by directly encoding a general distance metric into the WL aggrega-
2
Table 1: Summary of theoretical results on the expressive power of different GNN models for various
biconnectivity problems. We also list the time/space complexity (per WL iteration) for each WL
algorithm, where n and m are the number of nodes and edges of a graph, respectively.
Section 3.1 Section 3.2 Section 4
Model MPNN GSN CWN GraphSNN ESAN Ours 3-IGN
WL variant 1-WL SC-WL CWL OS-WL DSS-WL DS-WL SPD-WL GD-WL 2-FWL
Cut vertex ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓
Cut edge ✗ ✗ ✗ ✗ ✓ Unknown ✓ ✓ ✓
BCVTree ✗ ✗ ✗ ✗ ✓ Unknown ✗ ✓ ✓
BCETree ✗ ✗ ✗ ✗ ✓ Unknown ✓ ✓ ✓
Ref. Theorem - 3.1 C.12 C.13 3.2 C.16 4.1 4.2, 4.3 4.6
Time n+m n+m - n+m n(n+m) n(n+m) n2 n2 n3
2
Space 1
n n - n n n n n n2
tion procedure. We first prove that as a special case, the Shortest Path Distance WL (SPD-WL) is
expressive for all edge-biconnectivity problems, thus providing a novel understanding of its empiri-
cal success. However, it still cannot identify cut vertices. We further suggest an alternative called the
Resistance Distance WL (RD-WL) for vertex-biconnectivity. To sum up, all biconnectivity problems
can be provably solved within our proposed GD-WL framework.
Finally, we give a worst-case analysis of the proposed GD-WL framework. We discuss its limitations
by proving that the expressive power of both SPD-WL and RD-WL can be bounded by the standard
2-FWL test (Cai et al., 1992). Consequently, 2-FWL is fully expressive for all biconnectivity met-
rics. Besides, since GD-WL heavily relies on distance information, we proceed to analyze its power
in distinguishing the class of distance-regular graphs (Brouwer et al., 1989). Surprisingly, we show
GD-WL matches the power of 2-FWL in this case, which strongly justifies its high expressiveness
in distinguishing hard graphs. A summary of our theoretical contributions is given in Table 1.
Practical Implementation. The main advantage of GD-WL lies in its simplicity, efficiency and
parallelizability. We show it can be easily implemented using a Transformer-like architecture by
injecting the distance into Multi-head Attention (Vaswani et al., 2017), similar to Ying et al. (2021a).
Importantly, we prove that the resulting Graph Transformer (called Graphormer-GD) is as expressive
as GD-WL. This offers strong theoretical insights into the power and limits of Graph Transformers.
Empirically, we show Graphormer-GD not only achieves perfect accuracy in detecting cut vertices
and cut edges, but also outperforms prior GNN achitectures on popular benchmark datasets.
2 P RELIMINARY
Notations. We use { } to denote sets and use {{ }} to denote multisets. The cardinality of (multi)set
S is denoted as |S|. The index set is denoted as [n] := {1, · · · , n}. Throughout this paper, we
consider simple undirected graphs G = (V, E) with no repeated edges or self-loops. Therefore,
each edge {u, v} ∈ E can be expressed as a set of two elements. For a node u ∈ V, denote its
neighbors as NG (u) := {v ∈ V : {u, v} ∈ E} and denote its degree as degG (u) := |NG (u)|. A
path P = (u0 , · · · , ud ) is a tuple of nodes satisfying {ui−1 , ui } ∈ E for all i ∈ [d], and its length
is denoted as |P | := d. A path P is said to be simple if it does not go through a node more than
once, i.e. ui ̸= uj for i ̸= j. The shortest path distance between two nodes u and v is denoted to be
disG (u, v) := min{|P | : P is a path from u to v}. The induced subgraph with vertex subset S ⊂ V
is defined as G[S] = (S, ES ) where ES := {{u, v} ∈ E : u, v ∈ S}.
We next introduce the concepts of connectivity, vertex-biconnectivity and edge-biconnectivity.
Definition 2.1. (Connectivity) A graph G is connected if for any two nodes u, v ∈ V, there is a
path from u to v. A vertex set S ⊂ V is a connected component of G if G[S] is connected and for
any proper superset T ⊋ S, G[T ] is disconnected. Denote CC(G) as the set of all connected com-
ponents, then CC(G) forms a partition of the vertex set V. Clearly, G is connected iff |CC(G)| = 1.
Definition 2.2. (Biconnectivity) A node v ∈ V is a cut vertex (or articulation point) of G if re-
moving v increases the number of connected components, i.e., |CC(G[V\{v}])| > |CC(G)|. A
graph is vertex-biconnected if it is connected and does not have any cut vertex. A vertex set S ⊂ V
1
The space complexity of WL algorithms may differ from the corresponding GNN models in training, e.g.,
for DS-WL and GD-WL, due to the need to store intermediate results for back-propagation.
3
is a vertex-biconnected component of G if G[S] is vertex-biconnected and for any proper super-
set T ⊋ S, G[T ] is not vertex-biconnected. We can similarly define the concepts of cut edge (or
bridge) and edge-biconnected component (we omit them for brevity). Finally, denote BCCV (G)
(resp. BCCE (G)) as the set of all vertex-biconnected (resp. edge-biconnected) components.
Two non-adjacent nodes u, v ∈ V are in the same vertex-biconnected component iff there are two
paths from u to v that do not intersect (except at endpoints). Two nodes u, v are in the same edge-
biconnected component iff there are two paths from u to v that do not share an edge. On the other
hand, if two nodes are in different vertex/edge-biconnected components, any path between them
must go through some cut vertex/edge. Therefore, cut vertices/edges can be regarded as “hubs” in
a graph that link different subgraphs into a whole. Furthermore, the link between cut vertices/edges
and biconnected components forms a tree structure, which are called the block cut tree (cf. Figure 1).
Definition 2.3. (Block cut-edge tree) The block cut-edge tree of graph G = (V, E) is defined as
follows: BCETree(G) := (BCCE (G), E E ), where
E E := {S1 , S2 } : S1 , S2 ∈ BCCE (G), ∃u ∈ S1 , v ∈ S2 , s.t. {u, v} ∈ E .
Definition 2.4. (Block cut-vertex tree) The block cut-vertex tree of graph G = (V, E) is defined as
follows: BCVTree(G) := (BCCV (G) ∪ V Cut , E V ), where V Cut ⊂ V is the set containing all cut
vertices of G and
E V := {S, v} : S ∈ BCCV (G), v ∈ V Cut , v ∈ S .
The following theorem shows that all concepts related to biconnectivity can be efficiently computed.
Theorem 2.5. (Tarjan, 1972) The problems related to biconnectivity, including identifying all cut
vertices/edges, finding all biconnected components (BCCV (G) and BCCE (G)), and building block
cut trees (BCVTree(G) and BCETree(G)), can all be solved using the Depth-First Search algo-
rithm, within a computation complexity linear in the graph size, i.e. Θ(|V| + |E|).
Isomorphism and color refinement algorithms. Two graphs G = (VG , EG ) and H = (VH , EH )
are isomorphic (denoted as G ≃ H) if there is an isomorphism (bijective mapping) f : VG → VH
such that for any nodes u, v ∈ VG , {u, v} ∈ EG iff {f (u), f (v)} ∈ EH . A color refinement
algorithm is an algorithm that outputs a color mapping χG : VG → C when taking graph G as input,
where C is called the color set. A valid color refinement algorithm must preserve invariance under
isomorphism, i.e., χG (u) = χH (f (u)) for isomorphism f and node u ∈ VG . As a result, it can be
used as a necessary test for graph isomorphism by comparing the multisets {{χG (u) : u ∈ VG }} and
{{χH (u) : u ∈ VH }}, which we call the graph representations. Similarly, χG (u) can be seen as the
node feature of u ∈ VG , and {{χG (u), χG (v)}} corresponds to the edge feature of {u, v} ∈ EG . All
algorithms studied in this paper fit the color refinement framework, and please refer to Appendix B
for a precise description of several representatives (e.g., the classic 1-WL and k-FWL algorithms).
Problem setup. This paper focuses on the following three types of problems with increasing diffi-
culties. Firstly, we say a color refinement algorithm can distinguish whether a graph is vertex/edge-
biconnected, if for any graphs G, H where G is vertex/edge-biconnected but H is not, their graph
representations are different, i.e. {{χG (u) : u ∈ VG }} ̸= {{χH (u) : u ∈ VH }}. Secondly, we say a
color refinement algorithm can identify cut vertices if for any graphs G, H and nodes u ∈ VG , v ∈
VH where u is a cut vertex but v is not, their node features are different, i.e. χG (u) ̸= χH (v).
Similarly, it can identify cut edges if for any {u, v} ∈ EG and {w, x} ∈ EH where {u, v} is a cut
edge but {w, x} is not, their edge features are different, i.e. {{χG (u), χG (v)}} ̸= {{χH (w), χH (x)}}.
Finally, we say a color refinement algorithm can distinguish block cut-vertex/edge trees, if for any
graphs G, H satisfying BCVTree(G) ̸≃ BCVTree(H) (or BCETree(G) ̸≃ BCETree(H)), their
graph representations are different, i.e. {{χG (u) : u ∈ VG }} ̸= {{χH (u) : u ∈ VH }}.
4
(a) (b) (c) (d)
Figure 2: Illustration of four representative counterexamples (see Examples C.9 and C.10 for general
definitions). Graphs in the first row have cut vertices (outlined in bold red) and some also have cut
edges (denoted as red lines), while graphs in the second row do not have any cut vertex or cut edge.
3.1 C OUNTEREXAMPLES
1-WL/MPNNs. We first consider the classic 1-WL. We provide two principled class of counterex-
amples which are formally defined in Examples C.9 and C.10, with a few special cases illustrated in
Figure 2. For each pair of graphs in Figure 2, the color of each node is drawn according to the 1-WL
color mapping. It can be seen that the two graph representations are the same. Therefore, 1-WL
cannot distinguish any biconnectivity problem listed in Section 2.
Substructure Counting WL/GSN. Bouritsas et al. (2022) developed a principled approach to boost
the expressiveness of MPNNs by incorporating substructure counts into node features or the 1-
WL aggregation procedure. The resulting algorithm, which we call the SC-WL, is detailed in Ap-
pendix B.3. However, we show no matter what sub-structures are used, the corresponding GSN still
cannot solve any biconnectivity problem listed in Section 2. We give a proof in Appendix C.2 for
the general case that allows arbitrary substructures, based on Examples C.9 and C.10. We also point
out that our negative result applies to the similar GNN variant in Barceló et al. (2021).
Theorem 3.1. Let H = {H1 , · · · , Hk }, Hi = (Vi , Ei ) be any set of connected graphs and denote
n = maxi∈[k] |Vi |. Then SC-WL (Appendix B.3) using the substructure set H cannot solve any
vertex/edge-biconnectivity problem listed in Section 2. Moreover, there exist counterexample graphs
whose sizes (both in terms of vertices and edges) are O(n).
GNNs with lifting transformations (MPSN/CWN). Bodnar et al. (2021b;a) considered another
approach to design powerful GNNs by using graph lifting transformations. In a nutshell, these ap-
proaches exploit higher-order graph structures such as cliques and cycles to design new WL aggre-
gation procedures. Unfortunately, we show the resulting algorithms, called the SWL and CWL, still
cannot solve any biconnectivity problem. Please see Appendix C.2 (Proposition C.12) for details.
Other GNN variants. In Appendix C.2, we discuss other recently proposed GNNs, such as Graph-
SNN (Wijesinghe & Wang, 2022), GNN-AK (Zhao et al., 2022), and NGNN (Zhang & Li, 2021).
Due to space limit, we defer the corresponding negative results in Propositions C.13, C.15 and C.16.
We next switch our attention to a new type of GNN framework proposed in Bevilacqua et al. (2022),
called the Equivariant Subgraph Aggregation Networks (ESAN). The central algorithm in EASN is
called the DSS-WL. Given a graph G, DSS-WL first generates a bag of vertex-shared (sub)graphs
π
BG = {{G1 , · · · , Gm }} according to a graph generation policy π. Then in each iteration t, the
algorithm refines the color of each node v in each subgraph Gi by jointly aggregating its neighboring
colors in the own subgraph and across all subgraphs. The aggregation formula can be written as:
χtGi (v) := hash χt−1 t−1 t−1 t−1
Gi (v), {{χGi (u) : u ∈ NGi (v)}}, χG (v), {{χG (u) : u ∈ NG (v)}} , (1)
χtG (v) := hash {{χtGi (v) : i ∈ [m]}} ,
(2)
where hash is a perfect hash function. DSS-WL terminates when χtG induces a stable vertex parti-
tion. In this paper, we consider node-based graph generation policies, for which each subgraph is
π
associated to a specific node, i.e. BG = {{Gv : v ∈ V}}. Some popular choices are node deletion
πND , node marking πNM , k-ego-network πEGO(k) , and its node marking version πEGOM(k) . A full
description of DSS-WL as well as different policies can be found in Appendix B.4 (Algorithm 3).
5
A fundamental question regarding DSS-WL is how expressive it is. While a straightforward analysis
shows that DSS-WL is strictly more powerful than 1-WL, an in-depth understanding on what addi-
tional power DSS-WL gains over 1-WL is still limited. The only new result is the very recent work
of Frasca et al. (2022), who showed a 3-WL upper bound for the expressivity of DSS-WL. Yet, such
a result actually gives a limitation of DSS-WL rather than showing its power. Moreover, there is a
large gap between the highly strong 3-WL and the weak 1-WL. In the following, we take a different
perspective and prove that DSS-WL is expressive for both types of biconnectivity problems.
Theorem 3.2. Let G = (VG , EG ) and H = (VH , EH ) be two graphs, and let χG and χH be the
corresponding DSS-WL color mapping with node marking policy. Then the following holds:
• For any two nodes w ∈ VG and x ∈ VH , if χG (w) = χH (x), then w is a cut vertex if and
only if x is a cut vertex.
• For any two edges {w1 , w2 } ∈ EG and {x1 , x2 } ∈ EH , if {{χG (w1 ), χG (w2 )}} =
{{χH (x1 ), χH (x2 )}}, then {w1 , w2 } is a cut edge if and only if {x1 , x2 } is a cut edge.
The proof of Theorem 3.2 is highly technical and is deferred to Appendix C.3. By using the basic
results derived in Appendix C.1, we conduct a careful analysis of the DSS-WL color mapping and
discover several important properties. They give insights on why DSS-WL can succeed in distin-
guishing biconnectivity, as we will discuss below.
How can DSS-WL distinguish biconnectivity? We find that a crucial advantage of DSS-WL
over the classic 1-WL is that DSS-WL color mapping implicitly encodes distance information (see
Lemma C.19(e) and Corollary C.24). For example, two nodes u ∈ VG , v ∈ VH will have dif-
ferent DSS-WL colors if the distance set {{disG (u, w) : w ∈ VG }} differs from {{disH (v, w) :
w ∈ VH }}. Our proof highlights that distance information plays a vital role in distinguishing edge-
biconnectivity when combining with color refinement algorithms (detailed in Section 4), and it also
helps distinguish vertex-biconnectivity (see the proof of Lemma C.22). Consequently, our analysis
provides a novel understanding and a strong justification for the success of DSS-WL in two aspects:
the graph representation computed by DSS-WL intrinsically encodes distance and biconnectivity
information, both of which are fundamental structural properties of graphs but are lacking in 1-WL.
Discussions on graph generation policies. Note that Theorem 3.2 holds for node marking policy.
In fact, the ability of DSS-WL to encode distance information heavily relies on node marking as
shown in the proof of Lemma C.19. In contrast, we prove that the ego-network policy πEGO(k)
cannot distinguish cut vertices (Proposition C.14), using the counterexample given in Figure 2(c).
Therefore, our result shows an inherent advantage of node marking than the ego-network policy in
distinguishing a class of non-isomorphic graphs, which is raised as an open question in Bevilacqua
et al. (2022, Section 5). It also highlights a theoretical limitation of πEGO(k) compared with its node
marking version πEGOM(k) , a subtle difference that may not have received sufficient attention yet.
For example, both the GNN-AK and GNN-AK-ctx architecture (Zhao et al., 2022) cannot solve
vertex-biconnectivity problems since it is similar to πEGO(k) (see Proposition C.15). On the other
hand, the GNN-AK+ does not suffer from such a drawback although it also uses πEGO(k) , because
it further adds distance encoding in each subgraph (which is more expressive than node marking).
Discussions on DS-WL. Bevilacqua et al. (2022); Cotta et al. (2021) also considered a weaker ver-
sion of DSS-WL, called the DS-WL, which aggregates the node color in each subgraph without
interaction across different subgraphs (see formula (10)). We show in Proposition C.16 that unfor-
tunately, DS-WL with common node-based policies cannot identify cut vertices when the color of
each node v is defined as its associated subgraph representation Gv . This theoretically reveals the
importance of cross-graph aggregation and justifies the design of DSS-WL. Finally, we point out
that Qian et al. (2022) very recently proposed an extension of DS-WL that adds a final cross-graph
aggregation procedure, for which our negative result may not hold. It may be an interesting direction
to theoretically analyze the expressiveness of this type of DS-WL in future work.
6
study whether simpler architectures exist. More importantly, DSS-WL suffers from high computa-
tional costs in both time and memory. Indeed, it requires Θ(n2 ) space and Θ(nm) time per iteration
(using policy πNM ) to compute node colors for a graph with n nodes and m edges, which is n times
costly than 1-WL. Given the theoretical linear lower bound in Theorem 2.5, one may naturally raise
the question of how to close the gap by developing more efficient color refinement algorithms.
We approach the problem by rethinking the classic 1-WL test. We argue that a major weakness of
1-WL is that it is agnostic to distance information between nodes, partly because each node can
only “see” its neighbors in aggregation. On the other hand, the DSS-WL color mapping implicitly
encodes distance information as shown in Section 3.2, which inspires us to formally study whether
incorporating distance in the aggregation procedure is crucial for solving biconnectivity problems.
To this end, we introduce a novel color refinement framework which we call Generalized Distance
Weisfeiler-Lehman (GD-WL). The update rule of GD-WL is very simple and can be written as:
χtG (v) := hash {{(dG (v, u), χt−1
G (u)) : u ∈ V}} , (3)
where dG can be an arbitrary distance metric. The full algorithm is described in Algorithm 4.
SPD-WL for edge-biconnectivity. As a special case, when choosing the shortest path distance
dG = disG , we obtain an algorithm which we call SPD-WL. It can be equivalently written as
χtG (v) := hash χt−1 t−1 t−1
G (v), {{χG (u) : u ∈ NG (v)}}, {{χG (u) : disG (v, u) = 2}},
(4)
· · · , {{χt−1 t−1
G (u) : disG (v, u) = n − 1}}, {{χG (u) : disG (v, u) = ∞}} .
From (4) it is clear that SPD-WL is strictly more powerful than 1-WL since it additionally aggre-
gates the k-hop neighbors for all k > 1. There have been several prior works related to SPD-WL,
including using distance encoding as node features (Li et al., 2020) or performing k-hop aggrega-
tion for some small k (see Appendix D.2 for more related works and discussions). Yet, these works
are either purely empirical or provide limited theoretical analysis (e.g., by focusing only on regular
graphs). Instead, we introduce the general and more expressive SPD-WL framework with a rather
different motivation and perform a systematic study on its expressive power. Our key result confirms
that SPD-WL is fully expressive for all edge-biconnectivity problems listed in Section 2.
Theorem 4.1. Let G = (VG , EG ) and H = (VH , EH ) be two graphs, and let χG and χH be the
corresponding SPD-WL color mapping. Then the following holds:
• For any two edges {w1 , w2 } ∈ EG and {x1 , x2 } ∈ EH , if {{χG (w1 ), χG (w2 )}} =
{{χH (x1 ), χH (x2 )}}, then {w1 , w2 } is a cut edge if and only if {x1 , x2 } is a cut edge.
• If {{χG (w) : w ∈ VG }} = {{χH (w) : w ∈ VH }}, then BCETree(G) ≃ BCETree(H).
Theorem 4.1 is highly non-trivial and perhaps surprising at first sight, as it combines three seemingly
unrelated concepts (i.e., SPD, biconnectivity, and the WL test) into a unified conclusion. We give a
proof in Appendix C.4, which separately considers two cases: χG (w1 ) ̸= χG (w2 ) and χG (w1 ) =
χG (w2 ) (see Figure 2(b,d) for examples). For each case, the key technique in the proof is to construct
an auxiliary graph (Definitions C.26 and C.34) that precisely characterizes the structural relationship
between nodes that have specific colors (see Corollaries C.31 and C.40). Finally, we highlight that
the second item of Theorem 4.1 may be particularly interesting: while distinguishing general non-
isomorphic graphs are known to be hard (Cai et al., 1992; Babai, 2016), we show distinguishing
non-isomorphic graphs with different block cut-edge trees can be much easily solved by SPD-WL.
RD-WL for vertex-biconnectivity. Unfortunately, while SPD-WL is fully expressive for edge-
biconnectivity, it is not expressive for vertex-biconnectivity. We give a simple counterexample in
Figure 2(c), where SPD-WL cannot distinguish the two graphs. Nevertheless, we find that by using
a different distance metric, problems related to vertex-biconnectivity can also be fully solved. We
propose such a choice called the Resistance Distance (RD) (denoted as disR G ), which is also a basic
metric in graph theory (Doyle & Snell, 1984; Klein & Randić, 1993; Sanmartın et al., 2022). For-
mally, the value of disRG (u, v) is defined to be the effective resistance between nodes u and v when
treating G as an electrical network where each edge corresponds to a resistance of one ohm. We
note that other generalized distances can also be considered (Li et al., 2020; Velingker et al., 2022).
RD has many elegant properties. First, it is a valid metric: indeed, RD is non-negative, semidefinite,
symmetric, and satisfies the triangular inequality (see Appendix E.2). Moreover, similar to SPD,
we also have 0 ≤ disR R
G (u, v) ≤ n − 1, and disG (u, v) = disG (u, v) if G is a tree. In Appendix E.2,
we further show that RD is highly related to the graph Laplacian and can be efficiently calculated.
7
Theorem 4.2. Let G = (VG , EG ) and H = (VH , EH ) be two graphs, and let χG and χH be the
corresponding RD-WL color mapping. Then the following holds:
• For any two nodes w ∈ VG and x ∈ VH , if χG (w) = χH (x), then w is a cut vertex if and
only if x is a cut vertex.
• If {{χG (w) : w ∈ VG }} = {{χH (w) : w ∈ VH }}, then BCVTree(G) ≃ BCVTree(H).
The form of Theorem 4.2 exactly parallels Theorem 4.1, which shows that RD-WL is fully expres-
sive for vertex-biconnectivity. We give a proof of Theorem 4.1 in Appendix C.5. In particular, the
proof of the second item is highly technical due to the challenges in analyzing the (complex) struc-
ture of the block cut-vertex tree. It also highlights that distinguishing non-isomorphic graphs that
have different BCVTrees is much easier than the general case.
Combining Theorems 4.1 and 4.2 immediately yields the following corollary, showing that all bi-
connectivity problems can be solved within our proposed GD-WL framework.
Corollary 4.3. When using both SPD and RD (i.e., by setting dG (u, v) := (disG (u, v), disR
G (u, v))),
the corresponding GD-WL is fully expressive for both vertex-biconnectivity and edge-biconnectivity.
Computational cost. The GD-WL framework only needs a complexity of Θ(n) space and Θ(n2 )
time per-iteration for a graph of n nodes and m edges, both of which are strictly less than DSS-WL.
In particular, GD-WL has the same space complexity as 1-WL, which can be crucial for large-scale
tasks. On the other hand, one may ask how much computational overhead there is in preprocessing
pairwise distances between nodes. We show in Appendix E that the computational cost can be
trivially upper bounded by O(nm) for SPD and O(n3 ) for RD. Note that the preprocessing step only
needs to be executed once, and we find that the cost is negligible compared to the GNN architecture.
Practical implementation. One of the main advantages of GD-WL is its high degree of paralleliz-
ability. In particular, we find GD-WL can be easily implemented using a Transformer-like architec-
ture by injecting distance information into Multi-head Attention (Vaswani et al., 2017), similar to
the structural encoding in Graphormer (Ying et al., 2021a). The attention layer can be written as:
h ⊤
Yh = ϕh1 (D) ⊙ softmax XWQ h
) + ϕh2 (D) XWVh ,
(XWK (5)
where X ∈ Rn×d is the input node features of the previous layer, D ∈ Rn×n is the distance matrix
h h
such that Duv = dG (u, v), WQ , WK , WVh ∈ Rd×dH are learnable weight matrices of the h-th
head, ϕh1 and ϕh2 are elementwise functions applied to D (possibly parameterized), and ⊙ denotes
h
PY ∈
the elementwise multiplication. The results Rn×dH across all heads h are then combined and
projected to obtain the final output Y = h Y WO h h
where WOh
∈ RdH ×d . We call the resulting
architecture Graphormer-GD, and the full structure of Graphormer-GD is provided in Appendix E.3.
It is easy to see that the mapping from X to Y in (5) is equivariant and simulates the GD-WL
aggregation. Importantly, we have the following expressivity result, which precisely characterizes
the power and limits of Graphormer-GD. We give a proof in Appendix E.3.
Theorem 4.4. Graphormer-GD is at most as powerful as GD-WL in distinguishing non-isomorphic
graphs. Moreover, when choosing proper functions ϕh1 and ϕh2 and using a sufficiently large number
of heads and layers, Graphormer-GD is as powerful as GD-WL.
On the expressivity upper bound of GD-WL. To complete the theoretical analysis, we finally
provide an upper bound of the expressive power for our proposed SPD-WL and RD-WL, by studying
the relationship with the standard 2-FWL (3-WL) algorithm.
Theorem 4.5. The 2-FWL algorithm is more powerful than both SPD-WL and RD-WL. Formally,
the 2-FWL color mapping induces a finer vertex partition than that of both SPD-WL and RD-WL.
We give a proof in Appendix C.6. Using Theorem 4.5, we arrive at the important corollary:
Corollary 4.6. The 2-FWL is fully expressive for both vertex-biconnectivity and edge-biconnectivity.
A worst-case analysis of GD-WL for distance-regular graphs. Since GD-WL heavily relies on
distance information, one may wonder about its expressiveness in the worst-case scenario where
distance information may not help distinguish certain non-isomorphic graphs, in particular, the class
of distance-regular graphs (Brouwer et al., 1989). Due to space limit, we provide a comprehensive
study of this question in Appendix C.7, where we give a precise and complete characterization of
8
Dodecahedron Desargues graph 4x4 rook’s graph Shrikhande graph
(a) SPD-WL fails while RD-WL succeeds. (b) Both SPD-WL and RD-WL fail.
Figure 3: Illustration of non-isomorphic distance-regular graphs.
what types of distance-regular graphs SPD-WL/RD-WL/2-FWL can distinguish (with both theoret-
ical results and counterexamples). The main result is present as follows:
Theorem 4.7. RD-WL is strictly more powerful than SPD-WL in distinguishing non-isomorphic
distance-regular graphs. Moreover, RD-WL is as powerful as 2-FWL in distinguishing non-
isomorphic distance-regular graphs.
The above theorem strongly justifies the power of resistance distance and our proposed GD-WL.
Importantly, to our knowledge, this is the first result showing that a more efficient WL algorithm can
match the expressive power of 2-FWL in distinguishing distance-regular graphs.
5 E XPERIMENTS
In this section, we perform empirical evaluations of our proposed Graphormer-GD. We mainly con-
sider the following two sets of experiments. Firstly, we would like to verify whether Graphormer-
GD can indeed learn biconnectivity-related metrics easily as our theory predicts. Secondly, we
would like to investigate whether GNNs with sufficient expressiveness for biconnectivity can also
help real-world tasks and benefit the generalization performance as well. The code and models will
be made publicly available at https://github.com/lsj2408/Graphormer-GD.
Synthetic tasks. To test the expressive Table 2: Accuracy on cut vertex (articulation point) and
power of GNNs for biconnectivity met- cut edge (bridge) detection tasks.
rics, we separately consider two tasks:
(i) Cut Vertex Detection and (ii) Cut Model
Cut Vertex Cut Edge
Edge Detection. Given a GNN model Detection Detection
that outputs node features, we add a learn- GCN (Kipf & Welling, 2017) 51.5%±1.3% 62.4%±1.8%
able prediction head that takes each node GAT (Veličković et al., 2018) 52.0%±1.3% 62.8%±1.9%
feature (or two node features correspond- GIN (Xu et al., 2019) 53.9%±1.7% 63.1%±2.2%
ing to each edge) as input and predicts GSN (Bouritsas et al., 2022) 60.1%±1.9% 70.7%±2.1%
whether it is a cut vertex (cut edge) or Graphormer (Ying et al., 2021a) 76.4%±2.8% 84.5%±3.3%
not. The evaluation metric for both tasks Graphormer-GD (ours) 100% 100%
is the graph-level accuracy, i.e., given a - w/o. Resistance Distance 83.3%±2.7% 100%
graph, the model prediction is considered
correct only when all the cut vertices/edges are correctly identified. To make the results convincing,
we construct a challenging dataset that comprises various types of hard graphs, including the regular
graphs with cut vertices/edges and also Examples C.9 and C.10 mentioned in Section 3. We also
choose several GNN baselines with different levels of expressive power: (i) classic MPNNs (Kipf &
Welling, 2017; Veličković et al., 2018; Xu et al., 2019); (ii) Graph Substructure Network (Bouritsas
et al., 2022); (iii) Graphormer (Ying et al., 2021a). The details of model configurations, dataset, and
training procedure are provided in Appendix F.1.
The results are presented in Table 2. It can be seen that baseline GNNs cannot perfectly solve these
synthetic tasks. In contrast, the Graphormer-GD achieves 100% accuracy on both tasks, implying
that it can easily learn biconnectivity metrics even in very difficult graphs. Moreover, while using
only SPD suffices to identify cut edges, it is still necessary to further incorporate RD to identify cut
vertices. This is consistent with our theoretical results in Theorems 4.1, 4.2 and 4.4.
Real-world tasks. We further study the empirical performance of our Graphormer-GD on the real-
world benchmark: ZINC from Benchmarking-GNNs (Dwivedi et al., 2020). To show the scalability
of Graphormer-GD, we train our models on both ZINC-Full (consisting of 250K molecular graphs)
and ZINC-Subset (12K selected graphs). We comprehensively compare our model with prior ex-
9
Table 3: Mean Absolute Error (MAE) on ZINC test set. Following Dwivedi et al. (2020), the
parameter budget of compared models is set to 500k. We use ∗ to indicate the best performance.
Test MAE
Method Model Time (s) Params
ZINC-Subset ZINC-Full
GIN (Xu et al., 2019) 8.05 509,549 0.526±0.051 0.088±0.002
GraphSAGE (Hamilton et al., 2017) 6.02 505,341 0.398±0.002 0.126±0.003
GAT (Veličković et al., 2018) 8.28 531,345 0.384±0.007 0.111±0.002
MPNNs GCN (Kipf & Welling, 2017) 5.85 505,079 0.367±0.011 0.113±0.002
MoNet (Monti et al., 2017) 7.19 504,013 0.292±0.006 0.090±0.002
GatedGCN-PE(Bresson & Laurent, 2017) 10.74 505,011 0.214±0.006 -
MPNN(sum) (Gilmer et al., 2017) - 480,805 0.145±0.007 -
PNA (Corso et al., 2020) - 387,155 0.142±0.010 -
Higher-order RingGNN (Chen et al., 2019) 178.03 527,283 0.353±0.019 -
GNNs 3WLGNN (Maron et al., 2019a) 179.35 507,603 0.303±0.068 -
Substructure- GSN (Bouritsas et al., 2022) - ∼500k 0.101±0.010 -
based GNNs CIN-Small (Bodnar et al., 2021a) - ∼100k 0.094±0.004 0.044±0.003
NGNN (Zhang & Li, 2021) - ∼500k 0.111±0.003 0.029±0.001
DSS-GNN (Bevilacqua et al., 2022) - 445,709 0.097±0.006 -
Subgraph
GNN-AK (Zhao et al., 2022) - ∼500k 0.105±0.010 -
GNNs
GNN-AK+ (Zhao et al., 2022) - ∼500k 0.091±0.011 -
SUN (Frasca et al., 2022) 15.04 526,489 0.083±0.003 -
GT (Dwivedi & Bresson, 2021) - 588,929 0.226±0.014 -
Graph SAN (Kreuzer et al., 2021) - 508,577 0.139±0.006 -
Transformers Graphormer (Ying et al., 2021a) 12.26 489,321 0.122±0.006 0.052±0.005
URPE (Luo et al., 2022b) 12.40 491,737 0.086±0.007 0.028±0.002
GD-WL Graphormer-GD (ours) 12.52 502,793 0.081±0.009∗ 0.025±0.004∗
pressive GNNs that have been publicly released. For a fair comparison, we ensure that the parameter
budget of both Graphormer-GD and other compared models are around 500K, following Dwivedi
et al. (2020). Details of baselines and settings are presented in Appendix F.2.
The results are shown in Table 3, where our score is averaged over four experiments with differ-
ent seeds. We also list the per-epoch training time of different models on ZINC-subset as well
as their model parameters. It can be seen that Graphormer-GD surpasses or matches all compet-
itive baselines on the test set of both ZINC-Subset and ZINC-Full. Furthermore, we find that the
empirical performance of compared models align with their expressive power measured by graph
biconnectivity. For example, Subgraph GNNs that are expressive for biconnectivity also consistently
outperform classic MPNNs by a large margin. Compared with Subgraph GNNs, the main advan-
tage of Graphormer-GD is that it is simpler to implement, has stronger parallelizability, while still
achieving better performance. Therefore, we believe our proposed architecture is both effective and
efficient and can be well extended to more practical scenarios like drug discovery.
Other tasks. We also perform node-level experiments on two popular datasets: the Brazil-Airports
and the Europe-Airports. Due to space limit, the results are shown in Appendix F.3.
6 CONCLUSION
In this paper, we systematically investigate the expressive power of GNNs via the perspective of
graph biconnectivity. Through the novel lens, we gain strong theoretical insights into the power and
limits of existing popular GNNs. We then introduce the principled GD-WL framework that is fully
expressive for all biconnectivity metrics. We further design the Graphormer-GD architecture that
is provably powerful while enjoying practical efficiency and parallelizability. Experiments on both
synthetic and real-world datasets demonstrate the effectiveness of Graphormer-GD.
There are still many promising directions that have not yet been explored. Firstly, it remains an
important open problem whether biconnectivity can be solved more efficiently in o(n2 ) time using
equivariant GNNs. Secondly, a deep understanding of GD-WL is generally lacking. For example,
we conjecture that RD-WL can encode graph spectral (Lim et al., 2022) and is strictly more powerful
than SPD-WL in distinguishing general graphs. Thirdly, it may be interesting to further investigate
more expressive distance (structural) encoding schemes beyond RD-WL and explore how to encode
them in Graph Transformers. Finally, one can extend biconnectivity to a hierarchy of higher-order
variants (e.g., tri-connectivity), which provides a completely different view parallel to the WL hier-
archy to study the expressive power and guide designing provably powerful GNNs architectures.
10
ACKNOWLEDGMENTS
Bohang Zhang is grateful to Ruichen Li for his great help in discussing and checking several of
the main results in this paper, including Theorems 3.1, 3.2, 4.1 and 4.7. In particular, after the
initial submission, Ruichen Li discovered a simpler proof of Lemma C.28 and helped complete the
proof of Theorem C.61. Bohang Zhang would also thank Yiheng Du, Kai Yang amd Ruichen Li for
correcting some small mistakes in the proof of Lemmas C.20 and C.45.
R EFERENCES
Ralph Abboud, İsmail İlkan Ceylan, Martin Grohe, and Thomas Lukasiewicz. The surprising power
of graph neural networks with random node initialization. In Proceedings of the Thirtieth Inter-
national Joint Conference on Artificial Intelligence, IJCAI-21, pp. 2112–2118, 2021.
Ralph Abboud, Radoslav Dimitrov, and Ismail Ilkan Ceylan. Shortest path networks for graph
property prediction. In The First Learning on Graphs Conference, 2022.
Robert Ackland et al. Mapping the us political blogosphere: Are conservative bloggers more promi-
nent? In BlogTalk Downunder 2005 Conference, Sydney. BlogTalk Downunder 2005 Conference,
Sydney, 2005.
Noga Alon, Raphael Yuster, and Uri Zwick. Finding and counting given length cycles. Algorithmica,
17(3):209–223, 1997.
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical implications.
In International Conference on Learning Representations, 2021.
Vikraman Arvind, Frank Fuhlbrück, Johannes Köbler, and Oleg Verbitsky. On weisfeiler-leman
invariance: Subgraph counts and related graph properties. Journal of Computer and System Sci-
ences, 113:42–59, 2020.
Waiss Azizian and Marc Lelarge. Expressive power of invariant and equivariant graph neural net-
works. In International Conference on Learning Representations, 2021.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Muhammet Balcilar, Pierre Héroux, Benoit Gauzere, Pascal Vasseur, Sébastien Adam, and Paul
Honeine. Breaking the limits of message passing graph neural networks. In International Con-
ference on Machine Learning, pp. 599–608. PMLR, 2021.
Pablo Barceló, Floris Geerts, Juan Reutter, and Maksimilian Ryschkov. Graph neural networks with
local graph parameters. In Advances in Neural Information Processing Systems, volume 34, pp.
25280–25293, 2021.
Beatrice Bevilacqua, Fabrizio Frasca, Derek Lim, Balasubramaniam Srinivasan, Chen Cai, Gopinath
Balamurugan, Michael M Bronstein, and Haggai Maron. Equivariant subgraph aggregation net-
works. In International Conference on Learning Representations, 2022.
Cristian Bodnar, Fabrizio Frasca, Nina Otter, Yu Guang Wang, Pietro Liò, Guido Montufar, and
Michael M. Bronstein. Weisfeiler and lehman go cellular: CW networks. In Advances in Neural
Information Processing Systems, volume 34, 2021a.
Cristian Bodnar, Fabrizio Frasca, Yuguang Wang, Nina Otter, Guido F Montufar, Pietro Lio, and
Michael Bronstein. Weisfeiler and lehman go topological: Message passing simplicial networks.
In International Conference on Machine Learning, pp. 1026–1037. PMLR, 2021b.
Béla Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 1998.
11
Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. Improving graph
neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2022.
Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint
arXiv:1711.07553, 2017.
Andries E Brouwer, Arjeh M Cohen, Arjeh M Cohen, and Arnold Neumaier. Distance-regular
graphs. Springer (Berlin [ua]), 1989.
Jin-Yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables
for graph identification. Combinatorica, 12(4):389–410, 1992.
Ashok K Chandra, Prabhakar Raghavan, Walter L Ruzzo, Roman Smolensky, and Prasoon Tiwari.
The electrical resistance of a graph captures its commute and cover times. computational com-
plexity, 6(4):312–340, 1996.
Zhengdao Chen, Soledad Villar, Lei Chen, and Joan Bruna. On the equivalence between graph
isomorphism testing and function approximation with gnns. Advances in neural information
processing systems, 32, 2019.
Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. Can graph neural networks count sub-
structures? In Proceedings of the 34th International Conference on Neural Information Process-
ing Systems, pp. 10383–10395, 2020.
Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal
neighbourhood aggregation for graph nets. In Advances in Neural Information Processing Sys-
tems, volume 33, pp. 13260–13271, 2020.
Leonardo Cotta, Christopher Morris, and Bruno Ribeiro. Reconstruction for powerful graph repre-
sentations. In Advances in Neural Information Processing Systems, volume 34, pp. 1713–1726,
2021.
Pim de Haan, Taco Cohen, and Max Welling. Natural graph networks. In Proceedings of the 34th
International Conference on Neural Information Processing Systems, volume 33, pp. 3636–3646,
2020.
Peter G Doyle and J Laurie Snell. Random walks and electric networks, volume 22. American
Mathematical Soc., 1984.
Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs.
AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.
Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson.
Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
Or Feldman, Amit Boyarski, Shai Feldman, Dani Kogan, Avi Mendelson, and Chaim Baskin. We-
isfeiler and leman go infinite: Spectral and combinatorial pre-colorings. In ICLR 2022 Workshop
on Geometrical and Topological Representation Learning, 2022.
Jiarui Feng, Yixin Chen, Fuhai Li, Anindya Sarkar, and Muhan Zhang. How powerful are k-hop
message passing graph neural networks. arXiv preprint arXiv:2205.13328, 2022.
Robert W Floyd. Algorithm 97: shortest path. Communications of the ACM, 5(6):345, 1962.
Fabrizio Frasca, Beatrice Bevilacqua, Michael Bronstein, and Haggai Maron. Understanding and
extending subgraph gnns by rethinking their symmetries. arXiv preprint arXiv:2206.11140, 2022.
Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits
of graph neural networks. In International Conference on Machine Learning, pp. 3419–3430.
PMLR, 2020.
Floris Geerts and Juan L Reutter. Expressiveness and approximation properties of graph neural
networks. In International Conference on Learning Representations, 2022.
12
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In International conference on machine learning, pp.
1263–1272. PMLR, 2017.
Frieda Granot and Arthur F Veinott Jr. Substitutes, complements and ripples in network flows.
Mathematics of Operations Research, 10(3):471–497, 1985.
Ivan Gutman and W Xiao. Generalized inverse of the laplacian matrix and some applications. Bul-
letin (Académie serbe des sciences et des arts. Classe des sciences mathématiques et naturelles.
Sciences mathématiques), pp. 15–23, 2004.
William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large
graphs. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, volume 30, pp. 1025–1035, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
John E Hopcroft and Robert Endre Tarjan. Isomorphism of planar graphs. In Complexity of computer
computations, pp. 131–152. Springer, 1972.
Max Horn, Edward De Brouwer, Michael Moor, Yves Moreau, Bastian Rieck, and Karsten Borg-
wardt. Topological graph neural networks. In International Conference on Learning Representa-
tions, 2022.
Yinan Huang, Xingang Peng, Jianzhu Ma, and Muhan Zhang. Boosting the cycle counting power
of graph neural networks with i$ˆ2$-GNNs. In International Conference on Learning Represen-
tations, 2023.
Neil Immerman and Eric Lander. Describing graphs: A first-order approach to graph canonization.
In Complexity theory retrospective, pp. 59–81. Springer, 1990.
Sanjiv Kapoor and Hariharan Ramesh. Algorithms for enumerating all spanning trees of undirected
and weighted graphs. SIAM Journal on Computing, 24(2):247–265, 1995.
Nicolas Keriven and Gabriel Peyré. Universal invariant and equivariant graph neural networks. In
Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.
7092–7101, 2019.
Sandra Kiefer. Power and limits of the Weisfeiler-Leman algorithm. PhD thesis, Dissertation, RWTH
Aachen University, 2020.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
works. In International Conference on Learning Representations, 2017.
Douglas J Klein and Milan Randić. Resistance distance. Journal of mathematical chemistry, 12(1):
81–95, 1993.
Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Re-
thinking graph transformers with spectral attention. In Advances in Neural Information Process-
ing Systems, volume 34, 2021.
Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: design provably
more powerful neural networks for graph representation learning. In Proceedings of the 34th
International Conference on Neural Information Processing Systems, pp. 4465–4478, 2020.
Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, and Stefanie
Jegelka. Sign and basis invariant networks for spectral graph representation learning. arXiv
preprint arXiv:2202.13013, 2022.
13
Andreas Loukas. What graph neural networks cannot learn: depth vs width. In International Con-
ference on Learning Representations, 2020.
Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He.
One transformer can understand both 2d & 3d molecular data. arXiv preprint arXiv:2210.01765,
2022a.
Shengjie Luo, Shanda Li, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He. Your transformer
may not be as powerful as you expect. arXiv preprint arXiv:2205.13401, 2022b.
Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. Provably powerful graph
networks. In Advances in neural information processing systems, volume 32, pp. 2156–2167,
2019a.
Haggai Maron, Heli Ben-Hamu, Nadav Shamir, and Yaron Lipman. Invariant and equivariant graph
networks. In International Conference on Learning Representations, 2019b.
Haggai Maron, Ethan Fetaya, Nimrod Segol, and Yaron Lipman. On the universality of invariant
networks. In International conference on machine learning, pp. 4363–4371. PMLR, 2019c.
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M
Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5115–5124,
2017.
Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav
Rattan, and Martin Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks.
In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 4602–4609, 2019.
Christopher Morris, Gaurav Rattan, and Petra Mutzel. Weisfeiler and leman go sparse: towards
scalable higher-order graph embeddings. In Proceedings of the 34th International Conference on
Neural Information Processing Systems, pp. 21824–21840, 2020.
Christopher Morris, Yaron Lipman, Haggai Maron, Bastian Rieck, Nils M Kriege, Martin Grohe,
Matthias Fey, and Karsten Borgwardt. Weisfeiler and leman go machine learning: The story so
far. arXiv preprint arXiv:2112.09992, 2021.
Christopher Morris, Gaurav Rattan, Sandra Kiefer, and Siamak Ravanbakhsh. Speqnets: Sparsity-
aware permutation-equivariant graph networks. In International Conference on Machine Learn-
ing, pp. 16017–16042. PMLR, 2022.
Ryan Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Relational pooling
for graph representations. In International Conference on Machine Learning, pp. 4663–4673.
PMLR, 2019.
Pál András Papp and Roger Wattenhofer. A theoretical comparison of graph neural network exten-
sions. arXiv preprint arXiv:2201.12884, 2022.
Pál András Papp, Karolis Martinkus, Lukas Faber, and Roger Wattenhofer. Dropgnn: random
dropouts increase the expressiveness of graph neural networks. In Advances in Neural Infor-
mation Processing Systems, volume 34, pp. 21997–22009, 2021.
Chendi Qian, Gaurav Rattan, Floris Geerts, Christopher Morris, and Mathias Niepert. Ordered
subgraph aggregation networks. arXiv preprint arXiv:2206.11168, 2022.
Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning node
representations from structural identity. In Proceedings of the 23rd ACM SIGKDD international
conference on knowledge discovery and data mining, pp. 385–394, 2017.
Enrique Fita Sanmartın, Sebastian Damrich, and Fred Hamprecht. The algebraic path problem for
graph metrics. In International Conference on Machine Learning, pp. 19178–19204. PMLR,
2022.
14
Ryoma Sato. A survey on the expressive power of graph neural networks. arXiv preprint
arXiv:2003.04078, 2020.
Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Approximation ratios of graph neural networks
for combinatorial problems. In Proceedings of the 33rd International Conference on Neural
Information Processing Systems, pp. 4081–4090, 2019.
Ryoma Sato, Makoto Yamada, and Hisashi Kashima. Random features strengthen graph neural
networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM),
pp. 333–341. SIAM, 2021.
Bernhard Scholkopf, Kah-Kay Sung, Christopher JC Burges, Federico Girosi, Partha Niyogi,
Tomaso Poggio, and Vladimir Vapnik. Comparing support vector machines with gaussian kernels
to radial basis function classifiers. IEEE transactions on Signal Processing, 45(11):2758–2765,
1997.
Yu Shi, Shuxin Zheng, Guolin Ke, Yifei Shen, Jiacheng You, Jiyan He, Shengjie Luo, Chang Liu,
Di He, and Tie-Yan Liu. Benchmarking graphormer on large-scale molecular modeling datasets.
arXiv preprint arXiv:2203.04810, 2022.
Rajat Talak, Siyi Hu, Lisa Peng, and Luca Carlone. Neural trees for learning on graphs. In Advances
in Neural Information Processing Systems, volume 34, pp. 26395–26408, 2021.
Robert Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2):
146–160, 1972.
Erik Thiede, Wenda Zhou, and Risi Kondor. Autobahn: Automorphism-based graph neural nets. In
Advances in Neural Information Processing Systems, volume 34, pp. 29922–29934, 2021.
Jan Toenshoff, Martin Ritzert, Hinrikus Wolf, and Martin Grohe. Graph learning with 1d convolu-
tions on random walks. arXiv preprint arXiv:2102.08786, 2021.
Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M.
Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In Interna-
tional Conference on Learning Representations, 2022.
Edwin R van Dam, Jack H Koolen, and Hajime Tanaka. Distance-regular graphs. arXiv preprint
arXiv:1410.6294, 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, volume 30, 2017.
Petar Veličković. Message passing all the way up. arXiv preprint arXiv:2202.11097, 2022.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph attention networks. In International Conference on Learning Representations,
2018.
Ameya Velingker, Ali Kemal Sinop, Ira Ktena, Petar Veličković, and Sreenivas Gollapudi. Affinity-
aware graph networks. arXiv preprint arXiv:2206.11941, 2022.
Clément Vignac, Andreas Loukas, and Pascal Frossard. Building powerful and equivariant graph
neural networks with structural message-passing. In Proceedings of the 34th International Con-
ference on Neural Information Processing Systems, pp. 14143–14155, 2020.
Boris Weisfeiler and Andrei Leman. The reduction of a graph to canonical form and the algebra
which appears therein. NTI, Series, 2(9):12–16, 1968.
Asiri Wijesinghe and Qing Wang. A new perspective on” how graph neural networks go beyond
weisfeiler-lehman?”. In International Conference on Learning Representations, 2022.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? In International Conference on Learning Representations, 2019.
15
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and
Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in Neural
Information Processing Systems, 34, 2021a.
Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu,
Yuxin Wang, Yanming Shen, and Di He. First place solution of kdd cup 2021 ogb large-scale
challenge graph-level track. arXiv preprint arXiv:2106.08279, 2021b.
Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35,
pp. 10737–10745, 2021.
Raphael Yuster and Uri Zwick. Finding even cycles even faster. SIAM Journal on Discrete Mathe-
matics, 10(2):209–222, 1997.
Muhan Zhang and Pan Li. Nested graph neural networks. In Advances in Neural Information
Processing Systems, volume 34, pp. 15734–15747, 2021.
Lingxiao Zhao, Wei Jin, Leman Akoglu, and Neil Shah. From stars to subgraphs: Uplifting any gnn
with local structure awareness. In International Conference on Learning Representations, 2022.
16
Appendix
Table of Contents
A Recent advances in expressive GNNs 18
C Proof of Theorems 23
C.1 Properties of color refinement algorithms . . . . . . . . . . . . . . . . . . . . . 23
C.2 Counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
C.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
C.4 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
C.5 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
C.6 Proof of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
C.7 Proof of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
F Experimental Details 57
F.1 Synthetic Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
F.2 Real-world Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
F.3 More Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
F.4 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
17
A R ECENT ADVANCES IN EXPRESSIVE GNN S
Since the seminal works of Xu et al. (2019); Morris et al. (2019), extensive studies have devoted to
developing new GNN architectures with better expressiveness beyond the 1-WL test. These works
can be broadly classified into the following categories.
Higher-order GNNs. One straightforward way to design provably more expressive GNNs is in-
spired by the higher-order WL tests (see Appendix B.2). Instead of performing node feature ag-
gregation, these higher-order GNNs calculate a feature vector for each k-tuple of nodes (k ≥ 2)
and perform aggregation between features of different tuples using tensor operations (Morris et al.,
2019; Maron et al., 2019b;c;a; Keriven & Peyré, 2019; Azizian & Lelarge, 2021; Geerts & Reutter,
2022). In particular, Maron et al. (2019a) leveraged equivariant matrix multiplication to design net-
work layers that mimic the 2-FWL aggregation. Due to the huge computational cost of higher-order
GNNs, several recent works considered improving efficiency by leveraging the sparse and local na-
ture of graphs and designing a “local” version of the k-WL aggregation, which comes at the cost of
some expressiveness (Morris et al., 2020; 2022). The work of Vignac et al. (2020) can also be seen
as a local 2-order GNN and its expressive power is bounded by 3-IGN (Maron et al., 2019c).
Substructure-based GNNs. Another way to design more expressive GNNs is inspired by studying
the failure cases of 1-WL test. In particular, Chen et al. (2020) pointed out that standard MPNNs
cannot detect/count common substructures such as cycles, cliques, and paths. Based on this finding,
Bouritsas et al. (2022) designed the Graph Substructure Network (GSN) by incorporating substruc-
ture counting into node features using a preprocessing step. Such an approach was later extended
by Barceló et al. (2021) based on homomorphism counting. Bodnar et al. (2021b;a); Thiede et al.
(2021); Horn et al. (2022) further developed novel WL aggregation schemes that take into account
these substructures (e.g., cycles or cliques). Toenshoff et al. (2021) considered using random walk
techniques to generate small substructures.
Subgraph GNNs. In fact, the graphs indistinguishable by 1-WL tend to possess a high degree of
symmetry (e.g., see Figure 2). Based on this observation, a variety of recent approaches sought
to break the symmetry by feeding subgraphs into an MPNN. To maintain equivariance, a set of
subgraphs is generated symmetrically from the original graph using predefined policies, and the final
output is aggregated across all subgraphs. There have been several subgraph generation policies in
prior works, such as node deletion (Cotta et al., 2021), edge deletion (Bevilacqua et al., 2022), node
marking (Papp & Wattenhofer, 2022), and ego-networks (Zhao et al., 2022; Zhang & Li, 2021; You
et al., 2021). These works also slightly differ in the aggregation schemes. In particular, Bevilacqua
et al. (2022) developed a unified framework, called ESAN, which includes per-layer aggregation
across subgraphs and thus enjoys better expressiveness. Very recently, Frasca et al. (2022) further
extended the framework based on a more relaxed symmetry analysis and proved an upper bound of
its expressiveness to be 3-WL. Qian et al. (2022) provided a theoretical analysis of how subgraph
GNNs relate to k-FWL and also designed an approach to learn policies.
Non-equivariant GNNs. Perhaps one of the simplest way to break the intrinsic symmetry of 1-WL
aggregation is to use non-equivariant GNNs. Indeed, Loukas (2020) proved that if each node in a
GNN is equipped with a unique identifier, then standard MPNNs can already be Turing universal.
There have been several works that exploit this idea to build powerful GNNs, such as using port
numbering (Sato et al., 2019), relational pooling (Murphy et al., 2019), random features (Sato et al.,
2021; Abboud et al., 2021), or dropout techniques (Papp et al., 2021). However, since the resulting
architectures cannot fully preserve equivariance, the sample complexity required for training and
generalization may not be guaranteed (Garg et al., 2020). Therefore, in this paper we only focus on
analyzing and designing equivariant GNNs.
Other approaches. Wijesinghe & Wang (2022); de Haan et al. (2020) designed novel variants of
MPNNs based on more powerful neighborhood aggregation schemes that are aware of the local
graph structure, rather than simply treating neighboring nodes as a set. Li et al. (2020); Velingker
et al. (2022) incorporated distance encoding into node/edge features to enhance the expressive power
of MPNNs. Balcilar et al. (2021); Feldman et al. (2022) utilized spectral information of graphs to
achieve better expressiveness beyond 1-WL. Talak et al. (2021) proposed the Neural Tree Network
that performs message passing between higher-order subgraphs instead of node-level aggregation.
Finally, for a comprehensive survey on expressive GNNs, we refer readers to Sato (2020) and Morris
et al. (2021).
18
B T HE W EISFEILER -L EHMAN A LGORITHMS AND R ECENTLY P ROPOSED
VARIANTS
In this section, we give a precise description on the family of Weisfeiler-Lehman algorithms and
several recently proposed variants that are studied in this paper. We first present the classic 1-WL
algorithm (Weisfeiler & Leman, 1968) and the more advanced k-FWL (Cai et al., 1992; Morris et al.,
2019). Then we present several recently proposed WL variants, including WL with Substructure
Counting (SC-WL) (Bouritsas et al., 2022), Overlap Subgraph WL (OS-WL) (Wijesinghe & Wang,
2022), Equivariant Subgraph Aggregation WL (DSS-WL) (Bevilacqua et al., 2022) and Generalized
Distance WL (GD-WL).
Throughout this section, we assume hash : X → C is an injective hash function that can map
“arbitrary objects” to a color in C where C is an abstract set called the color set. Formally, the
domain X comprises all the objects we are interested in:
• R ⊂ X and C ⊂ X ;
• For any finite multiset M with elements in X , M ∈ X ;
• For any tuple c ∈ X k of finite dimension k ∈ N+ , c ∈ X .
Given a graph G = (V, E), the 1-dimensional Weisfeiler-Lehman algorithm (1-WL), also called the
color refinement algorithm, iteratively calculates a color mapping χG from each vertex v ∈ V to a
color χG (v) ∈ C. The pseudo code of 1-WL is presented in Algorithm 1. Intuitively, at the beginning
the color of each vertex is initialized to be the same. Then in each iteration, 1-WL algorithm updates
each vertex color by combining its own color with the neighborhood color multiset using a hash
function. This procedure is repeated for a sufficiently large number of iterations T , e.g. T = |V|.
5 Return: χTG
At each iteration, the color mapping χtG induces a partition of the vertex set V with an equivalence
relation ∼χtG defined to be u ∼χtG v ⇐⇒ χtG (u) = χtG (v) for u, v ∈ V. We call each equivalence
class a color class with an associated color c ∈ C, denoted as (χtG )−1 (c) := {v ∈ V : χtG (v) = c}.
The corresponding partition is then denoted as PG t
= {(χtG )−1 (c) : c ∈ CGt t
} where CG := {χtG (v) :
v ∈ V} is the color set containing all the presented colors of vertices in G.
t
An important observation is that each 1-WL iteration refines the partition PG to a finer partition
t+1
PG , because for any u, v ∈ V, u ∼χt+1 v implies u ∼χtG v. Since the number of vertices |V|
G
Tstable Tstable +1
is finite, there must exist an iteration Tstable < |V| such that PG = PG . It follows that
t Tstable Tstable
PG = PG for all t ≥ Tstable , i.e. the partition stabilizes. We thus denote PG := PG as
the stable partition induced by the 1-WL algorithm, and denote χG as any stable color mapping
(i.e. by picking any χtG with t ≥ Tstable ). We can similarly define the inverse mapping χ−1 G . The
mapping χG serves as a node feature extractor so that χG (v) is the representation of node v ∈ V.
Correspondingly, the multiset {{χG (v) : v ∈ V}} can serve as the representation of graph G.
The 1-WL algorithm can be used to distinguish whether two graphs G and H are isomorphic, by
comparing their graph representations {{χG (v) : v ∈ V}} and {{χH (v) : v ∈ V}}. If the two
multisets are not equivalent, then G and H are clearly non-isomorphic. Thus 1-WL is a necessary
condition to test graph isomorphism. Nevertheless, the 1-WL test fails when {{χG (v) : v ∈ V}} =
{{χH (v) : v ∈ V}} but G and H are still non-isomorphic (see Figure 2 for a counterexample). This
motivates the more powerful higher-order WL tests, which are illustrated in the next subsection.
19
B.2 k-FWL T EST
In this section, we present a family of algorithms called the k-dimensional Folklore Weisfeiler-
Lehman algorithms (k-FWL). Instead of calculating a node color mapping, k-FWL computes a color
mapping on each k-tuple of nodes. The pseudo code of k-FWL (k ≥ 2) is presented in Algorithm 2.
Intuitively, at the beginning, the color of each vertex tuple v encodes the full structure (i.e. iso-
mophism type) of the subgraph induced by the ordered vertex set {vi : i ∈ [k]}, by hashing the
“adjacency” matrix Av defined in (6). Then in each iteration, k-FWL algorithm updates the color
of each vertex tuple by combining its own color with the “neighborhood” color using a hash func-
tion. Here, the neighborhood of a tuple v is all the tuples that differ v by exactly one element.
These k × |V| neighborhood colors are grouped into a multiset of size |V| where each element is a
k-tuple. Finally, the update procedure is repeated for a sufficiently large number of iterations T , e.g.
T = |V|k .
Simiar to 1-WL, the k-FWL color mapping χtG induces a partition of the set of vertex k-tuples
V k , and each k-FWL iteration refines the partition of the previous iteration. Since the number of
vertex k-tuples |V|k is finite, there must exist an iteration Tstable < |V|k such that the partition no
longer changes after t ≥ Tstable . We denote the stable color mapping as χG by picking any χtG with
t ≥ Tstable .
The k-FWL algorithm can be used to distinguish whether two graphs G and H are isomorphic, by
comparing their graph representations {{χG (v) : v ∈ V k }} and {{χH (v) : v ∈ V k }}. It has been
proved that k-FWL is strictly more powerful than 1-WL in distinguishing non-isomorphic graphs,
and (k + 1)-FWL is strictly more powerful than k-FWL for all k ≥ 2 (Cai et al., 1992).
Moreover, the k-FWL algorithm can also be used to extract node representations as with 1-WL. To
do this, we can simply define χG (v) := χG (v, · · · , v) as the vertex color of the k-FWL algorithm
(without abuse of notation), which induces a partition PG over vertex set V. It has been shown that
this partition is finer than the partition induces by 1-WL, and also the vertex partition induced by
(k + 1)-FWL is finer than that of k-FWL (Kiefer, 2020).
Recently, Bouritsas et al. (2022) proposed a variant of the 1-WL algorithm by incorporating the so-
called substructure counting into WL aggregation procedure. This yields a algorithm that is provably
powerful than the original 1-WL test.
To describe the algorithm, we first need the notation of automorphism group. Given a graph H =
(VH , EH ), an automorphism of H is a bijective mapping f : VH → VH such that for any two
vertices u, v ∈ VH , {u, v} ∈ EH ⇐⇒ {f (u), f (v)} ∈ EH . It follows that all automorphisms of H
form a group under function composition, which is called the automorphism group and denoted as
Aut(H).
20
The automorphism group Aut(H) yields a partition of the vertex set V, called orbits. Formally,
given a vertex v ∈ VH , define its orbit OrbH (v) = {u ∈ VH : ∃f ∈ Aut(H), f (u) = v}. The
set of all orbits H\ Aut(H) := {OrbH (v) : v ∈ VH } is called the quotient of the automorphism.
Denote dH = |H\ Aut(H)| and denote the elements in H\ Aut(H) as {OH,i V
}di=1
H
. We are now
ready to describe the procedure of SC-WL.
Pre-processing. Depending on the tasks, one first specify a set of (small) connected graphs H =
{H1 , · · · , Hk }, which will be used for sub-structure counting in the input graph G. Popular choices
of these small graphs are cycles of different lengths (e.g., triangle or square) and cliques. Given a
graph G = (VG , EG ), for each vertex v ∈ VG and each graph H ∈ H, the following quantities are
calculated:
xV V
H,i (v) := G[S] : S ⊂ V, G[S] ≃ H, v ∈ S, fG[S]→VH (v) ∈ OH,i , i ∈ [dH ] (7)
where fG[S]→VH is any isomorphism that maps the vertices of graph G[S] to those of graph H.
Intuitively, xV
H,i (v) counts the number of induced subgraphs of G that is isomorphic to H and
V
contains node v, such that the orbit of v is similar to the orbit OH,i . The counts corresponding to
V
different orbits OH,i and different graphs H are finally combined and concatenated into a vector:
⊤ ⊤ ⊤
xV (v) = [xV V D
H1 (v) , · · · , xHk (v) ] ∈ N+ (8)
where the dimension of xV (v) is D = i∈[k] di .
P
Message Passing. The message passing procedure is similar to Algorithm 1, except that the aggre-
gation formula (Line 4) is replaced by the following update rule:
χtG (v) := hash χt−1 V t−1 V
G (v), x (v), {{(χG (u), x (u)) : u ∈ NG (v)}} (9)
which incorporates the substructure counts (7, 8). Note that the update rule (9) is slightly simpler
than the original paper (Bouritsas et al., 2022, Section 3.2), but the expressive power of the two
formulations are the same.
Finally, we note that the above procedure counts substructures and calculates features xV for each
vertex of G. One can similarly consider calculating substructure counts for each edge of G, and the
conclusion in this paper (Theorem 3.1) still holds. Please refer to Bouritsas et al. (2022) for more
details on how to calculate edge features.
Recently, Bevilacqua et al. (2022) developd a new type of graph neural networks, called Equivariant
Subgraph Aggregation Networks, as well as a new WL variant named DSS-WL. Given a graph
π
G = (V, E), DSS-WL first generates a bag of graphs BG = {{G1 , · · · , Gm }} which share the
vertices, i.e. Gi = (V, Ei ), but differ in the edge sets Ei . Here π denotes the graph generation policy
which determines the edge set Ei for each graph Gi . The initial coloring χ0Gi (v) for each node
v ∈ V in graph Gi is also determined by π and can be different across different nodes and graphs.
In each iteration, the algorithm refines the color of each node by jointly aggregating its neighboring
colors in the own graph and across different graphs. This procedure is repeated for a sufficiently
large iterations T to obtain the stable color mappings χGi and χG . The pseudo code of DSS-WL is
presented in Algorithm 3.
The key component in the DSS-WL algorithm is the graph generation policy π which must maintain
symmetry, i.e., be equivairant under permutation of the vertex set. We list several common choices
below:
π
• Node marking policy π = πNM . In this policy, we have BG = {{Gv : v ∈ V}} where
π
Gv = G, i.e., there are |V| graphs in BG whose structures are the completely the same.
The difference, however, lies in the initial coloring which marks the special node v in the
following way: χ0Gv (v) = c1 and χ0Gv (u) = c0 for other nodes u ̸= v, where c0 , c1 ∈ C are
two different colors.
π
• Node deletion policy π = πND . The bag of graphs for this policy is also defined as BG =
{{Gv : v ∈ V}}, but each graph Gv = (V, Ev ) has a different edge set Ev := E\{{v, w} :
w ∈ NG (v)}. Intuitively, it removes all edges that connects to node v and thus makes v an
isolated node. The initial coloring is chosen as a constant χ0Gi (v) = c0 for all v ∈ V and
π
Gi ∈ BG for some fixed color c0 ∈ C.
21
Algorithm 3: DSS Weisfeiler-Lehman Algorithm
Input : Graph G = (V, E), the number of iterations T , and graph selection policy π
Output: Color mapping χG : V → C
π
1 Initialize: Generate a bag of graphs BG = {{Gi }}m 0
i=1 , Gi = (V, Ei ) and initial coloring χGi for
i ∈ [m] according to policy π
2 Let χ0G (v) := hash {{χtGi (v) : i ∈ [m]}} for each v ∈ V
3 for t ← 1 to T do
4 for each v ∈ V do
5 for i ← 1 to m do
6 χtGi (v) :=
hash χt−1 t−1 t−1 t−1
Gi (v), {{χGi (u) : u ∈ NGi (v)}}, χG (v), {{χG (u) : u ∈ NG (v)}}
7 χtG (v) := hash {{χtGi (v) : i ∈ [m]}}
8 Return: χTG
π
• Ego network policy π = πEGO(k) . In this policy, we also have BG = {{Gv : v ∈ V}}, Gv =
(V, Ev ). The edge set Ev is defined as Ev := {{u, w} ∈ E : disG (u, v) ≤ k, disG (w, v) ≤
k}, which corresponds to a subgraph containing all the k-hop neighbors of v and isolating
other nodes. The initial coloring is chosen as χ0Gi (v) = c0 for all v ∈ V and Gi ∈ BG π
where c0 ∈ C is a constant. One can also consider the ego network policy with marking
π = πEGOM(k) , by marking the initial color of the special node v for each Gv .
π
We note that for all the above policies, |BG | = |V|. There are other choices such as the edge deletion
policy (Bevilacqua et al., 2022), but we do not discuss them in this paper. A straightforward analysis
yields that DSS-WL with any above policy is strictly powerful than the classic 1-WL algorithm.
Also, node marking policy has been shown to be not less powerful than the node deletion policy
(Papp & Wattenhofer, 2022).
Finally, we highlight that Bevilacqua et al. (2022); Cotta et al. (2021) also proposed a weaker version
of DSS-WL, called the DS-WL algorithm. The difference is that for DS-WL, Lines 6 and 7 in
Algorithm 3 are replaced by a simple 1-WL aggregation:
t−1
χtGi (v) := hash χG (v), {{χt−1
i Gi (u) : u ∈ NG (v)}} . (10)
However, the original formulation of DS-WL (Bevilacqua et al., 2022) only outputs a graph repre-
π
sentation {{{{χGi (v) : v ∈ V}} : Gi ∈ BG }} rather than outputs each node color, which does not
suit the node-level tasks (e.g., finding cut vertices). Nevertheless, there are simple adaptations that
makes DS-WL output a color mapping χG . We will study these adaptations in Appendix C.2 (see
the paragraph above Proposition C.16) and discuss their limitations compared with DSS-WL.
In this paper, we study a new variant of the color refinement algorithm, called the Generalized Dis-
tance WL (GD-WL). The complete algorithm is described below. As a special case, when choosing
dG = disG , the resulting algorithm is called the Shortest Path Distance WL (SPD-WL), which is
strictly powerful than the classic 1-WL.
22
C P ROOF OF T HEOREMS
This section provides all the missing proofs in this paper. For the convenience of reading, we will
restate each theorem before giving a proof.
In this subsection, we first derive several important properties that are shared by a general class
of color refinement algorithms. They will serve as key lemmas in our subsequent proofs. Here,
a general color refinement algorithm takes a graph G = (VG , EG ) as input and calculates a color
mapping χG : VG → C. We first define a concept called the WL-condition.
Definition C.1. A color mapping χG : VG → C is said to satisfy the WL-condition if for any two
vertices u, v with the same color (i.e. χG (u) = χG (v)) and any color c ∈ C,
|NG (u) ∩ χ−1 −1
G (c)| = |NG (v) ∩ χG (c)|,
where χ−1
G is the inverse mapping of χG .
Remark C.2. The WL-condition can be further generalized to handle two graphs. Let χG : VG → C
and χH : VH → C be two color mappings obtained by applying the same color refinement algorithm
for graphs G and H, respectively. χG and χH are said to jointly satisfy the WL-condition, if for any
two vertices u ∈ VG and v ∈ VH with the same color (χG (u) = χH (v)) and any color c ∈ C,
|NG (u) ∩ χ−1 −1
G (c)| = |NH (v) ∩ χH (c)|.
It is easy to see that the classic 1-WL algorithm (Algorithm 1) satisfies the WL-condition. In fact,
many of the presented algorithms in this paper satisfy such a condition as we will show below, such
as DSS-WL (Algorithm 3), SPD-WL (Algorithm 4 with dG = disG ), and k-FWL (Algorithm 2).
Proposition C.3. Consider the DSS-WL algorithm (Algorithm 4) with arbitrary graph selection
policy π. Let χG and χH be the color mappings for graphs G and H, and let {{χGi : i ∈ [mG ]}}
and {{χHi : i ∈ [mH ]}} be the color mapping for subgraphs generated by π. Then,
• χG and χH jointly satisfy the WL-condition;
• χGi and χHj jointly satisfy the WL-condition for any i ∈ [mG ] and j ∈ [mH ].
Proof. We first prove the second bullet of Proposition C.3. By definition of the DSS-WL aggregation
procedure (Line 6 in Algorithm 3), χGi (u) = χHi (v) already implies {{χGi (w) : w ∈ NGi (u)}} =
{{χHj (w) : w ∈ NHj (v)}}. Namely, |{w : w ∈ NGi (u) ∩ χ−1 Gi (c)}| = |{w : w ∈ NHj (v) ∩
−1
χHj (c)}| holds for any c ∈ C.
We then turn to the first bullet. If χG (u) = χH (v), then {{χGi (u) : i ∈ [mG ]}} = {{χHj (v) :
j ∈ [mH ]}} (Line 7 in Algorithm 3). Then there exists a pair of indices i ∈ [mG ] and j ∈ [mH ]
such that χGi (u) = χHj (v). By definition of the DSS-WL aggregation, it implies {{χG (w) : w ∈
NG (u)}} = {{χH (w) : w ∈ NH (v)}} and concludes the proof.
Proposition C.4. Let χG and χH be two mappings returned by SPD-WL (Algorithm 4 with dG =
disG ) for graphs G and H, respectively. Then χG and χH jointly satisfy the WL-condition.
Proof. If χG (u) = χH (v) for some nodes u, v, then by the update rule (Line 4 in Algorithm 4)
{{(disG (u, w), χG (w)) : w ∈ V}} = {{(disG (v, w), χG (w)) : w ∈ V}}.
Since w ∈ NG (u) if and only if disG (u, w) = 1, we have
{{χG (w) : w ∈ NG (u)}} = {{χG (w) : w ∈ NG (v)}}.
Therefore, for any c ∈ C, |{w : w ∈ NG (u) ∩ χ−1 −1
G (c)}| = |{w : w ∈ NG (v) ∩ χG (c)}|.
Proposition C.5. Let χG and χH be two vertex color mappings returned by the k-FWL algorithm
(k ≥ 2). Then χG and χH jointly satisfy the WL-condition.
23
Proof. Let χG (u) = χH (v) for some u ∈ VG and v ∈ VH . By the update formula (Line 4 in
Algorithm 2), {{χG (u, · · · , u, w) : w ∈ VG }} = {{χH (v, · · · , v, w) : w ∈ VH }}. Note that for any
nodes w1 ∈ VG , w2 ∈ VH and any x1 ∈ NG (w1 ), x2 ∈ / NH (w2 ), one has χG (w1 , · · · , w1 , x1 ) ̸=
χH (w2 , · · · , w2 , x2 ). This is obtained by the definition of the initialization mapping χ0G and the
fact that χG refines χ0G . Consequently, {{χG (u, · · · , u, w) : w ∈ NG (u)}} = {{χG (v, · · · , v, w) :
w ∈ NH (v)}}. Next, we can use the fact that if χG (u, · · · , u, w1 ) = χG (v, · · · , v, w2 ) for some
w1 , w2 ∈ V, then χG (w1 ) = χG (w2 ) (see Lemma C.6). Therefore, {{χG (w) : w ∈ NG (u)}} =
{{χG (w) : w ∈ NH (v)}}, which concludes the proof.
To complete the proof of Proposition C.5, it remains to prove the following lemma:
Lemma C.6. Let χG and χH be color mappings for graphs G and H in the k-FWL algorithm
(k ≥ 2). Denote
cati,j (w, x) := (w, · · · , w, x, · · · , x).
| {z } | {z }
i times j times
Proof. By the update formula (Line 4 in Algorithm 2), χG (catk−i,i (u, w)) = χH (catk−i,i (v, x))
implies that {{χG (catk−i−1,1,i (u, y, w)) : y ∈ VG }} = {{χH (catk−i−1,1,i (v, y, x)) : y ∈ VH }}.
Note that for any j ∈ [k − 1] and any z ∈ VG k
, z ′ ∈ VH
k
with zj = zj+1 but zj′ ̸= zj+1
′
, one has
′ 0
χG (z) ̸= χH (z ). This is obtained by the definition of the initialization mapping χG and the fact
that χG refines χ0G . Therefore, we have χG (catk−i−1,i+1 (u, w)) = χH (catk−i−1,i+1 (v, x)), as
desired.
Equipped with the concept of WL-condition, we now present several key results. In the following,
let χG : VG → C and χH : VH → C be two color mappings jointly satisfying the WL-condition.
Lemma C.7. Let (v0 , · · · , vd ) be any path (not necessarily simple) of length d in graph G. Then
for any node u0 ∈ χ−1 H (χG (v0 )) in graph H, there exists a path (u0 , · · · , ud ) of the same length d
starting at u0 , such that χH (ui ) = χG (vi ) holds for all i ∈ [d].
Proof. The proof is based on induction over the path length d. For the base case of d = 1, if the
conclusion does not hold, then there exists two vertices u ∈ VG , v ∈ VH with the same color (i.e.
χG (u) = χH (v)) and a color c = χG (v1 ) such that NG (u) ∩ χ−1 −1
G (c) ̸= ∅ but NH (v) ∩ χH (c) = ∅.
This obviously contradicts the WL-condition. For the induction step on the path length d, one can
just split it by two parts (v0 , · · · , vd−1 ) and (vd−1 , vd ). Separately using induction yields two paths
(u0 , · · · , ud−1 ) and (ud−1 , ud ) such that χH (ui ) = χG (vi ) for all i ∈ [d]. By linking the two paths
we have completed the proof.
Finally, let us define the shortest path distance between node u and vertex set S as disG (u, S) :=
minv∈S disG (u, v). The above lemma directly yields the following corollary:
Corollary C.8. For any color c ∈ {χG (w) : w ∈ VG } and any two vertices u ∈ VG , v ∈ VH with
the same color (i.e. χG (u) = χH (v)), disG (u, χ−1 −1
G (c)) = disH (v, χH (c)).
C.2 C OUNTEREXAMPLES
We provides the following two families of counterexamples, which most prior works cannot distin-
guish.
Example C.9. Let G1 = (V, E1 ) and G2 = (V, E2 ) be a pair of graphs with n = 2km + 1 nodes
where m, k are two positive integers satisfying mk ≥ 3. Denote V = [n] and define the edge sets as
follows:
E1 = {{i, (i mod 2km) + 1} : i ∈ [2km]} ∪ {{n, i} : i ∈ [2km], i mod m = 0} ,
E2 = {{i, (i mod km) + 1} : i ∈ [km]} ∪ {{i + km, (i mod km) + km + 1} : i ∈ [km]} ∪
{{n, i} : i ∈ [2km], i mod m = 0} .
24
See Figure 2(a-c) for an illustration of three cases: (i) m = 2, k = 2; (ii) m = 4, k = 1; (iii)
m = 1, k = 4. It is easy to see that regardless of the chosen of m and k, G1 always has no cut
vertex but G2 do always have a cut vertex with node number n. The case of k = 1 is more special,
for which G2 actually has three cut vertices with node number m, 2m, and n, respectively, and it
even has two cut edges {m, n} and {2m, n} (Figure 2(b)).
Example C.10. Let G1 = (V, E1 ) and G2 = (V, E2 ) be a pair of graphs with n = 2m nodes where
m ≥ 3 is an arbitrary integer. Denote V = [n] and define the edge sets as follows:
E1 = {{i, (i mod n) + 1} : i ∈ [n]} ∪ {{m, 2m}} ,
E2 = {{i, (i mod m) + 1} : i ∈ [m]} ∪ {{i + m, (i mod m) + m + 1} : i ∈ [m]} ∪ {{m, 2m}} .
See Figure 2(d) for an illustration of the case n = 8. It is easy to see that G1 does not have any cut
vertex or cut edge, but G2 do have two cut vertices with node number m and 2m, and has a cut edge
{m, 2m}.
Theorem C.11. Let H = {H1 , · · · , Hk }, Hi = (Vi , Ei ) be any set of connected graphs and denote
nV = maxi∈[k] |Vi |. Then SC-WL (Appendix B.3) using the substructure set H can neither distin-
guish whether a given graph has cut vertices nor distinguish whether it has cut edges. Moreover,
there exist counterexample graphs whose size (both in terms of vertices and edges) is O(nV ).
Proof. We would like to prove that SC-WL cannot distinguish both Examples C.9 and C.10 when
nV < m (m is defined in these examples). First note that for both examples, any cycle in both G1
and G2 has a length of at least m. Since the number of nodes in Hi is O(nV ), if Hi contains cycles,
it will not occur in both G1 and G2 , thus taking no effect in distinguishing the two graphs. As a
result, we can simply assume all graphs in H are trees (connected graphs with no cycles). Below,
we provide a complete proof for Example C.9, which already yields the conclusion that SC-WL can
neither distinguish cut vertices nor cut edges. We omit the proof for Example C.10 since the proof
technique is similar.
Proof for Example C.9. Let Hi be a tree with less than m vertices where m is defined in Exam-
ple C.9. By symmetry of the two graphs G1 and G2 , it suffices to prove the following two types of
equations: xV V V V V
G1 (n) = xG2 (n) and xG1 (i) = xG2 (i) for all m < i ≤ 2m, where x is defined in (8).
We first aim to prove that xV V
G1 (v) = xG2 (v) for v ∈ {m + 1, · · · , 2m}. Consider an induced sub-
graph G1 [S] which is isomorphic to Hi and contains node v. Define the set T := {jm : j ∈ [k]}∩S.
For ease of presentation, we define an operation cir(x, a, b) that outputs an integer y in the range of
(a, b] such that y has the same remainder as x (mod b − a). Formally, cir(x, a, b) = y if a < y ≤ b
and x ≡ y (mod b − a).
• If n ∈
/ S, then it is easy to see that G1 [S] is a chain, i.e., no vertices have a degree larger
than 2. We define the following mapping gS : S → [n], such that
cir(u, m, 2m) if k = 1,
gS (u) =
cir(u, 0, km) if k ≥ 2.
In this way, the chain G1 [S] is mapped to a chain of G2 that contains v. Concretely, denote
gS (S) = {gS (u) : u ∈ S}, then G2 [gS (S)] ≃ G1 [S] ≃ Hi , and obviously the orbit of v in
G2 [gS (S)] matches the orbit of v in G1 [S]. See Figure 4(a,b) for an illustration of this case.
• If n ∈ S, then it is easy to see that the set T ̸= ∅. We will similarly construct a mapping
gS : S → [n] that maps S to gS (S) satisfying gS (v) = v, which is defined as follows. For
each u ∈ S\{n}, we find a unique vertex wu in T such that disG1 [S] (u, wu ) is the minimum.
Note that the node wu is well-defined since T ̸= ∅ and any path in G1 [S] from u to a node
in T goes through wu . Define
cir(u, m, 2m) if k = 1 and wu = wv ,
cir(u, 0, m) if k = 1 and wu ̸= wv ,
gS (u) =
cir(u, 0, km)
if k > 1 and wu ≤ km,
cir(u, km, 2km) if k > 1 and wu > km.
We also define gS (n) = n. Such a definition guarantees that for any x1 , x2 ∈ S, {x1 , x2 } ∈
EG1 ⇐⇒ {gS (x1 ), gS (x2 )} ∈ EG2 . Therefore, G2 [gS (S)] ≃ G1 [S] ≃ Hi . Moreover,
observe that gS (u) ≡ u (mod m) always holds, and thus it is easy to see that the orbit of
v in G2 [gS (S)] matches the orbit of v in G1 [S]. See Figure 4(c,d) for an illustration of this
case.
25
4𝑚𝑚 −1 4𝑚𝑚 1 4𝑚𝑚 −1 4𝑚𝑚 1
2 2
2𝑚𝑚 −1 2𝑚𝑚 1 2𝑚𝑚 −1 2𝑚𝑚 1
…
𝑛𝑛 𝑛𝑛 𝑛𝑛 𝑛𝑛
𝑣𝑣
…
𝑚𝑚 𝑚𝑚
𝑚𝑚+2 3𝑚𝑚 𝑚𝑚+2 𝑣𝑣 3𝑚𝑚
𝑚𝑚+1 𝑚𝑚 𝑚𝑚−1 𝑚𝑚+1 𝑚𝑚+1 𝑚𝑚 𝑚𝑚−1 𝑚𝑚+1
3𝑚𝑚 −1 3𝑚𝑚 −1
𝑚𝑚+2 … 𝑚𝑚+2 𝑚𝑚+2 … 𝑚𝑚+2
𝑣𝑣 𝑣𝑣
𝑣𝑣
2𝑚𝑚 +1 2𝑚𝑚 2𝑚𝑚 −1 𝑣𝑣 2𝑚𝑚 +1 2𝑚𝑚 2𝑚𝑚 −1
𝑚𝑚+1 2𝑚𝑚 −1 2𝑚𝑚 4𝑚𝑚 𝑚𝑚+1 2𝑚𝑚 −1 2𝑚𝑚 4𝑚𝑚2𝑚𝑚 +1
2𝑚𝑚 2𝑚𝑚 −1 1 4𝑚𝑚 −1 2𝑚𝑚 +1 2𝑚𝑚 2𝑚𝑚 −1 1 4𝑚𝑚 −1
𝑛𝑛 𝑣𝑣 2 𝑛𝑛 𝑣𝑣 2
𝑚𝑚 𝑛𝑛 𝑚𝑚 𝑛𝑛
…
…
𝑚𝑚−1 1 𝑚𝑚−1 1
…
…
…
…
…
𝑚𝑚+2 𝑚𝑚+2
2 𝑚𝑚−1 3𝑚𝑚 −1 2 𝑚𝑚−1 3𝑚𝑚 −1
… 𝑚𝑚+1 𝑚𝑚 3𝑚𝑚 +1 3𝑚𝑚 … 𝑚𝑚+1 𝑚𝑚 3𝑚𝑚 +1 3𝑚𝑚
(a) n ∈
/ S, k = 1 (b) n ∈
/ S, k > 1 (c) n ∈ S, k = 1 (d) n ∈ S, k > 1
Figure 4: Illustration of the proof of Theorem 3.1. The trees G1 [S], G2 [g(S)] are outlined by orange.
Finally, note that for any two different sets S1 and S2 such that G1 [S1 ] ≃ G1 [S2 ] ≃ Hi , we have
gS1 (S1 ) ̸= gS2 (S2 ), which guarantees that the mapping g : {S ⊂ [n] : G1 [S] ≃ Hi , v ∈ S} →
{S ⊂ [n] : G2 [S] ≃ Hi , v ∈ S} defined to be g(S) = gS (S) is injective. One can further
check that the mapping g is also surjective, and thus it is bijective. This means xV V
G1 (v) = xG2 (v)
for v ∈ {m, · · · , 2m − 1}. The proof for xV V
G1 (n) = xG2 (n) is almost the same, so we omit it
here. Noting that under classic 1-WL, the colors χG1 (v) = χG2 (v) are also the same. Therefore,
adding the features xV (v) does not help distinguish the two graphs. We have finished the proof for
Example C.9.
Using a similar cycle analysis as the above proof, we have the following negative result for Simplicial
WL (Bodnar et al., 2021b) and Cellular WL (Bodnar et al., 2021a):
Proposition C.12. Consider the SWL algorithm (Bodnar et al., 2021b), or more generally, the CWL
algorithms with either k-CL, k-IC, or k-C as lifting maps (k ≥ 3 is an integer) (Bodnar et al., 2021a,
Definition 14). These algorithms can neither distinguish whether a given graph has cut vertices nor
distinguish whether it has cut edges.
Proof. Observe that the counterexample graphs in both Examples C.9 and C.10 do not have cliques.
Therefore, SWL (or CWL with k-CL) reduces to the classic 1-WL and thus fails to distinguish
them. Since the lengths of any cycles in these counterexample graphs are at least m (m is defined in
Examples C.9 and C.10), we have that CWL with k-IC or k-C also reduces to 1-WL when m > k.
Therefore, there exists graphs whose size is O(k) such that CWL can neither distinguish cut vertices
nor cut edges.
Finally, we point out that even if k is not a constant (i.e., can scale to the graph size), CWL with k-IC
still fails to distinguish whether a given graph has cut vertices. This is because for Example C.9 with
k ≥ 2 (e.g. Figure 2(b,c)), CWL with IC still outputs the same graph representation for both G1
and G2 . This happens because all the 2-dimensional cells in these examples are cycles of an equal
length of m + 2 and one can easily check that they have the same CWL color.
Proof. An important limitation of OS-WL is that if a graph does not contain triangles, then any over-
lap subgraph Suv between two adjacent nodes u, v will only have one edge {u, v}. Consequently,
the subgraph mapping ω does not take effect can OS-WL reduces to the classic 1-WL. Therefore,
Example C.9 with m > 1 and Example C.10 with m > 3 still apply here since the graphs G1 and G2
do not contain triangles (see Figure 2(a,b,d)). Moreover, Example C.9 with m = 1 (see Figure 2(c))
is also a counterexample as discussed in Wijesinghe & Wang (2022, Figure 2(a)).
26
Proposition C.14. The DSS-WL with ego network policy without marking cannot distinguish the
graphs in Example C.9 with m = 1 (Figure 2(c)).
Proof. First note that for any two vertices u, v in either G1 or G2 defined in Example C.9, their
shortest path distance does not exceed 2. Thus we only need to consider the ego network policy
πEGO(1) and πEGO(2) .
• For πEGO(2) , the ego graphs of all nodes are simply the original graph and thus all graphs
in the bag B π and equal. Thus DSS-WL reduces to the classic 1-WL and cannot distinguish
G1 and G2 .
• For πEGO(1) , the ego graph of each node v ̸= n is a graph with 5 edges, having a shape of
two triangles sharing one edge. These ego graphs are clearly isomorphic. The ego graph of
the special node n is the original graph containing all edges. It is easy to see that the vertex
partition of DSS-WL becomes stable only after one iteration, and the color mapping of G1
and G2 is the same. Therefore, DSS-WL cannot distinguish G1 and G2 .
We thus conclude the proof.
Proposition C.15. The GNN-AK architecture proposed in Zhao et al. (2022) cannot distinguish
whether a given graph has cut vertices.
Proof. The GNN-AK architecture is quite similar to DSS-WL using the ego network policy but is
weaker. There is also a subtle difference: GNN-AK adds the so-called centroid encoding. However,
unlike node marking that is performed before the WL procedure, centroid encoding is performed
after the WL procedure. The subtle difference causes GNN-AK to be unable to distinguish between
the two graphs G1 and G2 .
We finally consider the DS-WL algorithm proposed in Cotta et al. (2021); Bevilacqua et al. (2022).
As discussed in Appendix B.4, the original DS-WL formulation only outputs a graph representation
rather than node colors. There are two simple ways to define nodes colors for DS-WL:
π |V|
• If the graph generation policy π is node-based, then each subgraph in BG = {{Gi }}i=1 is
uniquely associated to a specific node v ∈ V. We can thus use the graph representation of
each subgraph Gi as the color of each node. This strategy has appeared in prior works, e.g.
Zhao et al. (2022).
• For a general graph generation policy π, there no longer exists an explicit bijective mapping
between nodes and subgraphs. In this case, another possible way is to define χG (v) :=
π
{{χGi (v) : Gi ∈ BG }}, similar to DSS-WL. This approach is recently introduced by Qian
et al. (2022). However, such a strategy loses the memory advantage of DS-WL (i.e., needing
π π
Θ(|V||BG |) memory complexity rather than Θ(|V|+|BG |)), and is less expressive than DSS-
WL. We thus do not study this variant in the present work.
Proposition C.16. The DS-WL algorithm with node marking/deletion policy cannot distinguish cut
vertices when each node’s color is defined as its associated subgraph representation.
Proof. One can similarly check that for Example C.9 with m = 1 (see Figure 2(c)), the color of node
n will be the same for both graphs G1 and G2 . Therefore, DS-WL cannot identify cut vertices.
Finally, using a similar proof technique, the NGNN architecture proposed in Zhang & Li (2021)
(with shortest path distance encoding) cannot identify cut vertices.
Theorem C.17. Let G = (V, EG ) and H = (V, EH ) be two graphs, and let χG and χH be the
corresponding DSS-WL color mapping with node marking policy. Then the following holds:
• For any two nodes w ∈ V in G and x ∈ V in H, if χG (w) = χH (x), then w is a cut vertex
in graph G if and only if x is a cut vertex in graph H.
• For any two edges {w1 , w2 } ∈ EG and {x1 , x2 } ∈ EH , if {{χG (w1 ), χG (w2 )}} =
{{χH (x1 ), χH (x2 )}}, then {w1 , w2 } is a cut edge if and only if {x1 , x2 } is a cut edge.
27
Proof. We divide the proof into two parts in Appendices C.3.1 and C.3.2, separately focusing on
proving each bullet of Theorem 3.2. Before going into the proof, we first define several notations.
Denote χuG (v) as the color of node v under the DSS-WL algorithm when marking u as a special
node. By definition of DSS-WL (Line 7 in Algorithm 3), χG (v) = hash ({{χuG (v) : u ∈ V}}). We
can similarly define the inverse mappings (χuG )−1 .
We first present a lemma which can help us exclude the case of disconnected graphs.
Lemma C.18. Given a node w, let SG (w) ⊂ V be the connected component in graph G that
comprises node w. For any two nodes w ∈ V in G and x ∈ V in H, if χG (w) = χH (x), then
χG[SG (w)] (w) = χH[SH (x)] (x).
Proof. We first prove that if χG (w) = χH (x), then {{χuG (w) : u ∈ SG (w)}} = {{χuH (x) : u ∈
SH (x)}}. First note that for any nodes u, w in G and v, x in H, if u ∈ SG (w) but v ∈
/ SH (x),
then χuG (w) ̸= χvH (x). This is because DSS-WL only performs neighborhood aggregation, and the
marking v cannot propagate to node x while the marking u can propagate to node w. By definition
we have
χG (w) = hash ({{χuG (w) : u ∈ SG (w)}} ∪ {{χvG (w) : v ∈
/ SG (w)}}) .
Similarly,
χH (x) = hash ({{χuH (x) : u ∈ SH (x)}} ∪ {{χvH (x) : v ∈
/ SH (x)}}) .
Since χG (w) = χH (x), we have {{χuG : u ∈ SG (w)}} = {{χuH : u ∈ SH (x)}}. This clearly
implies {{χuG[SG (w)] : u ∈ SG (w)}} = {{χuH[SH (x)] : u ∈ SH (x)}}, and thus χG[SG (w)] (w) =
χH[SH (x)] (x).
Note that w is a cut vertex in G implies w is a cut vertex in G[SG (w)]. Therefore, based on
Lemma C.18, we can restrict our attention to subgraphs G[SG (w)] and H[SH (x)] instead of the
original (potentially disconnected) graphs. In other words, in the subsequent proof we can simply
assume that both graphs G and H are connected.
We next present several simple but important properties regrading the DSS-WL color mapping as
well as the subgraph color mappings.
Lemma C.19. Let u, w be two nodes in connected graph G and v, x be two nodes in connected
graph H. Then the following holds:
(a) If w = u and x ̸= v, then χuG (w) ̸= χvH (x);
(b) If χuG (w) = χvH (x), then χG (w) = χH (x);
(c) If χuG (w) = χvH (x), then χG (u) = χH (v);
(d) χG (w) = χH (x) if and only if χw x
G (w) = χH (x);
(e) If χuG (w) = χvH (x), then disG (u, w) = disH (v, x).
Proof. Item (a) holds because in DSS-WL, the node with marking cannot have the same color as
a node without marking. This can be rigorously proved by induction over the iteration t in the
DSS-WL algorithm (Line 6 in Algorithm 3).
Item (b) simply follows by definition of the DSS-WL aggregation procedure since the color χuG (w)
encodes the color of χG (w).
We next prove item (c), which follows by using the WL-condition of DSS-WL algorithm (Proposi-
tion C.3). Since G is connected, there is a path from w to u. Therefore, in graph H there is also a
path from x to some node v ′ satisfying χuG (u) = χvH (v ′ ) (Lemma C.7). Now using item (a), it can
only be the case v ′ = v and thus χuG (u) = χvH (v). Finally, by item (b) we obtain the desired result.
We next prove item (d). On the one hand, item (b) already shows that χw x
G (w) = χG (x) =⇒
χG (w) = χH (x). On the other hand, by definition of the DSS-WL algorithm,
χG (w) = hash ({{χw u
G (w)}} ∪ {{χG (w) : u ∈ V\{w}}}) ,
x v
χH (x) = hash ({{χH (x)}} ∪ {{χH (x) : v ∈ V\{x}}}) .
28
Since χG (w) = χH (x) and χw v
G (w) ̸= χH (x) holds for all v ∈ V\{x} (by item (a)), we obtain
χw
G (w) = χ x
G (x).
We finally prove item (e), which again can be derived from the WL-condition of DSS-WL al-
gorithm. If χuG (w) = χvH (x), then by Corollary C.8 we have disG (w, (χuG )−1 (χuG (u))) =
disH (x, (χvH )−1 (χuG (u))). Using item (a), we have (χuG )−1 (χuG (u)) = {u} and for any v ′ ̸= v,
χvH (v ′ ) ̸= χvH (v). Therefore, it can only be the case that (χvH )−1 (χuG (u)) = {v} and χvH (v) =
χuG (u). This yields disG (u, w) = disG (v, x) and concludes the proof.
Proof. Let NGd (u) := {w ∈ V : disG (u, w) = d} be the d-hop neighbors of node u in graph
d
G, and denote CG := {{χuG (w) : w ∈ NGd (u)}} as the multiset containing the color of all nodes
d
w with distance d to node u. We can similarly denote NH (v) := {w : disH (v, w) = d} and
d v d d d
CH = {{χH (w) : w ∈ NH (v)}}. It suffices to prove that for all d ∈ N+ , CG = CH .
We will prove the above result by induction. The case of d = 0 is trivial. Now suppose the case
d d d+1 d+1
of d is true (i.e., CG = CH ) and we want to prove CG = CH . Note that for any nodes x1 , x2
satisfying χG (x1 ) = χH (x2 ), {{χG (w) : w ∈ NG (x1 )}} = {{χvH (w) : w ∈ NH (x2 )}}. Therefore,
u v u
d d
by the induction assumption CG = CH ,
[ [
{{χuG (w) : w ∈ NG (x)}} = {{χvH (w) : w ∈ NH (x)}}.
d (u)
x∈NG d (v)
x∈NH
′
We next claim that CG d d
∩ CG = ∅ for any d ̸= d′ . This is because for any nodes w1 and w2 with
the same color χuG (w1 ) = χuG (w2 ), by Lemma C.19(e) we have disG (w1 , u) = disG (w2 , u). Using
this property, we obtain
[ [
{{χuG (w) : w ∈ NG (x) ∩ NGd+1 (u)}} = {{χvH (w) : w ∈ NH (x) ∩ NHd+1
(v)}}.
d (u)
x∈NG d (v)
x∈NH
where {{c}} × m is a multiset containing m repeated elements c. Finally, observe that if χuG (w1 ) =
χvH (w2 ) for some nodes w1 and w2 , then |NG (w1 ) ∩ NGd (u)| = |NH (w2 ) ∩ NH d
(v)| (because
′ d+1
′
CG ∩ CG = ∅ for any d ̸= d ). Consequently, {{χG (w) : w ∈ NG (u)}} = {{χvH (w) : w ∈
d d u
d+1 d+1 d+1
NH (v)}}, namely CG = CH . We have thus completed the proof of the induction step.
We now present the following key result, which shows an important property of the color mapping
for DSS-WL:
Corollary C.21. Let u, v ∈ V be two nodes in connected graph G with the same DSS-WL color, i.e.
χG (u) = χG (v). Then for any color c ∈ C, {{χuG (w) : w ∈ χ−1 v −1
G (c)}} = {{χG (w) : w ∈ χG (c)}}.
Proof. First observe that if χG (u) = χG (v), then χuG (u) = χvG (v) (by Lemma C.19(d)). Con-
sequently, {{χuG (w) : w ∈ V}} = {{χvG (w) : w ∈ V}} holds by Lemma C.20. If {{χuG (w) :
w ∈ χ−1 G (c)}} = ̸ {{χvG (w) : w ∈ χ−1 −1
G (c)}}, then there must exist two nodes w1 ∈ χG (c)
−1 u v
and w2 ∈ / χG (c), such that χG (w1 ) = χG (w2 ). Therefore, by Lemma C.19(b) we have
χG (w1 ) = χG (w2 ), yielding a contradiction.
In the subsequent proof, we assume the connected graph G is not vertex-biconnected and let u ∈ V
be a cut vertex in G. Let {Si }m
i=1 (m ≥ 2) be the partition of the vertex set V\{u}, representing
each connected component after removing node u.
29
𝑦𝑦𝑑𝑑
𝑢𝑢2
𝑆𝑆𝑗𝑗
𝑢𝑢 𝑢𝑢
𝑆𝑆𝑖𝑖 𝑆𝑆𝑖𝑖 𝑄𝑄 𝑤𝑤
𝑢𝑢′
𝑢𝑢3 𝑃𝑃
𝑢𝑢1 𝑤𝑤𝑤
Lemma C.22. There is at most one set Si satisfying Si ∩ χ−1 G (χG (u)) ̸= ∅. In other words, if
Si ∩ χ−1
G (χ G (u)) ̸
= ∅ for some i ∈ [m], then for any j ∈ [m] and j ̸= i, Sj ∩ χ−1
G (χG (u)) = ∅.
Using a similar proof technique as the one in Lemma C.23, we can prove the first part of Theo-
rem 3.2. Suppose u′ ∈ χ−1 ′
H (χG (u)) and we want to prove that u is a cut vertex of graph H.
−1 −1
Observe that |χG (χG (u))| = |χH (χH (u))|. (A simple proof is as follows: χG (u) = χH (u′ )
′
implies χuG (u) = χuH (u′ ) by Lemma C.19(d), and thus using Lemma C.20 we have {{χuG (w) : w ∈
′
V}} = {{χuH (w) : w ∈ V}} and finally obtain {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}} by
Lemma C.19(b).)
We first consider the case when |χ−1 −1
G (χG (u))| = |χH (χH (u))| > 1. Following the above proof,
′
we can similarly pick w ∈ Sj in G and w′ in H satisfying χG (w) ̸= χG (u) and χuH (w′ ) = χuG (w).
Since |χ−1 −1 ′
G (χG (u))| > 1, we can pick a node uH ∈ χH (χG (u)) in H such that uH ̸= u . If u is
′
30
Graph 𝐺𝐺 Graph 𝐻𝐻 Graph 𝐺𝐺 Graph 𝐻𝐻
Figure 6: Several illustrations to help understand the main proof of Theorem 3.2.
not a cut vertex, then there is a path P = (x0 , · · · , xd ) in H where x0 = w′ and xd = uH , such
′ ′
that χuH (xi ) ̸= χuH (u′ ) for all i ∈ [d] (by Lemma C.19(a)). Using the WL-condition, there exists a
′
path Q = (y0 , · · · , yd ) in G satisfying y0 = w and χuG (yi ) = χuH (xi ) for all i ∈ [d]. In particular,
′
χuG (yd ) = χuH (uH ), which implies χG (yd ) = χG (uH ) by using Lemma C.19(b). However, any
path from w to yd ∈ χ−1 u u
G (χG (u)) must go through node u, implying that χG (yi ) = χG (u) for some
u u′ u′ ′ u
i ∈ [d]. This yields a contradiction because χG (yi ) = χH (xi ) ̸= χH (u ) = χG (u). See Figure 6(a)
for an illustration of this paragraph.
We finally consider the case when |χ−1 −1
G (χG (u))| = |χH (χH (u))| = 1. Let w ∈ S1 and x ∈ S2
be two nodes in G that belongs to different connected components when removing node u, then
χG (w) ̸= χG (u) and χG (x) ̸= χG (u). Since χG (u) = χH (u′ ), by the WL-condition (Lemma C.7)
w′
there is a node w′ ∈ χ−1 w ′
H (χG (w)) in H. Consequently, χG (w) = χH (w ) (Lemma C.19(d)). Again
′
by the WL-condition, there is a node x ∈ (χH ) (χG (x)) in H. Clearly, w′ ̸= u′ and x′ ̸= u′
′ w −1 w
(because they have different colors). If u′ is not a cut vertex, then there is path P = (y0 , · · · , yd ) in
H such that y0 = x′ , yd = w′ and yi ̸= u′ for all i ∈ [d]. It follows that for all i ∈ [d], χH (yi ) ̸=
w′ w′
χH (u′ ) by our assumption |χ−1 ′
H (χH (u))| = 1, and thus χH (yi ) ̸= χH (u ) (by Lemma C.19(b)).
′
w w ′
Since χG (x) = χH (x ), by the WL-condition (Lemma C.7), there is a path Q = (z0 , · · · , zd ) in G
′
−1
satisfying z0 = x and zi ∈ (χw G) (χwH (yi )) for i ∈ [d]. See Figure 6(b) for an illustration of this
paragraph.
′
′
Clearly, we have zd = w using χw w
G (zd ) = χH (w ) and Lemma C.19(a). On the other hand, by
w′
Lemma C.19(b), χG (zi ) = χH (yi ) implies χG (zi ) = χH (yi ) and thus χG (zi ) ̸= χH (u′ ) = χG (u)
w
holds for all i ∈ [d] and thus zi ̸= u. In other words, we have found a path from x to w without
going through node u, which yields a contradiction as u is a cut vertex. We have thus finished the
proof.
−1 −1
Proof. By Corollary C.21, we have {{χw x
G (v) : v ∈ χG (c)}} = {{χG (v) : v ∈ χG (c)}}. Since for
any nodes u, v, χwG (u) = χx
G (v) implies disG (u, w) = dis G (v, x) (by Lemma C.19(e)), we have
obtained the desired conclusion.
Equivalently, the above corollary says that if χG (w) = χG (x), then the following two multisets are
equivalent:
{{(disG (w, v), χG (v)) : v ∈ V}} = {{(disG (x, v), χG (v)) : v ∈ V}}.
Therefore, it guarantees that the vertex partition induced by the DSS-WL color mapping is finer
than that of the SPD-WL (Algorithm 4 with dG = disG ). We can thus invoke Theorem 4.1, which
directly concludes the proof (due to Proposition C.56).
31
C.4 P ROOF OF T HEOREM 4.1
Theorem C.25. Let G = (V, EG ) and H = (V, EH ) be two graphs, and let χG and χH be the
corresponding SPD-WL color mapping. Then the following holds:
• For any two edges {w1 , w2 } ∈ EG and {x1 , x2 } ∈ EH , if {{χG (w1 ), χG (w2 )}} =
{{χH (x1 ), χH (x2 )}}, then {w1 , w2 } is a cut edge if and only if {x1 , x2 } is a cut edge.
• If the graph representations of G and H are the same under SPD-WL, then their block
cut-edge trees (Definition 2.3) are isomorphic. Mathematically, {{χG (w) : w ∈ V}} =
{{χH (w) : w ∈ V}} implies that BCETree(G) ≃ BCETree(H).
Proof Sketch. The proof of Theorem 4.1 is highly non-trivial and is divided into three parts (pre-
sented in Appendices C.4.1 to C.4.3, respectively). We first consider the special setting when both
G and H are connected and {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}}. Assume G is not
edge-biconnected, and let {u, v} ∈ EG be a cut edge in G. We separately consider two cases:
χG (u) ̸= χG (v) (Appendix C.4.1) and χG (u) = χG (v) (Appendix C.4.2), and prove that any edge
{u′ , v ′ } ∈ EH satisfying {{χG (u), χG (v)}} = {{χH (u′ ), χH (v ′ )}} is also a cut edge of H. This
basically finishes the proof of the first bullet in the theorem. Finally, we consider the general setting
where graphs G, H can be disconnected and their representation is not the same in Appendix C.4.3,
and complete the proof of Theorem 4.1.
Without abuse of notation, throughout Appendices C.4.1 and C.4.2 we redefine the color set C :=
{χG (w) : w ∈ V} = {χH (w) : w ∈ V} to focus only on colors that are present in G (or H), rather
than all (irrelevant) colors in the range of a hash function.
Proof. Assume the lemma does not hold, i.e. |S ∩ Su | > 1 and |S ∩ Sv | > 1. We first prove that
χ−1 −1
G (χG (u)) ∩ Sv ̸= ∅ and χG (χG (v)) ∩ Su ̸= ∅. By symmetry, we only need to prove the former.
Suppose χG (χG (u)) ∩ Sv = ∅, then (χ−1
−1
G (χG (v)) ∩ Sv )\{v} ̸= ∅ (because |S ∩ Sv | > 1), and
thus there exists v ′ ∈ Sv , v ′ ̸= v such that χG (v ′ ) = χG (v). Note that v ′ must connect to a node u′
with χG (u′ ) = χG (u). Since {u, v} is a cut edge in G, u′ ∈ Sv . Therefore, χ−1 G (χG (u)) ∩ Sv ̸= ∅,
yielding a contradiction. This paragraph is illustrated in Figure 7(a).
We next prove that at least one of the following two conditions holds (which are symmetric): (i)
(χ−1
G (χG (u)) ∩ Su )\{u} = ̸ ∅; (ii) (χ−1
G (χG (v)) ∩ Sv )\{v} = ̸ ∅. Based on the above paragraph,
there exists v ∈ Su satisfying χG (v ′ ) = χG (v). Note that v ′ must connect to a node with color
′
32
𝑆𝑆𝑢𝑢 𝑆𝑆𝑣𝑣 𝑆𝑆𝑢𝑢
𝑤𝑤𝑤
Based on Lemma C.27, in the subsequent proof we can without loss of generality assume
χ−1 −1
G (χG (u)) ∩ Su = {u} and χG (χG (v)) ∩ Su = ∅. This leads to the following lemma:
33
Now define the set D(x) := {u′ : u′ ∈ χ−1 ′
G (χG (u)), disG (x, u ) ≤ |P2 |}. Let us focus on the
cardinality of the sets D(w) and D(w′ ). It follows that D(w′ ) = {u}, because for any other node
u′ ∈ χ−1 ′ ′
G (χG (u)), u ̸= u, we have u ∈ Sv and thus
The next lemma presents an important property of the color graph GC (defined in Definition C.26).
Lemma C.29. GC has a cut edge {{χG (u), χG (v)}}.
Proof. Suppose {{χG (u), χG (v)}} is not a cut edge of GC . Then there is a simple cycle (c1 , · · · , cm )
where c1 = χG (u), cm = χG (v) and m > 2. Namely, there exists a simple path from c1 to cm with
length ≥ 2. By the definition of GC and the WL-condition, there exists a sequence of nodes of G
{wi }mi=1 where w1 = u and χ(wi ) = ci such that {wi , wi+1 } ∈ EG , i ∈ [m − 1]. Note that wi ̸= u
for i = {2, · · · , m} and w2 ̸= v because (c1 , · · · , cm ) is a simple path. Therefore, wi ∈ Su for all
i ∈ [m]. However, it contradicts |S ∩ Su | = 1 (Lemma C.27) since χG (wm ) = χG (v).
Proof. If {u′ , v ′ } is not a cut edge, there is a simple cycle going through {u′ , v ′ }. Denote it as
(w1 , · · · , wm ) where w1 = u′ , wm = v ′ , m > 2. By Lemma C.27, w2 ∈ / χG (v), otherwise u′ will
−1
connect to at least two different nodes w2 , wm ∈ χG (χG (v)) and thus u′ and u cannot have the
same color under SPD-WL. Let j be the index such that j = min{j ∈ [m] : χG (wj ) = χG (v)}, then
j > 2. Consider the path (w1 , · · · , wj ). It follows that χG (wk ) ̸= χG (u) for all k ∈ {2, · · · , j}
by Lemma C.28 (otherwise there is a path from node w1 to some node wi ∈ χ−1 G (χG (u)) (i ∈
{2, · · · , j}) that does not go through nodes in the set χ−1 G (χ G (v)), a contradiction). Therefore,
(χG (w1 ), · · · , χG (wj )) is a path of length ≥ 2 in GC from χG (u) to χG (v) (not necessarily simple),
without going through the edge {{χG (u), χG (v)}}. This contradicts Lemma C.29, which says that
{{χG (u), χG (v)}} is a cut edge in GC .
Based on Lemma C.29, the cut edge {{χG (u), χG (v)}} partitions the vertices C of the color graph
GC into two classes. Denote them as {Cu , Cv } where χG (u) ∈ Cu and χG (v) ∈ Cv . The next
corollary characterizes the structure of the node colors calculated in SPD-WL.
Corollary C.31. For any w satisfying χG (w) ∈ Cu , there exists a cut edge {u′ , v ′ }, u′ ∈
χ−1 ′ −1 ′ ′
G (χG (u)), v ∈ χG (χG (v)), that partitions V into two classes Su′ ∪ Sv ′ , u , w ∈ Su′ , v ∈ Sv ′ ,
−1 ′ ′ −1 ′
such that χG (χG (u )) ∪ Su = {u } and χG (χG (v )) ∪ Su = ∅.
′ ′
Remark C.32. Corollary C.31 can be seen as a generalized version of Lemma C.27. Indeed, when
w ∈ Su , one can pick u′ = u and v ′ = v. Then χ−1 ′ ′ −1 ′
G (χG (u )) ∪ Su′ = {u } and χG (χG (v )) ∪
Su′ = ∅ hold due to Lemma C.27. In general, Corollary C.31 says that all the cut edges with color
{χG (u), χG (v)} play an equal role: Lemma C.27 applies for any chosen cut edge {u′ , v ′ }. An
illustration of Corollary C.31 is given in Figure 9(a).
Proof. By the definition of Cu , any node c ∈ Cu in the color graph can reach the node χG (u)
without going through χG (v). Therefore, there exists some u′ ∈ χ−1 G (χG (u)) such that there exists
a path P1 from w to u′ without going through nodes in the set χ−1 G (χG (v)). Also, there exists a
node v ′ ∈ NG (u′ ) with χG (v ′ ) = χG (v) due to the color of u′ . By Corollary C.30, {u′ , v ′ } is a cut
edge of G. Clearly, w ∈ Su′ .
We next prove the following fact: for any x ∈ Su′ , χG (x) ∈ Cu . Otherwise, one can pick a node
x ∈ Su′ with color χG (x) ∈ Cv . Consider the shortest path between nodes x and u′ , denoted
as (y1 , · · · , ym ) where y1 = x and ym = u′ . It follows that yi ∈ Su for all i ∈ [m]. Denote
34
𝑤𝑤2 𝑤𝑤3 𝑆𝑆𝑢𝑢
𝑢𝑢2 𝑢𝑢3
𝑤𝑤1 𝑤𝑤4 𝑆𝑆𝑣𝑣 𝑣𝑣′
𝑢𝑢1 𝑣𝑣2 𝑤𝑤 𝑢𝑢′
𝑣𝑣1 𝑢𝑢4 𝑃𝑃1
(a) (b)
Figure 9: Illustration of Corollary C.31 and its proof.
ci = χG (yi ), i ∈ [m]. Then (c1 , · · · , cm ) is a path (not necessarily simple) in the color graph GC .
Now pick the index j = max{j ∈ [m] : cj ∈ Cv } (which is well-defined because c1 ∈ Cv ). It
follows that j < m (since ym ∈ Cu ), cj = χG (v) and cj+1 = χG (u) (because {{χG (u), χG (v)}} is
a cut edge that partitions the color graph GC into Cu and Cv ). Consider the following two cases (see
Figure 9(b) for an illustration):
• j = m−1. Then u′ connects to both nodes yj and v ′ with color χG (yj ) = χG (v ′ ) = χG (v).
This contradicts Lemma C.27 since u only connects to one node v with color χG (v).
• j < m − 1. Then yj+1 ̸= u′ because the path (y1 , · · · , ym ) is simple. Howover, one has
χG (yi ) ̸= χG (v) for all i ∈ {j +1, · · · , m} by definition of j. This contradicts Lemma C.28.
This completes the proof that for any x ∈ Su′ , χG (x) ∈ Cu . Therefore, χ−1 ′
G (χG (v )) ∪ Su′ = ∅.
We have already fully characterized the properties of cut edges {u′ , v ′ } with color {χG (u), χG (v)}.
Now we switch our focus to the graph H. We first prove a general result that holds for arbitrary H.
Lemma C.33. Let {w1 , w2 } ∈ EH and P is a path with the minimum length from w1 to w2 without
going through edge {w1 , w2 }. In other words, linking path P with the edge {w1 , w2 } forms a simple
cycle Q. Then for any two nodes x1 , x2 in Q, disH (x1 , x2 ) = disQ (x1 , x2 ).
Proof. Split the cycle Q into two paths Q1 and Q2 with endpoints {x1 , x2 } where Q1 contains
the edge {w1 , w2 } and Q2 does not contain {w1 , w2 }. Assume the above lemma does not hold
and disH (w, x) < disQ (w, x). It means that there exists a path R in H from x1 to x2 for which
|R| < min(|Q1 |, |Q2 |). Note that the edge {u, v} occurs at most once in R. Separately consider
two cases:
• {w1 , w2 } occurs in R. Then linking R with Q2 forms a cycle that contains {w1 , w2 } exactly
once;
• {w1 , w2 } does not occur in R. Then linking R with Q1 forms a cycle that contains {w1 , w2 }
exactly once.
In both cases, the cycle has a length less than |Q|. This contradicts the condition that P is a path
with minimum length from w1 to w2 without passing edge {w1 , w2 }.
We can similarly consider the color graph H C = (C, EH C ) defined in Definition C.26. Note that we
have assumed that the graph representations of G and H are the same, i.e. {{χG (w) : w ∈ V}} =
{{χH (w) : w ∈ V}}. It follows that H C is isomorphic to GC and the identity vertex mapping is
an isomorphism, i.e., {{c1 , c2 }} ∈ EGC ⇐⇒ {{c1 , c2 }} ∈ EH C . Therefore, {{χG (u), χG (v)}} is a
cut edge of H C (Lemma C.29) that splits the vertices C into two classes Cu , Cv . Since the vertex
labels of H are not important, we can without abuse of notation let u, v be two nodes such that
35
𝑤𝑤𝑗𝑗 𝑆𝑆𝑢𝑢𝑤𝑤′ 𝑆𝑆𝑣𝑣′𝑤𝑤
𝑢𝑢𝑢
𝑤𝑤𝑘𝑘 𝑢𝑢′1
𝑢𝑢′2
𝑢𝑢 𝑣𝑣
𝑤𝑤1 𝑤𝑤𝑚𝑚
Figure 10: Illustrations to help understand the proof of the main result.
u′1 and u′w ̸= u′2 (otherwise by Corollary C.31 any path from w′ to a node u′ ̸= u′w with color
χG (u′ ) = χG (u) must first go through u′w and then go through vw ′
, implying that |disG (w′ , u′1 ) −
disG (w′ , u′2 )| ≥ 2 and yielding a contradiction). Therefore, disG (w′ , u′1 ) > disG (w′ , u′w ) and
disG (w′ , u′2 ) > disG (w′ , u′w ). We give an illustration of the structure of G in Figure 10(b) based on
this paragraph.
Pick any vw ∈ χ−1 ′ ′
H (χH (v)) satisfying disH (vw , wk ) = disG (vw , w ). Denote the operation
dropmin(S) := S\{{min S}} that takes a multiset S and removes one of the minimum elements
in S. We have
dropmin({{disG (w′ , uG ) : uG ∈ χ−1
G (χG (u))}})
= dropmin({{disG (w′ , vw
′ ′
) + disG (vw , uG ) : uG ∈ χ−1
G (χG (u))}}) (by Corollary C.31)
= dropmin({{disH (wk , vw ) + disH (vw , uH ) : uH ∈ χ−1
H (χH (u))}})
and also
dropmin({{disG (w′ , uG ) : uG ∈ χ−1 −1
G (χG (u))) = dropmin({{disH (wk , uH ) : uH ∈ χH (χH (u))}})
due to the same color χG (w′ ) = χH (wk ). Combining the above two equations and noting
that disH (wk , vw ) + disH (vw , uH ) ≥ disH (wk , uH ), we obtain the following result: for any
36
uH ∈ χ−1 ′ ′
H (χH (u)) for which disH (wk , vw ) + disH (vw , uH ) > disG (w , uw ), disH (wk , vw ) +
disH (vw , uH ) = disH (wk , uH ). In particular,
disH (wk , w1 ) = disH (wk , vw ) + disH (vw , w1 ),
disH (wk , wj ) = disH (wk , vw ) + disH (vw , wj ).
Therefore,
disH (w1 , wj ) = disH (w1 , wk ) + disH (wk , wj )
= 2disH (wk , vw ) + disH (vw , w1 ) + disH (vw , wj )
≥ 2disH (wk , vw ) + disH (w1 , wj ),
implying wk = vw . However, χH (wk ) ∈ Cu while χH (vw ) ∈ Cv , yielding a contradiction.
It is easy to see that hG is well-defined for all w ∈ V because {u, v} is a cut edge of G. We further
define the following auxiliary graph:
Definition C.34. (Auxiliary graph) Define the auxiliary graph GA = (VGA , EGA ) where VGA :=
{u, v} × C and EGA := {{{fG (w1 ), fG (w2 )}} : {w1 , w2 } ∈ EG }. Note that GA can have self loops,
so each edge is denoted as a multiset with two elements.
It is straightforward to see that there is only one edge in GA with the form {{(u, c1 ), (v, c2 )}} ∈
EGA for some c1 , c2 ∈ C since {u, v} is a cut edge of G. Therefore, the only edge is
{{(u, χG (u)), (v, χG (v))}} and is a cut edge in GA .
−1 −1
We also define fG as the inverse mapping of fG , i.e. fG (z, c) = {w ∈ V : fG (w) = (z, c)}. We
−1
first prove that fG is well-defined on the domain VGA .
Lemma C.35. fG is a surjection.
−1
Proof. Suppose that fG is not a surjection. Then there exists a color c ∈ C such that either fG (u, c)
−1 −1 −1
or fG (v, c) is an empty set. Without loss of generality, assume fG (v, c) = ∅, then fG (u, c) ̸= ∅.
−1 −1
Pick any w ∈ fG (u, c). Obviously, w ̸= u (otherwise fG (v, χG (v)) = ∅, a contradiction). Then
−1 −1
we claim that for any x ∈ NG (w), fG (v, χG (x)) is empty. Note that x ∈ fG (u, χG (x)). If
′ −1
the claim does not hold, take x ∈ fG (v, χG (x)). Since x connects to a node with color c and
χG (x) = χG (x′ ), x′ must also connect to a node with color c. Denote the node that connects to x′
−1
with color c as w′ . Then w′ ∈ fG (v, c), yielding a contradiction.
By induction, for any x such that there exists a path from x to w without going through the edge
−1 −1
{u, v}, we have fG (v, χG (x)) = ∅. This finally implies fG (v, χG (v)) = ∅, leading to a contra-
diction. Therefore, f is a surjection.
−1 −1
Lemma C.36. |fG (u, χG (u))| = |fG (v, χG (v))| = 1.
Proof. Pick u′ = arg maxu′ ∈f −1 (u,χ(u)) disG (u, u′ ) and similarly pick v ′ . It follows that any path
G
between u′ and v ′ goes through edge {u, v}. Therefore, disG (u′ , v ′ ) = disG (u, u′ )+disG (v, v ′ )+1.
Since all nodes u, u′ , v, v ′ have the same color under SPD-WL, there exists a node w ∈ χ−1 G (χG (u))
satisfying disG (u, w) = disG (u′ , v ′ ) and thus disG (u, w) > disG (u, u′ ). By definition of the node
u′ , fG (w) ̸= (u, χ(u)) and thus fG (w) = (v, χ(u)). Therefore, disG (u, w) = disG (v, w) + 1,
which implies that
disG (v, w) = disG (v, v ′ ) + disG (u, u′ ).
Since disG (v, w) ≤ disG (v, v ′ ), we have disG (v, w) = disG (v, v ′ ) and u = u′ . A similar argument
yields v = v ′ , finishing the proof.
37
We can now prove some useful properties of the auxiliary graph GA based on Lemmas C.35
and C.36.
Corollary C.37. For any c1 , c2 ∈ C, {{(u, c1 ), (u, c2 )}} ∈ EGA if and only if {{(v, c1 ), (v, c2 )}} ∈
EGA .
A
Proof. By definition of EG , if {{(u, c1 ), (u, c2 )}} ∈ EGA , then there exists two vertices w1 ∈
−1 −1
fG (u, c1 ) and w2 ∈ fG (u, c2 ) such that {w1 , w2 } ∈ EG . By Lemma C.36, either χG (w1 ) ̸=
χG (u) or χG (w2 ) ̸= χG (u). Without loss of generality, assume c1 ̸= χG (u). By Lemma C.35,
−1
there exists x1 ∈ fG (v, c1 ). Since χG (x1 ) = χG (w1 ), x1 must also connect to a node x2
with χG (x2 ) = c2 . The edge {x1 , x2 } = ̸ {u, v} because χG (x1 ) = c1 ̸= χG (u). Therefore,
A
f (x2 ) = (v, c2 ), namely {{(v, c1 ), (v, c2 )}} ∈ EG .
The following lemma establishes the distance relationship between graphs G and GA .
Lemma C.38. The following holds:
• For any w, w′ ∈ V, disG (w, w′ ) ≥ disGA (f (w), f (w′ )).
−1 −1 ′
• For any ξ, ξ ′ ∈ VGA and any node w ∈ fG (ξ), there exists a node w′ ∈ fG (ξ ) such that
disG (w, w′ ) = disGA (ξ, ξ ′ ).
Proof. The first bullet is trivial since for all {w, w′ } ∈ EG , {{f (w), f (w′ )}} ∈ EGA by Defini-
tion C.34. We prove the second bullet in the following. Note that GA can have self-loops, but for
any ξ, ξ ′ ∈ V A , the shortest path between ξ and ξ ′ will not go through self-loops. We only need to
−1 −1 ′
prove that for all {{ξ, ξ ′ }} ∈ E A , ξ ̸= ξ ′ and all w ∈ fG (ξ), there exists w′ ∈ fG (ξ ) such that
′ ′ −1
{w, w } ∈ EG . This will imply that for any ξ, ξ ∈ VGA and any node w ∈ fG (ξ), there exists a
−1 ′
node w′ ∈ fG (ξ ) such that disG (w, w′ ) ≤ disGA (ξ, ξ ′ ), which completes the proof by combining
the first bullet in Lemma C.38.
The case of {{ξ, ξ ′ }} = {{(u, χG (u)), (v, χG (v))}} is trivial. Now assume that {{ξ, ξ ′ }} = ̸
−1 −1 ′
{{(u, χG (u)), (v, χG (v))}}. By Definition C.34, there exists x ∈ fG (ξ) and x′ ∈ fG (ξ ), such that
{x, x′ } ∈ EG . Note that hG (x) = hG (x′ ) because {x, x′ } = ̸ {u, v}. Since χG (x) = χG (w), there
exists w′ ∈ χ−1G (χ G (x ′
)) such that {w, w ′
} ∈ EG . It must hold that hG (w) = hG (w′ ) (otherwise
{w, w } = {u, v} and thus {{ξ, ξ }} = {{(u, χG (u)), (v, χG (v))). Therefore, hG (w′ ) = hG (w) =
′ ′
−1 ′
hG (x) = hG (x′ ) and thus fG (w′ ) = fG (x′ ), namely w′ ∈ fG (ξ ).
−1
Proof. Proof of the first bullet: by Lemma C.38, there exists two nodes u1 , u2 ∈ fG (fG (u)) such
that disG (u1 , w) = disGA (fG (u), fG (w)) and disG (u2 , w′ ) = disGA (fG (u), fG (w′ )). Therefore,
disG (u1 , w) = disG (u2 , w′ ). However, by Lemma C.36 and the condition hG (w) = hG (w′ ), it
must be u1 = u2 = u, namely disG (u, w) = disG (u, w′ ). The proof of disG (v, w) = disG (v ′ , w′ )
is similar.
Proof of the second bullet: Let χG (w) = χG (w′ ) = c. Without loss of generality, assume fG (w) =
(u, c) and f (w′ ) = (v, c). By Lemma C.38, it suffices to prove that disGA ((u, χG (u)), (u, c)) =
disGA ((v, χG (v)), (v, c)). By the definition of GA and its cut edge {{(u, χG (u), (v, χG (v))}}, the
shortest path between (u, χG (u)) and (u, c) must only go through nodes in the set {(u, c1 ) : c1 ∈ C},
and similarly the shortest path between (v, χG (v)) and (v, c) must only go through nodes in
{(v, c2 ) : c2 ∈ C}. Finally, Corollary C.37 says that for c1 , c2 ∈ C, {{(u, c1 ), (u, c2 )}} ∈ GA
if and only if {{(v, c1 ), (v, c2 )}} ∈ GA . We thus conclude that disGA ((u, χG (u)), (u, c)) =
disGA ((v, χG (v)), (v, c)) and disG (u, w) = disG (v, w′ ).
38
Finally, we can prove the following important corollary:
−1 −1
Corollary C.40. For any c ∈ C, |fG (u, c)| = |fG (v, c)|.
−1 −1
Proof. Pick any w ∈ fG (u, c) and x ∈ fG (v, c). By Corollary C.39, we have
disG (w, u) = disG (x, v) := d,
disG (w, v) = disG (x, u) = d + 1.
−1 −1
The multiset {{disG (u, w′ ) : χG (w′ ) = c}} contains |fG (u, c)| elements of value d and |fG (v, c)|
′ ′ −1
elements of value d + 1. The multiset {{disG (v, w ) : χG (w ) = c}} has |fG (v, c)| elements of
−1
value d and |fG (u, c)| elements of value d + 1. Since u and v has the same color under SPD-WL,
−1 −1
the two multiset must be equivalent. Therefore, |fG (u, c)| = |fG (v, c)|.
Next, we switch our focus to the graph H. Since we have assumed that the graph representations
of G and H are the same, i.e. {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}}, the size of the set
{w ∈ V : χH (w) = χG (u)} must be 2. We may denote the elements as u and v without abuse of
notation and thus {u, v} ∈ EH . Also for any w ∈ V, we have disH (w, u) ̸= disH (w, v). Therefore,
we can similarly define the mapping fH : V → {u, v} × V and the mapping hH : V → {u, v} as in
(12). The auxiliary graph H A is defined analogous to Definition C.34.
−1 −1 −1 −1
Lemma C.41. For any c ∈ C, |fH (u, c)| = |fH (v, c)| = |fG (u, c)| = |fG (v, c)|.
−1 −1
Proof. If |fH ̸ |fH
(u, c)| = (v, c)|, we have {{disH (u, w) : χH (w) = c}} ̸= {{disH (v, w) :
χH (w) = c}}, implying that u and v cannot have the same color under SPD-WL. This already
concludes the proof by using Corollary C.40 as
−1 −1 −1 −1
|fH (u, c)| + |fH (v, c)| = |fG (u, c)| + |fG (v, c)|.
We finally present a technical lemma which will be used in the subsequent proof.
Lemma C.42. Given node w ∈ V and color c ∈ C, define multisets
DG,= (w, c) := {{disG (w, x) : x ∈ χ−1
G (c), hG (x) = hG (w)}},
DG,̸= (w, c) := {{disG (w, x) : x ∈ χ−1
G (c), hG (x) ̸= hG (w)}}.
For any two nodes w, w′ ∈ V in graphs G and H satisfying χG (w) = χH (w′ ), pick any d ∈
DG,̸= (w, c) and d′ ∈ DH,= (w′ , c). Then d′ < d.
Proof. Without loss of generality, assume hG (w) = hH (w′ ) = u and let fG (w) = fH (w′ ) =
−1 −1
(u, cw ). Pick x ∈ fG (v, c) and x′ ∈ fH (u, c), then disH (x′ , u) = min(disG (x, u), disG (x, v))
′
and disH (w , u) = min(disG (w, u), disG (w, v)). Thus
disH (w′ , x′ ) ≤ disH (w′ , u) + disH (u, x′ )
= min(disG (w, u), disG (w, v)) + min(disG (x, u), disG (x, v))
< min(disG (w, u) + disG (x, v), disG (w, v) + disG (x, u)) + 1
= disG (w, x),
which concludes the proof.
In the following, we will prove that {u, v} is a cut edge in graph H. Consider an edge
A
{{(u, c1 ), (v, c2 )}} ∈ EH A (such an edge exists because {{(u, χH (u)), (v, χH (v))}} ∈ EH ). We
will prove that this is the only case, i.e. it must be c1 = χH (u) = χH (v) = c2 .
−1
By Definition C.34, {{(u, c1 ), (v, c2 )}} ∈ EH A implies that there exists two nodes x′ ∈ fH (u, c1 )
′ −1 ′ ′ −1 ′
and w ∈ fH (v, c2 ), such that {w , x } ∈ EH . Pick w ∈ χG (c2 ). By Lemma C.42, DH,= (w , c1 )∩
DG,̸= (w, c1 ) = ∅. Since w′ and w have the same color under SPD-WL,
DH,= (w′ , c1 ) ∪ DH,̸= (w′ , c1 ) = DG,= (w, c1 ) ∪ DG,̸= (w, c1 ).
39
By Lemma C.41, |DH,= (w′ , c1 )| = |DH,̸= (w′ , c1 )| = |DG,= (w, c1 )| = |DG,̸= (w, c1 )|. Therefore,
DG,̸= (w, c1 ) = DH,̸= (w′ , c1 ). We claim that all elements in the set DG,̸= (w, c1 ) are the same. This
is because for any x ∈ χ−1G (c1 ), hG (x) ̸= hG (w), we have
disG (w, x) = disG (w, h(w)) + 1 + disG (h(x), x),
and by Corollary C.39 disG (h(x), x) has an equal value for different x. Since {w′ , x′ } ∈ EH , we
have 1 ∈ DH,̸= (w′ , c1 ) and thus all elements in DG,̸= (w, c1 ) equals 1. Therefore, c1 = χG (u).
Analogously, c2 = χG (u). Therefore, c1 = χH (u) = χH (v) = c2 .
Let Su = {w ∈ V : hH (w) = u} and Sv = {w ∈ V : hH (w) = v}. Then the above argument
implies that if w ∈ Su , x ∈ Sv and {w, x} ∈ EG , then {w, x} = {u, v}. Therefore {u, v} is a cut
edge of graph H.
Proof. Consider the GD-WL procedure defined in Algorithm 4 with arbitrary distance function dG .
Suppose at iteration t ≥ T , {{χtG (w) : w ∈ V}} ̸= {{χtH (w) : w ∈ V}}. Then at iteration t + 1, we
have for each v ∈ V,
χt+1 t
G (v) = hash {{hash(dG (v, u), χG (u)) : u ∈ V}} .
The above lemma implies that if there exists edges {w1 , w2 } ∈ EG , {x1 , x2 } ∈ EH satisfying
{{χG (w1 ), χG (w2 )}} = {{χH (x1 ), χH (x2 )}}, then {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}}.
Also, SPD-WL ensures that both graphs are either connected or disconnected. If they are both con-
nected, the previous proof (Appendices C.4.1 and C.4.2) ensures that {w1 , w2 } is a cut edge of G if
and only if {x1 , x2 } is a cut edge of H. For the disconnected case, let SG ⊂ V be the largest con-
nected component containing nodes w1 , w2 , and similarly denote SH ⊂ V as the largest connected
component containing nodes x1 , x2 . Obviously, |SG | = |SH | due to the facts that disG (w1 , y) =
∞ ̸= disG (w1 , y ′ ) for all y ∈
/ SG , y ′ ∈ SG and that the two edges {w1 , w2 } ∈ EG , {x1 , x2 } have
the same color under SPD-WL. Moreover, {{χG (w) : w ∈ SG }} = {{χH (w) : w ∈ SH }}. Now
consider re-execute the SPD-WL algorithm on subgraphs G[SG ] and H[SH ] induced by the vertices
in set SG and SH , respectively. It follows that for any uG ∈ SG and uH ∈ SH , χG (uG ) = χH (uH )
implies that χG[SG ] (uG ) = χH[SH ] (uH ). Therefore, {w1 , w2 } is a cut edge of G[SG ] if and only if
{x1 , x2 } is a cut edge of H[SH ]. By the dinifition of SG and SH , {w1 , w2 } is a cut edge of G if and
only if {x1 , x2 } is a cut edge of H.
It remains to prove that {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}} implies BCETree(G) ≃
BCETree(H). By definition of the block cut-edge tree, each cut edge of G corresponds to
a tree edge in BCETree(G) and each biconnected component of G corresponds to a node of
BCETree(G). We still only focus on the case of connected graphs G, H, and it is straightfor-
ward to extend the proof to the general (disconnected) case using a similar technique as the previous
paragraph.
Given a fixed SPD-WL graph representation R, consider any graphs G = (V, EG ) satisfying
{{χG (w) : w ∈ V}} = R. Since we have proved that the SPD-WL node feature χG (v), v ∈ V
precisely locates all the cut edges, the multiset
C E := {{{χG (u), χG (v)} : {u, v} ∈ EG is a cut edge}}
40
is fixed (fully determined by R, not G). Denote C V := {c1 ,c2 }∈C E {c1 , c2 } as the set that contains
S
the color of endpoints of all cut edges. For each cut edge {u, v} ∈ EG , denote SG,u and SG,v be
the vertex partition corresponding to the two connected components after removing the edge {u, v},
satisfying u ∈ SG,u , v ∈ SG,v , SG,u ∩ SG,v = ∅, SG,u ∪ SG,v = V. It suffices to prove that given a
cut edge {u, v} ∈ EG with color {χG (u), χG (v)}, the multiset {{χG (w) : w ∈ SG,u , χG (w) ∈ C V }}
can be determined purely based on R and the edge color {{χG (u), χG (v)}}, rather than the specific
graph G or edge {u, v}. This basically concludes the proof, since the BCETree can be uniquely
constructed as follows: if {{χG (w) : w ∈ SG,u , χG (w) ∈ C V }} = {{χG (u)}} (i.e. with only
one element), then {{χG (u), χG (v)}} is a leaf edge of the BCETree such that χG (u) connects to a
biconnected component that is a leaf of the BCETree. After finding all the leaf edges, we can then
find the BCETree edges that connect to leaf edges and determine which leaf edges they connect. The
procedure can be recursively executed until the full BCETree is constructed. The whole procedure
does not depend on the specific graph G and only depends on R.
We now show how to determine {{χG (w) : w ∈ SG,u , χG (w) ∈ C V }} given a cut edge {u, v} ∈ EG
with color {χG (u), χG (v)}. Define the multiset
D(c1 , c2 ) := {{disG (w, x) : x ∈ χ−1
G (c2 )}} (w ∈ χ−1
G (c1 ) can be picked arbitrarily)
Note that D(c1 , c2 ) is well-defined (does not depend on w) by definition of the SPD-WL color. For
any cu , cv ∈ C E , pick arbitrary cut edge {u, v} with color χG (u) = cu , χG (v) = cv . Define
[
T (cu , cv ) = {{c}} × |(D(cu , c) + 1) ∩ D(cv , c)| (13)
c∈C V
where {{c}} × m denotes a multiset with m repeated elements c, and D(cu , c) + 1 := {{d + 1 :
d ∈ D(cu , c)}}. Intuitively speaking, T (cu , cv ) corresponds to the color of all nodes w ∈ V such
that disG (u, w) + 1 = disG (v, w) and χG (w) ∈ C V . Therefore, T (cu , cv ) is exactly the multiset
{{χG (w) : w ∈ SG,u , χG (w) ∈ C V }} and we have completed the proof.
Theorem C.44. Let G = (V, EG ) and H = (V, EH ) be two graphs, and let χG and χH be the
corresponding RD-WL color mapping. Then the following holds:
• For any two nodes w ∈ V in G and x ∈ V in H, if χG (w) = χH (x), then w is a cut vertex
of G if and only if x is a cut vertex of H.
• If the graph representations of G and H are the same under RD-WL, then their block
cut-vertex trees (Definition 2.4) are isomorphic. Mathematically, {{χG (w) : w ∈ V}} =
{{χH (w) : w ∈ V}} implies that BCVTree(G) ≃ BCVTree(H).
Proof Sketch. First observe that Lemma C.43 holds for general distances and thus applies here.
Therefore, if χG (w) = χH (x), the graph representations will be the same, i.e. {{χG (w) : w ∈
V}} = {{χH (w) : w ∈ V}}. By a similar analysis as SPD-WL (Appendix C.4.3), we can only
focus on the case that both graphs are connected. We prove the first bullet of Theorem 4.2 in
Appendix C.5.1 and prove the second bullet in Appendix C.5.2, both assuming that G and H are
connected and their graph representations are the same.
Proof. We use the key finding that the Resistance Distance is equivalent to the Commute Time Dis-
tance multiplied by a constant (Chandra et al., 1996, see also Appendix E.2), i.e. disC G (u, w) =
2|E| disR
G (u, w). Here, the Commute Time Distance is defined as disC
G (u, w) := hG (u, w) +
hG (w, u) where hG (u, w) is the average hitting time from u to w in a random walk (Appendix E.2).
41
• If v is not a cut vertex, given any nodes u, w, u ̸= v, w ̸= v, we can partition the set of all
v v
hitting paths Puw from u to w (not necessarily simple) into two sets Puw and P uw such that
v v v
all paths P ∈ Puw contain v and no path P ∈ P uw contains v. Clearly, P uw ̸= ∅. Given a
Qm−1
path P = (x0 , · · · , xm ), define the probability function q(P ) := 1/ i=0 degG (xi ). Then
by definitions of the average hitting time h,
X X X
hG (u, w) = q(P ) · |P | = q(P ) · |P | + q(P ) · |P |
P ∈Puw v
P ∈Puw v
P ∈P uw
X X
= q(P1 )q(P2 )(|P1 | + |P2 |) + q(P ) · |P |
w v
P1 ∈P uv ,P2 ∈Pvw P ∈P uw
!
X X X X
= q(P1 )|P1 | q(P2 ) + q(P2 )|P2 | q(P1 )
w P2 ∈Pvw P2 ∈Pvw w
P1 ∈P uv P1 ∈P uv
X
+ q(P )|P |
v
P ∈P uw
X X X
≤ q(P )|P | + q(P )|P | + q(P )|P |
w v P ∈Pvw
P ∈P uv P ∈P uw
X X X
< q(P )|P | + q(P )|P | + q(P )|P |
w w
P ∈Puv P ∈Pvw
P ∈P uv
−1
Proof. Let ui = arg maxu′ ∈χ−1 (χG (u)) disR ′
G (u, u ). If ui = u, then Si ∩ χG (χG (u)) = ∅ for all
G
i ∈ [m] and thus Lemma C.46 clearly holds. Otherwise, ui ∈ Si for some i. We will prove that for
any j ̸= i, Sj ∩ χ−1
G (χG (u)) = ∅.
If the above conclusion does not holds, then we can pick a set Sj and a vertex uj ∈ Sj ∩χ−1 G (χG (u)).
Since u is a cut vertex and Si , Sj are different connected components, by Lemma C.45 we have
disR R R R
G (ui , uj ) = disG (ui , u) + disG (u, uj ) > disG (ui , u). This yields a contradiction because
maxu′ ∈χ−1 (χG (u)) disG (u, u ) ̸= maxu′ ∈χ−1 (χG (ui )) disR
R ′ ′
G (ui , u ), which means that u and ui can-
G G
not have the same RD-WL color.
The next lemma presents a key result which is similar to Corollary C.30.
Lemma C.47. For all u′ ∈ χ−1 ′
G (χG (u)), u it is a cut vertex of G.
Proof. If |χ−1
G (χG (u))| = 1, then Lemma C.47 clearly holds. Otherwise, by Lemma C.46 there
exists two sets Si and Sj satisfying Si ∩ χ−1 −1
G (χG (u)) ̸= ∅, Sj ∩ χG (χG (u)) = ∅. Since Sj ̸= ∅, we
−1
can pick w ∈ Sj with color χG (w) ̸= χG (u). Pick u ∈ Si ∩ χG (χG (u)). Since χG (u) = χG (u′ ),
′
42
there exists a node w′ ∈ χ−1 R R ′ ′
G (χG (w)) such that disG (u, w) = disG (u , w ). Then we have
′′ ′′ −1 ′′ ′′ −1
{{disR R R
G (w, u ) : u ∈ χG (χG (u))}} = {{disG (w, u) + disG (u, u ) : u ∈ χG (χG (u))}} (14)
′ ′ ′ ′′ ′′ −1
= {{disR R
G (w , u ) + disG (u , u ) : u ∈ χG (χG (u))}} (15)
where (14) holds because u is a cut vertex and all u′′ ̸= u are in the set Si but w ∈ Sj (Lemma C.46),
and (15) holds because χG (u) = χG (u′ ). On the other hands,
′′ ′′ −1 ′ ′′ ′′ −1
{{disR R
G (w, u ) : u ∈ χG (χG (u))}} = {{disG (w , u ) : u ∈ χG (χG (u))}}.
−1
Therefore, disR ′ ′′ R ′ ′ R ′ ′′ ′′ ′′
G (w , u ) = disG (w , u ) + disG (u , u ) for all u ∈ χG (χG (u)). Pick u = u,
′′ ′ ′′ ′
then clearly u ̸= u and u ̸= w. Lemma C.45 shows that u is a cut vertex, which concludes the
proof. See Figure 11 for an illustration of the above proof.
Proof. Proof of the first bullet: since i ∈ MG (u), any path from w to a node uG ∈ χ−1 G (χG (u))
goes through the cut vertex u, implying minuG ∈χ−1 (χG (u)) disR
G (w, uG ) = dis R
G (w, u). Similarly,
G
43
and w′ are the same under RD-WL, we have
′
min disR
H (w , uH ) = min disR
G (w, uG )
uH ∈χ−1 ′
H (χH (u )) uG ∈χ−1
G (χG (u))
Proof. Assume {{χG (w) : w ∈ SG,i (u)}} ∩ {{χH (w) : w ∈ SH,j (u′ )}} ̸= ∅. Then there exists
nodes w ∈ SG,i (u) in G and w′ ∈ SH,j (u′ ) in H, satisfying χG (w) = χH (w′ ). Our goal is to
prove that {{χG (w) : w ∈ SG,i (u)}} = {{χH (w) : w ∈ SH,j (u′ )}}. It thus suffices to prove that for
any color c ∈ C, |χ−1 −1 ′
G (c) ∩ SG,i (u)| = |χH (c) ∩ SH,j (u )|.
−1
Define DG (w, c) = {{disR ′ ′
G (w, x) : x ∈ χG (c)}} and define DG (w, c) + d := {{d + d : d ∈
DG (w, c)}}. We next claim that
|χ−1 −1 R
G (c) ∩ SG,i (u)| = |χG (c)| − |DG (w, c) ∩ (DG (u, c) + disG (w, u))|.
This is simply because for any x ∈ χ−1 G (c), either x ∈ SG,i (u) or x ∈
/ SG,i (u). If x ∈
/ SG,i (u),
then disR R R R R
G (w, x) = disG (w, u) + disG (u, x) (Lemma C.45); otherwise, disG (w, x) ̸= disG (w, u) +
R
disG (u, x). Similarly,
|χ−1 ′ −1 ′ ′ R ′ ′
H (c) ∩ SH,j (u )| = |χH (c)| − |DH (w , c) ∩ (DH (u , c) + disH (w , u ))|.
Lemma C.48 and Corollary C.49 leads to the following key corollary:
Corollary C.51. Let u ∈ V be a vertex in G and u′ ∈ V be a vertex in H. If χG (u) = χH (u′ ), then
mG (u) = mH (u′ ) and
m (u) m (u′ )
G
{{{{χG (w) : w ∈ SG,i (u)}}}}i=1 = {{{{χH (w) : w ∈ SH,i (u′ )}}}}i=1
H
.
44
Proof. If both u and u′ are not cut vertices, Corollary C.51 trivially holds since mG (u) = mH (u′ ) =
1 and SG,1 (u) = V\{u}, SH,1 (u′ ) = V\{u′ }. Now assume u and u′ are both cut vertices. We first
claim that
{{χG (w) : w ∈ i∈MG (u) SG,i (u)}} = {{χH (w) : w ∈ i∈MH (u′ ) SH,i (u′ )}}.
S S
(16)
To prove the claim, it suffices to prove that for each color c ∈ C,
[ [
SG,i (u) ∩ χ−1
G (c) = SH,i (u′ ) ∩ χ−1
H (c) . (17)
i∈MG (u) i∈MH (u′ )
We are now ready to prove that {{χG (w) : w ∈ V}} = {{χH (w) : w ∈ V}} implies BCVTree(G) ≃
BCVTree(H). Recall that in a block cut-vertex tree BCVTree(G), there are two types of nodes: all
cut vertices of G, and all biconnected components of G. Each edge in BCVTree(G) is connected
between a cut vertex u ∈ V and a biconnected component B ⊂ V such that u ∈ B.
Given a fixed RD-WL graph representation R, consider any graph G = (V, EG ) satisfying
{{χG (w) : w ∈ V}} = R. First, all cut vertices of G can be determined purely from R using the
node colors. We denote the cut vertex color multiset as C V := {{χG (u) : u is a cut vertex of G}}.
Next, the number mG (u) for each cut vertex u can be determined only by its color χG (u) (by Corol-
lary C.51), which is equal to the degree of node u in BCVTree(G). We now give a procedure to
construct BCVTree(G), which purely depends on R rather than the specific graph G.
m (u)
G
We examine the multisets T (u) := {{{{χG (w) : w ∈ SG,i (u)}}}}i=1 for all cut vertices u, which
only depends on R and χG (u) rather than the specific graph G or node u by Corollary C.51. See
P of T (u) for four types of cut vertices u. In the first step, we find
Figure 12(b) for an illustration
all cut vertices u such that S∈T (u) 1[C V ∩ S = ̸ ∅] ≤ 1 where 1[·] is the indicator function. In
other words, we find cut vertices u such that there is at most one connected component SG,i (u) that
contains cut vertices. These cut vertices u will serve as “leaf (cut vertex) nodes” in BCVTree(G), in
the sense that it connects to at most one internal node in BCVTree(G). The number of BCVTree leaf
45
×3 ×2 ×2 ×1
×4 ×2 ×2 ×2
×2
×2
×3 ×2 ×2
×2
×4 ×2 ×1
×1
×1
(a) The original graph (b) Illustration of the multisets T (u) for each cut vertex u.
(c) The first step (d) The second step (e) The third step (f) The final step
Figure 12: Illustrations for constructing the BCVTree given the graph representation R.
nodes that connect to u are also determined by Corollary C.51. See Figure 12(c) for an illustration.
After finding all the “leaf (cut vertex) nodes”, we can then find cut vertex nodes v such that when
removing all “leaf (cut vertex) nodes” in the BCVTree, v will serve as a “leaf (cut vertex) node”. To
do this, we compute for each cut vertex v and each biconnected component Bv associated with v,
whether Bv has no cut vertex or all cut vertices in Bv correspond P to the “leaf (cut vertex) nodes” in
BCVTree(G). Then, we check whether a cut vertex v satisfies S∈T (v) 1[(C V ∩ S)\CuV ̸= ∅] ≤ 1,
where the set CuV contains all colors corresponding to “leaf (cut vertex) nodes”. These vertices v will
serve as new “leaf (cut vertex) nodes” when removing all “leaf (cut vertex) nodes” in the BCVTree,
and the connection between such vertices v and “leaf (cut vertex) nodes” can also be determined (see
Figure 12(d) for an illustration). The procedure can be recursively executed until the full BCVTree
is constructed (see Figure 12(f)), and the whole procedure does not depend on the specific graph G
and only depends on R, which completes the proof.
Given a graph G = (V, E), let χtG be the 2-FWL color mapping after the t-th iteration (see Algo-
rithm 2 for details), and let χG be the stable 2-FWL color mapping. The following result is useful
for the subsequent proof:
Lemma C.52. Let u1 , u2 , v1 , v2 ∈ V be nodes in graph G and t be an integer. The following holds:
• If χtG (u1 , v1 ) = χtG (u2 , v2 ) and t ≥ 1, then degG (u1 ) = degG (u2 ) and degG (v1 ) =
degG (v2 ).
46
Proof. By the initial coloring (6) of 2-FWL, χ0G (u, v) can have the following three types of values:
csame if u = v
(
0
χG (u1 , v1 ) = cedge if u ̸= v and {u, v} ∈ E
cother if u ̸= v and {u, v} ∈/E
where csame , cedge , cother are three different colors. Therefore, if χ0G (u1 , v1 ) = χ0G (u2 , v2 ), then
u1 = v1 if and only if u2 = v2 , and {u1 , v1 } ∈ E if and only if {u2 , v2 } ∈ E. For the update step,
If χ1G (u1 , v1 ) = χ1G (u2 , v2 ), then (19) implies that {{χ0G (u1 , w) : w ∈ V}} = {{χ0G (u2 , w) : w ∈
V}} and thus |{{w ∈ V : {u1 , w} ∈ E}}| = |{{w ∈ V : {u2 , w} ∈ E}}|, namely degG (u1 ) =
degG (u2 ). We can similarly prove that degG (v1 ) = degG (v2 ).
Finally, note that χtG (u1 , v1 ) = χtG (u2 , v2 ) implies χt−1 t−1
G (u1 , v1 ) = χG (u2 , v2 ) using (19). This
concludes the proof of the case t ≥ 1 by a simple induction.
For a path P = (x0 , · · · , xd ) (not necessarily simple) in graph G of length d ≥ 1, define ω(P ) :=
(degG (x1 ), · · · , degG (xd−1 )) which is a tuple of length d − 1. We have the following key lemma:
Lemma C.53. Let t ∈ N be a non-negative integer. Given nodes u1 , u2 , v1 , v2 ∈ V, if χtG (u1 , v1 ) =
χtG (u2 , v2 ), then the following holds:
• Denote Pd (u, v) be the set of all paths (not necessarily simple) from node u to node v of
length d. Then |Pt+1 (u1 , v1 )| = |Pt+1 (u2 , v2 )|.
• Denote Qd (u, v) be the set of all hitting paths (not necessarily simple) from node u to node
v of length d. Then, {{ω(Q) : Q ∈ Qt+1 (u1 , v1 )}} = {{ω(Q) : Q ∈ Qt+1 (u2 , v2 )}}, and
{{ω(Q) : Q ∈ Qt+1 (v1 , u1 )}} = {{ω(Q) : Q ∈ Qt+1 (v2 , u2 )}}.
Proof. We prove the lemma by induction over iteration t. We first prove the base case t = 0.
• If u1 = v1 , then by Lemma C.52 u2 = v2 . Note that obviously |P1 (u, u)| = 0 and
|Q1 (u, u)| = 0 for any node u, namely |P1 (u1 , u1 )| = |P1 (u2 , u2 )| and Q1 (u1 , u1 ) =
Q1 (u2 , u2 ) = ∅.
• Similarly, if u1 ̸= v1 and {u1 , v1 } ∈/ E, then by Lemma C.52 u2 ̸= v2 and {u2 , v2 } ∈ / E.
We also have |P1 (u1 , v1 )| = |P1 (u2 , v2 )| = 0 and Q1 (u1 , v1 ) = Q1 (u2 , v2 ) = ∅.
• If u1 ̸= v1 and {u1 , v1 } ∈ E, then by Lemma C.52 u2 ̸= v2 and {u2 , v2 } ∈ E. Then
|P1 (u1 , v1 )| = |P1 (u2 , v2 )| = 1 and Q1 (u1 , v1 ) = Q1 (u2 , v2 ) where both sets have a single
element that is an empty tuple (0-dimension).
Now suppose that the conclusion of Lemma C.53 holds in iteration t, weP will prove that it also holds
in iteration t + 1. First note that for any two nodes u, v, |Pt+1 (u, v)| = w∈NG (v) |Pt+1 (u, w)|. If
χt+1 t+1
G (u1 , v1 ) = χG (u2 , v2 ), then by definition of 2-FWL update formula (19)
{{(χtG (u1 , w), χtG (w, v1 )) : w ∈ V}} = {{(χtG (u2 , w), χtG (w, v2 )) : w ∈ V}}.
which implies that {{χtG (u1 , w) : w ∈ NG (v1 )}} = {{χtG (u2 , w) : w ∈ NG (v2 )}} due to
Lemma C.52. Therefore,
• By induction, {P {|Pt+1 (u1 , w)| : w ∈ NG (vP
1 )}} = {{|Pt+1 (u2 , w)| : w ∈ NG (v2 )}}.
It follows that w∈NG (v1 ) |Pt+1 (u1 , w)| = w∈NG (v2 ) |Pt+1 (u2 , w)| and thus we have
|Pt+2 (u1 , v1 )| = |Pt+2 (u2 , v2 )|.
• By induction, {{(χtG (u1 , w), χtG (w, v1 ), {{ω(Q) : Q ∈ Qt+1 (w, v1 )}}) : w ∈ NG (u1 )}} =
{{(χtG (u2 , w), χtG (w, v2 ), {{ω(Q) : Q ∈ Qt+1 (w, v2 )}}) : w ∈ NG (u2 )}}. Since
Lemma C.52 says that χtG (w, v) ̸= χtG (v, v) if w ̸= v, we have
{{(χtG (u1 , w), {{ω(Q) : Q ∈ Qt+1 (w, v1 )}}) : w ∈ NG (u1 )\{v1 }}}
={{(χtG (u2 , w), {{ω(Q) : Q ∈ Qt+1 (w, v2 )}}) : w ∈ NG (u2 )\{v2 }}}
47
Further using the third bullet of Lemma C.52 and rearranging the two multisets yields
{{(degG (w), ω(Q)) : w ∈ NG (u1 )\{v1 }, Q ∈ Qt+1 (w, v1 )}}
={{(degG (w), ω(Q)) : w ∈ NG (u2 )\{v2 }, Q ∈ Qt+1 (w, v2 )}}.
Equivalently, {{ω(Q) : Q ∈ Qt+2 (u1 , v1 )}} = {{ω(Q) : Q ∈ Qt+2 (u2 , v2 )}}. We can
similarly prove that {{ω(Q) : Q ∈ Qt+2 (v1 , u1 )}} = {{ω(Q) : Q ∈ Qt+2 (v2 , u2 )}}.
This concludes the proof of the induction step.
Proof. If χG (u1 , v1 ) = χG (u2 , v2 ), then χtG (u1 , v1 ) = χtG (u2 , v2 ) holds for all t ≥ 0. By
Lemma C.53 |Pt (u1 , v1 )| = |Pt (u2 , v2 )| holds for all t ≥ 0 (the case t = 0 trivially holds).
Since disG (u, v) = min{t : |Pt (u1 , v1 )| > 0}, we conclude that disG (u1 , v1 ) = disG (u2 , v2 ). As
for the Resistance Distance disR G , it is equivalent to the Commute Time Distance multiplied by a
constant (Chandra et al., 1996, see also Appendix E.2), i.e. disC R
G (u, w) = 2|E| disG (u, w). Since
C P∞ P P
disG (u, v) = i=0 i · ( P ∈Qi (u,v) q(P ) + P ∈Qi (v,u) q(P )) where Qi (u, v) is the set contain-
Qd−1
ing all hitting paths of length i from u to v, and q(P ) = 1/ degG (u) i=1 deg(xi ) for a path
P P
P = (x0 , · · · , xd ). By Lemma C.53, we have P ∈Qi (u1 ,v1 ) q(P ) = P ∈Qi (u2 ,v2 ) q(P ) and
P P
P ∈Qi (v1 ,u1 ) q(P ) = P ∈Qi (v2 ,u2 ) q(P ) for all i ≥ 0 (the case i = 0 trivially holds) and thus
disC C R R
G (u1 , v1 ) = disG (u2 , v2 ), namely disG (u1 , v1 ) = disG (u2 , v2 ).
Proof. Note that by definition (see Appendix B.2), we have χG (v) := χG (v, v) for any node v ∈ V.
If χG (v1 ) = χG (v2 ), then by definition of 2-FWL aggregation formula,
{{(χG (v1 , w), χG (w, v1 )) : w ∈ V}} = {{(χG (v2 , w), χG (w, v2 )) : w ∈ V}}.
Using Lemma C.6, if χG (v1 , w1 ) = χG (v2 , w2 ) for some nodes w1 and w2 , then χG (w1 ) =
χG (w2 ). Therefore, by using Corollary C.54 we obtain that if χG (v1 ) = χG (v2 ), then
{{(χG (w), disG (w, v1 )) : w ∈ V}} = {{(χG (w), disG (w, v2 )) : w ∈ V}},
{{(χG (w), disR R
G (w, v1 )) : w ∈ V}} = {{(χG (w), disG (w, v2 )) : w ∈ V}}.
Finally, the following proposition trivially holds and will be used to prove Corollary 4.6.
Proposition C.56. Given a graph G = (V, EG ), let χG and χ̃G be two color mappings induced by
two different (general) color refinement algorithms, respectively. If the vertex partition induced by
the mapping χG is finer than that of χ̃G , then:
• The mapping χG can distinguish cut vertices/edges if χ̃G can distinguish cut vertices/edges;
• The mapping χG can distinguish the isomorphism type of BCVTree(G)/BCETree(G) if χ̃G
can distinguish the isomorphism type of BCVTree(G)/BCETree(G).
48
C.7 P ROOF OF T HEOREM 4.7
In this subsection, we give more fine-grained theoretical results on the expressiveness upper bound
of GD-WL by considering the special problem of distinguishing distance-regular graphs, a class of
hard graphs that are highly relevant to the GD-WL framework. We provide a full characterization of
what types of distance-regular graphs different GD-WL algorithms can or cannot distinguish, with
both proofs and counterexamples.
Given a graph G = (V, E), let NGi (u) = {w ∈ V : disG (u, w) = i} be the i-hop neighbors
of u in G and let D(G) := maxu,v∈V disG (u, v) be the diameter of G. We say G is distance-
regular if for all i, j ∈ [D(G)] and all nodes u, v, w, x ∈ V with disG (u, v) = disG (w, x), we have
|NGi (u) ∩ NGj (v)| = |NGi (w) ∩ NGj (x)|. From the definition, it is straightforward to see that for all
u, v ∈ V and i ∈ [D(G)], |NGi (u)| = |NGi (v)|, i.e., the number of i-hop neighbors is the same for all
nodes. We thus denote κ(G) = (k1 , · · · , kD(G) ) as the k-hop-neighbor array where ki := |NGi (u)|
with u ∈ V chosen arbitrarily. We next define another important array:
Definition C.57. (Intersection array) The intersection array of a distance-regular graph G is
denoted as ι(G) = {b0 , · · · , bD(G)−1 ; c1 , · · · , cD(G) } where bi = |NG (u) ∩ NGi+1 (v)| and
ci = |NG (u) ∩ NGi−1 (v)| with disG (u, v) = i.
Theorem C.58 precisely characterizes the equivalence class of all distance-regular graphs for differ-
ent types of algorithms. Combined the fact that ι(G) = ι(H) implies κ(G) = κ(H) (see e.g. van
Dam et al. (2014, page 8)), we immediately arrive at the following corollary:
Corollary C.59. RD-WL is strictly more powerful than SPD-WL in distinguishing non-isomorphic
distance-regular graphs. Moreover, RD-WL is as powerful as 2-FWL in distinguishing non-
isomorphic distance-regular graphs.
Proof of the first item of Theorem C.58. This part is straightforward. Consider the SPD-WL color
mapping χ1G of graph G after the first iteration. Then for two graphs G, H with n nodes, χ1G (u) =
χ1H (v) if and only if |NGi (u)| = |NH
i
(v)| for all i ∈ [n − 1]. Therefore, if κ(G) ̸= κ(H), then for
49
any node u in G and v in H, |NGj (u)| = |NH j
(v)| holds for some j ∈ [max(D(G), D(H))] and
thus χ1G (u) ̸= χ1H (v). Namely, χG (u) ̸= χH (v) for all nodes u in G and v in H, implying that
SPD-WL can distinguish the two graphs. On the other hand, if κ(G) = κ(H), then for any node u
in G and v in H we have χ1G (u) = χ1H (v). Similarly, χtG (u) = χtH (v) for any iteration t ∈ N, and
thus SPD-WL cannot distinguish the two graphs.
Proof of the second item of Theorem C.58. The key insight is that given a distance-regular graph,
the resistance distance between a pair of nodes (u, v) only depends on its SPD. Formally, for any
nodes u, v, w, x in a distance-regular graph G, disG (u, v) = disG (w, x) implies that disR
G (u, v) =
disR
G (w, x). Actually, we have the following stronger result:
Theorem C.61. For any two nodes u, v in a connected distance-regular graph G, disR
G (u, v) =
D(G)
rdisG (u,v) where the sequence {rd }d=0 is recursively defined as follows:
0 if d = 0,
D(G)
rd = 2 X (20)
rd−1 + nk b
ki if d ∈ [D(G)],
d−1 d−1
i=d
where ι(G) = {b0 , · · · , bD(G)−1 ; c1 , · · · , cD(G) } is the intersection array of G and κ(G) =
(k1 , · · · , kD(G) ) is its k-hop-neighbor array.
Proof. Let R ∈ Rn×n be the RD matrix. Based on Theorem E.1, R can be expressed as R =
−1
diag(M)11⊤ + 11⊤ diag(M) − 2M, where M = L + n1 11⊤ and L is the graph Laplacian
matrix. Now let R
e = [rdis (u,v) ]u,v∈V be the matrix with elements defined in (20). The key step is
G
1
R= diag(c11⊤ − R)11e ⊤
+ 11⊤ diag(c11⊤ − R) e − c11⊤ + R e =R e
2
(since diag(R)
e = O) and finish the proof.
We now prove 2M = c11⊤ − R e for some c ∈ R, namely L + 1 11⊤ c11⊤ − R
n
e = 2I. Note
that R e = c1 1 for some c1 ∈ R because G is distance-regular.
e is a symmetric matrix and satisfy R1
Combined the fact that L1 = 0, we have
1 e = c − c1 11⊤ − LR.
L + 11⊤ c11⊤ − R e
n n
e = c11⊤ − 2I for some c ∈ R. Let us calculate each element
It thus suffices to prove that LR
[LR]e uv (u, v ∈ V). We have
D(G)
X
e uv = k1 rdis (u,v) −
[LR] G
rd |NG (u) ∩ NGd (v)|. (21)
d=0
e uv = − 2(n−1) .
by using b0 = k1 and k0 = 0. Thus [LR] n
• u ̸= v and disG (u, v) < D(G). Denote j = disG (u, v). In this case, in (21) the term
NG (u) ∩ NGd (v) ̸= ∅ only when d ∈ {j − 1, j, j + 1}, and by definition of intersection array
50
we have |NG (u) ∩ NGj−1 (v)| = cj , |NG (u) ∩ NGj+1 (v)| = bj , and |NG (u) ∩ NGj (v)| =
|NG (u)| − cj − bj = k1 − cj − bj . Therefore,
e uv = k1 rj − rj−1 cj − rj (k1 − bj − cj ) − rj+1 bj
[LR]
= cj (rj − rj−1 ) + bj (rj − rj+1 )
D(G) D(G)
2cj X 2bj X
= ki − ki
nkj−1 bj−1 i=j nkj bj i=j+1
D(G) D(G)
2 X X 2
= kj − kj = ,
nkj i=j i=j+1
n
where in the second last step we use the recursive relation of rj , and in the last step we use
k b
the fact that kj = j−1cj j−1 for any j ∈ [D(G)] (see e.g. van Dam et al. (2014, page 8)).
• u ̸= v and disG (u, v) = D(G). This case is similar as the previous one. Denote j =
disG (u, v), and NG (u) ∩ NGd (v) ̸= ∅ only when d ∈ {j − 1, j}. We have
e uv = k1 rj − rj−1 cj − rj (k1 − cj )
[LR]
= cj (rj − rj−1 )
2cj 2
= kj = ,
nkj−1 bj−1 n
kj−1 bj−1
where we again use kj = cj .
2 ⊤
Combining the above three cases, we conclude that LR n 11 −2I,
e = which finishes the proof.
We are now ready to prove the main result. Let G = (VG , EG ) and H = (VH , EH ) be two distance-
regular graphs. We first prove that if ι(G) = ι(H), then RD-WL cannot distinguish the two graphs.
This is a simple consequence of Theorem C.61. Combined with the fact that κ(G) = κ(H), we
have {disR R
G (u, w) : w ∈ VG } = {disH (v, w) : w ∈ VH } for any nodes u ∈ VG and v ∈ VH .
Therefore, after the first iteration, the RD-WL color mappings χ1G and χ1H satisfy χ1G (u) = χ1H (v)
for all u ∈ VG and v ∈ VH . Similarly, after the t-th iteration we still have χtG (u) = χtH (v) for all
u ∈ VG and v ∈ VH and thus RD-WL cannot distinguish the two graphs.
It remains to prove that if ι(G) ̸= ι(H), then RD-WL can distinguish the two graphs. First observe
that in Theorem C.61, ri < rj holds for any i < j. Therefore, for any nodes u ∈ VG and v ∈ VH ,
{disR R
G (u, w) : w ∈ VG } = {disH (v, w) : w ∈ VH } if and only if
Proof of the third item of Theorem C.58. First, if ι(G) ̸= ι(H), then 2-FWL can distinguish graphs
G and H. This is simply due to the fact that 2-FWL is more powerful than RD-WL (Theorem 4.5).
It remains to prove that if ι(G) = ι(H), then 2-FWL cannot distinguish graphs G and H.
Let χtG : VG × VG → C be the 2-FWL color mapping of graph G after t iterations. We aim to
prove that for any nodes u, v ∈ VG and w, x ∈ VH , if disG (u, v) = disG (w, x), then χtG (u, v) =
χtH (w, x) for any t ∈ N. We prove it by induction. The base case of t = 0 trivially holds. Now
suppose the case of t holds and let us consider the color mapping after t + 1 iterations. By the
2-FWL update rule (2),
χt+1 t t t
G (u, v) = hash χG (u, v), {{(χG (u, z), χG (z, v)) : z ∈ VG }} . (23)
51
It thus suffices to prove that
{{(χtG (u, z), χtG (z, v)) : z ∈ VG }} = {{(χtH (w, z), χtH (z, x)) : z ∈ VH }}. (24)
{{(disG (u, z), disG (z, v)) : z ∈ VG }} = {{(disH (w, z), disH (z, x)) : z ∈ VH }}.
This already yields (24) by the induction result of iteration t. We thus complete the proof.
In this subsection, we review existing metrics used in prior works to measure the expressiveness of
GNNs. We will discuss the limitations of these metrics and argue why biconnectivity may serve as
a more reasonable and compelling criterion in designing powerful GNN architectures.
WL hierarchy. Since the discovery of the relationship between MPNNs and 1-WL test (Xu et al.,
2019; Morris et al., 2019), the WL hierarchy has been considered as the most standard metric to
guide designing expressive GNNs. However, achieving an expressive power that matches the 2-
FWL test is already highly difficult. Indeed, each iteration of the 2-FWL algorithm already requires
a complexity of Ω(n3 ) time and Θ(n2 ) space for a graph with n vertices (Immerman & Lander,
1990). Therefore, it is impossible to design expressive GNNs using this metric while maintain-
ing its computational efficiency. Moreover, whether achieving higher-order WL expressiveness is
necessary and helpful for real-world tasks has been questioned by recent works (Veličković, 2022).
Structural metrics. Another line of works thus sought different metrics to measure the expressive
power of GNNs. Several popular choices are the ability of counting substructures (Arvind et al.,
2020; Chen et al., 2020; Bouritsas et al., 2022), detecting cycles (Loukas, 2020; Vignac et al., 2020;
Huang et al., 2023), calculating the graph diameter (Garg et al., 2020; Loukas, 2020) or other graph-
related (combinatorial) problems (Sato et al., 2019). Yet, all these metrics have a common drawback:
the corresponding problems may be too hard for GNNs to solve. Indeed, we show in Table 4
that solving any above task requires a computation complexity that grows super-linear w.r.t. the
graph size even using advanced algorithms. Therefore, it is quite natural that standard MPNNs
are not expressive for these metrics, since no GNNs can solve these tasks while being efficient.
Consequently, instead of using GNNs to directly learn these metrics, these works had to use a
precomputation step which can be costly in the worst case.
Table 4: The best computational complexity of known algorithms for solving different graph prob-
lems. Here n and m are the number of nodes and edges of a given graph, respectively.
Metric Complexity Reference
k+1
k-FWL Ω(n ) (Immerman & Lander, 1990)
Counting/detecting triangles O(min(n2.376 , m3/2 )) (Alon et al., 1997)
Detecting cycles of an odd length k ≥ 3 O(min(n2.376 , m2 )) (Alon et al., 1997)
Detecting cycles of an even length k ≥ 4 O(n2 ) (Yuster & Zwick, 1997)
Calculating the graph diameter O(nm) –
Detecting cut vertices Θ(n + m) (Tarjan, 1972)
Detecting cut edges Θ(n + m) (Tarjan, 1972)
Due to the lack of proper metrics, most subsequent works mainly justify the expressive power of
their proposed GNNs by focusing on regular graphs (Li et al., 2020; Bevilacqua et al., 2022; Bodnar
et al., 2021b; Feng et al., 2022; Velingker et al., 2022, to list a few), which hardly appear in practice.
In contrast, the biconnectivity metrics proposed in this paper are different from all prior metrics, in
that (i) it is a basic graph property and has significant values in both theory and applications; (i) it
can be efficiently calculated with a complexity linear in the graph size, and thus it is reasonable to
expect that these metrics should be learned by expressive GNNs.
52
D.2 GNN S WITH DISTANCE ENCODING
In this subsection, we review prior works that are related to our proposed GD-WL. In the research
field of expressive GNNs, the idea of incorporating distance first appeared in Li et al. (2020), where
the authors mainly considered using distance encoding as node features and showed that distance
can help distinguish regular graphs. They also considered an approach similar to k-hop aggrega-
tion by incorporating distance into the message-passing procedure (but without a systematic study).
Zhang & Li (2021) designed a subgraph GNN that also uses (generalized) distance encoding as
node features in each subgraph. Ying et al. (2021a) designed a Transformer architecture that incor-
porates distance information and empirically showed excellent performance. Very recently, Feng
et al. (2022) formally studied the expressive power of k-hop GNNs. Yet, they still restricted the
analysis to regular graphs. The concurrent work of Abboud et al. (2022) designed the shortest path
network which is highly similar to our proposed SPD-WL. They showed the resulting model can
alleviate the bottlenecks and over-squashing problems for MPNNs (Alon & Yahav, 2021; Topping
et al., 2022) due to the increased receptive field.
Compared with prior works, our contribution lies in the following three aspects:
• We formalize the principled and more expressive GD-WL framework, which comprises
SPD-WL as a special case. Our framework is theoretically clean and generalizes all prior
works in a unified manner.
• We systematically and theoretically analyze the expressive power of SPD-WL for general
graphs and highlight a fundamental advantage in distinguishing edge-biconnectivity.
• We design a Transformer-based GNN that is provably as expressive as GD-WL. Thus, our
framework is not only for theoretical analysis, but can also be easily implemented with good
empirical performance on real-world tasks.
Discussions with the concurrent work of Velingker et al. (2022). After the initial submission, we
became aware of a concurrent work (Velingker et al., 2022) which also explored the use of Resistance
Distance to enhance the expressiveness of standard MPNNs. Here, we provide a comprehensive
comparison of these two works. Overall, the main difference is that their approach incorporates
RD (and several related affinity measures) into node/edge features (like Zhang & Li (2021)), while
we combine RD to design a new WL aggregation procedure. As for the theoretical analysis, they
only give a few toy examples of regular graphs to justify the expressive power beyond the 1-WL
test, while we give a systematic analysis of the power of RD-WL for general graphs and point out
that it is fully expressive for vertex-biconnectivity. In Velingker et al. (2022), the authors also made
comparisons to SPD and conjectured that RD may have additional advantages than SPD in terms of
expressiveness. In fact, this question is formally answered in our work, by proving that RD-WL is
expressive for vertex-biconnectivity while SPD-WL is not. Another important contribution of our
work is that we provide an upper bound of the expressive power of RD-WL to be 2-FWL (3-WL),
which reveals the limit of incorporating RD information. We also provide a precise and complete
characterization for the expressiveness of RD-WL in distinguishing distance-regular graphs, which
reveals that RD-WL can match the power of 2-FWL in distinguishing these hard graphs.
In this section, we give implementation details of GD-WL and our proposed GNN architecture. We
also give detailed analysis of its computation complexity. Below, assume the input graph G = (V, E)
has n vertices and m edges.
Shortest Path Distance can be easily calculated using the Floyd-Warshall algorithm (Floyd, 1962),
which has a complexity of Θ(n3 ). For sparse graphs typically encountered in practice (i.e. m =
o(n2 )), a more clever way is to use breadth-first search that computes the distance from a given node
to all other nodes in the graph. The time complexity can be improved to Θ(nm).
53
E.2 P REPROCESSING R ESISTANCE D ISTANCE
In this subsection, we first describe several important properties of Resistance Distance. Based on
these properties, we give a simple yet efficient algorithm to calculate Resistance Distance.
Equivalence between Resistance Distance (RD) and Commute Time Distance (CTD). Chan-
dra et al. (1996) established an important relationship between RD and CTD, by proving that
disC R
G (u, v) = 2m disG (u, v) holds for any graph G and any nodes u, v ∈ V. Here, the Commute
Time Distance is defined as disCG (u, v) := hG (u, v)+hG (v, u) where hG (u, v) is the average hitting
time from u to v in a random walk. Concretely, hG (u, v) is equal to the average number of edges
passed in a random walk when starting from u and reaching v for the first time. Mathmatically, it
satisfies the following recursive relation:
0 if u = v,
hG (u, v) = ∞ if u and v are in different connected components,
1
1+ P
degG (u) w∈NG (u) h G (u, v) otherwise.
(25)
The above equation can be used to calculate CTD and thus RD, as we will show later.
Resistance Distance is a graph metric. We say a function dG : V × V → R is a graph metric if
it is non-negative, positive semidefinite, symmetric, and satisfies triangular inequality. Let G be a
connected graph. Then Resistance Distance disR G is a valid graph metric because:
Comparing RD with SPD. It is easy to see that RD is always no larger than SPD, i.e. disR G (u, v) ≤
disG (u, v). This is because for any subgraph G′ of G, we have disR G (u, v) ≤ disR
G′ (u, v), and when
G′ is chosen to contain only the edges that belong to the shortest path between u and v, we have
disR R
G′ (u, v) = disG (u, v). Therefore, the range of RD is the same as SPD, i.e. 0 ≤ disG (u, v) ≤
n − 1. However, unlike SPD which is an integer, RD can be a general rational number. RD can thus
be seen as a more fine-grained distance metric than SPD. Nevertheless, RD is still discrete and there
are only finitely many possible values of disR
G (u, v) when n is fixed.
Proof. Denote d = (degG (1), · · · , degG (n))⊤ . Define the probability matrix P such that Pij = 0
if {i, j} ∈
/ E and Pij = 1/ degG (i) if {i, j} ∈ E. Then for any i ̸= j, (25) can be equivalently
written as
Xn
h(i, j) = 1 + Pik h(k, j) − Pij h(j, j). (26)
k=1
Pn
Now define a matrix H̃ such that H̃ij = 1 + k=1 Pik H̃kj − Pij H̃jj , then H̃ij = h(i, j) for all
i ̸= j (although H̃ii ̸= 0 = h(i, i)). H̃ can be equivalently written as
54
where diag(H̃) is the diagnal matrix with elements H̃ii for i ∈ [n].
We first calculate diag(H̃). Noting that d⊤ P = d, we have
d⊤ H̃ = d⊤ 11⊤ + d⊤ (H̃ − diag(H̃)),
and thus d⊤ diag(H̃) = d⊤ 11⊤ , namely
1 ⊤ 2m
H̃ii = d 1= . (28)
di di
Now define H = H̃ − diag(H̃), then Hij = h(i, j) for all i, j ∈ [n]. We will calculate H in
the following proof. We first write (27) equivalently as H + diag(H̃) = 11⊤ + PH. Then by
multiplying D, we have
D(I − P)H = D11⊤ − D diag(H̃). (29)
Using the fact that D(I − P) = L and (28), we obtain
LH = D11⊤ − 2mI. (30)
Next, noting that L1 = 0, we have
1 1
L + 11⊤
L= I − 11⊤ . (31)
n n
One important property is that the matrix L + n1 11⊤ is invertible (see Gutman & Xiao (2004,
Theorem 4) for a proof). Combining (30) and (31) we have
−1
1 1
I − 11⊤ H = L + 11⊤ D11⊤ − 2mI = M D11⊤ − 2mI .
(32)
n n
By taking diagonal elements and noting that diag(H) = O, we otain
1
− diag 11⊤ H = diag MD11⊤ − 2m diag (M)
(33)
n
Namely,
1 ⊤
H 1 = −MD1 + 2m diag (M) 1. (34)
n
Substituting (34) into (32) yields
H = M D11⊤ − 2mI − 11⊤ DM + 2m11⊤ diag (M) .
(35)
Therefore,
H + H⊤ = 2m(11⊤ diag (M) + diag (M) 11⊤ − 2M). (36)
This finally yields disR
G (i, j) = 1
2m disC
G (i, j) = 1 ⊤
2m (H+H ) = Mi,i +Mj,j −2Mi,j and concludes
the proof.
Computational Complexity. The graph Laplacian can be calculated in O(n2 ) time, and M can
be calculated by matrix inversion which requires O(n3 ) time. Therefore, the overall computational
complexity is O(n3 ) (or O(n2.376 ) using advanced matrix multiplication algorithms).
For sparse graphs typically encountered in practice (i.e. m = o(n2 )), one may similarly ask whether
a complexity that depends on m can be achieved. We conjecture that it should be possible. Below,
−1
we give another algorithm to calculate L + n1 11⊤ . Note that the graph Laplacian L can be
equivalently written as L = EE⊤ , where E ∈ Rn×m is defined as
1 if ϵj = {i, k} and k > i
(
Eij = −1 if ϵj = {i, k} and k < i (37)
0 if i ∈
/ ϵj
where we denote E = {ϵ1 , · · · , ϵm }. Let E = [e1 , · · · , em ] where ei ∈ Rn , then M =
⊤ −1
1 ⊤
Pm
n 11 + i=1 ei ei . Noting that each ei is highly sparse with only two non-zero elements.
We suspect that one can obtain an O(nm) complexity using techniques similar to the Sherman-
Morrison-Woodbury update. We leave it as an open problem.
55
E.3 T RANSFORMER - BASED IMPLEMENTATION
Graphormer-GD. The model is built on the Graphormer (Ying et al., 2021a) model, which use the
Transformer (Vaswani et al., 2017) as the backbone network. A Transformer block consists of two
layers: a self-attention layer followed by a feed-forward layer, with both layers having normalization
(e.g., LayerNorm (Ba et al., 2016)) and skip connections (He et al., 2016). Denote X(l) ∈ Rn×d as
the input to the (l + 1)-th block and define X(0) = X, where n is the number of nodes and d is the
feature dimension. For an input X(l) , the (l + 1)-th block works as follows:
l,h l,h ⊤
Ah (X(l) ) = softmax X(l) WQ (X(l) WK ) ; (38)
H
X
X̂(l) = X(l) + Ah (X(l) )X(l) WVl,h WO
l,h
; (39)
h=1
where D ∈ Rn×n is the distance matrix such that Duv = dG (u, v), ϕh1 and ϕh2 are element-wise
functions applied to D, and ⊙ denotes the element-wise multiplication. In this way, the graph
structural information can be captured by our Graphormer-GD model.
As stated in Section 4, we mainly consider two distance metrics: Shortest Path Distance disG and
Resistance Distance disRG . For SPD, we follow Ying et al. (2021a) to use their shortest path distance
SPD
encoding. Formally, let DSPD be the SPD matrix such that Duv = disG (u, v). The function ϕ1
and ϕ2 can simply be parameterized by two learnable vectors v 1 and v 2 , so that ϕ1 (Duv SPD
) is a
1
learnable scalar corresponding to vD SPD (and similarly for ϕ 2 ). If two nodes u and v are not in
uv
SPD
the same connected component, i.e., Duv = ∞, a special learnable scalar is assigned. For RD,
we use the Gaussian Basis kernels (Scholkopf et al., 1997) to encode the value since it may not be
an integer. The encoded values from different Gaussian Basis kernels are concatenated and further
transformed by a two-layer MLP. We integrate both the SPD encoding and the RD encoding to
obtain ϕl,h l,h
1 (D) and ϕ2 (D). Note that these two matrices are parameterized by different sets of
parameters. Following Ying et al. (2021a), we also incorporate the degree of each node in the input
layer using a degree embedding.
Relationship between Graphormer-GD and GD-WL. As stated in Section 4, the expressive power
of Graphormer-GD is at most as powerful as GD-WL. We will prove that it is actually as powerful as
GD-WL under mild assumptions. We first restate the Lemma 5 from Xu et al. (2019), which shows
that sum aggregators can represent injective functions over multisets.
Lemma E.2. (Xu et al., 2019, Lemma 5) Assume the set X is countable. Then there exists a function
f : X → Rn so that the function h(X̂ ) :=
P
x∈X̂ f (x) is unique for each multiset X̂ ⊂ X of
P
bounded size. Moreover, any multiset function g can be decomposed as g(X̂ ) = ϕ( x∈X̂ f (x)) for
some function ϕ.
We are now ready to present the detailed proof of the Theorem 4.4, which is restated as follows:
Theorem E.3. Graphormer-GD is at most as powerful as GD-WL. Moreover, when choosing proper
functions ϕh1 and ϕh2 and using a sufficiently large number of heads and layers, Graphormer-GD is
as powerful as GD-WL.
56
Proof. Consider all graphs with no more than n nodes. The total number of possible values of both
SPD and RD are thus finite and depends on n (see Appendix E.2). Let
Dn = {(disG (u, v), disR
G (u, v)) : G = (V, E), |V| ≤ n, u, v ∈ V}
denote the set of all possible pairs (disG (u, v), disRG (u, v)). Since Dn is finite, we can list
its elements as Dn = {dG,1 , · · · , dG,|Dn | }. Without abuse of notation, denote dG (u, v) =
(disG (u, v), disR
G (u, v)). Then the GD-WL aggregation in (3) can be reformulated as follows:
t,|Dn |
χtG (v) := hash (χt,1 t,2
G (v), χG (v), · · · , χG (v)) ,
(42)
where χt,k t−1
G (v) := {{χG (u) : u ∈ V, dG (u, v) = dG,k }}.
Intuitively, this reformulation indicates that in each iteration, GD-WL updates the color of node v
by hashing a tuple of color multisets, where each multiset is obtained by injectively aggregating
the colors of all nodes u ∈ V with certain distance configuration to node v. Therefore, to express
GD-WL, the model suffices to update the representation of each node following the above procedure.
We show Graphormer-GD can achieve this goal. Recall that for the h-th head, the attention ma-
h ⊤
trix is defined as ϕh1 (D) ⊙ softmax XWQ h
(XWK ) + ϕh2 (D) . For the function ϕh1 , we de-
fine it to be the indicator function ϕh1 (d) := I(d = dG,h ). For the function ϕh2 , we set it to
h h
be constant irrespective to the matrix D. Let WQ , WK be zero matrices. It can be seen that
h ⊤ 1
11⊤ , and thus for each node v, the
h h
the term softmax XWQ (XWK ) + ϕ2 (D) reduces to |V|
output in the h-th attention head is the sum aggregation of representations of node u satisfying
dG (u, v) = dG,h . Formally,
h i 1 X h i
Ah (X(l) )X(l) = X(l) .
v |V| u
dG (u,v)=dG,h
1
Note that the constant |V|can be extracted with an additional head and be concatenated to the node
representations. Moreover, the node representation X is processed via the feed-forward network
in the previous layer (see (40). Thus, we can invoke Lemma E.2 and prove that the h-th atten-
tion head in Graphormer-GD can implement an injective aggregating function for {{χt−1 G (u) : u ∈
V, dG (u, v) = dG,h }}. Therefore, by using a sufficiently large number of attention heads, the multi-
set representations χt,k
G , k ∈ [|Dn |] can be injectively obtained.
Finally, the multi-head attention defined in (39) is equivalent to first concatenating the output of each
attention head and then using a linear mapping to transform the results. Thus, the concatenation
t,|Dn |
is clearly an injective mapping of the tuple of multisets χt,1 t,2
G , χG , ..., χG . When the linear
mapping has irrelational weights, the projection will also be injective. Therefore, one attention
layer followed by the feed-forward network can implement the aggregation formula (42). Thus, our
Graphormer-GD is able to simulate the GD-WL when using a sufficiant number of layers, which
concludes the proof.
F E XPERIMENTAL D ETAILS
F.1 S YNTHETIC TASKS
Data Generation and Evaluation Metrics. We carefully design several graph generators to exam-
ine the expressive power of compared models on graph biconnectivity tasks. First, we include the
two families of graphs presented in Examples C.9 and C.10 (Appendix C.2). We further introduce
a rich family of regular graphs with both cut vertices and cut edges. Each graph in this family is
constructed by first randomly generating several connected components and then linking them via
cut edges while simultaneously ensuring that each node has the same degree. Combining the above
three families of hard graphs, we online generate data instances to train the compared models. For
each data instance, the total number of nodes is upper bounded by 120. We use graph-level accu-
racy as the metric. That is, for each graph, the prediction of the model is considered correct only
when all and only the cut vertices/edges are correctly identified. We use different seeds to repeat the
experiments 5 times and report the average accuracy.
57
Baselines. We choose several baselines with their expressive power being at different levels. First,
we consider classic MPNNs including GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018),
and GIN (Bouritsas et al., 2022). The expressive power of these GNNs is proven to be at most
as powerful as the 1-WL test (Xu et al., 2019). We also compare the Graph Substructure Net-
work (Bouritsas et al., 2022), which extracts graph substructures to improve the expressive power of
MPNNs. The substructure counts are incorporated into node features or the aggregation procedure.
Lastly, we also compare the Graphormer model (Ying et al., 2021a), which achieved impressive
performance in several world competitions (Ying et al., 2021b; Shi et al., 2022; Luo et al., 2022a).
Settings. We employ a 6-layer Graphormer-GD model. The dimension of hidden layers and feed-
forward layers is set to 768. The number of Gaussian Basis kernels is set to 128. The number of
attention heads is set to 64. The batch size is set to 32. We use AdamW (Kingma & Ba, 2014) as the
optimizer and set its hyperparameter ϵ to 1e-8 and (β1 , β2 ) to (0.9, 0.999). The peak learning rate is
set to 9e-5. The model is trained for 100k steps with a 6K-step warm-up stage. After the warm-up
stage, the learning rate decays linearly to 0. All models are trained on 1 NVIDIA Tesla V100 GPU.
58
(SUN) (Frasca et al., 2022) is developed based on the symmetry analysis of a series of existing
Subgraph GNNs and an upper bound on their expressive power, which theoretically unifies previous
architectures and performs well across several graph representation learning benchmarks.
Last, we compare several Graph Transformer models. GraphTransformer (GT) (Dwivedi & Bres-
son, 2021) uses the Transformer model on graph tasks, which only aggregates the information from
neighbor nodes to ensure graph sparsity, and proposes to use Laplacian eigenvector as positional
encoding. Spectral Attention Network (SAN) (Kreuzer et al., 2021) uses a learned positional en-
coding (LPE) that can take advantage of the full Laplacian spectrum to learn the position of each
node in a given graph. Graphormer (Ying et al., 2021a) develops the centrality encoding, spatial
encoding, and edge encoding to incorporate the graph structure information into the Transformer
model. Universal RPE (URPE) (Luo et al., 2022b) first shows that there exist continuous sequence-
to-sequence functions which RPE-based Transformers cannot approximate, and develops a novel
and universal attention module called Universal RPE-based Attention. The effectiveness of URPE
has been verified across language and graph benchmarks (e.g., the ZINC dataset).
Settings. Our Graphormer-GD consists of 12 layers. The dimension of hidden layers and feed-
forward layers are set to 80. The number of Gaussian Basis kernels is set to 128. The number of
attention heads is set to 8. The batch size is selected from [128, 256, 512]. We use AdamW (Kingma
& Ba, 2014) as the optimizer, and set its hyperparameter ϵ to 1e-8 and (β1 , β2 ) to (0.9, 0.999). The
peak learning rate is selected from [4e-4, 5e-4]. The model is trained for 600k and 800k steps with
a 60K-step warm-up stage for ZINC-Subset and ZINC-Full respectively. After the warm-up stage,
the learning rate decays linearly to zero. The dropout ratio is selected from [0.0, 0.1]. The weight
decay is selected from [0.0, 0.01]. All models are trained on 4 NVIDIA Tesla V100 GPUs.
We further conduct experiments to measure the efficiency of our approach by profiling the time
cost per training epoch. We compare the efficiency of Graphormer-GD with other baselines along
with the number of model parameters on the ZINC-subset from Dwivedi et al. (2020). The number
59
Table 5: Average Accuracy on Brazil-Airports and Europe-Airports datasets. Experiments are re-
peated for 20 times with different seeds. We use * to indicate the best performance.
Model Brazil-Airports Europe-Airports
GCN (Kipf & Welling, 2017) 64.55±4.18 54.83±2.69
GraphSAGE (Hamilton et al., 2017) 70.65±5.33 56.29±3.21
GIN (Xu et al., 2019) 71.89±3.60 57.05±4.08
Struc2vec (Ribeiro et al., 2017) 70.88±4.26 57.94±4.01
DE-GNN-SPD (Li et al., 2020) 73.28±2.47 56.98±2.79
DE-GNN-LP (Li et al., 2020) 75.10±3.80 58.41±3.20
DEA-GNN-SPD (Li et al., 2020) 75.37±3.25 57.99±2.39
Graphormer-GD (ours) 77.69±6.39* 59.23±4.05*
of layers and the hidden dimension of our Graphormer-GD are set to 12 and 80 respectively. The
number of attention heads is set to 8. The batch size is set to 128, which is the same as the settings of
all baselines. We run profiling of all models on a 16GB NVIDIA Tesla V100 GPU. For all baselines,
we evaluate the time costs based on the publicly available codes of Dwivedi et al. (2020) and Ying
et al. (2021a). The results are presented in Table 6.
From Table 6, we can draw the following conclusions. Firstly, the efficiency of Graphormer-GD is
in the same order of magnitude as classic MPNNs despite the fact that the computation complexity
of Graphormer-GD is higher than MPNNs (i.e., Θ(n2 ) v.s. Θ(n + m) for a graph with n nodes
and m edges). This may be due to the high parallelizability of the Transformer layers. Secondly,
Graphormer-GD is much more efficient than higher-order GNNs as reflected by the computation
complexity in Table 1. Finally, Graphormer-GD is almost as efficient as the original Graphormer,
since the newly introduced module to encode the Resistance Distance takes negligible additional
time compared to that of the whole architecture.
Table 6: Efficiency Evaluation of different GNN models. We report the time per training epoch
(seconds) as well as the number of model parameters.
Model # Params Time (s)
GCN (Kipf & Welling, 2017) 505,079 5.85
GraphSAGE (Hamilton et al., 2017) 505,341 6.02
MoNet (Monti et al., 2017) 504,013 7.19
GIN (Xu et al., 2019) 509,549 8.05
GAT (Veličković et al., 2018) 531,345 8.28
GatedGCN-PE (Bresson & Laurent, 2017) 505,011 10.74
RingGNN (Chen et al., 2019) 527,283 178.03
3WLGNN (Maron et al., 2019a) 507,603 179.35
Graphormer (Ying et al., 2021a) 489,321 12.26
Graphormer-GD (ours) 502,793 12.52
60