Skip to content

gh-129847: Add graphlib.reverse(), graphlib.as_transitive() #130875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

lordmauve
Copy link
Contributor

@lordmauve lordmauve commented Mar 5, 2025

  • Add graphlib.reverse() for reversing a DAG in the form accepted by TopologicalSorter
  • Add graphlib.as_transitive() for computing a transitive closure
  • Add unit tests for both
  • Add docs to graphlib module docs
  • Add blurb and What's New entry

📚 Documentation preview 📚: https://cpython-previews--130875.org.readthedocs.build/

@lordmauve
Copy link
Contributor Author

Bugs are fixed; they were bugs in how I unrolled a recursive implementation to iterative.

For reference, a recursive version is

def as_transitive(graph):
    def visit(node, visited, stack):
        if node in stack:
            cycle = stack[stack.index(node):] + [node]
            raise CycleError("nodes are in a cycle", cycle)
        if node in visited:
            return visited[node]
        closure = set()
        stack.append(node)
        for child in graph.get(node, []):
            closure.add(child)
            closure.update(visit(child, visited, stack))
        stack.pop()
        visited[node] = closure
        return closure

    visited = {}
    return {node: visit(node, visited, []) for node in graph}

The problem with the recursive version is that it is limited to graphs with diameter < sys.getrecursionlimit() so 1000 by default.

I've also explored using TopologicalSorter:

def as_transitive(graph):
    graph = {k: set(v) for k, v in graph.items()}
    transitive = {}
    for node in TopologicalSorter(graph).static_order():
        if node in graph:
            direct = graph[node]
            t = transitive[node] = set(direct)
            t.update(*(transitive.get(d, ()) for d in direct))
    return transitive

However, this is much slower as it needs to allocate a lot of temporary objects.

Timings:

./python -m timeit -s 'import graphlib; graph = {"c": "d", "a": "b", "e": "f", "b": "ce"}' 'graphlib.as_transitive(graph)'
Method timeit
TopologicalSorter 10000 loops, best of 5: 27.5 usec per loop
Iterative 50000 loops, best of 5: 5.44 usec per loop
Recursive 50000 loops, best of 5: 4.18 usec per loop

@lordmauve lordmauve force-pushed the lordmauve/issue129847 branch from 06ef96d to d90f7c2 Compare March 7, 2025 08:18
@lordmauve lordmauve force-pushed the lordmauve/issue129847 branch from 791f94d to 1889b93 Compare March 12, 2025 19:45
Comment on lines +2 to 3
import sys
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: sorting

Suggested change
import sys
import os
import os
import sys

Comment on lines +292 to +293
class TestAsTransitive(unittest.TestCase):
"""Tests for graphlib.as_transitive()."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the recursive version is that it is limited to graphs with diameter < sys.getrecursionlimit() so 1000 by default.

Could we add a test for such a pathological graph?

@tim-one
Copy link
Member

tim-one commented Mar 17, 2025

Question: why is transitive closure making a special case of cycles? If there's a cycle involving N nodes, then the transitive closure of that cycle contains at least N*N edges (every node in the cycle is reachable from every other node, including that each node is reachable from itself). So, e.g., if A, B, and C are in a cycle, the transitive closure maps every node in it to (at least) {A, B, C},

That is, the usual concept of transitive closure has nothing to do with cycles, so using the name to mean something else is at best surprising. I think it would be best to stick with the conventional meaning. This is, after all, not the "acyclic_graphlib" module 😉.

@lordmauve
Copy link
Contributor Author

This is, after all, not the "acyclic_graphlib" module 😉.

Despite the name I understand it is the acyclic graphlib module. From the discussion in #129847 there is no desire to compete with a full graph library like NetworkX but we may be willing to admit functions that work on acyclic graphs, especially task dependency graphs, which is the original use case for TopologicalSorter.

I usually use TopologicalSorter xor transitive closure in these kind of applications, so it's useful to have a cyclic check when computing the transitive closure, as it visits every node and the check drops out.

I could add a kw-only arg? cyclic_ok: bool = False?

@tim-one
Copy link
Member

tim-one commented Mar 18, 2025

Transitive closure isn't competing with anyone 😉. It's a basic operation applicable to all grsphs of all kinds, and has the same meaning everywhere. By default, in the absence of truly compelling reasons not to, Python should do the same as all other packages for it. So I'd be OK with adding an optional cyclic_ok=True argument.

That it's "natural" for a TC implementation to detect cycles isn't necessarily so. It depends on the specific algorithm. Here, for example, is an implementation of Warshall's algorithm (not tested much - may be buggy). It has no concept of "cycle", and does not build explicit paths:

def tc(graph):
    tc = {i : set(j) for i, j in graph.items()}
    for M, Mset in tc.items():
        for R in Mset:
            for Lset in tc.values():
                if M in Lset:
                   Lset.add(R)
    return tc

If you care about speed, you should check that out too. It builds very few temp objects.

@tim-one
Copy link
Member

tim-one commented Mar 20, 2025

Fun observation: a bunch of places on the web say Warshall's algorithm is only useful when using an adjacency bit matrix representation. But that's not so. This variation is even simpler, and runs much faster, especially so on dense input graphs Although this is Python, and part of "the trick" is that no indexing or visible .add() operations remain - they're all handled at "C speed" now:

def tc(graph):
    tc = {i : set(j) for i, j in graph.items()}
    for M, Mset in tc.items():
        if Mset:
            for Lset in tc.values():
                if M in Lset:
                    Lset.update(Mset)
    return tc

@tim-one
Copy link
Member

tim-one commented Mar 20, 2025

Huh! For more speed, replace:

                    Lset.update(Mset)

with

                    Lset |= Mset

@lordmauve
Copy link
Contributor Author

I'm happy with cyclic_ok=True.

I suspect for the use case of task graphs that graphlib was originally written for, dense graphs are rare. I benchmarked your Warshall's algorithm implementation on my firm's monorepo dependency graph (15k nodes, 130k edges) as it's bigger than trivial and provides organic data. For that use case it's 45x slower, taking 38s per call vs <1s for the DFS version.

@tim-one
Copy link
Member

tim-one commented Apr 2, 2025

Ya, Warshall is best suited for dense graphs, and I agree those aren't the natural focus here.

An idea is to split off "is there a cycle?" into its own function, so there's only one place that needs to change if ambitious change. "The best" compromise for transitive closure is to find strongly connected components first (in linear time). Then each SCC trivially induces a complete subgraph, and the DAG of SCCs can be done via a topsort.

There was no "grand plan" at the start, but I doubt anyone had in mind "sparse graphs" as a goal. Even cycle-free graphs can have a number of edges quadratic in the number of nodes.

There was a desire not to complicate things by introducing a Graph class too. So it was implicitly accepted that we'd stick (at least at first) to "the natural" Python graph representation: a dict mapping s bashable node to a collection of neighbors. So "directed and unweighted" was implicit at the start. I'd prefer to leave it there too.

My own unstated opinion was that graphlib would reach its limit when it grew a function to compute a directed graph's strongly connected components. a delicate undertaking to do efficiently.

The functions added by this PR are very comfortably within that limit.

@python-cla-bot
Copy link

python-cla-bot bot commented Apr 18, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

Comment on lines +588 to +589
graphlib
--------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be re-targeted to 3.15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy