Skip to content

gh-137103: A better circular check for json.dump() #137104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

aivarsk
Copy link
Contributor

@aivarsk aivarsk commented Jul 25, 2025

When check_circular=True (default) is used, the JSON module created a dict and created a new Long object for each object pointer and stored it in the map to prevent circular references and dumping the same object again.

Other Python objects like list and dict solve this problem by using Py_ReprEnter/Py_ReprLeave without creating a new Long object for each object.

Use Py_ReprEnter/Py_ReprLeave for JSON as well.

When check_circular=True (default) is used, the JSON module created
a map and created a new Long object for each object pointer and
stored it in the map to prevent circular references and dumping the same
object again.

Other Python objects like list and dict solve this problem by using
Py_ReprEnter/Py_ReprLeave without creating a new Long object for each
object.

Use Py_ReprEnter/Py_ReprLeave for JSON as well.
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Py_ReprEnter/ReprLeave really meant to be used that way? it also allocates lists and dicts so I don't know whether it's better than using ints.

More generally, is there a need to use PyDict and PyLong? maybe we can only use _Py_hashtable instead? I don't know if using hashtables is faster or not though.

@aivarsk
Copy link
Contributor Author

aivarsk commented Jul 25, 2025

Is Py_ReprEnter/ReprLeave really meant to be used that way? it also allocates lists and dicts so I don't know whether it's better than using ints.

I think it is meant to be used for that and other objects are using it, you can search around.

/* Helpers for printing recursive container types */
PyAPI_FUNC(int) Py_ReprEnter(PyObject *);
PyAPI_FUNC(void) Py_ReprLeave(PyObject *);

The big difference is that for a dict (current markers) you have to create a hash-able key object (but we keep track of lists and dicts) and it leads to creation of new Long objects. The Py_Repr* functions allocate dictionary and a list only on the first call per thread and then does a linear scan which is CPU-cache friendly. It doesn't do any allocations per new objects being tracked.

@picnixz
Copy link
Member

picnixz commented Jul 25, 2025

I think it is meant to be used for that and other objects are using it, you can search around.

To me it doesn't because it uses a specific key namely Py_Repr in the thread dict. If the object being recursed into is already in the list, wouldn't we detect it as being circular? (I don't have a PoC so I'm wondering whether we won't have a conflicting check). Apart from that, I can only find them being used in __repr__ and never elsewhere. Here it's something that is not __repr__ so I really don't want to misuse the API despite the comment.

It doesn't do any allocations per new objects being tracked.

While we don't allocate a full new object, we're still growing the list, or am I wrong?


cc @serhiy-storchaka as the JSON expert

Copy link
Contributor

@nineteendo nineteendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'll definitely port this to jsonyx if accepted. Could you decrease the diff though?

@aivarsk
Copy link
Contributor Author

aivarsk commented Jul 26, 2025

I think it is meant to be used for that and other objects are using it, you can search around.

To me it doesn't because it uses a specific key namely Py_Repr in the thread dict. If the object being recursed into is already in the list, wouldn't we detect it as being circular? (I don't have a PoC so I'm wondering whether we won't have a conflicting check). Apart from that, I can only find them being used in __repr__ and never elsewhere. Here it's something that is not __repr__ so I really don't want to misuse the API despite the comment.

Who is the best person to ask? In the worst case we can copy the function and call it JsonEnter/JsonLeave because it works better than the current dict approach. Another option would be to change the internal API and make markers a list not a dict.

It doesn't do any allocations per new objects being tracked.
While we don't allocate a full new object, we're still growing the list, or am I wrong?

It will increase the list size if there is insufficient space but it will do so few times per the lifetime of the thread.

@serhiy-storchaka
Copy link
Member

Py_ReprEnter has a very specific purpose.

The first peculiarity is that it uses a thread local Py_Repr to track the already seen objects. This is not needed for JSON encoder, it would be inefficient and may conflict with repr(). We could use an encoder attribute instead.

Second difference is that it uses a list instead of a dict. This may be more efficient for shallow recursion but much more slow for deep recursion. What is the threshold? 10? 100? Who knows, it needs to be measured.

We could use a hybrid approach -- list for shallow recursion and dict for deep recursion. The threshold should be determined by the results of microbenchmarks. This is a complex solution, so it should only be used if the benefit is large enough. It will also slow down the code near the threshold, so we should ensure that the overhead is small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy