Skip to content

gh-136459: Add perf trampoline support for macOS #136461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 22, 2025

Conversation

canova
Copy link
Contributor

@canova canova commented Jul 9, 2025

This PR adds perf trampoline support for macOS. Previously it was Linux-only, which is not actually needed it to be. We can share the implementation of Linux and macOS with some minor changes done in this PR.

For example, here's a before and after profile of this PR, captured by samply (for the same dummy script):
Before / After

As you can see, now we have blue frames that mentions the real Python function names, which makes it a lot more useful to understand this profile. Without it, it's just native frames that's really difficult to understand.

This is my first PR in this project, so please let me know if there is anything I didn't do or should do. Thanks!


📚 Documentation preview 📚: https://cpython-previews--136461.org.readthedocs.build/

@python-cla-bot
Copy link

python-cla-bot bot commented Jul 9, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@canova
Copy link
Contributor Author

canova commented Jul 9, 2025

cc @pablogsal. I went ahead and submitted this PR for #136459 but please let me know what you think!

A running process may create a file in the ``/tmp`` directory, which contains entries
that can map a section of executable code to a name. This interface is described in the
profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or
`samply <https://github.com/mstange/samply/>`_). A running process may create a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a shameless plug to samply 😄 Disclaimer: I'm not the maintainer of the project, but the maintainer is my colleague. But it doesn't change the fact that it's an awesome profiler! But I can revert it if you prefer not to include :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the plug, but this docs are going to need much more than this then. If samply is the main way to use this on macOS then we will need to update https://docs.python.org/3/howto/perf_profiling.html with full instructions for samply :)

@canova canova force-pushed the perf-trampoline-macos branch from f0887a1 to f663627 Compare July 9, 2025 10:25

/* These constants are defined inside <elf.h>, which we can't use outside of linux. */
#if !defined(__linux__)
# define EM_386 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define the variable for different platforms. like this https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c#L126-L135

Not define all the possible variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just changed them to be defined depending on the platform. I actually copy pasted this from the old code but your suggestion is better.

Copy link
Contributor

@Zheaoli Zheaoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

  1. tmm1/node@3542e4e
  2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc
  3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

@pablogsal
Copy link
Member

pablogsal commented Jul 9, 2025

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

  1. tmm1/node@3542e4e

  2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

  3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Thanks for you comment but I don't think we need any benchmark. We are using mmap all over the place and that's what the allocators, the JIT and many other things do. The trampoline also batches allocations in big chunks so I think we are fine. If the problem is on samply side that's their challenge to figure out there not here. If they need some sort of hook they should mention it here but is unlikely we will add anything that's not generic.

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

@pablogsal
Copy link
Member

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

@canova canova requested a review from diegorusso as a code owner July 9, 2025 23:36
@canova
Copy link
Contributor Author

canova commented Jul 9, 2025

@Zheaoli

For now, the trampoline depens on the mmap.

But it has record about the mmap performance issue on macOS.

I think we need a benchmark here

FYI

1. [tmm1/node@3542e4e](https://github.com/tmm1/node/commit/3542e4e2944f2fe1132ded1052e223f54c90e4bf)

2. https://chromium-review.googlesource.com/c/v8/v8/+/6385655/7/src/diagnostics/perf-jit.cc

3. https://bugzilla-dev.allizom.org/show_bug.cgi?id=1827214

Ah I think you are right. I actually looked for this and only found one mmap inside the perf_trampoline.c file and that was for a memory arena so thought this change was not needed and didn't think much of if after. But looking at it again, I think it's this one that we shouldn't mmap:

/*
* Map the first page of the jitdump file
*
* This memory mapping serves as a signal to perf that this process
* is generating JIT code. Perf scans /proc/.../maps looking for mapped
* files that match the jitdump naming pattern.
*
* The mapping must be PROT_READ | PROT_EXEC to be detected by perf.
*/
perf_jit_map_state.mapped_buffer = mmap(
NULL, // Let kernel choose address
page_size, // Map one page
PROT_READ | PROT_EXEC, // Read and execute permissions (required by perf)
MAP_PRIVATE, // Private mapping
fd, // File descriptor
0 // Offset 0 (first page)
);

I updated the PR to remove this on macOS, let me know what you think!

@pablogsal

@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers.

Thanks! I'm glad that you would like to support it on macOS. And I'm happy to discuss the details further and update the code later. (Enjoy your vacation!)

@@ -1,13 +1,21 @@
.text
#if defined(__APPLE__)
.globl __Py_trampoline_func_start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Apple and their symbol choices...:S

@pablogsal
Copy link
Member

@canova One thing I am concerned is that for perf I had to add jit support because nothing compiles with frame pointers in the wild which means that unless people compile Python themselves (and take a big-ish hit unfortunately) the frame-pointer version is suboptimal. What's the status of this for samply? Does it also support the perf jit interface? Can you ping here the samply maintainer if the answer is nuanced?

@mstange
Copy link

mstange commented Jul 10, 2025

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

@pablogsal
Copy link
Member

pablogsal commented Jul 10, 2025

Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"?

Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any JIT_CODE_UNWINDING_INFO records from the jitdump.

Perf supports two modes to deal with JIT compilers which is what we are leveraging to make it work with Python via the trampolines:

We do implement both. The reason we implement both is that the simple maps interface works as long as perf can unwind through the JIT frames. Perf uses libunwind or eelfutils for this and both choke when Python is not compiled with frame pointers. Check https://docs.python.org/3/howto/perf_profiling.html#how-to-obtain-the-best-results for more info.

Unfortunately, nobody compiles Python with frame pointers because it's very slow. See #96174. Not only that, even if you compile with frame pointers most of the time wheels and binary packages are not compiled with frame pointers so this is sort of not very useful in the wild unless you have control over your full ecosystem.

To deal with this we implement the much more complex JIT interface. This allows us to provide to perf DWARF for the trampolines so we can pass the eh_frame values and the unwinding information. perf then uses these via some horrible code that creates a single ELF file per trampoline (yuk!) and then uses that to interpret the samples it took. The problem with this mode is that to do this perf must dump the entire stack to disk to analyze it later. Unfortunately Python C stack is very big so this not only is slow but almost all the time you need the MAX size of the dump. This ends with gigantic files and a much slower compilation. You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

I hope this gives some background for the question.

I really want to know if simply is going to be able cope with the general case in general in Linux and Mac for all arches that we support.

I am asking this among other things because we already had problems with code that doesn't have frame pointers around the trampolines. See #130856.

@pablogsal
Copy link
Member

Also, we probably need a buildbot and tests for this to ensure this doesn't break in the future

@mstange
Copy link

mstange commented Jul 10, 2025

You can see our support for this in https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info... Anyway:

samply supports both perf.map file and jitdump. But as I said above, it does not respect the JIT unwind info. Instead, when finding the return address for a JIT frame, samply always uses framepointer unwinding. Looking at the actual assembly for the trampolines, I can see the following:

  • On aarch64, the trampoline actually has a frame pointer - the stp x29, x30, [sp, -16]!; mov x29, sp instructions make it so that the x29 ("fp") register is set to the location where the caller's framepointer and the return address are stored. So samply's approach should just work.
  • On x86_64, the trampoline does not set up a frame record. But it also doesn't clobber the rbp register. So: If the python C functions are compiled with framepointers, then unwinding can still resume, but the immediate caller of the JIT frame will be missing. And if the python C functions do not use framepointers, then unwinding will likely fail, because unwinding the caller using dwarf information usually requires an accurate rsp value, which samply won't have unless it starts respecting the jitdump unwind info.

For samply's current approach to work on x86_64, the x86_64 trampoline just needs to set up a framepointer - add a push rbp, mov rbp rsp at the start and a pop rbp at the end, adjust the jitdump dwarf to match (for Linux perf), and that should be it, no? Having a framepointer in the trampoline doesn't require using framepointers when compiling the python C code; it's a per-function decision. So IMO such a change would be very much orthogonal to #96174.

@pablogsal
Copy link
Member

Thanks a lot @mstange, that makes perfect sense and really clarifies the situation!

If we’re going to add samply support, we need to ensure it works reliably on both aarch64 and x86_64 across both macOS and Linux. Based on your analysis, we should definitely add frame pointers to the x86_64 trampoline in this PR to ensure consistent behavior across architectures.

I can help set up some buildbot infrastructure to ensure all of this works properly with samply once the PR is ready for more comprehensive testing.

I also think we should have full end-to-end documentation that covers how to use this with Python, similar to our existing perf docs. We should probably make the docs more generic to cover both perf and samply workflows, since users will likely want guidance on both tools depending on their platform and preferences.

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

@pablogsal
Copy link
Member

Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info...

I feel you: I was certainly nauseous when I had to debug that nonsense for hours and hours and even more when I had to implement the hack to "fix" it.

@Zheaoli
Copy link
Contributor

Zheaoli commented Jul 10, 2025

Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks.

Yes, I mean maps file here. Sorry for confusing

@Zheaoli
Copy link
Contributor

Zheaoli commented Jul 10, 2025

I updated the PR to remove this on macOS, let me know what you think!

SGTM

@@ -1166,7 +1191,11 @@ static void perf_map_jit_write_entry(void *state, const void *code_addr,
ev.base.size = sizeof(ev) + (name_length+1) + size;
ev.base.time_stamp = get_current_monotonic_ticks();
ev.process_id = getpid();
#if defined(__APPLE__)
pthread_threadid_np(NULL, &ev.thread_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use this implementation in here?

cpython/Include/object.h

Lines 193 to 255 in b44316a

_Py_ThreadId(void)
{
uintptr_t tid;
#if defined(_MSC_VER) && defined(_M_X64)
tid = __readgsqword(48);
#elif defined(_MSC_VER) && defined(_M_IX86)
tid = __readfsdword(24);
#elif defined(_MSC_VER) && defined(_M_ARM64)
tid = __getReg(18);
#elif defined(__MINGW32__) && defined(_M_X64)
tid = __readgsqword(48);
#elif defined(__MINGW32__) && defined(_M_IX86)
tid = __readfsdword(24);
#elif defined(__MINGW32__) && defined(_M_ARM64)
tid = __getReg(18);
#elif defined(__i386__)
__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS
#elif defined(__MACH__) && defined(__x86_64__)
__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS
#elif defined(__x86_64__)
__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS
#elif defined(__arm__) && __ARM_ARCH >= 7
__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid));
#elif defined(__aarch64__) && defined(__APPLE__)
__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid));
#elif defined(__aarch64__)
__asm__ ("mrs %0, tpidr_el0" : "=r" (tid));
#elif defined(__powerpc64__)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// r13 is reserved for use as system thread ID by the Power 64-bit ABI.
register uintptr_t tp __asm__ ("r13");
__asm__("" : "=r" (tp));
tid = tp;
#endif
#elif defined(__powerpc__)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// r2 is reserved for use as system thread ID by the Power 32-bit ABI.
register uintptr_t tp __asm__ ("r2");
__asm__ ("" : "=r" (tp));
tid = tp;
#endif
#elif defined(__s390__) && defined(__GNUC__)
// Both GCC and Clang have supported __builtin_thread_pointer
// for s390 from long time ago.
tid = (uintptr_t)__builtin_thread_pointer();
#elif defined(__riscv)
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer)
tid = (uintptr_t)__builtin_thread_pointer();
#else
// tp is Thread Pointer provided by the RISC-V ABI.
__asm__ ("mv %0, tp" : "=r" (tid));
#endif
#else
// Fallback to a portable implementation if we do not have a faster
// platform-specific implementation.
tid = _Py_GetThreadLocal_Addr();
#endif
return tid;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! It looks like _Py_ThreadId is defined only when Py_GIL_DISABLED is defined and Py_LIMITED_API is not defined:

#if defined(Py_GIL_DISABLED) && !defined(Py_LIMITED_API)

It doesn't compile because of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I meant just for inline assembly because it was for low over head at free threading, maybe you can just pick it for apple implementation.

@canova
Copy link
Contributor Author

canova commented Jul 10, 2025

@mstange @pablogsal Thanks for the investigation!

@canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly

Sure that sounds good to me, I'll update the PR soon with this and add some more tests.

Would you like me to update the documentation in this PR as well?

@canova canova force-pushed the perf-trampoline-macos branch from 2830252 to 2812159 Compare July 17, 2025 09:45
@canova
Copy link
Contributor Author

canova commented Jul 17, 2025

Thanks for the suggestions. I've updated the docs and rebased the whole PR on top of the current main branch. Please let me know what you think!

@canova canova requested review from hugovk and pablogsal July 17, 2025 09:47
@canova canova force-pushed the perf-trampoline-macos branch from 2812159 to 7b463c8 Compare July 17, 2025 09:50
@canova canova force-pushed the perf-trampoline-macos branch from 7b463c8 to 3ae5cb2 Compare July 17, 2025 09:53
@pablogsal
Copy link
Member

I will do a final pass tomorrow. @canova @mstange if samply works on windows would it be possible to add windows support somehow? What does the windows story require?

@pablogsal
Copy link
Member

I'm landing this for now but we should work on improving the docs separately. And there is still the Windows question for @canova to answer.

@pablogsal pablogsal merged commit a667800 into python:main Jul 22, 2025
41 checks passed
@pablogsal
Copy link
Member

Thanks a lot for the contribution and the patience @canova. Excellent work 💪

And thanks a lot for you insights @mstange 👌

@canova
Copy link
Contributor Author

canova commented Jul 22, 2025

Thanks for the reviews and help @pablogsal and @hugovk!

I will file an issue for the docs improvements.

@canova @mstange if samply works on windows would it be possible to add windows support somehow? What does the windows story require?

I'm not so sure as I'm not familiar with its implementation in samply. Maybe @mstange could give more insight into it.

@jrmuizel
Copy link

samply does work on Windows but it uses ETW JSCRIPT9::MethodLoad events instead of jitdump. Firefox and Chrome both emit these events for jitted JS code.

We emit the events here: https://searchfox.org/mozilla-central/rev/387160feb07b75ae76bfc12df035a10f58d25168/js/src/jit/PerfSpewer.cpp#643

It probably makes sense for python to follow a similar approach as that will allow it to work properly with Windows Performance Analyzer as well.

@canova
Copy link
Contributor Author

canova commented Jul 22, 2025

Ah, thanks for the information, Jeff! I filed #137022 for this.

encukou added a commit to encukou/cpython that referenced this pull request Jul 23, 2025
pythongh-136461 added perf support for macOS, with ifdefs around all changes
except increasing thread_id to 64 bits.
Make that change Apple-specific too.
encukou added a commit that referenced this pull request Jul 23, 2025
gh-136461 added perf support for macOS, with ifdefs around all changes
except increasing thread_id to 64 bits.
Make that change Apple-specific too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy