Skip to content

fix(opentelemetry): trace context propagation in process-pool workers #1017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gregbrowndev
Copy link

What was changed

This PR fixes a bug in the OpenTelemetry TracingInterceptor affecting sync, multi-process activities. The fix ensures tracing capabilities are possible inside the user's activity implementation, e.g. creating child spans, trace events, log correlation, profile correlation, distributed tracing with other systems, etc.

Unlike async or sync multi-threaded activities, the TracingInterceptor/_ActivityInboundImpl interceptors had not propagated the OTEL trace context, in this case, across the process pool.

Note: Both async and threadpool executors manage the trace context via Python's contextvars.

For the process-pool executor, any data we want to send to the child process must be extracted from contextvars and/or otherwise passed to loop.run_in_executor as pickable arguments to the target _execute_sync_activity function, and then rebuilt into contextvars on the other side.

Since the trace context is created in the TracingInterceptor in the parent process, it would be difficult getting this all the way down to _ActivityInboundImpl where it can be sent to the child process without introducing OpenTelemetry as a core dependency. This change attempts to be as transparent as possible, but may introduce a breaking change (see end of section).

The TracingInterceptor 's inbound activity interceptor now handles the special case for sync, non-threadpool executor activities. It wraps the input.fn in a picklable dataclass that:

  • captures the trace context for the parent span created in the interceptor / parent process
  • exposes its own __call__ function that becomes the entrypoint of the subprocess task, which reattaches the trace context before delegating to the original activity function
  • preserves as much of the original activity function's metadata, using functools.wraps. This is because downstream interceptors, such as the SentryInterceptor in the Python examples (see feat: add example using Sentry V2 SDK samples-python#140), use reflection on the activity attributes, e.g. fn.__name__, fn.__module__.

Tests have been added to verify the fix and I've had this patch running in our production environment for several weeks without any issues.

Breaking Change

As mentioned above, this change may break downstream interceptors that rely on receiving the original activity function handle directly.

Care has been taken to ensure common properties are preserved using functools.wraps, like you would with a decorator. However, without more significant changes to other parts of the SDK, I think this cannot be avoided, since creating a closure function cannot be pickled.

Users would need to ensure any interceptor switched on the function name, fn.__name__, rather than a reference to the real function.

Why?

Users of the SDK's process-pool Worker currently cannot leverage OTEL tracing capabilities inside their own activity implementation. The TracingInterceptor correctly instruments the activity's root span, but further downstream tracing is not properly linked to this parent span. The following is currently broken in sync, multiprocess activities:

  • creating child spans
  • attaching trace events / attributes
  • log correlation (e.g. with experimental OTEL logging SDK)
  • profile correlation (e.g. with Grafana Pyroscope SDK)
  • propagating the trace context onwards for distributed tracing with other systems

Checklist

Closes: 669

How was this tested:

Added tests to verify:

  • child spans in the activity have the correct parent
  • the wrapped activity preserves the original function's metadata

Manual testing using the OTEL logging SDK in my app shows that logs emitted in the activities are injected with correct trace_id/span_id enabling log-correlation in Grafana/Tempo/Loki. I didn't want to add this to the tests as the logging SDK is still experimental.

Note: testing this was quite tricky, I used a proxy list in the server process manager to access the spans exported from the child process's SpanExporter. I don't expect this would ever be necessary in production code (especially with OpenTelemetry) since all of the OTEL tracing exporters that I've seen a use push-based approach to export spans directly to an OTEL collector or tracing backend directly. (I think I remember seeing a Jaeger guide that indicated scraping traces from an endpoint, but that was for native Jaeger tooling I think). With a push-based exporter, e.g. OTLPTraceExporter, the child process can simply export its spans without needing to consolidate them with the parent process, even while the parent span created in the TracingInterceptor is yet to complete and be exported, the tracing backends expect to receive spans out-of-order.

Any docs updates needed?

Hopefully, no change from users is necessary.

gregbrowndev and others added 5 commits July 20, 2025 16:10
- Add test to show trace context is not available
- This test implementation isn't to be taken as a reference for production.
  The fixed `TracingInterceptor` works in production, provided you use
  the `OTLPSpanExporter` or other exporter that pushes traces to a collector
  or backend, rather than one that pulls traces from the server (if one exists).
- Add a custom span exporter to write finished_spans to a list proxy
  created by the server process manager. This is because we want to test the
  full trace across the process pool. Again, in production, the child process
  can just export spans directly to a remote collector. Tracing is designed
  to handle distributed systems.
- Ensure the child process is initialised with its own TracerProvider to
  avoid different default mp_context behaviours across MacOS and Linux
@gregbrowndev gregbrowndev requested a review from a team as a code owner August 3, 2025 22:45
@CLAassistant
Copy link

CLAassistant commented Aug 3, 2025

CLA assistant check
All committers have signed the CLA.

@gregbrowndev gregbrowndev changed the title Fix/opentelemetry trace context propagation fix(opentelemetry): trace context propagation in process-pool workers Aug 3, 2025
- For some reason, the docstring comparison for the reflection check seemed to fail in Python 3.9
- I shortened the docstring to make it easier to compare in VSCode test output, that seemed to fix
  the test. Maybe 3.9 doesn't strip leading spaces in the docstring (e.g. like textwrap.dedent)?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support / provide guidance on using OpenTelemetry logging + metrics SDKs with process-pool workers
2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy