fix(opentelemetry): trace context propagation in process-pool workers #1017
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What was changed
This PR fixes a bug in the OpenTelemetry
TracingInterceptor
affecting sync, multi-process activities. The fix ensures tracing capabilities are possible inside the user's activity implementation, e.g. creating child spans, trace events, log correlation, profile correlation, distributed tracing with other systems, etc.Unlike async or sync multi-threaded activities, the
TracingInterceptor
/_ActivityInboundImpl
interceptors had not propagated the OTEL trace context, in this case, across the process pool.For the process-pool executor, any data we want to send to the child process must be extracted from contextvars and/or otherwise passed to
loop.run_in_executor
as pickable arguments to the target_execute_sync_activity
function, and then rebuilt into contextvars on the other side.Since the trace context is created in the
TracingInterceptor
in the parent process, it would be difficult getting this all the way down to_ActivityInboundImpl
where it can be sent to the child process without introducing OpenTelemetry as a core dependency. This change attempts to be as transparent as possible, but may introduce a breaking change (see end of section).The
TracingInterceptor
's inbound activity interceptor now handles the special case for sync, non-threadpool executor activities. It wraps theinput.fn
in a picklable dataclass that:__call__
function that becomes the entrypoint of the subprocess task, which reattaches the trace context before delegating to the original activity functionfunctools.wraps
. This is because downstream interceptors, such as theSentryInterceptor
in the Python examples (see feat: add example using Sentry V2 SDK samples-python#140), use reflection on the activity attributes, e.g.fn.__name__
,fn.__module__
.Tests have been added to verify the fix and I've had this patch running in our production environment for several weeks without any issues.
Breaking Change
As mentioned above, this change may break downstream interceptors that rely on receiving the original activity function handle directly.
Care has been taken to ensure common properties are preserved using
functools.wraps
, like you would with a decorator. However, without more significant changes to other parts of the SDK, I think this cannot be avoided, since creating a closure function cannot be pickled.Users would need to ensure any interceptor switched on the function name,
fn.__name__
, rather than a reference to the real function.Why?
Users of the SDK's process-pool Worker currently cannot leverage OTEL tracing capabilities inside their own activity implementation. The
TracingInterceptor
correctly instruments the activity's root span, but further downstream tracing is not properly linked to this parent span. The following is currently broken in sync, multiprocess activities:Checklist
Closes: 669
How was this tested:
Added tests to verify:
Manual testing using the OTEL logging SDK in my app shows that logs emitted in the activities are injected with correct
trace_id
/span_id
enabling log-correlation in Grafana/Tempo/Loki. I didn't want to add this to the tests as the logging SDK is still experimental.Note: testing this was quite tricky, I used a proxy list in the server process manager to access the spans exported from the child process's
SpanExporter
. I don't expect this would ever be necessary in production code (especially with OpenTelemetry) since all of the OTEL tracing exporters that I've seen a use push-based approach to export spans directly to an OTEL collector or tracing backend directly. (I think I remember seeing a Jaeger guide that indicated scraping traces from an endpoint, but that was for native Jaeger tooling I think). With a push-based exporter, e.g.OTLPTraceExporter
, the child process can simply export its spans without needing to consolidate them with the parent process, even while the parent span created in theTracingInterceptor
is yet to complete and be exported, the tracing backends expect to receive spans out-of-order.Any docs updates needed?
Hopefully, no change from users is necessary.