Page MenuHomePhabricator

Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model
Open, MediumPublic

Description

At the Hackathon, @TK-999 wrote a proof-of-concept patch integrating Mediawiki with the OTel library, including creation of Spans for database queries. We're requesting a few passes of code review on this patch, beginning with any architectural issues and an estimate of how much more work is needed.

Some docs on what features & extra data you can get by linking against instrumentation: https://opentelemetry.io/docs/concepts/instrumentation/libraries/

Docs on the PHP library: https://opentelemetry.io/docs/instrumentation/php/

Event Timeline

Reedy renamed this task from Mediawiki imports OpenTelemetry client instrumentation library for enhanced trace metadata to MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata.Jun 27 2023, 2:30 PM
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptAug 21 2023, 1:53 PM
Krinkle removed a project: Spike.
Krinkle updated the task description. (Show Details)
Krinkle changed the subtype of this task from "Spike" to "Task".

Change #1027519 had a related patch set uploaded (by TK-999; author: TK-999):

[mediawiki/core@master] [DNM] PoC: Instrument MediaWiki with OpenTelemetry

https://gerrit.wikimedia.org/r/1027519

To run the above PoC, you need a Jaeger deployment. This is trivially doable with the upstream all-in-one image:

$ docker run --rm -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one:1.57

If the upstream image cannot be used, a locally built jaeger binary or the WMF images should work too, although in the latter case some additional configuration is required to make the collector and query services work together.

Then, you need a config in LocalSettings.php:

$wgOpenTelemetryConfig = [
	'endpoint' => 'http://127.0.0.1:4318/v1/traces',
	'serviceName' => 'mediawiki-dev',

	'samplingProbability' => 1.0 // change as needed
];

A couple of thoughts on next steps:

  • We should decide whether we would like a vendor-agnostic tracing abstraction in core or if we're fine with using the OTEL interfaces. PSR-22 proposes a standard tracing interface for PHP, but the work appears to have stalled and the proposed interfaces seem rudimentary at best.
    • We should also decide whether to use the official OTEL client at all or write our own. The former case would AIUI require a code review for security and other considerations.
  • We should decide on the exact level of instrumentation we want. @Krinkle has rightfully warned that trying to instrument too much userland code would effectively be a resurgence of the old wfProfileIn/Out situation, with profiling calls creeping into ever more areas of the code. I generally agree with this, as IMO distributed tracing is not profiling—that job can be better than by and therefore should be left to a profiler. However, tracing instrumentation can bring two benefits:
    1. It can help visualize the exact components involved in a distributed operation with many actors, e.g. an edit in a large-scale setup such as Wikimedia which may involve HTTP requests to multiple downstream services and background tasks published to message queues and executed at a later stage. Visualizing such an operation in the form of a trace can help operators pinpoint problematic areas, allowing them to focus the scope of an investigation.
    2. It can provide additional metadata for calls that involve calling an external dependency, such as a downstream service, a database or a cache, e.g. the executed query, or the IP address of the host involved. This additional context could in turn be useful in diagnosing problems.

This is an amazing proof-of-concept, thanks so much @TK-999 !!!

I'm generally in favor of just using the OTel interfaces and definitions. They're rigorously defined, and have broad industry support and a well-functioning open source community.

Re: your second question, I think in general if it looks like an RPC, we should trace it. Database queries for sure. Possibly memcached traffic, depending on the operation counts and overhead.

This is an amazing proof-of-concept, thanks so much @TK-999 !!!

I'm generally in favor of just using the OTel interfaces and definitions. They're rigorously defined, and have broad industry support and a well-functioning open source community.

Re: your second question, I think in general if it looks like an RPC, we should trace it. Database queries for sure. Possibly memcached traffic, depending on the operation counts and overhead.

Thanks @CDanis :) Yeah, that is my general sentiment as well for adding instrumentation.

One more thing I had forgotten to mention—the OTEL specification does not seem to state a preferred a way to "force" a trace being sampled regardless of the sampling probability or other configuration. We ended up adding support for the jaeger-debug-id header into our OTEL integration at Fandom to force sampling and add the value as a contextual tag when it is present in an incoming HTTP header. Such functionality is useful to have in order to be able to reliably obtain a trace view of some operation with a consistently reproducible failure.

CDanis triaged this task as Medium priority.
CDanis updated the task description. (Show Details)

Change #1027519 had a related patch set uploaded (by TK-999; author: TK-999):

[mediawiki/core@master] [DNM] PoC: Instrument MediaWiki with OpenTelemetry

https://gerrit.wikimedia.org/r/1027519

I've spent most of today reviewing this in Gerrit. Highlights:

  1. External dependencies: Too large, too complex, too volatile, insufficient support.

Note this is a criticsm on upstream, not on Mate or SRE. In reviewing the patch, I've debugged and navigated that code several times, and it's far too indirect. We'd likely depend on one or two people in the org to be able and willing to debug that. I'd prefer not to be one of them.

I acknowledge that this would enable something super cool and useful. That utility is in the the OTEL concept and the Jeager service, not the opentelemetry-php package.

It is too lage for something that logically extremely simple: We build 1 nested key-value array of words and numbers, and submit it as JSON to an HTTP API. (And the library isn't even responsible for the JSON and HTTP part.) This should be close to 100 lines of code in one or two classes, that we write and review once, in a way that inexperienced contributors can easily read and understand in one session, which we then support more or less unchanged for a long time.

The external dependencies would add about 4 megabytes to vendor. That's a 10% increase in size for MediaWiki releases (compared to mediawiki-1.42.1.tar.gz/vendor).

$ du -sh
2.7M vendor/open-telemetry/
920K vendor/google/protobuf/
156K vendor/php-http/

$ sloc -i php vendor/open-telemetry/
500 source files containing 20,000 "source" lines of code (i.e. excluding whitespace and comments)

$ sloc -i php  vendor/google/protobuf/
124 source files containing 10,000 source lines of code.

$ sloc -i php  vendor/php-http/
24 source files containing 1,300 source lines of code.

The current version of opentelemetry-php has already dropped support for PHP 7.4 and PHP 8.0, requiring PHP 8.1. This is part of why the patch is failing CI, because we're not getting past the first step in CI of downloading code, before we even install and start MW or start the tests. Assuming they'll be quick to drop support in the future, this means we can expect to be lagging in versions, lacking upstream support, no free security or performance fixes. Note that while WMF's upgrade cadence of PHP plays a role in whether we can even deploy it, even if we were to be willing to make long-term technical bets on that improving, this stands separate from that, because what we bundle in MW has to satisfy our minimum requirement, not just what we run in production. Including, for example, MediaWiki LTS releases and periodic updates to their vendored dependencies.

  1. Tests!

The patch doesn't add any tests yet. We should have at least 1 integration test that construct the tracing singleton, creates one or two spans (perhaps overlap them and then null them to trigger the automatically scoped end-of-span), verify that nothing is sent, call the class' send/shutdown method, and assert what it sends (HTTP body) and to where (URL).

In order for the test to be simple, self-documenting, and provide the most confidence, having a deterministic outcome that we can assert as a single string would be great.

That will probably rest on the clock source. I recommend the Wikimedia\ConvertibleTimestamp lib and microtime, which let you control clock progression in test context, so that there is a deterministic passage of time for each of the spans, and thus deterministic outcome that we can assert. (That reminds me, should we add high-resolution hrtime to this lib?)

  1. The patch is not passing CI yet.

I bypassed the PHP version and vendor issue locally and ran the test suite against a local PHP 8.2 install.

The nature of CI tests is often that only one issue can be found, and until it is fixed or workedaround, the next issue is not found. As such, I essentially fixed each of them. Each issue turn out easy to fix - after finding it. I've left details about these on the patch.

  1. Inclusion of BagOStuff and "async" JobRunner.

The tracing spans for database queries (Rdbms) and memcached (WANObjectCache, powered by BagOStuff) are great.

I've suggest we exclude the ones for internal use of BagOStuff, as this should be redundant with WANObjectCache, and the non-WANCache uses of BagOStuff seem either very noisy and not tracing relevant (i.e. array lookups), or similarly redundant (session store, parser cache, etc).

I've left details in Gerrit. I believe removing these only improves the usability of the tracing reports, without detracting from its value. I'm happy to reconsider this if others do find them useful (in that case consider the drawbacks I point out in read my CR and address or explain them in some way!)

With regards to (1) -- the OTEL spec is bespoke, but the trace data format is less so. If we want to, we could probably create a simpler client as suggested with a reasonable effort. While it wouldn't conform to the spec, that shouldn't matter too much in the big picture of things as long as it is conformant when it comes to exporting data and propagating trace context.

I agree with @Krinkle on the sizing -> I was surprised how much code the PHP open-telemetry introduces. It's important to notice that php-otel provides not only traces, but also metrics and logging functionality. We would need only tracing capabilities but PHP OTel SDK comes as bundle.
We could skip the protobuf extension and send data in json format - slower and requires more throughput. The google/protobuf lib is not recommended:

The native protobuf library is significantly slower than the extension. We strongly encourage the use of the extension.

For exporter we could use Guzzle which is already part of MediaWiki - but we would require all HTTP messaging/factory interfaces from PSR and I don't think we have it.

OTEL sends "reasonable" json structures. It should be easy to provide some basic library that would generate such structures in much lighter way.
Let me attach some sample JSON structure which my test env triggers:

{
   "resourceSpans":[
      {
         "resource":{
            "attributes":[
               {
                  "key":"host.name",
                  "value":{
                     "stringValue":"228458115fd8"
                  }
               },
               {
                  "key":"host.arch",
                  "value":{
                     "stringValue":"aarch64"
                  }
               },
               {
                  "key":"os.type",
                  "value":{
                     "stringValue":"linux"
                  }
               },
               {
                  "key":"os.description",
                  "value":{
                     "stringValue":"6.6.32-linuxkit"
                  }
               },
               {
                  "key":"os.name",
                  "value":{
                     "stringValue":"Linux"
                  }
               },
               {
                  "key":"os.version",
                  "value":{
                     "stringValue":"#1 SMP Thu Jun 13 14:13:01 UTC 2024"
                  }
               },
               {
                  "key":"process.pid",
                  "value":{
                     "intValue":"1"
                  }
               },
               {
                  "key":"process.executable.path",
                  "value":{
                     "stringValue":"/usr/local/bin/php"
                  }
               },
               {
                  "key":"process.owner",
                  "value":{
                     "stringValue":"php"
                  }
               },
               {
                  "key":"process.runtime.name",
                  "value":{
                     "stringValue":"cli-server"
                  }
               },
               {
                  "key":"process.runtime.version",
                  "value":{
                     "stringValue":"8.2.22"
                  }
               },
               {
                  "key":"telemetry.sdk.name",
                  "value":{
                     "stringValue":"opentelemetry"
                  }
               },
               {
                  "key":"telemetry.sdk.language",
                  "value":{
                     "stringValue":"php"
                  }
               },
               {
                  "key":"telemetry.sdk.version",
                  "value":{
                     "stringValue":"dev-main"
                  }
               },
               {
                  "key":"telemetry.distro.name",
                  "value":{
                     "stringValue":"opentelemetry-php-instrumentation"
                  }
               },
               {
                  "key":"telemetry.distro.version",
                  "value":{
                     "stringValue":"1.0.3"
                  }
               },
               {
                  "key":"service.name",
                  "value":{
                     "stringValue":"mediawiki-otel"
                  }
               },
               {
                  "key":"service.version",
                  "value":{
                     "stringValue":"0.1"
                  }
               }
            ]
         },
         "scopeSpans":[
            {
               "scope":{
                  "name":"io.opentelemetry.contrib.php"
               },
               "spans":[
                  {
                     "traceId":"619ffda61a5bda4c21daded87d711ac3",
                     "spanId":"acdb1288a3c2a88b",
                     "parentSpanId":"586e40b6a4905cd5",
                     "name":"GET http://localhost/query.php?query=test",
                     "kind":3,
                     "startTimeUnixNano":"1722854134303497237",
                     "endTimeUnixNano":"1722854134563485446",
                     "attributes":[
                        {
                           "key":"http.method",
                           "value":{
                              "stringValue":"GET"
                           }
                        },
                        {
                           "key":"http.status_code",
                           "value":{
                              "intValue":"200"
                           }
                        },
                        {
                           "key":"http.response_content_length",
                           "value":{
                              "stringValue":"339"
                           }
                        }
                     ],
                     "droppedAttributesCount":1,
                     "status":{
                        "code":1
                     },
                     "flags":257
                  }
               ]
            }
         ],
         "schemaUrl":"https://opentelemetry.io/schemas/1.26.0"
      }
   ]
}

There is a PHP-Fig standard proposal but it didn't move a bit - https://github.com/php-fig/fig-standards/pull/1301. If we want to follow if our own lightweight solution, it would be great also to help move the standard forward.

Yeah, there are some additional examples at https://github.com/open-telemetry/opentelemetry-proto/blob/v1.3.2/examples/trace.json. I'll look into creating a PS this week about implementing a simple OTEL tracing lib from scratch.

I wouldn't worry too much about the performance of either the JSON/protobuf serialization, since we'll be sampling trace data in production heavily and unsampled spans never get exported. Our OTEL integration at Fandom used the pure PHP protobuf serialization format and it had no discernible impact on overall performance flamegraphs thanks to sampling.

Change #1061573 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/core@master] Introduce minimal OTEL tracing library

https://gerrit.wikimedia.org/r/1061573

Change #1061574 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/core@master] PoC: Add request-level OTEL instrumentation

https://gerrit.wikimedia.org/r/1061574

I've prepared a simple tracing library as discussed above.

I'm a bit worried about ownership, should we decide to go with this approach—the OTEL specification might change, requiring us to adapt the client, or we may need to implement new functionality from the spec that we'd like to leverage. It'd be good to make sure there's an owning team for the lib ready to take on such work to hopefully avoid friction in the future.

Change #1062104 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/libs/Timestamp@master] Implement fake-able hrtime(), deprecate microtime()

https://gerrit.wikimedia.org/r/1062104

Change #1062104 merged by jenkins-bot:

[mediawiki/libs/Timestamp@master] Implement fake-able hrtime(), deprecate microtime()

https://gerrit.wikimedia.org/r/1062104

Change #1027519 abandoned by Máté Szabó:

[mediawiki/core@master] [DNM] Add basic OpenTelemetry instrumentation

Reason:

Superseded by Ibc3910058cd7ed064cad293a3cdc091344e66b86.

https://gerrit.wikimedia.org/r/1027519

Change #1061573 merged by jenkins-bot:

[mediawiki/core@master] Introduce minimal OTEL tracing library

https://gerrit.wikimedia.org/r/1061573

pmiazga renamed this task from MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata to Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model.Fri, Oct 11, 2:53 PM

Updated ticket title to match the current state of things. yesterday we merged the minimal OTEL tracing library. Thank you @mszabo for great work!.

The remaining steps are:

  1. Decide whether we need to discuss the RAII implications more (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1061573/comments/6d9c0d1e_23dcce5b?tab=comments).

@mszabo found an issue when we have multiple spans inside same function. The spans are auto-ended when PHP gets out of function, not out of the scope where span was created. IMHO we | should treat RAII as syntax sugar/fallback and expect engineers to start/end spans manually. The RAII would be a prevention mechanism when PHP jumps out of scope (because of exception) to make sure we auto deactivate and end open spans.

  1. implement logic to skip sampling and respect traceparent header. When the header is present the sampling logic shouldn't trigger, instead it should trace the request.
  2. wire up the library with MediaWiki logic, eg start a root span.
  3. instrument database calls
  4. update places where Telemetry::getRequestHeaders is used and instrument it with (Http factories/MWRequest/etc)
  5. update wmf-configs - set sampling rate to 0% as we shouldn't sample by our own, only depend on header existence.

@CDanis @Krinkle - what do you think would be the best areas to instrument? IMHO we should wrap entire request with a root span and instrument Database calls and the HTTPFactory. We could instrument the Parser too. For the first run we should keep it small.

@mszabo when testing the library I noticed that it may be tricky to find a good place where to start root span and call the shutdown/export. Initially, I followed the POC and started instrumentation in the MediaWikiEntryPoint, but might seem bit late. I tried to instrument all hooks and discovered that when we get to MediaWikiEntryPoint::setup - some hooks are already called.

@CDanis @Krinkle - what do you think would be the best areas to instrument? IMHO we should wrap entire request with a root span and instrument Database calls and the HTTPFactory. We could instrument the Parser too. For the first run we should keep it small.

+1 to all.

If you wanted something even smaller to instrument as a starting point, PoolCounter locks wouldn't be bad either.

@pmiazga @mszabo will either one of you have some time soon to at least do #3 above? That's the one I know the least about.

@pmiazga @mszabo will either one of you have some time soon to at least do #3 above? That's the one I know the least about.

Sure thing, I'll polish https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1061574 and that should do the trick

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy