MediaInfo captions are being indexed as label fields, this is suboptimal as these fields are mapped for completion.
We should fix this before we start to use them more broadly as they consume much more space than needed (exact matches & prefix fields).
Description
Details
Related Objects
- Mentioned In
- T235910: Updates to search keyword functionality for structured data captions
rEWCSb8e49ba084e1: Update HasDataForLangFeature for caption move to descriptions
T227847: Backfill terms index
T228429: Update CirrusSearch invocation to handle the reindexing of captions as descriptions
rEWCSe62637aeccc7: Query description fields with incaption keyword
T224611: Implement match for any-language label (haslabel:*) - Mentioned Here
- T190066: Expose all slots to the search interface
Event Timeline
Note however we don't have a field like labels_all for descriptions, right? Also, hascaption is now an alias to haslabel, so they both work on the same field. May be moved to be an alias for hasdescription, of course. But then not clear how hascaption:* would work if at all.
Note however we don't have a field like labels_all for descriptions, right? Also, hascaption is now an alias to haslabel, so they both work on the same field. May be moved to be an alias for hasdescription, of course. But then not clear how hascaption:* would work if at all.
it depends on the solution we decide to follow if we decide to use descriptions and hascaption:* is an important usecase then we need a new field like e.g. description_count.
Change 519602 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Index MediaInfo labels as separate caption fields
We're storing captions in label because captions are essentially labels for images, and we already have a 'description' field in UploadWizard for wikitext
As you say we don't need completion, we just need people to be able to search the caption text. We're pushing all captions into opening_text which is probably adequate for most use cases (though I guess won't cover stemming).
FWIW we don't expect to ever add descriptions to MediaInfo entities. Obvs if there was a consensus that the caption data should be stored there rather than in labels we could consider moving it, but as far as we can see there is no need to have both labels and descriptions for MediaInfo entities
Just to add some clarity on something Cormac said above:
FWIW we don't expect to ever add descriptions to MediaInfo entities.
I'd amend this to say that at the moment we're about 80% sure we won't add descriptions, but it's still quite possible that Commons community (either as a group or via a prolific bot writer) decide that multilingual descriptions in Wikitext aren't good enough and start migrating over to structured descriptions.
Would representing captions as captions rather than labels in the JSON structure help?
Or rather than pretending media info entities are totally like items / properties just have totally different handling for them when it comes to indexing?
Not really sure which JSON structure you mean @Addshore - but yeah, perhaps we should just explicitly index this stuff differently when doing the indexing. That's easy enough atm, but raises 2 questions:
- ATM slot data is written into the elastic document by the hook onCirrusSearchDocumentParse(). If we were going to have slot data automatically written to the document (and I guess that's the plan) we'd have to come up with a way of configuring how it's indexed
- If we end up added structured descriptions, and we want to index those as descriptions too, what then? The current code doesn't lend itself very well to concatenating label and description and writing both into one field - I guess it could be refactored, but might be tricky
So, I mean in the JSON that is stored and also the JSON that is presented to consumers of the data.
If we are saying they are nothing that like labels as we know then, why are we internally storing them / treating them as labels.
Well, there are two issues here:
- Captions are semantically not like labels, so it's semantically wrong to store them as labels.
- Since captions are stored as labels, they are indexed as labels, which means they are indexed for prefix completion search. This wastes resources and applies analyzers that are wrong for the searches that people would actually do on captions.
I am so far mainly concerned with (2), as this means both unnecessary load on our index servers and maybe broken searches too. But it is kinda related to (1) because as I understand the code assumes if we're storing something in labels field, it's labels. Maybe we could override in WikibaseMediaInfo extension and index it differently, not sure.
Erm ... this is getting a bit philosophical, but I don't really see that labels and descriptions have much semantic meaning associated with them, except for one would expect a label to be shorter than a description.
If we are saying they are nothing that like labels as we know then, why are we internally storing them / treating them as labels.
Correct me if I'm wrong here, but as far as I can tell 'labels' in the wikidata sense simply means 'concise descriptions of something that may also have a longer description, and that we index for prefix completion search'. If that's correct, then captions are not "nothing like labels as we know them" - they're still short descriptions of things that may also have long descriptions. They just need to be indexed differently
ATM label data from the MediaInfo slot is written to the elastic doc via the hook onCirrusSearchBuildDocumentParse, so I don't think it'll be difficult to update it so label is indexed in a different way. Perhaps ultimately this approach would make T190066 more difficult to implement, but I don't know what the plan for implementation of that is one way or the other
If that's correct, then captions are not "nothing like labels as we know them"
Well, ok, yes, not "nothing like", but the dominating use of labels in search - namely, prefix search, is irrelevant here. So we need to adjust for that.
Change 522473 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Index captions as description field rather than label
Change 523679 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/WikibaseCirrusSearch@master] Query description fields with incaption keyword
Change 522473 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Index captions as description field rather than label
Change 523679 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Query description fields with incaption keyword
Change 519602 abandoned by Cparle:
[WIP] Index MediaInfo labels as separate caption fields
Change 544091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikibaseCirrusSearch@master] Update HasDataForLangFeature for caption move to descriptions
Change 544091 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Update HasDataForLangFeature for caption move to descriptions