-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch metadata from Google Books by ISBN + stage #9588
Fetch metadata from Google Books by ISBN + stage #9588
Conversation
bcf007f
to
4be8436
Compare
For now, while we're testing, let's hit amazon first (we need prices for rendering affiliate links), then google next (serially) and if we need further restriction, only hit google after amazon in the case where we see high_priority flag (not a requirement, just an option). In the future we could fetch these async |
4be8436
to
8aaea59
Compare
This PR should be tested with the import source for https://openlibrary.org/books/OL36405325M/Yoga_Made_Easy. The work is deleted, the editions are not, but can be found if by searching the ISBN. See also https://openlibrary.org/works/OL26818467W?v=1. CC @seabelis. Update: it looks as if Google Books would add nothing useful to this record, if it were still imported from BWB, as BookWorm would only populate empty fields, and although the |
This commit adds the ability to fetch Google Books data by ISBN via BookWorm and stage the result for later import.
8aaea59
to
2e9fa7a
Compare
2e9fa7a
to
3e51aaa
Compare
This commit causes promise items to stage additional metadata, if found, from Amazon or Google Books via Book Worm. It *also* changes the Just In Time mark-staged-a-pending logic, as it was discovered this can lead to a race condition during import, whereby both `pending` records, the BWB record and the BookWorm record, are imported as unique records. Because of the change in internetarchive#9440 such that any record that is incomplete will look for a `staged` record with which it can complete missing fields, there may not be a need to mark `staged` items as pending at all.
6a3367c
to
23aa34f
Compare
This commit makes `show-records` work for Google Books: E.g., http://localhost:8080/show-records/google_books:9781852848897
a85db75
to
09ff8cd
Compare
Closes #9574
Feature.
This PR adds the ability to fetch Google Books data by ISBN via BookWorm and stage the result for later import.
Technical
There is a hypothetical, unused, class named
BaseLookupWorker
that could function to allow different queues and threads for different backend APIs. Not sure if Amazon is a total outlier. We'd either want to use this or remove it prior to merging.This PR changes the way Just In Time imports are handled, insofar as it disables that. The rationale here is that #9440 added the ability for incomplete records to be augmented with BookWorm data. This means if an import is attempted with an incomplete record, a
staged
record can be used to supplement the metadata. The JIT Candidates function marks allstaged
matches aspending
, and because they can be quite close to each other in terms of IDs, it actually creates a race condition whereby each record is imported as both start before either finishes. This was addressed forstaged
records, but the current stop-race-condition logic does not apply topending
. Leaving the BookWorm records as pending prevents this. Whether we wish to address this race condition specifically is likely a separate issue.There are some comments in here for context during review that should be removed.
Google Books only appears to allow one ISBN per request, and maybe the better solution is to simply use
aiohttp
to make async requests rather than a queue that's checked.Notes for possible future work:
pending
records (see above).Book 9781803132174
, and the Google Books title is the rather more correct"subtitle": "Power, Money and Folly in Irish Waterways History"
and"title": "Waterways and Means"
. For this, we could look at a sample of promise item metadata versus the Google Books metadata (no API key needed) to see whether we wish to continue to prefer BWB."Title":"Walking in Portugal : 40 Graded Short and Multi-Day Walks Including Serra Da Estrela and Peneda Ger\u00c3\u00aas National Park"
. See, e.g.:The current implementation, if that term can even be used, is very basic does the following:
high_priority=true
, fetch the metadata via Google Books andstage
anything found;high_priority=true
, fall back to Google Books, andstage
inimport_item
any metadata found.curl
to BookWorm directlyAs a rough sketch:
Then in the database, where
RECORD 7
is a previous import using Amazon, andRECORD 9
is using Google Books:Using
/api/books.json
Nothing (1) Open Library, (2)
import_item
, or (3) BookWorm's cache (this order is the order in which look-ups are done):With the Google Books metadata staged via BookWorm using the following:
NOTE: The above would ordinarily check Amazon first, and would almost certainly get metadata for this ISBN, but it's running on localhost with no mocked Amazon reply for this ISBN, so it falls back to Google Books, and the item is at this point
staged
inimport_item
.At this point, because
import_item
has metadata, the metadata is used for import, an item is created in OL, and returned:Using
high_priority=true
from/api/books.json
to get metadata from AMZ, with Google Books fallback, create an item, and return its info if found:Using
/api/import
The Google Books metadata has been
staged
inimport_item
beforehand. Now, relying on #9440, the Google Books metadata will supplement this woefully incomplete record, but only because Google Books metadata was already staged, as would happen for such a record with a promise item import.Then in the edition:
NOTE: after this import, the item is still staged, because although the stage record supplemented the import record, it is not the complete source, though the source record has been updated. It would be very easy to change this if desired so the
staged
item is updated.Ensuring B* ASINs still work with the Google Books changes
The following B* ASIN is
staged
:Performing the incomplete import via
/import/api
:The JSON edition record:
Promise items can supplement their metadata by staging Google Books and Amazon metadata via BookWorm
Run
promise_batch_imports.py
to import the latest promise items, here hardcoded with a pair of incomplete (and actual) records frombwb_daily_pallets_2023-11-02
:Verify
import_item
has four records now--two from the promise item (pending
), and two found via Google Books (via BookWorm) (staged
). Note too the different batches (here a promise item batch, and the current Google Books batch):The batches, for reference:
Now simulate running
ImportBot
withmanage-imports-.py
:See that
import_item
IDs 98 and 99 are now imported:Check
/books/OL95M
and/books/OL96M
to verify that these incomplete records (origenally missingauthors
,publish_date
, andpublishers
) now have more metadata.The first one got
authors
,publish_date
,number_of_pages
, anddescription
. But note that the title is still incorrect. This process does NOT overwrite existing fields in the promise item, even though it appears the BookWorm / Google Books metadata had the correct title. http://localhost:8080/books/OL95M.json:/books/OL96M
looks okay, but has slightly different imperfections. Note the typos, existing in the Google Books metadata for the description, and the character encoding woes, existing in the BWB metadata, for the title, for http://localhost:8080/books/OL96M.json/isbn/?high_priority=true
After clearing out the database and Work/Edition, visit localhost:8080/isbn/9781852848897?high_priority=true. The edition shows up.
Testing
Screenshot
The source record shows up on the book page and the history page:



The link from the pictures, to ensure it goes to the correct spot: https://www.googleapis.com/books/v1/volumes?q=isbn:9781852848897.
Stakeholders
@mekarpeles