Content-Length: 328141 | pFad | https://github.com/internetarchive/openlibrary/pull/9850

B4 more ad-hoc title filtering to avoid non-book imports, and exclude OTHER product class by hornc · Pull Request #9850 · internetarchive/openlibrary · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more ad-hoc title filtering to avoid non-book imports, and exclude OTHER product class #9850

Merged
merged 5 commits into from
Sep 9, 2024

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Sep 6, 2024

closes #9768

From looking at one recent data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.

The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.

This PR

  • rejects 'Dumpbin' and 'Copy bin' titles, and confirms that the supplied format codes aren't sufficient to filter accurately because dumpbins look like books in this field.
  • Adds a new 'Product class' check where OTH is included in these dumpbin items

Technical

Testing

Screenshot

Stakeholders

@hornc hornc requested a review from scottbarnes September 6, 2024 02:27
@hornc hornc added the Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] label Sep 6, 2024
Copy link
Collaborator

@scottbarnes scottbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@scottbarnes scottbarnes merged commit 45247da into master Sep 9, 2024
4 checks passed
@scottbarnes scottbarnes deleted the ISSUE9768 branch September 9, 2024 23:23
@@ -67,7 +67,9 @@
'version',
# Not a book
'calendar',
'copy bin',
Copy link
Collaborator

@cdrini cdrini Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this might not work cause of the space; I believe it splits on space when it does the check. Also this filter only applies to independently published books, and I don't think copy bin or dumpbin have publisher Independently Published

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, well those two that I added are likely to do nothing then as the dumpbins are from recognised publishers. It's not worth adding another title filter, the OTH check I added is doing the real work of preventing non-book items. The two strings can probably be removed to avoid confusion, although I don't think they will cause any false positives?

Looks like copy bin will never match because of the space. The title_words are split on space.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing makes sense to me, but I'd also suggest creating an issue to expand this to block those words from any publisher source. (Potentially also worthwhile to switch this to regex so the space thing isn't a problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non-book retail display items are being imported
3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/internetarchive/openlibrary/pull/9850

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy