-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more ad-hoc title filtering to avoid non-book imports, and exclude OTHER product class #9850
Conversation
move NONBOOK checks earlier in the init to quite early if NONBOOK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@@ -67,7 +67,9 @@ | |||
'version', | |||
# Not a book | |||
'calendar', | |||
'copy bin', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this might not work cause of the space; I believe it splits on space when it does the check. Also this filter only applies to independently published books, and I don't think copy bin
or dumpbin
have publisher Independently Published
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, well those two that I added are likely to do nothing then as the dumpbins are from recognised publishers. It's not worth adding another title filter, the OTH
check I added is doing the real work of preventing non-book items. The two strings can probably be removed to avoid confusion, although I don't think they will cause any false positives?
Looks like copy bin
will never match because of the space. The title_words
are split on space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing makes sense to me, but I'd also suggest creating an issue to expand this to block those words from any publisher source. (Potentially also worthwhile to switch this to regex so the space thing isn't a problem)
closes #9768
From looking at one recent data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.
The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.
This PR
OTH
is included in these dumpbin itemsTechnical
Testing
Screenshot
Stakeholders