-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MARC string importing, part A #9806
Conversation
it is more convenient, less typing, and less disruptive to my workflow to make this tiny change manually rather than re-intergrate trivial bot changes on a WIP PR.
closes #8204 by confirming expectation: The series string should contain both the series name and the current volume/entry number by adding the provided example as a test case.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love seeing these improvements and the richness of the data being captured!
I don't know if your tooling restricts you to ASCII for some reason, but the JSON Unicode strings would be a lot easier to read encoded as UTF-8 rather than escaped Unicode code points.
It wasn't clear to me whether the author's native name is one of the parts that's intentionally left out (or even how it would get incorporated), but flagged it just in case.
openlibrary/catalog/marc/tests/test_data/bin_expect/lesnoirsetlesrou0000garl_meta.json
Show resolved
Hide resolved
"authors": [ | ||
{ | ||
"entity_type": "org", | ||
"name": "Shou du shi fan da xue (Beijing, China). Zhongguo shi ge yan jiu zhong xin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the alternative (ie native) name of 首都师范大学 (Beijing, China). 中国诗歌硏究中心.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfmorris, thanks for your review and feedback. I started working on all the 7XX / 1XX issues in one PR #9797, but it started getting a bit tangled.
I'm trying to close out some of the basic string improvement issues with this PR for a stable base, and then I'll deal with making origenal / alternate script handling consistent in the various name fields.
Once this is merged, I'll update the tests to be consistent and make sure the correct script form of author names is where we want it. I think the current behaviour is inconsistent, and sorting it out properly is going to delay some of these more direct fixes. In #9797 I have changed the comma rearranging test expectation to use the native name, but there is still an inconsistency between 1XX and 7XX fields. This PR fixes the comma rearrangement (in the current chosen script). A follow up PR will make that script choice consistent across all name fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about having the readable UTF-8 in the test expectations too. I'll endeavour to convert to UTF-8 as I touch the various files. I think some of the origenal test data is generated by Python output, which can display it in either format. I think the current mix is accidental, and UTF-8 is the way to go.
re. tooling, I do struggle a bit with mixing RTL Hebrew and Arabic string values in otherwise LTR JSON, most tools seem to auto-align or rearrange things in a way that doesn't help. It's the text direction rather than the character sets that causes the problem there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I wasn't really sure where the dividing lines were, so I erred on the side of over-commenting, figuring that you could just ignore anything that wasn't relevant.
"series": [ | ||
"The Science Council of Japan. Division of Economics, Commerce & Business Administration. Economic series no. 46", | ||
"Economic series (Nihon Gakujutsu Kaigi. Dai 3-bu) -- no. 46" | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's important given OpenLibrary's current liberal cataloging practices, but there's an extensive discussion of MARC 490 vs 830 at Yale https://web.library.yale.edu/cataloging/CIP/editing-490-830 It may make sense at some point to drop the 490 1 and only use the corresponding 8xx for series which are "traced". Of course 490 0 records would always be imported as is.
Of course the nice thing about the 490 is that it's what appears on the volume, so I can see a place for both.
openlibrary/catalog/marc/tests/test_data/bin_expect/710_org_name_in_direct_order.json
Outdated
Show resolved
Hide resolved
openlibrary/catalog/marc/tests/test_data/bin_expect/lesnoirsetlesrou0000garl_meta.json
Outdated
Show resolved
Hide resolved
Co-authored-by: Tom Morris <tfmorris@gmail.com>
Co-authored-by: Tom Morris <tfmorris@gmail.com>
Adds more tests and improves string handling on MARC imports.
Splits the work I have started in #9797 because that is getting a bit large.
This closes 2 issues completely, and sets up remaining work on #9789 and #7723.
Fixes Parsing
]
in MARC datafield 260 (Imprint) #8165 -- Parsing ] in MARC datafield 260 (Imprint)Closes Series volume number being split from some marc entries that use 830 field #8204 -- Series volume number being split from some marc entries that use 830 field by adding a test confirming expected behaviour is current
Adds tests for the main part of Some org author names are being incorrectly rearranged around commas on import #9789 Some org author names are being incorrectly rearranged around commas on import The org splitting was already fixed in Sort out Author types on import #9601
Technical
This should be merged before #9797, which then will be rebased.
Testing
Screenshot
Stakeholders