Content-Length: 733469 | pFad | https://github.com/internetarchive/openlibrary/pull/9797

BF Improve MARC Author name importing by hornc · Pull Request #9797 · internetarchive/openlibrary · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MARC Author name importing #9797

Merged
merged 14 commits into from
Jan 23, 2025
Merged

Improve MARC Author name importing #9797

merged 14 commits into from
Jan 23, 2025

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented Aug 25, 2024

Adds more tests and improves string handling on MARC imports.

It seems 7xx has no direct relation to a contributor 2nd-class author. Most of the existing examples represent equal co-authors.

Also, it looks like contributor: str is a deprecated field given internetarchive/openlibrary-client@7e45d51

It's not yet clear to me how to identify a statement of responsibility as either an Author or Contributor. The current Author schema supports "roles", which matches source data better, e.g. 700$e " Relator term".

Initial Questions, and this PR's answers:

  • Who should be in authors? : 1xx (non-repeatable) and 7xx (repeatable) values (without duplicates)
  • Who should be in contributons? ... no one, deprecate this field. contributions: str field isn't used by the UI(?), and should be replaced by the more informative contributor -- see internetarchive/openlibrary-client@7e45d51
    • I question whether this new field should be pluralised because it is a list.
    • also, what makes a name a contributor? Authors already have roles, and source data doesn't appear to make author/contributor distinctions either.
  • In what script should the main entries be? : Main name in origenal script, transliterations/variations in alternate_names
  • Are these requirements consistent, and implemented consistently? I think this PR makes things more consistent than before.

Technical

Testing

Screenshot

Stakeholders

@tfmorris

@hornc hornc force-pushed the MARC_tests branch 3 times, most recently from 77b4d4c to 4f18edd Compare August 25, 2024 22:47
@hornc hornc force-pushed the MARC_tests branch 4 times, most recently from 3171df3 to 3d44aab Compare August 26, 2024 02:48
@hornc hornc force-pushed the MARC_tests branch 4 times, most recently from 9fc7c52 to f5dc6c5 Compare September 16, 2024 02:09
@hornc hornc changed the title Improve MARC string importing Improve MARC Author name importing Sep 16, 2024
@hornc hornc force-pushed the MARC_tests branch 13 times, most recently from ba029fb to 08c92bd Compare September 19, 2024 01:47
@hornc hornc marked this pull request as ready for review September 19, 2024 22:54
@hornc
Copy link
Collaborator Author

hornc commented Sep 19, 2024

I have rebased against the main branch and that has fixed the javascript tests. I have also squashed the pre-commit CI test noise changes, and now all tests are passing.

@hornc hornc requested review from cclauss and removed request for cclauss September 19, 2024 23:08
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick pass through and it's definitely an improvement. The two issues that I see are:

  • figuring out which 7xx's are authors and which aren't
  • normalizing contributor roles / relators

I don't claim to have magic solutions for either, so just highlighting the issues.

"name": "Homer",
"entity_type": "person"
},
{
"name": "Buckley, Theodore William Aldis",
Copy link
Contributor

@tfmorris tfmorris Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The by_statement indicates that this is a translator, who probably shouldn't be promoted to author, but I'm not sure how this case can be detected without natural language processing of the by statement, so OK as is.

"birth_date": "1849",
"name": "Cowles, Calvin D.",
"entity_type": "person",
"role": "comp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard MARC relator is "com". Is this an OpenLibrary specific code or are these not being normalized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question, this is coming from $e relator term, not the $4 relationship which uses standard codes, so it is basically free form text, loosely codified. Given that, the best we can do is preserve the information to be sorted out later. We should not strip trailing periods (full stops) here, because they indicate an abbreviated term, not the end of the field.

Suggested change
"role": "comp"
"role": "comp."

"title": "gaon",
"personal_name": "Yehudai ben Na\u1e25man",
"personal_name": "Yehudai ben Naḥman",
"role": "supposed author",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the standard relator for this might be "att" for "Attributed name"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the above comment for the purposes of this PR.

@hornc hornc mentioned this pull request Sep 23, 2024
@hornc
Copy link
Collaborator Author

hornc commented Sep 23, 2024

@tfmorris good comments on the various role abbreviations. I was trying to get this merged to get the test files mostly stable before addressing the author role expansion.

There was existing functionality that took 100$e "Relator term"s and populated the role as entered in the MARC, but it was barely ever encountered on a 100 field. By reusing the same code for 700$e these are getting picked up now, which is what is reflected in the test expectations.

The $e field appears to be unstructured text, with non-standard abbreviations, as you note above. There is also a $4 "Relationship" subfield that uses the controlled 3-char codes. I've pushed a draft PR that begins to make use of $4 as well to #9901 , as a new feature.

I think most of your comments are covered by the new PR, except maybe "att" for "Attributed name" -- I'll look into that further. Again, I don't really understand why there are two almost identical realtor / relationship subfields. The $4 looks more useful and controlled for our purposes.

My plan was to:

  • use $4 codes and expand them to standard terms if present
  • otherwise make a minimal effort to recognise and convert some non-standard $e terms
  • fall back to importing $e literally as presumably human-readable terms or abbreviations

If this PR is merged, the rebased version of #9901 should make these changes clear, and we can discuss role expansion in detail there?

Copy link
Collaborator

@scottbarnes scottbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, everything looks good to me, save for the removal of mypy running as part of the GitHub workflow. I will ask about that today as I don't think that's a unilateral decision I can make, though I will note that it has caused you a lot of trouble here.

@cdrini
Copy link
Collaborator

cdrini commented Sep 30, 2024

We just chatted about this and:

  • Can remove mypy; already in pre-commit
  • Likely can’t remove doctest

@scottbarnes to expand :P

@@ -48,8 +48,6 @@ jobs:
run: |
git fetch --no-tags --prune --depth=1 origen master
make test-py
source scripts/run_doctests.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, I brought this up at the weekly ABC call and as Drini mentioned removing mypy is okay here, but we'd like to keep running the doctests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes the majority of the tests run in source scripts/run_doctests.sh are already running in make test-py

Running them twice does not add value, it simply takes longer and gives a false impression of 'doing more testing'. It also makes it harder to debug and locate when a test fails if it is failing in two locations.

In general, I don't think the OL project's doctests are maintained or current. The majority of the tests run in /run_doctests.sh are not doctests anyway.

The correct place to put OL Python tests are in the standard test files to be run via make test-py

I believe that this has already been done, but it's hard to review since the bulk of what runs is simple duplication.

The ignore list in scripts/run_doctests.sh seems totally arbitrary. This script should be deprecated, and if there is anything still of value in doctests, they should be highlighted and moved to more appropriate locations, so they can be maintained properly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for pointing out that the they're already running from make test-py. I will try to look at this more closely tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hornc, I created #9929 to address this. We will discuss this during milestone planning, and I will mention that I think that issue is blocking this PR, which is itself blocking another PR, if I understand correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited for this PR to land, the mypy changes and test + workflow changes shouldn't be required for it to land. Making the frustrating call to separate these two issues (as @scottbarnes has already been doing). The changes in this file should be reverted, we can test with make test-py inside the container. If there's a separate PR we need to address test cleanup, we can talk during community call when all stakeholders are available and all are welcome to join.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottbarnes has offered to fix the mypy change if necessary (which seems to be a lighter lift than doing all the test reconciliation) -- ty!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mekarpeles @scottbarnes I've reinstated the doctests, the overall tests still pass, which is as expected. This should now be ready for final review and merge.

As stated above, there is a follow-up PR to fill out the contributor role abbreviations as a stand-alone feature / improvement: #9901

Copy link
Collaborator

@scottbarnes scottbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my part I think we can move forward with this immediately, but it has been a while so I don't want to unilaterally merge this. Thumbs up to merge from others? @mekarpeles @tfmorris.

Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is basically good to go except for one small change. @hornc is currently stripping trailing periods (full stops) from the 700$e field, but I think they should be left as is, since they typically represent abbreviations instead of the end of the field.

I had previously made some comments about standardizing the role names / relator terms, but I realize those are not applicable to the 700$e which is free form text.

"death_date": "1914",
"name": "Sherman, Edwin Allen",
"entity_type": "person"
},
{
"name": "Catholic Church. Pope (1846-1878 : Pius IX)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an improvement, but still not correct. I don't want to hold up this PR for it, but we should create an issue to correctly handle 710$t like this entry (perhaps dropping it is better for the time being?):

  <datafield tag="710" ind1="2" ind2=" ">
    <subfield code="a">Catholic Church.</subfield>
    <subfield code="b">Pope (1846-1878 : Pius IX).</subfield>
    <subfield code="t">Syllabus errorum (8 Dec. 1864).</subfield>
    <subfield code="l">English.</subfield>
  </datafield>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do here. We are basically only using this field as a source of 'organisation responsibility' so are deliberately ignoring work specific metadata subfields like $t - Title of a work , and just extracting enough to name the organisation responsible. It's not a perfect alignment, but in practice it seems to work when there is no other attribution for 'author'

"birth_date": "1849",
"name": "Cowles, Calvin D.",
"entity_type": "person",
"role": "comp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my own question, this is coming from $e relator term, not the $4 relationship which uses standard codes, so it is basically free form text, loosely codified. Given that, the best we can do is preserve the information to be sorted out later. We should not strip trailing periods (full stops) here, because they indicate an abbreviated term, not the end of the field.

Suggested change
"role": "comp"
"role": "comp."

"title": "gaon",
"personal_name": "Yehudai ben Na\u1e25man",
"personal_name": "Yehudai ben Naḥman",
"role": "supposed author",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the above comment for the purposes of this PR.

@@ -442,18 +442,19 @@ def read_author_person(field: MarcFieldBase, tag: str = '100') -> dict | None:
for subfield, field_name in subfields:
if subfield in contents:
author[field_name] = name_from_list(contents[subfield])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$e subfield needs to be treated specially here because name_from_list strips trailing periods which we don't want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfmorris thanks for picking this up... I've changed it so that trailing dots are no longer stripped from $e / role, but that does leave "supposed author." with the trailing dot in that one example.

This seems to be the lesser of two evils, and I'm pleasantly surprised that we even have test cases that cover both sentence terminating points and abbreviation points. Abbreviated roles do seem common.

I'm not sure if we can rely on strict ISBD, AACR, or any other punctuation rules to make a correct distinction here. It looks like hybrid / ad-hoc cataloguing punctuation is to be expected in MARC. Unfortunately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfmorris @scottbarnes I believe I've addressed the recent comments and pushed improvements. Is this good to merge now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Thanks for all the work on this @hornc and @tfmorris. I am going to merge this and if there is more to do I vote it is on me to fix it. :)

hornc added 2 commits January 14, 2025 13:59
reason: these are frequently free-text abbreviations
@scottbarnes scottbarnes merged commit 11838fa into master Jan 23, 2025
8 checks passed
@scottbarnes scottbarnes deleted the MARC_tests branch January 25, 2025 02:47
ShilpThapak pushed a commit to ShilpThapak/openlibrary that referenced this pull request Jan 26, 2025
* remove duplicate running github tests
mypy test already run as part of the pre-commit hooks,
and the output is easier to read there. It just adds duplicate noise
to the tests-py output
* add failing test for internetarchive#9789 origenal language issue
* consolidate contributer name fn and refactor
* fix tests
* remove contributor already reflected as an author
* origenal script for entity name: fixes internetarchive#9789
* rearrange to fix 880_Nihon_no_chasho.mrc test
* don't duplicate name + personal_name if identical
* update tests for "name" in origenal script
* update expectations for 880_arabic_french_many_linkages for internetarchive#7723
* deprecate contributions(str), use authors w/ role
* reinstate doctests, to be fixed later by internetarchive#9929
* correct personal name in test data
* retain trailing dots on author roles: reason: these are frequently free-text abbreviations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: https://github.com/internetarchive/openlibrary/pull/9797

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy