Take into account encoding of source file for syntax error #124188

serhiy-storchaka · 2024-09-17T19:02:59Z

Currently most syntax errors raised in the compiler (except these raised in the parser) use PyErr_ProgramTextObject() to get the line of the code. It does not know the encoding of the source file and interpret it as UTF-8 (failing if it contain non-UTF-8 sequences). The parser uses _PyErr_ProgramDecodedTextObject().

There are two ways to solve this issue:

Pass the source file encoding from the parser to the code generator. This may require changing some data structures. But this is more efficient.
Detect the encoding in PyErr_ProgramTextObject(). Since the latter is in the public C API, this can also affect the third-party code.

There are other issues with PyErr_ProgramTextObject():

It leave the BOM in the first line if the source line contains it. This is not consistent with offsets.
For very long lines, it returns the tail of the line that exceeds 1000 bytes. It can be short, it can start with invalid character, it is not consistent with offsets. If return incomplete line, it is better to return the head.

This all applies to PyErr_ProgramText() as well.

Linked PRs

The text was updated successfully, but these errors were encountered:

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines.

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107)

* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

hugovk · 2025-02-05T14:28:00Z

Triage: PR merged and backported. Please re-open if there's more to do.

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API 3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes labels Sep 17, 2024

bedevere-app bot mentioned this issue Sep 17, 2024

gh-124188: Fix PyErr_ProgramTextObject() #124189

Merged

serhiy-storchaka mentioned this issue Sep 17, 2024

gh-123969: refactor _PyErr_RaiseSyntaxError and _PyErr_EmitSyntaxWarning out of compiler #123972

Merged

bedevere-app bot mentioned this issue Sep 24, 2024

[3.13] gh-124188: Fix PyErr_ProgramTextObject() (GH-124189) #124423

Merged

bedevere-app bot mentioned this issue Sep 24, 2024

[3.12] gh-124188: Fix PyErr_ProgramTextObject() (GH-124189) #124426

Merged

hugovk closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Take into account encoding of source file for syntax error #124188

Take into account encoding of source file for syntax error #124188

serhiy-storchaka commented Sep 17, 2024 •

edited by bedevere-app bot

Loading

hugovk commented Feb 5, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

Take into account encoding of source file for syntax error #124188

Take into account encoding of source file for syntax error #124188

Comments

serhiy-storchaka commented Sep 17, 2024 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked PRs

hugovk commented Feb 5, 2025

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

serhiy-storchaka commented Sep 17, 2024 •

edited by bedevere-app bot

Loading