Skip to content

Take into account encoding of source file for syntax error #124188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
serhiy-storchaka opened this issue Sep 17, 2024 · 1 comment
Closed

Take into account encoding of source file for syntax error #124188

serhiy-storchaka opened this issue Sep 17, 2024 · 1 comment
Labels
3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API

Comments

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Sep 17, 2024

Currently most syntax errors raised in the compiler (except these raised in the parser) use PyErr_ProgramTextObject() to get the line of the code. It does not know the encoding of the source file and interpret it as UTF-8 (failing if it contain non-UTF-8 sequences). The parser uses _PyErr_ProgramDecodedTextObject().

There are two ways to solve this issue:

  • Pass the source file encoding from the parser to the code generator. This may require changing some data structures. But this is more efficient.
  • Detect the encoding in PyErr_ProgramTextObject(). Since the latter is in the public C API, this can also affect the third-party code.

There are other issues with PyErr_ProgramTextObject():

  • It leave the BOM in the first line if the source line contains it. This is not consistent with offsets.
  • For very long lines, it returns the tail of the line that exceeds 1000 bytes. It can be short, it can start with invalid character, it is not consistent with offsets. If return incomplete line, it is better to return the head.

This all applies to PyErr_ProgramText() as well.

Linked PRs

@serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API 3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes labels Sep 17, 2024
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Sep 17, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
serhiy-storchaka added a commit that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Sep 24, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)
serhiy-storchaka added a commit that referenced this issue Oct 7, 2024
* Detect source file encoding.
* Use the "replace" error handler even for UTF-8 (default) encoding.
* Remove the BOM.
* Fix detection of too long lines if they contain NUL.
* Return the head rather than the tail for truncated long lines.
(cherry picked from commit e2f7107)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@hugovk
Copy link
Member

hugovk commented Feb 5, 2025

Triage: PR merged and backported. Please re-open if there's more to do.

@hugovk hugovk closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes 3.13 bugs and security fixes 3.14 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-C-API
Projects
None yet
Development

No branches or pull requests

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy