Skip to content

TypeError when parsing regexp with unicode named character sequence escape #90568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jirkamarsik mannequin opened this issue Jan 17, 2022 · 4 comments · Fixed by #91665
Closed

TypeError when parsing regexp with unicode named character sequence escape #90568

jirkamarsik mannequin opened this issue Jan 17, 2022 · 4 comments · Fixed by #91665
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-regex topic-unicode type-feature A feature request or enhancement

Comments

@jirkamarsik
Copy link
Mannequin

jirkamarsik mannequin commented Jan 17, 2022

BPO 46410
Nosy @vstinner, @ezio-melotti, @serhiy-storchaka, @jirkamarsik

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2022-01-17.12:31:30.830>
labels = ['expert-regex', 'interpreter-core', 'type-feature', 'expert-unicode', '3.11']
title = 'TypeError when parsing regexp with unicode named character sequence escape'
updated_at = <Date 2022-03-19.11:22:19.310>
user = 'https://github.com/jirkamarsik'

bugs.python.org fields:

activity = <Date 2022-03-19.11:22:19.310>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core', 'Regular Expressions', 'Unicode']
creation = <Date 2022-01-17.12:31:30.830>
creator = 'jirkamarsik'
dependencies = []
files = []
hgrepos = []
issue_num = 46410
keywords = []
message_count = 3.0
messages = ['410770', '410874', '415540']
nosy_count = 5.0
nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'jirkamarsik']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue46410'
versions = ['Python 3.11']

@jirkamarsik
Copy link
Mannequin Author

jirkamarsik mannequin commented Jan 17, 2022

re.compile(r"\N{name of Unicode Named Character Sequence}"), e.g. re.compile(r"\N{KEYCAP NUMBER SIGN}"), throws a TypeError. The regular expression parser relies on 'unicodedata' to lookup character names. The 'unicodedata' module recently added support for Unicode Named Character Sequences (https://www.unicode.org/Public/13.0.0/ucd/NamedSequences.txt). Trying to use these named character sequences in a regular expression leads to a 'TypeError', as the regexp parser tries to call 'ord' on a string with length > 1.

@jirkamarsik jirkamarsik mannequin added type-bug An unexpected behavior, bug, or error 3.10 only security fixes topic-regex labels Jan 17, 2022
@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Jan 18, 2022

They're not supported in string literals either:

Python 3.10.1 (tags/v3.10.1:2cd268a, Dec  6 2021, 19:10:37) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> "\N{KEYCAP NUMBER SIGN}"
  File "<stdin>", line 1
    "\N{KEYCAP NUMBER SIGN}"
                            ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-21: unknown Unicode character name

@serhiy-storchaka
Copy link
Member

>>> import unicodedata
>>> unicodedata.lookup('KEYCAP NUMBER SIGN')
'#️'
>>> print(ascii(unicodedata.lookup('KEYCAP NUMBER SIGN')))
'#\ufe0f\u20e3'

Support of Unicode Named Character Sequences in the unicodeescape codec and in the RE parser would be a new feature.

@serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode 3.11 only security fixes type-feature A feature request or enhancement and removed 3.10 only security fixes type-bug An unexpected behavior, bug, or error labels Mar 19, 2022
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 18, 2022
@serhiy-storchaka
Copy link
Member

Support of named sequence in \N escape was explicitly rejected in #56962.

One of reasons is that r"[\N{KEYCAP NUMBER SIGN}]" and "[\N{KEYCAP NUMBER SIGN}]" would be two very different regular expressions, and it would be very difficult to catch such kind of errors by human and programmatically.

@ezio-melotti

serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 22, 2022
…e in RE (pythonGH-91665)

re.error is now raised instead of TypeError..
(cherry picked from commit 6ccfa31)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
GH-91665) (GH-91830)

re.error is now raised instead of TypeError.
(cherry picked from commit 6ccfa31)
serhiy-storchaka added a commit that referenced this issue Apr 22, 2022
…GH-91665) (GH-91830) (GH-91834)

re.error is now raised instead of TypeError.
(cherry picked from commit 6ccfa31)
(cherry picked from commit 9c18d78)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
hello-adam pushed a commit to hello-adam/cpython that referenced this issue Jun 2, 2022
… in RE (pythonGH-91665) (pythonGH-91830) (pythonGH-91834)

re.error is now raised instead of TypeError.
(cherry picked from commit 6ccfa31)
(cherry picked from commit 9c18d78)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-regex topic-unicode type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy