-
-
Notifications
You must be signed in to change notification settings - Fork 3k
mypyc: Fix C string encoding #7978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using repr() to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example: * The C string could inadvertently contain trigraphs. * Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence. Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences. It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks for this.
Basically looks good, but a few comments about reorganizing it a little.
I'm also sad about the octal thing but oh well.
mypyc/emitmodule.py
Outdated
@@ -421,8 +449,7 @@ def encode_as_c_string(s: str) -> Tuple[str, int]: | |||
|
|||
def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]: | |||
"""Produce a single-escaped, quoted C string and its size from a bytes""" | |||
# This is a kind of abusive way to do this... | |||
escaped = repr(b)[2:-1].replace('"', '\\"') | |||
escaped = ''.join(map(C_CHAR_MAP.__getitem__, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd write this as a list comprehension instead of a map
mypyc/emitmodule.py
Outdated
@@ -67,6 +68,33 @@ | |||
# A list of (file name, file contents) pairs. | |||
FileContents = List[Tuple[str, str]] | |||
|
|||
# The C standard specifies that any number of valid hexadecimal characters are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth moving this out of emitmodule
at this point. mypyc.cstring
, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And then encode_as_c_string and encode_bytes_as_c_string too)
mypyc/emitmodule.py
Outdated
# These assignments must be done after string.printable because they are | ||
# overrides for the printable characters that need to be escaped in string | ||
# literals. | ||
C_CHAR_MAP[ord('\'')] = '\\\'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all generated the same way (adding a backslash), so I think it would be an improvement to generate them all mechanically from a list of characters to escape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but the neatest way I can think to do this is:
for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'):
escaped = '\\{}'.format(x)
value = escaped.encode('ascii').decode('unicode_escape')
C_CHAR_MAP[ord(value)] = escaped
I am not sure whether this is an improvement. What do you think?
d389606
to
bad98fa
Compare
I've force-pushed some changes, i.e. rewrote the comments, and refactored the escaped strings to raw string literals. I will try and implement your suggestions shortly.
@msullivan My original prototype used |
I agree that using octal is preferable to writing complex code that avoids it. |
bad98fa
to
7bc5d89
Compare
mypyc/cstring.py
Outdated
|
||
# These assignments must come last because we prioritize the escape | ||
# sequences in the C standard over any other representation. | ||
CHAR_MAP[ord('\'')] = r'\'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msullivan This is the simplest method I have been able to devise to generate these escape sequences:
for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'):
escaped = '\\{}'.format(x)
value = escaped.encode('ascii').decode('unicode_escape')
C_CHAR_MAP[ord(value)] = escaped
While I am not a huge fan of manually listing each escape sequence, I dislike this method of generating them even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encode/decode is a little hokey but I think I like this more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I dislike this is that we're calling into Python's Unicode APIs. There's also codecs.escape_decode
that handles ASCII only, but it's undocumented.
9389393
to
3ca384a
Compare
Could you also remove |
return encode_bytes_as_c_string(s.encode('utf-8')) | ||
|
||
|
||
def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what an implementation with hexadecimal escape sequences, instead of octal, would look like:
diff --git i/mypyc/cstring.py w/mypyc/cstring.py
index 37f3ab38..0557a216 100644
--- i/mypyc/cstring.py
+++ w/mypyc/cstring.py
@@ -19,9 +19,11 @@ octal digits.
"""
import string
+import itertools
from typing import Tuple
-CHAR_MAP = ['\\{:03o}'.format(i) for i in range(256)]
+HEX_DIGITS = frozenset(string.hexdigits)
+CHAR_MAP = {}
# It is safe to use string.printable as it always uses the C locale.
for c in string.printable:
@@ -49,5 +51,13 @@ def encode_as_c_string(s: str) -> Tuple[str, int]:
def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:
"""Produce a quoted C string literal and its size, for a byte string."""
- escaped = ''.join([CHAR_MAP[i] for i in b])
+ escaped = ''
+ for printable, group in itertools.groupby(b, key=CHAR_MAP.__contains__):
+ if printable:
+ s = ''.join([CHAR_MAP[i] for i in group])
+ if len(escaped) and s[0] in HEX_DIGITS:
+ escaped += '" "'
+ escaped += s
+ else:
+ escaped += ''.join(['\\x{:02X}'.format(i) for i in group])
return '"{}"'.format(escaped), len(b)
The compiler will not complain about trigraphs anymore because we now escape all question marks in C string literals.
22bc75f
to
d4f3656
Compare
Thanks! |
Using
repr()
to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example:The C string could inadvertently contain trigraphs.
Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence.
Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences.
It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.
As an ad hoc test, I generated a Python module containing string literals for every possible character:
Then I compiled the module with Mypyc and checked it produced the correct byte strings: