mypyc: Fix C string encoding #7978

saleemrashid · 2019-11-19T19:05:46Z

Using repr() to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example:

The C string could inadvertently contain trigraphs.
Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence.

Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences.

It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.

As an ad hoc test, I generated a Python module containing string literals for every possible character:

with open("literals.py", "w") as f:
    f.write(
        "LITERALS = [{}]".format(
            ", ".join([repr(bytes([x, ord("0")])) for x in range(256)])
        )
    )

Then I compiled the module with Mypyc and checked it produced the correct byte strings:

import literals

assert literals.__file__.endswith(".so")

for x in range(256):
    assert literals.LITERALS[x] == bytes([x, ord("0")])

Using repr() to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example: * The C string could inadvertently contain trigraphs. * Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence. Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences. It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.

msullivan

Good catch! Thanks for this.

Basically looks good, but a few comments about reorganizing it a little.

I'm also sad about the octal thing but oh well.

msullivan · 2019-11-19T19:37:53Z

mypyc/emitmodule.py

@@ -421,8 +449,7 @@ def encode_as_c_string(s: str) -> Tuple[str, int]:

 def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:
    """Produce a single-escaped, quoted C string and its size from a bytes"""
-    # This is a kind of abusive way to do this...
-    escaped = repr(b)[2:-1].replace('"', '\\"')
+    escaped = ''.join(map(C_CHAR_MAP.__getitem__, b))


I'd write this as a list comprehension instead of a map

msullivan · 2019-11-19T19:39:25Z

mypyc/emitmodule.py

@@ -67,6 +68,33 @@
 # A list of (file name, file contents) pairs.
 FileContents = List[Tuple[str, str]]

+# The C standard specifies that any number of valid hexadecimal characters are


I think it's worth moving this out of emitmodule at this point. mypyc.cstring, maybe?

(And then encode_as_c_string and encode_bytes_as_c_string too)

msullivan · 2019-11-19T19:40:19Z

mypyc/emitmodule.py

+# These assignments must be done after string.printable because they are
+# overrides for the printable characters that need to be escaped in string
+# literals.
+C_CHAR_MAP[ord('\'')] = '\\\''


These are all generated the same way (adding a backslash), so I think it would be an improvement to generate them all mechanically from a list of characters to escape.

I considered this, but the neatest way I can think to do this is:

for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'): escaped = '\\{}'.format(x) value = escaped.encode('ascii').decode('unicode_escape') C_CHAR_MAP[ord(value)] = escaped

I am not sure whether this is an improvement. What do you think?

saleemrashid · 2019-11-19T20:00:31Z

I've force-pushed some changes, i.e. rewrote the comments, and refactored the escaped strings to raw string literals.

I will try and implement your suggestions shortly.

I'm also sad about the octal thing but oh well.

@msullivan My original prototype used itertools.groupby to group blocks of printable characters and unprintable characters so you could safely use hexadecimal escape sequences. It worked fine but it seemed like unnecessary complexity.

msullivan · 2019-11-19T20:36:09Z

@msullivan My original prototype used itertools.groupby to group blocks of printable characters and unprintable characters so you could safely use hexadecimal escape sequences. It worked fine but it seemed like unnecessary complexity.

I agree that using octal is preferable to writing complex code that avoids it.

saleemrashid · 2019-11-19T22:49:55Z

mypyc/cstring.py

+
+# These assignments must come last because we prioritize the escape
+# sequences in the C standard over any other representation.
+CHAR_MAP[ord('\'')] = r'\''


@msullivan This is the simplest method I have been able to devise to generate these escape sequences:

for x in ("'", '"', '\\', 'a', 'b', 'f', 'n', 'r', 't', 'v'): escaped = '\\{}'.format(x) value = escaped.encode('ascii').decode('unicode_escape') C_CHAR_MAP[ord(value)] = escaped

While I am not a huge fan of manually listing each escape sequence, I dislike this method of generating them even more.

The encode/decode is a little hokey but I think I like this more

The main reason I dislike this is that we're calling into Python's Unicode APIs. There's also codecs.escape_decode that handles ASCII only, but it's undocumented.

msullivan · 2019-11-19T23:23:49Z

Could you also remove -Wno-trigraph from our C flags in build.py <_<

saleemrashid · 2019-11-19T23:25:15Z

mypyc/cstring.py

+    return encode_bytes_as_c_string(s.encode('utf-8'))
+
+
+def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:


This is what an implementation with hexadecimal escape sequences, instead of octal, would look like:

diff --git i/mypyc/cstring.py w/mypyc/cstring.py index 37f3ab38..0557a216 100644 --- i/mypyc/cstring.py +++ w/mypyc/cstring.py @@ -19,9 +19,11 @@ octal digits. """ import string +import itertools from typing import Tuple -CHAR_MAP = ['\\{:03o}'.format(i) for i in range(256)] +HEX_DIGITS = frozenset(string.hexdigits) +CHAR_MAP = {} # It is safe to use string.printable as it always uses the C locale. for c in string.printable: @@ -49,5 +51,13 @@ def encode_as_c_string(s: str) -> Tuple[str, int]: def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]: """Produce a quoted C string literal and its size, for a byte string.""" - escaped = ''.join([CHAR_MAP[i] for i in b]) + escaped = '' + for printable, group in itertools.groupby(b, key=CHAR_MAP.__contains__): + if printable: + s = ''.join([CHAR_MAP[i] for i in group]) + if len(escaped) and s[0] in HEX_DIGITS: + escaped += '" "' + escaped += s + else: + escaped += ''.join(['\\x{:02X}'.format(i) for i in group]) return '"{}"'.format(escaped), len(b)

The compiler will not complain about trigraphs anymore because we now escape all question marks in C string literals.

msullivan · 2019-11-20T00:33:28Z

Thanks!

msullivan self-requested a review November 19, 2019 19:07

msullivan reviewed Nov 19, 2019

View reviewed changes

saleemrashid force-pushed the mypyc-cstr-encoding branch from d389606 to bad98fa Compare November 19, 2019 19:56

saleemrashid force-pushed the mypyc-cstr-encoding branch from bad98fa to 7bc5d89 Compare November 19, 2019 22:44

saleemrashid commented Nov 19, 2019

View reviewed changes

saleemrashid force-pushed the mypyc-cstr-encoding branch 2 times, most recently from 9389393 to 3ca384a Compare November 19, 2019 22:58

saleemrashid commented Nov 19, 2019

View reviewed changes

mypyc: Remove -Wno-trigraphs from CFLAGS

d4f3656

The compiler will not complain about trigraphs anymore because we now escape all question marks in C string literals.

saleemrashid force-pushed the mypyc-cstr-encoding branch 3 times, most recently from 22bc75f to d4f3656 Compare November 19, 2019 23:59

msullivan merged commit 05b92c0 into python:master Nov 20, 2019

saleemrashid deleted the mypyc-cstr-encoding branch November 20, 2019 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

mypyc: Fix C string encoding #7978

mypyc: Fix C string encoding #7978

Uh oh!

saleemrashid commented Nov 19, 2019

Uh oh!

msullivan left a comment

Uh oh!

msullivan Nov 19, 2019

Uh oh!

msullivan Nov 19, 2019

Uh oh!

msullivan Nov 19, 2019

Uh oh!

msullivan Nov 19, 2019

Uh oh!

saleemrashid Nov 19, 2019

Uh oh!

saleemrashid commented Nov 19, 2019

Uh oh!

msullivan commented Nov 19, 2019

Uh oh!

saleemrashid Nov 19, 2019 •

edited

Loading

Uh oh!

msullivan Nov 19, 2019

Uh oh!

saleemrashid Nov 19, 2019

Uh oh!

msullivan commented Nov 19, 2019

Uh oh!

saleemrashid Nov 19, 2019

Uh oh!

msullivan commented Nov 20, 2019

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

		return encode_bytes_as_c_string(s.encode('utf-8'))


		def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:

Uh oh!

mypyc: Fix C string encoding #7978

mypyc: Fix C string encoding #7978

Uh oh!

Conversation

saleemrashid commented Nov 19, 2019

Uh oh!

msullivan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saleemrashid commented Nov 19, 2019

Uh oh!

msullivan commented Nov 19, 2019

Uh oh!

saleemrashid Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msullivan commented Nov 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msullivan commented Nov 20, 2019

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

saleemrashid Nov 19, 2019 •

edited

Loading