Skip to content

Commit bad98fa

Browse files
committed
mypyc: Fix C string encoding
Using repr() to encode C strings relied on implementation-defined behaviour and had a number of bugs, for example: * The C string could inadvertently contain trigraphs. * Valid hexdecimal characters following a hexadecimal escape sequence would be parsed as part of the escape sequence. Octal escape sequences are used for unprintable characters because they do not have the same issue as hexadecimal escape sequences. It would be possible to mitigate the issue with hexadecimal escape sequences by emitting multiple string literals. However, the complexity is not worth it, especially as the output is not designed to be human-readable.
1 parent e99a2b5 commit bad98fa

File tree

1 file changed

+29
-2
lines changed

1 file changed

+29
-2
lines changed

mypyc/emitmodule.py

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import os
77
import hashlib
88
import json
9+
import string
910
from collections import OrderedDict
1011
from typing import List, Tuple, Dict, Iterable, Set, TypeVar, Optional
1112

@@ -67,6 +68,33 @@
6768
# A list of (file name, file contents) pairs.
6869
FileContents = List[Tuple[str, str]]
6970

71+
# The C standard specifies that an unlimited number of valid hexadecimal
72+
# characters are parsed as part of the hexadecimal escape sequence. For
73+
# example, "\x12345" would be unexpectedly parsed as {0x12345}, instead of
74+
# {0x123, '4', '5'}. Therefore, we use octal escape sequences which are
75+
# specified to contain at most three octal digits.
76+
C_CHAR_MAP = ['\\{:03o}'.format(x) for x in range(256)]
77+
# Most printable characters do not need to be escaped in string literals. We
78+
# can safely use string.printable here because it always uses the C locale.
79+
for x in string.printable:
80+
C_CHAR_MAP[ord(x)] = x
81+
# These assignments must be done after string.printable because they are
82+
# overrides for the printable characters that need to be escaped in string
83+
# literals.
84+
C_CHAR_MAP[ord('\'')] = r'\''
85+
C_CHAR_MAP[ord('\"')] = r'\"'
86+
C_CHAR_MAP[ord('\\')] = r'\\'
87+
C_CHAR_MAP[ord('\a')] = r'\a'
88+
C_CHAR_MAP[ord('\b')] = r'\b'
89+
C_CHAR_MAP[ord('\f')] = r'\f'
90+
C_CHAR_MAP[ord('\n')] = r'\n'
91+
C_CHAR_MAP[ord('\r')] = r'\r'
92+
C_CHAR_MAP[ord('\t')] = r'\t'
93+
C_CHAR_MAP[ord('\v')] = r'\v'
94+
# The question mark is escaped to prevent trigraphs from being interpreted
95+
# inside string literals. This escape sequence is invalid in Python.
96+
C_CHAR_MAP[ord('?')] = r'\?'
97+
7098

7199
class MarkedDeclaration:
72100
"""Add a mark, useful for topological sort."""
@@ -421,8 +449,7 @@ def encode_as_c_string(s: str) -> Tuple[str, int]:
421449

422450
def encode_bytes_as_c_string(b: bytes) -> Tuple[str, int]:
423451
"""Produce a single-escaped, quoted C string and its size from a bytes"""
424-
# This is a kind of abusive way to do this...
425-
escaped = repr(b)[2:-1].replace('"', '\\"')
452+
escaped = ''.join(map(C_CHAR_MAP.__getitem__, b))
426453
return '"{}"'.format(escaped), len(b)
427454

428455

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy