Skip to content

Commit 0afc0a7

Browse files
committed
Fix unaccent generation script in Windows
As originally coded, the script would fail on Windows 10 and Python 3 because stdout would not be switched to UTF-8 only for Python 2. This patch makes that apply to both versions. Also add python 2 compatibility markers so that we know what to remove once we drop support for that. Also use a "with" clause to ensure file descriptor is closed promptly. Author: Hugh Ranalli, Ramanarayana Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/CAKm4Xs7_61XMyOWmHs3n0mmkS0O4S0pvfWk=7cQ5P0gs177f7A@mail.gmail.com Discussion: https://postgr.es/m/15548-cef1b3f8de190d4f@postgresql.org
1 parent b438e7e commit 0afc0a7

File tree

1 file changed

+24
-20
lines changed

1 file changed

+24
-20
lines changed

contrib/unaccent/generate_unaccent_rules.py

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,15 @@
3232
# The approach is to be Python3 compatible with Python2 "backports".
3333
from __future__ import print_function
3434
from __future__ import unicode_literals
35+
# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
36+
37+
import argparse
3538
import codecs
39+
import re
3640
import sys
41+
import xml.etree.ElementTree as ET
3742

43+
# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped
3844
if sys.version_info[0] <= 2:
3945
# Encode stdout as UTF-8, so we can just print to it
4046
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
@@ -45,12 +51,9 @@
4551
# Python 2 and 3 compatible bytes call
4652
def bytes(source, encoding='ascii', errors='strict'):
4753
return source.encode(encoding=encoding, errors=errors)
54+
else:
4855
# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped
49-
50-
import re
51-
import argparse
52-
import sys
53-
import xml.etree.ElementTree as ET
56+
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
5457

5558
# The ranges of Unicode characters that we consider to be "plain letters".
5659
# For now we are being conservative by including only Latin and Greek. This
@@ -233,21 +236,22 @@ def main(args):
233236
charactersSet = set()
234237

235238
# read file UnicodeData.txt
236-
unicodeDataFile = open(args.unicodeDataFilePath, 'r')
237-
238-
# read everything we need into memory
239-
for line in unicodeDataFile:
240-
fields = line.split(";")
241-
if len(fields) > 5:
242-
# http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
243-
general_category = fields[2]
244-
decomposition = fields[5]
245-
decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
246-
id = int(fields[0], 16)
247-
combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
248-
codepoint = Codepoint(id, general_category, combining_ids)
249-
table[id] = codepoint
250-
all.append(codepoint)
239+
with codecs.open(
240+
args.unicodeDataFilePath, mode='r', encoding='UTF-8',
241+
) as unicodeDataFile:
242+
# read everything we need into memory
243+
for line in unicodeDataFile:
244+
fields = line.split(";")
245+
if len(fields) > 5:
246+
# http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt
247+
general_category = fields[2]
248+
decomposition = fields[5]
249+
decomposition = re.sub(decomposition_type_pattern, ' ', decomposition)
250+
id = int(fields[0], 16)
251+
combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""]
252+
codepoint = Codepoint(id, general_category, combining_ids)
253+
table[id] = codepoint
254+
all.append(codepoint)
251255

252256
# walk through all the codepoints looking for interesting mappings
253257
for codepoint in all:

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy