Skip to content

Commit df4cba6

Browse files
committed
Commit Patrice's patches except:
> - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 > characters (characters with values >= 0x10000, which are encoded on > four bytes). Also, update mb/expected/unicode.out. This is necessary since the patches affetc the result of queries using UTF-8. --------------------------------------------------------------- Hi, I should have sent the patch earlier, but got delayed by other stuff. Anyway, here is the patch: - most of the functionality is only activated when MULTIBYTE is defined, - check valid UTF-8 characters, client-side only yet, and only on output, you still can send invalid UTF-8 to the server (so, it's only partly compliant to Unicode 3.1, but that's better than nothing). - formats with the correct number of columns (that's why I made it in the first place after all), but only for UNICODE. However, the code allows to plug-in routines for other encodings, as Tatsuo did for the other multibyte functions. - corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1 characters (characters with values >= 0x10000, which are encoded on four bytes). - doesn't depend on the locale capabilities of the glibc (useful for remote telnet). I would like somebody to check it closely, as it is my first patch to pgsql. Also, I created dummy .orig files, so that the two files I created are included, I hope that's the right way. Now, a lot of functionality is NOT included here, but I will keep that for 7.3 :) That includes all string checking on the server side (which will have to be a bit more optimised ;) ), and the input checking on the client side for UTF-8, though that should not be difficult. It's just to send the strings through mbvalidate() before sending them to the server. Strong checking on UTF-8 strings is mandatory to be compliant with Unicode 3.1+ . Do I have time to look for a patch to include iso-8859-15 for 7.2 ? The euro is coming 1. january 2002 (before 7.3 !) and over 280 millions people in Europe will need the euro sign and only iso-8859-15 and iso-8859-16 have it (and unfortunately, I don't think all Unices will switch to Unicode in the meantime).... err... yes, I know that this is not every single person in Europe that uses PostgreSql, so it's not exactly 280m, but it's just a matter of time ! ;) I'll come back (on pgsql-hackers) later to ask a few questions regarding the full unicode support (normalisation, collation, regexes,...) on the server side :) Here is the patch ! Patrice. -- Patrice HÉDÉ ------------------------------- patrice à islande org ----- -- Isn't it weird how scientists can imagine all the matter of the universe exploding out of a dot smaller than the head of a pin, but they can't come up with a more evocative name for it than "The Big Bang" ? -- What would _you_ call the creation of the universe ? -- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes ------------------------------------------ http://www.islande.org/ -----
1 parent d07bacd commit df4cba6

File tree

4 files changed

+529
-70
lines changed

4 files changed

+529
-70
lines changed

src/bin/psql/Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# Portions Copyright (c) 1996-2001, PostgreSQL Global Development Group
66
# Portions Copyright (c) 1994, Regents of the University of California
77
#
8-
# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.30 2001/02/27 08:13:27 ishii Exp $
8+
# $Header: /cvsroot/pgsql/src/bin/psql/Makefile,v 1.31 2001/10/15 01:25:10 ishii Exp $
99
#
1010
#-------------------------------------------------------------------------
1111

@@ -19,7 +19,7 @@ override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
1919

2020
OBJS=command.o common.o help.o input.o stringutils.o mainloop.o \
2121
copy.o startup.o prompt.o variables.o large_obj.o print.o describe.o \
22-
tab-complete.o
22+
tab-complete.o mbprint.o
2323

2424
all: submake psql
2525

src/bin/psql/mbprint.c

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
/*
2+
* psql - the PostgreSQL interactive terminal
3+
*
4+
* Copyright 2000 by PostgreSQL Global Development Group
5+
*
6+
* $Header: /cvsroot/pgsql/src/bin/psql/mbprint.c,v 1.1 2001/10/15 01:25:10 ishii Exp $
7+
*/
8+
9+
#include "postgres_fe.h"
10+
#include "mbprint.h"
11+
12+
#ifdef MULTIBYTE
13+
14+
#include "mb/pg_wchar.h"
15+
#include "settings.h"
16+
17+
/*
18+
* This is an implementation of wcwidth() and wcswidth() as defined in
19+
* "The Single UNIX Specification, Version 2, The Open Group, 1997"
20+
* <http://www.UNIX-systems.org/online.html>
21+
*
22+
* Markus Kuhn -- 2001-09-08 -- public domain
23+
*
24+
* customised for PostgreSQL
25+
*
26+
* original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
27+
*/
28+
29+
struct mbinterval {
30+
unsigned short first;
31+
unsigned short last;
32+
};
33+
34+
/* auxiliary function for binary search in interval table */
35+
static int
36+
mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
37+
{
38+
int min = 0;
39+
int mid;
40+
41+
if (ucs < table[0].first || ucs > table[max].last)
42+
return 0;
43+
while (max >= min) {
44+
mid = (min + max) / 2;
45+
if (ucs > table[mid].last)
46+
min = mid + 1;
47+
else if (ucs < table[mid].first)
48+
max = mid - 1;
49+
else
50+
return 1;
51+
}
52+
53+
return 0;
54+
}
55+
56+
57+
/* The following functions define the column width of an ISO 10646
58+
* character as follows:
59+
*
60+
* - The null character (U+0000) has a column width of 0.
61+
*
62+
* - Other C0/C1 control characters and DEL will lead to a return
63+
* value of -1.
64+
*
65+
* - Non-spacing and enclosing combining characters (general
66+
* category code Mn or Me in the Unicode database) have a
67+
* column width of 0.
68+
*
69+
* - Other format characters (general category code Cf in the Unicode
70+
* database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
71+
*
72+
* - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
73+
* have a column width of 0.
74+
*
75+
* - Spacing characters in the East Asian Wide (W) or East Asian
76+
* FullWidth (F) category as defined in Unicode Technical
77+
* Report #11 have a column width of 2.
78+
*
79+
* - All remaining characters (including all printable
80+
* ISO 8859-1 and WGL4 characters, Unicode control characters,
81+
* etc.) have a column width of 1.
82+
*
83+
* This implementation assumes that wchar_t characters are encoded
84+
* in ISO 10646.
85+
*/
86+
87+
static int
88+
ucs_wcwidth(pg_wchar ucs)
89+
{
90+
/* sorted list of non-overlapping intervals of non-spacing characters */
91+
static const struct mbinterval combining[] = {
92+
{ 0x0300, 0x034E }, { 0x0360, 0x0362 }, { 0x0483, 0x0486 },
93+
{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
94+
{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
95+
{ 0x05C4, 0x05C4 }, { 0x064B, 0x0655 }, { 0x0670, 0x0670 },
96+
{ 0x06D6, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
97+
{ 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
98+
{ 0x07A6, 0x07B0 }, { 0x0901, 0x0902 }, { 0x093C, 0x093C },
99+
{ 0x0941, 0x0948 }, { 0x094D, 0x094D }, { 0x0951, 0x0954 },
100+
{ 0x0962, 0x0963 }, { 0x0981, 0x0981 }, { 0x09BC, 0x09BC },
101+
{ 0x09C1, 0x09C4 }, { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 },
102+
{ 0x0A02, 0x0A02 }, { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 },
103+
{ 0x0A47, 0x0A48 }, { 0x0A4B, 0x0A4D }, { 0x0A70, 0x0A71 },
104+
{ 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC }, { 0x0AC1, 0x0AC5 },
105+
{ 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD }, { 0x0B01, 0x0B01 },
106+
{ 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 },
107+
{ 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 },
108+
{ 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
109+
{ 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
110+
{ 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD },
111+
{ 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D }, { 0x0DCA, 0x0DCA },
112+
{ 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, { 0x0E31, 0x0E31 },
113+
{ 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, { 0x0EB1, 0x0EB1 },
114+
{ 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, { 0x0EC8, 0x0ECD },
115+
{ 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, { 0x0F37, 0x0F37 },
116+
{ 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, { 0x0F80, 0x0F84 },
117+
{ 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 }, { 0x0F99, 0x0FBC },
118+
{ 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 }, { 0x1032, 0x1032 },
119+
{ 0x1036, 0x1037 }, { 0x1039, 0x1039 }, { 0x1058, 0x1059 },
120+
{ 0x1160, 0x11FF }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 },
121+
{ 0x17C9, 0x17D3 }, { 0x180B, 0x180E }, { 0x18A9, 0x18A9 },
122+
{ 0x200B, 0x200F }, { 0x202A, 0x202E }, { 0x206A, 0x206F },
123+
{ 0x20D0, 0x20E3 }, { 0x302A, 0x302F }, { 0x3099, 0x309A },
124+
{ 0xFB1E, 0xFB1E }, { 0xFE20, 0xFE23 }, { 0xFEFF, 0xFEFF },
125+
{ 0xFFF9, 0xFFFB }
126+
};
127+
128+
/* test for 8-bit control characters */
129+
if (ucs == 0) {
130+
return 0;
131+
}
132+
133+
if (ucs < 32 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff) {
134+
return -1;
135+
}
136+
137+
/* binary search in table of non-spacing characters */
138+
if (mbbisearch(ucs, combining,
139+
sizeof(combining) / sizeof(struct mbinterval) - 1)) {
140+
return 0;
141+
}
142+
143+
/* if we arrive here, ucs is not a combining or C0/C1 control character */
144+
145+
return 1 +
146+
(ucs >= 0x1100 &&
147+
(ucs <= 0x115f || /* Hangul Jamo init. consonants */
148+
(ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
149+
ucs != 0x303f) || /* CJK ... Yi */
150+
(ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
151+
(ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
152+
(ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
153+
(ucs >= 0xff00 && ucs <= 0xff5f) || /* Fullwidth Forms */
154+
(ucs >= 0xffe0 && ucs <= 0xffe6) ||
155+
(ucs >= 0x20000 && ucs <= 0x2ffff)));
156+
}
157+
158+
pg_wchar
159+
utf2ucs(const unsigned char *c)
160+
{
161+
/* one char version of pg_utf2wchar_with_len.
162+
* no control here, c must point to a large enough string
163+
*/
164+
if ((*c & 0x80) == 0) {
165+
return (pg_wchar)c[0];
166+
}
167+
else if ((*c & 0xe0) == 0xc0) {
168+
return (pg_wchar)(((c[0] & 0x1f) << 6) |
169+
(c[1] & 0x3f));
170+
}
171+
else if ((*c & 0xf0) == 0xe0) {
172+
return (pg_wchar)(((c[0] & 0x0f) << 12) |
173+
((c[1] & 0x3f) << 6) |
174+
(c[2] & 0x3f));
175+
}
176+
else if ((*c & 0xf0) == 0xf0) {
177+
return (pg_wchar)(((c[0] & 0x07) << 18) |
178+
((c[1] & 0x3f) << 12) |
179+
((c[2] & 0x3f) << 6) |
180+
(c[3] & 0x3f));
181+
}
182+
else {
183+
/* that is an invalid code on purpose */
184+
return 0xffffffff;
185+
}
186+
}
187+
188+
/* mb_utf_wcwidth : calculate column length for the utf8 string pwcs
189+
*/
190+
static int
191+
mb_utf_wcswidth(unsigned char *pwcs, int len)
192+
{
193+
int w, l = 0;
194+
int width = 0;
195+
196+
for (;*pwcs && len > 0; pwcs+=l) {
197+
l = pg_utf_mblen(pwcs);
198+
if ((len < l) || ((w = ucs_wcwidth(utf2ucs(pwcs))) < 0)) {
199+
return width;
200+
}
201+
len -= l;
202+
width += w;
203+
}
204+
return width;
205+
}
206+
207+
static int
208+
utf_charcheck(const unsigned char *c)
209+
{
210+
/* Unicode 3.1 compliant validation :
211+
* for each category, it checks the combination of each byte to make sur
212+
* it maps to a valid range. It also returns -1 for the following UCS values:
213+
* ucs > 0x10ffff
214+
* ucs & 0xfffe = 0xfffe
215+
* 0xfdd0 < ucs < 0xfdef
216+
* ucs & 0xdb00 = 0xd800 (surrogates)
217+
*/
218+
if ((*c & 0x80) == 0) {
219+
return 1;
220+
}
221+
else if ((*c & 0xe0) == 0xc0) {
222+
/* two-byte char */
223+
if(((c[1] & 0xc0) == 0x80) && ((c[0] & 0x1f) > 0x01)) {
224+
return 2;
225+
}
226+
return -1;
227+
}
228+
else if ((*c & 0xf0) == 0xe0) {
229+
/* three-byte char */
230+
if (((c[1] & 0xc0) == 0x80) &&
231+
(((c[0] & 0x0f) != 0x00) || ((c[1] & 0x20) == 0x20)) &&
232+
((c[2] & 0xc0) == 0x80)) {
233+
int z = c[0] & 0x0f;
234+
int yx = ((c[1] & 0x3f) << 6) | (c[0] & 0x3f);
235+
int lx = yx & 0x7f;
236+
237+
/* check 0xfffe/0xffff, 0xfdd0..0xfedf range, surrogates */
238+
if (((z == 0x0f) &&
239+
(((yx & 0xffe) == 0xffe) ||
240+
(((yx & 0xf80) == 0xd80) && (lx >= 0x30) && (lx <= 0x4f)))) ||
241+
((z == 0x0d) && ((yx & 0xb00) == 0x800))) {
242+
return -1;
243+
}
244+
return 3;
245+
}
246+
return -1;
247+
}
248+
else if ((*c & 0xf8) == 0xf0) {
249+
int u = ((c[0] & 0x07) << 2) | ((c[1] & 0x30) >> 4);
250+
251+
/* four-byte char */
252+
if (((c[1] & 0xc0) == 0x80) &&
253+
(u > 0x00) && (u <= 0x10) &&
254+
((c[2] & 0xc0) == 0x80) && ((c[3] & 0xc0) == 0x80)) {
255+
/* test for 0xzzzzfffe/0xzzzzfffff */
256+
if (((c[1] & 0x0f) == 0x0f) && ((c[2] & 0x3f) == 0x3f) &&
257+
((c[3] & 0x3e) == 0x3e)) {
258+
return -1;
259+
}
260+
return 4;
261+
}
262+
return -1;
263+
}
264+
return -1;
265+
}
266+
267+
static unsigned char *
268+
mb_utf_validate(unsigned char *pwcs)
269+
{
270+
int l = 0;
271+
unsigned char *p = pwcs;
272+
unsigned char *p0 = pwcs;
273+
274+
while( *pwcs ) {
275+
if ((l = utf_charcheck(pwcs)) > 0) {
276+
if (p != pwcs) {
277+
int i;
278+
for( i = 0; i < l; i++) {
279+
*p++ = *pwcs++;
280+
}
281+
}
282+
else {
283+
pwcs += l;
284+
p += l;
285+
}
286+
}
287+
else {
288+
/* we skip the char */
289+
pwcs++;
290+
}
291+
}
292+
if (p != pwcs) {
293+
*p = '\0';
294+
}
295+
return p0;
296+
}
297+
298+
/*
299+
* public functions : wcswidth and mbvalidate
300+
*/
301+
302+
int
303+
pg_wcswidth(unsigned char *pwcs, int len) {
304+
if (pset.encoding == PG_UTF8) {
305+
return mb_utf_wcswidth(pwcs, len);
306+
}
307+
else {
308+
/* obviously, other encodings may want to fix this, but I don't know them
309+
* myself, unfortunately.
310+
*/
311+
return len;
312+
}
313+
}
314+
315+
unsigned char *
316+
mbvalidate(unsigned char *pwcs) {
317+
if (pset.encoding == PG_UTF8) {
318+
return mb_utf_validate(pwcs);
319+
}
320+
else {
321+
/* other encodings needing validation should add their own routines here
322+
*/
323+
return pwcs;
324+
}
325+
}
326+
#else /* !MULTIBYTE */
327+
328+
/* in single-byte environment, all cells take 1 column */
329+
int pg_wcswidth(unsigned char *pwcs, int len) {
330+
return len;
331+
}
332+
#endif
333+
334+

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy