Skip to content

Commit 2a0af7f

Browse files
committed
Allow complemented character class escapes within regex brackets.
The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
1 parent 6b40d9b commit 2a0af7f

File tree

10 files changed

+672
-271
lines changed

10 files changed

+672
-271
lines changed

doc/src/sgml/func.sgml

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
60976097
non-ASCII characters to belong to any of these classes.)
60986098
In addition to these standard character
60996099
classes, <productname>PostgreSQL</productname> defines
6100+
the <literal>word</literal> character class, which is the same as
6101+
<literal>alnum</literal> plus the underscore (<literal>_</literal>)
6102+
character, and
61006103
the <literal>ascii</literal> character class, which contains exactly
61016104
the 7-bit ASCII set.
61026105
</para>
@@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
61086111
matching empty strings at the beginning
61096112
and end of a word respectively. A word is defined as a sequence
61106113
of word characters that is neither preceded nor followed by word
6111-
characters. A word character is an <literal>alnum</literal> character (as
6112-
defined by the <acronym>POSIX</acronym> character class described above)
6113-
or an underscore. This is an extension, compatible with but not
6114+
characters. A word character is any character belonging to the
6115+
<literal>word</literal> character class, that is, any letter, digit,
6116+
or underscore. This is an extension, compatible with but not
61146117
specified by <acronym>POSIX</acronym> 1003.2, and should be used with
61156118
caution in software intended to be portable to other systems.
61166119
The constraint escapes described below are usually preferable; they
@@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63306333

63316334
<row>
63326335
<entry> <literal>\w</literal> </entry>
6333-
<entry> <literal>[[:alnum:]_]</literal>
6334-
(note underscore is included) </entry>
6336+
<entry> <literal>[[:word:]]</literal> </entry>
63356337
</row>
63366338

63376339
<row>
@@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
63466348

63476349
<row>
63486350
<entry> <literal>\W</literal> </entry>
6349-
<entry> <literal>[^[:alnum:]_]</literal>
6350-
(note underscore is included) </entry>
6351+
<entry> <literal>[^[:word:]]</literal> </entry>
63516352
</row>
63526353
</tbody>
63536354
</tgroup>
63546355
</table>
63556356

63566357
<para>
6357-
Within bracket expressions, <literal>\d</literal>, <literal>\s</literal>,
6358-
and <literal>\w</literal> lose their outer brackets,
6359-
and <literal>\D</literal>, <literal>\S</literal>, and <literal>\W</literal> are illegal.
6360-
(So, for example, <literal>[a-c\d]</literal> is equivalent to
6358+
The class-shorthand escapes also work within bracket expressions,
6359+
although the definitions shown above are not quite syntactically
6360+
valid in that context.
6361+
For example, <literal>[a-c\d]</literal> is equivalent to
63616362
<literal>[a-c[:digit:]]</literal>.
6362-
Also, <literal>[a-c\D]</literal>, which is equivalent to
6363-
<literal>[a-c^[:digit:]]</literal>, is illegal.)
63646363
</para>
63656364

63666365
<table id="posix-constraint-escapes-table">

src/backend/regex/re_syntax.n

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -519,15 +519,10 @@ character classes:
519519
(note underscore)
520520
.RE
521521
.PP
522-
Within bracket expressions, `\fB\ed\fR', `\fB\es\fR',
523-
and `\fB\ew\fR'\&
524-
lose their outer brackets,
525-
and `\fB\eD\fR', `\fB\eS\fR',
526-
and `\fB\eW\fR'\&
527-
are illegal.
528-
.VS 8.2
529-
(So, for example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
530-
Also, \fB[a-c\eD]\fR, which is equivalent to \fB[a-c^[:digit:]]\fR, is illegal.)
522+
The class-shorthand escapes also work within bracket expressions,
523+
although the definitions shown above are not quite syntactically
524+
valid in that context.
525+
For example, \fB[a-c\ed]\fR is equivalent to \fB[a-c[:digit:]]\fR.
531526
.VE 8.2
532527
.PP
533528
A constraint escape (AREs only) is a constraint,

src/backend/regex/regc_color.c

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -936,7 +936,16 @@ okcolors(struct nfa *nfa,
936936
}
937937
else if (cd->nschrs == 0 && cd->nuchrs == 0)
938938
{
939-
/* parent empty, its arcs change color to subcolor */
939+
/*
940+
* Parent is now empty, so just change all its arcs to the
941+
* subcolor, then free the parent.
942+
*
943+
* It is not obvious that simply relabeling the arcs like this is
944+
* OK; it appears to risk creating duplicate arcs. We are
945+
* basically relying on the assumption that processing of a
946+
* bracket expression can't create arcs of both a color and its
947+
* subcolor between the bracket's endpoints.
948+
*/
940949
cd->sub = NOSUB;
941950
scd = &cm->cd[sco];
942951
assert(scd->nschrs > 0 || scd->nuchrs > 0);
@@ -1062,17 +1071,34 @@ colorcomplement(struct nfa *nfa,
10621071
struct colordesc *cd;
10631072
struct colordesc *end = CDEND(cm);
10641073
color co;
1074+
struct arc *a;
10651075

10661076
assert(of != from);
10671077

10681078
/* A RAINBOW arc matches all colors, making the complement empty */
10691079
if (findarc(of, PLAIN, RAINBOW) != NULL)
10701080
return;
10711081

1082+
/* Otherwise, transiently mark the colors that appear in of's out-arcs */
1083+
for (a = of->outs; a != NULL; a = a->outchain)
1084+
{
1085+
if (a->type == PLAIN)
1086+
{
1087+
assert(a->co >= 0);
1088+
cd = &cm->cd[a->co];
1089+
assert(!UNUSEDCOLOR(cd));
1090+
cd->flags |= COLMARK;
1091+
}
1092+
}
1093+
1094+
/* Scan colors, clear transient marks, add arcs for unmarked colors */
10721095
for (cd = cm->cd, co = 0; cd < end && !CISERR(); cd++, co++)
1073-
if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
1074-
if (findarc(of, PLAIN, co) == NULL)
1075-
newarc(nfa, type, co, from, to);
1096+
{
1097+
if (cd->flags & COLMARK)
1098+
cd->flags &= ~COLMARK;
1099+
else if (!UNUSEDCOLOR(cd) && !(cd->flags & PSEUDO))
1100+
newarc(nfa, type, co, from, to);
1101+
}
10761102
}
10771103

10781104

src/backend/regex/regc_lex.c

Lines changed: 16 additions & 150 deletions
Original file line numberDiff line numberDiff line change
@@ -193,83 +193,6 @@ prefixes(struct vars *v)
193193
}
194194
}
195195

196-
/*
197-
* lexnest - "call a subroutine", interpolating string at the lexical level
198-
*
199-
* Note, this is not a very general facility. There are a number of
200-
* implicit assumptions about what sorts of strings can be subroutines.
201-
*/
202-
static void
203-
lexnest(struct vars *v,
204-
const chr *beginp, /* start of interpolation */
205-
const chr *endp) /* one past end of interpolation */
206-
{
207-
assert(v->savenow == NULL); /* only one level of nesting */
208-
v->savenow = v->now;
209-
v->savestop = v->stop;
210-
v->now = beginp;
211-
v->stop = endp;
212-
}
213-
214-
/*
215-
* string constants to interpolate as expansions of things like \d
216-
*/
217-
static const chr backd[] = { /* \d */
218-
CHR('['), CHR('['), CHR(':'),
219-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
220-
CHR(':'), CHR(']'), CHR(']')
221-
};
222-
static const chr backD[] = { /* \D */
223-
CHR('['), CHR('^'), CHR('['), CHR(':'),
224-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
225-
CHR(':'), CHR(']'), CHR(']')
226-
};
227-
static const chr brbackd[] = { /* \d within brackets */
228-
CHR('['), CHR(':'),
229-
CHR('d'), CHR('i'), CHR('g'), CHR('i'), CHR('t'),
230-
CHR(':'), CHR(']')
231-
};
232-
static const chr backs[] = { /* \s */
233-
CHR('['), CHR('['), CHR(':'),
234-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
235-
CHR(':'), CHR(']'), CHR(']')
236-
};
237-
static const chr backS[] = { /* \S */
238-
CHR('['), CHR('^'), CHR('['), CHR(':'),
239-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
240-
CHR(':'), CHR(']'), CHR(']')
241-
};
242-
static const chr brbacks[] = { /* \s within brackets */
243-
CHR('['), CHR(':'),
244-
CHR('s'), CHR('p'), CHR('a'), CHR('c'), CHR('e'),
245-
CHR(':'), CHR(']')
246-
};
247-
static const chr backw[] = { /* \w */
248-
CHR('['), CHR('['), CHR(':'),
249-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
250-
CHR(':'), CHR(']'), CHR('_'), CHR(']')
251-
};
252-
static const chr backW[] = { /* \W */
253-
CHR('['), CHR('^'), CHR('['), CHR(':'),
254-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
255-
CHR(':'), CHR(']'), CHR('_'), CHR(']')
256-
};
257-
static const chr brbackw[] = { /* \w within brackets */
258-
CHR('['), CHR(':'),
259-
CHR('a'), CHR('l'), CHR('n'), CHR('u'), CHR('m'),
260-
CHR(':'), CHR(']'), CHR('_')
261-
};
262-
263-
/*
264-
* lexword - interpolate a bracket expression for word characters
265-
* Possibly ought to inquire whether there is a "word" character class.
266-
*/
267-
static void
268-
lexword(struct vars *v)
269-
{
270-
lexnest(v, backw, ENDOF(backw));
271-
}
272-
273196
/*
274197
* next - get next token
275198
*/
@@ -292,14 +215,6 @@ next(struct vars *v)
292215
RETV(SBEGIN, 0); /* same as \A */
293216
}
294217

295-
/* if we're nested and we've hit end, return to outer level */
296-
if (v->savenow != NULL && ATEOS())
297-
{
298-
v->now = v->savenow;
299-
v->stop = v->savestop;
300-
v->savenow = v->savestop = NULL;
301-
}
302-
303218
/* skip white space etc. if appropriate (not in literal or []) */
304219
if (v->cflags & REG_EXPANDED)
305220
switch (v->lexcon)
@@ -420,32 +335,15 @@ next(struct vars *v)
420335
NOTE(REG_UNONPOSIX);
421336
if (ATEOS())
422337
FAILW(REG_EESCAPE);
423-
(DISCARD) lexescape(v);
338+
if (!lexescape(v))
339+
return 0;
424340
switch (v->nexttype)
425341
{ /* not all escapes okay here */
426342
case PLAIN:
343+
case CCLASSS:
344+
case CCLASSC:
427345
return 1;
428346
break;
429-
case CCLASS:
430-
switch (v->nextvalue)
431-
{
432-
case 'd':
433-
lexnest(v, brbackd, ENDOF(brbackd));
434-
break;
435-
case 's':
436-
lexnest(v, brbacks, ENDOF(brbacks));
437-
break;
438-
case 'w':
439-
lexnest(v, brbackw, ENDOF(brbackw));
440-
break;
441-
default:
442-
FAILW(REG_EESCAPE);
443-
break;
444-
}
445-
/* lexnest done, back up and try again */
446-
v->nexttype = v->lasttype;
447-
return next(v);
448-
break;
449347
}
450348
/* not one of the acceptable escapes */
451349
FAILW(REG_EESCAPE);
@@ -691,49 +589,17 @@ next(struct vars *v)
691589
}
692590
RETV(PLAIN, *v->now++);
693591
}
694-
(DISCARD) lexescape(v);
695-
if (ISERR())
696-
FAILW(REG_EESCAPE);
697-
if (v->nexttype == CCLASS)
698-
{ /* fudge at lexical level */
699-
switch (v->nextvalue)
700-
{
701-
case 'd':
702-
lexnest(v, backd, ENDOF(backd));
703-
break;
704-
case 'D':
705-
lexnest(v, backD, ENDOF(backD));
706-
break;
707-
case 's':
708-
lexnest(v, backs, ENDOF(backs));
709-
break;
710-
case 'S':
711-
lexnest(v, backS, ENDOF(backS));
712-
break;
713-
case 'w':
714-
lexnest(v, backw, ENDOF(backw));
715-
break;
716-
case 'W':
717-
lexnest(v, backW, ENDOF(backW));
718-
break;
719-
default:
720-
assert(NOTREACHED);
721-
FAILW(REG_ASSERT);
722-
break;
723-
}
724-
/* lexnest done, back up and try again */
725-
v->nexttype = v->lasttype;
726-
return next(v);
727-
}
728-
/* otherwise, lexescape has already done the work */
729-
return !ISERR();
592+
return lexescape(v);
730593
}
731594

732595
/*
733596
* lexescape - parse an ARE backslash escape (backslash already eaten)
734-
* Note slightly nonstandard use of the CCLASS type code.
597+
*
598+
* This is used for ARE backslashes both normally and inside bracket
599+
* expressions. In the latter case, not all escape types are allowed,
600+
* but the caller must reject unwanted ones after we return.
735601
*/
736-
static int /* not actually used, but convenient for RETV */
602+
static int
737603
lexescape(struct vars *v)
738604
{
739605
chr c;
@@ -775,11 +641,11 @@ lexescape(struct vars *v)
775641
break;
776642
case CHR('d'):
777643
NOTE(REG_ULOCALE);
778-
RETV(CCLASS, 'd');
644+
RETV(CCLASSS, CC_DIGIT);
779645
break;
780646
case CHR('D'):
781647
NOTE(REG_ULOCALE);
782-
RETV(CCLASS, 'D');
648+
RETV(CCLASSC, CC_DIGIT);
783649
break;
784650
case CHR('e'):
785651
NOTE(REG_UUNPORT);
@@ -802,11 +668,11 @@ lexescape(struct vars *v)
802668
break;
803669
case CHR('s'):
804670
NOTE(REG_ULOCALE);
805-
RETV(CCLASS, 's');
671+
RETV(CCLASSS, CC_SPACE);
806672
break;
807673
case CHR('S'):
808674
NOTE(REG_ULOCALE);
809-
RETV(CCLASS, 'S');
675+
RETV(CCLASSC, CC_SPACE);
810676
break;
811677
case CHR('t'):
812678
RETV(PLAIN, CHR('\t'));
@@ -828,11 +694,11 @@ lexescape(struct vars *v)
828694
break;
829695
case CHR('w'):
830696
NOTE(REG_ULOCALE);
831-
RETV(CCLASS, 'w');
697+
RETV(CCLASSS, CC_WORD);
832698
break;
833699
case CHR('W'):
834700
NOTE(REG_ULOCALE);
835-
RETV(CCLASS, 'W');
701+
RETV(CCLASSC, CC_WORD);
836702
break;
837703
case CHR('x'):
838704
NOTE(REG_UUNPORT);

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy