summaryrefslogtreecommitdiffstats
path: root/gnu/usr.bin/perl/pod/perlreguts.pod
diff options
context:
space:
mode:
Diffstat (limited to 'gnu/usr.bin/perl/pod/perlreguts.pod')
-rw-r--r--gnu/usr.bin/perl/pod/perlreguts.pod128
1 files changed, 85 insertions, 43 deletions
diff --git a/gnu/usr.bin/perl/pod/perlreguts.pod b/gnu/usr.bin/perl/pod/perlreguts.pod
index ec1c243f8a9..bb7f372c664 100644
--- a/gnu/usr.bin/perl/pod/perlreguts.pod
+++ b/gnu/usr.bin/perl/pod/perlreguts.pod
@@ -182,9 +182,9 @@ POSIX char classes called C<regnode_charclass_class> which has an
additional 4-byte (32-bit) bitmap indicating which POSIX char classes
have been included.
- regnode_charclass_class U32 arg1;
- char bitmap[ANYOF_BITMAP_SIZE];
- char classflags[ANYOF_CLASSBITMAP_SIZE];
+ regnode_charclass_class U32 arg1;
+ char bitmap[ANYOF_BITMAP_SIZE];
+ char classflags[ANYOF_CLASSBITMAP_SIZE];
=back
@@ -354,20 +354,23 @@ simpler form.
The call graph looks like this:
- reg() # parse a top level regex, or inside of parens
- regbranch() # parse a single branch of an alternation
- regpiece() # parse a pattern followed by a quantifier
- regatom() # parse a simple pattern
- regclass() # used to handle a class
- reg() # used to handle a parenthesised subpattern
- ....
- ...
- regtail() # finish off the branch
- ...
- regtail() # finish off the branch sequence. Tie each
- # branch's tail to the tail of the sequence
- # (NEW) In Debug mode this is
- # regtail_study().
+ reg() # parse a top level regex, or inside of
+ # parens
+ regbranch() # parse a single branch of an alternation
+ regpiece() # parse a pattern followed by a quantifier
+ regatom() # parse a simple pattern
+ regclass() # used to handle a class
+ reg() # used to handle a parenthesised
+ # subpattern
+ ....
+ ...
+ regtail() # finish off the branch
+ ...
+ regtail() # finish off the branch sequence. Tie each
+ # branch's tail to the tail of the
+ # sequence
+ # (NEW) In Debug mode this is
+ # regtail_study().
A grammar form might be something like this:
@@ -383,6 +386,52 @@ A grammar form might be something like this:
piece : _piece
| _piece quant
+=head3 Parsing complications
+
+The implication of the above description is that a pattern containing nested
+parentheses will result in a call graph which cycles through C<reg()>,
+C<regbranch()>, C<regpiece()>, C<regatom()>, C<reg()>, C<regbranch()> I<etc>
+multiple times, until the deepest level of nesting is reached. All the above
+routines return a pointer to a C<regnode>, which is usually the last regnode
+added to the program. However, one complication is that reg() returns NULL
+for parsing C<(?:)> syntax for embedded modifiers, setting the flag
+C<TRYAGAIN>. The C<TRYAGAIN> propagates upwards until it is captured, in
+some cases by by C<regatom()>, but otherwise unconditionally by
+C<regbranch()>. Hence it will never be returned by C<regbranch()> to
+C<reg()>. This flag permits patterns such as C<(?i)+> to be detected as
+errors (I<Quantifier follows nothing in regex; marked by <-- HERE in m/(?i)+
+<-- HERE />).
+
+Another complication is that the representation used for the program differs
+if it needs to store Unicode, but it's not always possible to know for sure
+whether it does until midway through parsing. The Unicode representation for
+the program is larger, and cannot be matched as efficiently. (See L</Unicode
+and Localisation Support> below for more details as to why.) If the pattern
+contains literal Unicode, it's obvious that the program needs to store
+Unicode. Otherwise, the parser optimistically assumes that the more
+efficient representation can be used, and starts sizing on this basis.
+However, if it then encounters something in the pattern which must be stored
+as Unicode, such as an C<\x{...}> escape sequence representing a character
+literal, then this means that all previously calculated sizes need to be
+redone, using values appropriate for the Unicode representation. Currently,
+all regular expression constructions which can trigger this are parsed by code
+in C<regatom()>.
+
+To avoid wasted work when a restart is needed, the sizing pass is abandoned
+- C<regatom()> immediately returns NULL, setting the flag C<RESTART_UTF8>.
+(This action is encapsulated using the macro C<REQUIRE_UTF8>.) This restart
+request is propagated up the call chain in a similar fashion, until it is
+"caught" in C<Perl_re_op_compile()>, which marks the pattern as containing
+Unicode, and restarts the sizing pass. It is also possible for constructions
+within run-time code blocks to turn out to need Unicode representation.,
+which is signalled by C<S_compile_runtime_code()> returning false to
+C<Perl_re_op_compile()>.
+
+The restart was previously implemented using a C<longjmp> in C<regatom()>
+back to a C<setjmp> in C<Perl_re_op_compile()>, but this proved to be
+problematic as the latter is a large function containing many automatic
+variables, which interact badly with the emergent control flow of C<setjmp>.
+
=head3 Debug Output
In the 5.9.x development version of perl you can C<< use re Debug => 'PARSE' >>
@@ -489,11 +538,11 @@ Now for something much more complex: C</x(?:foo*|b[a][rR])(foo|bar)$/>
atom
>)$< 34 tail~ BRANCH (28)
36 tsdy~ BRANCH (END) (31)
- ~ attach to CLOSE1 (34) offset to 3
+ ~ attach to CLOSE1 (34) offset to 3
tsdy~ EXACT <foo> (EXACT) (29)
- ~ attach to CLOSE1 (34) offset to 5
+ ~ attach to CLOSE1 (34) offset to 5
tsdy~ EXACT <bar> (EXACT) (32)
- ~ attach to CLOSE1 (34) offset to 2
+ ~ attach to CLOSE1 (34) offset to 2
>$< tail~ BRANCH (3)
~ BRANCH (9)
~ TAIL (25)
@@ -765,7 +814,7 @@ implement things such as the stringification of C<qr//>.
The other structure is pointed to be the C<regexp> struct's
C<pprivate> and is in addition to C<intflags> in the same struct
considered to be the property of the regex engine which compiled the
-regular expression;
+regular expression;
The regexp structure contains all the data that perl needs to be aware of
to properly work with the regular expression. It includes data about
@@ -792,31 +841,24 @@ The following structure is used as the C<pprivate> struct by perl's
regex engine. Since it is specific to perl it is only of curiosity
value to other engine implementations.
- typedef struct regexp_internal {
- regexp_paren_ofs *swap; /* Swap copy of *startp / *endp */
- U32 *offsets; /* offset annotations 20001228 MJD
- data about mapping the program to the
- string*/
- regnode *regstclass; /* Optional startclass as identified or constructed
- by the optimiser */
- struct reg_data *data; /* Additional miscellaneous data used by the program.
- Used to make it easier to clone and free arbitrary
- data that the regops need. Often the ARG field of
- a regop is an index into this structure */
- regnode program[1]; /* Unwarranted chumminess with compiler. */
- } regexp_internal;
+ typedef struct regexp_internal {
+ U32 *offsets; /* offset annotations 20001228 MJD
+ * data about mapping the program to
+ * the string*/
+ regnode *regstclass; /* Optional startclass as identified or
+ * constructed by the optimiser */
+ struct reg_data *data; /* Additional miscellaneous data used
+ * by the program. Used to make it
+ * easier to clone and free arbitrary
+ * data that the regops need. Often the
+ * ARG field of a regop is an index
+ * into this structure */
+ regnode program[1]; /* Unwarranted chumminess with
+ * compiler. */
+ } regexp_internal;
=over 5
-=item C<swap>
-
-C<swap> formerly was an extra set of startp/endp stored in a
-C<regexp_paren_ofs> struct. This was used when the last successful match
-was from the same pattern as the current pattern, so that a partial
-match didn't overwrite the previous match's results, but it caused a
-problem with re-entrant code such as trying to build the UTF-8 swashes.
-Currently unused and left for backward compatibility with 5.10.0.
-
=item C<offsets>
Offsets holds a mapping of offset in the C<program>