pintos-os.org Git - pspp/blob - doc/regex.texi

   1 @node Overview
   2 @chapter Overview
   3
   4 A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text
   5 string that describes some (mathematical) set of strings.  A regexp
   6 @var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of
   7 strings described by @var{r}.
   8
   9 Using the Regex library, you can:
  10
  11 @itemize @bullet
  12
  13 @item
  14 see if a string matches a specified pattern as a whole, and
  15
  16 @item
  17 search within a string for a substring matching a specified pattern.
  18
  19 @end itemize
  20
  21 Some regular expressions match only one string, i.e., the set they
  22 describe has only one member.  For example, the regular expression
  23 @samp{foo} matches the string @samp{foo} and no others.  Other regular
  24 expressions match more than one string, i.e., the set they describe has
  25 more than one member.  For example, the regular expression @samp{f*}
  26 matches the set of strings made up of any number (including zero) of
  27 @samp{f}s.  As you can see, some characters in regular expressions match
  28 themselves (such as @samp{f}) and some don't (such as @samp{*}); the
  29 ones that don't match themselves instead let you specify patterns that
  30 describe many different strings.
  31
  32 To either match or search for a regular expression with the Regex
  33 library functions, you must first compile it with a Regex pattern
  34 compiling function.  A @dfn{compiled pattern} is a regular expression
  35 converted to the internal format used by the library functions.  Once
  36 you've compiled a pattern, you can use it for matching or searching any
  37 number of times.
  38
  39 The Regex library is used by including @file{regex.h}.
  40 @pindex regex.h
  41 Regex provides three groups of functions with which you can operate on
  42 regular expressions.  One group---the @sc{gnu} group---is more powerful
  43 but not completely compatible with the other two, namely the @sc{posix}
  44 and Berkeley @sc{unix} groups; its interface was designed specifically
  45 for @sc{gnu}.  The other groups have the same interfaces as do the
  46 regular expression functions in @sc{posix} and Berkeley
  47 @sc{unix}.
  48
  49 We wrote this chapter with programmers in mind, not users of
  50 programs---such as Emacs---that use Regex.  We describe the Regex
  51 library in its entirety, not how to write regular expressions that a
  52 particular program understands.
  53
  54
  55 @node Regular Expression Syntax
  56 @chapter Regular Expression Syntax
  57
  58 @cindex regular expressions, syntax of
  59 @cindex syntax of regular expressions
  60
  61 @dfn{Characters} are things you can type.  @dfn{Operators} are things in
  62 a regular expression that match one or more characters.  You compose
  63 regular expressions from operators, which in turn you specify using one
  64 or more characters.
  65
  66 Most characters represent what we call the match-self operator, i.e.,
  67 they match themselves; we call these characters @dfn{ordinary}.  Other
  68 characters represent either all or parts of fancier operators; e.g.,
  69 @samp{.} represents what we call the match-any-character operator
  70 (which, no surprise, matches (almost) any character); we call these
  71 characters @dfn{special}.  Two different things determine what
  72 characters represent what operators:
  73
  74 @enumerate
  75 @item
  76 the regular expression syntax your program has told the Regex library to
  77 recognize, and
  78
  79 @item
  80 the context of the character in the regular expression.
  81 @end enumerate
  82
  83 In the following sections, we describe these things in more detail.
  84
  85 @menu
  86 * Syntax Bits::
  87 * Predefined Syntaxes::
  88 * Collating Elements vs. Characters::
  89 * The Backslash Character::
  90 @end menu
  91
  92
  93 @node Syntax Bits
  94 @section Syntax Bits
  95
  96 @cindex syntax bits
  97
  98 In any particular syntax for regular expressions, some characters are
  99 always special, others are sometimes special, and others are never
 100 special.  The particular syntax that Regex recognizes for a given
 101 regular expression depends on the value in the @code{syntax} field of
 102 the pattern buffer of that regular expression.
 103
 104 You get a pattern buffer by compiling a regular expression.  @xref{GNU
 105 Pattern Buffers}, and @ref{POSIX Pattern Buffers}, for more information
 106 on pattern buffers.  @xref{GNU Regular Expression Compiling}, @ref{POSIX
 107 Regular Expression Compiling}, and @ref{BSD Regular Expression
 108 Compiling}, for more information on compiling.
 109
 110 Regex considers the value of the @code{syntax} field to be a collection
 111 of bits; we refer to these bits as @dfn{syntax bits}.  In most cases,
 112 they affect what characters represent what operators.  We describe the
 113 meanings of the operators to which we refer in @ref{Common Operators},
 114 @ref{GNU Operators}, and @ref{GNU Emacs Operators}.
 115
 116 For reference, here is the complete list of syntax bits, in alphabetical
 117 order:
 118
 119 @table @code
 120
 121 @cnindex RE_BACKSLASH_ESCAPE_IN_LIST
 122 @item RE_BACKSLASH_ESCAPE_IN_LISTS
 123 If this bit is set, then @samp{\} inside a list (@pxref{List Operators}
 124 quotes (makes ordinary, if it's special) the following character; if
 125 this bit isn't set, then @samp{\} is an ordinary character inside lists.
 126 (@xref{The Backslash Character}, for what `\' does outside of lists.)
 127
 128 @cnindex RE_BK_PLUS_QM
 129 @item RE_BK_PLUS_QM
 130 If this bit is set, then @samp{\+} represents the match-one-or-more
 131 operator and @samp{\?} represents the match-zero-or-more operator; if
 132 this bit isn't set, then @samp{+} represents the match-one-or-more
 133 operator and @samp{?} represents the match-zero-or-one operator.  This
 134 bit is irrelevant if @code{RE_LIMITED_OPS} is set.
 135
 136 @cnindex RE_CHAR_CLASSES
 137 @item RE_CHAR_CLASSES
 138 If this bit is set, then you can use character classes in lists; if this
 139 bit isn't set, then you can't.
 140
 141 @cnindex RE_CONTEXT_INDEP_ANCHORS
 142 @item RE_CONTEXT_INDEP_ANCHORS
 143 If this bit is set, then @samp{^} and @samp{$} are special anywhere outside
 144 a list; if this bit isn't set, then these characters are special only in
 145 certain contexts.  @xref{Match-beginning-of-line Operator}, and
 146 @ref{Match-end-of-line Operator}.
 147
 148 @cnindex RE_CONTEXT_INDEP_OPS
 149 @item RE_CONTEXT_INDEP_OPS
 150 If this bit is set, then certain characters are special anywhere outside
 151 a list; if this bit isn't set, then those characters are special only in
 152 some contexts and are ordinary elsewhere.  Specifically, if this bit
 153 isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS}
 154 isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending
 155 on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators
 156 only if they're not first in a regular expression or just after an
 157 open-group or alternation operator.  The same holds for @samp{@{} (or
 158 @samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if
 159 it is the beginning of a valid interval and the syntax bit
 160 @code{RE_INTERVALS} is set.
 161
 162 @cnindex RE_CONTEXT_INVALID_OPS
 163 @item RE_CONTEXT_INVALID_OPS
 164 If this bit is set, then repetition and alternation operators can't be
 165 in certain positions within a regular expression.  Specifically, the
 166 regular expression is invalid if it has:
 167
 168 @itemize @bullet
 169
 170 @item
 171 a repetition operator first in the regular expression or just after a
 172 match-beginning-of-line, open-group, or alternation operator; or
 173
 174 @item
 175 an alternation operator first or last in the regular expression, just
 176 before a match-end-of-line operator, or just after an alternation or
 177 open-group operator.
 178
 179 @end itemize
 180
 181 If this bit isn't set, then you can put the characters representing the
 182 repetition and alternation characters anywhere in a regular expression.
 183 Whether or not they will in fact be operators in certain positions
 184 depends on other syntax bits.
 185
 186 @cnindex RE_DOT_NEWLINE
 187 @item RE_DOT_NEWLINE
 188 If this bit is set, then the match-any-character operator matches
 189 a newline; if this bit isn't set, then it doesn't.
 190
 191 @cnindex RE_DOT_NOT_NULL
 192 @item RE_DOT_NOT_NULL
 193 If this bit is set, then the match-any-character operator doesn't match
 194 a null character; if this bit isn't set, then it does.
 195
 196 @cnindex RE_INTERVALS
 197 @item RE_INTERVALS
 198 If this bit is set, then Regex recognizes interval operators; if this bit
 199 isn't set, then it doesn't.
 200
 201 @cnindex RE_LIMITED_OPS
 202 @item RE_LIMITED_OPS
 203 If this bit is set, then Regex doesn't recognize the match-one-or-more,
 204 match-zero-or-one or alternation operators; if this bit isn't set, then
 205 it does.
 206
 207 @cnindex RE_NEWLINE_ALT
 208 @item RE_NEWLINE_ALT
 209 If this bit is set, then newline represents the alternation operator; if
 210 this bit isn't set, then newline is ordinary.
 211
 212 @cnindex RE_NO_BK_BRACES
 213 @item RE_NO_BK_BRACES
 214 If this bit is set, then @samp{@{} represents the open-interval operator
 215 and @samp{@}} represents the close-interval operator; if this bit isn't
 216 set, then @samp{\@{} represents the open-interval operator and
 217 @samp{\@}} represents the close-interval operator.  This bit is relevant
 218 only if @code{RE_INTERVALS} is set.
 219
 220 @cnindex RE_NO_BK_PARENS
 221 @item RE_NO_BK_PARENS
 222 If this bit is set, then @samp{(} represents the open-group operator and
 223 @samp{)} represents the close-group operator; if this bit isn't set, then
 224 @samp{\(} represents the open-group operator and @samp{\)} represents
 225 the close-group operator.
 226
 227 @cnindex RE_NO_BK_REFS
 228 @item RE_NO_BK_REFS
 229 If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as
 230 the back reference operator; if this bit isn't set, then it does.
 231
 232 @cnindex RE_NO_BK_VBAR
 233 @item RE_NO_BK_VBAR
 234 If this bit is set, then @samp{|} represents the alternation operator;
 235 if this bit isn't set, then @samp{\|} represents the alternation
 236 operator.  This bit is irrelevant if @code{RE_LIMITED_OPS} is set.
 237
 238 @cnindex RE_NO_EMPTY_RANGES
 239 @item RE_NO_EMPTY_RANGES
 240 If this bit is set, then a regular expression with a range whose ending
 241 point collates lower than its starting point is invalid; if this bit
 242 isn't set, then Regex considers such a range to be empty.
 243
 244 @cnindex RE_UNMATCHED_RIGHT_PAREN_ORD
 245 @item RE_UNMATCHED_RIGHT_PAREN_ORD
 246 If this bit is set and the regular expression has no matching open-group
 247 operator, then Regex considers what would otherwise be a close-group
 248 operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}.
 249
 250 @end table
 251
 252
 253 @node Predefined Syntaxes
 254 @section Predefined Syntaxes
 255
 256 If you're programming with Regex, you can set a pattern buffer's
 257 (@pxref{GNU Pattern Buffers}, and @ref{POSIX Pattern Buffers})
 258 @code{syntax} field either to an arbitrary combination of syntax bits
 259 (@pxref{Syntax Bits}) or else to the configurations defined by Regex.
 260 These configurations define the syntaxes used by certain
 261 programs---@sc{gnu} Emacs,
 262 @cindex Emacs
 263 @sc{posix} Awk,
 264 @cindex POSIX Awk
 265 traditional Awk,
 266 @cindex Awk
 267 Grep,
 268 @cindex Grep
 269 @cindex Egrep
 270 Egrep---in addition to syntaxes for @sc{posix} basic and extended
 271 regular expressions.
 272
 273 The predefined syntaxes---taken directly from @file{regex.h}---are:
 274
 275 @smallexample
 276 #define RE_SYNTAX_EMACS 0
 277
 278 #define RE_SYNTAX_AWK                                                   \
 279   (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL                       \
 280    | RE_NO_BK_PARENS            | RE_NO_BK_REFS                         \
 281    | RE_NO_BK_VBAR               | RE_NO_EMPTY_RANGES                   \
 282    | RE_UNMATCHED_RIGHT_PAREN_ORD)
 283
 284 #define RE_SYNTAX_POSIX_AWK                                             \
 285   (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
 286
 287 #define RE_SYNTAX_GREP                                                  \
 288   (RE_BK_PLUS_QM              | RE_CHAR_CLASSES                         \
 289    | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS                            \
 290    | RE_NEWLINE_ALT)
 291
 292 #define RE_SYNTAX_EGREP                                                 \
 293   (RE_CHAR_CLASSES        | RE_CONTEXT_INDEP_ANCHORS                    \
 294    | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE                    \
 295    | RE_NEWLINE_ALT       | RE_NO_BK_PARENS                             \
 296    | RE_NO_BK_VBAR)
 297
 298 #define RE_SYNTAX_POSIX_EGREP                                           \
 299   (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
 300
 301 /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff.  */
 302 #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
 303
 304 #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
 305
 306 /* Syntax bits common to both basic and extended POSIX regex syntax.  */
 307 #define _RE_SYNTAX_POSIX_COMMON                                         \
 308   (RE_CHAR_CLASSES | RE_DOT_NEWLINE      | RE_DOT_NOT_NULL              \
 309    | RE_INTERVALS  | RE_NO_EMPTY_RANGES)
 310
 311 #define RE_SYNTAX_POSIX_BASIC                                           \
 312   (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
 313
 314 /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
 315    RE_LIMITED_OPS, i.e., \? \+ \| are not recognized.  Actually, this
 316    isn't minimal, since other operators, such as \`, aren't disabled.  */
 317 #define RE_SYNTAX_POSIX_MINIMAL_BASIC                                   \
 318   (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
 319
 320 #define RE_SYNTAX_POSIX_EXTENDED                                        \
 321   (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS                   \
 322    | RE_CONTEXT_INDEP_OPS  | RE_NO_BK_BRACES                            \
 323    | RE_NO_BK_PARENS       | RE_NO_BK_VBAR                              \
 324    | RE_UNMATCHED_RIGHT_PAREN_ORD)
 325
 326 /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
 327    replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added.  */
 328 #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED                                \
 329   (_RE_SYNTAX_POSIX_COMMON  | RE_CONTEXT_INDEP_ANCHORS                  \
 330    | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES                           \
 331    | RE_NO_BK_PARENS        | RE_NO_BK_REFS                             \
 332    | RE_NO_BK_VBAR          | RE_UNMATCHED_RIGHT_PAREN_ORD)
 333 @end smallexample
 334
 335 @node Collating Elements vs. Characters
 336 @section Collating Elements vs.@: Characters
 337
 338 @sc{posix} generalizes the notion of a character to that of a
 339 collating element.  It defines a @dfn{collating element} to be ``a
 340 sequence of one or more bytes defined in the current collating sequence
 341 as a unit of collation.''
 342
 343 This generalizes the notion of a character in
 344 two ways.  First, a single character can map into two or more collating
 345 elements.  For example, the German
 346 @tex
 347 `\ss'
 348 @end tex
 349 @ifinfo
 350 ``es-zet''
 351 @end ifinfo
 352 collates as the collating element @samp{s} followed by another collating
 353 element @samp{s}.  Second, two or more characters can map into one
 354 collating element.  For example, the Spanish @samp{ll} collates after
 355 @samp{l} and before @samp{m}.
 356
 357 Since @sc{posix}'s ``collating element'' preserves the essential idea of
 358 a ``character,'' we use the latter, more familiar, term in this document.
 359
 360 @node The Backslash Character
 361 @section The Backslash Character
 362
 363 @cindex \
 364 The @samp{\} character has one of four different meanings, depending on
 365 the context in which you use it and what syntax bits are set
 366 (@pxref{Syntax Bits}).  It can: 1) stand for itself, 2) quote the next
 367 character, 3) introduce an operator, or 4) do nothing.
 368
 369 @enumerate
 370 @item
 371 It stands for itself inside a list
 372 (@pxref{List Operators}) if the syntax bit
 373 @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set.  For example, @samp{[\]}
 374 would match @samp{\}.
 375
 376 @item
 377 It quotes (makes ordinary, if it's special) the next character when you
 378 use it either:
 379
 380 @itemize @bullet
 381 @item
 382 outside a list,@footnote{Sometimes
 383 you don't have to explicitly quote special characters to make
 384 them ordinary.  For instance, most characters lose any special meaning
 385 inside a list (@pxref{List Operators}).  In addition, if the syntax bits
 386 @code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS}
 387 aren't set, then (for historical reasons) the matcher considers special
 388 characters ordinary if they are in contexts where the operations they
 389 represent make no sense; for example, then the match-zero-or-more
 390 operator (represented by @samp{*}) matches itself in the regular
 391 expression @samp{*foo} because there is no preceding expression on which
 392 it can operate.  It is poor practice, however, to depend on this
 393 behavior; if you want a special character to be ordinary outside a list,
 394 it's better to always quote it, regardless.} or
 395
 396 @item
 397 inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set.
 398
 399 @end itemize
 400
 401 @item
 402 It introduces an operator when followed by certain ordinary
 403 characters---sometimes only when certain syntax bits are set.  See the
 404 cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR},
 405 @code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}.  Also:
 406
 407 @itemize @bullet
 408 @item
 409 @samp{\b} represents the match-word-boundary operator
 410 (@pxref{Match-word-boundary Operator}).
 411
 412 @item
 413 @samp{\B} represents the match-within-word operator
 414 (@pxref{Match-within-word Operator}).
 415
 416 @item
 417 @samp{\<} represents the match-beginning-of-word operator @*
 418 (@pxref{Match-beginning-of-word Operator}).
 419
 420 @item
 421 @samp{\>} represents the match-end-of-word operator
 422 (@pxref{Match-end-of-word Operator}).
 423
 424 @item
 425 @samp{\w} represents the match-word-constituent operator
 426 (@pxref{Match-word-constituent Operator}).
 427
 428 @item
 429 @samp{\W} represents the match-non-word-constituent operator
 430 (@pxref{Match-non-word-constituent Operator}).
 431
 432 @item
 433 @samp{\`} represents the match-beginning-of-buffer
 434 operator and @samp{\'} represents the match-end-of-buffer operator
 435 (@pxref{Buffer Operators}).
 436
 437 @item
 438 If Regex was compiled with the C preprocessor symbol @code{emacs}
 439 defined, then @samp{\s@var{class}} represents the match-syntactic-class
 440 operator and @samp{\S@var{class}} represents the
 441 match-not-syntactic-class operator (@pxref{Syntactic Class Operators}).
 442
 443 @end itemize
 444
 445 @item
 446 In all other cases, Regex ignores @samp{\}.  For example,
 447 @samp{\n} matches @samp{n}.
 448
 449 @end enumerate
 450
 451 @node Common Operators
 452 @chapter Common Operators
 453
 454 You compose regular expressions from operators.  In the following
 455 sections, we describe the regular expression operators specified by
 456 @sc{posix}; @sc{gnu} also uses these.  Most operators have more than one
 457 representation as characters.  @xref{Regular Expression Syntax}, for
 458 what characters represent what operators under what circumstances.
 459
 460 For most operators that can be represented in two ways, one
 461 representation is a single character and the other is that character
 462 preceded by @samp{\}.  For example, either @samp{(} or @samp{\(}
 463 represents the open-group operator.  Which one does depends on the
 464 setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}.  Why is
 465 this so?  Historical reasons dictate some of the varying
 466 representations, while @sc{posix} dictates others.
 467
 468 Finally, almost all characters lose any special meaning inside a list
 469 (@pxref{List Operators}).
 470
 471 @menu
 472 * Match-self Operator::                 Ordinary characters.
 473 * Match-any-character Operator::        .
 474 * Concatenation Operator::              Juxtaposition.
 475 * Repetition Operators::                *  +  ? @{@}
 476 * Alternation Operator::                |
 477 * List Operators::                      [...]  [^...]
 478 * Grouping Operators::                  (...)
 479 * Back-reference Operator::             \digit
 480 * Anchoring Operators::                 ^  $
 481 @end menu
 482
 483 @node Match-self Operator
 484 @section The Match-self Operator (@var{ordinary character})
 485
 486 This operator matches the character itself.  All ordinary characters
 487 (@pxref{Regular Expression Syntax}) represent this operator.  For
 488 example, @samp{f} is always an ordinary character, so the regular
 489 expression @samp{f} matches only the string @samp{f}.  In
 490 particular, it does @emph{not} match the string @samp{ff}.
 491
 492 @node Match-any-character Operator
 493 @section The Match-any-character Operator (@code{.})
 494
 495 @cindex @samp{.}
 496
 497 This operator matches any single printing or nonprinting character
 498 except it won't match a:
 499
 500 @table @asis
 501 @item newline
 502 if the syntax bit @code{RE_DOT_NEWLINE} isn't set.
 503
 504 @item null
 505 if the syntax bit @code{RE_DOT_NOT_NULL} is set.
 506
 507 @end table
 508
 509 The @samp{.} (period) character represents this operator.  For example,
 510 @samp{a.b} matches any three-character string beginning with @samp{a}
 511 and ending with @samp{b}.
 512
 513 @node Concatenation Operator
 514 @section The Concatenation Operator
 515
 516 This operator concatenates two regular expressions @var{a} and @var{b}.
 517 No character represents this operator; you simply put @var{b} after
 518 @var{a}.  The result is a regular expression that will match a string if
 519 @var{a} matches its first part and @var{b} matches the rest.  For
 520 example, @samp{xy} (two match-self operators) matches @samp{xy}.
 521
 522 @node Repetition Operators
 523 @section Repetition Operators
 524
 525 Repetition operators repeat the preceding regular expression a specified
 526 number of times.
 527
 528 @menu
 529 * Match-zero-or-more Operator::  *
 530 * Match-one-or-more Operator::   +
 531 * Match-zero-or-one Operator::   ?
 532 * Interval Operators::           @{@}
 533 @end menu
 534
 535 @node Match-zero-or-more Operator
 536 @subsection The Match-zero-or-more Operator (@code{*})
 537
 538 @cindex @samp{*}
 539
 540 This operator repeats the smallest possible preceding regular expression
 541 as many times as necessary (including zero) to match the pattern.
 542 @samp{*} represents this operator.  For example, @samp{o*}
 543 matches any string made up of zero or more @samp{o}s.  Since this
 544 operator operates on the smallest preceding regular expression,
 545 @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.  So,
 546 @samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
 547
 548 Since the match-zero-or-more operator is a suffix operator, it may be
 549 useless as such when no regular expression precedes it.  This is the
 550 case when it:
 551
 552 @itemize @bullet
 553 @item
 554 is first in a regular expression, or
 555
 556 @item
 557 follows a match-beginning-of-line, open-group, or alternation
 558 operator.
 559
 560 @end itemize
 561
 562 @noindent
 563 Three different things can happen in these cases:
 564
 565 @enumerate
 566 @item
 567 If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the
 568 regular expression is invalid.
 569
 570 @item
 571 If @code{RE_CONTEXT_INVALID_OPS} isn't set, but
 572 @code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the
 573 match-zero-or-more operator (which then operates on the empty string).
 574
 575 @item
 576 Otherwise, @samp{*} is ordinary.
 577
 578 @end enumerate
 579
 580 @cindex backtracking
 581 The matcher processes a match-zero-or-more operator by first matching as
 582 many repetitions of the smallest preceding regular expression as it can.
 583 Then it continues to match the rest of the pattern.
 584
 585 If it can't match the rest of the pattern, it backtracks (as many times
 586 as necessary), each time discarding one of the matches until it can
 587 either match the entire pattern or be certain that it cannot get a
 588 match.  For example, when matching @samp{ca*ar} against @samp{caaar},
 589 the matcher first matches all three @samp{a}s of the string with the
 590 @samp{a*} of the regular expression.  However, it cannot then match the
 591 final @samp{ar} of the regular expression against the final @samp{r} of
 592 the string.  So it backtracks, discarding the match of the last @samp{a}
 593 in the string.  It can then match the remaining @samp{ar}.
 594
 595
 596 @node Match-one-or-more Operator
 597 @subsection The Match-one-or-more Operator (@code{+} or @code{\+})
 598
 599 @cindex @samp{+}
 600
 601 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize
 602 this operator.  Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't
 603 set, then @samp{+} represents this operator; if it is, then @samp{\+}
 604 does.
 605
 606 This operator is similar to the match-zero-or-more operator except that
 607 it repeats the preceding regular expression at least once;
 608 @pxref{Match-zero-or-more Operator}, for what it operates on, how some
 609 syntax bits affect it, and how Regex backtracks to match it.
 610
 611 For example, supposing that @samp{+} represents the match-one-or-more
 612 operator; then @samp{ca+r} matches, e.g., @samp{car} and
 613 @samp{caaaar}, but not @samp{cr}.
 614
 615 @node Match-zero-or-one Operator
 616 @subsection The Match-zero-or-one Operator (@code{?} or @code{\?})
 617 @cindex @samp{?}
 618
 619 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
 620 recognize this operator.  Otherwise, if the syntax bit
 621 @code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator;
 622 if it is, then @samp{\?} does.
 623
 624 This operator is similar to the match-zero-or-more operator except that
 625 it repeats the preceding regular expression once or not at all;
 626 @pxref{Match-zero-or-more Operator}, to see what it operates on, how
 627 some syntax bits affect it, and how Regex backtracks to match it.
 628
 629 For example, supposing that @samp{?} represents the match-zero-or-one
 630 operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but
 631 nothing else.
 632
 633 @node Interval Operators
 634 @subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}})
 635
 636 @cindex interval expression
 637 @cindex @samp{@{}
 638 @cindex @samp{@}}
 639 @cindex @samp{\@{}
 640 @cindex @samp{\@}}
 641
 642 If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes
 643 @dfn{interval expressions}.  They repeat the smallest possible preceding
 644 regular expression a specified number of times.
 645
 646 If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents
 647 the @dfn{open-interval operator} and @samp{@}} represents the
 648 @dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do.
 649
 650 Specifically, supposing that @samp{@{} and @samp{@}} represent the
 651 open-interval and close-interval operators; then:
 652
 653 @table @code
 654 @item  @{@var{count}@}
 655 matches exactly @var{count} occurrences of the preceding regular
 656 expression.
 657
 658 @item @{@var{min},@}
 659 matches @var{min} or more occurrences of the preceding regular
 660 expression.
 661
 662 @item  @{@var{min}, @var{max}@}
 663 matches at least @var{min} but no more than @var{max} occurrences of
 664 the preceding regular expression.
 665
 666 @end table
 667
 668 The interval expression (but not necessarily the regular expression that
 669 contains it) is invalid if:
 670
 671 @itemize @bullet
 672 @item
 673 @var{min} is greater than @var{max}, or
 674
 675 @item
 676 any of @var{count}, @var{min}, or @var{max} are outside the range
 677 zero to @code{RE_DUP_MAX} (which symbol @file{regex.h}
 678 defines).
 679
 680 @end itemize
 681
 682 If the interval expression is invalid and the syntax bit
 683 @code{RE_NO_BK_BRACES} is set, then Regex considers all the
 684 characters in the would-be interval to be ordinary.  If that bit
 685 isn't set, then the regular expression is invalid.
 686
 687 If the interval expression is valid but there is no preceding regular
 688 expression on which to operate, then if the syntax bit
 689 @code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid.
 690 If that bit isn't set, then Regex considers all the characters---other
 691 than backslashes, which it ignores---in the would-be interval to be
 692 ordinary.
 693
 694
 695 @node Alternation Operator
 696 @section The Alternation Operator (@code{|} or @code{\|})
 697
 698 @kindex |
 699 @kindex \|
 700 @cindex alternation operator
 701 @cindex or operator
 702
 703 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
 704 recognize this operator.  Otherwise, if the syntax bit
 705 @code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator;
 706 otherwise, @samp{\|} does.
 707
 708 Alternatives match one of a choice of regular expressions:
 709 if you put the character(s) representing the alternation operator between
 710 any two regular expressions @var{a} and @var{b}, the result matches
 711 the union of the strings that @var{a} and @var{b} match.  For
 712 example, supposing that @samp{|} is the alternation operator, then
 713 @samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or
 714 @samp{quux}.
 715
 716 @ignore
 717 @c Nobody needs to disallow empty alternatives any more.
 718 If the syntax bit @code{RE_NO_EMPTY_ALTS} is set, then if either of the regular
 719 expressions @var{a} or @var{b} is empty, the
 720 regular expression is invalid.  More precisely, if this syntax bit is
 721 set, then the alternation operator can't:
 722
 723 @itemize @bullet
 724 @item
 725 be first or last in a regular expression;
 726
 727 @item
 728 follow either another alternation operator or an open-group operator
 729 (@pxref{Grouping Operators}); or
 730
 731 @item
 732 precede a close-group operator.
 733
 734 @end itemize
 735
 736 @noindent
 737 For example, supposing @samp{(} and @samp{)} represent the open and
 738 close-group operators, then @samp{|foo}, @samp{foo|}, @samp{foo||bar},
 739 @samp{foo(|bar)}, and @samp{(foo|)bar} would all be invalid.
 740 @end ignore
 741
 742 The alternation operator operates on the @emph{largest} possible
 743 surrounding regular expressions.  (Put another way, it has the lowest
 744 precedence of any regular expression operator.)
 745 Thus, the only way you can
 746 delimit its arguments is to use grouping.  For example, if @samp{(} and
 747 @samp{)} are the open and close-group operators, then @samp{fo(o|b)ar}
 748 would match either @samp{fooar} or @samp{fobar}.  (@samp{foo|bar} would
 749 match @samp{foo} or @samp{bar}.)
 750
 751 @cindex backtracking
 752 The matcher usually tries all combinations of alternatives so as to
 753 match the longest possible string.  For example, when matching
 754 @samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot
 755 take, say, the first (``depth-first'') combination it could match, since
 756 then it would be content to match just @samp{fooqbar}.
 757
 758 @comment xx something about leftmost-longest
 759
 760
 761 @node List Operators
 762 @section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]})
 763
 764 @cindex matching list
 765 @cindex @samp{[}
 766 @cindex @samp{]}
 767 @cindex @samp{^}
 768 @cindex @samp{-}
 769 @cindex @samp{\}
 770 @cindex @samp{[^}
 771 @cindex nonmatching list
 772 @cindex matching newline
 773 @cindex bracket expression
 774
 775 @dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or
 776 more items.  An @dfn{item} is a character,
 777 @ignore
 778 (These get added when they get implemented.)
 779 a collating symbol, an equivalence class expression,
 780 @end ignore
 781 a character class expression, or a range expression.  The syntax bits
 782 affect which kinds of items you can put in a list.  We explain the last
 783 two items in subsections below.  Empty lists are invalid.
 784
 785 A @dfn{matching list} matches a single character represented by one of
 786 the list items.  You form a matching list by enclosing one or more items
 787 within an @dfn{open-matching-list operator} (represented by @samp{[})
 788 and a @dfn{close-list operator} (represented by @samp{]}).
 789
 790 For example, @samp{[ab]} matches either @samp{a} or @samp{b}.
 791 @samp{[ad]*} matches the empty string and any string composed of just
 792 @samp{a}s and @samp{d}s in any order.  Regex considers invalid a regular
 793 expression with a @samp{[} but no matching
 794 @samp{]}.
 795
 796 @dfn{Nonmatching lists} are similar to matching lists except that they
 797 match a single character @emph{not} represented by one of the list
 798 items.  You use an @dfn{open-nonmatching-list operator} (represented by
 799 @samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be
 800 the first character in the list.  If you put a @samp{^} character first
 801 in (what you think is) a matching list, you'll turn it into a
 802 nonmatching list.}) instead of an open-matching-list operator to start a
 803 nonmatching list.
 804
 805 For example, @samp{[^ab]} matches any character except @samp{a} or
 806 @samp{b}.
 807
 808 If the @code{posix_newline} field in the pattern buffer (@pxref{GNU
 809 Pattern Buffers} is set, then nonmatching lists do not match a newline.
 810
 811 Most characters lose any special meaning inside a list.  The special
 812 characters inside a list follow.
 813
 814 @table @samp
 815 @item ]
 816 ends the list if it's not the first list item.  So, if you want to make
 817 the @samp{]} character a list item, you must put it first.
 818
 819 @item \
 820 quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is
 821 set.
 822
 823 @ignore
 824 Put these in if they get implemented.
 825
 826 @item [.
 827 represents the open-collating-symbol operator (@pxref{Collating Symbol
 828 Operators}).
 829
 830 @item .]
 831 represents the close-collating-symbol operator.
 832
 833 @item [=
 834 represents the open-equivalence-class operator (@pxref{Equivalence Class
 835 Operators}).
 836
 837 @item =]
 838 represents the close-equivalence-class operator.
 839
 840 @end ignore
 841
 842 @item [:
 843 represents the open-character-class operator (@pxref{Character Class
 844 Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what
 845 follows is a valid character class expression.
 846
 847 @item :]
 848 represents the close-character-class operator if the syntax bit
 849 @code{RE_CHAR_CLASSES} is set and what precedes it is an
 850 open-character-class operator followed by a valid character class name.
 851
 852 @item -
 853 represents the range operator (@pxref{Range Operator}) if it's
 854 not first or last in a list or the ending point of a range.
 855
 856 @end table
 857
 858 @noindent
 859 All other characters are ordinary.  For example, @samp{[.*]} matches
 860 @samp{.} and @samp{*}.
 861
 862 @menu
 863 * Character Class Operators::   [:class:]
 864 * Range Operator::          start-end
 865 @end menu
 866
 867 @ignore
 868 (If collating symbols and equivalence class expressions get implemented,
 869 then add this.)
 870
 871 node Collating Symbol Operators
 872 subsubsection Collating Symbol Operators (@code{[.} @dots{} @code{.]})
 873
 874 If the syntax bit @code{XX} is set, then you can represent
 875 collating symbols inside lists.  You form a @dfn{collating symbol} by
 876 putting a collating element between an @dfn{open-collating-symbol
 877 operator} and an @dfn{close-collating-symbol operator}.  @samp{[.}
 878 represents the open-collating-symbol operator and @samp{.]} represents
 879 the close-collating-symbol operator.  For example, if @samp{ll} is a
 880 collating element, then @samp{[[.ll.]]} would match @samp{ll}.
 881
 882 node Equivalence Class Operators
 883 subsubsection Equivalence Class Operators (@code{[=} @dots{} @code{=]})
 884 @cindex equivalence class expression in regex
 885 @cindex @samp{[=} in regex
 886 @cindex @samp{=]} in regex
 887
 888 If the syntax bit @code{XX} is set, then Regex recognizes equivalence class
 889 expressions inside lists.  A @dfn{equivalence class expression} is a set
 890 of collating elements which all belong to the same equivalence class.
 891 You form an equivalence class expression by putting a collating
 892 element between an @dfn{open-equivalence-class operator} and a
 893 @dfn{close-equivalence-class operator}.  @samp{[=} represents the
 894 open-equivalence-class operator and @samp{=]} represents the
 895 close-equivalence-class operator.  For example, if @samp{a} and @samp{A}
 896 were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]}
 897 would match both @samp{a} and @samp{A}.  If the collating element in an
 898 equivalence class expression isn't part of an equivalence class, then
 899 the matcher considers the equivalence class expression to be a collating
 900 symbol.
 901
 902 @end ignore
 903
 904 @node Character Class Operators
 905 @subsection Character Class Operators (@code{[:} @dots{} @code{:]})
 906
 907 @cindex character classes
 908 @cindex @samp{[:} in regex
 909 @cindex @samp{:]} in regex
 910
 911 If the syntax bit @code{RE_CHARACTER_CLASSES} is set, then Regex
 912 recognizes character class expressions inside lists.  A @dfn{character
 913 class expression} matches one character from a given class.  You form a
 914 character class expression by putting a character class name between an
 915 @dfn{open-character-class operator} (represented by @samp{[:}) and a
 916 @dfn{close-character-class operator} (represented by @samp{:]}).  The
 917 character class names and their meanings are:
 918
 919 @table @code
 920
 921 @item alnum
 922 letters and digits
 923
 924 @item alpha
 925 letters
 926
 927 @item blank
 928 system-dependent; for @sc{gnu}, a space or tab
 929
 930 @item cntrl
 931 control characters (in the @sc{ascii} encoding, code 0177 and codes
 932 less than 040)
 933
 934 @item digit
 935 digits
 936
 937 @item graph
 938 same as @code{print} except omits space
 939
 940 @item lower
 941 lowercase letters
 942
 943 @item print
 944 printable characters (in the @sc{ascii} encoding, space
 945 tilde---codes 040 through 0176)
 946
 947 @item punct
 948 neither control nor alphanumeric characters
 949
 950 @item space
 951 space, carriage return, newline, vertical tab, and form feed
 952
 953 @item upper
 954 uppercase letters
 955
 956 @item xdigit
 957 hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F}
 958
 959 @end table
 960
 961 @noindent
 962 These correspond to the definitions in the C library's @file{<ctype.h>}
 963 facility.  For example, @samp{[:alpha:]} corresponds to the standard
 964 facility @code{isalpha}.  Regex recognizes character class expressions
 965 only inside of lists; so @samp{[[:alpha:]]} matches any letter, but
 966 @samp{[:alpha:]} outside of a bracket expression and not followed by a
 967 repetition operator matches just itself.
 968
 969 @node Range Operator
 970 @subsection The Range Operator (@code{-})
 971
 972 Regex recognizes @dfn{range expressions} inside a list. They represent
 973 those characters
 974 that fall between two elements in the current collating sequence.  You
 975 form a range expression by putting a @dfn{range operator} between two
 976 @ignore
 977 (If these get implemented, then substitute this for ``characters.'')
 978 of any of the following: characters, collating elements, collating symbols,
 979 and equivalence class expressions.  The starting point of the range and
 980 the ending point of the range don't have to be the same kind of item,
 981 e.g., the starting point could be a collating element and the ending
 982 point could be an equivalence class expression.  If a range's ending
 983 point is an equivalence class, then all the collating elements in that
 984 class will be in the range.
 985 @end ignore
 986 characters.@footnote{You can't use a character class for the starting
 987 or ending point of a range, since a character class is not a single
 988 character.} @samp{-} represents the range operator.  For example,
 989 @samp{a-f} within a list represents all the characters from @samp{a}
 990 through @samp{f}
 991 inclusively.
 992
 993 If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's
 994 ending point collates less than its starting point, the range (and the
 995 regular expression containing it) is invalid.  For example, the regular
 996 expression @samp{[z-a]} would be invalid.  If this bit isn't set, then
 997 Regex considers such a range to be empty.
 998
 999 Since @samp{-} represents the range operator, if you want to make a
1000 @samp{-} character itself
1001 a list item, you must do one of the following:
1002
1003 @itemize @bullet
1004 @item
1005 Put the @samp{-} either first or last in the list.
1006
1007 @item
1008 Include a range whose starting point collates strictly lower than
1009 @samp{-} and whose ending point collates equal or higher.  Unless a
1010 range is the first item in a list, a @samp{-} can't be its starting
1011 point, but @emph{can} be its ending point.  That is because Regex
1012 considers @samp{-} to be the range operator unless it is preceded by
1013 another @samp{-}.  For example, in the @sc{ascii} encoding, @samp{)},
1014 @samp{*}, @samp{+}, @samp{,}, @samp{-}, @samp{.}, and @samp{/} are
1015 contiguous characters in the collating sequence.  You might think that
1016 @samp{[)-+--/]} has two ranges: @samp{)-+} and @samp{--/}.  Rather, it
1017 has the ranges @samp{)-+} and @samp{+--}, plus the character @samp{/}, so
1018 it matches, e.g., @samp{,}, not @samp{.}.
1019
1020 @item
1021 Put a range whose starting point is @samp{-} first in the list.
1022
1023 @end itemize
1024
1025 For example, @samp{[-a-z]} matches a lowercase letter or a hyphen (in
1026 English, in @sc{ascii}).
1027
1028
1029 @node Grouping Operators
1030 @section Grouping Operators (@code{(} @dots{} @code{)} or @code{\(} @dots{} @code{\)})
1031
1032 @kindex (
1033 @kindex )
1034 @kindex \(
1035 @kindex \)
1036 @cindex grouping
1037 @cindex subexpressions
1038 @cindex parenthesizing
1039
1040 A @dfn{group}, also known as a @dfn{subexpression}, consists of an
1041 @dfn{open-group operator}, any number of other operators, and a
1042 @dfn{close-group operator}.  Regex treats this sequence as a unit, just
1043 as mathematics and programming languages treat a parenthesized
1044 expression as a unit.
1045
1046 Therefore, using @dfn{groups}, you can:
1047
1048 @itemize @bullet
1049 @item
1050 delimit the argument(s) to an alternation operator (@pxref{Alternation
1051 Operator}) or a repetition operator (@pxref{Repetition
1052 Operators}).
1053
1054 @item
1055 keep track of the indices of the substring that matched a given group.
1056 @xref{Using Registers}, for a precise explanation.
1057 This lets you:
1058
1059 @itemize @bullet
1060 @item
1061 use the back-reference operator (@pxref{Back-reference Operator}).
1062
1063 @item
1064 use registers (@pxref{Using Registers}).
1065
1066 @end itemize
1067
1068 @end itemize
1069
1070 If the syntax bit @code{RE_NO_BK_PARENS} is set, then @samp{(} represents
1071 the open-group operator and @samp{)} represents the
1072 close-group operator; otherwise, @samp{\(} and @samp{\)} do.
1073
1074 If the syntax bit @code{RE_UNMATCHED_RIGHT_PAREN_ORD} is set and a
1075 close-group operator has no matching open-group operator, then Regex
1076 considers it to match @samp{)}.
1077
1078
1079 @node Back-reference Operator
1080 @section The Back-reference Operator (@dfn{\}@var{digit})
1081
1082 @cindex back references
1083
1084 If the syntax bit @code{RE_NO_BK_REF} isn't set, then Regex recognizes
1085 back references.  A back reference matches a specified preceding group.
1086 The back reference operator is represented by @samp{\@var{digit}}
1087 anywhere after the end of a regular expression's @w{@var{digit}-th}
1088 group (@pxref{Grouping Operators}).
1089
1090 @var{digit} must be between @samp{1} and @samp{9}.  The matcher assigns
1091 numbers 1 through 9 to the first nine groups it encounters.  By using
1092 one of @samp{\1} through @samp{\9} after the corresponding group's
1093 close-group operator, you can match a substring identical to the
1094 one that the group does.
1095
1096 Back references match according to the following (in all examples below,
1097 @samp{(} represents the open-group, @samp{)} the close-group, @samp{@{}
1098 the open-interval and @samp{@}} the close-interval operator):
1099
1100 @itemize @bullet
1101 @item
1102 If the group matches a substring, the back reference matches an
1103 identical substring.  For example, @samp{(a)\1} matches @samp{aa} and
1104 @samp{(bana)na\1bo\1} matches @samp{bananabanabobana}.  Likewise,
1105 @samp{(.*)\1} matches any (newline-free if the syntax bit
1106 @code{RE_DOT_NEWLINE} isn't set) string that is composed of two
1107 identical halves; the @samp{(.*)} matches the first half and the
1108 @samp{\1} matches the second half.
1109
1110 @item
1111 If the group matches more than once (as it might if followed
1112 by, e.g., a repetition operator), then the back reference matches the
1113 substring the group @emph{last} matched.  For example,
1114 @samp{((a*)b)*\1\2} matches @samp{aabababa}; first @w{group 1} (the
1115 outer one) matches @samp{aab} and @w{group 2} (the inner one) matches
1116 @samp{aa}.  Then @w{group 1} matches @samp{ab} and @w{group 2} matches
1117 @samp{a}.  So, @samp{\1} matches @samp{ab} and @samp{\2} matches
1118 @samp{a}.
1119
1120 @item
1121 If the group doesn't participate in a match, i.e., it is part of an
1122 alternative not taken or a repetition operator allows zero repetitions
1123 of it, then the back reference makes the whole match fail.  For example,
1124 @samp{(one()|two())-and-(three\2|four\3)} matches @samp{one-and-three}
1125 and @samp{two-and-four}, but not @samp{one-and-four} or
1126 @samp{two-and-three}.  For example, if the pattern matches
1127 @samp{one-and-}, then its @w{group 2} matches the empty string and its
1128 @w{group 3} doesn't participate in the match.  So, if it then matches
1129 @samp{four}, then when it tries to back reference @w{group 3}---which it
1130 will attempt to do because @samp{\3} follows the @samp{four}---the match
1131 will fail because @w{group 3} didn't participate in the match.
1132
1133 @end itemize
1134
1135 You can use a back reference as an argument to a repetition operator.  For
1136 example, @samp{(a(b))\2*} matches @samp{a} followed by two or more
1137 @samp{b}s.  Similarly, @samp{(a(b))\2@{3@}} matches @samp{abbbb}.
1138
1139 If there is no preceding @w{@var{digit}-th} subexpression, the regular
1140 expression is invalid.
1141
1142
1143 @node Anchoring Operators
1144 @section Anchoring Operators
1145
1146 @cindex anchoring
1147 @cindex regexp anchoring
1148
1149 These operators can constrain a pattern to match only at the beginning or
1150 end of the entire string or at the beginning or end of a line.
1151
1152 @menu
1153 * Match-beginning-of-line Operator::  ^
1154 * Match-end-of-line Operator::        $
1155 @end menu
1156
1157
1158 @node Match-beginning-of-line Operator
1159 @subsection The Match-beginning-of-line Operator (@code{^})
1160
1161 @kindex ^
1162 @cindex beginning-of-line operator
1163 @cindex anchors
1164
1165 This operator can match the empty string either at the beginning of the
1166 string or after a newline character.  Thus, it is said to @dfn{anchor}
1167 the pattern to the beginning of a line.
1168
1169 In the cases following, @samp{^} represents this operator.  (Otherwise,
1170 @samp{^} is ordinary.)
1171
1172 @itemize @bullet
1173
1174 @item
1175 It (the @samp{^}) is first in the pattern, as in @samp{^foo}.
1176
1177 @cnindex RE_CONTEXT_INDEP_ANCHORS @r{(and @samp{^})}
1178 @item
1179 The syntax bit @code{RE_CONTEXT_INDEP_ANCHORS} is set, and it is outside
1180 a bracket expression.
1181
1182 @cindex open-group operator and @samp{^}
1183 @cindex alternation operator and @samp{^}
1184 @item
1185 It follows an open-group or alternation operator, as in @samp{a\(^b\)}
1186 and @samp{a\|^b}.  @xref{Grouping Operators}, and @ref{Alternation
1187 Operator}.
1188
1189 @end itemize
1190
1191 These rules imply that some valid patterns containing @samp{^} cannot be
1192 matched; for example, @samp{foo^bar} if @code{RE_CONTEXT_INDEP_ANCHORS}
1193 is set.
1194
1195 @vindex not_bol @r{field in pattern buffer}
1196 If the @code{not_bol} field is set in the pattern buffer (@pxref{GNU
1197 Pattern Buffers}), then @samp{^} fails to match at the beginning of the
1198 string.  @xref{POSIX Matching}, for when you might find this useful.
1199
1200 @vindex newline_anchor @r{field in pattern buffer}
1201 If the @code{newline_anchor} field is set in the pattern buffer, then
1202 @samp{^} fails to match after a newline.  This is useful when you do not
1203 regard the string to be matched as broken into lines.
1204
1205
1206 @node Match-end-of-line Operator
1207 @subsection The Match-end-of-line Operator (@code{$})
1208
1209 @kindex $
1210 @cindex end-of-line operator
1211 @cindex anchors
1212
1213 This operator can match the empty string either at the end of
1214 the string or before a newline character in the string.  Thus, it is
1215 said to @dfn{anchor} the pattern to the end of a line.
1216
1217 It is always represented by @samp{$}.  For example, @samp{foo$} usually
1218 matches, e.g., @samp{foo} and, e.g., the first three characters of
1219 @samp{foo\nbar}.
1220
1221 Its interaction with the syntax bits and pattern buffer fields is
1222 exactly the dual of @samp{^}'s; see the previous section.  (That is,
1223 ``@samp{^}'' becomes ``@samp{$}'', ``beginning'' becomes ``end'',
1224 ``next'' becomes ``previous'', ``after'' becomes ``before'', and
1225 ``@code{not_bol}'' becomes ``@code{not_eol}''.)
1226
1227
1228 @node GNU Operators
1229 @chapter GNU Operators
1230
1231 Following are operators that @sc{gnu} defines (and @sc{posix} doesn't).
1232
1233 @menu
1234 * Word Operators::
1235 * Buffer Operators::
1236 @end menu
1237
1238 @node Word Operators
1239 @section Word Operators
1240
1241 The operators in this section require Regex to recognize parts of words.
1242 Regex uses a syntax table to determine whether or not a character is
1243 part of a word, i.e., whether or not it is @dfn{word-constituent}.
1244
1245 @menu
1246 * Non-Emacs Syntax Tables::
1247 * Match-word-boundary Operator::        \b
1248 * Match-within-word Operator::          \B
1249 * Match-beginning-of-word Operator::    \<
1250 * Match-end-of-word Operator::          \>
1251 * Match-word-constituent Operator::     \w
1252 * Match-non-word-constituent Operator:: \W
1253 @end menu
1254
1255 @node Non-Emacs Syntax Tables
1256 @subsection Non-Emacs Syntax Tables
1257
1258 A @dfn{syntax table} is an array indexed by the characters in your
1259 character set.  In the @sc{ascii} encoding, therefore, a syntax table
1260 has 256 elements.  Regex always uses a @code{char *} variable
1261 @code{re_syntax_table} as its syntax table.  In some cases, it
1262 initializes this variable and in others it expects you to initialize it.
1263
1264 @itemize @bullet
1265 @item
1266 If Regex is compiled with the preprocessor symbols @code{emacs} and
1267 @code{SYNTAX_TABLE} both undefined, then Regex allocates
1268 @code{re_syntax_table} and initializes an element @var{i} either to
1269 @code{Sword} (which it defines) if @var{i} is a letter, number, or
1270 @samp{_}, or to zero if it's not.
1271
1272 @item
1273 If Regex is compiled with @code{emacs} undefined but @code{SYNTAX_TABLE}
1274 defined, then Regex expects you to define a @code{char *} variable
1275 @code{re_syntax_table} to be a valid syntax table.
1276
1277 @item
1278 @xref{Emacs Syntax Tables}, for what happens when Regex is compiled with
1279 the preprocessor symbol @code{emacs} defined.
1280
1281 @end itemize
1282
1283 @node Match-word-boundary Operator
1284 @subsection The Match-word-boundary Operator (@code{\b})
1285
1286 @cindex @samp{\b}
1287 @cindex word boundaries, matching
1288
1289 This operator (represented by @samp{\b}) matches the empty string at
1290 either the beginning or the end of a word.  For example, @samp{\brat\b}
1291 matches the separate word @samp{rat}.
1292
1293 @node Match-within-word Operator
1294 @subsection The Match-within-word Operator (@code{\B})
1295
1296 @cindex @samp{\B}
1297
1298 This operator (represented by @samp{\B}) matches the empty string within
1299 a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
1300 @samp{dirty \Brat} doesn't match @samp{dirty rat}.
1301
1302 @node Match-beginning-of-word Operator
1303 @subsection The Match-beginning-of-word Operator (@code{\<})
1304
1305 @cindex @samp{\<}
1306
1307 This operator (represented by @samp{\<}) matches the empty string at the
1308 beginning of a word.
1309
1310 @node Match-end-of-word Operator
1311 @subsection The Match-end-of-word Operator (@code{\>})
1312
1313 @cindex @samp{\>}
1314
1315 This operator (represented by @samp{\>}) matches the empty string at the
1316 end of a word.
1317
1318 @node Match-word-constituent Operator
1319 @subsection The Match-word-constituent Operator (@code{\w})
1320
1321 @cindex @samp{\w}
1322
1323 This operator (represented by @samp{\w}) matches any word-constituent
1324 character.
1325
1326 @node Match-non-word-constituent Operator
1327 @subsection The Match-non-word-constituent Operator (@code{\W})
1328
1329 @cindex @samp{\W}
1330
1331 This operator (represented by @samp{\W}) matches any character that is
1332 not word-constituent.
1333
1334
1335 @node Buffer Operators
1336 @section Buffer Operators
1337
1338 Following are operators which work on buffers.  In Emacs, a @dfn{buffer}
1339 is, naturally, an Emacs buffer.  For other programs, Regex considers the
1340 entire string to be matched as the buffer.
1341
1342 @menu
1343 * Match-beginning-of-buffer Operator::  \`
1344 * Match-end-of-buffer Operator::        \'
1345 @end menu
1346
1347
1348 @node Match-beginning-of-buffer Operator
1349 @subsection The Match-beginning-of-buffer Operator (@code{\`})
1350
1351 @cindex @samp{\`}
1352
1353 This operator (represented by @samp{\`}) matches the empty string at the
1354 beginning of the buffer.
1355
1356 @node Match-end-of-buffer Operator
1357 @subsection The Match-end-of-buffer Operator (@code{\'})
1358
1359 @cindex @samp{\'}
1360
1361 This operator (represented by @samp{\'}) matches the empty string at the
1362 end of the buffer.
1363
1364
1365 @node GNU Emacs Operators
1366 @chapter GNU Emacs Operators
1367
1368 Following are operators that @sc{gnu} defines (and @sc{posix} doesn't)
1369 that you can use only when Regex is compiled with the preprocessor
1370 symbol @code{emacs} defined.
1371
1372 @menu
1373 * Syntactic Class Operators::
1374 @end menu
1375
1376
1377 @node Syntactic Class Operators
1378 @section Syntactic Class Operators
1379
1380 The operators in this section require Regex to recognize the syntactic
1381 classes of characters.  Regex uses a syntax table to determine this.
1382
1383 @menu
1384 * Emacs Syntax Tables::
1385 * Match-syntactic-class Operator::      \sCLASS
1386 * Match-not-syntactic-class Operator::  \SCLASS
1387 @end menu
1388
1389 @node Emacs Syntax Tables
1390 @subsection Emacs Syntax Tables
1391
1392 A @dfn{syntax table} is an array indexed by the characters in your
1393 character set.  In the @sc{ascii} encoding, therefore, a syntax table
1394 has 256 elements.
1395
1396 If Regex is compiled with the preprocessor symbol @code{emacs} defined,
1397 then Regex expects you to define and initialize the variable
1398 @code{re_syntax_table} to be an Emacs syntax table.  Emacs' syntax
1399 tables are more complicated than Regex's own (@pxref{Non-Emacs Syntax
1400 Tables}).  @xref{Syntax, , Syntax, emacs, The GNU Emacs User's Manual},
1401 for a description of Emacs' syntax tables.
1402
1403 @node Match-syntactic-class Operator
1404 @subsection The Match-syntactic-class Operator (@code{\s}@var{class})
1405
1406 @cindex @samp{\s}
1407
1408 This operator matches any character whose syntactic class is represented
1409 by a specified character.  @samp{\s@var{class}} represents this operator
1410 where @var{class} is the character representing the syntactic class you
1411 want.  For example, @samp{w} represents the syntactic
1412 class of word-constituent characters, so @samp{\sw} matches any
1413 word-constituent character.
1414
1415 @node Match-not-syntactic-class Operator
1416 @subsection The Match-not-syntactic-class Operator (@code{\S}@var{class})
1417
1418 @cindex @samp{\S}
1419
1420 This operator is similar to the match-syntactic-class operator except
1421 that it matches any character whose syntactic class is @emph{not}
1422 represented by the specified character.  @samp{\S@var{class}} represents
1423 this operator.  For example, @samp{w} represents the syntactic class of
1424 word-constituent characters, so @samp{\Sw} matches any character that is
1425 not word-constituent.
1426
1427
1428 @node What Gets Matched?
1429 @chapter What Gets Matched?
1430
1431 Regex usually matches strings according to the ``leftmost longest''
1432 rule; that is, it chooses the longest of the leftmost matches.  This
1433 does not mean that for a regular expression containing subexpressions
1434 that it simply chooses the longest match for each subexpression, left to
1435 right; the overall match must also be the longest possible one.
1436
1437 For example, @samp{(ac*)(c*d[ac]*)\1} matches @samp{acdacaaa}, not
1438 @samp{acdac}, as it would if it were to choose the longest match for the
1439 first subexpression.
1440
1441
1442 @node Programming with Regex
1443 @chapter Programming with Regex
1444
1445 Here we describe how you use the Regex data structures and functions in
1446 C programs.  Regex has three interfaces: one designed for @sc{gnu}, one
1447 compatible with @sc{posix} and one compatible with Berkeley @sc{unix}.
1448
1449 @menu
1450 * GNU Regex Functions::
1451 * POSIX Regex Functions::
1452 * BSD Regex Functions::
1453 @end menu
1454
1455
1456 @node GNU Regex Functions
1457 @section GNU Regex Functions
1458
1459 If you're writing code that doesn't need to be compatible with either
1460 @sc{posix} or Berkeley @sc{unix}, you can use these functions.  They
1461 provide more options than the other interfaces.
1462
1463 @menu
1464 * GNU Pattern Buffers::         The re_pattern_buffer type.
1465 * GNU Regular Expression Compiling::  re_compile_pattern ()
1466 * GNU Matching::                re_match ()
1467 * GNU Searching::               re_search ()
1468 * Matching/Searching with Split Data::  re_match_2 (), re_search_2 ()
1469 * Searching with Fastmaps::     re_compile_fastmap ()
1470 * GNU Translate Tables::        The `translate' field.
1471 * Using Registers::             The re_registers type and related fns.
1472 * Freeing GNU Pattern Buffers::  regfree ()
1473 @end menu
1474
1475
1476 @node GNU Pattern Buffers
1477 @subsection GNU Pattern Buffers
1478
1479 @cindex pattern buffer, definition of
1480 @tindex re_pattern_buffer @r{definition}
1481 @tindex struct re_pattern_buffer @r{definition}
1482
1483 To compile, match, or search for a given regular expression, you must
1484 supply a pattern buffer.  A @dfn{pattern buffer} holds one compiled
1485 regular expression.@footnote{Regular expressions are also referred to as
1486 ``patterns,'' hence the name ``pattern buffer.''}
1487
1488 You can have several different pattern buffers simultaneously, each
1489 holding a compiled pattern for a different regular expression.
1490
1491 @file{regex.h} defines the pattern buffer @code{struct} as follows:
1492
1493 @example
1494         /* Space that holds the compiled pattern.  It is declared as
1495           `unsigned char *' because its elements are
1496            sometimes used as array indexes.  */
1497   unsigned char *buffer;
1498
1499         /* Number of bytes to which `buffer' points.  */
1500   unsigned long allocated;
1501
1502         /* Number of bytes actually used in `buffer'.  */
1503   unsigned long used;
1504
1505         /* Syntax setting with which the pattern was compiled.  */
1506   reg_syntax_t syntax;
1507
1508         /* Pointer to a fastmap, if any, otherwise zero.  re_search uses
1509            the fastmap, if there is one, to skip over impossible
1510            starting points for matches.  */
1511   char *fastmap;
1512
1513         /* Either a translate table to apply to all characters before
1514            comparing them, or zero for no translation.  The translation
1515            is applied to a pattern when it is compiled and to a string
1516            when it is matched.  */
1517   char *translate;
1518
1519         /* Number of subexpressions found by the compiler.  */
1520   size_t re_nsub;
1521
1522         /* Zero if this pattern cannot match the empty string, one else.
1523            Well, in truth it's used only in `re_search_2', to see
1524            whether or not we should use the fastmap, so we don't set
1525            this absolutely perfectly; see `re_compile_fastmap' (the
1526            `duplicate' case).  */
1527   unsigned can_be_null : 1;
1528
1529         /* If REGS_UNALLOCATED, allocate space in the `regs' structure
1530              for `max (RE_NREGS, re_nsub + 1)' groups.
1531            If REGS_REALLOCATE, reallocate space if necessary.
1532            If REGS_FIXED, use what's there.  */
1533 #define REGS_UNALLOCATED 0
1534 #define REGS_REALLOCATE 1
1535 #define REGS_FIXED 2
1536   unsigned regs_allocated : 2;
1537
1538         /* Set to zero when `regex_compile' compiles a pattern; set to one
1539            by `re_compile_fastmap' if it updates the fastmap.  */
1540   unsigned fastmap_accurate : 1;
1541
1542         /* If set, `re_match_2' does not return information about
1543            subexpressions.  */
1544   unsigned no_sub : 1;
1545
1546         /* If set, a beginning-of-line anchor doesn't match at the
1547            beginning of the string.  */
1548   unsigned not_bol : 1;
1549
1550         /* Similarly for an end-of-line anchor.  */
1551   unsigned not_eol : 1;
1552
1553         /* If true, an anchor at a newline matches.  */
1554   unsigned newline_anchor : 1;
1555
1556 @end example
1557
1558
1559 @node GNU Regular Expression Compiling
1560 @subsection GNU Regular Expression Compiling
1561
1562 In @sc{gnu}, you can both match and search for a given regular
1563 expression.  To do either, you must first compile it in a pattern buffer
1564 (@pxref{GNU Pattern Buffers}).
1565
1566 @cindex syntax initialization
1567 @vindex re_syntax_options @r{initialization}
1568 Regular expressions match according to the syntax with which they were
1569 compiled; with @sc{gnu}, you indicate what syntax you want by setting
1570 the variable @code{re_syntax_options} (declared in @file{regex.h})
1571 before calling the compiling function, @code{re_compile_pattern} (see
1572 below).  @xref{Syntax Bits}, and @ref{Predefined Syntaxes}.
1573
1574 You can change the value of @code{re_syntax_options} at any time.
1575 Usually, however, you set its value once and then never change it.
1576
1577 @cindex pattern buffer initialization
1578 @code{re_compile_pattern} takes a pattern buffer as an argument.  You
1579 must initialize the following fields:
1580
1581 @table @code
1582
1583 @item translate @r{initialization}
1584
1585 @item translate
1586 @vindex translate @r{initialization}
1587 Initialize this to point to a translate table if you want one, or to
1588 zero if you don't.  We explain translate tables in @ref{GNU Translate
1589 Tables}.
1590
1591 @item fastmap
1592 @vindex fastmap @r{initialization}
1593 Initialize this to nonzero if you want a fastmap, or to zero if you
1594 don't.
1595
1596 @item buffer
1597 @itemx allocated
1598 @vindex buffer @r{initialization}
1599 @vindex allocated @r{initialization}
1600 @findex malloc
1601 If you want @code{re_compile_pattern} to allocate memory for the
1602 compiled pattern, set both of these to zero.  If you have an existing
1603 block of memory (allocated with @code{malloc}) you want Regex to use,
1604 set @code{buffer} to its address and @code{allocated} to its size (in
1605 bytes).
1606
1607 @code{re_compile_pattern} uses @code{realloc} to extend the space for
1608 the compiled pattern as necessary.
1609
1610 @end table
1611
1612 To compile a pattern buffer, use:
1613
1614 @findex re_compile_pattern
1615 @example
1616 char *
1617 re_compile_pattern (const char *@var{regex}, const int @var{regex_size},
1618                     struct re_pattern_buffer *@var{pattern_buffer})
1619 @end example
1620
1621 @noindent
1622 @var{regex} is the regular expression's address, @var{regex_size} is its
1623 length, and @var{pattern_buffer} is the pattern buffer's address.
1624
1625 If @code{re_compile_pattern} successfully compiles the regular
1626 expression, it returns zero and sets @code{*@var{pattern_buffer}} to the
1627 compiled pattern.  It sets the pattern buffer's fields as follows:
1628
1629 @table @code
1630 @item buffer
1631 @vindex buffer @r{field, set by @code{re_compile_pattern}}
1632 to the compiled pattern.
1633
1634 @item used
1635 @vindex used @r{field, set by @code{re_compile_pattern}}
1636 to the number of bytes the compiled pattern in @code{buffer} occupies.
1637
1638 @item syntax
1639 @vindex syntax @r{field, set by @code{re_compile_pattern}}
1640 to the current value of @code{re_syntax_options}.
1641
1642 @item re_nsub
1643 @vindex re_nsub @r{field, set by @code{re_compile_pattern}}
1644 to the number of subexpressions in @var{regex}.
1645
1646 @item fastmap_accurate
1647 @vindex fastmap_accurate @r{field, set by @code{re_compile_pattern}}
1648 to zero on the theory that the pattern you're compiling is different
1649 than the one previously compiled into @code{buffer}; in that case (since
1650 you can't make a fastmap without a compiled pattern),
1651 @code{fastmap} would either contain an incompatible fastmap, or nothing
1652 at all.
1653
1654 @c xx what else?
1655 @end table
1656
1657 If @code{re_compile_pattern} can't compile @var{regex}, it returns an
1658 error string corresponding to one of the errors listed in @ref{POSIX
1659 Regular Expression Compiling}.
1660
1661
1662 @node GNU Matching
1663 @subsection GNU Matching
1664
1665 @cindex matching with GNU functions
1666
1667 Matching the @sc{gnu} way means trying to match as much of a string as
1668 possible starting at a position within it you specify.  Once you've compiled
1669 a pattern into a pattern buffer (@pxref{GNU Regular Expression
1670 Compiling}), you can ask the matcher to match that pattern against a
1671 string using:
1672
1673 @findex re_match
1674 @example
1675 int
1676 re_match (struct re_pattern_buffer *@var{pattern_buffer},
1677           const char *@var{string}, const int @var{size},
1678           const int @var{start}, struct re_registers *@var{regs})
1679 @end example
1680
1681 @noindent
1682 @var{pattern_buffer} is the address of a pattern buffer containing a
1683 compiled pattern.  @var{string} is the string you want to match; it can
1684 contain newline and null characters.  @var{size} is the length of that
1685 string.  @var{start} is the string index at which you want to
1686 begin matching; the first character of @var{string} is at index zero.
1687 @xref{Using Registers}, for a explanation of @var{regs}; you can safely
1688 pass zero.
1689
1690 @code{re_match} matches the regular expression in @var{pattern_buffer}
1691 against the string @var{string} according to the syntax in
1692 @var{pattern_buffers}'s @code{syntax} field.  (@xref{GNU Regular
1693 Expression Compiling}, for how to set it.)  The function returns
1694 @math{-1} if the compiled pattern does not match any part of
1695 @var{string} and @math{-2} if an internal error happens; otherwise, it
1696 returns how many (possibly zero) characters of @var{string} the pattern
1697 matched.
1698
1699 An example: suppose @var{pattern_buffer} points to a pattern buffer
1700 containing the compiled pattern for @samp{a*}, and @var{string} points
1701 to @samp{aaaaab} (whereupon @var{size} should be 6). Then if @var{start}
1702 is 2, @code{re_match} returns 3, i.e., @samp{a*} would have matched the
1703 last three @samp{a}s in @var{string}.  If @var{start} is 0,
1704 @code{re_match} returns 5, i.e., @samp{a*} would have matched all the
1705 @samp{a}s in @var{string}.  If @var{start} is either 5 or 6, it returns
1706 zero.
1707
1708 If @var{start} is not between zero and @var{size}, then
1709 @code{re_match} returns @math{-1}.
1710
1711
1712 @node GNU Searching
1713 @subsection GNU Searching
1714
1715 @cindex searching with GNU functions
1716
1717 @dfn{Searching} means trying to match starting at successive positions
1718 within a string.  The function @code{re_search} does this.
1719
1720 Before calling @code{re_search}, you must compile your regular
1721 expression.  @xref{GNU Regular Expression Compiling}.
1722
1723 Here is the function declaration:
1724
1725 @findex re_search
1726 @example
1727 int
1728 re_search (struct re_pattern_buffer *@var{pattern_buffer},
1729            const char *@var{string}, const int @var{size},
1730            const int @var{start}, const int @var{range},
1731            struct re_registers *@var{regs})
1732 @end example
1733
1734 @noindent
1735 @vindex start @r{argument to @code{re_search}}
1736 @vindex range @r{argument to @code{re_search}}
1737 whose arguments are the same as those to @code{re_match} (@pxref{GNU
1738 Matching}) except that the two arguments @var{start} and @var{range}
1739 replace @code{re_match}'s argument @var{start}.
1740
1741 If @var{range} is positive, then @code{re_search} attempts a match
1742 starting first at index @var{start}, then at @math{@var{start} + 1} if
1743 that fails, and so on, up to @math{@var{start} + @var{range}}; if
1744 @var{range} is negative, then it attempts a match starting first at
1745 index @var{start}, then at @math{@var{start} -1} if that fails, and so
1746 on.
1747
1748 If @var{start} is not between zero and @var{size}, then @code{re_search}
1749 returns @math{-1}.  When @var{range} is positive, @code{re_search}
1750 adjusts @var{range} so that @math{@var{start} + @var{range} - 1} is
1751 between zero and @var{size}, if necessary; that way it won't search
1752 outside of @var{string}.  Similarly, when @var{range} is negative,
1753 @code{re_search} adjusts @var{range} so that @math{@var{start} +
1754 @var{range} + 1} is between zero and @var{size}, if necessary.
1755
1756 If the @code{fastmap} field of @var{pattern_buffer} is zero,
1757 @code{re_search} matches starting at consecutive positions; otherwise,
1758 it uses @code{fastmap} to make the search more efficient.
1759 @xref{Searching with Fastmaps}.
1760
1761 If no match is found, @code{re_search} returns @math{-1}.  If
1762 a match is found, it returns the index where the match began.  If an
1763 internal error happens, it returns @math{-2}.
1764
1765
1766 @node Matching/Searching with Split Data
1767 @subsection Matching and Searching with Split Data
1768
1769 Using the functions @code{re_match_2} and @code{re_search_2}, you can
1770 match or search in data that is divided into two strings.
1771
1772 The function:
1773
1774 @findex re_match_2
1775 @example
1776 int
1777 re_match_2 (struct re_pattern_buffer *@var{buffer},
1778             const char *@var{string1}, const int @var{size1},
1779             const char *@var{string2}, const int @var{size2},
1780             const int @var{start},
1781             struct re_registers *@var{regs},
1782             const int @var{stop})
1783 @end example
1784
1785 @noindent
1786 is similar to @code{re_match} (@pxref{GNU Matching}) except that you
1787 pass @emph{two} data strings and sizes, and an index @var{stop} beyond
1788 which you don't want the matcher to try matching.  As with
1789 @code{re_match}, if it succeeds, @code{re_match_2} returns how many
1790 characters of @var{string} it matched.  Regard @var{string1} and
1791 @var{string2} as concatenated when you set the arguments @var{start} and
1792 @var{stop} and use the contents of @var{regs}; @code{re_match_2} never
1793 returns a value larger than @math{@var{size1} + @var{size2}}.
1794
1795 The function:
1796
1797 @findex re_search_2
1798 @example
1799 int
1800 re_search_2 (struct re_pattern_buffer *@var{buffer},
1801              const char *@var{string1}, const int @var{size1},
1802              const char *@var{string2}, const int @var{size2},
1803              const int @var{start}, const int @var{range},
1804              struct re_registers *@var{regs},
1805              const int @var{stop})
1806 @end example
1807
1808 @noindent
1809 is similarly related to @code{re_search}.
1810
1811
1812 @node Searching with Fastmaps
1813 @subsection Searching with Fastmaps
1814
1815 @cindex fastmaps
1816 If you're searching through a long string, you should use a fastmap.
1817 Without one, the searcher tries to match at consecutive positions in the
1818 string.  Generally, most of the characters in the string could not start
1819 a match.  It takes much longer to try matching at a given position in the
1820 string than it does to check in a table whether or not the character at
1821 that position could start a match.  A @dfn{fastmap} is such a table.
1822
1823 More specifically, a fastmap is an array indexed by the characters in
1824 your character set.  Under the @sc{ascii} encoding, therefore, a fastmap
1825 has 256 elements.  If you want the searcher to use a fastmap with a
1826 given pattern buffer, you must allocate the array and assign the array's
1827 address to the pattern buffer's @code{fastmap} field.  You either can
1828 compile the fastmap yourself or have @code{re_search} do it for you;
1829 when @code{fastmap} is nonzero, it automatically compiles a fastmap the
1830 first time you search using a particular compiled pattern.
1831
1832 To compile a fastmap yourself, use:
1833
1834 @findex re_compile_fastmap
1835 @example
1836 int
1837 re_compile_fastmap (struct re_pattern_buffer *@var{pattern_buffer})
1838 @end example
1839
1840 @noindent
1841 @var{pattern_buffer} is the address of a pattern buffer.  If the
1842 character @var{c} could start a match for the pattern,
1843 @code{re_compile_fastmap} makes
1844 @code{@var{pattern_buffer}->fastmap[@var{c}]} nonzero.  It returns
1845 @math{0} if it can compile a fastmap and @math{-2} if there is an
1846 internal error.  For example, if @samp{|} is the alternation operator
1847 and @var{pattern_buffer} holds the compiled pattern for @samp{a|b}, then
1848 @code{re_compile_fastmap} sets @code{fastmap['a']} and
1849 @code{fastmap['b']} (and no others).
1850
1851 @code{re_search} uses a fastmap as it moves along in the string: it
1852 checks the string's characters until it finds one that's in the fastmap.
1853 Then it tries matching at that character.  If the match fails, it
1854 repeats the process.  So, by using a fastmap, @code{re_search} doesn't
1855 waste time trying to match at positions in the string that couldn't
1856 start a match.
1857
1858 If you don't want @code{re_search} to use a fastmap,
1859 store zero in the @code{fastmap} field of the pattern buffer before
1860 calling @code{re_search}.
1861
1862 Once you've initialized a pattern buffer's @code{fastmap} field, you
1863 need never do so again---even if you compile a new pattern in
1864 it---provided the way the field is set still reflects whether or not you
1865 want a fastmap.  @code{re_search} will still either do nothing if
1866 @code{fastmap} is null or, if it isn't, compile a new fastmap for the
1867 new pattern.
1868
1869 @node GNU Translate Tables
1870 @subsection GNU Translate Tables
1871
1872 If you set the @code{translate} field of a pattern buffer to a translate
1873 table, then the @sc{gnu} Regex functions to which you've passed that
1874 pattern buffer use it to apply a simple transformation
1875 to all the regular expression and string characters at which they look.
1876
1877 A @dfn{translate table} is an array indexed by the characters in your
1878 character set.  Under the @sc{ascii} encoding, therefore, a translate
1879 table has 256 elements.  The array's elements are also characters in
1880 your character set.  When the Regex functions see a character @var{c},
1881 they use @code{translate[@var{c}]} in its place, with one exception: the
1882 character after a @samp{\} is not translated.  (This ensures that, the
1883 operators, e.g., @samp{\B} and @samp{\b}, are always distinguishable.)
1884
1885 For example, a table that maps all lowercase letters to the
1886 corresponding uppercase ones would cause the matcher to ignore
1887 differences in case.@footnote{A table that maps all uppercase letters to
1888 the corresponding lowercase ones would work just as well for this
1889 purpose.}  Such a table would map all characters except lowercase letters
1890 to themselves, and lowercase letters to the corresponding uppercase
1891 ones.  Under the @sc{ascii} encoding, here's how you could initialize
1892 such a table (we'll call it @code{case_fold}):
1893
1894 @example
1895 for (i = 0; i < 256; i++)
1896   case_fold[i] = i;
1897 for (i = 'a'; i <= 'z'; i++)
1898   case_fold[i] = i - ('a' - 'A');
1899 @end example
1900
1901 You tell Regex to use a translate table on a given pattern buffer by
1902 assigning that table's address to the @code{translate} field of that
1903 buffer.  If you don't want Regex to do any translation, put zero into
1904 this field.  You'll get weird results if you change the table's contents
1905 anytime between compiling the pattern buffer, compiling its fastmap, and
1906 matching or searching with the pattern buffer.
1907
1908 @node Using Registers
1909 @subsection Using Registers
1910
1911 A group in a regular expression can match a (posssibly empty) substring
1912 of the string that regular expression as a whole matched.  The matcher
1913 remembers the beginning and end of the substring matched by
1914 each group.
1915
1916 To find out what they matched, pass a nonzero @var{regs} argument to a
1917 @sc{gnu} matching or searching function (@pxref{GNU Matching} and
1918 @ref{GNU Searching}), i.e., the address of a structure of this type, as
1919 defined in @file{regex.h}:
1920
1921 @c We don't bother to include this directly from regex.h,
1922 @c since it changes so rarely.
1923 @example
1924 @tindex re_registers
1925 @vindex num_regs @r{in @code{struct re_registers}}
1926 @vindex start @r{in @code{struct re_registers}}
1927 @vindex end @r{in @code{struct re_registers}}
1928 struct re_registers
1929 @{
1930   unsigned num_regs;
1931   regoff_t *start;
1932   regoff_t *end;
1933 @};
1934 @end example
1935
1936 Except for (possibly) the @var{num_regs}'th element (see below), the
1937 @var{i}th element of the @code{start} and @code{end} arrays records
1938 information about the @var{i}th group in the pattern.  (They're declared
1939 as C pointers, but this is only because not all C compilers accept
1940 zero-length arrays; conceptually, it is simplest to think of them as
1941 arrays.)
1942
1943 The @code{start} and @code{end} arrays are allocated in various ways,
1944 depending on the value of the @code{regs_allocated}
1945 @vindex regs_allocated
1946 field in the pattern buffer passed to the matcher.
1947
1948 The simplest and perhaps most useful is to let the matcher (re)allocate
1949 enough space to record information for all the groups in the regular
1950 expression.  If @code{regs_allocated} is @code{REGS_UNALLOCATED},
1951 @vindex REGS_UNALLOCATED
1952 the matcher allocates @math{1 + @var{re_nsub}} (another field in the
1953 pattern buffer; @pxref{GNU Pattern Buffers}).  The extra element is set
1954 to @math{-1}, and sets @code{regs_allocated} to @code{REGS_REALLOCATE}.
1955 @vindex REGS_REALLOCATE
1956 Then on subsequent calls with the same pattern buffer and @var{regs}
1957 arguments, the matcher reallocates more space if necessary.
1958
1959 It would perhaps be more logical to make the @code{regs_allocated} field
1960 part of the @code{re_registers} structure, instead of part of the
1961 pattern buffer.  But in that case the caller would be forced to
1962 initialize the structure before passing it.  Much existing code doesn't
1963 do this initialization, and it's arguably better to avoid it anyway.
1964
1965 @code{re_compile_pattern} sets @code{regs_allocated} to
1966 @code{REGS_UNALLOCATED},
1967 so if you use the GNU regular expression
1968 functions, you get this behavior by default.
1969
1970 xx document re_set_registers
1971
1972 @sc{posix}, on the other hand, requires a different interface:  the
1973 caller is supposed to pass in a fixed-length array which the matcher
1974 fills.  Therefore, if @code{regs_allocated} is @code{REGS_FIXED}
1975 @vindex REGS_FIXED
1976 the matcher simply fills that array.
1977
1978 The following examples illustrate the information recorded in the
1979 @code{re_registers} structure.  (In all of them, @samp{(} represents the
1980 open-group and @samp{)} the close-group operator.  The first character
1981 in the string @var{string} is at index 0.)
1982
1983 @c xx i'm not sure this is all true anymore.
1984
1985 @itemize @bullet
1986
1987 @item
1988 If the regular expression has an @w{@var{i}-th}
1989 group not contained within another group that matches a
1990 substring of @var{string}, then the function sets
1991 @code{@w{@var{regs}->}start[@var{i}]} to the index in @var{string} where
1992 the substring matched by the @w{@var{i}-th} group begins, and
1993 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
1994 substring's end.  The function sets @code{@w{@var{regs}->}start[0]} and
1995 @code{@w{@var{regs}->}end[0]} to analogous information about the entire
1996 pattern.
1997
1998 For example, when you match @samp{((a)(b))} against @samp{ab}, you get:
1999
2000 @itemize
2001 @item
2002 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
2003
2004 @item
2005 0 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
2006
2007 @item
2008 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
2009
2010 @item
2011 1 in @code{@w{@var{regs}->}start[3]} and 2 in @code{@w{@var{regs}->}end[3]}
2012 @end itemize
2013
2014 @item
2015 If a group matches more than once (as it might if followed by,
2016 e.g., a repetition operator), then the function reports the information
2017 about what the group @emph{last} matched.
2018
2019 For example, when you match the pattern @samp{(a)*} against the string
2020 @samp{aa}, you get:
2021
2022 @itemize
2023 @item
2024 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
2025
2026 @item
2027 1 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
2028 @end itemize
2029
2030 @item
2031 If the @w{@var{i}-th} group does not participate in a
2032 successful match, e.g., it is an alternative not taken or a
2033 repetition operator allows zero repetitions of it, then the function
2034 sets @code{@w{@var{regs}->}start[@var{i}]} and
2035 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}.
2036
2037 For example, when you match the pattern @samp{(a)*b} against
2038 the string @samp{b}, you get:
2039
2040 @itemize
2041 @item
2042 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
2043
2044 @item
2045 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
2046 @end itemize
2047
2048 @item
2049 If the @w{@var{i}-th} group matches a zero-length string, then the
2050 function sets @code{@w{@var{regs}->}start[@var{i}]} and
2051 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
2052 zero-length string.
2053
2054 For example, when you match the pattern @samp{(a*)b} against the string
2055 @samp{b}, you get:
2056
2057 @itemize
2058 @item
2059 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
2060
2061 @item
2062 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]}
2063 @end itemize
2064
2065 @ignore
2066 The function sets @code{@w{@var{regs}->}start[0]} and
2067 @code{@w{@var{regs}->}end[0]} to analogous information about the entire
2068 pattern.
2069
2070 For example, when you match the pattern @samp{(a*)} against the empty
2071 string, you get:
2072
2073 @itemize
2074 @item
2075 0 in @code{@w{@var{regs}->}start[0]} and 0 in @code{@w{@var{regs}->}end[0]}
2076
2077 @item
2078 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]}
2079 @end itemize
2080 @end ignore
2081
2082 @item
2083 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
2084 in turn not contained within any other group within group @var{i} and
2085 the function reports a match of the @w{@var{i}-th} group, then it
2086 records in @code{@w{@var{regs}->}start[@var{j}]} and
2087 @code{@w{@var{regs}->}end[@var{j}]} the last match (if it matched) of
2088 the @w{@var{j}-th} group.
2089
2090 For example, when you match the pattern @samp{((a*)b)*} against the
2091 string @samp{abb}, @w{group 2} last matches the empty string, so you
2092 get what it previously matched:
2093
2094 @itemize
2095 @item
2096 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
2097
2098 @item
2099 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
2100
2101 @item
2102 2 in @code{@w{@var{regs}->}start[2]} and 2 in @code{@w{@var{regs}->}end[2]}
2103 @end itemize
2104
2105 When you match the pattern @samp{((a)*b)*} against the string
2106 @samp{abb}, @w{group 2} doesn't participate in the last match, so you
2107 get:
2108
2109 @itemize
2110 @item
2111 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
2112
2113 @item
2114 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
2115
2116 @item
2117 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
2118 @end itemize
2119
2120 @item
2121 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
2122 in turn not contained within any other group within group @var{i}
2123 and the function sets
2124 @code{@w{@var{regs}->}start[@var{i}]} and
2125 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}, then it also sets
2126 @code{@w{@var{regs}->}start[@var{j}]} and
2127 @code{@w{@var{regs}->}end[@var{j}]} to @math{-1}.
2128
2129 For example, when you match the pattern @samp{((a)*b)*c} against the
2130 string @samp{c}, you get:
2131
2132 @itemize
2133 @item
2134 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
2135
2136 @item
2137 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
2138
2139 @item
2140 @math{-1} in @code{@w{@var{regs}->}start[2]} and @math{-1} in @code{@w{@var{regs}->}end[2]}
2141 @end itemize
2142
2143 @end itemize
2144
2145 @node Freeing GNU Pattern Buffers
2146 @subsection Freeing GNU Pattern Buffers
2147
2148 To free any allocated fields of a pattern buffer, you can use the
2149 @sc{posix} function described in @ref{Freeing POSIX Pattern Buffers},
2150 since the type @code{regex_t}---the type for @sc{posix} pattern
2151 buffers---is equivalent to the type @code{re_pattern_buffer}.  After
2152 freeing a pattern buffer, you need to again compile a regular expression
2153 in it (@pxref{GNU Regular Expression Compiling}) before passing it to
2154 a matching or searching function.
2155
2156
2157 @node POSIX Regex Functions
2158 @section POSIX Regex Functions
2159
2160 If you're writing code that has to be @sc{posix} compatible, you'll need
2161 to use these functions. Their interfaces are as specified by @sc{posix},
2162 draft 1003.2/D11.2.
2163
2164 @menu
2165 * POSIX Pattern Buffers::               The regex_t type.
2166 * POSIX Regular Expression Compiling::  regcomp ()
2167 * POSIX Matching::                      regexec ()
2168 * Reporting Errors::                    regerror ()
2169 * Using Byte Offsets::                  The regmatch_t type.
2170 * Freeing POSIX Pattern Buffers::       regfree ()
2171 @end menu
2172
2173
2174 @node POSIX Pattern Buffers
2175 @subsection POSIX Pattern Buffers
2176
2177 To compile or match a given regular expression the @sc{posix} way, you
2178 must supply a pattern buffer exactly the way you do for @sc{gnu}
2179 (@pxref{GNU Pattern Buffers}).  @sc{posix} pattern buffers have type
2180 @code{regex_t}, which is equivalent to the @sc{gnu} pattern buffer
2181 type @code{re_pattern_buffer}.
2182
2183
2184 @node POSIX Regular Expression Compiling
2185 @subsection POSIX Regular Expression Compiling
2186
2187 With @sc{posix}, you can only search for a given regular expression; you
2188 can't match it.  To do this, you must first compile it in a
2189 pattern buffer, using @code{regcomp}.
2190
2191 @ignore
2192 Before calling @code{regcomp}, you must initialize this pattern buffer
2193 as you do for @sc{gnu} (@pxref{GNU Regular Expression Compiling}).  See
2194 below, however, for how to choose a syntax with which to compile.
2195 @end ignore
2196
2197 To compile a pattern buffer, use:
2198
2199 @findex regcomp
2200 @example
2201 int
2202 regcomp (regex_t *@var{preg}, const char *@var{regex}, int @var{cflags})
2203 @end example
2204
2205 @noindent
2206 @var{preg} is the initialized pattern buffer's address, @var{regex} is
2207 the regular expression's address, and @var{cflags} is the compilation
2208 flags, which Regex considers as a collection of bits.  Here are the
2209 valid bits, as defined in @file{regex.h}:
2210
2211 @table @code
2212
2213 @item REG_EXTENDED
2214 @vindex REG_EXTENDED
2215 says to use @sc{posix} Extended Regular Expression syntax; if this isn't
2216 set, then says to use @sc{posix} Basic Regular Expression syntax.
2217 @code{regcomp} sets @var{preg}'s @code{syntax} field accordingly.
2218
2219 @item REG_ICASE
2220 @vindex REG_ICASE
2221 @cindex ignoring case
2222 says to ignore case; @code{regcomp} sets @var{preg}'s @code{translate}
2223 field to a translate table which ignores case, replacing anything you've
2224 put there before.
2225
2226 @item REG_NOSUB
2227 @vindex REG_NOSUB
2228 says to set @var{preg}'s @code{no_sub} field; @pxref{POSIX Matching},
2229 for what this means.
2230
2231 @item REG_NEWLINE
2232 @vindex REG_NEWLINE
2233 says that a:
2234
2235 @itemize @bullet
2236
2237 @item
2238 match-any-character operator (@pxref{Match-any-character
2239 Operator}) doesn't match a newline.
2240
2241 @item
2242 nonmatching list not containing a newline (@pxref{List
2243 Operators}) matches a newline.
2244
2245 @item
2246 match-beginning-of-line operator (@pxref{Match-beginning-of-line
2247 Operator}) matches the empty string immediately after a newline,
2248 regardless of how @code{REG_NOTBOL} is set (@pxref{POSIX Matching}, for
2249 an explanation of @code{REG_NOTBOL}).
2250
2251 @item
2252 match-end-of-line operator (@pxref{Match-beginning-of-line
2253 Operator}) matches the empty string immediately before a newline,
2254 regardless of how @code{REG_NOTEOL} is set (@pxref{POSIX Matching},
2255 for an explanation of @code{REG_NOTEOL}).
2256
2257 @end itemize
2258
2259 @end table
2260
2261 If @code{regcomp} successfully compiles the regular expression, it
2262 returns zero and sets @code{*@var{pattern_buffer}} to the compiled
2263 pattern. Except for @code{syntax} (which it sets as explained above), it
2264 also sets the same fields the same way as does the @sc{gnu} compiling
2265 function (@pxref{GNU Regular Expression Compiling}).
2266
2267 If @code{regcomp} can't compile the regular expression, it returns one
2268 of the error codes listed here.  (Except when noted differently, the
2269 syntax of in all examples below is basic regular expression syntax.)
2270
2271 @table @code
2272
2273 @comment repetitions
2274 @item REG_BADRPT
2275 For example, the consecutive repetition operators @samp{**} in
2276 @samp{a**} are invalid.  As another example, if the syntax is extended
2277 regular expression syntax, then the repetition operator @samp{*} with
2278 nothing on which to operate in @samp{*} is invalid.
2279
2280 @item REG_BADBR
2281 For example, the @var{count} @samp{-1} in @samp{a\@{-1} is invalid.
2282
2283 @item REG_EBRACE
2284 For example, @samp{a\@{1} is missing a close-interval operator.
2285
2286 @comment lists
2287 @item REG_EBRACK
2288 For example, @samp{[a} is missing a close-list operator.
2289
2290 @item REG_ERANGE
2291 For example, the range ending point @samp{z} that collates lower than
2292 does its starting point @samp{a} in @samp{[z-a]} is invalid.  Also, the
2293 range with the character class @samp{[:alpha:]} as its starting point in
2294 @samp{[[:alpha:]-|]}.
2295
2296 @item REG_ECTYPE
2297 For example, the character class name @samp{foo} in @samp{[[:foo:]} is
2298 invalid.
2299
2300 @comment groups
2301 @item REG_EPAREN
2302 For example, @samp{a\)} is missing an open-group operator and @samp{\(a}
2303 is missing a close-group operator.
2304
2305 @item REG_ESUBREG
2306 For example, the back reference @samp{\2} that refers to a nonexistent
2307 subexpression in @samp{\(a\)\2} is invalid.
2308
2309 @comment unfinished business
2310
2311 @item REG_EEND
2312 Returned when a regular expression causes no other more specific error.
2313
2314 @item REG_EESCAPE
2315 For example, the trailing backslash @samp{\} in @samp{a\} is invalid, as is the
2316 one in @samp{\}.
2317
2318 @comment kitchen sink
2319 @item REG_BADPAT
2320 For example, in the extended regular expression syntax, the empty group
2321 @samp{()} in @samp{a()b} is invalid.
2322
2323 @comment internal
2324 @item REG_ESIZE
2325 Returned when a regular expression needs a pattern buffer larger than
2326 65536 bytes.
2327
2328 @item REG_ESPACE
2329 Returned when a regular expression makes Regex to run out of memory.
2330
2331 @end table
2332
2333
2334 @node POSIX Matching
2335 @subsection POSIX Matching
2336
2337 Matching the @sc{posix} way means trying to match a null-terminated
2338 string starting at its first character.  Once you've compiled a pattern
2339 into a pattern buffer (@pxref{POSIX Regular Expression Compiling}), you
2340 can ask the matcher to match that pattern against a string using:
2341
2342 @findex regexec
2343 @example
2344 int
2345 regexec (const regex_t *@var{preg}, const char *@var{string},
2346          size_t @var{nmatch}, regmatch_t @var{pmatch}[], int @var{eflags})
2347 @end example
2348
2349 @noindent
2350 @var{preg} is the address of a pattern buffer for a compiled pattern.
2351 @var{string} is the string you want to match.
2352
2353 @xref{Using Byte Offsets}, for an explanation of @var{pmatch}.  If you
2354 pass zero for @var{nmatch} or you compiled @var{preg} with the
2355 compilation flag @code{REG_NOSUB} set, then @code{regexec} will ignore
2356 @var{pmatch}; otherwise, you must allocate it to have at least
2357 @var{nmatch} elements.  @code{regexec} will record @var{nmatch} byte
2358 offsets in @var{pmatch}, and set to @math{-1} any unused elements up to
2359 @math{@var{pmatch}@code{[@var{nmatch}]} - 1}.
2360
2361 @var{eflags} specifies @dfn{execution flags}---namely, the two bits
2362 @code{REG_NOTBOL} and @code{REG_NOTEOL} (defined in @file{regex.h}).  If
2363 you set @code{REG_NOTBOL}, then the match-beginning-of-line operator
2364 (@pxref{Match-beginning-of-line Operator}) always fails to match.
2365 This lets you match against pieces of a line, as you would need to if,
2366 say, searching for repeated instances of a given pattern in a line; it
2367 would work correctly for patterns both with and without
2368 match-beginning-of-line operators.  @code{REG_NOTEOL} works analogously
2369 for the match-end-of-line operator (@pxref{Match-end-of-line
2370 Operator}); it exists for symmetry.
2371
2372 @code{regexec} tries to find a match for @var{preg} in @var{string}
2373 according to the syntax in @var{preg}'s @code{syntax} field.
2374 (@xref{POSIX Regular Expression Compiling}, for how to set it.)  The
2375 function returns zero if the compiled pattern matches @var{string} and
2376 @code{REG_NOMATCH} (defined in @file{regex.h}) if it doesn't.
2377
2378 @node Reporting Errors
2379 @subsection Reporting Errors
2380
2381 If either @code{regcomp} or @code{regexec} fail, they return a nonzero
2382 error code, the possibilities for which are defined in @file{regex.h}.
2383 @xref{POSIX Regular Expression Compiling}, and @ref{POSIX Matching}, for
2384 what these codes mean.  To get an error string corresponding to these
2385 codes, you can use:
2386
2387 @findex regerror
2388 @example
2389 size_t
2390 regerror (int @var{errcode},
2391           const regex_t *@var{preg},
2392           char *@var{errbuf},
2393           size_t @var{errbuf_size})
2394 @end example
2395
2396 @noindent
2397 @var{errcode} is an error code, @var{preg} is the address of the pattern
2398 buffer which provoked the error, @var{errbuf} is the error buffer, and
2399 @var{errbuf_size} is @var{errbuf}'s size.
2400
2401 @code{regerror} returns the size in bytes of the error string
2402 corresponding to @var{errcode} (including its terminating null).  If
2403 @var{errbuf} and @var{errbuf_size} are nonzero, it also returns in
2404 @var{errbuf} the first @math{@var{errbuf_size} - 1} characters of the
2405 error string, followed by a null.
2406 @var{errbuf_size} must be a nonnegative number less than or equal to the
2407 size in bytes of @var{errbuf}.
2408
2409 You can call @code{regerror} with a null @var{errbuf} and a zero
2410 @var{errbuf_size} to determine how large @var{errbuf} need be to
2411 accommodate @code{regerror}'s error string.
2412
2413 @node Using Byte Offsets
2414 @subsection Using Byte Offsets
2415
2416 In @sc{posix}, variables of type @code{regmatch_t} hold analogous
2417 information, but are not identical to, @sc{gnu}'s registers (@pxref{Using
2418 Registers}).  To get information about registers in @sc{posix}, pass to
2419 @code{regexec} a nonzero @var{pmatch} of type @code{regmatch_t}, i.e.,
2420 the address of a structure of this type, defined in
2421 @file{regex.h}:
2422
2423 @tindex regmatch_t
2424 @example
2425 typedef struct
2426 @{
2427   regoff_t rm_so;
2428   regoff_t rm_eo;
2429 @} regmatch_t;
2430 @end example
2431
2432 When reading in @ref{Using Registers}, about how the matching function
2433 stores the information into the registers, substitute @var{pmatch} for
2434 @var{regs}, @code{@w{@var{pmatch}[@var{i}]->}rm_so} for
2435 @code{@w{@var{regs}->}start[@var{i}]} and
2436 @code{@w{@var{pmatch}[@var{i}]->}rm_eo} for
2437 @code{@w{@var{regs}->}end[@var{i}]}.
2438
2439 @node Freeing POSIX Pattern Buffers
2440 @subsection Freeing POSIX Pattern Buffers
2441
2442 To free any allocated fields of a pattern buffer, use:
2443
2444 @findex regfree
2445 @example
2446 void
2447 regfree (regex_t *@var{preg})
2448 @end example
2449
2450 @noindent
2451 @var{preg} is the pattern buffer whose allocated fields you want freed.
2452 @code{regfree} also sets @var{preg}'s @code{allocated} and @code{used}
2453 fields to zero.  After freeing a pattern buffer, you need to again
2454 compile a regular expression in it (@pxref{POSIX Regular Expression
2455 Compiling}) before passing it to the matching function (@pxref{POSIX
2456 Matching}).
2457
2458
2459 @node BSD Regex Functions
2460 @section BSD Regex Functions
2461
2462 If you're writing code that has to be Berkeley @sc{unix} compatible,
2463 you'll need to use these functions whose interfaces are the same as those
2464 in Berkeley @sc{unix}.
2465
2466 @menu
2467 * BSD Regular Expression Compiling::    re_comp ()
2468 * BSD Searching::                       re_exec ()
2469 @end menu
2470
2471 @node BSD Regular Expression Compiling
2472 @subsection  BSD Regular Expression Compiling
2473
2474 With Berkeley @sc{unix}, you can only search for a given regular
2475 expression; you can't match one.  To search for it, you must first
2476 compile it.  Before you compile it, you must indicate the regular
2477 expression syntax you want it compiled according to by setting the
2478 variable @code{re_syntax_options} (declared in @file{regex.h} to some
2479 syntax (@pxref{Regular Expression Syntax}).
2480
2481 To compile a regular expression use:
2482
2483 @findex re_comp
2484 @example
2485 char *
2486 re_comp (char *@var{regex})
2487 @end example
2488
2489 @noindent
2490 @var{regex} is the address of a null-terminated regular expression.
2491 @code{re_comp} uses an internal pattern buffer, so you can use only the
2492 most recently compiled pattern buffer.  This means that if you want to
2493 use a given regular expression that you've already compiled---but it
2494 isn't the latest one you've compiled---you'll have to recompile it.  If
2495 you call @code{re_comp} with the null string (@emph{not} the empty
2496 string) as the argument, it doesn't change the contents of the pattern
2497 buffer.
2498
2499 If @code{re_comp} successfully compiles the regular expression, it
2500 returns zero.  If it can't compile the regular expression, it returns
2501 an error string.  @code{re_comp}'s error messages are identical to those
2502 of @code{re_compile_pattern} (@pxref{GNU Regular Expression
2503 Compiling}).
2504
2505 @node BSD Searching
2506 @subsection BSD Searching
2507
2508 Searching the Berkeley @sc{unix} way means searching in a string
2509 starting at its first character and trying successive positions within
2510 it to find a match.  Once you've compiled a pattern using @code{re_comp}
2511 (@pxref{BSD Regular Expression Compiling}), you can ask Regex
2512 to search for that pattern in a string using:
2513
2514 @findex re_exec
2515 @example
2516 int
2517 re_exec (char *@var{string})
2518 @end example
2519
2520 @noindent
2521 @var{string} is the address of the null-terminated string in which you
2522 want to search.
2523
2524 @code{re_exec} returns either 1 for success or 0 for failure.  It
2525 automatically uses a @sc{gnu} fastmap (@pxref{Searching with Fastmaps}).