org.apache.regexp

Class RE

Implemented Interfaces:
Serializable

public class RE
extends java.lang.Object
implements Serializable

RE is an efficient, lightweight regular expression evaluator/matcher class. Regular expressions are pattern descriptions which enable sophisticated matching of strings. In addition to being able to match a string against a pattern, you can also extract parts of the match. This is especially useful in text parsing! Details on the syntax of regular expression patterns are given below.

To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:

  RE r = new RE("a*b");
 

Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:

  boolean matched = r.match("aaaab");
 
will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".

If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:

  RE r = new RE("(a*)b");                  // Compile expression
  boolean matched = r.match("xaaaab");     // Match against "xaaaab"

  String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab'
  String insideParens = r.getParen(1);     // insideParens will be 'aaaa'

  int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
  int endWholeExpr = r.getParenEnd(0);     // endWholeExpr will be index 6
  int lenWholeExpr = r.getParenLength(0);  // lenWholeExpr will be 5

  int startInside = r.getParenStart(1);    // startInside will be index 1
  int endInside = r.getParenEnd(1);        // endInside will be index 5
  int lenInside = r.getParenLength(1);     // lenInside will be 4
 
You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:
  ([0-9]+)=\1
 
will match any string of the form n=n (like 0=0 or 2=2).

The full regular expression syntax accepted by RE is described here:


  Characters

    unicodeChar   Matches any identical unicode character
    \                    Used to quote a meta-character (like '*')
    \\                   Matches a single '\' character
    \0nnn                Matches a given octal character
    \xhh                 Matches a given 8-bit hexadecimal character
    \\uhhhh              Matches a given 16-bit hexadecimal character
    \t                   Matches an ASCII tab character
    \n                   Matches an ASCII newline character
    \r                   Matches an ASCII return character
    \f                   Matches an ASCII form feed character


  Character Classes

    [abc]                Simple character class
    [a-zA-Z]             Character class with ranges
    [^abc]               Negated character class
 
NOTE: Incomplete ranges will be interpreted as "starts from zero" or "ends with last character".
I.e. [-a] is the same as [\\u0000-a], and [a-] is the same as [a-\\uFFFF], [-] means "all characters".

  Standard POSIX Character Classes

    [:alnum:]            Alphanumeric characters.
    [:alpha:]            Alphabetic characters.
    [:blank:]            Space and tab characters.
    [:cntrl:]            Control characters.
    [:digit:]            Numeric characters.
    [:graph:]            Characters that are printable and are also visible.
                         (A space is printable, but not visible, while an
                         `a' is both.)
    [:lower:]            Lower-case alphabetic characters.
    [:print:]            Printable characters (characters that are not
                         control characters.)
    [:punct:]            Punctuation characters (characters that are not letter,
                         digits, control characters, or space characters).
    [:space:]            Space characters (such as space, tab, and formfeed,
                         to name a few).
    [:upper:]            Upper-case alphabetic characters.
    [:xdigit:]           Characters that are hexadecimal digits.


  Non-standard POSIX-style Character Classes

    [:javastart:]        Start of a Java identifier
    [:javapart:]         Part of a Java identifier


  Predefined Classes

    .         Matches any character other than newline
    \w        Matches a "word" character (alphanumeric plus "_")
    \W        Matches a non-word character
    \s        Matches a whitespace character
    \S        Matches a non-whitespace character
    \d        Matches a digit character
    \D        Matches a non-digit character


  Boundary Matchers

    ^         Matches only at the beginning of a line
    $         Matches only at the end of a line
    \b        Matches only at a word boundary
    \B        Matches only at a non-word boundary


  Greedy Closures

    A*        Matches A 0 or more times (greedy)
    A+        Matches A 1 or more times (greedy)
    A?        Matches A 1 or 0 times (greedy)
    A{n}      Matches A exactly n times (greedy)
    A{n,}     Matches A at least n times (greedy)
    A{n,m}    Matches A at least n but not more than m times (greedy)


  Reluctant Closures

    A*?       Matches A 0 or more times (reluctant)
    A+?       Matches A 1 or more times (reluctant)
    A??       Matches A 0 or 1 times (reluctant)


  Logical Operators

    AB        Matches A followed by B
    A|B       Matches either A or B
    (A)       Used for subexpression grouping
   (?:A)      Used for subexpression clustering (just like grouping but
              no backrefs)


  Backreferences

    \1    Backreference to 1st parenthesized subexpression
    \2    Backreference to 2nd parenthesized subexpression
    \3    Backreference to 3rd parenthesized subexpression
    \4    Backreference to 4th parenthesized subexpression
    \5    Backreference to 5th parenthesized subexpression
    \6    Backreference to 6th parenthesized subexpression
    \7    Backreference to 7th parenthesized subexpression
    \8    Backreference to 8th parenthesized subexpression
    \9    Backreference to 9th parenthesized subexpression
 

All closure operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the string as possible without causing the overall match to fail. If you want a closure to be reluctant (non-greedy), you can simply follow it with a '?'. A reluctant closure will match as few elements of the string as possible when finding matches. {m,n} closures don't currently support reluctancy.

Line terminators
A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:

    // Pre-compiled regular expression "a*b"
    char[] re1Instructions =
    {
        0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
        0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
        0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
        0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
        0x0000,
    };


    REProgram re1 = new REProgram(re1Instructions);
 
You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.


ISSUES:

Version:
$Id: RE.java 518156 2007-03-14 14:31:26Z vgritsenko $
Authors:
Jonathan Locke
Tobias Schäfer
See Also:
recompile, RECompiler

Field Summary

(package private) static char
E_ALNUM
(package private) static char
E_BOUND
(package private) static char
E_DIGIT
(package private) static char
E_NALNUM
(package private) static char
E_NBOUND
(package private) static char
E_NDIGIT
(package private) static char
E_NSPACE
(package private) static char
E_SPACE
static int
MATCH_CASEINDEPENDENT
Flag to indicate that matching should be case-independent (folded)
static int
MATCH_MULTILINE
Newlines should match as BOL/EOL (^ and $)
static int
MATCH_NORMAL
Specifies normal, case-sensitive matching behaviour.
static int
MATCH_SINGLELINE
Consider all input a single body of text - newlines are matched by .
(package private) static int
MAX_PAREN
(package private) static char
OP_ANY
(package private) static char
OP_ANYOF
(package private) static char
OP_ATOM
(package private) static char
OP_BACKREF
(package private) static char
OP_BOL
(package private) static char
OP_BRANCH
(package private) static char
OP_CLOSE
(package private) static char
OP_CLOSE_CLUSTER
(package private) static char
OP_CONTINUE
(package private) static char
OP_END
* The format of a node in a program is: * * [ OPCODE ] [ OPDATA ] [ OPNEXT ] [ OPERAND ] * * char OPCODE - instruction * char OPDATA - modifying data * char OPNEXT - next node (relative offset) * *
(package private) static char
OP_EOL
(package private) static char
OP_ESCAPE
(package private) static char
OP_GOTO
(package private) static char
OP_MAYBE
(package private) static char
OP_NOTHING
(package private) static char
OP_OPEN
(package private) static char
OP_OPEN_CLUSTER
(package private) static char
OP_PLUS
(package private) static char
OP_POSIXCLASS
(package private) static char
OP_RELUCTANTMAYBE
(package private) static char
OP_RELUCTANTPLUS
(package private) static char
OP_RELUCTANTSTAR
(package private) static char
OP_STAR
(package private) static char
POSIX_CLASS_ALNUM
(package private) static char
POSIX_CLASS_ALPHA
(package private) static char
POSIX_CLASS_BLANK
(package private) static char
POSIX_CLASS_CNTRL
(package private) static char
POSIX_CLASS_DIGIT
(package private) static char
POSIX_CLASS_GRAPH
(package private) static char
POSIX_CLASS_JPART
(package private) static char
POSIX_CLASS_JSTART
(package private) static char
POSIX_CLASS_LOWER
(package private) static char
POSIX_CLASS_PRINT
(package private) static char
POSIX_CLASS_PUNCT
(package private) static char
POSIX_CLASS_SPACE
(package private) static char
POSIX_CLASS_UPPER
(package private) static char
POSIX_CLASS_XDIGIT
static int
REPLACE_ALL
Flag bit that indicates that subst should replace all occurrences of this regular expression.
static int
REPLACE_BACKREFERENCES
Flag bit that indicates that subst should replace backreferences
static int
REPLACE_FIRSTONLY
Flag bit that indicates that subst should only replace the first occurrence of this regular expression.
(package private) int
end0
(package private) int
end1
(package private) int
end2
(package private) int[]
endBackref
(package private) int[]
endn
(package private) int
matchFlags
(package private) static int
maxNode
(package private) int
maxParen
(package private) static int
nodeSize
(package private) static int
offsetNext
(package private) static int
offsetOpcode
(package private) static int
offsetOpdata
(package private) int
parenCount
(package private) REProgram
program
(package private) CharacterIterator
search
(package private) int
start0
(package private) int
start1
(package private) int
start2
(package private) int[]
startBackref
(package private) int[]
startn

Constructor Summary

RE()
Constructs a regular expression matcher with no initial program.
RE(String pattern)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.
RE(String pattern, int matchFlags)
Constructs a regular expression matcher from a String by compiling it using a new instance of RECompiler.
RE(REProgram program)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
RE(REProgram program, int matchFlags)
Construct a matcher for a pre-compiled regular expression from program (bytecode) data.

Method Summary

private void
allocParens()
Performs lazy allocation of subexpression arrays
private int
compareChars(char c1, char c2, boolean caseIndependent)
Compares two characters.
int
getMatchFlags()
Returns the current match behaviour flags.
String
getParen(int which)
Gets the contents of a parenthesized subexpression after a successful match.
int
getParenCount()
Returns the number of parenthesized subexpressions available after a successful match.
int
getParenEnd(int which)
Returns the end index of a given paren level.
int
getParenLength(int which)
Returns the length of a given paren level.
int
getParenStart(int which)
Returns the start index of a given paren level.
REProgram
getProgram()
Returns the current regular expression program in use by this matcher object.
String[]
grep(Object[] search)
Returns an array of Strings, whose toString representation matches a regular expression.
protected void
internalError(String s)
Throws an Error representing an internal error condition probably resulting from a bug in the regular expression compiler (or possibly data corruption).
private boolean