Java supports regular expressions through the classes in the java.util.regex
package in the standard Java library. While there are some
differences in advanced features supported by the Java regular
expression library compared to PCRE, they both share a large part of
the syntax and patterns and expressions can be used in Java and
other languages.
In the examples below, be sure to import the following package at the top of the source file if you are trying out the code:
import java.util.regex.*;
Regular expression String patterns
In Java, regular strings can contain special characters (also known
as escape sequences) which are characters
that are preceeded by a backslash (
\
) and identify a special piece of text like a newline (
\n
) or a tab character (
\t
). As a result, when writing regular expressions in Java code, you
need to escape the backslash in each metacharacter to let the
compiler know that it's not an errant escape sequence.
For example, take the pattern
"There are \d dogs"
. In Java, you would escape the backslash of the digit metacharacter
by using the escape sequence
\\
(effectively escaping the backslash with itself) to create the
pattern
"There are \\d dogs"
.
This is only necessary when hard-coding patterns into Java code, as
strings that are read in from user input or from files are read by
character individually and escape sequences are not interpreted.
This is a common approach to get around this problem, by either
putting the patterns in a
Properties
or resource file so they can be easier to read and understand.
Other languages like C# or Python support the notion of raw strings, but Java has yet to add this useful feature into the core language.
Matching a string
Working with regular expressions in Java generally involves
instantiating a Pattern
,
and matching it against some text. The simplest way to do this is to
call the static method Pattern.matches()
,
which takes an input string and the regular expression to match it
against, and simply returns whether the pattern matches the string.
boolean isMatch = Pattern.matches(String regex, String inputStr)
However, this does not give you any additional information such as
where in the input string the pattern matches, or the groups that
matched. So for most purposes, it is both more useful and also more
efficient to compile a new
Pattern
and then use it to create a new Matcher
for each input string that you are matching against, which will hold
the results of the match.
Pattern ptrn = Pattern.compile(String regex)
Matcher matcher = ptrn.matcher(String
inputStr)
// Lets use a regular expression to match
a date string. Pattern ptrn = Pattern.compile("([a-zA-Z]+)
(\\d+)"); Matcher matcher = ptrn.matcher("June 24"); if
(matcher.matches()) { // Indeed, the expression "([a-zA-Z]+) (\d+)"
matches the date string // To get the indices of the match, you can
read the Matcher object's // start and end values. // This will
print [0, 7], since it matches at the beginning and end of the //
string System.out.println("Match at index [" + matcher.start() + ",
" + matcher.end() + ")"); // To get the fully matched text, you can
read the Matcher object's group // This will print "June 24"
System.out.println("Match: " + matcher.group()); }
Capturing groups
Capturing groups in a regular expression is as straightforward as
matching a string in the example above. After using a
Pattern
to match an input string, you can just iterate through the extracted
groups in the returned
Matcher
.
// Lets use a regular expression to
capture data from a few date strings. String pattern = "([a-zA-Z]+)
(\\d+)"; Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)");
Matcher matcher = ptrn.matcher("June 24, August 9, Dec 12"); //
This will print each of the matches and the index in the input
string // where the match was found: // June 24 at index [0, 7) //
August 9 at index [9, 17) // Dec 12 at index [19, 25) while
(matcher.find()) { System.out.println(String.format("Match: %s at
index [%d, %d]", matcher.group(), matcher.start(), matcher.end()));
} // If we are iterating over the groups in the match again, first
reset the // matcher to start at the beginning of the input string.
matcher.reset(); // For each match, we can extract the captured
information by reading the // captured groups. while
(matcher.find()) { // This will print the number of captured groups
in this match System.out.println(String.format("%d groups
captured", matcher.groupCount())); // This will print the month and
day of each match. Remember that the // first group is always the
whole matched text, so the month starts at // index 1 instead.
System.out.println("Month: " + matcher.group(1) + ", Day: " +
matcher.group(2)); // Each group in the match also has a start and
end index, which is the // index in the input string that the group
was found. System.out.println(String.format("Month found at[%d,
%d)", matcher.start(1), matcher.end(1))); }
Finding and replacing strings
Another common task is to find and replace a part of a string using
regular expressions, for example, to replace all instances of an old
email domain, or to swap the order of some text. You can do this in
Java with the Matcher.replaceAll()
and Matcher.replaceFirst()
methods. Both these methods first reset the matcher to start at the
beginning of the input string up to either the end of the string, or
the end of the first match respectively.
The replacement string can contain references to captured groups in the pattern (using the dollar sign $), or just a regular literal string.
String replacedString = matcher.replaceAll(String inputStr)
String replacedString = matcher.replaceFirst(String inputStr)
// Lets try and reverse the order of the
day and month in a few date // strings. Notice how the replacement
string also contains metacharacters // (the back references to the
captured groups) so we use a verbatim // string for that as well.
Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)"); Matcher
matcher = ptrn.matcher("June 24, August 9, Dec 12"); // This will
reorder the string inline and print: // 24 of June, 9 of August, 12
of Dec // Remember that the first group is always the full matched
text, so the // month and day indices start from 1 instead of zero.
String replacedString = matcher.replaceAll("$2 of $1");
System.out.println(replacedString);
Pattern
Flags
When compiling a
Pattern
, you will notice that you can pass in additional flags to change
how input strings are matched. Most of the available flags are a
convenience and can be written into the into the regular expression
itself directly, but some can be useful in certain cases.
Pattern.CASE_INSENSITIVE
makes the pattern case insensitive so that it matches strings of different capitalizationsPattern.MULTILINE
is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input stringPattern.DOTALL
allows the dot metacharacter (.) to match new line characters as wellPattern.LITERAL
makes the pattern literal, in the sense that the escaped characters are matched as-is. For example, the pattern"\d"
will match a backslash followed by a 'd' character as opposed to a digit character
Links
For more information about using regular expressions in Java, please visit the following links:
No comments:
Post a Comment