Regular expressions java

Regular+expressions+java...

Java supports regular expressions through the classes in the java.util.regex package in the standard Java library. While there are some differences in advanced features supported by the Java regular expression library compared to PCRE, they both share a large part of the syntax and patterns and expressions can be used in Java and other languages.

In the examples below, be sure to import the following package at the top of the source file if you are trying out the code:

import java.util.regex.*;

Regular expression String patterns

In Java, regular strings can contain special characters (also known as escape sequences) which are characters that are preceeded by a backslash ( \ ) and identify a special piece of text like a newline ( \n ) or a tab character ( \t ). As a result, when writing regular expressions in Java code, you need to escape the backslash in each metacharacter to let the compiler know that it's not an errant escape sequence.

For example, take the pattern "There are \d dogs" . In Java, you would escape the backslash of the digit metacharacter by using the escape sequence \\ (effectively escaping the backslash with itself) to create the pattern "There are \\d dogs" .

This is only necessary when hard-coding patterns into Java code, as strings that are read in from user input or from files are read by character individually and escape sequences are not interpreted. This is a common approach to get around this problem, by either putting the patterns in a Properties or resource file so they can be easier to read and understand.

Other languages like C# or Python support the notion of raw strings, but Java has yet to add this useful feature into the core language.

Matching a string

Working with regular expressions in Java generally involves instantiating a Pattern, and matching it against some text. The simplest way to do this is to call the static method Pattern.matches(), which takes an input string and the regular expression to match it against, and simply returns whether the pattern matches the string.  

Method
boolean isMatch = Pattern.matches(String regex, String inputStr)

However, this does not give you any additional information such as where in the input string the pattern matches, or the groups that matched. So for most purposes, it is both more useful and also more efficient to compile a new Pattern and then use it to create a new Matcher for each input string that you are matching against, which will hold the results of the match.

Methods
Pattern ptrn = Pattern.compile(String regex) Matcher matcher = ptrn.matcher(String inputStr)
Example
// Lets use a regular expression to match a date string. Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)"); Matcher matcher = ptrn.matcher("June 24"); if (matcher.matches()) { // Indeed, the expression "([a-zA-Z]+) (\d+)" matches the date string // To get the indices of the match, you can read the Matcher object's // start and end values. // This will print [0, 7], since it matches at the beginning and end of the // string System.out.println("Match at index [" + matcher.start() + ", " + matcher.end() + ")"); // To get the fully matched text, you can read the Matcher object's group // This will print "June 24" System.out.println("Match: " + matcher.group()); }

Capturing groups

Capturing groups in a regular expression is as straightforward as matching a string in the example above. After using a Pattern to match an input string, you can just iterate through the extracted groups in the returned Matcher .

Example
// Lets use a regular expression to capture data from a few date strings. String pattern = "([a-zA-Z]+) (\\d+)"; Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)"); Matcher matcher = ptrn.matcher("June 24, August 9, Dec 12"); // This will print each of the matches and the index in the input string // where the match was found: // June 24 at index [0, 7) // August 9 at index [9, 17) // Dec 12 at index [19, 25) while (matcher.find()) { System.out.println(String.format("Match: %s at index [%d, %d]", matcher.group(), matcher.start(), matcher.end())); } // If we are iterating over the groups in the match again, first reset the // matcher to start at the beginning of the input string. matcher.reset(); // For each match, we can extract the captured information by reading the // captured groups. while (matcher.find()) { // This will print the number of captured groups in this match System.out.println(String.format("%d groups captured", matcher.groupCount())); // This will print the month and day of each match. Remember that the // first group is always the whole matched text, so the month starts at // index 1 instead. System.out.println("Month: " + matcher.group(1) + ", Day: " + matcher.group(2)); // Each group in the match also has a start and end index, which is the // index in the input string that the group was found. System.out.println(String.format("Month found at[%d, %d)", matcher.start(1), matcher.end(1))); }

Finding and replacing strings

Another common task is to find and replace a part of a string using regular expressions, for example, to replace all instances of an old email domain, or to swap the order of some text. You can do this in Java with the Matcher.replaceAll() and Matcher.replaceFirst() methods. Both these methods first reset the matcher to start at the beginning of the input string up to either the end of the string, or the end of the first match respectively.

The replacement string can contain references to captured groups in the pattern (using the dollar sign $), or just a regular literal string.

Method
String replacedString = matcher.replaceAll(String inputStr)
String replacedString = matcher.replaceFirst(String inputStr)
Example
// Lets try and reverse the order of the day and month in a few date // strings. Notice how the replacement string also contains metacharacters // (the back references to the captured groups) so we use a verbatim // string for that as well. Pattern ptrn = Pattern.compile("([a-zA-Z]+) (\\d+)"); Matcher matcher = ptrn.matcher("June 24, August 9, Dec 12"); // This will reorder the string inline and print: // 24 of June, 9 of August, 12 of Dec // Remember that the first group is always the full matched text, so the // month and day indices start from 1 instead of zero. String replacedString = matcher.replaceAll("$2 of $1"); System.out.println(replacedString);

Pattern Flags

When compiling a Pattern , you will notice that you can pass in additional flags to change how input strings are matched. Most of the available flags are a convenience and can be written into the into the regular expression itself directly, but some can be useful in certain cases.

  • Pattern.CASE_INSENSITIVE makes the pattern case insensitive so that it matches strings of different capitalizations
  • Pattern.MULTILINE is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string
  • Pattern.DOTALL allows the dot metacharacter (.) to match new line characters as well
  • Pattern.LITERAL makes the pattern literal, in the sense that the escaped characters are matched as-is. For example, the pattern "\d" will match a backslash followed by a 'd' character as opposed to a digit character

Links

For more information about using regular expressions in Java, please visit the following links:

No comments:

Post a Comment