Q1 How will you go about implementing the following validation rules for a user name?
— user name must be between 2 and 17 characters long.
— valid characters are A to Z, a to z, 0 to 9, . (full-stop), _ (underscore) and – (hyphen)
— user name must begin with an alphabetic character.
— user name must not end with a . (full stop) or _ (underscore) or – (hyphen).
A1 The above rules can be implemented with a regular expression as shown below:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
import java.util.regex.Matcher; import java.util.regex.Pattern; public class UserNameRegex { public static final String PATTERN = "^[a-zA-Z][a-zA-Z0-9._-]{0,15}[a-zA-Z0-9]$"; public static final Pattern p = Pattern.compile(PATTERN); public static boolean apply(String userName) { Matcher matcher = p.matcher(userName); return matcher.find(); } } |
Not compiling the regular expression can be costly if Pattern.matches( ) is used over and over again with the same expression in a loop or frequent method calls because the matches( ) method will re-compile the expression every time it is used. You can also re-use the Matcher object for different input strings by calling the method reset( ).
What does this pattern mean?
How will you test this? Test with JUnit. The junit.jar file must be in the classpath.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
import org.junit.Assert; import org.junit.Test; public class UserNameRegexTest { @Test public void testMinLength() { Assert.assertFalse("can't be <2", UserNameRegex.apply("P")); } @Test public void testMaxLength() { Assert.assertFalse("Can't be >17", UserNameRegex .apply("Jonathon-Christopher")); } @Test public void testCantEndWith() { Assert.assertFalse("can't end with . - _", UserNameRegex .apply("s.g.r.")); } @Test public void testMustStartWith() { Assert.assertFalse("Must start with an alphabet", UserNameRegex.apply("23Lucky")); } @Test public void validNames() { Assert.assertTrue("Min Length 2", UserNameRegex.apply("Jo")); Assert.assertTrue("Max Length 17", UserNameRegex .apply("Peter-Christopher")); Assert.assertTrue(". - _ allowed in the middle", UserNameRegex.apply("s.g-h_k.r")); Assert.assertTrue("end with numeric", UserNameRegex .apply("Lucky23")); } } |
Q2 How would you go about validating a supplied password in your code that must adhere to the following conditions?
— Must contain a digit from 0-9.
— Must contain one uppercase character.
— Must contain one lowercase character.
— The length must be between 6 to 15 characters long.
A2 It can be achieved with a regular expression. But unlike the previous question, this is a bit more complicated than it looks. You might be tempted to write something like,
|
1 |
[0-9A-Za-z]{6,15} |
But the above regex is not going to meet the requirement. The above regex will only ensure that the valid characters are used and the password is of correct length (i.e. between 6 and 15 characters). But it won’t ensure that there is at least “one digit from 0-9, one lower case and one uppercase”. So, the above regex will return true for passwords like
|
1 2 |
Johnpwd //no digit in the password john123 //no uppercase character in the password |
This is where the regex with LOOK AHEAD comes to the rescue. The pattern below uses “look ahead” with the syntax ?=. The regex pattern with look ahead is shown below:
|
1 |
((?=[A-Za-z]*[0-9])(?=[0-9A-Z]*[a-z])(?=[0-9a-z]*[A-Z])){6,15} |
|
1 2 3 4 5 6 7 8 9 10 11 |
public static void correct() { String PATTERN = "((?=[A-Za-z]*[0-9])(?=[0-9A-Z]*[a-z])(?=[0-9a-z]*[A-Z])){6,15}"; Pattern p = Pattern.compile(PATTERN); Matcher matcher = p.matcher("John123"); System.out.println(matcher.find( )); //prints true; matcher = p.matcher("john123"); System.out.println(matcher.find( )); //prints false; matcher = p.matcher("johnpwd"); System.out.println(matcher.find( )); //prints false; } |
Don’t be overwhelmed by long regex patterns as shown above. Once you divide them into smaller chunks as shown below, things become much clearer.
Note: The look ahead and look behind constructs are collectively known as “look around” constructs. Unlike the greedy quantifiers discussed, the “look arounds” actually match characters, but then give up the match and only return the result with “match” or “no match”. That is why they are called “assertions”. They do not consume characters in the string like [0-9A-Za-z]{6,15}, but only assert whether a match is possible or not.
The look ahead constructs are non-capturing. If you want to capture the values matched, for example to debug or use the captured values, the above regex can be modified with additional parentheses to capture the matched values.
|
1 |
(?=([A-Za-z]*)([0-9]))(?=([0-9A-Z]*)([a-z]))(?=([0-9a-z]*)([A-Z]))){6,15} |
You can display the captured values as shown below:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
public static void correctWithGroup() { String PATTERN = "((?=([A-Za-z]*)([0-9]))(?=([0-9A-Z]*)([a-z]))(?=([0-9a-z]*)([A-Z]))){6,15}"; Pattern p = Pattern.compile(PATTERN); Matcher matcher = p.matcher("John123"); System.out.println(matcher.find()); int count = matcher.groupCount(); //0, and 1 are outside parentheses, which will be empty for (int i = 2; i <= count; i++) { System.out.println(matcher.group(i)); } } |
Outputs:
|
1 2 3 4 5 6 7 8 9 |
true John 1 J o J |
As you can see,
(?=([A-Za-z]*)([0-9])) matches John followed by 1, i.e. presence of a digit.
(?=([0-9A-Z]*)([a-z])) matches J followed by o, i.e. presence of a lower-case alphabet.
(?=([0-9a-z]*)([A-Z])) matches nothing followed by J, i.e. presence of an uppercase letter.
Q3 What does .*, +, and ? mean with respect to regular expressions? Where do you look for the summary of Java regular expression constructs?
A3 *, +, and ? are known as (greedy) quantifiers as they quantify the preceding character(s). For example,
. → matches any character.
.* → matches any character repeated 0 or more times.
.+ → matches any character repeated 1 or more times.
.? → matches any character repeated 0 or 1 time. This means optional.
[a-zA-Z]* → Any alphabet repeated 0 or more times.
[a-zA-Z]+ → Any alphabet repeated 1 or more times.
[a-zA-Z]? → Any alphabet repeated 0 or 1.
Note: These are not wild characters. *, +, and ? are regex repetitions. The {x,y} is also used for repetition. For example, [a]{3,5} means the letter “a” repeated at least 3 times or a maximum of 5 times. In fact, internally the Pattern.java class implements,
a* as a{0, 0x7FFFFFFF}
a+ as a{1, 0x7FFFFFFF}
The values in square brackets (i.e. [ ] ) are known as character sets. The ., *, ?, etc are escaped and used as literal values within the square brackets. Here are some examples of the character sets.
[aeiou] → matches exactly one lowercase vowel.
[^aeiou] → matches a character that ISN’T a lowercase vowel (^ inside [ ] means NOT).
^[a-z&&[^aeiou]]*$ → matches any character other than a vowel anchored between start (^) and end ($). This is a character class subtraction regex. The “&&” means intersection.
[a-d[m-p]] → Matches characters a to d and m to p. This is a union regex.
[a-z&&[d-f]] → Matches only d, e, and f. This is an intersection regex.
[x-z[\\p{Digit}]] → matches x-z and 0-9. Similar to [x-z0-9] or [x-z[\\d]]. The “\p” stands for POSIX character classes.
^[aeiou] – matches a lowercase vowel anchored at the beginning of a line
[^^] → matches a character that isn’t a caret ‘^’ .
^[^^] → matches a character that isn’t a caret at the beginning of a line.
^[^.]. → matches anything but a literal period, followed by “any” character, at the beginning of a line
[.*]* → matches a contiguous sequence of optional dots (.) and asterisks (*)
[aeiou]{3} – matches 3 consecutive lowercase vowels (all not necessarily the same vowel)
\[aeiou\] → matches the string “[aeiou]”. “\” means escape.
Q4 What does grouping mean with regards to regular expressions? Would you prefer capturing or non-capturing group?
A4 A group is a pair of parentheses used to group sub patterns. For example, c(a|u)t matches cat or cut. A group also captures the matching text within the parentheses. The groups are numbered from left to right, outside to inside. For example, (x(y*))+(z*) has 3 explicit and 1 implicit groups. The implicit group 0 is the entire match.
implicit group 0: (x(y*))+(z*)
explicit group 1: (x(y*))
explicit group 2: (y*)
explicit group 3: (z*)
Capturing groups are numbered by counting their opening parentheses from left to right. The captured sub sequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete. Groups beginning with ?: are pure, non-capturing groups that do not capture text and do not count towards the group total. For example, (?:x(?:y*))+(?:z*) will only have group 0 being the entire match.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexGroup { public static void main(String[ ] args) { CharSequence inputStr = "xyyxyyzz"; String patternCapturing = "(x(y*))+(z*)"; String patternNonCapturing= "(?:x(?:y*))+(?:z*)"; System.out.println("====Capturing====="); apply(inputStr,patternCapturing); System.out.println("====Non Capturing====="); apply(inputStr,patternNonCapturing); } public static void apply(CharSequence inputStr, String patternStr) { // Compile and use regular expression Pattern pattern = Pattern.compile(patternStr); Matcher matcher = pattern.matcher(inputStr); boolean matchFound = matcher.find(); if (matchFound) { // Get all groups for this match for (int i = 0; i <= matcher.groupCount( ); i++) { System.out.println("group " + i + " --> " + matcher.group(i)); } } } } |
Output:
|
1 2 3 4 5 6 7 8 9 |
====Capturing===== group 0 --> xyyxyyzz group 1 --> xyy group 2 --> yy group 3 --> zz ====Non Capturing===== group 0 --> xyyxyyzz |
The following sample code illustrates substitution, where the captured group numbers can be used for the subsequent matches. For example, the term “marketing” and “sales” can be captured and reused as these terms appear more than once. Once within the URL and once outside the URL as per the code snippet shown below.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
import java.util.regex.Matcher; import java.util.regex.Pattern; public class CapturingParenthesis { /** * To match a URI that starts with one of the following: * http://www.site.com/marketing/ or http://www.site.com/sales/ */ public static void main(String[] args) { final String input = "You can try either " + "http://www.site.com/marketing/ for marketing related " + "queries or http://www.site.com/sales/ for sales " + "related queries and further information."; System.out.println("========Capturing========="); String capPat1 = "http://www.site.com/(marketing|sales)/"; execute(input, capPat1); System.out.println("=====Capturing with substitution===="); String capPat2 = "http://www.site.com/(marketing|sales)/.*\\1"; execute(input, capPat2); System.out.println("========Non capturing========="); // can be more efficient String nonCapPat1 = "http://www.site.com/(?:marketing|sales)/"; execute(input, nonCapPat1); System.out.println("===Non capturing no substitution=="); // can't use it for substituting the previous match String nonCapPat2 = "http://www.site.com/(?:marketing|sales)/.*\\1"; execute(input, nonCapPat2); } public static void execute(String input, String pattern) { Pattern p = Pattern.compile(pattern); Matcher matcher = p.matcher(input); while (matcher.find()) { System.out.print("Start index: " + matcher.start( )); System.out.print(" End index: " + matcher.end( ) + " "); System.out.println("group= " + matcher.group( )); } } } |
Output:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
========Capturing========= Start index: 19 End index: 49 group= http://www.site.com/marketing/ Start index: 83 End index: 109 group= http://www.site.com/sales/ =====Capturing with substitution==== Start index: 19 End index: 63 group= http://www.site.com/marketing/ for marketing Start index: 83 End index: 119 group= http://www.site.com/sales/ for sales ========Non capturing========= Start index: 19 End index: 49 group= http://www.site.com/marketing/ Start index: 83 End index: 109 group= http://www.site.com/sales/ ===Non capturing no substitution== |
Note: The “\1” (group 1) captures either “marketing” or “sales” when the capturing pattern is used. Since “\” has to be escaped in Java, it is represented as “\\1”. The .* matches the word “for” as it means match any character 0 or more times.
Group capturing incur a small-time penalty each time you use them. If you don’t really need to capture the text inside a group, prefer using non-capturing groups.
Q5 What do you understand by greedy, reluctant, and possessive quantifiers? What do you understand by the term “backtracking”?
A5
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import java.util.regex.Matcher; import java.util.regex.Pattern; public class GreedyReluctantPossessive { public static void main(String[ ] args) { String input = "xfooxxfooxxxxfooxxxxxxxxfoofo"; // this is the default for *, +, and ?. // fetches the whole String eagerly and then start to backtrack String eagerPattern = ".*foo"; System.out.println("=======Greedy========"); execute(input, eagerPattern); // Backtrack to the beginning and start fetching one at a time String reluctantPattern = ".*?foo"; System.out.println("=======Reluctant========"); execute(input, reluctantPattern); // fetches the whole String eagerly, but won't let go of what was fetched String possessivePattern = ".*+foo"; System.out.println("=======Possessive========"); execute(input, possessivePattern); // possessive pattern is more suited for matching text in quotes. String input2 = "a=\"xfoo\" some text b=\"xfooo\"" + "some more text c=\"xfo\""; possessivePattern = "\"xfoo\"*+"; System.out.println("=====Possessive match for text in quotes====="); execute(input2, possessivePattern); } private static void execute(String input, String pattern) { Pattern p = Pattern.compile(pattern); Matcher matcher = p.matcher(input); while (matcher.find( )) { System.out.print("Start index: " + matcher.start( )); System.out.print(" End index: " + matcher.end( ) + " "); System.out.println("group= " + matcher.group( )); } } } |
Output:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
=======Greedy======== Start index: 0 End index: 27 group= xfooxxfooxxxxfooxxxxxxxxfoo =======Reluctant======== Start index: 0 End index: 4 group= xfoo Start index: 4 End index: 9 group= xxfoo Start index: 9 End index: 16 group= xxxxfoo Start index: 16 End index: 27 group= xxxxxxxxfoo =======Possessive=============Possessive match for text in quotes===== Start index: 2 End index: 8 group= "xfoo" Start index: 21 End index: 26 group= "xfoo |
Q6 Why did the string “xfooxxfooxxxxfooxxxxxxxxfoofo” return no match when used with a possessive quantifier?
A6 Even if you try to match “xfooxxfooxxxxfooxxxxxxxxfoo” without the last “fo”, the possessive quantifier would match nothing because the possessive quantifiers will not give up the matches on a backtrack, and the .*+ matches your entire string including the last “foo” and then there’s nothing for “foo” to match in the pattern “.*+foo”. So use possessive quantifiers only when you know that what you’ve matched should never be backtracked. For example, “[^f]*+.*foo” or more importantly matching text in quotes as demonstrated above with “\”xfoo\”*+”.
Performance testing
Always test your regular expressions with negative and positive test cases. Generally the possessive quantifiers are more efficient than the reluctant quantifiers, and the reluctant quantifiers are more efficient than the greedy quantifiers. When a regex is used in a performance critical area of your application, it would be wise to test it first. Write a benchmark application that tests it against different inputs. Remember to test different lengths of inputs, and also inputs that almost match your pattern.
What is “backtracking”?
Like most regex engines, Java uses NFA (Non deterministic Finite Automaton) approach. This means the engine scans the regex components one by one, and progresses on the input string accordingly. However, it will go back in order to explore matching alternatives when a “dead end” is found. Alternatives result from regex structures such as quantifiers (*, +, ?) and alternation (e.g. a|b|c|d). This exploration technique is called backtracking. In a more simpler term, backtracking means return to attempt an untried option.
Optimization tips
The .* and alternation (e.g. a|b|c) regexes must be used sparingly. For example, if you want to retrieve everything between two “@” characters in an input string, instead of using “@(.*)@”, it’s much better to use “@([^@]*)@”. This regex can be further optimized by minimizing the backtracking with the help of a possessive quantifier “*+” instead of a greedy quantifier “*”.
Say you are using the pattern “@([^@]*)@” with a long input string that does not have any “@”. Because “*” is a greedy quantifier, it will grab all the characters until the end of the string, and then it will backtrack, giving back one character at a time. The expression will fail only when it can’t backtrack anymore. If you change the pattern to use “@([^@]*+)@” with a possessive quantifier “*+”, the new pattern fails faster because once it has tried to match all the characters that are not a “@”, it does not backtrack; Instead it fails right there. Take care not to compromise correctness for a small optimization.
Regular expressions like “(COLON|COMMA|COLUMN)” have a reputation for being slow, so watch out for them. Firstly, the order of alternation counts. So, place the more common options in the front to be matched faster. Secondly, extract out the more common ones. For example, CO(MMA|L(ON|UMN)). Since COMMA is more common, it is used first. The “CO” and “L” are extracted out as they are more common.
Q7 How will you use “negative look ahead” to split a string of digits say 230_000L to individual array of digits using a regex?
A7
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import java.util.Arrays; public class Test { public static void main(String[] args) { long input = 230_000L; String inputStr = Long.valueOf(input).toString(); String[] split = inputStr.split("(?!^)"); //regex applied System.out.println(Arrays.asList(split)); } } |
(?!X) –> is NEGATIVE LOOK AHEAD syntax. X is the “^”. The ^ matches the beginning of a string, so the regex matches at every position that is not the beginning of the string, and inserts a split there.
Output:
|
1 2 3 |
[2, 3, 0, 0, 0, 0] |
Q8 How will you split the following CSV values on “,” ?
|
1 2 3 |
String input = "John, Peter, Samuel , Erik"; |
A8 Using a regex in the String.split(….) method.
|
1 2 3 4 |
String input = "John, Peter, Samuel , Erik"; String[] splits = input.split("\\s*,\\s*"); |
\s -> space
\\s* –> first “\” is an escape character, and “*” means zero or more spaces.
Q9 What if you have “,” as a value within the quotes as in “Peter, Smith”, how will you split the CSV values with regex?
|
1 2 3 |
String input = "John, \"Peter, Smith\", Samuel , Erik"; |
A9
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
package com.mytutorial; public class CsvSplitter { public static void main(String[] args) { String input = "John, \"Peter, Smith\", Samuel , Erik"; String[] split = input.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1); //try with different limit values instead of -1 for (String s : split) { System.out.println(s.trim()); } } } |
Output:
|
1 2 3 4 5 6 |
John "Peter, Smith" Samuel Erik |
(?=) –> Positive Look Ahead
[^\”]*\” –> NO quotes 0 or more times, and followed by a quote.
,([^\”]*\”[^\”]*\”)* –> , followed by zero or more of anything within quotes.
[^\”]*$ –> ends with anything other than a quote 0 or more times.


