Parsing and manipulating string data is a common task in Java programming. This in-depth tutorial explains the main methods of parsing strings by breaking them into tokens for easier processing.

We will compare three approaches:

  1. The Java string split() method
  2. The Java Scanner class
  3. The Apache Commons Lang StringUtils class

By the end, you‘ll understand the strengths and weaknesses of each approach so you can confidently parse strings in your own code.

Why Parse Strings in Java?

Let‘s first discuss why string parsing is so ubiquitous in Java applications.

Strings in Java represent immutable sequences of characters. According to Oracle, string operations account for more than 25% of the processing time in typical Java applications. That time is often dominated by string parsing operations.

Common reasons you may need to parse Java strings include:

  • Tokenizing – Breaking input strings into meaningful tokens based on delimiters
  • Data processing – Accessing specific substrings or fields within large strings
  • Validation – Verifying strings match required formats
  • Transformation – Manipulating string data formats

As data flows through your Java program, parsing it into easily processed chunks with these methods improves robustness and flexibility.

1. Parsing Strings with Split()

Java‘s String class provides a split() method that divides strings around matches of a regular expression delimiter. It‘s one of the easiest ways to parse and tokenize strings in Java.

Here is a sample usage:

String data = "123|Alice|25";
String[] tokens = data.split("\\|"); 
// tokens = [0] "123", [1] "Alice", [2] "25"

This splits the pipe-delimited string and returns the tokenized values in an array.

How Split Works

When you call split(), the string is divided where the regex matches, generating substring tokens that are returned in an array:

Diagram showing how split() divides a string around matches of the regex

As this diagram shows, the delimiting regular expression is utilized to determine the division points.

Splitting on Different Delimiters

The delimiter can be any regular expression pattern. Here are some examples:

Whitespace:

"Alice Bob Eve".split("\\s+"); 
// ["Alice", "Bob", "Eve"]

Commas:

"A,B,C".split(",");
// ["A", "B", "C"]  

Underscores:

"X_Y_Z".split("_"); 
// ["X", "Y", "Z"]

The regex \s+ matches one or more whitespace characters. The other examples simply escape the literal delimiter.

Choose delimiters that fit your data format – commas/tabs for CSV data, custom strings for other cases.

Limiting Splits with Limit Parameter

The split() method has an overload that accepts a limit parameter:

public String[] split(String regex, int limit)  

Specifying a positive limit restricts the number of splits to that maximum:

"1,2,3,4".split(",", 2);
// ["1", "2,3,4"] 

Here limiting to 2 splits places the remaining text in one token.

A limit of 0 does no splitting, while -1 means no limit (unlimited splits).

Regular Expression Primer

To fully utilize split() for parsing, you should understand regular expressions. Here‘s a quick primer of common examples:

Regex Matches
. Any character
\d A digit
\w Alphanumeric character
\s Any whitespace
[abc] a, b, or c
[^abc] NOT a, b, or c
X? X occurs 0 or 1 times
X+ X occurs 1 or more times
X* X occurs 0 or more times

You can combine these building blocks to match very complex patterns.

Now let‘s look at more full-featured parsing approaches.

2. Parsing Strings with Scanner

The Java Scanner class provides scanning functionality that parses input strings similar to split(), but also supports many additional parsing capabilities.

Scanner acts as an iterator over tokens in the string, making it easy to process the parsed contents.

Here is basic usage of Scanner for string parsing:

String input = "1 fish 2 fish red fish blue fish";
Scanner scanner = new Scanner(input); 

while(scanner.hasNext()) {
  System.out.println(scanner.next()); 
}

This prints:

1
fish 
2
fish
red
fish 
blue
fish

By default Scanner delimited tokens by whitespace. But it‘s configurable to handle any regex delimiters, like split().

How Scanner Works

Internally, Scanner iterates through the string seeking matches of the delimiter regex.

When delimiters are found, it returns the intervening substrings as tokens:

Diagram showing how Scanner parses a string, returning tokens

These tokens are exposed through methods like next() and hasNext().

Configuring Delimiters

To configure the delimiting pattern, call useDelimiter():

Scanner scanner = new Scanner(input);
scanner.useDelimiter(","); // split on commas  

You can also construct a Scanner with the delimiter:

Scanner scanner = new Scanner(input).useDelimiter("\\s*fish\\s*");

This sets the pattern \s*fish\s* to extract tokens split on the word "fish" with optional whitespace.

Accessing Tokens

The hasNext() and next() methods iterate through tokens:

  • next() returns the next token
  • hasNext() checks if there is another token

For example:

while(scanner.hasNext()) {
  String token = scanner.next();

  // process token
}

You can also call nextLine() to return the remainder of the string.

Scanner enables building complex string parsing logic around these core methods.

Performance Impact

One downside to Scanner is performance overhead. Constructing a Scanner on your string introduces more initialization costs compared to basic split().

So if all you need is simple splitting, split() will be faster. Use Scanner when you require its advanced parsing capabilities.

3. Parsing Strings with StringUtils

The Apache Commons Lang library provides StringUtils – a set of utilities for manipulating strings in Java.

The StringUtils class has over 50 methods, including advanced capabilities for parsing and transforming strings.

Here is an example parsing a string with a StringUtils method:

String text = "Hello world!";
String[] result = StringUtils.substringsBetween(text, "Hello", "!");
// [" world"]

This extracts the substring from text that falls between "Hello" and "!" into an array of tokens.

Let‘s explore some other useful parsing functions in StringUtils.

Tokenizing Strings

Methods like split() tokenize strings:

// Splits by commas  
String[] tokens = StringUtils.split(text, ","); 

// Split preserving all tokens
StringUtils.splitPreserveAllTokens(text);

splitPreserveAllTokens() keeps empty tokens following delimiters. This parses CSV rows completely for example.

Extracting Substrings

Methods in the substring family extract portions of larger strings:

// Text before delim
String start = StringUtils.substringBefore(text, ",");    

// Text after delim  
String end = StringUtils.substringAfter(text, ",");   

// Text between delims
String middle = StringUtils.substringBetween(text, "--");

These make isolating substrings cleaner than concatenating output from multiple split() calls.

Comparing Parsing Approaches

Let‘s recap the pros and cons of the string parsing approaches:

split() Scanner StringUtils
Speed Very fast Slower due to overhead Fast with most methods
Memory Low overhead Higher memory usage Medium overhead
Features Simple splitting Advanced parsing logic Specialized manipulation & extra parsing functions
Code Readability Straightforward Complex full system Easy to read method calls
Setup Effort Just call methods Must construct Scanner Import Lang dependency

For many cases, split() provides ideal balance of speed, efficiency, and ease of use.

When you need extensibility or advanced parsing capabilities, Scanner and StringUtils shine.

Now let‘s look beyond splitting – how to parse strings into other data types.

Parsing Strings into Other Data Types

The parsing we‘ve examined simply divides strings and extracts substrings. But often the end goal is converting textual data into other formats for easier manipulation in code.

Luckily, Java has great support for parsing strings into useful types like numbers, dates, enums, and more.

Parsing into Numbers & Primitives

For numeric strings, methods in the primitive wrapper classes parse into number types:

int value = Integer.parseInt("123"); // 123
double pi = Double.parseDouble("3.14");
boolean ready = Boolean.parseBoolean("true"); 

This is faster than using Scanner for number parsing. Methods follow a standard naming convention – "parse"+PrimitiveType.

Parsing into Dates & Times

The Java Date/Time API provides flexible parsing of date/time strings:

LocalDate date = LocalDate.parse("2023-01-01"); 

LocalDateTime datetime = LocalDateTime.parse("2017-06-12T11:42");  

Instant instant = Instant.parse("2020-05-03T12:30:00Z");

Formats are customizable and null-safe handling is supported too.

Parsing Custom Formats

For complex or custom data types, create parser classes to handle conversion:

// Custom parser
public class UserParser {

  public static User parse(String data) {
    String[] parts = data.split(",");
    User user = new User(); 
    user.firstName = parts[0]; 
    user.lastName = parts[1];
    return user;
  }

}

// Using custom parser  
User user = UserParser.parse("John,Doe");

Custom parsers keep code modular and focused on a single responsibility.

Watch Out for These String Parsing Pitfalls!

While Java‘s string parsing capabilities are powerful, some common issues can make implementations fragile:

Ignoring Empty Tokens

When delimiters appear consecutively, empty strings may be generated:

"1,,2".split(","); 
// [1, "", 2]

Code consuming the array must defensively handle empty elements.

Assuming Delimited Length

Don‘t assume an N element array from N delimiters. Trailing delimiters often indicate issues:

"1,2,".split(",");
// [1, 2, ""] 

Dynamic array processing avoids exceptions.

Matching Delimiters

Unbalanced or mismatched delimiters during parsing can lead to subtle bugs. Validate structure or carefully track state when parsing nested strings.

Handling Malformed Data

Defend against invalid characters, improperly formatted data, buffer overflows from long inputs etc. that your parsing logic may not anticipate.

Build validation checks into your parsers and always sanitize untrusted data.

By understanding these potential pitfalls when working with string data, you can parse safely.

Putting it All Together

You should now have a solid grasp of string parsing approaches in Java. Let‘s review the key takeaways:

  • Use split() for simple, fast string tokenization. Specify regex delimiters to tokenize as needed
  • Scanner adds advanced parsing capabilities for complex tokenizing logic when needed
  • If using an external library, StringUtils provides convenient manipulation functions
  • Combine parsing primitives with data type conversions for real-world parsing use cases
  • Validate inputs and handle empty tokens/exceptions to keep parsers robust

Java‘s built-in facilities make string parsing smooth. For production systems, benchmark alternatives and select the optimal approach your architecture.

Happy (string) parsing!

Related Resources

Similar Posts