Pitfalls of Handling Large String Objects in Java

Most Java developers know to be careful about String concatenation. But some argue that it is less important nowadays with improved JVM garbage collection efficiency. That might be true for small objects, but for large String objects, which is common when dealing with web services and xmls, developers need to be even more careful with String operations in general, and not just concatenation.

OOM in Production, Why?

I recently looked into an OutOfMemory exception in production and the culprit turned out to be the following code.

The application receives a response from a web service and needs to pass the result as a String into a field to a downstream web service. The downstream web service was old, and doesn't support prefixes. So a developer writes the above code to remove prefixes and also change a field name. Not an ideal situation but Innocent enough?

However, when the return xml from the first web service turns out to be pretty big, in the size of ~200k, it causes frequent OutOfMemory exceptions in production servers. 200k is pretty big for simple texts, but still, the server has 1.5G heap size, why is it OutOfMemory?

Java String Impementation is NOT Optimized for Large String

JVM hates large objects. The heap may have enough space, but it is always highly fragmented. When a large object request comes in, JVM will usually go through a series of gc cycles to move objects around to accomodate the large object. In my case, it is unfortunate that the input String is huge already, ~200k, the above code makes even more copies of it, stressing out JVM and resulting in OOM.

In the above code, each parse result is a new large String, which makes it 4 additional large objects. But that is not ALL! More large objects were allocated in Java's String implementation.

  • replaceAll is not optimized for large Strings

String's replaceAll method eventually invokes replaceAll in java.util.regex.Matcher as listed above. Its implementation, as explained below, is not optimized for large Strings.
  • Always specify size when StringBuffer is created for large Strings 

First of all, when replaceAll creates the StringBuffer, it uses the default constructor,  which only allocates 16 bytes, perfect for small Strings.

 As more charecters are appended to it, the StringBuffer will double its size when it reaches its limit.

This means, in our case, replaceAll will generate garbage objects in the size of 16, 32, ... 32k, 64k, 128k, 256k when it finishes its parsing. 32k, 64k, 128k objects are all very large objects from the garbage collection's point of view.

  • For large Strings, StringBuffer.toString()  is bad, use new String(StringBuffer) instead

Secondly, when replaceAll returns, it uses StringBuffer.toString(). Following is the source code for StringBuffer.toString().

It checks the size of wasted bytes: if it is more than 768, it uses a different String constructor to make another copy of its underlying char array to free up 'wasted' space.

For large Strings, wasting 768 bytes is nothing compred to the overhead of allocating another large object from the heap. new String(StringBuffer sb) on the other hand, simply re-uses the char arrary in the StringBuffer object.

  • Always be extra careful with large Strings, read Java's source code when in doubt

Java may change its Java implementation in later versions, if you are not sure, read source code. If you cannot avoid large String objects for whatever reason, treat it with extra care. The difference between generating 2 large objects vs 20 can mean mutliple gc cycles of the expensive kind, and in our case, OutOfMemory exception.


Code Optimized

Eventually we decided to parse the String manually instead of using replaceAll. This allows us to transform the input String in one single pass and avoid generating additional large Java objects.

1 comment: