So far so good, but we started to see the GC getting more and more inefficient, and at some point, all the 4 VMs where we've the API running used to run out of heap space. The GC attempts got bigger, at times I'd see 3-4000 collections during the process. To add to my surprises, all VMs started crashing within minutes of each other once every 3-4 days. We use a super awesome product called NewRelic to monitor our services farm, and the behavior I noticed was that of a somewhat massive memory leak. Not only that, the GC CPU Utilization hit as much as 70% right before the VM crashes. We did an internal code review, not once, not twice but many times over and over. We used yourkit and tried to simulate this on local and stage, still no help.
This was going on for a good few weeks till we launched a new product, the IGN Stock Exchange. The API volume increased, as did the number of PersonNotFoundExceptions given the various touchpoints across ign.com. Now instead of the servers blowing up every 3-4 days, the blowups started happening every 2 hours. I tried changing application log levels to FATAL from WARN in a desperate attempt to save the servers but no help. The logs kept rolling, and so did the memory on the VMs.
We run a completely stateless (although not share-nothing) architecture for the API so there was no question of any leaks from the code - which is what baffled me a lot. At this point I dumped the heap from one of the VMs using jmap, compressed it, and scp'd it to my local machine. Then I used Eclipse MAT to open it up and there it was:
One instance of "java.util.logging.ConsoleHandler" loaded by "system class loader" occupies 423,661,904 (87.89%) bytes. The instance is referenced by org.apache.juli.ClassLoaderLogManager$RootLogger @ 0x2aaac3945b38 , loaded by "sun.misc.Launcher$AppClassLoader @ 0x2aaac39010e8". The memory is accumulated in one instance of "byte[]" loaded by "system class loader".
Keywords
java.util.logging.ConsoleHandler
sun.misc.Launcher$AppClassLoader @ 0x2aaac39010e8
byte[]
First off, we do not use juli. The app uses Log4j - thats when I noticed that the logs where most of the stuff was getting written were not the application logs, but Tomcat logs, like catalina.out and catalina.YYYY-MM-DD.log. This gave me a hint to look at Tomcat's logging configuration and here is what I found in CATALINA_HOME/conf/logging.properties:
1catalina.org.apache.juli.FileHandler.level = FINE
1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
1catalina.org.apache.juli.FileHandler.prefix = catalina.
2localhost.org.apache.juli.FileHandler.level = FINE
2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
2localhost.org.apache.juli.FileHandler.prefix = localhost.
3manager.org.apache.juli.FileHandler.level = FINE
3manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
3manager.org.apache.juli.FileHandler.prefix = manager.
4host-manager.org.apache.juli.FileHandler.level = FINE
4host-manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
4host-manager.org.apache.juli.FileHandler.prefix = host-manager.
java.util.logging.ConsoleHandler.level = FINE
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
These are the default settings for Tomcat logging, and too brutal for Production given the FINE level. Although Tomcat does support customizing the logging provider, I did not have much time to experiment while the boxes kept blowing up.
I immediately changed FINE to SEVERE (Since Tomcat uses juli and juli does not have FATAL, which is the Log4j equivalent) and bounced the server. Surprise! No more logging to console.
Earlier, the exceptions were logged at the throw new PersonNotFoundException() line and the traces were dumped in Tomcat's console. I am very curious to find out what could be causing the leak in Tomcat's ConsoleAppender (and quite possibly, juli) but for now I am glad that all VMs are behaving normally, the sawtooth GC pattern is back to normal. I'll change the provider to Log4j and see if that makes any difference, and report back my findings.
At the end of the day, NewRelic was instrumental in helping us isolate the problem, and the offline heap profiling with jmap and Eclipse/MAT saved the day.
So if you're deploying under Tomcat in production, you may want to look at logging.properties, more so if you have a public API that can lead to lot of exceptions being thrown by the app due to validation/bad data.

