Saturday, March 26, 2011

Beware of Tomcat Logging


Here are IGN we have built a Social API, which is built on top of OpenSocial specs, deployed in Tomcat (6.0.20 to be precise) on Centos5. Lately we changed our API so that a method that was returning a POJO was refactored to throw an Exception for a specific condition - a PersonNotFoundException instead of returning a null on getPersonById().
So far so good, but we started to see the GC getting more and more inefficient, and at some point, all the 4 VMs where we've the API running used to run out of heap space. The GC attempts got bigger, at times I'd see 3-4000 collections during the process. To add to my surprises, all VMs started crashing within minutes of each other once every 3-4 days. We use a super awesome product called NewRelic to monitor our services farm, and the behavior I noticed was that of a somewhat massive memory leak. Not only that, the GC CPU Utilization hit as much as 70% right before the VM crashes. We did an internal code review, not once, not twice but many times over and over. We used yourkit and tried to simulate this on local and stage, still no help.


This was going on for a good few weeks till we launched a new product, the IGN Stock Exchange. The API volume increased, as did the number of PersonNotFoundExceptions given the various touchpoints across ign.com. Now instead of the servers blowing up every 3-4 days, the blowups started happening every 2 hours. I tried changing application log levels to FATAL from WARN in a desperate attempt to save the servers but no help. The logs kept rolling, and so did the memory on the VMs.
We run a completely stateless (although not share-nothing) architecture for the API so there was no question of any leaks from the code - which is what baffled me a lot. At this point I dumped the heap from one of the VMs using jmap, compressed it, and scp'd it to my local machine. Then I used Eclipse MAT to open it up and there it was:


One instance of "java.util.logging.ConsoleHandler" loaded by "system class loader" occupies 423,661,904 (87.89%) bytes. The instance is referenced by org.apache.juli.ClassLoaderLogManager$RootLogger @ 0x2aaac3945b38 , loaded by "sun.misc.Launcher$AppClassLoader @ 0x2aaac39010e8". The memory is accumulated in one instance of "byte[]" loaded by "system class loader".
Keywords
java.util.logging.ConsoleHandler
sun.misc.Launcher$AppClassLoader @ 0x2aaac39010e8
byte[]
First off, we do not use juli. The app uses Log4j - thats when I noticed that the logs where most of the stuff was getting written were not the application logs, but Tomcat logs, like catalina.out and catalina.YYYY-MM-DD.log. This gave me a hint to look at Tomcat's logging configuration and here is what I found in CATALINA_HOME/conf/logging.properties:

1catalina.org.apache.juli.FileHandler.level = FINE
1catalina.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
1catalina.org.apache.juli.FileHandler.prefix = catalina.
2localhost.org.apache.juli.FileHandler.level = FINE
2localhost.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
2localhost.org.apache.juli.FileHandler.prefix = localhost.
3manager.org.apache.juli.FileHandler.level = FINE
3manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
3manager.org.apache.juli.FileHandler.prefix = manager.
4host-manager.org.apache.juli.FileHandler.level = FINE
4host-manager.org.apache.juli.FileHandler.directory = ${catalina.base}/logs
4host-manager.org.apache.juli.FileHandler.prefix = host-manager.
java.util.logging.ConsoleHandler.level = FINE
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter

These are the default settings for Tomcat logging, and too brutal for Production given the FINE level. Although Tomcat does support customizing the logging provider, I did not have much time to experiment while the boxes kept blowing up.

I immediately changed FINE to SEVERE (Since Tomcat uses juli and juli does not have FATAL, which is the Log4j equivalent) and bounced the server. Surprise! No more logging to console.
Earlier, the exceptions were logged at the throw new PersonNotFoundException() line and the traces were dumped in Tomcat's console. I am very curious to find out what could be causing the leak in Tomcat's ConsoleAppender (and quite possibly, juli) but for now I am glad that all VMs are behaving normally, the sawtooth GC pattern is back to normal. I'll change the provider to Log4j and see if that makes any difference, and report back my findings.


At the end of the day, NewRelic was instrumental in helping us isolate the problem, and the offline heap profiling with jmap and Eclipse/MAT saved the day.
So if you're deploying under Tomcat in production, you may want to look at logging.properties, more so if you have a public API that can lead to lot of exceptions being thrown by the app due to validation/bad data.

Tuesday, February 15, 2011

Using MongoDB for IGN's Social Platform

My Presentation at SF Bay Area MongoDB Meetup on 02/15/2011.

Monday, October 11, 2010

Social Platforms : What goes on under the hood

My Presentation at Silicon Valley Code Camp on 10/09/2010 at 11:15 AM, Room 5001.


Sunday, September 19, 2010

Eclipse fails to start on Ubuntu

Often times I've found Eclipse (Galileo) failing to start on Ubuntu. The progress bar on the splash shows up almost half way and the process hangs. Retries do not help. I've the usual plugins (Maven2, subclipse) and nothing fancy.

The only solution I've found to this problem is getting rid of the metadata folder. The downside to this is that I've to reimport the projects in my workspace but at least Eclipse comes back to life.

Since I use multiple workspaces it sucks even more as I've to make sure which one I was using last which is the exact one preventing Eclipse to start.

mpandit@mpandit-ubuntu:~$ find . -name '.metadata' -print
./.local/share/Trash/files/.metadata
./Manish_Laptop_Backup/refactor/.metadata
./Manish_Laptop_Backup/workspace/temp/.metadata
./Manish_Laptop_Backup/workspace/.metadata
./Documents/workspace-sts-2.3.2.RELEASE/.metadata
./Documents/mongoworkspace/.metadata
./workspace/.metadata
./newworkspace/.metadata

Since I was working last in .newworkspace, I'll go and verify the timestamp of the files in it.

mpandit@mpandit-ubuntu:~$ cd newworkspace/
mpandit@mpandit-ubuntu:~/newworkspace$ cd .metadata/
mpandit@mpandit-ubuntu:~/newworkspace/.metadata$ ls -al
total 1876
drwxr-xr-x 4 mpandit mpandit 4096 2010-09-13 23:05 .
drwxr-xr-x 4 mpandit mpandit 4096 2010-07-03 23:03 ..
-rw-r--r-- 1 mpandit mpandit 1025091 2010-09-13 23:05 .bak_0.log
-rw-r--r-- 1 mpandit mpandit 0 2010-06-02 11:03 .lock
-rw-r--r-- 1 mpandit mpandit 860893 2010-09-18 13:21 .log
drwxr-xr-x 3 mpandit mpandit 4096 2010-09-06 22:51 .mylyn
drwxr-xr-x 31 mpandit mpandit 4096 2010-09-12 14:56 .plugins
-rw-r--r-- 1 mpandit mpandit 26 2010-09-18 13:21 version.ini

version.ini's timestamp confirmed that this is the workspace I was using.

Next, we delete the .metadata folder.

mpandit@mpandit-ubuntu:~/newworkspace/.metadata$ cd ..
mpandit@mpandit-ubuntu:~/newworkspace$ rm -rf .metadata/
mpandit@mpandit-ubuntu:~/newworkspace$


Restart Eclipse and it works!

Thursday, May 20, 2010

Ruby Memcached Gem on Centos and Ubuntu

Tried installing memcached ruby gem on Ubuntu 10.04 (lucid) and ran into an error.

mpandit@vaio-ubuntu:~/work$ sudo gem install memcached
[sudo] password for mpandit:
Building native extensions. This could take a while...
ERROR: Error installing memcached:
ERROR: Failed to build gem native extension.

/usr/bin/ruby1.8 extconf.rb
Libmemcached already built; run 'rake clean' first if you need to rebuild.
creating Makefile

make
gcc -I. -I/usr/lib/ruby/1.8/i486-linux -I/usr/lib/ruby/1.8/i486-linux -I. -fPIC -I/var/lib/gems/1.8/gems/memcached-0.19.3/ext/include -L/var/lib/gems/1.8/gems/memcached-0.19.3/ext/lib -fno-strict-aliasing -g -g -O2 -fPIC -fno-strict-aliasing -g -g -O2 -fPIC -c rlibmemcached_wrap.c
rlibmemcached_wrap.c: In function ‘_wrap_MemcachedSt_sasl_callbacks_set’:
rlibmemcached_wrap.c:6669: error: ‘sasl_callback_t’ undeclared (first use in this function)
rlibmemcached_wrap.c:6669: error: (Each undeclared identifier is reported only once
rlibmemcached_wrap.c:6669: error: for each function it appears in.)
rlibmemcached_wrap.c:6669: error: ‘arg2’ undeclared (first use in this function)
rlibmemcached_wrap.c:6669: error: expected expression before ‘)’ token
rlibmemcached_wrap.c:6687: error: expected expression before ‘)’ token
rlibmemcached_wrap.c:6688: error: ‘struct memcached_st’ has no member named ‘sasl_callbacks’
rlibmemcached_wrap.c:6688: error: expected ‘)’ before ‘const’
rlibmemcached_wrap.c:6688: error: expected ‘;’ before ‘arg2’
rlibmemcached_wrap.c: In function ‘_wrap_MemcachedSt_sasl_callbacks_get’:
rlibmemcached_wrap.c:6700: error: ‘sasl_callback_t’ undeclared (first use in this function)
rlibmemcached_wrap.c:6700: error: ‘result’ undeclared (first use in this function)
rlibmemcached_wrap.c:6711: error: expected expression before ‘)’ token
rlibmemcached_wrap.c:6711: error: ‘struct memcached_st’ has no member named ‘sasl_callbacks’
rlibmemcached_wrap.c: In function ‘_wrap_memcached_set_sasl_callbacks’:
rlibmemcached_wrap.c:12021: error: ‘sasl_callback_t’ undeclared (first use in this function)
rlibmemcached_wrap.c:12021: error: ‘arg2’ undeclared (first use in this function)
rlibmemcached_wrap.c:12021: error: expected expression before ‘)’ token
rlibmemcached_wrap.c:12039: error: expected expression before ‘)’ token
rlibmemcached_wrap.c:12040: error: expected ‘)’ before ‘const’
rlibmemcached_wrap.c:12040: error: expected ‘)’ before ‘arg2’
rlibmemcached_wrap.c: In function ‘_wrap_memcached_get_sasl_callbacks’:
rlibmemcached_wrap.c:12122: error: ‘sasl_callback_t’ undeclared (first use in this function)
rlibmemcached_wrap.c:12122: error: ‘result’ undeclared (first use in this function)
rlibmemcached_wrap.c:12133: error: expected expression before ‘)’ token
rlibmemcached_wrap.c: In function ‘Init_rlibmemcached’:
rlibmemcached_wrap.c:13285: error: ‘SASL_OK’ undeclared (first use in this function)
make: *** [rlibmemcached_wrap.o] Error 1


Gem files will remain installed in /var/lib/gems/1.8/gems/memcached-0.19.3 for inspection.
Results logged to /var/lib/gems/1.8/gems/memcached-0.19.3/ext/gem_make.out


I googled around, turns out I need libsasl2 and libsasl2-devel installed. Thats easy.

mpandit@vaio-ubuntu:~/work$ sudo apt-get install libsasl2 libsasl2-dev

After these got installed, I ran into the same error as above. After wasting 15 minutes, I realized that I should try deleting the gem crap it has already put in the folder. This did the trick:

mpandit@vaio-ubuntu:~/work$ sudo rm -rf /var/lib/gems/1.8/gems/memcache*
mpandit@vaio-ubuntu:~/work$ sudo gem install memcached
Building native extensions. This could take a while...
Successfully installed memcached-0.19.3
1 gem installed
Installing ri documentation for memcached-0.19.3...


Now it was turn to do this on my work servers. Unfortunately they run Centos 5.4, and I have a bias against non-Ubuntu distributions, simply because they are a few years/months behind Ubuntu when it comes to packages.

So, lesson learnt from Ubuntu, I tried this:

[mpandit@dev-services-vm-01:~] $ sudo yum install libsasl2
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* addons: distro
* base: distro
* epel: mirrors.kernel.org
* extras: distro
* updates: distro
addons | 951 B 00:00
base | 1.1 kB 00:00
citrix | 951 B 00:00
elff | 1.9 kB 00:00
epel | 3.4 kB 00:00
extras | 2.1 kB 00:00
updates | 1.9 kB 00:00
Excluding Packages in global exclude list
Finished
Excluding Packages from CentOS-5 - Base
Finished
Setting up Install Process
No package libsasl2 available.
Nothing to do


Googled around again, turns out the sasl2 library is called cyrus-sasl2 (dont ask me why). After searching around rpmfind and pbone, I was able to grab the sasl2 RPM and sasl2-dev RPM for Centos 5.4. Low and behold, I had sasl2 already installed, and dev was missing.

[mpandit@dev-services-vm-01:/tmp] $ wget ftp://ftp.pbone.net/mirror/ftp.centos.org/5.4/os/x86_64/CentOS/cyrus-sasl-2.1.22-5.el5.x86_64.rpm
[mpandit@dev-services-vm-01:/tmp] $ sudo rpm -vi cyrus-sasl-2.1.22-5.el5.x86_64.rpm
Preparing packages for installation...
package cyrus-sasl-2.1.22-5.el5.x86_64 is already installed

[mpandit@dev-services-vm-01:/tmp] $ wget ftp://ftp.pbone.net/mirror/ftp.centos.org/5.4/os/x86_64/CentOS/cyrus-sasl-devel-2.1.22-5.el5.x86_64.rpm
[mpandit@dev-services-vm-01:/tmp] $ sudo rpm -iv cyrus-sasl-devel-2.1.22-5.el5.x86_64.rpm
Preparing packages for installation...
cyrus-sasl-devel-2.1.22-5.el5

Now that this was installed, I ran sudo gem install memcached, and saw the same ugly 1000 line error. Again, deleted the directory and it worked like a charm.

Success at the end!

[mpandit@dev-services-vm-01:/tmp] $ sudo rm -rf /usr/lib/ruby/gems/1.8/gems/mem*
[mpandit@dev-services-vm-01:/tmp] $ sudo gem install memcachedBuilding native extensions. This could take a while...
Successfully installed memcached-0.19.3
1 gem installed
Installing ri documentation for memcached-0.19.3...


Not sure if its a bug or a feature, but I really wish gem install would check to see if half-ass installation from a failed attempt already exists in the LIB_DIR. If so, delete it, or at least move it to a .prev folder or something.

Thats why whenever someone tells me to just "apt-get install" or "yum install" anything, I always take it with a grain of salt.

Monday, February 8, 2010

Falling in love with Ruby!

So my newfound interest is Ruby. Starting as a n00b but picking it up quite a bit. However I doubt I'd be as comfortable with or accustomed to Ruby as I am with Java, given the 12 years of hardcore programming experience with the latter. Anyway, so the task at hand was a small piece of code that I could use to generate timestamps for OAuth requests (oauth_timestamp). I did not realize how verbose Java programming can get till I tried to compare apples to apples.

Here is the Java way to do it:

public class TimestampUtil{
public static void main(String ... args) throws Throwable{
System.out.println(System.currentTimeMillis()/1000);
}
}

mpandit@vaio-ubuntu:~$ javac TimestampUtil
mpandit@vaio-ubuntu:~$ java TimestampUtil
1265696612
mpandit@vaio-ubuntu:~$


Here is the Ruby way:

vi time.rb

puts Time.now.to_i

mpandit@vaio-ubuntu:~$ ruby time.rb

1265696797
mpandit@vaio-ubuntu:~$

Sunday, February 7, 2010

Apache commons-lang StringUtils and Numbers

I had my unit tests fail because I was using the StringUtils.isNumeric( ) on a string containing a negative number. The javadoc on this API is excellent, including the sample request/response but it does not talk about negative numbers. The definition of this API seems debatable from a negative number standpoint : "Checks if the String contains only unicode digits. A decimal point is not a unicode digit and returns false."
I'd call the method something else, like isWholeNumber( ) if the method is going to say that decimals (and negatives) are going to be ignored.
Anyway, here is a simple way I used to figure out if the string contains a number (negative or with decimal, doesnt matter).

String string = "-12.49";
boolean isNumeric = string!=null && string.length() > 0 && string.toUpperCase().equals(string.toLowerCase());