Executive summary: Complex queries can cause a memory leak in ColdFusion MX. It seems that when ColdFusion caches the prepared statement of a complex query, that statement never gets freed for garbage collection. The solution is to set the value of the Max Pooled Statements in the datasource settings to zero.
Read on if you want the gory details of how we came to this conclusion.
We recently took the giant leap from ColdFusion 5.0 to ColdFusion MX 6.1. There are several reasons why we haven't made the migration until now, some were well-founded, but none of them are going to be of any interest to anyone, so I won't bother you with them. The transition was done as we upgraded to newer servers. We set up the new servers with MX 6.1 and gradually moved traffic over to them. As the MX servers took on more traffic, they became more unstable. The JRun service would just randomly die without any trace of a problem.
We were less than a week away from the scheduled release of a large MX-dependent project. To prepare, we moved the last remaining traffic to the MX servers, and the servers became extremely unstable. During peak traffic time, if one server happened to die, the load balancer would pull it out. Then all of the traffic would slam the other server, it would slow down, and get pulled by the load balancer, starting a vicious cycle that would go on until traffic died down.
After no small amount of work and research, we ended up adding physical RAM to the servers and increasing the heap size and the max perm size in the JVM arguments. We increased these parameters as high as they would go and still allow the service to start. It seemed to help. We had two solid days without any problem, so we decided to roll out the new project.
It promptly fell on its face.
The servers could not stay up, under normal daytime load, for more than 30 minutes. We had to roll everything back.
At this point, it's a disastrous mystery. The new code is suddenly suspect, of course, but there are plenty of other variables to consider as well. A few days after the code was rolled back, we continued to see instability with the MX servers, so that pointed to a more general issue rather than anything in the new code. The MX servers are around 3-4 years newer than the CF 5 servers they are replacing, and they can barely keep up with the traffic. (Whoever said CF MX 6.1 would perform anywhere close to CF 5 must be in marketing, because anyone with any familiarity with Java knows it's not going to outperform a native application.)
The new code was the first to rely heavily on CFC's, one of which was particularly large. Our initial hypothesis was that this particularly large object was being instantiated but not freed for the garbage collector, leading to a rapid increase in memory usage that eventually chokes the server. We looked extensively into the reported memory leak ColdFusion MX. We even went so far as to refactor a major section of the project to avoid using CFC's altogether.
Another major suspect was the JVM itself. ColdFusion MX 6.1 uses Java 1.4.2, which is old. I have no idea how old because I didn't see any version history after a very brief look at Wikipedia, but it's the first flavor of Java I ever knew. Of course, Macromedia/Adobe does not recommend using a newer JVM on 6.1, which could be an honest recommendation or a ploy to get people to upgrade. Either way, we started taking a serious look at ColdFusion MX 8.0.
When we started load-testing (this point in the software life cycle is a terrible time to start load-testing), it immediately became obvious it was indeed a problem with memory. The old code, which was stable during the testing, when ran under simulated load would use 80-85% of the memory allocated to the JVM. The new code, ran under the same load, would climb to around 95% memory usage and induce a constant cycle of full garbage collections. These full garbage collections would not recover any memory, however. Since a full garbage collection has to pause all threads in order to run, the server would be come unresponsive for all intents and purposes.
We scrounged up another server that we could install ColdFusion MX 8.0 onto. Although the code failed in ColdFusion 8.0 just like it did in 6.1, this was really a godsend, because ColdFusion 8.0 uses Java 1.6 by default, which has much better tools for investigating these types of issues than 1.4.2 does. After a few heap dumps taken under load, and a lot of help analyzing them from our in-house Java experts, we figured out most of the memory was being used by database connections. We had over 2000 instances of database connections, and all of their supporting objects, instantiated. That's pretty impressive when you consider the load test only simulated ten concurrent users.
We eventually narrowed the problem down to one query. Replace the real query with a simple "SELECT staticValue AS columnName" type query, and everything ran like a swiss watch. The query in question was fairly complex, with several joins. The funny thing is that it existed in the old code, and was only made a little more complex in the new code.
Before we set off trying to optimize this query by using views, stored procedures and every other trick we knew, we started playing with the datasource settings. We noticed that the default value of the Max Pooled Statements setting is 1000. We just happen to have two datasources used in the application, and 2 x 1000 = 2000, the approximate number of database connection objects in our heap. We cut this back to 100, with negligible effect. We set it to zero, and, wow, did that ever help. Not only did that make the application stable (which is the most important metric), memory was stable at 20-25% percent usage (a dramatic change), CPU load was down slightly, average request time was slightly faster, total requests per second was slightly higher, and the database performance was not effected at all. All good and no bad.
We made the change to our production servers, and memory usage seems to be stable around 10%, rather than near 60-70%.
So there you go - if you happen to have a ColdFusion MX server that is operating at high memory usage, you may want to check your Max Pooled Statements setting. Of course, try it in a non-production environment first. There are so many possible causes of a memory leaks in ColdFusion (and in general, Java and other virtual machines as well), I hate to add one more thing to the check list, but maybe it will help someone out. I should mention we verified this behavior against Microsoft SQL Server 2000 and 2005, and I can't say if any other DBMS are affected.
In a piece of wonderful irony, we discovered that someone had posted this very same advice under the username of bert.d to the 6.1 livedocs; after we had figured it out on our own of course.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment