Nate's Technical Blog

Monday, March 8, 2010

Directory Head in Windows

I had a need to display the top 10 most recently updated files in a directory containing somewhere around 25,000 files. The files are being updated on an irregular, but continuous, basis. The purpose is to provide a very simple monitor / status report of the process that is updating these files.

My first thought is to do

ls -l | head

That would work perfect, if we were running on a *nix platform. Sometimes, it's hard to be a programmer in a windows environment.

My next thought was to use ColdFusion; something along the lines of

<cfdirectory
action="list"
name="dirContents"
directory="#GetDirectoryFromPath(GetCurrentTemplatePath())#"
sort="DateLastModified DESC">
<cfloop query="dirContents" endrow="10">
<cfoutput>#dirContents.Name# #dirContents.DateLastModified#
</cfoutput>
</cfloop>

Unsurprisingly, this took way too much time, as ColdFusion was fetching meta data for each and every file in the directory. I would have liked to try using an event gateway, but alas, this is Standard Edition ColdFusion instead of Enterprise. I also looked into Java, but found similar limitations. It was quite obvious that the only way to do this in a reasonable time was to interact with the file system at the OS level.

I'll spare everyone the gruesome details of the Google searches, but here is the batch file that I came up with.

@echo off & setLocal enabledelayedexpansion
set dirpath=.
set limit=10
set order=/O:-D
if not [%1]==[] set dirpath=%1
if not [%2]==[] set limit=%2
if not [%3]==[] set order=%3
SET /a N=0
for /F "skip=5 tokens=* delims= " %%i IN ('dir %dirpath% %order%') do (
SET /a N+=1
echo %%i
IF !N! GEQ %limit% GOTO end
)

:end

The trickiest thing was getting the counter to increment inside the loop iterations. Sounds easy, but in a batch file it is not intuitive at all, and I had trouble finding good examples. It came down to a combination of the setLocal enabledelayedexpansion at the top and the syntax using the exclamation marks, !N! instead of percent signs.

Wednesday, November 11, 2009

Is Object-Oriented ColdFusion a Myth?

ColdFusion has come through quite a metamorphosis over the last decade. Anyone who is familiar with ColdFusion 5, or even 4.5, can attest to that. In those days, there were no CFC's, and techniques for modularizing an application's code were limited to simply physically breaking the code into separate files and relying on the <cfinclude> tag and custom tags.

When ColdFusion MX was released, CFC's were introduced, opening the door for the basic object-oriented approaches. With each successive version of ColdFusion, new features have been introduced that bring it more in line with contemporary languages that are generally considered the pillars of OOP: Java and Microsoft's answer to Java, .NET. Even so, there are certain limitations to ColdFusion's OO abilities. The lack of true constructors is a glaring omission that I'd love to hear somebody at Adobe rationalize. ColdFusion does not support polymorphism per say, nor does it provide for multiple inheritance.

The community using ColdFusion has learned to slide around these restrictions though; everyone has seen an init function in a CFC, right? CFC's are quite popular, but I am constantly surprised how few people actually use a CFC as a true object – an encapsulation of data and methods. Very few people I have come in contact with can clearly explain how to store instance data safely in an object defined in a CFC. Many of the ones that are actually storing data in objects, rely on a framework to define, instantiate and initialize the objects and do so just because the frameworks help make their code pretty. The vast majority, however, seem to use CFC's only as containers for static methods. What could be a very powerful framework is being reduced to a utility for organizing methods, which is not all that far from the custom tags of ColdFusion 5.

There are probably many reasons for this lackadaisical use of the OOP paradigm that I've observed. Maybe I have seen a bad sample population; I know there are good ColdFusion programmers out there that understand and use OOP principles every day. But I'll go ahead and offer up some hypotheses:

No ORM Needed – One of the fundamental strengths of ColdFusion has always been the ease with which it handles interaction with database systems. While other languages force the programmer to piece the SQL of the query together and then marshal the resulting data into an object or array of objects, ColdFusion takes in SQL and produces a query object (and yes, it is an object). Who needs to go to the work to implement an object-relational mapping when there is <cfquery>? I know ColdFusion 9 introduced ORM implemented in Hibernate, but I'm far from being onboard on that because I have yet to see an example of a ColdFusion application that would significantly benefit from Hibernate. The point is that ColdFusion itself is already really good, probably better than any other language out there, at handling data from a database system without user-defined objects, so the need to store data in CFC's has never been critical to an application's design.

Individual Programming Experience – How many ColdFusion programmers came from an OO background? How many OO programmers go into ColdFusion? How many ColdFusion programmers have a computer science degree?

Performance – Instantiating objects in ColdFusion is expensive. An average data-driven web site is going to have a whole lot of data objects being instantiated on the fly. If these data objects are just POJO-type objects, then one would have to justify the additional expense of instantiating all these objects that do nothing more than encapsulate data without offering any complex methods to operate on the data.

Ease of Coding – Let's face it, in ColdFusion it's hard to instantiate an object and store data inside it. It's much easier to declare a struct and shove the data in there, not to mention the all-powerful <cfquery> tag.

Informal Class Definitions – No formal definition of the data members of a CFC is ever made. Any method can add a private data member at any time (of course some other OO languages like Javascript allow this as well). Without a formal way to define the data members, it just isn't obvious that the data members should even be used. There is the <cfproperty> tag, but its use is limited to helping generate the WSDL when a CFC is exposed as a web service.

History – Applications written for CF 5 or lower do not change to OOP without a lot of work. Similarly, programmers comfortable in procedural programming will tend to stick to that.

In the end, it might be the uniqueness of the ColdFusion language that is the underlying issue here. On one hand, it is one of the easiest languages to pick up and become fluent in, so understandably there are a lot of people using it that have had no formal OOP experience or schooling. On another, you have the extreme ease of database interaction with the <cfquery> tag and the loosely-typed variables that tend to make most of the formal programming paradigms out there seem like overkill. Even after considering all of that, can somebody tell me why it would be a bad idea to allow constructors to be defined in CFC's?

Tuesday, February 3, 2009

Pitfalls of Mylyn

If you don't know, Mylyn is a plugin for the Eclipse IDE that provides a task-oriented view of the IDE and helps automate a few tasks. It integrates with your project tracking tools, like Trac or Bugzilla, so you can get a list of the tasks (a generic word for tickets or bugs) that you need to work on right there in Eclipse. You activate one of the tasks, and Mylyn will automatically restore the IDE to the state it was at when you left that task by:

Opening the files you were working on
Changing the perspective, if necessary
Limiting the navigator to show just the files falling inside the context of the task

As you work on the task, it keeps track of which files you access (even those you just view) and adds them automatically to the context of the task. When you're done with the task, you can attach the context as a .zip file to the ticket so another developer would be able to reference it later.

That's a short overview of some of the handy features of Mylyn that I use. As I've used it over the last few months, I have discovered two issues that I would like to detail.

First off, the task context will absorb any files that are touched by a SVN snychronize or SVN Update. When working in a team development environment, it's best to update your local working copies frequently to receive the latest code updates from the rest of the team to avoid any integration issues. Eclipse makes this easy by allowing you to schedule a refresh of the synchronize view, which will alert you to any incoming changes. However, any file identified to have incoming changes during a refresh of the scynchronize view will be added to the context of the active task. Additionally, if you perform a SVN update while a task is active, any files updated will also be added to the context of that task. I consider this to simply be a bug in Mylyn; it is needlessly adding files to the context of the task, to the point where the context becomes so bloated it is useless. There are really only two options for maintaining your task context:

Deactivate the task before any SVN operations (which is hard to do if you are using the scheduled refresh of the synchronize view).
Constanly remove extra files from the context.

The second issue is a by-product of what is really very useful feature that Mylyn provides: the commit comment template. This feature allows you to set up a default format to use for the comment when commiting code that can automatically insert properties of the active task. For example, you could set it up to default to "Refs [insert ticket number here]". This is especially handy when you are using the trac-post-commit-hook in your subversion repository to automatically update the ticket when you commit a revision for it, because you don't accidently mistype a commit comment and fail to trigger the hook.

The problem comes when a developer uses this template to completely formulate their commit message and fails to provide any additional information about the specifc changes they are committing. If a ticket ends up with several revisions, and you ever have to review those reivisions, you end up looking at a list of revisions with exactly the same comment. Not very helpful. This is purely a case of developers being too lazy to properly document their revisions (the infamous "lack of documentation" problem), but I would nevertheless categorize it as a potential pitfall of using Mylyn. It is not a reason to avoid using Mylyn in of itself, but it is something to be aware of and instruct your team to avoid.

Thursday, August 14, 2008

ColdFusion Looping Techniques

For those of you that may not be familiar with ColdFusion, it has a lot of different ways to loop over a collection of data. There is, of course, the query loop:

There are other looping methods for other data collections like lists, arrays, and structures. When it comes to arrays versus lists, the mantra has always been that arrays are always more efficient than arrays. Lists, however, are deeply ingrained in the programming habits of most ColdFusion developers, which is undoubtedly helped by the presence of functions like ValueList.

I recently heard about a comparison of the performance of the various looping techniques that was posted on a blog by Brian and Eric. It's a well written article, and I recommend reading the whole thing. However, the important point that I took away was that looping over an array was only marginally better than looping over a list. All other looping techniques were significantly less efficient.

Saturday, March 15, 2008

Solving a Memory Leak in ColdFusion MX

Executive summary: Complex queries can cause a memory leak in ColdFusion MX. It seems that when ColdFusion caches the prepared statement of a complex query, that statement never gets freed for garbage collection. The solution is to set the value of the Max Pooled Statements in the datasource settings to zero.

Read on if you want the gory details of how we came to this conclusion.

We recently took the giant leap from ColdFusion 5.0 to ColdFusion MX 6.1. There are several reasons why we haven't made the migration until now, some were well-founded, but none of them are going to be of any interest to anyone, so I won't bother you with them. The transition was done as we upgraded to newer servers. We set up the new servers with MX 6.1 and gradually moved traffic over to them. As the MX servers took on more traffic, they became more unstable. The JRun service would just randomly die without any trace of a problem.

We were less than a week away from the scheduled release of a large MX-dependent project. To prepare, we moved the last remaining traffic to the MX servers, and the servers became extremely unstable. During peak traffic time, if one server happened to die, the load balancer would pull it out. Then all of the traffic would slam the other server, it would slow down, and get pulled by the load balancer, starting a vicious cycle that would go on until traffic died down.

After no small amount of work and research, we ended up adding physical RAM to the servers and increasing the heap size and the max perm size in the JVM arguments. We increased these parameters as high as they would go and still allow the service to start. It seemed to help. We had two solid days without any problem, so we decided to roll out the new project.

It promptly fell on its face.

The servers could not stay up, under normal daytime load, for more than 30 minutes. We had to roll everything back.

At this point, it's a disastrous mystery. The new code is suddenly suspect, of course, but there are plenty of other variables to consider as well. A few days after the code was rolled back, we continued to see instability with the MX servers, so that pointed to a more general issue rather than anything in the new code. The MX servers are around 3-4 years newer than the CF 5 servers they are replacing, and they can barely keep up with the traffic. (Whoever said CF MX 6.1 would perform anywhere close to CF 5 must be in marketing, because anyone with any familiarity with Java knows it's not going to outperform a native application.)

The new code was the first to rely heavily on CFC's, one of which was particularly large. Our initial hypothesis was that this particularly large object was being instantiated but not freed for the garbage collector, leading to a rapid increase in memory usage that eventually chokes the server. We looked extensively into the reported memory leak ColdFusion MX. We even went so far as to refactor a major section of the project to avoid using CFC's altogether.

Another major suspect was the JVM itself. ColdFusion MX 6.1 uses Java 1.4.2, which is old. I have no idea how old because I didn't see any version history after a very brief look at Wikipedia, but it's the first flavor of Java I ever knew. Of course, Macromedia/Adobe does not recommend using a newer JVM on 6.1, which could be an honest recommendation or a ploy to get people to upgrade. Either way, we started taking a serious look at ColdFusion MX 8.0.

When we started load-testing (this point in the software life cycle is a terrible time to start load-testing), it immediately became obvious it was indeed a problem with memory. The old code, which was stable during the testing, when ran under simulated load would use 80-85% of the memory allocated to the JVM. The new code, ran under the same load, would climb to around 95% memory usage and induce a constant cycle of full garbage collections. These full garbage collections would not recover any memory, however. Since a full garbage collection has to pause all threads in order to run, the server would be come unresponsive for all intents and purposes.

We scrounged up another server that we could install ColdFusion MX 8.0 onto. Although the code failed in ColdFusion 8.0 just like it did in 6.1, this was really a godsend, because ColdFusion 8.0 uses Java 1.6 by default, which has much better tools for investigating these types of issues than 1.4.2 does. After a few heap dumps taken under load, and a lot of help analyzing them from our in-house Java experts, we figured out most of the memory was being used by database connections. We had over 2000 instances of database connections, and all of their supporting objects, instantiated. That's pretty impressive when you consider the load test only simulated ten concurrent users.

We eventually narrowed the problem down to one query. Replace the real query with a simple "SELECT staticValue AS columnName" type query, and everything ran like a swiss watch. The query in question was fairly complex, with several joins. The funny thing is that it existed in the old code, and was only made a little more complex in the new code.

Before we set off trying to optimize this query by using views, stored procedures and every other trick we knew, we started playing with the datasource settings. We noticed that the default value of the Max Pooled Statements setting is 1000. We just happen to have two datasources used in the application, and 2 x 1000 = 2000, the approximate number of database connection objects in our heap. We cut this back to 100, with negligible effect. We set it to zero, and, wow, did that ever help. Not only did that make the application stable (which is the most important metric), memory was stable at 20-25% percent usage (a dramatic change), CPU load was down slightly, average request time was slightly faster, total requests per second was slightly higher, and the database performance was not effected at all. All good and no bad.

We made the change to our production servers, and memory usage seems to be stable around 10%, rather than near 60-70%.

So there you go - if you happen to have a ColdFusion MX server that is operating at high memory usage, you may want to check your Max Pooled Statements setting. Of course, try it in a non-production environment first. There are so many possible causes of a memory leaks in ColdFusion (and in general, Java and other virtual machines as well), I hate to add one more thing to the check list, but maybe it will help someone out. I should mention we verified this behavior against Microsoft SQL Server 2000 and 2005, and I can't say if any other DBMS are affected.

In a piece of wonderful irony, we discovered that someone had posted this very same advice under the username of bert.d to the 6.1 livedocs; after we had figured it out on our own of course.

Wednesday, March 5, 2008

Real-Life Search Engine Optimization

An acquaintance of mine was launching a web site for his business and wanted some advice on how to publicize it. Ah, the classic Search Engine Optimization (SEO) problem. I have always wanted to formate a proof that this was an NP-Complete, as results can be a polynomial in time and effort.

This is a huge question for all sorts of people, as it affects every web site on the internet in one way or another. The reality of the problem only hits you when you stop to consider the position of the search engine, whose sole purpose (forget about selling ad space for a minute) is to completely categorize and index the world wide web to allow people to find the information they need with minimal effort. This is a hopelessly impossible situation the search engines are in, so they are forced to take short cuts, cheat off each other, make assumptions, and, in the end, usually provide sub-optimal results.

The best advice I have: think like a search engine. This may be easier for programmers like myself, but probably not. I would venture a guess that very few people in Google would be able to tell you how to best present a site to the Google god for processing.

The next best advice: know your target audience. Try to imagine someone looking for your web site, or more appropriately, the information on your web site. What search term(s) would they use? Make sure those terms are on your web site, and draw as much attention to them as possible. It's no small irony that the tech community has it easy here. Programmers often search for specific terms (strings) that aren't even real dictionary words that can often only mean one specific thing. If I am looking for documentation on a certain ColdFusion tag, maybe <cfinvokeargument>, I can simply enter "cfinvokeargument" into any search engine and find what I need. If I were to type in "orange", however, the search engine wouldn't know if I was looking for the proper RGB values of the color orange, what varieties of oranges are grown in Texas, or the location of the county courthouse in Orange county, CA.

Enough with the abstract advice. Here are a few suggestions that I have used in real-life situations, in no particular order.

Domain Name is a Key
Get a good domain name, like AcmeRocketSales.com or something closely related to your company or web site name. This may seem like a no-brainer, but search engines will see this as a mark of legitimacy and permanency, just like a human would. Darn, those computers are smart.

The Power of the People
Get your site listed on the Open Directory Project. This is a project with the goal of indexing the entire web - with a twist that it doesn't rely on robot/spider software to "crawl" the web to find and categorize web sites. Instead volunteer human editors must review and approve every submission. This results in a very accurate listing of legitimate sites. As a result, normal search engines, like Google and Yahoo, pay very close attention to web sites added to the ODP. Getting listed here will go a long way to getting your site more value in the eye of a search engine.

Meta Tags do no Harm
Use the <meta> tags for keywords and descriptions in the HTML code of your web site. Make them the same on every page of your site. These tags used to be the primary thing that search engines used to automatically index web pages, but they were abused so much by people that the search engines started to just ignore them. So then people quit using them. Rumor is they are on the rebound though. Whatever the case may be, they can't hurt.

Self-Linking is like Self-Promotion
Make sure every page on your web site has links to the home page and the top level sections of your web site. This is pretty standard nowadays with site layouts where every page has a common navigation area.

Embrace the Power of the Web
Get other web sites to link to your site. Every web site the links to yours increases the importance of your site in the eyes of a search engine. If any of your employees have a blog, facebook, myspace, etc., have them add a link someplace. You could approach some of your customers and ask if they would consider putting a link as a recommendation (business cards with your URL help a lot here). By the way, Google happens to run a small blogging site. You think it scans those blogs more frequently or more efficiently? I bet so...

Geocode Yourself
Your physical location is one of the most unique attributes you have, so you had better use it. Make sure your address is displayed, in full, on your web site. The more often, the better, so consider sticking it in the footer or somewhere. Register with Google local; provide as much information as you can.

Feed the Beast, and the Beast May Not Eat You
You can always throw in the towel and advertise with search engines themselves, like Google. I'm not real sure how this works, but I think you can set something up so your ad shows up if someone searches for certain search terms. I would consider this a last resort. It's a sure-fire solution, but it's not free.

Thursday, February 28, 2008

SQL Update From Same Table

I recently had a situation where I had to update some columns in several rows of a table with values taken from other rows in the same table. I thought this was the perfect situation to use the SQL "UPDATE ... FROM" strategy, but I ran into some problems.

Here's the set up:

CREATE TABLE table1 (x int not null, mod10 int null )
-- insert one batch of data
INSERT INTO table1 (x, mod10) VALUES (1, 1)
INSERT INTO table1 (x, mod10) VALUES (2, 0)
INSERT INTO table1 (x, mod10) VALUES (3, 1)
INSERT INTO table1 (x, mod10) VALUES (4, 2)
INSERT INTO table1 (x, mod10) VALUES (5, 5)
INSERT INTO table1 (x, mod10) VALUES (6, 6)
INSERT INTO table1 (x, mod10) VALUES (7, 7)
INSERT INTO table1 (x, mod10) VALUES (8, 8)
INSERT INTO table1 (x, mod10) VALUES (9, 9)
INSERT INTO table1 (x, mod10) VALUES (10, 0)

-- insert second batch of data
INSERT INTO table1 (x, mod10) VALUES (11, NULL)
INSERT INTO table1 (x, mod10) VALUES (12, NULL)
INSERT INTO table1 (x, mod10) VALUES (13, NULL)
INSERT INTO table1 (x, mod10) VALUES (14, NULL)
INSERT INTO table1 (x, mod10) VALUES (15, NULL)
INSERT INTO table1 (x, mod10) VALUES (16, NULL)
INSERT INTO table1 (x, mod10) VALUES (17, NULL)
INSERT INTO table1 (x, mod10) VALUES (18, NULL)
INSERT INTO table1 (x, mod10) VALUES (19, NULL)
INSERT INTO table1 (x, mod10) VALUES (20, NULL)

I want to update the mod10 column in the rows where mod10 is null, and the values I want to use should come from earlier entries in the table. Here's the query I tried:

UPDATE table1
SET mod10 = t.mod10
FROM table1 t
WHERE table1.x = t.x+10
AND table1.mod10 IS NULL

Looks like it should work just fine, but I get errors like

Server: Msg 107, Level 16, State 3, Line 1
The column prefix 'table1' does not match with a table name or alias name used in the query.

It took me a while to figure this out, but there is a sneaky way around this:

UPDATE table1
SET mod10 = t.mod10
FROM (SELECT * FROM table1) AS t
WHERE table1.x = t.x+10
AND table1.mod10 IS NULL