Wednesday, February 11, 2009

The Trouble with Enumerables

Quite a lot of political postings for what's suppose to be a technical journal recently... time for a return to our traditional values!

Today's topic is Enumerables. Originally I thought this post was going to be about iterators, but on reflection, iterators aren't really the trouble here... but I'm getting ahead of myself. Ruby, dynamic scripting language of MVC fame, has this Enumerable concept. What it does is takes a set of objects and performs actions on that set of objects, sometimes returning a modified set, sometimes returning a single element from the set. The available actions are known as iterators, and some of the more common ones are each, select, and collect.

The underlying mechanic for all of these iterators is the each iterator, that does nothing more than return each item in the set, one at a time. You can then wrap the each iterator with additional functionality to generate all the additional iterators... so, if you wanted to go through each item in the set until you find an item that meets an established criteria, you would use detect. Or, if you wanted all of the items that meet the criteria, you would use select. The big upside is that you get really clean looking code since you are using all this built in magic instead of writing your own.

Let's look at a bit of code to see what I'm talking about. We start with an array of hashes describing a person, with their name and address:
  person_array = [
{
:name => 'alice',
:address => '123 Fake Street'
},{
:name => 'bob',
:address => '1600 Pennsylvania Avenue'
},{
:name => 'charlie'
:address => 'Infinite Loop'
}
]
Now, we want to know alice's address, so we can use the detect iterator to find a hash with a :name value that equals "alice"
  person = person_array.detect { |person| person[:name] == 'alice' }
puts person[:address]
Which, one must admit, is pretty slick. In that one little line of code we go all the looping, variable checking, and returning behavior. You can even take it a step further and wrap the whole thing in a method:
  def address_for(name,person_array)
person = person_array.detect { |person| person[:name] == name }
return person[:address]
end
Now, you're probably asking, if I think all of this is so slick, why title this post The Trouble with Enumerables? The reason these little guys are trouble is because they can hide inefficient implementations behind clean looking code because it teaches developers to treat all sets of data as equivalent.

But, the truth is, sets of data are not equal in the eye's of their maker. In fact, in the dynamic language world there are really two different kinds of sets, each with their own particular strengths and weaknesses. On one hand, you have the arrays, which is ordered data, meaning that each item in the set is stored in a specific order, 0..n, and you can loop through that safe in the knowledge that you're going to get the data in the same order every time. On the other hand you have hashes, which is keyed data, meaning that each item is stored based on a hashed key value, such that you can find it again quickly by just repeating the hash algorithm.

Let's return to the code example from above. I used an array there, but if I know I want to use that data to find addresses based on a name, it would be much faster if I used a hash that looked like this:
  person_hash = {
:alice => '123 Fake Street',
:bob => '1600 Pennsylvania Avenue',
:charlie => 'Infinite Loop'
}
Then to access alice's address all we would need to do is:
  puts person_hash[:alice]
We can even wrap it in the same method as above:
  def address_for(name,person_hash)
return person_hash[name]
end
Now, the output is the same in both cases, "123 Fake Street", but the two implementations differ in important ways. The first one (with the array) is an O(n) speed function, meaning that to find the desired result under the worse case scenario, it will take n runs of the function to do so (n being the number of items in the array). That's because we have to look at each item in the array to see which one has a :name value of "alice". The second implementation (with the hash) is just O(1), or whatever constant number you want to put in between the parentheses. No matter how many items are in the hash, it will take the same amount of time to find alice's address.

Which is why Enumerables are trouble. Because they are so darn easy to use they hide the fact that you may have traded a possible constant time function for a linear time function with no actual gain in functionality.

So, my fellow ruby developers (and any other language that has enumerable like functionality), the next time you reach for your favorite iterator, ask yourself, am I using this because it's easy and looks clean, or am I using it because it's the best tool for the job? There are plenty of good reasons to using an enumerable, but if your sole reason is because it's "easy and clean," then you are only asking for trouble down the road when you put your code into production and suddenly that array of four items you tested during development has become 4,000 items and that one function is slowing everything down, to say nothing of all the other functions where you made the same short-sighted decision.

No comments: