Pro Bono Geek: The Tagging Paradigm

Welcome to my first installment on technical rumination... let's all hope by the time it's posted I've actually had something valuable to say that is more than just jargon filed ramblings. I've actually been thinking about this topic for a while now, ever since a conversation I had with David (sorry Mr. Morgan, you are no longer the only David in my life). He (David) is in the process of laying the initial foundations for a our company's new CMS system. It's actual a very interesting process, building a CMS system, because you are making decisions with significant longterm impact that are not easily changed--regardless of how agile one might be--because once the system is in use with clients, there's no easy way to retrain. So David and I spend several hours every week discussing long term implications, naming strategies, and general paradigms. Most recently we discussed the implications of a tagging based system.

For those not immediately familiar with tagging, here's a brief intro as I see it (which means get out your salt shakers, because I'm not going to fall into the boosters club on this one). Tagging is part of the the semantic web (SW) movement. For purposes fair-play, here is a direct quote from the W3C's FAQ on the main goals of SW:

The Semantic Web allows two things.

1. It allows data to be surfaced in the form of real data, so that a program doesn’t have to strip the formatting and pictures and ads off a Web page and guess where the data on it is.
2. it allows people to write (or generate) files which explain—to a machine—the relationship between different sets of data. For example, one is able to make a “semantic link” between a database with a “zip-code” column and a form with a “zip” field that they actually mean the same – they are the same abstract concept. This allows machines to follow links and hence automatically integrate data from many different sources.

So, the idea is to take the human-readable content and convert it into machine readable content in a way that makes sense and allows for all sorts of cool functionality. The best example of this in today's world is an RSS feed, which takes blog posts (just like this one) and converts it into data that an RSS reader can understand and synthesize into a format that you, as the reader, desire. So far, so good.

Tagging fits into the semantic web concept as the chief mechanism for aggregating data. So, imagine we have three bits of data: bit one [1] is tagged as "blue" and "red"; bit two [2] is tagged as "red" and "yellow"; and the final bit [3] is tagged as "yellow" and "blue". Now we can ask for all the bits tagged as blue [1,3], all the bills tagged as blue and red [1], the bits tagged as blue or red [1,2,3], or even all the bits tagged blue and not red [3]. The power of this approach is obvious... the problem with tagging is that in the rush to embrace the approach no one seems to be talking about the weaknesses.

Before I get into that, let me say something about my general philosophy when it comes to Computer Science. My general belief is that the original computer science folks were really smart and figured out nearly everything truly neat that was to be discovered based on the hardware available at the time. Linus Torvalds said it best, when commenting on Microsoft's claim that they held patents on technology in the Linux Kernel: pretty much everything about operating systems was figured out in the 60s and 70s. If you believe this--as I do--then you begin to see that the only true innovations left are a combination of hardware and software... here I'm thinking parallel programming made possible by dual-core CPUs.

Which brings us back to tags... why, if they have such power, are they really only now coming into vogue? Several possibly reasons to be suspicious come to mind.

1) Tags are unstructured data.

When talking about this with David, he said he sees tags as lists, but I don't think that's right. A list implies an order, with a start, an end, and clearly defined order between the two. But going back to our bits, if I ask for all the blue bits, the results could be [1,3] or it could be [3,1]. There is nothing about a list of tags that says either is right or wrong. It's better to say a list is a collection. Of course, I can impose an order on the collection by sorting the elements, but then the ordering is coming from within the data and is not part of the collection itself. This means (a) I have to order the collection every time I want to work with it, (b) I cannot apply arbitrary order to the collection unless I start storing meta data about the order in the items themselves.

2) Tags will never be as efficient as an actual list.

In the world of web development, the easiest way to create a list is to just use a table in your relational database. This is handy because it's just one table and you don't have to mess with JOIN statements that are inherently slower than a single table SELECT statement. Tagging, however, requires three tables. 1) a table of the data to be tagged, 2) a table of the tags, 3) a table pairing the first and second tables together. So, now you've got three tables and two joins, which is just not going to be as fast as a simple SELECT.

3) Over reliance on tags could replace good design.

Of course, developers already know this even if they are rushing to embrace the tagging concept. They know it because they are still using foreign keys in their tables to provide the usual one-to-one and one-to-many relationships. To demonstrate this, consider a recent project involving an online election tool I wrote. For this I had a table of polls, a table of choices, and a table of votes. The choices table had a poll_id field that linked the poll together with the choices, thus allowing me to ask for all the choices associated with the particular poll, or find the poll of the particular choice. The votes then had a choice_id and a poll_id, so I could do the same lookups. I did this because it was efficient and easy to understand... but one could envision doing the same thing with tags. Come up with a tag name for the poll itself and then just tag all the choices and votes with the poll tag. We'd get the same result, but it would require a lot more joining and be overall less-efficient.

4) In a collaborative web setting, tagging makes too many assumptions about users.

While researching for this post (that's right, research!) I discovered an excellent essay by internet luminary Cory Doctorow entitled Metacrap: Putting the torch to seven straw-men of the meta-utopia. He has seven reasons why the meta-data system championed by the semantic web is doomed to failure. It's short and worth a read, but the items I want to draw attention to are: (2) people are lazy and (7) there's more than one way to describe something.

4.a) People are lazy.

This blog has a tagging system, provided at no cost by the fine folks a blogger.com. If you scroll down you'll see a GIANT list of tags I've used over the years on various stories. It's a mess, and as you'll see, very few tags have more than one or two posts. I don't consider myself a lazy person, but the truth is that I haven't put in the time to organize and categorize each story... and since the tagging options available to me are limited only by my imaginative vocabulary, chances are the list of tags is only going to continue to grow... making an increasingly useless tag system for my readers.

4.b) There's more than one way to describe something.

This is the kicker, and Cory hits it right on the head when he says, "No, I'm not watching cartoons! It's cultural anthropology." Both are fair ways of describing something like The Simpsons or The Family Guy. Unless we want to go around tagging everything with all the possible synonyms (whether agreed to or not) those left with the dubious task of aggregating data based on the tags are going to have a difficult time find everything they seek.

5) Namespace Collision

The final argument is especially geeky, so hold on tight. In the real world, when I'm talking with Sarah about "David" she knows I'm talking about David at work, because our namespace has "David" mapped to "David Chelimsky who works at Articulated Man," but when I talk to Sheridan about "David" he knows I'm talking about "David Morgan who works at Ernst & Young" because that is our shared namespace. Two people, both tagged with "David", but they point to different people depending on the context in which I say the name. You run into the same problem whenever you search on google and the results are not 100% what you are looking for. Humans don't really think in terms of namespace, but for computers it's a necessity. Without a clear context, you end up with namespace collision. So, how do we solve the two Davids dilemma? We could provide more exhaustive tagging... so, now I tag them as "David Morgan" and "David Chelimsky"... but what if I know two David Morgans? Do I provide a tag that is really long and includes date of birth and social security number? Thing is, developers already use the social security number approach, by assigning every object a unique identification number. It's not especially human readable, but it does resolve the namespace collision.

Just like Cory, I don't walk away from tagging with a sense of abandonment, just healthy skepticism. I think there are two primary lessons to pull out of all of this that govern when tags make sense. 1) When the aggregation of data is truly arbitrary. This is how we avoid a tagging regime for the polling system I described above. There was a clear object hierarchy there and using the tables to describe that hierarchy was far more concise than tags. 2) when the data has a clear enough context to avoid namespace collisions. You don't want to be tagging by a persons name... there are just too few names in the world. But, if you are working with a theatre company, tagging by the particular performance ("Romeo & Juliet", "Midsummer Night's Dream", etc) might do the trick 3) When the act of tagging itself is centrally controlled. This is to say, allowing the public and large to apply tags leads to pretty useless tags (see slashdot.org as an excellent example of useless tags... not to say they aren't funny). But if the tagging regime is centrally controlled, then you can rely on a uniform scheme of description and prevent the laziness problem that I face. If I had come up with a uniform system at the beginning, it would be useful today.

I will wait to see how tags unfold in the larger internet. Done in limited situations and with the proper structure, there is real power... but just beyond that ridge is a desert of poorly organized and inefficiently accessible data.

Pro Bono Geek

Saturday, March 22, 2008

The Tagging Paradigm

No comments:

Sean Kellogg

Blog Archive

Technology Interests

Academic Interests

Topics & Tags