Pro Bono Geek: technology

Showing posts with label technology. Show all posts

Sunday, February 15, 2009

Time for a Change

I started this blog with a tip of the hat to my earlier efforts at an online journal and a hope that Blogger would be a more permanent home than past attempts. By any objective measure, Blogger has been a complete success, clocking a total of 289 posts since April of 2005. Less prolific than, say, Wonkette's twenty posts a day, but not too shabby for a dude whose never kept a blog longer than a year.

Recently, I've grown frustrated with Blogger. It's a fine platform, and having someone else deal with the hosting is certainly a plus... but in the end, it's a service over which I have no control. It does exactly what Google wants it to do, and nothing more. For a long time I didn't want more... but times, they are a changing. The first hint of longing came when friends launched two new blogs with WordPress, Minor Failures and GeekBeer. Both blogs have gravatar support, an idea with which I am absolutely smitten. Then, most recently, I posted some code examples and found the Blogger support for showing that code was most disappointing. Combined with the byzantine themeing system, the inability to change the blog's domain name, and the general need to refresh the look & feel of the site; one gets a very compelling case to switch blogging platforms.

I administer dozens of WordPress blogs for work and thus have a deal of familiarity not only with general hosting and administration concerns, but also the internal code structure. PHP isn't my first choice of languages, but it's always fun to contribute when you can. So, gentle readers, this marks my last second-to-last post with Blogger. I'm going to take a few days to write a conversion script that will port as much content as possible from the old blog to the new one and find a reasonable theme from which to start customization. Soon as the new blog isn't embarrassing to look at, I will switch the Apache configuration so that blog.probonogeek.org points to the WordPress deployment instead of redirecting you to Blogger. Once that is done I will make a final post letting readers know where the new RSS feed is at and then leave good old reliable http://probonogeek.blogspot.com behind.

Wednesday, February 11, 2009

The Trouble with Enumerables

Quite a lot of political postings for what's suppose to be a technical journal recently... time for a return to our traditional values!

Today's topic is Enumerables. Originally I thought this post was going to be about iterators, but on reflection, iterators aren't really the trouble here... but I'm getting ahead of myself. Ruby, dynamic scripting language of MVC fame, has this Enumerable concept. What it does is takes a set of objects and performs actions on that set of objects, sometimes returning a modified set, sometimes returning a single element from the set. The available actions are known as iterators, and some of the more common ones are each, select, and collect.

The underlying mechanic for all of these iterators is the each iterator, that does nothing more than return each item in the set, one at a time. You can then wrap the each iterator with additional functionality to generate all the additional iterators... so, if you wanted to go through each item in the set until you find an item that meets an established criteria, you would use detect. Or, if you wanted all of the items that meet the criteria, you would use select. The big upside is that you get really clean looking code since you are using all this built in magic instead of writing your own.

Let's look at a bit of code to see what I'm talking about. We start with an array of hashes describing a person, with their name and address:

  person_array = [
    {
      :name    => 'alice',
      :address => '123 Fake Street'
    },{
      :name    => 'bob',
      :address => '1600 Pennsylvania Avenue'
    },{
      :name    => 'charlie'
      :address => 'Infinite Loop'
    }
  ]

Now, we want to know alice's address, so we can use the detect iterator to find a hash with a :name value that equals "alice"

  person = person_array.detect { |person| person[:name] == 'alice' }
  puts person[:address]

Which, one must admit, is pretty slick. In that one little line of code we go all the looping, variable checking, and returning behavior. You can even take it a step further and wrap the whole thing in a method:

  def address_for(name,person_array)
    person = person_array.detect { |person| person[:name] == name }
    return person[:address]
  end

Now, you're probably asking, if I think all of this is so slick, why title this post The Trouble with Enumerables? The reason these little guys are trouble is because they can hide inefficient implementations behind clean looking code because it teaches developers to treat all sets of data as equivalent.

But, the truth is, sets of data are not equal in the eye's of their maker. In fact, in the dynamic language world there are really two different kinds of sets, each with their own particular strengths and weaknesses. On one hand, you have the arrays, which is ordered data, meaning that each item in the set is stored in a specific order, 0..n, and you can loop through that safe in the knowledge that you're going to get the data in the same order every time. On the other hand you have hashes, which is keyed data, meaning that each item is stored based on a hashed key value, such that you can find it again quickly by just repeating the hash algorithm.

Let's return to the code example from above. I used an array there, but if I know I want to use that data to find addresses based on a name, it would be much faster if I used a hash that looked like this:

  person_hash = {
    :alice   => '123 Fake Street',
    :bob     => '1600 Pennsylvania Avenue',
    :charlie => 'Infinite Loop'
  }

Then to access alice's address all we would need to do is:

puts person_hash[:alice]

We can even wrap it in the same method as above:

  def address_for(name,person_hash)
    return person_hash[name]
  end

Now, the output is the same in both cases, "123 Fake Street", but the two implementations differ in important ways. The first one (with the array) is an O(n) speed function, meaning that to find the desired result under the worse case scenario, it will take n runs of the function to do so (n being the number of items in the array). That's because we have to look at each item in the array to see which one has a :name value of "alice". The second implementation (with the hash) is just O(1), or whatever constant number you want to put in between the parentheses. No matter how many items are in the hash, it will take the same amount of time to find alice's address.

Which is why Enumerables are trouble. Because they are so darn easy to use they hide the fact that you may have traded a possible constant time function for a linear time function with no actual gain in functionality.

So, my fellow ruby developers (and any other language that has enumerable like functionality), the next time you reach for your favorite iterator, ask yourself, am I using this because it's easy and looks clean, or am I using it because it's the best tool for the job? There are plenty of good reasons to using an enumerable, but if your sole reason is because it's "easy and clean," then you are only asking for trouble down the road when you put your code into production and suddenly that array of four items you tested during development has become 4,000 items and that one function is slowing everything down, to say nothing of all the other functions where you made the same short-sighted decision.

Wednesday, October 22, 2008

My First Rails Application

Last week Articulated Man launched my first solo Rail's application, a voter's guide for New England federal races. The site was sponsored by the New England Alliance for Children's Health, whose site we did earlier in the year as just a standard site. The voter's guide required a bit more functionality, thus the decision was made to develop in Rails and use it as my first stab at Rails development.

The site turned out great, mostly because we have outstanding designers who can make anything look good. I also learned quite a bit about the nuts and bolts of a Rails site, which is something you don't really get from just reading the book.

While it's still too soon to make any firm declarations about Rails, I will say it was very nice to have some provided structure when building the application. With LegSim, and pretty much any other project I've done, I had to build everything out of whole cloth... thus, LegSim is rather amorphous, having changed throughout the years and never following a clearly defined structure. With Rails you get that out of the box... perhaps more structure than I would prefer, but I think I'd prefer too much structure over too little, at least at my current stage as a developer.

Speaking of LegSim, the UW Congress course has started up again this quarter, giving me a boost of excitement to get developing again. Already some good stuff happening there as I integrate what I've learned since doing web development full time. Sadly, Archon and LegSim v5 have been put on hold until later, as I need to have a finished product well before either of those technologies will be ready for prime time. But some day--some day soon--LegSim will be rewritten and be better than ever!

Tuesday, September 02, 2008

Chrome: Speculation

If you are a geek and you haven't heard about Chrome, then you've been living under a rock since Monday when it was first leaked. If you aren't a geek, your failure to notice the news is acceptable, understandable, forgivable. But now it's on my blog, and you have no excuse, so get wise.

There are more than a handful of interesting things to say about Chrome, and none of them require me to even have tried Chrome, since it's not yet available for Linux uses... here are each of those interesting things in no particular order.

1) The Comic Book

Google used an unorthodox approach to explaining the technology driving their fancy new browser. Instead of your standard, boring white paper, Google released a freaking comic book! It's still a point-by-point review of the problems of current day browsers and Google's proposed solutions, but it goes a step further with use of clever pictures to describe complex technical problems. It reminds me of an excellent video on Trusted Computing circulated years back (worth a watch if you haven't seen it before). Now, let's not fool ourselves, the Chrome comic book is not for the faint hearted... processes versus threads, memory footprint, hidden class transitions, incremental garbage collection... this isn't kids stuff and certainly not for public consumption. Were it excels is communicated complex ideas to folks with a shared vocabulary but without shared expertise. I don't develop browsers, and probably never will, but I still understood the message. A contributor to Debian Planet quipped, "I think it would be good if we had a set of comics that explained all the aspects of how computers work," and I couldn't agree more. I suppose that's one advantage of having serious cash to throw around.

2) Open Source as Market Motivator

It's my belief that Google has zero interest competing with the likes of Firefox and Internet Explorer, giants that they are... or even the lesser three: Safari, Opera, and Konqueror (being the origins of WebKit... KDE for the win!). Chrome will never be as big as those browsers and Google doesn't care. Google's purpose, stated in various press releases, developers conference, and in the freakin' comic itself, is to improve the ecosystem in which they operate: the web. Google wants more content online, and more users searching for that content, in order to feed the growing advertising business on which Google's billions are based. Chrome isn't about challenging FF or IE for market share, it is about challenging FF and IE to be better.

To accomplish these goals they have open-sourced the browser and all of its fancy doodads. Some clever things here. First, they used WebKit as their rendering engine, and as I mentioned, I love WebKit because it is based on KHTML, which was one of the first good open-source HTML renders and is still in use by Konqueror. What's unique about WebKit is that neither FF (which uses Gecko) or IE (which uses something I will refer to simply as the suck) use it. So, here you've got an entire implementation of a radical new way of building a web browser, with all sorts of cool features just begging for adoption and neither of the big players have a leg up... both will have to tear out parts and re-implement based around their rendering system. And re-implement they shall! If Chrome can deliver on all of Google's lofty promises, then users are going to gravitate to whichever browser can best deliver the same results.

3) Process vs. Threads

This is the big thing that Chrome is supposed to offer. Modern day browsers utilize tabs to allow users to visit many pages at once, which is handy... but in order to visit multiple pages like that, the browser has to be able to do many things at once. Until now, that was down with threads.

To help visualize a thread, imagine you have a fourteen year old kid and you tell him to deliver newspapers along a street. Off he goes and does his thing and he does it very well. Then, the next day, you tell the kid while he's delivering the papers you'd also like him to compose an opera. So, he goes and delivers a few papers, and then stops and jots down a few notes, maybe a harmony or two, then back to paper delivery. He gets it done, but all that bouncing from one to another causes him to do it a bit slower. The next day you ask him to do all those things he was already doing and do your taxes (does anyone else get a cat on the second result?!). This time, when he switches over to doing your taxes, his poor little fourteen year old brain can't handle it and the whole operation goes to hell... no papers get delivered, no opera is composed, and certainly not tax returns. That's threading... one "person" switching between various jobs.

Now, with processes, it's like you have THREE fourteen year old boys to do your bidding... one goes off to deliver the papers, one composes the opera, and the final does your taxes. Even if the third kid can't deliver, his epic failure doesn't impact the performance of the other two. You may still get audited, but at least you'll know the papers are delivered and opera lovers can rave about the latest wunderkind.

IE and FF use threads (though, rumor on the street is that IE8 beta is process based)... so if one thread goes wonky, you probably lose the entire browser. Chrome is different, it uses separate processes for each tab, that way if one has a problem, the others aren't impacted. If, at this point, you are saying "big deal, how often does my browser crash?" you are right where I am. I use my browser for everything all day... 10 - 15 tabs at once is standard operating procedure for me. Maybe I'm not visiting the nefarious parts of the internets. But here's what is cool about their concept. It's not one process per HTTP request or page fetch, it's one process per tab/domain. Which means that so long are you are browsing around CNN.com, you operate within a single process, sharing memory for various javascript fun within that domain. But once you leave CNN.com to visit, say, nytimes.com, the old process is killed and a new one, with fresh uncluttered memory, is spawned. Which, if you don't know much about the AJAX security model, is really a clever approach. AJAX is sandboxed by design, meaning AJAX scripts running on a page at cnn.com can ONLY talk with cnn.com servers... it cannot make a request off to washingtonpost.com or whatever... it's all isolated. So now, when you go to gmail.com and sit there for HOURS, with its memory consuming javascript, it is all washed away the moment you move to a new domain. Now that, my friends, is good news.

Of course, it comes with a cost... those processes each need their own memory, and while it may be virtual memory at first, once they start doing a lot of writing, and you get all those page faults, it's gonna be real memory... and then we'll see what happens on less-than-modern computers that don't have 2 GBs of memory to throw around just to read their daily web comics.

4) Javascript: V8

I like javascript and have no patience for its detractors. If you haven't used the likes of prototype or jquery, you have no concept of what javascript is capable of or how it can be extended to do whatever you might possibly want to do. Having said that, Javascript can be slow... painfully slow... on underpowered computers (like my laptop, now three years old). You can hear it chugging away on some javascript code. It's my observation, however, that it's not the javascript engine at fault, it's the javascript itself... folks relying too much on their framework and object oriented design and not enough on smart coding.

For example, I recently retooled a javascript library that reordered a sequence of pulldown menus (known as select elements in HTML lingo). The previous version of the library iterated through the list of selects SO many times, it wasn't even funny (and I find most HTML/javascript base conversations to be hilarious!). So, although I had to sacrifice a bit of encapsulation to do it, I was able to rewrite the library to be significantly faster... and my CPU thanked me for the effort. So, what does this have to do with Chrome?

Well, Chrome has a new javascript engine, V8, which is supposed to be a lot faster for various reasons. I guess that's great... but, at least for the vast majority of javascript code out there, the real problem isn't the engine, it's the code. Google has an answer for that too, but the day I choose to learn Java is the day I choose to dust off the law degree.

5) Gears Out-of-the-Box

When I first learned about Gears, I wasn't excited. Then I went to Google I/O and I got a little excited so I tried it out... Firebug threw so many errors, and everything ran so slow, that I lost all my excitement the threw it out. I will say that the idea of a more robust javascript interface to the filesystem and to other hardware resources is a great idea... as is a persistent data storage system beyond cookies. But Google's got an uphill battle here. Until the majority of users have Gears installed, or a browser with Gears like features, no web developer is going to utilize those tools, thus there will be no incentive for users to actually install them. I honestly have no clue how Flash managed to get installed on nearly every browser out there... but I don't see how any plugin that is as invasive as Gears is going to be able to repeat that miracle a second time. So, Gears out of the box?! Yeah, just another browser with propriatary extensions that are tempting, but should not be used.

6) User Interface

I haven't seen it yet, so I don't know... one friend says it's really hard to get used to. I reserve the right to be obstinate.

In Conclusion

Hell if I know... Google is a complete mystery. But, by and large, they haven't steered me wrong, even if some believe what they are doing is more like sharecroping than software development. I'll be the first to try Chrome soon as they release that Linux version... and while Google's at it, maybe a Linux Picasa client?

Tuesday, July 01, 2008

Don't be Fooled by .us.com

I got an email today from Network Solutions declaring "Is the .COM Domain You Want Taken? Get the .US.COM & Save" and thought to myself, "wow, they are finally starting to advertise the .us TLD!" Here in the States we sort of take the .com and .org top level domains for granted. But in much of the rest of the world websites use their country code TLD... so, in the United Kingdom you will see lots of .uk domains. Personally, I prefer this, as it helps identify the site's situs (to use a legal term)... don't believe the hype of pure virtual existence, websites have tangible form in the physical world.

Trouble with this advertisement from Network Solutions is that they are not, in fact, advertising a .us TLD... they are advertising subdomains of the .us.com domain. Note the .com is at the end, not preceding the .us like with .com.au (I just set one of these up yesterday, nothing special about Australia). So, what we've got going here is somebody (presumably Network Solutions or a subsidiary) spent the $20 necessary to register us.com--a process that is no different than when I registered prbonogeek.org--and is now going to sell subdomains of their domain for $20 per year and are passing it off as a ".COM Alternative!"

Now, I can't speak for anyone else, but the idea of giving $20 to some dude who happened to buy the us.com domain when I could just as easily purchase a .us domain for the same price through a legitimate registrar, seems awfully silly. To further bolster my claim, have a look at the actual us.com site... looks like a google link farm to me. Having said that, if anyone wants to purchase subdomains for probonogeek.org, I'm offering them at the competitive price of only $15/y!

Tuesday, June 24, 2008

Getting Back Up...

The probonogeek.org server is starting to come back from the dead. I took down the slice following my recent hack and awaited instructions from my hosting provider. Sadly, this experience made them reconsider entering this business and they have terminated the beta slice program that in which I was a part. They pointed me towards slicehost, which is a competitor with Linode, which we use at work. Anyway, I thought it would be a good opportunity to try something new, so I signed up for a slice and got the ball rolling on a new server.

Remember kids, security first...

niles@zion:~/exploit$ ./exploit
-----------------------------------
 Linux vmsplice Local Root Exploit
 By qaaz
-----------------------------------
[+] mmap: 0x100000000000 .. 0x100000001000
[+] page: 0x100000000000
[+] page: 0x100000000038
[+] mmap: 0x4000 .. 0x5000
[+] page: 0x4000
[+] page: 0x4038
[+] mmap: 0x1000 .. 0x2000
[+] page: 0x1000
[+] mmap: 0x2b7638001000 .. 0x2b7638033000
[-] vmsplice: Bad address

Now I just need to restore my Subversion and Apache servers and I'll be rocking and rolling once again!

Wednesday, June 18, 2008

Hacked

Today I received a very unhappy email from a fellow saying my webserver had launched an attack against his FTP server and that I needed to stop it or he would contact the Federal Authorities. I was unbelieving at first, to be perfectly honest, and asked him to produce logs verifying the attack. But then I went and checked my server and discovered it was running a script named ftp_scanner, which seemed to be attempting brute force attacks against random FTP servers. ack.

I quickly killed all the ftp_scanner processes, found the offending script on the server (cleverly hidden in /tmp/.../ so as to be both hidden from a standard 'ls' and appear like a system file when running 'ls -a'). The immediate problem addressed, I tried to figure out how this could have happened. To my horror, I discovered that Thursday of last week someone had run a brute force attack against my SSH server and happened upon one of my users whose password was the same as her username. double ack!

A little back story is useful here... on Friday my server went down in a sort of funky way. I could still ping it, but http and ssh access were denied. It took all weekend working with my provider to get it re-enabled. They said it was because CPU usage had spiked, and since it's a virtualized server, my slice was shutoff to prevent damage to the larger system. I should have investigated then, but I just figured the detection systems were borked and thought nothing of it. Bad idea.

Two days later, the intrepid attackers struck again... and I would never have known if not for the email from the poor guy whose server my server was attacking. But that's not the worst of it. While cleaning things up, I noticed an SSH login to the 'news' account, which is a system user account that you cannot usually log into. It was then that I discovered the /etc/shadow password file had been compromised to enable a variety of logins that should not have been. This, unfortunately, was the worse possible news. If the attackers could change /etc/shadow, it meant they had manged to obtain root level access to my server. ack, ack, ack.

I went back to the /tmp/.../ folder to poke around the contents. It was then that I discovered the Linux vmsplice Local Root Exploit. And indeed, running the tests described my system was vulnerable, and the entire slice had been compromised. Since I don't run tripwire, or anything like that, I was pretty much screwed. oh, ack...

All user data is now backed up onto my local desktop and the slice is scheduled to be cleared. Once the kernel is secured I will have to start building the system from the ground up all over again.

Oh, and if "Not Rick" is out there, I'm sorry to have caused you any trouble... but contacting me via means that prevent me from replying makes it difficult to apologize or explain the situation.

Monday, June 02, 2008

Google vs. Privately Owned Community

This isn't really a story about Google, but I was tipped off by a tech-legal blogger about the story because of Google's involvement with the St. Paul suburb of North Oaks, Minnesota. The basic story boils down to (1) North Oaks residents actually own the roads in their town and have a trespassing ordinance, (2) Google violated that ordinance when it took photos of the town for its Street View program, (3) North Oaks city council requested the photos of the entire city be removed, (4) Google complied.

From a Public Relations standpoint, I have no argument with Google's decision... however, I do think there is a dangerous first amendment precedent waiting in the wings here. In Marsh v. Alabama the U.S. Supreme Court ruled that First Amendment activity was still protected in the town of Chickasaw, Alabama even though every square inch of the town was private property owned by the Gulf Shipbuilding Corporation. The company had baned religious leafleting and the Court said the company was the State in that situation and thus must abide by the First Amendment.

I think the situation in Chickasaw, Alabama is analogues to North Oaks, Minnesota... except, instead of a for-profit company owning the streets, individuals bound by their deeds through the North Oaks Home Owners Association own the streets. But the situation is otherwise the same in that a private entity is attempting to get around the State Action doctrine by abolishing the State. But in so doing, they create a new State in all but name, and thus under Marsh must allow First Amendment activities. There remains the question of whether taking photos from streets is a First Amendment activity, a question I am not immediately familiar with, although I believe it is protected.

Either way, I imagine Google complied for the same reason it complies with requests from private citizens... it doesn't have to under the law, but it does out of respect for privacy. My question now is what happens if a "citizen" of North Oaks, Minnesota writes to Google saying they wish to opt back into Street View?

Sunday, June 01, 2008

Why is this so upsetting?

Regular users of Google properties will have noticed that the Google favicon has changed. Here's a side-by-side comparison from blogoscoped.com.

The old icon reflected Google culture as I saw it, colorful, yet professional. This new logo drops the color scheme and switches to a lowercase "g". It's the sort of favicon I would expect on a kids-oriented site. If their goal here is to appear as an "underdog"--as suggested by the blogoscoped.com article--then they are seriously misreading their audience.

Maybe this will grow on me, but it had better start growing soon, 'cause at the moment it is nothing more than an eyesore on my bookmark toolbar.

Saturday, May 31, 2008

Putting Ruby into Words

Since I've started learning about Ruby and reading some of the community blogs and books, I have this sense. I am the first to admit that it's a poorly defined sense, but somewhere deep inside of me, something was wrong with the community. Thankfully, there are Debian Developers out there with the same feelings who have a better way with words. To lazy to read the link? No problem, here's the critical bit:

What's troubled me for some time about the post-Rails Ruby community is that it has a distinct bent away from its Free Software roots. I understand Matz actually used to use (not sure about today) Debian Unstable, and Ruby traditionally displayed its roots quite strongly, with a Perl heritage and a community consisting largely of hardcore *NIX people. With the advent of Rails, the move has been towards things like TextMate and OSX. Software like Gems (no relation to Gemstone) fits in fine with one of these systems, but not so well with modern Free Software systems, and I think it's symptomatic of the change. Given this propensity in the Ruby community, and given the numbers Gemstone is posting, I'd be surprised if lots of Rubyists don't move that way as soon as it's available.

I couldn't agree more! When I first learned the preferred editor for Rails development is an OS X only commercial app, I was literally speechless.

There are other examples of this divergence from the Free Software world. For example, Rails recent decision to abandon Trac, a reliable ticketing system used by a whole set of large FOSS projects. Rails now uses Lighthouse, itself a Rails application, that is decidedly closed source. If this sort of behavior continues, I think you'll see a spike in useful stuff coming out of commercial shops followed by a slow decline as the ecosystem that comprises free Ruby code begins to shrink and eventually die off. At which point you've got a free language whose community and ecosystem is more about commercial interests than free software.

Monday, May 26, 2008

NGINX... why?!

Anyone who has any relationship with Rails development has, at this point, heard of Nginx. The point of Nginx is to replace the Apache, a the definitive global webserver that Rails devs feel is simply too slow for their lightening fast development framework. It's not the first time the Rails community has snubbed Apache, nor will it the last. Those Rails devs are simply fickle folks.

So, fine, let the Rails devs frolic with their uberfast webserver... what about the rest of us mere mortals? Is Nginx a good route for you? Let me say here and now, the answer to that question is almost always a strong, resilient, and durable no. The reasons for the rejection are many, so let's start with the funny ones first and proceed to the more technical ones.

First, it behaves in inexplicable ways for different browsers. Check out this screen shot of Penny-Arcade loaded in Firefox (on the top) and Konquerer (on the bottom) at the same time.

Click to see full resolution

This happened with multiple reloads (cache disabled)... it always worked with Firefox, always "failed" with Konqueror. Oh, and that "Bad Gateway" message is a something you should get used to if you are thinking about deploying Nginx, because it's an all too common sight (more about that later on).

Second, the primary documentation is in Russian. Yes, Русский. From what I can gather, the primary developers are Russian, which is great... yay global open source development! But, a webserver is a complicated beast, hence the great forests that are clear cut each year to produce the necessary library of books on Apache and MS Information Server. Let me be clear that when I say primary, I do mean to imply there is secondary documentation. This is secondary documentation in the same way that warning labels will list sixteen life threatening things you could do written in English, followed by a single warning in Spanish that translates to "Danger."

Third, nginx does not support .htaccess files. Anyone who spends much time building custom websites knows the power of these magic little files that alters the way Apache treats a particular folder. Securing a folder with basic authentication is two line simple lines and a password file. Nginx takes a different approach, where different means stop bugging us to add .htaccess support. Instead, every directive, for every folder, regardless of it's scope, must go into a master configuration file. You can split the conf file into many smaller files, but they are all loaded when the server starts and given global effect. The common approach here is to split each hosted domain into a conf file... but that only helps keep things organized, because in the end of the day, every conf file has global implications.

Third and a half, nginx requires you to have apache support tools lying around to do stuff. This really isn't worth a whole new point, because everyone already has apache lying around... but lets say you wanted to create a password file for basic authentication. There is no nginx utility to generate those handy hash values, you have to use htpasswd, available from your apache distribution.

Fourth, Nginx doesn't actually do anything beyond serve static HTML and binary assets... which is to say, it doesn't run php or perl or any of the other P's that you might find in the LAMP stack. What it does is take requests and proxies them to other servers that do know how to execute that code. This is great in the Rails world, which long ago decided to have Rails be it's on little server that you submit requests to and get responses back. Even under Apache, the standard approach is to run Rails as a cluster of Mongrel servers that Apache talks to via a proxy connection. In the world of PHP and Perl, this approach is somewhat counter-intuitive. Apache's mod_php loads a php interpreter into Apache, allowing Apache to do all the heavy lifting for you... ditto with mod_perl. Even ruby has a mod_ruby (although, it's still premature). With nginx, everything is it's own standalone server.

So, what if your php project needs to know something about the webserver (like the root folder, or a basic auth username)? Well, you need to know that ahead of time and setup the proxy (which you defined in that global conf file I mentioned in #3) to pass those variables to your application server, otherwise it won't be around for you to use. Better yet, what if the proxy server is down? Nginx will great you with a handy "Bad Gateway" message and no further information. Good luck debugging the underlying server, since it really only knows how to talk in http requests... perhaps you can code your own debugger with LWP.

Finally, I am left with the question why? The ostensible reason is that it's faster and can therefore handle more requests. Even if we accept that as true (*grumble, grumble*), it only accomplishes that speed by passing the buck off to other servers. When you find a non-responsive site it's not because the static assets like images and HTML text are being served slowly... it's because the dynamic content generated by php/perl/python/ruby/whatever and the underly database from which the data is drawn cannot keep up. Nginx suffers that same failing... while requiring just as many resources because you now have to run so many different servers for each of the languages you want to code it.

If you are developing Rails, then by all means, enjoy this flavor of the month until some new exciting technology comes along and all the little Ruby lemmings go marching off in a new direction. For everyone else writing applications that are meant to stand the test of time, stay with Apache, it hasn't let us down yet.

Tuesday, May 13, 2008

Working with SPF Records

Today I finally sat down and learned enough about SPF records to actually get one deployed on a site I'm setting up. What's an SPF record, you're wondering? Perhaps you are too lazy to click one of my provided links. No problem, here is a description anyone can understand.

So email is more like a normal letter than you might expected--not surprising, since most systems are built modeled after existing systems--and includes things like a sending address and a return address. In the world of email, these are the "To:" header and the "From:" header, respectively. So, if I were to send you an email, the top of it would look something like:

To: "John Doe" <jdoe@example.com>
From: "Sean Kellogg" <skellogg@probonogeek.org>
Subject: A message
...

Thus, you would know the email was from me and treat it appropriately.

Trouble is, just like a return address on an envelope, there is no way to be certain the return address is accurate. I could stamp all of my envelopes with 1600 Pennsylvania Ave. and it would still get delivered (well... maybe, not sure how USPS would respond to that particular address). Point is, I could send the following email to you just as easily as the one above.

To: "John Doe" <jdoe@example.com>
From: "Bill Gates <bill.gates@microsoft.com>
Subject: A message
...

and the mail system would happily send it off your your mailbox. So, you've got one class of people who are trying to steal your identity. The other class of folks are those who are more interested in masking their own. This group is known as spammers. I would say all spam today is sent using forged headers, such that the From: header is set to either a non-existent email address or some poor unsuspecting by-standard.

Enter SPF records, which are a mechanism to validate the From headers. Basically, as a domain manager, you declare a set number of machines which are authorized to send mail on behalf of the domain. Then, mail service providers are responsible for checking that declaration to ensure that the originating server is one of the authorized senders. In the microsoft.com example, the mailhost would figure out all the valid servers that can send email on behalf of microsoft.com, realize my server is not one of them, and reject the email.

The only part left is to figure out how to write SPF records. Turns out it's not as hard as I expected, once you know how. I recommend the following wizard as a great starting point for defining your SPF records. All you need to do is specify the domain you are managing, and then list the various servers you want to authorize to send on your behalf.

Of course, that last part is easier said than done in some cases. The domain I was doing this for uses google apps for mail delivery, and lord only knows how many different servers are involved with the gmail setup. Thankfully, the SPF folks were prepared for that! There is an "include" directive as part of the SPF spec that allows you to say, "in addition to these settings, include the settings from this other SPF record." Then you just point at the gmail SPF record and your set.

I'll be honest though, I'm not certain about this whole SPF system. For example, I use the washingtonpost.com article sender to send stuff to friends and colleagues. Those emails are generated by washingtonpost.com servers and set the From: header to my address. Except, if the recipient host is set to enforce SPF records, it's going to get the email and say, woah, washingtonpost.com is not authorized to send for probonogeek.org! Not sure how this problem gets resolved, but there needs to be a way for address holders to authorize third party sites to send email on their behalf on a one-at-a-time basis. Any bight ideas out there?

Sunday, May 11, 2008

My Parts Per Million

My company did a website for a water testing device manufacture (I realize, not our usual political fare... not every client can be running for President). The client was so pleased with the new site they sent us a nice gift basket and a few of their products. Once quick USPS shipment and I am the proud owner of an HMDigital TDS-4.

So, I drew myself a glass of tap water (I don't filter my water, but Sarah does... I'll test hers once she gets a new filter) and gave it a go. My Santa Cruz tap water measures in at 216 TDS PPM. That's right, 216 Total Dissolved Solids parts per million. The back of the product says the EPA's Maximum Contaminant Levels of TDS for human consumption is 500 ppm. I'm not sure if 216 is good, great, acceptable, below average, I just know it's not the maximum contaminant level... huzzah?!

Great thing is, this device is portable, so I can start taking it to restaurants and provide reviews on water quality. It's a whole new world of eating out metrics. Oh sure, we could go out to there, but the TDS was a little high last time...

Tuesday, May 06, 2008

AJAX File Upload: The Cake is a Lie

For a long time I have been smitten with the idea of AJAX. By now everyone has experienced AJAX, even if they don't know it yet. AJAX powers web 2.0 sites like Flickr and Gmail. Allowing the user to interact with a website without a page-refresh is a strangely liberating technology... finally my applications have state! But the true holy grail of AJAX lies with the mysterious mechanism of file uploading. No doubt you've done this before, in a non-ajax fashion. While filling out some innocuous HTML form you are presented by a seemingly innocent file selection dialog box, perhaps selecting the latest photo of you kitty, to send along with the other information. This basic file uploading capability is made possible by creating a special HTML form, like so:

<form action='upload.cgi' method='post' enctype='multipart/form-data'>
  ...HTML form fields go here...
  <input type='file' name='my_picture'>
  ...maybe some more HTML form fields go here...
  <input type='submit' value='Down the Tubes!'>
</form>

That enctype business there tells your browser to send a special sort of HTTP request that can contain binary data. Generally requests just send text, but by enabling binary data transmission, we can send photos, mp3s, pdfs, anything within the size limit of the protocol. Trouble is, ajax requests are built such that you cannot change the enctype to multipart/form-data! Even with the cross-browser prowess of Prototype (my preferred javascript framework), there is just no way to change the nature of the HTTP request. It's either text or bust.

So, how do internet giants like Flickr and Facebook do it? What is the secret ingredient? A little googling reveals the answer as satisfactory, yet unsatisfying. Allow me to explain. To start, we need to redefine our objectives... since we can't "use AJAX to upload a file" our objective needs to be "make it appear like we are using AJAX to upload a file." When we say, "use AJAX" what we really mean is communicate with the server without a page reload. But we must remember the earlier lesson, you can only upload a file use a multipart/form-data form. Put another way, we have to call submit() on that form... there is no other path to the promised land.

HTML Forms are a tricky thing. Left to their own devices, when you call submit(), the entire page reloads. So that's out. But, we can set a target for the form, such that calling submit() causes the form to load in the target window. Setting target='_new' will create an entirely new window where the form will be processed. This is sort of cool, in that the underlying window remains unchanged. But we certainly don't want new windows popup up all the time. Yuck.

We could set the target to an embedded iframe in the main window itself. This is a lot closer, because there is no messy popup business. But now you've got this iframe reloading, which isn't exactly the seamless experience we are shooting for. The final piece to our puzzle then is to make the iframe hidden with style='display: none;' attribute.

So, now are form from above looks like this.

<form action='upload.cgi' method='post' target='empty_iframe' 
      enctype='multipart/form-data'>
  ...HTML form fields go here...
  <input type='file' name='my_picture'>
  ...maybe some more HTML form fields go here...
  <input type='submit' value='Down the Tubes!'>
</form>
<iframe src='about:blank' name='empty_iframe' 
        style='display: none;'></iframe>

Now, when you hit the submit button the form sends the data, including the file, off to the server and the response comes back to the invisible iframe. To the user, nothing seems to have changed. You can add a little pre-process magic with javascript, like hiding the form, but what if you want to do post-process magic? With a traditional AJAX request you could get an XML payload back, or javascript if you use a framework like Prototype. Turns out you can do something similar with the iframe trick. You can call methods on the parent window from within the iframe by sending javascript inside of a <script> tag.

Your output to the iframe will looking something like this:

<script type='text/javascript'>
  window.top.window.function();
</script>

You can call as many functions from within the script tags as you like, just remember that the iframe has no sense of the variables available from within the parent window, so that can complicate things. But a little forth-thought can go a long way to making the magic happen. You can also do some cool things like insert and remove the iframe on the fly so that it's only there during the form processing bit.

Now that you know how it works, it should be obvious how this is all a lie... a horrible, horrible lie. There isn't anything the least bit AJAX-Y about this. In fact, if you accept this as a valid method of asynchronous server communication, then you can pretty much never use the XMLHttpRequest object ever again... just communicate via hidden iframes! I realize that file uploading is a serious security concern (we don't want malicious coders to be able to upload files from your harddrive without your knowledge), and I know that AJAX presents it's own security concerns... but there has got to be a better way. I hope that future revisions to the XMLHttpRequest object provide a way to send multipart/form-data responses so we can ditch this awful, messy hack.

Monday, April 21, 2008

The Danger of Abitrary Strings

A friend of mine recently pointed out a rather unique "feature" that cnn.com appears to be introducing: the ability to print headlines directly to a t-shirt for, I don't really know... live blogging purposes? Maybe.

Anyway, whether by his own brilliance or act of God, Tom noted that the webpage used to generate these print-on-demand t-shirts is nothing more than your average GET string. So, when you click on the headline "Nail polish color may tip off politics" the browser sends that string to the t-shirt generator as the URL "headline=Nail%20polish%20color%20may%20tip%20off%20politics". But, GET strings have significant security implications, in that it is trivial for the end user to alter those strings, resulting in rather humorous results.

For example, you could request a t-shirt that says "CNN is stupid". Tom had other ideas. I figured, if you can say something silly, what's to stop you from saying something newsworthy.

Just in case CNN fix this little glitch, here's a snapshot of the last link's output

As Tom noted, it seems all very familiar. While CNN was stupid enough to allow their tool to be used for subversive ends, they learned Nike's lesson and prohibit you from actually purchasing t-shirts with unauthorized headlines. Which gets to the point of this post. Developers of applications are often looking for flexibility in the constant drive to make code do whatever the client requests. My own company is certainly not immune to this siren's call. But sometimes that flexibility can lead to real issues..

The right way to build such a t-shirt tool is to pass a unique ID to that t-shirt application that corresponds to the particular story in the CNN database. The tool then goes and fetches the headline and publication date and prepares a delightful, non-embarrassing, headline t-shirt. Of course, that leads to the dreaded coupling, where two seemingly unrelated pieces of technology become reliant on one another... a significant violation of the Agile principles. But you know what, I think I prefer heresy over being fired.

Saturday, March 22, 2008

The Tagging Paradigm

Welcome to my first installment on technical rumination... let's all hope by the time it's posted I've actually had something valuable to say that is more than just jargon filed ramblings. I've actually been thinking about this topic for a while now, ever since a conversation I had with David (sorry Mr. Morgan, you are no longer the only David in my life). He (David) is in the process of laying the initial foundations for a our company's new CMS system. It's actual a very interesting process, building a CMS system, because you are making decisions with significant longterm impact that are not easily changed--regardless of how agile one might be--because once the system is in use with clients, there's no easy way to retrain. So David and I spend several hours every week discussing long term implications, naming strategies, and general paradigms. Most recently we discussed the implications of a tagging based system.

For those not immediately familiar with tagging, here's a brief intro as I see it (which means get out your salt shakers, because I'm not going to fall into the boosters club on this one). Tagging is part of the the semantic web (SW) movement. For purposes fair-play, here is a direct quote from the W3C's FAQ on the main goals of SW:

The Semantic Web allows two things.

1. It allows data to be surfaced in the form of real data, so that a program doesn’t have to strip the formatting and pictures and ads off a Web page and guess where the data on it is.
2. it allows people to write (or generate) files which explain—to a machine—the relationship between different sets of data. For example, one is able to make a “semantic link” between a database with a “zip-code” column and a form with a “zip” field that they actually mean the same – they are the same abstract concept. This allows machines to follow links and hence automatically integrate data from many different sources.

So, the idea is to take the human-readable content and convert it into machine readable content in a way that makes sense and allows for all sorts of cool functionality. The best example of this in today's world is an RSS feed, which takes blog posts (just like this one) and converts it into data that an RSS reader can understand and synthesize into a format that you, as the reader, desire. So far, so good.

Tagging fits into the semantic web concept as the chief mechanism for aggregating data. So, imagine we have three bits of data: bit one [1] is tagged as "blue" and "red"; bit two [2] is tagged as "red" and "yellow"; and the final bit [3] is tagged as "yellow" and "blue". Now we can ask for all the bits tagged as blue [1,3], all the bills tagged as blue and red [1], the bits tagged as blue or red [1,2,3], or even all the bits tagged blue and not red [3]. The power of this approach is obvious... the problem with tagging is that in the rush to embrace the approach no one seems to be talking about the weaknesses.

Before I get into that, let me say something about my general philosophy when it comes to Computer Science. My general belief is that the original computer science folks were really smart and figured out nearly everything truly neat that was to be discovered based on the hardware available at the time. Linus Torvalds said it best, when commenting on Microsoft's claim that they held patents on technology in the Linux Kernel: pretty much everything about operating systems was figured out in the 60s and 70s. If you believe this--as I do--then you begin to see that the only true innovations left are a combination of hardware and software... here I'm thinking parallel programming made possible by dual-core CPUs.

Which brings us back to tags... why, if they have such power, are they really only now coming into vogue? Several possibly reasons to be suspicious come to mind.

1) Tags are unstructured data.

When talking about this with David, he said he sees tags as lists, but I don't think that's right. A list implies an order, with a start, an end, and clearly defined order between the two. But going back to our bits, if I ask for all the blue bits, the results could be [1,3] or it could be [3,1]. There is nothing about a list of tags that says either is right or wrong. It's better to say a list is a collection. Of course, I can impose an order on the collection by sorting the elements, but then the ordering is coming from within the data and is not part of the collection itself. This means (a) I have to order the collection every time I want to work with it, (b) I cannot apply arbitrary order to the collection unless I start storing meta data about the order in the items themselves.

2) Tags will never be as efficient as an actual list.

In the world of web development, the easiest way to create a list is to just use a table in your relational database. This is handy because it's just one table and you don't have to mess with JOIN statements that are inherently slower than a single table SELECT statement. Tagging, however, requires three tables. 1) a table of the data to be tagged, 2) a table of the tags, 3) a table pairing the first and second tables together. So, now you've got three tables and two joins, which is just not going to be as fast as a simple SELECT.

3) Over reliance on tags could replace good design.

Of course, developers already know this even if they are rushing to embrace the tagging concept. They know it because they are still using foreign keys in their tables to provide the usual one-to-one and one-to-many relationships. To demonstrate this, consider a recent project involving an online election tool I wrote. For this I had a table of polls, a table of choices, and a table of votes. The choices table had a poll_id field that linked the poll together with the choices, thus allowing me to ask for all the choices associated with the particular poll, or find the poll of the particular choice. The votes then had a choice_id and a poll_id, so I could do the same lookups. I did this because it was efficient and easy to understand... but one could envision doing the same thing with tags. Come up with a tag name for the poll itself and then just tag all the choices and votes with the poll tag. We'd get the same result, but it would require a lot more joining and be overall less-efficient.

4) In a collaborative web setting, tagging makes too many assumptions about users.

While researching for this post (that's right, research!) I discovered an excellent essay by internet luminary Cory Doctorow entitled Metacrap: Putting the torch to seven straw-men of the meta-utopia. He has seven reasons why the meta-data system championed by the semantic web is doomed to failure. It's short and worth a read, but the items I want to draw attention to are: (2) people are lazy and (7) there's more than one way to describe something.

4.a) People are lazy.

This blog has a tagging system, provided at no cost by the fine folks a blogger.com. If you scroll down you'll see a GIANT list of tags I've used over the years on various stories. It's a mess, and as you'll see, very few tags have more than one or two posts. I don't consider myself a lazy person, but the truth is that I haven't put in the time to organize and categorize each story... and since the tagging options available to me are limited only by my imaginative vocabulary, chances are the list of tags is only going to continue to grow... making an increasingly useless tag system for my readers.

4.b) There's more than one way to describe something.

This is the kicker, and Cory hits it right on the head when he says, "No, I'm not watching cartoons! It's cultural anthropology." Both are fair ways of describing something like The Simpsons or The Family Guy. Unless we want to go around tagging everything with all the possible synonyms (whether agreed to or not) those left with the dubious task of aggregating data based on the tags are going to have a difficult time find everything they seek.

5) Namespace Collision

The final argument is especially geeky, so hold on tight. In the real world, when I'm talking with Sarah about "David" she knows I'm talking about David at work, because our namespace has "David" mapped to "David Chelimsky who works at Articulated Man," but when I talk to Sheridan about "David" he knows I'm talking about "David Morgan who works at Ernst & Young" because that is our shared namespace. Two people, both tagged with "David", but they point to different people depending on the context in which I say the name. You run into the same problem whenever you search on google and the results are not 100% what you are looking for. Humans don't really think in terms of namespace, but for computers it's a necessity. Without a clear context, you end up with namespace collision. So, how do we solve the two Davids dilemma? We could provide more exhaustive tagging... so, now I tag them as "David Morgan" and "David Chelimsky"... but what if I know two David Morgans? Do I provide a tag that is really long and includes date of birth and social security number? Thing is, developers already use the social security number approach, by assigning every object a unique identification number. It's not especially human readable, but it does resolve the namespace collision.

Just like Cory, I don't walk away from tagging with a sense of abandonment, just healthy skepticism. I think there are two primary lessons to pull out of all of this that govern when tags make sense. 1) When the aggregation of data is truly arbitrary. This is how we avoid a tagging regime for the polling system I described above. There was a clear object hierarchy there and using the tables to describe that hierarchy was far more concise than tags. 2) when the data has a clear enough context to avoid namespace collisions. You don't want to be tagging by a persons name... there are just too few names in the world. But, if you are working with a theatre company, tagging by the particular performance ("Romeo & Juliet", "Midsummer Night's Dream", etc) might do the trick 3) When the act of tagging itself is centrally controlled. This is to say, allowing the public and large to apply tags leads to pretty useless tags (see slashdot.org as an excellent example of useless tags... not to say they aren't funny). But if the tagging regime is centrally controlled, then you can rely on a uniform scheme of description and prevent the laziness problem that I face. If I had come up with a uniform system at the beginning, it would be useful today.

I will wait to see how tags unfold in the larger internet. Done in limited situations and with the proper structure, there is real power... but just beyond that ridge is a desert of poorly organized and inefficiently accessible data.

Monday, May 28, 2007

A New Playbook in Dealing with the Internet

Some of you may have read about the number which must not be named incident a few weeks back. For those who didn't, there is this number that certain powers that be wish to keep secret. In so doing, issues various cease and desist orders which caused quite a stir and increased the spread of the number far more than if they had never done anything. A classic case of misunderstanding the reality in which you find yourself.

Today the Washington Post brings a story of a young girl from California who has found herself in the middle of a media storm. Due to her excellence in sports and attractive looks, her photo has spread across the internet on blogs and messages boards. Someone even setup a fake Facebook account under her name. The attention has often been sexual in nature and cause grief for the girl and her family.

What impresses me about her story is that she, or at least those who are advising her, have rejected the misguided approach of the copyright maximalists. Instead of sending out cease and desist letters to anyone who ever touched the photo, instead of threaten slander suits against those who speak her name, instead of crying to the media about how unfair it all is, this young girl has chosen to embrace the media storm. Not in the Brittany Spears, "It'll make me famous" sort of way, but in the "okay, if you're really interested, here's my story" sort of way. I predict that by opening up, embracing the storm instead of fighting against it, the frenzy will die down much quicker than otherwise.

Of course, I could be wrong and this may end up stoking the fires, but my gut says that now that she's obtainable, in that her life is not clouded in mystery, she'll be far less of a target for those who obsess about the impossible. Nothing destroys a fantasy like a healthy dose of real life.

Saturday, February 03, 2007

Don't Believe Everything you Read About Security

Washington Post has a Q&A up that asks "When I log into my Internet provider's Web-mail page, I don't see the usual lock icon. Isn't it dangerous to send a password over the Internet without encryption?" And proceeds to tell people to fear sites which don't employ the little lock.

It's true, sending passwords over the public lines in clear text is asking for trouble. But that doesn't mean that the little lock is the only way to do it. In fact, that little lock costs a lot of money for websites to purchase (and repurchase, on an annual basis). But there are alternatives that are just as good. LegSim uses such a system, relying on basic cryptography and some intelligence. Just because a site doesn't chose to buy into the SSL certificate racket doesn't mean it can't be trusted.

Sunday, January 28, 2007

More Useful Than Originally Intended

There is a new linux oriented website which I just learned of today called goodbye-microsoft.com. Aside from the likely Uniform Domain Name Dispute Resolution claim Microsoft has against the venture, I've got to say that I'm very impressed with what's going on here.

The idea is simple. Using nothing more than a webbrowser and Windows, go to a website and download all the necessary bits to install Debian Linux onto a desktop. No optical discs necessary. Besides the obvious marketing value of such a website, it has tremendous functionality value for those of us with laptops.

See, more and more laptops these days are shipped without an optical disc drive. Or, if they have one, it connects via a strange PCMCIA card. Which means that it is very difficult to start the Linux install process (though not impossible). The process itself is nearly painless these days (in stark contrast to my first experience in 2000). But if you can't get the damn thing started then you've got serious problems.

But now, thanks to this brilliant invention, it will be trivial to install Linux on my next laptop.

Thursday, December 14, 2006

Webcomics For Life

A friend of mine sent me a humorous webcomic that deserves to be shared with a wider audience. I've never heard of this particular strip before, but if you are into math, technology, and humour, it may be for you.

If not, I still think this one is a particularly excellent reflection of me, as a person. I'm not saying the girl has ever left before, but the scenario certainly started out the same way.

Make sure to hang your mouse over the image for a few seconds to uncover one of my life's guiding philosophies.

Here's the strip.