Category Archives: Tech

A Jinja macro for generating an html select box with US states

More on Jinja macros

{% macro states_select(name, value='', class='', id='') -%}
 {% set states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH",
"OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"] %}
 <select name="{{name}}" class="{{class}}" id="{{id}}">
 {% for state in states %}
 <option value="{{state}}" {{'selected' if value==state else ''}}>{{state}}</option>
 {% endfor %}
 </select>
{%- endmacro %}

How-to Set Up Ubuntu w/ MongoDB Replica Sets on Amazon EC2

This tutorial is intended for beginners who aren’t familiar with EC2 yet, but are generally familiar with mongoDB. EC2 is actually pretty easy, but a lot of the basic info you need to get started is interspersed across numerous websites and articles. This post hopefully puts all the necessary details in one place.

The first thing to understand is that every EC2 instance runs an AMI (Amazon Machine Image) which is basically a bundle of one or more EBS (Elastic Block Storage) snapshots. The physical machine that your instance is hosted on has build in hard drive space, but it isn’t persistent. When you shut down or reboot the server whatever is on that disk will be wiped. Amazon already has a database of community AMI’s including basic Ubuntu installs. We can use one of these, then install the necessary packages, update configs, etc. and save the configured snapshot as our own AMI. Problem is, when you search the community AMI’s for ‘ubuntu’ you get some 500 results, so which one do we pick? http://alestic.com is a good resource for things related to EC2 and Ubuntu and they have a list of ‘official’ AMIs from Canonical. I’m basing my EC2 instance in amazon’s us-east1 data center so the AMI identifier for Ubuntu 11.04 EBS 64bit is ami-1aad5273. If your EC2 instances are located somewhere else, you’ll need the corresponding AMI identifier for that data center, which can be found on alestic.com

To start off, you can follow the EC2 getting started guide, except instead of the Basic Linux AMI you can use the Ubuntu AMI that I mentioned above. There’s also no need to terminate the instance at the end since we’ll just roll right into customizing this instance for MongoDB.

I like to start but getting any system updates that have come out since the AMI was created:

sudo apt-get update
sudo apt-get upgrade

I also like to install the linux tools dstat and htop to monitor system performance.

After following Amazon’s Getting Started Guide you should have a blank Ubuntu box and be SSH’ed into it. The linux root partition is usually an EBS volume and I like to make a second EBS volume that I can mount for just the mongodb database directory. This way I can detach the database volume and move it to another running instance. So go into the AWS Management Console and click on Volumes on the left. Create a new volume that has ample space for your database. You can’t resize these things so leave room to grow. After you create the EBS volume you need to attach it to your EC2 instance and define a mount point. I usually use /dev/sde.

Next, let’s log into the EC2 instance by ssh. We need to format the new volume, mount it, and add it to /etc/fstab so it auto-mounts when we restart. (note: on Ubuntu Natty 11.04 the drive ends up appearing as /dev/xvde, but on older systems and other flavors of linux it might still be /dev/sde)

sudo mkfs -t ext4 /dev/xvde

I’m going to mount my new volume at /db

sudo mkdir /db
sudo vim /etc/fstab

add the following line to the bottom of your /etc/fstab

/dev/xvde        /db     auto    noatime,noexec,nodiratime 0 0

We can either restart to auto-mount it or we can manually mount it now using

sudo mount /dev/xvde /db

Now lets install mongodb. Here are the official docs.

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv 7F0CEB10
sudo vim /etc/apt/sources.list
deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen
sudo apt-get update
sudo apt-get install mongodb-10gen

sudo mkdir /db/mongodb
sudo chown mongodb:mongodb /db/mongodb

Now lets edit /etc/mongodb.conf and change the location of the database. Near the top change dbpath so it looks like this:

dbpath=/db/mongodb

I also like to change my oplogSize to something larger than the default so if a secondary instance is down I have longer to bring it back up before it becomes too stale to re-sync. I also recommend turning on journaling to prevent data corruption.

oplogSize = 10000
replSet = myReplicaSet
journal = true

If you’re using a hostname in the replica set configuration instead of the IP address, you need to configure that in /etc/hostname and /etc/hosts

/etc/hostname:

db1

/etc/hosts:

127.0.0.1     db1   db1.mydomain.com    localhost.localdomain    localhost
xxx.xxx.xxx.xxx    db1   db1.mydomain.com

(where xxx.xxx.xxx.xxx is this machine’s IP address that you use in the replica set config. Usually the elastic IP.)

After changing hostname information you’ll need to restart the instance for it to take affect.

You need to add a hole in the EC2 firewall for the other replica nodes. Do this by going to the Security Groups section of the EC2 dashboard. Click on the security group you’re using and add a custom line TCP from port 27017, with /32 as the IP address for each node. (where xxx.xxx.xxx.xxx is the instances IP address). Each node of the replica set needs to be able to access every other node of the replica set. Best way to do this is use the same security group for all of them and add all IP addresses to the allowed list.

When you have the instance basically set, go back into the AWS control panel, right click the instance and choose Create Image. You can start up any number of these for the replica set, but you need to change the /etc/hostname and /etc/hosts file to reflect the individual IP address and hostname of the bot (db1, db2, db3, etc.)

From here on the instructions in MongoDB Replica Set Configuration docs are valid. You don’t need to specify the replSet name on the command line since we already set it in the config file. mongoDB should be already running, but you can restart it with /etc/init.d/mongodb restart if you change any configuration parameters.

Running cherrypy on multiple ports (example)

As a continuation of my previous post on how to run cherrypy as an SSL server as HTTPS (port 443), this tutorial show how to run a single cherrypy instance on multiple ports for both HTTP (port 80) and HTTPS (port 443)

We need to do a few things differently than in most examples out there like how to set configs when not using the quickstart() function and creating multiple Server() objects. But after reading through the source code a little it becomes clear.

import cherrypy

class RootServer:
    @cherrypy.expose
    def index(self, **keywords):
        return "it works!"

if __name__ == '__main__':
    site_config = {
        '/static': {
            'tools.staticdir.on': True,
            'tools.staticdir.dir': "/home/ubuntu/my_website/static"
        },
        '/support': {
            'tools.staticfile.on': True,
            'tools.staticfile.filename': "/home/ubuntu/my_website/templates/support.html"
        }
    }
    cherrypy.tree.mount(RootServer())

    cherrypy.server.unsubscribe()

    server1 = cherrypy._cpserver.Server()
    server1.socket_port=443
    server1._socket_host='0.0.0.0'
    server1.thread_pool=30
    server1.ssl_module = 'pyopenssl'
    server1.ssl_certificate = '/home/ubuntu/my_cert.crt'
    server1.ssl_private_key = '/home/ubuntu/my_cert.key'
    server1.ssl_certificate_chain = '/home/ubuntu/gd_bundle.crt'
    server1.subscribe()

    server2 = cherrypy._cpserver.Server()
    server2.socket_port=80
    server2._socket_host="0.0.0.0"
    server2.thread_pool=30
    server2.subscribe()

    cherrypy.engine.start()
    cherrypy.engine.block()

Using SSL HTTPS with cherrypy 3.2.0 Example

It took me more time than it should have to piece together the right bits of current information for using SSL with cherrypy. Here’s a fully working example of cherrypy 3.2.0 serving up HTTPS requests.

Quick notes – if you haven’t tried cherrypy, do it. It’s awesome in its simplicity. Also, I got my SSL cert from godaddy, which was the cheapest I found. This particular cert uses a certificate chain, so when all is said and done we have my_cert.crt, my_cert.key, and gd_bundle.crt.

ssl_server.py:

import cherrypy

class RootServer:
    @cherrypy.expose
    def index(self, **keywords):
        return "it works!"

if __name__ == '__main__':
    server_config={
        'server.socket_host': '0.0.0.0',
        'server.socket_port':443,

        'server.ssl_module':'pyopenssl',
        'server.ssl_certificate':'/home/ubuntu/my_cert.crt',
        'server.ssl_private_key':'/home/ubuntu/my_cert.key',
        'server.ssl_certificate_chain':'/home/ubuntu/gd_bundle.crt'
    }

    cherrypy.config.update(server_config)
    cherrypy.quickstart(RootServer())

Launch the server like:

sudo python ssl_server.py

You need to use sudo because it runs on port 443. You should be asked to “Enter PEM pass phrase” that you set when generating your key.

Update: In a follow-up post I show how you run an HTTPS server (port 443) and an HTTP server (port 80) at the same time.

Inspiring Civic Hacking

Mick Ebeling recently gave a TED talk about the homemade eye-tracking device he and a bunch of hackers made to allow a paralyzed man to communicate, stephen hawking style. They did this with an off-the-shelf PS3 camera and some open source software for $50. That’s what I call a righteous hack. Most importantly it has real-world significance. And it’s totally something I or many people I know could have done if I had thought of the idea.

I think a lot of hackers are hungry for this kind of meaningful work. We need a repository of project ideas like the Eyewriter – immediate needs that have a tangible social affect and can be done in a weekend or two. Organize the ideas by skills required and offer the platform for organizing groups of hackers to tackle the problem. There are a lot of developers out there looking for a side project and a way to have an impact. And we also need idea people. Social workers, NGO’s, and every day people to tell us how technology could solve the problems they see in the field.

Some resources:
Random Hacks of Kindness
Code For America
Public Equals Online
Applications for Good

For goodness sake, hack!

A False Sense of Security with Test-driven Development

Test driven development is great as long as you have proper tests. The problem is that it’s very hard to predict enough edge cases to cover the field of possible scenarios. Code coverage analysis will help developers make sure all code blocks are executed, but it doesn’t do anything to ensure an application correctly handles the variations in data, user interaction, failure scenarios, or how it behaves under different stress conditions.

The fact that tests are helpful, but never complete is something most developers are already conscious of. The danger is that better tests make worse developers! It’s very easy to lean too heavily on passing tests, wildly changing code until the light goes green without spending enough time thinking through the application’s logic.

I’m basically saying that, psychologically speaking, passing tests gives us a false sense of security. They can be a distraction from carefully crafted and thought through code. That’s why I advocate writing tests only for the purposes of regression testing. It should be a follow-up step, not an integral part of initial development.

Getting Wikipedia Summary from the Page ID

While working on my forthcoming checkin.to project, I needed to use the MediaWiki API to get the summary paragraph of wikipedia articles pertaining to places. Checkin.to relies on the Yahoo Where On Earth Identifiers (woeid). Yahoo also conveniently offers a concordance API so from the woeid I get the Geonames ID and the Wikipedia page ID among other things. As far as I can tell, the MediaWiki API doesn’t allow you to request page content using the page ID so the first step here is to resolve the page id into a unique page title. This can be done using the query action like so:

http://en.wikipedia.org/w/api.php?action=query&pageids=49728&format=json

It gives a response resembling:

{"query":{"pages":{"49728":{"pageid":49728,"ns":0,"title":"San Francisco"}}}}

Step 2 is to get the actual page content. There are a variety of formats available including the raw wiki markup, but for my purpose the formatted HTML is much more useful. We also need to convert the spaces in the page title to underscores. The request looks like this:

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=San_Francisco&format=json

And a response resembling:

{"parse":{"text":{"*":"<div class=\"dablink\">This article is about the place in California. [...] "}}}

Step 3 is to parse the resulting article html and extract just the first body paragraph which typically summarizes the whole article. The problem here is that a bunch of other stuff including all the sidebar content comes before the first body paragraph and that other stuff itself can include p tags. jQuery is a big help here, as usual. First, lets wrap the entire resulting wiki page in a div element to give everything a root. Then we can first just the simplings of that wrapper element to find the first root level p tag.

wikipage = $("<div>"+data.parse.text['*']+"<div>").children('p:first');

Below I have the entire resulting function that goes from page id to summary paragraph and appends it to a <div> somewhere in my DOM called #wiki_container. I also perform some optional cleanup including removing citations, updating the relative hrefs to absolute hrefs pointing to http://en.wikipedia.org, and adding a read more link.

function getAreaMetaInfo_Wikipedia(page_id) {
  $.ajax({
    url: 'http://en.wikipedia.org/w/api.php',
    data: {
      action:'query',
      pageids:page_id,
      format:'json'
    },
    dataType:'jsonp',
    success: function(data) {
      title = data.query.pages[page_id].title.replace(' ','_');
      $.ajax({
        url: 'http://en.wikipedia.org/w/api.php',
        data: {
          action:'parse',
          prop:'text',
          page:title,
          format:'json'
        },
        dataType:'jsonp',
        success: function(data) {
          wikipage = $("<div>"+data.parse.text['*']+"</div>").children('p:first');
          wikipage.find('sup').remove();
          wikipage.find('a').each(function() {
            $(this)
              .attr('href', 'http://en.wikipedia.org'+$(this).attr('href'))
              .attr('target','wikipedia');
          });
          $("#wiki_container").append(wikipage);
          $("#wiki_container").append("<a href='http://en.wikipedia.org/wiki/"+title+"' target='wikipedia'>Read more on Wikipedia</a>");
        }
      });
    }
  });
}

A continuous, blocking python interface for streaming Flickr photos

As I explained in my last post, Yahoo! claims their Firehose is a real-time streaming API and it’s not. So to make life a bit easier for app developers I wrote a python wrapper that provides a continuous blocking interface to the Flickr polling API. Effectively it emulates a streaming API by stringing together frequent requests to the flickr.photos.getRecent results. And it’s dead simple.

import PyFlickrStreamr

fs = PyFlickrStreamr('your_api_key_here', extras=['date_upload','url_m'])
for row in fs:
    print str(row['id'])+"   "+row['url_m']

You can download the package from pypi or fork the source code on github. Have fun.

The Yahoo Firehose "feed" isn’t a feed at all

The web has been on a big trend of real-time for the past couple years. Friendfeed was one of the first services to show real-time updates across your social network and real-time feeds took the stage in a big way when Twitter started its streaming API. In April, Yahoo! announced it’s Firehose API claiming “it includes a real-time feed of every public action taken on our network”. The thing is, this isn’t a “feed” or a “stream” in the same sense that Twitter’s streaming API is. It’s a database you can poll with Yahoo’s YQL, an SQL like query language. Sure, the updates may be available in their database in near real-time, but to receive them you need to issue a new request. In fact the only way you know if there are updates is to continuously poll the service. A feed would be something like long-polling with HTTP server push (what twitter does) or PubSubHubbub.

It may be just semantics to some, but this bothers me. To those of us who build applications that publish or consumer real-time information this is a very important distinction. I plan on writing a python library that wraps flickr’s polling API into a “real-time” blocking continuous stream for a project I’m working on. I’ll publish the code on github and post it here when done.