Climate Data Mashup

The problem: people generally have very little perspective of the actual scale of the contributing components of climate change or the effects of different proposed measures to stop it. What percentage of CO2 emissions are a result of city residential electrical consumption vs agriculture vs vehicles? How much of a difference will legislation X make in the big picture? When Obama says that the United States will cut greenhouse gas emission 80% by 2050 what kind of effect does that actually have? What would happen to the weather in 10 years if everyone in the world stopped driving tomorrow?

The solution: Let people build hypothetical scenarios themselves. Design an interface centered around an attractive time line graph indicating climate data in all its various forms including temperature increase, carbon emissions, and sea level. Curious users can click on and off different proposed solutions to see the real overall effect on projected emissions along with dollar cost over time. Group data in terms of current relevancy such as proposals being discussed at the climate summit in Copenhagen this week. Include competing predictions from different agencies and scientific groups to communicate the level of uncertainty.

Extra credit for building a system to automatically pull in current data from a variety of sources.

A General Update

I haven’t really used this space to send out personal updates as of yet, but that’s partly what I mean it for. In years past I kept a frequent blog about my travels and adventures, but since becoming boring that kind of tapered off. Tonight I’m feeling a little inspired.

I kept up with my nomadic tendencies by moving (somewhat blindly) back to San Francisco last month. The intention is mostly to surround myself with techies and the Bay’s zany brand of artistic expression while staying away from the Midwest’s bone-chilling winter. So far so good. I’m constantly surprised by all the times I hear guys at a bar intensely debating their new iPhone app or walk into a coffee shop at 2 in the afternoon to find it full of people on their laptops adorn with jQuery or facebook or name-your-startup stickers. I’ve already been to a few of silicon valley’s famously excessive dot com parties with flashy performances and dance clubs offering all-night open bar for hundreds of people. And only in San Francisco will you find friday night events taking place in some converted warehouse combining lectures on neuroscience and synesthesia with street art and house music until 2am. My favorite. I’m subletting a room in a nice apartment for the time being (pic, pic).

So far I have tended on reclusive productivity, trying to finish a couple long overdue freelance jobs that have been hovering about for quite some time. I also started working on some exciting startup ideas. Check out tweet-pulse.com for a little tease of one. Mysterious, I know. Others involve more heavily my ambitions for artificial intelligence, but I discover more and more how difficult that can be with no capital and no graduate degree. I’m contemplating graduate school, which would be the expected path, but I really prefer to learn by doing something practical. I would love for someone to pay me to work on something related to robotics while I hone my skills in more abstract machine learning techniques, but those opportunities are few and far between. They do exists – I had one interview for a job I would have loved – but it feels like 90% of the work these days is with web-related ventures.

I’ve had enough of that. In fact, I’m setting a guideline for my upcoming job search that I won’t look at companies which deal only in the web space because frankly I’m tired of that work. I’ve done it for the past seven years both freelancing and fully employed and it lost most of the appeal when it was no longer a theoretical challenge and only an implementation challenge. I like to work on problems that I go to bed thinking about and wake up having been enlightened by some other-dimensional thought pattern. Or perhaps to experiment with methods mysterious enough that the results surprise me.

Over the next 30 days I need to decide which of the three-sided fence I’ll hop over to. I know I would benefit from working with a company on a project larger than myself and with mentors more experienced that me, but I’m scared that I’ll find that work less than fulfilling. I also know I’ll miss, in some ways, being my own boss, which has been my natural state almost exclusively so far. Some of the biggest and most successful names in technology have gone their own way and found it better than any other, but I could just as likely find it longer and more strenuous. The third side of this curious tri-fence is graduate school. Oh what potential, oh what a loss of time. I welcome your counseling.

How head hunters use google searches to find candidates

In the past two months I’ve been randomly contacted by a number of head hunters hiring for positions both in Chicago and on the west coast. In a time when so many people are looking for jobs, some recruiters actually seem to be coming to me. I know that a number of friends in the industry with at least a little proven track record have been experiencing the same thing. One of the keys, it turns out, is to have your own blog and resume (cv) online. Check out this google hit that showed up on my site’s analytics yesterday, for example. “.pdf” and “Java” “chicago” CV

This particular person was in the UK (Chicago financial companies often use London based recruiting agencies). I’ve seen similar searches coming from networks of big companies that we all know.

So keep your resume online, keep it current, and blog.

Upgrading to MySQL 5.1 and PHP 5.3.0 on Centos with Plesk

This is an account of my experience trying to upgrade MySQL to version 5.1 on my MediaTemple (dv) 3.5 dedicated virtual server. It may help people not on MediaTemple servers, especially if you’re trying to upgrade MySQL without damaging Plesk. There are a lot of forum posts out there that lead you down the wrong path so beware.

Before you go any further, this could really mess up your server and its not totally tested so if you have anything you depend on, I would be very careful. In this process I actually had to do a complete restore because I screwed the server up so royally before I figured everything out.

First, obviously, you need to install the developer tools through the (mt) control panel which enabled root access and SSH. Then you need to install yum, following the directions in the (mt) knowledgebase article. There’s also a note at the bottom of that article about removing SiteBuilder, which you should do, but don’t bother installing the atomix repository.

Then, you’ll see that mysql 5.0.x is the latest in the existing repositories. Remi’s repos have all the latest goodies so we need to enable that. This page is helpful: http://blog.famillecollet.com/pages/Config-en

wget http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://rpms.famillecollet.com/enterprise/remi-release-5.rpm
rpm -Uvh remi-release-5*.rpm epel-release-5*.rpm

A quirk about centos is that after adding a repo it’s not actually enabled by default. There are two ways to go about getting access to the files. Either add the option –enablerepo=remi to the yum command or edit the /etc/yum.repo.d/remi.repo file and change enabled=1. Now if you do yum –enablerepo=remi info mysql it SHOULD say the latest 5.1.x version. But there are a couple other tricky bits.

If you try upgrading mysql you’ll notice that there are two packages creating conflicts. php-mhash and php-ncurses. Those two packages are safe to remove (yum remove php-mhash php-ncurses). Now you should be able upgrade mysql, which will also upgrade php to 5.3.0 (but if you try to upgrade php alone, you’ll get errors). yum –enablerepo=remi upgrade mysql

You’re almost there. You’ll notice that mysql won’t start, which is because there are old settings in /etc/my.cnf that trip it up. I’m doing this upgrade on a completely fresh system and I don’t have to worry about loosing any settings or even any databases that already exist, so I moved /etc/my.cnf to a new location and restarted mysql, which regenerates a fresh my.cnf. The final step is to run “mysql_upgrade -u admin -p”. The password is your plesk admin password.

The one unsolved part of this mystery is updating the plesk tables. All of these tables fail the update script. Still, I’ve clicked around and Plesk seems to be working OK.

Summary of Free Graphical Data Modeling Tools

The Goal: To graphically design a relational data model and generate both DDL to create the DB structure and the PHP5 class structure including public/private member variables and getters, setters, update, create, delete with working DB link. Icing on the cake would be if the data model could be updated graphically after generating the code and generate a patch file for updating the PHP5 code and an SQL script with the appropriate alter/add/drop commands to update the existing schema.

Eclipse Frameworks

Eclipse has a complex collection of various layers of frameworks, some seem a little redundant or aren’t immediately clear what functionality they provide. Most of these frameworks are in “incubation” phase. There are also proprietary plugins that utilize these frameworks but are not distributed using the official eclipse repositories. The frameworks serve simply as platforms to aid development of functional plugins primarily by third parties. The Eclipse project itself doesn’t provide any implemented solutions except for UML2Tools which is a component of the MDT.

The eclipse modeling project consists of the following:

  • Eclipse Modeling Framework Project (EMF) – (link) The EMF project is a modeling framework and code generation facility for building tools and other applications based on a structured data model. From a model specification described in XMI, EMF provides tools and runtime support to produce a set of Java classes for the model, along with a set of adapter classes that enable viewing and command-based editing of the model, and a basic editor.
  • Model Development Tools (MDT) – (link) The Model Development Tools (MDT) project focuses on big “M” modeling within the Modeling project. Its purpose is twofold: To provide an implementation of industry standard metamodels. To provide exemplary tools for developing models based on those metamodels.
    • UML2Tools (link) is a set of GMF-based editors for viewing and editing UML models. It is an optional component of the MDT project. It does not provide entity relationship modeling so I’m not going to say much more about it.
  • The Eclipse Graphical Modeling Framework (GMF) (link) provides a generative component and runtime infrastructure for developing graphical editors based on EMF and GEF. The project aims to provide these components, in addition to exemplary tools for select domain models which illustrate its capabilities.
  • The Graphical Editing Framework (GEF) (link) allows developers to create a rich graphical editor from an existing application model. GEF consists of 2 plug-ins. The org.eclipse.draw2d plug-in provides a layout and rendering toolkit for displaying graphics. The developer can then take advantage of the many common operations provided in GEF and/or extend them for the specific domain. GEF employs an MVC (model-view-controller) architecture which enables simple changes to be applied to the model from the view.GEF is completely application neutral and provides the groundwork to build almost any application, including but not limited to: activity diagrams, GUI builders, class diagram editors, state machines, and even WYSIWYG text editors. The Logic Example, one of the provided examples, is pictured below

Free Standalone and Plugin Implementations

Clay Database Modeling

Clay (link) is an Eclipse plugin by Azzurri, a Japanese company. The free version provides nearly all features needed to build a proper ERD. The Pro version give support for enterprise databases, printing and exporting images, and document generation. Clay exports clean DDL code for mysql. It can also reverse engineer databases.

Conclusion: This is a decent choice. It won’t lock you into using it. It doesn’t generate an interface for the application layer so it doesn’t fulfill the goal of this study.

RISE

RISE (link) is a freemium, proprietary, Windows only solution, but it does some good stuff. RISE is the only free solution I’ve found that will generate both the data layer and the application layer based on an entity relationship diagram. It has one of the easier to use interfaces for creating entities, attributes, relations, and views. The stereotypes concept makes building common structures such as trees, lists, classifications, and extensions easier. Connecting a customizable Interface to an entity or view allows C# or PHP code to be generated, giving your application layer access to the data layer. The application code fully implements methods that perform standard operations on your data layer. RISE will even generate a SOAP web service interface and provides an AJAX framework.

Despite all that praise, there are a few problems with RISE (not including it being windows only, etc). There doesn’t seem to be a way to create indexes, primary keys, auto-incremented values, or adjust the precision on data types.

What RISE generates is not proper DDL or a database that actually reflects your ERD. It produces a stored procedure that creates four tables for keeping track of the log and model versions in addition to the actual entity tables. This is presumably so it can gracefully upgrade the structure to a new version without loosing data but it results in an unclean database.

Conclusion: RISE may be a good choice for some applications but the fact that it doesn’t follow standards makes me disqualify it for lack of extensibility if you should ever decide to stop using   it.

ArgoUML

ArgoUML (link) is an open source, standalone UML and code generating application written in Java and maintained by Tigris.org, the same group that does subversion. It doesn’t support modeling database schemas out of the box, but it does have an officially supported plugin that does so called argouml-db. As of this writing the latest argouml-db plugin isn’t supported in the latest build of ArgoUML so you’ll need to get the bundled version from the sourceforge site https://sourceforge.net/projects/dbuml.

My impressions is that the db plugin is a hacked together job and isn’t very well maintained. It’s tricky to even get running, clumsy to add attributes to a table. The properties also doesn’t have basic options that you would expect as part of a DDL such as length of columns or auto_increment functions or null / not null. Generating the code from the bundled version gives the option of generic “SQL” and java. The sql didn’t work out of the box and the generated file only contained an error. The java code didn’t include any getters, setters, or link to the database. Instead of primitive data types the generated java code imported types.sql.VARCHAR and similar.

Conclusion: not worth anyone’s time for ERDs

Umbrello

Umbrello (link) is an open source application that is part of the KDE project, which is primarily targeted at linux distributions. Getting the latest Umbrello working on mac and windows can be a bit more of a pain. Macports and Fink as package managers for Mac and both have up-to-date versions of most mainstream linux software. You can compile from source most of KDE including Umbrello. It might be something you want to do overnight. Expect to encounter errors which you’ll have to find your way around. Needless to say, this is not a user-friend installation in non-linux environments.

Umbrello does a decent job of modeling relational databases (entity relationship model) and includes all expected features including data types and properties, auto-increment, foreign key constraints, etc. It does a fairly solid job of generating mysql or postgresql DDL, but it won’t export the data model to php. You’ll have to create a separate class diagram for that.

Conclusion: if you run linux this is a decent option considering the alternatives.

Non-free solutions

Conclusions

This is a relatively uncrowded market for robust solutions. The only two tools found that satisfy the core requirements of my goal are Visual Paradigm’s Database Visual Architect, the enterprise level proprietary solution and in a somewhat broken way, RISE, the free solution. No other tools produce both data and application layer code. I discount RISE for not keeping with standard methods and not producing extensible, clean code. I discount DB Visual Architect for it’s prohibitive cost in non-enterprise environments.

My chosen solution is the Clay Database Modeling plugin for eclipse, which does a good job of modeling and exporting DDL for the data layer. The application layer will need to be designed and implemented separately.

Comments are welcome on these solutions or any others not included.

Conficker C and a future with self-evolving computer viruses

I’m absolutely enthralled by the Conficker C virus after reading this analysis from SRI International. The C variant is the third major generation of the Conficker virus and demonstrates the highest level of sophistication found in any computer virus or worm to date. What excites me most about it is the decentralized nature of its peer-to-peer method of quickly propagating updates to itself in the roughly 12 million infected computers around the globe.

Researchers have identified the date of April 1st as when the virus wakes from hibernation. The event has been simulated with a copy of the virus in computer science laboratories, but since the virus gets its instructions through the peer-to-peer network and rendezvous points, it is impossible to tell what kind of code will be executed until it happens. Speculation has ranged from the biggest April Fools Day joke ever to a massive dragnet or “Dark Google” allowing the virus’s authors to search for sensitive information on infected machines en-mass and sell it to criminal organizations or governments. Which is especially worrisome since Conficker has infected many government and military networks.

The authors have demonstrated the most cutting edge knowledge in multiple disciplines so this is not just a kid sitting in his mothers basement. This is a closely coordinated effort between a group of extremely talented individuals and I personally wouldn’t be surprised if the authors, if caught, turn out to be a part of a government initiative. The SRI report says “those responsible for this outbreak have demonstrated Internet-wide programming skills, advanced cryptographic skills, custom dual-layer code packing and code obfuscation skills, and in-depth knowledge of Windows internals and security products.”

Conficker C does a mutex check with pseudo-randomly generated names when it initially installs to avoid overwriting itself. Then it patches the win32 net API to inhibit antivirus software and block antivirus websites. The fact that it patches only in memory DLL files and not persistently stored DLLs means that removal tools can’t simply replace the compromised files with clean ones. It also does a patch of the same windows vulnerability it initially used to enter the system but leaves a back door so that only new variants of Conficker can use it. This prevents other viruses from piggybacking on Conficker and competing for control of the system.

Conficker spawns a thread with the purpose of searching for a static list of known anti-virus applications and terminating them to defend itself from attack and blocks services that allow anti-virus software to auto-update. It also deletes all windows restore points and removed safe mode as a boot option. Conficker uses dual-layer encryption and code obfuscation to hinder efforts at reverse engineering it. Conficker released an update just a few weeks after a new md6 hashing method became publicly available from the original researchers at MIT.

The authors uses two similar methods of propagating updates to infected machines. Previous Conficker variants use a clever “rendezvous” system to randomly generate a huge list of possible locations for rendezvous locations where the authors may have placed a distribution server that changes on a daily basis. The randomized nature and the large number of possible locations make efforts to block those domains impractical. Once a machine has an update it can also assist in spreading it to other machines via the peer-to-peer network. Currently, almost all p2p networks require some kind of “seed” or predefined peer list to be introduced into the network, but Conficker doesn’t. It uses the same kind of pseudo-randomly generated destination list as the rendezvous system to generate an initial peer list, which essentially bootstraps itself into the network. There is absolutely no bottleneck that can be attacked to stop Conficker from communicating with its peers.

Some of my own ideas

Antivirus filters and coordinated strategies that to thwart the spread of viral software utilize patterns to identify uninvited guests. I see future decentralized malware using a randomized approach to avoid detection. If the same application is also capable of virally updating its peers, a system of natural selection will evolve. This is essentially how genetic algorithms work. In this case, the natural fitness function is the ability to infect new systems (spawning offspring) and its defensive ability to ward of efforts of removing it from the host (self preservation). In this sense, randomized configurations (the genetic code) of the virus will be propagated to new systems and over existing instances at a higher rate, and thus the more successful variants would become the most prevalent. And there we have a model of evolution.

The process works the same way as HIV and flu viruses and has the effect of a self-healing, growing network that can autonomously adapt to new countermeasures developed by antivirus companies. A self-evolving computer virus has the advantage over biological evolution of electronic speed. Time is the greatest enemy of evolution. Digital organisms have the chance to excel beyond anything we have observed in nature.

Fortunately, genetic programming hasn’t yet advanced to the stage where the scenario I proposed is practical. Current genetic programming algorithms are limited to a constant set of numerically variable attributes that are randomly modified in each generation. Those attributes could never be complex enough to resemble an actual evolving organism and too much of the logic for a computer virus has to be pre-programmed and static. I expect that to change over the next decade as more work is done in this area so watch out.

Links: