Visualizing Wikipedia As One Large Node Graph

As a network visualization tool, node graphs are an intuitive method of communicating relationships between entities. I’ve been thinking a lot about the semantic web lately and thought it would be cool to visualize all of the links between articles in Wikipedia at once. I want to pull back and get the 10,000 foot view of the state of modern knowledge, which I don’t think has been done before in a comprehensible way. Chris Harrison’s WikiViz project comes closest but it quickly becomes incomprehensible and is not dynamic.

I have not yet found a tool capable of pulling this off. There are two key ideas that go into representing information at such vast scale. We need to be able to show detailed information in a narrow focus but not get bogged down when zoomed out, which means you need to represent the graph at different resolutions. This has been a problem solved for seeing images at scale. Google earth represents the earth at vastly different resolutions and gigapan is able to zoom into images of many gigapixel size. Second, the kind of information you’re displaying needs to make sense at any height. That means when you’re looking at the graph from 10,000 feet it shouldn’t devolve into a gray blur. Google maps also demonstrates this by removing detail such as building names and street names, cities, and states when you zoom out. Because I’m a gamer I’m inspired by Supreme Commander which developed an innovative way of showing tactical information. You can zoom out to see the playing field as a whole and seamlessly zoom in to examine an area in detail. When zoomed out, individual units become symbols that still convey what the unit is.

At a detailed level a single node could contain basic information including the name, some classification, and perhaps a summery. We can use a sunburst style visualization to represent links. As you zoom out that detail gradually disappears. At a high level less significant articles can be represented by generalized concepts. Higher yet, even more general concepts begin to swallow up the more specific ones. The higher you get the more general the concept. Less significant links between concepts fade into the background. The big challenge is reliably building a concept tree for any node in Wikipedia. A lot of research and effort has gone into that area, but it’s not quite there yet. People would be forgiving of the accuracy to start with.

So here’s a summary of the requirements for a tool to visualize Wikipedia

  • Must handle 3.2 million nodes and tens of millions of edges (links)
  • Must be able to modify the graph dynamically to highlight changes in real-time. This means we need something other than the standard spring plotting algorithm which runs in computational complexity O(n^2).
  • Must be able to represent clusters of nodes symbolically as concepts and gradually move between level of detail
  • Must be able to operate with partial graph data. The client application will see only a small slice of the network at once or a high level view of a larger area at a low resolution.

In my brief analysis there are very few tools designed to handle large data sets dynamically. AI3 has a list of 26 candidate tools for large scale graph visualization and although some are visually stunning and some are large scale, none satisfy the requirements above. It seems like the major innovation needed here is a map-reduce style algorithm for undirected graphs. Map-reduce works well with a tree structure, but not as well with unstructured, cyclic graphs. In Wikipedia any node can be linked to any other node and there’s no consistent parent-child relationship. Everything is an “unowned” entity. If a comprehensive and reliable concept hierarchy could be generated from Wikipedia links and text we might be able to use that as the tree-like structure where each level of the tree roughly represents one resolution of the network.

Anyway – that’s something to think on.

UPDATE: A new open-source project called Gephi looks really interesting. http://gephi.org

Here are some more links of interest:
http://cytoscape.org/screenshots.php
http://arxiv.org/abs/cs/0512085
http://blog.semanticfoundry.com/2009/06/01/dynamic-visualization-introduction-theory/