When we’re faced with large amounts of data, ofttimes we scurry to aggregate, summarise, tally and total so we can glean meaning from the mass of numbers. The trouble is we don’t always know what the data is concealing. We don’t know the best analysis to perform that will reveal insight. One compute device that is extraordinary at finding patterns in large amounts of data is the human brain and its sophisticated peripheral, our eyes. All we need to do to activate this capability is visualise the raw data. Here are a couple of examples.
Last week I came across some fun visualisations by Vijay Pandurangan showing the historical change in movie poster colours. Starting back in 1914 and running through to this year, Vijay collected a stack of movie posters for each year. He then wrote a pretty simple program to chart the different colours used in each poster he could find:
I first made a unified view of colour trends in movie posters since 1914. Ignoring black and white colours, I generated a horizontal strip of hues in HSL. The width of each hue represents the amount of that hue across all images for that year, and the saturation and lighting were the weighted average for all matching pixels. Since HSL has a fixed order, comparisons can be made between years visually.
The results show an obvious trend.
Its pretty clear that over time, there’s trend towards an 50/50 split between orange blue. And sure enough, this is a known fad in movie posters. The high contrast of orange-blue has been road tested to death and can be seen on billboards everywhere.
A large number posters we see today are dominated by the complementary colour mix, juxtaposing warm and cool colours. The great thing about Vijay’s visualisation is it makes no presumption about the underlying data. It just presents the frequencies as a proportionally sized area in the graphic. From this, we instantly see two interesting facts. Both today’s dominance of orange-blue, and the trend from mainly orange over the last century.
Another favourite example, from 160 years ago, is the often referenced map of London used by John Snow when tracking the 1853-54 cholera outbreak. Dr Snow marked a map with black bars for each death. Each line parallel to the street and forming stacks indicating larger numbers of deaths in each location. Here is the full map drawn a year after the outbreak and a detailed view of the highest concentration of deaths.
Snow theorised that cholera was transmitted by water, not by foul air (miasma) as was the current accepted theory. By plotting incidents of cholera on a map of London, a cluster around the Broad Street water pump became obvious. The visualisation may have come too late to stop this particular outbreak, but it provided evidence strong enough to convince ardent skeptics that cholera was born by contaminated water.
There’s a subtle detail to Snow’s visualisation in the care taken to draw each death-mark. Each mark is the same size – thickness of stroke and length. Each mark is neatly aligned to the street the house faced onto. Not just a question of tidiness, this consistency drives the black marks to be in proportion to the effect of cholera in each location. The graphic would be nowhere near as powerful if the number of deaths was written as a number. The reader would have to interpret each numeral, scanning the map for the largest number. Cognitively much harder (and more error prone) than looking for the biggest grouping of black lines.
Both these example demonstrate data visualisation, not analysis. Snow did not need an algorithm to take the geo-coordinates and find a likely centre. By rendering the data on the map, the association with the Broad Street pump is instantly obvious. When working with large data sets with unknown trends, we can apply the same principle.
Even just a few pixels per data point, like in the movie posters, can be enough for human visual processing to detect trends. Use the computer to visualise as much data as possible on one screen and let the humans do the visual analysis.