Guest Post by Ian Bailey, CTO of illumr
Computers are reasonably good at analyzing large datasets, but there is one class of problem where they require a bit of help from puny humans – high dimensional datasets. By “high-dimensional” we mean “wide”, as in lots of columns. When we have wide data, it’s very hard to spot commonalities across a number of those columns. For example, if we have data from a large number of sensors, and all of them have something to say about what’s going on, it’s very hard to detect what is similar about all those readings when a particular type of event occurs.
A typically difficult problem is predicting cancer outcomes based on levels of gene expression. We have measures of hundreds (sometimes thousands) of genes, all of which interact to influence medical outcomes – e.g. survival rates, response to particular chemotherapies, growth rates, etc. We also have thousands of patients, so we end up with a table that is hundreds (or thousands) of columns wide and thousands of rows deep. It goes without saying that it’s impossible for a human to work out what combinations of levels of gene expression predict a given outcome. However, it’s also a very difficult problem for a computer to solve as it has to conduct huge numbers of comparisons of all the possible data values – this is known as the “Curse of Dimensionality”. A computer also needs to know what to look for if it’s going to come up with anything useful in a reasonable timeframe, which pretty much means you need to know the outcome before you run the analysis. Fat lot of good that is.
One answer to this problem is to combine human cognition with machine learning to circumvent some of the heavier computation. The computer organizes the data in an unbiased fashion and presents it as a graphical model to the analyst. At this point, the (as yet) uniquely human skill of pattern recognition comes into play, and the analyst is able to spot shapes and clusters in the data that indicate something interesting is going on. The tricky bit is in getting the computer to present something that helps the analyst. The traditional approach is to conduct what is known as a dimensional collapse (or dimension reduction) on the data. This sounds like a cataclysmic event from an episode of Doctor Who, but is actually just a way to simplify the data.
Before looking at dimension reduction, let’s look at a simpler example; a dataset with three columns of numbers. We can treat each row as a coordinate in 3D and plot the points. When the points are visualized, we can immediately tell if there is a random distribution (a cloud of points) or if some clusters are present. If we’re getting clusters, it’s a good indication that there is some commonality in the data. If we then colour-code our points based on some other variable (e.g. patient survival rate) we can immediately see if any of the clusters have a dominant colour. If they do, then we’ve found a pattern in our data that is worth investigation.
But what if I’ve got a thousand columns? How do I visualise that? This is where the dimensional collapse comes in. Traditional approaches such as Principal Component Analysis (PCA), and more recent machine learning algorithms such as T-SNE, allow us to collapse thousands of dimensions down to 2 or 3 whilst still (hopefully) maintaining proximity between the data points – i.e. if two points were close in n-dimensions, they’ll also be close to each other in 2 or 3 dimensions. What this means is that if we see clusters in 2D or 3D it is direct result of there being some similarity in the data, and we may have something worth investigating.
Such approaches have been around for years, and work well for spotting clusters. They’re not without their drawbacks though. PCA can have the effect of introducing a bias to the data if not used carefully. T-SNE is a more neutral approach, but if the data is noisy, and the signal is weak, it becomes very hard to find the signal in a cloud of points. What is required is a way to increase the contrast and reveal the patterns in a more obvious way to the user.
This is what topological data analysis (TDA) is all about. TDA attempts to build a topological model of the data by grouping and linking data points that are similar in n dimensions. This can then be visualized as a network chart to show the “shape of the data”. This has the effect of making the clusters more obvious to the human eye, and is also able to pick out smaller clusters that would be just lost in the noise in a traditional dimensional collapse approach. It also reveals a shape to the data which tends to draw the eye to particularly interesting features.
The market leader in this sort of analytic is Ayasdi, and they make some bold, but supportable, claims about what the shape of the data can reveal – e.g. circular plots are indicative of time series data.
The company I work for (illumr) takes a slightly different approach, but still with an emphasis on using shapes to draw the analyst’s eye to interesting aspects of the data. We don’t emphasise TDA, though our approach results in a topological model of the data. Debate rages (as only mathematicians can rage) on the interwebs about how TDA differs from dimensional collapse. This is somewhat missing the point, as the end result is still a dimensional collapse but with topology of the n-dimensional shape retained and presented in 2D (Ayasdi) or 3D (illumr).
There are different ways to achieve this, but Ayasdi tend to focus on a persistent homology approach – extracting the overall topology and displaying it as nodes and links. In the case of Ayasdi, each node corresponds to either a single data point or a cluster of points depending on how they connect (the topology). The illumr approach results in a node for every data point, as the emphasis here is on finding weak signals in data.
For instance, illumr worked with HouseMark – the leading provider of social housing data and insight in the UK. All 3151 groups of households aggregated by postcode and age of lead tenant are shown visualised (each node=household). The software self-organises to reveal the inherent structure of the dataset in 3D. Nodes are coloured by number of repairs (where red=low no. of repairs/green = high no. of repairs).
The dataset has some very definable clusters associated with number of repairs. By examining these clusters, we can reveal non-intuitive insights missed by other methodologies.
For example, there are important exceptions to the rule that the maintenance costs increase with age of property. In some cases, the effects of age of property, its size and also its type (house/flat) have negligible influence on number of repairs carried out.
Such findings enable the housing providers to better understand and predict the patterns of behaviour associated with responsive repairs thus helping to improve their operational efficiencies.
For more information on illumr, please visit illumr.comPublished in