We humans are much more efficient to analyze a complex phenomena if it is well structured and organized. Say we are given 1M points and asked to do some analysis. We can inspect each point separately (very daunting task) but we can also inspect 10 point groups or clusters. Finding such clusters is important nowadays since we are literally drowning in data! In machine learning community this sort of problem is called clustering and is often addressed in unsupervised [COURSE] manner i.e. no point labels are given. Only the data.
One smart fella may tell you that using the standard k-means algorithm [WIKI] he will get the problem solved. What he don’t realize is that he needs also to specify the number of clusters which is not known a priori. This is a very important parameter if we want to inspect our data based on natural clusters.
Another concern is data dimensionality which is much less evident for our smart fella. If his data has more hundreds, thousands or even more dimensions, without knowing he may be cursed – curse of dimensionality [WIKI]! In short, distances in such high-dimensional spaces loose their meaning and it is difficult to tell which point is closer than another!
One of solutions is to use unsupervised Bayesian methods modeling the data as finite mixtures and working on each dimension independently. Of course, this independence assumption does not hold in general for all possible data but this is a very viable hypothesis in data exploration.
- real-valued and discrete data;
- natural data clusters;
- processing is roughly linear in the number of data points;
- returns probabilistic membership for each data point;
- allows feature correlation within each cluster data points;
- works with missing data.
The algorithm finds the classes that are most probable with respect to the data and the model. In the case of real-valuead data the model uses mixture of Gaussians, while with discrete data – mixture of Bernoulli distributions. It has been successfully used as data exploration tool to discover new classes of infra-red stars during NASA IRAS project [WWW], as well as to discover new classes of proteins in DNA sequencing applications. In my research I used it as a data exploration tool with some attempts to learn better visual features from document images [DOCEXPLORE].
The software is distributed as a source tarball and can be compiled pretty easily on any major operating system. The problem is that the software was written back in the wild 90’s with only single processor computation capabilities. As it will be clear from the very first run, the computations are pretty heavy even on modern processors and becomes even more difficult on large datasets. So, the goal of this blog post is to share parallelized version of the AutoClass algorithm. The very first search with Google “parallel autoclass” will show you an academic publication [IEEE] explaining how the algorithm was modified to run on distributed computation system. It is embarrassing that they did not share the code of their P-Autoclass! To my best knowledge, nobody else attempted to give second chance to this forgotten but useful data exploration tool!
Nevertheless, thanks to their paper it was straightforward to identify the code blocks with most intensive computation and luckily – the easiest to parallelize. Relatively easy. I used the industry standard OpenMP framework [WWW] to parallelize the for loops under the question. The catch was to watchout for shared and private variables in the loop, as well as to deal with few reductions…
Ok, ok! No more technical implementation details, [HERE] is the source tarball. Just run “make” from the console and within few moments you will get the “autoclass” binary. I used the Intel C compiler so you may want to comment out the CC variable and enable the standard GNU C compiler.
The first barrier would be to prepare the data files. The format of data files is explained well in the accompanying documentation to the original AutoClass distribution [TAR]. If you use MATLAB, [THIS] toolbox simplifies the task greatly and besides allows you to call the AutoClass binary and get results back to the environment.
Final word of warning: Do not expect the software to be a magic black box spitting out the best data clusters the way you want or is expected in your application! You need at least basic understanding of the algorithm and be an expert (you can become one) on the data you work on! Make sure you understand how your data is generated, what are the properties, try different non-destructive data transformations and then you can hope the tool to be useful for you.
Credits go of course to NASA Ames Research Center [WWW] and to the brave fellas who developed it there. Thanks also to Michael Lewicki for writing the Matlab toolbox interfacing with the binary.