TransClust is a high-throughput clustering software that is based on Weighted Transitive Graph Projection.
It's main advantage over other approaches is that it's underlying model directly reflects hidden transitive substructures typical e.g. for biomedical data sets. In comparison to other clustering methods the desitiy parameter (the threshold) can be choosen as intuitively as for k-means.
Clustering is a widely established technique for exploratory data analysis with applications in statistics, computer science, biology, social sciences, or psychology. It is applied to empirical data in basically any scientific field to gain an initial impression of structural similarities. For this purpose, it is of great advantage to have an efficient and easy-to-use tool that can be applied ubiquitously to a large scope of data types.
The graph-theoretical problem underlying TransClust is called "Weighted Transitive Graph Projection" or "Weighted Cluster Editing". The goal is to transform a given intransitive graph into a transitive one by adding and removing edges. The idea is to find a solution with minimal number of edge modifications for the unweighted case. In our case, a weighted cost function is used. It reflects the distance between a user-given threshold and a pairwise similarity function, i.e. removing an edge labeld with a very high similarity is expensive, as is adding an edge with a very high dissimilarity. This problem can be used as model for clustering (see the picture below). The threshold, given by the user, strongly reflects the cluster density: High (restrictive) values make it more expensive to add most of the edges (resulting in many small clusters) while lower values make it cheap to add edges but expensive to remove them (resulting in few big clusters). The toy example picture below illustrates that the preferred solution strongly depends on the user-defined density parameter and the similarity function. The advantage of our approach is the ability to reflect hidden transitive substructures directly within the clustering model. Briefly, we cluster such that the average similarity within the clusters is above the threshold while the avarge similarity of objects from diufferent clusters is below the user-given threshold. To tackle this NP-hard problem, we combine exact (fixed-parameter) and heuristic (graph-layout-based) algorithms.
TransClust has only two simple requirements: The first is a measure of similarity (or distance) for each unordered pair in the dataset, saved in a similarity matrix. The second is a densitiy parameter (similarity threshold) that gives the boundary between similar and dissimilar objects. Note that these two requirements hold for any clustering method. However, most other tools ask the user to define further essential parameters. Furthermore, the densitivity normally cannot be set intuitively.
TransCluster needs the following input:
Note that for the first case an implementation for processing all-vs-all BLAST results of protein sequences exists and is accessible through this web interface.
Download the following example files to 'play' with the application of TransClust to protein sequence clustering.
For tutorials on how to use TransClust, please refer to our tutorials page.
Wittkop T, Emig D, Lange SJ, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J (2010) Partitioning biological data with Transitivity Clustering. Nature Methods 2010 Jun;7(6):419-20. PubMed