Description

What is TransClust?

A feature-rich clustering tool for biomedical data sets

TransClust is a high-throughput clustering software that is based on Weighted Transitive Graph Projection.

It's main advantage over other approaches is that it's underlying model directly reflects hidden transitive substructures typical e.g. for biomedical data sets. In comparison to other clustering methods the desitiy parameter (the threshold) can be choosen as intuitively as for k-means.

What is Clustering?

Defining a long-standing computational problem

Clustering is a widely established technique for exploratory data analysis with applications in statistics, computer science, biology, social sciences, or psychology. It is applied to empirical data in basically any scientific field to gain an initial impression of structural similarities. For this purpose, it is of great advantage to have an efficient and easy-to-use tool that can be applied ubiquitously to a large scope of data types.

What is the Weighted Transitive Graph Projection problem?

Solving a NP-hard problem for clustering

The graph-theoretical problem underlying TransClust is called "Weighted Transitive Graph Projection" or "Weighted Cluster Editing". The goal is to transform a given intransitive graph into a transitive one by adding and removing edges. The idea is to find a solution with minimal number of edge modifications for the unweighted case. In our case, a weighted cost function is used. It reflects the distance between a user-given threshold and a pairwise similarity function, i.e. removing an edge labeld with a very high similarity is expensive, as is adding an edge with a very high dissimilarity. This problem can be used as model for clustering (see the picture below). The threshold, given by the user, strongly reflects the cluster density: High (restrictive) values make it more expensive to add most of the edges (resulting in many small clusters) while lower values make it cheap to add edges but expensive to remove them (resulting in few big clusters). The toy example picture below illustrates that the preferred solution strongly depends on the user-defined density parameter and the similarity function. The advantage of our approach is the ability to reflect hidden transitive substructures directly within the clustering model. Briefly, we cluster such that the average similarity within the clusters is above the threshold while the avarge similarity of objects from diufferent clusters is below the user-given threshold. To tackle this NP-hard problem, we combine exact (fixed-parameter) and heuristic (graph-layout-based) algorithms. Weighted Transitive Graph Projection

What user-input is required?

TransClust as an easy-to-use clustering tool

TransClust has only two simple requirements: The first is a measure of similarity (or distance) for each unordered pair in the dataset, saved in a similarity matrix. The second is a densitiy parameter (similarity threshold) that gives the boundary between similar and dissimilar objects. Note that these two requirements hold for any clustering method. However, most other tools ask the user to define further essential parameters. Furthermore, the densitivity normally cannot be set intuitively.

TransCluster needs the following input:

  1. a pairwise similarity function and a threshold to compute a similarity matrix,
  2. a similarity matrix and a threshold to compute a cost matrix, or
  3. a cost matrix.
  4. (optionally) a gold standard file to determine the optimal threshold (density parameter) for the given problem automatically.

Note that for the first case an implementation for processing all-vs-all BLAST results of protein sequences exists and is accessible through this web interface.

Example Files

Get to know TransClust really fast

Download the following example files to 'play' with the application of TransClust to protein sequence clustering.

Proteins

  1. Example cost matrix file - It was generated from protein sequences of the amidohydrolases superfamily from the SFLD gold standard; see the original paper (Brown et al.) for more details. Similarities were calculated using -LOG10 of birectional best BLAST hits (BeH), the cost matrix created with a threshold of 100.
  2. Example similarity file - It was generated from the above mentioned amidohydrolases protein sequences from the SFLD gold standard; similarities calculated using -LOG10 of birectional best BLAST hits (BeH).
  3. Example FASTA file and all-vs-all BLAST file - The FASTA file contains the above mentioned protein sequences; the all-vs-all BLAST file was generated with an E-Value threshold of 100.
  4. Gold standard file - The gold standard file for the above introduced amidohydrolases proteins assigns each protein to a protein family. See the original paper (Brown et al.) for more details.

Social network

  1. Example cost matrix file - This traditional example for clustering applications was generated from social interactions of 34 karate club members. The club members splitted into two parts (two clubs) and we use the number of common social activities between each pair of club members as similarity function. See the original paper for more background information. This cost matrix was created by using the below provided similarity file (threshold: 0.1).
  2. Example similarity file - The file was generated from the above mentioned social network of the Zachary karate club using the social interactions of the club members as similarity values.
  3. Gold standard file - This gold standard file shows that the social network (the karate club) was splitted into two subgroups. Using the files provided above, will cluster the club into exactly the same two clusters.

For tutorials on how to use TransClust, please refer to our tutorials page.

Other popular protein sequence clustering tools

Compare TransClust to other tools

Citation

More papers can be found when you browse to "Cite TransClust"

Wittkop T, Emig D, Lange SJ, Rahmann S, Albrecht M, Morris JH, Böcker S, Stoye J, Baumbach J (2010) Partitioning biological data with Transitivity Clustering. Nature Methods 2010 Jun;7(6):419-20. PubMed