Documentation -- CostMatrixCreator for Missing Values

Table of Contents

Getting along in this tutorial

Prerquirements

All you need to get started

The CostMatrixCreator is written in Java and thus can be run under Windows, Linux or MacOS. As CMC calls the NCBI BLAST tool, the system also requires a running binary version of BLAST+. Please find below the links to download the prerquired versions.

How do Missing Values Work?

Everything ou need to know about our approach to save your time

Equipped with sophisticated biochemical measurement techniques we generate a massive amount of biomedical data that needs to be analyzed computationally. In order to process these data, most clustering tools require a similarity matrix containing all pairwise comparisons of all proteins. Keeping in mind that we have a tremendous amount of data available, the calculation and storage of the similarity matrix turns out to be more and more the bottleneck for large scale studies. On the other hand, most of the calculated similarities are not needed for the clustering as they only give little information. This fact is exploited by the CMC. The CMC cost matrix creation CMC enables the user to create cost matrices with missing values suitable for clustering with TransClust via command line or a easy to use graphical interface. This tool is specially designed for the clustering of biological data with the BLAST E-value as the similarity measure. The user can either manually select or CMC randomly selects proteins which are excluded from the all-vs-all BLAST comparison. As all included proteins are BLASTed against all others, there is still enough information left to ensure high quality cluster results. The figure above depicts how CMC processes the data in order to create cost matrices.

Tutorial: Graphical User Interface

Follow this step by step description for a fast entry to CMC

For this tutorial, you need a working installation of Java, NCBI BLAST+ and we recommend the usage of our test dataset from Bown et al. In order to start the graphical interface, you need to run the cmc.jar with the only parameter -gui:

java -jar cmc.jar -gui

Now, the graphical user interface starts up and presents the following interface: The CMC cost matrix creation

  1. In area (A), the user specifies the locations of the FASTA-File, the BLAST binaries, a working temporary directory and a directory where the resulting cost matrices should be stored.
  2. The runtime parameters are specified in area (B). Here, the user specifies what range of threshold should be covered and on how many different threads the BLASTing should be performed.
  3. Area (C) is the most important configuration box. In this box, all parameters for missing values are configured. First of all, the check-box definies, if the user wants to use missing values at all. If so, the user decides to, he has basically two more options:
  4. In the last area (D), the different BLAST-Parameters are specified. Most important is the E-value. In the text field below, the user can also specify some additional BLAST parameters. The actual command how BLAST will be called for this run is displayed in the text field below.

In order to execute the specified run, just hit the start button at the bottom and CMC performs the BLASTing and the creation of the cost matrices for you.

Tutorial: Command Line Execution

Follow this step by step description for a fast entry to CMC

For this tutorial, you need a working installation of Java, NCBI BLAST+ and we recommend the usage of our test dataset from Bown et al. Afterwards, you perform on your command line the following:

java -jar [java virtual machine options] cmc.jar [-key value]

COMPULSORY OPTIONS
  One of the following must be entered.

  -f {FASTAfile.fasta}
    Location of the FASTA file.

  -b {BLASTdir}
    Directory of the BLAST binaries.

  -o {outdir}
    Output directory of the Cost Matrices

  -tmp {tmpdir}
    Temporary directory, where e.g. the BLAST file will be saved.

  -tmin {threshold}
    Starting Threshold.

  -tmax {threshold}
    Ending Threshold.

  -sz {stepsize}
    Size of the steps between the thresholds.

  -evalue {evalue}
    The evalue to use for BLASTing.

  OR

  -gui
    Start the program with the graphical user interface.

  OR

  -help --help
    Show this help manual.

OTHER OPTIONS
  These are optional. Use one of the following:

  -p {percent}
    Create costmatrices with specified percentage of missing input values.

  OR

  -in {existingObjectsFile.txt}
    Specify a file containing the objects IDs that shall be included in the process.

  OR

  -out {missingObjectsFile.txt}
    Specify a file containing the objects IDs that shall be excluded from the process.

The listing above depicts all available configurations for CMC. If you want to use missing values, you need to provide a list of proteins which should not considered (option -out) or a list of proteins which shoud be considered (option -in) or you specify the percentage of missing values (Option -p). In the case you want to provide a list of gene, the files must follow the following format:

protein_id_1
protein_id_2
protein_id_3
...

Each line correponds to one protein ID of your FASTA file. The following command calculates the costmatrices for threshold 10 to 50 with a step size of 5, using only 30% of all proteins for an all-vs-all BLAST with E-value 0.01:

java -jar cmc.jar \
  -i /home/research/proteins.fasta \
  -o /home/research/costmatrizes \
  -tmp /tmp/cmc \
  -b /home/research/blast/bin/ \
  -evalue 0.01 \
  -tmin 10 -tmax 51 -sz 5 \
  -p 30