Bioserver

CLAIMS-MS README

LICENSE

The program has been distributed under GNU GPL.

INFORMATION

This program is mainly designed to work under Linux. It has been tested under

different distributions. Despite of this, it still should work under Windows

with the use of Cygwin environment (http://www.cygwin.com/).

The proper installation requires compilation of the program. However, this

instruction contains the information on the process, elementary knowledge of

the Linux-like environment might be required. It is worth mentioning also

that the program DOES NOT have GUI (Graphical User Interface); this may result

difficult for computer beginners.

In the following, one may find instructions of the proper configuration,

compilation, installation and running the program together with simple

examples.

INSTALLATION

The boost library is required for the compilation of the program as well as R

binary. However, R is optional for the proper functioning of the program, it

is used in the example. Both programs can be downloaded from the repository

under Linux, or may be downloaded as sources from the developer website.

www.boost.org/ and www.r-project.org/ respectively. On these websites, one

can find instructions for their installations. Both, the library and the

binary should be found in the default path by both the configuration script

and the binary.

To install, simply type:

./configure; make

In order to obtain more information one should execute:

./configure --help

RUNNING

To run the application, check the help first. Simply type

./claim-ms -h.

All the arguments may be passed either as common arguments or passed to the

standard input. To check the program functionality simple type:

./claim-ms < data/input.args

QUICK HELP

The main purpose of this work is to create an easy tool for biologists to

manipulate different types of biological data and help identifying functional

modules, i.e. sets of genes performing similar tasks in living organisms.

It is easily extensible and configurable. User may add his own packages to

process the data.

At present, claim-ms implements the following packages:

- microarray: read/writes microarray data from/to file;

- corr: computes correlation matrix from a set of vectors (like microarray);

- shortest_path: computes shortest path matrix between nodes of a graph;

- ppi: reads ppi data from file

- graph: represents a graph as a data structure; reads and writes graph to a

file;

- limit: define intersection of nodes sets;

- claim-ms: perform claim-ms analysis;

- kmeans: finds clusters using kmeans algorithm;

- results: summarize results of analysis into list of cliques

The analysis is defined by a data flow, e.g. a graph-like

dependencies passing the output of one package(s) as an input of another. The

user might define an input of a package in three different ways:

<filename>, {-p <package> args}, <package number>.

In general the program launch looks like this:

./claim-ms <program options>

-p <package name> <package options>

-p <package name> -i {-p <package name> <options>: <option2>}

-p <name> -i 1

where ':' is depicting that the program should use the same package as the

former but with different arguments: <options2>.

In order to obtain more information on the available packages, one should

run:

./claim-ms --help

EXAMPLE

An example has been prepared and can be used as a reference. Besides designing

new data flows, a user can obtain the application of claim-ms described in the

related paper by simply changing the names of the input and output files,

provided that the indication on data format (see end of this file) are obeyed.

In order to run the example one should execute:

./claim-ms < data/input-ms.args

"< data/input.args" means that the file "data/input.args" contains the actual

configuration which content should be passed as the standard input to the

executable.

The referenced example is based on the newest work, on the CLAIM software and

have the following form (lines beginning with "#" are comments):

# General parameters. Set verbose level to debug and output directory to

MA-PPI-GO.

-v debug -O MA-PPI-GO

# Define the first package; calculate the correlation (package corr) out of the

# input package (-i {...}).

-p corr -i {

# Read Microarray from file. The delimiter in the file is tabulation (-d "t")

# and the data should be read from data/AffyNaCl_Time-course_for_cliques.csv.

# Look into the file to see the format of the file.

-p microarray -d "t" -i data/AffyNaCl_Time-course_for_cliques.csv.bz2

} -r

# Define the third package; read the file with Gene Ontology distance based on

# the GO-Universal similarity measure

-p graph -f list -i data/Gene-Ontology-Universal-distance.bz2

# Define the fourth package; calculate the shortest path between the input graph

# (-i {...}) and store it as weights in the graph.

-p shortest_path -i {

# Define the fifth package. Read the graph from the file

# data/AI_interactions.csv, store it in boost adjacency_list structure (good

# for sparse graphs) and store the information the weights in the short

# data type (2 bytes per edge). See data/AI_interactions.csv for the file

# format.

-p ppi -d ':' -g adjacency_list -t short -i data/AI_interactions.csv.bz2

}

# Define sixth and seventh packages, which take as the input package 4 an

# limits its vertices being a common subset of packages 1, 3 and 4; store the

# outcome matrix into ppi.bz2 file.

-p limit -i -t 3 {

-p limit -i 4 -t 1 -s

} -f csv -d "t" -o ppi.bz2

# Define eight package, which takes as the input package 1 an limits its

# vertices being a common subset of packages 6 and 1; store the outcome matrix

# to ma.bz2

-p limit -t 6 -i 1 -f csv -d "t" -o ma.bz2

# Define ninth package, which takes as the input package 3 an limits its

# vertices being a common subset of packages 6 and 3; store the outcome matrix

# to go.bz2

-p limit -t 6 -i 3 -f csv -d "t" -o go.bz2

# define tenth package claim-ms computing the output clusters from the Microarray

# and PPI sets clustered with the use of kmeans algorithm.

-p claim2 -i {

# perform neural gas clustering, on the graph being the output of the sixth

# package for different number of clusters, using maximaly 5000 iterations

-p density -i { -p neuralgas -i 6 -N 5 -I 5000 }

-p density -i { -p neuralgas -i 6 -N 10 -I 5000 }

-p density -i { -p neuralgas -i 6 -N 15 -I 5000 }

-p density -i { -p neuralgas -i 6 -N 20 -I 5000 }

-p density -i { -p neuralgas -i 6 -N 25 -I 5000 }

-p density -i { -p neuralgas -i 6 -N 30 -I 5000 }

} -i {

-p density -i { -p neuralgas -i 8 -N 5 -I 5000 }

-p density -i { -p neuralgas -i 8 -N 10 -I 5000 }

-p density -i { -p neuralgas -i 8 -N 15 -I 5000 }

-p density -i { -p neuralgas -i 8 -N 20 -I 5000 }

-p density -i { -p neuralgas -i 8 -N 25 -I 5000 }

-p density -i { -p neuralgas -i 8 -N 30 -I 5000 }

} -i {

-p density -i { -p neuralgas -i 9 -N 5 -I 5000 }

-p density -i { -p neuralgas -i 9 -N 10 -I 5000 }

-p density -i { -p neuralgas -i 9 -N 15 -I 5000 }

-p density -i { -p neuralgas -i 9 -N 20 -I 5000 }

-p density -i { -p neuralgas -i 9 -N 25 -I 5000 }

-p density -i { -p neuralgas -i 9 -N 30 -I 5000 }

} -o claim_out_MA_PPI_GO_dist_as_score.bz2

For the sake of clarity, it has to be mentioned that different packages accept

different data structures as input and deliver to output. In spite of different

representations of the internal, low level representations, the user is aware

of only 3 structures: vector of vectors (representing the MA array), graph

representation (either represented by adjacency matrix, or adjacency list) or,

finally, the results (sets of genes). There is also additional type

representing a set of structures: multiple. In the actual set of packages the

structures are taken as input and output:

* graph package can take graph or a set of graphs as input, and returns graph

as an output;

* claim-ms package can take results or a set of results as input, and returns

multiple of results as an output;

* kmeans package can take graph as an input and results as an output;

* microarray package takes vector of vectors structure or multiple of them as

as an input and returns vector of vectors as an output;

* ppi package takes graph as an input and provides graph as an output;

* shortest_path package takes graph as an input and provides graph as an

output;

* corr package takes vector of vectors as an input and provides graph as an

output;

* limit package takes vector of vectors or graph as an input and the same as an

output;

* results package takes results as an input and the same as an output;

The user should be aware of these data structures while defining the data flow.

An output of a package should be compatible with the input the package it is

passed to.

For more details check the help of the program.

« back to index