LICENSE
The program has been distributed under GNU GPL.
INFORMATION
This program is mainly designed to work under Linux. It has been tested under
different distributions. Despite of this, it still should work under Windows
with the use of Cygwin environment (http://www.cygwin.com/).
The proper installation requires compilation of the program. However, this
instruction contains the information on the process, elementary knowledge of
the Linux-like environment might be required. It is worth mentioning also
that the program DOES NOT have GUI (Graphical User Interface); this may result
difficult for computer beginners.
In the following, one may find instructions of the proper configuration,
compilation, installation and running the program together with simple
examples.
INSTALLATION
The boost library is required for the compilation of the program as well as R
binary. However, R is optional for the proper functioning of the program, it
is used in the example. Both programs can be downloaded from the repository
under Linux, or may be downloaded as sources from the developer website.
www.boost.org/ and www.r-project.org/ respectively. On these websites, one
can find instructions for their installations. Both, the library and the
binary should be found in the default path by both the configuration script
and the binary.
To install, simply type:
./configure; make
In order to obtain more information one should execute:
./configure --help
RUNNING
To run the application, check the help first. Simply type
./claim-ms -h.
All the arguments may be passed either as common arguments or passed to the
standard input. To check the program functionality simple type:
./claim-ms < data/input.args
QUICK HELP
The main purpose of this work is to create an easy tool for biologists to
manipulate different types of biological data and help identifying functional
modules, i.e. sets of genes performing similar tasks in living organisms.
It is easily extensible and configurable. User may add his own packages to
process the data.
At present, claim-ms implements the following packages:
- microarray: read/writes microarray data from/to file;
- corr: computes correlation matrix from a set of vectors (like microarray);
- shortest_path: computes shortest path matrix between nodes of a graph;
- ppi: reads ppi data from file
- graph: represents a graph as a data structure; reads and writes graph to a
file;
- limit: define intersection of nodes sets;
- claim-ms: perform claim-ms analysis;
- kmeans: finds clusters using kmeans algorithm;
- results: summarize results of analysis into list of cliques
The analysis is defined by a data flow, e.g. a graph-like
dependencies passing the output of one package(s) as an input of another. The
user might define an input of a package in three different ways:
<filename>, {-p <package> args}, <package number>.
In general the program launch looks like this:
./claim-ms <program options>
-p <package name> <package options>
-p <package name> -i {-p <package name> <options>: <option2>}
-p <name> -i 1
where ':' is depicting that the program should use the same package as the
former but with different arguments: <options2>.
In order to obtain more information on the available packages, one should
run:
./claim-ms --help
EXAMPLE
An example has been prepared and can be used as a reference. Besides designing
new data flows, a user can obtain the application of claim-ms described in the
related paper by simply changing the names of the input and output files,
provided that the indication on data format (see end of this file) are obeyed.
In order to run the example one should execute:
./claim-ms < data/input-ms.args
"< data/input.args" means that the file "data/input.args" contains the actual
configuration which content should be passed as the standard input to the
executable.
The referenced example is based on the newest work, on the CLAIM software and
have the following form (lines beginning with "#" are comments):
# General parameters. Set verbose level to debug and output directory to
MA-PPI-GO.
-v debug -O MA-PPI-GO
# Define the first package; calculate the correlation (package corr) out of the
# input package (-i {...}).
-p corr -i {
# Read Microarray from file. The delimiter in the file is tabulation (-d "t")
# and the data should be read from data/AffyNaCl_Time-course_for_cliques.csv.
# Look into the file to see the format of the file.
-p microarray -d "t" -i data/AffyNaCl_Time-course_for_cliques.csv.bz2
} -r
# Define the third package; read the file with Gene Ontology distance based on
# the GO-Universal similarity measure
-p graph -f list -i data/Gene-Ontology-Universal-distance.bz2
# Define the fourth package; calculate the shortest path between the input graph
# (-i {...}) and store it as weights in the graph.
-p shortest_path -i {
# Define the fifth package. Read the graph from the file
# data/AI_interactions.csv, store it in boost adjacency_list structure (good
# for sparse graphs) and store the information the weights in the short
# data type (2 bytes per edge). See data/AI_interactions.csv for the file
# format.
-p ppi -d ':' -g adjacency_list -t short -i data/AI_interactions.csv.bz2
}
# Define sixth and seventh packages, which take as the input package 4 an
# limits its vertices being a common subset of packages 1, 3 and 4; store the
# outcome matrix into ppi.bz2 file.
-p limit -i -t 3 {
-p limit -i 4 -t 1 -s
} -f csv -d "t" -o ppi.bz2
# Define eight package, which takes as the input package 1 an limits its
# vertices being a common subset of packages 6 and 1; store the outcome matrix
# to ma.bz2
-p limit -t 6 -i 1 -f csv -d "t" -o ma.bz2
# Define ninth package, which takes as the input package 3 an limits its
# vertices being a common subset of packages 6 and 3; store the outcome matrix
# to go.bz2
-p limit -t 6 -i 3 -f csv -d "t" -o go.bz2
# define tenth package claim-ms computing the output clusters from the Microarray
# and PPI sets clustered with the use of kmeans algorithm.
-p claim2 -i {
# perform neural gas clustering, on the graph being the output of the sixth
# package for different number of clusters, using maximaly 5000 iterations
-p density -i { -p neuralgas -i 6 -N 5 -I 5000 }
-p density -i { -p neuralgas -i 6 -N 10 -I 5000 }
-p density -i { -p neuralgas -i 6 -N 15 -I 5000 }
-p density -i { -p neuralgas -i 6 -N 20 -I 5000 }
-p density -i { -p neuralgas -i 6 -N 25 -I 5000 }
-p density -i { -p neuralgas -i 6 -N 30 -I 5000 }
} -i {
-p density -i { -p neuralgas -i 8 -N 5 -I 5000 }
-p density -i { -p neuralgas -i 8 -N 10 -I 5000 }
-p density -i { -p neuralgas -i 8 -N 15 -I 5000 }
-p density -i { -p neuralgas -i 8 -N 20 -I 5000 }
-p density -i { -p neuralgas -i 8 -N 25 -I 5000 }
-p density -i { -p neuralgas -i 8 -N 30 -I 5000 }
} -i {
-p density -i { -p neuralgas -i 9 -N 5 -I 5000 }
-p density -i { -p neuralgas -i 9 -N 10 -I 5000 }
-p density -i { -p neuralgas -i 9 -N 15 -I 5000 }
-p density -i { -p neuralgas -i 9 -N 20 -I 5000 }
-p density -i { -p neuralgas -i 9 -N 25 -I 5000 }
-p density -i { -p neuralgas -i 9 -N 30 -I 5000 }
} -o claim_out_MA_PPI_GO_dist_as_score.bz2
For the sake of clarity, it has to be mentioned that different packages accept
different data structures as input and deliver to output. In spite of different
representations of the internal, low level representations, the user is aware
of only 3 structures: vector of vectors (representing the MA array), graph
representation (either represented by adjacency matrix, or adjacency list) or,
finally, the results (sets of genes). There is also additional type
representing a set of structures: multiple. In the actual set of packages the
structures are taken as input and output:
* graph package can take graph or a set of graphs as input, and returns graph
as an output;
* claim-ms package can take results or a set of results as input, and returns
multiple of results as an output;
* kmeans package can take graph as an input and results as an output;
* microarray package takes vector of vectors structure or multiple of them as
as an input and returns vector of vectors as an output;
* ppi package takes graph as an input and provides graph as an output;
* shortest_path package takes graph as an input and provides graph as an
output;
* corr package takes vector of vectors as an input and provides graph as an
output;
* limit package takes vector of vectors or graph as an input and the same as an
output;
* results package takes results as an input and the same as an output;
The user should be aware of these data structures while defining the data flow.
An output of a package should be compatible with the input the package it is
passed to.
For more details check the help of the program.