Bioserver

Studies conducted during the last few years have brought a great change in our understanding of the role that RNA molecules play in the biological systems. They do not only take part in the protein synthesis or serve as adaptors translating information encoded in nucleotide sequences but also are involved in regulation of geneexpression. Most of them are produced from the larger molecules due to enzyme processing or spontaneous degradation and play a crucial role in the cellular processes.In our laboratory, we work on the problem of a degradation of RNA molecules. In the studies we use the artificial RNA  molecules, which were designed in such a way that they should be unstable, according to the rules developed by Kierzek and co-workers. On the basis of the results of their degradation we identify the regions of RNA molecules which are weak and the most susceptible to the cleavage. We have also proposed a formulation of a new problem, called RNA Partial Degradation Problem (RNA PDP) and the algorithm that allows to reconstruct the cleavage sites of the given RNA molecule.

Viral infection modeling

HCV is one of the most prevalent human pathogens, infecting more than 170 million people worldwide. Similar to other RNA-based viruses, it exists as a quasispecies in an individual organism, i.e. as a pool of phylogenetically related but genetically slightly distinct variants. With its capacity for long-lasting persistence in the host, HCV causes chronic infections that can lead to liver fibrosis, cirrhosis and hepatocellular carcinoma.We investigate a correlation between genetic diversity of hepatitis C virus population and the level of viral RNA accumulation in patient blood. Genetic diversity is defined as the mean Hamming distance between all pairs of virus RNA sequences representing the population. We have found that a low Hamming distance (i.e. low genetic diversity) correlates with a high RNA level; symmetrically, high diversity corresponds to a low RNA level. In our opinion the obtained correlation strength justifies the use of the viral RNA level as a measure enabling prediction of efficiency of an established therapy.Structure of the IRES located in the 5'-UTR of HCV[source]

Peptide sequencing

Determining amino acid sequences of protein molecules is one of the most important issues in molecular biology. These sequences determine protein structure and functionality. Unfortunately, direct biochemical methods for reading amino acid sequences can be used for reading short sequences only. This is the reason, which makes peptide assembly algorithms an important complement of these methods.In our lab we have developed two algorithms solving the problem of short amino acid sequence assembly:- tabu search based method- genetic algorithm.Both algorithms have been tested in computational experiments and their performance compared.

Phylogenetic Trees

One of the most important aspect of molecular and computational biology is the reconstruction of evolutionary relationships. The area is well explored after decades of intensive research. Despite this fact there remains a need for good and efficient algorithms that are capable of reconstructing the evolutionary relationship in reasonable time.Since the problem is computationally intractable, exact algorithms are used only for small groups of species. In the Maximum Parsimony approach the time of computation grows so fast when number of sequences increases, that in practice it is possible to find the optimal solution for instances containing about 20 sequences only. It is this reason that in practical applications, heuristic methods are used.In our laboratory we have developed parallel adaptive memory programming algorithms based on Maximum Parsimony and some known neighborhood search methods for phylogenetic tree construction. The proposed algorithms achieve a superlinear speedup and find solutions of good quality.

Sequencing by hybridization

Sequencing by hybridization (SBH) is one of the methods of reading a DNA sequence. It consists of two phases: the biochemical and computational ones. In the biochemical phase a DNA chip, which contains a library of several different oligonucleotides, is exposed to the solution containing many copies of target DNA. Hybridization occurs between the DNA sequence and oligonucleotides in their complementary places. In most of the hybridization experiments, the library is a set of oligonucleotides of the same length. After the reaction, one obtains a set of short subfragments of the examined sequence by reading a fluorescent or radioactive image of the DNA chip. This set is called spectrum.The computational phase of DNA sequencing consists in reconstructing the target DNA sequence from the oligonucleotides in the spectrum. It is known that in the case of an ideal hybridization experiment, when no errors in spectrum are involved, the reconstruction of the sequence can be solved in polynomial time.In real hybridization experiments we can get two different kinds of errors:* positive errors - these are additional oligonucleotides that appear in the spectrum, but do not fit to the target DNA,* negative errors - missing oligonucleotides in the spectrum, comparing to the ideal spectrum; a special case are negative errors coming from repetitions, that are repeated subsequences in the original sequence but present only in one copy in the spectrum.When the above errors occur in the spectrum, the reconstruction of the DNA sequence becomes strongly NP-hard.Tabu search algorithm presented in the first article can be downloaded here.In section Standard and isothermic spectra you may find instances used in computational tests of algorithms solving the standard and isothermic DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests:* J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, A. Swiercz, "Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries", Computational Biology and Chemistry 28, 2004, pp. 11-19.* J. Blazewicz, C. Oguz, A. Swiercz, J. Weglarz, "DNA sequencing by hybridization via genetic search", Operations Research 54, 2006, pp. 1185-1192.* J. Blazewicz, F. Glover, M. Kasprzak, W.T. Markiewicz, C. Oguz, D. Rebholz-Schuhmann, A. Swiercz, "Dealing with repetitions in sequencing by hybridization", Computational Biology and Chemistry 30, 2006, pp 313-320.In section Standard spectra you may find the instances which were used in computational tests of algorithms solving the standard DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests.* J. Blazewicz, M. Kasprzak, W. Kuroczycki, Hybrid genetic algorithm for DNA sequencing with errors, Journal of Heuristics 8 (2002) 495-502.* J. Blazewicz, F. Glover, M. Kasprzak, DNA sequencing - tabu and scatter search combined, INFORMS Journal on Computing 16 (2004) 232-240.* J. Blazewicz, F. Glover, M. Kasprzak, Evolutionary approaches to DNA sequencing with errors, Annals of Operations Research 138 (2005) 67-78.Most of the sequences for experiments were taken from GenBank, National Institute of Health, USA. Sequences are composed of at least 600 nucleotides.From the prefixes of the sequences of length 200, 400, 500, and 600, the ideal spectra were generated (by selecting all of the oligonucleotides being members of the library that are contained in the sequence). Next, errors were randomly added to the spectra. Some of the sequences (in the section: standard spectra) were generated according to the uniform distribution.Spectra contain both negative and positive errors (with error rate ranging from 5% up to 20%). 5% error rate means that from the ideal spectrum 5% of all the oligonucleotides were randomly deleted and the same number of oligonucleotides were added (different from the ones already existing in the spectrum). Spectra were created progressively, which means that 10% of errors contained the same errors as in the spectrum with 5% error rate plus additional 5% of errors of both types. This provides that the case with spectra containing less errors (5%) is not more difficult than the case with spectra containing more errors (10%). Spectra with 20% of errors were created accordingly. At the end, oligonucleotides in the spectra were sorted alphabetically due to lose the information about order of the oligonucleotides within the target.In specific subarticle you can read definitions of used libraries.

SR-ASM algorithm

SR-ASM (Short Reads ASseMbly) algorithm is designed for DNA assembly of the short sequences coming from 454 sequencers. The 454 sequencer protocol is based on measuring light intensities generated during the sequencing process. The sequences are given as nucleotide chains with numbers of their consecutive appearances. Additionally to the sequences, one gets rates of confidence for every nucleotide.The constructed SR-ASM algorithm is a heuristic based on a graph model, the graph being built on the set of input sequences. In the algorithm three parts can be distinguished. The first part computes feasible overlaps for input sequences. It requires as the parameters, values of the minimum overlap between two sequences and of the error bound, the latter being the percentage of the mismatches allowed in the overlap of two sequences. The comparison is done also for the sequences reverse complementary to the input ones, due to the assumption that the fragments come from both strands of a DNA helix. In the second phase, a graph is constructed with the fragments as the vertices. Two vertices are connected by an arc if there is a feasible overlap between the two fragments. Next, a path is searched for, which passes through one of the vertices from every pair: either through the straightforward fragment or its reverse complementary counterpart. Usually it is not possible to find a single path in the graph and several paths corresponding to contigs are returned as solutions. At the end, in the third part, a consensus sequence (sequences) is determined on the basis of the alignment.The strength of SR-ASM comes from the combination of the following new propositions in the assembling procedure: temporary compression of the input sequences, new method of selecting promising pairs, operations repairing lack of some arcs or excess of some vertices in the overlap graph, and finally, the system of voting sequences in creating the consensus sequence.Download the source code of  SR-ASM algorithm.Usefulness of the algorithm has been proven in tests on raw data generated during sequencing of the whole 1.84 Mbp genome of bacteria Prochlorococcus marinus. The paper with the description of the algorithm and with the computational experiment is:* J. Blazewicz, M. Bryja, M. Figlerowicz, P. Gawron, M. Kasprzak, E. Kirton, D. Platt, J. Przybytek, A. Swiercz, L. Szajkowski, "Whole genome assembly from 454 sequencing output via modified DNA graph concept", Computational Biology and Chemistry 33 (2009) 224-230.

Simplified partial digest problem

In order to recognize linear structure of a long DNA fragment (i.e. its sequence of nucleotides), the DNA molecule is divided into smaller parts, sequenced and then combined together back into the whole. Restriction mapping is one of the approaches used for the purpose of dividing and subsequent joining smaller parts. Very short DNA fragments (patterns) of the molecule uniquely recognized by restriction enzymes provide information necessary to carry out this process. The patterns are called restriction sites.During the process of restriction mapping of DNA, many copies (clones) of a DNA molecule are exposed to restriction enzymes, which cut them at particular patterns (sites). Depending on the time of this reaction (also on the enzyme used and on the length of DNA), the experiment returns clones of the original molecule cut at few or many sites (usually the shorter reaction time, the smaller the number of sites recognized by the enzyme). The cutting process leads to the loss of information about the location of the fragments and restriction sites in the original molecule, their order is next reconstructed on the base of the fragment lengths.There are two classical variants of the experiment: double digestion, which employs two restriction enzymes, and partial digestion, which uses one enzyme but in a series of reactions differing by time. At the output of the error-free partial digestion one gets a multiset of all interpoint distances, where the points are the restriction sites and the ends of the molecule. The aim of the Partial Digest Problem (PDP) is to reconstruct the restriction map basing on the multiset of fragment lengths. This variant was not widely employed in biological laboratories because of the difficulty in obtaining fragments between every pair of sites. From the point of view of computational complexity theory, the status of error-free PDP (being a known problem also in computational geometry) is still an open question.The simplified partial digestion was proposed as a new variant of the restriction mapping in [1]. It overcomes the drawback of the partial digestion outlined above. In this simplified method, one enzyme is used on two sets of clones of the DNA molecule. The corresponding experiment consists only of two parts. In the first part, the time of the chemical reaction is very short and chosen so that target cloned molecules from the first set are cut at (statistically) one restriction site at most. In the second part, the reaction time span is long enough to cut the cloned molecules from the second set at all restriction sites. This approach reduces the number of reactions and presents an easier adjustment of the reaction times. The aim of the Simplified Partial Digest Problem (SPDP) is the reconstruction of the map of restriction sites from two multisets resulting from the two reactions. The computational complexity of SPDP was solved in [2], the problem was proved to be strongly NP-hard. On the other hand, further studies have shown computational usefulness of this approach, confirmed by tests of a series of new algorithms for the problem [3-6].[1] J. Blazewicz, P. Formanowicz, M. Kasprzak, M. Jaroszewski, W.T. Markiewicz, "Construction of DNA restriction maps based on a simplified experiment", Bioinformatics 17 (2001) 398–404.[2] J. Blazewicz, M. Kasprzak, "Combinatorial optimization in DNA mapping – a computational thread of the Simplified Partial Digest Problem", RAIRO – Operations Research 39 (2005) 227–241.[3] J. Blazewicz, M. Jaroszewski, "New algorithm for the Simplified Partial Digest Problem", Lecture Notes in Bioinformatics 2812 (2003) 95–110.[4] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "The Simplified Partial Digest Problem: enumerative and dynamic programming algorithms", IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2007) 668–680.[5] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "On the approximability of the Simplified Partial Digest Problem", Discrete Applied Mathematics 157 (2009) 3586-3592.[6] J. Blazewicz, E.K. Burke, M. Kasprzak, A. Kovalev, M.Y. Kovalyov, "The Simplified Partial Digest Problem: approximation and a graph-theoretic model", European Journal of Operational Research 208 (2011) 142-152.

RNA structural databases

A number of bioinformatic tools have been proposed to explore two major structural databases (PDB, NDB) in order to analyze various aspects of RNA tertiary structures. One of these tools is RNA FRABASE, a web-accessible database with an engine for automatic search of 3D fragments within PDB-derived RNA structures.RNA FRABASE 2.0 stores a wide selection of structural data about PDB-deposited RNA structures:- core data: RNA sequences, secondary structures, the PDB header data (identification codes, experimental method used, resolution, PDB deposition date and number of models), the NDB identification codes);- a collection of atom coordinates of unmodified and modified nucleotide residues occurring in RNA structures;- a description of torsion angles and sugar pucker parameters, including values for the pseudotorsion angles;- base-base and inter base pair parameters;- base pair location, i.e. information specifying whether one or both bases of the pair belong to the resulting fragment;- a complete classification of canonical and non-canonical base pair types, in both Westhof's and Saenger's notation and information about multiplets;- information about original residue and chain numbering;- graphical views of RNA secondary structures.The database is served by a search engine designed to search for user-specified single- or multi-stranded RNA fragments. Search algorithms operate on the database of the RNA sequences and the library of RNA secondary structures, coded in the dot-bracket format. Due to an efficiency of the engine, RNA FRABASE is the only web-accessible tool that enables fast automatic exploration of such a substantial selection of structural parameters available for user-defined 3D RNA fragments. The tool is widely used by the researchers all over the world in the study of RNA features.RNA FRABASE has been developed in our lab, in cooperation with a group of biochemists from the Laboratory of Structural Chemistry of Nucleic Acids, IChB PAS. The first version of the tool has been released in 2007, version 2.0 appeared in 2010. More information can be found in the following papers:M. Popenda, M. Blazewicz, M. Szachniuk, R.W. Adamiak.RNA FRABASE version 1.0: an engine with a database to search for the three-dimensional fragments within RNA structures. Nucleic Acids Research 36, 2008, D386-D391 (published online on October 5, 2007, doi:10.1093/nar/gkm786).M. Popenda, M. Szachniuk, M. Blazewicz, S. Wasik, E.K. Burke, J. Blazewicz, R.W. Adamiak.RNA FRABASE 2.0: an advanced web-accessible database with the capacity to search the three-dimensional fragments within RNA structures. BMC Bioinformatics, 2010, 11:231 (published online on May 6, 2010, doi:10.1186/1471-2105-11-231).

Pairwise Alignment

Pairwise alignment algorithms are of great importance in computational biology. Methods based on dynamic programming, despite their high computational complexity, have been widely used over a few decades due to their accuracy. However, in real-life applications constantly increasing number of biological sequences has started to cause serious difficulties.Therefore, in response to presented needs, we developed an efficient tool for pairwise alignment called G-PAS (previously known as gpu-pairAlign).More information in article.

CLAIM

CLAIM - coupling co-expression data and protein-protein interaction networks for functional protein analysisCLAIM (CLusterAnalysis Integration Method) is a new method for integrating co-expression data obtained through microarray experiments (MA) and protein-protein interaction (PPI) network data. Microarray and PPI data are separately clustered; the clusters are then merged in a special graph; cliques of this graph would identify a group of functionally related proteins. The biological insight provided by these groups is analyzed on the basis of co-localization and mRNA developmental expression, pointing out the new information that can be obtained by this method.CLAIM can be also used to assign proteins whose functional role is unknown to pathways using the cliques that are strongly associated with known pathways. The basic assumption is that, if a protein belongs to a clique and the other proteins in that clique are in a known pathway, then that protein is likely to belong to that pathway. Based on this assumption, pathway assignment was performed through a score prediction function, based on the presence of a protein in pathway enriched cliques.The prediction power of the algorithm appears to be sufficiently high to make this method a useful semi-automated tool for protein functional analysis.Method CLAIM has been tested on the model organism Arabidopsis thaliana.For more detailed information please read CLAIM README.Download CLAIM

GRASShopPER

GRASShopPER (GPU overlap GRaph ASSembler using Paired End Reads) is the novel assembly method that follows the approach of overlap–layout–consensus (OLC). In the method, a very efficient GPU  implementation of the exact reads alignment algorithm has been used for calculating the scores and shifts on the arcs of the graph. Two-part fork detection strategy has been introduced, which highly reduces misassembly rate in the resulting contigs. The first part is carried out during the graph traversal. In the second part, a greedy hyper-heuristic identifies undetected forks on the basis of paired-end reads information. The results of computational experiments show high coverage of the tested genome.More information