Sequencing by hybridization
Sequencing by hybridization (SBH) is one of the methods of reading a DNA sequence. It consists of two phases: the biochemical and computational ones. In the biochemical phase a DNA chip, which contains a library of several different oligonucleotides, is exposed to the solution containing many copies of target DNA. Hybridization occurs between the DNA sequence and oligonucleotides in their complementary places. In most of the hybridization experiments, the library is a set of oligonucleotides of the same length. After the reaction, one obtains a set of short subfragments of the examined sequence by reading a fluorescent or radioactive image of the DNA chip. This set is called spectrum.The computational phase of DNA sequencing consists in reconstructing the target DNA sequence from the oligonucleotides in the spectrum. It is known that in the case of an ideal hybridization experiment, when no errors in spectrum are involved, the reconstruction of the sequence can be solved in polynomial time.In real hybridization experiments we can get two different kinds of errors:* positive errors - these are additional oligonucleotides that appear in the spectrum, but do not fit to the target DNA,* negative errors - missing oligonucleotides in the spectrum, comparing to the ideal spectrum; a special case are negative errors coming from repetitions, that are repeated subsequences in the original sequence but present only in one copy in the spectrum.When the above errors occur in the spectrum, the reconstruction of the DNA sequence becomes strongly NP-hard.Tabu search algorithm presented in the first article can be downloaded here.In section Standard and isothermic spectra you may find instances used in computational tests of algorithms solving the standard and isothermic DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests:* J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, A. Swiercz, "Tabu search algorithm for DNA sequencing by hybridization with isothermic libraries", Computational Biology and Chemistry 28, 2004, pp. 11-19.* J. Blazewicz, C. Oguz, A. Swiercz, J. Weglarz, "DNA sequencing by hybridization via genetic search", Operations Research 54, 2006, pp. 1185-1192.* J. Blazewicz, F. Glover, M. Kasprzak, W.T. Markiewicz, C. Oguz, D. Rebholz-Schuhmann, A. Swiercz, "Dealing with repetitions in sequencing by hybridization", Computational Biology and Chemistry 30, 2006, pp 313-320.In section Standard spectra you may find the instances which were used in computational tests of algorithms solving the standard DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests.* J. Blazewicz, M. Kasprzak, W. Kuroczycki, Hybrid genetic algorithm for DNA sequencing with errors, Journal of Heuristics 8 (2002) 495-502.* J. Blazewicz, F. Glover, M. Kasprzak, DNA sequencing - tabu and scatter search combined, INFORMS Journal on Computing 16 (2004) 232-240.* J. Blazewicz, F. Glover, M. Kasprzak, Evolutionary approaches to DNA sequencing with errors, Annals of Operations Research 138 (2005) 67-78.Most of the sequences for experiments were taken from GenBank, National Institute of Health, USA. Sequences are composed of at least 600 nucleotides.From the prefixes of the sequences of length 200, 400, 500, and 600, the ideal spectra were generated (by selecting all of the oligonucleotides being members of the library that are contained in the sequence). Next, errors were randomly added to the spectra. Some of the sequences (in the section: standard spectra) were generated according to the uniform distribution.Spectra contain both negative and positive errors (with error rate ranging from 5% up to 20%). 5% error rate means that from the ideal spectrum 5% of all the oligonucleotides were randomly deleted and the same number of oligonucleotides were added (different from the ones already existing in the spectrum). Spectra were created progressively, which means that 10% of errors contained the same errors as in the spectrum with 5% error rate plus additional 5% of errors of both types. This provides that the case with spectra containing less errors (5%) is not more difficult than the case with spectra containing more errors (10%). Spectra with 20% of errors were created accordingly. At the end, oligonucleotides in the spectra were sorted alphabetically due to lose the information about order of the oligonucleotides within the target.In specific subarticle you can read definitions of used libraries.