Spectra derived from real and random DNA sequences for standard SBHThe instances reachable in this section were used in computational tests of algorithms solving the standard DNA sequencing by hybridization problem with both negative and positive errors. The following papers contain descriptions of the algorithms and the tests.
* J. Blazewicz, M. Kasprzak, W. Kuroczycki, Hybrid genetic algorithm for DNA sequencing with errors, Journal of Heuristics 8 (2002) 495-502.
* J. Blazewicz, F. Glover, M. Kasprzak, DNA sequencing - tabu and scatter search combined, INFORMS Journal on Computing 16 (2004) 232-240.
* J. Blazewicz, F. Glover, M. Kasprzak, Evolutionary approaches to DNA sequencing with errors, Annals of Operations Research 138 (2005) 67-78.
Table 1 contains instances derived from real DNA sequences coding human proteins, with 20% of random negative errors and 20% of random positive ones. These instances were used in tests presented in all the papers listed above. Instances generated from random DNA sequences with random errors are shown in Table 2. Table 3 contains instances with negative errors coming from repetitions of oligonucleotides in real DNA sequences. Tests on the data from Tables 2 and 3 have been presented in the last paper.
Spectra derived from real DNA sequences, with 20% of random negative errors and 20% of random positive errorsThe sequences being bases of these instances were taken from GenBank. They are prefixes of several genes coding human proteins. Their accession numbers in GenBank database are as follows:
D00723, D11428, D13510, X13440, X51535, X00351, X02994, X04350, Y00264, X58794, Y00649, X05299, X51841, X02160, X04772, X13561, X14758, X15005, X06537, Y00711, X05908, X07994, X13452, Y00651, X07982, X05875, X53799, X05451, X14322, X14618, X55762, X14894, X57548, X51408, X54867, X02874, X06985, Y00093, X15610, X52104.
These instances were used in tests presented in all the papers listed above. The lengths of the sequences varied between 109 and 509 nucleotides (with step 100), and the length of oligonucleotides was always set to 10. First, the spectra without errors were generated from these sequences, and their cardinalities were between 100 and 500 oligonucleotides. Next, randomly generated errors were introduced into these spectra: 20% of negative errors and 20% of positive errors, what resulted in spectra of the same cardinality. Files "specM.tar.gz" (eg. spec500.tar.gz) contain spectra of cardinality M saved in files "A.M-B+B" (eg. 1.500-100+100), where A is a number of the original sequence and B is the number of negative/positive errors introduced to the spectrum. Oligonucleotides in the spectra are sorted alphabetically in order to lose information about their original order in sequences. Files "seqsN.gz" (eg. seqs509.gz) contain the original sequences of length N, from which the spectra were generated. Every such file contain 40 sequences, they are sorted according to increasing order of their numbers A.
Spectra derived from random DNA sequences, with 20% of random negative errors and 20% of random positive errorsThe random sequences have been generated according to the uniform distribution. The lengths of the sequences varied between 109 and 509 nucleotides (with step 100), and the length of oligonucleotides was always set to 10. First, the spectra without errors were generated from these sequences, and their cardinalities were between 100 and 500 oligonucleotides. Next, randomly generated errors were introduced into these spectra: 20% of negative errors and 20% of positive errors, what resulted in spectra of the same cardinality. Test on these data were presented in the last paper mentioned earlier.
Files "rspecM.tar.gz" (eg. rspec500.tar.gz) contain spectra of cardinality M saved in files "A.M-B+B" (eg. 1.500-100+100), where A is a number of the original sequence and B is the number of negative/positive errors introduced to the spectrum. Oligonucleotides in the spectra are sorted alphabetically in order to lose information about their original order in sequences. Files "rseqsN.tar.gz" (eg. rseqs509.tar.gz) contain the original sequences of length N saved in files "seq#A.N" (eg. seq#1.509), from which the spectra were generated.
Spectra derived from real DNA sequences, with negative errors coming from repetitions of oligonucleotidesThe sequences being bases of these instances were taken from GenBank, National Institute of Health, USA. They are prefixes of length 509 of several genes coding human proteins. Their accession numbers in GenBank database are as follows:
X58377, X56088, X03350, X01098, X00318, X53279, X07577, X03663, X07173, Y00503, X07696, X03444, X03445, Y00815, Y00062, X13967, X17206, X01393, Y00809, X53331, X07362, X12510, X05450, Y00695, X54304, X13403, X13097, X04217, X04808, X03795, X04741, X52997, X04412, X07767, Y00345, X12385, X13405, X53605, Y00971, X13973, X00129, X54534, X04654, X06617, X13697, X12496, X02317, X07898, X02812, X05615, X01394, X16316, D10570, D28468, D12686, D90224, D14012, D11327, D16105.
The spectra were cut out from these sequences with oligonucleotide length set to 10, what resulted in some negative errors coming from repetitions of 10-mers within the sequences. The instances contain from 1 to 32 such errors.
File "rep500.tar.gz" contains spectra of cardinality fewer than 500, saved in files "A.rep500", where A is a number of the original sequence. Oligonucleotides in the spectra are sorted alphabetically in order to lose information about their original order in sequences. File "rep509.gz" contains 59 original sequences of length 509, from which the spectra were generated. The sequences are sorted in the file according to increasing order of their numbers A.
Table 3. Spectra derived from real DNA sequences, with negative errors coming from repetitions of oligonucleotides. |
|spectrum|
|
spectra
|
sequences
|
<500
|
rep500.tar.gz |
rep509.gz |