Department of Computer Science, Michigan Technological University, Houghton 49931.
An effective computer program for assembling DNA fragments, the contig assembly program (CAP), has been developed. In the CAP program, a filter is used to eliminate quickly fragment pairs that could not possibly overlap, a dynamic programming algorithm is applied to compute the maximal-scoring overlapping alignment between each remaining pair of fragments, and a simple greedy approach is employed to assemble fragments in order of alignment scores. To identify the true fragment overlaps, the dynamic programming algorithm uses specially chosen sets of alignment parameters to tolerate sequencing errors and to penalize "mutational" changes between different copies of a repetitive sequence. The performance tests of the program on fragment data from genomic sequencing projects produced satisfactory results. The CAP program is efficient in computer time and memory; it took about 4 h to assemble a set of 1015 fragments into long contigs on a Sun workstation.
Department of Computer Science, Michigan Technological University, Houghton, Michigan, 49931, USA.
We describe a number of improvements to the CAP sequence assembly program. These improvements include the development of methods for solving the problem caused by simple repetitive sequences, for automatically editing fragment alignments and consensus sequences, and for identifying chimeric fragments. The improved program (CAP2) assembled each of seven data sets, six of which contain repetitive sequences of very strong similarity, into a single sequence. As an example, CAP2 assembled a set of 1467 fragments into a single sequence of 73,328 bp that has only eight differences from the original sequence. The effects of fragment length, coverage, and error rate on the performance of CAP2 were evaluated using artificial data sets.