Mid-range inhomogeneity (MRI) is the significant enrichment of particular nu-cleotides in genomic sequences extending from 30 to 10,000 nucleotides. MRI can be observed for all nucleotide pairings (e.g., G+C, A+G, and G+T) as well as for individual bases. Various types of MRI regions are 4 to 20 times enriched in mam¬malian genomes compared to their occurrences in random models. We first show how different types of mutations change MRI regions. Human, chimpanzee and Macaca mulatta genomes were aligned to study the projected effects of substitutions and in-dels on human sequence evolution within both MRI regions and control regions of average nucleotide composition. Over 18.8 million fixed point substitutions, 3.9 mil¬lion SNPs, and indels spanning 6.9 Mb were procured and evaluated in human—1.8 Mb substitutions and 1.9 Mb indels within MRI regions. Ancestral and mutant alleles for substitutions were determined. Substitutions were grouped according to their fix¬ation within human populations: fixed substitutions (from the human-chimp-macaca alignment), major SNPs (> 80% mutant allele frequency within humans), medium SNPs (20%–80%), minor SNPs (3%–20%), and rare SNPs (<3%). Data on short (< 3 bp) and medium-length (3–50 bp) insertions and deletions within MRI regions and appropriate control regions were analyzed for their effect on the expansion or diminution of such regions as well as on changing nucleotide composition. MRI regions have comparable levels of de novo mutations to the control genomic sequences. Newer mutations rapidly erode MRI regions, bringing their nucleotide composition toward genome-average levels. However, substitutions that favor the maintenance of MRI properties have a higher chance to spread throughout the human population. Indels have a clear tendency to maintain MRI features but have a smaller impact than substitutions. Overall, the observed fixation bias for mutations helps maintain MRI regions during evolution. Next, we discuss the splicing of large introns in mammals (over 50,000 base-pairs). Large introns must be spliced out of the pre-mRNA in a timely fashion, which involves bringing together distant 5' and 3' splice sites. In Drosophila large introns can be spliced efficiently through a process known as recursive splicing. We computationally demonstrate that vertebrates lack the proper enrichment of RP-sites in their large introns, and, therefore, require some other method to aid splicing. Over 15,000 non-redundant, large introns from six mammals, 1,600 from chicken and zebrafish, and 560 large introns from five invertebrates were analyzed. Unlike the studied invertebrates, the studied vertebrate genomes contain consistently abundant amounts of direct and complementary strand interspersed repetitive elements (mainly SINEs and LINEs) that may form stems with each other within large introns. Indeed, predicted stems were abundant and stable in the large introns of mammals. We hypothesize that stable stems with long loops within large introns allow splice sites to find each other more quickly by folding the intronic RNA upon itself. Finally, we extend and complement existing Markov model algorithms by de¬veloping and testing a novel binary-abstracted Markov model (BAMM) algorithm. BAMM can emphasize selected portions of genomic sequence signals according to specific abstraction rules. We present abstraction rules that generalize genomic se¬quence patterns at the single nucleotide level up to the level of tetranucleotides, using both in-frame data and data of mixed reading frames. We develop context-dependent abstraction rules that emphasize genomic sequence repetition. Unlike traditional Markov models, BAMM can analyze nucleotide patterns on the short-range (< 20 bp) up to the mid-range (20 to 50 bp) scale. Abstraction rules can also be both frame sensitive or independent. We build classifiers for both coding sequences and introns as well as for 5' and 3' UTR data. Using support vector machines, we demon¬strate that we can combine multiple BAMM classifiers to get even better exon-intron classification accuracy. |