Overview
My doctoral thesis entitled “Comparison of Nucleotide and Protein Sequences for Genome-based Classification and Identification” was prepared as a cumulative thesis and focusses on many aspects in the fields of phylogenomics and microbial taxonomy. Please find below an abstract of the main topics. The thesis can be retrieved via the German National Library.
Microbial taxonomy and phylogenetics
Microorganisms heavily influence life on earth more than any other group of organisms and hold all records with respect to abundance, biomass, genetic diversity and diversity of metabolic pathways. The identification, characterization and classification of microbes is of utmost importance for medicine, biotechnology, ecology and for the society as a whole. These tasks are carried out by microbial taxonomists with the help of various techniques. Rapid genome-sequencing technologies are the latest ones within a long chain of technological innovations over the last 120 years that affected microbial taxonomy. Taking advantage of new genome-based applications leads to a significantly increased phylogenetic resolution. My doctoral thesis describes several novel approaches for the classification and identification of microorganisms based on genome or proteome sequences.
Digital DNA:DNA hybridization
First, an improved bioinformatics method for the replacement of the laborious and error-prone wet-lab DNA:DNA hybridization (DDH) method was developed. It infers genome-to- genome distances between pairs of entirely or partially sequenced genomes, yielding a digital, highly reliable estimator for the relatedness of genomes. Among several improvements, a novel addition was the determination of confidence intervals for predicted DDH estimates thus enabling the user to statistically evaluate the outcomes.
When is DDH mandatory in microbial taxonomy?
Second, the question was investigated when a DDH experiment is necessary at all. Currently, a 16S rRNA gene sequence similarity of 97% is the threshold up to that an additional DDH determination needs not be conducted for assessing whether or not two strains belong to the same species. A statistical analysis suggested an increase of this threshold from 97% to about 99.0%, without a significant risk of wrongly differentiating species.
The taxonomic use of the G+C content in the genomic age
Third, I re-estimated the literature assumption that the G+C content can vary up to 3-5% within species using genomic datasets. Results indicated that it varies no more than 1% within species, if computed from complete genome sequences.
Highly parallelized inference of large genome-based phylogenies
Our methods mentioned for the calculation of intergenomic distances were applied in phylogenetic inference using a novel high-performance computing implementation. This allowed for the analysis of large genome-scale nucleotide and amino acid datasets. Among several datasets, this approach was also successfully applied to a set of Basidiomycota genomes, which are by an order of magnitude larger than prokaryotic ones.