Integrating traits in polymorphism aware trees to better model speciation

Efforts to understand the speciation history of taxa have been hampered by incongruity among phylogenetic trees from different genomic regions. Three different biological processes can cause incongruence: horizontal gene transfer, gene duplication and loss, and incomplete lineage sorting (ILS). Horizontal gene transfer has a major role in bacterial evolution, and gene duplication and losses are common throughout the entire tree of life. ILS has received considerable attention from a theoretical point of view (Degnan & Rosenberg, 2006). It occurs when genes coalesce not in extant species, but in the ancestral populations that gave rise to them. As a result, some genes from a species may cluster with sequences from a sister species rather than their own. This project aims to see the “wood from the trees”.

In my group, we have developed an approached called Polymorhism-aware phylogenetic Models (PoMo), which is based on allele frequencies and so overcomes these limitations. Standard models treat substitutions as instantaneous events but PoMo describes them as a process: substitutions start as mutations to new, low-frequency alleles, then experience a series of changes in allele frequency. The changes of allele frequencies is are modelled by a continuous-time Markov chain based on DNA models (introduction of variation due to mutations) and the continuous Moran model (removal of variation due to genetic drift and natural selection). In this PhD project the approach will be extended to trait evolution (see Figure 1 below)
The project will develop the PoMo approach in a Bayesian framework with the following objectives:
(i) Integrate trait evolution into the model, so the method can be used to study genotype and binary phenotype data in a unified analysis. We will use self-incompatibilities in plant as an application as these are well studied and can act as a proof of principal.
(ii) Expand the approach from binary traits to multiple discrete states. Together with Andreanna Welch’s group will work on applications to sea birds (order Procellariiformes) using traits such as flight and foraging, as well as data from killer whales and their ecological traits.
(iii) Tackle time-series data of gene expression (RNA-Seq) as continuous function-valued traits in the context of species trees.

Click on an image to expand

Image Captions

Individual nucleotide sites evolve through point mutations. Each gene evolves according to duplication, loss, and transfer events. Species and their genomes evolve according to a diversification process. When attempting to infer a species tree only a fraction of the existing species and their genomes can be sampled. The PhD project will allow the use of phenotypic/ traits data. Traits can be binary, discrete, continuous, or even function valued.


The implementation of polymorphism-aware trait evolution in a Bayesian framework provides a new, flexible way to model evolutionary processes and obtain reliable strengthen estimates of biological parameters. The PhD project will couple the approach with numerical methods––such as Markov chain Monte Carlo (MCMC)––for approximating the posterior probability distribution of parameters. Bayesian inference methods can be extremely powerful and have revolutionized the range of evolutionary questions that can be tackled. In particular, the Bayesian framework allows us to integrate different types of data: the molecular sequence data and (importantly) the phenotype/trait data. As the Bayesian implementations are anything than trivial, we will collaborate with Sebastian Höhna (Ludwig Maximilian University Munich, Germany) and Tracy Heath (Iowa State University, US). Both are internationally known for their contributions to the RevBayes software project (Höhna et al., 2016) Both collaborators will provide training for the student in this field.

Project Timeline

Year 1

Develop new Bayesian tools for polymorphic binary traits and select plant groups for proof of principle. Creation of RNA-Seq libraries that are sent way for sequencing.

Year 2

Further develop of software to multiple discrete traits, testing of software, applications to real data sets. Cleaning and processing of the raw RNA-Seq data.

Year 3

Integration of continuous RNA-Seq expression data in the Bayesian approach. Analysis of RNA-seq data

Year 3.5

Integrate approach into user-friendly environment and write up thesis.

& Skills

The student will receive training in (1) development and programming of Bayesian approaches for phylogenomics; (2) analysis of genome-wide data sets; (3) molecular methods for next-generation sequencing.

References & further reading

Borges R, Boussau B, Szöllősi GJ, Kosiol C (2022). Nucleotide usage biases distort inferences of the species tree. Genome Biology and Evolution 14 (1):evab290.

Borges R, Szöllősi, GJ, and Kosiol C (2019). Quantifying GC-Biased Gene Conversion in Great Ape Genomes Using Polymorphism-Aware Models. Genetics, 212(4):1321.

Degnan J & Rosenberg N (2006) Discordance of Species Trees with Their Most Likely Gene Trees. PLoS Genetics 3:762.

De Maio N, Schrempf D, and Kosiol C (2015). PoMo: An allele frequency-based approach for species tree estimation. Systematic Biology, 64(6):1018.

Höhna S et al. (2016). RevBayes: Bayesian Phylogenetic Inference Using Graphical Models and an Interactive Model-Specification Language. Systematic Biology, 65(4):726.

Rogers J et al.(2019). The comparative genomics and complex population history of Papio baboons. Science Advances, 5(1):eaau6947.

Apply Now