Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes

Małgorzata Grabińska; Paweł Błażej; Paweł Mackiewicz

doi:10.4064/am40-4-5

Instytut Matematyczny Polskiej Akademii Nauk / Institute of Mathematics / Publishing house / Journals and Serials / Applicationes Mathematicae / All issues

Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes

Volume 40 / 2013

Małgorzata Grabińska, Paweł Błażej, Paweł Mackiewicz Applicationes Mathematicae 40 (2013), 447-457 MSC: Primary 92D20, 60J10; Secondary 60J20, 62P10. DOI: 10.4064/am40-4-5

Abstract

Methods based on the theory of Markov chains are most commonly used in the recognition of protein coding sequences. However, they require big learning sets to fill up all elements in transition probability matrices describing dependence between nucleotides in the analyzed sequences. Moreover, gene prediction is strongly influenced by the nucleotide bias measured by e.g. G+C content. In this paper we compare two methods: (i) the classical GeneMark algorithm, which uses a three-periodic non-homogeneous Markov chain, and (ii) an algorithm called PMC that considers six independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in two DNA strands. We have tested the efficiency (in terms of true positive rate) of these two Markov chain methods for the model bacterial genome of Escherichia coli depending on the size of the learning set, uncertainty of ORFs' function annotation, and model order of these algorithms. We have also applied the methods with different model orders for $163$ prokaryotic genomes that covered a wide range of G+C content. The PMC algorithm of different chain orders turns out to be more stable in comparison to the GeneMark algorithm. The PMC also outperforms the GM algorithm giving a higher fraction of coding sequences in the tested set of annotated genes. Moreover, it requires much smaller learning sets than GM to work properly.

Authors

Małgorzata GrabińskaDepartment of Genomics
Faculty of Biotechnology
University of Wrocław
Przybyszewskiego 63/77
51-148 Wrocław, Poland
e-mail
Paweł BłażejDepartment of Genomics
Faculty of Biotechnology
University of Wrocław
Przybyszewskiego 63/77
51-148 Wrocław, Poland
e-mail
Paweł MackiewiczDepartment of Genomics
Faculty of Biotechnology
University of Wrocław
Przybyszewskiego 63/77
51-148 Wrocław, Poland
e-mail

Free download under CC-BY license

Search for IMPAN publications

Instytut Matematyczny Polskiej Akademii Nauk / Institute of Mathematics / Publishing house / Journals and Serials / Applicationes Mathematicae / All issues

Applicationes Mathematicae

Two algorithms based on Markov chains and their application to recognition of protein coding genes in prokaryotic genomes

Volume 40 / 2013

Abstract

Authors

Search for IMPAN publications

Rewrite code from the image