Semi-supervised learning of Hidden Markov Models for biological sequence analysis

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Semi-supervised learning of Hidden Markov Models for biological sequence analysis. / Tamposis, Ioannis A; Tsirigos, Konstantinos D.; Theodoropoulou, Margarita C; Kontou, Panagiota I; Bagos, Pantelis G.

In: Bioinformatics, Vol. 35, No. 13, 2019, p. 2208-2215.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Tamposis, IA, Tsirigos, KD, Theodoropoulou, MC, Kontou, PI & Bagos, PG 2019, 'Semi-supervised learning of Hidden Markov Models for biological sequence analysis', Bioinformatics, vol. 35, no. 13, pp. 2208-2215. https://doi.org/10.1093/bioinformatics/bty910

APA

Tamposis, I. A., Tsirigos, K. D., Theodoropoulou, M. C., Kontou, P. I., & Bagos, P. G. (2019). Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics, 35(13), 2208-2215. https://doi.org/10.1093/bioinformatics/bty910

Vancouver

Tamposis IA, Tsirigos KD, Theodoropoulou MC, Kontou PI, Bagos PG. Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics. 2019;35(13):2208-2215. https://doi.org/10.1093/bioinformatics/bty910

Author

Tamposis, Ioannis A ; Tsirigos, Konstantinos D. ; Theodoropoulou, Margarita C ; Kontou, Panagiota I ; Bagos, Pantelis G. / Semi-supervised learning of Hidden Markov Models for biological sequence analysis. In: Bioinformatics. 2019 ; Vol. 35, No. 13. pp. 2208-2215.

Bibtex

@article{ba3025dc5c934b0b8341705b800c8889,
title = "Semi-supervised learning of Hidden Markov Models for biological sequence analysis",
abstract = "MOTIVATION: Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications.RESULTS: We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.",
author = "Tamposis, {Ioannis A} and Tsirigos, {Konstantinos D.} and Theodoropoulou, {Margarita C} and Kontou, {Panagiota I} and Bagos, {Pantelis G}",
year = "2019",
doi = "10.1093/bioinformatics/bty910",
language = "English",
volume = "35",
pages = "2208--2215",
journal = "Computer Applications in the Biosciences",
issn = "1471-2105",
publisher = "Oxford University Press",
number = "13",

}

RIS

TY - JOUR

T1 - Semi-supervised learning of Hidden Markov Models for biological sequence analysis

AU - Tamposis, Ioannis A

AU - Tsirigos, Konstantinos D.

AU - Theodoropoulou, Margarita C

AU - Kontou, Panagiota I

AU - Bagos, Pantelis G

PY - 2019

Y1 - 2019

N2 - MOTIVATION: Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications.RESULTS: We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

AB - MOTIVATION: Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications.RESULTS: We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

U2 - 10.1093/bioinformatics/bty910

DO - 10.1093/bioinformatics/bty910

M3 - Journal article

C2 - 30445435

VL - 35

SP - 2208

EP - 2215

JO - Computer Applications in the Biosciences

JF - Computer Applications in the Biosciences

SN - 1471-2105

IS - 13

ER -

ID: 238681577