title:Application of latent semantic analysis to protein remote homology detection
abstract: 讲述最近发表在Bioinformatics上的文章
Motivation: Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the Support Vector Ma-chine (SVM) is one of most effective methods. Many of SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector rep-resentations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM pro-tein classification.
Results: In this study, a latent semantic analysis model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the “words” of “protein sequence language”, including N-grams, patterns and motifs. Each protein sequence is taken as a “document” that is composed of bags-of-word. The word-document matrix is constructed firstly. The latent semantic analysis is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the latent semantic analysis model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the per-formance of this method is comparable with that of the com-plex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.