The Identification of Chlamydomonas Poly(A) Site Based on Cis-elements

Abstract: Since the sequences of species' genomes represent the first closed data set in biology,the gene structure annotation for genomes,which include the prediction of gene composing,gene structure and gene regulators in genome DNA sequences, becomes the core issue in bioinformatics.With the increasing number of genomes being sequenced,to analysis these huge number of sequences and get valuable information from them become increasingly important.To identify poly(A) sites in mRNA sequences is an important component of gene identification.Predicting the poly(A) site correctly would help to predict gene boundaries,and it has important significance in genome analysis.It is generally accepted that signals required for recognition of sites for polyadenylation reside near the cleavage site(poly[A]site).A hexamer AAUAAA or a close variant is located between 10 and 35 nt upstream of many mammalian poly(A) sites and is usually referred to as the polyadenylation signal(PAS).we found that most cis elements found in yeast and mammalian also exist in Arabidopsis poly(A) regions.Thus,we suggest that many cis elements surrounding the poly(A) site appear to be evolutionarily conserved among all eukaryotes,these cis elements take part in the regulation of mRNA polyadenylation.In this thesis,I build a recognition model base on Hierarchical Clustering Methods to find the Chlamydomonas poly(A) site.Firstly,extract hexamer from training data set use Z algorithm,then screening these hexamers by supervised attribute filter in WEKA,and get the cis elements.Secondly,definine the distance between the cis elements,and get the cis element groups by agglomerative hierarchical clustering.Thirdly,turn these cis element groups to PSSM by some specific rules,and transform sequences to numeric feature,then use Decisive Tree Classify and Bayesian Networks to finish the model.This model can get clustering of sequences through training data set,and judge the sequences of testing data set whether they have sites or not.I test my model through test data and analyze the result,find that the result is satisfied.The result of test show that the model in this article is feasible and effective in the recognition of poly(A) site…
Key words: Poly(A) site; cis element; feature extraction; Hierarchical Clustering

This entry was posted in Master Thesis. Bookmark the permalink.