0% found this document useful (0 votes)
34 views18 pages

Gene Identification and Prediction

Uploaded by

abhikansh1229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views18 pages

Gene Identification and Prediction

Uploaded by

abhikansh1229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
GENE IDENTIFICATION AND PREDICTION RODUCTION The objective of gene prediction is to identify Proteins. It is based on the statistical analysis of Gene prediction programs Sequence database entry regions of genomic DNA that encod, equence bias in genome coding are used to sift through new sequences and then anno! ith this information. Many of the gene-finding programs are similarity-based, i.e. sequence similari at either the protein or nucleotide level to know! (ESTs) to identify the genome regions like: to a known EST or protein, it provid region Mate the is done n proteins or Expressed Sequence Tis y to be protein coding. If a region has similany support that the region likely encodes an exon. Ths information can be used to refine regions that are likely exons for closer examination is building a gene model. Many programs have some commonness and possess the ability to differentiate benwer: gene sequences characteristic of exons, introns. splicing sites, and other regulatory sits expressed genes from other non-gene seque 's that lack these patterns. However, a progn trained on one organism may not work as efficiently for the other. DNA sequencing can & used to confirm gene identification. If EST sequences are available, covering a large amout! of genome, they can also be used for confirmation of predicted gene sequences. 11.2 BASIS OF GENE PREDICTION y . { = AG A gene is a segment of DNA that is expressed to yield a functional product isn a polypeptide. The summary of the important features of the gene structure, that isi for gene recognition, is given in Table 11.1 and shown in Figure 11d. 198Basis of Gene Prediction ODE Transcription start site 3 untranslated trailer ypatream regulatory elements: enhancers & silencers | ‘Transtation start site Poly A | signals Cj GT AG CAAT box TATA box | 5’ untranslated leader ATG | Exon | Exon 2 1 ca Intron mRNA Termination site a —_—— pstream-| H Downstream Regulatory region ‘ranscription unit ption unit, ‘ + | 4 FIGURE 11.1. A generalized structure of genes transeribed by RNA polymerase II, | displaying various structural and functional domai ! TABLE 11.1 Summary of Gene Structure | Upurcam Promoter First Exon ___Intronts)_Exon(s) Intronts) Last Exon Downstream | Inmergenic For example: Frequent CDS/ORF Frequent. CDS/ORF —_Intergenic | Region TATA. box . Slop nd Enhancer Stop and. Entancer Region with consensus Translational” Codons Sites Codons Sites. sequence TATA Start, Co Translational (AIT)A(A/T) Sequences Stop. 3'-UTR, and Inr with (CDS)Open PolyA-insertion consensus Reading Site, sequence Frames (ORF) Transcriptional YYANCTYAVYY and Enhancer Stor Sites | Most prokaryotic genes are represented only once in the genome. This is not true of the cukaryotic genes, present in multiple copies and which may be part of gene families. Comp- arison of prokaryotic and eukaryotic genes is summarized in the Table 11.2 for reference. TABLE 11.2 Comparison of Prokaryotes and Eukaryotes Eukaryotes Large genomes Low gene density Terminators not important Prokaryotes Small genomes High gene density Terminators important sm No introns (or splicing) Introns (splicing) No RNA processing RNA processing Similar promoters Heterogeneous promoters Polyadenylation Most introns have no known cellular function. However, some of them encode functional RNAs or proteins, It is hypothesized that introns represent vestiges of sequences that were 'Mportant earlier in the evolution. It is thought that introns may have helped accelerate ‘volution by exon shuffling (facilitating recombination between exons of different genes).BE Gere Memiification and Prediction : Joma nat times, ex code functionally distinet protein ¢ ion Some times, exons encode aE en ge are chimeras of on ey introns of different genes results in novel genes, Some fi exons gen from several other genes, providing direct evidence that new genes can be Formed 4! recombination between intron sequences. , / 5 "As dliscussed earlier, the process of copying the portion of the DNA containing 4 . into RNA is transcription (Figure 11,2). There are signals in the DNA sequence, which wee the transcription through enzyme RNA polymerase. RNA pol binds (0 specific pattern.” TTGACAN;TATAAT sus sequence, (If we compare all very strong promoters wep 100 base pairs Transcription oe e CC CAAT TATA | Upstream site Hogness box FIGURE 11.2 A generalized structure of a typical RNA polymerase II transcriteg genes showing various structural and func jonal domains. The upstream regulatory elements include enhancers and silencers. an average sequence for strong promoters, this is called a consensus sequence). This sequence is positioned 10 bp upstream (i.e. at -10) of the first nucleotide (labelled +1) of each ge There are other binding sites upstream from the gene called promoters that signal when express or inhibit the gene from expressing as RNA. Transcription stops when it encounters the signal to terminate—sometimes, a GC rich region which can form a hairpin loop, followed by aun of As (or Ts in the template strand). This sequence is referred to as the transcriptional terminator. The GC hairpin disrupts the binding of the mRNA to the DNA template. Some genes have weak and others strong promoters. Strong promoters have a sequence close to the ideal consensus TTGACAN,;TATAAT. RNA pol does not overlook thee promoters under normal circumstances and these genes may be transcribed at a frequetcy every 2 seconds. Other promoters can have slightly different sequences. For example, the may be changes in the TTGACA (-35 box) or in the TATAAT (—10 or Pribnow box) or the spacing between the two motifs (Figure 11.3). Thus strongly and weakly expressed gets may simply have different promoters—some proteins are needed at 10,000 moleculesicell others at 10 copies/cell. -35 10 Sequences Sequences Transcription start site TATA box -1 [$1 TIGACA TATAAT. 9 bases 7 mRNA 16-19 base: 5-9 bases! = trea : ; Upstream, d FIGURE 11.3 The prokaryotic promoter showing 10 sequence (Pribnow b08) 4 -35 sequence separated by a distance of 16 and 18 bp. First nucleotide of the ©" template is transcribed into RNA at the transcriptional start site.of Gene Pred tion Transiauon of building 5 5 Proteins a i czins irom the 5 ends oF mRNA amg. SCENE 0 the template RNA. Amino HN to carboxylic terminaly (ge Proteins are made in an N to C eset Figure 11.4). Ribosomes bind to a region (methionine) codon and are thus I 4 3 mRNA | se FIGURE 11.4 Transcription begins at the specific promoter site on the DNA template—Prinbow box. PB—Prinbow, Pr—promoter, 1—initi: CS—coding sequence, T—terminator, hnRNA—heterogencous RNA and mRNA—messenger RNA. positioned ready to begin translation. This sequence (which is analogous to the promoter in DNA) is called a Shine-Dalgarno sequence, or Ribosome Binding Site (RBS), The typical sequence is AGGAGGA and it is complementary to the 3’ end of the 16S RNA, UCCUCCUA-OH3’. In polycistronic messengers this sequence often occurs just upstream of each initiation codon. There is no similar sequence found in eukaryotes where the 5’ end of th mRNA is modified by a cap. Ribosomes are believed to bind and scan until they find the first AUG. The sequences surrounding the AUG help in recognition by the ribosome, ACC AUG G (Kozak consensus sequence) is optimal. Codon Bias Genes are usually predicted as the regions ‘of a genome that encode breiet sand a pinsed Sequence composition compared to non-coding regions, There is 9 bias in sequence due 10 ‘he properties of coding regions or on the tendencies for bases to occur al articular Posi Hon “ithin some codons. The unequal use of possible codons in bi vt is ie terms of unequal shitacteristic of organisms and is known as codon bias, r my mol codons, A comparison eal amino acids in proteins and the ENE ee Ow the typieal codon frequency Codon frequency i gion of the genome oF 3 ine for a particular protein. Thus, inthe same specics may ive an idea of the regions coding for 4 PACT Teer on "gions in which codons. are used with frequencies simiar {0 Ty a nity the Tequenees abserved in known protein couing regions can Pe ase . regi : 1 Sion as a possible coding region.PEER Gene stennfication anet Prediction If you consider the G + C composition of a region, you would note that hy many organisms that have skewed composition of G + C bases that lead t0 a biased cg composition. It is hypothesized that in many such cases. non-coding regions tend to req” the overall G + C composition of the organism. The coding regions due to natural seley and restriction of nucleotides for coding lead to codon bias. ion Streptomyces coelicolor ist model representative of a group of soil-dwelling organi with a complex lifecycle involving mycelia growth and formation of spore. In Sirepionye coelicolor, the chromosome is 8.667.507 bp long with a G + C coment of 72.1%. ant predicted to contain 7825 protein encoding genes. The coding regions have 70% G4 the first codon position, ~50% at the second codon position, and >95% at the third coda position. If you compare with the non-coding DNA sequence, all three positions will shoe the overall G + C of the organism which might be ~50%. A coding statistics is a function that uses a DNA sequence (0 compute a real numa, related to the likelihood that the sequence is coding for a protein, Most coding statisigg measure one or more of the following: + Codon usage bias Base compositional bias (among various codon positions) Base occurrence periodicity ‘These biases can be used to make predictions of the likely coding potential for a region of DNA. Introns are identified as large gaps in the alignment, typically (but not necessarily) flanked by the consensus GT and AG dinucleotides at the donor and acceptor sites, respectively. 11.3. PATTERN RECOGNITION Pattern recognition is the method of scanning a nucleic acid or a protein sequence for matches to short sequence patterns. These short patterns can be important indicators of some biological function. The presence of the matching pattern in the target nucleic acid or prota sequence is a signal of the same function for the target gene or protein sequence. Patterns most often examined in DNA sequences are given in Table 11.3 TABLE 11.3 Summary of Gene Features and Corresponding DNA Characteristics Gene Feature DNA Characteristic Coding sequences (CDS) Open reading frames (ORFs); GC-rich, CpG-content Translational start and stop sites Codons: Start (ATG) and Stop (TAA, TAG, TGA) Splice sites (Exon/Intron borders) Consensus sequences Promoter regions TATA, Shine-Dalgarno, Pribnow, Kozak Consensts CpG-content PolyA-signals Consensus sequences (characteristic nucleotide combi ations at about 10-20 nucleotides upstream of the inser tion site for the polyA-tail)ne Prediction Methods Options for . F making rf arcl Nese options jn 8 the pattern search more clude: Presence of ambiguous symbols in the Specified Variable spacing between mately “ Ditterns + ’ A ed positions 4 Choice of altemative matches to particule siti y Matches that include g laid Sequences of related protein families sometimes oi s ¢ the prediction of function (for ex aps in the pu PS iN the patter ‘ATBet sequence also have multi amp a iple consensus patterns that inrease the prediction of function ‘ample, using the BLOCKS data Pattern cetching in sequences is a 's for performing rapid searct $ through a sequence database for the closest matches to a given sequence by the BLA; dW ‘TA programs. FASTA has been successful in locating unidentified DNA bindi for E. coli LexA protein ing. sit | 14 GENE PREDICTION METHODS There are several methods of gene prediction, differing in the approach, efficacy of prediction methods. Important methods of gene prediction ¢ Laboratory-based approaches ¢ Feature-based approaches 4 Homology-based approaches + Statistical and HMM-based approaches algorithm, and the are as follows: Laboratory-based Approaches This is the traditional method to find a gene, which includes experimental procedures for locating genes in a sample of DNA. These can now be discussed Identification using blotting methods | Blotting is a technique used for detection of nucleic acids and proteins. The technique employs the transfer of biomolecules on to a membrane support that usually accomplishes blotting. The entire procedure involves the following steps: « Preparing a cell-free extract containing the biomolecule(s) of interest ~ Resolvi i by gel electrophoresis ; Fae ey ixtre eno @ membrane supot ssh 9 ntocsilose sferri : F Tac ‘ing th with a detection system that specifically hybridizes to the - Incubating the pape a Molecule of interest. n DNA is blotted, it is termed as ee techniques are coined for RNA and P cs When blotting, whereas Northern and Western blottin n, respectively.Gene Idenification and Prediction Southern blotting. Suppose, one of the CDNA clones you isolate and sequence represen, new gene. If you are interested in studying this gene further, you might want to Sclerming the structure of the gene by identifying introns, exons, and regulatory elements An essential step for such analysis is Southern blotting. which is a very POPULay Icis used to determine the size and arrangemen of ated to a clone of intey technique used for a variety of purpose: the genomic copy of a gene, to determine the number of genes re! and to investigate the evolutionary conservation of a gone, The Southern blotting technign’ has been used for understanding a variety of biological processes such as RNA splicing ang genomic rearrangements to form antibodies and ‘T cell receptors. This technique tas atm Played a major role in the identification of numerous rearranged genes that are associatey with a variety of human genetic disorders and cancers. With the introduction of high, resolving gel systems, it is now possible to use Southern blotting to detect gene mutation, involving single base-pair changes. This has led to the early diagnosis and prevention of potentially harmful diseases. Northern blotting. Northern blotting is a technique used to examine the size, and temporal and spatial expression pattern of specific RNAs. Usually, total cellular RNA, or polytay RNA is isolated and separated by size on an agarose gel. The RNA molecules in the gel are transferred to nitrocellulose paper or nylon as described above and detected using an appropriate DNA or RNA probe Although Southern and Northern blotting techniques exhibit a number of similartiss, there are several important differences also. The major difference is in the extreme care required to isolate non-degraded RNA. Full-length mRNA isolation is an important goal in generating high quality cDNA libraries. The difficulty in RNA isolation is that most ribonucleases are very stable and active and do not require activators or cofactors to function As a precaution, the first step in all RNA purification procedures is to lyse the cell in 2 solution that denatures, thus inactivating ribonucleases. Another difference in the two Procedures is in the type of gel used 10 resolve RNA. Unlike DNA that is only found as 3 double-stranded version, RNA migrates as a function of hybrid length, RNA can engage in non-uniform amounts of intra-molecular base pairing. ‘Therefore, RNA must unders0 electrophoresis under denatured conditions if it is to migrate as a function of nucleotide length. Norther blotting is useful because the size of a specific mRNA can be compared with the size of cloned DNAS, revealing whether the cloned cDNA is full-length, ‘This techniat® Sin Indicate which tissues express a particular gene or the factors that regulate its expression As an example, if you have isolated a CDNA, which is suspected to be induced by a growth factor, you could first try to experiment to stimulate cells with the growth factor and isolate total RNA a intervals following stimulation, The RNA isolated at cack point of time would th he sundance ote neal toned CDNA as a roe. Ie esl HH increases Toller ridlté MRNA in question is low in untreated cells but sisnifieat increases following stimulation, a good evidence that it provides the expression of the desired gene is indeed regulated by growth Tneiors,n be predictive of function. Zoo blots are nd henee Possible : Se Organigns homology, to the new : 1 nuclease mapping it p ¢ SI nuclease mapping and Primer Extens; an ident ‘i tae Figure 11.5), “ension, you can identify the S'-end of the gene 3 —* 5’ Primer FIGURE 11.5 Identification of 5’. -end of the gene. SI nuclease is an endonuclease that degrades. single-stranded RNA or DNA to mononucleotides. $1 nuclease is useful for the removal of unpaired regions following hybridization and is used in $1 nuclease mapping—a technique for the removal of single- standed regions of DNA and RNA from double-stranded hybrids of DNA act RNA. In SI tuclease mapping, a DNA probe labeled at its S’end overlaps the gene 5’-end or the end of 2m exon is hybridized to the gene DNA. The resulting 5’labeled DNA probe is then “sized” using a Southern blot. Its size Pinpoints the 5’-end of the gene or exon (Figure 11.6). ¥ y— > Digest with SI endonuclease co, —__* | Denature and electrophorese WOULLLLLLy FIGURE 11.6 $1 nuclease mapping to identify the 5’-end of the gene. Primey extension Primer i i J to map the 5’ ends of DNA or RNA fragments, &xlension is the technique useé cite eam of 5 end, Mvalves, annealing a specific oligonucleotide primer ¥ a ee E heen am of ae e i 4 its 5’ end, with @P. This le rama labeled, usually aa ora DNA template, making a fragment that ends a the & hier Fe polymerase can also be used with DNA a temp) tase, Which can copy “end of the template molecule. DNA lstes, (Figure 11.7)- Gene Identification and Prediction ‘—L. 3 fend with reverse transcriptase i, ———* Denature and clectrophorese ALALLLLLIL FIGURE 11.7 Primer extension technique to map 5’ ends of DNA or RNA, Exon trapping Exon trapping or exon amplification is a method to find expressed DNA sequences in g genome sequence and is based on selection for functional splice sites in genomic DNA. The advantages of exon trapping are that it does not require any prior knowledge about tissue. specific gene expression and can easily be performed on complex genomes. It can identify constitutive exons as well as alternative exons but cannot be used to identify intronless genes. In exon mapping, the genomic sequence is cloned into an intron (flanked by two exons) using a specialized exon trapping vector. This construct is expressed through a strong Promoter. If the genomic fragment contains an exon, it will be spliced into the resulting mRNA, changing its size and allowing its detection. Reverse transcriptase-polymerase chain reaction (RT-PCR) RT-PCR can be used to detect the RNA transcript of any gene. This is irrespective of the quantity of the specific mRNA. In RT-PCR, an RNA template is copied into a complementary DNA (cDNA) using a retroviral reverse transcriptase. The cDNA is thea amplified exponentially by PCR. As with NPAs, RT-PCR is somewhat tolerant of degraded RNA. As long as the RNA is intact within the region spanned by the primers, the target will be amplified. Although RT-PCR is the most sensitive method of mRNA detection available, it also has some drawbacks. It can be the most technically challenging method of detection and quantitation, often requiring substantial pre-experimental planning and design. Also, becaust of its extreme sensitivity, even minute amounts of contamination by genomic DNA o previously amplified PCR products can lead to aberrant results, so steps must be taken !° avoid any contamination. In situ hybridization (ISH) ISH is a powerful and versatile tool for the localization of specifie mRNAS in cells tissues. Unlike Northern blotting and nuclease mapping assays, ISH does not require isolation or electrophoretic separation of RNA. Hybridization of the probe takes placeGene Prediction Methods issue. Since cellula clure is. mai ye call OF tissue. S| : fall Mt structure is maintained throughout the procedure, ISH te ges information about the location of mRNA within the tissue sample : . ‘The procedure begins by fixing samples in neutral-buffered forma she issue in afin, Tne gompl are then sliced into thin sections and mounted onto eroscope slides, Alternatively, tissue can be sectioned frozen and post-fixed in jrgormabdehyde. After a series of washings to de-wax and rehydrate the sections, a pracinase K digestion js Performed to increase probe accessibility. A labeled probe is then jybriized «© the sample sections. (Proteinase K digestion is a crucial step for successful [sH. Insufficient digestion will result in a diminished hybridization signal. On the other hand, if the sample is over digested, tissue morphology will be poor and would make Jocalization of the hybridization signal difficult. The concentration of Proteinase K needed is dependent on the tissue type, Jength of fixation, and size of tissue core.) Radio-lubeled yobes are visualized with liquid film dried onto the slides, while non-isotopically labeled probes are conveniently detected with colorimetric or fluorescent reagents. The major drawback to ISH is the procedure itself. Standard protocols are cumbersome, time-consuming and laborious and require specialized equipment for preparing samples and visualizing results of the experiment. Additionally, quantitation of gene expression is not as saightforward as with the other techniques. in, and embedding Feature-based Approach Feature-based approaches are based on pattern recognition, treating DNA fragments as sequences (see Table 11.1). Gene finding by ORF prediction ORFs without stop codons are strongly suggestive of genes. ORF has the presence of a long series of codons in a DNA sequence without the series being interrupted by a termination codon. An ORF signal is enhanced even further by the presence of sequence patterns for Starting and stopping transcription before and after the ORF, Dynamic programming can be wed to identify the highest scoring regions. The best gene recognition systems tend to be Species-specific, trained on examples of known genes in the given organism. The initiation Site of box is always an ATG codon and it is always about 30 base pairs downstream from a TAATAA sequence. . , GRAIL (Gene Recognition and Analysis Internet Link) (http://compbio.ornl.gov/ Grail-13/) is perhaps the most widely used ORF identification tool, (It was also one of the first to be made available). It provides analysis of protein coding potential of a DNA Sequence. GRAIL uses variable-length windows tailored to each potential exon candidate Akfined as an open reading frame bounded by @ pair of startdonor, acceptor/donor or *ceptoristop sites. This scheme facilitates the use of more genomic context information OPlice junctions, arch sion stats, non-coding scores of 60-base regions on either side of a Puative exon) in the exon recognition process. GRAIL finds about 91% of all coding “tions with an apparent false positive sate of 8.6%BELGE Gene identification and Prediction Grail 1 helps in analyzing protein-coding regions. poly (A) sites. and promo, enables to construct gene models, predicts encoded protein sequences, and provide searching capabilities. A list of most likely exon candidates is first established, an evaluated further using a neural network approach, The algorithm makes its final by selecting the best candidates. A DP approach is then used to define the mos gene models . FindPatterns (http:/fvww.accelrys.com/products/gcg_wisconsin_package/pro, _list.himl#FindPatterns) can be used for scanning for ORF patterns. Frames (http://www.accelrys.com/products/gex_wisconsin_package/) can show ORF, for the six translation frames of a DNA sequence, Frames can superimpose the pattern of rare codon choices if you provide it with a codon frequency table. MacVector 6.5 (http:/www-sxstit/foxm_mev-htm) does the ORF detection based yy Fickett’s statistical method, or on the designation of sequence ends as start and stop codons ORFs can be found in 3 or 6 frames. Sequencher (http:/www.genecodes.convindex.html) can also be used for ORF analysis Sequencher can also be used for contig assembly, restriction enzyme mapping, heterozypcye detection, CDNA to Genomic DNA large gap alignment, motif, and SNP analysis, ORFs are easy to find with automated tools, however there are two major problems faced in their identification: 5 databace’ «these Prediction probabl, gram Small proteins. The issue is to decide the “cutoff’ to be used for a minimum sized protein. A cutoff of 100 amino acids is often used. However, in so doing, some true small proteins containing fewer than 100 amino acids are not annotated and some ORFs containing more than 100 amino acids are annotated even though they do not encode a protein. Small exons. Exons smaller than about 30 nucleotides cannot be reliably predicted by normal computational methods. Missing a small exon can result in prediction of a protein sequence that has an internal “frame shift”, (i.e.) the protein coding frame has shifted, Such a shift changes all the amino acids after the frame shift position, resulting in major errors in prediction of the protein sequence. There are various tests to verify that a predicted ORF is in fact likely to encode @ protein. Some of these are as follows: 1, The method is based on an unusual type of sequence variation that is found in ORFs—every third base tends to be the same one much more often than by chance alone. This property is due to non-random use of codons in ORFs and is true for any ORF, independent of the species. The program TestCode (http://www.aceelr3s com/products/gcg_wisconsin_package/) provides a plot of the non-randomness of every third base in the sequence. 2. This method is based on the analysis to determine whether the codons in the ORF correspond to those used in other genes of the same organism. For this information on codon use for an organism is necessary, averaged over all get’ 3. ‘The ORF may be translated into an amino acid sequence and the resulting seque™™® then compared to the databases of existing sequences. If one or more sequences significant similarity are found, there will be much more confidence in predicted ORFs.“nal based! appronet, ne Mredietion Methals raryotic £ENES AE MOTE ifFiCUIE to ide Eon homologues is the Most wide fs only on evolutionary tel ii MY than those af the Wy used me Fate Method for identity \ latedness, and ie * ND SO ate Wide en DLOgOUS produc! © of nding homologous product is that SOME Of the ready KOOWN. MesGeateh for genes can include Prokaryates wehing for HK Kenes. Homology seare 'y oppheable. A major advante Mormation about the pene 4 Matches 10 one ee Of the following, Known proteins ¢ Protein motls (e.g. zine finger, ATP ang GTP-binding mots, ete mE essed Sequence "Ty te e S (Expressed Sequence Tags) and ACRS (Ancient Conserved Regions) ioology based gene prediction systems ean fin Similatities to previously identified coding isgions. Alternatively, a different homotogy-hased uppre: y ach is to identify totally unknown served regions on the theory that sto compare two whole genomes and look for vey scquence is only conserved if it is important, Procrustes (http:/Mto- Usc.edu/software/ ccorently not fully functional), which accepts as input one genomic DNA sequence, and one or several protein sequences. The proteins (targets) are assumed to be similar to the protein encoded in the genomic fragment, Procrustes finds the chain of exons with the bect fie, the taget proteins: if’ several targets are specified, it makes one pene prediction per target Frocrustes also outputs the amino acid sequence of the predicted protein and the aligament between the predicted and target proteins. Finding coding regions can also be done by similarity searching using TBLASTX for finding exons, The approach to this problem is to translate the sequence in all six reading frames (3 forward and three reverse) and do a similarity search against the protein databanks. TBLASTX translates a DNA query sequence and performs a similarity: search against protein databanks. If a protein sequence matches, get its DNA sequence and align it with your unknown sequence. The start and stop codons would get identified. If the query sequence “ere genomic, then the introns would be identified. erustes/) is a homology-based program Statistical and HMM approaches 'o Chapter 10, you have learned that one of the importan applications of HMMs is in gene Meouification, One of the programs discussed there was Genel Senemark/) that used HMMs tor gene identification. There are several others and some can "4 be discussed here. a; 7 sloped by using simplified gene grammar rules HMMs for gene prediction can be developed by using nm -codons in the reading frame, ke stant-codon, oa ion Iength is divisible by 3 and no s pou bs In he em ting tame, ——_ Dart ferences (c.2. js More comme a “language may also consider dinucleotide Preferene S (C8 “00 that uc} . scessarily independent. emmi, UM ni (hte wiwvevecb' dtu. aik/services IMD gene!) is ee me iewon gene (http: sebs.dtus cts whole il so the predic ofp edicts whi fens in anonyi The program prey ial gente @ Sequence, and can Aggy gy euonMOUs DNA. The Progra re ie or partial genes in om exes a ea *Plice correctly. It can predict seve juences. HIMMgene can also be used to predict “Sed on wh i ven longer Seq . ole cosmids or evensite eet tien ve pe vee ve : Thin dx usefat ie} indicate iennye se al DNA eaitabie att (MMs) te INIMT nppeosety cquence are known, such as} predictions, fer may ‘eves foun Speen an alse repr . Fequally likely #ene punta eon Rpent clement: reopen Ven find the Best heriens, LIM MB fan be locked these k the #0 y Modeler) in # nyntem for fing pnen oF Dac nd archaea te eee. interpolated: Markow model M iti from non-coding DNA. ‘The eat tun tnt rough Btb-order, weighing tm) stands for “the Viterbi nes in eukaryotic DNA ml) is # program to predict complete fy introns, exons, promoter sites, polys Wie fifth-order Markov model of coding nto the chance that a given sequence The algori a probability |} models to model different functional units. tt Modeling, polyadenylation signals, translation on signal and promoters. A modified version of the \cceptor splice sites. 2 set is divided into four categori jegories are: yy model (WAM) ix BNSCAN, the s depending on 1, (243% © 2.43 — 51% C BL G1 - 57% C 4. 57% C+ G) or of these categories. s initial state probabilities are computed bY estimating the relative frequencie: nctional units in these categories. GENSCAN was piven a sequence of 20 kbp. The outputs are given in Figure 11.8 and gure 11.9. Genie (http://www.frultfly.org/seq_tools/genic. alized HMMs and neural networks. Searching for CG islands As din in the Chapter 4, CG dimer CG repeatedly occurs. An HMM: identified examples of CG ists can be used to find if The HMM © mI) is another tool that uses gener rssed carl © regions 2 used to train a program on J non, Ws and learn to recognize them. The na short sequence, the sequence come from a CpG island o ned to find the CpG islands in a long sequence. also be tr not aGene Prediction Methods! Pero Orne ana eC Bees Oe 10h Te Soe ao en ee eg Caercrirsetss Tenis Cc yests eC ere T) ercoceersy Petar) amen etyy 33 01907 peat asrararerarury FIGURE 11.8 Sample of GENSCAN output. ao. Lk debe eet teetetententenel Kb 0005 10 15) 2.0~*«2S SOS 50 1 ato = toeeetenatenetens kb ected Langhans lelanrtneatnntennstinetn sO SS eo eS 7.07 80 8S 9.0 95 10.0 ots eg te we aos eS {ROOTES Is. 14. 100 FOS ATO TS 2 : . os * kb nsLsnenbertnla tn 3 20.0 OSS TIT 7S 180 18.5 19.0 19-5 Single. M Optimal exon vial Tefon AP exon gene [Suboptimal exon Key: gy Initial gy Internal gy TT exon exon SCAN predicted genes in sequence. FIGURE 11.9 GEN:Gene Mdentification and Prediction Searching for protein binding sites in DNA Sequences DNA and protein sequence » Which have a relnted function that can be found by sequence analysis methods, An example is a set of By Contains signals for transeriptional promoters, Sequences of proteiny in have conserved amino acid patterns or motifs and may be used to idewtify other sequences that may have the same fun on SIGNAL SCAN (htip://www.chs.umn.edu/software/sigscan html) ic a Mabase of transcription factor sequences to find potential transe, binding sites in DNA sequences If the members of a set of Sequences are similar to each other, finding consensus patient . . Od to align the sequences by Multiple Sequence Alignment sey make a profile and search for patterns with a statistical method such ae the expectains maximization method, . DNA binding sites for proteins may separated by variable 5 been devised for findi following four steps: . may share con SEO8N5 pay NA sequences Families may ENaLUTE of fy, Pat These patterns serve as a sj ay, Nig Program, tha ition fagys the simplest methy be composed of several conserved pater Paces between the patterns. Expectation Maximization Algorithm as ing such regions in unaligned sequence fragments. It goes through te 1. The best scoring comparison matrix is obtained. 2. This matrix is then used to find the approximate | the original sequences. 3. ‘The predicted binding sites are then used to make a new matrix. 4 This matrix is again used to define even better the locations of the binding sitesi: the sequences locations of the binding sites is This process is repeated until the method converges on a single set of pattems in te sequences, GenLang GenLang (http://www. ‘upenn.edw/genlang/genlang_home.btml) is a syntactic pat recognition system, which uses the tools and techni ques of computational linguistics to find genes and other higher-order features in biological sequence data. Patterns are specified by means of rule sets called grammars, and a general purpose parser, implemented in the log Programming language Prolog, then performs the search, BCM GeneFinder GeneFinder (originally from Baylor Institute and now with Sanger Institute—not be! available now) offers some unusual custom algorithms, The algorithm first pres 9 Possible potential internal exons, and potential 5’ and 3-exon for each internal by ue discriminant functions combining characteristics describing various contextual featutes u ; ime! these exons. Then, by the method of dynamic programming, it searches for oP combination of these exons and construct gene modelPar me e parse (nttp://beagle.cat nition of exons o.edu/~eesny deri; nrons in a J scure for exch sequence positinn gre then aligned with the combination of the Mm ucture are found, GeneP, ead types Of Sequence pattern rsert Html) predicts the most MIE Sequence ty © DP approach Wuses a and chon The wali Ge yield jection ion feats nes heiny intron and exon erate within a gene CxO region: adjusting the and exons MOM Tikely intron TUSEE USES a scheme. fey S that mike up the intron siructure that comprise a weights used for Geneld cersD (http lw Limim.essoftwareygene gaknown genomic sequences designed with al structure In the first wigs, Sart and stop codons are predicted and scored along the sequence using Position Weight Arrays (PWAS), In the second SEP, Exons are built from the sites. Exons asthe sum of the scores of the defining ‘sites, plus the Model for coding DNA. Finally, from the set ol asembled, maximizing the sum of the scores of th id/asd.lutmnl) ix a hierarchic 4 program to predict genes in step, splice are scored log-likelihood ratio of 3 Markey predicted exons, the gene structure is he assembled exons. 1.5 OTHER GENE PREDICTION TOOLS Some of the other tools available for searching for various patterns are given below Poly-A site prediction HCpolyA. (http :/N25.itha.mi.enr.it/-webgene/wwwHC_polya.html) is used for Poly-A Site Prediction using the Hamming Clustering Method on the WebGene Server. TATA signaling, promoter & trans-factor bind site prediction NCtata (http:/25 tba. enr.t/-webgene/wwwHC.tata.htn) is 9 similar is sing Suning-Clustering Method for ‘TATA Signal Prediction in Eukaryotic Genes. This is als from WebGe Matinspector (http://www.genomatix.de/cgi-bin/matinspector/matinspector.p!)— “th Sequence for ‘Transeription Factor. slum. Sa Sean‘ tinged to "ind id ‘er ‘ vsloan ee vith the input DNA sequence MOlopies of Published signal sequences with the input DS ci took latenel ter ‘sitescan Cnttps//www fuiorgfepi-in/ittisitesean. pl) This rome sequence analysis. and. works best with sequences Fron (ORF) prediction “ane Region bh ip! lu/frameplot-3.0b.pl) Protein Coding (http:fiwa ih.go.jp/~ 1 'rediction in Bacterial DNA—NUN-Njene_ Identification and Prediction ov/gorf/gorf.html) graphical all open reading frames of a selectable minimum size in a user's se ready in the database. This tool identifies all open reading frames yg", standard or alternative genetic codes. ng Genefinder is 2 too! for exon prediction. It has the following facilities: py, internal exons, FEXH—all exons and FGENEH—gene structure, ‘Ox, The Exon Prediction Program (Perceval) (http://compbio.ornl.gov/grailexpigy shtml) stands for Protein-coding Exon, Repetitive, and CpG-Island EVALuator, poet reads ina DNA sequence and produces a list of possible Grail Exon Candidates °° filters these candidates against a repetitive element database. It also locates repei elements and CpG islands. * analy sig quence », Splice site prediction Genefinder is a tool for exon prediction. It has the following facilities: HSPL—splice sites Repetitive DNA & OpG isles analyses RepeatMasker2 (http://{tp.genome.washington.edu/cgi-bin/RepeatMasker) is used f; analysis of Repetitive Elements in DNA Sequences. RepeatMasker screens DNA sequere in fasta format against a library of repetitive elements and returns a masked query sequay ready for database searches as well as a table annotating the masked regions. tRNA gene prediction tRNAscan is used for genomic tRNA Identification (http://www.genetics. wustledu/ediy’ tRNAscan-SE/). A comparison of the various tools in the terms of their performance is given Table 11.4. TABLE 11.4 Comparison of Various Gene Prediction Tools Sensitivity Specificity Prediction Sensitivity Specificity (%) Exact (%) Exact Misti Tool Type (9%) Nucl (%) Nucl. __Exon Exons FGENES Gene structure 83 93 B B 7 GeneID Gene structure 69 7 2 46 x Gene Parser Gene structure 66 9 35 40 » GENSCAN Gene structure. 93 93 B 81 4 GRAIL I Gene structure. 83. 87 _ 52 : MZEF Internal exons 87 95 7B 86 :the oul pase’ their relative performance levels show identification and prediction are very Gene ics In prokaryotes, genes jiction methods are available which differ in theie pred Keview Que viens TG important to know about the are represented on in multiple copies, Proteins encoded by ly once in the genome, whereas in You learned that a number of gene Approach and efficiency and are need 4. A number of gene prediction tools and programs were also deseribed aed compared to yous they may be present

You might also like