0% found this document useful (0 votes)
43 views7 pages

Building A Multiple Sequence Alignment

Multiple sequence alignments are useful for predicting protein structures and functions, and are essential for phylogenetic analysis. Important amino acids like those in active enzyme sites are highly conserved between sequences, while less important residues can mutate more easily. Multiple sequence alignments may not be effective for assembling short, partially overlapping sequences or sequences with no homologs in databases. Key criteria for building multiple alignments include sequence similarity according to biochemical properties of amino acids or nucleotides.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views7 pages

Building A Multiple Sequence Alignment

Multiple sequence alignments are useful for predicting protein structures and functions, and are essential for phylogenetic analysis. Important amino acids like those in active enzyme sites are highly conserved between sequences, while less important residues can mutate more easily. Multiple sequence alignments may not be effective for assembling short, partially overlapping sequences or sequences with no homologs in databases. Key criteria for building multiple alignments include sequence similarity according to biochemical properties of amino acids or nucleotides.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Diapo 2 Building a Multiple Sequence Alignment

multiple alignments are useful for predicting protein structures (see Chapter 11),
central for predicting the function of proteins, and indispensable for phylogenetic
analysis
Important amino acids (or nucleotides) are not allowed to mutate. For instance,
active sites of enzymes are much conserved. _ Less-important residues change
more easily — sometimes randomly — and sometimes in order to adapt a function.

Diapo 2 Identifying situations where multiple alignments do not help


No funcionan bien para ensamblar las piezas de la secuencia en un proyecto de
secuenciación. Si tienes un conjunto de secuencias cortas, parcialmente
superpuestas, la alineación de secuencias múltiples no funciona bien. Si se enfrenta
a este tipo de problema en particular, puede que quiera utilizar los servicios
especializados de herramientas de ensamblaje de secuencias, como Phred y Phrap.
Otro casi sería cuando la secuencia en la que está interesado no tiene ningún
homólogo en ninguna de las secuencias bases de datos.

DIAPO 3 Main Criteria for Building a Multiple


Sequence Alignment
La idea detrás de una alineación múltiple como la de la es poner amino ácidos o
nucleótidos en la misma columna porque son similares según algún criterio.
Puedes usar cuatro criterios principales para construir una alineación múltiple de
secuencias que tienen diferentes propiedades

DIAPO 4 Choosing the Right Sequences


Multiple-sequence-alignment methods are at their best when aligning protein
sequences. The reason is that protein sequences are three times shorter tan the
corresponding DNA, and they use a more informative alphabet of 20 amino acids.
If you think that very similar sequences give very good alignments, you’re right!
However, a multiple sequence alignment that’s correct isn’t enough; it must also be
useful.
For instance, an alignment that only contains very similar sequences brings little
information being able to observe mutation patterns in every column — which isn’t
possible if you have an alignment in which most columns are entirely conserved.
If you can, make sure that each sequence is between 30 and 70 percent identical
with more than half of the sequences in the set. This way, you’re making a
reasonable trade-off between new information and alignment quality.

DIAPO 5 Gathering your sequences with online BLAST servers


Characterized sequences: These are sequences for which you have good
annotations and experimental information. You’d definitely want to include these
sequences in your alignment because they bring biological information with them —
and also allow feature propagation.
Uncharacterized sequences: This category can include your sequence(s) of interest
as well as database sequences. Uncharacterized sequences must be members of
the same family. Your main motivation in including them in your multiple alignment
is to distinguish between the conserved
positions that cannot mutate and the other, less-important columns. They help in
getting some contrast on your sequence of interest.

The main reason for using BLAST is to identify database sequences that are so
similar to the query that they probably are homologous. We commonly refer to such
sequences as hits or matches.

DIAPO 6 Selecting sequences on the ExPASy server


You can use this server only to retrieve protein sequences. If you’re interested in
gathering DNA sequences, use the European Bioinformatics SRS server
(srs.ebi.ac.uk) instead

1. Point your browser to www.expasy.ch/tools/blast/. The BLAST page of the


ExPASy server appears.

2. Enter the Sequence Accession Number P20472, as shown in Figure 9-2).


This is the accession number of the human parvalbumin. If you prefer, you
can also paste the sequence in raw format (that is, use the sequence only,
without any header). The program ignores spaces and numbers.
3. Select the BLAST flavor that you’re interested in. If you gave a protein
sequence in Step 2, select blastp. If you gave a coding DNA sequence in Step
2, select tblastn.

4. Keep the default option — Complete Database — in the pull-down menu. This
amounts to simultaneously searching Swiss-Prot + TrEMBL +TrEMBL_NEW.
If the search reports too many sequences that are very similar to your
sequence of interest, you can decrease the number of identical hits by
selecting a smaller database from the Database pulldown menu — Swiss-
Prot, for example.

DIAPO 7
5. Scroll down to the Options section and set the Number of Best Scoring Sequences
to Show option to 1000. Doing this makes it more likely that you’ll find appropriate
sequences in the BLAST result for your multiple alignment.
6. In the same Options section, set the Number of Best Alignments to Show option
to 1000. This choice makes it possible to judge the quality of the alignment before
selecting a sequence.

DIAPO 8
7. Click the Run BLAST button. After a brief pause, a Results page appears.
8. Scroll down the page to select the sequences you want. You select a sequence
by checking the box to its left.
This is the most delicate part of the process. There is no absolute rule to selecting
your sequences, but you can use the following guidelines:
• Select the top sequence. This top sequence is usually your sequence of interest.
If your sequence of interest is not at the top, you may have to add it to the list later
on.
• For a first analysis, you want to select ten sequences or fewer. Ideally, the ten
sequences to select should be evenly spaced between the very good E-values (10-
40) and less-good E-values (10-5).
• Before selecting a sequence, check to make sure it’s similar to
the query sequence — along its entire length. The alignment section is at the
bottom of the BLAST output. You must be especially careful with hits that have E-
values higher tan 10-10. They are equally likely to correspond to a good partial
match, a global overall match, or a match between a protein fragment and your
sequence. Inspecting the alignment is the only way to distinguish between these
situations.

DIAPO 9. Choose the method you want to use to export your sequences from
the Send Selected Sequences pull-down menu,
• FASTA: Generates a file that contains your sequences in FASTA format. You can
save this file with the File➪Save As option of your browser. When you need to, you
can reopen this file with your browser, in order to cut and paste its content into
another server

• ClustalW, Tcoffee, and MAFFT: These are multiple-sequence alignment packages


running on the EMBnet server. Select any of these to align the selected sequences.
• Reduce Redundancy: This option will extract the most meaningful sequences from
your dataset. Ideal if you have too many sequences and you don’t know how to
choose.
• Pratt: Will search for conserved motifs in your sequences without
aligning them.

Gathering a known collection of sequences from Swiss-Prot


If you already know the name or accession number of every sequence you want to
include in your multiple alignment and if these sequences are in Swiss-Prot or in
TrEMBL, you can directly access them by using a special online ExPASy facility.
www.expasy.ch/sprot/sprot-retrievelist.
html.

DIAPO 10 Choosing the Right Method of Multiple Sequence Alignment


Before you start making multiple sequence alignments, you must know that none of
the methods available today is perfect. They all use approximations. Building a
multiple alignment that lets you make a real discovery requires some practice. The
usual strategy requires comparing several alternative results and looking for
robustness and stability.
DIAPO 11 Using ClustalW
ClustalW is by far the most commonly used program for making multiple sequence
alignments.
ClustalW uses a progressive method to build its alignments. Instead of aligning all
the sequences at the same time, it adds them one by one.
Before you head off to a ClustalW server, you must do a little spadework ahead of
time. Specifically, you need to gather together all the sequences you want to work
with.

1. Point your browser to the EBI ClustalW server page at


www.ebi.ac.uk/clustalw.
The ClustalW page dutifully appears.
2. Paste the sequences you collected in the Sequence window.
3. Choose Fast from the Alignment pull-down menu (Figure 9-6).
4. Use the Output Format pull-down menu to set the selection of
your choice.
Output formats have various pros and cons. (See Chapter 10 for a discussion
on this.) It is safe to use Aln Without Numbers, the default
ClustalW format.
It is never too late to change a format. If you didn’t generate your
multiple alignment in the format that suits you best, DON’T recompute
it! You can easily reformat alignments by using an online reformat utility
(such as Fmtseq) at www.bimcore.emory.edu/Pise/. (For more on
reformatting, see Chapter 10.)
5. Choose Input from the Output Order pull-down menu. (Refer to
Figure 9-6.)
Click the Run button at the bottom of the page.
An intermediate page appears. Wait until your browser displays the
Results page.
DIAPO 12 Alinear secuencias y estructuras
con Tcoffee
ClustalW, pero produce alineaciones más precisas a costa de un tiempo de
funcionamiento ligeramente más largo.
Tcoffee construye un alineamiento progresivo como ClustalW, pero compara
segmentos a través de todo el conjunto de secuencias
1. Point your browser to the Tcoffee server home page at
www.tcoffee.org.
2. Click the Regular button on the TCOFFEE line (first line).
The Build a Multiple Alignment page appears (Figure 9-7).
3. Paste your sequences into the large window.
You can use most formats. If your sequences are in a text file, you can
upload this file by using the Browse button.
4. Click the Submit button at the top or the bottom of the page.
Tcoffee can be slow at times. If you’d prefer to be notified when your
computation is done, enter your e-mail address in the Web form.

DIAPO 14
5. Examine your results.
Tcoffee returns a table that contains hyperlinks to your results, as
shown in Figure 9-8.
The first row of the table is dedicated to multiple sequence alignments
and includes
• msf_aln, clustalw_aln, fasta_aln: Text files containing your alignment in various
formats. Keep these files if you want to use your alignment as input for another
program.
• score_html, score_ascii: A colorized alignment where every residue appears on
a background that indicates the quality of this alignment. Red indicates high-quality
segments; blue indicates regions of your alignment that you have no reason to trust.
The score_ascii is a text version of the .html file. These two last files are meant only
for display; you can’t use them as an input for other sequence-analysis programs.
The second row is dedicated to phylogenetic trees:
dnd: The guide tree or dendrogram generated by Tcoffee in Newick
format (see Chapter 13). You should not use it in place of the true
phylogenetic tree
• phylogenetic_tree: The true phylogenetic tree in Newick format,
generated from the Tcoffee multiple alignment by using the Neighbor
Joining method (see Chapter 13). This is not a guide tree but a real
phylogenetic tree.
• pdf: A pdf picture of the phylogenetic tree that corresponds to the
phylogenetic_tree file.

Crunching large datasets with MUSCLE


MUSCLE is a newcomer in the multiple-sequence-alignment arena but it is a
remarkably efficient package for making fast, high-quality multiple sequence
alignments. MUSCLE is ideal if you want to align several hundred sequences. You
can access it on various servers, including its home page (at www.
drive5.com/muscle/). Running MUSCLE is very straightforward — only a matter of
cutting and pasting your sequences into the designated window

DIAPO 16
Sabemos que las estructuras contienen bucles de superficie que evolucionan
rápidamente. Los bucles son porciones más suaves de la proteína que conectan
sus porciones más rígidas. Las estructuras de la proteína también contienen
regiones centrales que actúan como paredes de soporte de la proteína. Estas
paredes de apoyo evolucionan menos rápidamente que los bucles de la superficie
En tu alineación múltiple, puedes esperar encontrar bonitos bloques sin espacios
que corresponden a las regiones centrales - y las regiones ricas en brechas que
corresponden a los bucles.
Another criterion for a useful multiple alignment is knowing the type of amino acids
you can expect to see conserved

You might also like