BIO101 Module 3
BIO101 Module 3
OD
·
Bioinformatics
Gene & Protein Structures :
Lecture 1 :
Centraldogma
Replication
:
NA ,
Transcription
RNA
Reverse Transcription
Translation
Protein
can a
Formats :
1 .
FASTA format :
*
begins w arrowhead's
followed
by ID/Accession
*
arrowhead Number
*
can remove add certain details of the format
*
no
spaces in
sequence
. EMBA format
2 :
* NO arrowhead
*
spaces after a certain number of nucleotides ,
as well as number of indents
For
writing the bottom strand the description will "reverse complimentary" . (this is written in respect to top strand)
·
, say
DNA
sequence has both introns and exons
·
CDS means
coding sequence (mRNA)
A have between
protein coding sequence will always start codon stop codon and sequence in
·
a ,
a a
,
Reading Frames :
can ,
one
·
Frames
·
Open Reading :
A within the
reading frame
*
part
*
Contains a start codon , stop codon ,
and some
sequence in between
Out of the 6 frames only frame true ORF which codes for
reading the true
reading this contains the
protein
*
, one is o a
*
The sequence before start codon is the 5' untranslated region, and the sequence after stop codon is the 3 untranslated region
this is because the distance b/w start codon & stop codon increases
·
as the ORFs
length increases , , resulting in the
probability of a
stop codon occurring
the fact that the ORF and still doesn't have middle of the sequence that it is likely the protein
long stop codon in the
coding
·
most
is a means
sequence
The
longest ORF not conclusive evidence for the
of of probability
* is
presence a
gene ,
it is
only indicative
·
Promoters :
*
Help identify gene in text file
*
Upstream of gene ,
next to it
*
Eukaryotic promoters only have the TATAA box
* The SD sequence is the ribosomal binding site used for ribosome to bind to mRNA
Restriction Site
Signals
·
:
These palindromic sequences (top and bottom strand when read 5 % 3 thing)
* are mean same of 4 to 8 base
pairs length
* The sequences indicate restriction sites
* The restriction restriction sites and cut the DNA into different pieces
enzymes recognise
·
known as DNA endonucleases with the function to cut at specific restriction sites
·
makes DNA smaller
TCGA
AGCT 3 : These cuts result in 2
types of ends :
sticky ends (overhangs) and blunt ends
*
Total DNA molecules = number of cuts + ↓
ends
Sticky
blunt ends
*
Directions need to match
·
*
mRNA turned back into SNA SCONA) a results CDNA with introns short sequences of CONA used to
identify gene transcripts and useful determination
in
only exons in
gene sequence
·
,
no are
* There will be a
difference in the
original sequence and mRNA as mRNA does not have introns ·
fragments of MRNA
sequences derived through single sequencing reactions performed on randomly selected
MANA
OG clones from CDNA Libraries
* ESTs
represent amino acid peptide sequences from expressed genes
ESTs contain untranslated
·
regions as well
&
Predicting Genesi
*
Easier in
prokaryotes
·
promoter and start sites conserved
(codon bias)
different codons are used
differently in
prokaryotes and eukaryotes usage
~ based on a species specific manner
,
certain codons are biased than other codons for a particular amino acid
~ start , stop ,
splicing sites transcription factor
,
Content
region of genomic
* A
: DNA
, UTR
introns promoters enhancers , Silencers
~
exons , ,
,
Comparitive Sequence Analysis
Lecture 2
-
o access
· One
gene
codes for multiple proteins
·
Databases :
*
Gene Prediction- GeneMark
~ exon exo n s
transcribed for which the transcript
no ne
still has introns and
xon
* DNA is exons
The transcript alternately spliced to form mRNAs of just Splicing ELEE ELECEM etc
* &
is exons or .
* Alternative
splicing is the different ways exons can be joined ,
which happens because of the tissue/organ you are in
·
point is , exons can be
alteratively spliced or 'skipped' to
give
· different
organs tissues have
differently spliced exons
. different proteins
(isoforms)
·
results in different variants of proteins
·
these isoforms will have function from each other
a
slightly varying
~ last step is
post translational modifications (phosphorylation glycosylation, lipidation, ubiquitination cleavage)
, ,
Comparitive Genomics :
*
Shows
evolutionary changes among organisms
Phylogenetic Tree :
Homologs :
*
Orthologs homologs separated
: because of speciation events
,
derived from the common ancestor not within same species
.
BLAST : (Compares novel
query sequence
with known
sequences)
*
Local and
global allignments
Takes
original sequence and cuts it into bits and compares the query with the database to find Local
allignment
*
,
matches a
*
Once high scoring segment pairs are found ,
it compares the entire
query
with database
. Global
allignment
BLAST creates of "words" (short sequences) that have "threshold" score 16-256 nucleotides
query sequence Words consecutively
list all certain compared with the 3 amino acids
when
put together
* a a are in a row
. or
candidate based
sequences from the database perfect matches to
* When BLAST search is picked small the
a run
,
is on
sub-sequences in
query sequence
* Local allignment reduces the competition power narrows the search down
for global allignment based on 2 hits with the words .
*
Scoring basis is done on the basis of BLOSUM62 matrix
*
The more
positive the score
,
the more
likely the substitution takes place
translated translated
* tblastx nucleotide nucleotide
against
ORFS ORFs
Scoring allignments
Plast
scoring
·
Different types of BLOSUM exist (BLOSUM 90
,
62)
·
Number the
is
integrity
blastn Nucleotide #X
Ch 11 12 Is
21
I
blast Protein x
BLASTs algorithm :
Peptide Bonds :
*
Bond b/w CB and amino
group
carbonyl
*
Angles
-
1 .
· CCN Ca
°
·
180 g
2 .
Phi(CN < <) <
· exist as allowable
angles ,
shown by Ramachandran plot :
range
which allows a
proper structure to be formed
Left handed
X
beta
righteed
&
·
Edmund Degradation :
* Used to
sequence proteins
* Starts from N-terminal ,
removing one amino acid at a time
trifluoracetic acid
* TFA is then added which removes the amino acid which PITC attached to
* Drawbacks :
: a
Mass
·
Spectrometry :
Ionization
order
*
Sequences proteins using charge ratio
·
mass
Mass
and vaporised using heater Analyzer
·
*
protein injected
* once vapourised ,
it is ionized by addition of a proton
·
Detector
- mass
* mass
analyzer according to their M/2
*
spectrum assembly entails a
proteomics software which is interfaced to MS and assembles spectra
*
m = m/2 & mass
charge on
protein -
mass
by charge ratio
charge on
protein
Y
Hard ionization techniques proteins of protons /ionization)
*
: break on addition
Spectra : TOP-DOWN IMAGE (analyze as it is) : BOTTOM UP IMAGE (remove PTMs and analyze & fragment : PTMs not conserved)
* Run MS1 on mass spec data
*
Adding masses
of fragments to
get you to intact
protein mass
bel oure
mass/charge
~ resultsin
a
intact fragments
* MS1 is tuned
by the change in MS2 fragments mass added and the MS1 data
·
well precision
improves accuracy as as
*
MSLs
intensity is always I
* MS2s
intensity is never
Lecture 1:
• Centraldogma: DNA à RNA à Protein; Transcription à
Translation (direction in 5’ to 3’)
• FASTA format and EMBL format are methods to write
down the DNA sequences in a text file to access
information easily and on a large scale.
• FASTA Format:
o Arrow Header >, ID/Accession, Description,
Length (general few things, you can add or remove
elements in the header which is in the first line
before all the above)
• EMBL Format:
o ID/Accession, Description, Length (general few
things, you can add or remove elements in the
header which is in the first line before all the above)
• Differences between the two:
o Arrowhead ‘>’ in FASTA, not in EMBL
• The description will tell us certain things:
o If sequence is top strand, description will say
nothing.
o If sequence is bottom strand, description will say it
is reverse complementary.
o It will tell you if the sequence is mRNA or DNA
(no uracils are present)
• Start Codons: ATG
• Stop Codons: TGA, TAA, TAG
• Between a start and stop codon, there will be a region that
can code for a protein.
• mRNA has only 3 reading frames, as only one of the
DNA strands will be translated into protein; DNA has 6
reading frames, 3 for top strand and 3 for the bottom.
• Eukaryotes have introns and exons, but Prokaryotes have
only exons.
• Introns and exons both have ORFs, so we cannot
differentiate between them.
• Longest ORF is the true ORF.
• Can have multiple ORFs in a single reading frame.
• Looked at yeast genome to see what the length of the true
ORF (the sequences which makes the protein) is more
probable to be the longest ORF.
• Exception to this ^ do exist, but that is not that relevant.
• The point of the graph is, that the longest ORF is a good
prediction of the true ORF.
• Promoters:
o Promoters start transcription of a specific gene
o On the sense strand of the gene
o Eukaryotic promoters have a TATAA box, whereas
the Prokaryotic promoters have a TATAA box as
well as a Shine Dalgarno sequence, which is a
ribosomal binding site.
o The SD sequence is a conserved promoters and is
downstream of the TATAA box.
o Promoters are always on the gene.
• Restriction Site Signals:
o Cut DNA using restriction enzymes
(endonucleases) which cut at specific sequences
called restriction sites.
o These sites are 4-8 base pairs long and are
palindromic.
o Palindromic means if you read it 5 à 3 for the top
and bottom strand, it would read the same way.
o Sticky ends are the overhangs of sequences after a
cut.
o Blunt ends are when there are no overhands of
sequences after a cut.
• Expressed Sequence Tags:
o Represent amino acid peptides sequences from
expressed genes.
o Take the mRNA and reverse transcribe it. This is
called cDNA.
o Compare the genome with the cDNA, the gaps
represent the introns, and the areas that match are
the exons. This helps you find out the intron-exon
boundaries.
• Prokaryotes are not tolerant of promoter mutations.
• Codon Usage Bias:
o Certain codons which code for certain amino acids
have a bias in a species-specific manner.
o To do gene prediction in a species, you must look at
codon usage bias.
• Signals are short recurring patterns (stop codon, start
codon, splicing sites and transcription factors)
• Contents are regions of genomic DNA (exons, introns,
UTR, promoters, enhancers, silencers)
• Observable traits from gene sequences:
o Nucleotide frequencies and their correlations
o Functional sites: splice sites, promoters, enhances,
UTRs
Lecture 3:
• Amino acid can be presented with a single lettered amino
acid.
• Amino Acids & Peptide Bonds:
o Chiral Carbon is C alpha
o Carbon beta is C = O
o Omega Dihedral Angle is C alpha, C beta, N, C
alpha (180 degrees)
o Psi: N – Carbon alpha - Carbon beta – N
o Phi: Carbon Beta – N – C alpha – C beta
• Ramachandran Plot:
o Phi and Psi angles exist as allowable angles.
o Top left is beta sheet, top right is left-handed alpha
helix, bottom left is right-handed alpha helix.
• Edmund Degradation:
o PITC binds to the N – terminus of the amino acid
chain.
o TFA is added which is able to cut the peptide bond.
o The isolated amino acid is separated.
o This process is repeated.
o Drawbacks are that it is laborious (50 amino acids a
day limit), and it is restricted to 60 residues.
• Mass Spectrometry:
o Large scale
o Works on principle of ionizing proteins, adding a
positive charge to them
o Inject sample, vaporise, ionize, deflect in magnetic
field, mass separator, then final detector.
o Mass/charge ratio = mass + charge / mass
o Drawbacks are hard ionization techniques, low
resolution of mass analysers, and search algorithms
(isotopic envelope deconvolution, post-translational
modifications detector).
• SPECTRUM is a top-down approach:
o MS1 gives you one value.
o MS2 gives you fragments of the protein.
o Method is to add the fragments to try to attain the
MS1 value in order to ‘tune’ it.
o There can be reasons such as isotopes, PTMs that
result in MS1 not equal to the tuned value.
Lecture 2:
• Alternative Splicing:
o Protein isoforms can be formed through different
joining of exons – it adds complexity.
o PTMs get added (cleavage, glycosylation,
lipidation, ubiquitination, phosphorylation)
• Evolutionary Relationships:
o Organisms are related to each other through a
common ancestor.
• Homologs:
o Similar proteins or DNA sequences are homologs.
o Either orthologs, or paralogs.
o Orthologs are similar proteins or DNA sequences
across species and can be explained through
speciation events.
o Paralogs are similar proteins or DNA sequences
within the same species and can be explained
through gene duplication.
o To tell between
• Databases:
o Uniprot – Protein Sequences
oPDB – Protein Structures
oGenBank – DNA Sequences
oGenMark – Gene Prediction
oExpasy – Translates frames of DNA to RFs and
ORFs.
• BLAST:
o Takes a query and fragments it into any number of
k-mers (can be 3-mers, 2-mers etc.)
o Now it compares these fragments to sequences in
the database
o Within these similarities, it uses the BLOSUM 62
matrix to score the matches.
o BLOSUM looks at the substitution of one amino
acids to another and scores it accordingly.
o BLAST’s algorithm looks for any similarities
within the fragments and the database sequences –
these similarities do not have to be exact, but can
have a similar nucleotide etc. This is scored, and
eventually filtered out accordingly to the user
defined threshold score.
o The threshold score defines what is a good score or
not i.e it filters out the good matches and bad
matches.
o The comparisons which score above the threshold
are then expanded, using a global alignment. This
confirms whether the fragments match was a
coincidence, or a genuine good match.
o A global alignment is when you match entire query
to the database
• Types of BLAST:
Type of Blast Query Database Alignment Type Output
Blastn N N N N
Blastp P P P P
Blastx N (translated) P P P
Tblastn P N P N
Tblastx N (translated) N (translated) P N