0% found this document useful (0 votes)
29 views15 pages

BIO101 Module 3

Uploaded by

Abdullah Qureshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

BIO101 Module 3

Uploaded by

Abdullah Qureshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MULE3

OD
·
Bioinformatics
Gene & Protein Structures :

Lecture 1 :

Centraldogma
Replication
:

NA ,
Transcription
RNA
Reverse Transcription
Translation

Protein

The 4 nucleotides of DNA file to that sequence


be put into text represent
·

can a

The sequence be written 5'3' for both strands


will
conventionally always
~
however the bottom strand
,
will be written in description portion of the sequence format (will write in reverse for bottom strand since it's 305)

Formats :

1 .
FASTA format :

*
begins w arrowhead's

followed
by ID/Accession
*
arrowhead Number

ID followed by description and then


length
*
,

*
can remove add certain details of the format
*
no
spaces in
sequence

. EMBA format
2 :

* NO arrowhead

* rest same as FASTA (ID-des-length)


add details length
*
can remove such as etc
.

*
spaces after a certain number of nucleotides ,
as well as number of indents

For
writing the bottom strand the description will "reverse complimentary" . (this is written in respect to top strand)
·

, say

Description will clarify if the sequence is mRNA or DNA

Sequence will not have uracils

DNA
sequence has both introns and exons
·

MRNA sequence has


only exons

CDS means
coding sequence (mRNA)

Start and Stop Signals :

Start codon : ATG

Stop codon TGA


, TA TAG
·
:
,

A have between
protein coding sequence will always start codon stop codon and sequence in
·

a ,
a a
,

Reading Frames :

DNA be read in 6 different 3 for top strand 3 for bottom strand


ways~
·

can ,

Reading frame skips nucleotide


·

one
·

Expasy translates DNA sequence to


reading frames
, and further into open
reading frames I can make mistakes
by highlighting a
sequence as an ORF even if there isn't a stop codon)
Only one
reading frame is the true
reading frame
·

Frames
·

Open Reading :

A within the
reading frame
*
part

*
Contains a start codon , stop codon ,
and some
sequence in between

* Can have multiple ORFs or none


.
,

Out of the 6 frames only frame true ORF which codes for
reading the true
reading this contains the
protein
*
, one is o a

*
The sequence before start codon is the 5' untranslated region, and the sequence after stop codon is the 3 untranslated region

The true ORF


longest
usually
*
is the ORF

this is because the distance b/w start codon & stop codon increases
·
as the ORFs
length increases , , resulting in the
probability of a
stop codon occurring

the fact that the ORF and still doesn't have middle of the sequence that it is likely the protein
long stop codon in the
coding
·
most
is a means
sequence

The
longest ORF not conclusive evidence for the
of of probability
* is
presence a
gene ,
it is
only indicative

It that the true ORF


* is
possible a shorter ORF is ·
exceptions exist

·
Promoters :

*
Help identify gene in text file

*
Upstream of gene ,
next to it

* TATAA box on same strand as


gene
* Bacterial Promoters have the TATAA box at 10 as well as the
Shine-Dalgarno sequence at 35 nucleotides upstream from TSS

*
Eukaryotic promoters only have the TATAA box

* The SD sequence is the ribosomal binding site used for ribosome to bind to mRNA

Restriction Site
Signals
·
:

These palindromic sequences (top and bottom strand when read 5 % 3 thing)
* are mean same of 4 to 8 base
pairs length
* The sequences indicate restriction sites

* The restriction restriction sites and cut the DNA into different pieces
enzymes recognise
·
known as DNA endonucleases with the function to cut at specific restriction sites

·
makes DNA smaller

TCGA
AGCT 3 : These cuts result in 2
types of ends :
sticky ends (overhangs) and blunt ends

*
Total DNA molecules = number of cuts + ↓

ends
Sticky

blunt ends

*
Directions need to match
·

Expressed Sequence Tags :

*
mRNA turned back into SNA SCONA) a results CDNA with introns short sequences of CONA used to
identify gene transcripts and useful determination
in
only exons in
gene sequence
·

,
no are

* There will be a
difference in the
original sequence and mRNA as mRNA does not have introns ·

generated by sequencing one or both ends of an expressed gene

the mRNA and


*
Alligning original sequence would be visualised as this : ·

fragments of MRNA
sequences derived through single sequencing reactions performed on randomly selected

MANA
OG clones from CDNA Libraries

contain sequence to be expressed (translated) which


·These
·
exons are exons
.

gaps represent the introns a shows exon-intron boundaries can be


alligned with DNA gene sequence to identify exon-intron boundaries as men has no introns

* ESTs
represent amino acid peptide sequences from expressed genes
ESTs contain untranslated
·
regions as well

&

Predicting Genesi
*
Easier in
prokaryotes
·
promoter and start sites conserved

cannot tolerate mutations in


promoters
·
prokaryotes
also less minimal
·

prokaryotes are complex ,


so
promoter diversity is

(codon bias)
different codons are used
differently in
prokaryotes and eukaryotes usage
~ based on a species specific manner
,
certain codons are biased than other codons for a particular amino acid

Signals & Contents :

Signals A small pattern within the


genomic DNA
*
:

~ start , stop ,
splicing sites transcription factor
,

Content
region of genomic
* A
: DNA

, UTR
introns promoters enhancers , Silencers
~
exons , ,
,
Comparitive Sequence Analysis
Lecture 2
-

Biological data is immense


requires smart ways to and
analys
·

o access

· One
gene
codes for multiple proteins

·
Databases :

* Gene Sequences Gene Bank

* Protein Sequences Uniprot

* Protein Structures PDB (also has protein sequences ,


but has a
separate tab for 3D structures)

Translate DNA RFs & ORFs


Expasy
* into -

*
Gene Prediction- GeneMark

Alternative leads to combinations of different proteins


Splicing :
being made
·

~ exon exo n s
transcribed for which the transcript
no ne
still has introns and
xon
* DNA is exons

The transcript alternately spliced to form mRNAs of just Splicing ELEE ELECEM etc
* &
is exons or .

* Alternative
splicing is the different ways exons can be joined ,
which happens because of the tissue/organ you are in
·
point is , exons can be
alteratively spliced or 'skipped' to
give

· different
organs tissues have
differently spliced exons
. different proteins

(isoforms)
·
results in different variants of proteins
·
these isoforms will have function from each other
a
slightly varying
~ last step is
post translational modifications (phosphorylation glycosylation, lipidation, ubiquitination cleavage)
, ,

* Proteome is the entire protein

Comparitive Genomics :

* The ancestors of organisms be used to how different related to each other


common
organisms
can see are

* Provides view of how organisms are related to each other

*
Shows
evolutionary changes among organisms

* Identifies genes that are conserved among species

Phylogenetic Tree :

Shows from the


evolutionary distance
*
common ancestor

Homologs :

* Similar between species


genes

Paralogs homologs of duplication


*
: within same
species because gene event

*
Orthologs homologs separated
: because of speciation events
,
derived from the common ancestor not within same species
.
BLAST : (Compares novel
query sequence
with known
sequences)
*
Local and
global allignments
Takes
original sequence and cuts it into bits and compares the query with the database to find Local
allignment
*
,
matches a

*
Once high scoring segment pairs are found ,
it compares the entire
query
with database
. Global
allignment
BLAST creates of "words" (short sequences) that have "threshold" score 16-256 nucleotides
query sequence Words consecutively
list all certain compared with the 3 amino acids
when
put together
* a a are in a row
. or

candidate based
sequences from the database perfect matches to
* When BLAST search is picked small the
a run
,
is on
sub-sequences in
query sequence

* Local allignment reduces the competition power narrows the search down
for global allignment based on 2 hits with the words .

* Threshold score is user-defined

* The small fragment is called a -mer

*
Scoring basis is done on the basis of BLOSUM62 matrix

BLOSUMG2 developed by taking proteins with 62 % similar , and looked at the


probability of alanine-substituting to another alanine and the matrix formed on this probability
* is all or more ,
is .

*
The more
positive the score
,
the more
likely the substitution takes place

* blastn nucleotide output is nucleotide


query
*
blastp proteins
this becomesreading
Or
* blastx nucleotide to protein sequence database
query
translated
tblastn
sequence to get output of nucleotides
*
nucleotide
protein query against a

translated translated
* tblastx nucleotide nucleotide
against
ORFS ORFs

Computational complexity of search


*
power increases with

Scoring allignments
Plast
scoring
·
Different types of BLOSUM exist (BLOSUM 90
,
62)
·
Number the
is
integrity

Query Allignment Type Output Computational Power


IL
Database , i
,

blastn Nucleotide #X
Ch 11 12 Is

21
I

blast Protein x

blast Nucleotide Protein Protein Protein 6x

↓ blastn Protein Nucleotide Protein Nucleotide 6x

tblastx Nucleotide Nucleotide Protein Nucleotide 36x

BLASTs algorithm :

* Global allignment is the entire sequence being compared


* Local
allignment looks for parts of sequences similar
Protein Sequence Analysis
Lecture 3
-

Peptide Bonds :

*
Bond b/w CB and amino
group
carbonyl
*
Angles
-

1 .

Omega dihedral angle r

· CCN Ca

°
·
180 g

2 .
Phi(CN < <) <

3 Psi . (N <"Co N) <

· exist as allowable
angles ,
shown by Ramachandran plot :

range
which allows a
proper structure to be formed

Left handed
X

beta

righteed
&

·
Edmund Degradation :

* Used to
sequence proteins
* Starts from N-terminal ,
removing one amino acid at a time

Adds phenyl isothiocyanate (PITC) to N-terminus of acid


*
amino

trifluoracetic acid
* TFA is then added which removes the amino acid which PITC attached to

This repeated separating acids the bond


all N-terminus
by breaking
* is amino on .

An isolated acid's weight is known PITC's


* amino ,
as well as
weight
*
Weights are subtracted to find protein sequence

* Drawbacks :

restricted to 60 residues due to


protein degradation (misfolds)
Laborious 50 amino acids
day
-

: a

only sequence one protein at a time


-

Mass
·

Spectrometry :

Ionization

order
*
Sequences proteins using charge ratio
·

mass

Mass
and vaporised using heater Analyzer
·

*
protein injected

* once vapourised ,
it is ionized by addition of a proton
·

Detector

then passed through magnet where it deflects due to


mass/charge ratio lightest deflects most Spectrum Assembly
·

- mass

* mass
analyzer according to their M/2

* detector is where selected molecules hit

*
spectrum assembly entails a
proteomics software which is interfaced to MS and assembles spectra

*
m = m/2 & mass
charge on
protein -
mass
by charge ratio

charge on
protein

Y
Hard ionization techniques proteins of protons /ionization)
*
: break on addition

Low couldn't differentiate Drawback


resolution of close
by
* mass
analyzers ,
masses

* Search algorithms isotopic envelope deconvolution


: & post-translational modifications detection
·

Spectra : TOP-DOWN IMAGE (analyze as it is) : BOTTOM UP IMAGE (remove PTMs and analyze & fragment : PTMs not conserved)
* Run MS1 on mass spec data

* MSI is the protein's in tact mass

* The protein fragmented is the MS2 data

*
Adding masses
of fragments to
get you to intact
protein mass

bel oure
mass/charge
~ resultsin
a

intact fragments

* MS1 is tuned
by the change in MS2 fragments mass added and the MS1 data

·
well precision
improves accuracy as as

~ tune it to be more accurate

*
MSLs
intensity is always I

* MS2s
intensity is never
Lecture 1:
• Centraldogma: DNA à RNA à Protein; Transcription à
Translation (direction in 5’ to 3’)
• FASTA format and EMBL format are methods to write
down the DNA sequences in a text file to access
information easily and on a large scale.
• FASTA Format:
o Arrow Header >, ID/Accession, Description,
Length (general few things, you can add or remove
elements in the header which is in the first line
before all the above)
• EMBL Format:
o ID/Accession, Description, Length (general few
things, you can add or remove elements in the
header which is in the first line before all the above)
• Differences between the two:
o Arrowhead ‘>’ in FASTA, not in EMBL
• The description will tell us certain things:
o If sequence is top strand, description will say
nothing.
o If sequence is bottom strand, description will say it
is reverse complementary.
o It will tell you if the sequence is mRNA or DNA
(no uracils are present)
• Start Codons: ATG
• Stop Codons: TGA, TAA, TAG
• Between a start and stop codon, there will be a region that
can code for a protein.
• mRNA has only 3 reading frames, as only one of the
DNA strands will be translated into protein; DNA has 6
reading frames, 3 for top strand and 3 for the bottom.
• Eukaryotes have introns and exons, but Prokaryotes have
only exons.
• Introns and exons both have ORFs, so we cannot
differentiate between them.
• Longest ORF is the true ORF.
• Can have multiple ORFs in a single reading frame.
• Looked at yeast genome to see what the length of the true
ORF (the sequences which makes the protein) is more
probable to be the longest ORF.
• Exception to this ^ do exist, but that is not that relevant.
• The point of the graph is, that the longest ORF is a good
prediction of the true ORF.
• Promoters:
o Promoters start transcription of a specific gene
o On the sense strand of the gene
o Eukaryotic promoters have a TATAA box, whereas
the Prokaryotic promoters have a TATAA box as
well as a Shine Dalgarno sequence, which is a
ribosomal binding site.
o The SD sequence is a conserved promoters and is
downstream of the TATAA box.
o Promoters are always on the gene.
• Restriction Site Signals:
o Cut DNA using restriction enzymes
(endonucleases) which cut at specific sequences
called restriction sites.
o These sites are 4-8 base pairs long and are
palindromic.
o Palindromic means if you read it 5 à 3 for the top
and bottom strand, it would read the same way.
o Sticky ends are the overhangs of sequences after a
cut.
o Blunt ends are when there are no overhands of
sequences after a cut.
• Expressed Sequence Tags:
o Represent amino acid peptides sequences from
expressed genes.
o Take the mRNA and reverse transcribe it. This is
called cDNA.
o Compare the genome with the cDNA, the gaps
represent the introns, and the areas that match are
the exons. This helps you find out the intron-exon
boundaries.
• Prokaryotes are not tolerant of promoter mutations.
• Codon Usage Bias:
o Certain codons which code for certain amino acids
have a bias in a species-specific manner.
o To do gene prediction in a species, you must look at
codon usage bias.
• Signals are short recurring patterns (stop codon, start
codon, splicing sites and transcription factors)
• Contents are regions of genomic DNA (exons, introns,
UTR, promoters, enhancers, silencers)
• Observable traits from gene sequences:
o Nucleotide frequencies and their correlations
o Functional sites: splice sites, promoters, enhances,
UTRs
Lecture 3:
• Amino acid can be presented with a single lettered amino
acid.
• Amino Acids & Peptide Bonds:
o Chiral Carbon is C alpha
o Carbon beta is C = O
o Omega Dihedral Angle is C alpha, C beta, N, C
alpha (180 degrees)
o Psi: N – Carbon alpha - Carbon beta – N
o Phi: Carbon Beta – N – C alpha – C beta
• Ramachandran Plot:
o Phi and Psi angles exist as allowable angles.
o Top left is beta sheet, top right is left-handed alpha
helix, bottom left is right-handed alpha helix.
• Edmund Degradation:
o PITC binds to the N – terminus of the amino acid
chain.
o TFA is added which is able to cut the peptide bond.
o The isolated amino acid is separated.
o This process is repeated.
o Drawbacks are that it is laborious (50 amino acids a
day limit), and it is restricted to 60 residues.
• Mass Spectrometry:
o Large scale
o Works on principle of ionizing proteins, adding a
positive charge to them
o Inject sample, vaporise, ionize, deflect in magnetic
field, mass separator, then final detector.
o Mass/charge ratio = mass + charge / mass
o Drawbacks are hard ionization techniques, low
resolution of mass analysers, and search algorithms
(isotopic envelope deconvolution, post-translational
modifications detector).
• SPECTRUM is a top-down approach:
o MS1 gives you one value.
o MS2 gives you fragments of the protein.
o Method is to add the fragments to try to attain the
MS1 value in order to ‘tune’ it.
o There can be reasons such as isotopes, PTMs that
result in MS1 not equal to the tuned value.

Lecture 2:
• Alternative Splicing:
o Protein isoforms can be formed through different
joining of exons – it adds complexity.
o PTMs get added (cleavage, glycosylation,
lipidation, ubiquitination, phosphorylation)
• Evolutionary Relationships:
o Organisms are related to each other through a
common ancestor.
• Homologs:
o Similar proteins or DNA sequences are homologs.
o Either orthologs, or paralogs.
o Orthologs are similar proteins or DNA sequences
across species and can be explained through
speciation events.
o Paralogs are similar proteins or DNA sequences
within the same species and can be explained
through gene duplication.
o To tell between
• Databases:
o Uniprot – Protein Sequences
oPDB – Protein Structures
oGenBank – DNA Sequences
oGenMark – Gene Prediction
oExpasy – Translates frames of DNA to RFs and
ORFs.
• BLAST:
o Takes a query and fragments it into any number of
k-mers (can be 3-mers, 2-mers etc.)
o Now it compares these fragments to sequences in
the database
o Within these similarities, it uses the BLOSUM 62
matrix to score the matches.
o BLOSUM looks at the substitution of one amino
acids to another and scores it accordingly.
o BLAST’s algorithm looks for any similarities
within the fragments and the database sequences –
these similarities do not have to be exact, but can
have a similar nucleotide etc. This is scored, and
eventually filtered out accordingly to the user
defined threshold score.
o The threshold score defines what is a good score or
not i.e it filters out the good matches and bad
matches.
o The comparisons which score above the threshold
are then expanded, using a global alignment. This
confirms whether the fragments match was a
coincidence, or a genuine good match.
o A global alignment is when you match entire query
to the database
• Types of BLAST:
Type of Blast Query Database Alignment Type Output
Blastn N N N N
Blastp P P P P
Blastx N (translated) P P P
Tblastn P N P N
Tblastx N (translated) N (translated) P N

You might also like