Monday, December 8, 2014

Pathway Anaysis using IPA

Description about IPA:

  • IPA stands for Ingenuity pathway analysis.
  • It majorly deals with the creating molecular networks (algorithmically generated pathways).
  • Dividing data into diseases and biological functions that are over represented in your data.
  • Determining over represented signaling and metabolic canonical pathways.
There are actually 3 major aspects that can be taken from IPA
  • Drug Interactions
  • Growing/Generating a pathway
  • Canonical Pathways
As my pet gene is CYP2C9 I had to find out if it is actually present in IPA for pathway analysis. Then I had to see if search it in IPA which would give a tabular form of data where I found my gene but there were no drug interactions associated to my gene.

Growing Pathway:
  • I searched for CYP2C9 grew my pathway then created my pathway
  • Add to my pathway—> build—> grow—> (direct/human/remove chemical molecule types and biologic drug)
  • Then click on Apply.


  • But in this scenario we could notice that there are only 10 molecules as 10 relationships in Direct association.
  • So, we need to add some more pathways associated with CYP2C9 so that we can notice atleast of 20-25 relationships.


  • So, the molecules being constant the number of relationships increased to 23.
  • In case if there are more than 25 relationships we use "Trim" option to reduce the number of relationships.

  •  We can see that there are no inhibitors and activators but definitely Protein-Protein interactions for an example between CYP2C9 and CYP2D6 etc.,
 
Canonical pathway:
  •  Canonical actually mean the standard or unique pathway.
  • In here I actually considered Xenobiotic metabolism as my canonical pathway.

  • I want to explain the reason I choose this pathway that because the major disease associated with my gene is anti coagulation which is the reason for slow warfarin metabolism.

XENOBIOTICS

A xenobiotic is a foreign chemical substance found within an organism that is not normally naturally produced by or expected to be present within that organism. It can also cover substance which are present in much higher concentrations than are usual. Specifically, drugs such as antibiotics are xenobiotics in humans because the human body does not produce them itself, nor are they part of a normal food. 

 

  •  Major genes associated to my pathway Xenobiotic metabolic signalling are CYP2C9 which is quite obvious I choose this pathway!! Inclusive to this we observe CYP3A4 and CYP2C. In the below figure we can see the disease associated to my pet gene. This gene plays a vital role in Deep vein thrombosis which is a disease associated with anti coagulation.

     

     

Monday, December 1, 2014

GENOME ANALYSIS

What's a genome?

In modern molecular biology and genetics the genome is the genetic material of an organism. It is encoded either in DNA or, for many types of viruses, in RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA. This genetic material places a vital role in every person. This helps in understanding a disease condition present or future. In my option analyzing a full genome of a person can definitely help in prevention of the disease.

How can genome analysis be done?

Bioinformatics is playing a great role in genetics. The synergy of both these departments are creating a new entity in the field of science and technology. A bioinformatics approach to genomic analysis needs various entities into consideration which are:

1a. Sequencing
1b. Analysis of nucleic acid seq.
2. Analysis of protein seq.
3. Molecular structure prediction
4. Molecular interaction
5. Metabolic and regulatory networks
6. Gene & Protein expression data
7. Drug screening Ab initio drug design OR Drug compound screening in database of molecules
8. Genetic variability

There are various types of genome analysis 

Quick reference to how these analysis and DNA sequencing are done:

1.) https://www.youtube.com/watch?v=Q6UxQf1sVe8
2.) https://www.youtube.com/watch?v=fFzeGGrV7io

Whole genome sequencing and its uses in prevention of disease

Technological advances often outpace our ability to effectively use them, a situation that certainly could pertain to modern genomics. Breathtaking advances in genetic sequencing technology have the potential to make whole genome sequencing (WGS) available for healthcare and disease prevention. 

In the near future, WGS will transform diagnostic testing in the subset of patients with disorders resulting from disruption of a single gene or chromosomal region. Burgeoning application of WGS in a variety of clinical settings will allow assessment of the diagnostic yield in various subsets of symptomatic patients, guiding its widespread use in this setting. However, although WGS will almost certainly be a powerful diagnostic tool for patients with such disorders, whether such analysis will be a valuable clinical tool for those with common diseases is doubtful, for the simple reason that such disorders have many contributing non-genetic etiologies and because our ability to interpret the combinatorial effects of common genetic variants remains limited.

Three major factors that can be influential during WGS are:

1.) Family history
2.) Protein structure prediction
3.) Sequence comparison

Family History:

First thing is that family history can or sometimes can not play a crucial role in disease because many things influence your overall health and likelihood of developing a disease. Sometimes, it's not clear what causes a disease. Many diseases are thought to be caused by a combination of genetic, lifestyle, and environmental factors. The importance of any particular factor varies from person to person. If you have a disease, does that mean your children and grandchildren will get it, too? Not necessarily. They may have a greater chance of developing the disease than someone without a similar family history. But they are not certain to get the disease. 

Common health problems that can run in a family include:
  • Alzheimer's disease/dementia
  • Arthritis
  • Asthma
  • Blood clots
  • Cancer
  • Depression
  • Diabetes
  • Heart disease
  • High cholesterol
  • High blood pressure
  • Pregnancy losses and birth defects
  • Stroke.  

 

Genetic Diseases

Some diseases are clearly genetic. This means the disease comes from a mutation, or harmful change, in a gene inherited from one or both parents. Genes are small structures in your body's cells that determine how you look and tell your body how to work. Examples of genetic diseases are Huntington's disease, cystic fibrosis, and muscular dystrophy.

Protein Structure prediction and Sequence comparison:

Protein structure should be predicted for the analysis of a particular disease as this is the initial step that has to be done which would let us know the interaction mechanisms with the receptors associated with the it. There are various databases which would let us know the protein structures like Protein Data Bank (http://www.rcsb.org/pdb/home/home.do).
We can notice the scenario where we can not find the protein structure or hypothetical structures would be present. If we come up with those situations we can use sequence comparison technique as design our required protein. Sequence comparison can be done using BLAST tool (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

WGS on Alzheimer's Disease:

This was a project done on various families participated in the intervention. Whole genome sequencing (WGS) for select subjects from large multigenerational extended families with late onset Alzheimer disease to identify novel genes and alleles associated with the occurrence of late-onset AD. They had to consider full pedigree's of individual families as they are concentrating on the inheritance issue.

https://www.niagads.org/adsp/content/study-design
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Let's consider this example scenario in explaining genome analysis:

John has been confirmed as being homozygous for warfarin sensitivity and Deep vein thrombosis a disorder associated variant in gene CYP2C9. He had complex responses to multiple medications and have multiple clinical conditions that are not being explained by the disease variant or by 1year of traditional diagnostic center.


Warfarin sensitivity:

Warfarin is the most commonly prescribed anti-coagulant and is among the top 20 most prescribed drugs in the US. Patients who are suffering from blood clotting diseases like Deep vein thrombosis etc., are given this medication. John is a patient suffering from DVT and warfarin sensitivity. The reason for this disease is  variant *2 or *3 alleles of CYP2C9. Traditional methods of using anti-coagulants were not working on him so upon physician suggestion john's parents had to go with WGS.

I think participating in a clinical trial which offering full exome analysis for John and their parents at no personal cost is a good option because the various medications didn’t work on him,this might be that there were any other variants associated with the genome and this many cause this disease. His parents are also getting the total exome analysis for free of cost as we know blood clotting can also be inherited. I feel this is an advantage both for the researcher's as well as the participant.

Paying $5-$10k for this procedure is not much of use as they have other option to participate in research. If after the research there are any incidental finding that were not found at traditional diagnostic testing center they can use the money for the future medications. This scenario depends on financial and health situation of John.

If there are any incidental findings they play a major role in the John's life. WGS will turn out to be pioneer in disease treatment.


 Part 2


Converting variant to VCF format:

1.) OMIM was used to identify the "rs ID" for the variant that is associated to my disease.


2.) Then I navigated through “Gene” part of NCBI. From “Genomic Region and Transcripts” section, I selected “GRCh38 Primary Assembly”  and  look up my variant by searching “rs1057910” in the   search engine provided. It was zoomed in for the DNA view. The beginning of position 47639 is position 1 for CYP2C9 gene (chromosome 10)



 3.) At exon 7 NCBI viewer indicates my variant.


VCF format according to VCF 4.1:

#CHROM   POS     ID           REF   ALT   QUAL    FILTER   

10       47639   rs1057910    A      C     25      PASS


INFO                    FORMAT    CB00001  

NS=1;DP=35;DB           GT:GQ     1|1:52

Monday, November 10, 2014

MOLECULAR DIAGNOSTICS

ClinVar

  • ClinVar accessions submissions reporting human variation, interpretations of the relationship of that variation to human health and the evidence supporting each interpretation. 
  • The database is tightly coupled with dbSNP and dbVar, which maintain information about the location of variation on human assemblies.

ClinVar was used to document number of pathogenic variants associated with your gene and how many are reviewed by expert panel and professional society.

Fig 1 consists of ClinVar page CYP2C9 gene
  • There are 16 allelic variants that could be noticed in total in the database.
  • In this they were 4 pathogenic allelic variants.
  • They were no reviews by the expert panel or the professional society in the review status filter which we can see in the filter of this database.
 Document how many labs provide tests related to CYP2C9 gene

  • NCBI Genetic Test Reference (GTR) 
  • The Genetic Testing Registry (GTR) provides a central location for voluntary submission of genetic test information by providers.
  • The scope includes the test's purpose, methodology, validity, evidence of the test's usefulness, and laboratory contacts and credentials. 
  • The overarching goal of the GTR is to advance the public health and research into the genetic basis of health and disease.

CYP2C9 GENE:

  • There are 4 different laboratories offer diagnostic testing related to the disease associated to CYP2C9.

Fig consists of the number of laboratories offering  tests for the disease associated to CYP2C9
  • Disease condition associated with my pet gene is "Warfarin response" here we could notice the variant change 430C>T (p.Arg144Cys)  leads to this particular disease.
  • Every laboratory has different methods to diagnose the disease. 
DNA STAR to stimulate RFLP:
  • RFLP stands for restricted fragment length polymorphism is a technique that exploits variations in homologous DNA sequences. It refers to a difference between samples of homologous DNA molecules that come from differing locations of restriction enzyme sites, and to a related laboratory technique by which these segments can be illustrated.
  • Firstly perform RFLP on natural DNA with out making a variant change and the perform virtual agarose gel simulation using DNA STAR.
Fig represents the Seqbuilder (DNA STAR) module to represent the restriction enzyme (6 cutter) on position 430 in the sequence.
  • Gene quest is a module in DNA STAR where we can perform RFLP analysis.
  • We observe that there is no restriction enzyme site on the position were there is a variant so we nee d to consider any other restriction enzyme 7 positions adjacent to our variant site so we found  "EarI".
  • So, agarose gel  simulation is performed using this enzyme.
Results of RFLP:





Fig represents RFLP summary and agarose gel simulation
  • Repeating the same steps for the variant change T<C at position 430.
  • Using Edit seq make the variant change in the DNA sequence.



Conclusion:

We can notice that there are restriction sites near the site of your variant but the variant does not affect the digestion of the DNA.

Monday, November 3, 2014

ANTIGENICITY PREDICTION OF EBOLA VIRUS PROTEIN VP40 (GROUP)


  • Analysis of ebolavirus protein done by our group Saikrishna, Naushad and Xi.
  • We used 3 different patients from NCBI BIO PROJECT module. We tried to analyze that is there any variation in the antigenicity (MHC-I) binding prediction which is done by using IEDB Analysis resource.
 Fig Represents the 3 patients we used to analyze the MHC-I binding prediction.

  •  The 3 patients ID's are KM233115.1, KM233091.1, KM233042.1.
  • Then we got I got the protein VP40 sequences from these individual patients and did my analysis in IEDB.
  • In this I choose one particular allele HLA-0A*01 for all the 3 patients then analyzed to see that is there any change in the MHC-I binding peptide (EPITOPES).
  • My team mate Naushad used all the alleles for the viral protein VP24 and Xi  used VP30 and did analysis using DNASTAR.
  • You can visualize there blog through the following web link:
  • Naushad - http://www.noshiahmad.blogspot.com
  • Xi- http://www.missxiwangbioinformatics.wordpress.com 
 Analysis of a particular protein is done using IEDB for individual patients:
  1. Patient - KM233115.1 - GenBank: AIG96606.1
  2. Patient - KM233091.1 - GenBank: AIG96390.1
  3. Patient - KM233042.1 - GenBank: AIG95949.1

                           Fig Represents the patient 1 MHC-I binding prediction.

Fig Represents the patient 2 MHC-I binding prediction.

Fig Represents the patient 3 MHC-I binding prediction

We even used DNASTAR to perform the experiment as my group mates performed but i performed for cross reference using online tool PROPRED-I which is a server used to calculate the MHC-I binding peptides.
  • Even in this particular server we notice that "WTDDTPTGS" peptide is a binder but not he best one based on the calculations performed using Log scale values.
    Fig Represents page of Propred-I after submission of sequence
    We can notice that we can see the above mentioned peptide in the following sequence mentioned below
     According to the this server best binders for this sequence are :
    Fig Represents ranking of peptide binders using Log scale

    Conclusion: We notice that there is no change in the T cell antigenic determination there is no difference in the three patients with VP-40 gene. We can also see that the binders ranking based on the algorithms used.

ANTIGENICITY OF EBOLA VIRUS VP40 PROTEIN

EBOLA VIRUS
  • Ebola, is a disease of humans and other primates caused by Ebola virus. Ebola virus disease (EVD), formerly known as Ebola haemorrhagic fever, is a severe, often fatal illness in humans. 
  • The virus is transmitted to people from wild animals and spreads in the human population through human-to-human transmission.
TRANSMISSION
  • It is thought that fruit bats of the Pteropodidae family are natural Ebola virus hosts. Ebola is introduced into the human population through close contact with the blood, secretions, organs or other bodily fluids of infected animals such as chimpanzees, gorillas, fruit bats, monkeys, forest antelope and porcupines found ill or dead or in the rainforest. 
VIROLOGY
  • Ebolaviruses contain single-stranded, non-infectious RNA genomes. Ebolavirus genomes are approximately 19 kilo base pairs long and contain seven genes in the order 3 UTR-NP-VP35-VP40-GP-VP30-VP24-L-5 UTR.The genomes of the five different ebolaviruses (BDBV, EBOV, RESTV, SUDV and TAFV) differ in sequence and the number and location of gene overlaps. As all filo viruses ebola virions are filamentous particles that may appear in the shape of a shepherd's crook, of a "U" or of a "6," and they may be coiled, toroid or branched. In general, ebolavirions are 80 nanometers (nm) in width and may be as long as 14,000nm.
  • In this present blog my emphasis is on VP40 gene of ebolavirus and would like to describe ANTIGENICITY of its protein.  

 STRUCTURE OF EBOLA VP-40 PROTEIN
 http://www.rcsb.org/pdb/images/4ldb_bio_r_500.jpg 
  •  Fig Represents 3D Structure of VP-40 protein obtained from PDB Database
FUNCTION
  • Promotes virus assembly and budding by interacting with host proteins of the multivesicular body pathway. May facilitate virus budding by interacting with the nucleocapsid and the plasma membrane. 
  • Specific interactions with membrane-associated GP and VP24 during the budding process may also occur. The hexamer form seems to be involved in budding.
  •  The octamer form binds RNA, and may play a role in genome replication.
 ANTIGENICITY
  • Antigenicity is the capacity of a chemical structure (either an antigen or Hapten) to bind specifically with a group of certain products that have adaptive immunity: T cell receptors or antibodies (a.k.a. B cell receptors).
  • We can determine antigenicity is a protein sequence in protein sequence using DNA STAR using PROTEAN 3D module.
  • We perform this experiment using DNA STAR and we obtain regions in the protein showing high level of antigenicity.
  • We can even obtain the B-cell epitope, MHC II epitopes and T cell epitopes.
Fig Represents the output of antigenicity obtained from DNASTAR
Fig Represents the epitope predicted by DNASTAR
  • The above figure represents the regions in the protein sequence which exhibits antigenicity. The above mentioned protein is obtained from patient.
  • This is done by using NCBI BIO PROJECT module were we can find information regarding the  ebolavirus and its outbreaks on some patients.
  •  For the T-cell epitope prediction DNASTAR uses Roth bard-Taylor algorithm which uses amino acid window of 4 or 5 and this predicts the epitopes.
  • The epitopes should be int he following re-occurring pattern
  • If there are four AA then it should have charged/P, hydrophobic, hydrophobic and polar/G.
  • In the above example we could see that there are four amino acids in the figure which are selected G,A,L,R which follows the above mentioned pattern.
IEDB T-CELL EPITOPE PREDICTION
  • Every patient has an ID so we obtained a patient with a particular ID KM233090.1 then we analyze VP40 protein.
  • We obtain the sequence of VP40 sequence from particular patient then we try to obtain the epitopes from IEDB for one particular HLA allele.
  • In this case I choose HLA-A*01 allele for a 9 AA length epitope to be predicted. We can determine epitope to be a good binder if it has low percentile score.
  • So by this procedure I found that epitope "WTDDTPTGS" with percentile rank of 0.9.
  • The best peptide binder would be the one with the least percentile rank.

Monday, October 13, 2014

Protein 3D structure

Protein Data Bank

It is a repository for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This data is obtained open source to all the people around the world. The structures are obtained by two major methods which are X-ray crystallography and NMR spectroscopy.

  • Structures in PDB have respective id's (identifiers) by which we can access them this could be done by typing there id's in PDB search engine.
  • My pet gene is CYP2C9 the protein associated to this gene is having a respective PDB id 1OG5 but on further research regarding my protein I found  that even 1OG2 also corresponds to my pet gene but ligands bound to protein vary. 
  • 1OG5 has two ligands associated to it S-WARFARIN and Heme C group (the one which I am considering right now).
  • 1OG2 has Heme C group. 
  • 1OG5 belongs to homo sapiens and S-WARFARIN is the perfect ligand which is bound to it.
                                                               3D structure of 1OG5


Cn3D Protein Visualization Tool:

  • Cn3D ("see in 3D") is a helper application for your web browser that allows you to view 3-dimensional structures from NCBI's Entrez Structure database. Cn3D is provided for Windows and MCn3D ("see in 3D") is a helper application for your web browser that allows you to view 3-dimensional structures from NCBI's Entrez Structure database.
  •  Cn3D simultaneously displays structure, sequence, and alignment, and now has powerful annotation and alignment editing features.
  • From structure module in NCBI we obtained our Cn3D structure and we visualize using this tool.
  • We can change the type of view by going to Style menu ----> Rendering shortcuts ------> Space fill.
  • We select the disease causing site ILE at 359th position select the site in the sequence window.


 Fig represents Molecule(pink color) in space fill format and ILE at 359th position is represented in the sequence window (yellow color) 

  • For showing the particular site 359th position in "Worms" view Style menu ----> Rendering shortcuts ------> Worms. 

 Fig represents ILE site 359th position in "Worms" view
  • Then Select menu ------> Show selected residues.

 Fig represents ILE (specifically) site 359th position in "Worms" view

Secondary Structures:

The secondary structure associated to the site (ILE at 359th protein) according to PDB secondary structure prediction which uses CATH and SCOP to calculate.


         Fig represents ILE at position 359 has an Alpha helix


DNA STAR Protein Secondary Structure Analysis:

Protean software was used from DNA STAR to view secondary structures of protein which uses 2 algorithms Garnier Robson and Chou Fasman. Both these algorithms state that this site is a Beta sheet.

 Fig representing that ILE at 359th position is a Beta sheet

We could see that there is a contrast in the prediction DNA STAR algorithms predicts the site as sheet were in PDB (SCOP) predict it to a Helix.

Monday, October 6, 2014

Protein Structure prediction and human variation

This blog is much about the protein and allelic variant analysis that I have done using DNA STAR.

Online Mendelian Inheritance in Man (OMIM)provides the allelic variants that are 
required for the analysis. My gene has 3 allelic variants. These changes corresponds to 
change in the amino acids by substitution, deletion etc.,

TOLBUTAMIDE POOR METABOLIZER - ILU359LEU
WARFARIN SENSITIVITY - ARG144CYS
WARFARIN SENSITIVITY - LEU208VAL


 I considered the first allelic variant and performed analysis for my protein.

The ile359-to-leu (I359L) substitution results from a 1075A-C transversion in the CYP2C9 gene and is also known as rs1057910 and CYP2C9*3. The variant leads to reduced warfarin metabolism and increased risk of bleeding.

This variation is done using Edit seq module from DNA STAR where we can edit the protein sequence and convert it into allelic variant protein sequence.
A quick screenshot showing the change in the protein sequence at position 359.




Note:
If you want to recheck that the aminoacid substitution is correct try reverse translating the protein which is edited see that you have a corresponding change in gene sequence.
Aminoacid changed has been performed which is converted to Leucine to Isoleucine.
After the replacement of the amino acid then we need to analyze the changes that we notice in the protein structure, hydrophobicity and other physical properties. For this we have an application called Protean and Protean 3D ( next version of protean where it is better in visualization but little difficult to navigate through the window) I have selected a window of 355-365 amino acid to know the changes either in the secondary structures or the hydrophobicity. The reason we do this is because the values for either hydrophobicity vary with different algorithms. We can use the DNA stat protean tool to view the changes in the physical properties.


                                                                        CYP2C9 actual protein



                                                                      Allelic variant (ILE359LEU)


We notice that there is no significant change in the allelic region to that of the original protein features. We can see the Kyte-Doolittle Hydrophobicity plot was showing no significant difference even after the amino acid change. We have the threshold values for the aminoacid's hydrophobicity obtained by different algorithms.


Table 1: Representing threshold values for hydrophobicity in amino acids

Secondary structure prediction:

I used Protean 3D for analyzing the secondary structure of natural and variant proteins. This prediction is done by Chou-Fasman algorithm which is an empirical technique developed for prediction of secondary structures in proteins.

                                                                      Natural protein




                                                                         Allelic Variant




First we need to select the 10 aminoacid window in which we need to see that 359th position of amino acid would be median value. With the present procedure for finding out any difference between these two proteins we see that there is not much of a change but Beta sheets show little difference. I could see that rather than flexibility there is not much of a change observed. I suppose with some vivid analysis of the protein in the future we could know the reason why isn't there a change when an amino acid corresponding to the functionality of protein is changed.


My gene is not associated with the membrane so there is not much of a difference in the trans membrane region which we can visualize in Protean-3D

Key points:
A codon change in the CDS region of gene should generally correspond to functional or structural change. This aminoacid or codon change is resulting in a functional change but not a structural or any kind of physical change.