Wednesday, December 15, 2010

Random Patterns

Two recent projects have involved the idea of commonly found sequences.  The first example was done in Computer Science for Bioinformatics course.  The idea was to find a pattern in phosphorylation sites.  Investigating whether spatial or sequential areas were highly involved in these phosphorylations. There exists a known list of phosphorylation sites.  The site listed the AA's to the left and right of the phosphorylated AA.  These flanking sequences were analyzed to see if there existed a relationship between the sequence and being phosphorylated.  This was also analyzed by spatial distance of an arbitrary 10 angstroms, but will not be discussed. The next part of the experiment was to take a random sample from that genome and analyze it for similarities.  While this may sound commonplace I question its use in this example.
  I don't believe that AAs are perfectly randomly distributed. That there exists an equal amount of AAs in any genome.  I believe randomization should be replaced by normalization.  The normalization should also account for poly A tails or GC content, or any known repeating pattern in non-coding regions.  But what if those non-coding regions also play some sort of role?  Should a machine learning technique be used?  If so what requirements should be valued.  Phosphorylation often occurs by differently sized proteins with different charges, might that play a role in this experiment? Should steric hindrances, and spatial orientation play a role.  Good questions, and I think at the minimum the individual genome should be taken into consideration.  A and L may play a large role in these sample sequences, but does it play a large role overall? While this project was completed the question has not been resolved.
   Another project has popped up recently involving a binding site problem.  Taking a small clip of DNA it was analyzed to see if there were common patterns.  The frequency of all AA were taken and the percentages were calculated.  These percentages were compared against control samples of equal length.  All of these were from the same genome.  I believe these specific examples should be compared not against controlled sequences, but against a genome removing the known patterns of GC content at the minimum.  What I believe to be a normalization process. I do not know the best way to compare these samples.  Another question comes into the chances of a certain AA to be replaced by another with the same binding properties.  Should all of these differences be weighed equally.  Or should similarly charged particles be weighted more heavily. While all this idea is only a small piece of a puzzle involving machine learning, the accuracy of the learning is at best as good as the book it reads.  As I enter into probability and statistics this next semester, I hope to better understand the ways in which I can use mathematics to show the relationships among data.  

No comments:

Post a Comment