Methylation Patterns

Wednesday, December 15, 2010

Laptop Design for higher performance computing

While my previous posts were based upon biological problems. I wanted to mention a problem on my mind. Purchasing a laptop. I decided to purchase an AMD based laptop over an intel based laptop due to its superior price performance. After researching, I've found that AMD's larger 45 nm technology is not as good as intels. While I'm in no way a computer engineer the benchmarks typically give the price point performance to the AMD. A major draw back to the 45 nm AMD technology versus the newer 39 nm Intel technology is the energy used. There is also a difference in the size of the onboard L1 and L2 cache with double the amount on most intel chips. Computer technology has not had many advances outside of multiple cores. As the gigahertz increase the energy usage is dramatically increased. With this energy crisis comes a heat concern. While on a normal desktop the heat rarely approaches the typical ~70 degrees Celsius. Laptops on the other hand are notorious for having poor ability to dissipate heat. Analyzing my new laptop has revealed a lot of flaws in my system.
I purchased my laptop 3 months ago. An HP designed AMD quad core 2.1ghz processor w/ 6GB RAM and 640gb hd @7200 rpm w/ 1gb Radeon 5650 video card. I figured the quad core 2.1ghz could easily over power a duo core intel i5 even if the i5 uses threads better. It took a fellow classmates computer failure to change my mind. The classmate used her AMD quad core day and night to process genetic data. She said it was common for her 1 y/o computer of similar stats to be run 4 out of 7 days round the clock. I decided to monitor the temperatures of my processor. Its been said that there are very little differences in the AMD mobile chip version (Intel typically has completely different designs for mobile chips). While my desktop AMD six core typically does not go beyond the 34-44 degree Celsius range, the laptop idles in the high 50's and sometimes reaches into the 70-71 area. The idea of melting a chip is disheartening at best. Does the chip only reach temps in the low 70's because its designed for raw power and heat is secondary? A freeware based tool addressed this issue, and I found that the bios was decreasing the energy to stop a meltdown. The reality was that my cores were not running at near capacity. The cores were running around 900-1000g each, roughly 50% capacity. I was astonished. Intel has reportedly not had these problems. I thought of solutions. A chill pad? Well the heat is dissipated through the far left area near the edge of the computer. A chill pad will not even be near the hot area. A fan pad would be equally ineffective. Unlike a desktop computer, I can't utilize a large aftermarket heatsink couple with an enormous fan, or set of fans in the case. It also leaves liquid cooling out of the picture. It's been too long to return the laptop. Reselling has ethical issues. It also functions well, but it is better for shorter durations. I decided on building a pc. While I admit to liking my pc, and the ability to have an ergonomic system that does not require my hands to be in a tight space combined to the constant stooping over the monitor designed for a strange little person. Unfortunately I'm disappointed that more internet reviewers have not analyzed the ability to dissipate heat with the ability of the processor to function properly. It reminds me of a Ferrari in a school zone. I plan on doing more research to see if this issue is simply a poor HP design or the fact that AMD processors are best left out of laptops due to their inability to maintain safe operating temperatures. I hope this helps for any future laptop purchases.

Performance vs. Memory

I can vividly remember in CS class when I was told that you always can interchange speed and memory. Concise memory almost always involves in a reduction of speed. Lately I've felt the headaches of multiple scripts. Scripts to run other scripts, that run other scripts. A pipeline designed of very very little pipes. While this complexity can be viewed from a variety of angles, the more scripts the more angles. After a while only the experience mouse is the only one that can make it out of the maze. As I take over a protein binding project designed with many scripts. I find it overall confusing. Partial confusion is based upon the transferring from shell scripting to Perl. Utilizing one script to translate another. A question that resonates in my head is the clarity of labeling. So often do we label our files based upon our moods or generalized belief of the script. Requiring a time consuming process to break down a true purpose. Often all of the scripts are placed in a single folder when using the folder system to organize could eliminate the confusion and excessive commenting. A second problem is the consistent usage of hard coding locations. I have a firm belief in the idea of a config file and a soft-coded script. The idea that a script could be used on any computer with the basic requirements. A script that is edited from the outside (config) rather than the inside. That the extra efforts of a well thought script can save in the long run. Utilizing an interpreter based system may also decrease the problems at hand. Imagine you have 500,000 jobs to run one day and 500 to run the next day and you're using the same cluster. A script could be manually told the task to break up the jobs based upon your believed speed. Instead what if the script could evaluate the hard drive and ram memory usages and pair up files according to your processor abilities. Dividing up the jobs among nodes without much though, or threading the jobs for single machines based upon either a config file or auto detecting system. In most cases I believe the idea of a complex script is simply a script that is poorly designed a structure that appears like a plate of spaghetti rather than a box of it.

Random Patterns

Two recent projects have involved the idea of commonly found sequences. The first example was done in Computer Science for Bioinformatics course. The idea was to find a pattern in phosphorylation sites. Investigating whether spatial or sequential areas were highly involved in these phosphorylations. There exists a known list of phosphorylation sites. The site listed the AA's to the left and right of the phosphorylated AA. These flanking sequences were analyzed to see if there existed a relationship between the sequence and being phosphorylated. This was also analyzed by spatial distance of an arbitrary 10 angstroms, but will not be discussed. The next part of the experiment was to take a random sample from that genome and analyze it for similarities. While this may sound commonplace I question its use in this example.

I don't believe that AAs are perfectly randomly distributed. That there exists an equal amount of AAs in any genome. I believe randomization should be replaced by normalization. The normalization should also account for poly A tails or GC content, or any known repeating pattern in non-coding regions. But what if those non-coding regions also play some sort of role? Should a machine learning technique be used? If so what requirements should be valued. Phosphorylation often occurs by differently sized proteins with different charges, might that play a role in this experiment? Should steric hindrances, and spatial orientation play a role. Good questions, and I think at the minimum the individual genome should be taken into consideration. A and L may play a large role in these sample sequences, but does it play a large role overall? While this project was completed the question has not been resolved.

Another project has popped up recently involving a binding site problem. Taking a small clip of DNA it was analyzed to see if there were common patterns. The frequency of all AA were taken and the percentages were calculated. These percentages were compared against control samples of equal length. All of these were from the same genome. I believe these specific examples should be compared not against controlled sequences, but against a genome removing the known patterns of GC content at the minimum. What I believe to be a normalization process. I do not know the best way to compare these samples. Another question comes into the chances of a certain AA to be replaced by another with the same binding properties. Should all of these differences be weighed equally. Or should similarly charged particles be weighted more heavily. While all this idea is only a small piece of a puzzle involving machine learning, the accuracy of the learning is at best as good as the book it reads. As I enter into probability and statistics this next semester, I hope to better understand the ways in which I can use mathematics to show the relationships among data.

Wednesday, October 6, 2010

BRB Arraytools

After using BRB Arraytools to attempt to analyze Micro array data, I was unable to load the .cel files. I initially thought it was my installation of BRB arraytools. I removed and reinstalled BRB array tools with no avail. I then thought it could be my use of the older version of Office 2003. I then utilized another set of .cel files from online and they loaded into BRB. I emailed my project partners, but somehow they did not read the email before we met next. It turns out they were also unable to load the files and had different versions of Office. BRB errors out stating that it is missing a package, but all of the packages are installed. We ended up using my second set of .cel files. After the initial loading, BRB tools worked extremely well in creating scatterplots. In the real world, we don't have the ability to simply choose another data set.
With my previous use of the Gene Expression console I found out that it was necessary to load the library and the annotation files. After reading the BRB error more closely I thought that BRB was missing the library files. I looked into BRB arraytools, and it turns out that sometimes library files can be manually added. I followed the directions and opened up BRB tools in VB Editor. I then tried to manually add the statconnector feature that was mentioned in the manual. It turns out that the options in the BRB manual are not available instide the VB Editor. I'm not sure what the cause is but I was never available to make the specific .cel files to work with the BRB tools.
I then decided to try and load the files into the Gene Expression console. It turns out that after adding the required exon library the files loaded easily. I utilized the Gene Expression console and TMEV to analyze the data. I plan on emailing BRB tools about the problems using the specfic set of files and the possibility of library loading problems.

Sunday, October 3, 2010

Gene Expression Console

Microarray Analysis

As I go through the analysis of Dr. Fan’s Microarrays I have come across a variety of questions. Some of these questions will be answered in a distinct and quantified way, and some won’t be answered at all. Some are based upon previous historical references, and some are simply the limitations of a growing field.

Dr. Fan’s molecular biology based lab focuses on methylation patterns. She initially gave me 7 quite extensive journal papers that talked about the details of DNA and histone methylation. These methylations are being studied from a variety of aspects. Dr. Fan has developed a line of cells missing 3 out of four of the histone 1 linkers. Which allow her to analyze how histone 1 linker proteins can alter a mouse, but there are other questions that are being asked in her lab. Which areas are methylated? Are methylation patterns based up on the DNA? Or the histones? Or simply the length of the DNA? How do differentiated cell methylation patterns change? How does cancer alter these patterns?...

Dr. Fan’s initial assignment was for me to analyze microarray data into Affymetrix’s Gene Expression Console. She normally gives this analysis to a fellow collaborator. Microarrays contain a large mass of genes that allow someone to know which genes are present. These microarray genes can be from a standard chips or custom arrays can be made for the right price. The rest of this blog will be going through the steps to using the gene expression console with affymetrix cell files with a known and standard affymetrix library. The console simply normalizes samples and attaches the known information for the probe sets.

I initially started by downloading the microarray samples and the console itself. The console requires registering with affymetrix like most free programs, and the registration email may be filtered as SPAM. My .cel files did not need to be unzipped. The Expression console is quite simple. The first step is to create a new study. The next logical step is to add the intensity or .cel files. The study will automatically close and errors will be generated below. Expression Console, EC, requires that the library files are downloaded prior to adding the intensity files. Click on file and goto Download Library Files. You now need to use your login that you obtained while downloading the EC itself. Keep onto this password because every time you need to download something you will need to know it. I then downloaded the library that Dr. Fan used. Now the density files should pull in properly. The next step is to analyze the data. Click Run Analysis. After you take a small nap the data should be ready. Now how do we export this data. Now what kind of data do I want. Dr. Fan simply wanted excel data linking the probe sets with genes. If you export the data now, you will only get normalized probe set data. The key is also get the annotation files. So, goto File download Annotations. Now goto Export and get the Probesets with annotation data. Next blog will look into utilizing an advanced open source data analysis program. This program only needs the probe sets and their intensities. Dr. Fan and I then analyzed the data by standard deviations and log comparisons. The analysis of this data through TMEV will be covered in this next log.