Measuring gene expression — the degree to which a gene is active — is key to genetic research and development of new ways to treat disease.
The presence of specific messenger RNA indicates what genes are involved, while their total count can be used to estimate the level of gene expression. But the computer process traditionally used to identify messenger RNA (mRNA) in cell samples takes 10 to 15 hours to get results.
Now a team of researchers led by Carnegie Mellon University computational biologist Carl Kingsford has stepped forward with its Sailfish program, featuring an algorithm that estimates gene expression 20 to 30 times faster than current methods.
That’s to say, Sailfish gets results, and sometimes more accurately, in 10 to 15 minutes.
It explains why the team including Stephen M. Mount of the University of Maryland and Rob Patro, a CMU postdoctoral researcher, named the program after the world’s fastest fish, whose same velocities in a car would draw a speeding ticket on Interstate 79.
“Understanding when a gene is on and off is an important tool in basic biology,” said Mr. Kingsford, an associate professor in the school of computer science, who wrote the Sailfish algorithm. “The goal is to increase science and understand biology better.”
The journal Nature Biotechnology published a report in April describing Sailfish and how it advances the computational process. Now available online for free, Sailfish is drawing praise from scientists who are using the program to speed up their research, with the novel opportunity of double-checking their results.
“It’s benefitted my research because it’s an efficient, elegant program that has streamlined gene-expression analysis,” said John Stanton-Geddes, a University of Vermont professor with a doctoral degree in ecology, evolution and behavior. He said he uses Sailfish to identify genes that change expression in response to temperature in two eastern ant species.
While any individual organism’s genetic makeup is static, a CMU news release explains, activity of individual genes varies greatly over time, “making gene expression an important factor in understanding how organisms work and what occurs during disease processes.”
“Gene activity can’t be measured directly but can be inferred by monitoring RNA, the molecules that carry information from the genes for producing proteins and other cellular activities,” it says.
The math and science explaining Sailfish and its algorithm are complicated. Here’s a much-simplified explanation:
In research, cell samples of interest are ground up and analyzed in a sequencing machine, which spells out the combination of the four molecules that make up the RNA, each identified by a letter — namely A, C, G and U. Messenger RNA can have 100 to 1,000 of these letters.
Current methods require a process known as mapping, which takes RNA segments of 100 letters, known as “reads,” and tries to find an inexact match of those letters in the 100,000 sequenced RNA. Because of the large number of letters in the “reads” and the often-complicated notion of what constitutes a good match, the computer process takes many hours of computation to identify the most likely RNA represented, along with the count of how many are present in the sample. That count is used to evaluate the level of gene expression.
Another problem with mapping is that genetic variations in an individual’s RNA may cause no match to be found. The notion of matching used in mapping also can be too restrictive, which can lead to close but sufficiently different “reads” remaining unmatched, potentially skewing results.
But Sailfish completely eliminates that time-consuming mapping process.
Instead, it uses 20-letter RNA segments, known as Kmers, which the Sailfish algorithm works with to more easily match in the 100,000 RNA. The number of matches and complex analysis of those matches not only accurately identify the messenger RNA, but also count how many are present as a measure of gene activity or expression. Normal-functioning genes have a standard level of activity. Too little or too much activity can indicate a disease process.
Mr. Kingsford, who holds a doctoral degree in computer science, compares Sailfish’s method to matching a small phrase of 20 letters taken from a big pile of chopped-up children’s books, each book having 100 to 1,000 words, to determine which book is involved and how many copies of that book exist in the pile.
The Sailfish method looks for those exact words and might identify how many different books the phrase is used in. If the 20-letter combination (Kmer) occurs in only one book (RNA) but 60 versions of that phrase are found in the sample, then one can conclude that big pile included 60 different copies at that one book (RNA). That might indicate an active gene (or popular book) because the high number of RNA indicate a high level of gene expression.
Genetic variations would be less likely in a smaller, 20-letter segment, which increases accuracy while speeding up the computational process.
With Sailfish, Mr. Stanton-Geddes said, he can complete the analysis of samples and recheck the results within a day.
“I’m in the process of collecting data where I have 120 samples,” he said. “Previously, the expression analysis would have been a major bottleneck in my procedure, taking days of computation. With Sailfish, I can process a lot of samples and analyze them quickly, which encourages reproductibility of my work.”
Getting what might be a 15-hour analysis down to minutes is important, especially considering the already huge repositories of RNA sequencing data available, Mr. Kingsford said. To date, the lengthy computational effort has limited the amount of information that could be drawn from data.
Fifteen hours for each analysis “really starts to add up, particularly if you want to look at 100 experiments,” Mr. Kingsford said. “With Sailfish, we can give researchers everything they got from previous methods, but faster.”
David Templeton: firstname.lastname@example.org or 412-263-1578.