Between the DNA that stores our genetic information and the proteins that build our cells lies RNA — the middleman that relays the instructions. It carries genetic information out of the cell nucleus and into the cytoplasm. There, it’s translated into proteins, the movers and shakers that actually control what happens in our bodies.
Because RNA allows that all-important leap from DNA to protein, it’s a useful target for laboratory researchers who are interested in identifying the functions of genes. Using a technique called RNA interference (RNAi), researchers can selectively destroy particular RNA transcripts to “knock down” specific genes. Without the RNA transcript, the protein is no longer made.
But designing these RNA destroyers — called short hairpin RNAs (shRNAs) — can be tricky. Not all parts of a particular RNA transcript are equally susceptible to being chopped up. There’s traditionally been a fair amount of trial and error in finding the best target. This can slow research and even produce inconclusive results.
To Raphael Pelossof, an assistant attending computational biologist in the Department of Surgery at Memorial Sloan Kettering, the problem sounded familiar to ones he had tackled before. A computer scientist by training, Dr. Pelossof did his PhD with the scientist Mike Jones, who developed the first face-recognition software. When you point your smartphone at a scene, the little box that appears around a person’s face is the direct result of their handiwork.
Dr. Pelossof realized that he could apply principles of face-recognition technology to the design of shRNAs. Working as a postdoctoral fellow in the lab of Christina Leslie at the Sloan Kettering Institute (SKI), and in collaboration with Lauren Fairchild, a graduate student in the Tri-Institutional PhD Program, he developed a software program called SplashRNA to do just that. The software allows researchers to predict the best shRNAs for a given gene of interest with a high degree of certainty.
“It used to be that you would test ten shRNAs and four or five wouldn’t work,” Dr. Pelossof says. “Now, with SplashRNA, researchers can count on the process being a lot more accurate and efficient.” The team describes their approach in a new paper published in Nature Biotechnology.
Recognizing Patterns
Face-recognition technology relies on machine-learning tools called classifiers. These are sets of simple yes-no questions applied in a particular sequence. One classifier might evaluate simple facelike features (“Is there a vertical bar of light indicating a nose bridge?” “Is there a horizontal bar indicating eyebrows?”) to reject shapes that are obviously not faces. A second classifier might evaluate the more refined facelike features of the retained potential faces. Parts of the image that pass all of the classifiers are then determined to be faces.
Dr. Pelossof realized that there were parallels with shRNA prediction. “We know that there are certain features of the RNA sequence that tend to predict it will work,” Dr. Pelossof says. For example, certain nucleotides and nucleotide combinations seem to pop up more or less frequently at specific positions in shRNAs that work. “So we decided to see if we could build a set of classifiers that could screen out shRNAs that are unlikely to work,” he says.
There was more to solving the problem than clever computing skills, however. “The key to a good prediction algorithm is to have the right data sets,” says Christof Fellmann, a co-senior author on the paper. Dr. Fellmann and his peers have spent the last ten years building those data sets — more than 300,000 individual shRNAs. “Simply put, if you want to learn what a face is but you don’t know what a face looks like and don’t have any examples of a face, the task is pretty impossible,” he says.
Dr. Fellmann began this type of work as a graduate student in Scott Lowe’s lab at Cold Spring Harbor Laboratory. Subsequently, he co-founded a company called Mirimus, of which he is also the former chief scientific officer. Through the data sets that he and other have assembled, researchers have come to learn what a good shRNA “face” looks like. With this abundant data as a guide, the team sought to train computers to recognize the patterns that separate the good shRNAs from the bad ones.
That task was made more complicated by the fact that shRNA technology itself has evolved over the years. Data sets from the early days are not entirely equivalent to those obtained more recently. (The so-called shRNA “backbone” has changed, from one called miR-30 to a more-efficient one called miR-E.) Combining these data sets to get the most information out of each represented a machine-learning challenge. Fortunately, the right people with the right skills came together at the right time, Dr. Fellmann says.
Machine Logic
The collaboration began when Dr. Pelossof learned of efforts in Dr. Lowe’s lab (now at MSK) to improve shRNA prediction. The Leslie lab and the Lowe lab are both located on the 11th floor of the Zuckerman Research Center at SKI, where Dr. Pelossof was doing his postdoc. “Rafi met Christof, and that’s when the project really got going,” Dr. Leslie says.
The machine-learning approach they took draws on computational methods pioneered by Dr. Leslie, particularly those involving what are called sequence, or string, kernels. This is a way of teaching computers to recognize patterns in large bodies of text. A kernel is a mathematical representation of the data points in these texts.
As an illustration of the technology’s uses, Dr. Leslie gives the example of an Internet news feed that automatically directs preferred content to interested users. “Just from the statistics of words in a document, you can train algorithms that can predict with very high accuracy this is political news, this is sports,” she says.
Dr. Leslie’s big innovation was to realize that this same approach could be usefully applied to biological problems. She first used it to accurately predict the classification of proteins solely on the basis of short amino acid sequences — the kernels in this case. Applying the technique to potent shRNA prediction was a logical next step.
Setting a New Standard
If one were to design all of the possible shRNAs for a given gene, without using any criteria to eliminate likely duds, only about 5% would work. Using previous tools, researchers can get the success rate up to about 60% — much better, but still far from perfect. With SplashRNA, the likelihood of accurately predicting a good shRNA rises to about 90%, and the potency of each is also higher. “That’s a huge savings in time and reagents,” Dr. Pelossof notes.
To validate their face-detection approach, the team invited labs at MSK and elsewhere to test it against their usual methods. Typically, what you want to see with a good shRNA is a near-complete knockdown of the gene, such that no detectable protein is made. SplashRNA achieved that in spades.
In a series of assays performed in Dr. Lowe’s lab by graduate student Chun-Hao Huang, the team found that the top SplashRNA predictions matched or outperformed the best-known shRNAs for several genes tested. Indeed, the software works so well that it has already been adopted by the RNAi core at MSK, as well as by labs at several other academic centers, including Harvard, Weill Cornell, and the University of California, Berkeley. MSK has filed a patent on the technology.
With help and insight from Ralph Garippa’s group in the RNAi core at MSK, the team also created a website that enables researchers to freely access and use the software.
“RNA interference is one of the most fundamental tools available to geneticists,” Dr. Pelossof says. “With this prediction software, we’ve hopefully made their jobs a little easier.”
Dr. Fellmann adds: “We’ve built a tool that integrates all that we’ve learned over the years about RNAi biology and its technical implementation, and combined it with face detection. It’s a really simple interface, a one-stop shop for potent RNAi, like going to Amazon or Zappos to buy shoes.”