Storyline
I have been working with a group of budding, brave bioinformaticians at the NYU Center for Genomics and Structural Biology to help light a fire under my Next Generation Sequencing project on the venom duct transcriptome of a marine snail called Terebra anilis. The anilis species are little guys, pretty tiny in fact, but feisty when they go after worms. But not nearly as feisty and daring as our bioinformatics group in the face of untold gigabytes of what is is currently being labeled as... BIG DATA.
A Digression
Avid readers of my blog may recall that when I first started this next generation sequencing affair (massive amounts of DNA sequence that need to be reassembled into something resembling a transcriptome, we hope), I knew bupkis.
Things have changed, somewhat, with an exponential acquisition of knowledge on my part of linux and scripts and computer languages being interspersed with periodical plateaus. Soon I will be going to a two week NGS seminar in Michigan, led by the redoubtable Dr. C. Titus Brown, to get back into exponential learning mode. Standard modes of travel would point toward getting on a plane but is that what I will be doing? No! I will be riding my new motorcycle, and it's a straight shot on 80 West, so hi ho silver away :)
Back to the Storyline
Where were we? Oh yea, team anilis. A word about the players, with the names abbreviated to protect the guilty: A is a PhD student in computational biology and neuroscience who got this project off the ground, as he and his cohorts needed something interesting and challenging to do for his bioinformatics class at NYU, taught by Rich Bonneau. I am going to "out" Ramakrishnan Rajaram Srinivasan (Call him Ram), since he and I have made a pact to see who can get banned from the internet first, or at the very least kicked out of lab. Ram has been wildly instrumental and enthusiastic in pushing this assembly project forwards, as has M, a comp sci graduate about whom I don't know much of her future plans and ambitions, but I do know she is one smart cookie. Rounding out the group are Y and K, two women doing research in the Lionel Christiaen lab, and both pursuing advanced degrees in their respective fields.
Team anilis is a great group. I must confess, I was a little resistant to the collaboration at first. I didn't think they would have time to do anything meaningful with such a huge chunk of data (in excess of 288 mil raw reads of 100bp each) in the short time they would have to work with it. I was also, understandably perhaps, a bit protective of *my* data. So at first I couldn't decide if I was going to let them twist in the wind or actually try to help. My good sense and opportunism ultimately came to the fore, so I decided to help! And in the end, who is being helped here, massively? Moi.
The Methods of our Madness
The first thing that must be done to reconstruct a transcriptome from raw sequencing reads, or putting humpty dumpty back together again, is to run an assembly program or programs on a high performance computing platform, if you've got one. Choosing these programs has a wild west aspect as there are many of them... Trinity, Trans-ABySS, SOAPdenovo, RNnotator, Velvet/Oases... the list goes on. We have chosen three programs, some of which can be run on multiple "kmer" values (think short DNA words). Once the assemblies are complete, the goal is to compare them, to see if we can glean where they overlap, where they diverge, and which, if any of them, offer the best results. This is a tricky business because no one assembly program appears to definitively outperform another, and all of them are a tad suspect (understatement) in how well they can actually achieve the desired goal of giving back a comprehensive, minimally error free transcriptome reconstruction. Ask Titus how well these programs perform, but be prepared to duck.
Discussion
We, the collaborative group of brave budding bionformaticians, are not to be deterred. We have hope that ultimately, our assembled transcriptome will shine some light on putative new toxins with therapeutic potential. This will be a result of the downstream analysis of our assembies, where we will perform a whole variety of blasting techniques, mining for cysteine frameworks, annotation of orthologous proteins, and so forth. So far, we have basically completed three different contig assemblies from mRNA-seq data, post digital normalization and on wide range of kmer values for two of them. We are ready to rock and roll.
Everyone knows that collaboration is given a lot of lip service, but here I have seen it in action, and it's working. We can file that under awesome. Who knows where the whole project is going to lead (to a publication god willing), but it is a hundred percent certain that all of us will have learned a lot in the process. And that, ladies and gentlemen, is what counts.