Confessions of a_NGS Newbie: 2013

Saturday, June 22, 2013

A blog about it > NGS 2013

I know, there was a time before NGS 2013 when you probably thought that > was just an innocent greater than symbol. You didn't know what the hell to do with something like this | and in your mind "cd" was an outmoded flat disc that was once used to play music. And then... you met the power and the glory of the command line, which totally rocked your biological world. That is, when weren't hating on it for returning output like "putty, putty, putty" (repeat ad nauseam in weird computer voice). Then you took it to a whole new level with commands like "gunzip" and "tar" - the latter followed by those 4 alphabetical letters that no one ever remembers (xvzf?). You also quickly learned that it is going to be necessary to type at the command line with lightning speed, keeping the error rate down to about 1 typo every 100 words, at least for amateurs. And to think that this was all happening on day 1! Thanks Adina Chung Howe.

So what, pray tell, am I talking about? MSU NGS Summer Course 2013 is a two week course on next generation sequencing technologies that took place at the WK Kellogg Biological Station, a summer retreat built by the vaguely weird tycoon of Kellogg cereal fame. Why weird? Small fact communicated by Rachel Prunier, one of the NGS students: apparently said tycoon was fascinated by the anti-masturbatory properties of excessive consumption of raisin bran. Upon reflection, I can sort of buy into that.

Ah yes, where were we? NGS 2013! In short, the best experience ever. Learning and otherwise. We were a group of 24 (hard to put this in the past tense, the memories being so fresh) that came from all over geographically - a motley crew of PhD students, industry types, post-doc, professors, and even a stray masters student. We came to mecca primarily to learn about what to do with our BIG DATA. For those of you who are not well acquainted with the term BIG DATA, well trust me, it's big. In more ways than one. Moi, for example, has taken a venom duct from a marine snail and turned it into 289 million reads of raw sequence data, aka cDNA. Actually I came to the course with some assemblies as well, but no matter how you slice it, the DATA is BIG. The drill is, we dumb biologists (hah, Titus Brown, I said it!), don't always know so so well what to do with it once we have it. For this we rely on the more computationally advanced types, highly trained computer scientists, who probably could not work their way out of a wet paper bag in most situations, but who do excel when it comes to BIG DATA.

Unraveling, or perhaps re-raveling, this big data is in fact a hugely important challenge, if we want to get at the secrets of nature as encoded by genetic material spelled out by those four simple nucleotide bases - adenine, guanine, cytosine and thymine (uracil if you are counting RNA). Which in every case, for every student, is why we came to the course. We 24 lucky ones were selected from a pool of over 200 applicants. Jokes about challenged biologists aside, we were a group of pretty smart cookies, if a bit at sea when it came to bioinformatics. Participants were working on all sorts of organisms -- sea anemone, bacteria, human, equine, and so forth, each with their own assorted set of biological questions.

But wait, I am many paragraphs into this blog and I have not even explained C. TITUS BROWN. Boy, I bet he's a little ticked off. The fact of the matter is, he's a little hard to explain, but he is kind of awesome. When he's not being snarky and stubborn. Which is of course, not most of the time. C. Titus Brown is the founder and leader of NGS 2013, or the guy who got us to drink the big data and what are we going to do with it kool-aid. The course is now on its fourth year, though I suspect it has never seen a group quite equivalent to the likes of this year. Suffice to say we exceeded expectation on all fronts. Objective assessment, I promise.

So how did this course go structurally? Well we arose at the crack of dawn, more or less, and headed to the Carriage House for lectures, tutorials, and desperate pleas for help from the crack group of teaching assistants, who really came for the volleyball and the beer. Around 3 or 4 PM we would break for, you guessed it, volleyball! or frisbee or swimming in the lake. Or just a bit of lazing about. In the morning we learned about assembly programs, SAM, BEDTOOLS, BAM TO BED, BWA, Bowtie, TopHat, Cufflinks, you name it (all very useful), and in the evening (post physical recreation) various lectures on research (a couple of really astoundingly wonderful talks by Erich Schwarz, morning or evening) or other NGS topics of interest. I would also be totally remiss if I did not mention Istvan Albert, who has an unprecedented combination of razor sharp intellect and pure joie de vivre.

I could write a whole lot more about this. But for now, I will just leave you with the flavor. For me, and I suspect for a number of other participants and teachers as well, it was the experience of a lifetime. It was the perfect blend of rapid fire learning, intellectual stimulation, camraderie, sports, firepits, judicious amounts of alcohol (mostly), and great good fun. I would go back in a heartbeat. Ultimately, we delved far deeper into the world of bioinformatics and NGS than learning about basic linux commands, referenced in paragraph one, on day one. So yes, Titus Brown, next year, three weeks! For that final week, I think I can speak for most if not all of us, we are IN! :)

Saturday, May 4, 2013

Collaboration nation

Storyline

I have been working with a group of budding, brave bioinformaticians at the NYU Center for Genomics and Structural Biology to help light a fire under my Next Generation Sequencing project on the venom duct transcriptome of a marine snail called Terebra anilis. The anilis species are little guys, pretty tiny in fact, but feisty when they go after worms. But not nearly as feisty and daring as our bioinformatics group in the face of untold gigabytes of what is is currently being labeled as... BIG DATA.

A Digression

Avid readers of my blog may recall that when I first started this next generation sequencing affair (massive amounts of DNA sequence that need to be reassembled into something resembling a transcriptome, we hope), I knew bupkis.

Things have changed, somewhat, with an exponential acquisition of knowledge on my part of linux and scripts and computer languages being interspersed with periodical plateaus. Soon I will be going to a two week NGS seminar in Michigan, led by the redoubtable Dr. C. Titus Brown, to get back into exponential learning mode. Standard modes of travel would point toward getting on a plane but is that what I will be doing? No! I will be riding my new motorcycle, and it's a straight shot on 80 West, so hi ho silver away :)

Back to the Storyline

Where were we? Oh yea, team anilis. A word about the players, with the names abbreviated to protect the guilty: A is a PhD student in computational biology and neuroscience who got this project off the ground, as he and his cohorts needed something interesting and challenging to do for his bioinformatics class at NYU, taught by Rich Bonneau. I am going to "out" Ramakrishnan Rajaram Srinivasan (Call him Ram), since he and I have made a pact to see who can get banned from the internet first, or at the very least kicked out of lab. Ram has been wildly instrumental and enthusiastic in pushing this assembly project forwards, as has M, a comp sci graduate about whom I don't know much of her future plans and ambitions, but I do know she is one smart cookie. Rounding out the group are Y and K, two women doing research in the Lionel Christiaen lab, and both pursuing advanced degrees in their respective fields.

Team anilis is a great group. I must confess, I was a little resistant to the collaboration at first. I didn't think they would have time to do anything meaningful with such a huge chunk of data (in excess of 288 mil raw reads of 100bp each) in the short time they would have to work with it. I was also, understandably perhaps, a bit protective of *my* data. So at first I couldn't decide if I was going to let them twist in the wind or actually try to help. My good sense and opportunism ultimately came to the fore, so I decided to help! And in the end, who is being helped here, massively? Moi.

The Methods of our Madness

The first thing that must be done to reconstruct a transcriptome from raw sequencing reads, or putting humpty dumpty back together again, is to run an assembly program or programs on a high performance computing platform, if you've got one. Choosing these programs has a wild west aspect as there are many of them... Trinity, Trans-ABySS, SOAPdenovo, RNnotator, Velvet/Oases... the list goes on. We have chosen three programs, some of which can be run on multiple "kmer" values (think short DNA words). Once the assemblies are complete, the goal is to compare them, to see if we can glean where they overlap, where they diverge, and which, if any of them, offer the best results. This is a tricky business because no one assembly program appears to definitively outperform another, and all of them are a tad suspect (understatement) in how well they can actually achieve the desired goal of giving back a comprehensive, minimally error free transcriptome reconstruction. Ask Titus how well these programs perform, but be prepared to duck.

Discussion

We, the collaborative group of brave budding bionformaticians, are not to be deterred. We have hope that ultimately, our assembled transcriptome will shine some light on putative new toxins with therapeutic potential. This will be a result of the downstream analysis of our assembies, where we will perform a whole variety of blasting techniques, mining for cysteine frameworks, annotation of orthologous proteins, and so forth. So far, we have basically completed three different contig assemblies from mRNA-seq data, post digital normalization and on wide range of kmer values for two of them. We are ready to rock and roll.

Everyone knows that collaboration is given a lot of lip service, but here I have seen it in action, and it's working. We can file that under awesome. Who knows where the whole project is going to lead (to a publication god willing), but it is a hundred percent certain that all of us will have learned a lot in the process. And that, ladies and gentlemen, is what counts.

Monday, March 25, 2013

in Mozambique the sunny sky is aqua blue

I really suck at writing in my lab book. So I figure if I can harness my desire for self promotion into a way to actually keep track of what I am doing in lab -- especially my next generation sequencing work -- I can kill two birds with one stone: have an actual record of what I have done and achieve fame fortune and maybe even notoriety in the process. Besides it gets lonely rattling around here inside the computer. All those 0's and 1's with only A, C, T and G to keep me company. and the occasional U of course.

Now I know y'all are dying for a recap of my work so here goes: This particular project started in Mozambique, with the collection of marine specimens, namely Terebrids. Terebrids, much like the more renowned cone snails, are venomous marine snails that use a delicious cocktail of up to 200 toxins to snare their prey. My theory on the 200 toxins is that a slow moving, not terribly bright snail needs all the help it can get to score lunch (usually worms). And such a cornucopia of toxins provides a biochemist with genetic leanings such as myself with a veritable field day of things to do in extracting RNA from venom ducts and performing next generation sequencing (NGS) and analysis.

Just a word or two for the lay person about NGS. Here is a simple recipe: extract some RNA. copy it into cDNA (which has no introns only exons and handy stuff like polyA tails). Blast it to bits (wheeee!). Take those bits and throw them willy nilly on a massively parallel platform (e.g. Illumina) that will start sequencing away like there is no tomorrow. Millions of sequencing reactions are taking place simultaneously, the excitement is almost boundless. And then soon, one day very soon, you will find that you have in excess of 500 million "raw reads" (translation: your chopped up cDNA, now sequenced in lengths of say, 100 bp). All this data (post quality checks) is now your baby, and to think that in this case it all started with these itsy bitsy venom ducts (think very small fingernail trimmings) from the species Terebra anilis

You decide to throw a party. That's a lot of reads! Visions of a paper in Nature Genetics start dancing in your head. So you do indeed throw a party, and guzzle lots of alcohol. Maybe other substances are involved (I couldn't possibly comment my dear. But you might think so.)

And then.... the RECKONING. Not only is your head splitting in two, but your realize you have no freaking clue how you are going to approach this data in order to turn it into something meaningful. I mean you have read all kinds of papers on the subject and have marveled at contig assembly and its associated statistics, pondered the myriad approaches to blasting the data against all those lovely NCBI databases, thrilled to the challenges of doing this all de novo (ie no reference genome or even EST database to map anything to) and you realize.... well quite frankly you realize you don't have a fucking clue.

Fortunately your advisor comes to the rescue. She assures you that this is not a problem, and that it will simply be necessary to just learn how to do it. No need for assistance from any quarter, that bioinformatics class I took in the first year of grad school should see me through. I start babbling about linux and ftp sites and contig assembly programs and perl and python and bash. About which I most assuredly know nothing. as in nada, niente, rien, nicht.

And here my friends is the bare bones of the thing. Will our heroine have the moxie and smarts to triumph over these adverse circumstances? Or will she be flattened by a 2 terabyte computing cluster? Is there a knight in shining armor on the horizon, in the form of the deeply coveted computer geek? Oh Lancelot, where are you in my hour of need?

(stay tuned, but a little spoiler, Lancelot is going to be a she.)