BioInformatics: A Data Deluge with Hadoop to the Rescue

Datanami



Sponsored Content by Cloudera
Marty Lurie, marty@cloudera.com
The Data Deluge
Have you had your genome mapped today?  A quick search on the web reveals that genome-sequencing is available as a consumer product for only $299.  How far we’ve come from a 13-year effort to map the human genome.  (Seehttp://www.ornl.gov/sci/techresources/Human_Genome/home.shtml). 
Genome mapping is only one aspect of BioInformatics, however. Medical records, clinical trials, adverse drug reactions, and medical imaging are just a few of the many applications that generate the current deluge of medical data. 
BioInformatics as a discipline has been advancing by leaps and bounds.  What started as some computer savvy researchers writing Perl scripts now has hundreds of formal education courses offered at leading universities.  Numerous open source projects assist researchers in dealing with all the new information sources coming their way.
Apache Hadoop and BioInformatics
Your genomic data is “big” (Reading DNA accurately is difficult, so the readers do bi-directional multi-scan operations.  This makes for larger files.)  It is big enough that traditional computers struggle to figure out how to process all the base-pair sequences that represent how you, as a human, are constructed.  Apache Hadoop is an open source project for managing big data that uses commodity computers, lashed together in a cluster, to operate on massive files rapidly. You can take your genomic file, even if it is 300GB, and let Hadoop sequence, sort, and look for variants in your DNA to help doctors provide better medical care. 
There are several BioInformatics software applications that run on Hadoop to help researchers analyze DNA including  Seal, Bowtie, Cloudburst, and Crossbow.  If you go to pubmed.gov, a National Institutes of Health (NIH) site, and search for the term “hadoop”, you’ll see refereed publications on how to use this excellent parallel processing environment for medical research.
Cloudera and BioInformatics
Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.
“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please seehttp://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)
Cloudera is active in many other areas of BioInformatics.  Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera's 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.
FDA Adverse Drug Reaction Data, Cloudera, and Impala
Use case description: query the FDA data to determine which drugs listed in the incident reporting database resulted in hospitalization.
We will make a number of simplifying assumptions in the data model and query.  Please don’t debate if we are double-counting outcomes in the result – the purpose of this example is to show you how rapidly you can get started.  The fact that we can debate double counting illustrates the versatility of Hadoop: Rather than struggle to get to an answer using existing database technology we can load up all the data, start to work with it, and then debate what the right answer is based on real experience with the data.
Here are the steps:
  1. Download and run the Cloudera Impala VMware image
  2. Use wget to download the FDA incident reporting files
  3. Unzip and load the data into HDFS – the Hadoop file system
  4. Create the Hive metadata to access the files
  5. Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive
  6. Just for fun run the same query in Hive to compare performance
Here we go:
  •  Download and run the Cloudera Impala vmware image
  • Use wget to download the FDA incident reporting files

  • Unzip and load the data into HDFS – the Hadoop file system
unzip UCM319844.zip
hadoop fs -mkdir outcomeDir
hadoop fs -mkdir drugDir
hadoop fs -put OUTC12Q2.TXT outcomeDir
hadoop fs -put DRUG12Q2.TXT drugDir
  • Create the Hive metadata to access the files
hive
you are now in the hive prompt and can enter the table definitions
drop table outcomes;
create external table outcomes (
isr string,
outcome string)
row format delimited fields terminated by '$'
location '/user/cloudera/outcomeDir'
;
drop table drugs;
create external table drugs (
isr string,
drug_seq string,
role_cod string,
drugname string,
val_vbm string,
route string,
dose_vbm string,
dechal string,
rechal string,
lot_num string,
exp_dt string,
nda_num string
)
row format delimited fields terminated by '$'
location '/user/cloudera/drugDir'
;
Exit hive with control-C
  • Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive
 impalascripts/start-impala-state-store.sh
 impalascripts/start-impalad.sh

time impala-shell --impalad=127.0.0.1:21000 --query_file=drugQuery.sql

  • Just for fun run the same query in Hive to compare performance
time hive –f drugQuery.sql
Oh, right, you need the contents of the drugQuery.sql
$ cat drugQuery.sql
select 
   drugname,    outcome,   count(*) nmocc
FROM drugs
JOIN outcomes on (drugs.isr=outcomes.isr)
   where
   outcome in ('DE','LT', 'HO')
group by    drugname,   outcome
order by    nmocc DESC
limit 100;