BioInformatics: A Data Deluge with Hadoop to the Rescue

Sponsored Content by Cloudera

Marty Lurie, marty@cloudera.com

The Data Deluge

Have you had your genome mapped today? A quick search on the web reveals that genome-sequencing is available as a consumer product for only $299. How far we’ve come from a 13-year effort to map the human genome. (Seehttp://www.ornl.gov/sci/techresources/Human_Genome/home.shtml).

Genome mapping is only one aspect of BioInformatics, however. Medical records, clinical trials, adverse drug reactions, and medical imaging are just a few of the many applications that generate the current deluge of medical data.

BioInformatics as a discipline has been advancing by leaps and bounds. What started as some computer savvy researchers writing Perl scripts now has hundreds of formal education courses offered at leading universities. Numerous open source projects assist researchers in dealing with all the new information sources coming their way.

Apache Hadoop and BioInformatics

Your genomic data is “big” (Reading DNA accurately is difficult, so the readers do bi-directional multi-scan operations. This makes for larger files.) It is big enough that traditional computers struggle to figure out how to process all the base-pair sequences that represent how you, as a human, are constructed. Apache Hadoop is an open source project for managing big data that uses commodity computers, lashed together in a cluster, to operate on massive files rapidly. You can take your genomic file, even if it is 300GB, and let Hadoop sequence, sort, and look for variants in your DNA to help doctors provide better medical care.

There are several BioInformatics software applications that run on Hadoop to help researchers analyze DNA including Seal, Bowtie, Cloudburst, and Crossbow. If you go to pubmed.gov, a National Institutes of Health (NIH) site, and search for the term “hadoop”, you’ll see refereed publications on how to use this excellent parallel processing environment for medical research.

Cloudera and BioInformatics

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please seehttp://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera's 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

FDA Adverse Drug Reaction Data, Cloudera, and Impala

Use case description: query the FDA data to determine which drugs listed in the incident reporting database resulted in hospitalization.

We will make a number of simplifying assumptions in the data model and query. Please don’t debate if we are double-counting outcomes in the result – the purpose of this example is to show you how rapidly you can get started. The fact that we can debate double counting illustrates the versatility of Hadoop: Rather than struggle to get to an answer using existing database technology we can load up all the data, start to work with it, and then debate what the right answer is based on real experience with the data.

Here are the steps:

Download and run the Cloudera Impala VMware image
Use wget to download the FDA incident reporting files
Unzip and load the data into HDFS – the Hadoop file system
Create the Hive metadata to access the files
Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive
Just for fun run the same query in Hive to compare performance

Here we go:

Download and run the Cloudera Impala vmware image

https://downloads.cloudera.com/demo_vm/vmware/cloudera-impala-demo-vm-cdh4.1.1-vmware.tar.gz

Use wget to download the FDA incident reporting files

wgethttp://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/UCM319844.zip

Unzip and load the data into HDFS – the Hadoop file system

unzip UCM319844.zip

hadoop fs -mkdir outcomeDir

hadoop fs -mkdir drugDir

hadoop fs -put OUTC12Q2.TXT outcomeDir

hadoop fs -put DRUG12Q2.TXT drugDir

Create the Hive metadata to access the files

hive

you are now in the hive prompt and can enter the table definitions

drop table outcomes;

create external table outcomes (

isr string,

outcome string)

row format delimited fields terminated by '$'

location '/user/cloudera/outcomeDir'

;

drop table drugs;

create external table drugs (

isr string,

drug_seq string,

role_cod string,

drugname string,

val_vbm string,

route string,

dose_vbm string,

dechal string,

rechal string,

lot_num string,

exp_dt string,

nda_num string

)

row format delimited fields terminated by '$'

location '/user/cloudera/drugDir'

;

Exit hive with control-C

Start up the Impala processes and run the query in Imapa, the new open source query engine that complements Hive

impalascripts/start-impala-state-store.sh

impalascripts/start-impalad.sh

time impala-shell --impalad=127.0.0.1:21000 --query_file=drugQuery.sql

Just for fun run the same query in Hive to compare performance

time hive –f drugQuery.sql

Oh, right, you need the contents of the drugQuery.sql

$ cat drugQuery.sql

select

drugname, outcome, count(*) nmocc

FROM drugs

JOIN outcomes on (drugs.isr=outcomes.isr)

where

outcome in ('DE','LT', 'HO')

group by drugname, outcome

order by nmocc DESC

limit 100;

For more information you can click here

blogtest

Latest News

BioInformatics: A Data Deluge with Hadoop to the Rescue

0 comments:

Post a Comment

Popular Posts

Recent Posts

Social

More Links

About Me

Blog Archive

8,521,717

44,112

2,358

RSS Feeds

Featured Posts

Labels

Popular Tags

About

Featured Posts

Featured Posts

Recent Comments