Bioinformatics — computing with biotechnology and molecular biology data

Figure 1: Bioinformatics flow chart
Figure 1: Simplified bioinformatics application data flow
     

By Grant Jacobs

This is the first part of a small retrospective series on my own field, bioinformatics, or as I prefer to label my research area, computational biology.
This article was originally published in NZBioScience in 2003 and has been reproduced here with the editor’s permission. I’ve left it as it stands, warts and all! When reading it, you will want to remember that all the dates and so on were written six years ago. (My plugs for my consultancy probably look a little up-front on-line, but I’ve left them in so that I might remain faithful to the original article.) I apologise for not providing links for the references, but my time is limited.
Later I would like to take a present-day look at what I was thinking back then and comment on what, if anything, has changed. You’re welcome to share your thoughts on this article.

Bioinformatics is essential to success in modern biotechnology and molecular biology. It has risen in prominence due to the increased use of sophisticated equipment directly linked to computers in biological research and the new data resources and biotechnology applications resulting. Rather than study specific applications in bioinformatics, at the expense of others, here I present a personal perspective of the field and why theoretical biology and structural bioinformatics will have particularly important roles.
Dr. Jacobs has been involved in bioinformatics for more than ten years, with an undergraduate degree in Microbiology and Computer Science (B.Sc. Hons., Canterbury, 1986) and a Ph.D. studying DNA-binding proteins (Cambridge, 1992). He currently works as an independent bioinformatics software developer, recently founding BioinfoTools (www.bioinfotools.com). In addition to developing bioinformatics tools, his services are available for made-to-order software development, consulting and training.
Increasingly modern biotechnology and molecular biology data is first available in electronic, rather than physical, form. Bioinformatics, which processes this computer-bound biological data, has risen in prominence as a consequence. Bioinformatics was not created by this trend, but it has certainly gained popularity from it.
As many people join biotechnology from non-biological science, technological, management and business backgrounds, they bring with them a wide range of views of what bioinformatics is. Having been trained by one of the early bioinformatics scientists (bioinformaticians?) and having studied in the area from the ‘pre-hype’ era (i.e. more than 10 years ago), I believe I have some useful perspectives on bioinformatics and where it might be headed next. Support for my views can be found in references 1-3.

Just what exactly is bioinformatics?

Frequently described as ’computing + biology’, a closer inspection reveals many quite different activities. Statistical and mathematical models of biological systems. Squirreling away biological data into relational databases. Data mining. Development of new techniques for data storage and processing suited to biological data. Standardisation of data formats. Surveys of gene/protein families, their structure, function and evolution. Simulations and modelling studies of atomic structures of molecules. These and many other applications could be considered part of bioinformatics. Do all these really share a common theme, or are researchers just hanging a trendy moniker on their latest work?
A simplified data flow of a bioinformatics application (Figure 1) contains three broad areas: data management (informatics), computational and theoretical biology. All three are important for successful modern bioinformatics. (For brevity, I leave aside early steps of a full data pipeline such as data collection from hardware, laboratory information management systems, signal extraction, sample error detection/correction, etc.)

A linguist might define bioinformatics as informatics applied to biology. The Shorter OED (4) defines informatics as the translation of the Russian informatik — information science and technology. Information science, in turn, is defined as ‘the branch of knowledge that deals with storage, retrieval and dissemination of information’. Informatics in its narrowest sense is about manipulation of data, without necessarily understanding the meaning of the data being manipulated. In practice, bioinformatics is more than just informatics and the name bioinformatics is something of a misnomer.
The quality of the results from most bioinformatics applications lies with the accuracy of the computational model; the remainder is largely infrastructure that feeds data to and collects data from this step. The infrastructure can often be built using the latest ‘standard’ computing protocols and practices. Constructing the computational model by contrast requires an understanding of the biological data at hand, the biological knowledge wanted and the computer science, mathematical, etc., techniques to be applied. Good bioinformatics methods are built on a strong understanding of molecular systems.
When developing a bioinformatics method, a computational (mathematical, statistical, physical, chemical) model must be built upon basic biological knowledge. If we are seeking to locate protein-binding sites given a DNA sequence, we need to understand what makes up a protein-binding site in DNA and incorporate that knowledge into a computational model using the DNA sequence. If we want to find protein-protein interaction sites and partners given protein sequences, we need to understand what makes up a protein-protein interaction, how they evolve, and derive a computational model using the protein sequence data.
The information used to construct these computational models is often derived from biophysical studies of the molecules concerned. Surveys of these data lead to basic understanding of the nature of the molecules and their function. The author mainly bases his bioinformatics methods on knowledge derived from structural biology: since they are built on data directly related to the physical reality of the molecules, they are more likely to accurately reflect that reality.
Examples of theoretical biology knowledge for proteins include the allowed conformation of amino acids (5) for building and assessing models of protein structures, the frequency of amino acids at specific positions of the termini of α-helices (6) for predicting secondary structure, the correlation between hydrophobicity and the occurence of amino acids in the interior of proteins (7 p254, 8 p98-106) or the characteristic packing angles of α-helices (9). Equivalent data can be found for DNA and RNA molecules.
Many early molecular biologists were émigrés from physics and chemistry. These scientists were used to fields with underlying layers of principles upon which further work could be built. Modern molecular biology is to an extent guilty of not deriving further underlying principles from the data generated from experimental studies. The large amount of data being generated by the high-throughput methods would, I’d like to think, bring us back to the need to create further ’first principles’.
It is easy to be sold on the hype of databases and high-powered computers. Data management is essential, but to solve biological problems bioinformatics methods must have sound biological principles behind them or captured within them. To paraphrase Ouzounis (1): “By approaching bioinformatics (or computational biology) as a science, the field will not be [misinterpreted] as a technology platform” (see also Claverie, 3).

Where did bioinformatics come from?

“Bioinformatics is a new science, which arose in the last 5 to 10 years or so”. We’ve all heard phrases along these lines at seminars, conference talk introductions and by groups trying to persuade their powers-that-be to fund them. True? No.
In fact, while popular interest in bioinformatics is new, activity in the field has been around since the beginnings of molecular biology. This point has been re-iterated in several recent articles (1-3), e.g.: “This is one of the first and foremost misconceptions: people new to the field think that the field is new”(3). Examples of bioinformatics have been around since the 1960s – early 1970s and certainly it was established as a small, but active, area by the 1980s. Computer applications have always arisen where they have been seen to be useful or interesting for as long as computers and appropriate biological applications have been around. The nature of bioinformatics has adapted to the biological applications of the day, hence its apparent ’recent invention’.
Early work includes phylogenetics (1960s-), crystallographic applications, investigation of the genetic code, the foundation of the PIR (Protein Identification Resource) database in the late 1960s by Margaret Dayhoff, early observations of conserved protein sequence and structural features (late 1960s), formal sequence and structure comparison techniques along with similarity matrices and studies exploring parameters for sequence alignment (1960s – early 1970s). Bioinformatics can be traced back to these pioneers, blending chemistry, physics, genetics and computing in their efforts to learn of the molecular basis of life. As molecular biology emerged as a force, generating increasing amount of sequence and structural data, more database-oriented applications appeared such as searching techniques and identification of common patterns (1980s-).
Readers wishing to delve into early bioinformatics could read the history of molecular biology (e.g. 10), early issues of CABIOS (now Bioinformatics, founded in 1985) or use any of the bioinformatics texts of the 1980s as a starting point (references 8, 11-13).
Systematic biology (genomes, proteomics, etc.) is placing emphasis on data management perhaps at the expense of theoretical and computational biology (see Fig. 1). Early bioinformatics had little data management because there was little data to manage. This change has not created bioinformatics, but rather popularised it and presented new challenges. Part of this popularisation is no doubt due to commercial marketing hype.

Where is bioinformatics heading?

Below is a decidedly incomplete list of future directions for bioinformatics. As bioinformatics has historically tended to reflect what is going on molecular biology, one simple way to predict future developments to look at recent developments in biology and biotechnology. I’ve avoided the purely biologically-driven predictions to conserve space and focussed on what I see as the conceptual and technological driving forces.

Smart database interactions and smart data formats

As the volume and sources of data grows, larger and better-organised databases are needed. All these databases cannot be brought together at a single site. Thus, databases around the world must query each other for information the local database does not hold. Database developers need to agree on standard database intercommunication approaches and data formats. Development of these protocols and methods requires a deep understanding of data representation in computer science and an understanding of the kinds of data found in biological data and the problems asked of this data.
History suggests the scale problem itself will be largely solvable. After all, in 1986 Lewin wrote a paper titled [The] ’DNA databases are swamped’ (14), but we seem to be fine.

High-end computer systems

Improved computer capability has had an obvious impact on science. While modern workstations are surprisingly capable, some applications in bioinformatics still require high-end performance.
Clustering many cheap computers together can provide a lot of computing capacity at low cost, but not all applications are suitable to being distributed across an array of computers.
Providing access to large computers located at another site can benefit financially-strained groups. However, as commercial firms are unlikely to want their data or algorithms on these public machines, funding considerations may arise. This approach possibly has its best use in making available the spare capacity of large computer systems purchased on the merit of some specific (academic) project.
Several popular CPU manufacturers have 64 bit CPUs planned, so desktop (ie. relatively cheap) computers with very large main memory (RAM) capacity may be available in the near future. RAM is thousands of times faster than disk drives; one approach to speed processing is simply to place all the data into RAM. New algorithms are possible by assuming the presence of large amounts of RAM.
Possibly the largest untapped resource is the idle time of computers in large organisations. Here the hardware has been paid for and the costs lie in systems management (a common factor to all high-end systems) and a degree of control over / access to users’ computers.

Data mining

Data-driven science has proved popular recently. Take a data set and try to identify within it properties that distinguish that data set from others. Try finding subsets within a data set or relationships amongst the elements of the data. Approaches like this are not entirely new in bioinformatics (see 3), but recent times have seen the applications of techniques from information science.
These attempts are valuable, but to the author’s mind they are most valuable if the results can be cast into a biological model so that they can be understood, rather than effectively left as a set of weights to multivariate elements whose relationship with the overall biological problem or behaviour is not really understood. These approaches have the potential for generating new ‘first principles’ information, but this appears difficult to do in practice and seems to be rarely applied. A related issue is that many of the workers in this area perhaps lack the detailed knowledge to recast what they find in (theoretical) biological terms, upon which new analytical approaches could then be built.

bio-IT: yet another jazz word

Bio-IT is the realm of the major computer database and hardware vendors. Biotechnology and bioinformatics are high projected-growth areas, with forecasts of up to 45-55% per annum for the next 5 years (see also 15). These large IT players continually seek new growth areas for their businesses and are pitching their wares in this area, (re)packaged as bio-IT. In time it may prove that much of the basic infra-structural component will be pre-packaged. There is more money for some sectors of bioinformatics from the traditional computer giants than from the biotechnology sector.

Literature mining, literature databases and nomenclature

Historically, continuity of the record has played a major contribution in science and modern culture. The success of Medline in a curious way is disrupting this continuity. Few people now look up the older literature that is not accessible either via Medline or the cited references of a paper in the manual way. Thus, the pre-Medline literature is vanishing from sight. Some recent papers I have seen are almost total re-inventions and I can’t help suspect that this is partly influenced by this trend (as well as possible laziness on the researchers’ part!). The older literature badly needs to be incorporated into electronic form to prevent a loss of the continuity of the scientific record.
No scientist seems to be able to keep up with the flood of papers (can any of you, readers?!). The text in abstracts, and perhaps eventually full papers, can be mined for relationships between words to provide leads for further investigation. Databases of putative word relationships (so-called knowledge-bases) can be constructed and queried.
Nomenclature issues are set to continue to play a role as it is impractical to do computational work on items with inconsistent naming. This has obvious impact on literature mining, as well as searching gene and protein databases.
Better annotation of the origin of information needs to be included so that annotation can be revised appropriately. For example, function might be assigned by homology to a protein or gene whose function has previously been assigned. If the function of the gene is subsequently revised, then so must the function of the genes whose function were assigned by homology to it.

Annotation of genomes

This remains a major endeavour and will for some time to come. I agree with Claverie (3) that researchers should stop the “fallacious analogy that the genome is a text to be deciphered” and look instead to the physical structure and properties of genomes. Some of the current problems are discussed in reference 16 from a ‘sequence’ perspective. Fundamental to this is the interaction of proteins with the genomic DNA in its structural context in the nucleus and the higher-order structural complexes that arise as a consequence of this, a focus of the author’s work.

Pathways and interactions: systems biology

Pathway simulations and systems biology is emerging from many years of obscurity to be one of the big players. Workers aim to simulate biological systems on a large scale – ultimately complete cells. Present studies naturally focus on smaller systems!
A major problem for this area is that a number of the “dimensions” of the data are not known particularly well or at all. It is not enough to know just the molecules and their putative interactions. Dynamic factors such as association/dissociation constants and concentration levels play a role. The space dimension includes the issue of diffusion and/or transport of molecules. Then there is the time dimension. Given the (complete) absence of some of these data, it would seem that it may be some time before this area moves beyond small systems which can either acquire or assume some of these inputs.

More emphasis on computational biology?

I prefer to think of myself as someone who does computational biology, i.e. biology, using computers as the tool as opposed to general informatics on data that happens to be biological. Like most computational biologists, while I am interested in the benefits that new data management techniques bring, I am more interested in what I can learn from basic biology to create methods that reveal new knowledge about biological systems and, vice versa, what biology can be learnt from new computational models.
Some computational biologists may not write software, but use a deep knowledge of bioinformatics and the biological question in mind. I believe research groups and companies need to place more emphasis on this level of bioinformatics, encouraging specialist groups to develop the analytical software needed and outsourcing this software development where appropriate. It is worth considering that algorithm development in bioinformatics is likely to become increasingly complex and competitive.
Computational biologists who write their own software need to ensure that their software interfaces with the new biological data management systems.
The concepts and ’facts’ used to develop computational models in bioinformatics are vital, but regrettably perhaps not highly regarded amongst experimental biologists. Chemists and physicists rarely ignore their theoretical components; they look to them for answers. And biological systems are, after all, chemical systems behaving in physically-defined ways. These concepts and ’facts’ provide a bridge from the hard sciences that are valuable to all biologists and need to be taught and improved upon.

Structural bioinformatics

The structural genome project aims to obtain a representative protein structure for every protein fold, enabling every sequence to be mapped onto a fold. This project is a (rare) example of an experimental project which rests on a computational procedure for (part of) its success. The various sequence-structure fitting methods are critical to the ability of the elucidated structures to provide a fold for a related sequence of unknown structure.
Mapping sequence, functional and evolutionary data onto representative protein structures dramatically improves protein bioinformatics as the data is laid out on a 3D surface, making it much easier to reliably identify sequence-based features.
Combining this approach with other methods yields information about protein substrate binding sites and protein-substrate interactions. Molecular interactions are at the heart of biology, so these are vital clues. (The experimental approaches to yield the same, should, of course, not be ignored.) Higher-order structural biology and systems biology will need to be integrated at some point, combining structural genomics, structural bioinformatics and the large scale chemical and physical properties of the biological systems involved. This integration could possibly be considered to be one of the ultimate bioinformatics projects, providing a physically-defined systems biology.
As a stepping-stone towards this, I believe there should be more projects focussing on one region/organelle/process within the cell, aiming to vertically integrate the experimental observations at the cellular and molecular levels, structural biology and the bioinformatics. Using this style of approach, eukaryote gene regulation is the biological area the author’s computational methods are applied to.

Physics and chemistry in bioinformatics

Perhaps the most fundamental development to the author’s mind is that more chemistry and in especially physics is needed make best use of the new biological data. To accurately model sample behaviour in high-throughput equipment, integrate structural genomics data, higher-order structures and the behaviour of molecular systems, physical models are needed. These are physical in both senses of the word. Physical as in molecular and organelle structure and organisation. And physical in the sense of the mechanical and dynamic properties of physical systems. So, perhaps, biophysics will dominate the next era of bioinformatics science?