Biocomputing and Bio-informatics

Biotechnology is expected to become a key factor in the next century in a diversity of application areas such as medicine, chemistry and materials engineering. The application of computer technology to biological information, the so-called bioinformatics, is considered an important part of biotechnology.

We consider biocomputing different from bioinformatics, although both terms are interchangeably used. With biocomputing we mean the construction and use of computers which function like living organisms or contain biological components. In order to reason about and use such `machines', special algorithms have to be designed and new complexity theories have to be developed. Since some of these algorithms are also being used in the bioinformatics context, the confusion of these two areas might be explained. Although biocomputing is an interesting reseach topic on its own, bioinformatics is considered to have more scientific and economic impact.

In this note we discuss several aspects of bioinformatics. First we give an introduction into the area. Then we investigate the several research topics in bioinformatics where a program could be defined for the mathematics and computing science department and we conclude with a possible bioinformatics profile for the TU Eindhoven.

Proteins are the fundamental building blocks of life. Cell structure is either made up of proteins, or is being produced by enzymes, which are proteins. Proteins are variable length linear mixed polymers of in total 20 different amino acids. These linear polymers fold upon themselves to generate a shape characteristic of each different protein, and this shape along with the difFerent chemical properties of the 20 amino acids determine the function of the protein. Since the sequence of a protein can be determined from the DNA sequence wich encodes it, most protein sequences are in fact inferred from DNA sequences.

understanding the functions of the proteins which are encoded in the sequence,

Improvements in these areas lead to a better understanding of organisms, their metabolism and their evolution. Health care and drug design, new (bio)materials and their engineering, food (engineering) and food production, are obvious examples that may directly profit from this improved knowledge.

Several research areas in computing science and mathematics play an important role in these areas, some of them are dealt with in the next section.

As described in the previous section much attention in molecular biology research is devoted to analysing sequences. Large databases of DNA sequences have been collected: in the USA GenBank, in Europe EMBL and in Japan DDBJ. These database are very huge: The latest release of GenBank exceeded one billion base pairs. Not only the size of the sequence data is rapidly increasing, but also the number of characterized genes from many organisms and protein structures doubles about every two years.

The earliest tasks in bioinformatics were therefore the creation and maintenance of such databases of biological information. DNA sequences (and the protein sequences derived from them) comprise the majority of such databases. While the storage and organization of millions of nucleotides is far from trivial, designing a database and developing an interface whereby researchers can both access existing information and submit new entries is a challenging task. New database and datamining techniques are to be developed to handle this.

In order to efficiently compare a sequence with a vast number of other sequences, algorithms have been and are to be developed. Most algorithms are based upon a similarity measure of two or more sequences. This measure is used in determining the alignments of the sequences, i.e., the arrangements of the sequences showing the places where they are similar and where they differ. The problem of finding the optimal alignment is a problem area in which techniques from dynamic programming, combinatorial optimization, heuristic search methods, neural network theory, and statistics are applied.

Next to analysing sequences much bioengineering research is devoted to developing methods to predict the structure and/or function of (newly discovered) proteins. As is noted above, the structure of a protein is produced by the folding of the polymer chain back onto itself, and the association of multiple chains. Current research on protein folding and structure prediction uses two basic approaches: homology based and ab initio.

In ab initio approaches the structure of a protein is tried to determine which minimizes free energy. Large scale computing techniques from the molecular modelling scene, such as the molecular dynamics and Monte Carlo techniques and genetic algoritms, are successfully applied in this area.

Homology-based approaches attempt to determine the structure of a protein by comparing its sequence to that of related proteins whose structure is known. Clustering protein sequences into families of related sequences and the development of protein models are here important topics. Datamining techniques, statistics and genetic algorithms are applied for generating phylogenetic trees to examine evolutionary relationships.

Moreover, algorithms are developed to study the evolutionary process of DNA sequences. From the DNA sequence a genetic algorithm is derived by which a protein model can be simulated. The results of these simulations are then compared to experimental results, and when necessary the protein model be improved. In this scheme the principle of evolution is used to model structures and to simulate biomolecular reactions. Heuristic methods, and energy minirnzation techniques are here applied as computational methods.

Before indicating what the bioinformatics profile for the TU Eindhoven could be, we emphasize that although within the mathematics and computing science department knowledge and expertise is available on several of the above-mentioned research areas (heuristic methods, neural networks, computer simulations, statistics, combinatorial optimization), there is hardly any experience with biosystems. In order to be successful in the bioinformatics area, expertise from the biosciences in a cooperative relationship is therefore needed.

If we consider the several research areas highlighted then the most prominent gap is on (bio)datamining techniques in combination with computer simulation expertise on biosystems. A research group within computing science focussed on protein (simulation) algorithms and datamining techniques applicable to biosystems is expected to collaborate intensively with research groups in the biomedical engineering, the chemistry and the mechanical engineering departments on the engineering of new biomaterials with the appropriate mechanical and physicochemical properties.