2 minute read

Bioinformatics

Databases And Analysis Programs



A good deal of the early work in bioinformatics focused on processing and analyzing gene and protein sequences catalogued in databases such as GenBank, EMBL, and SWISS-PROT. Such databases were developed in academia or by government-sponsored groups and served as repositories where scientists could store and share their sequence data with other researchers. With the start of the Human Genome Project in 1990, efforts in bioinformatics intensified, rising to the challenge of handling the large amounts of DNA sequence data being generated at an unprecedented rate. By the midto late-1990s, much of the efforts in bioinformatics centered around genomic data, generated by the Human Genome Project and by private companies, and around proteomic data.



Early analysis of sequence information focused on looking for similarities between genes and between proteins. Algorithms were developed to help researchers rapidly identify similar gene or protein sequences. Such tools were extremely useful for determining whether a newly sequenced piece of DNA was at all similar to sequences already entered in a database. To determine how multiple sequences align and to view their similarities, multiplealignment programs were developed. Such programs helped scientists compare the sequences of closely related genes or compare the sequence of a particular gene or protein as it appears in several species.

To better understand the functional roles of new nucleotide and amino acid sequences, researchers developed algorithms to look for particular sequence "domains." Domains are regions where a particular sequence of The computer monitor of an automated gel sequencer displays a digital gel image. This data is more easily analyzed in this environment. nucleotides or amino acids is indicative of function in the protein. For example, a protein may have a domain that binds to ATP or GTP, two important protein regulators.

In addition, these algorithms can detect sequences that denote a region involved in particular types of post-translational modifications, such as tyrosine phosphorylation. Tools such as prosite, blocks, prints, and Pfam can be used to detect and predict such protein domains in sequence data.

Structure is central to protein function, and another set of tools, including SWISS-MODEL, allows researchers to use gene and protein sequence data to predict a protein's three-dimensional structure. Such tools can help predict how mutations in a gene sequence could alter the three-dimensional structure of the corresponding protein. They accomplish such molecular modeling by comparing a novel sequence to the sequences of genes whose protein structures are known.

The majority of tools were developed as academic freeware distributed on the Internet. In the early-to mid-1990s, commercial companies began to develop their own proprietary algorithms and tools, as well as their own proprietary databases. Those databases were then marketed to pharmaceutical and biotech companies as well as to academic research groups. The most commercially viable and profitable businesses focused on the production and sale of proprietary DNA-and gene-sequence databases in the mid-to late-1990s. These databases primarily contained genetic information that were not in the public domain databases, such as GenBank, and they thus offered potential competitive advantages to the drug discovery groups of large pharmaceutical and biotech companies.

Additional topics

Medicine EncyclopediaGenetics in Medicine - Part 1Bioinformatics - Databases And Analysis Programs, Applications Of Bioinformatics To Drug Discovery - Specialists