Chemoinformatics for Computer Scientists

Guest Post by Aaron Sterling

I recently completed a review of

*Handbook of Chemoinformatics Algorithms*(HCA) for SIGACT News. (See here for the full 12 page review. I have tried to recast the language of HCA into something more accessible to a computer scientist.) Somewhere along the way, my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry. I was inspired by two things: (1) none of the computer scientists I talked to about this -- not even ones who did work in bioinformatics -- had ever heard of chemoinformatics; and (2) the state of the art of chemoinformatics algorithms remains rudimentary from a TCS perspective (though the applications and the problems being solved are quite complex). I believe this represents a tremendous interdisciplinary research opportunity: hundreds of millions of dollars are riding on the speed and accuracy of the techniques presented in HCA, and I suspect that "small" mathematical improvements could yield large payoffs.

My quick-and-dirty definition of

*chemoinformatics*is, "Algorithms, databases and code to help chemists." A more thorough description can be found at this Wikipedia article. (The linked article provides a gentle overview of chemoinformatics, with links to several more specialized articles.) The most discussed applications in HCA are

*in silico*pharmaceutical discovery, solvent discovery, and petroleum reaction improvement and analysis.

The TCS community has formally recognized the importance of working more closely with chemists since at least the 2007 Computational Worldview and the Sciences Workshops, which discussed the tradeoff between "chemical cost" and "computational cost" of producing nanodevices. The report on those workshops speculates,

*While the computational costs are fairly straightforward to quantify, the same is not true of the chemical costs. Perhaps the Computer Science lens can be used to construct a formal, quantitative model for the relevant chemical processes, which can then be used to optimize the above tradeoff.*" After reading HCA, I believe formalizing such tradeoffs is important for all computational chemistry, not just nanochemistry.

Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind. There are a handful of graduate textbooks on chemoinformatics, but HCA is the first attempt to collect all chemoinformatics algorithms into one place. The difference in academic development is due to the proprietary nature of chemical databases, in contrast to biological data, which has a long history of being publicly available. As a result, thoroughgoing academic investigation of chemoinformatics is quite new, and there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA.

To provide an intuition for the type of problems considered, suppose you want to find a molecule that can do a particular thing. We assume that if the new molecule is structurally similar to other molecules that can do the thing, then it will have the same property. (This is called a "structure-activity relationship," or SAR.) However, we also need the molecule to be sufficiently different from known molecules so that it is possible to create a new patent estate for our discovery. The naive way to check for structural similarity would be to compare two molecules by solving the Subgraph Isomorphism Problem. There are some algorithms currently in practice that do this, but we expect that problem to be infeasible to solve in general. Therefore, we take graph-theoretic representations of the molecules we want to compare, and extract structural information from them in the form of real numbers called

*molecular descriptors*. (An example of a molecular descriptor that comes up in TCS is the number of distinct spanning trees that spans the molecular graph. There are over 2000 descriptors in the literature, and most require knowledge of chemistry to describe.) If our two molecules are close with respect to a metric in a descriptor space, we predict that they have the same functionality. Then we can test in a wetlab whether or not the prediction is true. The objective is to use computational resources to save time and money by "preprocessing" the laboratory experimental steps.

As this is the Computational Complexity blog, I will provide a quote from HCA on "complexity indices" for molecules. HCA, in turn, is partially quoting from

*Complexity and Chemistry: Introduction and Fundamentals*by Bonchev and Rouvray.

*A complexity index*should

- Increase with the number of vertices and edges
- Reflect the degree of connectedness
- Differentiate nonisomorphic systems
- Increase with the size of the graph, branching, cyclicity, and the number of multiple edges

Still, this is an ongoing discussion, with even conflicting positions.

Both Chapter 4 of HCA and (in much more detail)

*Complexity in Chemistry*provide many complexity indices that appear to have these properties. However, the argumentation is one of induction on small examples. There is no formal mathematics comparing different indices, and the explanation of usefulness of a complexity index is limited to, "It worked for this particular application." This currently ad hoc state of the field leads me to believe that there could be a significant interdisciplinary research opportunity available for theoretical computer scientists willing to put in the time to learn the vocabulary and perspective of the computational chemist.

I believe chemoinformatics, like bioinformatics, will provide an important source of problems for computer scientists, and I hope the publication of HCA, and (to a lesser but real extent) this guest blog and my review, encourage greater participation between computational chemistry and TCS.

**Update 2/13**: Chemist Rajarshi Guha has posted a response on his own blog