Monday, January 31, 2011

Is Cheminformatics the new Bioinformatics? (Guest Post by Aaron Sterling)

Chemoinformatics for Computer Scientists

Guest Post by Aaron Sterling

I recently completed a review of Handbook of Chemoinformatics Algorithms (HCA) for SIGACT News. (See here for the full 12 page review. I have tried to recast the language of HCA into something more accessible to a computer scientist.) Somewhere along the way, my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry. I was inspired by two things: (1) none of the computer scientists I talked to about this -- not even ones who did work in bioinformatics -- had ever heard of chemoinformatics; and (2) the state of the art of chemoinformatics algorithms remains rudimentary from a TCS perspective (though the applications and the problems being solved are quite complex). I believe this represents a tremendous interdisciplinary research opportunity: hundreds of millions of dollars are riding on the speed and accuracy of the techniques presented in HCA, and I suspect that "small" mathematical improvements could yield large payoffs.

My quick-and-dirty definition of chemoinformatics is, "Algorithms, databases and code to help chemists." A more thorough description can be found at this Wikipedia article. (The linked article provides a gentle overview of chemoinformatics, with links to several more specialized articles.) The most discussed applications in HCA are in silico pharmaceutical discovery, solvent discovery, and petroleum reaction improvement and analysis.

The TCS community has formally recognized the importance of working more closely with chemists since at least the 2007 Computational Worldview and the Sciences Workshops, which discussed the tradeoff between "chemical cost" and "computational cost" of producing nanodevices. The report on those workshops speculates, While the computational costs are fairly straightforward to quantify, the same is not true of the chemical costs. Perhaps the Computer Science lens can be used to construct a formal, quantitative model for the relevant chemical processes, which can then be used to optimize the above tradeoff." After reading HCA, I believe formalizing such tradeoffs is important for all computational chemistry, not just nanochemistry.

Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind. There are a handful of graduate textbooks on chemoinformatics, but HCA is the first attempt to collect all chemoinformatics algorithms into one place. The difference in academic development is due to the proprietary nature of chemical databases, in contrast to biological data, which has a long history of being publicly available. As a result, thoroughgoing academic investigation of chemoinformatics is quite new, and there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA.

To provide an intuition for the type of problems considered, suppose you want to find a molecule that can do a particular thing. We assume that if the new molecule is structurally similar to other molecules that can do the thing, then it will have the same property. (This is called a "structure-activity relationship," or SAR.) However, we also need the molecule to be sufficiently different from known molecules so that it is possible to create a new patent estate for our discovery. The naive way to check for structural similarity would be to compare two molecules by solving the Subgraph Isomorphism Problem. There are some algorithms currently in practice that do this, but we expect that problem to be infeasible to solve in general. Therefore, we take graph-theoretic representations of the molecules we want to compare, and extract structural information from them in the form of real numbers called molecular descriptors. (An example of a molecular descriptor that comes up in TCS is the number of distinct spanning trees that spans the molecular graph. There are over 2000 descriptors in the literature, and most require knowledge of chemistry to describe.) If our two molecules are close with respect to a metric in a descriptor space, we predict that they have the same functionality. Then we can test in a wetlab whether or not the prediction is true. The objective is to use computational resources to save time and money by "preprocessing" the laboratory experimental steps.

As this is the Computational Complexity blog, I will provide a quote from HCA on "complexity indices" for molecules. HCA, in turn, is partially quoting from Complexity and Chemistry: Introduction and Fundamentals by Bonchev and Rouvray.

A complexity index should
  1. Increase with the number of vertices and edges
  2. Reflect the degree of connectedness
  3. Differentiate nonisomorphic systems
  4. Increase with the size of the graph, branching, cyclicity, and the number of multiple edges

Still, this is an ongoing discussion, with even conflicting positions.

Both Chapter 4 of HCA and (in much more detail) Complexity in Chemistry provide many complexity indices that appear to have these properties. However, the argumentation is one of induction on small examples. There is no formal mathematics comparing different indices, and the explanation of usefulness of a complexity index is limited to, "It worked for this particular application." This currently ad hoc state of the field leads me to believe that there could be a significant interdisciplinary research opportunity available for theoretical computer scientists willing to put in the time to learn the vocabulary and perspective of the computational chemist.

I believe chemoinformatics, like bioinformatics, will provide an important source of problems for computer scientists, and I hope the publication of HCA, and (to a lesser but real extent) this guest blog and my review, encourage greater participation between computational chemistry and TCS.

Update 2/13: Chemist Rajarshi Guha has posted a response on his own blog


  1. Please let me say that Aaron's guest post is one of the best-ever here on Computational Complexity.

    At my university I regularly attend the weekly seminar on synthetic biology that is hosted by David Baker's group ... this seminar is standing-room-only for a predominantly young audience ... and all the computational themes that Aaron's guest-post emphasizes are prominently on display.

    As a unifying context for the various links that Aaron provides, I would like to recommend the International Roadmap Committee (IRC) "More than Moore" White Paper. The IRC abstracts five elements for rapid progress in STEM enterprises: FOM, a Figure of Merit for assessing progress; LEP, a Law of Expected Progress describing the cadence of expected improvement; WAT, Wide Applicability of Technology, SHR, willingness to SHaRe the key elements responsible for progress; and ECO, an Existing COmmunity upon which to build what the IRC calls "virtuous circles" of technological progress.

    In abstracting these elements, the IRC has, I think, done us all a tremendous service.

    As the links of Aaron's post show implicitly, the Moore-style acceleration of progress in computational biology is associated to a culture that broadly and consciously embraces the IRC's "More than Moore" enterprise elements of FOM, LEP, WAT, SHR, and ECO. And I can personally testify, from attending the Baker Group's lively synthetic biology seminars, that this roadmap for linking math, science, and engineering is workable and fun! :)

    Hmmmm ... to borrow a theme from Lance's previous post The Ideal Conference, if the computer science were to consciously embrace FOM, LEP, WAT, SHR, and ECO, then what informatic themes might a CS conference emphasize? Here I think Bill Thurston provides us with a mighty good answer:

    "Mathematics is an art of human understanding. ... Mathematical concepts are abstract, so it ends up that there are many different ways that they can sit in our brains. A given mathematical concept might be primarily a symbolic equation, a picture, a rhythmic pattern, a short movie---or best of all, an integrated combination of several different representations."

    To conceive of computational complexity as a Thurston-style "integrated combination of mathematical representations" that "sit in our brains in many different ways," with a deliberate view toward fostering in CS the "More than Moore" enterprise elements of FOM, LEP, WAT, SHR, and ECO, in order to grasp the opportunities and challenges in chemoinformatics (and many other enterprises) that Aaron's post identifies ... aye, lasses and laddies ... now *that* would be "An Ideal Conference". Not least because job opportunities in this field are burgeoning. Good! :)

    I have to say, though, that at any such conference, the computer scientists will learn at least as much from the biologists, chemists, and medical researchers, as the biologists, chemists, and medical researchers will learn from the computer scientists. Also good! :)

  2. Clearly I was ahead of my time :). My thesis in 1999 was on chemoinformatic algorithms, specifically finding pharmacophores for drug design.

  3. The problem you referred is similar to problems in computer vision and machine learning.

  4. It is sad that the first example in this post involves patent law as a motivation for a computational problem. There are already enough interesting problems given to us by nature! But great post otherwise.

  5. Dear Aaron, would you be so kind to update your book review to list my actual last name, "Willighagen"?


  6. (Returning after having read the full blog post, and a good part of the review.)

    @Aaron, interesting review! I believe you conclude that the algorithms in the book are pretty basic... sadly, that was deliberate... it's more oriented at showing the casual chemist what cheminformatics algorithms are about, rather than explaining to cutting-edge algorithms there are in cheminformatics (...), which would make the book unreadable to the target audience. But in doing so, we alienated the people we (well, I do) would love to collaborate more with! :(

    As was clear from particularly Rajarshi's chapter, there is a large open source cheminformatics community, who is very open (at least I am!) to collaboration, and within the CDK we actually have such in the past.

    One more exciting problem in chemistry you may find attractive, is the enumeration or even counting of possible graphs given a number of atoms and bonds (vertices and edges). Now, a chemical graph is a colored graph, and not all edges are allowed. Moreover, we are only interested in graphs that are non-symmetrical.

    Outstanding problems here are to calculate the number of chemical graphs given a number of atoms (as in a molecular formula, like C4H10O) without enumerating all structures.

    Secondly, 'we' would love an open source implementation of an efficient algorithm to enumerate all chemical graphs. The efficiency here relies primarily in not calculating solutions for which there already has been calculated a different solution, which happens to be symmetrical equivalent.

    Now, this problem has been solved in the proprietary Molgen software, but may provide you with the right amount of complexity you are looking for.

  7. Thank you for the long and critical review. I pretty much appreciate a critical view on the "computer science" side of cheminformatics before I have to read yet another questionable analysis mixing combinatorial and continuous problems and then concluding that it just works fine for "this non-benchmark data set".

    As pointed out multiple times, the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems.

    And its Wegner, not Wagner (chapter 4). Thanks, again, was fun to read.

  8. This comment has been removed by the author.

  9. I would like to throw another (depressing) book into the discussion, it is the Handbook of molecular descriptors. There you will find yet another 1001 descriptors leaving at least me with the feeling that there are too many names for the very same algorithm by just changing a tunable parameter (often the labeling function). So, I do not understand why people rather create unrelated and unoptimizable combinatorial problems, while they could just turn things into smooth and optimizable problems. Then it would be so much clearer for everyone that the optimal parameter set has to differ for the underlying data, while the computing procedure remains the same. Besides, on the long term we would learn so much more. Though, I do not believe that many computer scientists do even remotely understand the challenges we observe in the life science arena, at the end of the day silicon and carbon are different.

    Anyway, developing a common understanding is already a good start.

  10. One bigger problem with getting involved with this work is access to data. From my understanding, a lot of the more interesting data sets (molecule collections etc) are proprietary and under lock and key in drug companies. Or am I wrong here ?

  11. Suresh asks: a lot of the more interesting data sets (molecule collections etc) are proprietary and under lock and key in drug companies. Or am I wrong here ?

    Nowadays, especially for younger scientists, the scientific ideal of "data" is flexing and bending like the Tacoma Narrows Bridge ... e.g., from NMR spectra ("raw data") is deduced a set of distance constraints ("reduction #1") from which is deduced a set of candidate ground-state structures ("reduction #2") from which is deduced a set of binding energies ("reduction #3") from which is deduced a set of enhanced-binding mutations ("reduction #4") ... which are synthesized and tested for binding affinity ... at which point the synthetic cycle begins anew.

    After awhile, the notion of a linear hierarchy of quality becomes indistinct ... rather, quality in synthetic biology is regarded in much the same way that Terry Tao regards quality in mathematics:

    "the concept of mathematical quality [read 'quality in synthetic biology'] is a high-dimensional one, and lacks an obvious canonical total ordering. ... There does however seem to be some undefinable sense that a certain piece of mathematics [read 'synthetic biology'] is 'on to something,' that it is a piece of a larger puzzle waiting to be explored further."

  12. Aaron Sterling concludes: I believe chemoinformatics, like bioinformatics, will provide an important source of problems for computer scientists.

    The following conclusion is logically equivalent, yet psychologically opposite: "I believe computer science, like quantum information theory, is providing important new computational resources to synthetic chemists and biologists."

    This illustrates how "mathematics can sit in our brains in different ways" (in Bill Thurston's phrase) ... these choices obviously relate to Reinhard Selten's thought-provoking aphorism "game theory is for proving theorems, not for playing games."

    These cognitive choices are not binary. Instead, mixed cognitive strategies like "Theorems help computer scientists to conceive new computational resources" are globally more nearly optimal.

    That is why it is very desirable—essential, really—that everyone not think alike, regarding these key mathematical issues.

    For this reason, the emergence of diverse, open mathematical forums, like Computational Complexity, Gödel's Lost Letter, Shtetl Optimized, Combinatorics and More, and Math Overflow (and many more), is contributing greatly to accelerating progress across a broad span of STEM enterprises.

    The resulting cross-disciplinary fertilization can be discomfiting, ridiculous, and even painful ... but also irresistibly thought-provoking, playful, and fun. Good! :)

  13. Thanks to everyone for their interest, and for the exciting discussion -- and my apologies for misspelling names, now fixed. To respond to a few points:

    @matt: I explicitly chose a problem that included the building of a patent estate as a parameter because I wanted to construct a snapshot intuition of the type of problems encountered in HCA. Sad or not, economic profit is a central player in multiple chapters, both in what types of problems are "interesting," and in what kinds of tools are readily available. I believe the same could be said for most problems in computer science, though perhaps the profit influece is more veiled in TCS. As a reviewer, I felt my responsibility was to convey the lay of the land, as best I understood it -- and, as a computer scientist, I feel it's unwise to consider ourselves "purer" than chemists because we are somehow above financial or corporate pressures. (We're not.)

    @Egon W: I'm quite intrigued by your project suggestions, though I doubt I fully understand them. I will follow up with you directly, if that is ok.

    @Joerg KW: Thanks very much for your comments; it's intriguing to hear your perspective, seeing some of these issues from "inside." If you don't mind, could you (or any of the chemists reading) elaborate on, "the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems" ? I am not sure what you are referring to here.

    @Suresh: My (very limited) understanding is that both data and functionality are slowly becoming more accessible. The PubChem Project now has 31 million chemicals in its database, and access is free. On the other hand, manipulation of that data can be expensive. Many of the descriptors mentioned in Joerg Kurt Wegner's comment can only be calculated by expensive proprietary software. In addition to cost, the lack of code transparency means that an error in code could propagate errors into many results in the academic literature without being discovered. That is part of the motivation for the open-source chemoinformatics projects currently underway, like CDK, which Egon Willighagen has been part of.

    @John Sidles: Your constant enthusiasm for interdisciplinary research is inspiring. :-)

    Finally, I will say a word about the review, as it seems a fair number of people outside theoretical computer science might be reading this. I wrote this review for the newsletter of SIGACT (Special Interest Group on Algorithms and Computation Theory), a professional association of theoretical computer scientists. Bill Gasarch, co-owner of this blog, is the SIGACT book review editor; and Lance Fortnow, the other co-owner of this blog, is SIGACT Chair. Previous book reviews can be found here. It will be a while (Bill might be able to provide a timeframe) before this goes to press, so I can correct inaccuracies or address concerns before this becomes unchangeable.

    Thanks again to everyone.

  14. Aaron says: @John Sidles: Your constant enthusiasm for interdisciplinary research is inspiring. :-)

    It's not enthusiasm, Aaron ... it's a concrete roadmap ... a roadmap that was originally laid out by von Neumann, Shannon, and Feynman for sustained exponential expansion in sensing, metrology, and dynamical simulation capabilities.

    Recent advances in CS/QIT are providing new, concrete math-and-physics foundations for sustaining the expansion that von Neumann, Shannon, and Feynman envisioned. Good!

    Several posters have diffidently expressed concerns relating to increasing tensions between openness, curation, and property rights. But there is no need to address these key concerns with diffidence ... my wife highly recommends Philip Pullman's thoughtful and plain-spoken analysis of these issues.

    All in all, as Al Jolson sang in 1919 "You Ain't Heard Nothing Yet!" ... in the specific sense that the capabilities and challenges that Aaron's review addresses, are almost surely destined to continue their "More than Moore" expansion. Good! :)

  15. Hi Aaron, wrt your comment about descriptors - I'd argue that you can actually evaluate many commonly used descriptors with open source software. For example the CDK implements many descriptors -certainly not all noted in the Handbook of Molecular Descriptors - but as Joerg points out, many descriptors are minor variations of others. A recent paper ( showed that CDK descriptors give equivalent results to those obtained using a commercial tool (MOE). But it is also true that certain descriptors (logP for example) depend on having access to large datasets, for which there isn't always a freely available version

  16. @Aaron: I didn't say that the use of patents in the example was sad because of the financial aspect. Of course, without the financial aspect no one would have the money to do this research. Rather, I felt it was sad because in many cases now patent law harms innovation rather than helping it. I would have been completely happy with an example that involved other important financial considerations, such as return on investment, time to market, first mover advantage, economy of scale, and so on, all of which are important even when separated from patent law.

  17. "the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems"

    I will break it down into some examples.

    A molecular graph is ... a connection of some atoms with certain bonds. Though, chemists will look at such a pattern and they might say that this ring (e.g. Benzene) is an aromatic and conjugated system, while Cyclohexyl is a non-aromatic and aliphatic system. So, implicitely many chemists will assign multiple properties to a molecular graph at the same time, e.g. aromaticity, hybridization, electronegativity, and so on.
    Now, algorithms like PATTY [1] and MQL can be used for assigning such implicite properties in an explicite form allowing us to work with it. This converts an unchemical graph into a multi-labelled molecular graph with 'chemistry' knowledge. One of the remaining problems that we have till today not one standard grammar definition, but only mixed assignment cascades in various software packages. For details of this dilemma see For making it worse, sometimes the assignment cascades are cyclic and depend on the execution order, aka, the assignment process is 'instable' and can produce varying results.

    Now, lets us simply call the whole process a 'chemical expert system' (with all limitations of an expert system) or a 'cheminformatics kernel'.

    The resulting problem is that any subsequent analysis, e.g. a descriptor calculation, 3D conformer generation, docking, and so on ... depend on the intial assignment, which might be unstable. Approximately it might work, still strictly speaking does it remain a problem, and at the end people are comparing dockingX against dockingY, while there might be many steps in-between, which are 'approximately the same'.

    One classical example is one paper where they showed that the 3D conformer generation can be influenced by different SMILES input (a line notation for molecules) of the very same molecules [3]. Under the assumtion that a SMILES defines a multi-label molecular graph, why should than an internal numbering (order of SMILES) change the output of 3D conformations? So, this stochastic element in that stage looks strange to me, but I know that this is 'daily business' we have to account for.

    [1] B. L. Bush and R. P. Sheridan, PATTY: A Programmable Atom Typer and Language for Automatic Classification of Atoms in Molecular Databases, J. Chem. Inf. Comput. Sci., 33, 756-762, 1993.

    [2] E. Proschak, J. K. Wegner, A. Schüller, G. Schneider, U. Fechner, Molecular Query Language (MQL)-A Context-Free Grammar for Substructure Matching, J. Chem. Inf. Model., 2007, 47, 295-301. doi:10.1021/ci600305h

    [3] Permuting input for more effective sampling of 3D conformer space; Giorgio Carta, Valeria Onnis, Andrew J. S. Knox, Darren Fayne and David G. Lloyd; JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN;
    20, 3, 179-190, DOI: 10.1007/s10822-006-9044-4

  18. Thanks for the thoughtful post. We plan to open up this public data more soon in the very near future for folks to analyze further (the idea being community SAR is a nice foundation for future community QSAR):

  19. To provide a mathematical context for appreciating Barry Bunin's post (above), and in particular, for appreciating the objectives of his company Collaborative Drug Discovery Inc., a recommended recent article is "The mycobacterium tuberculosis drugome and its polypharmacological implications" (2010, free on-line at PLOS Computational Biology).

    There are several different ways that this PLOS article can "sit in our minds" (to borrow Bill Thurston's wonderful phrase) ... and there is no need to restrict ourselves to just one way.

    The Bill & Melinda Gates Foundation regards these new methods as essential to finding cures for diseases like tuberculosis and malaria. From an engineering point-of-view, the PLOS article shows how new enterprises in systems and synthetic biology are embracing the systems engineering methods that NASA applies in space missions ... except that the "autonomous vehicles" are nanoscale molecules that are designed to navigate cellular environments and seek specific molecular targets.

    So it is natural to ask, in what various ways might these ideas "rest in our brains" mathematically?

    This week on Gödel's Lost Letter and P=NP, Dick Lipton and Ken Regan give high praise to Mircea Pitici's new collection Best Writing on Mathematics: 2010, and I would like to draw attention especially to the Forward by Bill Thurston, which begins:

    "Mathematics is commonly thought to be the pursuit of universal truths, of patterns that are not anchored to any single fixed concept. But on a deeper level the goal of mathematics is to develop enhanced ways for humans to see and think about the world. Mathematics is a transforming journey, and progress in it can better be measured by changes in how we think than by the external truths we discover."

    Thurston goes on to suggest that as we read articles, we ask ourselves "What's the author trying to say? What is the author really thinking?"

    Let's apply Thurston's reading methods to the PLOS mycobacterium article. For me as a medical researcher, whose primary interest is in regenerative medicine, the best answers to Thurston's question are intimately bound-up with mathematical advances in complexity theory and quantum metrology. Because what the PLOS article is talking about, and what we are really thinking about in Thurston's sense, is a roadmap that was laid down decades ago by von Neumann and Feynman, to "see individual atoms distinctly" (Feynman 1959), and thereby to find out "where every individual nut and bolt is located ... by developments of which we can already foresee the character, the caliber, and the duration" (von Neumann 1946).

    Progress along this decades-old roadmap—it is a centuries-old roadmap, really—has in recent decades begun accelerated at a more-than-Moore rate, in part because of advances in sensing and metrology, and in part because of advances in simulation algorithms, but most of all, because of advances in our Thurston-style mathematical understanding of biological dynamics and complexity (both classical and quantum).

    Provided that this progress continues, in sensing and metrology, and in simulation capability, and most essentially of all, in our Thurston-style mathematical understanding of how it all works, then it seems to me that enterprises like Collaborative Drug Discovery Inc. have unbounded scope for growth, and even more significantly (for medical researchers) there is unbounded scope for progress in 21st century medicine. Good! :)

  20. "To provide an intuition for the type of problems considered, suppose you want to find a molecule that can do a particular thing. We assume that if the new molecule is structurally similar to other molecules that can do the thing, then it will have the same property. (This is called a "structure-activity relationship," or SAR.) However, we also need the molecule to be sufficiently different from known molecules so that it is possible to create a new patent estate for our discovery."

    I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous.

  21. "Those who love justice and good sausage should never watch either one being made." -- attributed to Otto von Bismarck

  22. @Rajarshi: Thanks for the correction and explanation.

    (Rajarshi is Rajarshi Guha, author of Chapter 12 of HCA, the open source software chapter.)

    @Joerg KW: What an amazing comment! Thank you. I will read those references very soon.

    @Barry B: Thank you too. Your news was exciting.

  23. (I emailed Steve Salzberg, biocomp prof at UMCP, a pointer to Aaron's post. He emailed me this
    response. I asked him if I could post it as a comment and he agreed.)

    Interesting guest post.
    Chemoinformatics is definitely an important area.
    In my opinion, it is not a "hot" field, though, in part
    for some of the reasons mentioned in the post - particularly the fact that the data in
    the field is mostly proprietary and/or secret. So they hurt themselves by that behavior.
    But the other reason I don't think it is moving that fast is that,
    unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological
    advances. In bioinformatics, the amazing progress in automated DNA sequencing has
    driven the science forward at a tremendous pace.

    I'm at a conference this week (by coincidence)
    with about 1000 people, all discussing the latest advances in sequencing technology.
    There are many academics here, and also vendors from all the major sequencing
    companies. DNA sequencing also has multiple very, very high profile successes to
    point to, such as the Human Genome Project and others. Chemoinformatics, in contrast,
    does not - at least I'm not aware of any.

    So it's important, yes, but it's harder to argue that it is a
    rapidly advancing field. Maybe
    if they shared all their data that would change.

  24. When comparing chemistry and biology I must agree that data production and throughput is lower. Still, data growth is exponential and we are simply drowning in structural and activity data, not only on a small molecule, but also on a structural biology level (XRay,NMR, protein-ligand complexes). See also this data explosion collection. Besides, I would encourage more cross-disciplinary work, which in itself can create "hot"ness, no matter if other disciplines produce more data. If people think that, we all should work for Google analyzing YouTube videos.

  25. Joerg Kurt Wegner is correct that there is a gaping capability mismatch between (fast and accelerating) sequence throughput and (relatively slow) structure throughput. An even more serious mismatch is that sequence coverage is strikingly comprehensive, while structure coverage is exceedingly sparse.

    Chromatin structure provides a good example. What Francis Crick called in the 1950s "the central dogma of molecular biology: the one-way flow of information from genome to cell" is now understood to be grossly wrong.

    Broadly speaking, the heritable trait of being a neuron (brain cell) rather than a hepatocyte (liver cell) is associated not to DNA, but to the conformational winding of DNA around histones. Thus, for purposes of regenerative medicine (my own main interest), sequences alone are very far from being all we need to know; conformational information is equally vital.

    We have wonderfully comprehensive instruments for showing us the pair-by-pair sequence of the DNA strands, but (at presents) no similarly comprehensive instruments for showing us the histone-by-histone structural winding of DNA in the cell nucleus.

    Still, structure determination capabilities are advancing at an incredibly rapid pace, and are largely paced by advances in CCT/QIT/QSE.

    There is every reason to anticipate that eventually (even reasonably soon) our structure-determining capabilities will begin to match our sequence-determining capabilities in comprehensive scope, speed, and cost. These fundamental capabilities will be much-discussed at the ENC Conference in Asilomar this coming April. It will be exciting!

    It is striking too that the 11-nanometer size of histone complexes is comparable to the resist half-pitch dimensions of coming generations of VLSI technologies ... according to the ITRS Roadmaps, anyway. Thus problems of structure-determination in biology, and in nanoelectronics, are foreseeably going to be solved together (or not at all).

    Just as Hilbert's motto for the 20th century was "We must know, we will know", so for the 21st century, in fields as various as biology, astronomy, and chemistry, the motto is "We must see, we will see." This age-old dream was shared by von Neumann and Feynman, and now in our century it is coming true. Good!

  26. How does Cheminformatics intersect with QSAR and Systems Biology ?

    Has there been much progress with Bioinfo in the last several years? After the hype of the HGH, the Proteome is a long way from being mapped. I was under the impression that there were aprox 500,000 proteins in the human body, most of which are hidden by high abundance proteins such as Albumin. most discovered proteins have not had their 3D structure determined (X-Ray crysto / NMR are expensive) and insilico structure prediction has hit a wall.

  27. Dear Steve Salzberg (via GASARCH), I do think it is rapidly moving. The Blue Obelisk movement has repeated in some 15 years in open source cheminformatics what the whole community did in the 30 years before that, and more. Indeed, one problem when I started in this field 15 years ago is that cheminformatics was not considered academic, and was for long pushed into commercial entities based on source code as IP, resulting in a slowdown. But with the open source cheminformatics movement things have picked up speed again, and very fast too.

    That bioinformatics is going faster, is not intrinsic to these problems. That just reflects the amount of funding, IMHO. In fact, most cheminformaticians I know work actually as bioinformatician. Moreover, do not underestimate the amount of contributions bioinformatics fields like metabolomics, flux analysis, assay data, chemogenomics, etc, has to cheminformatics. That said, 98% of the current cheminformatics literature is about applications, rather than metadological work.

    The adoption of XML (CML) and RDF as a semantic representations of chemical information is a nice example where the open source cheminformatics community is ahead of its field, and highlights many of the simplifications proprietary solutions made in the past.

  28. Hi Anon26,

    Several of the commenters on this thread are more qualified to respond to you than I am, but the post is a week old now, and I don't know who will see it. I'll say one thing.

    It seems to me that the opening up of dramatically understudied chemical databases should provide areas for new research, even if a wall has been hit elsewhere (and I don't know that it has). Here's a link to a 2010 Wall Street Journal article you might find interesting.

  29. Here we go again, do you think cheminformatics graph canonicalization is a solved problem? Think, again!. And any subsequent (large-scale) data mining efforts are impacted.

    BTW, how many large-scale molecule mining tools do you know, which are part of active scientific research?