Comments on Computational Complexity: Is Cheminformatics the new Bioinformatics? (Guest Post by Aaron Sterling)

Here we go again, do you think cheminformatics gra...

2011-02-13T11:39:25.179-06:00

Here we go again, do you think cheminformatics graph canonicalization is a solved problem? Think, again!. And any subsequent (large-scale) data mining efforts are impacted.

BTW, how many large-scale molecule mining tools do you know, which are part of active scientific research?

Hi Anon26, Several of the commenters on this thre...

2011-02-10T13:43:00.762-06:00

Hi Anon26,

Several of the commenters on this thread are more qualified to respond to you than I am, but the post is a week old now, and I don't know who will see it. I'll say one thing.

It seems to me that the opening up of dramatically understudied chemical databases should provide areas for new research, even if a wall has been hit elsewhere (and I don't know that it has). Here's a link to a 2010 Wall Street Journal article you might find interesting.

http://online.wsj.com/article/SB10001424052748703341904575266583403844888.html?mod=WSJ_Tech_RIGHTTopCarousel

Dear Steve Salzberg (via GASARCH), I do think it i...

2011-02-10T00:39:03.662-06:00

Dear Steve Salzberg (via GASARCH), I do think it is rapidly moving. The Blue Obelisk movement has repeated in some 15 years in open source cheminformatics what the whole community did in the 30 years before that, and more. Indeed, one problem when I started in this field 15 years ago is that cheminformatics was not considered academic, and was for long pushed into commercial entities based on source code as IP, resulting in a slowdown. But with the open source cheminformatics movement things have picked up speed again, and very fast too.

That bioinformatics is going faster, is not intrinsic to these problems. That just reflects the amount of funding, IMHO. In fact, most cheminformaticians I know work actually as bioinformatician. Moreover, do not underestimate the amount of contributions bioinformatics fields like metabolomics, flux analysis, assay data, chemogenomics, etc, has to cheminformatics. That said, 98% of the current cheminformatics literature is about applications, rather than metadological work.

The adoption of XML (CML) and RDF as a semantic representations of chemical information is a nice example where the open source cheminformatics community is ahead of its field, and highlights many of the simplifications proprietary solutions made in the past.

How does Cheminformatics intersect with QSAR and S...

2011-02-09T06:36:46.055-06:00

How does Cheminformatics intersect with QSAR and Systems Biology ?

Has there been much progress with Bioinfo in the last several years? After the hype of the HGH, the Proteome is a long way from being mapped. I was under the impression that there were aprox 500,000 proteins in the human body, most of which are hidden by high abundance proteins such as Albumin. most discovered proteins have not had their 3D structure determined (X-Ray crysto / NMR are expensive) and insilico structure prediction has hit a wall.

Joerg Kurt Wegner is correct that there is a gapin...

2011-02-03T06:16:21.737-06:00

Joerg Kurt Wegner is correct that there is a gaping capability mismatch between (fast and accelerating) sequence throughput and (relatively slow) structure throughput. An even more serious mismatch is that sequence coverage is strikingly comprehensive, while structure coverage is exceedingly sparse.

Chromatin structure provides a good example. What Francis Crick called in the 1950s "the central dogma of molecular biology: the one-way flow of information from genome to cell" is now understood to be grossly wrong.

Broadly speaking, the heritable trait of being a neuron (brain cell) rather than a hepatocyte (liver cell) is associated not to DNA, but to the conformational winding of DNA around histones. Thus, for purposes of regenerative medicine (my own main interest), sequences alone are very far from being all we need to know; conformational information is equally vital.

We have wonderfully comprehensive instruments for showing us the pair-by-pair sequence of the DNA strands, but (at presents) no similarly comprehensive instruments for showing us the histone-by-histone structural winding of DNA in the cell nucleus.

Still, structure determination capabilities are advancing at an incredibly rapid pace, and are largely paced by advances in CCT/QIT/QSE.

There is every reason to anticipate that eventually (even reasonably soon) our structure-determining capabilities will begin to match our sequence-determining capabilities in comprehensive scope, speed, and cost. These fundamental capabilities will be much-discussed at the ENC Conference in Asilomar this coming April. It will be exciting!

It is striking too that the 11-nanometer size of histone complexes is comparable to the resist half-pitch dimensions of coming generations of VLSI technologies ... according to the ITRS Roadmaps, anyway. Thus problems of structure-determination in biology, and in nanoelectronics, are foreseeably going to be solved together (or not at all).

Just as Hilbert's motto for the 20th century was "We must know, we will know", so for the 21st century, in fields as various as biology, astronomy, and chemistry, the motto is "We must see, we will see." This age-old dream was shared by von Neumann and Feynman, and now in our century it is coming true. Good!

When comparing chemistry and biology I must agree ...

2011-02-03T00:42:34.191-06:00

When comparing chemistry and biology I must agree that data production and throughput is lower. Still, data growth is exponential and we are simply drowning in structural and activity data, not only on a small molecule, but also on a structural biology level (XRay,NMR, protein-ligand complexes). See also this data explosion collection. Besides, I would encourage more cross-disciplinary work, which in itself can create "hot"ness, no matter if other disciplines produce more data. If people think that, we all should work for Google analyzing YouTube videos.

(I emailed Steve Salzberg, biocomp prof at UMCP, ...

2011-02-02T17:46:21.847-06:00

(I emailed Steve Salzberg, biocomp prof at UMCP, a pointer to Aaron's post. He emailed me this
response. I asked him if I could post it as a comment and he agreed.)

Interesting guest post.
Chemoinformatics is definitely an important area.
In my opinion, it is not a "hot" field, though, in part
for some of the reasons mentioned in the post - particularly the fact that the data in
the field is mostly proprietary and/or secret. So they hurt themselves by that behavior.
But the other reason I don't think it is moving that fast is that,
unlike bioinformatics, chemoinformatics is not being spurred by dramatic new technological
advances. In bioinformatics, the amazing progress in automated DNA sequencing has
driven the science forward at a tremendous pace.

I'm at a conference this week (by coincidence)
with about 1000 people, all discussing the latest advances in sequencing technology.
There are many academics here, and also vendors from all the major sequencing
companies. DNA sequencing also has multiple very, very high profile successes to
point to, such as the Human Genome Project and others. Chemoinformatics, in contrast,
does not - at least I'm not aware of any.

So it's important, yes, but it's harder to argue that it is a
rapidly advancing field. Maybe
if they shared all their data that would change.

@Rajarshi: Thanks for the correction and explanati...

2011-02-02T09:20:19.610-06:00

@Rajarshi: Thanks for the correction and explanation.

(Rajarshi is Rajarshi Guha, author of Chapter 12 of HCA, the open source software chapter.)

@Joerg KW: What an amazing comment! Thank you. I will read those references very soon.

@Barry B: Thank you too. Your news was exciting.

"Those who love justice and good sausage shou...

2011-02-02T08:50:45.842-06:00

"Those who love justice and good sausage should never watch either one being made." -- attributed to Otto von Bismarck

"To provide an intuition for the type of prob...

2011-02-02T08:20:50.623-06:00

"To provide an intuition for the type of problems considered, suppose you want to find a molecule that can do a particular thing. We assume that if the new molecule is structurally similar to other molecules that can do the thing, then it will have the same property. (This is called a "structure-activity relationship," or SAR.) However, we also need the molecule to be sufficiently different from known molecules so that it is possible to create a new patent estate for our discovery."

I refuse to believe that this is a valid form of research. Yes, it has been mentioned before. The very idea is still outrageous.

To provide a mathematical context for appreciating...

2011-02-02T08:16:07.532-06:00

To provide a mathematical context for appreciating Barry Bunin's post (above), and in particular, for appreciating the objectives of his company Collaborative Drug Discovery Inc., a recommended recent article is "The mycobacterium tuberculosis drugome and its polypharmacological implications" (2010, free on-line at PLOS Computational Biology).

There are several different ways that this PLOS article can "sit in our minds" (to borrow Bill Thurston's wonderful phrase) ... and there is no need to restrict ourselves to just one way.

The Bill & Melinda Gates Foundation regards these new methods as essential to finding cures for diseases like tuberculosis and malaria. From an engineering point-of-view, the PLOS article shows how new enterprises in systems and synthetic biology are embracing the systems engineering methods that NASA applies in space missions ... except that the "autonomous vehicles" are nanoscale molecules that are designed to navigate cellular environments and seek specific molecular targets.

So it is natural to ask, in what various ways might these ideas "rest in our brains" mathematically?

This week on Gödel's Lost Letter and P=NP, Dick Lipton and Ken Regan give high praise to Mircea Pitici's new collection Best Writing on Mathematics: 2010, and I would like to draw attention especially to the Forward by Bill Thurston, which begins:

---------------------------
"Mathematics is commonly thought to be the pursuit of universal truths, of patterns that are not anchored to any single fixed concept. But on a deeper level the goal of mathematics is to develop enhanced ways for humans to see and think about the world. Mathematics is a transforming journey, and progress in it can better be measured by changes in how we think than by the external truths we discover."
---------------------------

Thurston goes on to suggest that as we read articles, we ask ourselves "What's the author trying to say? What is the author really thinking?"

Let's apply Thurston's reading methods to the PLOS mycobacterium article. For me as a medical researcher, whose primary interest is in regenerative medicine, the best answers to Thurston's question are intimately bound-up with mathematical advances in complexity theory and quantum metrology. Because what the PLOS article is talking about, and what we are really thinking about in Thurston's sense, is a roadmap that was laid down decades ago by von Neumann and Feynman, to "see individual atoms distinctly" (Feynman 1959), and thereby to find out "where every individual nut and bolt is located ... by developments of which we can already foresee the character, the caliber, and the duration" (von Neumann 1946).

Progress along this decades-old roadmap—it is a centuries-old roadmap, really—has in recent decades begun accelerated at a more-than-Moore rate, in part because of advances in sensing and metrology, and in part because of advances in simulation algorithms, but most of all, because of advances in our Thurston-style mathematical understanding of biological dynamics and complexity (both classical and quantum).

Provided that this progress continues, in sensing and metrology, and in simulation capability, and most essentially of all, in our Thurston-style mathematical understanding of how it all works, then it seems to me that enterprises like Collaborative Drug Discovery Inc. have unbounded scope for growth, and even more significantly (for medical researchers) there is unbounded scope for progress in 21st century medicine. Good! :)

Thanks for the thoughtful post. We plan to open u...

2011-02-02T03:05:44.221-06:00

Thanks for the thoughtful post. We plan to open up this public data more soon in the very near future for folks to analyze further (the idea being community SAR is a nice foundation for future community QSAR): http://www.collaborativedrug.com/pages/public_access

"the lack of multi-label graph standards is s...

2011-02-01T13:58:08.396-06:00

"the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems"

I will break it down into some examples.

A molecular graph is ... a connection of some atoms with certain bonds. Though, chemists will look at such a pattern and they might say that this ring (e.g. Benzene) is an aromatic and conjugated system, while Cyclohexyl is a non-aromatic and aliphatic system. So, implicitely many chemists will assign multiple properties to a molecular graph at the same time, e.g. aromaticity, hybridization, electronegativity, and so on.
Now, algorithms like PATTY [1] and MQL can be used for assigning such implicite properties in an explicite form allowing us to work with it. This converts an unchemical graph into a multi-labelled molecular graph with 'chemistry' knowledge. One of the remaining problems that we have till today not one standard grammar definition, but only mixed assignment cascades in various software packages. For details of this dilemma see OpenSmiles.org. For making it worse, sometimes the assignment cascades are cyclic and depend on the execution order, aka, the assignment process is 'instable' and can produce varying results.

Now, lets us simply call the whole process a 'chemical expert system' (with all limitations of an expert system) or a 'cheminformatics kernel'.

The resulting problem is that any subsequent analysis, e.g. a descriptor calculation, 3D conformer generation, docking, and so on ... depend on the intial assignment, which might be unstable. Approximately it might work, still strictly speaking does it remain a problem, and at the end people are comparing dockingX against dockingY, while there might be many steps in-between, which are 'approximately the same'.

One classical example is one paper where they showed that the 3D conformer generation can be influenced by different SMILES input (a line notation for molecules) of the very same molecules [3]. Under the assumtion that a SMILES defines a multi-label molecular graph, why should than an internal numbering (order of SMILES) change the output of 3D conformations? So, this stochastic element in that stage looks strange to me, but I know that this is 'daily business' we have to account for.

[1] B. L. Bush and R. P. Sheridan, PATTY: A Programmable Atom Typer and Language for Automatic Classification of Atoms in Molecular Databases, J. Chem. Inf. Comput. Sci., 33, 756-762, 1993.

[2] E. Proschak, J. K. Wegner, A. Schüller, G. Schneider, U. Fechner, Molecular Query Language (MQL)-A Context-Free Grammar for Substructure Matching, J. Chem. Inf. Model., 2007, 47, 295-301. doi:10.1021/ci600305h

[3] Permuting input for more effective sampling of 3D conformer space; Giorgio Carta, Valeria Onnis, Andrew J. S. Knox, Darren Fayne and David G. Lloyd; JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN;
20, 3, 179-190, DOI: 10.1007/s10822-006-9044-4

@Aaron: I didn't say that the use of patents i...

2011-02-01T11:29:28.820-06:00

@Aaron: I didn't say that the use of patents in the example was sad because of the financial aspect. Of course, without the financial aspect no one would have the money to do this research. Rather, I felt it was sad because in many cases now patent law harms innovation rather than helping it. I would have been completely happy with an example that involved other important financial considerations, such as return on investment, time to market, first mover advantage, economy of scale, and so on, all of which are important even when separated from patent law.

Hi Aaron, wrt your comment about descriptors - I&#...

2011-02-01T11:05:26.344-06:00

Hi Aaron, wrt your comment about descriptors - I'd argue that you can actually evaluate many commonly used descriptors with open source software. For example the CDK implements many descriptors -certainly not all noted in the Handbook of Molecular Descriptors - but as Joerg points out, many descriptors are minor variations of others. A recent paper (dx.doi.org/10.1124/dmd.110.034918) showed that CDK descriptors give equivalent results to those obtained using a commercial tool (MOE). But it is also true that certain descriptors (logP for example) depend on having access to large datasets, for which there isn't always a freely available version

Aaron says: @John Sidles: Your constant enthusiasm...

2011-02-01T10:07:33.179-06:00

Aaron says: @John Sidles: Your constant enthusiasm for interdisciplinary research is inspiring. :-)

It's not enthusiasm, Aaron ... it's a concrete roadmap ... a roadmap that was originally laid out by von Neumann, Shannon, and Feynman for sustained exponential expansion in sensing, metrology, and dynamical simulation capabilities.

Recent advances in CS/QIT are providing new, concrete math-and-physics foundations for sustaining the expansion that von Neumann, Shannon, and Feynman envisioned. Good!

Several posters have diffidently expressed concerns relating to increasing tensions between openness, curation, and property rights. But there is no need to address these key concerns with diffidence ... my wife highly recommends Philip Pullman's thoughtful and plain-spoken analysis of these issues.

All in all, as Al Jolson sang in 1919 "You Ain't Heard Nothing Yet!" ... in the specific sense that the capabilities and challenges that Aaron's review addresses, are almost surely destined to continue their "More than Moore" expansion. Good! :)

Thanks to everyone for their interest, and for the...

2011-02-01T08:51:56.368-06:00

Thanks to everyone for their interest, and for the exciting discussion -- and my apologies for misspelling names, now fixed. To respond to a few points:

@matt: I explicitly chose a problem that included the building of a patent estate as a parameter because I wanted to construct a snapshot intuition of the type of problems encountered in HCA. Sad or not, economic profit is a central player in multiple chapters, both in what types of problems are "interesting," and in what kinds of tools are readily available. I believe the same could be said for most problems in computer science, though perhaps the profit influece is more veiled in TCS. As a reviewer, I felt my responsibility was to convey the lay of the land, as best I understood it -- and, as a computer scientist, I feel it's unwise to consider ourselves "purer" than chemists because we are somehow above financial or corporate pressures. (We're not.)

@Egon W: I'm quite intrigued by your project suggestions, though I doubt I fully understand them. I will follow up with you directly, if that is ok.

@Joerg KW: Thanks very much for your comments; it's intriguing to hear your perspective, seeing some of these issues from "inside." If you don't mind, could you (or any of the chemists reading) elaborate on, "the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems" ? I am not sure what you are referring to here.

@Suresh: My (very limited) understanding is that both data and functionality are slowly becoming more accessible. The PubChem Project now has 31 million chemicals in its database, and access is free. On the other hand, manipulation of that data can be expensive. Many of the descriptors mentioned in Joerg Kurt Wegner's comment can only be calculated by expensive proprietary software. In addition to cost, the lack of code transparency means that an error in code could propagate errors into many results in the academic literature without being discovered. That is part of the motivation for the open-source chemoinformatics projects currently underway, like CDK, which Egon Willighagen has been part of.

@John Sidles: Your constant enthusiasm for interdisciplinary research is inspiring. :-)

Finally, I will say a word about the review, as it seems a fair number of people outside theoretical computer science might be reading this. I wrote this review for the newsletter of SIGACT (Special Interest Group on Algorithms and Computation Theory), a professional association of theoretical computer scientists. Bill Gasarch, co-owner of this blog, is the SIGACT book review editor; and Lance Fortnow, the other co-owner of this blog, is SIGACT Chair. Previous book reviews can be found here. It will be a while (Bill might be able to provide a timeframe) before this goes to press, so I can correct inaccuracies or address concerns before this becomes unchangeable.

Thanks again to everyone.

Aaron Sterling concludes: I believe chemoinformati...

2011-02-01T07:37:30.340-06:00

Aaron Sterling concludes: I believe chemoinformatics, like bioinformatics, will provide an important source of problems for computer scientists.

The following conclusion is logically equivalent, yet psychologically opposite: "I believe computer science, like quantum information theory, is providing important new computational resources to synthetic chemists and biologists."

This illustrates how "mathematics can sit in our brains in different ways" (in Bill Thurston's phrase) ... these choices obviously relate to Reinhard Selten's thought-provoking aphorism "game theory is for proving theorems, not for playing games."

These cognitive choices are not binary. Instead, mixed cognitive strategies like "Theorems help computer scientists to conceive new computational resources" are globally more nearly optimal.

That is why it is very desirable—essential, really—that everyone not think alike, regarding these key mathematical issues.

For this reason, the emergence of diverse, open mathematical forums, like Computational Complexity, Gödel's Lost Letter, Shtetl Optimized, Combinatorics and More, and Math Overflow (and many more), is contributing greatly to accelerating progress across a broad span of STEM enterprises.

The resulting cross-disciplinary fertilization can be discomfiting, ridiculous, and even painful ... but also irresistibly thought-provoking, playful, and fun. Good! :)

Suresh asks: a lot of the more interesting data se...

2011-01-31T17:15:20.963-06:00

Suresh asks: a lot of the more interesting data sets (molecule collections etc) are proprietary and under lock and key in drug companies. Or am I wrong here ?

Nowadays, especially for younger scientists, the scientific ideal of "data" is flexing and bending like the Tacoma Narrows Bridge ... e.g., from NMR spectra ("raw data") is deduced a set of distance constraints ("reduction #1") from which is deduced a set of candidate ground-state structures ("reduction #2") from which is deduced a set of binding energies ("reduction #3") from which is deduced a set of enhanced-binding mutations ("reduction #4") ... which are synthesized and tested for binding affinity ... at which point the synthetic cycle begins anew.

After awhile, the notion of a linear hierarchy of quality becomes indistinct ... rather, quality in synthetic biology is regarded in much the same way that Terry Tao regards quality in mathematics:

--------------------
"the concept of mathematical quality [read 'quality in synthetic biology'] is a high-dimensional one, and lacks an obvious canonical total ordering. ... There does however seem to be some undeﬁnable sense that a certain piece of mathematics [read 'synthetic biology'] is 'on to something,' that it is a piece of a larger puzzle waiting to be explored further."
--------------------

One bigger problem with getting involved with this...

2011-01-31T16:19:57.403-06:00

One bigger problem with getting involved with this work is access to data. From my understanding, a lot of the more interesting data sets (molecule collections etc) are proprietary and under lock and key in drug companies. Or am I wrong here ?

I would like to throw another (depressing) book in...

2011-01-31T15:46:42.314-06:00

I would like to throw another (depressing) book into the discussion, it is the Handbook of molecular descriptors. There you will find yet another 1001 descriptors leaving at least me with the feeling that there are too many names for the very same algorithm by just changing a tunable parameter (often the labeling function). So, I do not understand why people rather create unrelated and unoptimizable combinatorial problems, while they could just turn things into smooth and optimizable problems. Then it would be so much clearer for everyone that the optimal parameter set has to differ for the underlying data, while the computing procedure remains the same. Besides, on the long term we would learn so much more. Though, I do not believe that many computer scientists do even remotely understand the challenges we observe in the life science arena, at the end of the day silicon and carbon are different.

Anyway, developing a common understanding is already a good start.

2011-01-31T13:55:48.386-06:00

This comment has been removed by the author.

Thank you for the long and critical review. I pret...

2011-01-31T13:54:42.108-06:00

Thank you for the long and critical review. I pretty much appreciate a critical view on the "computer science" side of cheminformatics before I have to read yet another questionable analysis mixing combinatorial and continuous problems and then concluding that it just works fine for "this non-benchmark data set".

As pointed out multiple times, the lack of multi-label graph standards is still a major problem in this area, and things are just getting worse when going large-scale, or to 3D modelling problems.

And its Wegner, not Wagner (chapter 4). Thanks, again, was fun to read.

(Returning after having read the full blog post, a...

2011-01-31T13:16:22.595-06:00

(Returning after having read the full blog post, and a good part of the review.)

@Aaron, interesting review! I believe you conclude that the algorithms in the book are pretty basic... sadly, that was deliberate... it's more oriented at showing the casual chemist what cheminformatics algorithms are about, rather than explaining to cutting-edge algorithms there are in cheminformatics (...), which would make the book unreadable to the target audience. But in doing so, we alienated the people we (well, I do) would love to collaborate more with! :(

As was clear from particularly Rajarshi's chapter, there is a large open source cheminformatics community, who is very open (at least I am!) to collaboration, and within the CDK we actually have such in the past.

One more exciting problem in chemistry you may find attractive, is the enumeration or even counting of possible graphs given a number of atoms and bonds (vertices and edges). Now, a chemical graph is a colored graph, and not all edges are allowed. Moreover, we are only interested in graphs that are non-symmetrical.

Outstanding problems here are to calculate the number of chemical graphs given a number of atoms (as in a molecular formula, like C4H10O) without enumerating all structures.

Secondly, 'we' would love an open source implementation of an efficient algorithm to enumerate all chemical graphs. The efficiency here relies primarily in not calculating solutions for which there already has been calculated a different solution, which happens to be symmetrical equivalent.

Now, this problem has been solved in the proprietary Molgen software, but may provide you with the right amount of complexity you are looking for.

Dear Aaron, would you be so kind to update your bo...

2011-01-31T12:55:43.265-06:00

Dear Aaron, would you be so kind to update your book review to list my actual last name, "Willighagen"?

Thanx!