Monday, January 31, 2011

Is Cheminformatics the new Bioinformatics? (Guest Post by Aaron Sterling)



Chemoinformatics for Computer Scientists

Guest Post by Aaron Sterling

I recently completed a review of Handbook of Chemoinformatics Algorithms (HCA) for SIGACT News. (See here for the full 12 page review. I have tried to recast the language of HCA into something more accessible to a computer scientist.) Somewhere along the way, my goal for the project changed from just a review of a book, to an attempt to build a bridge between theoretical computer science and computational chemistry. I was inspired by two things: (1) none of the computer scientists I talked to about this -- not even ones who did work in bioinformatics -- had ever heard of chemoinformatics; and (2) the state of the art of chemoinformatics algorithms remains rudimentary from a TCS perspective (though the applications and the problems being solved are quite complex). I believe this represents a tremendous interdisciplinary research opportunity: hundreds of millions of dollars are riding on the speed and accuracy of the techniques presented in HCA, and I suspect that "small" mathematical improvements could yield large payoffs.

My quick-and-dirty definition of chemoinformatics is, "Algorithms, databases and code to help chemists." A more thorough description can be found at this Wikipedia article. (The linked article provides a gentle overview of chemoinformatics, with links to several more specialized articles.) The most discussed applications in HCA are in silico pharmaceutical discovery, solvent discovery, and petroleum reaction improvement and analysis.

The TCS community has formally recognized the importance of working more closely with chemists since at least the 2007 Computational Worldview and the Sciences Workshops, which discussed the tradeoff between "chemical cost" and "computational cost" of producing nanodevices. The report on those workshops speculates, While the computational costs are fairly straightforward to quantify, the same is not true of the chemical costs. Perhaps the Computer Science lens can be used to construct a formal, quantitative model for the relevant chemical processes, which can then be used to optimize the above tradeoff." After reading HCA, I believe formalizing such tradeoffs is important for all computational chemistry, not just nanochemistry.

Unlike the field of bioinformatics, which enjoys a rich academic literature going back many years, HCA is the first book of its kind. There are a handful of graduate textbooks on chemoinformatics, but HCA is the first attempt to collect all chemoinformatics algorithms into one place. The difference in academic development is due to the proprietary nature of chemical databases, in contrast to biological data, which has a long history of being publicly available. As a result, thoroughgoing academic investigation of chemoinformatics is quite new, and there does not appear to be an overarching mathematical theory for any of the application areas considered in HCA.

To provide an intuition for the type of problems considered, suppose you want to find a molecule that can do a particular thing. We assume that if the new molecule is structurally similar to other molecules that can do the thing, then it will have the same property. (This is called a "structure-activity relationship," or SAR.) However, we also need the molecule to be sufficiently different from known molecules so that it is possible to create a new patent estate for our discovery. The naive way to check for structural similarity would be to compare two molecules by solving the Subgraph Isomorphism Problem. There are some algorithms currently in practice that do this, but we expect that problem to be infeasible to solve in general. Therefore, we take graph-theoretic representations of the molecules we want to compare, and extract structural information from them in the form of real numbers called molecular descriptors. (An example of a molecular descriptor that comes up in TCS is the number of distinct spanning trees that spans the molecular graph. There are over 2000 descriptors in the literature, and most require knowledge of chemistry to describe.) If our two molecules are close with respect to a metric in a descriptor space, we predict that they have the same functionality. Then we can test in a wetlab whether or not the prediction is true. The objective is to use computational resources to save time and money by "preprocessing" the laboratory experimental steps.

As this is the Computational Complexity blog, I will provide a quote from HCA on "complexity indices" for molecules. HCA, in turn, is partially quoting from Complexity and Chemistry: Introduction and Fundamentals by Bonchev and Rouvray.

A complexity index should
  1. Increase with the number of vertices and edges
  2. Reflect the degree of connectedness
  3. Differentiate nonisomorphic systems
  4. Increase with the size of the graph, branching, cyclicity, and the number of multiple edges


Still, this is an ongoing discussion, with even conflicting positions.

Both Chapter 4 of HCA and (in much more detail) Complexity in Chemistry provide many complexity indices that appear to have these properties. However, the argumentation is one of induction on small examples. There is no formal mathematics comparing different indices, and the explanation of usefulness of a complexity index is limited to, "It worked for this particular application." This currently ad hoc state of the field leads me to believe that there could be a significant interdisciplinary research opportunity available for theoretical computer scientists willing to put in the time to learn the vocabulary and perspective of the computational chemist.

I believe chemoinformatics, like bioinformatics, will provide an important source of problems for computer scientists, and I hope the publication of HCA, and (to a lesser but real extent) this guest blog and my review, encourage greater participation between computational chemistry and TCS.

Update 2/13: Chemist Rajarshi Guha has posted a response on his own blog

Thursday, January 27, 2011

The Ideal Conference

I found the perfect CS conference. A meeting where computer scientists from all its subdisciplines come together. Not with the purpose of presenting their current research and padding their CVs, but to hear about the latest ideas in the field, learn how to be stronger members of the CS community and above all network, meeting other computer scientists making connections and sharing ideas. This meeting draws a large segment of its intended audience, so popular it has to close registration. A conference that fulfills that most important conference mission: building community.

The only problem: I'm not invited to the party, the Grace Hopper Celebration of Women in Computing and its over 2000 attendees.

The Grace Hopper is a great event but why can't computer science also have such a meeting the embraces the entire CS community? Are we just too big? We could have a large meeting in January built around academic recruiting, where job candidates and hiring committees can have initial discussion and narrow the number of interviews needed later on. Many other academic fields, including mathematics and economics, follow this model.

The most important the purpose of this meeting would be for people to meet, build a strong CS community and make us feel proud to call ourselves computer scientists.

Monday, January 24, 2011

Why My Kids Trust Wikipedia

Guest post from Annie and Molly Fortnow

Our teachers used to tell us not to use Wikipedia because anybody can edit it and therefore it isn't trustworthy. A couple of years ago we decided to test it out. [Not with my knowledge - Lance]

We went to the Cow page but it was locked showing that some subjects can't be edited. We then proceeded to the Grapes page which was unlocked. So we added to the bottom: "Grapes are good. Nerds are cows." We immediately got a message popping up on the screen that said something like "That's an inappropriate remark. It's being deleted. You are getting a warning."

Now we know that if someone edits Wikipedia with something silly it will always be edited back. So, now we trust Wikipedia and use it often. Every time one of our friends asks us why we use Wikipedia we tell this story and everyone always believes us. Now all our friends trust Wikipedia too.

Thursday, January 20, 2011

Does Tiger Woods know what a Venn Diagram is?



In prior blogs I noted that the terms Turing Test and Prisoner's Dilemma have been used in articles for non-math people. In the age of Google people can look things up (recall that Google makes us smarter). I have since seen Prisoner's Dilemma used as the name of an episode of the TV show White Collar. They used it mostly correctly in the show.

I have spotted some more math term in a non-math context.

VENN DIAGRAMS!

In an article entitled Rachel Uchitel is not a Madam., which is about the world Tiger Woods was involved in, the following was mentioned. (The context is a comparison between the options men have for affairs: a prostitute or a civilian.)
Both methods of slaking the hunger have their pros and cons. Men like to hunt, and there is no need to hunt a prostitute. Men like to cheat without strings, and you can't stop a civilian from falling in love. But (Tiger) Woods found a way to enjoy the best of both worlds in one type of woman, a Venn diagram of sexual satisfaction. Most of his women lived in a nebulous in-between world.
Will this enlighten the masses as to what a Venn diagram is? If they look it up then yes; however, the actual statement is wrong. What they really mean is an intersection of sexual satisfaction, or, as it is commonly known, an intersextion.

ZENO'S PARADOX

An article entitled Harry Potter and the Dragged out final act began as follows.
When Warner brothers announced that the seventh and final book of J.K. Rowling's Harry Potter series, Harry Potter and the Deathly Hallows, would be two movies, it occurred to me that the company had been insufficiently ambitious. If, as reported, Warner executives are scared of running short of tentpoles (i.e., the so-called franchises that prop up a studio), they should at the very least divide the next half in half. Following Zeno's paradox, they could even turn Deathly Hallows into an infinite number of sequels with Becket-like arcs of nonaction: "Let's apparate." [They do not move.]
Is this the correct usage of Zeno's paradox? Becket? Apparate? I was inspired to look up apparate and put in a pointer so that you can learn what it means too!

n+1

There is a general interest magazine called n+1. I emailed the editors to ask why they chose the name and got this enlightening response:
Well, as a non-math person and one of the founding editors of n+1, I can tell you that we still don't know very much about math, but we did have some vague high school memories of set theory and algebra and knew that n+1 could mean, if it were a set, an infinite series or open-ended expansion, or just that for any quantity (n), there's often more than meets the eye, or is commonly thought or known (+1). That was the sense that Chad Harbach, another founding editor, had in mind when he first thought of the title as a placeholder, a math metaphor for human potential, back when he was a Harvard undergrad. Over time, the title also seemed to work in response to the "End of History" crowd, all those people who told us that no new ideas were really possible in the humanities, no new writing was possible, that it was foolish to start a magazine of politics, literature, and culture in these times. So we took on n+1 as a rallying cry, of sorts. Someone might also hear it as "end+1," after the end, a new beginning, that sort of thing. We did have a math PhD friend who suggested that, if we really wanted to designate an infinite universe of possibilities, we should have called it omega plus one, but that seemed too much for us non-math types. As far as I know, omega plus one is still available as a title.

Thanks for writing in to ask and best of luck with the blog and other endeavors,
I wish them well too!

Will any of these terms enter the common vocabulary? I suspect that Prisoner's Dilemma will as it is a nice shorthand for a common phenomena. I suspect that Turing Test, Venn Diagram, Zeno's Paradox, and n+1 do not come up often enough to break out into common use.

What math terms have you seen used in articles for non-math people? Were they used correctly? Will they become common? Are they common already? If so have then are they related to the original meaning?

Monday, January 17, 2011

Coloring Maps

The four color theorem means you can color the United States in four colors. But can you color it in three? Try it before you read on.


The answer is no. Consider Nevada. It takes three colors to color that states that surround Nevada (Oregon, Idaho, Utah, Arizona and California) and then one more for Nevada.

I use this as an example of a heuristic in the P v NP book I'm working on. But what about the converse: Can a map have no internal states with an odd number of neighbors and still require four colors. I can come up with some examples but they all require either a lake separating states (like Lake Michigan) or more than three states coming together at a point (like the Four Corners of Utah, Arizona, New Mexico and Colorado).

I wrote this question in terms of planar graphs and asked it on Theory Q&A. Turns out there are no other examples. Take any map, where all internal states have an even number of neighbors with no lakes and no point that borders more than three states and that map can be three colored. Cool.

I'd like to find natural examples, real maps of anywhere where
1) At least two internal regions (to make it interesting)
2) All internal regions have an even number of neighbors
3a) The map is not three colorable (has lakes or more than three regions meeting in a point), or
3b) There are no lakes or more than three regions meeting in a point (so three colorable).

If you are the first to give me an example I use in the book, I'll send you a free copy of the book when it is published.


Thursday, January 13, 2011

Are you a Ringer? A Reverse Ringer?

A ringer is an impostor, especially one whose pretense is intended to gain an advantage in a competition. This definition is from Wikipedia and agrees with what I thought the term meant. (Wiktionary has a similar definition. Spellcheck insists that Wiktionary should be either Dictionary or Visionary.)

The following problem appeared in The Bent Winter 2010 issue. (The Bent is a publication of Tau Beta Pi, and Engineering Honor Society.)
Al's job is testing bowling balls. He has two identical bowling balls and is to test their impact resistance by dropping them out of windows on various floors of a 100-story building. He is to determine from which exact floor a dropped ball will shatter on impact with the pavement below. Al knows nothing about the strength of the balls. They may shatter when dropped from the first floor or not until dropped from the 100th floor. What is the minimum number of ball drops needed to guarantee that Al can uniquely determine the floor fro which the balls will shatter. Balls that do not shatter may be dropped again. Both balls may be destroyed during the test. Include a brief outline of how testing is done.
This problem raises questions and metaquestions.
  1. What is the answer?
  2. A while back I had an undergrad work on the general problem of f floors and e eggs (we used eggs not bowling balls). Hence I know the answer with matching upper and lower bounds for all f and e. Should I submit my answer? If I did would I be a ringer? All you get for submitting a correct solution is your name in the next issue so that would be okay(?). Even so, its seems like cheating. (See here for my students paper. We didn't publish it since a similar paper had already appeared: The Egg Drop Number by Michael Boardman, Mathematics Magazine, Vol 77, No. 5, Ded 2004, 368-372. You can find it here.)
  3. When is one a ringer? In a later issue there was a problem I had not seen before but was able to solve easily with graph theory. For that problem am I a ringer?
  4. The intent of the problems (I think) is to test your cleverness not your repository of knowledge. With this in mind, clearly if I send in a solution to the Bowling Ball problem, I am being a ringer. For the graph theory problem it is less clear.
  5. Once in a restaurant I was doing the kids math puzzles on the paper placemats with my great nieces and nephews. There was one problem that I could solve by brute force but tried instead to find a clever solution (there probably wasn't one). Hence I could not solve it. Or at least that is what my great nieces and nephews think. This might be called being a reverse-ringer. Is there a better term?
  6. Here is a problem from Activity book from the NSA on Codes, Ciphers, Puzzles) that is geared towards kids. I was able to do every puzzle in it very quickly except this one. If you know the answer please tell me before some kid asks me:
    Logic Puzzle Number 1: If a railroad train is moving northward, there is a part of each car on the train that is moving southward at each instant, no matter how fast the train is going. what is the contrary piece moving southward? Hint- sketch a train moving on a track and examine the parts you sketch.
What has been your experience with solving (or not) problems that are, in some sense, below your ability and knowledge level?

Monday, January 10, 2011

LICS and TAMC call for papers

Two Call For Papers Announcements:
  1. LICS 2011 (Logic in Computer Science) has posted its call-for-papers here. (It was probably posted a while back- the submission deadline is Jan 12, two days from now.) I noticed some people on the committee that are regularly at CCC. How much do LICS and CCC overlap?

    In 1987 CCC (then called Structures) co-located with LICS at Cornell. I do not know if that many people went to both (I did not). Some years later I noticed that LICS and CCC were at the same time on different coasts, hence it would be impossible to go to both. Nobody complained (in fact, nobody seemed to notice) so I assumed there is not that much overlap.

    SO-I throw the question to those who go to both conferences- is there much overlap? Would people from CCC benefit by going to an occasional LICS? Would people from LICS benefit by going to an occasional CCC? Surely the answer is YES in terms of getting exposure to some new things. But is there a benefit beyond that?
  2. (Disclosure- I am on the program committee for this one.) TAMC 2011: (Theory and Applications of Models of Computation) has posted its call-for-papers here. Note that the Submission deadline is Feb 5 and its in Japan. Is it being in Japan make it more or less likely for you to want to go? If you are a theorist in Japan (or close to Japan) then I would assume this is GREAT since its a local. If not then do you view it as (1) too expensive, too much time away from home, so DON"T want to go, or (2) get to see another country! so do want to go! Obviously different people think different things at different times.

Thursday, January 06, 2011

The Enduring Legacy of the Turing Machine

Last Februrary Peter Wegner asked if I would be interested in writing an article for a series in ACM Ubiquity on "What is Computation?"
Our expanding collaboration with other fields is broadening our understanding of computation, and it is appropriate to take stock of where we are.  It is likely that the question "What is computation?" will never be completely settled, just as the question "What is life?" is never settled in biology and "what are the fundamental forces?" is never settled in physics.  Engaging with our question is valuable even if we may not find a completely satisfactory answer.
Well I know exactly what computation is, thanks to the beautiful paper of Alan Turing in 1936. My initial reaction was to stay far away but then I realized I had a forum to truly make the point that Turing had the right notion from the start. But as I started writing I realized I didn't have to make any new arguments, Turing himself anticipated the future objections. I just used his words.

The Ubiquity "What is Computation?" papers are coming out once a week. My article, The Enduring Legacy of the Turing Machine, was publshed last week.

Also check out the article by David Bacon who turns the question around. Instead of asking "Can the universe compute beyond Turing Machines", he asks "Why can the universe compute at all?".

Monday, January 03, 2011

What is a breakthrough? Lets have an intelligent discussion!!!!!!

In 2010 this blog announced the following Breakthrough!!!! results: (Listed chronologically.)
  1. Better Algorithms for Unique Games, by Arora, Barak, Steurer. (We refer to this as AUG.)
  2. Better Circuit Lower Bounds, by Williams. (We refer to this as CLB.)
  3. Erdos Distance Problem mostly resolved, by Guth and Katz. (We refer to this as ED.)
  4. Progress on Density Needed to Guarantee a 3-AP, by Sanders. (We refer to this as 3AP.)
  5. Better Approximation Algorithm for (a version of) Metric TSP, by Gharan, Saberi, Singh. (We refer to this as TSP.)
Some of the comments on those posts questioned if these results really were breakthroughs. This is a fair question; however, in order to answer it the question arises What is a breakthrough? I list some criteria. I'm not sure how many of these a breakthrough needs.
  1. The result is correct and the paper is posted. This is mandatory.
  2. The problem that the result concerns has to be important. This might be subjective.
  3. There has to be substantial progress on the problem. This could be a matter of debate.
  4. There has to be a reason why the problem was thought to be hard. E.g., a proof that a new technique is needed, problem has been open for a long time, smart folks say its hard.
So, how do the five breakthrough results look with these criteria? I will now do the very silly exercise of actually giving numeric values to these criteria. The scale is 1 to 10. I do not take these numbers seriously; however, being forced to assign numbers forces one to think about the criteria.
  1. AUG: The Unique Game Conjecture is important, hence progress on it in either direction is important. The result was surprising since it may indicate that UGC is not true. IMPORTANT: 8. PROGRESS: 8. THOUGHT HARD: 8. TOTAL SCORE: 24.
  2. CLB: If you view this as an attack on P vs NP then the problem being considered is important but the progress on it is not impressive in absolute terms. However, it is very impressive in terms of getting around current barriers. IMPORTANT: 10. PROGRESS: 8. THOUGHT HARD: 10. TOTAL SCORE: 28.
  3. ED is certainly very interesting, but is it important? There is a website devoted to it but does that make it important? It has lead to mathematics of importance but that's not quite the same thing. The result makes substantial progress in that it SOLVED the problem (up to a log factor). Smart people thought the proof would require new techniques. It did. IMPORTANT: 6. PROGRESS: 10. THOUGHT HARD: 9. TOTAL SCORE: 25.
  4. 3AP is certainly important and has lead to important mathematics. But there are people working in complexity, even complexity bloggers, who do not think this sort of thing is important. They are wrong. Hold the flames- I am kidding. The progress made is more one of technique then result. Many people thought it was hard. IMPORTANT: 7. PROGRESS: 7. THOUGHT HARD: 9. TOTAL SCORE: 23.
  5. TSP: The metric TSP problem has had that constant of 3/2 for a very long time. It was possible that 3/2 was optimal. Getting any kind of improvement on it, even a very tiny one, would be very important and would be real progress. Alas the paper didn't quite do that--- it made progress on a version of the problem. Still impressive. I will leave it to people in algorithms to argue if it is a breakthrough or not.
  6. In November I heard about the result of on Network flows by Christiano, Kelner, Madry, Spielma, Teng. I had heard that it was a breakthrough. I emailed 5 friends in algorithms asking if they wanted to guest post on it. Nobody wanted to. This is NOT a statement about the paper itself. Reasons were a combination of (a) too busy, (b) don't know the area well enough, and (c) don't want to deal with the idiotic comments your blog often gets. This does NOT mean it was not a breakthrough. It may have to do with my choice of friends. Was it a breakthrough? I still don't know. However, I will now open it up: If someone wants to guest blog about it, let me know.
These are my opinions. What are yours? On these results or on any other ones?