## Sunday, February 18, 2024

### ChatGPT thinks Graph Isomorphism has real applications. Is it right?

Lance did a post on Babai's result on Graph Isomorphism (see here). I then did a post asking if Graph Isomorphism has real applications (see here).  Lance proofread my post (There were some typos! Really!) and then he was inspired to ask ChatGPT what it thought. The conversation is below (I added some fonts, and my opinion clearly marked,  but nothing else).  I leave it to the reader to determine what ChatGPT got right and got wrong. In areas that I am unfamiliar with I tend to think that ChatGPT is right, perhaps a case of Gell-man Amnesia (see here).

LANCE: What are some applications of Graph Isomorphism?

ChatGPT:

Isomorphism, a concept from the field of graph theory in mathematics and computer science, refers to the problem of determining whether two graphs are isomorphic, meaning there is a one-to-one correspondence between their vertex sets that preserves adjacency. Despite its theoretical underpinnings, graph isomorphism has practical applications in various domains:

Chemical Informatics: It's crucial for identifying chemical compounds. Since chemical structures can be represented as graphs (with atoms as vertices and bonds as edges), determining if two graphs are isomorphic can indicate whether two compounds are structurally identical.

BILL'S OPINION: I suspect simple heuristics work fine. Also I suspect that chemicals are more complicated than graphs. More generally, instead of asking Does GI have any real applications? I should have asked Do the last 45 years of graph isomorphism algorithms produced by our community have any applications? I suspect the answer is no.

Pattern Recognition: In image and pattern recognition tasks, objects can be represented as graphs, and finding isomorphic graphs helps in identifying similar patterns or shapes within different datasets.

BILL'S OPINION: I suspect that in real world pattern recognition, especially for images, approx matches is what you really want. What makes GI hard/interesting is that you need exact Isomorphism. So I doubt GI really helps for Pattern Recognition.

Network Analysis: Graph isomorphism is used to analyze and compare different network structures in social networks, biological networks (like neural connections and protein interactions), and the Internet, to understand their underlying properties and dynamics.

BILL'S OPINION: This sounds plausible to me. The people doing the studying are other academics, however, that can still count.

Computer Vision: For object recognition and classification, where objects within an image can be represented as graphs, and their identification involves finding isomorphic subgraphs within larger scene graphs.

BILL'S OPINION: Same issue as with Pattern matching.

Cryptography: Graph isomorphism problems can serve as the basis for cryptographic protocols, including public key cryptography, due to the computational difficulty of solving graph isomorphism for general cases.

BILL'S OPINION: Just NO.  One reason factoring is so useful for crypto is that one can easily generate hard instances. For GI you cannot (at least not yet). I think ChatGPT confused GI being in Zero-Knowledge with GI actually being used for Crypto. More on this later.

Database Indexing: In databases, especially graph databases, graph isomorphism helps in query optimization and efficient data retrieval by identifying identical or similar graph structures.

BILL'S OPINION: I can believe this! Gell-man Amnesia?

These applications demonstrate the wide-ranging impact of graph isomorphism beyond theoretical computer science, influencing fields where structural and relational data analysis is crucial.

LANCE: Can you give me specific examples of these?

(BILL: I won't comment on these, except in one case,  as it would be the same comments as above.)

ChatGPT:

Certainly! Here are specific examples for the practical applications of graph isomorphism mentioned:

Chemical Informatics: A classic example is the identification of isomers in chemistry. Isomers are compounds with the same molecular formula but different structures. Graph isomorphism algorithms can distinguish between different isomers by comparing their graph representations, where vertices represent atoms and edges represent chemical bonds. For instance, butane and isobutane are structural isomers with different connectivity among the same set of atoms, which can be identified as distinct through graph isomorphism.

Pattern Recognition: In character recognition systems, letters or symbols can be represented as graphs, where the strokes are edges and intersections are vertices. Graph isomorphism algorithms can then be used to match a scanned, handwritten, or printed character against a database of graph-represented characters to identify the letter or symbol.

Network Analysis: In social network analysis, researchers might use graph isomorphism to identify similar subnetworks within larger networks. For example, detecting communities or cliques within social networks that have identical or similar interaction patterns, which can be crucial for understanding social dynamics or the spread of information.

Computer Vision: An application in computer vision could be recognizing objects regardless of their orientation in an image. By representing objects as graphs, where vertices correspond to object features and edges to spatial relationships between these features, graph isomorphism can help in identifying an object by finding a match between its graph representation and that of a known object.

Cryptography: A specific use case in cryptography is the design of cryptographic primitives based on the Graph Isomorphism problem, such as the Goldreich-Goldwasser-Halevi (GGH) cryptosystem. This cryptosystem constructs a public key from a hard instance of a lattice problem, which is closely related to finding isomorphisms between noisy graph representations, leveraging the computational difficulty of graph isomorphism for security.

BILL: The pointer to the Wikipedia page on the GGH system is here. It has nothing to do with graph isomorphism. So this one is complete garbage.

Database Indexing: In the context of graph databases used for storing knowledge graphs or social network data, an example application is the optimization of query processing. When a user queries the database to find data patterns that match a specific graph structure, graph isomorphism algorithms can help quickly locate all instances of the data that are isomorphic to the query graph, enhancing search efficiency and performance.

These examples highlight how graph isomorphism plays a crucial role in solving real-world problems across different domains by leveraging the structure and properties of graphs.

LANCE: Give me a specific research paper in Chemistry that uses graph isomorphism

ChatGPT
A notable research paper that applies graph isomorphism in the field of chemistry is:

Title: "Development and Application of a Novel Graph-Theoretical Approach for Molecular Similarity Analysis"

Authors: Raymond E. Carhart, David H. Smith, Robert Venkataraghavan

Journal: Journal of Chemical Information and Computer Sciences

Year: 1985, Volume: 25, Issue: 1, Pages: 64–73, DOI: 10.1021/ci00045a002

Abstract:

In this paper, the authors introduce a novel method for analyzing molecular similarity based on graph-theoretical concepts. They present an algorithm, known as the MCS (Maximum Common Subgraph) algorithm, which is used to identify the largest subgraph common to two chemical compounds represented as graphs. This process relies on graph isomorphism to determine the structural similarity between molecules. The approach was groundbreaking at the time because it provided a quantitative method for assessing molecular similarity, which is crucial for drug discovery and the development of chemical databases.

The significance of this work lies in its application of graph isomorphism to solve a practical problem in chemistry, demonstrating the interdisciplinary nature of graph theory and its potential to impact research beyond theoretical computer science. This paper has been influential in the fields of cheminformatics and drug design, where molecular similarity plays a key role in identifying potential therapeutic agents.

BILL'S COMMENT: This information from ChatGPT would be more impressive if this paper actually existed.

DOES NOT EXIST: Bill could not find it on the web.

DOES NOT EXIST: Bill went to the website of that journal and did a search and did not find it.

DOES NOT EXIST: The DOI does not point to it.

DOES EXIST: Some slide packet pointed to it as a reference.

If someone leaves a comment with a pointer to it, or emails me a pointer to it, I will update this post.)

1. That paper appears to exist: R.E. Carhart, D.H. Smith, R. Venkataraghavan JCICS 25:64-73 (1985), and it is referenced here: https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf in the context of atomic structure similarity.

1. (Bill) Before posting I went to the website of JCICS and searched for the paper and could not find it. The website I used was

https://dl.acm.org/journal/jchm

So what is going on? Possibilities
1) It is there and either I am not good at seraching or the serach engine is bad.
2) The slide packet you pointed to that had the paper as a reference is incorrect.

Both are quite plausible. If you find a pointer to the paper (or just the abstract- its prob behind a paywall) then I will correct my post.

2. At least we ought to be able to agree that the DOI that the ChatBot provided is to a completely-unrelated article, entitled "Keeping up with Japanese chemical technology at Chemical Abstracts Service". At least, I assume that it's completely unrelated. It is, indeed, behind a paywall, and the Rutgers library system does not provide me with access to it.

3. The only Google hits for "Development and Application of a Novel Graph-Theoretical Approach for Molecular Similarity Analysis" are tied to this post.

4. (Copying from an email I sent to Bill)

So chemists use two related problems, but not vanilla GI as far as I can tell.

Induced subgraph isomorphism: used to find the maximal common induced subgraph of a collection of molecular graphs, and in some cases used to do database queries like "which chemicals are available for sale by our suppliers that contain a specific chemical core?" Many use cases here seem to be in ad-hoc visualization and exploration. They want to plot a bunch of chemical graphs in a way that highlights a particular part (the common core) and lets the chemist use their "chemical intuition" to figure out how that relates to a physical reaction. The leading algorithm here is called VF2++, after a line of work from VF->VF2->VF2+->VF2++. It's a branch and bound algorithm.

Graph canonicalization: used to give a string/hash identifier for fast database lookups/search. Notably, chemists gave up on using GI for database search because in the time it took to get fast GI solvers like nauty/bliss, the industry came up with a few different kinds of string identifiers for chemicals that are "good enough" for large swaths of drug discovery work. The "SMILES" string is one I read about in detail. Maybe someone with more theory background can explain the relationship between GI and canonicalization. Clearly canonicalization is at least as hard as GI, and chemists probably relax the "canonical" part slightly.

I did find two chemistry projects that use GI directly. Both of these projects seem relatively unpopular in the chemistry world compared to the big open source projects whose maintainers I interviewed.
https://github.com/StructureGenerator/surge which " generates all non-isomorphic constitutional isomers of a given molecular formula." It was written by Brendan McKay (the author of nauty), so I feel there is a potential conflict of interest and I shouldn't conclude this is a valid use case unless I find someone else using this package, which I have yet to. I have to talk to my chemistry contacts to see if this task is a useful one for most chemists.

https://github.com/RuleWorld/bionetgen which is a chemistry modeling framework that uses GI for some sort of data structure optimizations I didn't get to the bottom of.

A few other things to mention. In re "chemicals are more complicated than graphs," yes absolutely, but for the most part they annotate the edges and vertices with enough extra information to make it work, and the parts where it cannot work (e.g., knotted molecules, weird quantum effects) they resort to other, more detailed representations, and it's more academic rather than what people use at drug design companies. The general field of trying to infer physical/biological activity from the chemical/graph structure of a molecule is called Quantitative Structure-Activity Relationship, and I found there is a lot of successful work in trying to formulate chemically-relevant graph invariants.

One of my interviewees gave this talk that covers the history of cheminformatics through 1990. You might find it enjoyable at least as a love letter to how intertwined the history of cheminformatics and computer science are.

Summarizing ChatGPT: it's wrong in the sense that chemists don't use the atomic formula to determine if two compounds are identical, they use the SMILES string or similar, which has enough information to distinguish simple isomers like the one listed.

1. "One of my interviewees gave this talk that covers the history of cheminformatics through 1990." Which talk? Looks like your copy from an email was lossy.

5. On subgraph isomorphism things are clearer. There is huge literature on related problems.
In network resource orchestration one of the topics of intense research is the embedding of Service Function Chains (SFCs) in large networks, like datacenters.
In brief, an SFC is modelled as a small weighted graph and a datacenter as a large weighted graph. Embedding is the mapping of the small graph on a subgraph of the large network so that the resulting embedding will have certain properties, like optimal resource usage. It is a generalization of subgraph isomorphism as we search for the best approximation of an isomorphic subgraph. Of course there are variations to this problem, like letting more than one nodes from the SFC to be mapped in one node of the large network.
The problem is NP-hard and we try to apply AI methods to achieve efficient solutions for real time usage. Although even using advanced methods like genetic algorithms it is still hard to get exact solutions; see section 4 in
https://doi.org/10.1007/s10922-023-09771-y

6. Regarding "Graph canonicalization" as "give a string/hash identifier"
(see j2kun's comment) versus "canonisation" as "canonical order of a graph". The abstracts of the following two papers explain a close relationship between "a logic capturing PTIME," "Choiceless Polynomial Time with counting (CPT)," and "canonisation," which is absent for the weaker "give a string/hash identifier":

Choiceless Computation and Symmetry: Limitations of Definability (https://arxiv.org/abs/2401.07147) by Benedikt Pago
The search for a logic capturing PTIME is a long standing open problem in finite model theory. One of the most promising candidate logics for this is Choiceless Polynomial Time with counting (CPT). Abstractly speaking, CPT is an isomorphism-invariant computation model working with hereditarily finite sets as data structures. While it is easy to check that the evaluation of CPT-sentences is possible in polynomial time, the converse has been open for more than 20 years: Can every PTIME-decidable property of finite structures be expressed in CPT? We attempt to make progress towards a negative answer and show that Choiceless Polynomial Time cannot compute a preorder with colour classes of logarithmic size in every hypercube. The reason is that such preorders have super-polynomially many automorphic images, which makes it impossible for CPT to define them.

Deep Weisfeiler Leman (https://arxiv.org/abs/2003.10935) by Martin Grohe, Pascal Schweitzer, Daniel Wiebking
We introduce the framework of Deep Weisfeiler Leman algorithms (DeepWL), which allows the design of purely combinatorial graph isomorphism tests that are more powerful than the well-known Weisfeiler-Leman algorithm.
We prove that, as an abstract computational model, polynomial time DeepWL-algorithms have exactly the same expressiveness as the logic Choiceless Polynomial Time (with counting) introduced by Blass, Gurevich, and Shelah (Ann. Pure Appl. Logic., 1999)
It is a well-known open question whether the existence of a polynomial time graph isomorphism test implies the existence of a polynomial time canonisation algorithm. Our main technical result states that for each class of graphs (satisfying some mild closure condition), if there is a polynomial time DeepWL isomorphism test then there is a polynomial canonisation algorithm for this class. This implies that there is also a logic capturing polynomial time on this class.

7. Especially ChatGPT-3.5 has a tendency to confabulate references, ChatGPT-4 is better, but for such exact/discrete questions it is wise not to ask a lossy compressed language model :)

The paper it tried to refer to seems to be: https://pubs.acs.org/doi/pdf/10.1021/ci00046a002 "Atom pairs as molecular features in structure-activity studies: definition and applications" Raymond E. Carhart, Dennis H. Smith, and R. Venkataraghavan (J. Chem. Inf. Comput. Sci. 1985, 25, 2, 64–73)

So the idea was in the relevant latent direction, but the execution less so.

Really appreciated your Frozen Kolmogorov Concentrate intuition pump for language models.

I once used subgraph isomorphism to detect credit card/transaction fraud, but mostly focused on graph edit distance because real-life graphs often not exactly the same, but can be very similar.