Thursday, January 12, 2023

Semantic Search for the Blog

 As Google started to limit academic storage, I started looking at Google Takeout and started wondering what I could do with all that data. I downloaded all the posts from the blog, since we use Google's blogger, and ran them through OpenAI's Ada Embedding. The Ada embedding maps text up to 8192 words into a point on the 1536-dimensional unit sphere. You can measure the similarity between two embeddings via a simple dot product, giving you the cosine of the angle between them.

So I created a semantic search for the blog. Go ahead and try it out.

Search for  

You can enter a search term, phrase, or the full URL (including https) of a blog post. It will return a list of the 5 closest posts, with the percentage match, computed as the square of the cosine. I don't have a mechanism for automatically updating the files, so you'll only see posts from 2022 and earlier.

This was an Open AI-assisted affair, as I used ChatGPT and GitHub co-pilot to help with the python and pandas data frames. It took me longer to figure out how to create a web application so you can try the search. Similarity match doesn't work like normal searches, for example if you search for a city like "Detroit", you'll get posts that mention other cities. Some other oddities, like "mad" seems to match "Madhu". It probably says something about me that my most happy post is not about some great new theorem but about baseball.

1 comment:

  1. Does the search engine for the blog that you built differ from the one that is already on the blog (left upper corner?)