Creating a Semantic Search Engine for My Photos

In this article, we explore the results of using a CLIP model to find photos in a personal image library using open-text search queries and image similarity.
Pedro Cuenca
Created on May 16|Last edited on July 29
Comment
I like photography, and I try to be organized with my photo collection. 
I save files to folders arranged by date, and I keep a few collections in both Lightroom and Capture One. Nevertheless, I usually have a hard time trying to locate photos I know exist in my library – I don't know where! 
What if I could search using a free-form text query without going through the effort to caption all my photos beforehand? I've had this in the back of my mind for some time, and last Saturday, it was finally the time when I attempted an initial proof-of-concept prototype.
My motivation to tackle the project at this time was that Jeremy Howard is going to discuss Transformers during his next lecture on the 2022 fast.ai course, which I'm attending and enjoying a lot. I have used Transformers in the past, and I'm familiar with the HuggingFace transformers library he is going to use. I also wanted to test multimodal (vision + text) transformers for this project. Finally, I got encouraged by the projects my friends at the Delft fastai study group are doing every week, so I wanted to contribute something of my own. 
Before explaining how the system works, let me invite you to explore some of the results I got from the prototype. The following panel shows the 4 images in my library that the model thinks are most similar to the text query displayed underneath (look for "red flower" below). 
The neat thing about writing this post as a W&B live report is that you can watch any example you like (from the ones I gathered) by dragging the slider below the photo strip and the caption: select a different Step index, and you'll see a different prompt:
﻿
﻿
As I'll explain later, this works great for my purposes! In this article, we'll find out how it was built. Here's what we'll be covering: 
Table of ContentsApproachResults for free-form text queriesVisual SearchVisual and Text search, Vector ArithmeticFinal thoughts
﻿
ApproachFor this test, I selected one of the CLIP models created by OpenAI, many of which are pre-trained and available in the transformers library, ready for use. 
CLIP is a multimodal vision and text model that attempts to encode both images and the textual descriptions of those images in the same latent space. This means that if a text sentence is a good description of an image, both items will be represented as very close points in that vector space.
CLIP is usually applied to measure the similarity scores between some images and some text descriptions. In fact, the code snippet shown in the Hugging Face documentation for the CLIP models does exactly that: it uses an input composed of an image and two descriptions. It asks the model to determine which one of those descriptions is closest to what's shown in the image.
In my case, I selected model openai/clip-vit-base-patch32, and then took the following approach:
First: I create embeddings for about 30 thousand photos in my library. I first exported them to a manageable size of 480 pixels on the longest side. I then run all those small images through the visual part of the CLIP model, ignoring the text inputs. This gave me a big table with ~30K items:
The first column of the table is the path of the image on the disk. The second column is the vector the image gets assigned in the representation space.
This process took just a couple of minutes on my computer.
Note that no training or fine-tuning was applied: I just used the default representations extracted by the pre-trained CLIP model. CLIP was trained on several million image/caption pairs, so I thought it already knows how to extract information from a wide variety of photos, no matter the subject.
To search for photos using a free-form text representation, these are the operations involved:
Process the query text through the text portion of CLIP. This gives us a vector that represents the input text.
Find the similarity between that vector and each one of the vectors that represent the images in my library. The similarity can be calculated using the dot product between the text vector and each one of the image vectors. Fortunately, this can be done very efficiently using torch.matmul, the matrix multiplication function provided in PyTorch.
Sort the similarity results, and keep the first few items. Then go to disk and retrieve the images associated with those results (remember that we stored the file paths alongside the vectors).
The query process is very fast. It takes ~3 tenths of a second for a query to complete (including image retrieval and display in a notebook cell).
Results for free-form text queriesIn addition to the dynamic figure shown in the introduction, this is a table with results from a few queries I run on my photos. Some of them were suggested during a discussion in the Delft fastai study group (thanks for them!). You can click on the small arrows in the table footer to navigate the pages.
﻿
﻿
﻿
This prototype is already working great for my purposes. My main goal was to find photos I know exist in my library, and it does that most of the time. For example, I remember I took a photo in Japan with a young lady leaning on a red wall, and this is the result when I used the query "girl leaning on red wall":
Text query: girl leaning on red wall. I think I only have one of those in my library!
The first image is the one I was looking for. The others don't show girls leaning on red walls, but I don't have that many photos of girls on red walls!
I was surprised to see that the system can also be used to filter by technique or style. It understands what slow shutter speed is, it recognizes motion blur, and it responds well to queries like "colorful abstract subject":
Text query: colorful abstract subject.
Text query: colorful Ferris wheel.
It can also handle locations. I asked the model for photos taken in Madrid, San Francisco, Japan, Napoli, or Tunisia, and it was spot on. The model is not using any GPS or embedded EXIF information: it just seems to recognize features in big cities or some distinctive features of subjects in different areas.
Text query: Japan nature.
During the meeting of our study group on Sunday, we also tried a couple of "conceptual"-based queries, like "gloomy day" or "happiness" (thanks for the suggestions!). This is what the system came up with:
Text query: gloomy day.
Text query: happiness.
Visual SearchGiven that this prototype needs a vector to locate other nearby vectors, I thought it could also be used to look for images visually similar to a given image. For example, I googled photos of the Aurora Borealis and used one of them as input for the model. These were the first results it found in my library:
 
﻿
﻿
﻿
Visual and Text search, Vector ArithmeticI finally tried to combine both image and text representations, but admittedly, this didn't work too well. My idea was to combine two vectors using some arithmetic operation and then try to find the points closest to the new location. 
For example, take the visual embedding for the photo we saw above about the girl leaning on a red wall, then subtract the embedding of the text "wall." Would that locate photos of women?
In this situation, I found that the embeddings for images appear to carry more weight than those from texts. I tried a few scale factors and finally settled on using a scale of 4 for the text embeddings when combining them with image embeddings. Even so, the model appeared to lean to one or the other and not so much on the combination. These are the results (remember the slider):
﻿
﻿
﻿
In the first case (removing the "wall") from the original photo, the model seems to find photos of people, mostly women. In the second case, however (removing "girl"), it still gets dragged to photos very similar to the original one.
This technique looks promising, but it needs more exploration!
Final thoughtsMultimodal models such as CLIP can encode images and text to the same representation space.
We can take advantage of that to look for images in a private library using free-form text descriptions. We need to create a big table with all the image embeddings.
Visual search (give me photos similar to this one) is also possible. Vector arithmetic seems challenging.
Ideas for improvement:
Try other models! There are larger and smaller variants of CLIP. There are also open-source versions of CLIP trained on public datasets such as LAION.
Pre-processing: the model requires square images of 224x224 pixels. Mine were downscaled and center-cropped, which might have lost details on the edges.
Thematic finetuning. For example, finetuning on a dataset of high-quality or stock photos would probably work better for professional photographers that want to explore their own massive libraries.
Data exploration. For example, index all of ImageNet, then query to find things that may or may not be there.
It was fun!
﻿
Add a comment
Tags: CLIP, Computer Vision, Experiment, Tables, Slider
Iterate on AI agents and models faster. Try Weights & Biases today.