GIPHY Search Gets Glasses
September 26, 2017 by
Hey! My name is Bethany and I was a software engineering intern on the Search team this summer. I was also a 2017 hackNY fellow. I study computer science at the University of Pennsylvania. In this post, I discuss my internship project: leveraging optical character recognition (OCR) to help you find the perfect GIF.
At the beginning of the summer, my friend was starting her first full-time job. I wanted to ~GIF~ her a pep talk before her first day, and I had the perfect movie reference in mind: Becca from Bridesmaids saying, “You are more beautiful than Cinderella! You smell like pine needles and have a face like sunshine!”
The GIF I was envisioning:
I searched GIPHY for “you are more beautiful than Cinderella” to no avail, then searched for “bridesmaids” and scrolled through several dozen results before giving up.
Searching for Bridesmaids or the direct quote did not yield any useful results:
It was easy to search for GIF with popular tags, but it was unrealistic to expect that someone would have tagged this GIF with the full line from the movie. And yet, I knew this GIF was out there. I wished there was a way to find the exact GIF that was pulled from the line in that movie / scene from that TV show / lyric from that song. Luckily, I was about to start my internship at GIPHY, with the opportunity to tackle this problem head on!
GIF Me the Tools and I’ll Finish the Job
When I started my internship, GIPHY engineers had already generated metadata about our collection of GIFs using Google Cloud Vision, an out-of-the box image recognition tool that is powered by machine learning. Specifically, Cloud Vision had performed optical character recognition (OCR) on our GIF corpus to detect text or captions within the image. The OCR results we got back from Google Cloud Vision were so good that my team felt confident about incorporating the data directly into our search engine. I was tasked with parsing the data and indexing each GIF, then updating our search query to leverage the new, bolstered metadata.
The Search team uses several tools that helped me along the way:
- Luigi: I used this Python framework to write a batch job that processed the JSON data generated from Google Cloud Vision
- AWS Simple Queue Service: I used this message queueing service to coordinate data transfer from Google Cloud Vision to documents in our search index
- Elasticsearch: GIPHY search is built on top of this open source search engine. GIF documents are stored here, and the search query returns results based on the data in our Elasticsearch index.
Bringing all these components together looks something like this:
The biggest technical challenge I faced during my internship was writing code that would scale. Most of my projects in school ran on such small datasets that my failure to optimize runtime was never an impediment. At GIPHY, that didn’t fly. I started testing my first worker, schedulable PHP code that prepares GIF updates for Elasticsearch, and realized that it would take 80+ hours (that’s 4 days) to process data for millions of GIFs. Clearly, scale and speed weren’t things I could ignore.
I spent a week and a half focusing on ways to speed up my Luigi task and my PHP worker. First, I investigated expensive operations involved in each process and how I might make fewer calls to them. For example, instead of making a new network connection for each individual piece of data, I re-wrote the method so that it would hit each service in batches.
Additionally, I sought out ways to run the work in parallel. For example, I refactored my first worker to operate over specific date parameters, so that I could distribute the work across multiple workers, each focusing on one week of data.
Ultimately, this sped up the processing time from 80 hours to only 8 hours, and in the process I acquired some new strategies and perspectives on writing scalable code.
The Home “Stretch”: Querying Elasticsearch
Finally, all the data was indexed and it was time to incorporate text/caption metadata into our query. To perform a text-based search against Elasticsearch, I would be utilizing some form of a match query.
There were a few variables to consider: first, I had to decide whether to compare the caption data against the search input using a match query with an “and” operator or using a match phrase query. The “and” match makes sense because it searches for captions that contain all of the words in the search input, which is what I wanted when I was searching for a line from a movie. However, I ultimately chose to use a match phrase query, which goes a step further: the words in the caption must appear in the same order as the words in the search input. This would guarantee that a substring of my movie quote is intact in the results that Elasticsearch returns.
Another variable to consider was how much to trust data from Google Cloud Vision relative to other sources of data we have about a GIF. For example, how meaningful is a GIF’s caption compared to its tags, the frequency with which users click on it, or its description on the page it came from? In most cases, the relevance of a caption inside an image will be greater than or equal to the relevance of the image’s description from its source, so I decided to incorporate the data from Cloud Vision directly above this level in the relevance model hierarchy.
Internal Testing: See the GIFference?
Updated query in hand, it was time to find out whether my change actually made any difference. I used GIPHY’s suite of internal tools to visualize how a change to the search query would affect search results. The first tool, Search UX, demonstrates the impact on the scale of a single search. It shows the Elasticsearch score for each GIF in the result set, and it can be expanded to view the raw JSON response that explains why the GIF was included in the results. Search UX shows me a very dramatic before-and-after when I search for “where are the turtles,” a quote from The Office:
The second tool, Side-by-Side, examines the query change on a larger scale by running the old and new queries against a random set of search terms and aggregating the affected metrics in an interactive dashboard. It’s useful to investigate the results and ensure that the change will not disrupt popular searches, like “cat” or “happy birthday,” that already deliver high-quality content.
Zooming in on any of the queries from the side-by-side test yields more specific insights, like the set of GIFs that would be added and removed by the proposed update. Here’s what that looks like when we search for “spirit fingers”:
So far so good!
Time to GIF It a Go
The internal tools indicate a positive change, but it’s time to let our users decide. I launched the updated query as an A/B experiment, and the results look promising: across all search traffic for the duration of the experiment, the lift in click-through rate was 0.5%. However, my change affects a very specific type of search, especially longer phrases, so the impact of the change is even more noticeable for queries in this category. For example, click-through rate when searching for the phrase “never give up never surrender” increased 32%, and click-through rate for the phrase “gotta be quicker than that” increased 31%. In addition to famous quotes from movies and TV shows, we saw improvements for general phrases like “everything will be ok” and “there you go”. The final click-through rate for these queries is almost 100%. To put my project to the ultimate test, however, I went to giphy.com to revisit my search query from the beginning of the summer:
Success! The search results are much improved. Now, the next time you use GIPHY to search for a specific scene or a direct quote, the results will show you exactly what you were looking for.
Learn more about my project? Click here.
Original “GIPHY Search Gets Glasses” GIF by Leroy Patterson, GIPHY Studios Artist
– Bethany Davis, Engineering Intern