GIPHY Gets Tagged by K-Nearest Neighbors
October 16, 2019 by
When a brand new GIF gets uploaded onto GIPHY.com, there’s usually not a lot of data associated with it to make sure that it’s discoverable via search. While we do rely on machine learning for tag generation, we also allow uploading users and GIPHY’s content team to manually add “tags”, which are keywords that help ensure good content is easy to find. However, how does one decide on the most effective tags?
Tagging can be a fairly laborious and semantic process. Additionally, some GIFs aren’t as straightforward to tag as others which makes the process even more time consuming.
What if there was a way to suggest tags to a GIF using underlying data? Not only would automating the process save time, it would also increase the likelihood that every GIF would appear in relevant search results. To test this idea, I built a service called GIPHY Tagz, which recommends tags to a GIF based on its metadata and a K-Nearest Neighbor model.
What is K-Nearest Neighbor Algorithm?
K-Nearest Neighbor (KNN) is a machine learning algorithm that classifies unlabeled data based on labeled data given to the model. For example, the model can be trained to recognize that GIFs with hand waving motions have been previously tagged with “hello”. Therefore, the model should tag “hello” for new GIFs with hand waving motion.
To determine which GIFs have hand waving motions (or other features), we use Google Cloud Vision, which detects features in images. In addition to the feature list, GCV data includes a “confidence score” for each feature which conveys how confident the model is that the image contains that feature.
In this simplified model, the X- and Y-Axes are the confidence values for two features. In a real KNN model, there would be one dimension for each GCV feature. The trained KNN model can take a new GIF and find the K most similar GIFs based on feature similarity, and recommend tags of those GIFs.
Training and Testing the Model
As with other machine learning models, before we can use the KNN model, we need to train it with a set of training data. For GIPHY Tagz, I created a training set by starting with a list of the top 300 tags/search queries on GIPHY. I used this list to write a SQL query that looked through GIPHY’s Amazon Redshift database to select the top 10 GIFs that perform well for each search term based on click through rate difference. After fetching the metadata for each GIF, I dumped the metadata into a huge JSON file which was used for features and confidence scores.
Fortunately, there is a Python’s Scikit-Learn library that implemented K-Nearest Neighbors. The training data was in the form of a sparse matrix where the rows are GIF ids, the columns are features, and the values are the confidence scores. I loaded the training data into the KNN model and was ready to test it.
Major props to anyone who specializes in machine learning because the testing part is incredibly painful. The first time I entered a cat GIF into my KNN model, the tags returned were “spa,” and “food.” Of course, my KNN model was not fully accurate due to the noise in the data and the limitation of working with only the top 300 search queries. However, I spent weeks trying to improve the precision and recall of the tags returned and I lost track of how many times I had to ask our data engineers for different approaches. I reduced the dimensions of the training data through principal component analysis and normalized the scores of the features through TF-IDF which both led to a decrease in the model’s accuracy. One successful approach was to use cosine similarity instead of euclidean metric when measuring the distance between each GIF. When I was finally satisfied with the accuracy of the tags returned by my KNN model, I collaborated with our internal React expert, Kyle, on a simple UI so that the service could be used by anyone in the company.
GIPHY, at its core, aims to deliver the perfect GIF for every situation and GIPHY Tagz demonstrated that it could play an effective role in future tagging efforts.
I would like to give a special shoutout to Jesse Ling, who was my mentor, and the Ad Products team for all the support they have given me throughout my internship whether that be giving career and life advice, introducing data engineering 101, explaining React basics, or even fixing my very many git conflicts.
— Alice Phan, Engineering Intern