Elasticsearch: Custom Analysis

March 2, 2020 by Utah Ingersoll

GIPHY uses Elasticsearch to deliver all the best GIFs. Elasticsearch is an extremely fast, open source search engine supported by a great community. It has a robust Query API which allows us to quickly iterate our search algorithm. The Mapping API enables us to prototype new signals and account for the quirks in GIF metadata.

This article describes text analysis as related to Elasticsearch, covers built-in analysis and introduces the development of custom analysis. While we will not exhaustively cover text analysis we aim to provide solid tools for further exploration.

You are encouraged to follow along using the docker environment described below.

Docker Setup

To follow the exercises in this tutorial you will need to install the following:

Installing Docker Desktop may require a full restart.

Once Docker is running, let’s pull the Elasticsearch container by entering the following in your console:

 docker pull docker.elastic.co/elasticsearch/elasticsearch:7.5.2

Now let’s start your local container:

docker run -p 9200:9200 -e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:7.5.2

From a new console confirm Elasticsearch is running using HTTPie

http localhost:9200

You should receive a response similar to the following

HTTP/1.1 200 OK

 {
    "cluster_name": "docker-cluster",
    "cluster_uuid": "_LQxOs63Rte4xlC8AQqLvw",
    "name": "cce47c60c1fd",
    "tagline": "You Know, for Search",
    "version": {
        "build_date": "2020-01-15T12:11:52.313576Z",
        "build_flavor": "default",
        "build_hash": "8bec50e1e0ad29dad5653712cf3bb580cd1afcdf",
        "build_snapshot": false,
        "build_type": "docker",
        "lucene_version": "8.3.0",
        "minimum_index_compatibility_version": "6.0.0-beta1",
        "minimum_wire_compatibility_version": "6.8.0",
        "number": "7.5.2"
    }
}

Congratulations! You are all set up for the exercises below!

Introduction to Text Analysis

Text Analysis is the process of decomposing text into small components called tokens. Frequently, tokens are just words.

Tokens produced by analysis are used to build the inverted indices which Elasticsearch uses to retrieve and rank documents. Analysis also collects term counts, positions, and other data for ranking documents.

Elasticsearch documents are composed of fields. Each field is assigned a data type either by mappings or through inference. Each data type has an implicit analyzer, but you may configure custom analyzers when the defaults do not suit your needs.

Incoming queries are parsed using the same analysis used at index time to ensure searches are operating on the same set of tokens.

How Analysis Works

Analysis consists of three parts:

Character Filters transform the original string by replacing or adding characters.
A Tokenizer decomposes the text into tokens, usually splitting on whitespace to form words.
Token Filters then remove or transform the tokens created by the Tokenizer. Common Token Filters include stop-word removal and stemming.

The Analysis Pipeline:

Analysis in Action

You can inspect analysis before indexing using the Analyze API.

Example 1: Standard Analysis

Use HTTPie to post the phrase “lost in translation” to your local Elasticsearch Analyze API:

Enter the following command in your terminal:

http localhost:9200/_analyze <<< '{
  "text": "lost in translation"
}'

You should receive the following in response:

{
	"tokens": [
    	{
        	"end_offset": 4,
        	"position": 0,
        	"start_offset": 0,
        	"token": "lost",
        	"type": "<ALPHANUM>"
    	},
    	{
        	"end_offset": 7,
        	"position": 1,
        	"start_offset": 5,
        	"token": "in",
        	"type": "<ALPHANUM>"
    	},
    	{
        	"end_offset": 19,
        	"position": 2,
        	"start_offset": 8,
        	"token": "translation",
        	"type": "<ALPHANUM>"
    	}
	]
}

Since we did not specify an analyzer, we received the Standard Analyzer. The phrase “lost in translation” has been broken into the three tokens “lost”, “in” and “translation”.

Built-in Analysis

Elasticsearch has default analysers for each data type. The Text data type defaults to the Standard Analyzer. There are also language specific analyzers which will outperform the default when the language is known.

Example 2: English Analysis

Let’s try analyzing “cats in space” using the English Language Analyzer. The English analyzer has no Token Filters, uses the standard tokenizer and passes the resulting tokens through a stop word filter, a stemmer, and a lowercase filter.

Enter the following in your terminal:

http localhost:9200/_analyze <<< '{
  "analyzer": "english",
  "text": "lost in translation"
}'

This time we will receive only two tokens, “lost” and “translat”.

{
	"tokens": [
    	{
        	"end_offset": 4,
        	"position": 0,
        	"start_offset": 0,
        	"token": "lost",
        	"type": "<ALPHANUM>"
    	},
    	{
        	"end_offset": 19,
        	"position": 2,
        	"start_offset": 8,
        	"token": "translat",
        	"type": "<ALPHANUM>"
    	}
	]
}

The english analyzer removed the stop word “in” and stemmed “translation” to “translat” (stemming is funny like that). Stopwords are very frequently occurring words like “a” or “it.” Adding stopwords to the index adversely impacts performance while doing little to improve the relevance of results. Stemming folds words with similar meaning like “translate” and “translation” down to one word, “translat” which has the overall effect of improving recall.

Example 3: Phrase Matching using English Analysis

Let’s post mappings defining a single field named caption with English analysis.

http PUT localhost:9200/gifs <<< '{                        
  "mappings": {
      "properties": {
        "caption": {
          "type": "text",
          "analyzer": "english"
        }
      }
  }
}'

Next, let’s add some documents using the bulk API.

http PUT localhost:9200/_bulk <<< '
  { "index" : { "_index" : "gifs", "_id" : "1" } }
  { "caption": "Happy birthday my love" }
  { "index" : { "_index" : "gifs", "_id" : "2" } }
  { "caption": "happy birthday to me" }
  { "index" : { "_index" : "gifs", "_id" : "3" } }
  { "caption": "happy birthday my friend" }
'

Now lets run a query:

http GET localhost:9200/gifs/_search <<< '{
  "query": {
    "match_phrase" : {
      "caption" : "Happy birthday to"
    }
  }
}'

You should receive the following results:

[
  {
    "_id": "2",
    "_index": "gifs",
    "_score": 0.28852317,
    "_source": {
      "caption": "happy birthday to me"
    },
    "_type": "_doc"
  },
  {
    "_id": "1",
    "_index": "gifs",
    "_score": 0.25748682,
    "_source": {
      "caption": "Happy birthday my love"
    },
    "_type": "_doc"
  },
  {
    "_id": "3",
    "_index": "gifs",
    "_score": 0.25748682,
    "_source": {
      "caption": "happy birthday my friend"
    },
    "_type": "_doc"
  }
]

The query “happy birthday to” matches all documents. This is because the English analyzer removed the stopword “to,” both at index time and at query time. Our actual query was “happy birthday” which matched all three documents.

If we wanted to match with more precision we could switch to an analyzer without a stop word filter. Let’s explore that further in the next example.

Example 4: Standard Analysis

Let’s post mappings with the caption field set to standard analysis.

http PUT localhost:9200/gifs-standard <<< '{                        
  "mappings": {
      "properties": {
        "caption": {
          "type": "text",
          "analyzer": "standard"
        }
      }
  }
}'

Let’s add the same documents as before:

http PUT localhost:9200/_bulk <<< '
  { "index" : { "_index" : "gifs-standard", "_id" : "1" } }
  { "caption": "Happy birthday my love" }
  { "index" : { "_index" : "gifs-standard", "_id" : "2" } }
  { "caption": "happy birthday to me" }
  { "index" : { "_index" : "gifs-standard", "_id" : "3" } }
  { "caption": "happy birthday my friend" }
'

Now let’s rerun our query against the new index:

http GET localhost:9200/gifs-standard/_search <<< '{
  "query": {
    "match_phrase" : {
      "caption" : "Happy birthday to"
    }
  }
}'

This time we should receive only the result matching the entire phrase:

[
  {
    "_id": "2",
    "_index": "gifs-standard",
    "_score": 1.247892,
    "_source": {
      "caption": "happy birthday to me"
    },
    "_type": "_doc"
  }
]

Custom Analysis

If we wanted the query “Happy birthday” to only match the document tagged “Happy birthday my love” we would need to write a custom mapping without the lowercase filter found in the standard and English analysers.

http PUT localhost:9200/gifs-custom <<< '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [],
          "filter": []
        }
      }
    }
  },
  "mappings": {
      "properties": {
        "caption": {
          "type": "text",
          "analyzer": "my_custom_analyzer"
        }
      }
  }
}'

Now let’s add our documents:

http PUT localhost:9200/_bulk <<< '
  { "index" : { "_index" : "gifs-custom", "_id" : "1" } }
  { "caption": "Happy birthday my love" }
  { "index" : { "_index" : "gifs-custom", "_id" : "2" } }
  { "caption": "happy birthday to me" }
  { "index" : { "_index" : "gifs-custom", "_id" : "3" } }
  { "caption": "happy birthday my friend" }
'

Now run the query:

http GET localhost:9200/gifs-custom/_search <<< '{
  "query": {
    "match_phrase" : {
      "caption" : "Happy birthday my"
    }
  }
}'

And you will receive the document we expect:

[
  {
    "_id": "1",
    "_index": "gifs-custom",
    "_score": 1.5843642,
    "_source": {
      "caption": "Happy birthday my love"
    },
    "_type": "_doc"
  }
]

You can combine different tokenizers and filters to achieve different text analysis styles to match your needs.

Inspecting Mappings

Let’s take a closer look at what happened in Example 3.

We can invoke the analysis defined on a specific mapping like this:

http GET localhost:9200/gifs/_analyze <<< '{
  "field" : "caption",
  "text" : "Happy birthday to"
}'

You should receive two tokens:

[
    {
        "end_offset": 5,
        "position": 0,
        "start_offset": 0,
        "token": "happi",
        "type": "<ALPHANUM>"
    },
    {
        "end_offset": 14,
        "position": 1,
        "start_offset": 6,
        "token": "birthdai",
        "type": "<ALPHANUM>"
    }
]

“Happy birthday my” and “birthday” were stemmed to “happi” and “birthdai” respectively. The algorithm that produced these odd stems is called the porter stemmer.

Most importantly, the stop word filter removed the word “to”.

Let’s now see what happens when we use standard analysis:

http GET localhost:9200/gifs-standard/_analyze <<< '{
  "field" : "caption",
  "text" : "happy birthday to"
}'

You will receive three tokens:

[
  {
    "end_offset": 5,
    "position": 0,
    "start_offset": 0,
    "token": "happy",
    "type": "<ALPHANUM>"
  },
  {
    "end_offset": 14,
    "position": 1,
    "start_offset": 6,
    "token": "birthday",
    "type": "<ALPHANUM>"
  },
  {
    "end_offset": 17,
    "position": 2,
    "start_offset": 15,
    "token": "to",
    "type": "<ALPHANUM>"
  }
]

The standard analyzer has only separated the words at word boundaries. This allows phrase match to find the phrase “happy birthday to” on the document we expect to be returned.

Further Exploration

With the tools outlined above in hand, you should be well prepared to dive into custom analysis. Elasticsearch has extensive documentation on analysis, which when paired with these examples, will help you craft custom analysis pipelines suited to your data.

— Utah Ingersoll, Senior Software Engineer