How to Implement Autocomplete with Edge N-Grams in Elasticsearch

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

If you want to provide the best possible search experience for your users, autocomplete functionality is a must-have feature. This functionality, which predicts the rest of a search term or phrase as the user types it, can be implemented with many databases. In this article, you’ll learn how to implement autocomplete with edge n-grams in Elasticsearch. Though the following tutorial provides step-by-step instructions for this implementation, feel free to jump to Just the Code if you’re already familiar with edge n-grams.

Understanding Autocomplete

If you’ve ever used Google, you know how helpful autocomplete can be. Autocomplete is sometimes referred to as “type-ahead search”, or “search-as-you-type”. It helps guide a user toward the results they want by prompting them with probable completions of the text that they’re typing. This reduces the amount of typing required by the user and helps them find what they want quickly.

Edge n-grams

In Elasticsearch, edge n-grams are used to implement autocomplete functionality. Though the terminology may sound unfamiliar, the underlying concepts are straightforward. An n-gram can be thought of as a sequence of n characters. Elasticsearch breaks up searchable text not just by individual terms, but by even smaller chunks. Let’s say a text field in Elasticsearch contained the word “Database”. This word could be broken up into single letters, called unigrams:

1
[ d, a, t, a, b, a, s, e]

When these individual letters are indexed, it becomes possible to search for “Database” just based on the letter “D”. N-grams work in a similar fashion, breaking terms up into these smaller chunks comprised of n number of characters. Let’s look at the same example of the word “Database”, this time being indexed as n-grams where n=2:

1
[ da, at, ta, ab, ba, as, se]

Similarly, a trigram of n=3 would yield:

1
[ dat, ata, tab, aba, bas, ase]

Now, it’s obvious that no user is going to search for “Database” using the “ase” chunk of characters at the end of the word. That’s where edge n-grams come into play. Edge n-grams only index the n-grams that are located at the beginning of the word. Depending on the value of n, the edge n-grams for our previous examples would include “D”,”Da”, and “Dat”.

Use Edge N-Grams with a Custom Filter and Analyzer

The code shown below is used to implement edge n-grams in Elasticsearch. It’s a bit complex, but the explanations that follow will clarify what’s going on:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store?pretty" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "autocomplete_filter"]
}
}
}
}
}
'

In this example, a custom analyzer was created, called autocomplete analyzer. It uses the autocomplete_filter, which is of type edge_ngram. The min_gram and max_gram specified in the code define the size of the n_grams that will be used. Here, the n_grams range from a length of 1 to 5.

To test this analyzer on a string, use the Analyze API as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_analyze?pretty" -d '
{
"analyzer": "autocomplete_analyzer",
"text": "Database"
}
'

{
"tokens" : [
{
"token" : "d",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "da",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "dat",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "data",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "datab",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
}
]
}

In the example above, the custom analyzer has broken up the string “Database” into the n-grams “d”, “da”, “dat”, “data”, and “datab”. The first n-gram, “d”, is the n-gram with a length of 1, and the final n-gram, “datab”, is the n-gram with the max length of 5.

This test confirms that the edge n-gram analyzer works exactly as expected, so the next step is to implement it in an index.

In the following example, an index will be used that represents a grocery store called store. This store index will contain a type called products. Our example dataset will contain just a handful of products, and each product will have only a few fields: id, price, quantity, and department. This example shows the JSON needed to create the dataset:

idnamepricequantitydepartment
1Multi-Grain Cereal4.994Packaged Foods
21lb Ground Beef3.9929Meat and Seafood
3Dozen Apples2.4912Produce
4Chocolate Bar1.292Packaged FoodsCheckout
51 Gallon Milk3.2916Dairy
60.5lb Jumbo Shrimp5.2912Meat and Seafood
7Wheat Bread1.295Bakery
8Pepperoni Pizza2.995Frozen
912 Pack Cola5.296Packaged Foods
10Lime Juice0.9920Produce
1112 Pack Cherry Cola5.5995Packaged Foods
121 Gallon Soy Milk3.3910Dairy
131 Gallon Vanilla Soy Milk3.499Dairy
141 Gallon Orange Juice3.294Juice

Now that we have a dataset, it’s time to set up a mapping for the index using the autocomplete_analyzer:

1
2
3
4
5
6
7
8
9
10
11
12
curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store/_mapping/products?pretty" -d '
{
"products": {
"properties" : {
"name": { "type": "text", "analyzer": "autocomplete_analyzer"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
'

The key line to pay attention to in this code is the following line, where the custom analyzer is set for the name field:

1
"name": { "type": "text", "analyzer": "autocomplete_analyzer"},

Next, the data is imported:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
$ curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/_bulk?pretty --data-binary @demo_store_db.json
{
"took" : 227,
"errors" : false,
"items" : [
{
"create" : {
"_index" : "store",
"_type" : "products",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
{
"create" : {
"_index" : "store",
"_type" : "products",
"_id" : "2",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
...
( MORE RESULTS)

Once the data is indexed, testing can be done to see whether the autocomplete functionality works correctly. To do this, try querying for “Whe”, and confirm that “Wheat Bread” is returned as a result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/products/_search?pretty" -d '
{
"query": {
"match": {
"name": {"query": "Whe", "analyzer": "standard"}
}
}
}
'

{
"took" : 108,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.513645,
"hits" : [
{
"_index" : "store",
"_type" : "products",
"_id" : "7",
"_score" : 1.513645,
"_source" : {
"id" : "7",
"name" : "Wheat Bread",
"price" : 1.29,
"quantity" : 5,
"department" : [
"Bakery"
]
}
}
]
}
}

As you can see in the output above, “Wheat Bread” was returned from a query for just “Whe”.

Conclusion

There’s no doubt that autocomplete functionality can help your users save time on their searches and find the results they want. If you’re interested in adding autocomplete to your search applications, Elasticsearch makes it simple. With this step-by-step guide, you can gain a better understanding of edge n-grams and learn how to use them in your code to create an optimal search experience for your users.

Just the Code

If you’re already familiar with edge n-grams and understand how they work, the following code includes everything needed to add autocomplete functionality in Elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store?pretty" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "autocomplete_filter"]
}
}
}
}
}
'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_analyze?pretty" -d '
{
"analyzer": "autocomplete_analyzer",
"text": "Database"
}
'
curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store/_mapping/products?pretty" -d '
{
"products": {
"properties" : {
"name": { "type": "text", "analyzer": "autocomplete_analyzer"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
'
1
2
3
4
5
6
7
8
9
$ curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/_bulk?pretty --data-binary @demo_store_db.json curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/products/_search?pretty" -d '
{
"query": {
"match": {
"name": {"query": "Whe", "analyzer": "standard"}
}
}
}
'

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.