How to Implement Autocomplete with Edge N-Grams in Elasticsearch

Written by Data Pilot

April 07, 2019

Elasticsearch

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

If you want to provide the best possible search experience for your users, autocomplete functionality is a must-have feature. This functionality, which predicts the rest of a search term or phrase as the user types it, can be implemented with many databases. In this article, you’ll learn how to implement autocomplete with edge n-grams in Elasticsearch. Though the following tutorial provides step-by-step instructions for this implementation, feel free to jump to Just the Code if you’re already familiar with edge n-grams.

Understanding Autocomplete

If you’ve ever used Google, you know how helpful autocomplete can be. Autocomplete is sometimes referred to as “type-ahead search”, or “search-as-you-type”. It helps guide a user toward the results they want by prompting them with probable completions of the text that they’re typing. This reduces the amount of typing required by the user and helps them find what they want quickly.

Edge n-grams

In Elasticsearch, edge n-grams are used to implement autocomplete functionality. Though the terminology may sound unfamiliar, the underlying concepts are straightforward. An n-gram can be thought of as a sequence of n characters. Elasticsearch breaks up searchable text not just by individual terms, but by even smaller chunks. Let’s say a text field in Elasticsearch contained the word “Database”. This word could be broken up into single letters, called unigrams:

1	[ d, a, t, a, b, a, s, e]

When these individual letters are indexed, it becomes possible to search for “Database” just based on the letter “D”. N-grams work in a similar fashion, breaking terms up into these smaller chunks comprised of n number of characters. Let’s look at the same example of the word “Database”, this time being indexed as n-grams where n=2:

1	[ da, at, ta, ab, ba, as, se]

Similarly, a trigram of n=3 would yield:

1	[ dat, ata, tab, aba, bas, ase]

Now, it’s obvious that no user is going to search for “Database” using the “ase” chunk of characters at the end of the word. That’s where edge n-grams come into play. Edge n-grams only index the n-grams that are located at the beginning of the word. Depending on the value of n, the edge n-grams for our previous examples would include “D”,”Da”, and “Dat”.

Use Edge N-Grams with a Custom Filter and Analyzer

The code shown below is used to implement edge n-grams in Elasticsearch. It’s a bit complex, but the explanations that follow will clarify what’s going on:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store?pretty" -d '
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "lowercase", "autocomplete_filter"]
}
}
}
}
}
'

In this example, a custom analyzer was created, called autocomplete analyzer. It uses the autocomplete_filter, which is of type edge_ngram. The min_gram and max_gram specified in the code define the size of the n_grams that will be used. Here, the n_grams range from a length of 1 to 5.

To test this analyzer on a string, use the Analyze API as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_analyze?pretty" -d '
{
"analyzer": "autocomplete_analyzer",
"text": "Database"
}
'
{
"tokens" : [
{
"token" : "d",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "da",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "dat",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "data",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
},
{
"token" : "datab",
"start_offset" : 0,
"end_offset" : 8,
"type" : "alphanum",
"position" : 0
}
]
}

In the example above, the custom analyzer has broken up the string “Database” into the n-grams “d”, “da”, “dat”, “data”, and “datab”. The first n-gram, “d”, is the n-gram with a length of 1, and the final n-gram, “datab”, is the n-gram with the max length of 5.

This test confirms that the edge n-gram analyzer works exactly as expected, so the next step is to implement it in an index.

In the following example, an index will be used that represents a grocery store called store. This store index will contain a type called products. Our example dataset will contain just a handful of products, and each product will have only a few fields: id, price, quantity, and department. This example shows the JSON needed to create the dataset:

id	name	price	quantity	department
1	Multi-Grain Cereal	4.99	4	Packaged Foods
2	1lb Ground Beef	3.99	29	Meat and Seafood
3	Dozen Apples	2.49	12	Produce
4	Chocolate Bar	1.29	2	Packaged Foods	Checkout
5	1 Gallon Milk	3.29	16	Dairy
6	0.5lb Jumbo Shrimp	5.29	12	Meat and Seafood
7	Wheat Bread	1.29	5	Bakery
8	Pepperoni Pizza	2.99	5	Frozen
9	12 Pack Cola	5.29	6	Packaged Foods
10	Lime Juice	0.99	20	Produce
11	12 Pack Cherry Cola	5.599	5	Packaged Foods
12	1 Gallon Soy Milk	3.39	10	Dairy
13	1 Gallon Vanilla Soy Milk	3.49	9	Dairy
14	1 Gallon Orange Juice	3.29	4	Juice

Now that we have a dataset, it’s time to set up a mapping for the index using the autocomplete_analyzer:

1
2
3
4
5
6
7
8
9
10
11
12

curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store/_mapping/products?pretty" -d '
{
"products": {
"properties" : {
"name": { "type": "text", "analyzer": "autocomplete_analyzer"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
'

The key line to pay attention to in this code is the following line, where the custom analyzer is set for the name field:

1	"name": { "type": "text", "analyzer": "autocomplete_analyzer"},

Next, the data is imported:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

$ curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/_bulk?pretty --data-binary @demo_store_db.json
{
"took" : 227,
"errors" : false,
"items" : [
{
"create" : {
"_index" : "store",
"_type" : "products",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
{
"create" : {
"_index" : "store",
"_type" : "products",
"_id" : "2",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
...
( MORE RESULTS)

Once the data is indexed, testing can be done to see whether the autocomplete functionality works correctly. To do this, try querying for “Whe”, and confirm that “Wheat Bread” is returned as a result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/products/_search?pretty" -d '
{
"query": {
"match": {
"name": {"query": "Whe", "analyzer": "standard"}
}
}
}
'
{
"took" : 108,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.513645,
"hits" : [
{
"_index" : "store",
"_type" : "products",
"_id" : "7",
"_score" : 1.513645,
"_source" : {
"id" : "7",
"name" : "Wheat Bread",
"price" : 1.29,
"quantity" : 5,
"department" : [
"Bakery"
]
}
}
]
}
}

As you can see in the output above, “Wheat Bread” was returned from a query for just “Whe”.

Conclusion

There’s no doubt that autocomplete functionality can help your users save time on their searches and find the results they want. If you’re interested in adding autocomplete to your search applications, Elasticsearch makes it simple. With this step-by-step guide, you can gain a better understanding of edge n-grams and learn how to use them in your code to create an optimal search experience for your users.

Just the Code

If you’re already familiar with edge n-grams and understand how they work, the following code includes everything needed to add autocomplete functionality in Elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_analyze?pretty" -d '
{
"analyzer": "autocomplete_analyzer",
"text": "Database"
}
' curl -H "Content-Type: application/json" -XPUT "127.0.0.1:9200/store/_mapping/products?pretty" -d '
{
"products": {
"properties" : {
"name": { "type": "text", "analyzer": "autocomplete_analyzer"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
'

1
2
3
4
5
6
7
8
9

$ curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/_bulk?pretty --data-binary @demo_store_db.json curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/products/_search?pretty" -d '
{
"query": {
"match": {
"name": {"query": "Whe", "analyzer": "standard"}
}
}
}
'

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

How to Implement Autocomplete with Edge N-Grams in Elasticsearch

Introduction

Understanding Autocomplete

Edge n-grams

Use Edge N-Grams with a Custom Filter and Analyzer

Conclusion

Just the Code

Pilot the ObjectRocket Platform Free!

Keep in the know!

Services

Platform

Company

Resources

Support