How to Use Fuzzy Query Matches in Elasticsearch

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

If you want to provide the best possible search experience for your users, you need to make sure they’re getting the results they want. The problem is, sometimes users make mistakes. If you’re only querying for exact matches, simple typos and spelling errors can lead to empty results– not an ideal user experience. This is where fuzzy query matches can help, handling user errors with ease and giving users the results they were likely searching for. In this tutorial, we’ll provide step-by-step instructions on how to implement fuzzy query matching in Elasticsearch. If you’re already comfortable with the concept and would prefer to skip the explanations, feel free to jump to Just the Code.

The Levenshtein Edit Distance

Before taking a look at some sample code, it’s important to understand the concept that fuzzy matching is based on: the Levenshtein edit distance. While that term might seem unfamiliar, it’s easy to understand. The Levenshtein edit distance is a measure of how dissimilar two strings are. In other words, it counts how many operations would be needed to transform one string into another. Let’s look at some examples using the string “Database” and figure out the Levenshtein distance for a few variations on this string:

  1. Substitution: “Database” vs “Databese” In this example, the Levenshtein edit distance for this example is 1 because only one letter was substituted. Each substitution counts as 1 unit of distance. The string “Dataxxxx”, on the other hand, would have a Levenshtein edit distance of 4 because four letters were substituted.

  2. Insertions: “Database” vs. “Dattabase”. The Levenshtein distance for this example is 1 because only 1 letter was inserted. Each insertion is measured as 1 unit of distance. Therefore, the string “Dattabaseee” would have a Levenshtein edit distance of 3 because three letters were inserted.

  3. Deletion: “Database” vs. “Databse”. Here’s another example where the Levenshtein edit distance would be 1 because one letter was deleted. Like substitution and insertio, each deletion counts as 1 unit of distance. “Databa” would have a Levenshtein edit distance of 2 because 2 letters were deleted.

Let’s put all these ideas together and try a more complex example. Comparing the strings “Database” and “atabasoy” would result in a Levenshtein edit distance of 3 because we have one substitution (o for e), one insertion (the letter y at the end), and one deletion (the D at the beginning).

Fuzzy Query Matching

In Elasticsearch, you can write queries that implement fuzzy matching and specify the maximum edit distance that will be allowed.

Let’s look at an example that uses an index called store, which represents a small grocery store. This store index contains a type called products which lists the store’s products. To keep things simple, our example dataset will only contain a handful of products with just the following fields: id, price, quantity, and department. The code below shows the JSON used to create the dataset:

idnamepricequantitydepartment
1Multi-Grain Cereal4.994Packaged Foods
21lb Ground Beef3.9929Meat and Seafood
3Dozen Apples2.4912Produce
4Chocolate Bar1.292Packaged FoodsCheckout
51 Gallon Milk3.2916Dairy
60.5lb Jumbo Shrimp5.2912Meat and Seafood
7Wheat Bread1.295Bakery
8Pepperoni Pizza2.995Frozen
912 Pack Cola5.296Packaged Foods
10Lime Juice0.9920Produce
1112 Pack Cherry Cola5.595Packaged Foods
121 Gallon Soy Milk3.3910Dairy
131 Gallon Vanilla Soy Milk3.499Dairy
141 Gallon Orange Juice3.294Juice

The curl command to create our index mapping would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d '
{
"mappings": {
"products": {
"properties" : {
"name": { "type": "text"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
}
'

In our grocery store example, we’ll assume that users are searching our index to find products they want to buy. One user would like to buy cereal but keeps misspelling it as “cereel”. Fortunately, we can write a query to handle that and give him the results he’s looking for. In this case, the user is only off by 1 substitution (e for a), so the Levenshtein edit distance in our query must allow for a minimum of one:

Request: `js curl -H “Content-Type: application/json” -XGET 127.0.0.1:9200/store/_search?pretty -d ‘ { “query”: { “fuzzy”: { “name”: { “value”: “cereel”, “fuzziness”: 1} } } } ‘ Response:js { “took” : 80, “timed_out” : false, “_shards” : { “total” : 5, “successful” : 5, “skipped” : 0, “failed” : 0 }, “hits” : { “total” : 1, “max_score” : 0.85222125, “hits” : [ { “_index” : “store”, “_type” : “products”, “_id” : “1”, “_score” : 0.85222125, “_source” : { “id” : “1”, “name” : “Multi-Grain Cereal”, “price” : 4.99, “quantity” : 4, “department” : [ “Packaged Foods” ] } } ] } } `

Note that the query accounted for the mistake and still returned the result the user expected.

Fuzziness Amount and Default AUTO

The fuzziness parameter in Elasticsearch defaults to AUTO, which means that the maximum allowed edit distance will depend on the length of your string. Shorter strings will have a smaller fuzziness value, which means that there will be less tolerance for errors. The exact default values are: Distance of 0 for strings with length of 1-2 characters Distance of 1 for strings with length of 3-5 characters * Distance of 2 for strings with length greater than 5 characters

These default settings have been proven to yield good results, so if you’re just getting started with fuzzy query matching and you’re not sure how to calibrate your fuzziness setting, it’s often best to stick with the defaults at first. With use, you can determine whether you need to adjust the fuzziness parameter to provide more or less tolerance.

Conclusion

Nobody’s perfect, and that includes your users. Typos and misspellings are common in searches, but it’s important to be able to give users the results they were looking for. Fortunately, fuzzy query matching in Elasticsearch makes it easy to handle user errors and still deliver accurate results. With the instructions provided in this tutorial, you’ll be able to add fuzzy query matching to your own search applications and take your search functionality to the next level.

Just the Code

If you’re already familiar with the concept of Levenshtein edit distance and fuzzy matching, the following code contains a great example for what you’ll need to implement fuzzy query matching in your search applications:

1
2
3
4
5
6
7
8
9
curl -H "Content-Type: application/json" -XGET 127.0.0.1:9200/store/_search?pretty -d '
{
"query": {
"fuzzy": {
"name": { "value": "cereel", "fuzziness": 1}
}
}
}
'

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.