How to Use Fuzzy Query Matches in Elasticsearch
Introduction
If you want to provide the best possible search experience for your users, you need to make sure they’re getting the results they want. The problem is, sometimes users make mistakes. If you’re only querying for exact matches, simple typos and spelling errors can lead to empty results– not an ideal user experience. This is where fuzzy query matches can help, handling user errors with ease and giving users the results they were likely searching for. In this tutorial, we’ll provide step-by-step instructions on how to implement fuzzy query matching in Elasticsearch. If you’re already comfortable with the concept and would prefer to skip the explanations, feel free to jump to Just the Code.
The Levenshtein Edit Distance
Before taking a look at some sample code, it’s important to understand the concept that fuzzy matching is based on: the Levenshtein edit distance. While that term might seem unfamiliar, it’s easy to understand. The Levenshtein edit distance is a measure of how dissimilar two strings are. In other words, it counts how many operations would be needed to transform one string into another. Let’s look at some examples using the string “Database” and figure out the Levenshtein distance for a few variations on this string:
Substitution: “Database” vs “Databese” In this example, the Levenshtein edit distance for this example is 1 because only one letter was substituted. Each substitution counts as 1 unit of distance. The string “Dataxxxx”, on the other hand, would have a Levenshtein edit distance of 4 because four letters were substituted.
Insertions: “Database” vs. “Dattabase”. The Levenshtein distance for this example is 1 because only 1 letter was inserted. Each insertion is measured as 1 unit of distance. Therefore, the string “Dattabaseee” would have a Levenshtein edit distance of 3 because three letters were inserted.
Deletion: “Database” vs. “Databse”. Here’s another example where the Levenshtein edit distance would be 1 because one letter was deleted. Like substitution and insertio, each deletion counts as 1 unit of distance. “Databa” would have a Levenshtein edit distance of 2 because 2 letters were deleted.
Let’s put all these ideas together and try a more complex example. Comparing the strings “Database” and “atabasoy” would result in a Levenshtein edit distance of 3 because we have one substitution (o for e), one insertion (the letter y at the end), and one deletion (the D at the beginning).
Fuzzy Query Matching
In Elasticsearch, you can write queries that implement fuzzy matching and specify the maximum edit distance that will be allowed.
Let’s look at an example that uses an index called store
, which represents a small grocery store. This store
index contains a type called products
which lists the store’s products. To keep things simple, our example dataset will only contain a handful of products with just the following fields: id, price, quantity, and department. The code below shows the JSON used to create the dataset:
id | name | price | quantity | department | |
---|---|---|---|---|---|
1 | Multi-Grain Cereal | 4.99 | 4 | Packaged Foods | |
2 | 1lb Ground Beef | 3.99 | 29 | Meat and Seafood | |
3 | Dozen Apples | 2.49 | 12 | Produce | |
4 | Chocolate Bar | 1.29 | 2 | Packaged Foods | Checkout |
5 | 1 Gallon Milk | 3.29 | 16 | Dairy | |
6 | 0.5lb Jumbo Shrimp | 5.29 | 12 | Meat and Seafood | |
7 | Wheat Bread | 1.29 | 5 | Bakery | |
8 | Pepperoni Pizza | 2.99 | 5 | Frozen | |
9 | 12 Pack Cola | 5.29 | 6 | Packaged Foods | |
10 | Lime Juice | 0.99 | 20 | Produce | |
11 | 12 Pack Cherry Cola | 5.59 | 5 | Packaged Foods | |
12 | 1 Gallon Soy Milk | 3.39 | 10 | Dairy | |
13 | 1 Gallon Vanilla Soy Milk | 3.49 | 9 | Dairy | |
14 | 1 Gallon Orange Juice | 3.29 | 4 | Juice |
The curl command to create our index mapping would look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d ' { "mappings": { "products": { "properties" : { "name": { "type": "text"}, "price": { "type": "double"}, "quantity": { "type": "integer"}, "department": { "type": "keyword"} } } } } ' |
In our grocery store example, we’ll assume that users are searching our index to find products they want to buy. One user would like to buy cereal but keeps misspelling it as “cereel”. Fortunately, we can write a query to handle that and give him the results he’s looking for. In this case, the user is only off by 1 substitution (e for a), so the Levenshtein edit distance in our query must allow for a minimum of one:
Request:
`
js
curl -H “Content-Type: application/json” -XGET 127.0.0.1:9200/store/_search?pretty -d ‘
{
“query”: {
“fuzzy”: {
“name”: { “value”: “cereel”, “fuzziness”: 1}
}
}
}
‘
Response:
js
{
“took” : 80,
“timed_out” : false,
“_shards” : {
“total” : 5,
“successful” : 5,
“skipped” : 0,
“failed” : 0
},
“hits” : {
“total” : 1,
“max_score” : 0.85222125,
“hits” : [
{
“_index” : “store”,
“_type” : “products”,
“_id” : “1”,
“_score” : 0.85222125,
“_source” : {
“id” : “1”,
“name” : “Multi-Grain Cereal”,
“price” : 4.99,
“quantity” : 4,
“department” : [
“Packaged Foods”
]
}
}
]
}
}
`
Note that the query accounted for the mistake and still returned the result the user expected.
Fuzziness Amount and Default AUTO
The fuzziness
parameter in Elasticsearch defaults to AUTO, which means that the maximum allowed edit distance will depend on the length of your string. Shorter strings will have a smaller fuzziness value, which means that there will be less tolerance for errors. The exact default values are:
Distance of 0 for strings with length of 1-2 characters
Distance of 1 for strings with length of 3-5 characters
* Distance of 2 for strings with length greater than 5 characters
These default settings have been proven to yield good results, so if you’re just getting started with fuzzy query matching and you’re not sure how to calibrate your fuzziness setting, it’s often best to stick with the defaults at first. With use, you can determine whether you need to adjust the fuzziness parameter to provide more or less tolerance.
Conclusion
Nobody’s perfect, and that includes your users. Typos and misspellings are common in searches, but it’s important to be able to give users the results they were looking for. Fortunately, fuzzy query matching in Elasticsearch makes it easy to handle user errors and still deliver accurate results. With the instructions provided in this tutorial, you’ll be able to add fuzzy query matching to your own search applications and take your search functionality to the next level.
Just the Code
If you’re already familiar with the concept of Levenshtein edit distance and fuzzy matching, the following code contains a great example for what you’ll need to implement fuzzy query matching in your search applications:
1 2 3 4 5 6 7 8 9 | curl -H "Content-Type: application/json" -XGET 127.0.0.1:9200/store/_search?pretty -d ' { "query": { "fuzzy": { "name": { "value": "cereel", "fuzziness": 1} } } } ' |
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started