How to Use Fuzzy Query Matches in Elasticsearch using Kibana
Introduction
If you want to provide the best possible search experience for your users, you need to make sure they’re getting the results they want. The problem is, sometimes users make mistakes. If you’re only querying for exact matches, simple typos and spelling errors can lead to empty results– not an ideal user experience. This is where fuzzy query matches can help, handling user errors with ease and giving users the results they were likely searching for. In this tutorial, we’ll provide step-by-step instructions on how to implement fuzzy query matching in Elasticsearch using Kibana. If you’re already comfortable with the concept of fuzzy matches and would prefer to skip the explanations, feel free to jump to Just the Code.
Prerequisites
Before we show you how to perform fuzzy query matches with Elasticsearch in Kibana, it’s important to make sure a few prerequisites are in place. There are only a few of system requirements for this task: Elasticsearch needs to be installed and running. In our example, we have Elasticsearch installed locally using the default port of 9200. If your Elasticsearch installation is running on a different server, you’ll need to modify your javascript syntax accordingly. * Kibana needs to be installed and running. We have ours running on http://localhost:5601 in case you need to make adjustments.
The Levenshtein Edit Distance
Before taking a look at some sample code, it’s important to understand the concept that fuzzy matching is based on: the Levenshtein edit distance. While that term might seem unfamiliar, it’s easy to understand. The Levenshtein edit distance is a measure of how dissimilar two strings are. In other words, it counts how many operations would be needed to transform one string into another. Let’s look at some examples using the string “Database” and figure out the Levenshtein distance for a few variations on this string:
Substitution: “Database” vs “Databese” In this example, the Levenshtein edit distance for this example is 1 because only one letter was substituted. Each substitution counts as 1 unit of distance. The string “Dataxxxx”, on the other hand, would have a Levenshtein edit distance of 4 because four letters were substituted.
Insertions: “Database” vs. “Dattabase”. The Levenshtein distance for this example is 1 because only 1 letter was inserted. Each insertion is measured as 1 unit of distance. Therefore, the string “Dattabaseee” would have a Levenshtein edit distance of 3 because three letters were inserted.
Deletion: “Database” vs. “Databse”. Here’s another example where the Levenshtein edit distance would be 1 because one letter was deleted. Like substitution and insertio, each deletion counts as 1 unit of distance. “Databa” would have a Levenshtein edit distance of 2 because 2 letters were deleted.
Let’s put all these ideas together and try a more complex example. Comparing the strings “Database” and “atabasoy” would result in a Levenshtein edit distance of 3 because we have one substitution (o for e), one insertion (the letter y at the end), and one deletion (the D at the beginning).
Fuzzy Query Matching
In Elasticsearch, you can write queries that implement fuzzy matching and specify the maximum edit distance that will be allowed.
Let’s look at an example that uses an index called store
, which represents a small grocery store. This store
index contains a type called products
which lists the store’s products. To keep things simple, our example dataset will only contain a handful of products with just the following fields: id, price, quantity, and department. The code below shows the JSON used to create the dataset:
id | name | price | quantity | department |
---|---|---|---|---|
1 | Multi-Grain Cereal | 4.99 | 4 | Packaged Foods |
2 | 1lb Ground Beef | 3.99 | 29 | Meat and Seafood |
3 | Dozen Apples | 2.49 | 12 | Produce |
4 | Chocolate Bar | 1.29 | 2 | Packaged Foods, Checkout |
5 | 1 Gallon Milk | 3.29 | 16 | Dairy |
6 | 0.5lb Jumbo Shrimp | 5.29 | 12 | Meat and Seafood |
7 | Wheat Bread | 1.29 | 5 | Bakery |
8 | Pepperoni Pizza | 2.99 | 5 | Frozen |
9 | 12 Pack Cola | 5.29 | 6 | Packaged Foods |
10 | Lime Juice | 0.99 | 20 | Produce |
11 | 12 Pack Cherry Cola | 5.59 | 5 | Packaged Foods |
12 | 1 Gallon Soy Milk | 3.39 | 10 | Dairy |
13 | 1 Gallon Vanilla Soy Milk | 3.49 | 9 | Dairy |
14 | 1 Gallon Orange Juice | 3.29 | 4 | Juice |
Here is the json we used to define the mapping if our index:
1 2 3 4 5 6 7 8 9 10 11 12 | { "mappings": { "products": { "properties" : { "name": { "type": "text"}, "price": { "type": "double"}, "quantity": { "type": "integer"}, "department": { "type": "keyword"} } } } } |
In our grocery store example, we’ll assume that users are searching our index to find products they want to buy. One user would like to buy cereal but keeps misspelling it as “cereel”. Fortunately, we can write a query to handle that and give him the results he’s looking for. In this case, the user is only off by 1 substitution (e for a), so the Levenshtein edit distance in our query must allow for a minimum of one. Let’s see how a fuzzy query match would work.
We run this command in our Kibana Dev Tools Console:
1 2 3 4 5 6 7 8 9 10 11 | GET /store/_search { "query": { "fuzzy": { "name": { "value": "cereel", "fuzziness": 1 } } } } |
What we’ve done is defined a query of type fuzzy. We’ve given it the name of the field we want to evaluate for matches name
and specified the fuzziness along with the value
we are searching for "value": "cereel"
. Notice that the index name store
is in the GET request and specifies which index to search in. Now let’s look at the response:
Response:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | { "took" : 69, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.85222125, "hits" : [ { "_index" : "store", "_type" : "products", "_id" : "1", "_score" : 0.85222125, "_source" : { "id" : "1", "name" : "Multi-Grain Cereal", "price" : 4.99, "quantity" : 4, "department" : [ "Packaged Foods" ] } } ] } } |
Note that the query accounted for the mistake in the spelling of cereel
and still returned the "Multi-Grain Cereal
result the user expected.
Fuzziness Amount and Default AUTO
The fuzziness
parameter in Elasticsearch defaults to AUTO, which means that the maximum allowed edit distance will depend on the length of your string. Shorter strings will have a smaller fuzziness value, which means that there will be less tolerance for errors. The exact default values are:
Distance of 0 for strings with length of 1-2 characters
Distance of 1 for strings with length of 3-5 characters
* Distance of 2 for strings with length greater than 5 characters
These default settings have been proven to yield good results, so if you’re just getting started with fuzzy query matching and you’re not sure how to calibrate your fuzziness setting, it’s often best to stick with the defaults at first. With use, you can determine whether you need to adjust the fuzziness parameter to provide more or less tolerance.
Conclusion
Nobody’s perfect, and that includes your users. Typos and misspellings are common in searches, but it’s important to be able to give users the results they were looking for. Fortunately, fuzzy query matching in Elasticsearch makes it easy to handle user errors and still deliver accurate results. With the instructions provided in this tutorial, you’ll be able to add fuzzy query matching to your own search applications and take your search functionality to the next level.
Just the Code
If you’re already familiar with the concept of Levenshtein edit distance, fuzzy matching, and are familiar with using Kibana, the following code contains a great example for what you might need to implement fuzzy query matching in your search applications:
1 2 3 4 5 6 7 8 9 10 11 | GET /store/_search { "query": { "fuzzy": { "name": { "value": "cereel", "fuzziness": 1 } } } } |
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started