How to Find the Number of Distinct Values with Cardinality Aggregations in Elasticsearch
Introduction
There are times when you may want to analyze a dataset to find the number of distinct values for that set. For example, you may want to count the number of unique visitors to a website or the number of unique customers a business had in the past month. Regardless of the exact purpose, Elasticsearch makes this type of analysis easy with cardinality aggregation, which is really just another way of saying “counting the number of unique values in a set”. In this tutorial, we’ll provide step-by-step instructions for finding the number of distinct values with cardinality aggregations in Elasticsearch; however, if you’d prefer to skip the explanations and dive straight into the sample code, feel free to jump to Just the Code.
Prerequisites
Before we attempt to use cardinality aggregation in Elasticsearch, it’s important to make sure a few prerequisites are in place. The system requirements are minimal: Elasticsearch needs to be installed and running, and NodeJS also needs to be installed. In this tutorial, we assume that Elasticsearch is running locally on the default port, so our curl
commands will reflect that with the syntax localhost:9200
or 127.0.0.1:9200
. If Elasticsearch is running on a different server, your curl
commands will take a slightly different form: YOURDOMAIN.com:9200
.
Using cardinality aggregation
Let’s take a look at an example of using cardinality aggregation in Elasticsearch. For our example, we’ll create a sample index called store
, which represents a small grocery store. Our store
index contains a type called products1
, which lists all of the store’s products. We’ll keep our sample dataset simple by including just a handful of products with a small number of fields: id, price, quantity, and department. The JSON needed to create this small dataset is shown below:
id | name | price | quantity | department | |
---|---|---|---|---|---|
1 | Multi-Grain Cereal | 4.99 | 4 | Packaged Foods | |
2 | 1lb Ground Beef | 3.99 | 29 | Meat and Seafood | |
3 | Dozen Apples | 2.49 | 12 | Produce | |
4 | Chocolate Bar | 1.29 | 2 | Packaged Foods | Checkout |
5 | 1 Gallon Milk | 3.29 | 16 | Dairy | |
6 | 0.5lb Jumbo Shrimp | 5.29 | 12 | Meat and Seafood | |
7 | Wheat Bread | 1.29 | 5 | Bakery | |
8 | Pepperoni Pizza | 2.99 | 5 | Frozen | |
9 | 12 Pack Cola | 5.29 | 6 | Packaged Foods | |
10 | Lime Juice | 0.99 | 20 | Produce | |
11 | 12 Pack Cherry Cola | 5.59 | 5 | Packaged Foods | |
12 | 1 Gallon Soy Milk | 3.39 | 10 | Dairy | |
13 | 1 Gallon Vanilla Soy Milk | 3.49 | 9 | Dairy | |
14 | 1 Gallon Orange Juice | 3.29 | 4 | Juice |
The code below shows the mapping:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d ' { "mappings": { "products": { "properties" : { "name": { "type": "text"}, "price": { "type": "double"}, "quantity": { "type": "integer"}, "department": { "type": "keyword"} } } } } ' |
Let’s say we wanted to find out the number of distinct departments in our store. We could determine this using a cardinality aggregation. The following code can be used to accomplish the task:
1 2 3 4 5 6 7 8 9 10 11 | curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d ' { "aggs" : { "department_count" : { "cardinality" : { "field" : "department" } } } } ' |
Let’s take a closer look at what we just did. First, we created an aggregator using "aggs"
. We named our aggregator "department_count"
. Note that we defined the type of the aggregator as "cardinality"
. This is an important step, because aggregation can be used for a variety of purposes in Elasticsearch: calculating minimums, maximums, averages, and much more. We also set the field
value as "department"
, which just means that we’ll be evaluating the "department"
field for distinct values.
Here’s the response we got from Elasticsearch:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 14, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "department_count" : { "value" : 8 } } } |
Looking at the output, you can see that the value for department_count
was 8. We can verify this by examining our dataset and counting the eight distinct departments: Meat and Seafood, Produce, Packaged Foods, Checkout, Dairy, Frozen, Bakery, and Juice.
Other Cardinality Options
The example we just reviewed shows how to use cardinality aggregation for a single field; in this case, it was "department"
. However, cardinality aggregations can also be applied to multiple fields. To accomplish this, you would use the script
parameter in your aggregator code, supplying an inline script or a stored script. Adding the scripting option to a cardinality aggregation does have an impact on performance, but the ability to find unique combinations across multiple fields usually makes it worthwhile. For additional information on the script
option and other cardinality options, consult the Elasticsearch documentation.
Conclusion
Being able to find the number of distinct values in a dataset can help you gain new insights and understand your data better. With this step-by-step guide, it will be quick and simple to accomplish this task in Elasticsearch and use this analysis in your own applications.
Just the Code
If you’re already familiar with the concept of aggregation, here’s all the code you’ll need to find the number of distinct values in a dataset.
1 2 3 4 5 6 7 8 9 10 11 | curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d ' { "aggs" : { "department_count" : { "cardinality" : { "field" : "department" } } } } ' |
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started