How to Find the Number of Distinct Values with Cardinality Aggregations in Elasticsearch

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

There are times when you may want to analyze a dataset to find the number of distinct values for that set. For example, you may want to count the number of unique visitors to a website or the number of unique customers a business had in the past month. Regardless of the exact purpose, Elasticsearch makes this type of analysis easy with cardinality aggregation, which is really just another way of saying “counting the number of unique values in a set”. In this tutorial, we’ll provide step-by-step instructions for finding the number of distinct values with cardinality aggregations in Elasticsearch; however, if you’d prefer to skip the explanations and dive straight into the sample code, feel free to jump to Just the Code.

Prerequisites

Before we attempt to use cardinality aggregation in Elasticsearch, it’s important to make sure a few prerequisites are in place. The system requirements are minimal: Elasticsearch needs to be installed and running, and NodeJS also needs to be installed. In this tutorial, we assume that Elasticsearch is running locally on the default port, so our curl commands will reflect that with the syntax localhost:9200 or 127.0.0.1:9200. If Elasticsearch is running on a different server, your curl commands will take a slightly different form: YOURDOMAIN.com:9200.

Using cardinality aggregation

Let’s take a look at an example of using cardinality aggregation in Elasticsearch. For our example, we’ll create a sample index called store, which represents a small grocery store. Our store index contains a type called products1, which lists all of the store’s products. We’ll keep our sample dataset simple by including just a handful of products with a small number of fields: id, price, quantity, and department. The JSON needed to create this small dataset is shown below:

idnamepricequantitydepartment
1Multi-Grain Cereal4.994Packaged Foods
21lb Ground Beef3.9929Meat and Seafood
3Dozen Apples2.4912Produce
4Chocolate Bar1.292Packaged FoodsCheckout
51 Gallon Milk3.2916Dairy
60.5lb Jumbo Shrimp5.2912Meat and Seafood
7Wheat Bread1.295Bakery
8Pepperoni Pizza2.995Frozen
912 Pack Cola5.296Packaged Foods
10Lime Juice0.9920Produce
1112 Pack Cherry Cola5.595Packaged Foods
121 Gallon Soy Milk3.3910Dairy
131 Gallon Vanilla Soy Milk3.499Dairy
141 Gallon Orange Juice3.294Juice

The code below shows the mapping:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d '
{
"mappings": {
"products": {
"properties" : {
"name": { "type": "text"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
}
'

Let’s say we wanted to find out the number of distinct departments in our store. We could determine this using a cardinality aggregation. The following code can be used to accomplish the task:

1
2
3
4
5
6
7
8
9
10
11
curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d '
{
"aggs" : {
"department_count" : {
"cardinality" : {
"field" : "department"
}
}
}
}
'

Let’s take a closer look at what we just did. First, we created an aggregator using "aggs". We named our aggregator "department_count". Note that we defined the type of the aggregator as "cardinality". This is an important step, because aggregation can be used for a variety of purposes in Elasticsearch: calculating minimums, maximums, averages, and much more. We also set the field value as "department", which just means that we’ll be evaluating the "department" field for distinct values.

Here’s the response we got from Elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 14,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"department_count" : {
"value" : 8
}
}
}

Looking at the output, you can see that the value for department_count was 8. We can verify this by examining our dataset and counting the eight distinct departments: Meat and Seafood, Produce, Packaged Foods, Checkout, Dairy, Frozen, Bakery, and Juice.

Other Cardinality Options

The example we just reviewed shows how to use cardinality aggregation for a single field; in this case, it was "department". However, cardinality aggregations can also be applied to multiple fields. To accomplish this, you would use the script parameter in your aggregator code, supplying an inline script or a stored script. Adding the scripting option to a cardinality aggregation does have an impact on performance, but the ability to find unique combinations across multiple fields usually makes it worthwhile. For additional information on the script option and other cardinality options, consult the Elasticsearch documentation.

Conclusion

Being able to find the number of distinct values in a dataset can help you gain new insights and understand your data better. With this step-by-step guide, it will be quick and simple to accomplish this task in Elasticsearch and use this analysis in your own applications.

Just the Code

If you’re already familiar with the concept of aggregation, here’s all the code you’ll need to find the number of distinct values in a dataset.

1
2
3
4
5
6
7
8
9
10
11
curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d '
{
"aggs" : {
"department_count" : {
"cardinality" : {
"field" : "department"
}
}
}
}
'

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.