How to Find the Number of Distinct Values with Cardinality Aggregations in Elasticsearch

Written by Data Pilot

April 07, 2019

Elasticsearch

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

There are times when you may want to analyze a dataset to find the number of distinct values for that set. For example, you may want to count the number of unique visitors to a website or the number of unique customers a business had in the past month. Regardless of the exact purpose, Elasticsearch makes this type of analysis easy with cardinality aggregation, which is really just another way of saying “counting the number of unique values in a set”. In this tutorial, we’ll provide step-by-step instructions for finding the number of distinct values with cardinality aggregations in Elasticsearch; however, if you’d prefer to skip the explanations and dive straight into the sample code, feel free to jump to Just the Code.

Prerequisites

Before we attempt to use cardinality aggregation in Elasticsearch, it’s important to make sure a few prerequisites are in place. The system requirements are minimal: Elasticsearch needs to be installed and running, and NodeJS also needs to be installed. In this tutorial, we assume that Elasticsearch is running locally on the default port, so our curl commands will reflect that with the syntax localhost:9200 or 127.0.0.1:9200. If Elasticsearch is running on a different server, your curl commands will take a slightly different form: YOURDOMAIN.com:9200.

Using cardinality aggregation

Let’s take a look at an example of using cardinality aggregation in Elasticsearch. For our example, we’ll create a sample index called store, which represents a small grocery store. Our store index contains a type called products1, which lists all of the store’s products. We’ll keep our sample dataset simple by including just a handful of products with a small number of fields: id, price, quantity, and department. The JSON needed to create this small dataset is shown below:

id	name	price	quantity	department
1	Multi-Grain Cereal	4.99	4	Packaged Foods
2	1lb Ground Beef	3.99	29	Meat and Seafood
3	Dozen Apples	2.49	12	Produce
4	Chocolate Bar	1.29	2	Packaged Foods	Checkout
5	1 Gallon Milk	3.29	16	Dairy
6	0.5lb Jumbo Shrimp	5.29	12	Meat and Seafood
7	Wheat Bread	1.29	5	Bakery
8	Pepperoni Pizza	2.99	5	Frozen
9	12 Pack Cola	5.29	6	Packaged Foods
10	Lime Juice	0.99	20	Produce
11	12 Pack Cherry Cola	5.59	5	Packaged Foods
12	1 Gallon Soy Milk	3.39	10	Dairy
13	1 Gallon Vanilla Soy Milk	3.49	9	Dairy
14	1 Gallon Orange Juice	3.29	4	Juice

The code below shows the mapping:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d '
{
"mappings": {
"products": {
"properties" : {
"name": { "type": "text"},
"price": { "type": "double"},
"quantity": { "type": "integer"},
"department": { "type": "keyword"}
}
}
}
}
'

Let’s say we wanted to find out the number of distinct departments in our store. We could determine this using a cardinality aggregation. The following code can be used to accomplish the task:

1
2
3
4
5
6
7
8
9
10
11

curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d '
{
"aggs" : {
"department_count" : {
"cardinality" : {
"field" : "department"
}
}
}
}
'

Let’s take a closer look at what we just did. First, we created an aggregator using "aggs". We named our aggregator "department_count". Note that we defined the type of the aggregator as "cardinality". This is an important step, because aggregation can be used for a variety of purposes in Elasticsearch: calculating minimums, maximums, averages, and much more. We also set the field value as "department", which just means that we’ll be evaluating the "department" field for distinct values.

Here’s the response we got from Elasticsearch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 14,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"department_count" : {
"value" : 8
}
}
}

Looking at the output, you can see that the value for department_count was 8. We can verify this by examining our dataset and counting the eight distinct departments: Meat and Seafood, Produce, Packaged Foods, Checkout, Dairy, Frozen, Bakery, and Juice.

Other Cardinality Options

The example we just reviewed shows how to use cardinality aggregation for a single field; in this case, it was "department". However, cardinality aggregations can also be applied to multiple fields. To accomplish this, you would use the script parameter in your aggregator code, supplying an inline script or a stored script. Adding the scripting option to a cardinality aggregation does have an impact on performance, but the ability to find unique combinations across multiple fields usually makes it worthwhile. For additional information on the script option and other cardinality options, consult the Elasticsearch documentation.

Conclusion

Being able to find the number of distinct values in a dataset can help you gain new insights and understand your data better. With this step-by-step guide, it will be quick and simple to accomplish this task in Elasticsearch and use this analysis in your own applications.

Just the Code

If you’re already familiar with the concept of aggregation, here’s all the code you’ll need to find the number of distinct values in a dataset.

1
2
3
4
5
6
7
8
9
10
11

curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?size=0&pretty" -d '
{
"aggs" : {
"department_count" : {
"cardinality" : {
"field" : "department"
}
}
}
}
'

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

How to Find the Number of Distinct Values with Cardinality Aggregations in Elasticsearch

Introduction

Prerequisites

Using cardinality aggregation

Other Cardinality Options

Conclusion

Just the Code

Pilot the ObjectRocket Platform Free!

Keep in the know!

Services

Platform

Company

Resources

Support