How to Compute a Weighted Average with Aggregations in Elasticsearch
Introduction
When you’re determining the average for values in a dataset, sometimes it doesn’t make sense to treat all the values equally. For example, if you’re calculating student grades, doing inventory cost accounting, or computing average bond yields, it’s helpful to assign a relative importance to each value in your dataset. That’s where a weighted average comes into play. Elasticsearch makes this task simple through the use of aggregations. In this tutorial, we’ll provide step-by-step instructions for computing a weighted average with aggregations in Elasticsearch. If you’d prefer to skip the explanations and dive into the sample code, feel free to jump to Just the Code.
Prerequisites
Before we attempt to compute a weighted average in Elasticsearch, it’s important to make sure a few prerequisites are in place. There are only a couple of system requirements for this task: NodeJS needs to be installed, and Elasticsearch also needs to be installed and running. In our example, we’ll assume that Elasticsearch is installed locally and is using the default port, so our curl
commands will take the form: localhost:9200
. If your Elasticsearch installation is running on a different server, you’ll need to modify your curl
syntax accordingly: YOURDOMAIN.com:9200
.
Note: Weighted average functionality is only available in Elasticsearch 6.4.0+. If you’re running an older version of Elasticsearch, it’s necessary to upgrade before proceeding with this tutorial.
What is a weighted average?
Before we look at the math involved in a weighted average, let’s review a simple example of how a weighted average might be used. Let’s say that the average height of men worldwide is 180 cm, and the average height of women is 170 cm. You might jump to the conclusion that the average height of humans worldwide must be 175 cm. Why is this assumption incorrect? It doesn’t take into account the fact that there aren’t equal numbers of men and women in the world. Using a weighted average, we could account for the fact that there are more women than men by assigning the 170 cm value more weight in the average.
Here’s the math behind the concept: In a weighted average, each value in a dataset can be assigned a different weight. The following equation shows how to calculate the weighted average of a set of values, assuming you know the weight assigned to each value.
Weighted Average = ∑(value * weight) / ∑(weight)
Using the weighted_avg aggregation
Now that we’ve talked about the math behind weighted averages, we’ll look at an example of how to calculate one using aggregations in Elasticsearch. For our example, we’ll use a sample index called store
, which represents a small grocery store. Our index will contain a type called products
which lists all of the store’s products. We’ll keep our dataset simple by having just a handful of products with a small number of fields: id, price, quantity, and department. The JSON below can be used to create this dataset:
id | name | price | quantity | department | |
---|---|---|---|---|---|
1 | Multi-Grain Cereal | 4.99 | 4 | Packaged Foods | |
2 | 1lb Ground Beef | 3.99 | 29 | Meat and Seafood | |
3 | Dozen Apples | 2.49 | 12 | Produce | |
4 | Chocolate Bar | 1.29 | 2 | Packaged Foods | Checkout |
5 | 1 Gallon Milk | 3.29 | 16 | Dairy | |
6 | 0.5lb Jumbo Shrimp | 5.29 | 12 | Meat and Seafood | |
7 | Wheat Bread | 1.29 | 5 | Bakery | |
8 | Pepperoni Pizza | 2.99 | 5 | Frozen | |
9 | 12 Pack Cola | 5.29 | 6 | Packaged Foods | |
10 | Lime Juice | 0.99 | 20 | Produce | |
11 | 12 Pack Cherry Cola | 5.59 | 5 | Packaged Foods | |
12 | 1 Gallon Soy Milk | 3.39 | 10 | Dairy | |
13 | 1 Gallon Vanilla Soy Milk | 3.49 | 9 | Dairy | |
14 | 1 Gallon Orange Juice | 3.29 | 4 | Juice |
The mapping is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | curl -H "Content-Type: application/json" -XPUT 127.0.0.1:9200/store -d ' { "mappings": { "products": { "properties" : { "name": { "type": "text"}, "price": { "type": "double"}, "quantity": { "type": "integer"}, "department": { "type": "keyword"} } } } } ' |
Let’s imagine we want to calculate the average price of a product in our store. A standard non-weighted average isn’t going to give us the most accurate result in this situation, because we have a large quantity of certain products and a small quantity of others. For this case, a weighted average is the ideal choice. With a weighted average, we’ll assign twice the weight to the $1.29 price of our “Chocolate Bars”, because we have two of them in inventory. Similarly, we’ll assign ten times the weight to the $0.99 price of our “Lime Juice”, because there are ten units in stock.
Our calculation would look something like this: (4.99 x 4) + (3.99 x 29) + (2.49 x 12) + … / ( 4 + 29 + 12 + …)
Fortunately, we don’t need to worry much about the math, because the code below will do the work for us:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?pretty" -d ' { "size": 0, "aggs" : { "weighted_avg_price": { "weighted_avg": { "value": { "field": "price" }, "weight": { "field": "quantity" } } } } } ' |
Let’s take a closer look at what’s happening in this code. First, we created an aggregator using "aggs"
. We named our aggregator "weighted_avg_price"
, and we set its type to "weighted_avg"
. Then, we set the "value"
field to "price"
, which tells Elasticsearch that we want to calculate an average for the "price"
field. Finally, we set the "weight"
field to "quantity"
, which lets Elasticsearch know to use the value of "quantity"
to assign a relative weight to each price.
Let’s see what the results have to say:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | { "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 14, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "weighted_avg_price" : { "value" : 3.3353237410071945 } } } |
The weighted_avg_price
returned was $3.34. Using the equation shown earlier in this tutorial, this is the answer you would get if you calculated the weighted average by hand.
Other weighted_avg Options
While the example we just reviewed showed a fairly simple weighted average calculation, a few options are available to handle more complex cases. For example, you can use the missing
parameter to specify your own value
or weight
to use if either of these fields are missing. You can also provide a script to determine values or weights if needed. For more information on these options and other Elasticsearch aggregation options, see their documentation.
Conclusion
When some values in a dataset have a greater relative importance than others, a standard average calculation won’t give you the most accurate result– a weighted average is the better choice. Fortunately, Elasticsearch makes this task easy with their weighted average aggregation. With the instructions provided in this tutorial, it will be easy to apply this analysis to your own Elasticsearch datasets.
Just the Code
If you’re already familiar with the concept of aggregation, here’s all the code you’ll need to calculate a weighted average:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | curl -H "Content-Type: application/json" -XGET "127.0.0.1:9200/store/_search?pretty" -d ' { "size": 0, "aggs" : { "weighted_avg_price": { "weighted_avg": { "value": { "field": "price" }, "weight": { "field": "quantity" } } } } } ' |
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started