The problem
A question on the Elasticsearch Slack today received an interesting answer that used a trick I haven’t noticed before. The trick is noteworthy as it has some handy use-cases.
The question boiled down to finding the number of documents where a field value is true
, then using that count in a bucket_script
pipeline aggregation.
Ben Trent - an ML Engineer at Elastic - replied with a trick involving the sum
aggregation type. When applied to a boolean
field, sum
will treat values of false
as 0
and true
as 1
. This results in it returning the number of documents with a value of true
.
The Gist Ben provided gave a great example of how this can be used to address the OP’s problem. I’ll go over how it works but change the mapping to be a bit more specific.
The solution
Setup
I’ll create the mapping and index some documents first:
PUT products/
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"productCategory": {
"type": "keyword"
},
"inStock": {
"type": "boolean"
}
}
}
}
PUT products/_doc/1
{
"productCategory": "book",
"inStock": true
}
PUT products/_doc/2
{
"productCategory": "book",
"inStock": false
}
PUT products/_doc/3
{
"productCategory": "book",
"inStock": true
}
PUT products/_doc/4
{
"productCategory": "toy",
"inStock": true
}
PUT products/_doc/5
{
"productCategory": "toy",
"inStock": true
}
PUT products/_doc/6
{
"productCategory": "toy",
"inStock": false
}
PUT products/_doc/7
{
"productCategory": "toy",
"inStock": false
}
Start at the top
I’ll start by analysing the products by category.
GET products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "productCategory"
}
}
}
}
This creates a bucket per category - I’m removing the metadata and hits
blocks from all results for brevity:
{
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "toy",
"doc_count" : 4
},
{
"key" : "book",
"doc_count" : 3
}
]
}
}
}
We need to know the following in order to work out the percentage of products that are in stock:
- Number of products in the category with a value of
true
for theinStock
field - Total number of products in the category
Counting in stock products
Finding the number of in stock products is the more tricky of the two but this is where the sum
aggregation will make life much easier. To see how it works, use a terms
aggregation on the inStock
field:
GET products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "productCategory"
},
"aggs": {
"inStock": {
"terms": {
"field": "inStock"
}
}
}
}
}
}
The bucket keys in the results show how Elasticsearch is treating the values:
{
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "toy",
"doc_count" : 4,
"inStock" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2
},
{
"key" : 1,
"key_as_string" : "true",
"doc_count" : 2
}
]
}
},
{
"key" : "book",
"doc_count" : 3,
"inStock" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1,
"key_as_string" : "true",
"doc_count" : 2
},
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 1
}
]
}
}
]
}
}
}
Switching the terms
to a sum
will give us the number we need straight away, as Elasticsearch adds up all the ones (in stock) and zeroes (not in stock):
GET products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "productCategory"
},
"aggs": {
"numInStock": {
"sum": {
"field": "inStock"
}
}
}
}
}
}
{
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "toy",
"doc_count" : 4,
"numInStock" : {
"value" : 2.0,
"value_as_string" : "true"
}
},
{
"key" : "book",
"doc_count" : 3,
"numInStock" : {
"value" : 2.0,
"value_as_string" : "true"
}
}
]
}
}
}
Counting all products
Ben used a value_count
aggregation to count the number of documents in the categories
buckets. This is an unnecessary step; the document count is already available for us to use in a pipeline aggregation.
Calculating the percentage
Here’s the pipeline aggregation to calculate the percentage of in stock products:
GET products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "productCategory"
},
"aggs": {
"numInStock": {
"sum": {
"field": "inStock"
}
},
"percInStock": {
"bucket_script": {
"buckets_path": {
"inStock": "numInStock",
"total": "_count"
},
"script": "(params.inStock / params.total) * 100"
}
}
}
}
}
}
The buckets_path
section may look a bit strange.
A bucket_script
pipeline aggregation is a parent pipeline aggregation, meaning it can compute new values from the output of its parent aggregation. In this case, the parent is the categories
aggregation. _count
is a special bucket path that will use the document count as its input. This is the number of products in the category bucket.
numInStock
is a metric aggregation, so using it in a buckets_path
will use the metric aggregation value.
The script simply does the calculation to find the percentage and the output is just what we need:
{
"aggregations" : {
"categories" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "toy",
"doc_count" : 4,
"numInStock" : {
"value" : 2.0,
"value_as_string" : "true"
},
"percInStock" : {
"value" : 50.0
}
},
{
"key" : "book",
"doc_count" : 3,
"numInStock" : {
"value" : 2.0,
"value_as_string" : "true"
},
"percInStock" : {
"value" : 66.66666666666666
}
}
]
}
}
}
Summary
Using sum
aggregations on boolean
fields is a really useful trick for finding the number of documents where that field value is true
.
Despite pipeline aggregations being removed from the Elastic Certified Engineer exam, they’re an incredibly useful feature and I will be writing more about them in the future.