A common mistake I’ve seen, and made, is misunderstanding how should
clauses in a bool
query work. They’re understood to be the OR
part of your query but that’s true only some of the time.
It’s important to know exactly what should
does as bool
queries are the bread and butter of most Elasticsearch queries and getting it wrong can result in having more documents returned than you would expect. There can be certification exam questions on the mechanics, too.
should
this match?
To demonstrate this issue, we’ll create an index, add some documents, then run some queries and dig into the results.
PUT myindex
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"category": {
"type": "keyword"
},
"comment": {
"type": "text"
}
}
}
}
POST myindex/_doc
{
"category": "video",
"comment": "The video was good"
}
POST myindex/_doc
{
"category": "video",
"comment": "The video was terrible"
}
POST myindex/_doc
{
"category": "video",
"comment": "I didn't watch the video"
}
POST myindex/_doc
{
"category": "blog",
"comment": "Which blog?"
}
POST myindex/_doc
{
"category": "blog",
"comment": "The blog is good"
}
POST myindex/_doc
{
"category": "blog",
"comment": "I time my watch by the regularity of the posts"
}
We want to find all the documents where the comment contains either watch
or good
, which stinks of should
query.
GET myindex/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"comment": "watch"
}
},
{
"match": {
"comment": "good"
}
}
]
}
}
}
The results are exactly what we’re expecting; all the documents returned contain either watch
or good
:
{
...
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : 1.1077526,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "IdL6YHEB_3GqsWyQHD8Y",
"_score" : 1.1077526,
"_source" : {
"category" : "video",
"comment" : "The video was good"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "JdL6YHEB_3GqsWyQHj-z",
"_score" : 1.1077526,
"_source" : {
"category" : "blog",
"comment" : "The blog is good"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "I9L6YHEB_3GqsWyQHj89",
"_score" : 1.0152972,
"_source" : {
"category" : "video",
"comment" : "I didn't watch the video"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "JtL6YHEB_3GqsWyQHz8B",
"_score" : 0.71635467,
"_source" : {
"category" : "blog",
"comment" : "I time my watch by the regularity of the posts"
}
}
]
}
}
We now want to filter down those documents to just those comments that are for the video
category. Most people will simply add a must
block to the top of the bool
query and expect it to do exactly what we’re asking:
GET myindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "video"
}
}
],
"should": [
{
"match": {
"comment": "watch"
}
},
{
"match": {
"comment": "good"
}
}
]
}
}
}
Looking at the results, we can see that this is where the trouble starts:
{
...
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.8008997,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "IdL6YHEB_3GqsWyQHD8Y",
"_score" : 1.8008997,
"_source" : {
"category" : "video",
"comment" : "The video was good"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "I9L6YHEB_3GqsWyQHj89",
"_score" : 1.7084444,
"_source" : {
"category" : "video",
"comment" : "I didn't watch the video"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "ItL6YHEB_3GqsWyQHT_r",
"_score" : 0.6931472,
"_source" : {
"category" : "video",
"comment" : "The video was terrible"
}
}
]
}
}
mimimum_should_match
and its changing default value
You were probably expecting two documents to match; what’s that third document doing in the results? It is a video
comment but it doesn’t contain either watch
or good
. Understanding why this is happening requires being aware of what the minimum_should_match
parameter does and knowing that is has different default values depending on what else is in the bool
query.
mimimum_should_match
will specify how many, or the percentage, of the should
clauses in our query should match the document. For example, if we have four should
clauses and we want at least two of them to match a document, we can specify either 2
or 50%
. Under most circumstances, you’ll want to match one of the clauses. Even if you are aware of what the minimum_should_match
does, the default value is what will trip up some people.
This is the key part in the docs:
If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.
In the first query containing only should
clauses, minimum_should_match
took a default of 1
so we got expected results.
In our query above, however, our should
is combined with a must
. As we don’t specify a value for minimum_should_match
, the default is 0
. Therefore, none of our should
clauses are required to match a document. Documents will only be filtered down by the must
clause; anything the does actually match should
clauses will only increase the score for that match. The third document in the results is a video
document but doesn’t match any of the should
clauses. It therefore is returned as a match but has a much lower score than the other two that do match the should
s.
Several fixes
There are several ways to fix the query but the easiest one is to simply apply a minimum_should_match
of 1
:
GET myindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "video"
}
}
],
"should": [
{
"match": {
"comment": "watch"
}
},
{
"match": {
"comment": "good"
}
}
],
"minimum_should_match": 1
}
}
}
We’ll now get the results we were expecting:
{
...
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.8008997,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "IdL6YHEB_3GqsWyQHD8Y",
"_score" : 1.8008997,
"_source" : {
"category" : "video",
"comment" : "The video was good"
}
},
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "I9L6YHEB_3GqsWyQHj89",
"_score" : 1.7084444,
"_source" : {
"category" : "video",
"comment" : "I didn't watch the video"
}
}
]
}
}
Another fix, without using minimum_should_match
, is putting the should
in a nested bool
inside the must
:
GET myindex/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"category": "video"
}
},
{
"bool": {
"should": [
{
"match": {
"comment": "watch"
}
},
{
"match": {
"comment": "good"
}
}
]
}
}
]
}
}
}
In our case, as we only have video
and blog
comments, we could have used a must_not
and also omitted the minimum_should_match
:
GET myindex/_search
{
"query": {
"bool": {
"must_not": [
{
"match": {
"category": "blog"
}
}
],
"should": [
{
"match": {
"comment": "watch"
}
},
{
"match": {
"comment": "good"
}
}
]
}
}
}
Which one you choose depends on your use-case but the clearest way is to simply apply a minimum_should_match
. Then there’s no need to try and remember which default value the bool
query is using.
Conclusion
It’s easy to see where we’re getting incorrect results in this contrived example. In the wild, when you’re filtering down billions of documents, it’s harder to spot that you’re getting documents you’re not expecting.
bool
queries are used everywhere and it’s important to know how they work so you can make your queries efficient, relevant, and - most importantly - correct.