Solution 01
This is a case of updating the elasticsearch.yml
and jvm.options
files.
In elasticsearch/config/elasticsearch.yml
, these two properties need to be configured:
cluster.name: lab-cluster
node.name: node01
The heap size setting needs two changes in elasticsearch/config/jvm.options
:
-Xms2g
-Xmx2g
Elasticsearch can then be started by changing to the directory containing the Elasticsearch distribution and running bin/elasticsearch
.
Elasticsearch configuration notes
This solution assumes you’re running Elasticsearch from the command line in this lab environment. In production systems, however, the configuration files can be elsewhere, depending on how Elasticsearch is being run and the operating system being used. The documentation does a good job of describing where the files can live and how to modify them.
Environment variables can be very useful in an Elasticsearch configuration. node.name
could have a value of ${NODE_NAME}
that we give a value of node01
.
Solution 02
You need to configure the elasticsearch.hosts
property in kibana/config/kibana.yml
with the REST endpoint of your Elasticsearch node:
elasticsearch.hosts: ["http://localhost:9200"]
Solution 03
I wrote up full instructions in a blog post here.
Solution 04
The _cat
API is ideal for tasks like this. It provides details on many cluster and index settings and gives a response in human readable format.
Index data can be fetched from _cat/indices
. The data can be filtered to a specific index by adding an index name, using an optional wildcard, to the end of the URL. The attributes in the response can also be filtered by using the h
component of the query string. The attributes in the response will be in the order you ask for them. Headers can be included by adding the v
component.
GET _cat/indices/olympic-events?v&h=index,health,docs.count,pri.store.size
Solution 05
The clue is in the question text! The explain
API will provide this information:
GET _cluster/allocation/explain
This call will explain the reason for the yellow index status. It provides plenty of information about the index, shard, and state causing the problem, as well as an explanation:
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[olympic-events][0], node[07lbZT6ORAi99ewhrplKnA], [P], s[STARTED], a[id=dyKLbCNIR6Ke01knDeWbjQ]]"
Solution 06
The problem is that, by default, Elasticsearch will create a new index with 1 primary shard and 1 replica shard. We only have one node in the cluster and a replica shard can’t be assigned to the same node as its primary. This results in the replica shard becoming unassigned
and the index (and therefore, cluster) going into a yellow status.
The replica shard needs to be either assigned to a different node, or removed completely. The question is asking for cluster or index settings to be changed, indicating that the replica should be removed. The index settings can be used to set the number of replicas to 0
with the following:
PUT olympic-events/_settings
{
"number_of_replicas": 0
}
Test that the index status is now green using the same call to the _cat
API used in solution 04.
Solution 07
Have a look at the mapping by using GET olympic-events/_mapping
. Elasticsearch has created the Age
field as a multi-field text-based mapping. Finding the unique values for the field requires a terms
aggregation. A terms
aggregation will only return 10 buckets by default, so a size
property is required to make sure we get all the values back:
GET olympic-events/_search
{
"size": 0,
"aggs": {
"ages": {
"terms": {
"field": "Age.keyword",
"size": 100
}
}
}
}
The suspicious value is NA
. Data ‘pollution’ like this is fairly common.
{
"key" : "NA",
"doc_count" : 9474
}
Solution 08
A reindex
like this isn’t a real backup but it’s going to be good enough for us in this situation. A real backup would be a snapshot
, which we’ll come on to later.
POST _reindex
{
"source": {
"index": "olympic-events"
},
"dest": {
"index": "olympic-events-backup:
}
}
Solution 09
Deleting a selection of documents from an index is a great use-case for delete_by_query
. Writing the query first to make sure it’s only returning documents you want deleted is a sensible thing to do. You will have greater confidence that you’re only deleting the documents you want to.
There are different ways of writing the query. Matching the same value in multiple fields can be done using multi_match
:
POST olympic-events/_delete_by_query
{
"query": {
"multi_match": {
"query": "NA",
"fields": ["Height", "Weight", "Age"]
}
}
}
A bool
query is more verbose but equally effective.
POST olympic-events/_delete_by_query
{
"query": {
"bool": {
"should": [
{
"match": {
"Age": "NA"
}
},
{
"match": {
"Height": "NA"
}
},
{
"match": {
"Weight": "NA"
}
}
],
"minimum_should_match": 1
}
}
}
Notice I’m using the minimum_should_match
attribute, even though it’s not required here. Make your intention clear with bool
queries and remove any ambiguity or requirement for knowledge about how it works.
I have seen people use the term
query on text fields, only to wonder why they’re not getting documents returned. A term
query will do an exact match, without using an analyser. This is fine when querying keyword
fields but not text
fields. The standard analyser will lowercase all text but a term
query does an exact match so would have matched na
but not NA
.
Solution 10
Ingest pipelines are created via the _ingest
API. Our new pipeline will need three stages:
- Split the
Games
field on a space. - Put the first component into a new
year
field and the second into a newseason
field. - Remove the
Games
field.
The first and last steps in the pipeline are easily dealt with by the split
and remove
processors. You may think that the second step could be done with the set
processor but it doesn’t work with array notation and will result in empty strings in your new fields. The script
processor can come to the rescue.
PUT _ingest/pipeline/split_games
{
"processors": [
{
"split": {
"field": "Games",
"separator": " "
}
},
{
"script": {
"lang": "painless",
"source": """
ctx.year = ctx.Games[0];
ctx.season = ctx.Games[1];
"""
}
},
{
"remove": {
"field": "Games"
}
}
]
}
Solution 11
Ingest pipelines can be tested with the _simulate
API:
POST _ingest/pipeline/split_games/_simulate
{
"docs": [
{
"_source": {
"Games": "1998 Summer"
}
}
]
}
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"season" : "Summer",
"year" : "1998"
},
"_ingest" : {
"timestamp" : "2020-07-12T20:33:02.628755Z"
}
}
}
]
}
Solution 12
A new index can be created with a PUT
request to the index endpoint. A good place to start for the request body is the mapping of the existing mapping. Take the response from olympic-events/_mapping
and transform it into the fields we need. The settings for the shards and the mappings can be done in one request:
PUT olympic-events-fixed
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"athleteId": {
"type": "integer"
},
"age": {
"type": "short"
},
"height": {
"type": "short"
},
"weight": {
"type": "short"
},
"athleteName": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"gender": {
"type": "keyword"
},
"team": {
"type": "keyword"
},
"noc": {
"type": "keyword"
},
"year": {
"type": "short"
},
"season": {
"type": "keyword"
},
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"sport": {
"type": "keyword"
},
"event": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"medal": {
"type": "keyword"
}
}
}
}
Next steps
The solutions to the next set of exercises can be found here.