Solutions 01 - 12

Solution 01

This is a case of updating the elasticsearch.yml and jvm.options files.

In elasticsearch/config/elasticsearch.yml, these two properties need to be configured:

  • cluster.name: lab-cluster
  • node.name: node01

The heap size setting needs two changes in elasticsearch/config/jvm.options:

  • -Xms2g
  • -Xmx2g

Elasticsearch can then be started by changing to the directory containing the Elasticsearch distribution and running bin/elasticsearch.

Elasticsearch configuration notes

This solution assumes you’re running Elasticsearch from the command line in this lab environment. In production systems, however, the configuration files can be elsewhere, depending on how Elasticsearch is being run and the operating system being used. The documentation does a good job of describing where the files can live and how to modify them.

Environment variables can be very useful in an Elasticsearch configuration. node.name could have a value of ${NODE_NAME} that we give a value of node01.

Solution 02

You need to configure the elasticsearch.hosts property in kibana/config/kibana.yml with the REST endpoint of your Elasticsearch node:

elasticsearch.hosts: ["http://localhost:9200"]

Solution 03

Simply follow the instructions and you’ll have the data loaded into the cluster.

Solution 04

The _cat API is ideal for tasks like this. It provides details on many cluster and index settings and gives a response in human readable format.

Index data can be fetched from _cat/indices. The data can be filtered to a specific index by adding an index name, using an optional wildcard, to the end of the URL. The attributes in the response can also be filtered by using the h component of the query string. The attributes in the response will be in the order you ask for them. Headers can be included by adding the v component.

GET _cat/indices/olympic-events?v&h=index,health,docs.count,pri.store.size

Solution 05

The clue is in the question text! The explain API will provide this information:

GET _cluster/allocation/explain

This call will explain the reason for the yellow index status. It provides plenty of information about the index, shard, and state causing the problem, as well as an explanation:

"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[olympic-events][0], node[07lbZT6ORAi99ewhrplKnA], [P], s[STARTED], a[id=dyKLbCNIR6Ke01knDeWbjQ]]"

Solution 06

The problem is that, by default, Elasticsearch will create a new index with 1 primary shard and 1 replica shard. We only have one node in the cluster and a replica shard can’t be assigned to the same node as its primary. This results in the replica shard becoming unassigned and the index (and therefore, cluster) going into a yellow status.

The replica shard needs to be either assigned to a different node, or removed completely. The question is asking for cluster or index settings to be changed, indicating that the replica should be removed. The index settings can be used to set the number of replicas to 0 with the following:

PUT olympic-events/_settings
{
  "number_of_replicas": 0
}

Test that the index status is now green using the same call to the _cat API used in solution 04.

Solution 07

Have a look at the mapping by using GET olympic-events/_mapping. Elasticsearch has created the Age field as a multi-field text-based mapping. Finding the unique values for the field requires a terms aggregation. A terms aggregation will only return 10 buckets by default, so a size property is required to make sure we get all the values back:

GET olympic-events/_search
{
  "size": 0,
  "aggs": {
    "ages": {
      "terms": {
        "field": "Age.keyword",
        "size": 100
      }
    }
  }
}

The suspicious value is NA. Data ‘pollution’ like this is fairly common.

{
  "key" : "NA",
  "doc_count" : 9474
}

Solution 08

A reindex like this isn’t a real backup but it’s going to be good enough for us in this situation. A real backup would be a snapshot, which we’ll come on to later.

POST _reindex
{
  "source": {
    "index": "olympic-events"
  },
  "dest": {
    "index": "olympic-events-backup:
  }
}

Solution 09

Deleting a selection of documents from an index is a great use-case for delete_by_query. Writing the query first to make sure it’s only returning documents you want deleted is a sensible thing to do. You will have greater confidence that you’re only deleting the documents you want to.

There are different ways of writing the query. Matching the same value in multiple fields can be done using multi_match:

POST olympic-events/_delete_by_query
{
  "query": {
    "multi_match": {
      "query": "NA",
      "fields": ["Height", "Weight", "Age"]
    }
  }
}

A bool query is more verbose but equally effective.

POST olympic-events/_delete_by_query
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Age": "NA"
          }
        },
        {
          "match": {
            "Height": "NA"
          }
        },
        {
          "match": {
            "Weight": "NA"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

Notice I’m using the minimum_should_match attribute, even though it’s not required here. Make your intention clear with bool queries and remove any ambiguity or requirement for knowledge about how it works.

I have seen people use the term query on text fields, only to wonder why they’re not getting documents returned. A term query will do an exact match, without using an analyser. This is fine when querying keyword fields but not text fields. The standard analyser will lowercase all text but a term query does an exact match so would have matched na but not NA.

Solution 10

Ingest pipelines are created via the _ingest API. Our new pipeline will need three stages:

  1. Split the Games field on a space.
  2. Put the first component into a new year field and the second into a new season field.
  3. Remove the Games field.

The first and last steps in the pipeline are easily dealt with by the split and remove processors. You may think that the second step could be done with the set processor but it doesn’t work with array notation and will result in empty strings in your new fields. The script processor can come to the rescue.

PUT _ingest/pipeline/split_games
{
  "processors": [
    {
      "split": {
        "field": "Games",
        "separator": " "
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
ctx.year = ctx.Games[0];
ctx.season = ctx.Games[1];
"""
      }
    },
    {
      "remove": {
        "field": "Games"
      }
    }
  ]
}

Solution 11

Ingest pipelines can be tested with the _simulate API:

POST _ingest/pipeline/split_games/_simulate
{
  "docs": [
    {
      "_source": {
        "Games": "1998 Summer"
      }
    }
  ]
}
{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "season" : "Summer",
          "year" : "1998"
        },
        "_ingest" : {
          "timestamp" : "2020-07-12T20:33:02.628755Z"
        }
      }
    }
  ]
}

Solution 12

A new index can be created with a PUT request to the index endpoint. A good place to start for the request body is the mapping of the existing mapping. Take the response from olympic-events/_mapping and transform it into the fields we need. The settings for the shards and the mappings can be done in one request:

PUT olympic-events-fixed
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "athleteId": {
        "type": "integer"
      },
      "age": {
        "type": "short"
      },
      "height": {
        "type": "short"
      },
      "weight": {
        "type": "short"
      },
      "athleteName": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "gender": {
        "type": "keyword"
      },
      "team": {
        "type": "keyword"
      },
      "noc": {
        "type": "keyword"
      },
      "year": {
        "type": "short"
      },
      "season": {
        "type": "keyword"
      },
      "city": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "sport": {
        "type": "keyword"
      },
      "event": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "medal": {
        "type": "keyword"
      }
    }
  }
}
Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme based on Stack originally designed by Jimmy, forked by George Bridgeman