Draft Elasticsearch Exercises

Update: July 2020: I am now building lab exercises to help people prepare for the exam. You can see the list of available labs here

Introduction

I find lab-based exercises to be the best way to cement information as knowledge. While I am going back and forth on how to deliver training material, I wanted to publish a draft of some Elasticsearch exercises I have put together and solicit some feedback.

I’m not releasing data at this stage but would really appreciate some feedback on the format and writing style. Would a series of blog posts be suitable, or is a PDF a better approach?

The environment used here is similar to that of the Certification but the exercises aren’t meant to be part of a mock Certification exam; they’re for someone learning Elasticsearch but aren’t a million miles from the type of questions in the exam. I also don’t want to spoon-feed which API to use in each scenario, as would happen in a traditional training course. I’m trying to strike a balance between the two.

Get in touch on LinkedIn or comment in the post on Reddit if you have feedback.

Requirements

You will need a machine capable of running a small, single-node Elasticsearch v7.2 cluster and Kibana. A modern laptop with 8 or 16GB of RAM should suffice for our purposes. Later exercises will require a multi-node cluster and we will discuss how best to create a suitable lab environment nearer the time, or read how to do this using Vagrant.

The environment used in the Certification exam will be reproduced as closely as possible, so Elasticsearch will need be run directly on a host, virtual machine or cloud instance where you can access the shell directly; ideally over SSH. Elasticsearch and Kibana distributions (extracted from the .tar.gz or .zip archive) are required on the node, as well as the data files used in these exercises.

All REST calls to Elasticsearch in these exercises will assume that Elasticsearch is running on localhost. You will need to modify those addresses with the host of your cluster if it is different.

Topics covered

Creating indices
Defining mappings
Reindexing
Ingest pipelines
Delete by query
Aggregations

Exercises

Exercise 01

Configure Elasticsearch with the following criteria and start Elasticsearch:

Property	Value
Cluster name	`lab-cluster`
Node name	`node01`
Heap size	`2g`

Exercise 02

The Bulk API may be covered in the exam but, for now, we’re only going to use it to get data into the cluster. There will be exercises later to ensure you can craft a suitable _bulk request body.

The volume of data being passed to the Bulk API here is far more than you would normally post in one batch. A more efficient and mechanically sympathetic strategy would be to split the file into batches and post each batch individually. I am sacrificing more than efficiency for the sake of platform portability here.

The olympic-events.ndjson file contains all the data for these exercises, formatted for use with the Bulk API. The file contains 271116 documents. Run the following command from the same location as the ndjson file to import this data into a new index called olympic-events:

curl -X POST 'http://localhost:9200/olympic-events/_bulk' -H "Content-Type: application/x-ndjson" --data-binary @olympic-events.ndjson > /dev/null

Exercise 03

Start Kibana.

Exercise 04

Validate that the data was imported correctly by using a single API call to show the index name, index health, number of documents, and the size of the primary store. The details in the response must be in that order, with headers, and for the new index only.

Exercise 05

The cluster health is yellow. Use a cluster API that can explain the problem.

Exercise 06

Change the cluster or index settings as required to get the cluster to a green status.

Exercise 07

Look at how Elasticsearch has applied very general-purpose mappings to the data. Why has it chosen to use a text type for the Age field? Find all unique values for the Age field; there are less than 100 unique values for the Age field. Look for any suspicious values.

Exercise 08

We will be deleting data in the next exercise; making a backup is always prudent. Without making any changes to the data, reindex the olympic-events index into a new index called olympic-events-backup.

Exercise 09

The Height and Weight fields suffer from the same problem as the Age field. Later exercises will require numeric-type queries for these fields so we want to exclude any document we can’t use in our analyses. In a single request, delete all documents from the olympic-events index that have a value of NA for either the Age, Height or Weight field.

Exercise 10

Notice how the Games field contains both the Olympic year and season. Create an ingest pipeline called split_games that will split this field into two new fields - year and season - and remove the original Games field.

Exercise 11

Ensure your new pipeline is working correctly by simulating it with these values:

1998 Summer
2014 Winter

Exercise 12

We’ll now start to clean up the mappings. Create a new index called olympic-events-fixed with 1 shard, 0 replicas, and the following mapping:

Field	Type
`athleteId`	`integer`
`age`	`short`
`height`	`short`
`weight`	`short`
`city`	`keyword`
`athleteName`	`text` + `keyword`
`gender`	`keyword`
`team`	`keyword`
`noc`	`keyword`
`year`	`short`
`season`	`keyword`
`city`	`text` + `keyword`
`sport`	`keyword`
`event`	`text` + `keyword`
`medal`	`keyword`