Introduction
This is the first round of Elasticsearch exercises. In this set, we will load in the data and get the index ready to start cleaning up the documents.
A video version of this round is also available on YouTube.
Please get in touch or comment on YouTube if you have any questions or feedback.
Topics covered
- Creating indices
- Defining mappings
- Reindexing
- Ingest pipelines
- Delete by query
- Data Visualizer
Exercises
Exercise 01
Configure Elasticsearch with the following criteria and start Elasticsearch:
Property | Value |
---|---|
Cluster name | lab-cluster |
Node name | node01 |
Heap size | 2g |
Exercise 02
Configure Kibana to point to your Elasticsearch node and start Kibana.
Exercise 03
Download the dataset from here and use Kibana’s Data Visualizer to upload the file into a new index called olympic-events
.
Exercise 04
Validate that the data was imported correctly by using a single API call to show the index name, index health, number of documents, and the size of the primary store. The details in the response must be in that order, with headers, and for the new index only.
Exercise 05
The cluster health is yellow. Use a cluster API that can explain the problem.
Exercise 06
Change the cluster or index settings as required to get the cluster to a green status.
Exercise 07
Look at how Elasticsearch has applied very general-purpose mappings to the data. Why has it chosen to use a keyword
type for the Age
field? Find all unique values for the Age
field; there are less than 100 unique values for the Age
field. Look for any suspicious values.
Exercise 08
We will be deleting data in the next exercise; making a backup is always prudent. Without making any changes to the data, reindex the olympic-events
index into a new index called olympic-events-backup
.
Exercise 09
The Height
and Weight
fields suffer from the same problem as the Age
field. Later exercises will require numeric-type queries for these fields so we want to exclude any document we can’t use in our analyses. In a single request, delete all documents from the olympic-events
index that have a value of NA
for either the Age
, Height
or Weight
field.
Exercise 10
Notice how the Games
field contains both the Olympic year and season. Create an ingest pipeline called split_games
that will split this field into two new fields - year
and season
- and remove the original Games
field.
Exercise 11
Ensure your new pipeline is working correctly by simulating it with these values:
1998 Summer
2014 Winter
Exercise 12
We’ll now start to clean up the mappings. Create a new index called olympic-events-fixed
with 1 shard, 0 replicas, and the following mapping:
Field | Type |
---|---|
athleteId | integer |
age | short |
height | short |
weight | short |
athleteName | text + keyword |
gender | keyword |
team | keyword |
noc | keyword |
year | short |
season | keyword |
city | text + keyword |
sport | keyword |
event | text + keyword |
medal | keyword |
Next steps
Part two of the exercises can be found here.