Software Engineer, Before Inserting One Iota of Data in an Elasticsearch Index, You Must Do This.

6 min readJun 28, 2021

It’s all about mapping and we’d better be careful.

Introduction

The problem I’m going to share with you happened when I first started learning Elasticsearch. It’ll help me introducing the need for defining a good mapping for your indices. So what happened?

What was the problem?

I was trying to set up a datastore of mangas to practice searching with Elasticsearch. Let me give you an overview of my hands-on.

Overview of my hands-on.

You can follow me along the way.

I first created and seeded an index with my mangas’ list.

2. Then I performed the following query, hoping to get “One Piece” manga as result.

Can you guess what I got? You can let me know in the comments.🤔

Here is the result I got.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Nothing! Why? The reason was simple as you guessed: because of a bad mapping.

What does mapping really mean?

Each document we store in an Elasticsearch index is a collection of fields filled. Mapping is the process of describing to Elasticsearch how those documents and their fields must be stored and indexed. When we don’t define an index mapping, Elasticsarch is smart enough to detect the type of each field, we call that process dynamic mapping. On the other hand and as we’re going to see, it’s recommended to define explicitly index mapping, this is known as — you got it right — explicit mapping.

While searching, I came across reasons why it’s so important to really think and define the right index schema for an index before starting inserting data. Throughout this article, I’ll share with you what I gathered.

Why does a good mapping matter?

Let’s now address the crucial point. Why does it matter to define a good schema for your index? There are essentially three reasons for that:

Bad mapping leads to frustrating and unexpected searching results.
Good mapping prevents you from mapping explosions and saves you from headaches.
It’s not possible to redefine the data type of an existing field when the index has already got data, unfortunately, but it’s so logical.
Bad mapping cuts down on tremendous opportunities we have to use some advanced Elasticsearch features related to data types such as geodata features.

Types of data in Elasticsearch

Well, it’s trivial that defining mapping requires using data types. So let’s explore the different data types Elasticsearch provides. The good news is that Elasticsearch groups them per category.

Common field types

Yes common because they are well known and used in most programming languages and databases. We have binary, boolean, keyword, constant_keyword, wildcard, long, integer, short, byte, double, float, half_float, scaled_float, unsigned_long, date, and date_nanos. For date data type you can specify the format of the field with their format attribute.

Objects and relational types

Here is a tricky notion to keep in mind. Actually, you don’t have to define explicitly your field as object in Elasticsearch. Instead, we fill the properties attribute of the field with the nested attributes and their type. Here name is an object.

"mappings": { 
        "properties": {
          "age":  { "type": "integer" },
          "name": { 
            "properties": {
              "first": { "type": "text" },
              "last":  { "type": "text" }
            }
         }
       }
  }

The nested type is a special version of the object type that allows arrays of objects to be indexed so that they can be queried independently of each other. The flattened data type is used to avoid mapping explosion for object fields that contain too many nested fields and join data type is a special field that creates parent/child relationships within documents of the same index.

Structured data types

When it comes to talking about structured data types, we have the range data types (long_range, double_range, date_range, and ip_range) that allow you to define a range of values. You can also insert IPV4 and IPV6 addresses in your index by using ip data type; the same thing for version (for software versioning) and murmur3(it allows you to store hashes of values).

Aggregate data types

Aggregate data types are used to store a set of data that are mainly used for aggregation purposes. We have aggregate_metric_double and histogram.

Text search types

If your goal is to perform real full-text searches on your data, you’ll be using those kinds of data types. We have text, annotated-text, completion, search_as_you_type, token_count.

Document ranking types

As the name of their category implies, document ranking types are used to set up an advanced way to compare documents and provide better-quality search results. They can be used in a load of applications such as semantic search, text similarity search. We have dense_vector, sparse_vector, rank_feature, rank_features.

Spatial data types

I do think Elasticsearch won’t stop amazing me. Geolocation features of Elasticsearch are a world of possibilities to explore. To work with geodata, Elasticsearch offers the following types: geo_point, geo_shape, point, shape for geolocation usages.

Special field data types

Here are some of the other amazing stuff about data types Elasticsearch has got:

array data field type: actually, it’s not really a data type. There is no dedicated array data type. Any field can contain nothing or many values by default, however, all values in the array must be of the same data type. So you can store an array of strings, an array of integers even an array of arrays.
multi-fields: let’s assume you need to store a value as text and at the same time as keyword. That’s the purpose of multi-fields in Elasticsearch.

How to define an index mapping the right way?

As we saw earlier, bad mapping decisions can lead to frustrating situations where you’ll be obliged to reindex your data. Here are the steps I propose you follow for defining your indices mapping.

Analyze your data: What kind of information are you going to store in your field (strings, sentences, numeric values, IP addresses, raw texts)? What kinds of operation will you be performing on the data stored in your field (aggregations, searches, …)? If you want to perform searches against this field, will it be exact terms matching searches (to decide to go for “keyword” instead of “text” for example)?
Choose the right data type for your field: Look at the different data types we went through earlier and choose the best one for your field.
Then define your mapping: There are mainly two ways to define your index’s mapping. Firstly, you can set your index’s mapping when creating it.

PUT /books
{
  "mappings": {
    "properties": {
      "title":    { "type": "text" },  
      "author":   { "type": "keyword" },
      "published":   { "type": "date" }     
    }
  }
}

Or you can set it after you created your index. In this case, you have to update the mapping of your created index because Elasticsearch creates a mapping immediately after you create an index.

PUT /books/_mapping
{
  "properties": {
      "title":    { "type": "text" },  
      "author":   { "type": "keyword" },
      "published":   { "type": "date" }     
    }
  }
}

Hands-on

Our hands-on is going to be very straightforward. Let’s assume that you’re the CTO of a large company and you want to have a central data store for the employees' data. Our goal is to create an index and store the data of all the employees of your company.

I know we can’t go through all you need to set up the ideal index right now. But just for sake of this hands-on, this is the solution I found for your task.

I hope you already installed Elasticsearch.

Great. Now you don’t have to worry anymore about the mapping aspect of your employees' index. You can start inserting data without problem.

Conclusion

Mapping is a critical aspect of working with Elasticsearch’s indices. For that reason, we have to pick with care the right data types you need for your fields. It will save us for sure from nightmares of data reindexing.

Let’s summarize. Before inserting any data in an index:

You need to look at your data and ask yourself the right questions.
Then you have to choose from the plethora of data types Elasticsearch provides, the right one for your field.
And eventually, you have to set the mapping of your index.

Thank you for reading me. I’d like to know more about your experiences with mapping while working with Elasticsearch.

Till next time, take care.