Aggregation in Elasticsearch : 3 Must-Know as Software Engineer.
People tend to like Elasticsearch at the beginning and when they learn about aggregation they fall in love for the product. — Thijs Feryn
What is Elasticsearch aggregation used for and why does a software engineer have to get to know it❓
Welcome engineer! In this tutorial, we’re going to learn a lot together about …aggregation.
I discovered Elasticsearch thanks to a friend of mine many years ago. I wanted a simple solution to build a complete and fully customizable search engine for an images searching website.
But in the process of learning and using Elasticsearch, I discovered this surprising statement: Elasticsearch is more used for aggregation than searching actually.
It’s simply mind-boggling to discover that Elasticsearch aggregation features are more used than searching ones. Throughout this article, we’ll be exploring the power of Elasticsearch aggregation features. So without further discussion, let’s dive in.
Aggregation: What are we talking about?
In this age of data and metadata, companies have been increasingly collecting data on their online customers and website visitors. All those data are used to get numerous metrics, such as users preferences, number of transactions, best clients' age range or geographic area, and so on.
Elasticsearch offers interesting, simple to implement, and advanced aggregation features. This article is meant to lead us through different use cases of Elasticsearch aggregation with real-life hands-on. So hang-on.
Before starting, we need to seed an Elasticsearch index to execute our queries.
Let us seed our experiment index
For our hands-on, we’ll use some spurious phone calls data. First of all, let’s define the mapping of your index. The index we’re going to create is called phone_calls_details ☎️. Let’s proceed.
Now we’ll add the customer's phone call details.
Well done. Before going further, I have a simple challenge for you.
Take a look at the data we’ve just inserted and answer the questions below.
- What’s the phone call that lasted the most?
- How many calls has every caller made so far?
- Who is the customer that called the most so far?
- How many minutes did John and Torsten spend on the phone?
- What is the average time of a call?
Now let’s assume that we’ve millions of calls details. Hence there is no way to do it manually. That’s where aggregation comes into play.
What is aggregation?
An aggregation computes, summarizes and extracts new information from your original dataset as metrics, statistics, or other analytics.
In Elasticsearch, we have three (03) kinds of aggregation: metrics, buckets, and pipelines aggregation. Ready? Let’s dive in!
A metric is an aggregation that calculates some values such as a sum or average, from field values. To hone our understanding, let’s go through the use cases below.
Use case 1: What’s the phone call that lasted the most?
In this example, we created an aggregation called max_call_duration that estimates the max (maximum) of all of the duration fields values.
Use case 2: What is the average time of a call?
In this example, our aggregation (a metric) is called average_time. The clause we used here to plot the average is avg. As its name suggests, avg computes the average of numeric values that are extracted from the aggregated documents.
After running this query, I got average_time = 23.44 s. Let’s move on to the second type of aggregation.
Buckets, as the name implies, are aggregations that group documents into buckets or groups also called bins, based on field values, ranges, or other criteria. Let’s look at some examples.
Use case 1: How many calls did every caller make so far?
This aggregation (this bucket request) will show per caller_name, the number of documents (the number of calls) we have.
So we’ve got:
- John: 6 calls
- Torsten: 3 calls.
Use case 2: How many total minutes did every caller spend on the phone?
Here the things start being interesting. We’ll mix bucket and metric to plot some analytics data as it comes in real life.
In the example above, we used two operations to get our results:
- First of all, we create a bucket using the caller_name as key so that we have the calls group by caller.
- Secondly, we apply inside of every bucket an aggregation which is a metric, to sum up, the duration fields.
We hit the last but not the least type. The pipelines are a particular kind of aggregation. Instead of taking information from the documents, they take data from other aggregations to produce the desired result. Let’s have a look at an example.
Use case: How long on average did callers spend on the phone?
We want to know on average how long a caller spends on the phone. There are many possibilities but here we’ll choose to bucket the calls per caller, sum up them per caller, and use a pipeline to plot the average of all those sums per bucket. Figure 1 shows the process of our aggregation.
This process leads to the query below. Let’s run it to get our aggregation result.
Okay! You’re right. It’s a little bit rude. Let’s break it down.
When it comes to talking about aggregation with pipelines, the way to build requests is a bit different.
- calls_per_caller_name is a normal aggregation to bucket and sum the durations per caller_name
- duration_sum is our pipeline entry; the type of aggregation we are using here is avg_bucket (you can use sum_bucket and many others) to plot the average of calls_per_caller_name>duration per bucket.
Now imagine all of the kinds of requests you can make on a data set using Elasticsearch aggregations.
While the end of this article is looming, let’s review some points about Elasticsearch aggregation.
- Elasticsearch aggregation is mainly used for analytics.
- There are basically three kinds of aggregation in Elasticsearch: metrics, buckets, and pipelines.
- It’s possible to mix different kinds of aggregation to get statistics from an index.
Elasticsearch is just a wonderful product with excellent features for analytics and data extraction. All we have to do as engineers is to take advantage of that.
Till next time, take care. I’d like to know more about your experience with Elasticsearch aggregations in the comments 📝.