How to Query a Large Data Set in Elasticsearch

Introduction

In Elasticsearch, on the surface, the scroll API acts similar to any basic database after you initiate a search request. The difference between it and the others is that Elasticsearch s flexible enough to retrieve large quantities–not just a single page of search results. In addition, the powerful scroll API can return all of the availalbe results when you query a large data set just one time.

Before conducting a search, remember this:

>”The primary purpose of scrolling is for querying large data sets, not for real-time queries.” —Vanderbush

Use Scan and Scroll to Retrieve Large Data Results

With the search_type Scan and the Scroll API, you can bypass what seems like the bottomless pit of deep search pagination. As you know, that takes time and wasted time equates to money lost. Scan and Scroll searches through large quantities of data fast, skipping intense pagination. A scrolled API search looks at data files and then continues retrieving them from Elasticsearch based on the initial query until all the results are displayed.

How Scanning with Elasticsearch Works:

The Scan completes quickly without intense pagination because the global sorting is turned off. The result is large data returned quickly, batch by batch. You get complete data without sacrificing time.

Let’s look at an example of how you can use Scan and the Scroll API to query a large data set. We’re going to do three things:

1) Make a GET request 2) Set scan search_type parameter as the URL search_type 3) Set a 2-minute scroll parameter time limit for the initial scroll search in Elasticsearch.

GET /old_index/_search?search_type=scan&scroll=2m
{
"query": { "match_all": {}},
"size": 500
}

Now, to get the first batch of results, there will be the Base-64 encoded string or _scroll_id that is included in the response. You’ll need to send that to the _search/scroll API endpoint.

GET /_search/scroll?scroll=2m
CiMgRWxhc3RpY3NlYXJjaCAtIEhvdyB0byAuLi4uCiMjIEludHJvZHVjdGlvbjoKX19OT1RFIFRPIFdSSVRFUlM6X18gUGxlYXNlIHdyaXRlIGEgYnJpZWYgZGVzY3JpcHRpb24gYWJvdXQgdGhlIFhYWFhYIHVzaW5nIHRoaXMgc291cmNlIG1hdGVyaWFsIGFzIGluc==

To reach your search results goal, set the from parameter to a fixed size. This example shows a query request to display results in batches of 250. In this case, you’ll set the from parameter value to 250 to get results in increments of 250. Set the size parameter to 250 as well: size = 250.

GET /_search?size=250 ---- return first 250 results
GET /_search?size=250&from=250 ---- next 250 results
GET /_search?size=250&from=500 ---- next 250 results

Scrolling

Even though with Scan & scroll you can obtain large data query results by initiating a searching one time, it’s true purpose is for re-indexing data. Querying for real-time search results isn’t what it was designed for, therefore, it’s not recommended.

Here’s why. When you set the timeout parameter, you must receive results within that time. After that time expires, additional results cease to appear. Be sure to set the time parameter to get the most results in a specified period. That’s why scrolling can work for large data queries, but not real-time results.

Assistance with Scrolling and Re-indexing of Documents

Client support for scrolling and document re-indexing is available.

To get help with Perl, visit Elasticsearch::Client::5_0::Bulk and Search::Elasticsearch::Client::5_0::Scroll

For assistance with Python, visit elasticsearch.helpers

Let’s talk a little more about setting timeouts. Elasticsearch must know the timeout duration to search a large data set. If a scroll parameter set for _?scroll=3m shows a 3-minute search time unit.

POST /cars/_search?scroll=3m
{
"size": 100,
"query": {
"match" : {
"title" : "volvo"
}
}
}

Below is the POST request with cURL using a 3-minute timeout parameter as well.

curl -X POST "localhost:9200/cars/_search?scroll=3m" -H 'Content-Type: application/json' -d'
{
"size": 100,
"query": {
"match" : {
"title" : "volvo"
}
}
}

When you query large data set Elasticsearch, configure the size parameter to return the maximum number of results per batch.

The scroll API retrieves the next group of results when it receives the _scroll_id request.

POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

The original search request already contains the index name, so it should be not be included in the URL. The POST or GET is okay to use though.

In the above example, Elasticsearch is instructed by the scroll parameter to keep the search active for another minute 1m. The GET or POST can be used and the URL should not include the index name because that is specified in the original search request instead. Elasticsearch receives the scroll_id search return request and returns the next batch of results based on the original query.

Conclusion

This step-by-step tutorial explained how to query a large data set in Elasticsearch and why it’s fast are easy when you use the Scan and Scroll API features. Because Elasticsearch gives you the ability to skip global data sorting, you quickly receive results, batch-by-batch. You work around the time consumption of deep pagination, yet get the results you need.

The Scan and the Scroll API aren’t for real-time queries, but for indexing large amounts of data into a newly comprised index. Keeping that in mind, it’s still very good to know that with Elasticsearch, you can query large data sets using the correct code. For more information, see their documentation

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.