How to Query a Large Data Set in Elasticsearch
Introduction
In Elasticsearch, on the surface, the scroll API acts similar to any basic database after you initiate a search request. The difference between it and the others is that Elasticsearch s flexible enough to retrieve large quantities–not just a single page of search results. In addition, the powerful scroll API can return all of the availalbe results when you query a large data set just one time.
Before conducting a search, remember this:
>”The primary purpose of scrolling is for querying large data sets, not for real-time queries.” —Vanderbush
Use Scan and Scroll to Retrieve Large Data Results
With the search_type Scan and the Scroll API, you can bypass what seems like the bottomless pit of deep search pagination. As you know, that takes time and wasted time equates to money lost. Scan and Scroll searches through large quantities of data fast, skipping intense pagination. A scrolled API search looks at data files and then continues retrieving them from Elasticsearch based on the initial query until all the results are displayed.
How Scanning with Elasticsearch Works:
The Scan completes quickly without intense pagination because the global sorting is turned off. The result is large data returned quickly, batch by batch. You get complete data without sacrificing time.
Let’s look at an example of how you can use Scan and the Scroll API to query a large data set. We’re going to do three things:
1) Make a GET
request
2) Set scan
search_type parameter as the URL search_type
3) Set a 2-minute scroll
parameter time limit for the initial scroll search in Elasticsearch.
1 2 3 4 5 | GET /old_index/_search?search_type=scan&scroll=2m { "query": { "match_all": {}}, "size": 500 } |
Now, to get the first batch of results, there will be the Base-64 encoded string or _scroll_id
that is included in the response. You’ll need to send that to the _search/scroll
API endpoint.
1 2 | GET /_search/scroll?scroll=2m CiMgRWxhc3RpY3NlYXJjaCAtIEhvdyB0byAuLi4uCiMjIEludHJvZHVjdGlvbjoKX19OT1RFIFRPIFdSSVRFUlM6X18gUGxlYXNlIHdyaXRlIGEgYnJpZWYgZGVzY3JpcHRpb24gYWJvdXQgdGhlIFhYWFhYIHVzaW5nIHRoaXMgc291cmNlIG1hdGVyaWFsIGFzIGluc== |
To reach your search results goal, set the from
parameter to a fixed size. This example shows a query request to display results in batches of 250. In this case, you’ll set the from
parameter value to 250
to get results in increments of 250. Set the size parameter to 250 as well: size = 250
.
1 2 3 | GET /_search?size=250 ---- return first 250 results GET /_search?size=250&from=250 ---- next 250 results GET /_search?size=250&from=500 ---- next 250 results |
Scrolling
Even though with Scan & scroll
you can obtain large data query results by initiating a searching one time, it’s true purpose is for re-indexing data. Querying for real-time search results isn’t what it was designed for, therefore, it’s not recommended.
Here’s why. When you set the timeout parameter, you must receive results within that time. After that time expires, additional results cease to appear. Be sure to set the time parameter to get the most results in a specified period. That’s why scrolling can work for large data queries, but not real-time results.
Assistance with Scrolling and Re-indexing of Documents
Client support for scrolling and document re-indexing is available.
To get help with Perl, visit Elasticsearch::Client::5_0::Bulk and Search::Elasticsearch::Client::5_0::Scroll
For assistance with Python, visit elasticsearch.helpers
Let’s talk a little more about setting timeouts. Elasticsearch must know the timeout duration to search a large data set. If a scroll
parameter set for _?scroll=3m
shows a 3-minute search time unit.
1 2 3 4 5 6 7 8 9 | POST /cars/_search?scroll=3m { "size": 100, "query": { "match" : { "title" : "volvo" } } } |
Below is the POST
request with cURL using a 3-minute timeout parameter as well.
1 2 3 4 5 6 7 8 9 | curl -X POST "localhost:9200/cars/_search?scroll=3m" -H 'Content-Type: application/json' -d' { "size": 100, "query": { "match" : { "title" : "volvo" } } } |
When you query large data set Elasticsearch, configure the size
parameter to return the maximum number of results per batch.
The scroll
API retrieves the next group of results when it receives the _scroll_id
request.
1 2 3 4 5 | POST /_search/scroll { "scroll" : "1m", "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" } |
The original search request already contains the index name, so it should be not be included in the URL. The POST
or GET
is okay to use though.
In the above example, Elasticsearch is instructed by the scroll
parameter to keep the search active for another minute 1m
. The GET
or POST
can be used and the URL should not include the index name because that is specified in the original search request instead. Elasticsearch receives the scroll_id
search return request and returns the next batch of results based on the original query.
Conclusion
This step-by-step tutorial explained how to query a large data set in Elasticsearch and why it’s fast are easy when you use the Scan and Scroll API features. Because Elasticsearch gives you the ability to skip global data sorting, you quickly receive results, batch-by-batch. You work around the time consumption of deep pagination, yet get the results you need.
The Scan and the Scroll API aren’t for real-time queries, but for indexing large amounts of data into a newly comprised index. Keeping that in mind, it’s still very good to know that with Elasticsearch, you can query large data sets using the correct code. For more information, see their documentation
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started