What Is New in Elasticsearch 2.3
Elasticsearch 2.3 delivers significant enhancements to query performance, aggregation capabilities, and cluster management. This release focuses on making complex operations faster and more resource-efficient for large-scale deployments.
| Category | Key Updates |
|---|---|
| New Features | Pipeline Aggregations, Doc Values by default, Reindex API |
| Performance | Faster terms aggregations, Improved global ordinals |
| Mapping & Querying | GeoPoint as multi-field, Percolator updates, Query validation |
| Cluster Management | Shard allocation filtering, Disk-based allocation settings |
| Deprecations | Scripting language deprecations, Old percolator syntax |
How did pipeline aggregations change data analysis?
Pipeline aggregations allow you to compute new metrics based on the output of other aggregations. This is a game-changer for analytics, enabling complex calculations like moving averages or derivatives directly within your aggregation pipeline.
Instead of post-processing results client-side, you can now chain aggregations together. For example, you can calculate the cumulative sum of a daily metric or find the bucket with the maximum value. This keeps the entire computational load on the Elasticsearch cluster.
Example Usage
{
"aggs": {
"sales_per_month": {
"date_histogram": {"field": "date", "interval": "month"},
"aggs": {"sales": {"sum": {"field": "price"}}}
},
"cumulative_sales": {
"cumulative_sum": {"buckets_path": "sales_per_month>sales"}
}
}
}
Why are doc values now the default?
Doc values are now enabled by default for all fields except analyzed strings. This shift significantly reduces heap memory usage and improves aggregation performance by using a columnar data structure stored on disk.
In practice, this means your aggregations and sorts will be faster and more memory-efficient out of the box. The old fielddata caching mechanism, which was heap-heavy, is no longer the default for most use cases. You'll see less garbage collection pressure.
If you need to use fielddata for an analyzed string field, you must explicitly enable it in your mapping. This change encourages more sustainable cluster sizing for heavy analytical workloads.
What makes the Reindex API so useful?
The Reindex API provides a built-in way to copy documents from one index to another. This is essential for index maintenance, mapping changes, and data migration tasks that were previously cumbersome.
You can use it to change an index's shard count, update mappings, or even reindex from remote clusters. It handles the scrolling, indexing, and version conflict management for you, making it far more reliable than custom scripts.
Basic Reindex Command
POST _reindex
{
"source": {"index": "old_index"},
"dest": {"index": "new_index"}
}
How were geo queries improved?
GeoPoint fields can now be configured as multi-fields. This allows you to index the same geographic point multiple times with different precisions or for different purposes, like a high-precision search and a low-precision aggregation.
The percolator has also been updated to support geo queries. You can now register percolator queries that use geo bounding box, distance, or polygon filters, opening up new real-time alerting scenarios based on location.
What should I know about the deprecations?
Several scripting languages (groovy, javascript) have been deprecated due to security concerns. The recommendation is to migrate to Painless, the new default scripting language designed for safety and performance.
The old percolator syntax (percolate query) is also deprecated in favor of the newer percolator type. You should update your percolation setup to use the dedicated percolator type for future compatibility.
Always check the deprecation logs after upgrading to identify any changes needed in your application code or cluster configuration.
FAQ
Do I need to reindex to benefit from doc values being default?
No, existing indices will continue to use their current settings. The new default only applies to newly created indices. To enable doc values on an existing index, you would need to reindex with an updated mapping.
Can I still use Groovy scripts in 2.3?
Yes, but it's deprecated. Groovy scripting still works if enabled in the elasticsearch.yml configuration file, but you should start migrating your scripts to Painless to avoid issues in future versions.
What is the main advantage of pipeline aggregations?
They allow for complex, multi-stage analytics entirely on the server side. This eliminates the need to transfer large datasets to a client application for post-processing, reducing network overhead and simplifying application code.
Is the Reindex API suitable for large indices?
Yes, but monitor it carefully. The API uses scrolling and bulk indexing, so for very large indices, you may need to tune the scroll_size and slices parameters for optimal performance and to avoid timeouts.
How do I enable fielddata now that it's not the default?
You must explicitly set "fielddata": {"format": "disabled"} to false in the mapping for any analyzed string field where you need to use it for aggregations or sorting. It's generally recommended to use keyword sub-fields instead.