Elasticsearch has some built-in mechanism which prevents it from flooding disk with indices data, it's called Disk-based Shard Allocation.
This feature is impact is slightly different for DataCenter edition from the other edition becomes ES runs on several only for the former edition.
When ES runs on a single node, the impact is that ES will make all indices read-only when the 95% (default value) free disk watermark is reached.
When ES runs on multiple nodes, the impact is smoother:
- 85% watermark reached on some ES node: no more shard will be allocated to this node
- 90% watermark reached on some ES node: ES will try and move shard(s) away from this node
- 95% watermark reached on some ES node: now ES will make read-only any index which shared can not be moved away from this node
Freeing disk space is not enough to recover from indices being read-only and restarting SQ (and therefor ES) won't be enough either. Each index must be made read-write back with a command such as:
Because of this, user who got into that situation had no other option than deleting their indices to recover, see:
Two immediate actions should be taken:
- users should be provided with a mean to recover from their indices being read-only without having to delete and rebuild them
- rebuilding the indices can be very costly
- it's too strong a punishment for what could be just a transitive or accidental lack of disk
- documentation should be updated to inform users
We are currently using ES default free disk watermark setting values (85%, 90% and 95%). While there is no reason to disable this feature (ES offers this option), these values probably do not make sense in some (many?) situations. Eg.:
- SQ ES indices are barely taking 1Gb on disk. They have been made read-only on my personal computer because I've reached the 85% watermark of a 70Gb disk but I have 9Gb left!
- I have a huge enterprise machine with 1TB of disk, ES indices will be made read-only while I still have 150Gb of free disk!
Shall we offer the possibility to configure the watermark thresholds (they can be percentages or byte values)? Should SQ override them based on some algorithm?
At SQ startup, reset any index which would be read-only.
If there is still no free disk, ES will either put the indices back into read-only or reject the reset. This should be confirmed.
Requirements for 15% free disk was [added recently|https://github.com/SonarSource/sonar-enterprise/commit/4b3e712a29fe14a48d6646372783b9e12071163d].
Documentation should be completed to:
- mention how to recover from read-only indices?
- mention the behavior for the DataCenter edition?
- mention how to tune Disk-based Shard Allocation in SQ?