Maintaining production systems is one of those unfortunate tasks that we need to deal with… I mean, why can’t they just run themselves? I get tired of daily tasks extremely quickly. Now that I have a few ongoing Elasticsearch clusters to deal with, I had to come up with a way to keep them singing.
As a developer, I usually don’t have to deal with these kind of things, but in startup world, I get to do it all from maintenance, monitoring, development, etc.
Jenkins makes this kind of stuff super easy. With a slew of python programs, that use parameters/environment variables to connect to the right Elasticsearch cluster, I’m able to perform the following tasks, in order (order is key)
- Create Snapshot
- Monitor Snapshot until it’s done
- Delete Old Data ( This is especially interesting in our use case, we have a lot of intentional False Positive data for connectivity testing)
- Force Merge Indices
I have Jenkins set up to trigger the down stream jobs after the prior completes.
I could do a cool Jenkins Pipeline…. in my spare time.
Daily snapshots are critical in case of cluster failure. With a four node cluster, I’m running in a fairly safe setup, but if something goes catastrophically bad, I can always restore from a snapshot. My setup has my snapshots going to AWS S3 buckets.
Delete Old Data:
When dealing with network monitoring, network sensors and storing of NSM data (see Suricata NSM Fields ), we have determined one easy way to test end to end integration is by inserting some obviously fake False Positives into our system. We have stood up a Threat Intelligence Platform (Soltra Edge) to serve some fake Indicator/Observables. Google.com, Yahoo.com, etc. They show up in everyone’s networks if there is user traffic. Now, this is great to determine connectivity, but long term that comes to be LOTS of traffic that I really don’t need to store…. so, they get deleted.
Force Merge Indices
There is a lot of magic that happens in Elasticsearch. Thats’s fantastic. Force Merging allows ES to effectively shrink the number of segments in a shard, thereby increasing performance when querying it. This is really only useful for indices that are no longer receiving data. In our use case, that’s historical data. I delete the old data, then force merge it.
A day in the life.. of Jenkins.