I did an upgrade with the team from 1.7 to 7.5.2 a few years ago we used terraform to build the 7.5.2 cluster with about 28 nodes. First we did a snapshot to upgrade the data from 1.7 to 2.4 and we synced by having our applications write to both. To get them to a synced state right before snapshotting we set a redis key that told our application servers to start writing every document changed or created to a redis set so we would have a set of all things changed since snapshot. This was to account for the time between snapshotting and getting the new cluster up. Once we have the set of changes synced we could test queries by switching a customer account to read from 2.4 via another redis set of upgrade accounts. Once we were confident and saw no new deprecations we did the process again for 5.6 and the. 7.5… as I recall we could skip 6.x It was an intense few weeks but definitely worth it for us. We also cleaned up our deployment to have a dedicated set of master, data and client nodes.
FogBugz was still on twelve ElasticSearch 1.6 nodes when I left in 2018. We also had a custom plugin (essentially requesting facets that weren't stored in ElasticSearch back from FogBugz), which was the main reason we hadn't spent much time thinking about upgrading it. To keep performance adequate, we scheduled cache flush operations that, even at the time, we knew were pants-on-head crazy to be doing in production. I can't remember if we were running 32-bit or 64-bit with Compressed OOPs.
Kiln was on an even older version, v1.4 if I remember correctly. And one of the shards had a corruption warning, yet it didn't seem to affect stability or results. But that wasn't a fun cluster to operate, since it refused to do certain types of maintenance because of the supposed corruption.
Hopefully the newer versions are easier to migrate between. I don't remember what exactly was preventing us from upgrading, but I'm sure part of it was wanting to avoid a full reindex.
It's good to hear stories of real-world systems. If you only look at blog posts you get the idea that everyone is doing everything perfectly, but of course it's not really like that at all...
I've heard horror stories from friends about working at meltwater. Setting that aside for a moment, this is an amazing software engineering achievement.
Pulling off this level of scale with Elasticsearch is no easy feat and very impressive from a technical perspective. When you're running ES with petabytes of mission critical data as a core service powering the universe of a business, cluster rebuilds aren't an option (or maybe they are, as a last resort, but absolutely will not be acceptable on an ongoing basis).
Relying on Elasticsearch mega-clusters in this manner is akin to running an ultra-marathon with a really sharp pair of scissors glued in each hand. Or maybe even more extreme than my (admittedly lame) analogy.
Running nodes with such high shard counts is an appreciably precarious proposition, because there is a fair amount of overhead in the Elasticsearch management protocol. I wonder what the performance testing strategy entailed.
I have a lot of respect for the engineers working to make this project and service a success story. When it comes to Elasticsearch at scale, such outcomes are the exception.
I usually don't explain my downvotes, but I thought that your comment was good overall, but the "horror stories from friends about working at meltwater" without explaining what they are just makes it a bit unfair.
As criticism, it's very vague, and as someone who doesn't work at Meltwater (for the last 5 years or so at least) it doesn't give me any information either. Well except that there are rumors about Meltwater, but that would be true about any large corporation.
Maybe I misunderstood and the horror stories were about ES, but I got it as being about the company itself. Could you expand? What type of stories? :)
One time, a few years ago a particularly nasty query was executed over and over again and it took a few hours to find it and then block it.
And during that time so many nodes had became slow and unresponsive that another (for us) previously unseen memory leak started to occur.
Nodes kept building up queues of unanswered ping requests on them. And the requests contained our 100Mb large cluster state, so the heaps filled up and evenmore nodes became unresponsive.
And from then on the whole thing turned into a death spiral of doom.
After trying, and failing to get it under control for 48 hours we gave up and rebuilt the whole cluster from scratch, using the snapshots we store on S3.
The recovery took another 90 hours or so. That was not a fun week.
Non-technically, it was a horrow-show. I worked at the company from 2005/06 to 2012, when I quit after witnessing shameful behaviour towards women, a party-culture that literally lead to rape allegations, and a CEO who looted the company for money and shipped it to tax-havens.
One of the area managers - Kaveh, IIRC - also had a double morale on line with Trump. He was very "don't put your pen in the company ink", and proceeded to get one of his subordinates pregnant.
I remember there was an "anti Meltwater blog" at some point, but I can't find it now. I don't remember the URL either, so I can't look it up on archive.org. However, this site[0] seems to contain copy-pasted stuff from it.
As said, my experiences are 10+ years old, and hopefully things have improved.
Is there no other search database that can be persisted other than Elastic/Lucene/Solr?
I get that there’s little money to be made in these things but it’s surprising. Seems like most full text search are relatively simple plug-ins to existing databases or in memory only.
Yes, there is. We moved from ES to Vespa (vespa.ai) and never looked back. WE got better results, speed and WAY lower maintenance costs. I really don't understand how underrated this project is.
Vespa seems like a great match for Elastic's text and vector search, but not for classic "OLTP"-style queries.
For example, until very recently Vespa did not even have case-sensitive string field matching [1]; doing a strict equality query on a string field was not possible, and the authors did not seem to see why it would be useful. Vespa lacks a lot of this kind of basic search functionality, making it less general-purpose than Elasticsearch.
It's clear that it's a very powerful search engine, though it also feels antiquated in many ways. It's very obvious that it's an ancient project that has been worked on by many different people throughout the years, with no cohesive vision or design, though it does seem like they're slowly cleaning things up. (The documentation used to be much worse, for one.)
Nearly a decade ago (oh god) I converted some overdesigned five node ES mess to https://github.com/mchaput/whoosh. It's (obviously) not the fastest or anything, but it was more than good enough for low-dozens of GBs of mostly static data.
> In order to control how queries are executed, we have built a plugin which exposes a set of custom query types. We use these query types to provide functionality and performance optimisations not available in stock Elasticsearch. For example, we have implemented wildcards within phrases, with support for executing within SpanNear queries. We optimise “*” to a match-all-query. And a whole lot of other things.
Did you port your the in-house plugins? Seems like a big blocker.
I don't want to spoil the other blog posts but we managed to solve almost all of our custom use cases without modifying elasticsearch itself. We still have one custom plugin but only to enhance functionality, not for performance and stability reasons.
While I fully understand why you run this thing with 300+ nodes as you do, I have to wonder, just for fun - could you actually fit this whole thing on a single large server? Looks like something with 16 TiB RAM and 2 PiB SSD storage is actually a server you could theoretically buy today?