Half a year of intense work to eliminate technical debt accumulated over a decade.
In 2016, we built APUtime by replicating data from a relational database into a Search Engine = unlimitedpossibilities for searching, filtering, and aggregating data at low latency (5–50ms).
We were locked on Elasticsearch 5.6. Upgrading wasn't just a "click a button" task; it required a complete re-engineering because the data structure faced breaking changes.
👎 In SaaS, you strive for well-architected data design. Looking back, we missed the mark here.
The "WTF" Facts (The Legacy State):
(fighting the engine rather than using it)
❌ Tenant-per-Index
❌ Immediate Refresh
❌ Basic app traffic not collected for bulk writes
The New Reality (OpenSearch 3.3+):
We moved from a localized, fragmented mess to a consolidated, high-performance architecture.
✅ Shared Indices: All standard tenants share just 8 indices (one per entity type).
✅ Whale Isolation: Our largest tenants ("Whales") get dedicated, isolated indices (more than GBs) to protect the shared pool.
✅ Consolidation: We went from thousands of indices to roughly a few.
We moved from legacy x86 instances to AWS r8gd (Graviton + NVMe). The price-to-performance ratio is incomparable. The NVMe drives allow us to leverage high-speed I/O for aggressive data merging.
Old Way: 1,000+ tenants × 1+ shards = 5,000+ shards. The cluster spent more CPU managing its own state (metadata) than searching data.
New Way: 8+ Shared Indices + a few Whales = a few shards total. The Cluster State size dropped by 99%, making master node operations instant.
Old Way: refresh=true on every write created thousands of micro-segments per minute.
New Way: We switched to a 3s refresh interval combined with an Aggressive Merge Policy (segments_per_tier: 4). We now use the idle CPU to constantly "defrag" the indices in the background, keeping the segment count low and search speeds high.
Old Way: CPU was choked by context switching and managing thousands of idle indices.
New Way: We use Index Thread Throttling (max_thread_count: 2). We implemented Quality of Service (QoS) by throttling "Whale" indices to 1 thread while giving Shared indices 2 threads. This ensures a massive bulk import from one client never starves the API for everyone else.
Old Way: A huge portion of the Heap was wasted just holding the "Terms Index" (metadata) for thousands of shards.
New Way: With only a few Shards, our memory overhead is negligible. We tuned the Translog (128+MB for shared, 64MB for whales) to act as a proper buffer, ensuring we only commit to disk when it makes sense, not on every packet.
✅ The Result?