Major API Core Rewrite - Happy New Year

Removing 10 Years of Tech Debt: The OpenSearch Re-engineering Journey

Half a year of intense work to eliminate technical debt accumulated over a decade.

In 2016, we built APUtime by replicating data from a relational database into a Search Engine = unlimitedpossibilities for searching, filtering, and aggregating data at low latency (5–50ms).

We were locked on Elasticsearch 5.6. Upgrading wasn't just a "click a button" task; it required a complete re-engineering because the data structure faced breaking changes.

👎 In SaaS, you strive for well-architected data design. Looking back, we missed the mark here.

The "WTF" Facts (The Legacy State):
(fighting the engine rather than using it)
❌ Tenant-per-Index
❌ Immediate Refresh
❌ Basic app traffic not collected for bulk writes

The New Reality (OpenSearch 3.3+):
We moved from a localized, fragmented mess to a consolidated, high-performance architecture.
✅ Shared Indices: All standard tenants share just 8 indices (one per entity type).
✅ Whale Isolation: Our largest tenants ("Whales") get dedicated, isolated indices (more than GBs) to protect the shared pool.
✅ Consolidation: We went from thousands of indices to roughly a few.

🦾 Instance Evolution:

We moved from legacy x86 instances to AWS r8gd (Graviton + NVMe). The price-to-performance ratio is incomparable. The NVMe drives allow us to leverage high-speed I/O for aggressive data merging.

💥 Shard Explosion (Solved):

Old Way: 1,000+ tenants × 1+ shards = 5,000+ shards. The cluster spent more CPU managing its own state (metadata) than searching data.

New Way: 8+ Shared Indices + a few Whales = a few shards total. The Cluster State size dropped by 99%, making master node operations instant.

🧱 Tiny Segment Explosion

Old Way: refresh=true on every write created thousands of micro-segments per minute.

New Way: We switched to a 3s refresh interval combined with an Aggressive Merge Policy (segments_per_tier: 4). We now use the idle CPU to constantly "defrag" the indices in the background, keeping the segment count low and search speeds high.

🔥 CPU Overhead

Old Way: CPU was choked by context switching and managing thousands of idle indices.

New Way: We use Index Thread Throttling (max_thread_count: 2). We implemented Quality of Service (QoS) by throttling "Whale" indices to 1 thread while giving Shared indices 2 threads. This ensures a massive bulk import from one client never starves the API for everyone else.

🧠 JVM Heap Overhead

Old Way: A huge portion of the Heap was wasted just holding the "Terms Index" (metadata) for thousands of shards.

New Way: With only a few Shards, our memory overhead is negligible. We tuned the Translog (128+MB for shared, 64MB for whales) to act as a proper buffer, ensuring we only commit to disk when it makes sense, not on every packet.

✅ The Result?

60+% cheaper
Modern hardware; sleeps at 3-5% CPU and JVM oscillating 5-65% (G1GC is lazy and waits until reasonable moment)
Handling the same traffic with zero overhead