5 Surprising Lessons from Achieving Sub-Microsecond RT Latency¶

When writing code for low-latency trading, many "well-known best practices" must be ignored—or even reversed. What works in theory often collapses in real-world performance.

Atomics Lie After extensive testing, the only atomic that remained in our code was std::atomic_flag in rare spinlocks. Other atomics look good in theory (and are easy to use) but face-plant in practice when we measure latency.
Big O is Misleading Cache efficiency beats algorithm complexity every time. A linear search can outperform a hash table when done on a modern CPU (AVX) because data paths are parallel and each path has a small, predictable subset of data.
Forget Portability In HFT, optimizing for specific hardware always wins over writing software that runs everywhere. Find the best hardware first, then perfect your code for that CPU. Buying hardware your need isn't cheap, but it's simple.
Memory Layout is Everything Everyone knows about memory pools, but the real game-changer is reducing the number of pools (ideally to a single massive pool) and ensuring memory is laid out exactly in the order it's accessed. Never make your CPU jump backward or skip large chunks—sequential access keeps caches hot and minimizes stalls. This had a far bigger impact than expected.
Less General = More Optimized The more specialized the code, the better it performs. Third-party libraries are too general to be truly optimal. It’s shocking how poorly some well-established (and widely used) libraries performed and how much code had to be rewritten in-house. Queues and loggers weren’t something we expected to implement—but here we are.