5 Quick Wins for Ultra-Low-Latency Systems¶

When optimizing for lowest latencies, small tweaks can make a huge difference. Here are 5 quick wins that everyone should know:

Use Prefetching – If you know you’ll be accessing memory soon, use __builtin_prefetch in C++ to bring it into cache early. Avoiding cache misses is an easy latency win. One surprising fact was that even if prefetch was called immediately before using the data it improved latency in most cases!
Kernel Isolation – Everyone isolates their kernel from certain cores, but also restrict its L3 cache usage by controlling CPU cache ways. This ensures kernel does not pollute your L3 Cache.
!! Enable Hyperthreading !!! – I’ve seen posts suggesting disabling Hyperthreading for 'low latency' — I totally disagree but it is not a matter of opinion. There are literally no drawbacks to enabling HT if you pin your cores properly. If you need a full core and don’t want another thread running on a sibling, don’t pin it there! But leaving HT enabled gives you EXTRA CPU REGISTERS, which can improve instruction throughput as they are the fastest memory there is!
HT SPSC Queue Optimization – Related to 3) Place consumer and producer threads of a Single-Producer Single-Consumer (SPSC) queue on two hyperthread siblings. This cuts data transfer time significantly as it goes through L1 and L2 caches.
nable Hugepages – Standard memory pages slow you down. Enable hugepages and use HUGETLBs in C++ to reduce TLB misses and improve memory access speed.

🚀 These are easy wins that can make a real impact. What’s your favorite low-latency trick?