Notes from a talk by Ken Mitchell of AMD
- Have to pay attention to cross core memory latencies in Threadripper
- Improvements in Windows Insider Program for TR memory interconnects
- Instruction set evolution: Zen+ has a ton of new features—pay attention to CLZERO, SSE3
- Avoid mixing legacy SIMD and AVX instructions (has a penalty for switching)—move everything to AVX, or if you can’t, first do a VZEROUPPER or VZEROALL
- CLZERO is an AMD-specific instruction
- An atomic non-temporal move (MOVNT), designed to recover from otherwise fatal machine check architecture (MCA) errors caused by uncorrectable corrupt memory
- E.g., kill a corrupt user process but keep the system running
- Not for zeroing out memory
- An atomic non-temporal move (MOVNT), designed to recover from otherwise fatal machine check architecture (MCA) errors caused by uncorrectable corrupt memory
- If you miss a core complex’s own L1, L2, and L3 cache, you may get it from another CCX (rather than having to get it from RAM)
- God help you if you need to pull something from RAM attached to another chip
- AMD μProf profiler
- Remote profiling
- Thread concurrency tool
- “Assess Performance (Extended)” view is a good place to start
- Before benchmarking, set your BIOS to highest perf
- Use AMD Ryzen Master overclocking to set a fixed clock (no boost)
- Optimizations & lessons learned
- General: Use the latest Visual Studio compiler (better autovectorization in VS 2019)
- Test CPUID before calling newer instructions like AVX2, SSE3, FMA4
- Note that Windows 10 x64 requires SSE2 and PrefetchW (since these are required by the AMD64 ISA)
- …but Windows 7 x86 does not
- Best practices when counting cores
- Use all the physical cores (see sample code specific to AMD—
getDefaultThreadCount()
)
- Use all the physical cores (see sample code specific to AMD—
- Build command lists in parallel in DX12
- Aim for 250 draws per physical core
- Best practices for spinlocks (see AMD’s code sample)
- Avoid lock prefix instructions
- Use the pause instruction
- Align the lock variable
- AMD’s profiler can show you when you’re doing this wrong
- ALUTokenStall should have near-zero stalls per thousand instructions run
- Avoid the memcpy and memset regression (introduced in VS 2017, only affects AMD)
- Length > 32 and <= 128 where length is not known at compile time, you might hit this
- There’s a complicated workaround by statically linking a DLL
- Avoid false sharing
- Occurs when threads running on different processors modify data in the same cache line
- Often see this when you create a bunch of threads, but pack all the data for the threads on the same cache line
- Data cache refills CCX per thousand instructions is the metric to look at here in AMD’s profiler