Ryzen Optimization

Notes from a talk by Ken Mitchell of AMD

  • Have to pay attention to cross core memory latencies in Threadripper
    • Improvements in Windows Insider Program for TR memory interconnects
  • Instruction set evolution: Zen+ has a ton of new features—pay attention to CLZERO, SSE3
  • Avoid mixing legacy SIMD and AVX instructions (has a penalty for switching)—move everything to AVX, or if you can’t, first do a VZEROUPPER or VZEROALL
  • CLZERO is an AMD-specific instruction
    • An atomic non-temporal move (MOVNT), designed to recover from otherwise fatal machine check architecture (MCA) errors caused by uncorrectable corrupt memory
      • E.g., kill a corrupt user process but keep the system running
    • Not for zeroing out memory
  • If you miss a core complex’s own L1, L2, and L3 cache, you may get it from another CCX (rather than having to get it from RAM)
    • God help you if you need to pull something from RAM attached to another chip
  • AMD μProf profiler
    • Remote profiling
    • Thread concurrency tool
    • “Assess Performance (Extended)” view is a good place to start
  • Before benchmarking, set your BIOS to highest perf
    • Use AMD Ryzen Master overclocking to set a fixed clock (no boost)
  • Optimizations & lessons learned
    • General: Use the latest Visual Studio compiler (better autovectorization in VS 2019)
    • Test CPUID before calling newer instructions like AVX2, SSE3, FMA4
      • Note that Windows 10 x64 requires SSE2 and PrefetchW (since these are required by the AMD64 ISA)
      • …but Windows 7 x86 does not
    • Best practices when counting cores
      • Use all the physical cores (see sample code specific to AMDgetDefaultThreadCount())
    • Build command lists in parallel in DX12
      • Aim for 250 draws per physical core
    • Best practices for spinlocks (see AMD’s code sample)
      • Avoid lock prefix instructions
      • Use the pause instruction
      • Align the lock variable
      • AMD’s profiler can show you when you’re doing this wrong
        • ALUTokenStall should have near-zero stalls per thousand instructions run
    • Avoid the memcpy and memset regression (introduced in VS 2017, only affects AMD)
      • Length > 32 and <= 128 where length is not known at compile time, you might hit this
      • There’s a complicated workaround by statically linking a DLL
    • Avoid false sharing
      • Occurs when threads running on different processors modify data in the same cache line
      • Often see this when you create a bunch of threads, but pack all the data for the threads on the same cache line
      • Data cache refills CCX per thousand instructions is the metric to look at here in AMD’s profiler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s