
This article is a follow-up to our earlier post on the hardware-friendliness of HyperPlonk, expanding on claims made about the Sumcheck protocol. In that post, we argued that the performance bottleneck in Sumcheck stems not from raw computation, but from memory access patterns. This is a critical distinction: ideally, cryptographic protocols like Sumcheck should be compute-bound rather than memory-bound, since modern hardware — particularly GPUs and specialized accelerators — is optimized for high-throughput arithmetic but often struggles when constrained by memory bandwidth.
This memory-bound behavior is especially evident in the use of Number Theoretic Transforms (NTTs), a fundamental building block for polynomial operations in Sumcheck. Although NTTs are highly parallelizable, they require frequent and structured access to large arrays of field elements, placing significant pressure on memory subsystems and cache hierarchies. As a result, even when the underlying arithmetic is relatively inexpensive, memory access overhead can dominate overall performance.
Addressing this limitation calls for optimized memory layouts, hardware-aware scheduling strategies, and rethinking systems design to push hardware toward its computational limits, not its memory ceilings.
This follow-up aims to empirically validate our earlier claim by profiling the Sumcheck implementations within ICICLE. Through this analysis, we offer concrete evidence that memory access is indeed the dominant bottleneck, and provide insights that can inform future optimization and hardware acceleration strategies for Sumcheck and similar protocols.
Follow Ingonyama
Twitter / X: https://twitter.com/Ingo_zk
YouTube: https://www.youtube.com/@ingo_zk
GitHub: https://github.com/ingonyama-zk
LinkedIn: https://www.linkedin.com/company/ingonyama
Join us: https://www.ingonyama.com/careers
Snark Chocolate: Spotify / Apple Podcasts