With the rising popularity of sumchecks in SNARKs, you are called to improve your naive sumcheck prover implementation. Your code supports a multilinear polynomial of dimension n=28 over a 128-bit field, is based on VSBW13, and runs on Nvidia L4 GPU.
As you look into recent research, you discover Blendy which seems to offer a better time-space tradeoff. You implement Blendy for k=4 on the same GPU, and to your surprise, your old naive algorithm has better latency.
Explain why!
To apply, please send your answer to hr@ingonyama.com