TL;DR:
- We (still) have the best MSM hardware design
- It doesn’t matter
(Closing the gap) from PipeMSM to Zprize
ZPrize, a competition dedicated to ZK acceleration, was held from early June 22 to early October 22. While we didn’t participate in the competition (note to ZPrize lords: please make a v2 so we can join!), we released mid-competition our MSM hardware design and algorithms in the form of an academic paper (pipeMSM). The paper, according to Google Scholar, has been cited 4 times so far.
We were delighted to learn that our ideas inspired and were even used by some of the competitors.
One thing to note is that PipeMSM was implemented on the U55C FPGA, while the MSM for the FPGA competition targeted the AWS F1 card. There are more compute resources available on the U55C, so one might say we cannot compare the results from the paper to ZPrize winners. To that end, a month after the paper came out, we released “Cloud-ZK”, which is the pipeMSM design implemented for the F1.
As a side note, Cloud-ZK is dev friendly and we keep improving it — check out how it was used for accelerating Groth16 in Sparkworks or visit the repo!
Back to the topic: Let’s compare PipeMSM to the FPGA MSM track winner, Hardcaml:
A little less than 6[sec]. If we extrapolate the results from pipeMSM for 2²⁶, in a trend that is approximately linear (i.e. latency of 2²³ is 2x latency of 2²² etc..), we get 17.4[sec]. This is almost a 3x factor. Lets now close this gap:
In PipeMSM page 9 at the bottom of the first column, we suggest future improvements: the first one is going up in frequency. Indeed for Hardcaml the frequency was 278MHz while PipeMSM is operating at 125MHz, so approximately 2x speed-up. Therefore the first thing we aimed to do is double our frequency — while this was far from trivial and involved some kick-ass engineering, we made it happen.
The second suggestion we make in the paper is to add more EC-Adders. As a reminder, the EC-adder is the workhorse of the MSM design — everything else basically serves a single mighty EC-adder. If we can find a way to squeeze a second EC-Adder, then we can double the performance yet again. Doing that will bring us to 4x the original PipeMSM paper and to 4.35 [sec] for total latency, faster than the 5.9[sec] of the ZPrize winner. However, even before doing that, we discovered we can improve the utilization of the EC-Adder. In PipeMSM, EC-Adder was busy 70% of the time, a result of sub-optimal scheduler and a penalty of several clocks when we skipped to the next point. After fixing this issue and getting us almost at 100% utilization, we added a simple algorithm for precomputation, trade-offing some of the EC-Adder work with memory. The trick is that the results fetched from memory are actually precomputed. Now the EC-Adder utilization was again sub-optimal, so we went into another iteration of utilization optimization. All together, we got the implementation 1.5x faster!
Combined with the frequency doubling we ended up with exactly 3x, closing the gap to ZPrize winner.
(Closing the gap) from SOTA to ZPrize
Our design is not unique. There are more similarities than differences between all the MSM hardware designs we reviewed. For instance — all are based on the Pippenger algorithm. In fact, a large common ground can also be found when looking into GPU implementations. We project that unless a completely new algorithm can be found to replace Pippenger, we are likely to observe incremental progress in the improvement of MSM design and implementation.
In a striking contrast to this analysis, current state of the art MSM today takes 900ms for a 2³⁰ MSM (Cysic).
This is 281x the original pipeMSM paper, or 90x from the ZPrize winner and the updated PipeMSM above. Yeah, no typo. Yes, still Pippenger, but 90x.
To explain this gap, it is critical to understand that we are not comparing apples-to-apples in the slide above!
First, in another slide from the same deck we observe an illustration of the system under test:
There is no way to know exactly how many FPGAs are in this server, but it is clear that we have multiple cards and each card has multiple FPGAs on it. ZPrize results are measured as mentioned before, on a single AWS F1 card (single FPGA). Add to that:
1. From an earlier slide we can assume that the FPGAs used for the 90x measurement are from a model (VU13P) with many more resources than the F1–1.46x bigger to be precise.
2. AWS F1 is occupying 25% of the FPGA resources with a special shell (F1 is in the cloud so shell includes a lot of security stuff)¹
3. AWS F1 has a strict max power limitation of 85W. FPGAs are excellent low power accelerators and we work hard to always maintain a low power level. No such limitation for the 90x server.
The main missing piece is how many VU13P FPGA chips in total are being utilized in the server. If we knew this number, we can normalize the results for a valid comparison as follows: First, we use linear scaling, to be explained in a moment, to scale down from X FPGA chips to 1 — that is, if X=90, we divide the 90x result by 90 to get 1x. Second, we normalize the amount of resources to the amount used in F1 (ignoring power considerations for now).
Linear scaling: MSM is an (almost) embarrassingly parallel problem, part of why we get such fantastic results running it using GPUs. What it means for hardware design is that linear scale in the number of resources is the best we can hope for. Concretely, if we have two FPGAs, the best thing we can do is to divide the MSM in half and run each half on a dedicated FPGA. If we have 4 FPGAs, the best thing is to divide the MSM to 4 and so on. If we add a high bandwidth interconnect between the FPGAs we might be able to get a little more performance, for example, the big EC-Adder we mentioned before: say that we can fit 1.5 EC-Adder in one FPGA, we can’t do much with the 0.5 EC-Adder so in a way we are under-utilizing the FPGA. If we had a couple of FPGAs with fast interconnect we might be able to treat it as one FPGA with double the resources and fit in 3 EC-Adders instead of just 2 EC-Adders in the case of independent FPGAs. This is still O(1) improvement — no magic, just physics².
To conclude, since the number X is a commercial secret we cannot do the apples-to-apples comparison. Therefore, we will just have to take the designer’s word for it:
¹Shell is like the OS of the FPGA
²Our analysis is design dependent
Fast Forward — I have a killer MSM design, now what?
MSM to ZK is like matrix multiplication to AI.
Even if we have a way to push the cost and time of MSM to nil, which is not far from where we are today, it is not going to be enough to significantly accelerate a ZK prover by the orders of magnitude we need in order to enable new verticals. A method in which we simply replace calls to MSMs to run on specialized hardware is not going to bring us even one order of magnitude improvement over CPU. The real challenge lies in taking ZK to hardware end-to-end: manage the data-flow, the memory, the compute, I/O, power, and more. MSM IP is merely one tool used to achieve a bigger goal.
Another analogy to AI, more speculative perhaps, is that the norm will be that GPUs thrive while most ASICs fail — this is the nature of the fast moving industry and extremely powerful GPUs. In most cases today, (ZPrize for example) MSM on GPU is not just faster than MSM on FPGA, it is potentially cheaper, considering the low cost of GPU hardware compared to FPGAs. It takes a lot of skill and engineering to find the right places for FPGAs in our industry. Usually, these are the more mature eco-systems, and where the acceleration happens on a problem that is more complex than just MSM (yet well defined).
Where to with MSMs
MSM designs are table stakes for zk hardware acceleration. We expect that following the lead of ZPrize, new ideas and implementations will be brought to the open-source, public stage where they can be shaped, re-shaped and adjusted by users, among them zk hardware companies, to solve real world ZK infrastructure problems.
There is still room for innovation with MSM in hardware (ZPrize2 baby!) — for once, MSMs come in many sizes, and fields (hey, even extension fields! Talk to us on G2 MSM), which can benefit from different algos and improvements.
For us, next update in the pipeline is, as mentioned before, adding a second EC-Adder — boosting a projected 2x better latency. We will update our cloud-ZK image for projects that want to use this MSM in AWS. We will update this blog post whenever we make significant updates to our design. Most importantly — we will go full open source with our MSM source code.
Exciting times ahead!
Acknowledgements:
We would like to thank Michael, Oren, Leon, Tony, Danny, Stas, and Elan for insightful discussions and comments.