ICICLE-Snark: The Fastest Groth16 Implementation in the World

Published on: 
Mar 18, 2025

ICICLE-snark is now the fastest Groth16 prover implementation, delivering record-breaking performance that unlocks new possibilities for ZK applications. This note contains benchmarks, manual and other treats.

Intro

Zero-knowledge proofs have rapidly advanced in recent years, with Groth16 emerging as one of the most widely used proving systems with characteristics ideal for blockchains:

  • Verification requires only three pairings, making it one of the fastest ZK proofs to verify.
  • It generates constant-size proofs, in just three group elements, which reduces on-chain storage and transaction costs compared to other proving systems.
  • While the prover side is computationally heavy, hardware acceleration dramatically speeds it up, making Groth16 feasible even for high-throughput applications.

The biggest trade-off of Groth16 is that a trusted setup is required per circuit. While this disadvantage remains popular, it makes Groth16 ideal for blockchain applications and beyond — a single proof can attest to the correctness of a complex computation without revealing inputs, and anyone can verify it with only three pairings.

How Groth16 Works

Generating a Groth16 proof requires performing intensive arithmetic over large prime fields and elliptic curves​. In the scope of this blog, we won’t delve into QAP and R1CS. We can divide the rest of the prover into 2 steps: evaluate quotient and proof. As we can see these two steps contain 3 IFFT, 3 FFT, and 5 MSM. In addition to these, there is also element-wise multiplication and subtraction. These 2 big primitives (MSM and NTT) dominate the proving process. Multi-scalar multiplication (MSM) and Fast Fourier Transform-based polynomial evaluation (FFT/NTT). These two primitives are responsible for most of the execution time. MSM can consume ~60% of proving time and FFT about ~30%​.

The prover needs two files: witness and zkey.

  • The witness file contains the intermediate values for a given input, based on the circuit. This file contains both public and private inputs. This file is unique per input.
  • The zkey file is a combination of the circuit, trusted setup, proving, and verification keys. This file is generated during the trusted setup phase and is unique per circuit.

The State of Existing Groth16 Implementations

Currently there are many Groth16 implementations. The most popular ones are:

  • SnarkJS with accessibility and ease of use. It also provides developers with 3 different zkSNARK proof systems and also setup and verification commands.
  • Rapidsnark is a prover written with c++ and assembly. This is approximately 4–10x faster than snarkjs implementation.
  • Arkworks
  • Lambdaworks
  • Gnark

Breaking the Speed Barrier with ICICLE

ICICLE is a cutting-edge cryptography library engineered to accelerate advanced algorithms and protocols — starting with ZKPs — across diverse compute backends, including GPUs, CPUs, Metal, and more. It supports multiple frontends in C++, Rust, and Go, allowing seamless integration across different development environments. With a single codebase, you can leverage three popular languages and run on multiple hardware platforms.

Accelerating Groth16

We first approached this problem by only integrating MSM and NTT. However, due to data transfer and data conversion, we couldn’t reach the performance we wanted. As a solution, we decided to build a Groth16 prover using ICICLE from scratch. The reference code was snarkjs repository built by iden3 team. This allowed us to build an optimized prover that can work with existing zkey files. So anyone can easily switch to ICICLE-snark from snarkjs or rapidsnark. The primitives explained in the “How Groth16 Works” section are already implemented and well-optimized in the ICICLE library.

Another problem with existing tools is that they are designed to be used as CLI. That means if you’re using these tools to prove the same circuits multiple times then there are many values you can cache and reuse like the proving key, bases, and NTT domain. To utilize this cache trick we developed our prover as a background worker and Rust library.

Data Conversion and Data Transfer

One of the biggest problems with moving any code to the GPU is data transfer and conversion. To be able to use ICICLE and utilize CUDA you need to convert your data to ICICLE type and move it to the device. In most cases preparing data for the CUDA kernel will take lots more time than executing the kernel itself.

Data conversion can take too much time due to calling `from` functions 2^n times. In most cases it’s possible and better to alter the memory directly with `transmute` calls in Rust. This allows the developer to change the type of the existing memory if the developer can guarantee it’s okay to use. It’s practically 100% speedup because `transmute` calls take only 20–30 ns while the same conversion takes 100ms when done in a naive way.

Data transfer speed is limited by your hardware. So it’s crucial to avoid data transfer if it’s possible. This was the initial motivation for us to build Groth16 from scratch instead of simply replacing MSM and NTT calls in existing implementations.

MSM and NTT

MSM and NTT are cryptographic primitives that are being used in many proving systems like Groth16, Hyrax, Plonk and Libra. ICICLE is providing fast implementations for multiple backends (CPU, CUDA, Metal, and more are upcoming). Replacing 5 MSMs with ICICLE MSM gave 63x improvement on average (size=2²²) and replacing 3 IFFTs and 3 FFTs with ICICLE FFT API gave 320x improvement on average (size=2²²)

VecOps

There are many places we need to multiply and subtract two vectors element-wise. These tasks are highly parallelizable on GPU. Calling VecOps API in ICICLE instead of using CPU to processes long arrays gave 200x boost on average (size 2²²)

Cache

In many applications (like a proving service or L2 rollup), you’ll reuse the same circuit and proving key for many proofs. ICICLE utilizes this by keeping the data in device memory and caching computations To take advantage of this we introduced CacheManager. It first computes values that are dependent on zkey file and save it by mapping with zkey file. If it’s computed previously then the code is using it again without computing.

CACHE Improvement

Performance Benchmarks & Results

We benched the code on 2 different setups:

  • 4080 & i9–13900K
  • 4090 & Ryzen 9 9950X

We used the circuits in the MoPro’s benchmark repository to compare the proving systems.

  • Complex Circuits: These circuits are for pure benchmarking purposes. It allows us to compare the performance of the provers based on a number of constraints.
  • Anon Aadhaar: Anon Aadhaar is a zero-knowledge protocol that allows Aadhaar ID owners to prove their identity in a privacy preserving way.
  • Aptos Keyless: Aptos Keyless lets users create self-custodial Aptos accounts with OIDC credentials (e.g., Google, Apple) instead of secret keys or mnemonics.

As mentioned before, in production it’s more likely that a project is going to prove the same circuits. To utilize this we are using the Cache system. However the other tools we compare are CLI so they terminate after one proving. To keep things fair we provide both benchmarks with and without cache.

How to Integrate ICICLE

You can now start integrating ICICLE-snark into your codebase. Depending on your codebase you have two options.

Use it in Rust Project

If your codebase is written in Rust then best practice is to use the prover as Rust Crate. You just need to add it to your dependencies by calling `cargo add ICICLE-snark`. After that you can import the proving function and create a CacheManager instance then start proving by providing paths of witness and zkey files.

use icicle_snark::{groth16_prove, CacheManager};

fn main() {
    let mut cache_manager = CacheManager::default();
    
    let witness = "witness.wtns";
    let zkey = "circuit_final.zkey";
    let proof = "proof.json";
    let public = "public.json";
    let device = "CUDA"; //replace with CPU or METAL if needed
        
    groth16_prove(
        &witness, 
        &zkey, 
        &proof, 
        &public, 
        device, 
        &mut cache_manager
    ).unwrap();
}

Use it in Other Programming Languages

If your codebase is written in a different programming language you can still use ICICLE prover. In this way you need to fetch the repository and run it in `worker` mode then communicate with it in any codebase. We shared an example of how to do that under ICICLE-snark/examples directory.

ICICLE Grant Program

We’re here to support innovative projects using ICICLE. If you have a unique idea, let’s explore the possibility of a grant to accelerate your work.

Choose any research paper, identify an algorithm with reported benchmarks, and re-implement it using ICICLE. The more you improve upon the original implementation, the larger the grant you’ll receive.

👉 ICICLE Developer Docs

Follow Ingonyama

Twitter / X: https://twitter.com/Ingo_zk

YouTube: https://www.youtube.com/@ingo_zk

GitHub: https://github.com/ingonyama-zk

LinkedIn: https://www.linkedin.com/company/ingonyama

Join us: https://www.ingonyama.com/career

Snark Chocolate: Spotify / Apple Podcasts

light

Written by

Table of Contents

Want to discuss further?

Ingonyama is commited to developing hardware for a private future using Zero Knowledge Proofs.

Get in touch
Get our RSS feed