Introducing ICICLE-Stwo: A GPU-Accelerated Stwo Prover

TL;DR:

The goal of this project was to add a CUDA backend to Starkware Stwo prover, without introducing any changes to Stwo’s frontend design.

Here we outline the progress of that integration, share performance metrics, and look ahead to future enhancements. Across numerous CPU/GPU configurations, ICICLE-Stwo demonstrates 3.25x–7x performance improvement over Starkware’s heavily optimized SIMD backend. The ICICLE-Stwo repository is fully open source and functions as a drop-in replacement for the standard Stwo prover.

From Circle STARK to Stwo Acceleration in ICICLE

Our collaboration with Starkware took off in 2023. We first announced our strategic partnership around Stwo prover acceleration in August 2024.

Stwo is a new generation prover technology developed by Starkware, based on the Circle STARK breakthrough work. As a first step towards GPU acceleration, we needed to extend the ICICLE library with M31-based polynomial arithmetic support.

Circle STARK brings several new primitives that we had to adapt for GPU usage, including the Discrete Circle-Curve Transform (DCCT), which acts similarly to an NTT.

ICICLE DCCT vs. Nethermind DCCT (reproduce)

Equipped with these assets, we moved on to integration.

Early Integration: Straw-Man Solution

When we started the integration of ICICLE for Stwo, the strategy was to use ICICLE’s methodology and replace existing mechanisms in Stwo with ICICLE constructs. One of the main challenges of any CPU-GPU integration is working with a non-unified memory architecture which often leads to time wasted on copying data back and forth between the two platforms. Another related problem is the need to translate data when moving between platforms due to redundant abstraction, often utilized by CPU software languages. Our first attempt failed miserably. More than 90% of the time was spent in transit.

Building on Stwo’s Backend Trait

As Stwo was already constructed using Rust traits for a backend, which was implemented for the CPU and SIMD backends, we decided to tackle the task again within the constraints of this backend. Making changes outside the backend would have worked but would also have changed the Stwo framework completely, resulting in two independent implementations. We worked under a strict condition that only Stwo backend structures are to be affected by the integration (all frontend code remains unchanged).

Our implementation is based on the example:

test_wide_fib_prove_with_blake() in “crates/prover/src/examples/wide_fibonacci/mod.rs”.

The example generates a trace, adds it to an instance of CommitmentSchemeProver, extends and commits to it. The instance of CommitmentSchemeProver together with an instance of a Fiat Shamir (FS) channel and a description of the trace components’ constraints are passed over to the prove()function to produce a proof.

Prior to proving, we generate the trace. The trace is a collection of columns of M31 elements. It is typically one of the largest data structures required for constructing the proof. It is required for constructing the proof but it’s not a part of the final proof output. As such, it makes most sense to either generate it in the GPU, or copy it once, immediately after generation, and maintain a pointer to it throughout the scope of the prove()function. It is important to note that our design assumes that all memory structures can be stored in the GPU’s global memory. For proofs where the trace is less than 3GB in size, this restriction should pose no issues. Larger proofs would need a slightly different treatment which should typically affect the frontend and is beyond the scope for now.

The basic virtual structure in the backend is Col<T> which implements a column of elements of type T. Implementing this for GPU allowed us to use ICICLE mechanisms over this data structure (and extensions of it) directly in GPU, which is very efficient. A second structure that is basic in the Stwo implementation is SecureColumnByCoords, which is a structure that represents columns of secure field elements (secure field elements are the fourth extension of the M31 base field, namely QM31). Unlike Col<T>, SecureColumnByCoords is not implemented virtually and forces an implementation of an array of Col<T>’s. Natively, ICICLE works on columns of field elements, so for secure field elements, the natural ICICLE mechanism is to implement this as Col<QM31>, but the Stwo implementation forced us to rewrite some of our constructs to support this multiple M31 column representation. This could have been avoided by virtualizing SecureColumnByCoords, but as we said that was out-of-scope. Once both Col<T> and SecureColumnByCoords structures were implemented in the backend and all backend functions were adjusted to work with them, things fell into place.

Below is a list of the main memory structures involved in the process:
1. Trace — A collection of M31 columns, potentially grouped into a few groups called components.
2. Composition Polys (CP) — A collection of a few M31 columns, four per trace component.
3. Out-of-Domain (OOD) Samples — A collection of QM31 samples, at most two per column of trace and CP, per query.
4. Quotient Polys (QP) — A few QM31 columns, one per trace component and one for CP.
5. Merkle Tree commitments for trace, CP, and QP.

The input to proving is the trace and its components’ constraints, and the output is a number of paths from the committed Merkle Trees along with the OOD Samples. Therefore, the bulk of the data (see below) can exist only on GPU and be discarded upon completion of the proof. This is schematically depicted in the figure below.

Profiling

Below is a profile of an execution of test_wide_fib_prove_with_blake() on platform i9–12900k + 3090Ti.

Let us now break down the profile:
Function fn-prove corresponds to a complete proof generation excluding the trace. The trace and its Merkle-Tree commitment are produced prior to fn-prove. The first part of fn-prove is fn-compute_coposition_polynomial. There are three hardware options for this part: CPU, SIMD, or GPU. The GPU version, depicted above, is currently limited and does not support lookups (see appendix) but is superior to CPU and SIMD. It is still about a quarter of the proof and there’s still much room for improvement.

The image below shows the Merkle Tree construction for CP. As is noticeable, the profile is dominated by CUDA kernel execution (blue blocks). One thing that can be improved here, is the first half of the time, where twiddle factors are generated for the DCCT operation. These twiddles can be generated at a setup stage and used later, as needed.

The OOD sampling, shown in the zoom image below, is the next stage. This function takes about a fifth of the runtime and can still be greatly improved. A simple improvement which can shrink this by perhaps a ten-fold is to move the current sampling loop from the frontend to the backend (see this PR). This would allow parallelization of the computation of all samples resulting in much better GPU utilization.

FRI quotients and the FRI commit are already dominated by CUDA kernel execution, as can be seen in the image below. Therefore FRI could potentially benefit from further internal kernel parallelization.

The last profile snapshot is of the decommit logic where proof elements are copied back to the CPU.

Finally, we note that further optimizations can be done to support batching. We demonstrate this point as part of the benchmarks below.

Runtime Comparison

We compared ICICLE-Stwo with Starkware SIMD and Nethermind Stwo-GPU. We run benchmarks on four machines, each with a different CPU and GPU.

We benchmarked a single run of test_wide_fib_prove_with_blake(). The raw data and instructions on how to reproduce can be found in the following spreadsheet.

SIMD-Stwo vs. ICICLE-Stwo with different setups

Notes:

ICICLE-Stwo is using the GPU for the composite polynomial (see appendix below). Using the SIMD backend would have resulted in a slightly slower results
ICICLE-Stwo is OOM from 2²³. This can be handled as we stated above. Stwo-GPU is OOM already for 2²⁰ for the same trace.
As we go from 128 columns to 1024 columns the speedup compared to Stwo-GPU is getting smaller, demonstrating the lack of batching optimizations in our code.

Appendix — on lookups running composite polynomial in GPU

The Fibonacci example used in our profile does not make use of lookups and is thus limited: A Cairo compiled program, for instance, will likely utilize lookups.The missing functionality affects the ability to evaluate trace constraints that involve lookups in the function compute_composition_polynomial() in “crates/prover/src/core/prover/mod.rs”. Extending this functionality on GPU is doable but was out-of-scope for now.As an alternative, we provide an option to work with Stwo’s SIMD backend for compute_composition_polynomial() only. This option provides us the ability to take advantage of the acceleration provided by the SIMD without affecting the frontend code and without limiting the lookup functionality. This should be revisited because GPU speedup provides superior performance to SIMD: Around 2x with our current implementation.

Acknowledgements

The work on this project was supported by the Starknet foundation.Further ReadingAs part of the work on Circle Starks acceleration we also published our RC-STARKs paper.