This release is the third major step toward the 1.0.0 release of Zama’s low-level crypto library, Concrete-core. Check out previous blog posts on the topic (V1.0.0-alpha and V1.0.0-beta), where we explain how Concrete-core is designed to experiment and integrate new FHE-related hardware-acceleration with ease!
Today, it is our pleasure to deliver Concrete-core V1.0.0-gamma, with Cuda acceleration! Visit the Github release note for the full list of changes. Overall, this version introduces:
This release comes with two new APIs on top of Concrete-core: a C API and a WASM API. Both only wrap a subset of Concrete-core at the moment. The C API covers the use case of the Concrete Framework’s Compiler, while the WASM API provides all necessary engines for a client application. We plan to make their generation automatic and to offer full coverage of Concrete-core’s features. The aim is to enable the widest possible range of applications to be built on top of Concrete-core.
In the sections below, more details are given on the new multi-platform support and the Cuda acceleration introduced in this release.
From the start, the V1 design has been aimed at supporting a wide variety of platforms and hardware. Adapting Concrete-core V0.1.10 required an extensive restructuring of the code base; in order to see it through safely, we needed a testing infrastructure. This infrastructure was introduced with the V1.0.0-beta release in April 2022, which set us free to proceed with the main restructuring of Concrete-core: multi-backend support. The former `Core` backend is now split into:
- A default backend that does not depend on specialized hardware, unless specific compile-time configuration is performed:
- An FFTW backend that implements operations involving polynomial multiplication (bootstrap, Cmux, external product) with FFTW acceleration.
- Additionally, a new Cuda backend is introduced that provides GPU acceleration for the bootstrap and keyswitch.
This new structure is represented in the figure below:

Now let’s dive into more details of the new Cuda backend introduced in this release.
Cuda acceleration is now available in Concrete-core. TFHE’s programmable bootstrap is the bottleneck in terms of performance, which is why this is the first operation we’ve targeted for Cuda acceleration. Since it usually comes together with a keyswitch, we also provide a Cuda accelerated version of the keyswitch. In order to cover different use cases, we offer two different implementations of the bootstrap. In both, the bootstrap operation is accelerated via a single Cuda kernel, a function executed on the GPU that runs a set of instructions onto Cuda threads, themselves part of Cuda blocks:
Below is a comparison between the nuFHE implementation, the Amortized Bootstrap, and the Low Latency Bootstrap. The parameter set is fixed so as to be supported by nuFHE: 32-bit integers are used along with an LWE dimension of 500, a polynomial size of 1024, one GLWE dimension, two levels of decomposition, and a base logarithm of 10 for the decomposition. nuFHE exposes two PBS implementations: one relying on an NTT (in yellow) and another relying on an FFT (in green). The time it takes to execute one bootstrap when launching various amounts of bootstraps at once is compared considering from one up to 10,000 bootstraps launched at once.
.png)
The amortized bootstrap of Concrete-core is plotted in blue and the Low Latency one in red (the latter can only launch a restricted amount of PBS at once, which is why there are only two points on the curve). This figure above shows that for small amounts of bootstraps launched at once, the Low Latency implementation of Concrete performs best. On the other hand, when launching large amounts of bootstraps at once, the nuFHE implementations and the Amortized bootstrap implementation perform similarly. For intermediate amounts of bootstraps, the nuFHE implementation relying on the FFT performs best, though it only supports a very limited set of cryptographic parameters. Further, these benchmark results were obtained on an Nvidia Tesla V100-SXM2-16GB GPU. One bootstrap using the same parameters on a CPU requires 27 ms (11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz).
The benchmarking results shown above only relate to one set of cryptographic parameters, so bear in mind that performance varies depending on the parameters chosen. In order to ease the user’s life, we have introduced cost and noise models for the Cuda accelerated operations. Those are being integrated into Zama’s Optimizer and Compiler, which take care of choosing the best parameters and hardware for the user.
Check out our tutorial to see how to start using the Cuda backend.
With this release, we’re getting very close to the final V1 release. We hope this V1.0.0-gamma version will give you the opportunity to try out your applications with Cuda acceleration. Here are the links to both our user documentation and the Rust documentation to get you started.
We’re excited to see what you build!
News, research and product releases