Integer arithmetic is the backbone of cryptographic algorithms. However, the typical widths of numbers that provide acceptable security are not directly supported by standard hardware such as CPUs and GPUs. Therefore, we need to represent our integers as arrays of multiple "limbs", either 32 or 64 bits in size.
Design a kernel that multiplies pairs of 256-bit numbers, getting 512-bit results. The goal is to maximize the throughput of the mutliplier.
More info on our GitHub