Ed448-Goldilocks

Fast, strong elliptic curve cryptography

Ed448-Goldilocks is a new elliptic curve for cryptography. Specifically, it is the curve:

y2 + x2 ≡ 1 - 39081 x2 y2     (mod 2448 - 2224 - 1)
I propose that Goldilocks be considered for new implementations and new standards.

This website contains information about Goldilocks, as well as a fast and portable implementation.

Why a new elliptic curve?

Elliptic curve cryptography is popular due to its solid performance and small key size. As the mathematics of ECC advances, new curve shapes and implementation techniques are discovered which offer simpler, faster, stronger cryptography. A particularly important advance was Harold Edwards' 2007 discovery of a new curve form. Dan Bernstein, Tanja Lange and other cryptographers soon realized that this form supports faster and simpler operations than older "short Weierstrass" curves.

Most current deployments of ECC use the NIST elliptic curves. While nothing is wrong with these curves per se, they are beginning to show their age. In addition to using the older short Weierstrass form, these curves' moduli were designed for one specific implementation strategy on 32-bit computers. As a result, they perform poorly on 64-bit machines. Even on 32-bit machines, the required strategy is often not very efficient.

As a result, cryptographers have been considering alternative curve designs for some time. Bernstein's 2006 design Curve25519 has achieved a broad acceptance in the security community, and is used in some places that NIST's P-256 curve might be used. However, no stronger design has achieved as broad acceptance. Since mid-2013, I have been involved in discussions with other cryptographers to propose new options for a stronger curve.

Why this curve?

The discussions of the past year have produced many criteria for the design of a new curve. The SafeCurves criteria are relatively conservative, which is appropriate since a stronger curve should be oriented to conservatism as well as speed. Essentially:

Between discussion on the curves@moderncrypto.org mailing list and a paper I wrote in early 2014, it seems that an Edwards curve is the best. Specifying an Edwards curve retains the performance advantages of the other forms, but makes it clear how to build a simple, secure implementation.

That leaves the question of prime choice. My goal was to find a curve meeting the SafeCurves criteria which has a strong performance vs security tradeoff. For security reasons, I wanted a prime which was at least 384 bits long and not more than 521 bits. I looked at families of primes which have high performance on a variety of architectures, and the "golden-ratio Solinas" form 2n - 2n/2 - 1 appeared to be the strongest candidate. This form is efficient whenever the prime is broken up into an even number of limbs, and supports Karatsuba multiplication essentially for free.

This prime form gives very fast arithmetic. On Haswell, a multiplication takes about 122 CPU cycles with either n = 448 or n = 480. An addition or subtraction takes only a few instructions, and so is too fast to meaningfully measure.

In the desired range, there are four such primes, with n = 416, 448, 450 or 480. On a 64-bit machine, n can be as high as 480 without causing a drop in performance, but on a 32-bit machine the limit is 448. So this is the one I chose. I intend to measure the performance of the 480-bit alternative, "Ed480-Ridinghood", to validate that 448 was the correct choice.

Within these contraints, the coefficient d := -39081 is the least coefficient in absolute value which gives a curve with order and twist-order that are 4 times a prime. There is one more constraint which I imposed but didn't use: the curve should have order less than p. This simplifies the implementation. If the curve had had order greater than p, I would have chosen +39082 instead. These curves are isogenous to each others' twists.

Other authors have proposed different prime shapes, most commonly 2n - c. This is a good choice. However, the multiplication by c causes extra carry propagation, which causes performance to drop off sooner than for a Solinas prime. This is why Bernstein et al.'s design prime is 2414 - 17: larger primes would generate extra carries on a 32-bit platform.

Arithmetic modulo p

To explain in more detail why the chosen p works so well, I will outline an implementation strategy. Most implementations of larger fields use reduced-radix arithmetic, in which the digit size is less than the machine's native word size. For example, instead of using 14 digits of 32 bits each, my implementation of Goldilocks uses 16 digits of 28 bits each.

The advantage of the reduced-radix approach is that carries can accumulate in the upper bits of the word, and do not need to be propagated immediately. The disadvantage is that field elements take more memory, and require more additions and multiplications to process them. In particular, multiply-and-accumulate needs to take about 1-(14/16)2 = 23% less time in order to make up for the greater number of multiplies; but this is the case on most modern CPUs. Furthermore, element-wise vector additions or subtractions are much faster than handling a carry chain, and the representation can use signed arithmetic more easily than a densely packed one.

This strategy works best if all the carries can be delayed until the end. Whether this is feasible depends on how many are accumulated at once, and how big the digits were to begin with (usually they are not perfectly reduced). In the case of Ed448-Goldilocks with 16 limbs, the largest magnification is 38. This can be computed as the largest coefficient of ((x16-1)/(x-1))2 mod (x16-x8-1). This means that we will need at least 6 bits of headroom in the multiplier output. Since the output of the multiply instruction is twice the size of its input, we need at least 3 bits free in the input. Signed arithmetic improves this by a factor of 2, but this does not change the conclusion that 3 bits are needed.

We have 4 bits free. This gives one more bit of headroom, which means that field elements can be added (or subtracted, if signed arithmetic is used) once without reduction, and the multiplication algorithm will still produce the correct result. Likewise, with 8 limbs, multiplies magnify by a factor of 18, again requiring 3 bits of headroom in the input (2 if signed arithmetic is used). This means that the limbs can be up to 60 bits (enough for 2480 - 2240 - 1) with one addition or subtraction before a multiply.

The same analysis can be run with primes in other forms. For a pseudo-Mersenne prime of the form 2n - c, the magnification is (l-1)c+1, where l is the number of limbs. So for example, with 2414-17 and 16 limbs of 26 bits each, we get a magnification of 256, i.e. 8 bits, or 4 bits in the input. In fact 6 bits are free, so there are 2 bits of extra headroom. But this is the largest pseudo-Mersenne prime that fits into 16 limbs on a 32-bit machine without carry propagation in the middle of a multiply.

Karatsuba

With fields this size, the Karatsuba multiplication technique becomes useful:

(aw+b)(cw+d) = acw2 + ((a+b)(c+d)-ac-bd)w + bd
This formula works well with reduced-radix implementations, because they compute (a+b) and (c+d) without extra carry propagation. Furthermore, because Karatsuba is a polynomial identity, it does not introduce new carry propagation problems on the product side so long as delayed propagation is used.

If w is set to 2240 in the above identity, we see that

(aw+b)(cw+d) ≡ ((a+b)(c+d) - bd)w + ac + bd   (mod p)
or alternatively,
(aw+b)(cw+d) ≡ ((a+b)c + ad)w + (a+b)c + b(d-c)   (mod p)
That is, Karatsuba multiplication can be fused with reduction mod p, and it doesn't make the multiplication algorithm more complex.