Ed448-Goldilocks is a new elliptic curve for cryptography. Specifically, it is the curve:

This website contains information about Goldilocks, as well as a fast and portable implementation.

Elliptic curve cryptography is popular due to its solid performance and small key size. As the mathematics of ECC advances, new curve shapes and implementation techniques are discovered which offer simpler, faster, stronger cryptography. A particularly important advance was Harold Edwards' 2007 discovery of a new curve form. Dan Bernstein, Tanja Lange and other cryptographers soon realized that this form supports faster and simpler operations than older "short Weierstrass" curves.

Most current deployments of ECC use the NIST elliptic curves. While nothing is wrong
with these curves *per se*, they are beginning to show their age. In addition to
using the older short Weierstrass form, these curves' moduli were designed for one specific
implementation strategy on 32-bit computers. As a result, they perform poorly on
64-bit machines. Even on 32-bit machines, the required strategy is often not very efficient.

As a result, cryptographers have been considering alternative curve designs for some time. Bernstein's 2006 design Curve25519 has achieved a broad acceptance in the security community, and is used in some places that NIST's P-256 curve might be used. However, no stronger design has achieved as broad acceptance. Since mid-2013, I have been involved in discussions with other cryptographers to propose new options for a stronger curve.

The discussions of the past year have produced many criteria for the design of a new curve. The SafeCurves criteria are relatively conservative, which is appropriate since a stronger curve should be oriented to conservatism as well as speed. Essentially:

- The design should be a Montgomery, Edwards or twisted Edwards curve. These curves are simpler to implement securely than short Weierstrass curves.
- The curve's order and its twist's order should each be a prime times a small cofactor. The minimal cofactor for these curve forms is 4.
- The curve should not support nontrivial endomorphisms. These allow increased performance and are not known to lead to attacks, but we are trying to be extra conservative.
- The curve should have large CM discriminant and embedding degree.
- The choice of parameters should be rigid, i.e. the lowest possible coefficient should be used.

That leaves the question of prime choice. My goal was to find a curve meeting the SafeCurves
criteria which has a strong performance vs security tradeoff. For security reasons, I wanted a prime
which was at least 384 bits long and not more than 521 bits. I looked at families of primes which have
high performance on a variety of architectures, and the "golden-ratio Solinas" form
2^{n} - 2^{n/2} - 1 appeared to be the strongest candidate. This form is
efficient whenever the prime is broken up into an even number of limbs, and supports Karatsuba
multiplication essentially for free.

This prime form gives very fast arithmetic. On Haswell, a multiplication takes about 122
CPU cycles with either *n* = 448 or *n* = 480. An addition or subtraction takes
only a few instructions, and so is too fast to meaningfully measure.

In the desired range, there are four such primes, with *n* = 416, 448, 450 or 480. On a
64-bit machine, *n* can be as high as 480 without causing a drop in performance, but on
a 32-bit machine the limit is 448. So this is the one I chose. I intend to measure the
performance of the 480-bit alternative, "Ed480-Ridinghood", to validate that 448
was the correct choice.

Within these contraints, the coefficient *d* := -39081 is the least coefficient in
absolute value which gives a curve with order and twist-order that are 4 times a prime.
There is one more constraint which I imposed but didn't use: the curve should have order
less than *p*. This simplifies the implementation. If the curve had had order
greater than *p*, I would have chosen +39082 instead. These curves are isogenous
to each others' twists.

Other authors have proposed different prime shapes, most commonly 2^{n} - *c*.
This is a good choice.
However, the multiplication by *c* causes extra carry propagation, which causes
performance to drop off sooner than for a Solinas prime. This is why Bernstein et al.'s design
prime is 2^{414} - 17: larger primes would generate extra carries on a 32-bit platform.

To explain in more detail why the chosen *p* works so well, I will outline an
implementation strategy. Most implementations of larger fields use reduced-radix arithmetic,
in which the digit size is less than the machine's native word size. For example, instead of
using 14 digits of 32 bits each, my implementation of Goldilocks uses 16 digits of 28 bits each.

The advantage of the reduced-radix approach is that carries can accumulate in the upper bits
of the word, and do not need to be propagated immediately. The disadvantage is that field
elements take more memory, and require more additions and multiplications to process them.
In particular, multiply-and-accumulate needs to take about 1-(14/16)^{2} = 23% less
time in order to make up for the greater number of multiplies; but this is the case on most
modern CPUs.
Furthermore, element-wise vector additions or subtractions are much faster than handling a
carry chain, and the representation can use signed arithmetic more easily than a densely packed
one.

This strategy works best if all the carries can be delayed until the end. Whether this is
feasible depends on how many are accumulated at once, and how big the digits were to begin
with (usually they are not perfectly reduced). In the case of Ed448-Goldilocks with 16 limbs,
the largest magnification is 38. This can be computed as the largest coefficient of
((x^{16}-1)/(x-1))^{2} mod (x^{16}-x^{8}-1). This means that
we will need at least 6 bits of headroom in the multiplier output. Since the output of the
multiply instruction is twice
the size of its input, we need at least 3 bits free in the input. Signed arithmetic improves
this by a factor of 2, but this does not change the conclusion that 3 bits are needed.

We have 4 bits free. This gives one more bit of headroom, which means that field elements
can be added (or subtracted, if signed arithmetic is used) once without reduction, and the
multiplication algorithm will still produce the correct result.
Likewise, with 8 limbs, multiplies magnify by a factor of 18, again requiring 3 bits of
headroom in the input (2 if signed arithmetic is used). This means that the limbs can be
up to 60 bits (enough for 2^{480} - 2^{240} - 1)
with one addition or subtraction before a multiply.

The same analysis can be run with primes in other forms. For a pseudo-Mersenne prime
of the form 2^{n} - *c*, the magnification is (*l*-1)*c*+1, where
*l* is the number of limbs. So for example, with 2^{414}-17 and 16 limbs
of 26 bits each, we get a magnification of 256, i.e. 8 bits, or 4 bits in the input. In
fact 6 bits are free, so there are 2 bits of extra headroom. But this is the largest
pseudo-Mersenne prime that fits into 16 limbs on a 32-bit machine without carry propagation
in the middle of a multiply.

With fields this size, the Karatsuba multiplication technique becomes useful:

If w is set to 2^{240} in the above identity, we see that