Be very careful with division, and avoid it when possible. For example, hoist `float inverse = 1.0f / divisor;`

out of a loop and multiply by `inverse`

inside the loop. (If the rounding error in `inverse`

is acceptable)

Usually `1.0/x`

will not be exactly-representable as a `float`

or `double`

. It will be exact when `x`

is a power of 2. This lets compilers optimize `x / 2.0f`

to `x * 0.5f`

without any change in the result.

To let the compiler do this optimization for you even when the result won't be exact (or with a runtime-variable divisor), you need options like `gcc -O3 -ffast-math`

. Specifically, `-freciprocal-math`

(enabled by `-funsafe-math-optimizations`

enabled by `-ffast-math`

) lets the compiler replace `x / y`

with `x * (1/y)`

when that's useful. Other compilers have similar options, and ICC may enable some "unsafe" optimization by default (I think it does, but I forget).

`-ffast-math`

is often important to allow auto-vectorization of FP loops, especially reductions (e.g. summing an array into one scalar total), because FP math is not associative. Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

Also note that C++ compilers can fold `+`

and `*`

into an FMA in some cases (when compiling for a target that supports it, like `-march=haswell`

), but they can't do that with `/`

.

**Division has worse latency than multiplication or addition (or FMA) by a factor of 2 to 4 on modern x86 CPUs, and worse throughput by a factor of 6 to 40**^{1} (for a tight loop doing *only* division instead of *only* multiplication).

The divide / sqrt unit is not fully pipelined, for reasons explained in @NathanWhitehead's answer. The worst ratios are for 256b vectors, because (unlike other execution units) the divide unit is usually not full-width, so wide vectors have to be done in two halves. A not-fully-pipelined execution unit is so unusual that Intel CPUs have an `arith.divider_active`

hardware performance counter to help you find code that bottlenecks on divider throughput instead of the usual front-end or execution port bottlenecks. (Or more often, memory bottlenecks or long latency chains limiting instruction-level parallelism causing instruction throughput to be less than ~4 per clock).

**However, FP division and sqrt on Intel and AMD CPUs (other than KNL) is implemented as a single uop, so it doesn't necessarily have a big throughput impact on surrounding code**. The best case for division is when out-of-order execution can hide the latency, and when there are lots of multiplies and adds (or other work) that can happen in parallel with the divide.

(Integer division is microcoded as multiple uops on Intel, so it always has more impact on surrounding code that integer multiply. There's less demand for high-performance integer division, so the hardware support isn't as fancy. Related: microcoded instructions like `idiv`

can cause alignment-sensitive front-end bottlenecks.)

So for example, this will be really bad:

```
for ()
a[i] = b[i] / scale; // division throughput bottleneck
// Instead, use this:
float inv = 1.0 / scale;
for ()
a[i] = b[i] * inv; // multiply (or store) throughput bottleneck
```

All you're doing in the loop is load/divide/store, and they're independent so it's throughput that matters, not latency.

A reduction like `accumulator /= b[i]`

would bottleneck on divide or multiply latency, rather than throughput. But with multiple accumulators that you divide or multiply at the end, you can hide the latency and still saturate the throughput. Note that `sum += a[i] / b[i]`

bottlenecks on `add`

latency or `div`

throughput, but not `div`

latency because the division isn't on the critical path (the loop-carried dependency chain).

**But in something like this (approximating a function like **`log(x)`

with a ratio of two polynomials), the divide can be pretty cheap:

```
for () {
// (not shown: extracting the exponent / mantissa)
float p = polynomial(b[i], 1.23, -4.56, ...); // FMA chain for a polynomial
float q = polynomial(b[i], 3.21, -6.54, ...);
a[i] = p/q;
}
```

For `log()`

over the range of the mantissa, a ratio of two polynomials of order N has much less error than a single polynomial with 2N coefficients, and evaluating 2 in parallel gives you some instruction-level parallelism within a single loop body instead of one massively long dep chain, making things a LOT easier for out-of-order execution.

In this case, we don't bottleneck on divide latency because out-of-order execution can keep multiple iterations of the loop over the arrays in flight.

We don't bottleneck on divide *throughput* as long as our polynomials are big enough that we only have one divide for every 10 FMA instructions or so. (And in a real `log()`

use case, there's be a bunch of work extracting exponent / mantissa and combining things back together again, so there's even more work to do between divides.)

### When you do need to divide, usually it's best to just divide instead of `rcpps`

x86 has an approximate-reciprocal instruction (`rcpps`

), which only gives you 12 bits of precision. (AVX512F has 14 bits, and AVX512ER has 28 bits.)

You can use this to do `x / y = x * approx_recip(y)`

without using an actual divide instruction. (`rcpps`

itsef is fairly fast; usually a bit slower than multiplication. It uses a table lookup from a table internal to the CPU. The divider hardware may use the same table for a starting point.)

For most purposes, `x * rcpps(y)`

is too inaccurate, and a Newton-Raphson iteration to double the precision is required. But that costs you 2 multiplies and 2 FMAs, and has latency about as high as an actual divide instruction. If *all* you're doing is division, then it can be a throughput win. (But you should avoid that kind of loop in the first place if you can, maybe by doing the division as part of another loop that does other work.)

But if you're using division as part of a more complex function, the `rcpps`

itself + the extra mul + FMA usually makes it faster to just divide with a `divps`

instruction, except on CPUs with very low `divps`

throughput.

(For example Knight's Landing, see below. KNL supports AVX512ER, so for `float`

vectors the `VRCP28PS`

result is already accurate enough to just multiply without a Newton-Raphson iteration. `float`

mantissa size is only 24 bits.)

### Specific numbers from Agner Fog's tables:

Unlike every other ALU operation, division latency/throughput is data-dependent on some CPUs. Again, this is because it's so slow and not fully pipelined. Out-of-order scheduling is easier with fixed latencies, because it avoids write-back conflicts (when the same execution port tries to produce 2 results in the same cycle, e.g. from running a 3 cycle instruction and then two 1-cycle operations).

Generally, the fastest cases are when the divisor is a "round" number like `2.0`

or `0.5`

(i.e. the base2 `float`

representation has lots of trailing zeros in the mantissa).

`float`

**latency** (cycles) **/ throughput** (cycles per instruction, running just that back to back with independent inputs):

```
scalar & 128b vector 256b AVX vector
divss | mulss
divps xmm | mulps vdivps ymm | vmulps ymm
Nehalem 7-14 / 7-14 | 5 / 1 (No AVX)
Sandybridge 10-14 / 10-14 | 5 / 1 21-29 / 20-28 (3 uops) | 5 / 1
Haswell 10-13 / 7 | 5 / 0.5 18-21 / 14 (3 uops) | 5 / 0.5
Skylake 11 / 3 | 4 / 0.5 11 / 5 (1 uop) | 4 / 0.5
Piledriver 9-24 / 5-10 | 5-6 / 0.5 9-24 / 9-20 (2 uops) | 5-6 / 1 (2 uops)
Ryzen 10 / 3 | 3 / 0.5 10 / 6 (2 uops) | 3 / 1 (2 uops)
Low-power CPUs:
Jaguar(scalar) 14 / 14 | 2 / 1
Jaguar 19 / 19 | 2 / 1 38 / 38 (2 uops) | 2 / 2 (2 uops)
Silvermont(scalar) 19 / 17 | 4 / 1
Silvermont 39 / 39 (6 uops) | 5 / 2 (No AVX)
KNL(scalar) 27 / 17 (3 uops) | 6 / 0.5
KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
```

`double`

**latency** (cycles) **/ throughput** (cycles per instruction):

```
scalar & 128b vector 256b AVX vector
divsd | mulsd
divpd xmm | mulpd vdivpd ymm | vmulpd ymm
Nehalem 7-22 / 7-22 | 5 / 1 (No AVX)
Sandybridge 10-22 / 10-22 | 5 / 1 21-45 / 20-44 (3 uops) | 5 / 1
Haswell 10-20 / 8-14 | 5 / 0.5 19-35 / 16-28 (3 uops) | 5 / 0.5
Skylake 13-14 / 4 | 4 / 0.5 13-14 / 8 (1 uop) | 4 / 0.5
Piledriver 9-27 / 5-10 | 5-6 / 1 9-27 / 9-18 (2 uops) | 5-6 / 1 (2 uops)
Ryzen 8-13 / 4-5 | 4 / 0.5 8-13 / 8-9 (2 uops) | 4 / 1 (2 uops)
Low power CPUs:
Jaguar 19 / 19 | 4 / 2 38 / 38 (2 uops) | 4 / 2 (2 uops)
Silvermont(scalar) 34 / 32 | 5 / 2
Silvermont 69 / 69 (6 uops) | 5 / 2 (No AVX)
KNL(scalar) 42 / 42 (3 uops) | 6 / 0.5 (Yes, Agner really lists scalar as slower than packed, but fewer uops)
KNL 32 / 20 (18uops) | 6 / 0.5 32 / 32 (18 uops) | 6 / 0.5 (AVX and AVX512)
```

Ivybridge and Broadwell are different too, but I wanted to keep the table small. (Core2 (before Nehalem) has better divider performance, but its max clock speeds were lower.)

Atom, Silvermont, and **even Knight's Landing (Xeon Phi based on Silvermont) have exceptionally low divide performance**, and even a 128b vector is slower than scalar. AMD's low-power Jaguar CPU (used in some consoles) is similar. A high-performance divider takes a lot of die area. Xeon Phi has low power *per-core*, and packing lots of cores on a die gives it tighter die-area constraints that Skylake-AVX512. It seems that AVX512ER `rcp28ps`

/ `pd`

is what you're "supposed" to use on KNL.

(See this InstLatx64 result for Skylake-AVX512 aka Skylake-X. Numbers for `vdivps zmm`

: 18c / 10c, so half the throughput of `ymm`

.)

Long latency chains become a problem when they're loop-carried, or when they're so long that they stop out-of-order execution from finding parallelism with other independent work.

**Footnote 1: how I made up those div vs. mul performance ratios:**

FP divide vs. multiple performance ratios are even worse than that in low-power CPUs like Silvermont and Jaguar, and even in Xeon Phi (KNL, where you should use AVX512ER).

**Actual divide/multiply throughput ratios for scalar (non-vectorized) **`double`

: 8 on Ryzen and Skylake with their beefed-up dividers, but 16-28 on Haswell (data-dependent, and more likely towards the 28 cycle end unless your divisors are round numbers). These modern CPUs have very powerful dividers, but their 2-per-clock multiply throughput blows it away. (Even more so when your code can auto-vectorize with 256b AVX vectors). Also note that with the right compiler options,
those multiply throughputs also apply to FMA.

Numbers from http://agner.org/optimize/ instruction tables for Intel Haswell/Skylake and AMD Ryzen, for SSE scalar (not including x87 `fmul`

/ `fdiv`

) and for 256b AVX SIMD vectors of `float`

or `double`

. See also the x86 tag wiki.

`200f / 2`

into`100f`

.4more comments