Workaround for Speculative Execution vulnerabilities

Memory loads are one type of instruction that can benefit greatly from speculative execution. Memory loads are quite common, of course. They have relatively long execution latencies, addresses used in the loads are commonly available in advance, and the result can be stored in a new temporary variable without destroying the value of any other variable. Unfortunately, memory loads can raise exceptions if their addresses are illegal, so speculatively accessing illegal addresses may cause a correct program to halt unexpectedly. Besides, mispredicted memory loads can cause extra cache misses and page faults, which are extremely costly.
  if (P != nullptr)
    U = *P;
  

dereferencing P speculatively will cause this correct program to halt in error if P is nullptr.

Many high-performance processors provide special features to support speculative memory access:

Prefetching

The prefetch instruction was invented to bring data from memory to the cache before it is used. A prefetch instruction indicates to the processor that the program is likely to use a particular memory word in the near future. If the location specified is invalid or if accessing it causes a page fault, the processor can simply ignore the operation. Otherwise, the processor will bring the data from memory to the cache if it is not already there.

Poison Bits

Anthor architectural feature called poison bits was invented to allow speculative load of data from memory into the register file. Each register on the machine is augmented with a poison bit. If illegal memory is accessed or the accessed page is not in memory, the processor does not raise the exception immediately but instead just sets the poison bit of the destination register. An exception is raised only if the contents of the register with a marked poison bit are used.

Predicated Execution

Because branches are expensive, and mispredicted branches are even more so, predicated instructions were invented to reduce the number of branches in a program. A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution; the instruction is executed only if the predicate is found to be true.

As an example, a conditional move instruction CMOVZ R2, R3, R1 has the semantics that the contents of register R3 are moved to register R2 only if register R1 is zero. Code such as:

  if (A == 0)
    B = C + D;
  

can be implemented with two machine instructions, assuming that A, B, C, and D are allocated to register R1, R2, R4, R5, respectively, as follows:

  ADD R3, R4, R5
  CMOVZ R2, R3, R1
  

This conversion replaces a series of instructions sharing a control dependence with instructions sharing only data dependences. These instructions can then be combained with adjacent basic blocks to create a larger basic block. More importantly, with this code, the processor does not have a chance to mispredict, thus guaranteeing that the instruction pipeline will run smoothly.

Predicated execution does come with a cost. Predicated instructions are fetched and decoded, even though they may not be executed in the end. Static schedulers must reserve all the resources needed for their execution and ensure that all the potential data dependences are satisifed. Predicated execution shoudl not be used aggressively unless the machine has many more resources than can possibly be used otherwise.

Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre

Chandler implemented:

And sort of for X86 Target, also implemented RetpolinePic and RetpolineNoPic by inheriting from <Target> for LLD's X86 32bit, implemented Retpoline and RetpolineZNow for 64bit Backend, to override writeGotPlt, writePltHeader, and writePlt.

So it might factor a few minor things, for example, implement a Retpoline framework to workaround fix speculative execution issues, code reusable for other targets is better :)

Add -mindirect-branch=thunk,thunk-inline,thunk-extra -mindirect-branch-loop -mindirect-branch-register -mno-indirect-branch-register and indirect_branch attribute

The implementation is mainly focuse on gcc/config/i386/i386.c LOC is 50+K!!! Snapshot for previous GCC 6.x i386.c:

Good luck! Don't get lost in GCC :)

Reference