Memory loads are one type of instruction that can benefit greatly from speculative execution. Memory loads are quite common, of course. They have relatively long execution latencies, addresses used in the loads are commonly available in advance, and the result can be stored in a new temporary variable without destroying the value of any other variable. Unfortunately, memory loads can raise exceptions if their addresses are illegal, so speculatively accessing illegal addresses may cause a correct program to halt unexpectedly. Besides, mispredicted memory loads can cause extra cache misses and page faults, which are extremely costly.
if (P != nullptr) U = *P;
P speculatively will cause this correct program to halt in error if
Many high-performance processors provide special features to support speculative memory access:
The prefetch instruction was invented to bring data from memory to the cache before it is used. A
prefetch instruction indicates to the processor that the program is likely to use a particular memory word in the near future. If the location specified is invalid or if accessing it causes a page fault, the processor can simply ignore the operation. Otherwise, the processor will bring the data from memory to the cache if it is not already there.
Anthor architectural feature called
poison bits was invented to allow speculative load of data from memory into the register file. Each register on the machine is augmented with a
poison bit. If illegal memory is accessed or the accessed page is not in memory, the processor does not raise the exception immediately but instead just sets the poison bit of the destination register. An exception is raised only if the contents of the register with a marked poison bit are used.
Because branches are expensive, and mispredicted branches are even more so,
predicated instructions were invented to reduce the number of branches in a program. A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution; the instruction is executed only if the predicate is found to be true.
As an example, a conditional move instruction
CMOVZ R2, R3, R1 has the semantics that the contents of register
R3 are moved to register
R2 only if register
R1 is zero. Code such as:
if (A == 0) B = C + D;
can be implemented with two machine instructions, assuming that
D are allocated to register
R5, respectively, as follows:
ADD R3, R4, R5 CMOVZ R2, R3, R1
This conversion replaces a series of instructions sharing a control dependence with instructions sharing only data dependences. These instructions can then be combained with adjacent basic blocks to create a larger basic block. More importantly, with this code, the processor does not have a chance to mispredict, thus guaranteeing that the instruction pipeline will run smoothly.
Predicated execution does come with a cost. Predicated instructions are fetched and decoded, even though they may not be executed in the end. Static schedulers must reserve all the resources needed for their execution and ensure that all the potential data dependences are satisifed. Predicated execution shoudl not be used aggressively unless the machine has many more resources than can possibly be used otherwise.
Introduce the "retpoline" x86 mitigation technique for variant #2 of the speculative execution vulnerabilities disclosed today, specifically identified by CVE-2017-5715, "Branch Target Injection", and is one of the two halves to Spectre
IndirectBrExpandPassoverride runOnFunction to rewrite each indirectbr to cast its loaded pointer to an integer and switch on it using the integer map, check
Subtargetwhether or not
useRetpolineduring Instruction Selection phase.
<Target>RetpolineThunksPassis a RET-implemented trampoline that is used to lower indirect calls in a way that prevents speculation on some x86 processors and can be used to mitigate security vulnerabilities due to targeted speculative execution and side channels. Override runOnModule, then
createThunk(M, "r11", X86::R11);for 64bit and
createThunk(M, "eax", X86::EAX); createThunk(M, "ecx", X86::ECX); createThunk(M, "edx", X86::EDX); createThunk(M, "push");for 32bit.
getOpcodeForRetpolineis based on Retpoline Opc Type, return
EmitLoweredRetpolineto copy the virtual register into the R11 physical register and call the retpoline thunk, carefully find an available scratch register to hold the callee, but when No register available, just use
PUSH, this must not be a tailcall, and notice that this must not be x64.
And sort of for X86 Target, also implemented
RetpolineNoPic by inheriting from
<Target> for LLD's X86 32bit, implemented
RetpolineZNow for 64bit Backend, to override
So it might factor a few minor things, for example, implement a
Retpoline framework to workaround fix speculative execution issues, code reusable for other targets is better :)
Add -mindirect-branch=thunk,thunk-inline,thunk-extra -mindirect-branch-loop -mindirect-branch-register -mno-indirect-branch-register and indirect_branch attribute
The implementation is mainly focuse on gcc/config/i386/i386.c LOC is 50+K!!! Snapshot for previous GCC 6.x