Half&Half: Demystifying Intel's Directional Branch Predictors for Fast, Secure Partitioned Execution

IEEE Symposium on Security and Privacy (IEEE S&P), May 2023.

Hosein Yavarzadeh, Mohammadkazem Taram, Shravan Narayan, Deian Stefan, Dean Tullsen

[ Paper | bibtex | Slides | Teaser | Video ]

In this article, I will discuss our discoveries regarding the structure of the modern branch predictors within high-end Intel processors. For a thorough examination of our findings, please refer to our paper presented at the IEEE Symposium on Security and Privacy in 2023.

Branch Prediction

Before getting into the details of our findings, let me provide a brief overview of branch prediction in modern processors.

Without branch prediction, the processor would have to wait until the branch instruction is executed before it can determine the next instruction to fetch. This would result in a significant performance penalty. To overcome this issue, modern processors use branch predictors to predict the outcome of a branch instruction before it is executed. This enables the processor to speculatively execute the predicted instructions, which can significantly improve performance if the branch predictor is accurate enough.

The branch prediction unit (BPU) in modern processors is resposible for making three key predictions:

  1. Identifying Branches - The BPU must identify branch instructions in the instruction stream in order to predict their target addresses and outcomes (Taken or Not-Taken).
  2. Predicting Target Addresses
  3. Predicting Outcomes

The BPU incoporates multiple predictors to make these predictions. Branch Target Buffer (BTB) is used to identify branch instructions and predict their targets. Conditional Branch Predictor (CBP) is used to predict the outcomes of conditional branches. Return Address Stack (RAS) is used to predict the return addresses of function calls. Indirect Branch Predictor (IBP) is used to predict the targets of indirect branches.

In this article, we will focus on the Conditional Branch Predictor (CBP) and our findings regarding its architecture.

Intel's Conditional Branch Predictor (CBP)

Intel's conditional branch predictor (CBP) is a TAGE-like predictor that uses a combination of global and local history to predict the outcome of conditional branches.

Let's take a look at our findings regarding the architecture of Intel's CBP.

Intel's Conditional Branch Predictor
Intel's CBP consists of two components: a local history component (Table0) and a set of global history components (Tables1-3).

The local history component (Table0) is indexed using the lower bits of the branch instruction's address and is useful for capturing the local behavior of branches. For example, it can capture the behavior of a conditional branch that is almost-always taken or not taken.

The global history components (Tables1-3) are indexed/tagged using the combination of the branch instruction's address and the path history register (PHR). In the literature, the PHR is often referred to as the global history register (GHR) or the branch history buffer (BHB).

We found that in intel processors, the global history (PHR) gets updated only when the branch is taken which is different from the conventional GHR used in TAGE-like predictors. GHR is like a shift register that gets updated based on the outcome of the branch --- inserts the outcome of the branch at the least significant bit and shifts the rest of the bits to the right. However, in Intel's CBP, the PHR is updated only when the branch is taken, using a different mechanism. When a branch is taken, a so called "footprint" value is computed based on the branch's address and the target address. This "footprint" is then used to update the PHR. For example, in Alder Lake processors, the footprint is computed (as depicted in the following figure) using the 16 lower bits of the branch's address and the 6 lower bits of the target address.
Footprint Computation
The PHR undergoes a two-step update process upon a branch being taken: first, a leftward shift of two bits, followed by XORing the 16-bit footprint into the PHR. This can be expressed as follows:

PHRnew = (PHRold << 2 bits) XOR footprint

From the PHR update policy, we can conclude that the PHR functions as a 194*2 bits shift-register, with the even and odd bits operating independently as the PHR shifts two bits to the left per taken branch.

Each global history component is indexed using a different number of bits from the PHR (somewhat similar to geometric lengths of the PHR). The Table1 is indexed using the 34 lower bits of the PHR, Table2 is indexed using the 66 lower bits, and Table3 is indexed using the entire PHR size. This allows the CBP to not only capture the near correlated branches but also the long-range correlations.

Index Hash Function

We found that the PHR is folded into 8-bit index using a simple XOR-based hash function.

For Table1, 34 lower bits of PHR are folded into a 8-bit index. For Table2, 66 lower bits are folded into a 8-bit index. For Table3, entire PHR is folded into a 8-bit index.

The surprising result is that in addition to the 8-bit PHR value, one single bit of the branch's address (bit 5) is also used in the index hash function. This bit is not XORed with the PHR, but rather it is used solely as an independent bit in the index hash function.

Partitioning the CBP

From a security standpoint, this final observation has enormous implications. It means that no branch for which PC[5] is 0 can possibly be influenced by branches for which PC[5] is 1, in any of the PHTs. They cannot cause evictions to reduce branch accuracy. They cannot detect evictions, eliminating side channels. They cannot mistrain branches because they cannot induce aliasing across this partition.

Partitioning the Branch Predictor
We use this observation to propose a new defense mechanism against Spectre-like attacks, which we call Half&Half. The idea is to partition the branch predictor into two halves, such that the branches with PC[5] = 0 are predicted by one half and the branches with PC[5] = 1 are predicted by the other half. This partitioning is achieved by masking the PC[5] bit and using it to select the appropriate half of the branch predictor.

We have implemented this defense mechanism on top of LLVM and Swivel (hardened web-assembly compiler) compilers and evaluated it on a variety of Spectre-like attacks. Our results show that Half&Half is highly effective in mitigating these attacks, with minimal performance overhead.

We believe that Half&Half has the potential to significantly improve the security of modern processors against Spectre-like attacks, and we are excited to continue our work in this area.


Please do not hesitate to reach out to me at hyavarzadeh@ucsd.edu.