VLIW

Introduction

VLIW (very long instruction word) processors execute several instructions in parallel. For example:

Load/store #1
Load/store #2
Floating-point operation #1
Floating-point operation #2
Integer or branch operation

VLIW requires comprehensive compiler. Usually binaries are incompatible. If one operation slot stalls, entire processor is stalled. Long instructions with unfilled operation slots waste storage room.

Itanium

Itanium had 128x 64-bit general purpose registers; 128x 82-bit floating-point registers; 64x 1-bit predicate registers; 8x 64-bit registers for function calls and returns.

Each instruction bundle was 128 bits long and it consisted of 3x 41-bit instructions and 5-bit template. Each instruction consisted of 14-bits, 3x 7-bit registers and 6-bit predicate. Template refers to what kind of combination of instructions are referred to by instruction bundle.

Itanium has explicit support for software pipelining. Itanium has speculative load which means that if ld.s detects exception, poison bit of destination register is set. Exceptions are raised by chk.s which checks the poison bit.

Software pipelining

Software pipelining is a technique used to optimize loops so they are parallelized by hardware pipelining. It interleaves instructions from different iterations. Software pipelining separates parallelized code execution to three steps:

prologue - Start the loop
kernel - Actually parallelized part of the loop
epilogue - Finish the loop

Software pipelining places the register name assignment burden on compiler.

Predicated execution

Each instruction in loop is predicated on rotating predicate register. The predicate register is used to determine if in prologue, kernel or epilogue phase. Software pipelining is essentially out-of-order execution, except that compiler takes care of reordering. In compiler theory a GCD (greatest common divisor) test is used for loop dependence analysis and loop optimization to check if a loop can be parallelized. According to GCD test comparing the indices i and j of two arrays present in two or more statements, it can be calculated whether there may exist some dependency between loop statements and whether it is legal to parallelize the loop or not.

for (int i = 0; i < 100; i++) {
    a[2*i] = b[i];
    c[i] = a[4*i+1];
}

The indices 2*i and 4*i + 1 could be substituted to following formula:

\begin{equation*} a \times i + b = c \times j + d \rightarrow a \times i - c \times j = d - b \end{equation*}

As following:

\begin{equation*} 1 - 0 = 2i - 4i \rightarrow -2i = 1 \end{equation*}

Now if m - k can be divided by greatest common denominator of x and y then there may be data dependency between those two. As m - k = 1 is not divisible by GCD(2,4) which is 2 then we can assume that no data dependency exists between those two statements.

Predicated execution avoids branches, and simplifies compiler optimization by converting a control dependence to a data dependence. In predicated code the branch and both paths are executed simultaneously, when the outcome of branch is known, the processor discards results from invalid path.

Considering following C snippet:

if (a < 0) {
    b = 1;
    c = 2;
} else {
    d = 3;
    e = 4;
}

Predicated assembly would look something like this:

cmp.lt p1,p2 = r1,r2;

(p1) b = 1;
(p1) c = 2;
(p2) d = 3;
(p2) e = 4;

Trace scheduling

Trace selection means finding the most likely sequence of basic block that will be executed consecutively. Trace compaction means optimizing such trace as much as possible: move operations as early as possible in the trace; pack the operations in as few VLIW instructions as possible; additional code required on exit points.

computer architecture TU Berlin