MIPS64 pipeline01. Mar '14

Introduction

MIPS architecture is of RISC processor familier and it attempts to keep processor design as simple as possible. It has load/store instructions to access memory, to process data it has to be loaded into register first. MIPS was designed for pipelining efficiency.

Registers

MIPS64 registers:

  • 32x 64-bit general-purpose registers (R0, R1, ..., R31)

  • R0 is always 0

  • 32x double-preicision floating-points

  • hi and lo for multiplication and quotinent/remainder of division

  • floating-point status register

MIPS64 instructions:

  • Moving between special registers and general-purpose registers

  • Moving between floating-point registers and general-purpose registers

Addressing modes

  • Register direct, eg. add R1,R2,R3 → R1 = R2 + R3

  • Immediate, eg. addi R1,R2,#3 → R1 = R2 + 3

  • Register indirect with displacement, eg. lw R1,100(R2) → R1 = Mem[R2+100]

Memory accesses must be aligned. Load half word, load byte and load byte unsigned perform sign-extension since all registers are 64-bit by default. Single-precision floating-point numbers are loaded to the highest 32 bits.

Instruction formats

R-type instructions include ALU oprations, logic operations rely on following format where rd = funct(rs, rt):

  • opcode - 6 bits opcode

  • rs - 5 bits operand s register index

  • rt - 5 bits operand t register index

  • rd - 5 bits result d register index

  • shamt - 5 bits

  • funct - 5 bits

I-type instructions include load, store and branch:

  • opcode - 6 bits

  • rs - 5 bits operand s register index

  • rt - 5 bits operand t register index

  • immediate - 16 bits immediate

J-type instructions include jump; jump and link:

  • opcode - 6 bits

  • offset - 26 bits offset

Arithmetic instructions

  • dadd r1,r2,r3 - Add signed 64-bit integers, R1 = R2 + R3, overflow detection

  • daddu r1,r2,r3 - Add unsigned 64-bit integers, R1 = R2 + R3, no overflow detection

  • daddiu r1,r1,#-1 - Add unsigned immediate, R1 -= 1, no overflow detection

Logic instructions

  • and r1,r2,r3 - Bitwise AND: R1 = R2 & R3

  • ori r1,r2,#7 - Bitwise OR, no sign extension: R1 = R2 | 7

  • lui r1,#42 - Load upper immediate: R1 = 00000000000000000000000000000000 ## 42 ## 0000000000000000

  • dsll r1,r2,#5 - Logical shift left: R1 = R2 << 5

  • dslt r1,r2,r3 - Set less than: R1 = R2 < R3

Control flow instruction

  • j 1000 - Jump to label: PC(31 downto 0) = 4 * 1000

  • jal 1000 - Jump and link: R31 = PC+4; PC(31 downto 0) = (PC+4)(31 downto 28) ## 4 * 1000

  • jr r31 - Jump register: PC = R31

  • bne r1,r2,100 - Branch on not equal: if (R1 != R2) PC = PC + 4 + 4*100

  • bltz r1,100 - Branch on less than 0: if (R1 < 0) PC = PC + 4 + 4*100

  • movz r1,r2,r3 - Move if zero: if (R3 == 0) R1 = R2

Floating-point operations

  • {add,sub,mul,div}.{s,d,ps} - Add, subtract, multiply, divide single-precision, double-precision or paired single-precision numbers.

  • madd.{s,d,ps} - Multiply and add

  • mov.{s,d} - Move

  • c.{lt,le,eq}.{s,d} - Compare numbers, sets bits in floating-point status register

  • bc1t, bc1f - Branch if floating-point condition bit is either true or false respectively

  • cvt.d.s f0,f1 - Convert single to double: F0 = (double)F1

  • cvt.w.d f0,f1 - Convert double to word: F0 = (int)F1

  • mfc1 r1,f1 - Move from floating-point register to general-purpose register: R1 = F1

  • mtc1 f1,r1 - Move from general-purpose register to floating-point register: F1 = R1

Sample assembly

Commented assembly for MIPS64, assume $v0 stores the function's return value:

    add $t0, $zero, $zero    # T0 = 0
loop:
    beq $a1, $zero, finish   # if (A1 == 0) goto finish
    add $t0, $t0, $a0        # T0 += A0
    sub $a1, $a1, 1          # A1++
    j loop                   # goto loop
finish:
    daddi $t0,$t0,100        # T0 += 100
    dadd $v0,$t0,$zero       # return T0

Unrolling loops

Consider following sample:

loop:
    l.d f1,0(r1)     ; F1 = ptr1[0]
    l.d f2,0(r2)     ; F2 = ptr2[0]
    ; stall
    mul.d f1,f1,f2   ; F1 = F1 * F2
    ; stall
    ; stall
    ; stall
    add.d f0,f0,f1   ; F0 += F1
    add r1,r1,8      ; R1 += 8
    add r2,r2,8      ; R2 += 8
    bne r1,r9,loop   ; if (R1 != R9) goto loop
    ; delay slot

Unrolling the loop by a factor of two results in following code:

loop:
    l.d f1,0(r1)
    l.d f2,0(r2)
    ; stall
    mul.d f1,f1,f2
    l.d f3,8(r1)
    l.d f4,8(r3)
    ; stall
    mul.d f3,f3,f4
    add.d f0,f0,f1
    add.d f0,f0,f3
    add r1,r1,16
    add r2,r2,16
    bne r1,r9,loop

This improves potential for instruction scheduling and it reduces loop overhead. The code size is increased which may be a concern in embedded world. This may cause increase of instruction cache miss rate. Unrolling implies register renaming which increases register pressure, that is the lack of registers. Which means that some values may need to be stored in/loaded from memory.

Compiler takes following steps to reschedule a loop: Eliminate extra test and branch instructions and adjusts loop termination and iteration code. Rename registers to avoid name dependencies. Determine loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent. This requires analysing memory addresses and finding that hey do not refer to the same address. Schedule the code preserving andy dependencies needed to yield the same result as original code.

Leaf function

Leaf function is a function that does not call another function. Return address (R31) does not need to be saved. By convention arguments are passed in R4-R7 and results are passed in R2 and R3.

Processor types

In a single-cycle processor all instructions take one cycle, but cycle must accommodate slowest instruction. In a multi-cycle processor instruction execution is split into number of steps. Different instructions can take different number of cycles. Allows the reuse of functional-units in different cycles In a Pipelined processor execution of an instruction is overlapped with other steps a processor takes.

MIPS pipeline

All MIPS instructions are of same 32-bit length. Few instruction formats, sources register indexes (rs, rt, rd) always in the same place. Only load and store instructions cause memory access. MIPS has simple branch instructions.

MIPS splits execution steps into five steps:

  1. Instruction Fetch (IF) - Fetch instruction by PC and increment PC by 4

  2. Instruction Decode (ID) and register fetch - Decode the instruction, read registers specified by register indexes rs and rt in case we need them eg. for ALU operations. The register values are read in the second half of this step.

  3. Execute (EX) - ALU instructions perform operation specified by opcode and function code; memory references add base address to offset; branch instructions compute branch condition and branch target

  4. Memory access (MEM) - Load reads from memory using address calculated in previous step; store writes to memory using effective address

  5. Write-back (WB) - Only ALU and load instructions write result back to register file. Note that this happens in the first half of the cycle.

For example load instruction results in IF, ID, EX, MEM and WB. Store instrcution results in IF, ID, EX, MEM.

Latencies

Depending on the opcode, execution latencies differ:

  • Integer ALU executes in one cycle. That means the latency of ALU is 0.

  • Floating-point adder executes in 4 cycles, it is pipelined. Latency is 3 cycles.

  • Floating-point multiplier executes in 6 cycles and it is pipelined. Latency is 5 cycles.

  • Floating point division executes in 25 cycles and it is NOT pipelined. Latency is 24 cycles.

Due to varying latencies, multiple instructions may need to write to register file in same cycle, thus stalling or increasing number of read/write ports is neccessary.

Processors can be made faster by:

  • Pipelining

  • Instruction-level parallelism (ILP)

  • Data-level parallelism (DLP)

  • Thread-level parallelism (TLP)

CPU performance:

\begin{equation*} CPU time = \frac{seconds}{program} = \frac{instructions}{program} = \frac{cycles}{instruction} = \frac{seconds}{cycle} \end{equation*}

Hazards

Data hazard: read after write (RAW), write after read (WAR), write after write (WAW). MIPS implements forwarding/bypassing which makes it possible to avoid read after write hazards. Destination register between EX/MEM and/or MEM/WB pipeline stages equal the source register of instruction in ID/EX pipeline register. This means that ALU output of previous cycle can directly be used by current cycle. This also means that is memory access happens in previous cycle, in can be used by current cycle:

add r1,r2,r3 ;  IF   ID   EX   MEM  WB
sub r4,r1,r5 ;       IF   ID   ---  EX   MEM  WB

If a load to a register is followed immediately by an instruction that uses that register the hazard detection unit detects such hazard and stalls the pipeline. That is because load instruction uses ALU to calculate the address of memory access.

If forwarding is disabled, dependant instruction stalls in its instrcution decode (ID) stage until source operand is written back to register file:

add r1,r2,r3 ;  IF   ID   EX   MEM  WB
sub r4,r1,r5 ;       IF   ID   ---  ---  EX   MEM  WB

Structural hazard: processor's hardware is needed by two or more instructions. For example if floating-point multiplication result is delayed 7 cycles and floating-point addition is delayed 4 cycles and they end up requesting memory-access at the same time if there is only one memory port. This problem can be solved by replicating functional units, eg. several ALU-s, multiple memory ports.

Control/branching hazard: delay between fetching control flow instruction (branch or jump) and actual jump. For that reason MIPS introduced branch delay slot. MIPS has simplified branch testing (rx == ry, rx != ry, rx == 0, rx != 0), the branch condition evaluation and branch target address calculation BOTH happen in instruction decode (ID) stage. The instruction after branch is always executed. This is also known as branch delay slot. The compiler should fill that slot with instructions after the branch (predict branch not taken) or instructions from the target (predict branch taken). If no instruction is found, it should be filled with nop.

Sample execution

Execution with full forwarding on a 5-stage pipeline, X stands for stall:

sll  r3,r1,2    IF   ID   EX   MEM  WB
add  r3,r2,r3        IF   ID   EX   MEM  WB
lw   r4,0(r3)             IF   ID   EX   MEM  WB
lw   r5,0(r3)                  IF   ID   EX   MEM  WB
sw   r5,0(r3)                       IF   ID   X    EX   MEM  WB
sw   r4,4(r3)                            IF   X    ID   EX   MEM  WB
TU Berlin computer architecture MIPS MIPS64