MIPS64 pipeline01. Mar '14
Introduction
MIPS architecture is of RISC processor familier and it attempts to keep processor design as simple as possible. It has load/store instructions to access memory, to process data it has to be loaded into register first. MIPS was designed for pipelining efficiency.
Registers
MIPS64 registers:
32x 64-bit general-purpose registers (R0, R1, ..., R31)
R0 is always 0
32x double-preicision floating-points
hi and lo for multiplication and quotinent/remainder of division
floating-point status register
MIPS64 instructions:
Moving between special registers and general-purpose registers
Moving between floating-point registers and general-purpose registers
Addressing modes
Register direct, eg. add R1,R2,R3 → R1 = R2 + R3
Immediate, eg. addi R1,R2,#3 → R1 = R2 + 3
Register indirect with displacement, eg. lw R1,100(R2) → R1 = Mem[R2+100]
Memory accesses must be aligned. Load half word, load byte and load byte unsigned perform sign-extension since all registers are 64-bit by default. Single-precision floating-point numbers are loaded to the highest 32 bits.
Instruction formats
R-type instructions include ALU oprations, logic operations rely on following format where rd = funct(rs, rt):
opcode - 6 bits opcode
rs - 5 bits operand s register index
rt - 5 bits operand t register index
rd - 5 bits result d register index
shamt - 5 bits
funct - 5 bits
I-type instructions include load, store and branch:
opcode - 6 bits
rs - 5 bits operand s register index
rt - 5 bits operand t register index
immediate - 16 bits immediate
J-type instructions include jump; jump and link:
opcode - 6 bits
offset - 26 bits offset
Arithmetic instructions
dadd r1,r2,r3 - Add signed 64-bit integers, R1 = R2 + R3, overflow detection
daddu r1,r2,r3 - Add unsigned 64-bit integers, R1 = R2 + R3, no overflow detection
daddiu r1,r1,#-1 - Add unsigned immediate, R1 -= 1, no overflow detection
Logic instructions
and r1,r2,r3 - Bitwise AND: R1 = R2 & R3
ori r1,r2,#7 - Bitwise OR, no sign extension: R1 = R2 | 7
lui r1,#42 - Load upper immediate: R1 = 00000000000000000000000000000000 ## 42 ## 0000000000000000
dsll r1,r2,#5 - Logical shift left: R1 = R2 << 5
dslt r1,r2,r3 - Set less than: R1 = R2 < R3
Control flow instruction
j 1000 - Jump to label: PC(31 downto 0) = 4 * 1000
jal 1000 - Jump and link: R31 = PC+4; PC(31 downto 0) = (PC+4)(31 downto 28) ## 4 * 1000
jr r31 - Jump register: PC = R31
bne r1,r2,100 - Branch on not equal: if (R1 != R2) PC = PC + 4 + 4*100
bltz r1,100 - Branch on less than 0: if (R1 < 0) PC = PC + 4 + 4*100
movz r1,r2,r3 - Move if zero: if (R3 == 0) R1 = R2
Floating-point operations
{add,sub,mul,div}.{s,d,ps} - Add, subtract, multiply, divide single-precision, double-precision or paired single-precision numbers.
madd.{s,d,ps} - Multiply and add
mov.{s,d} - Move
c.{lt,le,eq}.{s,d} - Compare numbers, sets bits in floating-point status register
bc1t, bc1f - Branch if floating-point condition bit is either true or false respectively
cvt.d.s f0,f1 - Convert single to double: F0 = (double)F1
cvt.w.d f0,f1 - Convert double to word: F0 = (int)F1
mfc1 r1,f1 - Move from floating-point register to general-purpose register: R1 = F1
mtc1 f1,r1 - Move from general-purpose register to floating-point register: F1 = R1
Sample assembly
Commented assembly for MIPS64, assume $v0 stores the function's return value:
add $t0, $zero, $zero # T0 = 0
loop:
beq $a1, $zero, finish # if (A1 == 0) goto finish
add $t0, $t0, $a0 # T0 += A0
sub $a1, $a1, 1 # A1++
j loop # goto loop
finish:
daddi $t0,$t0,100 # T0 += 100
dadd $v0,$t0,$zero # return T0
Unrolling loops
Consider following sample:
loop:
l.d f1,0(r1) ; F1 = ptr1[0]
l.d f2,0(r2) ; F2 = ptr2[0]
; stall
mul.d f1,f1,f2 ; F1 = F1 * F2
; stall
; stall
; stall
add.d f0,f0,f1 ; F0 += F1
add r1,r1,8 ; R1 += 8
add r2,r2,8 ; R2 += 8
bne r1,r9,loop ; if (R1 != R9) goto loop
; delay slot
Unrolling the loop by a factor of two results in following code:
loop:
l.d f1,0(r1)
l.d f2,0(r2)
; stall
mul.d f1,f1,f2
l.d f3,8(r1)
l.d f4,8(r3)
; stall
mul.d f3,f3,f4
add.d f0,f0,f1
add.d f0,f0,f3
add r1,r1,16
add r2,r2,16
bne r1,r9,loop
This improves potential for instruction scheduling and it reduces loop overhead. The code size is increased which may be a concern in embedded world. This may cause increase of instruction cache miss rate. Unrolling implies register renaming which increases register pressure, that is the lack of registers. Which means that some values may need to be stored in/loaded from memory.
Compiler takes following steps to reschedule a loop: Eliminate extra test and branch instructions and adjusts loop termination and iteration code. Rename registers to avoid name dependencies. Determine loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent. This requires analysing memory addresses and finding that hey do not refer to the same address. Schedule the code preserving andy dependencies needed to yield the same result as original code.
Leaf function
Leaf function is a function that does not call another function. Return address (R31) does not need to be saved. By convention arguments are passed in R4-R7 and results are passed in R2 and R3.
Processor types
In a single-cycle processor all instructions take one cycle, but cycle must accommodate slowest instruction. In a multi-cycle processor instruction execution is split into number of steps. Different instructions can take different number of cycles. Allows the reuse of functional-units in different cycles In a Pipelined processor execution of an instruction is overlapped with other steps a processor takes.
MIPS pipeline
All MIPS instructions are of same 32-bit length. Few instruction formats, sources register indexes (rs, rt, rd) always in the same place. Only load and store instructions cause memory access. MIPS has simple branch instructions.
MIPS splits execution steps into five steps:
Instruction Fetch (IF) - Fetch instruction by PC and increment PC by 4
Instruction Decode (ID) and register fetch - Decode the instruction, read registers specified by register indexes rs and rt in case we need them eg. for ALU operations. The register values are read in the second half of this step.
Execute (EX) - ALU instructions perform operation specified by opcode and function code; memory references add base address to offset; branch instructions compute branch condition and branch target
Memory access (MEM) - Load reads from memory using address calculated in previous step; store writes to memory using effective address
Write-back (WB) - Only ALU and load instructions write result back to register file. Note that this happens in the first half of the cycle.
For example load instruction results in IF, ID, EX, MEM and WB. Store instrcution results in IF, ID, EX, MEM.
Latencies
Depending on the opcode, execution latencies differ:
Integer ALU executes in one cycle. That means the latency of ALU is 0.
Floating-point adder executes in 4 cycles, it is pipelined. Latency is 3 cycles.
Floating-point multiplier executes in 6 cycles and it is pipelined. Latency is 5 cycles.
Floating point division executes in 25 cycles and it is NOT pipelined. Latency is 24 cycles.
Due to varying latencies, multiple instructions may need to write to register file in same cycle, thus stalling or increasing number of read/write ports is neccessary.
Processors can be made faster by:
Pipelining
Instruction-level parallelism (ILP)
Data-level parallelism (DLP)
Thread-level parallelism (TLP)
CPU performance:
Hazards
Data hazard: read after write (RAW), write after read (WAR), write after write (WAW). MIPS implements forwarding/bypassing which makes it possible to avoid read after write hazards. Destination register between EX/MEM and/or MEM/WB pipeline stages equal the source register of instruction in ID/EX pipeline register. This means that ALU output of previous cycle can directly be used by current cycle. This also means that is memory access happens in previous cycle, in can be used by current cycle:
add r1,r2,r3 ; IF ID EX MEM WB
sub r4,r1,r5 ; IF ID --- EX MEM WB
If a load to a register is followed immediately by an instruction that uses that register the hazard detection unit detects such hazard and stalls the pipeline. That is because load instruction uses ALU to calculate the address of memory access.
If forwarding is disabled, dependant instruction stalls in its instrcution decode (ID) stage until source operand is written back to register file:
add r1,r2,r3 ; IF ID EX MEM WB
sub r4,r1,r5 ; IF ID --- --- EX MEM WB
Structural hazard: processor's hardware is needed by two or more instructions. For example if floating-point multiplication result is delayed 7 cycles and floating-point addition is delayed 4 cycles and they end up requesting memory-access at the same time if there is only one memory port. This problem can be solved by replicating functional units, eg. several ALU-s, multiple memory ports.
Control/branching hazard: delay between fetching control flow instruction (branch or jump) and actual jump. For that reason MIPS introduced branch delay slot. MIPS has simplified branch testing (rx == ry, rx != ry, rx == 0, rx != 0), the branch condition evaluation and branch target address calculation BOTH happen in instruction decode (ID) stage. The instruction after branch is always executed. This is also known as branch delay slot. The compiler should fill that slot with instructions after the branch (predict branch not taken) or instructions from the target (predict branch taken). If no instruction is found, it should be filled with nop.
Sample execution
Execution with full forwarding on a 5-stage pipeline, X stands for stall:
sll r3,r1,2 IF ID EX MEM WB
add r3,r2,r3 IF ID EX MEM WB
lw r4,0(r3) IF ID EX MEM WB
lw r5,0(r3) IF ID EX MEM WB
sw r5,0(r3) IF ID X EX MEM WB
sw r4,4(r3) IF X ID EX MEM WB