Dynamic scheduling

Introduction

ILP (instruction level parallelism) means that instructions are partially executed in parallel. Static techniques rely on compiler and are mostly used in embedded systems. Dynamic techniques depend on hardware to find ILP. Dynamic scheduling is basically loop unrolling in hardware. To improve performance instructions need to be executed out of order and/or in parallel.

Data hazards

Data hazards can be categorized into three:

Data dependence also known as read-after-write hazard - Instruction uses result provided by previous one.
Output dependence also known as write-after-write hazard - Instruction writes to same register as previous one.
Anti dependence also known as write-after-read hazard - Instruction writes to a register that previous instruction reads.

Dependenceis are program properties, hazards are properties of machine organization. If two instructions are independent, they can be executed in parallel or in any order. If they are dependent, they must be executed in order. Part of their execution may be overlapped. Program order needs to be preserved only where it affects program outcome. Name dependencies can be removed by changing names of registers. Tha is also known as register renaming.

Control hazards

Control dependencies are not fundamental performance limits, control dependencies can be violated if the correctness it not affected. Properties critical to program correctness are data flow and exeption behaviour. Instruction reordering may not cause new exceptions.

Improving preformance

Dynamic scheduling aka hardware loop unrolling reduces data hazard stalls. Dynamic branch prediction reduces control stalls. Issuing multiple instructions improves ideal CPI. Speculation reduce data and control stalls. Dynamic memory disambiguation reduces data hazard stalls involving memory. Compiler dependence analysis, software pipelining and trace scheduling improve ideal CPI and reduce data hazard stalls. Hardware support for compiler speculation improves ideal CPI and reduce data/control hazard stalls.

Dynamic scheduling

Dynamically scheduled processor rearranges instruction execution order. This better handles cases where dependence is unknown at compile time. It simplifies compiler on the expense of increased hardware complexity. For example:

div.d f0,f2,f4     ; stall 24 cycles
add.d f10,f0,f8    ; depends on previous instruction
sub.d f12,f8,f13   ; doesn't depend on either of previous instructions

Tomasulo algorithm

Tomasulo algorithm was implemented in IBM 360. In Tomasulo algorithm the instructions are fetched into FIFO queue.

Issue step of pipeline gets next instruction from FIFO. If there is no structural hazard the instruction is issued to a reservation station either with register values if they're available or with identifiers of reservation stations that will produce the operands. This is also implicitly register renaming and it eliminates WAR and WAW hazards.

Execute step executes if operands are available otherwise snoop the common data bus.

Write result step writes result on common data bus to all waiting reservation stations and register file. Afterwards the reservation station is marked available.

Loads and stores compute memory address when base register is available. Loads as soon as memory unit is free. Stores wait for value to be stored from reservation station if not immediate. Loads and stores execute in order to preserve correctness.

Common data bus also behaves like a bypassing/forwarding mechanism. The instructions are issued and dispatched in-order, but executed and retired out-of-order. Retired instructions are those whose outputs were actually used and not speculatively calculated.

Intel Nehalem and AMD K6 had centralized reservation stations bound to instruction window. That means functional units shared a set of reservation stations. This implied effective use of available reservation stations on the expense of hardware complexity. IBM PowerPC used distributed reservation stations bound to certain functional units which reduced the hardware complexity. Intel Pentium had hybrid reservation stations. One reservation station per integer ALU, one per memory access, one per floating-point unit.

Register renaming

Register renaming is implemented implicitly in Tomasulo algorithm. In MIPS R10000 the physical register file is large than the logical register file defined in instruction set architecture. A mapping table is used to associate logical registers with physical registers. When an instruction is decoded, its physical source registers are obtained from mapping table and its destination register is obrtained from free list. The mapping table is then updated. When instruction commits that is writes back the result to register, processor frees physical register that was formerly associated with its logical destination register.

Example of dual-issue superscalar without speculation

Consider following loop:

double x[1024];
for (i = 0; i < 1024; i++)
    x[i] = x[i] + x[i+1];

Which could be converted to following assembly:

loop:
    l.d f0,0(r1)
    l.d f2,8(r1)
    add.d f4,f0,f2
    s.d f4,0(r1)
    add r1,r1,8
    bne r1,r9,loop

Assume dynamically scheduled, dual-issue superscalar processor with:

One integer functional unit used for ALU operations and effective address calculation, load/store exec and integer exec are shared.
Separate pipelined floating-point functional unit, floating-point exec in independent
Separate functional unit to evaluate branch conditions, branch exec is indepentent
Issue takes one cycle
Write result takes one cycle
Perfect branch predictions: instructions are fetched and issued, but not actually executed until branch has completed
Latency between producting and consuming instruction is 1 cycle for integer ALU 2 for load, 3 for floating-point addition and 4 for floating-point multiplication.
Instructions after branch have to wait for branches to complete, because we don't have speculation

iter	instruction	issue	exec	mem	write cdb	comment
1	l.d f0,0(r1)	1	2	3	4	first issue
1	l.d f2,8(r1)	1	3	4	5	wait for ALU
1	add.d f4,f0,f2	2	6		9	wait for f2
1	s.d f4,0(r1)	2	4	10		wait for ALU, f4
1	add r1,r1,8	3	5		6	wait for ALU
1	bne r1,r9,loop	3	7			wait for r1
2	l.d f0,0(r1)	4	8	9	10	wait for bne
2	l.d f2,8(r1)	4	9	10	11	wait for ALU
2	add.d f4,f0,f2	5	12		15	wait for f2
2	s.d f4,0(r1)	5	10	16		wait for ALU, f4
2	add r1,r1,8	6	11		12	wait for ALU
2	bne r1,r9,loop	6	13			wait for r1

Dynamic scheduling

computer architecture scheduling TU Berlin