Single Instruction, Multiple-Data01. Mar '14

Introduction

Single-Instruction Multiple-Data or SIMD for short, means that processor operates on several operands at the same time during one cycle. The Intel x86 initially did not have any SIMD instructions, over the time several were added.

MMX

MMX (MultiMedia eXtensions) was introduced in Pentium processors released in 1997 1, it defined eight 64-bit registers MM0 .. M8 which were actually aliases of x87 FPU stack registers 2. Thus context switches preserved MMX register states. The instructions allowed boolean logic, addition padd{w,h,b}, subtraction psub{w,h,b} of (un)signed integer 2x 32-bit doublewords, 4x 16-bit words or 8x 8-bit bytes. Multiplication is handled a bit differently, result register is filled with either higher (pmulhw) or lower bits (pmullw) depending on the instruction. In MMX the data access has to be aligned to 64-bits.

MMX also introduced packing instructions, eg. packing and saturating (padds{w,h,b}, paddus{w,h,b}) 2x 32-bit and 2x 32-bit operands to 4x 16-bit operands. Packing allows converting from wider datatypes to narrower ones while clipping the value the value. Unpacking isn't exactly doing the opposite, instead unpacking is used to interleave higher/lower bits from two narrow operands to one wider ones.

Padding with zeros is quite easy, unpack the operand with another operand filled with zeros. Sign extension is a bit more complicated the second operand has to be filled with sign bit (eg. psraw mm7,15). Unpacking can be used to broadcast operands, that is copying operand to each slot 3.

1: MMX (instruction set)
2: MMX Instruction Set
3: x64 assembly and C++ Tutorial 44: MMX Packing and Unpacking

Streaming SIMD Extensions

SSE (Streaming SIMD Extensions) were added to Pentium III released in 1999 4. It contains 70 new instructions which operate on single-precision floating-point data and it added eight new 128-bit registers also known as XMM0 .. XMM7. Later in the 64-bit counterpart amd64, another eight registers XMM8 .. XMM15 were added which are accessible only in 64-bit mode. SSE2 added more instructions which made it possible to operate on 2x 64-bit double-preicison floating-point numbers, 2x 64-bit integers or 4x 32-bit integers, 8x 16-bit short integers or 16x 8-bit characters. The OS has to be aware of the extensions so the register states would be preserved over context switches. SSE3 added DSP-oriented mathematical instructions. SSSE3 is an incremental upgrade to SSE2 adding 16 new instructions. SSE4 added dot product instructions and additional integer instructions.

4: Streaming SIMD Extension

Advanced Vector Extensions

AVX (Advanced Vector Extensions) were proposed by Intel in March 2008. AVX widened the SSE registers to 256-bits and renamed them, so AVX has XMM0 .. XMM31, YMM0 .. YMM31 and ZMM0 .. ZMM31 registers. The SSE registers are subset of AVX. AVX2 also known as Haswell New Instructions expanded most SSE and AVX integer instructions to 256-bits. AVX-512 is scheduled to be supported in 2015 and it extends the registers to 512-bits.

Sample

Consider following C snippet, the operands are unsigned 8-bit integers:

uint8_t img_in1[SIZE], img_in2[SIZE], img_out[SIZE];

for (i=0; i<SIZE; i++){
    img_out[i] = img_in1[i];
    if (img_in[1] > 200)
        img_out[i] = img_in2[i];
}

Assuming pseudo-architecture with MIPS64 and MMX extensions:

ldw
add r1,r0,r0    ; Initialize counter
ld r2,SIZE(r0)   ; Load SIZE
loop:
    ld f1,img_in1(r1)   ; Load 8 bytes of input image 1
    ld f2,img_in2(r1)   ; Load 8 bytes of input image 2
    mov.d f3,f1         ; Copy F1 to F3
    pcmpgtb f3,f3,f2    ; Generate comparison bit mask to F3
    mov.d f4,f2         ; Copy F2 to F4
    pand f4,f4,f3       ; Select values of F2 by bit mask
    mov.d f5,f1         ; Copy F1 to F5
    pandn f5,f5,f3      ; Select values of F1 by negated bit mask
    por f5,f4           ; Merge values
    sd f5,img_out(r1)   ; Store
    addi r1,r1,8
    bne r1,r2,loop

MMX SIMD SSE computer architecture TU Berlin