Goldschmidt division implementation12. Dec '13

Introduction

The Goldschmidt division algorithm may seem straightforward, but the implementation might get quite complex. In this case Goldschmidt divisor is composed of:

Bit-vector based 4:2 counter
Carry lookahead adder
Reciprocal approximation circuit
Radix-4 Booth encoder
Booth encoder based multiplier
Registers to hold intermediate N, D and F values
Controller circuit which counts iterations
Glue code

Adders and counters are not discussed in this article since they've been covered in various others.

Booth encoder

The Booth encoder converts multiplier bit triplets into addition and subtraction operations of multiplicand:

library ieee;
use ieee.std_logic_1164.all;

-- Consider multiplication of a and b:
-- d is radix-4 booth encoded bits of digit of the multiplier b
-- a is the bit vector of multiplicand
-- n is the 2's complement of a
-- m is the multiplication of d and a
entity booth4 is
    Port ( a : in  std_logic_vector (15 downto 0);
           d : in  std_logic_vector (2 downto 0);
           m : out std_logic_vector (16 downto 0));
end booth4;

architecture behavioral of booth4 is
    component cla is
        generic (N : integer := 16);
        port (
            a  : in  std_logic_vector (N-1 downto 0);
            b  : in  std_logic_vector (N-1 downto 0);
            ci : in  std_logic;
            s  : out std_logic_vector (N-1 downto 0);
            co : out std_logic
        );
    end component;

    -- Signals for two's complement
    signal ai : std_logic_vector(15 DOWNTO 0);
    signal an : std_logic_vector(15 DOWNTO 0);

begin
    -- Two's complement of a,
    -- assuming that of course compiler takes care of
    -- getting rid of duplicate two's complement circuits
    ai <= not a;
    cla_stage: cla generic map (N=>16)  port map(
      a => ai, b => "0000000000000001", ci => '0', s => an
    );

    -- Dummy lookup table
    with d select
        m <=a(15)  & a   when "001", -- Sign expanded addition
            a(15)  & a   when "010", -- Same as previous
            a      & "0" when "011", -- Add double
            an     & "0" when "100", -- Subtract double
            an(15) & an  when "101", -- Sign expanded subtraction
            an(15) & an  when "110",
            "00000000000000000" when others;
end behavioral;

Booth encoder based multiplier

library ieee;
use ieee.std_logic_1164.all;

entity multiplier is
    port (
        a : in  std_logic_vector (15 downto 0);
        b : in  std_logic_vector (15 downto 0);
        m : out std_logic_vector (31 downto 0)
    );
end multiplier;

architecture behavioral of multiplier is
    -- Booth encoder
    component booth4 is
        port (
            a  : in  std_logic_vector (15 downto 0);
            d  : in  std_logic_vector (2 downto 0);
            m  : out std_logic_vector (16 downto 0)
        );
    end component;

    -- Bit-vector based 4:2 counter
    component counter42 is
        generic (N : integer := 24);
        port (
            a  : in  std_logic_vector (N-1 downto 0);
            b  : in  std_logic_vector (N-1 downto 0);
            c  : in  std_logic_vector (N-1 downto 0);
            d  : in  std_logic_vector (N-1 downto 0);
            s  : out std_logic_vector (N-1 downto 0);
            co : out std_logic_vector (N-1 downto 0)
        );
    end component;

    -- Carry lookahead adder
    component cla is
        generic (N : integer := 32);
        port (
            a  : in  std_logic_vector (N-1 downto 0);
            b  : in  std_logic_vector (N-1 downto 0);
            ci : in  std_logic;
            s  : out std_logic_vector (N-1 downto 0);
            co : out std_logic
        );
    end component;

    -- First Booth digit
    signal d0   : std_logic_vector(2 downto 0);

    -- Products
    signal p0   : std_logic_vector(16 downto 0);
    signal p1   : std_logic_vector(16 downto 0);
    signal p2   : std_logic_vector(16 downto 0);
    signal p3   : std_logic_vector(16 downto 0);
    signal p4   : std_logic_vector(16 downto 0);
    signal p5   : std_logic_vector(16 downto 0);
    signal p6   : std_logic_vector(16 downto 0);
    signal p7   : std_logic_vector(16 downto 0);

    -- Sign extension
    signal se0  : std_logic_vector(23 downto 0);
    signal se1  : std_logic_vector(23 downto 0);
    signal se2  : std_logic_vector(23 downto 0);
    signal se3  : std_logic_vector(23 downto 0);
    signal se4  : std_logic_vector(23 downto 0);
    signal se5  : std_logic_vector(23 downto 0);
    signal se6  : std_logic_vector(23 downto 0);
    signal se7  : std_logic_vector(23 downto 0);

    -- Intermediary sums and carries
    signal s1   : std_logic_vector(23 downto 0);
    signal s2   : std_logic_vector(23 downto 0);
    signal co1  : std_logic_vector(23 downto 0);
    signal co2  : std_logic_vector(23 downto 0);

    -- Final sum and carry
    signal s3   : std_logic_vector(31 downto 0);
    signal co3  : std_logic_vector(31 downto 0);

    -- Padded sums abd carries
    signal s1p  : std_logic_vector(31 downto 0);
    signal s2p  : std_logic_vector(31 downto 0);
    signal co1p : std_logic_vector(31 downto 0);
    signal co2p : std_logic_vector(31 downto 0);
    signal co3p : std_logic_vector(31 downto 0);
begin
    -- Pad first booth digit
    d0 <= b( 1 downto  0) & "0";

    -- Product stages
    p0_stage: booth4 port map(a=>a, d=>d0, m=>p0);
    p1_stage: booth4 port map(a=>a, d=>b( 3 downto  1), m=>p1);
    p2_stage: booth4 port map(a=>a, d=>b( 5 downto  3), m=>p2);
    p3_stage: booth4 port map(a=>a, d=>b( 7 downto  5), m=>p3);
    p4_stage: booth4 port map(a=>a, d=>b( 9 downto  7), m=>p4);
    p5_stage: booth4 port map(a=>a, d=>b(11 downto  9), m=>p5);
    p6_stage: booth4 port map(a=>a, d=>b(13 downto 11), m=>p6);
    p7_stage: booth4 port map(a=>a, d=>b(15 downto 13), m=>p7);

    -- Reduced sign extension, assuming that compiler takes
    -- care of getting rid of constants of course
    se0 <= "0000" & (not p0(16)) & p0(16) & p0(16) & p0;
    se1 <= "0001" & (not p1(16))            & p1 & "00";
    se2 <= "01" & (not p2(16))            & p2 & "0000";
    se3 <= not(p3(16))                  & p3 & "000000";

    se4 <= "000001" & (not p4(16))                 & p4;
    se5 <= "0001" & (not p5(16))            & p5 & "00";
    se6 <= "01" & (not p6(16))            & p6 & "0000";
    se7 <= not(p7(16))                  & p7 & "000000";

    -- Add first 4 products
    counter_stage1: counter42 port map(
        a => se0, b => se1, c => se2, d => se3,
        s => s1, co => co1);

    -- Add remaining 4 products
    counter_stage2: counter42 port map(
        a => se4, b => se5, c => se6, d => se7,
        s => s2, co => co2);

    -- Pad carries
    s1p  <= "00000001" & s1;
    s2p  <= s2 & "00000000";
    co1p <= "0000000" & co1 & "0";
    co2p <= co2(22 downto 0) & "000000000";

    -- Add shifted operands
    counter_stage3: counter42 generic map (N=>32) port map(
        a => s1p,
        b => s2p,
        c => co1p,
        d => co2p,
        s => s3,
        co => co3
    );

    -- Pad carries
    co3p <= co3(30 downto 0) & "0";

    -- Add sum and carries
    cla_stage2: cla port map(
        a => s3, b => co3p, ci => '0', s => m
    );
end behavioral;

Reciprocal approximator

The reciprocal approximator circuit determines approximate value of the inverse of a number.

library ieee;
use ieee.std_logic_1164.all;

entity reciprocal is
    port (
        b : in  std_logic_vector (15 downto 0);
        r : out std_logic_vector (15 downto 0)
    );
end reciprocal;

architecture behavioral of reciprocal is

begin
    -- Note that this is not the most compact solution,
    -- but it leaves room for experimentation.
    r <= "1100000000000000" when b               = "0000000000000000" else
         "0110000000000000" when b(15 downto  1) = "000000000000000" else
         "0011000000000000" when b(15 downto  2) = "00000000000000" else
         "0001100000000000" when b(15 downto  3) = "0000000000000" else
         "0000110000000000" when b(15 downto  4) = "000000000000" else
         "0000011000000000" when b(15 downto  5) = "00000000000" else
         "0000001100000000" when b(15 downto  6) = "0000000000" else
         "0000000110000000" when b(15 downto  7) = "000000000" else
         "0000000011000000" when b(15 downto  8) = "00000000" else
         "0000000001100000" when b(15 downto  9) = "0000000" else
         "0000000000110000" when b(15 downto 10) = "000000" else
         "0000000000011000" when b(15 downto 11) = "00000" else
         "0000000000001100" when b(15 downto 12) = "0000" else
         "0000000000000110" when b(15 downto 13) = "000" else
         "0000000000000011" when b(15 downto 14) = "00" else
         "0000000000000001" when b(15)           = '0' else
         "0000000000000000";
end behavioral;

Glue code

Goldschmidt division algorithm glues together all previously learned components:

library ieee;
use ieee.STD_LOGIC_1164.all;
use std.textio.all;

entity goldschmidt_division is
    port (
        a   : in std_logic_vector (15 downto 0);   -- Dividend
        b   : in std_logic_vector (15 downto 0);   -- Divisor
        c   : in std_logic_vector (2 downto 0);    -- Iteration count
        clk : in std_logic;
        rst : in std_logic;
        q   : out std_logic_vector (15 downto 0)   -- Quotinent
    );
end goldschmidt_division;

architecture behavioral of goldschmidt_division is
    -- Booth multiplier
    component multiplier is
        port (
            a : in  std_logic_vector (15 downto 0);
            b : in  std_logic_vector (15 downto 0);
            m : out std_logic_vector (31 downto 0)
        );
    end component;

    -- Carry lookahead adder
    component cla is
        generic (N : integer := 16);
        port (
            a  : in  std_logic_vector (N-1 downto 0);
            b  : in  std_logic_vector (N-1 downto 0);
            ci : in  std_logic;
            s  : out std_logic_vector (N-1 downto 0);
            co : out std_logic
        );
    end component;

    -- Reciprocal approximator
    component reciprocal is
        port (
            b : in  std_logic_vector (15 downto 0);
            r : out std_logic_vector (15 downto 0)
        );
    end component;

    -- Counter register
    signal counter : std_logic_vector(2 downto 0);
    signal counter_decremented : std_logic_vector(2 downto 0);

    -- Intermediate output signals
    signal nop : std_logic_vector(31 downto 0);
    signal dop : std_logic_vector(31 downto 0);
    signal dos : std_logic_vector(15 downto 0);
    signal don : std_logic_vector(15 downto 0);

    -- Current iteration outputs
    signal no : std_logic_vector(15 downto 0);
    signal do : std_logic_vector(15 downto 0);
    signal fo : std_logic_vector(15 downto 0);

    -- Previous iteration registers
    signal np : std_logic_vector(15 downto 0);
    signal dp : std_logic_vector(15 downto 0);
    signal fp : std_logic_vector(15 downto 0);

    -- Initial reciprocal
    signal ir : std_logic_vector(15 downto 0);

begin
    -- Dump output process
    dump_proc: process
    file OUTPUT_FILE : text open write_mode is "dump.txt";
    variable output_line: LINE;
    begin
            for i in 1 to 10 loop
            wait until rising_edge(clk);
            write(output_line,to_bitvector(no));
            writeline(output_file,output_line);
            end loop;
    end process;

    -- Registers for N, D and F
    sync_process: process(clk, counter)
    begin
        if (rst = '1') then
            counter <= c;
            np <= a;
            dp <= b;
            fp <= ir;
        elsif rising_edge(clk) then
            counter <= counter_decremented;
            np <= no;
            dp <= do;
            fp <= fo;
        end if;
    end process;

    -- Have initial reciprocal always ready to go
    initial_reciprocal: reciprocal port map(b=>b, r=>ir);

    -- Decrement iteration counter
    counter_decremented <=
        "110" when counter = "111" else
        "101" when counter = "110" else
        "100" when counter = "101" else
        "011" when counter = "100" else
        "010" when counter = "011" else
        "001" when counter = "010" else
        "000";

    -- Multiply input F and input N
    n_stage: multiplier port map(a=>fp, b=>np, m=>nop);

    -- Multiply input F and input D
    d_stage: multiplier port map(a=>fp, b=>dp, m=>dop);

    -- Round output N
    no <= nop(23 downto 8);

    -- Round output D
    dos <= dop(23 downto 8);

    -- Find two's complement of output D
    don <= not dos;

    -- Subtract D from 2
    f_stage: cla port map(a=>"0000001000000000", b=>don, ci=>'1', s=>fo);
    do <= dos;

    -- Bind outputs
    q <= no;

end behavioral;

Conclusion

The main drawbacks of such simplistic design is that the intermediate multiplication results are truncated and there is precision loss for every iteration. In this case the division of 86 by 7 should result in number approximately 12.285714(285714), but after 4th iteration the precision loss starts to affect the result and it deviates from that number:

 8.062500000
10.832031250
12.140625000
12.328125000
12.375000000
12.421875000
12.468750000
12.515625000
12.562500000
12.609375000
12.656250000
12.703125000
12.750000000
12.796875000
12.843750000
12.890625000
12.937500000
12.984375000
13.031250000

On Xilinx Spartan 3E XC3S500E the latency of this circuit was 21.872ns which means that we should be able to operate it on base frequency of:

\begin{equation*} \frac{1}{21.872ns} \approx 45.7MHz \end{equation*}

The user of course has to take care of feeding the correct number of iterations to the circuit and delaying the output reading operation by that amount of cycles.

Goldschmidt computer arithmetic VHDL division TU Berlin