# Introduction

The Goldschmidt division algorithm may seem straightforward, but the implementation might get quite complex. In this case Goldschmidt divisor is composed of:

• Bit-vector based 4:2 counter

• Reciprocal approximation circuit

• Booth encoder based multiplier

• Registers to hold intermediate N, D and F values

• Controller circuit which counts iterations

• Glue code

# Booth encoder

The Booth encoder converts multiplier bit triplets into addition and subtraction operations of multiplicand:

library ieee;
use ieee.std_logic_1164.all;

-- Consider multiplication of a and b:
-- d is radix-4 booth encoded bits of digit of the multiplier b
-- a is the bit vector of multiplicand
-- n is the 2's complement of a
-- m is the multiplication of d and a
entity booth4 is
Port ( a : in  std_logic_vector (15 downto 0);
d : in  std_logic_vector (2 downto 0);
m : out std_logic_vector (16 downto 0));
end booth4;

architecture behavioral of booth4 is
component cla is
generic (N : integer := 16);
port (
a  : in  std_logic_vector (N-1 downto 0);
b  : in  std_logic_vector (N-1 downto 0);
ci : in  std_logic;
s  : out std_logic_vector (N-1 downto 0);
co : out std_logic
);
end component;

-- Signals for two's complement
signal ai : std_logic_vector(15 DOWNTO 0);
signal an : std_logic_vector(15 DOWNTO 0);

begin
-- Two's complement of a,
-- assuming that of course compiler takes care of
-- getting rid of duplicate two's complement circuits
ai <= not a;
cla_stage: cla generic map (N=>16)  port map(
a => ai, b => "0000000000000001", ci => '0', s => an
);

-- Dummy lookup table
with d select
m <=a(15)  & a   when "001", -- Sign expanded addition
a(15)  & a   when "010", -- Same as previous
a      & "0" when "011", -- Add double
an     & "0" when "100", -- Subtract double
an(15) & an  when "101", -- Sign expanded subtraction
an(15) & an  when "110",
"00000000000000000" when others;
end behavioral;

# Booth encoder based multiplier

library ieee;
use ieee.std_logic_1164.all;

entity multiplier is
port (
a : in  std_logic_vector (15 downto 0);
b : in  std_logic_vector (15 downto 0);
m : out std_logic_vector (31 downto 0)
);
end multiplier;

architecture behavioral of multiplier is
-- Booth encoder
component booth4 is
port (
a  : in  std_logic_vector (15 downto 0);
d  : in  std_logic_vector (2 downto 0);
m  : out std_logic_vector (16 downto 0)
);
end component;

-- Bit-vector based 4:2 counter
component counter42 is
generic (N : integer := 24);
port (
a  : in  std_logic_vector (N-1 downto 0);
b  : in  std_logic_vector (N-1 downto 0);
c  : in  std_logic_vector (N-1 downto 0);
d  : in  std_logic_vector (N-1 downto 0);
s  : out std_logic_vector (N-1 downto 0);
co : out std_logic_vector (N-1 downto 0)
);
end component;

component cla is
generic (N : integer := 32);
port (
a  : in  std_logic_vector (N-1 downto 0);
b  : in  std_logic_vector (N-1 downto 0);
ci : in  std_logic;
s  : out std_logic_vector (N-1 downto 0);
co : out std_logic
);
end component;

-- First Booth digit
signal d0   : std_logic_vector(2 downto 0);

-- Products
signal p0   : std_logic_vector(16 downto 0);
signal p1   : std_logic_vector(16 downto 0);
signal p2   : std_logic_vector(16 downto 0);
signal p3   : std_logic_vector(16 downto 0);
signal p4   : std_logic_vector(16 downto 0);
signal p5   : std_logic_vector(16 downto 0);
signal p6   : std_logic_vector(16 downto 0);
signal p7   : std_logic_vector(16 downto 0);

-- Sign extension
signal se0  : std_logic_vector(23 downto 0);
signal se1  : std_logic_vector(23 downto 0);
signal se2  : std_logic_vector(23 downto 0);
signal se3  : std_logic_vector(23 downto 0);
signal se4  : std_logic_vector(23 downto 0);
signal se5  : std_logic_vector(23 downto 0);
signal se6  : std_logic_vector(23 downto 0);
signal se7  : std_logic_vector(23 downto 0);

-- Intermediary sums and carries
signal s1   : std_logic_vector(23 downto 0);
signal s2   : std_logic_vector(23 downto 0);
signal co1  : std_logic_vector(23 downto 0);
signal co2  : std_logic_vector(23 downto 0);

-- Final sum and carry
signal s3   : std_logic_vector(31 downto 0);
signal co3  : std_logic_vector(31 downto 0);

signal s1p  : std_logic_vector(31 downto 0);
signal s2p  : std_logic_vector(31 downto 0);
signal co1p : std_logic_vector(31 downto 0);
signal co2p : std_logic_vector(31 downto 0);
signal co3p : std_logic_vector(31 downto 0);
begin
d0 <= b( 1 downto  0) & "0";

-- Product stages
p0_stage: booth4 port map(a=>a, d=>d0, m=>p0);
p1_stage: booth4 port map(a=>a, d=>b( 3 downto  1), m=>p1);
p2_stage: booth4 port map(a=>a, d=>b( 5 downto  3), m=>p2);
p3_stage: booth4 port map(a=>a, d=>b( 7 downto  5), m=>p3);
p4_stage: booth4 port map(a=>a, d=>b( 9 downto  7), m=>p4);
p5_stage: booth4 port map(a=>a, d=>b(11 downto  9), m=>p5);
p6_stage: booth4 port map(a=>a, d=>b(13 downto 11), m=>p6);
p7_stage: booth4 port map(a=>a, d=>b(15 downto 13), m=>p7);

-- Reduced sign extension, assuming that compiler takes
-- care of getting rid of constants of course
se0 <= "0000" & (not p0(16)) & p0(16) & p0(16) & p0;
se1 <= "0001" & (not p1(16))            & p1 & "00";
se2 <= "01" & (not p2(16))            & p2 & "0000";
se3 <= not(p3(16))                  & p3 & "000000";

se4 <= "000001" & (not p4(16))                 & p4;
se5 <= "0001" & (not p5(16))            & p5 & "00";
se6 <= "01" & (not p6(16))            & p6 & "0000";
se7 <= not(p7(16))                  & p7 & "000000";

counter_stage1: counter42 port map(
a => se0, b => se1, c => se2, d => se3,
s => s1, co => co1);

counter_stage2: counter42 port map(
a => se4, b => se5, c => se6, d => se7,
s => s2, co => co2);

s1p  <= "00000001" & s1;
s2p  <= s2 & "00000000";
co1p <= "0000000" & co1 & "0";
co2p <= co2(22 downto 0) & "000000000";

counter_stage3: counter42 generic map (N=>32) port map(
a => s1p,
b => s2p,
c => co1p,
d => co2p,
s => s3,
co => co3
);

co3p <= co3(30 downto 0) & "0";

cla_stage2: cla port map(
a => s3, b => co3p, ci => '0', s => m
);
end behavioral;

# Reciprocal approximator

The reciprocal approximator circuit determines approximate value of the inverse of a number.

library ieee;
use ieee.std_logic_1164.all;

entity reciprocal is
port (
b : in  std_logic_vector (15 downto 0);
r : out std_logic_vector (15 downto 0)
);
end reciprocal;

architecture behavioral of reciprocal is

begin
-- Note that this is not the most compact solution,
-- but it leaves room for experimentation.
r <= "1100000000000000" when b               = "0000000000000000" else
"0110000000000000" when b(15 downto  1) = "000000000000000" else
"0011000000000000" when b(15 downto  2) = "00000000000000" else
"0001100000000000" when b(15 downto  3) = "0000000000000" else
"0000110000000000" when b(15 downto  4) = "000000000000" else
"0000011000000000" when b(15 downto  5) = "00000000000" else
"0000001100000000" when b(15 downto  6) = "0000000000" else
"0000000110000000" when b(15 downto  7) = "000000000" else
"0000000011000000" when b(15 downto  8) = "00000000" else
"0000000001100000" when b(15 downto  9) = "0000000" else
"0000000000110000" when b(15 downto 10) = "000000" else
"0000000000011000" when b(15 downto 11) = "00000" else
"0000000000001100" when b(15 downto 12) = "0000" else
"0000000000000110" when b(15 downto 13) = "000" else
"0000000000000011" when b(15 downto 14) = "00" else
"0000000000000001" when b(15)           = '0' else
"0000000000000000";
end behavioral;

# Glue code

Goldschmidt division algorithm glues together all previously learned components:

library ieee;
use ieee.STD_LOGIC_1164.all;
use std.textio.all;

entity goldschmidt_division is
port (
a   : in std_logic_vector (15 downto 0);   -- Dividend
b   : in std_logic_vector (15 downto 0);   -- Divisor
c   : in std_logic_vector (2 downto 0);    -- Iteration count
clk : in std_logic;
rst : in std_logic;
q   : out std_logic_vector (15 downto 0)   -- Quotinent
);
end goldschmidt_division;

architecture behavioral of goldschmidt_division is
-- Booth multiplier
component multiplier is
port (
a : in  std_logic_vector (15 downto 0);
b : in  std_logic_vector (15 downto 0);
m : out std_logic_vector (31 downto 0)
);
end component;

component cla is
generic (N : integer := 16);
port (
a  : in  std_logic_vector (N-1 downto 0);
b  : in  std_logic_vector (N-1 downto 0);
ci : in  std_logic;
s  : out std_logic_vector (N-1 downto 0);
co : out std_logic
);
end component;

-- Reciprocal approximator
component reciprocal is
port (
b : in  std_logic_vector (15 downto 0);
r : out std_logic_vector (15 downto 0)
);
end component;

-- Counter register
signal counter : std_logic_vector(2 downto 0);
signal counter_decremented : std_logic_vector(2 downto 0);

-- Intermediate output signals
signal nop : std_logic_vector(31 downto 0);
signal dop : std_logic_vector(31 downto 0);
signal dos : std_logic_vector(15 downto 0);
signal don : std_logic_vector(15 downto 0);

-- Current iteration outputs
signal no : std_logic_vector(15 downto 0);
signal do : std_logic_vector(15 downto 0);
signal fo : std_logic_vector(15 downto 0);

-- Previous iteration registers
signal np : std_logic_vector(15 downto 0);
signal dp : std_logic_vector(15 downto 0);
signal fp : std_logic_vector(15 downto 0);

-- Initial reciprocal
signal ir : std_logic_vector(15 downto 0);

begin
-- Dump output process
dump_proc: process
file OUTPUT_FILE : text open write_mode is "dump.txt";
variable output_line: LINE;
begin
for i in 1 to 10 loop
wait until rising_edge(clk);
write(output_line,to_bitvector(no));
writeline(output_file,output_line);
end loop;
end process;

-- Registers for N, D and F
sync_process: process(clk, counter)
begin
if (rst = '1') then
counter <= c;
np <= a;
dp <= b;
fp <= ir;
elsif rising_edge(clk) then
counter <= counter_decremented;
np <= no;
dp <= do;
fp <= fo;
end if;
end process;

-- Have initial reciprocal always ready to go
initial_reciprocal: reciprocal port map(b=>b, r=>ir);

-- Decrement iteration counter
counter_decremented <=
"110" when counter = "111" else
"101" when counter = "110" else
"100" when counter = "101" else
"011" when counter = "100" else
"010" when counter = "011" else
"001" when counter = "010" else
"000";

-- Multiply input F and input N
n_stage: multiplier port map(a=>fp, b=>np, m=>nop);

-- Multiply input F and input D
d_stage: multiplier port map(a=>fp, b=>dp, m=>dop);

-- Round output N
no <= nop(23 downto 8);

-- Round output D
dos <= dop(23 downto 8);

-- Find two's complement of output D
don <= not dos;

-- Subtract D from 2
f_stage: cla port map(a=>"0000001000000000", b=>don, ci=>'1', s=>fo);
do <= dos;

-- Bind outputs
q <= no;

end behavioral;

# Conclusion

The main drawbacks of such simplistic design is that the intermediate multiplication results are truncated and there is precision loss for every iteration. In this case the division of 86 by 7 should result in number approximately 12.285714(285714), but after 4th iteration the precision loss starts to affect the result and it deviates from that number:

 8.062500000
10.832031250
12.140625000
12.328125000
12.375000000
12.421875000
12.468750000
12.515625000
12.562500000
12.609375000
12.656250000
12.703125000
12.750000000
12.796875000
12.843750000
12.890625000
12.937500000
12.984375000
13.031250000

On Xilinx Spartan 3E XC3S500E the latency of this circuit was 21.872ns which means that we should be able to operate it on base frequency of:

\begin{equation*} \frac{1}{21.872ns} \approx 45.7MHz \end{equation*}

The user of course has to take care of feeding the correct number of iterations to the circuit and delaying the output reading operation by that amount of cycles.

Goldschmidt computer arithmetic VHDL division TU Berlin