Getting started with Zynq-7000 boards

1.1 Introduction

Studying FPGA development is rather expensive process and so far the cheapest way to get started in my opinion is to get equipped with Digilent ZYBO development board, OV7670 camera, Salea logic analyzer, a breadboard, light emitting diodes, segment displays and some 220 ohm resistors. If you get lucky you can push it down to 200€ budget.

ZYBO is a Zynq-7000 SoC based board from Digilent. This is a little HOWTO for getting started with ZYBO, but this should serve well anyone who wants to get on the track with Zynq-7000 based boards such as Zedboard, Microzed, ZC702 evaluation board etc.

img/zybo-board.jpg

ZYBO is a Zynq-7000 SoC based entry level board.

Previously I attempted to outline detailed process of creating hardware design for ZYBO using Vivado, but as it turned out to consume significant amount of time spent on learning vendor specific stuff I decided to take the shortcut. Digilent provides ZYBO Base System Design 1 which includes prepared high level block design and IP cores for VGA output, HDMI output, and audio codec input/output.

As it is easier to remove useless blocks than to create block design from scratch this is recommended course of action. Base designs should be available for all Zynq-7000 boards from the corresponding hardware vendors. Otherwise you have to follow pretty complex guide to set up the high level block design from scratch 2.

1(1,2)

ZYBO Base System

2

http://zedboard.org/content/creating-base-zynq-design-vivado-ipi-20132

1.2 Installing Vivado

Bitstream is a 2MB binary file which configures the programmable logic of ZYBO. Unfortunately the bitstream file can currently be compiled only with proprietary tools from Xilinx. When you're purchasing ZYBO from Digilent you have option to get 20 USD accessories kit for ZYBO which includes Vivado voucher. With the Voucher you get license for Vivado and it's updates for a year since the date of purchase. You can download Vivado at Xilinx web 3, just follow release notes to install Vivado and the license file Xilinx.lic. Note that license will be bound to the MAC address of the particular machine.

3

http://www.xilinx.com/support/download.html

1.3 Working around Vivado quirks

If you encounter floating point formatting exception try disabling locales by appeding following to Vivado's settings64.sh:

export LANG=C

If you bump into NullPointerException while attempting to run Tools → Create Package and IP try to:

  • Open from main menu ToolsProject settings

  • Click on IP

  • Select Packager tab

  • Make sure Vendor field value if anything but (none)

1.4 Preparing bitstream

ZYBO Base System Design 1 includes all you need to compile the bitstream file for ZYBO. Uncompress the .zip file and open the project file sourcevivadohwzybo_bsdzybo_bsd.xpr with Vivado.

With newer Vivado versions you get notifications that IP blocks can be upgraded. Follow instructions on the screen to update the components. Once you have finished with the hardware design click on the lefthand panel Program and DebugGenerate Bitstream this will generate the 2MB blob that represents your FPGA configuration.

1.5 Preparing microSD card

The easiest is to use Xillinux as a starting point 4. Grab the Xillinux image and dump it to your microSD card. Note that previous contents of the card will be lost:

umount /dev/mmcblk0p*
wget -c http://xillybus.com/downloads/xillinux-1.3.img.gz
zcat xillinux-1.3.img.gz > /dev/mmcblk0
partprobe

Now you have your memory card populated with FAT32 filesystem for u-boot and ext4 filesystem which hosts Ubuntu 12.04 root filesystem. Next step is to place device tree definition file and first stage bootloader to the first (FAT32) filesystem:

wget -c http://xillybus.com/downloads/xillinux-eval-zybo-1.3v.zip
unzip xillinux-eval-zybo-1.3v.zip
mkdir /mnt/xillyboot
mount /dev/mmcblk0p1 /mnt/xillyboot
cp xillinux-eval-zybo-1.3v/bootfiles/boot.bin /mnt/xillyboot
cp xillinux-eval-zybo-1.3v/bootfiles/devicetree.dtb /mnt/xillyboot
cp path/to/zybo_base_system/source/vivado/hw/zybo_bsd/zybo_bsd.runs/impl_1/system_wrapper.bit /mnt/xillyboot/xillydemo.bit
umount /mnt/xillyboot
sync
4

Xillinux: A Linux distribution for Zedboard, ZyBo, MicroZed and SocKit

1.6 Booting Zynq-7000 board

In order to boot ZYBO the first (FAT32) filesystem has to contain:

  • boot.bin, the first stage bootloader shipper by Xillinux.

  • xillydemo.bit, the bitstream file produced by Vivado.

  • devicetree.dtb is the device tree which should be generated by Vivado but I currently can't seem to find a way to do it there. Instead I rely on the one provided by Xillinux.

  • uImage, the Linux kernel shipped by Xillinux.

Xillinux provided image has u-boot embedded to the the beginning of the card image which use the files mentioned above to set up hardware using the bitstream file, load kernel to RAM, load device tree to RAM and continue booting the kernel. Once kernel is up and running the kernel looks for root filesystem on the second partition which contains ext4 filesystem. If root filesystem is mounted successfully, the /sbin/init binary is invoked to continue the boot process.

If you have connected ZYBO via micro USB cable you can access the console via the built-in FTDI USB-UART bridge. Once you connect the cable to your Ubuntu laptop /dev/ttyUSB0 and /dev/ttyUSB1 should appear. Use picocom to connect to the virtual serial port:

sudo apt-get install picocom
sudo picocom -b 115200 /dev/ttyUSB0

1.7 Connecting via OpenSSH

As ZYBO has ethernet port onboard it makes sense to connect to the command-line via OpenSSH instead. Hook up an ethernet cable to ZYBO and connect it to a router or anything that provides DHCP service for automatically configuring the IP address of ZYBO. Otherwise you have to resort to manually configuring IP-s on your laptop, ZYBO and setting up masquerading and packet forwarding 5

Once the network is up and running you may install OpenSSH on ZYBO:

sudo apt-get install openssh-server

Generate SSH keys using ssh-keygen on your laptop if you haven't done so yet and copy-paste the public key to ZYBO's /root/.ssh/authorized_keys2, this way you can avoid typing password every time you log in via OpenSSH 6.

As inserting and ejecting microSD card becomes eventually tedious I recommend using OpenSSH to transfer bitstream to the device. Make sure the first FAT32 partition is mounted at /boot on the Zynq-7000 board. OpenSSH should be installed on Ubuntu laptop by default, so you can simply issue:

scp \
  path/to/zybo_base_system/source/vivado/hw/zybo_bsd/zybo_bsd.runs/impl_1/system_wrapper.bit \
  root@xillinux:/boot/xillydemo.bit

This method of course does not work if the board does not boot at all and in that case you have to resort to inserting the microSD card to your laptop.

5

https://help.ubuntu.com/community/Internet/ConnectionSharing

6

https://help.ubuntu.com/community/SSH/OpenSSH/Keys

1.8 About device tree

Device tree concept was introduced in Linux 3.15 with the purpose of easing Linux adoption on embedded devices. In a traditional PC sophisticated mechanisms such as PCI and PCI Express are used to detect what kind of hardware has been connected and what drivers should be loaded for them. That is not the case for ARM as it does not have such peripherial arbitration capabilities. For usual ARM devices a device tree is provided from the vendor and it's available in the mainline Linux source tree. The programmable logic of Zynq-7000 SoC makes the hardware configurable to a high degree via the bitstream file and this is where the device tree is used to describe:

  • What kind of peripherial devices are connected via PL

  • Which memory addresses have been allocated to peripherial devices for memory mapped input/output

  • Which interrupts kernel knows about

Note that device tree is not necessary to enable memory mapped input/output. The Xillinux device tree seems to pass through the shared interrupt numbered 91. There are basically 16 peripherial devices that can be connected to that shared interrupt. Note that kernel module has to be written for handling hardware interrupts as it is not possible to hook up an interrupt handler from a userspace program. The kernel module may however translate interrupts to whatever a userspace program can understand eg. signals or blocking FIFO. Shared interrupt means that you need to use bitmasks to determine which event actually happened and afterwards still read the particular PL output from the memory mapped region.

1.9 Blinking LED-s!

Linux provides access to the physical memory addresses via /dev/mem. It's a character device whose lower 512MB represent the physical DDR RAM present on the board. Peripherial devices are accessible via memory ranges predefined in the bitstream using memory mapped input/output. The ZYBO base system Vivado project contains following setup:

img/zybo-address-editor.png

Address editor

To blink the LED-s on the board you can just grab the LEDs_4bits Offset Address from the Address Editor and using mmap() you can write and read that memory range as if it was regular array using following Python snippet.

from time import sleep
import mmap

with open("/dev/mem", "r+b") as f:
    mm = mmap.mmap(f.fileno(), 4, offset=0x41210000)
    while True:
        try:
            mm[0] = chr(0xff)
            sleep(0.2)
            mm[0] = chr(0x00)
            sleep(0.2)
        except KeyboardInterrupt:
            break
    mm.close()

Just dump the contents to a file and invoke with python at command-line.

In the source tree you'll find base.xdc which contains pin mapping for button block, switch block, LED block onboard and other preconfigured ports:

# Button block
set_property PACKAGE_PIN R18 [get_ports {btns_4bits_tri_i[0]}]
set_property PACKAGE_PIN P16 [get_ports {btns_4bits_tri_i[1]}]
set_property PACKAGE_PIN V16 [get_ports {btns_4bits_tri_i[2]}]
set_property PACKAGE_PIN Y16 [get_ports {btns_4bits_tri_i[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {btns_4bits_tri_i[*]}]

# LED block
set_property PACKAGE_PIN M14 [get_ports {leds_4bits_tri_o[0]}]
set_property PACKAGE_PIN M15 [get_ports {leds_4bits_tri_o[1]}]
set_property PACKAGE_PIN G14 [get_ports {leds_4bits_tri_o[2]}]
set_property PACKAGE_PIN D18 [get_ports {leds_4bits_tri_o[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {leds_4bits_tri_o[*]}]

# Switch block
set_property PACKAGE_PIN G15 [get_ports {sws_4bits_tri_i[0]}]
set_property PACKAGE_PIN P15 [get_ports {sws_4bits_tri_i[1]}]
set_property PACKAGE_PIN W13 [get_ports {sws_4bits_tri_i[2]}]
set_property PACKAGE_PIN T16 [get_ports {sws_4bits_tri_i[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {sws_4bits_tri_i[*]}]

Pmod connectors on the board are not by default connected to any ports. In addition to attaching port in the high level block design constraints have to be added for the corresponding pins. On the wide side ZYBO has standard Pmod connector JE connected via built-in 200Ω resistors and three hi-speed Pmod connectors JD, JC, JB with no resistors 7:

H15 J15 W16 V12 Y17 T17 U17 V13 3.3V GND 3.3V GND R14 P14 T15 T14 V18 V17 U15 U14 3.3V GND 3.3V GND JE (Standard Pmod, 200Ω resistors) JD (Hi-Speed Pmod, no resistors)

Two leftmost Pmod connectors JE, JD on the wider edge of ZYBO

T10 T11 W15 V15 U12 T12 Y14 W14 3.3V GND 3.3V GND JC (Hi-Speed Pmod, no resistors) W20 V20 U20 T20 W19 W18 Y19 Y18 3.3V GND 3.3V GND JB (Hi-Speed Pmod, no resistors)

Two rightmost Pmod connectors JC, JB on the wider edge of ZYBO

Pin mappings for other ports can be found in ZYBO reference manual 7 and they're virtually impossible to locate simply by searching the Internet.

7(1,2)

ZYBO Reference Manual

2 Signal analysis using Sigrok

2.1 Introduction

Sigrok is an open-source software suite of signal analysis tools compromised of signal capture, protocol decoders and graphical tools. Sigrok's hardware support is pretty extensive and it's growing 9.

2.2 Installing

To install Sigrok on Ubuntu 14.04, enable their PPA repository:

sudo add-apt-repository ppa:jorik-kippendief/sigrok
sudo apt-get install pulseview

2.3 PulseView

Sigrok has command-line utilities for signal capture, but for newbies there is PulseView tool:

http://sigrok.org/wimg/e/ee/PulseView_I2C_DS1307_Decode.png

PulseView can parse various protocols such as I²C, SPI, UART etc.

2.4 Salea logic analyzer clones

Salea logic analyzer is a Cypress FX2 chipset based logic analyzer which can record up to 8 channels at 24MHz ranging from 0V to 5V.

http://sigrok.org/wimg/0/02/Mcu123_saleae_logic_clone_package_contents.jpg

Salea logic analyzer can sample 8 channels at 24MHz.

It has been discontinued by Salea but its clones are still available, anything with that particular chipset works. In order to use that chipset with Sigrok tools firmware for the logic analyzer has to be installed, otherwise you get "Firmware upload failed" once you fire up PulseView:

sudo apt-get install sigrok-firmware-fx2lafw

Salea logic analyzer clones can be purchased at eBay for less than 8€ per item 8.

8

http://www.ebay.com/itm/251637912969

2.5 Summary

Sigrok tools in conjunction with Salea logic analyzer clone can be used to interface FPGA design with already existing hardware and debug the design.

9

http://sigrok.org/wiki/Supported_hardware

3 Decimal counter on ZYBO

3.1 Introduction

In this article I'll attempt to outline steps needed to convert VHDL design into a usable application on ZYBO board. I am going to assume that person attempting to follow the guide has familiarized herself with ZYBO basics and is using ZYBO Base System Design as starting point for the high level block design.

3.2 Sample files

In this case we're connecting two segment displays to Pmod connectors and use programmable fabric to convert bus frequency to seconds shown on the segment display:

library ieee;
use ieee.std_logic_1164.all;

entity frequency_divider is
    generic (
        RATIO : integer := 50000000);
    port (
        clk_in  : in  std_logic;
        reset   : in  std_logic;
        clk_out : out std_logic);
end;

architecture behavioral of frequency_divider is
    signal temporal: std_logic;
    signal counter : integer range 0 to RATIO := 0;
begin
    frequency_divider_process: process (reset, clk_in) begin
        if (reset = '1') then
            temporal <= '0';
            counter <= 0;
        elsif rising_edge(clk_in) then
            if (counter = RATIO) then
                temporal <= not(temporal);
                counter <= 0;
            else
                counter <= counter + 1;
            end if;
        end if;
    end process;
    clk_out <= temporal;
end;

BCD counter counts up to 10 and then overflows 11:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity bcd_counter is
    port (
        clk_in  : in  std_logic;
        reset   : in  std_logic;
        inc     : in  std_logic;
        bcd     : out std_logic_vector(3 downto 0);
        clk_out : out std_logic);
end;

architecture behavioral of bcd_counter is
    signal temporal: std_logic;
    signal counter : integer range 0 to 10;
begin
    counter_process: process (reset, clk_in) begin
        if (reset = '1') then
            temporal <= '0';
            counter <= 0;
        elsif rising_edge(clk_in) then
            if inc = '1' then
                if (counter = 9) then
                    temporal <= '1';
                    counter <= 0;
                else
                    temporal <= '0';
                    counter <= counter + 1;
                end if;
            else
                if (counter = 0) then
                    temporal <= '1';
                    counter <= 9;
                else
                    temporal <= '0';
                    counter <= counter - 1;
                end if;
            end if;
        end if;
    end process;

    clk_out <= temporal;
    bcd <= std_logic_vector(to_unsigned(counter,4));
end;

Segment driver decodes BCD signal to segment display lanes:

library ieee;
use ieee.std_logic_1164.all;


entity bcd_segment_driver is
    port (
        bcd      : in  std_logic_vector(3 downto 0);
        segments : out std_logic_vector(6 downto 0));
end;

architecture behavioral of bcd_segment_driver is
begin
    segments <=
        "1111110" when bcd = "0000" else -- 0
        "0110000" when bcd = "0001" else -- 1
        "1101101" when bcd = "0010" else -- 2
        "1111001" when bcd = "0011" else -- 3
        "0110011" when bcd = "0100" else -- 4
        "1011011" when bcd = "0101" else -- 5
        "1011111" when bcd = "0110" else -- 6
        "1110000" when bcd = "0111" else -- 7
        "1111111" when bcd = "1000" else -- 8
        "1111011" when bcd = "1001" else -- 9
        "0000000";
end;

These files should be added by:

  • Open ToolsCreate and Package IP

  • Select Package your current project

  • Point IP location to the folder containing these three files

  • Click Next

  • Click Finish

  • IP editor window should be opened now

  • Make adjustments to the VHDL code if necessary

  • Under Package IP tab open Review and Pacakage step, click Package IP button

3.3 Block design

In the Vivado high level block design window click Add IP, search for the IP core that was added and double click on it. Apply the same for remaining blocks. Use block design editor to combine frequency divider, BCD counter and BCD segment drivers:

high-level-design/zybo-counter.png

Frequency divider is connected to the ZYBO bus clocked at 50MHz

3.4 Pin mapping

Each ZYBO Pmod connector is basically connected to 8-bit port of programmable fabric. Each port's 1st pin is marked with square and pins 1-4 are in the top row and pins 5-8 are in the row closer to the PCB. Remaining four pins are connected to 3.3V and ground rails and they're explicitly marked on the board.

Pin mapping is described in ZYBO reference manual PDF documentation 10. Edit Constraintsconstrs_1base.xdc to reflect your setup:

# Connect BTN0 to reset line
set_property PACKAGE_PIN R18 [get_ports {reset}]
set_property IOSTANDARD LVCMOS33 [get_ports {reset}]

# Pmod connector JB
set_property PACKAGE_PIN T20 [get_ports {digit_2[0]}]
set_property PACKAGE_PIN U20 [get_ports {digit_2[1]}]
set_property PACKAGE_PIN V20 [get_ports {digit_2[2]}]
set_property PACKAGE_PIN W20 [get_ports {digit_2[3]}]
set_property PACKAGE_PIN Y18 [get_ports {digit_2[4]}]
set_property PACKAGE_PIN Y19 [get_ports {digit_2[5]}]
set_property PACKAGE_PIN W18 [get_ports {digit_2[6]}]
set_property IOSTANDARD LVCMOS33 [get_ports {digit_2[*]}]

# Pmod connector JC
set_property PACKAGE_PIN V15 [get_ports {digit_1[0]}]
set_property PACKAGE_PIN W15 [get_ports {digit_1[1]}]
set_property PACKAGE_PIN T11 [get_ports {digit_1[2]}]
set_property PACKAGE_PIN T10 [get_ports {digit_1[3]}]
set_property PACKAGE_PIN W14 [get_ports {digit_1[4]}]
set_property PACKAGE_PIN Y14 [get_ports {digit_1[5]}]
set_property PACKAGE_PIN T12 [get_ports {digit_1[6]}]
set_property IOSTANDARD LVCMOS33 [get_ports {digit_1[*]}]

# Connect SW0 to increment/decrement toggle
set_property PACKAGE_PIN G15 [get_ports {inc}]
set_property IOSTANDARD LVCMOS33 [get_ports {inc}]

3.5 Final steps

Press Generate Bitstream in the left-hand panel under Program and Debug, the file will be written to zybo_base_system/source/vivado/hw/zybo_bsd/zybo_bsd.runs/impl_1/system_wrapper.bit. Transfer the file to the first FAT32 partition of the microSD card and reset the device.

img/zybo-counter.jpg

ZYBO in action

10

http://marsee101.blog19.fc2.com/blog-entry-2902.html

11

http://www.codeproject.com/Articles/443644/Frequency-Divider-with-VHDL

4 Piping OV7670 video to VGA output on ZYBO

4.1 Introduction

Before getting into more complex topics such as AXI Stream and direct memory access, it's recommended to first get familiar with pixel data encoding schemes and video timing signals. Hamsterworks has great examples for Zynq boards 12. In this example VGA frames are grabbed from OV7670 chipset based camera and stored in Block RAM based framebuffer.

http://www.elecfreaks.com/store/images/OV7670%20Camera%20Module_A.JPG

Omnivision OV7670 is a cheap 640x480 30fps camera module.

ZYBO is however more resource constrained so several modifications were required. In this case we're reducing the vertical resolution twofold since ZYBO does not have enough Block RAM to contain whole VGA frame. The example is basically working on ZYBO, but there are still few bugs that need to be ironed out.

12

http://hamsterworks.co.nz/mediawiki/index.php/OV7670_camera

4.2 Capture block

The capture block parses VSYNC and HREF signals and converts them into block RAM address. Pixel data is also a bit tricky - OV7670 transmits half of an 16-bit RGB (5:6:5) pixel during one PCLK cycle. Capture block latches the previous half and combines two halves into 12-bit RGB (4:4:4) pixel which is stored in block RAM. You are encouraged to use logic analyzer to debug video timing signals, as connecting wires to VGA output while display is connected is troublesome you might have to route extra pins onto Pmod connectors for debugging purposes.

----------------------------------------------------------------------------------
-- Engineer: Mike Field <hamster@snap.net.nz>
--
-- Description: Captures the pixels coming from the OV7670 camera and
--              Stores them in block RAM
----------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.ALL;
use ieee.NUMERIC_STD.ALL;

entity ov7670_capture is
    port (
        pclk  : in   std_logic;
        vsync : in   std_logic;
        href  : in   std_logic;
        d     : in   std_logic_vector ( 7 downto 0);
        addr  : out  std_logic_vector (17 downto 0);
        dout  : out  std_logic_vector (11 downto 0);
        we    : out  std_logic
    );
end ov7670_capture;

architecture behavioral of ov7670_capture is
   signal d_latch      : std_logic_vector(15 downto 0) := (others => '0');
   signal address      : std_logic_vector(18 downto 0) := (others => '0');
   signal address_next : std_logic_vector(18 downto 0) := (others => '0');
   signal wr_hold      : std_logic_vector( 1 downto 0)  := (others => '0');

begin
   addr <= address(18 downto 1);
   process(pclk)
   begin
      if rising_edge(pclk) then
         -- This is a bit tricky href starts a pixel transfer that takes 3 cycles
         --        Input   | state after clock tick
         --         href   | wr_hold    d_latch           d                 we address  address_next
         -- cycle -1  x    |    xx      xxxxxxxxxxxxxxxx  xxxxxxxxxxxxxxxx  x   xxxx     xxxx
         -- cycle 0   1    |    x1      xxxxxxxxRRRRRGGG  xxxxxxxxxxxxxxxx  x   xxxx     addr
         -- cycle 1   0    |    10      RRRRRGGGGGGBBBBB  xxxxxxxxRRRRRGGG  x   addr     addr
         -- cycle 2   x    |    0x      GGGBBBBBxxxxxxxx  RRRRRGGGGGGBBBBB  1   addr     addr+1

         if vsync = '1' then
            address <= (others => '0');
            address_next <= (others => '0');
            wr_hold <= (others => '0');
         else
            -- This should be a different order, but seems to be GRB!
            dout    <= d_latch(15 downto 12) & d_latch(10 downto 7) & d_latch(4 downto 1);
            address <= address_next;
            we      <= wr_hold(1);
            wr_hold <= wr_hold(0) & (href and not wr_hold(0));
            d_latch <= d_latch( 7 downto  0) & d;

            if wr_hold(1) = '1' then
               address_next <= std_logic_vector(unsigned(address_next)+1);
            end if;
         end if;
      end if;
   end process;
end behavioral;

4.3 Video output block

VGA output block generates HSYNC and VSYNC signals for the video outputs and corresponding input for the read address.

----------------------------------------------------------------------------------
-- Engineer: Mike Field <hamster@snap.net.nz>
--
-- Description: Generate analog 640x480 VGA, double-doublescanned from 19200 bytes of RAM
--
----------------------------------------------------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity ov7670_vga is
    port (
        clk25       : in  STD_LOGIC;
        vga_red     : out STD_LOGIC_VECTOR(4 downto 0);
        vga_green   : out STD_LOGIC_VECTOR(5 downto 0);
        vga_blue    : out STD_LOGIC_VECTOR(4 downto 0);
        vga_hsync   : out STD_LOGIC;
        vga_vsync   : out STD_LOGIC;
        frame_addr  : out STD_LOGIC_VECTOR(17 downto 0);
        frame_pixel : in  STD_LOGIC_VECTOR(11 downto 0)
    );
end ov7670_vga;

architecture Behavioral of ov7670_vga is
   -- Timing constants
   constant hRez       : natural := 640;
   constant hStartSync : natural := 640+16;
   constant hEndSync   : natural := 640+16+96;
   constant hMaxCount  : natural := 800;

   constant vRez       : natural := 480;
   constant vStartSync : natural := 480+10;
   constant vEndSync   : natural := 480+10+2;
   constant vMaxCount  : natural := 480+10+2+33;

   constant hsync_active : std_logic := '0';
   constant vsync_active : std_logic := '0';

   signal hCounter : unsigned( 9 downto 0) := (others => '0');
   signal vCounter : unsigned( 9 downto 0) := (others => '0');
   signal address  : unsigned(18 downto 0) := (others => '0');
   signal blank    : std_logic := '1';

begin
   frame_addr <= std_logic_vector(address(18 downto 1));

   process(clk25)
   begin
      if rising_edge(clk25) then
         -- Count the lines and rows
         if hCounter = hMaxCount-1 then
            hCounter <= (others => '0');
            if vCounter = vMaxCount-1 then
               vCounter <= (others => '0');
            else
               vCounter <= vCounter+1;
            end if;
         else
            hCounter <= hCounter+1;
         end if;

         if blank = '0' then
            vga_red   <= frame_pixel(11 downto 8) & "0";
            vga_green <= frame_pixel( 7 downto 4) & "00";
            vga_blue  <= frame_pixel( 3 downto 0) & "0";
         else
            vga_red   <= (others => '0');
            vga_green <= (others => '0');
            vga_blue  <= (others => '0');
         end if;

         if vCounter  >= vRez then
            address <= (others => '0');
            blank <= '1';
         else
            if hCounter  < 640 then
               blank <= '0';
               address <= address+1;
            else
               blank <= '1';
            end if;
         end if;

         -- Are we in the hSync pulse? (one has been added to include frame_buffer_latency)
         if hCounter > hStartSync and hCounter <= hEndSync then
            vga_hSync <= hsync_active;
         else
            vga_hSync <= not hsync_active;
         end if;

         -- Are we in the vSync pulse?
         if vCounter >= vStartSync and vCounter < vEndSync then
            vga_vSync <= vsync_active;
         else
            vga_vSync <= not vsync_active;
         end if;
      end if;
   end process;
end Behavioral;

It essentially plays the role of video card in a PC.

4.4 Controller block

Omnivision OV7670 uses Omnivision Serial Camera Control Bus (SCCB) protocol to set up the camera parameters. SCCB actually is I²C-compliant interface, but avoids the usage of I²C brand due to licensing fees 13. The controller component is composed of three components: I²C bus master, OV7670 instructions and glue code.

The first one is used to emulate I²C bus master:

----------------------------------------------------------------------------------
-- Engineer: <mfield@concepts.co.nz
--
-- Description: Send the commands to the OV7670 over an I2C-like interface
----------------------------------------------------------------------------------

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity i2c_sender is
    port (
        clk   : in    std_logic;
        siod  : inout std_logic;
        sioc  : out   std_logic;
        taken : out   std_logic;
        send  : in    std_logic;
        id    : in    std_logic_vector(7 downto 0);
        reg   : in    std_logic_vector(7 downto 0);
        value : in    std_logic_vector(7 downto 0)
    );
end i2c_sender;

architecture behavioral of i2c_sender is
    -- this value gives a 254 cycle pause before the initial frame is sent
    signal   divider  : unsigned (7 downto 0) := "00000001";
    signal   busy_sr  : std_logic_vector(31 downto 0) := (others => '0');
    signal   data_sr  : std_logic_vector(31 downto 0) := (others => '1');
begin
    process(busy_sr, data_sr(31))
    begin
        if busy_sr(11 downto 10) = "10" or
           busy_sr(20 downto 19) = "10" or
           busy_sr(29 downto 28) = "10"  then
            siod <= 'Z';
        else
            siod <= data_sr(31);
        end if;
    end process;

    process(clk)
    begin
        if rising_edge(clk) then
            taken <= '0';
            if busy_sr(31) = '0' then
                SIOC <= '1';
                if send = '1' then
                    if divider = "00000000" then
                        data_sr <= "100" &   id & '0'  &   reg & '0' & value & '0' & "01";
                        busy_sr <= "111" & "111111111" & "111111111" & "111111111" & "11";
                        taken <= '1';
                    else
                        divider <= divider+1; -- this only happens on powerup
                    end if;
                end if;
            else

                case busy_sr(32-1 downto 32-3) & busy_sr(2 downto 0) is
                    when "111"&"111" => -- start seq #1
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '1';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '1';
                        end case;
                    when "111"&"110" => -- start seq #2
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '1';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '1';
                        end case;
                    when "111"&"100" => -- start seq #3
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '0';
                            when "01"   => SIOC <= '0';
                            when "10"   => SIOC <= '0';
                            when others => SIOC <= '0';
                        end case;
                    when "110"&"000" => -- end seq #1
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '0';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '1';
                        end case;
                    when "100"&"000" => -- end seq #2
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '1';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '1';
                        end case;
                    when "000"&"000" => -- Idle
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '1';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '1';
                        end case;
                    when others      =>
                        case divider(7 downto 6) is
                            when "00"   => SIOC <= '0';
                            when "01"   => SIOC <= '1';
                            when "10"   => SIOC <= '1';
                            when others => SIOC <= '0';
                        end case;
                end case;

                if divider = "11111111" then
                    busy_sr <= busy_sr(32-2 downto 0) & '0';
                    data_sr <= data_sr(32-2 downto 0) & '1';
                    divider <= (others => '0');
                else
                    divider <= divider+1;
                end if;
            end if;
        end if;
    end process;
end behavioral;

The second one contains OV7670 setup instructions:

-- Company:
-- Engineer: Mike Field <hamster@sanp.net.nz>
--
-- Description: Register settings for the OV7670 Caamera (partially from OV7670.c
--              in the Linux Kernel
-- Edited by : Christopher Wilson <wilson@chrec.org>
------------------------------------------------------------------------------------
--
-- Notes:
-- 1) Regarding the WITH SELECT Statement:
--      WITH sreg(sel) SELECT
--           finished    <= '1' when x"FFFF",
--                        '0' when others;
-- This means the transfer is finished the first time sreg ends up as "FFFF",
-- I.E. Need Sequential Addresses in the below case statements
--
-- Common Debug Issues:
--
-- Red Appearing as Green / Green Appearing as Pink
-- Solution: Register Corrections Below
--
--
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity ov7670_registers is
    Port ( clk      : in  STD_LOGIC;
           resend   : in  STD_LOGIC;
           advance  : in  STD_LOGIC;
           command  : out  std_logic_vector(15 downto 0);
           finished : out  STD_LOGIC);
end ov7670_registers;

architecture Behavioral of ov7670_registers is
   signal sreg   : std_logic_vector(15 downto 0);
   signal address : std_logic_vector(7 downto 0) := (others => '0');
begin
   command <= sreg;
   with sreg select finished  <= '1' when x"FFFF", '0' when others;

   process(clk)
   begin
      if rising_edge(clk) then
         if resend = '1' then
            address <= (others => '0');
         elsif advance = '1' then
            address <= std_logic_vector(unsigned(address)+1);
         end if;

         case address is
            when x"00" => sreg <= x"1280"; -- COM7   Reset
            when x"01" => sreg <= x"1280"; -- COM7   Reset
            when x"02" => sreg <= x"1204"; -- COM7   Size & RGB output
            when x"03" => sreg <= x"1100"; -- CLKRC  Prescaler - Fin/(1+1)
            when x"04" => sreg <= x"0C00"; -- COM3   Lots of stuff, enable scaling, all others off
            when x"05" => sreg <= x"3E00"; -- COM14  PCLK scaling off

            when x"06" => sreg <= x"8C00"; -- RGB444 Set RGB format
            when x"07" => sreg <= x"0400"; -- COM1   no CCIR601
             when x"08" => sreg <= x"4010"; -- COM15  Full 0-255 output, RGB 565
            when x"09" => sreg <= x"3a04"; -- TSLB   Set UV ordering,  do not auto-reset window
            when x"0A" => sreg <= x"1438"; -- COM9  - AGC Celling
            when x"0B" => sreg <= x"4f40"; --x"4fb3"; -- MTX1  - colour conversion matrix
            when x"0C" => sreg <= x"5034"; --x"50b3"; -- MTX2  - colour conversion matrix
            when x"0D" => sreg <= x"510C"; --x"5100"; -- MTX3  - colour conversion matrix
            when x"0E" => sreg <= x"5217"; --x"523d"; -- MTX4  - colour conversion matrix
            when x"0F" => sreg <= x"5329"; --x"53a7"; -- MTX5  - colour conversion matrix
            when x"10" => sreg <= x"5440"; --x"54e4"; -- MTX6  - colour conversion matrix
            when x"11" => sreg <= x"581e"; --x"589e"; -- MTXS  - Matrix sign and auto contrast
            when x"12" => sreg <= x"3dc0"; -- COM13 - Turn on GAMMA and UV Auto adjust
            when x"13" => sreg <= x"1100"; -- CLKRC  Prescaler - Fin/(1+1)
            when x"14" => sreg <= x"1711"; -- HSTART HREF start (high 8 bits)
            when x"15" => sreg <= x"1861"; -- HSTOP  HREF stop (high 8 bits)
            when x"16" => sreg <= x"32A4"; -- HREF   Edge offset and low 3 bits of HSTART and HSTOP
            when x"17" => sreg <= x"1903"; -- VSTART VSYNC start (high 8 bits)
            when x"18" => sreg <= x"1A7b"; -- VSTOP  VSYNC stop (high 8 bits)
            when x"19" => sreg <= x"030a"; -- VREF   VSYNC low two bits
            when x"1A" => sreg <= x"0e61"; -- COM5(0x0E) 0x61
            when x"1B" => sreg <= x"0f4b"; -- COM6(0x0F) 0x4B
            when x"1C" => sreg <= x"1602"; --
            when x"1D" => sreg <= x"1e37"; -- MVFP (0x1E) 0x07  -- FLIP AND MIRROR IMAGE 0x3x
            when x"1E" => sreg <= x"2102";
            when x"1F" => sreg <= x"2291";
            when x"20" => sreg <= x"2907";
            when x"21" => sreg <= x"330b";
            when x"22" => sreg <= x"350b";
            when x"23" => sreg <= x"371d";
            when x"24" => sreg <= x"3871";
            when x"25" => sreg <= x"392a";
            when x"26" => sreg <= x"3c78"; -- COM12 (0x3C) 0x78
            when x"27" => sreg <= x"4d40";
            when x"28" => sreg <= x"4e20";
            when x"29" => sreg <= x"6900"; -- GFIX (0x69) 0x00
            when x"2A" => sreg <= x"6b4a";
            when x"2B" => sreg <= x"7410";
            when x"2C" => sreg <= x"8d4f";
            when x"2D" => sreg <= x"8e00";
            when x"2E" => sreg <= x"8f00";
            when x"2F" => sreg <= x"9000";
            when x"30" => sreg <= x"9100";
            when x"31" => sreg <= x"9600";
            when x"32" => sreg <= x"9a00";
            when x"33" => sreg <= x"b084";
            when x"34" => sreg <= x"b10c";
            when x"35" => sreg <= x"b20e";
            when x"36" => sreg <= x"b382";
            when x"37" => sreg <= x"b80a";
            when others => sreg <= x"ffff";
         end case;
      end if;
   end process;
end Behavioral;

Third one contains glue code for the IP core that can actually be instantiated:

----------------------------------------------------------------------------------
-- Engineer: Mike Field <hamster@snap.net.nz>
--
-- Description: Controller for the OV760 camera - transferes registers to the
--              camera over an I2C like bus
----------------------------------------------------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity ov7670_controller is
    Port ( clk   : in    STD_LOGIC;
                          resend :in    STD_LOGIC;
                          config_finished : out std_logic;
           sioc  : out   STD_LOGIC;
           siod  : inout STD_LOGIC;
           reset : out   STD_LOGIC;
           pwdn  : out   STD_LOGIC;
                          xclk  : out   STD_LOGIC
);
end ov7670_controller;

architecture Behavioral of ov7670_controller is
        COMPONENT ov7670_registers
        PORT(
                clk      : IN std_logic;
                advance  : IN std_logic;
                resend   : in STD_LOGIC;
                command  : OUT std_logic_vector(15 downto 0);
                finished : OUT std_logic
                );
        END COMPONENT;

        COMPONENT i2c_sender
        PORT(
                clk   : IN std_logic;
                send  : IN std_logic;
                taken : out std_logic;
                id    : IN std_logic_vector(7 downto 0);
                reg   : IN std_logic_vector(7 downto 0);
                value : IN std_logic_vector(7 downto 0);
                siod  : INOUT std_logic;
                sioc  : OUT std_logic
                );
        END COMPONENT;

        signal sys_clk  : std_logic := '0';
        signal command  : std_logic_vector(15 downto 0);
        signal finished : std_logic := '0';
        signal taken    : std_logic := '0';
        signal send     : std_logic;

        constant camera_address : std_logic_vector(7 downto 0) := x"42"; -- 42"; -- Device write ID - see top of page 11 of data sheet
begin
   config_finished <= finished;

        send <= not finished;
        Inst_i2c_sender: i2c_sender PORT MAP(
                clk   => clk,
                taken => taken,
                siod  => siod,
                sioc  => sioc,
                send  => send,
                id    => camera_address,
                reg   => command(15 downto 8),
                value => command(7 downto 0)
        );

        reset <= '1';                                           -- Normal mode
        pwdn  <= '0';                                           -- Power device up
        xclk  <= sys_clk;

        Inst_ov7670_registers: ov7670_registers PORT MAP(
                clk      => clk,
                advance  => taken,
                command  => command,
                finished => finished,
                resend   => resend
        );

        process(clk)
        begin
                if rising_edge(clk) then
                        sys_clk <= not sys_clk;
                end if;
        end process;
end Behavioral;
13

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/99/t/6092

4.5 Importing VHDL code

To insert VHDL code snippets into Vivado:

  • From the main menu select ToolsCreate and Package IP, click Next.

  • Select Package a specified directory, click Next.

  • Locate the directory which contains VHDL files for IP location, click Next.

  • Set Project name to main component name.

  • Set Project location to the parent folder of the VHDL files.

  • Click Finish

Once you have added everything to the library it's time to instantiate the code in the design, for each component add the corresponding block:

  • Click on Open Block Design under IP Integrator to open up the high level block design.

  • Right click in the designer area and select Add IP...

  • Locate the components added earlier

  • Repeat same steps for all components

Next step is to insert block RAM, clocking wizard and connect the components.

4.6 Instantiating block RAM

Since block RAM is highly platform specific a Xilinx block has to be inserted. Right click in the high level design → Add IP...Block Memory Generator to insert block RAM. Right click on the block → Customize block... opens up the dialog for editing block RAM parameters.

Stand Alone mode makes it possible to generate Simple Dual Port RAM which is essentially memory with write port and read port. Port width refers to amount of bits that can be read/written at once or in other words the size of a memory slot. Port depth refers to count of such slots which translates to address bit width.

img/block-ram-generator.png

Block RAM generator parameters

4.7 Routing pins

The base.xdc important chunks are following:

# Debounce button and config finished LED
set_property PACKAGE_PIN R18 [get_ports button_debounce]
set_property PACKAGE_PIN M14 [get_ports led_config_finished]

# Top JE
set_property PACKAGE_PIN H15 [get_ports ov7670_reset]
set_property PACKAGE_PIN J16 [get_ports {ov7670_d[1]}]
set_property PACKAGE_PIN W16 [get_ports {ov7670_d[3]}]
set_property PACKAGE_PIN V12 [get_ports {ov7670_d[5]}]

# Bottom JE
set_property PACKAGE_PIN Y17 [get_ports ov7670_pwdn]
set_property PACKAGE_PIN T17 [get_ports {ov7670_d[0]}]
set_property PACKAGE_PIN U17 [get_ports {ov7670_d[2]}]
set_property PACKAGE_PIN V13 [get_ports {ov7670_d[4]}]

# Top JD
set_property PACKAGE_PIN R14 [get_ports {ov7670_d[7]}]
set_property PACKAGE_PIN P14 [get_ports ov7670_pclk]
set_property PACKAGE_PIN T15 [get_ports ov7670_vsync]
set_property PACKAGE_PIN T14 [get_ports ov7670_sioc]

# Bottom JD
set_property PACKAGE_PIN V18 [get_ports {ov7670_d[6]}]
set_property PACKAGE_PIN V17 [get_ports ov7670_xclk]
set_property PACKAGE_PIN U15 [get_ports ov7670_href]
set_property PACKAGE_PIN U14 [get_ports ov7670_siod]

# Red channel of VGA output
set_property PACKAGE_PIN M19 [get_ports {RED_O[0]}]
set_property PACKAGE_PIN L20 [get_ports {RED_O[1]}]
set_property PACKAGE_PIN J20 [get_ports {RED_O[2]}]
set_property PACKAGE_PIN G20 [get_ports {RED_O[3]}]
set_property PACKAGE_PIN F19 [get_ports {RED_O[4]}]

# Green channel of VGA output
set_property PACKAGE_PIN H18 [get_ports {GREEN_O[0]}]
set_property PACKAGE_PIN N20 [get_ports {GREEN_O[1]}]
set_property PACKAGE_PIN L19 [get_ports {GREEN_O[2]}]
set_property PACKAGE_PIN J19 [get_ports {GREEN_O[3]}]
set_property PACKAGE_PIN H20 [get_ports {GREEN_O[4]}]
set_property PACKAGE_PIN F20 [get_ports {GREEN_O[5]}]
# Blue channel of VGA output
set_property PACKAGE_PIN P20 [get_ports {BLUE_O[0]}]
set_property PACKAGE_PIN M20 [get_ports {BLUE_O[1]}]
set_property PACKAGE_PIN K19 [get_ports {BLUE_O[2]}]
set_property PACKAGE_PIN J18 [get_ports {BLUE_O[3]}]
set_property PACKAGE_PIN G19 [get_ports {BLUE_O[4]}]

# Horizontal and vertical synchronization of VGA output
set_property PACKAGE_PIN P19 [get_ports HSYNC_O]
set_property PACKAGE_PIN R19 [get_ports VSYNC_O]

# Voltage levels
set_property IOSTANDARD LVCMOS33 [get_ports button_debounce]
set_property IOSTANDARD LVCMOS33 [get_ports led_config_finished]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_pclk]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_sioc]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_vsync]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_reset]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_pwdn]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_href]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_xclk]
set_property IOSTANDARD LVCMOS33 [get_ports ov7670_siod]
set_property IOSTANDARD LVCMOS33 [get_ports {ov7670_d[*]}]
set_property IOSTANDARD LVCMOS33 [get_ports {RED_O[*]}]
set_property IOSTANDARD LVCMOS33 [get_ports {GREEN_O[*]}]
set_property IOSTANDARD LVCMOS33 [get_ports {BLUE_O[*]}]
set_property IOSTANDARD LVCMOS33 [get_ports HSYNC_O]
set_property IOSTANDARD LVCMOS33 [get_ports VSYNC_O]

# Magic
set_property CLOCK_DEDICATED_ROUTE FALSE [get_nets ov7670_pclk_IBUF]

Using the pin mapping above the camera can be connected cleanly to the board:

img/zybo-ov7670.jpg

Omnivision OV7670 attached to Pmod connectors JD and JE.

Remember to connect GND and 3.3V rails of the ZYBO to cameras GND and 3.3V rails.

4.8 Final high level design

high-level-design/zybo-ov7670-to-vga.png

High level design

Click on Generate bitstream button and transfer resulting bitstream file to the boot partition and restart ZYBO.

4.9 Summary

If you've connected camera correctly you should see the video feed from the camera on the screen attached to VGA output. Capture and controller blocks can be re-used in other examples involving Omnivision OV7670 camera, so it's important to get expected outcome at this point.

5 AXI Direct Memory Access

5.1 Introduction

Getting started with direct memory access on Xilinx boards may be initially overwhelming. First of all Xilinx distinguishes AXI DMA and AXI VDMA in programmable fabric. AXI DMA refers to traditional FPGA direct memory access which roughly corresponds to transferring arbitrary streams of bytes from FPGA to a slice of DDR memory and vice versa. VDMA refers to video DMA which adds mechanisms to handle frame synchronization using ring buffer in DDR, on-the-fly video resolution changes, cropping and zooming. Video DMA is covered in next article. In addition to AXI DMA and AXI VDMA there is a DMA engine built into the ARM core which is also out of the scope of this article. Both AXI DMA and AXI VDMA have optional scatter-gather support which means that instead of writing memory addresses or framebuffer addresses to control registers the DMA controller grabs them from linked list in DDR memory. Scatter-gather features are out of scope of this article.

5.2 Internals

AXI DMA distinguishes two channels: MM2S (memory-mapped to stream) transports data from DDR memory to FPGA and S2MM (stream to memory-mapped) transports arbitrary data stream to DDR memory.

img/axi-dma-block-diagram.png

AXI DMA internals

5.3 Minimal working hardware

The simplest way to instantiate AXI DMA on Zynq-7000 based boards is to take board vendor's base design, strip unnecessary components, add AXI Direct Memory Access IP-core and connect the output stream port to it's input stream port. This essentially implements memcpy functionality which can be triggered from ARM core but offloaded to programmable fabric.

C code running on ARM cores VHDL code running on programmable fabric Userspace test application AXI4 Direct Memory Access DDR memory /dev/mem

AXI Direct Memory Access stream output is looped back to stream input

To be more precise, following is the corresponding high level block design. High speed clock line is highlighted in yellow as it runs on higher frequency of 150MHz while the general purpose port runs at 100MHz. Clock domain errors can usually be tracked back to conflicting clock lines. This is further explained in the end of this article.

high-level-design/axi-dma-loopback.png

High level block design corresponding to abstract design presented earlier

In the AXI Direct Memory Access IP-core customization dialog read channel and write channel correspond respectively to MM2S and S2MM portions of the DMA block. Memory map data width of 32 bits means that 4 bytes will be transferred during one bus cycle. This means the tdata port of the stream interface will be 32 bits wide.

img/axi-dma-parameters.png

Both read/write channels are enabled and scatter-gather engine is disabled

AXI Direct Memory Access component's control register, status register and transfer address registers are accessible via the AXI Lite slave port which is memory mapped to address range of 0x40400000 - 0x4040FFFF. The whole memory range of 0x00000000-0x1FFFFFFF is accessible via both stream to memory-mapped and memory-mapped to stream channel. AXI DMA 14 documentation has the offsets of the registers accessible via AXI Lite port. In this case MM2S control register of 32-bits is accessible at 0x40400000, MM2S status register of 32-bits at 0x40400004 and so forth.

img/axi-dma-address-editor.png

Important

Note that customizing the AXI Direct Memory Access IP-core parameters causes memory ranges to be reset under Address Editor!

5.4 Minimal working software

When it comes to writing C code I see alarming tendency of defaulting to vendor provided components: stand-alone binary compilers, Linux distributions, board support packages, wrappers while avoiding learning what actually happens in the hardware/software.

As described in my earlier article physical memory can be accessed in Linux via /dev/mem block device. This makes it possible to access AXI Lite registers simply by reading/writing to a memory mapped range from /dev/mem. To use DMA component minimally four steps have to be taken:

  1. Start the DMA channel (MM2S, S2MM or both) by writing 1 to control register

  2. Write start/destination addresses to corresponding registers

  3. To initiate the transfer(s) write transfer length(s) to corresponding register(s).

  4. Monitor status register for IOC_Irq flag.

In this case we're copying 32 bytes from physical address of 0x0E000000 to physical addres of 0x0F000000. Note that kernel may allocate memory for other processes in that range and that is the primary reason to write a kernel module which would request_mem_region so no other processes would overlap with the memory range. Besides reserving memory ranges the kernel module provides a sanitized way of accessing the hardware from userspace applications via /dev/blah block devices.

/**
 * Proof of concept offloaded memcopy using AXI Direct Memory Access v7.1
 */

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <termios.h>
#include <sys/mman.h>

#define MM2S_CONTROL_REGISTER 0x00
#define MM2S_STATUS_REGISTER 0x04
#define MM2S_START_ADDRESS 0x18
#define MM2S_LENGTH 0x28

#define S2MM_CONTROL_REGISTER 0x30
#define S2MM_STATUS_REGISTER 0x34
#define S2MM_DESTINATION_ADDRESS 0x48
#define S2MM_LENGTH 0x58

unsigned int dma_set(unsigned int* dma_virtual_address, int offset, unsigned int value) {
    dma_virtual_address[offset>>2] = value;
}

unsigned int dma_get(unsigned int* dma_virtual_address, int offset) {
    return dma_virtual_address[offset>>2];
}

int dma_mm2s_sync(unsigned int* dma_virtual_address) {
    unsigned int mm2s_status =  dma_get(dma_virtual_address, MM2S_STATUS_REGISTER);
    while(!(mm2s_status & 1<<12) || !(mm2s_status & 1<<1) ){
        dma_s2mm_status(dma_virtual_address);
        dma_mm2s_status(dma_virtual_address);

        mm2s_status =  dma_get(dma_virtual_address, MM2S_STATUS_REGISTER);
    }
}

int dma_s2mm_sync(unsigned int* dma_virtual_address) {
    unsigned int s2mm_status = dma_get(dma_virtual_address, S2MM_STATUS_REGISTER);
    while(!(s2mm_status & 1<<12) || !(s2mm_status & 1<<1)){
        dma_s2mm_status(dma_virtual_address);
        dma_mm2s_status(dma_virtual_address);

        s2mm_status = dma_get(dma_virtual_address, S2MM_STATUS_REGISTER);
    }
}

void dma_s2mm_status(unsigned int* dma_virtual_address) {
    unsigned int status = dma_get(dma_virtual_address, S2MM_STATUS_REGISTER);
    printf("Stream to memory-mapped status (0x%08x@0x%02x):", status, S2MM_STATUS_REGISTER);
    if (status & 0x00000001) printf(" halted"); else printf(" running");
    if (status & 0x00000002) printf(" idle");
    if (status & 0x00000008) printf(" SGIncld");
    if (status & 0x00000010) printf(" DMAIntErr");
    if (status & 0x00000020) printf(" DMASlvErr");
    if (status & 0x00000040) printf(" DMADecErr");
    if (status & 0x00000100) printf(" SGIntErr");
    if (status & 0x00000200) printf(" SGSlvErr");
    if (status & 0x00000400) printf(" SGDecErr");
    if (status & 0x00001000) printf(" IOC_Irq");
    if (status & 0x00002000) printf(" Dly_Irq");
    if (status & 0x00004000) printf(" Err_Irq");
    printf("\n");
}

void dma_mm2s_status(unsigned int* dma_virtual_address) {
    unsigned int status = dma_get(dma_virtual_address, MM2S_STATUS_REGISTER);
    printf("Memory-mapped to stream status (0x%08x@0x%02x):", status, MM2S_STATUS_REGISTER);
    if (status & 0x00000001) printf(" halted"); else printf(" running");
    if (status & 0x00000002) printf(" idle");
    if (status & 0x00000008) printf(" SGIncld");
    if (status & 0x00000010) printf(" DMAIntErr");
    if (status & 0x00000020) printf(" DMASlvErr");
    if (status & 0x00000040) printf(" DMADecErr");
    if (status & 0x00000100) printf(" SGIntErr");
    if (status & 0x00000200) printf(" SGSlvErr");
    if (status & 0x00000400) printf(" SGDecErr");
    if (status & 0x00001000) printf(" IOC_Irq");
    if (status & 0x00002000) printf(" Dly_Irq");
    if (status & 0x00004000) printf(" Err_Irq");
    printf("\n");
}

void memdump(void* virtual_address, int byte_count) {
    char *p = virtual_address;
    int offset;
    for (offset = 0; offset < byte_count; offset++) {
        printf("%02x", p[offset]);
        if (offset % 4 == 3) { printf(" "); }
    }
    printf("\n");
}


int main() {
    int dh = open("/dev/mem", O_RDWR | O_SYNC); // Open /dev/mem which represents the whole physical memory
    unsigned int* virtual_address = mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, dh, 0x40400000); // Memory map AXI Lite register block
    unsigned int* virtual_source_address  = mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, dh, 0x0e000000); // Memory map source address
    unsigned int* virtual_destination_address = mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, dh, 0x0f000000); // Memory map destination address

    virtual_source_address[0]= 0x11223344; // Write random stuff to source block
    memset(virtual_destination_address, 0, 32); // Clear destination block

    printf("Source memory block:      "); memdump(virtual_source_address, 32);
    printf("Destination memory block: "); memdump(virtual_destination_address, 32);

    printf("Resetting DMA\n");
    dma_set(virtual_address, S2MM_CONTROL_REGISTER, 4);
    dma_set(virtual_address, MM2S_CONTROL_REGISTER, 4);
    dma_s2mm_status(virtual_address);
    dma_mm2s_status(virtual_address);

    printf("Halting DMA\n");
    dma_set(virtual_address, S2MM_CONTROL_REGISTER, 0);
    dma_set(virtual_address, MM2S_CONTROL_REGISTER, 0);
    dma_s2mm_status(virtual_address);
    dma_mm2s_status(virtual_address);

    printf("Writing destination address\n");
    dma_set(virtual_address, S2MM_DESTINATION_ADDRESS, 0x0f000000); // Write destination address
    dma_s2mm_status(virtual_address);

    printf("Writing source address...\n");
    dma_set(virtual_address, MM2S_START_ADDRESS, 0x0e000000); // Write source address
    dma_mm2s_status(virtual_address);

    printf("Starting S2MM channel with all interrupts masked...\n");
    dma_set(virtual_address, S2MM_CONTROL_REGISTER, 0xf001);
    dma_s2mm_status(virtual_address);

    printf("Starting MM2S channel with all interrupts masked...\n");
    dma_set(virtual_address, MM2S_CONTROL_REGISTER, 0xf001);
    dma_mm2s_status(virtual_address);

    printf("Writing S2MM transfer length...\n");
    dma_set(virtual_address, S2MM_LENGTH, 32);
    dma_s2mm_status(virtual_address);

    printf("Writing MM2S transfer length...\n");
    dma_set(virtual_address, MM2S_LENGTH, 32);
    dma_mm2s_status(virtual_address);

    printf("Waiting for MM2S synchronization...\n");
    dma_mm2s_sync(virtual_address);

    printf("Waiting for S2MM sychronization...\n");
    dma_s2mm_sync(virtual_address); // If this locks up make sure all memory ranges are assigned under Address Editor!

    dma_s2mm_status(virtual_address);
    dma_mm2s_status(virtual_address);

    printf("Destination memory block: "); memdump(virtual_destination_address, 32);
}

Successful run should look something like this:

Source memory block:      44332211 7dcddfdf 5a7fefa4 36aa3c9b ca2eea6a 5bf64f81 ebf7ffbb b7f710d2
Destination memory block: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Resetting DMA
Stream to memory-mapped status (0x00000001@0x34): halted
Memory-mapped to stream status (0x00000001@0x04): halted
Halting DMA
Stream to memory-mapped status (0x00000001@0x34): halted
Memory-mapped to stream status (0x00000001@0x04): halted
Writing destination address
Stream to memory-mapped status (0x00000001@0x34): halted
Writing source address...
Memory-mapped to stream status (0x00000001@0x04): halted
Starting S2MM channel with all interrupts masked...
Stream to memory-mapped status (0x00000000@0x34): running
Starting MM2S channel with all interrupts masked...
Memory-mapped to stream status (0x00000000@0x04): running
Writing S2MM transfer length...
Stream to memory-mapped status (0x00000000@0x34): running
Writing MM2S transfer length...
Memory-mapped to stream status (0x00000000@0x04): running
Waiting for MM2S synchronization...
Waiting for S2MM sychronization...
Stream to memory-mapped status (0x00001002@0x34): running idle IOC_Irq
Memory-mapped to stream status (0x00001002@0x04): running idle IOC_Irq
Destination memory block: 44332211 7dcddfdf 5a7fefa4 36aa3c9b ca2eea6a 5bf64f81 ebf7ffbb b7f710d2

Note that IOC_Irq signifies that transfer completion interrupt was triggered.

5.5 Clocks

img/axi-dma-clocks.png

Processing system may generate up to 4 clocks

High-speed slave ports (S_AXI_HP0 .. S_AXI_HP1) and associated ports (M00_AXI, S00_AXI, S01_AXI, M_AXI_MM2S, M_AXI_S2MM) run at 150MHz dictated by FCLK_CLK1. Master in this case means that the bus transfers are initiated by the master which in this case is the AXI Direct Memory Access component. AXI Interconnect in this case is acting merely as a switch in an ethernet network multiplexing multiple AXI ports (S00_AXI, S01_AXI) to single M00_AXI.

General-purpose port (M_AXI_GP0) including all AXI Lite slaves (run at 100MHz. In this case Zynq7 Processing System is the transfer initiator. AXI Protocol Converter similarily to AXI Interconnect allows access to multiple AXI Lite slaves (S_AXI_LITE in this case) via single AXI Lite master port (M_AXI_GP0) on the Zynq7 Processing System.

14

LogiCORE IP AXI DMA v7.1, Product Guide PG021

6 Arbitrary data streams

6.1 Introduction

AMBA interface specification is published by ARM Ltd 15. AXI4-Stream one of many AMBA protocols designed to transport data streams of arbitrary width in hardware. Most usually 32-bit bus width is used, which means that 4 bytes get transferred during one cycle. At 100MHz of programmable logic frequency on FPGA-s this yields throughput of magnitude of hundreds of megabytes per second depending on memory management unit capabilities and configuration.

15

http://www.arm.com/products/system-ip/amba-specifications.php

6.2 AXI4-Stream

AXI4-Stream is a protocol designed to transport arbitrary unidirectional data streams.

data consumerdata producerPlace initial TDATA, TLAST and TUSER on the busSignal that initial data is ready via TVALIDSignal data received via TREADYStart transmitting TDATA, TLAST and TUSER in sync with ACLKAXI4-Stream SlaveAXI4-Stream Master

AXI4-Stream handshake

In AXI4-Stream TDATA width of bits is transferred per clock cycle. The transfer is started once sender signals TVALID and received responds with TREADY. TLAST signals the last byte of the stream.

img/axi-stream-valid-handshake.png

Example of READY/VALID Handshake, Start of a New Frame

AXI4-Stream has additional optional features: sending positional data with TKEEP and TSTRB ports which make it possible to multiplex both data position and data itself on TDATA lines; routing streams by TID and TDIST which roughly corresponds to stream identifier and stream destination identifier 16

16

http://wiki.fpganotes.com/doku.php/ip:bus:axi4_stream

6.3 AXI4-Stream Video

AXI4-Stream Video is a subset of AXI4-Stream designed for transporting video frames. AXI4-Stream Video is compatible with AXI4-Stream components, it simply has conventions for the use of ports already defined by AXI4-Stream:

  • The TLAST signal designates the last pixel of each line, and is also known as end of line (EOL).

  • The TUSER signal designates the first pixel of a frame and is known as start of frame (SOF).

These two flags are necessary to identify pixel locations on the AXI4 stream interface because there are no sync or blank signals. 17. Video DMA component makes use of the TUSER signal to synchronize frame buffering. Note that TUSER flag which is part of AXI4-Stream specification replaces FSYNC signal that has been used in the past by legacy applications.

17

http://www.xilinx.com/support/documentation/ip_documentation/v_axi4s_vid_out/v1_0/pg044_v_axis_vid_out.pdf

7 Video capture with VDMA

7.1 Introduction

The S2MM portion of Video DMA component can be used for video capture.

Camera input Convert to AXI4-Stream Video DDR memory AXI4-Stream to Video Out v3.0 Video output (VGA, HDMI, etc) Clocking Wizard Video Timing Controller v6.1 AXI Video Direct Memory Access v6.2 C code running on ARM cores VHDL code running on programmable fabric I²S controller block Camera control interface AXI4-Stream Broadcaster v1.1 Video4Linux2 kernel module Video4Linux2 userspace application AX4-Stream Subset Converter v1.1

Ideal pipeline for video capture employing single VDMA instance with only write channel, read channel is disabled.

7.2 Minimal hardware design

As getting everything working at the first attempt is tricky it makes sense to substitute actual camera with test pattern generator and kernel module with a userspace snippet which triggers the DMA transfer.

high-level-design/axi-tpg-via-vdma.png

High level block design for transferring frames from Test Pattern Generator to DDR memory using S2MM portion of single VDMA instance.

Note that in this case there are two clock domains: AXI4-Lite slaves are communicating at 100MHz bus speed, but video signals are transferred at bus frequency of 150MHz. High speed port clock is highlighted with yellow so if you get errors regarding clock domains double check the clock signal routing.

img/axi-vdma-s2mm-address-editor.png

Address mapping with AXI Video Direct Memory Access and AXI Test Pattern Generator

In this case VDMA controller control and status registers are mapped at 0x43000000 using AXI-Lite and that memory address can be written to in order to initiate a DMA transfer. In this example MM2S portion is disabled and S2MM portion of the VDMA controller has access to the whole physical memory range of 512MB on ZYBO via AXI High Performance port. This also bears potential security risk as malicious or buggy FPGA bitstream could make it possible to transmit sensitive DDR memory contents for instance RSA keys to third parties.

Note that without kernel module approach Linux may allocate the DMA memory ranges to applications and that combination may end up with memory corruption. In order to avoid that mem=224M should be added to kernel boot arguments so kernel would not use last 32MB for other processes and threads. Better solution would be of course to implement kernel module which ioremaps DMA memory ranges aswell as control/status register memory ranges.

img/axi-tpg-parameters.png

Test pattern generator 18 is configured to output AXI4-Stream of 24-bit RGB pixels at resolution of 640x480

Such configuration should produce tartan bars pattern.

img/axi-tpg-tartan-bars.png

Tartan bars pattern

img/axi-vdma-parameters-1.png

Only write channel (stream to memory-mapped) is enabled

img/axi-vdma-parameters-2.png

s2mm tuser signal emitted by test pattern generator 18 is used for frame synchronization

18(1,2)

Test Pattern Generator v6.0

7.3 Minimal software design

Following example for managing triple-buffered VDMA component should be pretty explainatory. Code is roughtly based on Ales Ruda's work 19 with heavy modifications based on Xilinx reference manual:

/*
 * Triple buffering example for Xilinx VDMA v6.2 IP-core,
 * loosely based on Ales Ruda's work.
 *
 *  Created on: 17.3.2013
 *      Author: Ales Ruda
 *         web: www.arbot.cz
 *
 *  Modified on: 18.12.2014
 *       Author: Lauri Vosandi
 *          web: lauri.vosandi.com
 */

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

/* Register offsets */
#define OFFSET_PARK_PTR_REG                     0x28
#define OFFSET_VERSION                          0x2c

#define OFFSET_VDMA_MM2S_CONTROL_REGISTER       0x00
#define OFFSET_VDMA_MM2S_STATUS_REGISTER        0x04
#define OFFSET_VDMA_MM2S_VSIZE                  0x50
#define OFFSET_VDMA_MM2S_HSIZE                  0x54
#define OFFSET_VDMA_MM2S_FRMDLY_STRIDE          0x58
#define OFFSET_VDMA_MM2S_FRAMEBUFFER1           0x5c
#define OFFSET_VDMA_MM2S_FRAMEBUFFER2           0x60
#define OFFSET_VDMA_MM2S_FRAMEBUFFER3           0x64
#define OFFSET_VDMA_MM2S_FRAMEBUFFER4           0x68

#define OFFSET_VDMA_S2MM_CONTROL_REGISTER       0x30
#define OFFSET_VDMA_S2MM_STATUS_REGISTER        0x34
#define OFFSET_VDMA_S2MM_IRQ_MASK               0x3c
#define OFFSET_VDMA_S2MM_REG_INDEX              0x44
#define OFFSET_VDMA_S2MM_VSIZE                  0xa0
#define OFFSET_VDMA_S2MM_HSIZE                  0xa4
#define OFFSET_VDMA_S2MM_FRMDLY_STRIDE          0xa8
#define OFFSET_VDMA_S2MM_FRAMEBUFFER1           0xac
#define OFFSET_VDMA_S2MM_FRAMEBUFFER2           0xb0
#define OFFSET_VDMA_S2MM_FRAMEBUFFER3           0xb4
#define OFFSET_VDMA_S2MM_FRAMEBUFFER4           0xb8

/* S2MM and MM2S control register flags */
#define VDMA_CONTROL_REGISTER_START                     0x00000001
#define VDMA_CONTROL_REGISTER_CIRCULAR_PARK             0x00000002
#define VDMA_CONTROL_REGISTER_RESET                     0x00000004
#define VDMA_CONTROL_REGISTER_GENLOCK_ENABLE            0x00000008
#define VDMA_CONTROL_REGISTER_FrameCntEn                0x00000010
#define VDMA_CONTROL_REGISTER_INTERNAL_GENLOCK          0x00000080
#define VDMA_CONTROL_REGISTER_WrPntr                    0x00000f00
#define VDMA_CONTROL_REGISTER_FrmCtn_IrqEn              0x00001000
#define VDMA_CONTROL_REGISTER_DlyCnt_IrqEn              0x00002000
#define VDMA_CONTROL_REGISTER_ERR_IrqEn                 0x00004000
#define VDMA_CONTROL_REGISTER_Repeat_En                 0x00008000
#define VDMA_CONTROL_REGISTER_InterruptFrameCount       0x00ff0000
#define VDMA_CONTROL_REGISTER_IRQDelayCount             0xff000000

/* S2MM status register */
#define VDMA_STATUS_REGISTER_HALTED                     0x00000001  // Read-only
#define VDMA_STATUS_REGISTER_VDMAInternalError          0x00000010  // Read or write-clear
#define VDMA_STATUS_REGISTER_VDMASlaveError             0x00000020  // Read-only
#define VDMA_STATUS_REGISTER_VDMADecodeError            0x00000040  // Read-only
#define VDMA_STATUS_REGISTER_StartOfFrameEarlyError     0x00000080  // Read-only
#define VDMA_STATUS_REGISTER_EndOfLineEarlyError        0x00000100  // Read-only
#define VDMA_STATUS_REGISTER_StartOfFrameLateError      0x00000800  // Read-only
#define VDMA_STATUS_REGISTER_FrameCountInterrupt        0x00001000  // Read-only
#define VDMA_STATUS_REGISTER_DelayCountInterrupt        0x00002000  // Read-only
#define VDMA_STATUS_REGISTER_ErrorInterrupt             0x00004000  // Read-only
#define VDMA_STATUS_REGISTER_EndOfLineLateError         0x00008000  // Read-only
#define VDMA_STATUS_REGISTER_FrameCount                 0x00ff0000  // Read-only
#define VDMA_STATUS_REGISTER_DelayCount                 0xff000000  // Read-only

typedef struct {
    unsigned int baseAddr;
    int vdmaHandler;
    int width;
    int height;
    int pixelLength;
    int fbLength;
    unsigned int* vdmaVirtualAddress;
    unsigned char* fb1VirtualAddress;
    unsigned char* fb1PhysicalAddress;
    unsigned char* fb2VirtualAddress;
    unsigned char* fb2PhysicalAddress;
    unsigned char* fb3VirtualAddress;
    unsigned char* fb3PhysicalAddress;

    pthread_mutex_t lock;
} vdma_handle;



int vdma_setup(vdma_handle *handle, unsigned int baseAddr, int width, int height, int pixelLength, unsigned int fb1Addr, unsigned int fb2Addr, unsigned int fb3Addr) {
    handle->baseAddr=baseAddr;
    handle->width=width;
    handle->height=height;
    handle->pixelLength=pixelLength;
    handle->fbLength=pixelLength*width*height;
    handle->vdmaHandler = open("/dev/mem", O_RDWR | O_SYNC);
    handle->vdmaVirtualAddress = (unsigned int*)mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, handle->vdmaHandler, (off_t)handle->baseAddr);
    if(handle->vdmaVirtualAddress == MAP_FAILED) {
        perror("vdmaVirtualAddress mapping for absolute memory access failed.\n");
        return -1;
    }

    handle->fb1PhysicalAddress = fb1Addr;
    handle->fb1VirtualAddress = (unsigned char*)mmap(NULL, handle->fbLength, PROT_READ | PROT_WRITE, MAP_SHARED, handle->vdmaHandler, (off_t)fb1Addr);
    if(handle->fb1VirtualAddress == MAP_FAILED) {
        perror("fb1VirtualAddress mapping for absolute memory access failed.\n");
        return -2;
    }

    handle->fb2PhysicalAddress = fb2Addr;
    handle->fb2VirtualAddress = (unsigned char*)mmap(NULL, handle->fbLength, PROT_READ | PROT_WRITE, MAP_SHARED, handle->vdmaHandler, (off_t)fb2Addr);
    if(handle->fb2VirtualAddress == MAP_FAILED) {
        perror("fb2VirtualAddress mapping for absolute memory access failed.\n");
        return -3;
    }

    handle->fb3PhysicalAddress = fb3Addr;
    handle->fb3VirtualAddress = (unsigned char*)mmap(NULL, handle->fbLength, PROT_READ | PROT_WRITE, MAP_SHARED, handle->vdmaHandler, (off_t)fb3Addr);
    if(handle->fb3VirtualAddress == MAP_FAILED)
    {
     perror("fb3VirtualAddress mapping for absolute memory access failed.\n");
     return -3;
    }

    memset(handle->fb1VirtualAddress, 255, handle->width*handle->height*handle->pixelLength);
    memset(handle->fb2VirtualAddress, 255, handle->width*handle->height*handle->pixelLength);
    memset(handle->fb3VirtualAddress, 255, handle->width*handle->height*handle->pixelLength);
    return 0;
}


void vdma_halt(vdma_handle *handle) {
    vdma_set(handle, OFFSET_VDMA_S2MM_CONTROL_REGISTER, VDMA_CONTROL_REGISTER_RESET);
    vdma_set(handle, OFFSET_VDMA_MM2S_CONTROL_REGISTER, VDMA_CONTROL_REGISTER_RESET);
    munmap((void *)handle->vdmaVirtualAddress, 65535);
    munmap((void *)handle->fb1VirtualAddress, handle->fbLength);
    munmap((void *)handle->fb2VirtualAddress, handle->fbLength);
    munmap((void *)handle->fb3VirtualAddress, handle->fbLength);
    close(handle->vdmaHandler);
}

unsigned int vdma_get(vdma_handle *handle, int num) {
    return handle->vdmaVirtualAddress[num>>2];
}

void vdma_set(vdma_handle *handle, int num, unsigned int val) {
    handle->vdmaVirtualAddress[num>>2]=val;
}

void vdma_status_dump(int status) {
    if (status & VDMA_STATUS_REGISTER_HALTED) printf(" halted"); else printf("running");
    if (status & VDMA_STATUS_REGISTER_VDMAInternalError) printf(" vdma-internal-error");
    if (status & VDMA_STATUS_REGISTER_VDMASlaveError) printf(" vdma-slave-error");
    if (status & VDMA_STATUS_REGISTER_VDMADecodeError) printf(" vdma-decode-error");
    if (status & VDMA_STATUS_REGISTER_StartOfFrameEarlyError) printf(" start-of-frame-early-error");
    if (status & VDMA_STATUS_REGISTER_EndOfLineEarlyError) printf(" end-of-line-early-error");
    if (status & VDMA_STATUS_REGISTER_StartOfFrameLateError) printf(" start-of-frame-late-error");
    if (status & VDMA_STATUS_REGISTER_FrameCountInterrupt) printf(" frame-count-interrupt");
    if (status & VDMA_STATUS_REGISTER_DelayCountInterrupt) printf(" delay-count-interrupt");
    if (status & VDMA_STATUS_REGISTER_ErrorInterrupt) printf(" error-interrupt");
    if (status & VDMA_STATUS_REGISTER_EndOfLineLateError) printf(" end-of-line-late-error");
    printf(" frame-count:%d", (status & VDMA_STATUS_REGISTER_FrameCount) >> 16);
    printf(" delay-count:%d", (status & VDMA_STATUS_REGISTER_DelayCount) >> 24);
    printf("\n");
}

void vdma_s2mm_status_dump(vdma_handle *handle) {
    int status = vdma_get(handle, OFFSET_VDMA_S2MM_STATUS_REGISTER);
    printf("S2MM status register (%08x):", status);
    vdma_status_dump(status);
}

void vdma_mm2s_status_dump(vdma_handle *handle) {
    int status = vdma_get(handle, OFFSET_VDMA_MM2S_STATUS_REGISTER);
    printf("MM2S status register (%08x):", status);
    vdma_status_dump(status);
}

void vdma_start_triple_buffering(vdma_handle *handle) {
    // Reset VDMA
    vdma_set(handle, OFFSET_VDMA_S2MM_CONTROL_REGISTER, VDMA_CONTROL_REGISTER_RESET);
    vdma_set(handle, OFFSET_VDMA_MM2S_CONTROL_REGISTER, VDMA_CONTROL_REGISTER_RESET);

    // Wait for reset to finish
    while((vdma_get(handle, OFFSET_VDMA_S2MM_CONTROL_REGISTER) & VDMA_CONTROL_REGISTER_RESET)==4);
    while((vdma_get(handle, OFFSET_VDMA_MM2S_CONTROL_REGISTER) & VDMA_CONTROL_REGISTER_RESET)==4);

    // Clear all error bits in status register
    vdma_set(handle, OFFSET_VDMA_S2MM_STATUS_REGISTER, 0);
    vdma_set(handle, OFFSET_VDMA_MM2S_STATUS_REGISTER, 0);

    // Do not mask interrupts
    vdma_set(handle, OFFSET_VDMA_S2MM_IRQ_MASK, 0xf);

    int interrupt_frame_count = 3;

    // Start both S2MM and MM2S in triple buffering mode
    vdma_set(handle, OFFSET_VDMA_S2MM_CONTROL_REGISTER,
        (interrupt_frame_count << 16) |
        VDMA_CONTROL_REGISTER_START |
        VDMA_CONTROL_REGISTER_GENLOCK_ENABLE |
        VDMA_CONTROL_REGISTER_INTERNAL_GENLOCK |
        VDMA_CONTROL_REGISTER_CIRCULAR_PARK);
    vdma_set(handle, OFFSET_VDMA_MM2S_CONTROL_REGISTER,
        (interrupt_frame_count << 16) |
        VDMA_CONTROL_REGISTER_START |
        VDMA_CONTROL_REGISTER_GENLOCK_ENABLE |
        VDMA_CONTROL_REGISTER_INTERNAL_GENLOCK |
        VDMA_CONTROL_REGISTER_CIRCULAR_PARK);


    while((vdma_get(handle, 0x30)&1)==0 || (vdma_get(handle, 0x34)&1)==1) {
        printf("Waiting for VDMA to start running...\n");
        sleep(1);
    }

    // Extra register index, use first 16 frame pointer registers
    vdma_set(handle, OFFSET_VDMA_S2MM_REG_INDEX, 0);

    // Write physical addresses to control register
    vdma_set(handle, OFFSET_VDMA_S2MM_FRAMEBUFFER1, handle->fb1PhysicalAddress);
    vdma_set(handle, OFFSET_VDMA_MM2S_FRAMEBUFFER1, handle->fb1PhysicalAddress);
    vdma_set(handle, OFFSET_VDMA_S2MM_FRAMEBUFFER2, handle->fb2PhysicalAddress);
    vdma_set(handle, OFFSET_VDMA_MM2S_FRAMEBUFFER2, handle->fb2PhysicalAddress);
    vdma_set(handle, OFFSET_VDMA_S2MM_FRAMEBUFFER3, handle->fb3PhysicalAddress);
    vdma_set(handle, OFFSET_VDMA_MM2S_FRAMEBUFFER3, handle->fb3PhysicalAddress);

    // Write Park pointer register
    vdma_set(handle, OFFSET_PARK_PTR_REG, 0);

    // Frame delay and stride (bytes)
    vdma_set(handle, OFFSET_VDMA_S2MM_FRMDLY_STRIDE, handle->width*handle->pixelLength);
    vdma_set(handle, OFFSET_VDMA_MM2S_FRMDLY_STRIDE, handle->width*handle->pixelLength);

    // Write horizontal size (bytes)
    vdma_set(handle, OFFSET_VDMA_S2MM_HSIZE, handle->width*handle->pixelLength);
    vdma_set(handle, OFFSET_VDMA_MM2S_HSIZE, handle->width*handle->pixelLength);

    // Write vertical size (lines), this actually starts the transfer
    vdma_set(handle, OFFSET_VDMA_S2MM_VSIZE, handle->height);
    vdma_set(handle, OFFSET_VDMA_MM2S_VSIZE, handle->height);
}

int vdma_running(vdma_handle *handle) {
    // Check whether VDMA is running, that is ready to start transfers
    return (vdma_get(handle, 0x34)&1)==1;
}

int vdma_idle(vdma_handle *handle) {
    // Check whtether VDMA is transferring
    return (vdma_get(handle, OFFSET_VDMA_S2MM_STATUS_REGISTER) & VDMA_STATUS_REGISTER_FrameCountInterrupt)!=0;
}

int main() {
    int j, i;
    vdma_handle handle;

    // Setup VDMA handle and memory-mapped ranges
    vdma_setup(&handle, 0x43000000, 640, 480, 4, 0x0e000000, 0x0f000000, 0x10000000);

    // Start triple buffering
    vdma_start_triple_buffering(&handle);

    // Run for 10 seconds, just monitor status registers
    for(i=0; i<10; i++) {
        vdma_s2mm_status_dump(&handle);
        vdma_mm2s_status_dump(&handle);
        printf("FB1:\n");
        for (j = 0; j < 256; j++) printf(" %02x", handle.fb1VirtualAddress[j]); printf("\n");
        sleep(1);
    }

    // Halt VDMA and unmap memory ranges
    vdma_halt(&handle);
}

Note that this is just a demo code which is not exactly usable for any practical application mainly because the memory ranges assigned for framebuffers are not reserved by any kernel module. For real applications AXI (V)DMA driver should be used. It builds proper abstraction such as /dev/axi_dma_0 or /dev/axi_vdma_0 which can be accessed from userspace applications 20.

19

http://arbot.cz/post/2013/03/20/VDMA-on-ZedBoard.aspx

20

https://github.com/Xilinx/linux-xlnx/tree/master/drivers/dma/xilinx

7.4 Grabbing frames over HTTP

Once the VDMA transfer is running you can use following Python snippet on the ZYBO to grab a frame from DDR memory and serve it over HTTP:

import os, png, mmap, BaseHTTPServer

FRAMEBUFFER_OFFSET=0x0e000000
WIDTH = 640
HEIGHT = 480
PIXEL_SIZE = 4

fh = os.open("/dev/mem", os.O_SYNC | os.O_RDONLY) # Disable cache, read-only
mm = mmap.mmap(fh, WIDTH*HEIGHT*PIXEL_SIZE, mmap.MAP_SHARED, mmap.PROT_READ, offset=FRAMEBUFFER_OFFSET)
class MyHandler(BaseHTTPServer.BaseHTTPRequestHandler):
    def do_GET(s):
        writer = png.Writer(WIDTH, HEIGHT, alpha=True)
        s.send_response(200)
        s.send_header("Content-type", "image/png")
        s.end_headers()
        writer.write_array(s.wfile,[ord(j) for j in mm[0:WIDTH*HEIGHT*PIXEL_SIZE]] )

httpd = BaseHTTPServer.HTTPServer(("0.0.0.0", 80), MyHandler)
try:
    httpd.serve_forever()
except KeyboardInterrupt:
    pass
httpd.server_close()

mm.close()
fh.close()

Simply open http://zybo-ip-address:80 on your laptop assuming that the laptop and ZYBO are attached to same network.

7.5 Interfacing with OV7670 camera module

The Hamsterworks controller block can be reused to initialize the camera, there are no modifications required there. The Hamsterworks capture component however is not suitable for interfacing with AXI4-Stream Video compatible cores. Thus we need a slightly modified block which generates corresponding frame and line synchronization primitives.

----------------------------------------------------------------------------------
-- Authors: Mike Field <hamster@snap.net.nz>
--          Lauir Vosandi <lauri.vosandi@gmail.com>
----------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity ov7670_axi_stream_capture is
    port (
        pclk              : in  std_logic;
        vsync             : in  std_logic;
        href              : in  std_logic;
        d                 : in  std_logic_vector (7 downto 0);
        m_axis_tvalid     : out std_logic;
        m_axis_tready     : in  std_logic;
        m_axis_tlast      : out std_logic;
        m_axis_tdata      : out std_logic_vector ( 31 downto 0 );
        m_axis_tuser      : out std_logic;
        aclk              : out std_logic
    );
end ov7670_axi_stream_capture;

architecture behavioral of ov7670_axi_stream_capture is
    signal d_latch          : std_logic_vector(15 downto 0) := (others => '0');
    signal address          : std_logic_vector(18 downto 0) := (others => '0');
    signal line             : std_logic_vector(1 downto 0)  := (others => '0');
    signal href_last        : std_logic_vector(6 downto 0)  := (others => '0');
    signal we_reg           : std_logic := '0';
    signal href_hold        : std_logic := '0';
    signal latched_vsync    : std_logic := '0';
    signal latched_href     : std_logic := '0';
    signal latched_d        : std_logic_vector (7 downto 0) := (others => '0');
    signal sof              : std_logic := '0';
    signal eol              : std_logic := '0';
begin
     -- Expand 16-bit RGB (5:6:5) to 32-bit RGBA (8:8:8:8)
     m_axis_tdata  <= "11111111"  & d_latch(4 downto 0) & d_latch(0) & d_latch(0) & d_latch(0) & d_latch(10 downto 5) & d_latch(5) & d_latch(5) & d_latch(15 downto 11) & d_latch(11) & d_latch(11) & d_latch(11);
     m_axis_tvalid <= we_reg;
     m_axis_tlast <= eol;
     m_axis_tuser <= sof;
     aclk <= not pclk;

capture_process: process(pclk)
begin
    if rising_edge(pclk) then
        if we_reg = '1' then
            address <= std_logic_vector(unsigned(address)+1);
        end if;

        if href_hold = '0' and latched_href = '1' then
            case line is
                when "00" => line <= "01";
                when "01" => line <= "10";
                when "10" => line <= "11";
                when others => line <= "00";
            end case;
        end if;
        href_hold <= latched_href;

        -- Capturing the data from the camera
        if latched_href = '1' then
            d_latch <= d_latch( 7 downto 0) & latched_d;
        end if;
        we_reg  <= '0';

        -- Is a new screen about to start (i.e. we have to restart capturing)
        if latched_vsync = '1' then
            address        <= (others => '0');
            href_last     <= (others => '0');
            line            <= (others => '0');
        else
            -- If not, set the write enable whenever we need to capture a pixel
            if href_last(0) = '1' then
                we_reg <= '1';
                href_last <= (others => '0');
            else
                href_last <= href_last(href_last'high-1 downto 0) & latched_href;
            end if;
        end if;

        case unsigned(address) mod 640 = 639 is
            when true => eol <= '1';
            when others => eol <= '0';
        end case;

        case unsigned(address) = 0 is
            when true => sof <= '1';
            when others => sof <= '0';
        end case;
    end if;
    if falling_edge(pclk) then
        latched_d     <= d;
        latched_href  <= href;
        latched_vsync <= vsync;
    end if;
end process;
end behavioral;

Modified block converts 16-bit RGB (5:6:5) signal to 32-bit RGBA (8:8:8:8) signal with fake opaque alpha channel. This way whole pixel is transferred during one AXI bus cycle and start-of-frame and end-of-line signals are perfectly aliged with the content.

Substituting test pattern generator with the modified capture block and adding controller block should be enough to have the video input from the camera connected to AXI4-Stream Video compatible pipeline.

21

http://www.voti.nl/docs/OV7670.pdf

7.6 Video4Linux2 driver

As Zynq-7000 boards have I²C bus master built-in, it make sense to take advantage of that feature instead of implementing controller block from scratch. On ZYBO the EEPROM and audio codec are connected to the I²C bus, but it should be possible to route I²C bus to Pmod connectors using IIC_0 port on Zynq7 processing system block. It should also be possible to access the I²C bus via /dev/i2c-0 device node if corresponding kernel modules have been loaded 22. This should make it possible to take advantage of OV7670 kernel module 23 which was written for One Laptop Per Child project. This way the camera initialization can be done by kernel and the camera can be configured via any Video4Linux application instead of static bitstream. How transferring the frames could be done in this case is not however clear yet.

22

Scanning a I²C bus for available slave devices

23

http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/drivers/media/video/ov7670.c

8 Complete video processing pipeline on ZYBO

8.1 Introduction

Xilinx libraries contain Video Direct Memory Access (VDMA) IP-core which can be used to transfer AXI4-Stream protocol based video stream to DDR memory and vice versa. Corresponding sub-components are S2MM (Stream to memory-mapped) also known as write channel and MM2S (Memory-mapped to stream) also known as read channel. Using both of them a video buffer can be implemented with optional crop and zoom features 24.

24

http://www.xilinx.com/support/documentation/ip_documentation/axi_videoip/v1_0/ug934_axi_videoIP.pd

8.2 Minimal hardware design

This is the last and the most complex example of what I have tried on the ZYBO.

high-level-design/ov7670-axi-stream-capture-pipeline.png

High level block design for complete hardware pipeline

In this example we connect Omnivision OV7670 camera via modified capture block to AXI4-Stream Video compatible pipeline. The VDMA controller maintains a ring buffer in the DDR memory and transfers the frame data to these buffers as the frames are received by S2MM portion of the VDMA controller.

img/axi-vdma-both-address-editor.png

Address mapping with AXI Video Direct Memory Access

25

LogiCORE IP AXI Video Direct Memory Access v6.2

8.3 Conclusion

If you've followed the guide you should know by now more or less how the video streams are handled on Zynq-7000 based boards. If that's not the case then I have failed. The code snippets can be found at GitHub repository 26. Most up to date VHDL and Zynq-7000 materials can be found on Lauri's blog 27.

26

https://github.com/laurivosandi/hdl/tree/master/zynq/

27

http://lauri.vosandi.com/hdl/