Pg150 Ultrascale Memory Ip
Pg150 Ultrascale Memory Ip
Architecture-Based FPGAs
Memory IP v1.4
• DDR3 v1.4
• DDR4 v2.2
• LPDDR3 v1.0
• QDR II+ v1.4
• QDR-IV+ v2.0
• RLDRAM 3 v1.4
IP Facts
Chapter 1: Overview
Navigating Content by Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Core Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Feature Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Licensing and Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 8: Overview
Navigating Content by Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Core Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Feature Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
SECTION X: APPENDICES
IP Facts
FPGA user designs to DDR3 and DDR4 SDRAM, See Resource Utilization (DDR3/DDR4),
Resource Utilization (LPDDR3),
LPDDR3 SDRAM, QDR II+ SRAM, QDR-IV SRAM,
Resources Resource Utilization (QDR II+),
and RLDRAM 3 devices. Resource Utilization (QDR-IV),
Resource Utilization (RLDRAM 3).
This product guide provides information about Provided with Core
using, customizing, and simulating a
Design Files RTL
LogiCORE™ IP DDR3 or DDR4 SDRAM, LPDDR3
Example Design Verilog
SDRAM, QDR II+ SRAM, QDR-IV SRAM, or a
Test Bench Verilog
RLDRAM 3 interface core for UltraScale
architecture-based FPGAs. It also describes the Constraints File XDC
Overview
Product Specification
Core Architecture
Example Design
Test Bench
Overview
IMPORTANT: This document supports DDR3 SDRAM core v1.4 and DDR4 SDRAM core v2.2.
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado ® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking
° Resets
° Protocol Description
° Example Design
Core Overview
The Xilinx UltraScale™ architecture includes the DDR3/DDR4 SDRAM cores. These cores
provide solutions for interfacing with these SDRAM memory types. Both a complete
Memory Controller and a physical (PHY) layer only solution are supported. The UltraScale
architecture for the DDR3/DDR4 cores are organized in the following high-level blocks:
• Controller – The controller accepts burst transactions from the user interface and
generates transactions to and from the SDRAM. The controller takes care of the SDRAM
timing parameters and refresh. It coalesces write and read transactions to reduce the
number of dead cycles involved in turning the bus around. The controller also reorders
commands to improve the utilization of the data bus to the SDRAM.
• Physical Layer – The physical layer provides a high-speed interface to the SDRAM. This
layer includes the hard blocks inside the FPGA and the soft blocks calibration logic
necessary to ensure optimal timing of the hard blocks interfacing to the SDRAM.
The application logic is responsible for all SDRAM transactions, timing, and refresh.
Results of the calibration process are available through the Xilinx debug tools.
After completion of calibration, the PHY layer presents raw interface to the
SDRAM.
• Application Interface – The user interface layer provides a simple FIFO-like interface
to the application. Data is buffered and read data is presented in request order.
The above user interface is layered on top of the native interface to the controller. The
native interface is not accessible by the user application and has no buffering and
presents return data to the user interface as it is received from the SDRAM which is not
necessarily in the original request order. The user interface then buffers the read and
write data and reorders the data as needed.
X17926-020618
Feature Summary
DDR3 SDRAM
• Component support for interface width of 8 to 80 bits (RDIMM, UDIMM, and SODIMM
support)
° Maximum component limit is 9 and this restriction is valid for components only and
not for DIMMs
• DDR3 (1.5V) and DDR3L (1.35V)
• Dual slot support for RDIMMs, SODIMMs, and UDIMMs
• Quad-rank RDIMM support
• Density support
° Other densities for memory device support is available through custom part
selection
• 8-bank support
• x4 (x4 devices must be used in even multiples), x8, and x16 device support
• AXI4 Slave Interface
Note: The x4-based component interfaces do not support AXI4, while x4-based RDIMM and
LRDIMM does support AXI4.
• x4, x8, and x16 components are supported
• 8-word burst support
• Support for 5 to 14 cycles of column-address strobe (CAS) latency (CL)
• On-die termination (ODT) support
• Support for 5 to 10 cycles of CAS write latency
• Source code delivery in Verilog
• 4:1 memory to FPGA logic interface clock ratio
• Open, closed, and transaction based pre-charge controller policy
• Interface calibration and training information available through the Vivado hardware
manager
• Optional Error Correcting Code (ECC) support for non-AXI4 72-bit interfaces
DDR4 SDRAM
• Component support for interface width of 8 to 80 bits (RDIMM, LRDIMM, UDIMM, and
SODIMM support)
° Maximum component limit is 9 and this restriction is valid for components only and
not for DIMMs
• Density support
° Other densities for memory device support is available through custom part
selection
• AXI4 Slave Interface
Note: The x4-based component interfaces do not support AXI4, while x4-based RDIMM and
LRDIMM does support AXI4.
• x4, x8, and x16 components are supported
• Dual slot support for DDR4 RDIMMs, SODIMMs, LRDIMMs, and UDIMMs
• 8-word burst support
• Support for 9 to 24 cycles of column-address strobe (CAS) latency (CL)
• ODT support
• 3DS RDIMM and LRDIMM support
• 3DS component support
• Support for 9 to 18 cycles of CAS write latency
• Source code delivery in Verilog
• 4:1 memory to FPGA logic interface clock ratio
• Open, closed, and transaction based pre-charge controller policy
• Interface calibration and training information available through the Vivado hardware
manager
• Optional Error Correcting Code (ECC) support for non-AXI4 72-bit interfaces
• CRC for write operations is not supported
• 2T timing for the address/command bus is not supported
RECOMMENDED: Use x8 or x4-based interfaces for maximum efficiency. These devices have four bank
groups and 16 banks which allow greater efficiency. Compared to the x16-based devices, which only
have two bank groups and eight banks. The DDR4 devices have better access timing among the bank
groups, so the larger number can increase the efficiency. Note that x16 DDP DDR4 DRAM is composed
of two x8 devices that has the larger number of banks and groups. For more information, see AR: 71209.
IMPORTANT: DBI should be enabled with repeated single Burst Length = 8 (BL8) read access with all
"0" on the DQ bus, followed by idle (NOP/DESELECT) inserted between each BL8 read burst as shown
in Figure 1-2. Enabling the DBI feature effectively mitigates excessive power supply noise.
If DBI is not an option, then encoding the data to remove all “0” bursts in application before it reaches
the memory controller is an equally effective method for mitigating power supply noise. For x4-based
RDIMM/LRDIMM interfaces which lack the DM/DBI pin, the power supply noise is mitigated by the ODT
settings used for these topologies. For x4-based component interfaces wider than 16 bits, the data
encoding method is recommended.
X-Ref Target - Figure 1-2
CK_t
DQS_t
DQ0
DQn
X20466-030118
Information about other Xilinx LogiCORE IP modules is available at the Xilinx Intellectual
Property page. For information on pricing and availability of other Xilinx LogiCORE IP
modules and tools, contact your local Xilinx sales representative.
License Checkers
If the IP requires a license key, the key must be verified. The Vivado® design tools have
several license checkpoints for gating licensed IP through the flow. If the license check
succeeds, the IP can continue generation. Otherwise, generation halts with error. License
checkpoints are enforced by the following tools:
• Vivado synthesis
• Vivado implementation
• write_bitstream (Tcl command)
IMPORTANT: IP license level is ignored at checkpoints. The test confirms a valid license exists. It does
not check IP license level.
Product Specification
Standards
This core supports DRAMs that are compliant to the JESD79-3F, DDR3 SDRAM Standard and
JESD79-4, DDR4 SDRAM Standard, JEDEC ® Solid State Technology Association [Ref 1]. It
also supports the DDR4 3DS Addendum.
For more information on UltraScale™ architecture documents, see References, page 789.
Performance
Maximum Frequencies
For more information on the maximum frequencies, see the following documentation:
sections. In each workload, the refresh interval is 7.8 µs and the periodic read interval is set
to 1.0 µs. The address mapping option is ROW_COLUMN_BANK and ORDERING is NORMAL
unless noted otherwise.
Efficiency Workloads
Sequential Read
• Simple address increment pattern
• 100% reads
Sequential Write
• Simple address increment pattern
• 100% writes (except for periodic reads generated by the controller for VT tracking)
Sequential write efficiency is lower than sequential read due to the injection of periodic
reads in the write sequence. The burst workload achieves an efficiency just between
sequential read and sequential write. The burst workload has read transactions frequently
enough that periodic reads are not injected by the controller, but due to read/write bus
turnaround the efficiency is still somewhat lower than a pure sequential read.
The short burst workload shows the effect of more frequent bus turnaround compared to
the 64 transaction bursts. The random workload shows the effect of frequent bus
turnaround and page misses. In all the cases in Table 2-1, efficiency is primarily limited by
DRAM specifications and the Memory Controller is scheduling transactions as efficiently as
possible.
The example DDR3/DDR4 "Idle" read latencies are shown in the following section for both
the user interface and PHY only interfaces. Actual read latency in hardware might vary due
to command and data bus flight times, package delays, CAS latency, etc.
The latency numbers are for an "Idle" case, where the Memory Controller starts off with no
pending transactions, no pending refreshes or periodic reads, and any DRAM protocol or
timing restrictions from previous commands have elapsed. When a new read transaction is
received, there is nothing blocking progress and read data is returned with the minimum
latency.
Table 2-2 shows the user interface idle latency in DRAM clock cycles from assertion of
app_en and app_rdy to assertion of app_rd_data_valid.
Table 2-3 shows the PHY Only interface read CAS command to read data latency. This is
equivalent to page hit latency.
Table 2-3: PHY Only Interface Read CAS Command to Read Data Latency
Latency Category DDR3 1600 CL = 11 [tCK] DDR4 2400 CL = 16 [tCK]
Page Hit 40 44
Resource Utilization
For full details about performance and resource utilization, visit Performance and Resource
Utilization (DDR3) and Performance and Resource Utilization (DDR4).
Port Descriptions
For a complete Memory Controller solution there are three port categories at the top-level
of the memory interface core called the “user design.”
• The first category is the memory interface signals that directly interfaces with the
SDRAM. These are defined by the JEDEC specification.
• The second category is the application interface signals. These are described in the
Protocol Description, page 118.
• The third category includes other signals necessary for proper operation of the core.
These include the clocks, reset, and status signals from the core. The clocking and reset
signals are described in their respective sections.
For a PHY layer only solution, the top-level application interface signals are replaced with
the PHY interface. These signals are described in the PHY Only Interface, page 157.
The signals that interface directly with the SDRAM and the clocking and reset signals are the
same as for the Memory Controller solution.
Core Architecture
This chapter describes the UltraScale™ architecture-based FPGAs Memory Interface
Solutions core with an overview of the modules and interfaces.
Overview
The UltraScale architecture-based FPGAs Memory Interface Solutions is shown in
Figure 3-1.
X-Ref Target - Figure 3-1
Memory
Controller
User 0 DDR3/
FPGA User Physical DDR4
Logic interface Initialization/ Layer SDRAM
Calibration
CalDone
Read Data
X24427-082420
Figure 3-1: UltraScale Architecture-Based FPGAs Memory Interface Solution Core Architecture
Memory Controller
The Memory Controller (MC) is designed to take Read, Write, and Read-Modify-Write
transactions from the user interface (UI) block and issues them to memory efficiently with
low latency, meeting all DRAM protocol and timing requirements, while using minimal FPGA
resources. The controller operates with a DRAM to system clock ratio of 4:1 and can issue
one Activate, one CAS, and one Precharge command on each system clock cycle.
The controller supports an open page policy and can achieve very high efficiencies with
workloads with a high degree of spatial locality. The controller also supports a closed page
policy and the ability to reorder transactions to efficiently schedule workloads with address
patterns that are more random. The controller also allows a degree of control over low-level
functions with a UI control signal for AutoPrecharge on a per transaction basis as well as
signals that can be used to determine when DRAM refresh commands are issued.
1. The Group FSMs that queue up transactions, check DRAM timing, and decide when to
request Precharge, Activate, and CAS DRAM commands.
2. The "Safe" logic and arbitration units that reorder transactions between Group FSMs
based on additional DRAM timing checks while also ensuring forward progress for all
DRAM command requests.
3. The Final Arbiter that makes the final decision about which commands are issued to the
PHY and feeds the result back to the previous stages.
MC
Read
RdData
Data
Data
ECC
Write
WrData
Data
CMD/Addr
Pre
Group FSM 0 Act
CAS
UI Precharge PHY
Pre
Group FSM 1 Act
CAS
Read/Write Safe Logic
and Final CMD/
Transaction
Reorder Arb Address
Activate
Arbitration
Pre
Group FSM 2 Act
CAS
CAS
Pre
Group FSM 3 Act
CAS
Maintenance
Refresh, ZQCS,
VT Tracking
X24428-082420
Native Interface
The UI block is connected to the Memory Controller by the native interface, and provides
the controller with address decode and read/write data buffering. On writes, data is
requested by the controller one cycle before it is needed by presenting the data buffer
address on the native interface. This data is expected to be supplied by the UI block on the
next cycle. Hence there is no buffering of any kind for data (except due to the barrel shifting
to place the data on a particular DDR clock).
On reads, the data is offered by the MC on the cycle it is available. Read data, along with a
buffer address is presented on the native interface as soon as it is ready. The data has to be
accepted by the UI block.
Read and write transactions are mapped to an mcGroup instance based on bank group and
bank address bits of the decoded address from the UI block. Although there are no groups
in DDR3, the name group represents either a real group in DDR4 x4 and x8 devices (which
serves four banks of that group). For DDR3, each mcGroup module would service two
banks.
In the case of DDR4 x16 interface, the mcGroup represents 1-bit of group (there are only
one group bit in x16) and 1-bit of bank, whereby the mcGroup serves two banks.
The total number of outstanding requests depends on the number of mcGroup instances, as
well as the round trip delay from the controller to memory and back. When the controller
issues an SDRAM CAS command to memory, an mcGroup instance becomes available to
take a new request, while the previous CAS commands, read return data, or write data might
still be in flight.
Datapath
Read and write data pass through the Memory Controller. If ECC is enabled, a SECDEC code
word is generated on writes and checked on reads. For more information, see ECC, page 30.
The MC generates the requisite control signals to the mcRead and mcWrite modules telling
them the timing of read and write data. The two modules acquire or provide the data as
required at the right time.
Reordering
Requests that map to the same mcGroup are never reordered. Reordering between the
mcGroup instances is controlled with the ORDERING parameter. When set to "NORM,"
reordering is enabled and the arbiter implements a round-robin priority plan, selecting in
priority order among the mcGroups with a command that is safe to issue to the SDRAM.
The timing of when it is safe to issue a command to the SDRAM can vary on the target bank
or bank group and its page status. This often contributes to reordering.
When the ORDERING parameter is set to "STRICT," all requests have their CAS commands
issued in the order in which the requests were accepted at the native interface. STRICT
ordering overrides all other controller mechanisms, such as the tendency to coalesce read
requests, and can therefore degrade data bandwidth utilization in some workloads.
Group Machines
In the Memory Controller, there are four group state machines. These state machines are
allocated depending on technology (DDR3 or DDR4) and width (x4, x8, and x16). The
following summarizes the allocation to each group machine. In this description, GM refers
to the Group Machine (0 to 3), BG refers to group address, and BA refers to bank address.
Note that group in the context of a group state machine denotes a notional group and does
not necessarily refer to a real group (except in case of DDR4, part x4 and x8).
Figure 3-3 shows the Group FSM block diagram for one instance. There are two main
sections to the Group FSM block, stage 1 and stage 2, each containing a FIFO and an FSM.
Stage 1 interfaces to the UI, issues Precharge and Activate commands, and tracks the DRAM
page status.
Stage 2 issues CAS commands and manages the RMW flow. There is also a set of DRAM
timers for each rank and bank used by the FSMs to schedule DRAM commands at the
earliest safe time. The Group FSM block is designed so that each instance queues up
multiple transactions from the UI, interleaves DRAM commands from multiple transactions
onto the DDR bus for efficiency, and executes CAS commands strictly in order.
X-Ref Target - Figure 3-3
Activate Request
full
Winning
tRCD Command
DRAM tRP feedback
Page
Status Commands tRAS
Timers
DRAM
Stage 2 CAS Request
Commands
Page Table CAS
full empty Group FSM
X24429-082420
When a new transaction is accepted from the UI, it is pushed into the stage 1 transaction
FIFO. The page status of the transaction at the head of the stage 1 FIFO is checked and
provided to the stage 1 transaction FSM. The FSM decides if a Precharge or Activate
command needs to be issued, and when it is safe to issue them based on the DRAM timers.
When the page is open and not already scheduled to be closed due to a pending RDA or
WRA in the stage 2 FIFO, the transaction is transferred from the stage 1 FIFO to the stage 2
FIFO. At this point, the stage 1 FIFO is popped and the stage 1 FSM begins processing the
next transaction. In parallel, the stage 2 FSM processes the CAS command phase of the
transaction at the head of the stage 2 FIFO. The stage 2 FSM issues a CAS command request
when it is safe based on the tRCD timers. The stage 2 FSM also issues both a read and write
CAS request for RMW transactions.
ECC
The MC supports an optional SECDED ECC scheme that detects and corrects read data
errors with 1-bit error per DQ bus burst and detects all 2-bit errors per burst. The 2-bit
errors are not corrected. Three or more bit errors per burst might or might not be detected,
but are never corrected. Enabling ECC adds four DRAM clock cycles of latency to all reads,
whether errors are detected/corrected or not.
Note: When ECC is enabled, initialize (or write to) the memory space prior to performing partial
writes (RMW).
Read-Modify-Write Flow
When a wr_bytes command is accepted at the user interface it is eventually assigned to a
group state machine like other write or read transactions. The group machine breaks the
Partial Write into a read phase and a write phase. The read phase performs the following:
Data from the read phase is not returned to the user interface. If errors are detected in the
read data, an ECC error signal is asserted at the native interface. After read data is stored in
the controller, the write phase begins as follows:
1. Write data is merged with the stored read data based on the write data mask bits.
2. New ECC check bits are generated for the merged data and check bits are written to
memory.
3. Any multiple bit errors in the read phase results in the error being made undetectable in
the write phase as new check bits are generated for the merged data. This is why the ECC
error signal is generated on the read phase even though data is not returned to the user
interface. This allows the system to know if an uncorrectable error has been turned into
an undetectable error.
When the write phase completes, the group machine becomes available to process a new
transaction. The RMW flow ties up a group machine for a longer time than a simple read or
write, and therefore might impact performance.
ECC Module
The ECC module is instantiated inside the DDR3/DDR4 Memory Controller. It is made up of
five submodules as shown in Figure 3-4.
X-Ref Target - Figure 3-4
ecc_err_addr[44:0]/ecc_err_addr[51:0]
ecc_multiple
ecc_single
correct_en
rd_data_en_mc2ni rd_data_en_phy2mc
D
Q
rmw_rd_done
ECC Buffer D
rd_data_addr_phy2mc[4:0]
32 X Writeaddr[4:0]
write_addr_phy2mc[4:0] PAYLOAD_WIDTH × 2 × nCK_PER_CLK Q
Read addr[4:0]
Read Data
wr_data_enc2xor
wr_data_mc2phy[8 × DQ_WIDTH – 1:0]
[8 × DQ_WIDTH – 1:0] XOR Block
write_data_ni2mc[8 × PAYLOAD_WIDTH – 1:0]
DataIn Merge and Encode DataOut (Error
Injection)
write_data_mask_ni2mc
ECC Gen
(H-Matrix Generation)
X17927-091416
Read data and check bits from the PHY are sent to the Decode block, and on the next
system clock cycle data and error indicators ecc_single/ecc_multiple are sent to the
NI. ecc_single asserts when a correctable error is detected and the read data has been
corrected. ecc_multiple asserts when an uncorrectable error is detected.
Read data is not modified by the ECC logic on an uncorrectable error. Error indicators are
never asserted for “periodic reads,” which are read transactions generated by the controller
only for the purposes of VT tracking and are not returned to the user interface or written
back to memory in an RMW flow.
Write data is merged in the Encode block with read data stored in the ECC Buffer. The merge
is controlled on a per byte basis by the write data mask signal. All writes use this flow, so full
writes are required to have all data mask bits deasserted to prevent unintended merging.
After the Merge stage, the Encode block generates check bits for the write data. The data
and check bits are output from the Encode block with a one system clock cycle delay.
The ECC Gen block implements an algorithm that generates an H-matrix for ECC check bit
generation and error checking/correction. The generated code depends only on the
PAYLOAD_WIDTH and DQ_WIDTH parameters, where DQ_WIDTH = PAYLOAD_WIDTH +
ECC_WIDTH. Currently only DQ_WIDTH = 72 and ECC_WIDTH = 8 is supported.
Error Address
Each time a read CAS command is issued, the full DRAM address is stored in a FIFO in the
decode block. When read data is returned and checked for errors, the DRAM address is
popped from the FIFO and ecc_err_addr[51:0] is returned on the same cycle as signals
ecc_single and ecc_multiple for the purposes of error logging or debug. Table 3-1 is
a common definition of this address for DDR3 and DDR4.
Latency
When the parameter ECC is ON, the ECC modules are instantiated and read and write data
latency through the MC increases by one system clock cycle. When ECC is OFF, the data
buses just pass through the MC and all ECC logic should be optimized out.
Address Parity
The Memory Controller generates even command/address parity with a one DRAM clock
delay after the chip select asserts Low. This signal is only used in DDR4 RDIMM
configurations where parity is required by the DIMM RCD component.
Address parity is supported only for DDR4 RDIMM and LRDIMM configurations, which
includes 3DS RDIMMs and LRDIMMs. The Memory Controller does not monitor the
Alert_n parity error status output from the RDIMM/LRDIMM and it might return
corrupted data to the User Interface after a parity error.
To detect this issue, you need to add a pin to your design to monitor the Alert_n signal.
If an Alert_n event is detected, the memory contents should be considered corrupt. To
recover from a parity error the Memory Controller must be reset, and all DRAM contents are
lost.
PHY
The PHY is considered the low-level physical interface to an external DDR3 or DDR4 SDRAM
device as well as all calibration logic for ensuring reliable operation of the physical interface
itself. The PHY generates the signal timing and sequencing required to interface to the
memory device.
• Clock/address/control-generation logics
• Write and read datapaths
• Logic for initializing the SDRAM after power-up
In addition, the PHY contains calibration logic to perform timing training of the read and
write datapaths to account for system static and dynamic delays.
The PHY is included in the complete Memory Interface Solution core, but can also be
implemented as a standalone PHY only block. A PHY only solution can be selected if you
plan to implement a custom Memory Controller. For details about interfacing to the PHY
only block, see the PHY Only Interface, page 157.
The Memory Controller and calibration logic communicate with this dedicated PHY in the
slow frequency clock domain, which is either divided by four or divided by two. This
depends on the DDR3 or DDR4 memory clock. A more detailed block diagram of the PHY
design is shown in Figure 3-5.
X-Ref Target - Figure 3-5
DDR Address/
infrastructure pllclks
Control, Write Data, pll
and Mask
CMD/Write Data
Memory Controller
pllGate
cal_top
User
Interface
1
xiphy iob
0
MicroBlaze calAddrDecode
mcs
CalDone
Cal Debug
Support
status
CalDone
X24430-082420
The Memory Controller is designed to separate out the command processing from the
low-level PHY requirements to ensure a clean separation between the controller and
physical layer. The command processing can be replaced with custom logic if desired, while
the logic for interacting with the PHY stays the same and can still be used by the calibration
logic.
The memory initialization is executed in Verilog RTL. The calibration and training are
implemented by an embedded MicroBlaze™ processor. The MicroBlaze Controller System
(MCS) is configured with an I/O Module and a block RAM. The
<module>_...cal_addr_decode.sv module provides the interface for the processor to
the rest of the system and implements helper logic. The <module>_...config_rom.sv
module stores settings that control the operation of initialization and calibration, providing
run time options that can be adjusted without having to recompile the source code.
The address unit connects the MCS to the local register set and the PHY by performing
address decode and control translation on the I/O module bus from spaces in the memory
map and MUXing return data (<module>_...cal_addr_decode.sv). In addition, it
provides address translation (also known as “mapping”) from a logical conceptualization of
the DRAM interface to the appropriate pinout-dependent location of the delay control in
the PHY address space.
Although the calibration architecture presents a simple and organized address map for
manipulating the delay elements for individual data, control and command bits, there is
flexibility in how those I/O pins are placed. For a given I/O placement, the path to the FPGA
logic is locked to a given pin. To enable a single binary software file to work with any
memory interface pinout, a translation block converts the simplified RIU addressing into
the pinout-specific RIU address for the target design (see Table 3-5).
The specific address translation is written by DDR3/DDR4 SDRAM after a pinout is selected
and cannot be modified. The code shows an example of the RTL structure that supports
this.
In this example, DQ0 is pinned out on Bit[0] of nibble 0 (nibble 0 according to instantiation
order). The RIU address for the ODELAY for Bit[0] is 0x0D. When DQ0 is addressed —
indicated by address 0x000_4100), this snippet of code is active. It enables nibble 0
(decoded to one-hot downstream) and forwards the address 0x0D to the RIU address bus.
The MicroBlaze I/O module interface is not always fast enough for implementing all of the
functions required in calibration. A helper circuit implemented in
<module>_...cal_addr_decode.sv is required to obtain commands from the
registers and translate at least a portion into single-cycle accuracy for submission to the
PHY. In addition, it supports command repetition to enable back-to-back read transactions
and read data comparison.
1. The built-in self-check of the PHY (BISC) is run. BISC is used in the PHY to compute
internal skews for use in voltage and temperature tracking after calibration is
completed.
2. After BISC is completed, calibration logic performs the required power-on initialization
sequence for the memory.
3. This is followed by several stages of timing calibration for the write and read datapaths.
4. After calibration is completed, PHY calculates internal offsets to be used in voltage and
temperature tracking.
5. PHY indicates calibration is finished and the controller begins issuing commands to the
memory.
Figure 3-6 shows the overall flow of memory initialization and the different stages of
calibration. The dark gray color is not available for this release.
X-Ref Target - Figure 3-6
System Reset
XIPHY BISC
XSDB Setup
Yes
Rank
== 0?
Read
Yes
Sanity Check Rank
== 0?
No
Write DQS-to-DQ (Simple)
Write/Read Yes
Sanity Check 2 Rank
== 0?
No
Write DQS-to-DQ (Complex)
Write/Read
Sanity Check 3
Write/Read
Sanity Check 4
Read DQS Centering Multi-Rank Adjustment
Write/Read
Sanity Check 5* All No
Rank count + 1
Done?
Yes
*San ity Check 5 runs for multi-rank and for a r ank other than the
first ra nk. For example, if th ere were two ranks, it would r un on
the second on ly. Calibration Done
**Sanity Check 6 runs for multi-rank an d goes through all of the
ranks. X24431-081021
When simulating a design out of DDR3/DDR4 SDRAM, the calibration it set to be bypassed
to enable you to generate traffic to and from the DRAM as quickly as possible. When
running in hardware or simulating with calibration, enabled signals are provided to indicate
what step of calibration is running or, if an error occurs, where an error occurred.
The first step in determining calibration status is to check the CalDone port. After the
CalDone port is checked, the status bits should be checked to indicate the steps that were
ran and completed. Calibration halts on the very first error encountered, so the status bits
indicate which step of calibration was last run. The status and error signals can be checked
through either connecting the Vivado analyzer signals to these ports or through the XSDB
tool (also through Vivado).
The calibration status is provided through the XSDB port, which stores useful information
regarding calibration for display in the Vivado IDE. The calibration status and error signals
are also provided as ports to allow for debug or triggering. Table 3-6 lists the
pre-calibration status signal description.
Table 3-7 lists the status signals in the port as well as how they relate to the core XSDB data.
In the status port, the mentioned bits are valid and the rest are reserved.
Table 3-9 lists the error signals and a description of each error. To decode the error first look
at the status to determine which calibration stage failed (the start bit would be asserted, the
associated done bit deasserted) then look at the error code provided. The error asserts the
first time an error is encountered.
0xF Byte N/A Timeout error waiting for read data to return.
Write Read Sanity 0x1 Nibble RIU Nibble Read data comparison failure.
27
Check 0xF N/A N/A Timeout error waiting for read data to return.
DQS Gate
During this stage of calibration, the read DQS preamble is detected and the gate to enable
data capture within the FPGA is calibrated to be one clock cycle before the first valid data
on DQ. The coarse and fine DQS gate taps (RL_DLY_COARSE and RL_DLY_FINE) are adjusted
during this stage. Read commands are issued with gaps in between to continually search for
the DQS preamble position. The DDR4 preamble training mode is enabled during this stage
to increase the low preamble period and aid in detection. During this stage of calibration,
only the read DQS signals are monitored and not the read DQ signals. DQS Preamble
Detection is performed sequentially on a per byte basis.
During this stage of calibration, the coarse taps are first adjusted while searching for the
low preamble position and the first rising DQS edge, in other words, a DQS pattern of 00X1.
X-Ref Target - Figure 3-7
0 0 X 1 X 0 X 1 X 0
DDR3
DDR4
X14782-070915
If the preamble is not found, the read latency is increased by one. The coarse taps are reset
and then adjusted again while searching for the low preamble and first rising DQS edge.
After the preamble position is properly detected, the fine taps are then adjusted to fine
tune and edge align the position of the sample clock with the DQS.
Write Leveling
DDR3/DDR4 write leveling allows the controller to adjust each write DQS phase
independently with respect to the CK forwarded to the DDR3/DDR4 SDRAM device. This
compensates for the skew between DQS and CK and meets the tDQSS specification.
During write leveling, DQS is driven by the FPGA memory interface and DQ is driven by the
DDR3/DDR4 SDRAM device to provide feedback. DQS is delayed until the 0 to 1 edge
transition on DQ is detected. The DQS delay is achieved using both ODELAY and coarse tap
delays.
After the edge transition is detected, the write leveling algorithm centers on the noise
region around the transition to maximize margin. This second step is completed with only
the use of ODELAY taps. Any reference to “FINE” is the ODELAY search.
1. Maximizes the DQ eye by removing skew and OCV effects using per bit read DQ deskew.
At the end of this stage, the DQ bits are internally deskewed to the left edge of the
incoming DQS.
Write DQS-to-DQ
This stage of calibration is required to center align the write DQS in the write DQ window
per bit. At the start of Write DQS Centering and Per-Bit Deskew, DQS is aligned to CK but no
adjustments on the write window have been made. Write window adjustments are made in
the following two sequential stages:
Write DQS-to-DM/DBI
When the write DBI option is selected for DDR4, the pin itself is calibrated as a DM and write
DBI is enabled at the end of calibration.
In all previous stages of calibration, data mask signals are driven low before and after the
required amount of time to ensure they have no impact on calibration. Now, both the read
and the writes have been calibrated and data mask can reliably be adjusted. If DM signals
are not used within the interface, this stage of calibration is skipped.
The 0F0F0F0F pattern is written to the DRAM and read back with read DBI enabled. The
DRAM sends the data back as FFFFFFFF but the DBI pin has the clock pattern 01010101,
that is used to measure the data valid window of the DBI input pin itself. The final DQS
location is determined based on the aggregate window for the DQ and DBI pins.
Depending on the interface type (UDIMM, RDIMM, LRDIMM, or component), the DQS could
either be one CK cycle earlier than, two CK cycles earlier than, or aligned to the CK edge that
captures the write command.
This is a pattern based calibration where coarse adjustments are made on a per byte basis
until the expected on time write pattern is read back. The process is as follows:
4. Repeat until the on time write pattern is read back, signifying DQS is aligned to the
correct CK cycle, or an incorrect pattern is received resulting in a Write Latency failure.
Reads are then performed where the following patterns can be calibrated:
Write Latency Calibration can fail for the following cases and signify a board violation
between DQS and CK trace matching:
For the same reasons as described in the Read DQS Centering (Complex), a complex data
pattern is used on the write path to adjust the Write DQS-to-DQ alignment. The same steps
as detailed in the Write DQS-to-DQ Centering are repeated just with a complex data
pattern.
The coarse taps are adjusted so the timing of the gate opening stays the same for any given
rank, where four coarse taps are equal to a single read latency adjustment in the general
interconnect. During this step, the algorithm tries to find a common clb2phy_rd_en
setting where across all ranks for a given byte the coarse setting would not overflow or
underflow, starting with the lowest read latency setting found for the byte during
calibration. If the lowest setting does not work for all ranks, the clb2phy_rd_en
increments by one and the check is repeated. The fine tap setting is < 90°, so it is not
included in the adjustment.
If the check reaches the maximum clb2phy_rd_en setting initially found during
calibration without finding a value that works between all ranks for a byte, an error is
asserted. If after the adjustment is made and the coarse taps are larger than 360° (four
coarse tap settings), a different error is asserted. For the error codes, see Table 3-9, “Error
Signal Descriptions,” on page 45.
For multi-rank systems, the coarse taps must be seven or less so additional delay is added
using the general interconnect read latency to compensate for the coarse tap requirement.
Enable VT Tracking
After the DQS gate multi-rank adjustment (if required), a signal is sent to the XIPHY to
recalibrate internal delays to start voltage and temperature tracking. The XIPHY asserts a
signal when complete, phy2clb_phy_rdy_upp for upper nibbles and
phy2clb_phy_rdy_low for lower nibbles.
For multi-rank systems, when all nibbles are ready for normal operation there is a
requirement of the XIPHY where two write-read bursts are required to be sent to the DRAM
before starting normal traffic. A data pattern of F00FF00F is used for the first and 0FF00FF0
for the second. The data itself is not checked and is expected to fail.
After all stages are completed across all ranks without any error, calDone gets asserted to
indicate user traffic can begin. In XSDB, DBG_END contains 0x1 if calibration completes and
0x2 if there is a failure.
If you would like to manually re-enable the stages, follow these steps:
1. Follow the steps in Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
for modifying IP in the "Editing IP Sources" section.
2. Open the core_name/rtl/ip_top/core_name_ddr4.sv in a text editor outside of
the Vivado Integrated Design Environment.
3. Locate the following lines:
parameter CAL_RD_VREF = "SKIP",
parameter CAL_RD_VREF_PATTERN = "SIMPLE",
parameter CAL_WR_VREF = "SKIP",
parameter CAL_WR_VREF_PATTERN = "SIMPLE",
Note: These lines occur twice. Once under ifdef SIMULATION and again under else. You need
to modify the lines within the else.
4. Modify the SKIP setting to full:
parameter CAL_RD_VREF = "FULL",
parameter CAL_RD_VREF_PATTERN = "SIMPLE",
parameter CAL_WR_VREF = "FULL",
parameter CAL_WR_VREF_PATTERN = "SIMPLE",
Figure 3-8 shows the overall flow of memory initialization and the different stages of the
LRDIMM calibration sequence.
System Reset
XIPHY BISC
XSDB Setup
MREP Training
DWL Training
DB Rank++
MWD Cycle Training
Yes
DB Rank < Ranks/Slot
No
The following data buffer calibration stages are added to meet the timing between the data
buffer and DRAMs and these are repeated for each and every rank of the LRDIMM card/slot.
• MREP Training
• MRD Cycle Training
• MRD Center Training
• DWL Training
• MWD Cycle Training
• MWD Center Training
Whereas the host side calibration stages would exercise the timing between host and data
buffer and they are performed once per every LRDIMM card/slot.
All the calibration stages between data buffer and DRAMs are exercised first and then the
host side calibration stages are exercised.
At the end of each of the data buffer calibration stages, Per Buffer Addressing (PBA) mode
is enabled to program the calibrated latency and the delay values into the data buffer
registers.
MREP Training
This training is to align the Read MDQS phase with the data buffer clock. In this training
mode, host drives the read commands, DRAM sends out the MDQS, data buffer samples the
strobe with the clock, and feeds the result on DQ. Calibration continues to perform this
training to find the 1 to 0 transition on Read MDQS sampled with the data buffer clock.
DWL Training
This training is to align the Write MDQS phase with the DRAM clock. In this training mode,
DB drives the MDQS pulses, DRAM samples the clock with MDQS, and feeds the result on to
MDQ. Data buffer forwards this result from MDQ to DQ. Calibration continues to perform
this training to find 0 to 1 transition on the clock sampled with the Write Read at the DRAM.
CAL_STATUS
There are two types of LRDIMM devices available: dual-rank cards and quad-rank cards.
Because the data buffer calibration stages are repeated for every rank of the card, the
calibration sequence numbering is going to be different for dual-rank cards versus
quad-rank cards.
The calibration status is provided through the XSDB port, which stores useful information
regarding calibration for display in the Vivado IDE. The calibration status is provided as
ports to allow for debug or triggering.
Table 3-10 lists the calibration status signals in the port as well as how they relate to the
core XSDB data for dual-rank LRDIMM card.
Table 3-10: XSDB Status Signal Description for Dual-Rank LRDIMM Card
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 0 Start Data Buffer Rank 0 MREP 1
1 1 Done – –
2 2 Start Data Buffer Rank 0 MRD Cycle 2
3 3 Done – –
DDR_CAL_STATUS_SLOTx_0 4 4 Start Data Buffer Rank 0 MRD Center 3
5 5 Done – –
6 6 Start Data Buffer Rank 0 DWL 4
7 7 Done – –
8 8 Start Data Buffer Rank 0 MWD Cycle 5
0 9 Done – –
1 10 Start Data Buffer Rank 0 MWD Center 6
2 11 Done – –
3 12 Start Data Buffer Rank 1 MREP 7
DDR_CAL_STATUS_SLOTx_1 4 13 Done – –
5 14 Start Data Buffer Rank 1 MRD Cycle 8
6 15 Done – –
7 16 Start Data Buffer Rank 1 MRD Center 9
8 17 Done – –
0 18 Start Data Buffer Rank 1 DWL 10
1 19 Done – –
2 20 Start Data Buffer Rank 1 MWD Cycle 11
3 21 Done – –
DDR_CAL_STATUS_SLOTx_2 4 22 Start Data Buffer Rank 1 MWD Center 12
5 23 Done – –
6 24 Start DQS Gate 13
7 25 Done – –
8 26 Start DQS Gate Sanity Check 14
Table 3-10: XSDB Status Signal Description for Dual-Rank LRDIMM Card (Cont’d)
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 27 Done – –
1 28 Start Write Leveling 15
2 29 Done – –
3 30 Start Read Per-Bit Deskew 16
DDR_CAL_STATUS_SLOTx_3 4 31 Done – –
5 32 Start Read Per-Bit DBI Deskew 17
6 33 Done – –
7 34 Start Read DQS Centering (Simple) 18
8 35 Done – –
0 36 Start Read Sanity Check 19
1 37 Done – –
2 38 Start Write DQS to DQ Deskew 20
3 39 Done – –
DDR_CAL_STATUS_SLOTx_4 4 40 Start Write DQS to DM/DBI Deskew 21
5 41 Done – –
6 42 Start Write DQS to DQ (Simple) 22
7 43 Done – –
8 44 Start Write DQS to DM/DBI (Simple) 23
0 45 Done – –
1 46 Start Read DQS Centering DBI (Simple) 24
2 47 Done – –
3 48 Start Write Latency Calibration 25
DDR_CAL_STATUS_SLOTx_5 4 49 Done – –
5 50 Start Write Read Sanity Check 0 26
6 51 Done – –
7 52 Start Read DQS Centering (Complex) 27
8 53 Done – –
Table 3-10: XSDB Status Signal Description for Dual-Rank LRDIMM Card (Cont’d)
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 54 Start Write Read Sanity Check 1 28
1 55 Done – –
2 56 Start Read V REF Training 29
3 57 Done – –
DDR_CAL_STATUS_SLOTx_6 4 58 Start Write Read Sanity Check 2 30
5 59 Done – –
6 60 Start Write DQS to DQ (Complex) 31
7 61 Done – –
8 62 Start Write DQS to DM/DBI (Complex) 32
0 63 Done – –
1 64 Start Write Read Sanity Check 3 33
2 65 Done – –
3 66 Start Write V REF Training 34
4 67 Done – -
DDR_CAL_STATUS_SLOTx_7
5 68 Start Write Read Sanity Check 4 35
6 69 Done – –
Read DQS Centering Multi Rank
7 70 Start 36
Adjustment
8 71 Done – –
0 72 Start Write Read Sanity Check 5 37
1 73 Done – –
2 74 Start Multi Rank Adjustment and Checks 38
DDR_CAL_STATUS_SLOTx_8
3 75 Done – -
4 76 Start Write Read Sanity Check 6 39
5 77 Done – –
Table 3-11 lists the calibration status signals in the port as well as how they relate to the
core XSDB data for quad-rank LRDIMM card.
Table 3-11: Status Signal Description for Quad-Rank LRDIMM Card (Cont’d)
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 27 Done – –
1 28 Start Data Buffer Rank 2 MRD Center 15
2 29 Done – –
3 30 Start Data Buffer Rank 2 DWL 16
DDR_CAL_STATUS_SLOTx_3 4 31 Done – –
5 32 Start Data Buffer Rank 2 MWD Cycle 17
6 33 Done – –
7 34 Start Data Buffer Rank 2 MWD Center 18
8 35 Done – –
0 36 Start Data Buffer Rank 3 MREP 19
1 37 Done – –
2 38 Start Data Buffer Rank 3 MRD Cycle 20
3 39 Done – –
DDR_CAL_STATUS_SLOTx_4 4 40 Start Data Buffer Rank 3 MRD Center 21
5 41 Done – –
6 42 Start Data Buffer Rank 3 DWL 22
7 43 Done – –
8 44 Start Data Buffer Rank 3 MWD Cycle 23
0 45 Done – –
1 46 Start Data Buffer Rank 3 MWD Center 24
2 47 Done – –
3 48 Start DQS Gate 25
DDR_CAL_STATUS_SLOTx_5 4 49 Done – –
5 50 Start DQS Gate Sanity Check 26
6 51 Done – –
7 52 Start Write Leveling 27
8 53 Done – –
Table 3-11: Status Signal Description for Quad-Rank LRDIMM Card (Cont’d)
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 54 Start Read Per-Bit Deskew 28
1 55 Done – –
2 56 Start Read Per-Bit DBI Deskew 29
3 57 Done – –
DDR_CAL_STATUS_SLOTx_6 4 58 Start Read DQS Centering (Simple) 30
5 59 Done – –
6 60 Start Read Sanity Check 31
7 61 Done – –
8 62 Start Write DQS to DQ Deskew 32
0 63 Done – –
1 64 Start Write DQS to DM/DBI Deskew 33
2 65 Done – –
3 66 Start Write DQS to DQ (Simple) 34
DDR_CAL_STATUS_SLOTx_7 4 67 Done – –
5 68 Start Write DQS to DM/DBI (Simple) 35
6 69 Done – –
7 70 Start Read DQS Centering DBI (Simple) 36
8 71 Done – –
0 72 Start Write Latency Calibration 37
1 73 Done – –
2 74 Start Write Read Sanity Check 0 38
3 75 Done – –
DDR_CAL_STATUS_SLOTx_8 4 76 Start Read DQS Centering (Complex) 39
5 77 Done – –
6 78 Start Write Read Sanity Check 1 40
7 79 Done – –
8 80 Start Read VREF Training 41
Table 3-11: Status Signal Description for Quad-Rank LRDIMM Card (Cont’d)
XSDB Status Calibration
XSDB Status Register Bits Port Bits Description Calibration Stage Name Stage
[8:0] [127:0] Number
0 81 Done – –
1 82 Start Write Read Sanity Check 2 42
2 83 Done – –
3 84 Start Write DQS to DQ (Complex) 43
DDR_CAL_STATUS_SLOTx_9 4 85 Done – –
5 86 Start Write DQS to DM/DBI (Complex) 44
6 87 Done – –
7 88 Start Write Read Sanity Check 3 45
8 89 Done – –
0 90 Start Write V REF Training 46
1 91 Done – –
2 92 Start Write Read Sanity Check 4 47
3 93 Done – –
Read DQS Centering Multi Rank
4 94 Start 48
DDR_CAL_STATUS_SLOTx_10 Adjustment
5 95 Done – –
6 96 Start Write Read Sanity Check 5 49
7 97 Done – –
Multi Rank Adjustment and
8 98 Start 50
Checks
0 99 Done – –
DDR_CAL_STATUS_SLOTx_11 1 100 Start Write Read Sanity Check 6 51
2 101 Done - -
ERROR STATUS
The Error signal descriptions of host calibration stages in Table 3-9 holds good for LRDIMM
host calibration stages, except that the stage numbering is as per LRDIMM dual-rank or
quad-rank configuration.
Table 3-12 lists the error signals of the dual-rank LRDIMM data buffer calibration stages and
their description.
Table 3-12: Error Signal Description of Dual-Rank LRDIMM Data Buffer Calibration Stages
Table 3-13 lists the error signals of the quad-rank LRDIMM data buffer calibration stages
and their description.
Table 3-13: Error Signal Description Of Quad-Rank LRDIMM Data Buffer Calibration Stages
Table 3-13: Error Signal Description Of Quad-Rank LRDIMM Data Buffer Calibration Stages (Cont’d)
Save Restore
The feature saves the calibration data into an external memory and restores the same
information at a later point of time for a quick calibration completion. The IP provides a set
of XSDB ports in the user interface through which, you can save and restore the memory
controller calibration data.
When the FPGA is programmed and asked to calibrate in a regular mode, all required
calibration stages are executed. You can start talking to the DRAM when the calibration
completes and issues a save request at any point of time to save the calibration data. This
is called save cycle. The FPGA can be reprogrammed or turned off after the save cycle.
At a later point of time, the same design can be reprogrammed and asked to calibrate in
restore mode in which, the calibration completes in a very quick time. This is called restore
cycle. The placeholder that keeps the calibration data inside the memory controller while
the FPGA is powered is called XSDB block RAM.
It is required to save and restore the entire XSDB block RAM. Its end address can be
obtained from the END_ADDR0/1 locations of the XSDB debug information. An example to
calculate the end address is available in step 2, page 604.
YES
YES
X17117-030518
Table 3-14: User Interface Ports Description for Save and Restore
Signal Name I/O Width Description
Request for saving the calibration data. No further memory requests
app_save_req I 1 are accepted when it is asserted. Must be asserted from Low to High
only after calibration completion.
Save request acknowledgment. The signal stays High after it is
app_save_ack O 1
asserted until a system reset is issued.
Table 3-14: User Interface Ports Description for Save and Restore (Cont’d)
Signal Name I/O Width Description
XSDB block RAM restore enable. It must be asserted High within 50
general interconnect cycles after ui_clk_sync_rst is deasserted in the
restore cycle until calibration completes.
Assert this to notify MicroBlaze to wait for XSDB block RAM
restoration completion. After the XSDB block RAM is restored, assert
app_restore_en I 1 app_restore_complete to notify MicroBlaze to continue calibration.
When asserted,
• MicroBlaze waits for app_restore_complete before proceeding to
calibration
• Disables all calibration stages except DQS gating
XSDB block RAM restore complete. It should be asserted High after
app_restore_complete I 1
the entire XSDB block RAM is restored until calibration completes.
Debug Output
app_dbg_out O Do not connect any signals to app_dbg_out and keep the port open
during instantiation
Save restore XSDB ports Select.
app_xsdb_select I 1 Assert for the XSDB block RAM read or write access. It should be
asserted as long as the access is required.
XSDB block RAM read enable. Asserting this for one cycle issues one
app_xsdb_rd_en I 1
read command.
XSDB block RAM write enable. Asserting this for one cycle issues one
app_xsdb_wr_en I 1 write command. The corresponding write address (app_xsdb_addr)
and write data (app_xsdb_wr_data) are taken in the same cycle.
XSDB block RAM address. This address is used for both read and write
app_xsdb_addr I 16 commands. app_xsdb_addr is taken in the same cycle when
app_xsdb_rd_en or app_xsdb_wr_en is valid.
XSDB block RAM write data. app_xsdb_wr_data is taken in the same
app_xsdb_wr_data I 9
cycle when app_xsdb_wr_en is valid.
1. Save cycle
a. Memory Controller boots up in a normal manner, completes calibration, and runs the
user traffic.
b. Issue a save request to the Memory Controller by asserting app_save_req. Any
read or write request that comes along with or after the save request is dropped and
the controller behavior is not guaranteed. Thus, the traffic must be stopped before
requesting the calibration data save.
c. Memory Controller asserts the app_save_ack after finishing all pending DRAM
commands. Figure 3-10 shows the save request and acknowledge assertions.
d. When the app_save_ack is asserted, save the XSDB block RAM content into an
external memory through the XSDB ports provided in the user interface as shown in
Figure 3-11. The saved data can be used to restore the calibration in a shorter time
at a later point of time.
X-Ref Target - Figure 3-10
Figure 3-11: XSDB Interface Timing for Reading XSDB Block RAM Content
2. Restore cycle
a. Assert the app_restore_en signal within 50 general interconnect cycles after the
user interface reset (ui_clk_sync_rst) is deasserted in the restore cycle. It should
stay asserted until the calibration completes.
b. Restore the XSDB block RAM content from the external saved space into the Memory
Controller through the XSDB ports provided in the user interface. The XSDB write
timing is shown in Figure 3-12. Assert the app_restore_complete after the entire
XSDB block RAM is restored as shown in Figure 3-13.
c. Memory Controller recognizes this as a restore boot up when app_restore_en is
asserted. The calibration sequence is going to be shortened in the restore mode.
When app_restore_complete is asserted, the entire calibration data from XSDB
block RAM is restored into PHY with minimal calculations.
d. Memory Controller skips all calibration stages except the DQS gating stage and
finishes calibration. User traffic starts after the calibration as usual.
X-Ref Target - Figure 3-12
Figure 3-12: XSDB Interface Timing for Writing XSDB Block RAM Content
X-Ref Target - Figure 3-13
Figure 3-13: Asserting app_restore_complete After Writing Entire Block RAM Content
Reset Sequence
The sys_rst signal resets the entire memory design which includes general interconnect
(fabric) logic which is driven by the MMCM clock (clkout0) and RIU logic. MicroBlaze™ and
calibration logic are driven by the MMCM clock (clkout6). The sys_rst input signal is
synchronized internally to create the ui_clk_sync_rst signal. The ui_clk_sync_rst
reset signal is synchronously asserted and synchronously deasserted.
Figure 3-14 shows the ui_clk_sync_rst (fabric reset) is synchronously asserted with a
few clock delays after sys_rst is asserted. When ui_clk_sync_rst is asserted, there are
a few clocks before the clocks are shut off.
X-Ref Target - Figure 3-14
Clamshell Topology
This feature is supported for DDR4 Controller/PHY Mode option in the Controller and
physical layer pull-down for User Interface, AXI interfaces, and Physical Layer Only
interface. Clamshell topology supports the Physical Layer Ping Pong interface.
Note: Only DDR4 single Rank components are supported with this feature.
The clamshell topology saves the component area by placing them on both sides (top and
bottom) of the board to mimic the address mirroring concept of the multi-rank RDIMMs.
Address mirroring improves the signal integrity of the address and control ports and makes
the PCB routing easier. The clamshell feature is available in the Basic tab as shown in
Figure 3-15.
The components are split into two categories called non-mirrored and mirrored. One
additional chip select signal is added to the design for the mirrored components.
Figure 3-16 shows the difference between the regular component topology and the
clamshell topology.
DRAM 0
DRAM 1
Top DRAMs Bottom DRAMs
(Non-Mirrored) (Mirrored)
DRAM 2
DRAM 0
DRAM 3 DRAM 1
FPGA CS0_n
Memory DRAM 2
Controller DRAM 4 CS0_n DRAM 3
DRAM 4
FPGA
DRAM 5 Memory DRAM 5
Controller DRAM 6
DRAM 6 DRAM 7
DRAM 8
DRAM 7 CS1_n
DRAM 8
Migration Feature
This feature is supported for DDR4 Controller/PHY Mode option in the Controller and
physical layer for User Interface, AXI interfaces, and Physical Layer Only interface.
Migration does not support the Physical Layer Ping Pong interface. This feature is helpful
when migrating a design from the existing FPGA package to another compatible package.
It also supports pin compatible packages within and across UltraScale and UltraScale+
families. For more information on the on pin compatible FPGAs, see the UltraScale
Architecture PCB Design and Pin Planning User Guide (UG583) [Ref 11].
The migration option compensates the package skews of all address/command signals on
the targeted device to keep the phase relationship of the source device intact. It is required
only for the address/command bus as there is no calibration for these signals.
The data bus (DQ and DQS) skews are not required to compensate because it is completed
during the regular calibration sequence. The tool supports a skew difference of 0 to 75 ps
only.
Figure 3-17 shows the Advanced Options tab to enable the migration feature.
X-Ref Target - Figure 3-17
Table 3-15 to Table 3-17 show examples on the skew calculations that need to be entered in
Figure 3-18 while migrating the FPGA device. The procedure to retrieve the delay values for
the source and target devices is available in the Migration chapter in the UltraScale
Architecture PCB Design and Pin Planning User Guide (UG583) [Ref 11].
These delay values for all used pins are listed in columns 2 and 3 for the source and target
devices, respectively. The difference in the delay of the target device from the source is
mentioned in column 4. Note that the skew can be positive or negative. Because the GUI
expects only the positive skew values, the column 4 values are adjusted in the column 5
such that the lowest skew difference becomes zero. The calculated values in column 5 are to
be entered in Vivado as shown in Figure 3-18.
The lowest skew among all entries of column 4 (Table 3-15) is +11 ps. Therefore, column 5
gets formed by subtracting this lowest skew value (+11 ps) from column 4.
The lowest skew among all entries of column 4 (Table 3-16) is -39 ps. Then, column 5 gets
formed by subtracting this lowest skew value (-39 ps) from column 4.
The lowest skew among all entries of column 4 (Table 3-17) is -18 ps. Hence, column 5 gets
formed by subtracting this lowest skew value (-18 ps) from column 4.
The MicroBlaze MCS ECC can be selected from the MicroBlaze MCS ECC option section in
the Advanced Options tab. The block RAM size increases if the ECC option for MicroBlaze
MCS is selected.
Memory Settings
This section captures the settings of memory components and DIMMs.
Clocking
The memory interface requires one mixed-mode clock manager (MMCM), one TXPLL per I/
O bank used by the memory interface, and two BUFGs. These clocking components are used
to create the proper clock frequencies and phase shifts necessary for the proper operation
of the memory interface.
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used.
Note: DDR3/DDR4 SDRAM generates the appropriate clocking structure and no modifications to
the RTL are supported.
The DDR3/DDR4 SDRAM tool generates the appropriate clocking structure for the desired
interface. This structure must not be modified. The allowed clock configuration is as
follows:
Requirements
GCIO
• Must use a differential I/O standard
• Must be in the same I/O column as the memory interface
• Must be in the same SLR of memory interface for the SSI technology devices
• The I/O standard and termination scheme are system dependent. For more information,
consult the UltraScale Architecture SelectIO Resources User Guide (UG571) [Ref 7].
MMCM
• MMCM is used to generate the FPGA logic system clock (1/4 of the memory clock)
• Must be located in the center bank of memory interface
• Must use internal feedback
• Input clock frequency divided by input divider must be ≥ 70 MHz (CLKINx / D ≥
70 MHz)
• Must use integer multiply and output divide values
° For two bank systems, the bank with the higher number of bytes selected is chosen
as the center bank. If the same number of bytes is selected in two banks, then the
top bank is chosen as the center bank.
° For four bank systems, either of the center banks can be chosen. DDR3/DDR4
SDRAM refers to the second bank from the top-most selected bank as the center
bank.
TXPLL
• CLKOUTPHY from TXPLL drives XIPHY within its bank
• TXPLL must be set to use a CLKFBOUT phase shift of 90°
• TXPLL must be held in reset until the MMCM lock output goes High
• Must use internal feedback
Figure 4-1 shows an example of the clocking structure for a three bank memory interface.
The GCIO drives the MMCM located at the center bank of the memory interface. MMCM
drives both the BUFGs located in the same bank. The BUFG (which is used to generate
system clock to FPGA logic) output drives the TXPLLs used in each bank of the interface.
X-Ref Target - Figure 4-1
Memory Interface
BUFG
TXPLL
BUFG Differential
GCIO Input
I/O Bank 4
X24432-082420
• For two bank systems, MMCM is placed in a bank with the most number of bytes
selected. If they both have the same number of bytes selected in two banks, then
MMCM is placed in the top bank.
• For four bank systems, MMCM is placed in a second bank from the top.
For designs generated with System Clock configuration of No Buffer, MMCM must not be
driven by another MMCM/PLL. Cascading clocking structures MMCM → BUFG → MMCM
and PLL → BUFG → MMCM are not allowed.
If the MMCM is driven by the GCIO pin of the other bank, then the
CLOCK_DEDICATED_ROUTE constraint with value "BACKBONE" must be set on the net that
is driving MMCM or on the MMCM input. Setting up the CLOCK_DEDICATED_ROUTE
constraint on the net is preferred. But when the same net is driving two MMCMs, the
CLOCK_DEDICATED_ROUTE constraint must be managed by considering which MMCM
needs the BACKBONE route.
In such cases, the CLOCK_DEDICATED_ROUTE constraint can be set on the MMCM input. To
use the "BACKBONE" route, any clock buffer that exists in the same CMT tile as the GCIO
must exist between the GCIO and MMCM input. The clock buffers that exists in the I/O CMT
are BUFG, BUFGCE, BUFGCTRL, and BUFGCE_DIV. So DDR3/DDR4 SDRAM instantiates BUFG
between the GCIO and MMCM when the GCIO pins and MMCM are not in the same bank
(see Figure 4-1).
If the GCIO pin and MMCM are allocated in different banks, DDR3/DDR4 SDRAM generates
CLOCK_DEDICATED_ROUTE constraints with value as "BACKBONE." If the GCIO pin and
MMCM are allocated in the same bank, there is no need to set any constraints on the
MMCM input.
Similarly when designs are generated with System Clock Configuration as a No Buffer
option, you must take care of the "BACKBONE" constraint and the BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM if GCIO pin and MMCM are allocated in
different banks. DDR3/DDR4 SDRAM does not generate clock constraints in the XDC file for
No Buffer configurations and you must take care of the clock constraints for No Buffer
configurations. For more information on clocking, see the UltraScale Architecture Clocking
Resources User Guide (UG572) [Ref 8].
For DDR3:
set_property CLOCK_DEDICATED_ROUTE BACKBONE [get_pins -hier -filter {NAME =~ */
u_ddr3_infrastructure/gen_mmcme*.u_mmcme_adv_inst/CLKIN1}]
For DDR4:
set_property CLOCK_DEDICATED_ROUTE BACKBONE [get_pins -hier -filter {NAME =~ */
u_ddr4_infrastructure/gen_mmcme*.u_mmcme_adv_inst/CLKIN1}]
For more information on the CLOCK_DEDICATED_ROUTE constraints, see the Vivado Design
Suite Properties Reference Guide (UG912) [Ref 9].
Note: If two different GCIO pins are used for two DDR3/DDR4 SDRAM IP cores in the same bank,
center bank of the memory interface is different for each IP. DDR3/DDR4 SDRAM generates MMCM
LOC and CLOCK_DEDICATED_ROUTE constraints accordingly.
1. DDR3/DDR4 SDRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) of both the IP cores.
2. System clock pins must be allocated within the same I/O column of the memory
interface pins allocated. Add the pin LOC constraints for system clock pins and clock
constraints in your top-level XDC.
3. You must add a "BACKBONE" constraint on the net that is driving the MMCM or on the
MMCM input if GCIO pin and MMCM are not allocated in the same bank. Apart from
this, BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV must be instantiated between GCIO and
MMCM to use the "BACKBONE" route.
Note:
° Skew spanning across multiple BUFGs is not a concern because single point of
contact exists between BUFG → TXPLL and the same BUFG → System Clock Logic.
° System input clock cannot span I/O columns because the longer the clock lines
span, the more jitter is picked up.
TXPLL Usage
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used. One PLL per bank is used if a bank is used by a single memory
interface. You can use a second PLL for other usage. To use a second PLL, you can perform
the following steps:
1. Generate the design for the System Clock Configuration option as No Buffer.
2. DDR3/DDR4 SDRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) and also to the input of PLL (PLL instance that you have in your
design).
3. You can use the PLL output clocks.
Additional Clocks
You can produce up to four additional clocks which are created from the same MMCM that
generates ui_clk. Additional clocks can be selected from the Clock Options section in the
Advanced tab. The GUI lists the possible clock frequencies from MMCM and the
frequencies for additional clocks vary based on selected memory frequency (Memory
Device Interface Speed (ps) value in the Basic tab), selected FPGA, and FPGA speed grade.
For situations where the memory interface is reset and recalibrated without a
reconfiguration of the FPGA, the SEM IP must be set into IDLE state to disable the memory
scan and to send the SEM IP back into the scanning (Observation or Detect only) states
afterwards. This can be done in two methods, through the “Command Interface” or “UART
interface.” See Chapter 3 of the UltraScale Architecture Soft Error Mitigation Controller
LogiCORE IP Product Guide (PG187) [Ref 10] for more information.
Resets
An asynchronous reset (sys_rst) input is provided. This is an active-High reset and the
sys_rst must assert for a minimum pulse width of 5 ns. The sys_rst can be an internal
or external pin.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must have a common reset input.
For more information on reset, see the Reset Sequence in Chapter 3, Core Architecture.
Note: The best possible calibration results are achieved when the FPGA activity is minimized from
the release of this reset input until the memory interface is fully calibrated as indicated by the
init_calib_complete port (see the User Interface section of this document).
• Address/control means cs_n, ras_n, cas_n, we_n, ba, ck, cke, a, parity (valid for
RDIMMs only), and odt. Multi-rank systems have one cs_n, cke, odt, and one ck pair
per rank.
• Pins in a byte lane are numbered N0 to N12.
• Byte lanes in a bank are designed by T0, T1, T2, or T3. Nibbles within a byte lane are
distinguished by a “U” or “L” designator added to the byte lane designator (T0, T1, T2,
or T3). Thus they are T0L, T0U, T1L, T1U, T2L, T2U, T3L, and T3U.
Note: There are two PLLs per bank and a controller uses one PLL in every bank that is being used by
the interface.
1. dqs, dq, and dm location.
a. Designs using x8 or x16 components – dqs must be located on a dedicated byte
clock pair in the upper nibble designated with “U” (N6 and N7). dq associated with
a dqs must be in same byte lane on any of the other pins except pins N1 and N12.
b. Designs using x4 components – dqs must be located on the dedicated dqs pair in
the nibble (N0 and N1 in the lower nibble, N6 and N7 in the upper nibble). dq’s
associated with a dqs must be in the same nibble on any of the other pins except pin
N12 (upper nibble).
c. dm (if used) must be located on pin N0 in the byte lane with the corresponding dqs.
When dm is disabled, pin N0 can be used for dq and pin N0 must not be used for
address/control signal. Pin N0 must not be used for Address/Control when dm is not
used (exception reset# pin).
Note: dm is not supported with x4 devices.
d. dm, if not used, must be pulled low on the PCB. Typical values used for this are equal
to the DQ trace impedance such as 40 or 50Ω . Consult with the memory vendor for
their specific recommendation. Unpredictable failures occur if this is not pulled low
appropriately.
IMPORTANT: Also, ensure that the interface is configured in the GUI to not use the data mask.
Otherwise, the calibration logic attempts to train this pin which results in a calibration failure.
2. The x4 components must be used in pairs. Odd numbers of x4 components are not
permitted. Both the upper and lower nibbles of a data byte must be occupied by a x4
dq/dqs group.
3. Byte lanes with a dqs are considered to be data byte lanes. Pins N1 and N12 can be used
for address/control in a data byte lane. If the data byte is in the same bank as the
remaining address/control pins, see step #4.
4. Address/control can be on any of the 13 pins in the address/control byte lanes. Address/
control must be contained within the same bank.
5. For dual slot configurations of RDIMMs and UDIMMs: cs, odt, cke, and ck port widths
are doubled. For exact mapping of the signals, see the DIMM Configurations.
6. One vrp pin per bank is used and DCI is required for the interfaces. A vrp pin is
required in I/O banks containing inputs as well as in output only banks. It is required in
output only banks because address/control signals use SSTL15_DCI/SSTL135_DCI to
enable usage of controlled output impedance. DCI cascade is allowed. When DCI
cascade is selected, vrp pin can be used as a normal I/O. All rules for the DCI in the
UltraScale™ Architecture SelectIO™ Resources User Guide (UG571) [Ref 7] must be
followed.
RECOMMENDED: Xilinx strongly recommends that the DCIUpdateMode option is kept with the default
value of ASREQUIRED so that the DCI circuitry is allowed to operate normally.
power up. When dm is disabled, the reset pin can be allocated to N0th pin of data byte
lane or any other free pin of that byte lane as long as other rules are not violated.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must share a common reset input.
10. All I/O banks used by the memory interface must be in the same column.
11. All I/O banks used by the memory interface must be in the same SLR of the column for
the SSI technology devices.
12. Maximum height of interface is five contiguous banks. The maximum supported
interface is 80-bit wide.
Maximum component limit is nine and this restriction is valid for components only and
not for DIMMs.
Note: If PCB compatibility between x4 and x8 based DIMMs is desired, additional restrictions apply.
The upper x4 DQS group must be placed within the lower byte nibble (N0 to N5). This allows DM to
be placed on N0 for the x8 pinout, pin compatibility for all DQ bits, and the added DQS pair for x4
be placed on N0/N1.
For example, a typical DDR3 x4 based RDIMM data sheet shows the DQS9 associated with DQ4, DQ5,
DQ6, and DQ7. This DQS9_p is used for the DM in an x8 configuration. This nibble must be
connected to the lower nibble of the byte lane. The Vivado® generated XDC labels this DQS9 as
DSQ1 (for more information, see the Pin Mapping for x4 RDIMMs/LRDIMMs). Table 4-1 and Table 4-2
include an example for one of the configurations of x4/x8/x16.
Table 4-1: Byte Lane View of Bank on FPGA Die for x8 and x16 Support
I/O Type Byte Lane Pin Number Signal Name
– T0U N12 –
N T0U N11 DQ[7:0]
P T0U N10 DQ[7:0]
N T0U N9 DQ[7:0]
P T0U N8 DQ[7:0]
DQSCC-N T0U N7 DQS0_N
DQSCC-P T0U N6 DQS0_P
N T0L N5 DQ[7:0]
P T0L N4 DQ[7:0]
N T0L N3 DQ[7:0]
P T0L N2 DQ[7:0]
DQSCC-N T0L N1 –
DQSCC-P T0L N0 DM0
Table 4-2: Byte Lane View of Bank on FPGA Die for x4, x8, and x16 Support
I/O Type Byte Lane Pin Number Signal Name
– T0U N12 –
N T0U N11 DQ[3:0]
P T0U N10 DQ[3:0]
N T0U N9 DQ[3:0]
P T0U N8 DQ[3:0]
DQSCC-N T0U N7 DQS0_N
DQSCC-P T0U N6 DQS0_P
N T0L N5 DQ[7:4]
P T0L N4 DQ[7:4]
N T0L N3 DQ[7:4]
P T0L N2 DQ[7:4]
Table 4-2: Byte Lane View of Bank on FPGA Die for x4, x8, and x16 Support (Cont’d)
I/O Type Byte Lane Pin Number Signal Name
DQSCC-N T0L N1 –/DQS9_N
DQSCC-P T0L N0 DM0/DQS9_P
Pin Swapping
• Pins can swap freely within each byte group (data and address/control), except for the
DQS pair which must be on the dedicated dqs pair in the nibble (for more information,
see the dqs, dq, and dm location., page 87).
• Byte groups (data and address/control) can swap easily with each other.
• Pins in the address/control byte groups can swap freely within and between their byte
groups.
• No other pin swapping is permitted.
Table 4-3 shows an example of a 16-bit DDR3 interface contained within one bank. This
example is for a component interface using two x8 DDR3 components.
Table 4-3: 16-Bit DDR3 (x8/x16 Part) Interface Contained in One Bank
1 a0 T3U_12 –
1 a1 T3U_11 N
1 a2 T3U_10 P
1 a3 T3U_9 N
1 a4 T3U_8 P
1 a5 T3U_7 N
1 a6 T3U_6 P
1 a7 T3L_5 N
1 a8 T3L_4 P
1 a9 T3L_3 N
1 a10 T3L_2 P
1 a11 T3L_1 N
Table 4-3: 16-Bit DDR3 (x8/x16 Part) Interface Contained in One Bank (Cont’d)
1 a12 T3L_0 P
1 a13 T2U_12 –
1 a14 T2U_11 N
1 we_n T2U_10 P
1 cas_n T2U_9 N
1 ras_n T2U_8 P
1 ck_n T2U_7 N
1 ck_p T2U_6 P
1 cs_n T2L_5 N
1 ba0 T2L_4 P
1 ba1 T2L_3 N
1 ba2 T2L_2 P
1 sys_clk_n T2L_1 N
1 sys_clk_p T2L_0 P
1 cke T1U_12 –
1 dq15 T1U_11 N
1 dq14 T1U_10 P
1 dq13 T1U_9 N
1 dq12 T1U_8 P
1 dqs1_n T1U_7 N
1 dqs1_p T1U_6 P
1 dq11 T1L_5 N
1 dq10 T1L_4 P
1 dq9 T1L_3 N
1 dq8 T1L_2 P
1 odt T1L_1 N
1 dm1 T1L_0 P
1 vrp T0U_12 –
1 dq7 T0U_11 N
1 dq6 T0U_10 P
Table 4-3: 16-Bit DDR3 (x8/x16 Part) Interface Contained in One Bank (Cont’d)
1 dq5 T0U_9 N
1 dq4 T0U_8 P
1 dqs0_n T0U_7 N
1 dqs0_p T0U_6 P
1 dq3 T0L_5 N
1 dq2 T0L_4 P
1 dq1 T0L_3 N
1 dq0 T0L_2 P
1 reset_n T0L_1 N
1 dm0 T0L_0 P
Table 4-4 shows an example of a 16-bit DDR3 interface contained within one bank. This
example is for a component interface using four x4 DDR3 components.
Table 4-4: 16-Bit DDR3 Interface (x4 Part) Contained in One Bank
Bank Signal Name Byte Group I/O Type
1 a0 T3U_12 –
1 a1 T3U_11 N
1 a2 T3U_10 P
1 a3 T3U_9 N
1 a4 T3U_8 P
1 a5 T3U_7 N
1 a6 T3U_6 P
1 a7 T3L_5 N
1 a8 T3L_4 P
1 a9 T3L_3 N
1 a10 T3L_2 P
1 a11 T3L_1 N
1 a12 T3L_0 P
1 a13 T2U_12 –
1 a14 T2U_11 N
1 we_n T2U_10 P
1 cas_n T2U_9 N
1 ras_n T2U_8 P
Table 4-4: 16-Bit DDR3 Interface (x4 Part) Contained in One Bank (Cont’d)
Bank Signal Name Byte Group I/O Type
1 ck_n T2U_7 N
1 ck_p T2U_6 P
1 cs_n T2L_5 N
1 ba0 T2L_4 P
1 ba1 T2L_3 N
1 ba2 T2L_2 P
1 sys_clk_n T2L_1 N
1 sys_clk_p T2L_0 P
1 cke T1U_12 –
1 dq15 T1U_11 N
1 dq14 T1U_10 P
1 dq13 T1U_9 N
1 dq12 T1U_8 P
1 dqs3_n T1U_7 N
1 dqs3_p T1U_6 P
1 dq11 T1L_5 N
1 dq10 T1L_4 P
1 dq9 T1L_3 N
1 dq8 T1L_2 P
1 dqs2_n T1L_1 N
1 dqs2_p T1L_0 P
1 vrp T0U_12 –
1 dq7 T0U_11 N
1 dq6 T0U_10 P
1 dq5 T0U_9 N
1 dq4 T0U_8 P
1 dqs1_n T0U_7 N
1 dqs1_p T0U_6 P
1 dq3 T0L_5 N
1 dq2 T0L_4 P
1 dq1 T0L_3 N
1 dq0 T0L_2 P
Table 4-4: 16-Bit DDR3 Interface (x4 Part) Contained in One Bank (Cont’d)
Bank Signal Name Byte Group I/O Type
1 dqs0_n T0L_1 N
1 dqs0_p T0L_0 P
Two DDR3 32-bit interfaces can fit in three banks by using all of the pins in the banks. To fit
the configuration in three banks for various scenarios, different Vivado IDE options can be
selected (based on requirement). Various Vivado IDE options that lead to pin savings are
listed as follows:
• In data byte group, pins 1 and 12 are unused. Unused pins of the data byte group can
be used for Address/Control pins if all Address/Control pins are allocated in the same
bank.
For example, if T3 byte group of Bank #2 is selected for data. Pins T3L_1 and T3U_12 are
not used by data and these pins can be used for Address/Control if all Address/Control
pins are allocated in Bank #2.
• If DCI cascade is selected, the vrp pin can be used as normal a I/O.
• Memory reset pin (reset_n pin) can be allocated anywhere as long as timing is met.
• System clock pins can be allocated in different banks and must be within the same
column of the memory interface banks selected.
• By disabling the Enabling Chip Select Pin option in the Vivado IDE, it frees up a pin
and the cs# ports are not generated.
• By disabling the Data Mask option in Vivado IDE, it frees up a pin and the data mask
(dm) port is not generated.
One of the configurations with two 32-bit DDR3 interfaces in three banks is given in
Table 4-5 (it is valid for memory part of x8/x16). Two interface signals are separated by
name c0_ and c1_. Example is given with interface-0 (c0) selected in banks 0 and 1 and
interface-1 (c1) selected in banks 1 and 2.
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks
Bank Signal Name Byte Group I/O Type
2 c1_ddr3_we_n T3U_12 –
2 c1_ddr3_ck_c[0] T3U_11 N
2 c1_ddr3_ck_t[0] T3U_10 P
2 c1_ddr3_cas_n T3U_9 N
2 c1_ddr3_ras_n T3U_8 P
2 c1_ddr3_ba[2] T3U_7 N
2 c1_ddr3_ba[1] T3U_6 P
2 c1_ddr3_ba[0] T3L_5 N
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
2 c1_ddr3_adr[15] T3L_4 P
2 c1_ddr3_adr[14] T3L_3 N
2 c1_ddr3_adr[13] T3L_2 P
2 c1_ddr3_adr[12] T3L_1 N
2 c1_ddr3_adr[11] T3L_0 P
2 c1_ddr3_adr[10] T2U_12 –
2 c1_ddr3_adr[9] T2U_11 N
2 c1_ddr3_adr[8] T2U_10 P
2 c1_ddr3_adr[7] T2U_9 N
2 c1_ddr3_adr[6] T2U_8 P
2 c1_ddr3_adr[5] T2U_7 N
2 c1_ddr3_adr[4] T2U_6 P
2 c1_ddr3_adr[3] T2L_5 N
2 c1_ddr3_adr[2] T2L_4 P
2 c1_ddr3_adr[1] T2L_3 N
2 c1_ddr3_adr[0] T2L_2 P
2 c1_sys_clk_n T2L_1 N
2 c1_sys_clk_p T2L_0 P
2 c1_ddr3_cke[0] T1U_12 –
2 c1_ddr3_dq[31] T1U_11 N
2 c1_ddr3_dq[30] T1U_10 P
2 c1_ddr3_dq[29] T1U_9 N
2 c1_ddr3_dq[28] T1U_8 P
2 c1_ddr3_dqs_n[3] T1U_7 N
2 c1_ddr3_dqs_p[3] T1U_6 P
2 c1_ddr3_dq[27] T1L_5 N
2 c1_ddr3_dq[26] T1L_4 P
2 c1_ddr3_dq[25] T1L_3 N
2 c1_ddr3_dq[24] T1L_2 P
2 c1_ddr3_odt[0] T1L_1 N
2 c1_ddr3_dm[3] T1L_0 P
2 vrp T0U_12 –
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
2 c1_ddr3_dq[23] T0U_11 N
2 c1_ddr3_dq[22] T0U_10 P
2 c1_ddr3_dq[21] T0U_9 N
2 c1_ddr3_dq[20] T0U_8 P
2 c1_ddr3_dqs_n[2] T0U_7 N
2 c1_ddr3_dqs_p[2] T0U_6 P
2 c1_ddr3_dq[19] T0L_5 N
2 c1_ddr3_dq[18] T0L_4 P
2 c1_ddr3_dq[17] T0L_3 N
2 c1_ddr3_dq[16] T0L_2 P
2 c1_ddr3_cs_n[0] T0L_1 N
2 c1_ddr3_dm[2] T0L_0 P
1 c1_ddr3_reset_n T3U_12 –
1 c1_ddr3_dq[15] T3U_11 N
1 c1_ddr3_dq[14] T3U_10 P
1 c1_ddr3_dq[13] T3U_9 N
1 c1_ddr3_dq[12] T3U_8 P
1 c1_ddr3_dqs_n[1] T3U_7 N
1 c1_ddr3_dqs_p[1] T3U_6 P
1 c1_ddr3_dq[11] T3L_5 N
1 c1_ddr3_dq[10] T3L_4 P
1 c1_ddr3_dq[9] T3L_3 N
1 c1_ddr3_dq[8] T3L_2 P
1 – T3L_1 N
1 c1_ddr3_dm[1] T3L_0 P
1 – T2U_12 –
1 c1_ddr3_dq[7] T2U_11 N
1 c1_ddr3_dq[6] T2U_10 P
1 c1_ddr3_dq[5] T2U_9 N
1 c1_ddr3_dq[4] T2U_8 P
1 c1_ddr3_dqs_n[0] T2U_7 N
1 c1_ddr3_dqs_p[0] T2U_6 P
1 c1_ddr3_dq[3] T2L_5 N
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
1 c1_ddr3_dq[2] T2L_4 P
1 c1_ddr3_dq[1] T2L_3 N
1 c1_ddr3_dq[0] T2L_2 P
1 – T2L_1 N
1 c1_ddr3_dm[0] T2L_0 P
1 – T1U_12 –
1 c0_ddr3_dq[31] T1U_11 N
1 c0_ddr3_dq[30] T1U_10 P
1 c0_ddr3_dq[29] T1U_9 N
1 c0_ddr3_dq[28] T1U_8 P
1 c0_ddr3_dqs_n[3] T1U_7 N
1 c0_ddr3_dqs_p[3] T1U_6 P
1 c0_ddr3_dq[27] T1L_5 N
1 c0_ddr3_dq[26] T1L_4 P
1 c0_ddr3_dq[25] T1L_3 N
1 c0_ddr3_dq[24] T1L_2 P
1 – T1L_1 N
1 c0_ddr3_dm[3] T1L_0 P
1 – T0U_12 –
1 c0_ddr3_dq[23] T0U_11 N
1 c0_ddr3_dq[22] T0U_10 P
1 c0_ddr3_dq[21] T0U_9 N
1 c0_ddr3_dq[20] T0U_8 P
1 c0_ddr3_dqs_n[2] T0U_7 N
1 c0_ddr3_dqs_p[2] T0U_6 P
1 c0_ddr3_dq[19] T0L_5 N
1 c0_ddr3_dq[18] T0L_4 P
1 c0_ddr3_dq[17] T0L_3 N
1 c0_ddr3_dq[16] T0L_2 P
1 c0_ddr3_reset_n T0L_1 N
1 c0_ddr3_dm[2] T0L_0 P
0 c0_ddr3_cs_n[0] T3U_12 –
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 c0_ddr3_dq[15] T3U_11 N
0 c0_ddr3_dq[14] T3U_10 P
0 c0_ddr3_dq[13] T3U_9 N
0 c0_ddr3_dq[12] T3U_8 P
0 c0_ddr3_dqs_n[1] T3U_7 N
0 c0_ddr3_dqs_p[1] T3U_6 P
0 c0_ddr3_dq[11] T3L_5 N
0 c0_ddr3_dq[10] T3L_4 P
0 c0_ddr3_dq[9] T3L_3 N
0 c0_ddr3_dq[8] T3L_2 P
0 c0_ddr3_cke[0] T3L_1 N
0 c0_ddr3_dm[1] T3L_0 P
0 c0_ddr3_odt[0] T2U_12 –
0 c0_ddr3_dq[7] T2U_11 N
0 c0_ddr3_dq[6] T2U_10 P
0 c0_ddr3_dq[5] T2U_9 N
0 c0_ddr3_dq[4] T2U_8 P
0 c0_ddr3_dqs_n[0] T2U_7 N
0 c0_ddr3_dqs_p[0] T2U_6 P
0 c0_ddr3_dq[3] T2L_5 N
0 c0_ddr3_dq[2] T2L_4 P
0 c0_ddr3_dq[1] T2L_3 N
0 c0_ddr3_dq[0] T2L_2 P
0 c0_ddr3_we_n T2L_1 N
0 c0_ddr3_dm[0] T2L_0 P
0 c0_ddr3_cas_n T1U_12 –
0 c0_ddr3_ck_c[0] T1U_11 N
0 c0_ddr3_ck_t[0] T1U_10 P
0 c0_sys_clk_n T1U_9 N
0 c0_sys_clk_p T1U_8 P
0 c0_ddr3_ras_n T1U_7 N
0 c0_ddr3_ba[2] T1U_6 P
0 c0_ddr3_ba[1] T1L_5 N
Table 4-5: Two 32-Bit DDR3 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 c0_ddr3_ba[0] T1L_4 P
0 c0_ddr3_addr[15] T1L_3 N
0 c0_ddr3_addr[14] T1L_2 P
0 c0_ddr3_addr[13] T1L_1 N
0 c0_ddr3_addr[12] T1L_0 P
0 vrp T0U_12 –
0 c0_ddr3_addr[11] T0U_11 N
0 c0_ddr3_addr[10] T0U_10 P
0 c0_ddr3_addr[9] T0U_9 N
0 c0_ddr3_addr[8] T0U_8 P
0 c0_ddr3_addr[7] T0U_7 N
0 c0_ddr3_addr[6] T0U_6 P
0 c0_ddr3_addr[5] T0L_5 N
0 c0_ddr3_addr[4] T0L_4 P
0 c0_ddr3_addr[3] T0L_3 N
0 c0_ddr3_addr[2] T0L_2 P
0 c0_ddr3_addr[1] T0L_1 N
0 c0_ddr3_addr[0] T0L_0 P
• Address/control means cs_n, ras_n (a16), cas_n (a15), we_n (a14), ba, bg, ck, cke,
a, odt, act_n, and parity (valid for RDIMMs and LRDIMMs only). Multi-rank systems
have one cs_n, cke, odt, and one ck pair per rank.
• Pins in a byte lane are numbered N0 to N12.
• Byte lanes in a bank are designed by T0, T1, T2, or T3. Nibbles within a byte lane are
distinguished by a “U” or “L” designator added to the byte lane designator (T0, T1, T2,
or T3). Thus they are T0L, T0U, T1L, T1U, T2L, T2U, T3L, and T3U.
Note: There are two PLLs per bank and a controller uses one PLL in every bank that is being used by
the interface.
1. dqs, dq, and dm/dbi location.
a. Designs using x8 or x16 components – dqs must be located on a dedicated byte
clock pair in the upper nibble designated with “U” (N6 and N7). dq associated with
a dqs must be in same byte lane on any of the other pins except pins N1 and N12.
b. Designs using x4 components – dqs must be located on a dedicated byte clock pair
in the nibble (N0 and N1 in the lower nibble, N6 and N7 in the upper nibble). dq
associated with a dqs must be in same nibble on any of the other pins except pin
N12 (upper nibble). The lower nibble dq and upper nibble dq must be allocated in
the same byte lane.
Note: The dm/dbi port is not supported in x4 DDR4 devices.
c. dm/dbi must be on pin N0 in the byte lane with the associated dqs.
Note: When the IP is configured as NO_DM_NO_DBI, the IP will always generate the ports
and they cannot be disabled, but they will not be used during calibration or operation. On
the FPGA side, these pins can be left floating. At the memory device, follow the
manufacturer's recommendation which is leave floating or a weak pull-up to VDDQ. If the
ports are not routed to the memory device and the IP is not set to NO_DM_NO_DBI, then the
design will fail calibration.
d. The x16 components must have the ldqs connected to the even dqs and the udqs
must be connected to the ldqs + 1. The first x16 component has ldqs connected
to dqs0 and udqs connected to dqs1 in the XDC file. The second x16 component
has ldqs connected to dqs2 and udqs connected to dqs3. This pattern continues
as needed for the interface. This does not restrict the physical location of the byte
lanes. The byte lanes associated with the dqs’s might be moved as desired in the
Vivado IDE to achieve optimal PCB routing.
Consider x16 part with data width of 32 and all data bytes are allocated in a single
bank. In such cases, DQS needs to be mapped as given in Table 4-6.
In Table 4-6, the Bank-Byte and Selected Memory Data Bytes indicate byte allocation
in the I/O pin planner. The following example is given for one of the generated
configuration in the I/O pin planner. Based on pin allocation, DQ byte allocation
might vary.
DQS Allocated (in IP on the FPGA) indicates DQS that is allocated on the FPGA end.
Memory device mapping indicates how DQS needs to be mapped on the memory
end.
2. The x4 components must be used in pairs. Odd numbers of x4 components are not
permitted. Both the upper and lower nibbles of a data byte must be occupied by a x4
dq/dqs group. Each byte lane containing two x4 nibbles must have sequential nibbles
with the even nibble being the lower number. For example, a byte lane can have nibbles
0 and 1, or 2 and 3, but must not have 1 and 2. The ordering of the nibbles within a byte
lane is not important.
3. Byte lanes with a dqs are considered to be data byte lanes. Pins N1 and N12 can be used
for address/control in a data byte lane. If the data byte is in the same bank as the
remaining address/control pins, see step #4.
4. Address/control can be on any of the 13 pins in the address/control byte lanes. Address/
control must be contained within the same bank.
5. One vrp pin per bank is used and DCI is required for the interfaces. A vrp pin is
required in I/O banks containing inputs as well as in output only banks. It is required in
output only banks because address/control signals use SSTL12_DCI to enable usage of
controlled output impedance. DCI cascade is allowed for data rates of 2,133 Mb/s and
lower. When DCI cascade is used, vrp pin can be used as a normal I/O. All rules for the
DCI in the UltraScale™ Architecture SelectIO™ Resources User Guide (UG571) [Ref 7] must
be followed.
RECOMMENDED: Xilinx strongly recommends that the DCIUpdateMode option is kept with the default
value of ASREQUIRED so that the DCI circuitry is allowed to operate normally.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must share a common reset input.
9. All I/O banks used by the memory interface must be in the same column.
10. All I/O banks used by the memory interface must be in the same SLR of the column for
the SSI technology devices.
11. For dual slot configurations of RDIMMs, LRDIMMs, and UDIMMs: cs, odt, cke, and ck
port widths are doubled. For exact mapping of the signals, see the DIMM
Configurations.
12. Maximum height of interface is five contiguous banks. The maximum supported
interface is 80-bit wide.
Maximum component limit is nine and this restriction is valid for components only and
not for DIMMs.
IMPORTANT: Component interfaces should be created with the same component for all components in
the interface. x16 components have a different number of bank groups than the x8 components. For
example, a 72-bit wide component interface should be created by using nine x8 components or five x16
components where half of one component is not used. Four x16 components and one x8 component is
not permissible.
Note: Pins N0 and N6 within the byte lane used by a memory interface can be utilized for other
purposes when not needed for the memory interface. However, the functionality of these pins is not
available until VTC_RDY asserts on the BITSLICE_CONTROL. For more information, see the
UltraScale™ Architecture SelectIO™ Resources User Guide (UG571) [Ref 7].
If PCB compatibility between x4 and x8 based DIMMs is desired, additional restrictions apply. The
upper x4 DQS group must be placed within the lower byte nibble (N0 to N5). This allows DM to be
placed on N0 for the x8 pinout, pin compatibility for all DQ bits, and the added DQS pair for x4 be
placed on N0/N1.
For example, a typical DDR4 x4 based RDIMM/LRDIMM data sheet shows the DQS9 associated with
DQ4, DQ5, DQ6, and DQ7. This DQS9_t is used for the DM/DBI in an x8 configuration. This nibble
must be connected to the lower nibble of the byte lane. The Vivado generated XDC labels this DQS9
as DSQ1 (for more information, see the Pin Mapping for x4 RDIMMs/LRDIMMs). Table 4-7 and
Table 4-8 include an example for one of the configurations of x4/x8/x16.
Table 4-7: Byte Lane View of Bank on FPGA Die for x8 and x16 Support
I/O Type Byte Lane Pin Number Signal Name
– T0U N12 –
N T0U N11 DQ[7:0]
P T0U N10 DQ[7:0]
N T0U N9 DQ[7:0]
P T0U N8 DQ[7:0]
DQSCC-N T0U N7 DQS0_c
DQSCC-P T0U N6 DQS0_t
N T0L N5 DQ[7:0]
P T0L N4 DQ[7:0]
N T0L N3 DQ[7:0]
P T0L N2 DQ[7:0]
DQSCC-N T0L N1 –
DQSCC-P T0L N0 DM0/DBI0
Table 4-8: Byte Lane View of Bank on FPGA Die for x4, x8, and x16 Support
I/O Type Byte Lane Pin Number Signal Name
– T0U N12 –
N T0U N11 DQ[3:0]
P T0U N10 DQ[3:0]
N T0U N9 DQ[3:0]
P T0U N8 DQ[3:0]
DQSCC-N T0U N7 DQS0_c
DQSCC-P T0U N6 DQS0_t
N T0L N5 DQ[7:4]
Table 4-8: Byte Lane View of Bank on FPGA Die for x4, x8, and x16 Support (Cont’d)
I/O Type Byte Lane Pin Number Signal Name
P T0L N4 DQ[7:4]
N T0L N3 DQ[7:4]
P T0L N2 DQ[7:4]
DQSCC-N T0L N1 –/DQS9_c
DQSCC-P T0L N0 DM0/DBI0/DQS9_t
Pin Swapping
• Pins can swap freely within each byte group (data and address/control), except for the
DQS pair which must be on the dedicated dqs pair in the nibble (for more information,
see the dqs, dq, and dm/dbi location., page 101).
• Byte groups (data and address/control) can swap easily with each other.
• Pins in the address/control byte groups can swap freely within and between their byte
groups.
• No other pin swapping is permitted.
Table 4-9 shows an example of a 32-bit DDR4 interface contained within two banks. This
example is for a component interface using four x8 DDR4 components.
Bank 1
1 – T3U_12 –
1 – T3U_11 N
1 – T3U_10 P
1 – T3U_9 N
1 – T3U_8 P
1 – T3U_7 N
1 – T3U_6 P
1 – T3L_5 N
1 – T3L_4 P
1 – T3L_3 N
1 – T3L_2 P
1 – T3L_1 N
1 – T3L_0 P
1 – T2U_12 –
1 – T2U_11 N
1 – T2U_10 P
1 – T2U_9 N
1 – T2U_8 P
1 – T2U_7 N
1 – T2U_6 P
1 – T2L_5 N
1 – T2L_4 P
1 – T2L_3 N
1 – T2L_2 P
1 – T2L_1 N
1 – T2L_0 P
1 reset_n T1U_12 –
1 dq31 T1U_11 N
1 dq30 T1U_10 P
1 dq29 T1U_9 N
1 dq28 T1U_8 P
1 dqs3_c T1U_7 N
1 dqs3_t T1U_6 P
1 dq27 T1L_5 N
1 dq26 T1L_4 P
1 dq25 T1L_3 N
1 dq24 T1L_2 P
1 unused T1L_1 N
1 dm3/dbi3 T1L_0 P
1 vrp T0U_12 –
1 dq23 T0U_11 N
1 dq22 T0U_10 P
1 dq21 T0U_9 N
1 dq20 T0U_8 P
1 dqs2_c T0U_7 N
1 dqs2_t T0U_6 P
1 dq19 T0L_5 N
1 dq18 T0L_4 P
1 dq17 T0L_3 N
1 dq16 T0L_2 P
1 – T0L_1 N
1 dm2/dbi2 T0L_0 P
Bank 2
2 a0 T3U_12 –
2 a1 T3U_11 N
2 a2 T3U_10 P
2 a3 T3U_9 N
2 a4 T3U_8 P
2 a5 T3U_7 N
2 a6 T3U_6 P
2 a7 T3L_5 N
2 a8 T3L_4 P
2 a9 T3L_3 N
2 a10 T3L_2 P
2 a11 T3L_1 N
2 a12 T3L_0 P
2 a13 T2U_12 –
2 we_n/a14 T2U_11 N
2 cas_n/a15 T2U_10 P
2 ras_n/a16 T2U_9 N
2 act_n T2U_8 P
2 ck_c T2U_7 N
2 ck_t T2U_6 P
2 ba0 T2L_5 N
2 ba1 T2L_4 P
2 bg0 T2L_3 N
2 bg1 T2L_2 P
2 sys_clk_n T2L_1 N
2 sys_clk_p T2L_0 P
2 cs_n T1U_12 –
2 dq15 T1U_11 N
2 dq14 T1U_10 P
2 dq13 T1U_9 N
2 dq12 T1U_8 P
2 dqs1_c T1U_7 N
2 dqs1_t T1U_6 P
2 dq11 T1L_5 N
2 dq10 T1L_4 P
2 dq9 T1L_3 N
2 dq8 T1L_2 P
2 odt T1L_1 N
2 dm1/dbi1 T1L_0 P
2 vrp T0U_12 –
2 dq7 T0U_11 N
2 dq6 T0U_10 P
2 dq5 T0U_9 N
2 dq4 T0U_8 P
2 dqs0_c T0U_7 N
2 dqs0_t T0U_6 P
2 dq3 T0L_5 N
2 dq2 T0L_4 P
2 dq1 T0L_3 N
2 dq0 T0L_2 P
2 cke T0L_1 N
2 dm0/dbi0 T0L_0 P
Table 4-10 shows an example of a 16-bit DDR4 interface contained within a single bank.
This example is for a component interface using four x4 DDR4 components.
Table 4-10: 16-Bit DDR4 Interface (x4 Part) Contained in One Bank
Bank Signal Name Byte Group I/O Type
1 a0 T3U_12 –
1 a1 T3U_11 N
1 a2 T3U_10 P
1 a3 T3U_9 N
1 a4 T3U_8 P
1 a5 T3U_7 N
1 a6 T3U_6 P
1 a7 T3L_5 N
1 a8 T3L_4 P
1 a9 T3L_3 N
1 a10 T3L_2 P
1 a11 T3L_1 N
1 a12 T3L_0 P
1 a13 T2U_12 –
1 we_n/a14 T2U_11 N
1 cas_n/a15 T2U_10 P
1 ras_n/a16 T2U_9 N
1 act_n T2U_8 P
1 ck_c T2U_7 N
1 ck_t T2U_6 P
1 ba0 T2L_5 N
1 ba1 T2L_4 P
1 bg0 T2L_3 N
1 bg1 T2L_2 P
Table 4-10: 16-Bit DDR4 Interface (x4 Part) Contained in One Bank (Cont’d)
Bank Signal Name Byte Group I/O Type
1 odt T2L_1 N
1 cke T2L_0 P
1 cs_n T1U_12 –
1 dq15 T1U_11 N
1 dq14 T1U_10 P
1 dq13 T1U_9 N
1 dq12 T1U_8 P
1 dqs3_c T1U_7 N
1 dqs3_t T1U_6 P
1 dq11 T1L_5 N
1 dq10 T1L_4 P
1 dq9 T1L_3 N
1 dq8 T1L_2 P
1 dqs2_c T1L_1 N
1 dqs2_t T1L_0 P
1 vrp T0U_12 –
1 dq7 T0U_11 N
1 dq6 T0U_10 P
1 dq5 T0U_9 N
1 dq4 T0U_8 P
1 dqs1_c T0U_7 N
1 dqs1_t T0U_6 P
1 dq3 T0L_5 N
1 dq2 T0L_4 P
1 dq1 T0L_3 N
1 dq0 T0L_2 P
1 dqs0_c T0L_1 N
1 dqs0_t T0L_0 P
Note: System clock pins (sys_clk_p and sys_clk_n) are allocated in different banks.
Two DDR4 32-bit interfaces can fit in three banks by using all of the pins in the banks. To fit
the configuration in three banks for various scenarios, different Vivado IDE options can be
selected (based on requirement). Various Vivado IDE options that lead to pin savings are
listed as follows:
• In data byte group, pins 1 and 12 are unused. Unused pins of the data byte group can
be used for Address/Control pins if all Address/Control pins are allocated in the same
bank.
For example, if T3 byte group of Bank #2 is selected for data. Pins T3L_1 and T3U_12 are
not used by data and these pins can be used for Address/Control if all Address/Control
pins are allocated in Bank #2.
• If DCI cascade is selected, the vrp pin can be used as normal a I/O. DCI cascade is
allowed for data rates of 2,133 Mb/s and lower.
• Memory reset pin (reset_n pin) can be allocated anywhere as long as timing is met.
• System clock pins can be allocated in different banks and must be within the same
column of the memory interface banks selected.
One of the configurations with two 32-bit DDR4 interfaces in three banks is given in
Table 4-11 (it is valid for memory part of x8/x16). Two interface signals are separated by
name c0_ and c1_. Example is given with interface-0 (c0) selected in banks 0 and 1 and
interface-1 (c1) selected in banks 1 and 2.
2 c1_ddr4_adr[10] T2U_12 –
2 c1_ddr4_adr[9] T2U_11 N
2 c1_ddr4_adr[8] T2U_10 P
2 c1_ddr4_adr[7] T2U_9 N
2 c1_ddr4_adr[6] T2U_8 P
2 c1_ddr4_adr[5] T2U_7 N
Table 4-11: Two 32-Bit DDR4 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
2 c1_ddr4_adr[4] T2U_6 P
2 c1_ddr4_adr[3] T2L_5 N
2 c1_ddr4_adr[2] T2L_4 P
2 c1_ddr4_adr[1] T2L_3 N
2 c1_ddr4_adr[0] T2L_2 P
2 c1_sys_clk_n T2L_1 N
2 c1_sys_clk_p T2L_0 P
2 c1_ddr4_act_n T1U_12 –
2 c1_ddr4_dq[31] T1U_11 N
2 c1_ddr4_dq[30] T1U_10 P
2 c1_ddr4_dq[29] T1U_9 N
2 c1_ddr4_dq[28] T1U_8 P
2 c1_ddr4_dqs_c[3] T1U_7 N
2 c1_ddr4_dqs_t[3] T1U_6 P
2 c1_ddr4_dq[27] T1L_5 N
2 c1_ddr4_dq[26] T1L_4 P
2 c1_ddr4_dq[25] T1L_3 N
2 c1_ddr4_dq[24] T1L_2 P
2 c1_ddr4_odt[0] T1L_1 N
2 c1_ddr4_dm_dbi[3] T1L_0 P
2 vrp T0U_12 –
2 c1_ddr4_dq[23] T0U_11 N
2 c1_ddr4_dq[22] T0U_10 P
2 c1_ddr4_dq[21] T0U_9 N
2 c1_ddr4_dq[20] T0U_8 P
2 c1_ddr4_dqs_c[2] T0U_7 N
2 c1_ddr4_dqs_t[2] T0U_6 P
2 c1_ddr4_dq[19] T0L_5 N
2 c1_ddr4_dq[18] T0L_4 P
2 c1_ddr4_dq[17] T0L_3 N
2 c1_ddr4_dq[16] T0L_2 P
2 c1_ddr4_cs_n[0] T0L_1 N
2 c1_ddr4_dm_dbi[2] T0L_0 P
Table 4-11: Two 32-Bit DDR4 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
1 c1_ddr4_reset_n T3U_12 –
1 c1_ddr4_dq[15] T3U_11 N
1 c1_ddr4_dq[14] T3U_10 P
1 c1_ddr4_dq[13] T3U_9 N
1 c1_ddr4_dq[12] T3U_8 P
1 c1_ddr4_dqs_c[1] T3U_7 N
1 c1_ddr4_dqs_t[1] T3U_6 P
1 c1_ddr4_dq[11] T3L_5 N
1 c1_ddr4_dq[10] T3L_4 P
1 c1_ddr4_dq[9] T3L_3 N
1 c1_ddr4_dq[8] T3L_2 P
1 – T3L_1 N
1 c1_ddr4_dm_dbi[1] T3L_0 P
1 – T2U_12 –
1 c1_ddr4_dq[7] T2U_11 N
1 c1_ddr4_dq[6] T2U_10 P
1 c1_ddr4_dq[5] T2U_9 N
1 c1_ddr4_dq[4] T2U_8 P
1 c1_ddr4_dqs_c[0] T2U_7 N
1 c1_ddr4_dqs_t[0] T2U_6 P
1 c1_ddr4_dq[3] T2L_5 N
1 c1_ddr4_dq[2] T2L_4 P
1 c1_ddr4_dq[1] T2L_3 N
1 c1_ddr4_dq[0] T2L_2 P
1 – T2L_1 N
1 c1_ddr4_dm_dbi[0] T2L_0 P
1 – T1U_12 –
1 c0_ddr4_dq[31] T1U_11 N
1 c0_ddr4_dq[30] T1U_10 P
1 c0_ddr4_dq[29] T1U_9 N
1 c0_ddr4_dq[28] T1U_8 P
1 c0_ddr4_dqs_c[3] T1U_7 N
Table 4-11: Two 32-Bit DDR4 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
1 c0_ddr4_dqs_t[3] T1U_6 P
1 c0_ddr4_dq[27] T1L_5 N
1 c0_ddr4_dq[26] T1L_4 P
1 c0_ddr4_dq[25] T1L_3 N
1 c0_ddr4_dq[24] T1L_2 P
1 – T1L_1 N
1 c0_ddr4_dm_dbi[3] T1L_0 P
1 – T0U_12 –
1 c0_ddr4_dq[23] T0U_11 N
1 c0_ddr4_dq[22] T0U_10 P
1 c0_ddr4_dq[21] T0U_9 N
1 c0_ddr4_dq[20] T0U_8 P
1 c0_ddr4_dqs_c[2] T0U_7 N
1 c0_ddr4_dqs_t[2] T0U_6 P
1 c0_ddr4_dq[19] T0L_5 N
1 c0_ddr4_dq[18] T0L_4 P
1 c0_ddr4_dq[17] T0L_3 N
1 c0_ddr4_dq[16] T0L_2 P
1 c0_ddr4_reset_n T0L_1 N
1 c0_ddr4_dm_dbi[2] T0L_0 P
0 c0_ddr4_bg[1] T3U_12 –
0 c0_ddr4_dq[15] T3U_11 N
0 c0_ddr4_dq[14] T3U_10 P
0 c0_ddr4_dq[13] T3U_9 N
0 c0_ddr4_dq[12] T3U_8 P
0 c0_ddr4_dqs_c[1] T3U_7 N
0 c0_ddr4_dqs_t[1] T3U_6 P
0 c0_ddr4_dq[11] T3L_5 N
0 c0_ddr4_dq[10] T3L_4 P
0 c0_ddr4_dq[9] T3L_3 N
0 c0_ddr4_dq[8] T3L_2 P
0 c0_ddr4_cke[0] T3L_1 N
0 c0_ddr4_dm_dbi[1] T3L_0 P
Table 4-11: Two 32-Bit DDR4 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 c0_ddr4_act_n T2U_12 –
0 c0_ddr4_dq[7] T2U_11 N
0 c0_ddr4_dq[6] T2U_10 P
0 c0_ddr4_dq[5] T2U_9 N
0 c0_ddr4_dq[4] T2U_8 P
0 c0_ddr4_dqs_c[0] T2U_7 N
0 c0_ddr4_dqs_t[0] T2U_6 P
0 c0_ddr4_dq[3] T2L_5 N
0 c0_ddr4_dq[2] T2L_4 P
0 c0_ddr4_dq[1] T2L_3 N
0 c0_ddr4_dq[0] T2L_2 P
0 c0_ddr4_cs_n[0] T2L_1 N
0 c0_ddr4_dm_dbi[0] T2L_0 P
0 c0_ddr4_odt[0] T1U_12 –
0 c0_ddr4_ck_c[0] T1U_11 N
0 c0_ddr4_ck_t[0] T1U_10 P
0 c0_sys_clk_n T1U_9 N
0 c0_sys_clk_p T1U_8 P
0 c0_ddr4_bg[0] T1U_7 N
0 c0_ddr4_ba[1] T1U_6 P
0 c0_ddr4_ba[0] T1L_5 N
0 c0_ddr4_adr[16] T1L_4 P
0 c0_ddr4_adr[15] T1L_3 N
0 c0_ddr4_adr[14] T1L_2 P
0 c0_ddr4_adr[13] T1L_1 N
0 c0_ddr4_adr[12] T1L_0 P
0 vrp T0U_12 –
0 c0_ddr4_adr[11] T0U_11 N
0 c0_ddr4_adr[10] T0U_10 P
0 c0_ddr4_adr[9] T0U_9 N
0 c0_ddr4_adr[8] T0U_8 P
0 c0_ddr4_adr[7] T0U_7 N
Table 4-11: Two 32-Bit DDR4 Interfaces Contained in Three Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 c0_ddr4_adr[6] T0U_6 P
0 c0_ddr4_adr[5] T0L_5 N
0 c0_ddr4_adr[4] T0L_4 P
0 c0_ddr4_adr[3] T0L_3 N
0 c0_ddr4_adr[2] T0L_2 P
0 c0_ddr4_adr[1] T0L_1 N
0 c0_ddr4_adr[0] T0L_0 P
Table 4-13 is an example showing the pin mapping for x4 DDR4 registered DIMMs between
the memory data sheet and the XDC.
Table 4-13: Pin Mapping for x4 DDR4 DIMMs
Memory Data Sheet DDR4 SDRAM XDC
DQ[63:0] DQ[63:0]
CB3 to CB0 DQ[67:64]
CB7 to CB4 DQ[71:68]
DQS0 DQS[0]
DQS1 DQS[2]
DQS2 DQS[4]
DQS3 DQS[6]
DQS4 DQS[8]
DQS5 DQS[10]
DQS6 DQS[12]
DQS7 DQS[14]
DQS8 DQS[16]
DQS9 DQS[1]
DQS10 DQS[3]
DQS11 DQS[5]
DQS12 DQS[7]
DQS13 DQS[9]
DQS14 DQS[11]
DQS15 DQS[13]
DQS16 DQS[15]
DQS17 DQS[17]
Protocol Description
This core has the following interfaces:
• User Interface
• AXI4 Slave Interface
• PHY Only Interface
User Interface
The user interface signals are described in Table 4-14 and connects to an FPGA user design
to allow access to an external memory device. The user interface is layered on top of the
native interface which is described earlier in the controller description.
Notes:
1. This port appears when "Enable Precharge Input" option is enabled in the Vivado IDE.
2. These ports appear upon enabling "Enable User Refresh and ZQCS Input" option in the Vivado IDE.
app_addr[APP_ADDR_WIDTH – 1:0]
This input indicates the address for the request currently being submitted to the user
interface. The user interface aggregates all the address fields of the external SDRAM and
presents a flat address space.
The controller does not support burst ordering so these low order bits are ignored, making
the effective minimum app_addr step size hex 8.
Note: The “+:” notation indicates indexed vector part select. For example, for the given input signal
input [511:0] app_addr;
app_addr[255+:64] means bits 255 to 319 of app_addr are considered.
The “ROW_COLUMN_BANK” setting maps app_addr[4:3] to the DDR4 bank group bits or
DDR3 bank bits used by the controller to interleave between its group FSMs. The lower
order address bits equal to app_addr[5] and above map to the remaining SDRAM bank
and column address bits. The highest order address bits map to the SDRAM row. This
mapping is ideal for workloads that have address streams that increment linearly by a
constant step size of hex 8 for long periods. With this configuration and workload,
transactions sent to the user interface are evenly interleaved across the controller group
FSMs, making the best use of the controller resources.
In addition, this arrangement tends to generate hits to open pages in the SDRAM. The
combination of group FSM interleaving and SDRAM page hits results in very high SDRAM
data bus utilization.
Address streams other than the simple increment pattern tend to have lower SDRAM bus
utilization. You can recover this performance loss by tuning the mapping of your design flat
address space to the app_addr input port of the user interface. If you have knowledge of
your address sequence, you can add logic to map your address bits with the highest toggle
rate to the lowest app_addr bits, starting with app_addr[3] and working up from there.
For example, if you know that your workload address Bits[4:3] toggle much less than
Bits[10:9], which toggle at the highest rate, you could add logic to swap these bits so that
your address Bits[10:9] map to app_addr[4:3]. The result is an improvement in how the
address stream interleaves across the controller group FSMs, resulting in better controller
throughput and higher SDRAM data bus utilization.
The ROW_COLUMN_BANK_INTLV is a mapping option that swaps a column and bank bit.
With this option, a sequential address stream maps the first eight transactions across four
banks instead of eight banks. Then the next eight transactions map to the next four banks,
and so on. This helps with the performance when there are short bursts of sequential
addresses instead of very long bursts.
Table 4-27 through Table 4-32 show the “ROW_COLUMN_BANK_INTLV” mapping for DDR3
and DDR4 with examples.
app_cmd[2:0]
This input specifies the command for the request currently being submitted to the user
interface. The available commands are shown in Table 4-33. With ECC enabled, the
wr_bytes operation is required for writes with any non-zero app_wdf_mask bits. The
wr_bytes triggers a read-modify-write flow in the controller, which is needed only for
writes with masked data in ECC mode.
app_autoprecharge
This input specifies the state of the A10 autoprecharge bit for the DRAM CAS command
for the request currently being submitted to the user interface. When this input is Low, the
Memory Controller issues a DRAM RD or WR CAS command. When this input is High, the
controller issues a DRAM RDA or WRA CAS command. This input provides per request
control, but can also be tied off to configure the controller statically for open or closed
page mode operation. The Memory Controller also has an option to automatically
determine when to issue an AutoPrecharge. This option disables the app_autoprecharge
input. For more information on the automatic mode, see Performance, page 189.
app_en
This input strobes in a request. Apply the desired values to app_addr[], app_cmd[2:0], and
app_hi_pri, and then assert app_en to submit the request to the user interface. This
initiates a handshake that the user interface acknowledges by asserting app_rdy.
app_wdf_data[APP_DATA_WIDTH – 1:0]
This bus provides the data currently being written to the external memory.
APP_DATA_WIDTH is 2 × nCK_PER_CLK × DQ_WIDTH when ECC is disabled (ECC parameter
value is OFF) and 2 × nCK_PER_CLK × (DQ_WIDTH – ECC_WIDTH) when ECC is enabled (ECC
parameter is ON).
PAYLOAD_WIDTH indicates the effective DQ_WIDTH on which the user interface data has
been transfered.
app_wdf_end
This input indicates that the data on the app_wdf_data[] bus in the current cycle is the
last data for the current request.
app_wdf_mask[APP_MASK_WIDTH – 1:0]
This bus indicates which bits of app_wdf_data[] are written to the external memory and
which bits remain in their current state. APP_MASK_WIDTH is APP_DATA_WIDTH/8.
app_wdf_wren
This input indicates that the data on the app_wdf_data[] bus is valid.
app_rdy
This output indicates whether the request currently being submitted to the user interface is
accepted. If the user interface does not assert this signal after app_en is asserted, the
current request must be retried. The app_rdy output is not asserted if:
° All the controller Group FSMs are occupied (can be viewed as the command buffer
being full).
- A read is requested and the read buffer is full.
- A write is requested and no write buffer pointers are available.
app_rd_data[APP_DATA_WIDTH – 1:0]
This output contains the data read from the external memory.
app_rd_data_end
This output indicates that the data on the app_rd_data[] bus in the current cycle is the
last data for the current request.
app_rd_data_valid
This output indicates that the data on the app_rd_data[] bus is valid.
app_wdf_rdy
This output indicates that the write data FIFO is ready to receive data. Write data is accepted
when both app_wdf_rdy and app_wdf_wren are asserted.
app_ref_req
When asserted, this active-High input requests that the Memory Controller send a refresh
command to the DRAM. It must be pulsed for a single cycle to make the request and then
deasserted at least until the app_ref_ack signal is asserted to acknowledge the request
and indicate that it has been sent.
app_ref_ack
When asserted, this active-High input acknowledges a refresh request and indicates that
the command has been sent from the Memory Controller to the PHY.
app_zq_req
When asserted, this active-High input requests that the Memory Controller send a ZQ
calibration command to the DRAM. It must be pulsed for a single cycle to make the request
and then deasserted at least until the app_zq_ack signal is asserted to acknowledge the
request and indicate that it has been sent.
app_zq_ack
When asserted, this active-High input acknowledges a ZQ calibration request and indicates
that the command has been sent from the Memory Controller to the PHY.
ui_clk_sync_rst
This is the reset from the user interface which is in synchronous with ui_clk.
ui_clk
This is the output clock from the user interface. It must be a quarter the frequency of the
clock going out to the external SDRAM, which depends on 4:1 mode selected in Vivado IDE.
init_calib_complete
PHY asserts init_calib_complete when calibration is finished. The application has no
need to wait for init_calib_complete before sending commands to the Memory
Controller.
Command Path
When the user logic app_en signal is asserted and the app_rdy signal is asserted from the
user interface, a command is accepted and written to the FIFO by the user interface. The
command is ignored by the user interface whenever app_rdy is deasserted. The user logic
needs to hold app_en High along with the valid command, autoprecharge, and address
values until app_rdy is asserted as shown for the "write with autoprecharge" transaction in
Figure 4-2.
clk
app_cmd WRITE
app_addr Addr 0
app_autoprecharge
app_en
X24433-082420
Figure 4-2: User Interface Command Timing Diagram with app_rdy Asserted
A non back-to-back write command can be issued as shown in Figure 4-3. This figure
depicts three scenarios for the app_wdf_data, app_wdf_wren, and app_wdf_end
signals as follows:
For write data that is output after the write command has been registered, as shown in
Note 3 (Figure 4-3), the maximum delay is two clock cycles.
clk
app_cmd WRITE
app_addr Addr 0
app_wdf_mask
app_wdf_rdy
app_wdf_data W0
app_wdf_wren
app_wdf_end 1
app_wdf_data W0
app_wdf_wren
app_wdf_end 2
app_wdf_data W0
app_wdf_wren
app_wdf_end 3
X24434-082420
Figure 4-3: 4:1 Mode User Interface Write Timing Diagram (Memory Burst Type = BL8)
Write Path
The write data is registered in the write FIFO when app_wdf_wren is asserted and
app_wdf_rdy is High (Figure 4-4). If app_wdf_rdy is deasserted, the user logic needs to
hold app_wdf_wren and app_wdf_end High along with the valid app_wdf_data value
until app_wdf_rdy is asserted. The app_wdf_mask signal can be used to mask out the
bytes to write to external memory.
clk
app_cmd WRITE WRITE WRITE WRITE WRITE WRITE WRITE
app_en
app_rdy
app_wdf_mask
app_wdf_rdy
app_wdf_data W a0 W b0 W c0 W d0 W e0 W f0 W g0
app_wdf_wren
app_wdf_end
X24435-082420
Figure 4-4: 4:1 Mode User Interface Back-to-Back Write Commands Timing Diagram
(Memory Burst Type = BL8)
The map of the application interface data to the DRAM output data can be explained with
an example.
For a 4:1 Memory Controller to DRAM clock ratio with an 8-bit memory, at the application
interface, if the 64-bit data driven is 0000_0806_0000_0805 (Hex), the data at the DRAM
interface is as shown in Figure 4-5. This is for a BL8 (Burst Length 8) transaction.
X-Ref Target - Figure 4-5
ddr3_ck_p
0 7 0 7 0
05 08 00 06 08 00
X24436-082420
The data values at different clock edges are as shown in Table 4-34.
Table 4-35 shows a generalized representation of how DRAM DQ bus data is concatenated
to form application interface data signals. app_wdf_data is shown in Table 4-35, but the
table applies equally to app_rd_data. Each byte of the DQ bus has eight bursts, Rise0
(burst 0) through Fall3 (burst 7) as shown previously in Table 4-34, for a total of 64 data bits.
When concatenated with Rise0 in the LSB position and Fall3 in the MSB position, a 64-bit
chunk of the app_wdf_data signal is formed.
For example, the eight bursts of ddr3_dq[7:0] corresponds to DQ bus byte 0, and when
concatenated as described here, they map to app_wdf_data[63:0]. To be clear on the
concatenation order, ddr3_dq[0] from Rise0 (burst 0) maps to app_wdf_data[0], and
ddr3_dq[7] from Fall3 (burst 7) maps to app_wdf_data[63]. The table shows a second
example, mapping DQ byte 1 to app_wdf_data[127:64], as well as the formula for DQ
byte N.
DQ Bus App Interface Signal DDR Bus Signal at Each BL8 Burst Position
Byte Fall3 … Rise1 Fall0 Rise0
app_wdf_data[(N + 1) ddr3_dq[(N + 1) ddr3_dq[(N + 1) ddr3_dq[(N + 1) ddr3_dq[(N + 1)
N …
× 64 – 1: N × 64] × 8 – 1:N × 8] × 8 – 1:N × 8] × 8 – 1:N × 8] × 8 – 1:N × 8]
1 app_wdf_data[127:64] ddr3_dq[15:8] … ddr3_dq[15:8] ddr3_dq[15:8] ddr3_dq[15:8]
0 app_wdf_data[63:0] ddr3_dq[7:0] … ddr3_dq[7:0] ddr3_dq[7:0] ddr3_dq[7:0]
Read Path
The read data is returned by the user interface in the requested order and is valid when
app_rd_data_valid is asserted (Figure 4-6 and Figure 4-7). The app_rd_data_end
signal indicates the end of each read command burst and is not needed in user logic.
X-Ref Target - Figure 4-6
clk
app_cmd READ
app_addr Addr 0
app_autoprecharge
app_en
app_rdy
app_rd_data R0
app_rd_data_valid
X24437-082420
Figure 4-6: 4:1 Mode User Interface Read Timing Diagram (Memory Burst Type = BL8) #1
X-Ref Target - Figure 4-7
clk
app_cmd READ
app_autoprecharge
app_en
app_rdy
app_rd_data R0 R1
app_rd_data_valid
X24438-082420
Figure 4-7: 4:1 Mode User Interface Read Timing Diagram (Memory Burst Type = BL8) #2
In Figure 4-7, the read data returned is always in the same order as the requests made on
the address/control bus.
Maintenance Commands
The UI can be configured by the Vivado IDE to enable two DRAM Refresh modes. The
default mode configures the UI and the Memory Controller to automatically generate
DRAM Refresh and ZQCS commands, meeting all DRAM protocol and timing requirements.
The controller interrupts normal system traffic on a regular basis to issue these
maintenance commands on the DRAM bus.
The User mode is enabled by checking the Enable User Refresh and ZQCS Input option in
the Vivado IDE. In this mode, you are responsible for issuing Refresh and ZQCS commands
at the rate required by the DRAM component specification after init_calib_complete
asserts High. You use the app_ref_req and app_zq_req signals on the UI to request
Refresh and ZQCS commands, and monitor app_ref_ack and app_zq_ack to know when
the commands have completed. The controller manages all DRAM timing and protocol for
these commands, other than the overall Refresh or ZQCS rate, just as it does for the default
DRAM Refresh mode. These request/ack ports operate independently of the other UI
command ports, like app_cmd and app_en.
The controller might not preserve the exact ordering of maintenance transactions
presented to the UI on relative to regular read and write transactions. When you request a
Refresh or ZQCS, the controller interrupts system traffic, just as in the default mode, and
inserts the maintenance commands. To take the best advantage of this mode, you should
request maintenance commands when the controller is idle or at least not very busy,
keeping in mind that the DRAM Refresh rate and ZQCS rate requirements cannot be
violated.
Figure 4-8 shows how the User mode ports are used and how they affect the DRAM
command bus. This diagram shows the general idea about this mode of operation and is
not timing accurate. Assuming the DRAM is idle with all banks closed, a short time after
app_ref_req or app_zq_req are asserted High for one system clock cycle, the controller
issues the requested commands on the DRAM command bus. The app_ref_req and
app_zq_req can be asserted on the same cycle or different cycles, and they do not have to
be asserted at the same rate. After a request signal is asserted High for one system clock,
you must keep it deasserted until the acknowledge signal asserts.
X-Ref Target - Figure 4-8
system clk
app_cmd
app_en
app_rdy
app_ref_req
app_zq_req
app_ref_ack
app_zq_ack
DRAM clk
CS_n
tRFC tZQCS
X24439-082420
Figure 4-8: User Mode Ports on DRAM Command Bus Timing Diagram
Figure 4-9 shows a case where the app_en is asserted and read transactions are presented
continuously to the UI when the app_ref_req and app_zq_req are asserted. The
controller interrupts the DRAM traffic following DRAM protocol and timing requirements,
issues the Refresh and ZQCS, and then continues issuing the read transactions. Note that
the app_rdy signal deasserts during this sequence. It is likely to deassert during a
sequence like this since the controller command queue can easily fill up during tRFC or
tZQCS. After the maintenance commands are issued and normal traffic resumes on the bus,
the app_rdy signal asserts and new transactions are accepted again into the controller.
X-Ref Target - Figure 4-9
system clk
app_cmd Read
app_en
app_rdy
app_ref_req
app_zq_req
app_ref_ack
app_zq_ack
DRAM clk
DRAM cmd Read CAS PreCharge Refresh ZQCS Activate Read CAS
CS_n
X24440-082420
Figure 4-9 shows the operation for a single-rank. In a multi-rank system, a single refresh
request generates a DRAM Refresh command to each rank, in series, staggered by tRFC/2.
The Refresh commands are staggered since they are relatively high power consumption
operations. A ZQCS command request generates a ZQCS command to all ranks in parallel.
The overall design is composed of separate blocks to handle each AXI channel, which allows
for independent read and write transactions. Read and write commands to the UI rely on a
simple round-robin arbiter to handle simultaneous requests.
The address read/address write modules are responsible for chopping the AXI4 incr/wrap
requests into smaller memory size burst lengths of either four or eight, and also conveying
the smaller burst lengths to the read/write data modules so they can interact with the user
interface. Fixed burst type is not supported.
If ECC is enabled, all write commands with any of the mask bits enabled are issued as
read-modify-write operation.
Also if ECC is enabled, all write commands with none of the mask bits enabled are issued as
write operation.
AXI Addressing
The AXI address from the AXI master is a TRUE byte address. The AXI shim converts the
address from the AXI master to the memory based on AXI SIZE and memory data width. The
LSBs of the AXI byte address are masked to 0, depending on the data width of the memory
array. If the memory array is 64 bits (8 bytes) wide, AXI address[2:0] are ignored and treated
as 0. If the memory array is 16 bits (2 bytes) wide, AXI address[0] is ignored and treated as
0. DDR3/DDR4 DRAM is accessed in blocks of DRAM bursts and this memory controller
always uses a fixed burst length of 8. The UI Data Width is always eight times the
PAYLOAD_WIDTH.
Aligned (ADDR A) AXI data width = 32-bit AWID = 0 AWADDR = 'h0 AWSIZE = 2 AWLEN = 3 AWBURST = INCR
Unaligned (ADDR B) AXI data width = 32-bit AWID = 1 AWADDR = 'h3 AWSIZE = 2 AWLEN = 3 AWBURST = INCR
Aligned (ADDR A) AXI data width = 32-bit ARID = 0 ARADDR = 'h0 ARSIZE = 2 ARLEN = 3 ARBURST = INCR
Unaligned (ADDR B) AXI data width = 32-bit ARID = 1 ARADDR = 'h3 ARSIZE = 2 ARLEN = 3 ARBURST = INCR
Aligned (ADDR A) AXI data width = 32-bit AWID = 0 AWADDR = 'h0 AWSIZE = 1 AWLEN = 3 AWBURST = INCR
Unaligned (ADDR B) AXI data width = 32-bit AWID = 1 AWADDR = 'h3 AWSIZE = 1 AWLEN = 3 AWBURST = INCR
Aligned (ADDR A) AXI data width = 32-bit ARID = 0 ARADDR = 'h0 ARSIZE = 1 ARLEN = 3 ARBURST = INCR
Unaligned (ADDR B) AXI data width = 32-bit ARID = 1 ARADDR = 'h3 ARSIZE = 1 ARLEN = 3 ARBURST = INCR
Equal priority is given to read and write address channels in this mode. The grant to the
read and write address channels alternate every clock cycle. The read or write requests from
the AXI master has no bearing on the grants. For example, the read requests are served in
alternative clock cycles, even when there are no write requests. The slots are fixed and they
are served in their respective slots only.
Round-Robin
Equal priority is given to read and write address channels in this mode. The grant to the
read and write channels depends on the last served request granted from the AXI master.
For example, if the last performed operation is write, then it gives precedence for read
operation to be served over write operation. Similarly, if the last performed operation is
read, then it gives precedence for write operation to be served over read operation.
Read and write address channels are served with equal priority in this mode. The requests
from the write address channel are processed when one of the following occurs:
The requests from the read address channel are processed in a similar method.
The read address channel is always given priority in this mode. The requests from the write
address channel are processed when there are no pending requests from the read address
channel or the starve limit for read is reached.
Write address channel is always given priority in this mode. The requests from the read
address channel are processed when there are no pending requests from the write address
channel. Arbitration outputs are registered in WRITE_PRIORITY_REG mode.
The AXI4-Lite Control/Status register interface block is implemented in parallel to the AXI4
memory-mapped interface. The block monitors the output of the native interface to
capture correctable (single bit) and uncorrectable (multiple bit) errors. When a correctable
and/or uncorrectable error occurs, the interface also captures the byte address of the failure
along with the failing data bits and ECC bits. Fault injection is provided by an XOR block
placed in the write datapath after the ECC encoding has occurred.
Only the first memory beat in a transaction can have errors inserted. For example, in a
memory configuration with a data width of 72 and a mode register set to burst length 8,
only the first 72 bits are corruptible through the fault injection interface. Interrupt
generation based on either a correctable or uncorrectable error can be independently
configured with the register interface. SLVERR response is seen on the read response bus
(rresp) in case of uncorrectable errors (if ECC is enabled).
ECC Enable/Disable
Two vectored signals from the Memory Controller indicate an ECC error: ecc_single and
ecc_multiple. The ecc_single signal indicates if there has been a correctable error and
the ecc_multiple signal indicates if there has been an uncorrectable error. The widths of
ecc_multiple and ecc_single are based on the C_NCK_PER_CLK parameter.
There can be between 0 and C_NCK_PER_CLK × 2 errors per cycle with each data beat
signaled by one of the vector bits. Multiple bits of the vector can be signaled per cycle
indicating that multiple correctable errors or multiple uncorrectable errors have been
detected. The ecc_err_addr signal (discussed in Fault Collection) is valid during the
assertion of either ecc_single or ecc_multiple.
The ECC_STATUS register sets the CE_STATUS bit and/or UE_STATUS bit for correctable error
detection and uncorrectable error detection, respectively.
CAUTION! Multiple bit error is a serious failure of memory because it is uncorrectable. In such cases,
application cannot rely on contents of the memory. It is suggested to not perform any further
transactions to memory.
Interrupt Generation
When interrupts are enabled with the CE_EN_IRQ and/or UE_EN_IRQ bits of the ECC_EN_IRQ
register, and a correctable error or uncorrectable error occurs, the interrupt signal is
asserted.
Fault Collection
To aid the analysis of ECC errors, there are two banks of storage registers that collect
information on the failing ECC decode. One bank of registers is for correctable errors, and
another bank is for uncorrectable errors. The failing address, undecoded data, and ECC bits
are saved into these register banks as CE_FFA, CE_FFD, and CE_FFE for correctable errors.
UE_FFA, UE_FFD, and UE_FFE are for uncorrectable errors. The data in combination with the
ECC bits can help determine which bit(s) have failed. CE_FFA stores the address from the
ecc_err_addr signal and converts it to a byte address. Upon error detection, the data is
latched into the appropriate register. Only the first data beat with an error is stored.
When a correctable error occurs, there is also a counter that counts the number of
correctable errors that have occurred. The counter can be read from the CE_CNT register
and is fixed as an 8-bit counter. It does not rollover when the maximum value is
incremented.
Fault Injection
The ECC Fault Injection registers, FI_D and FI_ECC, facilitates testing of the software drivers.
When set, the ECC Fault Injection register XORs with the DDR3/DDR4 SDRAM datapath to
simulate errors in the memory. It is ideal for injection to occur here because this is after the
encoding has been completed. There is only support to insert errors on the first data beat,
therefore there are two to four FI_D registers to accommodate this. During operation, after
the error has been inserted into the datapath, the register clears itself.
Table 4-41 lists the AXI4 slave interface specific signals. Clock/reset to the interface is
provided from the Memory Controller.
ECC register map is shown in Table 4-42. The register map is Little Endian. Write accesses to
read-only or reserved values are ignored. Read accesses to write-only or reserved values
return the value 0xDEADDEAD.
Notes:
1. Data bits 64–127 are only enabled if the DQ width is 144 bits.
2. FI_D* and FI_ECC* are only enabled if ECC_TEST parameter has been set to 1.
IMPORTANT: The ECC_TEST parameter must be manually set to ON in the IP output products when
they are generated globally for the AXI4-Lite Slave Control/Status Register Interface block to be
generated.
ECC_STATUS
This register holds information on the occurrence of correctable and uncorrectable errors.
The status bits are independently set to 1 for the first occurrence of each error type. The
status bits are cleared by writing a 1 to the corresponding bit position; that is, the status bits
can only be cleared to 0 and not set to 1 using a register write. The ECC Status register
operates independently of the ECC Enable Interrupt register.
ECC_EN_IRQ
This register determines if the values of the CE_STATUS and UE_STATUS bits in the ECC
Status register assert the Interrupt output signal (ECC_Interrupt). If both CE_EN_IRQ and
UE_EN_IRQ are set to 1 (enabled), the value of the Interrupt signal is the logical OR between
the CE_STATUS and UE_STATUS bits.
ECC_ON_OFF
The ECC On/Off Control register allows the application to enable or disable ECC checking.
The design parameter, C_ECC_ONOFF_RESET_VALUE (default on) determines the reset value
for the enable/disable setting of ECC. This facilitates start-up operations when ECC might or
might not be initialized in the external memory. When disabled, ECC checking is disabled
for read but ECC generation is active for write operations.
CE_CNT
This register counts the number of occurrences of correctable errors. It can be cleared or
preset to any value using a register write. When the counter reaches its maximum value, it
does not wrap around, but instead it stops incrementing and remains at the maximum
value. The width of the counter is defined by the value of the C_CE_COUNTER_WIDTH
parameter. The value of the CE counter width is fixed to eight bits.
CE_FFA[31:0]
This register stores the lower 32 bits of the decoded DRAM address (Bits[31:0]) of the first
occurrence of an access with a correctable error. The address format is defined in Table 3-1,
page 32. When the CE_STATUS bit in the ECC Status register is cleared, this register is
re-enabled to store the address of the next correctable error. Storing of the failing address
is enabled after reset.
CE_FFA[63:32]
This register stores the upper 32 bits of the decoded DRAM address (Bits[55:32]) of the first
occurrence of an access with a correctable error. The address format is defined in Table 3-1,
page 32. In addition, the upper byte of this register stores the ecc_single signal. When
the CE_STATUS bit in the ECC Status register is cleared, this register is re-enabled to store
the address of the next correctable error. Storing of the failing address is enabled after reset.
CE_FFD[31:0]
This register stores the (corrected) failing data (Bits[31:0]) of the first occurrence of an
access with a correctable error. When the CE_STATUS bit in the ECC Status register is cleared,
this register is re-enabled to store the data of the next correctable error. Storing of the
failing data is enabled after reset.
CE_FFD[63:32]
This register stores the (corrected) failing data (Bits[63:32]) of the first occurrence of an
access with a correctable error. When the CE_STATUS bit in the ECC Status register is cleared,
this register is re-enabled to store the data of the next correctable error. Storing of the
failing data is enabled after reset.
CE_FFD[95:64]
Note: This register is only used when DQ_WIDTH == 144.
This register stores the (corrected) failing data (Bits[95:64]) of the first occurrence of an
access with a correctable error. When the CE_STATUS bit in the ECC Status register is cleared,
this register is re-enabled to store the data of the next correctable error. Storing of the
failing data is enabled after reset.
CE_FFD[127:96]
Note: This register is only used when DQ_WIDTH == 144.
This register stores the (corrected) failing data (Bits[127:96]) of the first occurrence of an
access with a correctable error. When the CE_STATUS bit in the ECC Status register is cleared,
this register is re-enabled to store the data of the next correctable error. Storing of the
failing data is enabled after reset.
CE_FFE
This register stores the ECC bits of the first occurrence of an access with a correctable error.
When the CE_STATUS bit in the ECC Status register is cleared, this register is re-enabled to
store the ECC of the next correctable error. Storing of the failing ECC is enabled after reset.
Table 4-53 describes the register bit usage when DQ_WIDTH = 72.
Table 4-53: Correctable Error First Failing ECC Register for 72-Bit External Memory Width
Bits Name Core Access Reset Value Description
ECC (Bits[7:0]) of the first occurrence of a correctable
7:0 CE_FFE R 0
error.
Table 4-54 describes the register bit usage when DQ_WIDTH = 144.
Table 4-54: Correctable Error First Failing ECC Register for 144-Bit External Memory Width
Bits Name Core Access Reset Value Description
ECC (Bits[15:0]) of the first occurrence of a
15:0 CE_FFE R 0
correctable error.
UE_FFA[31:0]
This register stores the decoded DRAM address (Bits[31:0]) of the first occurrence of an
access with an uncorrectable error. The address format is defined in Table 3-1, page 32.
When the UE_STATUS bit in the ECC Status register is cleared, this register is re-enabled to
store the address of the next uncorrectable error. Storing of the failing address is enabled
after reset.
UE_FFA[63:32]
This register stores the decoded address (Bits[55:32]) of the first occurrence of an access
with an uncorrectable error. The address format is defined in Table 3-1, page 32. In addition,
the upper byte of this register stores the ecc_multiple signal. When the UE_STATUS bit
in the ECC Status register is cleared, this register is re-enabled to store the address of the
next uncorrectable error. Storing of the failing address is enabled after reset.
UE_FFD[31:0]
This register stores the (uncorrected) failing data (Bits[31:0]) of the first occurrence of an
access with an uncorrectable error. When the UE_STATUS bit in the ECC Status register is
cleared, this register is re-enabled to store the data of the next uncorrectable error. Storing
of the failing data is enabled after reset.
UE_FFD[63:32]
This register stores the (uncorrected) failing data (Bits[63:32]) of the first occurrence of an
access with an uncorrectable error. When the UE_STATUS bit in the ECC Status register is
cleared, this register is re-enabled to store the data of the next uncorrectable error. Storing
of the failing data is enabled after reset.
UE_FFD[95:64]
Note: This register is only used when the DQ_WIDTH == 144.
This register stores the (uncorrected) failing data (Bits[95:64]) of the first occurrence of an
access with an uncorrectable error. When the UE_STATUS bit in the ECC Status register is
cleared, this register is re-enabled to store the data of the next uncorrectable error. Storing
of the failing data is enabled after reset.
UE_FFD[127:96]
Note: This register is only used when the DQ_WIDTH == 144.
This register stores the (uncorrected) failing data (Bits[127:96]) of the first occurrence of an
access with an uncorrectable error. When the UE_STATUS bit in the ECC Status register is
cleared, this register is re-enabled to store the data of the next uncorrectable error. Storing
of the failing data is enabled after reset.
UE_FFE
This register stores the ECC bits of the first occurrence of an access with an uncorrectable
error. When the UE_STATUS bit in the ECC Status register is cleared, this register is
re-enabled to store the ECC of the next uncorrectable error. Storing of the failing ECC is
enabled after reset.
Table 4-61 describes the register bit usage when DQ_WIDTH = 72.
Table 4-61: Uncorrectable Error First Failing ECC Register for 72-Bit External Memory Width
Bits Name Core Access Reset Value Description
ECC (Bits[7:0]) of the first occurrence of an
7:0 UE_FFE R 0
uncorrectable error.
Table 4-62 describes the register bit usage when DQ_WIDTH = 144.
Table 4-62: Uncorrectable Error First Failing ECC Register for 144-Bit External Memory Width
Bits Name Core Access Reset Value Description
ECC (Bits[15:0]) of the first occurrence of an
15:0 UE_FFE R 0
uncorrectable error.
FI_D0
This register is used to inject errors in data (Bits[31:0]) written to memory and can be used
to test the error correction and error signaling. The bits set in the register toggle the
corresponding data bits (word 0 or Bits[31:0]) of the subsequent data written to the
memory without affecting the ECC bits written. After the fault has been injected, the Fault
Injection Data register is cleared automatically.
Injecting faults should be performed in a critical region in software; that is, writing this
register and the subsequent write to the memory must not be interrupted.
Special consideration must be given across FI_D0, FI_D1, FI_D2, and FI_D3 such that only a
single error condition is introduced.
FI_D1
This register is used to inject errors in data (Bits[63:32]) written to memory and can be used
to test the error correction and error signaling. The bits set in the register toggle the
corresponding data bits (word 1 or Bits[63:32]) of the subsequent data written to the
memory without affecting the ECC bits written. After the fault has been injected, the Fault
Injection Data register is cleared automatically.
Injecting faults should be performed in a critical region in software; that is, writing this
register and the subsequent write to the memory must not be interrupted.
FI_D2
Note: This register is only used when DQ_WIDTH =144.
This register is used to inject errors in data (Bits[95:64]) written to memory and can be used
to test the error correction and error signaling. The bits set in the register toggle the
corresponding data bits (word 2 or Bits[95:64]) of the subsequent data written to the
memory without affecting the ECC bits written. After the fault has been injected, the Fault
Injection Data register is cleared automatically.
Injecting faults should be performed in a critical region in software; that is, writing this
register and the subsequent write to the memory must not be interrupted.
Special consideration must be given across FI_D0, FI_D1, FI_D2, and FI_D3 such that only a
single error condition is introduced.
FI_D3
Note: This register is only used when DQ_WIDTH =144.
This register is used to inject errors in data (Bits[127:96]) written to memory and can be
used to test the error correction and error signaling. The bits set in the register toggle the
corresponding data bits (word 3 or Bits[127:96]) of the subsequent data written to the
memory without affecting the ECC bits written. After the fault has been injected, the Fault
Injection Data register is cleared automatically.
Injecting faults should be performed in a critical region in software; that is, writing this
register and the subsequent write to the memory must not be interrupted.
FI_ECC
This register is used to inject errors in the generated ECC written to the memory and can be
used to test the error correction and error signaling. The bits set in the register toggle the
corresponding ECC bits of the next data written to memory. After the fault has been
injected, the Fault Injection ECC register is cleared automatically.
Injecting faults should be performed in a critical region in software; that is, writing this
register and the subsequent write to memory must not be interrupted.
Table 4-67 describes the register bit usage when DQ_WIDTH = 72.
Table 4-67: Fault Injection ECC Register for 72-Bit External Memory Width
Bits Name Core Access Reset Value Description
Bit positions set to 1 toggle the corresponding bit of the
7:0 FI_ECC W 0 next ECC written to the memory. The register is
automatically cleared after the fault has been injected.
Table 4-68 describes the register bit usage when DQ_WIDTH = 144.
Table 4-68: Fault Injection ECC Register for 144-Bit External Memory Width
Bits Name Core Access Reset Value Description
Bit positions set to 1 toggle the corresponding bit of the
15:0 FI_ECC R 0 next ECC written to the memory. The register is
automatically cleared after the fault has been injected.
The PHY does not take in “memory transactions” like the user and AXI interfaces, which
translate transactions into one or more DRAM commands that meet DRAM protocol and
timing requirements. The PHY interface does no DRAM protocol or timing checking. When
using a PHY Only option, you are responsible for meeting all DRAM protocol requirements
and timing specifications of all DRAM components in the system.
The PHY runs at the system clock frequency, or 1/4 of the DRAM clock frequency. The PHY
therefore accepts four DRAM commands per system clock and issues them serially on
consecutive DRAM clock cycles on the DRAM bus. In other words, the PHY interface has four
command slots: slots 0, 1, 2, and 3, which it accepts each system clock. The command in slot
position 0 is issued on the DRAM bus first, and the command in slot 3 is issued last. The
PHY does have limitations as to which slots can accept read and write CAS commands. For
more information, see CAS Command Timing Limitations, page 176. Except for CAS
commands, each slot can accept arbitrary DRAM commands.
The PHY FPGA logic interface has an input port for each pin on a DDR3 or DDR4 bus. Each
PHY command/address input port has a width that is eight times wider than its
corresponding DRAM bus pin. For example, a DDR4 bus has one act_n pin, and the PHY
has an 8-bit mc_ACT_n input port. Each pair of bits in the mc_ACT_n port corresponds to
a "command slot." The two LSBs are slot0 and the two MSBs are slot3. The PHY address
input port for a DDR4 design with 18 address pins is 144 bits wide, with each byte
corresponding to the four command slots for one DDR4 address pin. There are two bits for
each command slot in each input port of the PHY.
This is due to the underlying design of the PHY and its support for double data rate data
buses. But as the DRAM command/address bus is single data rate, you must always drive
the two bits that correspond to a command slot to the same value. See the following
interface tables for additional descriptions and examples in the timing diagrams that show
how bytes and bits correspond to DRAM pins and command slots.
The PHY interface has read and write data ports with eight bits for each DRAM DQ pin. Each
port bit represents one data bit on the DDR DRAM bus for a BL8 burst. Therefore one BL8
data burst for the entire DQ bus is transferred across the PHY interface on each system
clock. The PHY only supports BL8 data transfers. The data format is the same as the user
interface data format. For more information, see PHY, page 34.
The PHY interface also has several control signals that you must drive and/or respond to
when a read or write CAS command is issued. The control signals are used by the PHY to
manage the transfer of read and write data between the PHY interface and the DRAM bus.
See the following signal tables and timing diagrams.
Your custom Memory Controller must wait until the PHY output calDone is asserted before
sending any DRAM commands to the PHY. The PHY initializes and trains the DRAM before
asserting calDone. For more information on the PHY internal structures and training
algorithms, see the PHY, page 34. After calDone is asserted, the PHY is ready to accept any
DRAM commands.
The only required DRAM or PHY commands are related to VT tracking and DRAM refresh/
ZQ. These requirements are detailed in VT Tracking, page 178 and Refresh and ZQ,
page 181.
Clocking and Reset and Debug signals are described in other sections or documents. See
the corresponding references. In this section, a description is given for each signal in the
remaining four groups and timing diagrams show examples of the signals in use.
For more information on the clocking and reset, see the Clocking, page 81 section.
Table 4-69 shows the command and address signals for a PHY only option.
Figure 4-14 shows the functional relationship between the PHY command/address input
signals and a DDR4 command/address bus. The diagram shows an Activate command on
system clock cycle N in the slot1 position. The mc_ACT_n[3:2] and mc_CS_n[3:2] are
both asserted Low in cycle N, and all the other bits in cycle N are asserted High, generating
an Activate in the slot1 position roughly two system clocks later and NOP/DESELECT
commands on the other command slots.
On cycle N + 3, mc_CS_n and the mc_ADR bits corresponding to CAS/A15 are set to 0xFC.
This asserts mc_ADR[121:120] and mc_CS_n[1:0] Low, and all other bits in cycle N + 3
High, generating a read command on slot0 and NOP/DESELECT commands on the other
command slots two system clocks later. With the Activate and read command separated by
three system clock cycles and taking into account the command slot position of both
commands within their system clock cycle, expect the separation on the DDR4 bus to be 11
DRAM clocks, as shown in the DDR bus portion of Figure 4-14.
Note: Figure 4-14 shows the relative position of commands on the DDR bus based on the PHY input
signals. Although the diagram shows some latency in going through the PHY to be somewhat
realistic, this diagram does not represent the absolute command latency through the PHY to the DDR
bus, or the system clock to DRAM clock phase alignment. The intention of this diagram is to show the
concept of command slots at the PHY interface.
X-Ref Target - Figure 4-14
Cycle N Cycle N+1 Cycle N+2 Cycle N+3 Cycle N+4 Cycle N+5
System Clock
Activate Read
Command Command
Slot1 Slot0
DRAM Clock
DDR4_ACT_n
DDR4_RAS/A16
DDR4_CAS/A15
DDR4_WE/A14
DDR4_CS_n
tRCD=11 tCK
Read
Activate
Command
Command
X24441-082420
Figure 4-14: PHY Command/Address Input Signal with DDR4 Command/Address Bus
Figure 4-15 shows an example of using all four command slots in a single system clock. This
example shows three commands to rank0, and one to rank1, in cycle N. BG and BA address
pins are included in the diagram to spread the commands over different banks to not
violate DRAM protocol. Table 4-70 lists the command in each command slot.
Cycle N Cycle N+1 Cycle N+2 Cycle N+3 Cycle N+4 Cycle N+5
System Clock
DRAM Clock
DDR4_ACT_n
DDR4_RAS/A16
DDR4_CAS/A15
DDR4_WE/A14
DDR4_BG[1:0] 0 1 2 0
DDR4_BA[1:0] 0 3 1 0
DDR4_CS_n[1]
DDR4_CS_n[0]
X24442-082420
To understand how DRAM commands to different command slots are packed together, the
following detailed example shows how to convert DRAM commands at the PHY interface to
commands on the DRAM command/address bus. To convert PHY interface commands to
DRAM commands, write out the PHY signal for one system clock in binary and reverse the
bit order of each byte. You can also drop every other bit after the reversal because the bit
pairs are required to have the same value. In the subsequent example, the mc_BA[15:0]
signal has a cycle N value of 0x0C3C:
Hex 0x0C3C
Binary 16'b0000_1100_0011_1100
Reverse bits in each byte 16'b0011_0000_0011_1100
Take the upper eight bits for DRAM BA[1] and the lower eight bits for DRAM BA[0] and
the expected pattern on the DRAM bus is:
00 11 00 00
BA[1] 0 1 0 0
Low High Low Low
00 11 11 00
BA[0] 0 1 1 0
Low High High Low
This matches the DRAM BA[1:0] signal values of 0, 3, 1, and 0 shown in the Figure 4-15.
Write Data
Table 4-71 shows the write data signals for a PHY only option.
Read Data
Table 4-72 shows the read data signals for a PHY only option.
PHY Control
Table 4-73 shows the PHY control signals for a PHY only option.
Figure 4-16 shows a write command example. On cycle N, write command “A” is asserted on
the PHY command/address inputs in the slot0 position. The mcWrCAS input is also asserted
on cycle N, and a valid rank value is asserted on the winRank signal. In Figure 4-16, there
is only one CS_n pin, so the only valid winRank value is 0x0. The mcCasSlot[1:0] and
mcCasSlot2 signals are valid on cycle N, and specify slot0.
Write command “B” is then asserted on cycle N + 1 in the slot2 position, with mcWrCAS,
winRank, mcCasSlot[1:0], and mcCasSlot2 asserted to valid values as well. On cycle
M, PHY asserts wrDataEn to indicate that wrData and wrDataMask values corresponding
to command A need to be driven on cycle M + 1.
Figure 4-16 shows the data and mask widths assuming an 8-bit DDR4 DQ bus width. The
delay between cycle N and cycle M is controlled by the PHY, based on the CWL and AL
settings of the DRAM. wrDataEn also asserts on cycle M + 1 to indicate that wrData and
wrDataMask values for command B are required on cycle M + 2. Although this example
shows that wrDataEn is asserted on two consecutive system clock cycles, you should not
assume this will always be the case, even if mcWrCAS is asserted on consecutive clock cycles
as is shown here. There is no data buffering in the PHY and data is pulled into the PHY just
in time. Depending on the CWL/AL settings and the command slot used, consecutive
mcWrCAS assertions might not result in consecutive wrDataEn assertions.
X-Ref Target - Figure 4-16
Write Write
Command Command
A B
Slot0 Slot2
mcWrCAS
mcRdCAS
mcCasSlot2
wrDataEn
wrDataMask[7:0] DM A DM B
X24443-082420
Figure 4-17 shows a read command example. Read commands are issued on cycles N and
N + 1 in slot positions 0 and 2, respectively. The mcRdCAS, winRank, mcCasSlot, and
mcCasSlot2 are asserted on these cycles as well. On cycles M + 1 and M + 2, PHY asserts
rdDataEn and rdData.
Note: The separation between N and M + 1 is much larger than in the write example (Figure 4-16).
In the read case, the separation is determined by the full round trip latency of command output,
DRAM CL/AL, and data input through PHY.
Read Read
Command Command
A B
Slot0 Slot2
mcWrCAS
mcRdCAS
mcCasSlot2
rdDataEn
X24444-082420
Debug
EXTRA_CMD_DELAY Parameter
Depending on the number of ranks, ECC mode, and DRAM latency configuration, PHY must
be programmed to add latency on the DRAM command address bus. This provides enough
pipeline stages in the PHY programmable logic to close timing and to process mcWrCAS.
Added command latency is generally needed at very low CWL in single-rank configurations,
or in multi-rank configurations. Enabling ECC might also require adding command latency,
but this depends on whether your controller design (outside the PHY) depends on receiving
the wrDataEn signal a system clock cycle early to allow for generating ECC check bits.
The EXTRA_CMD_DELAY parameter is used to add one or two system clock cycles of delay
on the DRAM command/address path. The parameter does not delay the mcWrCAS or
mcRdCAS signals. This gives the PHY more time from the assertion of mcWrCAS or mcRdCAS
to generate XIPHY control signals. To the PHY, an EXTRA_CMD_DELAY setting of one or two
is the same as having a higher CWL or AL setting.
Table 4-75 shows the required EXTRA_CMD_DELAY setting for various configurations of
CWL, CL, and AL.
DM_DBI Parameter
The PHY supports the DDR4 DBI function on the read path and write path. Table 4-76 shows
how read and write DBI can be enabled separately or in combination.
When write DBI is enabled, Data Mask is disabled. The DM_DBI parameter only configures
the PHY and the MRS parameters must also be set to configure the DRAM for DM/DBI.
The allowed values for the DM_DBI option in the GUI are as follows for x8 and x16 parts (“X”
indicates supported and “–” indicates not supported):
Table 4-77: DM_DBI Options
Native AXI
Option Value
ECC Disable ECC Enable ECC Disable ECC Enable
(1)
DM_NO_DBI X – X –
DM_DBI_RD X – X –
NO_DM_DBI_RD X X – X
NO_DM_DBI_WR X X – X
NO_DM_DBI_WR_RD X X – X
(2)
NO_DM_NO_DBI – X – X
Notes:
1. Default option for ECC disabled interfaces.
2. Default option for ECC enabled interfaces.
IMPORTANT: DBI should be enabled with repeated single Burst Length = 8 (BL8) read access with all
"0" on the DQ bus, followed by idle (NOP/DESELECT) inserted between each BL8 read burst as shown
in Figure 1-2. Enabling the DBI feature effectively mitigates excessive power supply noise.
If DBI is not an option, then encoding the data to remove all “0” bursts in application before it reaches
the memory controller is an equally effective method for mitigating power supply noise. For x4-based
RDIMM/LRDIMM interfaces which lack the DM/DBI pin, the power supply noise is mitigated by the ODT
settings used for these topologies. For x4-based component interfaces wider than 16 bits, the data
encoding method is recommended.
DBI can be enabled to reduce power consumption in the interface by reducing the total
number of DQ signals driven Low and thereby reduce noise in the VCCO supply. For further
information where this might be useful for improved signal integrity, see Answer Record AR
70006.
complex because it generates XIPHY control signals based on the DRAM CWL and CL values
with DRAM clock resolution, not just system clock resolution.
Supporting two different command slots for CAS commands adds a significant amount of
logic on the XIPHY control paths. There are very few pipeline stages available to break up
the logic due to protocol requirements of the XIPHY. CAS command support on all four slots
would further increase the complexity and degrade timing.
Following the memory system layout guidelines ensures that a spacing of eight DRAM
clocks is sufficient for correct operation. Write to Write timing to the same rank is limited
only by the DRAM specification and the command slot limitations for CAS commands
discussed earlier.
Consider Read to Write command spacing, the JEDEC® DRAM specification [Ref 1] shows
the component requirement as: RL + BL/2 + 2 – WL. This formula only spaces the Read DQS
post-amble and Write DQS preamble by one DRAM clock on an ideal bus with no timing
skews. Any DQS flight time, write leveling uncertainty, jitter, etc. reduces this margin. When
these timing errors add up to more than one DRAM clock, there is a drive fight at the FPGA
DQS pins which likely corrupts the Read transaction. A DDR3/DDR4 SDRAM generated
controller uses the following formula to delay Write CAS after a Read CAS to allow for a
worst case timing budget for a system following the layout guidelines: RL + BL/2 + 4 – WL.
Read CAS to Read CAS commands to different ranks must also be spaced by your custom
controller to avoid drive fights, particularly when reading first from a "far" rank and then
from a "near" rank. A DDR3/DDR4 SDRAM generated controller spaces the Read CAS
commands to different ranks by at least six DRAM clock cycles.
Write CAS to Read CAS to the same rank is defined by the JEDEC DRAM specification
[Ref 1]. Your controller must follow this DRAM requirement, and it ensures that there is no
possibility of drive fights for Write to Read to the same rank. Write CAS to Read CAS spacing
to different ranks, however, must also be limited by your controller. This spacing is not
defined by the JEDEC DRAM specification [Ref 1] directly.
Write to Read to different ranks can be spaced much closer together than Write to Read to
the same rank, but factors to consider include write leveling uncertainty, jitter, and tDQSCK.
A DDR3/DDR4 SDRAM generated controller spaces Write CAS to Read CAS to different
ranks by at least six DRAM clocks.
Additive Latency
The PHY supports DRAM additive latency. The only effect on the PHY interface due to
enabling Additive Latency in the MRS parameters is in the timing of the wrDataEn signal
after mcWrCAS assertion. The PHY takes the AL setting into account when scheduling
wrDataEn. You can also find the rdDataEn asserts much later after mcRdCAS because the
DRAM returns data much later. The AL setting also has an impact on whether or not the
EXTRA_CMD_DELAY parameter needs to be set to a non-zero value.
VT Tracking
The PHY requires read commands to be issued at a minimum rate to keep the read DQS
gate signal aligned to the read DQS preamble after calDone is asserted. In addition, the
gt_data_ready signal needs to be pulsed at regular intervals to instruct the PHY to
update its read DQS training values in the RIU. Finally, the PHY requires periodic gaps in
read traffic to allow the XIPHY to update its gate alignment circuits with the values the PHY
programs into the RIU. Specifically, the PHY requires the following after calDone asserts:
1. At least one read command every 1 µs. For a multi-rank system any rank is acceptable
within the same channel. For a Ping Pong PHY, there are multiple channels. In that case,
it is necessary to read command on each channel.
2. The gt_data_ready signal is asserted for one system clock cycle after rdDataEn or
per_rd_done signal asserts at least once within each 1 µs interval.
For a Ping Pong PHY, there are multiple channels. In that case, it is necessary to assert
the gt_data_ready signal for multiple channels at the same time like the following
figure.
X-Ref Target - Figure 4-18
Ch0
1 0
0 1 0
Ch1
1 0
0 1 0
X23076-080619
When the read is tagged as a special type of read, it is possible to assert the
gt_data_ready signal after the per_rd_done signal at each channels like the
following figure.
X-Ref Target - Figure 4-19
Ch0
1 0
1 0
0 1 0
Ch1
0 1 0
0 1 0
0 1 0
X23077-031620
3. There is a three contiguous system clock cycle period with no read CAS commands
asserted at the PHY interface every 1 µs.
The PHY cannot interrupt traffic to meet these requirements. It is therefore your custom
Memory Controller's responsibility to issue DRAM commands and assert the
gt_data_ready input signal in a way that meets the above requirements.
Figure 4-20 shows two examples where the custom controller must interrupt normal traffic
to meet the VT tracking requirements. The first example is a High read bandwidth workload
with mcRdCAS asserted continuously for almost 1 µs. The controller must stop issuing read
commands for three contiguous system clocks once each 1 µs period, and assert
gt_data_ready once per period.
The second example is a High write bandwidth workload with mcWrCAS asserted
continuously for almost 1 µs. The controller must stop issuing writes, issue at least one read
command, and then assert gt_data_ready once per 1 µs period.
IMPORTANT: The controller must not violate DRAM protocol or timing requirements during this
process.
1 μs 1 μs
mcWrCAS
1 μs 1 μs
mcWrCAS
X24445-082420
A workload that has a mix of read and write traffic in every 1 µs interval might naturally
meet the first and third VT tracking requirements listed above. In this case, the only extra
step required is to assert the gt_data_ready signal every 1 µs and regular traffic would
not be interrupted at all. The custom controller, however, is responsible for ensuring all
three requirements are met for all workloads. DDR3/DDR4 SDRAM generated controllers
monitor the mcRdCAS and mcWrCAS signals and decide each 1 µs period what actions, if
any, need to be taken to meet the VT tracking requirements. Your custom controller can
implement any scheme that meets the requirements described here.
Refresh and ZQ
After calDone is asserted by the PHY, periodic DRAM refresh and ZQ calibration are the
responsibility of your custom Memory Controller. Your controller must issue refresh and ZQ
commands, meet DRAM refresh and ZQ interval requirements, while meeting all other
DRAM protocol and timing requirements. For example, if a refresh is due and you have open
pages in the DRAM, you must precharge the pages, wait tRP, and then issue a refresh
command, etc. The PHY does not perform the precharge or any other part of this process
for you.
This section describes the Ping Pong PHY in the UltraScale architecture. It includes the Ping
Pong PHY overview, configuration supported, and interface.
RECOMMENDED: The Ping Pong PHY is based on the PHY only design. Read the PHY Only Interface
section before starting this section.
In the Ping Pong PHY, two memory channels are supported. The two channels share most of
the control/address signals except CS_n, CKE, and ODT are duplicated for Channel1. Each
channel has its own Data (DQ/DQS/DM) signals. The advantage of using Ping Pong PHY is
that the control/address signals are pin saving.
Figure 4-21 shows a Ping Pong PHY design with a total channel width of DQ_WIDTH. The
total channel width, DQ_WIDTH, is split into two evenly split channels. Each channel has a
width of DQ_WIDTH/2. The solid arrows indicate shared control/address signals. The dashed
arrows indicate CS_n[1:0], CKE[1:0], and ODT[1:0] connected to two separated
channels. The dotted arrows indicate DQ/DQS/DM signals connected to two separated
channels.
X-Ref Target - Figure 4-21
DQ[DQ_WIDTH–1:DQ_WIDTH/2],
DQS[DQS_WIDTH–1:DQS_WIDTH/2],
DM_DBI_N[DM_WIDTH–1:DM_WIDTH/2]
DDR4
DDR4
DDR4
DDR4
DQ[DQ_WIDTH/2–1:0],
DQS[DQS_WIDTH/2–1:0],
DM_DBI_N[DM_WIDTH/2–1:0]
X15676-010318
Supported Configuration
The following rules outline the configuration supported by the Ping Pong PHY:
The Ping Pong PHY interface is very similar to the PHY only interface except command/
address signals are shared by both Channel0 and Channel1 in the Ping Pong PHY. Because
command/address signals are shared between Channel0 and Channel1, they are qualified
separately by CS_n, CKE, and ODT per channel.
Table 4-79 to Table 4-82 show the Ping Pong PHY signal interfaces.
mc_CS_n In Ping Pong PHY, bits [CS_WIDTH × 8/2 – 1:0] is used for Channel0,
I bits [CS_WIDTH × 8 – 1:CS_WIDTH × 8/2] is used for Channel1.
[2 × CS_WIDTH × 8 – 1:0]
In case of dual-rank design, mc_CS_n is defines as {Ch1-CS1, Ch1-CS0,
Ch0-CS1, Ch0-CS0}.
DRAM ODT. Eight bits for each DRAM pin. Active-High.
mc_ODT In Ping Pong PHY, bits [ODT_WIDTH × 8/2 – 1:0] is used for Channel0,
I bits [ODT_WIDTH × 8 – 1:ODT_WIDTH × 8/2] is used for Channel1.
[2 × ODT_WIDTH × 8 – 1:0]
In case of dual-rank design, mc_ODT_n is defines as {Ch1-ODT1_n,
Ch1-ODT0_n, Ch0-ODT1_n, Ch0-ODT0_n}.
mc_C[LR_WIDTH ×8 – 1:0] I DRAM (3DS) Logical rank select address. Eight bits for each DRAM pin.
Performance
The efficiency of a memory system is affected by many factors including limitations due to
the memory, such as cycle time (tRC) within a single bank, or Activate to Activate spacing to
the same DDR4 bank group (tRRD_L). When given multiple transactions to work on, the
Memory Controller schedules commands to the DRAM in a way that attempts to minimize
the impact of these DRAM timing requirements. But there are also limitations due to the
Memory Controller architecture itself. This section explains the key controller limitations
and options for obtaining the best performance out of the controller.
Address Map
The app_addr to the DRAM address map is described in the User Interface. Six mapping
options are included:
• ROW_COLUMN_BANK
• ROW_BANK_COLUMN
• BANK_ROW_COLUMN
• ROW_COLUMN_LRANK_BANK
• ROW_LRANK_COLUMN_BANK
• ROW_COLUMN_BANK_INTLV
For a purely random address stream at the user interface, all of the options would result in
a similar efficiency. For a sequential app_addr address stream, or any workload that tends
to have a small stride through the app_addr memory space, the ROW_COLUMN_BANK
mapping generally provides a better overall efficiency. This is due to the Memory Controller
architecture and the interleaving of transactions across the Group FSMs. The Group FSMs
are described in the Memory Controller, page 25. This controller architecture impact on
efficiency should be considered even for situations where DRAM timing is not limiting
efficiency. Table 4-83 shows two mapping options for the 4 Gb (x8) DRAM components.
Table 4-83: DDR3/DDR4 4 Gb (x8) DRAM Address Mapping without 3DS Options
Table 4-83: DDR3/DDR4 4 Gb (x8) DRAM Address Mapping without 3DS Options (Cont’d)
Note: Highlighted bits are used to map addresses to Group FSMs in the controller.
From the DDR3 map, you might expect reasonable efficiency with the
ROW_BANK_COLUMN option with a simple address increment pattern. The increment
pattern would generate page hits to a single bank, which DDR3 could handle as a stream of
back-to-back CAS commands resulting in high efficiency. But looking at the italic bank bits
in Table 4-84 show that the address increment pattern also maps the long stream of page
hits to the same controller Group FSM.
For example, Table 4-84 shows how the first 12 app_addr addresses decode to the DRAM
addresses and map to the Group FSMs for both mapping options. The
ROW_BANK_COLUMN option only maps to the Group FSM 0 over this address range.
The same address to Group FSM mapping issue applies to x16 DRAMs. The map for DDR4
4 Gb (x16) is shown in Table 4-85. The ROW_COLUMN_BANK option gives the best
efficiency with sequential address patterns. The bits used to map to the Group FSMs are
highlighted.
For example, Table 4-86 shows how the first 12 app_addr decodes to the DRAM address
and maps to the Group FSMs for the ROW_COLUMN_BANK mapping option.
As mentioned in the Memory Controller, page 25, a Group FSM can issue one CAS
command every three system clock cycles, or every 12 DRAM clock cycles, even for page
hits. Therefore with only a single Group FSM issuing page hit commands to the DRAM for
long periods, the maximum efficiency is 33%.
Table 4-84 shows that the ROW_COLUMN_BANK option maps these same 12 addresses
evenly across all eight DRAM banks and all four controller Group FSMs. This generates eight
“page empty” transactions which open up all eight DRAM banks, followed by page hits to
the open banks.
With all four Group FSMs issuing page hits, the efficiency can hit 100%, for as long as the
address increment pattern continues, or until a refresh interrupts the pattern, or there is bus
dead time for a DQ bus turnaround, etc. Figure 4-22 shows the Group FSM issue over a
larger address range for the ROW_BANK_COLUMN option. Note that the first 2k addresses
map to two DRAM banks, but only one Group FSM.
The address map graph for the ROW_COLUMN_BANK option is shown in Figure 4-23. Note
that the address range in this graph is only 64 bytes, not 8k bytes. This graph is showing the
same information as in the Address Decode in Table 4-84. With an address pattern that
tends to stride through memory in minimum sized steps, efficiency tends to be High with
the ROW_COLUMN_BANK option.
Note that the ROW_COLUMN_BANK option does not result in High bus efficiency for all
strides through memory. Consider the case of a stride of 16 bytes. This maps to only two
Group FSMs resulting in a maximum efficiency of 67%. A stride of 32 bytes maps to only one
Group FSM and the maximum efficiency is the same as the ROW_BANK_COLUMN option,
just 33%. For an address pattern with variable strides, but strides that tend to be < 1k in the
app_addr address space, the ROW_COLUMN_BANK option is much more likely to result in
good efficiency.
The same Group FSM issue exists for DDR4. With an address increment pattern and the
DDR4 ROW_BANK_COLUMN option, the first 4k transactions map to a single Group FSM, as
well as mapping to banks within a single DRAM bank group. The DRAM would limit the
address increment pattern efficiency due to the tCCD_L timing restriction. The controller
limitation in this case is even more restrictive, due to the single Group FSM. Again the
efficiency would be limited to 33%.
With the ROW_COLUMN_BANK option, the address increment pattern interleaves across all
the DRAM banks and bank groups and all of the Group FSMs over a small address range.
Figure 4-24 shows how the DDR4 4 Gb (x8) ROW_COLUMN_BANK address map for the first
128 bytes of app_addr. This graph shows how the addresses map evenly across all DRAM
banks and bank groups, and all four controller Group FSMs.
X-Ref Target - Figure 4-24
Figure 4-25 shows the first 64 bytes of app_addr mapping evenly across banks, bank
groups, and Group FSMs.
X-Ref Target - Figure 4-25
When considering whether an address pattern at the user interface results in good DRAM
efficiency, the mapping of the pattern to the controller Group FSMs is just as important as
the mapping to the DRAM address. The app_addr bits that map app_addr addresses to
the Group FSMs are shown in Table 4-87 for 4 Gb and 8 Gb components.
Consider an example where you try to obtain good efficiency using only four DDR3 banks
at a time. Assume you are using a 4 Gb (x8) with the ROW_COLUMN_BANK option and you
decide to open a page in banks 0, 1, 2, and 3, and issue transactions to four column
addresses in each bank. Using the address map from Address Map, determine the
app_addr pattern that decodes to this DRAM sequence. Applying the Group FSM map
from Table 4-87, determine how this app_addr pattern maps to the FSMs. The result is
shown in Table 4-88.
The four bank pattern in Table 4-88 works well from a DRAM point of view, but the
controller only uses two of its four Group FSMs and the maximum efficiency is 67%. In
practice it is even lower due to other timing restrictions like tRCD. A better bank pattern
would be to open all the even banks and send four transactions to each as shown in
Table 4-89.
The “even bank” pattern uses all of the Group FSMs and therefore has better efficiency than
the previous pattern.
For good efficiency, you want to keep as many Group FSMs busy in parallel as you can. You
could try changing the transaction presented to the user interface to one that maps to a
different FSM, but you do not have visibility at the user interface as to which FSMs have
space to take new transactions. The transaction FIFOs prevent this type of head of line
blocking until a UI command maps to an FSM with a full FIFO.
A Group FSM FIFO structure can hold up to six transactions, depending on the page status
of the target rank and bank. The FIFO structure is made up of two stages that also
implement a “Look Ahead” function. New transactions are placed in the first FIFO stage and
are operated on when they reach the head of the FIFO. Then depending on the transaction
page status, the Group FSM either arbitrates to open the transaction page, or if the page is
already open, the FSM pushes the page hit into the second FIFO stage. This scheme allows
multiple page hits to be queued up while the FSM looks ahead into the logical FIFO
structure for pages that need to be opened. Looking ahead into the queue allows an FSM to
interleave DRAM commands for multiple transactions on the DDR bus. This helps to hide
DRAM tRCD and tRP timing associated with opening and closing pages.
The following conceptual timing diagram shows the transaction flow from the UI to the
DDR command bus, through the Group FSMs, for a series of transactions. The diagram is
conceptual in that the latency from the UI to the DDR bus is not considered and not all
DRAM timing requirements are met. Although not completely timing accurate, the diagram
does follow DRAM protocol well enough to help explain the controller features under
discussion.
Four transactions are presented at the UI, the first three mapping to the Group FSM0 and
the fourth to FSM1. On system clock cycle 1, FSM0 accepts transaction 1 to Row 0, Column
0, and Bank 0 into its stage 1 FIFO and issues an Activate command.
On clock 2, transaction 1 is moved into the FSM0 stage 2 FIFO and transaction 2 is accepted
into FSM0 stage 1 FIFO. On clock cycles 2 through 4, FSM0 is arbitrating to issue a CAS
command for transaction 1, and an Activate command for transaction 2. FSM0 is looking
ahead to schedule commands for transaction 2 even though transaction 1 is not complete.
Note that the time when these DRAM commands win arbitration is determined by DRAM
timing such as tRCD and controller pipeline delays, which explains why the commands are
spaced on the DDR command bus as shown.
On cycle 3, transaction 3 is accepted into FSM0 stage 1 FIFO, but it is not processed until
clock cycle 5 when it comes to the head of the stage 1 FIFO. Cycle 5 is where FSM0 begins
looking ahead at transaction 3 while also arbitrating to issue the CAS command for
transaction 2. Finally on cycle 4, transaction 4 is accepted into FSM1 stage 1 FIFO. If FSM0
did not have at least a three deep FIFO, transaction 4 would have been blocked until cycle 6.
This diagram does not show a high efficiency transaction pattern. There are no page hits
and only two Group FSMs are involved. But the example does show how a single Group FSM
interleaves DRAM commands for multiple transactions on the DDR bus and minimizes
blocking of the UI, thereby improving efficiency.
Autoprecharge
The Memory Controller defaults to a page open policy. It leaves banks open, even when
there are no transactions pending. It only closes banks when a refresh is due, a page miss
transaction is being processed, or when explicitly instructed to issue a transaction with a
RDA or WRA CAS command. The app_autoprecharge port on the UI allows you to
explicitly instruct the controller to issue a RDA or WRA command in the CAS command
phase of processing a transaction, on a per transaction basis. You can use this signal to
improve efficiency when you have knowledge of what transactions will be sent to the UI in
the future.
The following diagram is a modified version of the “look ahead” example from the previous
section. The page miss transaction that was previously presented to the UI in cycle 3 is now
moved out to cycle 9. The controller can no longer “look ahead” and issues the Precharge to
Bank 0 in cycle 6 because it does not know about the page miss until cycle 9. But if you
know that transaction 1 in cycle 1 is the only transaction to Row 0 in Bank0, assert the
app_autoprecharge port in cycle 1. Then, the CAS command for transaction 1 in cycle 5
is a RDA or WRA, and the transaction to Row 1, Bank 0 in cycle 9 is no longer a page miss.
The transaction in cycle 9 is only needed as an Activate command instead of a Precharge
followed by an Activate tRP later.
addresses before switching to a different random address. Patterns like this are often seen
in typical AXI configurations.
If you have knowledge of the UI traffic pattern, you might be able to schedule DRAM
maintenance commands with less impact on system efficiency. You can use the app_ref
and app_zq ports at the UI to schedule these commands when the controller is configured
for User Refresh and ZQCS. In this mode, the controller does not schedule the DRAM
maintenance commands and only issues them based on the app_ref and app_zq ports.
You are responsible for meeting all DRAM timing requirements for refresh and ZQCS.
Consider a case where the system needs to move a large amount of data into or out of the
DRAM with the highest possible efficiency over a 50 µs period. If the controller schedules
the maintenance commands, this 50 µs data burst would be interrupted multiple times for
refresh, reducing efficiency roughly 4%. In User Refresh mode, however, you can decide to
postpone refreshes during the 50 µs burst and make them up later. The DRAM specification
allows up to eight refreshes to be postponed, giving you flexibility to schedule refreshes
over a 9 × tREFI period, more than enough to cover the 50 µs in this example.
While User Refresh and ZQCS enable you to optimize efficiency, their incorrect use can lead
to DRAM timing violations and data loss in the DRAM. Use this mode only if you thoroughly
understand DRAM refresh and ZQCS requirements as well as the operation of the app_ref
and app_zq UI ports. The UI port operation is described in the User Interface.
Periodic Reads
The FPGA DDR PHY requires at least one DRAM RD or RDA command to be issued every
1 µs. This requirement is described in the User Interface. If this requirement is not met by
the transaction pattern at the UI, the controller detects the lack of reads and injects a read
transaction into Group FSM0. This injected read is issued to the DRAM following the normal
mechanisms of the controller issuing transactions. The key difference is that no read data is
returned to the UI. This is wasted DRAM bandwidth.
User interface patterns with long strings of write transactions are affected the most by the
PHY periodic read requirement. Consider a pattern with a 50/50 read/write transaction
ratio, but organized such that the pattern alternates between 2 µs bursts of 100% page hit
reads and 2 µs bursts of 100% page hit writes. There is at least one injected read in the 2 µs
write burst, resulting in a loss of efficiency due to the read command and the turnaround
time to switch the DRAM and DDR bus from writes to reads back to writes. This 2 µs
alternating burst pattern is slightly more efficient than alternating between reads and
writes every 1 µs. A 1 µs or shorter alternating pattern would eliminate the need for the
controller to inject reads, but there would still be more read-write turnarounds.
Bus turnarounds are expensive in terms of efficiency and should be avoided if possible.
Long bursts of page hit writes, > 2 µs in duration, are still the most efficient way to write to
the DRAM, but the impact of one write-read-write turnaround each 1 µs must be taken into
account when calculating the maximum write efficiency.
DIMM Configurations
DDR3/DDR4 SDRAM memory interface supports UDIMM, RDIMM, LRDIMM, and SODIMM
in multiple slot configurations.
IMPORTANT: Note that the chip select order generated by Vivado is dependent to your board design.
Also, the DDR3/DDR4 IP core does not read SPD. If the DIMM configuration changes, the IP must be
regenerated.
In the following configurations, the empty slot is not used and it is optional to be
implemented on the board.
DDR3/DDR4 UDIMM/SODIMM
Table 4-92 and Figure 4-26 show the four configurations supported for DDR3/DDR4
UDIMM and SODIMM.
For a dual-rank DIMM, Dual Slot configuration, follow the chip select order shown in
Figure 4-26, where CS0 and CS1 are connected to Slot0 and CS2 and CS3 are connected to
Slot1.
DDR3/DDR4 UDIMM
Slot 1 Slot 0
CS0
Rank = 1
Slot 1 Slot 0
CS0
Rank = 2
CS1
Slot 1 Slot 0
CS2 CS0
Rank = 2 Rank = 2
CS3 CS1
Slot 1 Slot 0
CS1 CS0
Rank = 1 Rank = 1
X14994-090315
DDR3 RDIMM
Table 4-93 and Figure 4-27 show the five configurations supported for DDR3 RDIMM.
DDR3 RDIMM requires two chip selects for a single-rank RDIMM to program the register
chip.
For a single-rank DIMM, Dual slot configuration, you must follow the chip select order
shown in Figure 4-27, where CS0 and CS2 are connected to Slot0 and CS1 and CS3 are
connected to Slot1.
For a dual-rank DIMM, Dual Slot configuration, follow the chip select order shown in
Figure 4-27, where CS0 and CS1 are connected to Slot0 and CS2 and CS3 are connected to
Slot1.
DDR3 RDIMM
Slot 1 Slot 0
CS0
Rank = 1
CS1
Slot 1 Slot 0
CS1 CS0
Rank = 1 Rank = 1
CS3 CS2
Slot 1 Slot 0
CS0
Rank = 2
CS1
Slot 1 Slot 0
CS2 CS0
Rank = 2 Rank = 2
CS3 CS1
Slot 1 Slot 0
CS0
CS1
Rank = 4
CS2
CS3
X14995-040720
DDR4 RDIMM
Table 4-94 and Figure 4-28 show the four configurations supported for DDR4 RDIMM. For
dual-rank DIMM, Dual Slot configuration, follow the chip select order shown in Figure 4-28,
where CS0 and CS1 are connected to Slot0 and CS2 and CS3 are connected to Slot1.
DDR4 RDIMM
Slot 1 Slot 0
CS0
Rank = 1
Slot 1 Slot 0
CS1 CS0
Rank = 1 Rank = 1
Slot 1 Slot 0
CS0
Rank = 2
CS1
Slot 1 Slot 0
CS2 CS0
Rank = 2 Rank = 2
CS3 CS1
X14996-090315
SLOT0_CONFIG
In a given DIMM configuration, the logic chip select is mapped to physical slot using an
8-bit number per SLOT. Each bit corresponds to a logic chip select connectivity in a SLOT.
SLOT0_FUNC_CS
A DDR3 single-rank RDIMM and two chip selects are needed to access the register chip.
However, only the lower rank chip select is used as functional chip select. SLOT0_FUNC_CS
describes the functional chip select per SLOT. For any DIMM other than a DDR3 single-rank
RDIMM, SLOT0_CONFIG is the same as SLOT0_FUNC_CS and SLOT1_CONFIG is the same as
SLOT1_FUNC_CS.
DDR4 LRDIMM
Table 4-95 and Figure 4-29 show the three configurations supported for DDR4 LRDIMM.
For Dual Slot, dual-rank configuration, follow the chip select order shown in Figure 4-29,
where CS0 and CS1 are connected to Slot0 and CS2 and CS3 are connected to Slot1.
DDR4 LRDIMM
Slot 1 Slot 0
CS0
Ranks = 2
CS1
Slot 1 Slot 0
CS0
CS1
Ranks = 4
CS2
CS3
Slot 1 Slot 0
CS2 CS0
Ranks = 2 Ranks = 2
CS3 CS1
X16355-031016
The default values of four parameters are given in Table 4-96. These parameters can be
changed through the Tcl command using user parameter TIMING_OP1 or TIMING_OP2 for
Controller/PHY mode of the Controller and Physical Layer. These Tcl options are not valid
for any PHY_ONLY (Physical Layer Only and Physical Layer Ping Pong) designs.
4. Generate output files by selecting Generate Output Products after right-clicking IP. See
Figure 4-31.
X-Ref Target - Figure 4-31
The generated output files have the RTL parameter values set as per Table 4-96.
Table 4-97: Parameter Values Based on Tcl Command Option for 3DS
Better Timing
Parameters Default (TIMING_3DS Tcl Option)
ALIAS_PAGE OFF ON
ALIAS_P_CNT OFF ON
DRAM pages are kept open as long as possible to reduce number of precharges. The
controller contains a page table per bank and rank for each bank group. With 3DS, a third
dimension is added to these page tables for logical ranks. This increases gate counts and
makes timing closures harder. But the DRAM access performance is improved. ALIAS_PAGE
= ON removes this dimension.
Similarly for 3DS, another dimension is added for logical rank to some per rank/bank
counters which keeps track of tRAS, tRTP, and tWTP. ALIAS_P_CNT = ON removes the logical
rank dimension.
Removing the third dimension does not affect correct operation of DRAM. However, it
removes some of the performance advantages.
The default values of two parameters are given in Table 4-97. These parameters can be
changed through the Tcl command using user parameter TIMING_3DS for Controller/PHY
mode of Controller and Physical Layer. These Tcl options are not valid for any PHY_ONLY
(Physical Layer Only and Physical Layer Ping Pong) designs.
4. Generate output files by selecting Generate Output Products after right-clicking IP. See
Figure 4-33.
• Memory IP lists the possible Reference Input Clock Speed values based on the targeted
memory frequency (based on selected Memory Device Interface Speed).
• Otherwise, select M and D Options and target for desired Reference Input Clock Speed
which is calculated based on selected CLKFBOUT_MULT (M), DIVCLK_DIVIDE (D), and
CLKOUT0_DIVIDE (D0) values in the Advanced Clocking Tab.
The required Reference Input Clock Speed is calculated from the M, D, and D0 values
entered in the GUI using the following formulas:
Where tCK is the Memory Device Interface Speed selected in the Basic tab.
Calculated Reference Input Clock Speed from M, D, and D0 values are validated as per
clocking guidelines. For more information on clocking rules, see Clocking.
Apart from the memory specific clocking rules, validation of the possible MMCM input
frequency range, MMCM VCO frequency range, and MMCM PFD frequency range values are
completed for M, D, and D0 in the GUI.
For UltraScale devices, see Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2] and Virtex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS893) [Ref 3] for MMCM Input frequency range, MMCM VCO frequency
range, and MMCM PFD frequency range values.
For UltraScale+ devices, see Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS922) [Ref 4], Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS923) [Ref 5], and Zynq UltraScale+ MPSoC Data Sheet: DC and AC
Switching Characteristics (DS925) [Ref 6] for MMCM Input frequency range, MMCM VCO
frequency range, and MMCM PFD frequency range values.
For possible M, D, and D0 values and detailed information on clocking and the MMCM, see
the UltraScale Architecture Clocking Resources User Guide (UG572) [Ref 8].
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 13]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 15]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16]
This section includes information about using Xilinx ® tools to customize and generate the
core in the Vivado Design Suite.
If you are customizing and generating the core in the IP integrator, see the Vivado Design
Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 13] for detailed
information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values change, see the
description of the parameter in this chapter. To view the parameter value, run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For more information about generating the core in Vivado, see the Vivado Design Suite User
Guide: Designing with IP (UG896) [Ref 14] and the Vivado Design Suite User Guide: Getting
Started (UG910) [Ref 15].
Note: Figures in this chapter are illustrations of the Vivado Integrated Design Environment (IDE).
This layout might vary from the current version.
Basic Tab
Figure 5-1 and Figure 5-2 show the Basic tab when you start up the DDR3/DDR4 SDRAM.
X-Ref Target - Figure 5-1
IMPORTANT: All parameters shown in the controller options dialog box are limited selection options in
this release.
For the Vivado IDE, all controllers (DDR3, DDR4, LPDDR3, QDR II+, QDR-IV, and RLDRAM 3)
can be created and available for instantiation.
In IP integrator, only one controller instance can be created and only two kinds of
controllers are available for instantiation:
• DDR3
• DDR4
1. After a controller is added in the pull-down menu, select the Mode and Interface for
the controller. Select the AXI4 Interface or have the option to select the Generate the
PHY component only.
2. Select the settings in the Clocking, Controller Options, Memory Options, and
Advanced User Request Controller Options.
In Clocking, the Memory Device Interface Speed sets the speed of the interface. The
speed entered drives the available Reference Input Clock Speeds. For more
information on the clocking structure, see the Clocking, page 81.
3. To use memory parts which are not available by default through the DDR3/DDR4
SDRAM Vivado IDE, you can create a custom parts CSV file, as specified in the AR:
63462. This CSV file has to be provided after enabling the Custom Parts Data File
option. After selecting this option. you are able to see the custom memory parts along
with the default memory parts. Note that, simulations are not supported for the custom
part. Custom part simulations require manually adding the memory model to the
simulation and might require modifying the test bench instantiation.
4. All available options of Data Mask and DBI and their functionality is described in
Table 4-76. Also, the dependency of ECC on the DM_DBI input is mentioned in
Table 4-77 for both user and AXI interfaces.
IMPORTANT: To support partial writes, AXI designs require Data Mask (DM) to always be selected and
it is grayed out. This is for all AXI interfaces except 72 bits, which requires the use of ECC. Having ECC
and DM in the same design causes the ECC to fail, so turning off the DM when ECC is enabled is
required.
Figure 5-3: Vivado Customize IP Dialog Box for DDR4 – AXI Options
Figure 5-4: Vivado Customize IP Dialog Box for DDR3 – Advanced Clocking
Figure 5-5: Vivado Customize IP Dialog Box for DDR3 – Advanced Options
Figure 5-6: Vivado Customize IP Dialog Box for DDR4 – Advanced Options
Figure 5-7: Vivado Customize IP Dialog Box for DDR4 – Migration Options
Figure 5-8: Vivado Customize IP Dialog Box – DDR3 SDRAM I/O Planning and Design Checklist
Figure 5-9: Vivado Customize IP Dialog Box – DDR4 SDRAM I/O Planning and Design Checklist
User Parameters
Table 5-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
2. In the Generate Output Products option, do not select Generate instead select Skip
(Figure 5-10).
X-Ref Target - Figure 5-10
3. Set the Burst Type value by running the following command on the Tcl Console:
a. For DDR3 IP:
set_property -dict [list CONFIG.C0.DDR3_BurstType <value_to_be_set>] [get_ips
<ip_name>]
For example:
For example:
The generated output files have the Burst Type value set as per the selected value.
3. Set the Additive Latency value by running the following command on the Tcl Console:
set_property -dict [list config.AL_SEL <value_to_be_set>] [get_ips <ip_name>]
For example:
The generated output files have the Additive Latency value set as per the selected value.
IMPORTANT: The values entered are not validated, it is your responsibility to enter the right values.
For example:
For example:
For example:
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
I/O Planning
DDR3/DDR4 SDRAM I/O pin planning is completed with the full design pin planning using
the Vivado I/O Pin Planner. DDR3/DDR4 SDRAM I/O pins can be selected through several
Vivado I/O Pin Planner features including assignments using I/O Ports view, Package view,
or Memory Bank/Byte Planner. Pin assignments can additionally be made through
importing an XDC or modifying the existing XDC file.
These options are available for all DDR3/DDR4 SDRAM designs and multiple DDR3/DDR4
SDRAM IP instances can be completed in one setting. To learn more about the available
Memory IP pin planning options, see the Vivado Design Suite User Guide: I/O and Clock
Planning (UG899) [Ref 18].
Required Constraints
For DDR3/DDR4 SDRAM Vivado IDE, you specify the pin location constraints. For more
information on I/O standard and other constraints, see the Vivado Design Suite User Guide:
I/O and Clock Planning (UG899) [Ref 18]. The location is chosen by the Vivado IDE
according to the banks and byte lanes chosen for the design.
The I/O standard is chosen by the memory type selection and options in the Vivado IDE and
by the pin type. A sample for dq[0] is shown here.
For HR banks, update the output_impedance of all the ports assigned to HR banks pins
using the reset_property command. For more information, see AR: 63852.
IMPORTANT: Do not alter these constraints. If the pin locations need to be altered, rerun the DDR3/
DDR4 SDRAM Vivado IDE to generate a new XDC file.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
For more information on clocking, see Clocking, page 81.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
IMPORTANT: The set_input_delay and set_output_delay constraints are not needed on the
external memory interface pins in this design due to the calibration process that automatically runs at
start-up. Warnings seen during implementation for the pins can be ignored.
Simulation
For comprehensive information about Vivado simulation components, as well as
information about using supported third-party tools, see the Vivado Design Suite User
Guide: Logic Simulation (UG900) [Ref 16]. For more information on simulation, see
Chapter 6, Example Design and Chapter 7, Test Bench.
Note: The Example Design is a Mixed Language IP and simulations should be run with the
Simulation Language set to Mixed. If the Simulation Language is set to Verilog, then it attempts
to run a netlist simulation.
Example Design
This chapter contains information about the example design provided in the Vivado®
Design Suite. Vivado supports Open IP Example Design flow. To create the example design
using this flow, right-click the IP in the Source Window, as shown in Figure 6-1 and select
Open IP Example Design.
X-Ref Target - Figure 6-1
This option creates a new Vivado project. Upon selecting the menu, a dialog box to enter
the directory information for the new design project opens.
Select a directory, or use the defaults, and click OK. This launches a new Vivado with all of
the example design files and a copy of the IP.
Figure 6-2 shows the example design with the PHY only option selected (controller module
does not get generated).
X-Ref Target - Figure 6-2
Figure 6-2: Open IP Example Design with PHY Only Option Selected
The example design can be simulated using one of the methods in the following sections.
RECOMMENDED: If a custom wrapper is used to simulate the example design, the following parameter
should be used in the custom wrapper:
parameter SIMULATION = "TRUE"
The parameter SIMULATION is used to disable the calibration during simulation.
Project-Based Simulation
This method can be used to simulate the example design using the Vivado Integrated
Design Environment (IDE). Memory IP delivers memory models for DDR3 and IEEE
encrypted memory models for DDR4.
The Vivado simulator, Questa Advanced Simulator, IES, and VCS tools are used for DDR3/
DDR4 IP verification at each software release. The Vivado simulation tool is used for DDR3/
DDR4 IP verification from 2015.1 Vivado software release. The following subsections
describe steps to run a project-based simulation using each supported simulator tool.
5. In the Flow Navigator window, select Run Simulation and select Run Behavioral
Simulation option as shown in Figure 6-4.
6. Vivado invokes Vivado simulator and simulations are run in the Vivado simulator tool.
For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900)
[Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 6-6.
5. Vivado invokes Questa Advanced Simulator and simulations are run in the Questa
Advanced Simulator tool. For more information, see the Vivado Design Suite User Guide:
Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 6-6.
5. Vivado invokes IES and simulations are run in the IES tool. For more information, see the
Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 6-6.
5. Vivado invokes VCS and simulations are run in the VCS tool. For more information, see
the Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
Simulation Speed
DDR3/DDR4 SDRAM provides a Vivado IDE option to reduce the simulation speed by
selecting behavioral XIPHY model instead of UNISIM XIPHY model. Behavioral XIPHY model
simulation is a default option for DDR3/DDR4 SDRAM designs. To select the simulation
mode, click the Advanced Options tab and find the Simulation Options as shown in
Figure 5-5.
The SIM_MODE parameter in the RTL is given a different value based on the Vivado IDE
selection.
• SIM_MODE = BFM – If BFM mode is selected in the Vivado IDE, the RTL parameter
reflects this value for the SIM_MODE parameter. This is the default option.
• SIM_MODE = FULL – If UNISIM mode is selected in the Vivado IDE, XIPHY UNISIMs are
selected and the parameter value in the RTL is FULL.
If the design is generated with the Reference Input Clock option selected as No Buffer (at
Advanced > FPGA Options > Reference Input), the CLOCK_DEDICATED_ROUTE
constraints and BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation based on GCIO and
MMCM allocation needs to be handled manually for the IP flow. DDR3/DDR4 SDRAM does
not generate clock constraints in the XDC file for No Buffer configurations and you must
take care of the clock constraints for No Buffer configurations for the IP flow.
For an example design flow with No Buffer configurations, DDR3/DDR4 SDRAM generates
the example design with differential buffer instantiation for system clock pins. DDR3/DDR4
SDRAM generates clock constraints in the example_design.xdc. It also generates a
CLOCK_DEDICATED_ROUTE constraint as the “BACKBONE” and instantiates BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM input if the GCIO and MMCM are not in
same bank to provide a complete solution. This is done for the example design flow as a
reference when it is generated for the first time.
If in the example design, the I/O pins of the system clock pins are changed to some other
pins with the I/O pin planner, the CLOCK_DEDICATED_ROUTE constraints and BUFG/
BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation need to be managed manually. A DRC error
is reported for the same.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The intent of the performance test bench is for you to obtain an estimate on the efficiency
for a given traffic pattern with the DDR3/DDR4 SDRAM controller. The test bench passes
your supplied commands and address to the Memory Controller and measures the
efficiency for the given pattern. The efficiency is measured by the occupancy of the dq bus.
The primary use of the test bench is for efficiency measurements so no data integrity checks
are performed. Static data is written into the memory during write transactions and the
same data is always read back.
Stimulus Pattern
Stimulus pattern for non-3DS part is 48 bits and the format is described in Table 7-2. For a
3DS part, stimulus pattern is 52 bits and is described in Table 7-3. The stimulus pattern
description for non-3DS and 3DS parts are shown in Table 7-4.
For example, an eight bank configuration only bank Bits[2:0] is sent to the Memory
Controller and the remaining bits are ignored. The extra bits for an address field are
provided for you to enter the address in a hexadecimal format. You must confirm the value
entered corresponds to the width of a given configuration.
The address is assembled based on the top-level MEM_ADDR_ORDER parameter and sent to
the user interface.
Bus Utilization
The bus utilization is calculated at the User Interface taking total number of Reads and
Writes into consideration and the following equation is used:
bw_cumulative = --------------------------------------------------------------------------------
Example Patterns
These examples are based on the MEM_ADDR_ORDER set to BANK_ROW_COLUMN.
00_0_2_000F_00A_1 – This pattern is a single read from 10th column, 15th row, and second bank.
X-Ref Target - Figure 7-3
0A_0_0_0010_000_1 – This corresponds to 11 reads with address starting from 0 to 80 which can be seen in the column.
X-Ref Target - Figure 7-5
After opening the example_design project, follow the steps to run the performance
traffic generator.
1. In the Vivado Integrated Design Environment (IDE), open the Simulation Sources
section and double-click the sim_tb_top.sv file to open it in Edit mode. Or open the
file from the following location, <project_dir>/example_project/
<component_name>_example/<component_name>_example.srcs/sim_1/
imports/tb/sim_tb_top.sv.
2. Add a `define BEHV line in the file[sim_tb_to.sv] and save it.
3. Go to the Simulation Settings in the Vivado IDE.
a. Select Target Simulator from the supported simulators (supported simulators are
Questa Advanced Simulator, Incisive Enterprise Simulator (IES), Verilog Compiler
Simulator (VCS), and Vivado simulator). Browse to the compiled libraries location
and set the path on the Compiled Libraries Location option as per the Target
Simulator.
b. Under the Simulation tab, set the simulation run-time to 1 ms (there are simulation
RTL directives which stop the simulation after a certain period of time, which is less
than 1 ms). The Generate Scripts Only option generates simulation scripts only.
Overview
Product Specification
Core Architecture
Example Design
Test Bench
Overview
IMPORTANT: This document supports LPDDR3 SDRAM core v1.0.
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado ® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking
° Resets
° Protocol Description
° Example Design
Core Overview
The Xilinx UltraScale™ architecture includes the LPDDR3 SDRAM core. This core provides
solutions for interfacing with the SDRAM memory type. The UltraScale architecture for the
LPDDR3 core is organized in the following high-level blocks:
• Controller – The controller accepts burst transactions from the user interface and
generates transactions to and from the SDRAM. The controller takes care of the SDRAM
timing parameters and refresh. It coalesces write and read transactions to reduce the
number of dead cycles involved in turning the bus around. The controller also reorders
commands to improve the utilization of the data bus to the SDRAM.
• Physical Layer – The physical layer provides a high-speed interface to the SDRAM. This
layer includes the hard blocks inside the FPGA and the soft blocks calibration logic
necessary to ensure optimal timing of the hard blocks interfacing to the SDRAM.
The new hard blocks in the UltraScale architecture allow interface rates of up to
1,600 Mb/s to be achieved. The application logic is responsible for all SDRAM
transactions, timing, and refresh.
Results of the calibration process are available through the Xilinx debug tools.
After completion of calibration, the PHY layer presents raw interface to the
SDRAM.
• Application Interface – The user interface layer provides a simple FIFO-like interface
to the application. Data is buffered and read data is presented in request order.
The above user interface is layered on top of the native interface to the controller. The
native interface is not accessible by the user application and has no buffering and
presents return data to the user interface as it is received from the SDRAM which is not
necessarily in the original request order. The user interface then buffers the read and
write data and reorders the data as needed.
UltraScale FPG As
Physical Interface
(1)
User Interface UltraScale FPGAs Memory Interface Solution
ui_clk_sync_rst
ui_clk
app_addr User init_calib_complete
Memory Physical
Interface
app_cmd Controller Layer lpddr3_ca
Block
app_en lpddr3_ck_t
app_hi_pri lpddr3_ck_c
app_wdf_data lpddr3_cke LPDDR3
User app_wdf_end IOB lpddr3_cs_n SDRAM
FPGA Native Interface MC/PHY Int erface
Logic app_wdf_mask lpddr3_dm
app_wdf_wren lpddr3_odt
app_rdy lpddr3_dq
app_rd_data lpddr3_dqs_c
app_rd_data_end lpddr3_dqs_t
app_rd_data_valid
app_wdf_rdy
1. Syst em clock (sys_clk_p and sys_clk_n/sys_clk_i) and system reset (sys_rst_n) port connections are not shown in block diagram.
X18838-082117
Feature Summary
• Density support
° Other densities for memory device support is available through custom part
selection
• 8-bank support
• x32 device support
Information about other Xilinx LogiCORE IP modules is available at the Xilinx Intellectual
Property page. For information on pricing and availability of other Xilinx LogiCORE IP
modules and tools, contact your local Xilinx sales representative.
License Checkers
If the IP requires a license key, the key must be verified. The Vivado design tools have
several license checkpoints for gating licensed IP through the flow. If the license check
succeeds, the IP can continue generation. Otherwise, generation halts with error. License
checkpoints are enforced by the following tools:
• Vivado synthesis
• Vivado implementation
• write_bitstream (Tcl command)
IMPORTANT: IP license level is ignored at checkpoints. The test confirms a valid license exists. It does
not check IP license level.
Product Specification
Standards
This core supports DRAMs that are compliant to the JESD209-3C, LPDDR3 SDRAM Standard,
JEDEC ® Solid State Technology Association [Ref 1].
For more information on UltraScale™ architecture documents, see References, page 789.
Performance
Maximum Frequencies
For more information on the maximum frequencies, see the following documentation:
Resource Utilization
For full details about performance and resource utilization, visit Performance and Resource
Utilization.
Port Descriptions
For a complete Memory Controller solution there are three port categories at the top-level
of the memory interface core called the “user design.”
• The first category is the memory interface signals that directly interfaces with the
SDRAM. These are defined by the JEDEC specification.
• The second category is the application interface signals. These are described in the
Protocol Description, page 296.
• The third category includes other signals necessary for proper operation of the core.
These include the clocks, reset, and status signals from the core. The clocking and reset
signals are described in their respective sections.
Core Architecture
This chapter describes the UltraScale™ architecture-based FPGAs Memory Interface
Solutions core with an overview of the modules and interfaces.
Overview
The UltraScale architecture-based FPGAs Memory Interface Solutions is shown in
Figure 10-1.
X-Ref Target - Figure 10-1
Memory
Controller
0
LPDDR3
User FPGA User Physical
Initialization/ SDRAM
Logic interface Layer
Calibration
CalDone
Read Data
X18839-031517
Figure 10-1: UltraScale Architecture-Based FPGAs Memory Interface Solution Core Architecture
Memory Controller
In the core default configuration, the Memory Controller (MC) resides between the user
interface (UI) block and the physical layer. This is depicted in Figure 10-2.
X-Ref Target - Figure 10-2
rank
bank
row
col
cmd
data_buf_ad dr
wr_data
User wr_data_mask MC/PHY Physical
Interface Bank Machines Arbiter
accept Interface Layer
Block
bank_mach_next
wr_data_addr
wr_data_en Column Machine
wr_data_offset
rd_data
rd_data_addr
rd_data_en
rd_data_offset
app_sr_req
app_sr_active
app_ref_req
app_ref_ack
app_zq_req
app_zq_ack
UG586_c1_44_081911
The Memory Controller is the primary logic block of the memory interface. The Memory
Controller receives requests from the UI and stores them in a logical queue. Requests are
optionally reordered to optimize system throughput and latency.
Bank Machines
Most of the Memory Controller logic resides in the bank machines. Bank machines
correspond to DRAM banks. A given bank machine manages a single DRAM bank at any
given time. However, bank machine assignment is dynamic, so it is not necessary to have a
bank machine for each physical bank. The number of banks can be configured to trade off
between area and performance. This is discussed in greater detail in the Precharge Policy
section.
The duration of a bank machine assignment to a particular DRAM bank is coupled to user
requests rather than the state of the target DRAM bank. When a request is accepted, it is
assigned to a bank machine. When a request is complete, the bank machine is released and
is made available for assignment to another request. Bank machines issue all the commands
necessary to complete the request.
On behalf of the current request, a bank machine must generate row commands and
column commands to complete the request. Row and column commands are independent
but must adhere to DRAM timing requirements.
The following example illustrates this concept. Consider the case when the Memory
Controller and DRAM are idle when a single request arrives. The bank machine at the head
of the pool:
Similar functionality applies when multiple requests arrive targeting different rows or banks.
Now consider the case when a request arrives targeting an open DRAM bank, managed by
an already active bank machine. The already active bank machine recognizes that the new
request targets the same DRAM bank and skips the precharge step (step 4). The bank
machine at the head of the idle pool accepts the new user request and skips the activate
step (step 2).
Finally, when a request arrives in between both a previous and subsequent request all to the
same target DRAM bank, the controller skips both the activate (step 2) and precharge
(step 4) operations.
A bank machine precharges a DRAM bank as soon as possible unless another pending
request targets the same bank. This is discussed in greater detail in the Precharge Policy
section.
Column commands can be reordered for the purpose of optimizing memory interface
throughput. The ordering algorithm nominally ensures data coherence. The reordering
feature is explained in greater detail in the Reordering section.
Rank Machines
The rank machines correspond to DRAM ranks. Rank machines monitor the activity of the
bank machines and track rank or device-specific timing parameters. For example, a rank
machine monitors the number of activate commands sent to a rank within a time window.
After the allowed number of activates have been sent, the rank machine generates an
inhibit signal that prevents the bank machines from sending any further activates to the
rank until the time window has shifted enough to allow more activates. Rank machines are
statically assigned to a physical DRAM rank.
Column Machine
The single column machine generates the timing information necessary to manage the DQ
data bus. Although there can be multiple DRAM ranks, because there is a single DQ bus, all
the columns in all DRAM ranks are managed as a single unit. The column machine monitors
commands issued by the bank machines and generates inhibit signals back to the bank
machines so that the DQ bus is utilized in an orderly manner.
Arbitration Block
The arbitration block receives requests to send commands to the DRAM array from the bank
machines. Row commands and column commands are arbitrated independently. For each
command opportunity, the arbiter block selects a row and a column command to forward to
the physical layer. The arbitration block implements a round-robin protocol to ensure
forward progress.
Reordering
DRAM accesses are broken into two quasi-independent parts, row commands and column
commands. Each request occupies a logical queue entry, and each queue entry has an
associated bank machine. These bank machines track the state of the DRAM rank or bank it
is currently bound to, if any.
If necessary, the bank machine attempts to activate the proper rank, bank, or row on behalf
of the current request. In the process of doing so, the bank machine looks at the current
state of the DRAM to decide if various timing parameters are met. Eventually, all timing
parameters are met and the bank machine arbitrates to send the activate. The arbitration is
done in a simple round-robin manner. Arbitration is necessary because several bank
machines might request to send row commands (activate and precharge) at the same time.
Not all requests require an activate. If a preceding request has activated the same rank,
bank, or row, a subsequent request might inherit the bank machine state and avoid the
precharge/activate penalties.
After the necessary rank, bank, or row is activated and the RAS to CAS delay timing is met,
the bank machine tries to issue the CAS-READ or CAS-WRITE command. Unlike the row
command, all requests issue a CAS command. Before arbitrating to send a CAS command,
the bank machine must look at the state of the DRAM, the state of the DQ bus, priority, and
ordering. Eventually, all these factors assume their favorable states and the bank machine
arbitrates to send a CAS command. In a manner similar to row commands, a round-robin
arbiter uses a priority scheme and selects the next column command.
The round-robin arbiter itself is a source of reordering. Assume for example that an
otherwise idle Memory Controller receives a burst of new requests while processing a
refresh. These requests queue up and wait for the refresh to complete. After the DRAM is
ready to receive a new activate, all waiting requests assert their arbitration requests
simultaneously. The arbiter selects the next activate to send based solely on its round-robin
algorithm, independent of request order. Similar behavior can be observed for column
commands.
The controller supports NORM ordering mode. In this mode, the controller reorders reads
but not writes as needed to improve efficiency. All write requests are issued in the request
order relative to all other write requests, and requests within a given rank-bank retire in
order. This ensures that it is not possible to observe the result of a later write before an
earlier write completes.
Precharge Policy
The controller implements an aggressive precharge policy. The controller examines the
input queue of requests as each transaction completes. If no requests are in the queue for
a currently open bank/row, the controller closes it to minimize latency for requests to other
rows in the bank. Because the queue depth is equal to the number of bank machines,
greater efficiency can be obtained by increasing the number of bank machines
(nBANK_MACHS). As this number is increased, FPGA logic timing becomes more
challenging. In some situations, the overall system efficiency can be greater with an
increased number of bank machines and a lower memory clock frequency. Simulations
should be performed with the target design command behavior to determine the optimum
setting.
PHY
The PHY is considered the low-level physical interface to an external LPDDR3 SDRAM device
as well as all calibration logic for ensuring reliable operation of the physical interface itself.
The PHY generates the signal timing and sequencing required to interface to the memory
device.
• Clock/address/control-generation logics
• Write and read datapaths
• Logic for initializing the SDRAM after power-up
In addition, the PHY contains calibration logic to perform timing training of the read and
write datapaths to account for system static and dynamic delays.
The Memory Controller and calibration logic communicate with this dedicated PHY in the
slow frequency clock domain, which is either divided by four. A more detailed block
diagram of the PHY design is shown in Figure 10-3.
DDR Address/
infrastructure pllclks
Control, Write Data, pll
and Mask
CMD/Write Data
Memory Controller
pllGate
cal_top
User
Interface
1
xiphy iob
0
MicroBlaze calAddrDecode
mcs
CalDone
Cal Debug
Support
status
CalDone
X24430-082420
The Memory Controller is designed to separate out the command processing from the
low-level PHY requirements to ensure a clean separation between the controller and
physical layer. The command processing can be replaced with custom logic if desired, while
the logic for interacting with the PHY stays the same and can still be used by the calibration
logic.
The address unit connects the MCS to the local register set and the PHY by performing
address decode and control translation on the I/O module bus from spaces in the memory
map and MUXing return data (<module>_...cal_addr_decode.sv). In addition, it
provides address translation (also known as “mapping”) from a logical conceptualization of
the DRAM interface to the appropriate pinout-dependent location of the delay control in
the PHY address space.
Although the calibration architecture presents a simple and organized address map for
manipulating the delay elements for individual data, control and command bits, there is
flexibility in how those I/O pins are placed. For a given I/O placement, the path to the FPGA
logic is locked to a given pin. To enable a single binary software file to work with any
memory interface pinout, a translation block converts the simplified RIU addressing into
the pinout-specific RIU address for the target design (see Table 10-2).
The specific address translation is written by LPDDR3 SDRAM after a pinout is selected and
cannot be modified. The code shows an example of the RTL structure that supports this.
In this example, DQ0 is pinned out on Bit[0] of nibble 0 (nibble 0 according to instantiation
order). The RIU address for the ODELAY for Bit[0] is 0x0D. When DQ0 is addressed —
indicated by address 0x000_4100), this snippet of code is active. It enables nibble 0
(decoded to one-hot downstream) and forwards the address 0x0D to the RIU address bus.
The MicroBlaze I/O module interface is not always fast enough for implementing all of the
functions required in calibration. A helper circuit implemented in
<module>_...cal_addr_decode.sv is required to obtain commands from the
registers and translate at least a portion into single-cycle accuracy for submission to the
PHY. In addition, it supports command repetition to enable back-to-back read transactions
and read data comparison.
1. The built-in self-check of the PHY (BISC) is run. BISC is used in the PHY to compute
internal skews for use in voltage and temperature tracking after calibration is
completed.
2. After BISC is completed, calibration logic performs the required power-on initialization
sequence for the memory.
3. This is followed by several stages of timing calibration for the write and read datapaths.
4. After calibration is completed, PHY calculates internal offsets to be used in voltage and
temperature tracking.
5. PHY indicates calibration is finished and the controller begins issuing commands to the
memory.
Figure 10-4 shows the overall flow of memory initialization and the different stages of
calibration. The dark gray color is not available for this release.
X-Ref Target - Figure 10-4
System Reset
XIPHY BISC
XSDB Setup
Write Leveling
DQS Gating
Read Leveling
Enable VT Tracking
Calibration Complete
X18840-040720
When simulating a design out of LPDDR3 SDRAM, the calibration it set to be bypassed to
enable you to generate traffic to and from the DRAM as quickly as possible. When running
in hardware or simulating with calibration, enabled signals are provided to indicate what
step of calibration is running or, if an error occurs, where an error occurred.
The first step in determining calibration status is to check the CalDone port. After the
CalDone port is checked, the status bits should be checked to indicate the steps that were
ran and completed. Calibration halts on the very first error encountered, so the status bits
indicate which step of calibration was last run. The status and error signals can be checked
through either connecting the Vivado analyzer signals to these ports or through the XSDB
tool (also through Vivado).
The calibration status is provided through the XSDB port, which stores useful information
regarding calibration for display in the Vivado IDE. The calibration status and error signals
are also provided as ports to allow for debug or triggering. Table 10-3 lists the
pre-calibration status signal description.
Table 10-4 lists the status signals in the port as well as how they relate to the core XSDB
data. In the status port, the mentioned bits are valid and the rest are reserved.
During the CA training mode, the data sent on the address bus is returned on the DQ bus.
Each CA bit is delayed until a 0 to 1 transition is detected to find the left margin. Once the
left edge detection is complete, CK is moved to find the right edge. The ODELAY elements
are used on both CA and CK during this alignment.
Write Leveling
LPDDR3 write leveling allows the controller to adjust each write DQS phase independently
with respect to the CK forwarded to the LPDDR3 SDRAM device. This compensates for the
skew between DQS and CK and meets the tDQSS specification.
During write leveling, DQS is driven by the FPGA memory interface and DQ is driven by the
LPDDR3 SDRAM device to provide feedback. DQS is delayed until the 0 to 1 edge transition
on DQ is detected. The DQS delay is achieved using both ODELAY and coarse tap delays.
After the edge transition is detected, the write leveling algorithm centers on the noise
region around the transition to maximize margin. This second step is completed with only
the use of ODELAY taps. Any reference to “FINE” is the ODELAY search.
DQS Gate
During this stage of calibration, the read DQS preamble is detected and the gate to enable
data capture within the FPGA is calibrated to be one clock cycle before the first valid data
on DQ. The coarse and fine DQS gate taps (RL_DLY_COARSE and RL_DLY_FINE) are adjusted
during this stage. Read commands are issued with gaps in between to continually search for
the DQS preamble position. During this stage of calibration, only the read DQS signals are
monitored and not the read DQ signals. DQS Preamble Detection is performed sequentially
on a per byte basis.
During this stage of calibration, the coarse taps are first adjusted while searching for the
low preamble position and the first rising DQS edge, in other words, a DQS pattern of 00X1.
X-Ref Target - Figure 10-5
0 0 X 1 X 0 X 1 X 0
LPDDR3
Coarse
Resolution
0 1 2 3 4 5 6 7 8 9
1 memory
clock cycle
X18841-031517
If the preamble is not found, the read latency is increased by one. The coarse taps are reset
and then adjusted again while searching for the low preamble and first rising DQS edge.
After the preamble position is properly detected, the fine taps are then adjusted to fine
tune and edge align the position of the sample clock with the DQS.
Read Leveling
Read Leveling is performed over multiple stages to maximize the data eye and center the
internal read sampling clock in the read DQ window for robust sampling. To perform this,
Read Leveling performs the following sequential steps:
1. Maximizes the DQ eye by removing skew and OCV effects using per bit read DQ deskew.
2. Sweeps DQS across all DQ bits and finds the center of the data eye using the
Multi-Purpose register data pattern. Centering of the data eye is completed for both the
DQS and DQS#.
3. Post calibration, continuously maintains the relative delay of DQS versus DQ across the
VT range.
At the end of this stage, the DQ bits are internally deskewed to the left edge of the
incoming DQS.
Depending on the interface type, the DQS could either be one CK cycle earlier than, two CK
cycles earlier than, or aligned to the CK edge that captures the write command.
This is a pattern based calibration where coarse adjustments are made on a per byte basis
until the expected on time write pattern is read back. The process is as follows:
Reads are then performed where the following patterns can be calibrated:
Write Latency Calibration can fail for the following cases and signify a board violation
between DQS and CK trace matching:
Enable VT Tracking
After all stages of calibration, a signal is sent to the XIPHY to recalibrate internal delays to
start voltage and temperature tracking. The XIPHY asserts a signal when complete,
phy2clb_phy_rdy_upp for upper nibbles and phy2clb_phy_rdy_low for lower
nibbles.
Reset Sequence
The sys_rst signal resets the entire memory design which includes general interconnect
(fabric) logic which is driven by the MMCM clock (clkout0) and RIU logic. MicroBlaze™ and
calibration logic are driven by the MMCM clock (clkout6). The sys_rst input signal is
synchronized internally to create the ui_clk_sync_rst signal. The ui_clk_sync_rst
reset signal is synchronously asserted and synchronously deasserted.
Figure 10-6 shows the ui_clk_sync_rst (fabric reset) is synchronously asserted with a
few clock delays after sys_rst is asserted. When ui_clk_sync_rst is asserted, there are
a few clocks before the clocks are shut off.
X-Ref Target - Figure 10-6
Clocking
The memory interface requires one mixed-mode clock manager (MMCM), one TXPLL per I/
O bank used by the memory interface, and two BUFGs. These clocking components are used
to create the proper clock frequencies and phase shifts necessary for the proper operation
of the memory interface.
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used.
Note: LPDDR3 SDRAM generates the appropriate clocking structure and no modifications to the
RTL are supported.
The LPDDR3 SDRAM tool generates the appropriate clocking structure for the desired
interface. This structure must not be modified. The allowed clock configuration is as
follows:
Requirements
GCIO
• Must use a differential I/O standard
• Must be in the same I/O column as the memory interface
• Must be in the same SLR of memory interface for the SSI technology devices
• The I/O standard and termination scheme are system dependent. For more information,
consult the UltraScale Architecture SelectIO Resources User Guide (UG571) [Ref 7].
MMCM
• MMCM is used to generate the FPGA logic system clock (1/4 of the memory clock)
• Must be located in the center bank of memory interface
• Must use internal feedback
• Input clock frequency divided by input divider must be ≥ 70 MHz (CLKINx / D ≥
70 MHz)
• Must use integer multiply and output divide values
° For two bank systems, the bank with the higher number of bytes selected is chosen
as the center bank. If the same number of bytes is selected in two banks, then the
top bank is chosen as the center bank.
TXPLL
• CLKOUTPHY from TXPLL drives XIPHY within its bank
• TXPLL must be set to use a CLKFBOUT phase shift of 90°
• TXPLL must be held in reset until the MMCM lock output goes High
• Must use internal feedback
Figure 11-1 shows an example of the clocking structure for a three bank memory interface.
The GCIO drives the MMCM located at the center bank of the memory interface. MMCM
drives both the BUFGs located in the same bank. The BUFG (which is used to generate
system clock to FPGA logic) output drives the TXPLLs used in each bank of the interface.
X-Ref Target - Figure 11-1
Memory Interface
BUFG
TXPLL
BUFG Differential
GCIO Input
I/O Bank 4
X24432-082420
• For two bank systems, MMCM is placed in a bank with the most number of bytes
selected. If they both have the same number of bytes selected in two banks, then
MMCM is placed in the top bank.
• For four bank systems, MMCM is placed in a second bank from the top.
For designs generated with System Clock configuration of No Buffer, MMCM must not be
driven by another MMCM/PLL. Cascading clocking structures MMCM → BUFG → MMCM
and PLL → BUFG → MMCM are not allowed.
If the MMCM is driven by the GCIO pin of the other bank, then the
CLOCK_DEDICATED_ROUTE constraint with value "BACKBONE" must be set on the net that
is driving MMCM or on the MMCM input. Setting up the CLOCK_DEDICATED_ROUTE
constraint on the net is preferred. But when the same net is driving two MMCMs, the
CLOCK_DEDICATED_ROUTE constraint must be managed by considering which MMCM
needs the BACKBONE route.
In such cases, the CLOCK_DEDICATED_ROUTE constraint can be set on the MMCM input. To
use the "BACKBONE" route, any clock buffer that exists in the same CMT tile as the GCIO
must exist between the GCIO and MMCM input. The clock buffers that exists in the I/O CMT
are BUFG, BUFGCE, BUFGCTRL, and BUFGCE_DIV. So LPDDR3 SDRAM instantiates BUFG
between the GCIO and MMCM when the GCIO pins and MMCM are not in the same bank
(see Figure 11-1).
If the GCIO pin and MMCM are allocated in different banks, LPDDR3 SDRAM generates
CLOCK_DEDICATED_ROUTE constraints with value as "BACKBONE." If the GCIO pin and
MMCM are allocated in the same bank, there is no need to set any constraints on the
MMCM input.
Similarly when designs are generated with System Clock Configuration as a No Buffer
option, you must take care of the "BACKBONE" constraint and the BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM if GCIO pin and MMCM are allocated in
different banks. LPDDR3 SDRAM does not generate clock constraints in the XDC file for No
Buffer configurations and you must take care of the clock constraints for No Buffer
configurations. For more information on clocking, see the UltraScale Architecture Clocking
Resources User Guide (UG572) [Ref 8].
For LPDDR3:
set_property CLOCK_DEDICATED_ROUTE BACKBONE [get_pins -hier -filter {NAME =~ */
u_ddr_infrastructure/gen_mmcme*.u_mmcme_adv_inst/CLKIN1}]
For more information on the CLOCK_DEDICATED_ROUTE constraints, see the Vivado Design
Suite Properties Reference Guide (UG912) [Ref 9].
Note: If two different GCIO pins are used for two LPDDR3 SDRAM IP cores in the same bank, center
bank of the memory interface is different for each IP. LPDDR3 SDRAM generates MMCM LOC and
CLOCK_DEDICATED_ROUTE constraints accordingly.
1. LPDDR3 SDRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) of both the IP cores.
2. System clock pins must be allocated within the same I/O column of the memory
interface pins allocated. Add the pin LOC constraints for system clock pins and clock
constraints in your top-level XDC.
3. You must add a "BACKBONE" constraint on the net that is driving the MMCM or on the
MMCM input if GCIO pin and MMCM are not allocated in the same bank. Apart from
this, BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV must be instantiated between GCIO and
MMCM to use the "BACKBONE" route.
Note:
° The UltraScale architecture includes an independent XIPHY power supply and TXPLL
for each XIPHY. This results in clean, low jitter clocks for the memory system.
° Skew spanning across multiple BUFGs is not a concern because single point of
contact exists between BUFG → TXPLL and the same BUFG → System Clock Logic.
° System input clock cannot span I/O columns because the longer the clock lines
span, the more jitter is picked up.
TXPLL Usage
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used. One PLL per bank is used if a bank is used by a single memory
interface. You can use a second PLL for other usage. To use a second PLL, you can perform
the following steps:
1. Generate the design for the System Clock Configuration option as No Buffer.
2. LPDDR3 SDRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) and also to the input of PLL (PLL instance that you have in your
design).
3. You can use the PLL output clocks.
Additional Clocks
You can produce up to four additional clocks which are created from the same MMCM that
generates ui_clk. Additional clocks can be selected from the Clock Options section in the
Advanced tab. The GUI lists the possible clock frequencies from MMCM and the
frequencies for additional clocks vary based on selected memory frequency (Memory
Device Interface Speed (ps) value in the Basic tab), selected FPGA, and FPGA speed grade.
For situations where the memory interface is reset and recalibrated without a
reconfiguration of the FPGA, the SEM IP must be set into IDLE state to disable the memory
scan and to send the SEM IP back into the scanning (Observation or Detect only) states
afterwards. This can be done in two methods, through the “Command Interface” or “UART
interface.” See Chapter 3 of the UltraScale Architecture Soft Error Mitigation Controller
LogiCORE IP Product Guide (PG187) [Ref 10] for more information.
Resets
An asynchronous reset (sys_rst) input is provided. This is an active-High reset and the
sys_rst must assert for a minimum pulse width of 5 ns. The sys_rst can be an internal
or external pin.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must have a common reset input.
For more information on reset, see the Reset Sequence in Chapter 10, Core Architecture.
Note: The best possible calibration results are achieved when the FPGA activity is minimized from
the release of this reset input until the memory interface is fully calibrated as indicated by the
init_calib_complete port (see the User Interface section of this document).
a. If DCI cascade option is disabled, vrp pin per bank is needed for DCI termination for
memory pins allocated banks. So one vrp pin per bank is reserved in memory pins
allocated banks during pin allocation.
b. If the bank contains any memory port(s), vrp must be reserved and must not be
allocated to any Memory Port or any other general I/O.
c. If the bank contains system clock signals (sys_clk_p and sys_clk_n) and status
output pins (init_calib_complete, data_compare_error and sys_rst_n)
only, vrp can be used as normal I/O.
d. If DCI cascade option is enabled, vrp pin can be used for any memory port or any
other general I/O. DCI cascade rules are the same as the I/O DCI rules and there are
no specific DCI cascade rules for memory specific.
e. DCI cascade is valid for HP banks only.
RECOMMENDED: Xilinx strongly recommends that the DCIUpdateMode option is kept with the default
value of ASREQUIRED so that the DCI circuitry is allowed to operate normally.
5. All I/O banks used by the memory interface must be in the same column and must be in
the same SLR.
6. Maximum height of interface is two contiguous banks.
7. Bank skipping is not allowed.
8. The input clock must be connected to GCIO. The highest performance is achieved when
the input clock for the MMCM in the interface comes from the clock capable pair in the
I/O column used for the memory interface.
9. System clock pins (sys_clk_p and sys_clk_n) restricted to the same column of
memory I/Os allocated banks. Also, they must be in the same SLR of the memory
interface for the SSI technology devices.
10. System clock pins can also be allocated in the memory banks.
11. System clock pins must be allocated within the same SLR of the memory pins allocated
SLR.
12. System control/status signals (init_calib_complete, data_compare_error, and
sys_rst_n) can be allocated in any bank in the device which also includes memory
banks. These signals can also be allocated across SLR.
13. There are dedicated V REF pins (not included in the rules above). Either internal or
external V REF is permitted. If an external V REF is not used, the V REF pins must be pulled
to ground by a resistor value specified in the UltraScale™ Architecture SelectIO™
Resources User Guide (UG571) [Ref 7]. These pins must be connected appropriately for
the standard in use. When using external V REF for a LPDDR3 interface, provide the FPGA
VREF pins a 0.75V reference.
IMPORTANT: The system reset pin (sys_rst_n) must not be allocated to pin N0 or N6 if the byte is
used in a memory interface. Consult the UltraScale Architecture Select IO Resources User Guide
(UG571) [Ref 7] for more information.
Pinout Swapping
• Pins can swap freely within each byte group (data and address/control), except for the
DQS pair which must be on the dedicated DQS pair in the nibble (for more information,
see the dqs, dq, and dm location in LPDDR3 Pin Rules).
• Byte groups (data and address/control) can swap easily with each other.
• Pins in the address/control byte groups can swap freely within and between their byte
groups.
• No other pin swapping is permitted.
Pinout Examples
IMPORTANT: Due to the calibration stage, there is no need for set_input_delay/
set_output_delay on the LPDDR3 SDRAM. Ignore the unconstrained inputs and outputs for
LPDDR3 SDRAM and the signals which are calibrated.
Table 11-1 shows an example of a 32-bit LPDDR3 interface contained in two banks. This
example is for a component interface using x32 LPDDR3 components.
Bank 1
1 – T3U_12 –
1 – T3U_11 N
1 – T3U_10 P
1 – T3U_9 N
1 – T3U_8 P
1 – T3U_7 N
1 – T3U_6 P
1 – T3L_5 N
1 – T3L_4 P
1 – T3L_3 N
1 – T3L_2 P
1 – T3L_1 N
1 – T3L_0 P
1 – T2U_12 –
1 – T2U_11 N
1 – T2U_10 P
1 – T2U_9 N
1 – T2U_8 P
1 – T2U_7 N
1 – T2U_6 P
1 – T2L_5 N
1 – T2L_4 P
1 – T2L_3 N
1 – T2L_2 P
1 – T2L_1 N
1 – T2L_0 P
1 – T1U_12 –
1 dq31 T1U_11 N
1 dq30 T1U_10 P
1 dq29 T1U_9 N
1 dq28 T1U_8 P
1 dqs3_c T1U_7 N
1 dqs3_t T1U_6 P
1 dq27 T1L_5 N
1 dq26 T1L_4 P
1 dq25 T1L_3 N
1 dq24 T1L_2 P
1 – T1L_1 N
1 dm3 T1L_0 P
1 vrp T0U_12 –
1 dq23 T0U_11 N
1 dq22 T0U_10 P
1 dq21 T0U_9 N
1 dq20 T0U_8 P
1 dqs2_c T0U_7 N
1 dqs2_t T0U_6 P
1 dq19 T0L_5 N
1 dq18 T0L_4 P
1 dq17 T0L_3 N
1 dq16 T0L_2 P
1 – T0L_1 N
1 dm2 T0L_0 P
Bank 2
2 ca0 T3U_12 –
2 ca1 T3U_11 N
2 ca2 T3U_10 P
2 ca3 T3U_9 N
2 ca4 T3U_8 P
2 ca5 T3U_7 N
2 ca6 T3U_6 P
2 ca7 T3L_5 N
2 ca8 T3L_4 P
2 ca9 T3L_3 N
2 cs_n T3L_2 P
2 ck_c T3L_1 N
2 ck_t T3L_0 P
2 – T2U_12 –
2 – T2U_11 N
2 – T2U_10 P
2 – T2U_9 N
2 – T2U_8 P
2 – T2U_7 N
2 – T2U_6 P
2 – T2L_5 N
2 – T2L_4 P
2 – T2L_3 N
2 – T2L_2 P
2 sys_clk_n T2L_1 N
2 sys_clk_p T2L_0 P
2 – T1U_12 –
2 dq15 T1U_11 N
2 dq14 T1U_10 P
2 dq13 T1U_9 N
2 dq12 T1U_8 P
2 dqs1_c T1U_7 N
2 dqs1_t T1U_6 P
2 dq11 T1L_5 N
2 dq10 T1L_4 P
2 dq9 T1L_3 N
2 dq8 T1L_2 P
2 odt T1L_1 N
2 dm1 T1L_0 P
2 vrp T0U_12 –
2 dq7 T0U_11 N
2 dq6 T0U_10 P
2 dq5 T0U_9 N
2 dq4 T0U_8 P
2 dqs0_c T0U_7 N
2 dqs0_t T0U_6 P
2 dq3 T0L_5 N
2 dq2 T0L_4 P
2 dq1 T0L_3 N
2 dq0 T0L_2 P
2 cke T0L_1 N
2 dm0 T0L_0 P
Protocol Description
This core has a user interface.
User Interface
The user interface signals are described in Table 11-2 and connects to an FPGA user design
to allow access to an external memory device. The user interface is layered on top of the
native interface which is described earlier in the controller description.
app_addr[APP_ADDR_WIDTH – 1:0]
This input indicates the address for the request currently being submitted to the user
interface. The user interface aggregates all the address fields of the external SDRAM and
presents a flat address space.
The address mapping in ROW_BANK_COLUMN ordering has been depicted in Table 11-3
and Figure 11-2.
User Address
A A A A A A A
n - 4 3 2 1 0
Memory
UG586_c1_61a_012411
The address mapping in BANK_ROW_COLUMN ordering has been depicted in Table 11-4
and Figure 11-3.
User Address
A A A A A A A
n - 4 3 2 1 0
Memory
UG586_c1_61_091410
app_cmd[2:0]
This input specifies the command for the request currently being submitted to the user
interface. The available commands are shown in Table 11-5.
app_en
This input strobes in a request. Apply the desired values to app_addr[], app_cmd[2:0], and
app_hi_pri, and then assert app_en to submit the request to the user interface. This
initiates a handshake that the user interface acknowledges by asserting app_rdy.
app_hi_pri
This input indicates that the current request is a high priority.
app_wdf_data[APP_DATA_WIDTH – 1:0]
This bus provides the data currently being written to the external memory.
app_wdf_end
This input indicates that the data on the app_wdf_data[] bus in the current cycle is the
last data for the current request.
app_wdf_mask[APP_MASK_WIDTH – 1:0]
This bus indicates which bits of app_wdf_data[] are written to the external memory and
which bits remain in their current state. The bytes are masked by setting a value of 1 to the
corresponding bits in app_wdf_mask. For example, if the application data width is 256, the
mask width takes a value of 32. The least significant byte [7:0] of app_wdf_data is masked
using Bit[0] of app_wdf_mask and the most significant byte [255:248] of app_wdf_data
is masked using Bit[31] of app_wdf_mask. Hence if you have to mask the last DWORD, that
is, bytes 0, 1, 2, and 3 of app_wdf_data, the app_wdf_mask should be set to
32'h0000_000F.
app_wdf_wren
This input indicates that the data on the app_wdf_data[] bus is valid.
app_rdy
This output indicates whether the request currently being submitted to the user interface is
accepted. If the user interface does not assert this signal after app_en is asserted, the
current request must be retried. The app_rdy output is not asserted if:
° All the bank machines are occupied (can be viewed as the command buffer being
full).
- A read is requested and the read buffer is full.
- A write is requested and no write buffer pointers are available.
app_rd_data[APP_DATA_WIDTH – 1:0]
This output contains the data read from the external memory.
app_rd_data_end
This output indicates that the data on the app_rd_data[] bus in the current cycle is the
last data for the current request.
app_rd_data_valid
This output indicates that the data on the app_rd_data[] bus is valid.
app_wdf_rdy
This output indicates that the write data FIFO is ready to receive data. Write data is accepted
when both app_wdf_rdy and app_wdf_wren are asserted.
ui_clk_sync_rst
This is the reset from the user interface which is in synchronous with ui_clk.
ui_clk
This is the output clock from the user interface. It must be a quarter the frequency of the
clock going out to the external SDRAM, which depends on 2:1 or 4:1 mode selected in
Vivado IDE.
init_calib_complete
PHY asserts init_calib_complete when calibration is finished. The application has no
need to wait for init_calib_complete before sending commands to the Memory
Controller.
Command Path
When the user logic app_en signal is asserted and the app_rdy signal is asserted from the
user interface, a command is accepted and written to the FIFO by the user interface. The
command is ignored by the user interface whenever app_rdy is deasserted. The user logic
needs to hold app_en High along with the valid command, autoprecharge, and address
values until app_rdy is asserted as shown for the "write with autoprecharge" transaction in
Figure 11-4.
X-Ref Target - Figure 11-4
clk
app_cmd WRITE
app_addr Addr 0
app_autoprecharge
app_en
X24433-082420
Figure 11-4: User Interface Command Timing Diagram with app_rdy Asserted
A non back-to-back write command can be issued as shown in Figure 11-5. This figure
depicts three scenarios for the app_wdf_data, app_wdf_wren, and app_wdf_end
signals as follows:
For write data that is output after the write command has been registered, as shown in
Note 3 (Figure 11-5), the maximum delay is two clock cycles.
clk
app_cmd WRITE
app_addr Addr 0
app_wdf_mask
app_wdf_rdy
app_wdf_data W0
app_wdf_wren
app_wdf_end 1
app_wdf_data W0
app_wdf_wren
app_wdf_end 2
app_wdf_data W0
app_wdf_wren
app_wdf_end 3
X24434-082420
Figure 11-5: 4:1 Mode User Interface Write Timing Diagram (Memory Burst Type = BL8)
Write Path
The write data is registered in the write FIFO when app_wdf_wren is asserted and
app_wdf_rdy is High (Figure 11-6). If app_wdf_rdy is deasserted, the user logic needs to
hold app_wdf_wren and app_wdf_end High along with the valid app_wdf_data value
until app_wdf_rdy is asserted. The app_wdf_mask signal can be used to mask out the
bytes to write to external memory.
clk
app_cmd WRITE WRITE WRITE WRITE WRITE WRITE WRITE
app_en
app_rdy
app_wdf_mask
app_wdf_rdy
app_wdf_data W a0 W b0 W c0 W d0 W e0 W f0 W g0
app_wdf_wren
app_wdf_end
X24435-082420
Figure 11-6: 4:1 Mode User Interface Back-to-Back Write Commands Timing Diagram
(Memory Burst Type = BL8)
The map of the application interface data to the DRAM output data can be explained with
an example. For a 4:1 Memory Controller to DRAM clock ratio with an 8-bit memory, at the
application interface, if the 64-bit data driven is 0000_0806_0000_0805 (Hex), the data
values at different clock edges are as shown in Table 11-6. This is for a BL8 (Burst Length 8)
transaction.
Table 11-7 shows a generalized representation of how DRAM DQ bus data is concatenated
to form application interface data signals. app_wdf_data is shown in Table 11-7, but the
table applies equally to app_rd_data. Each byte of the DQ bus has eight bursts, Rise0
(burst 0) through Fall3 (burst 7) as shown previously in Table 11-6, for a total of 64 data bits.
When concatenated with Rise0 in the LSB position and Fall3 in the MSB position, a 64-bit
chunk of the app_wdf_data signal is formed.
For example, the eight bursts of lpddr3_dq[7:0] corresponds to DQ bus byte 0, and
when concatenated as described here, they map to app_wdf_data[63:0]. To be clear on
the concatenation order, lpddr3_dq[0] from Rise0 (burst 0) maps to app_wdf_data[0],
and lpddr3_dq[7] from Fall3 (burst 7) maps to app_wdf_data[63]. The table shows a
second example, mapping DQ byte 1 to app_wdf_data[127:64], as well as the formula
for DQ byte N.
Read Path
The read data is returned by the user interface in the requested order and is valid when
app_rd_data_valid is asserted (Figure 11-7 and Figure 11-8). The app_rd_data_end
signal indicates the end of each read command burst and is not needed in user logic.
X-Ref Target - Figure 11-7
clk
app_cmd READ
app_addr Addr 0
app_en
app_rdy
app_rd_data R0
app_rd_data_valid
X18842-031517
Figure 11-7: 4:1 Mode User Interface Read Timing Diagram (Memory Burst Type = BL8) #1
clk
app_cmd READ
app_en
app_rdy
app_rd_data R0 R1
app_rd_data_valid
X18843-031517
Figure 11-8: 4:1 Mode User Interface Read Timing Diagram (Memory Burst Type = BL8) #2
In Figure 11-8, the read data returned is always in the same order as the requests made on
the address/control bus.
Periodic Reads
The FPGA DDR PHY requires two back-to-back DRAM RD or RDA command to be issued
every 1 µs. This requirement is described in the User Interface. When the controller is
writing and the 1 µs periodic reads are due, the reads are injected by the controller to the
address of the next read/write in the queue. When the controller is idle and no reads or
writes are requested, the periodic reads use the last address accessed. If this address has
been closed, an activate is required. This injected read is issued to the DRAM following the
normal mechanisms of the controller issuing transactions. The key difference is that no read
data is returned to the UI. This is wasted DRAM bandwidth.
User interface patterns with long strings of write transactions are affected the most by the
PHY periodic read requirement. Consider a pattern with a 50/50 read/write transaction
ratio, but organized such that the pattern alternates between 2 µs bursts of 100% page hit
reads and 2 µs bursts of 100% page hit writes. The periodic reads are injected in the 2 µs
write burst, resulting in a loss of efficiency due to the read command and the turnaround
time to switch the DRAM and DDR bus from writes to reads back to writes. This 2 µs
alternating burst pattern is slightly more efficient than alternating between reads and
writes every 1 µs. A 1 µs or shorter alternating pattern would eliminate the need for the
controller to inject reads, but there would still be more read-write turnarounds.
Bus turnarounds are expensive in terms of efficiency and should be avoided if possible.
Long bursts of page hit writes, > 2 µs in duration, are still the most efficient way to write to
the DRAM, but the impact of one write-read-write turnaround each 1 µs must be taken into
account when calculating the maximum write efficiency.
• Memory IP lists the possible Reference Input Clock Speed values based on the targeted
memory frequency (based on selected Memory Device Interface Speed).
• Otherwise, select M and D Options and target for desired Reference Input Clock Speed
which is calculated based on selected CLKFBOUT_MULT (M), DIVCLK_DIVIDE (D), and
CLKOUT0_DIVIDE (D0) values in the Advanced Clocking Tab.
The required Reference Input Clock Speed is calculated from the M, D, and D0 values
entered in the GUI using the following formulas:
Where tCK is the Memory Device Interface Speed selected in the Basic tab.
Calculated Reference Input Clock Speed from M, D, and D0 values are validated as per
clocking guidelines. For more information on clocking rules, see Clocking.
Apart from the memory specific clocking rules, validation of the possible MMCM input
frequency range, MMCM VCO frequency range, and MMCM PFD frequency range values are
completed for M, D, and D0 in the GUI.
For UltraScale devices, see Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2] and Virtex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS893) [Ref 3] for MMCM Input frequency range, MMCM VCO frequency
range, and MMCM PFD frequency range values.
For UltraScale+ devices, see Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS922) [Ref 4], Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS923) [Ref 5], and Zynq UltraScale+ MPSoC Data Sheet: DC and AC
Switching Characteristics (DS925) [Ref 6] for MMCM Input frequency range, MMCM VCO
frequency range, and MMCM PFD frequency range values.
For possible M, D, and D0 values and detailed information on clocking and the MMCM, see
the UltraScale Architecture Clocking Resources User Guide (UG572) [Ref 8].
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 13]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 15]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16]
This section includes information about using Xilinx ® tools to customize and generate the
core in the Vivado Design Suite.
If you are customizing and generating the core in the IP integrator, see the Vivado Design
Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 13] for detailed
information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values change, see the
description of the parameter in this chapter. To view the parameter value, run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For more information about generating the core in Vivado, see the Vivado Design Suite User
Guide: Designing with IP (UG896) [Ref 14] and the Vivado Design Suite User Guide: Getting
Started (UG910) [Ref 15].
Note: Figures in this chapter are illustrations of the Vivado Integrated Design Environment (IDE).
This layout might vary from the current version.
Basic Tab
Figure 12-1 shows the Basic tab when you start up the LPDDR3 SDRAM.
X-Ref Target - Figure 12-1
IMPORTANT: All parameters shown in the controller options dialog box are limited selection options in
this release.
For the Vivado IDE, all controllers (DDR3, DDR4, LPDDR3, QDR II+, QDR-IV, and RLDRAM 3)
can be created and available for instantiation.
1. Select the settings in the Clocking, Controller Options, Memory Options, and
Advanced User Request Controller Options.
In Clocking, the Memory Device Interface Speed sets the speed of the interface. The
speed entered drives the available Reference Input Clock Speeds. For more
information on the clocking structure, see the Clocking, page 284.
2. To use memory parts which are not available by default through the LPDDR3 SDRAM
Vivado IDE, you can create a custom parts CSV file, as specified in the AR: 63462. This
CSV file has to be provided after enabling the Custom Parts Data File option. After
selecting this option. you are able to see the custom memory parts along with the
default memory parts. Note that, simulations are not supported for the custom part.
Custom part simulations require manually adding the memory model to the simulation
and might require modifying the test bench instantiation.
Figure 12-2: Vivado Customize IP Dialog Box for LPDDR3 – Advanced Clocking
Figure 12-3: Vivado Customize IP Dialog Box for LPDDR3 – Advanced Options
Figure 12-4: Vivado Customize IP Dialog Box – LPDDR3 SDRAM I/O Planning and Design Checklist
User Parameters
Table 12-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
I/O Planning
LPDDR3 SDRAM I/O pin planning is completed with the full design pin planning using the
Vivado I/O Pin Planner. LPDDR3 SDRAM I/O pins can be selected through several Vivado
I/O Pin Planner features including assignments using I/O Ports view, Package view, or
Memory Bank/Byte Planner. Pin assignments can additionally be made through importing
an XDC or modifying the existing XDC file.
These options are available for all LPDDR3 SDRAM designs and multiple LPDDR3 SDRAM IP
instances can be completed in one setting. To learn more about the available Memory IP pin
planning options, see the Vivado Design Suite User Guide: I/O and Clock Planning (UG899)
[Ref 18].
Required Constraints
For LPDDR3 SDRAM Vivado IDE, you specify the pin location constraints. For more
information on I/O standard and other constraints, see the Vivado Design Suite User Guide:
I/O and Clock Planning (UG899) [Ref 18]. The location is chosen by the Vivado IDE
according to the banks and byte lanes chosen for the design.
The I/O standard is chosen by the memory type selection and options in the Vivado IDE and
by the pin type. A sample for dq[0] is shown here.
For HR banks, update the output_impedance of all the ports assigned to HR banks pins
using the reset_property command. For more information, see AR: 63852.
IMPORTANT: Do not alter these constraints. If the pin locations need to be altered, rerun the LPDDR3
SDRAM Vivado IDE to generate a new XDC file.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
For more information on clocking, see Clocking, page 284.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
IMPORTANT: The set_input_delay and set_output_delay constraints are not needed on the
external memory interface pins in this design due to the calibration process that automatically runs at
start-up. Warnings seen during implementation for the pins can be ignored.
Simulation
For comprehensive information about Vivado simulation components, as well as
information about using supported third-party tools, see the Vivado Design Suite User
Guide: Logic Simulation (UG900) [Ref 16]. For more information on simulation, see
Chapter 13, Example Design and Chapter 14, Test Bench.
Example Design
This chapter contains information about the example design provided in the Vivado®
Design Suite. Vivado supports Open IP Example Design flow. To create the example design
using this flow, right-click the IP in the Source Window, as shown in Figure 13-1 and select
Open IP Example Design.
X-Ref Target - Figure 13-1
This option creates a new Vivado project. Upon selecting the menu, a dialog box to enter
the directory information for the new design project opens.
Select a directory, or use the defaults, and click OK. This launches a new Vivado with all of
the example design files and a copy of the IP.
The example design can be simulated using one of the methods in the following sections.
Project-Based Simulation
This method can be used to simulate the example design using the Vivado Integrated
Design Environment (IDE). Memory IP delivers memory models for LPDDR3.
The Vivado simulator, Questa Advanced Simulator, IES, and VCS tools are used for LPDDR3
IP verification at each software release. The Vivado simulation tool is used for LPDDR3 IP
verification from 2017.1 Vivado software release. The following subsections describe steps
to run a project-based simulation using each supported simulator tool.
5. In the Flow Navigator window, select Run Simulation and select Run Behavioral
Simulation option as shown in Figure 13-3.
X-Ref Target - Figure 13-3
6. Vivado invokes Vivado simulator and simulations are run in the Vivado simulator tool.
For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900)
[Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 13-5.
X-Ref Target - Figure 13-5
5. Vivado invokes Questa Advanced Simulator and simulations are run in the Questa
Advanced Simulator tool. For more information, see the Vivado Design Suite User Guide:
Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 13-5.
5. Vivado invokes IES and simulations are run in the IES tool. For more information, see the
Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 13-5.
5. Vivado invokes VCS and simulations are run in the VCS tool. For more information, see
the Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
If the design is generated with the Reference Input Clock option selected as No Buffer (at
Advanced > FPGA Options > Reference Input), the CLOCK_DEDICATED_ROUTE
constraints and BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation based on GCIO and
MMCM allocation needs to be handled manually for the IP flow. LPDDR3 SDRAM does not
generate clock constraints in the XDC file for No Buffer configurations and you must take
care of the clock constraints for No Buffer configurations for the IP flow.
For an example design flow with No Buffer configurations, LPDDR3 SDRAM generates the
example design with differential buffer instantiation for system clock pins. LPDDR3 SDRAM
generates clock constraints in the example_design.xdc. It also generates a
CLOCK_DEDICATED_ROUTE constraint as the “BACKBONE” and instantiates BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM input if the GCIO and MMCM are not in
same bank to provide a complete solution. This is done for the example design flow as a
reference when it is generated for the first time.
If in the example design, the I/O pins of the system clock pins are changed to some other
pins with the I/O pin planner, the CLOCK_DEDICATED_ROUTE constraints and BUFG/
BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation need to be managed manually. A DRC error
is reported for the same.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The example design of the LPDDR3 Memory Controller generates either a simple test bench
or an Advanced Traffic Generator based on the Example Design Test Bench input in the
Vivado Integrated Design Environment wizard. For more information on the traffic
generators, see Chapter 36, Traffic Generator.
Overview
Product Specification
Core Architecture
Example Design
Test Bench
Overview
IMPORTANT: This document supports QDR II+ SRAM core v1.4.
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado ® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking
° Resets
° Protocol Description
° Example Design
Core Overview
The Xilinx UltraScale™ architecture includes the QDR II+ SRAM core. This core provides
solutions for interfacing with the QDR II+ SRAM memory type.
The QDR II+ SRAM core is a physical layer for interfacing Xilinx UltraScale FPGA user
designs to the QDR II+ SRAM devices. QDR II+ SRAMs offer high-speed data transfers on
separate read and write buses on the rising and falling edges of the clock. These memory
devices are used in high-performance systems as temporary data storage, such as:
The QDR II+ SRAM solutions core is a PHY that takes simple user commands, converts them
to the QDR II+ protocol, and provides the converted commands to the memory. The design
enables you to provide one read and one write request per cycle eliminating the need for a
Memory Controller and the associated overhead, thereby reducing the latency through the
core.
Figure 15-1 shows a high-level block diagram of the QDR II+ SRAM interface solution.
X-Ref Target - Figure 15-1
sys_clk qdr_k_p K
sys_rst qdr_k_n K
qdr_rst_clk qdr_w_n W
qdr_clk qdr_r_n R
qdr_sa SA
qdr_d D
qdr_bw_n BW QDR II+ SRAM
app_wr_cmd qdr_cq_p CQ
app_wr_addr qdr_cq_n CQ
app_wr_data qdr_q Q
app_wr_bw_n qdr_doff_n
init_calib_complete
app_rd_cmd
app_rd_addr
app_rd_valid
app_rd_data
X19719-040820
The physical layer includes the hard blocks inside the FPGA and the soft calibration logic
necessary to ensure optimal timing of the hard blocks interfacing to the memory part.
The QDR II+ memories do not require an elaborate initialization procedure. However,
you must ensure that the Doff_n signal is provided to the memory as required by the
vendor. The QDR II+ SRAM interface design provided by the QDR II+ IP drives the
Doff_n signal from the FPGA. After the internal MMCM has locked, the Doff_n signal
is asserted High for 100 µs without issuing any commands to the memory device.
For memory devices that require the Doff_n signal to be terminated at the memory
and not be driven from the FPGA, you must perform the required initialization
procedure.
• Calibration – The calibration modules provide a complete method to set all delays in
the hard blocks and soft IP to work with the memory interface. Each bit is individually
trained and then combined to ensure optimal interface performance. Results of the
calibration process is available through the Xilinx debug tools. After completion of
calibration, the PHY layer presents raw interface to the memory part.
Feature Summary
• Component support for interface widths up to 36 bits
• x18 and x36 memory device support
• 4-word and 2-word burst support
• Only HSTL_I I/O standard support
• Cascaded data width support is available only for BL-4 designs
• Data rates up to 1,266 Mb/s for BL-4 designs
• Data rates up to 900 Mb/s for BL-2 designs
• Memory device support with 72 Mb density
• Support for 2.0 and 2.5 cycles of Read Latency
• Other densities for memory device support is available through custom part selection
• Source code delivery in Verilog and System Verilog
• 2:1 memory to FPGA logic interface clock ratio
• Interface calibration and training information available through the Vivado hardware
manager
Information about other Xilinx LogiCORE IP modules is available at the Xilinx Intellectual
Property page. For information on pricing and availability of other Xilinx LogiCORE IP
modules and tools, contact your local Xilinx sales representative.
License Checkers
If the IP requires a license key, the key must be verified. The Vivado® design tools have
several license checkpoints for gating licensed IP through the flow. If the license check
succeeds, the IP can continue generation. Otherwise, generation halts with error. License
checkpoints are enforced by the following tools:
• Vivado synthesis
• Vivado implementation
• write_bitstream (Tcl command)
IMPORTANT: IP license level is ignored at checkpoints. The test confirms a valid license exists. It does
not check IP license level.
Product Specification
Standards
This core complies with the QDR II+ SRAM standard defined by the QDR Consortium. For
more information on UltraScale™ architecture documents, see References, page 789.
Performance
Maximum Frequencies
For more information on the maximum frequencies, see the following documentation:
Resource Utilization
For full details about performance and resource utilization, visit Performance and Resource
Utilization.
Port Descriptions
There are three port categories at the top-level of the memory interface core called the
“user design.”
• The first category is the memory interface signals that directly interfaces with the
memory part. These are defined by the QDR II+ SRAM specification.
• The second category is the application interface signals which is referred to as the
“user interface.” This is described in the Protocol Description, page 370.
• The third category includes other signals necessary for proper operation of the core.
These include the clocks, reset, and status signals from the core. The clocking and reset
signals are described in their respective sections.
Core Architecture
This chapter describes the UltraScale™ architecture-based FPGAs Memory Interface
Solutions core with an overview of the modules and interfaces.
Overview
The UltraScale architecture-based FPGAs Memory Interface Solutions is shown in
Figure 17-1.
X-Ref Target - Figure 17-1
CalDone
Read Data
X24446-081021
The user interface uses a simple protocol based entirely on SDR signals to make read and
write requests. For more details describing this protocol, see User Interface in Chapter 18.
There is no requirement for the controller in QDR II+ SRAM protocol and thus, the Memory
Controller contains only the physical interface. It takes commands from the user interface
and adheres to the protocol requirements of the QDR II+ SRAM device. It is responsible to
generate proper timing relationships and DDR signaling to communicate with the external
memory device. For more details, see Memory Interface in Chapter 18.
PHY
The PHY is considered the low-level physical interface to an external QDR II+ SRAM device.
It contains the entire calibration logic for ensuring reliable operation of the physical
interface itself. The PHY generates the signal timing and sequencing required to interface to
the memory device.
• Clock/address/control-generation logics
• Write and read datapaths
• Logic for initializing the SDRAM after power-up
In addition, the PHY contains calibration logic to perform timing training of the read and
write datapaths to account for system static and dynamic delays.
The user interface and calibration logic communicate with this dedicated PHY in the slow
frequency clock domain, which is divided by 2. A more detailed block diagram of the PHY
design is shown in Figure 17-2.
X-Ref Target - Figure 17-2
pll.sv
refclk
Address/Control, Write Data, and Mask pllclks
CMD/Write Data
pllGate
qdriip_phy.sv
1
qdriip_cal.sv
User 0
Interface qdriip_cal_addr_ qdriip_ qdriip_
decode.sv xiphy.sv Iob.sv
MicroBlaze CalDone
Processor
config_rom.sv
Read Data
Status Read Data
CalDone
X24447-082420
The PHY architecture encompasses all of the logic contained in qdriip_xiphy.sv. The
PHY contains wrappers around dedicated hard blocks to build up the memory interface
from smaller components. A byte lane contains all of the clocks, resets, and datapaths for a
given subset of I/O. Multiple byte lanes are grouped together, along with dedicated
clocking resources, to make up a single bank memory interface. For more information on
the hard silicon physical layer architecture, see the UltraScale™ Architecture SelectIO™
Resources User Guide (UG571) [Ref 7].
The address unit connects the MCS to the local register set and the PHY by performing
address decode and control translation on the I/O module bus from spaces in the memory
map and MUXing return data (qdriip_cal_adr_decode.sv). In addition, it provides
address translation (also known as “mapping”) from a logical conceptualization of the
DRAM interface to the appropriate pinout-dependent location of the delay control in the
PHY address space.
Although the calibration architecture presents a simple and organized address map for
manipulating the delay elements for individual data, control and command bits, there is
flexibility in how those I/O pins are placed. For a given I/O placement, the path to the FPGA
logic is locked to a given pin. To enable a single binary software file to work with any
memory interface pinout, a translation block converts the simplified Register Interface Unit
(RIU) addressing into the pinout-specific RIU address for the target design. The specific
address translation is written by QDR II+ SRAM after a pinout is selected. The code shows
an example of the RTL structure that supports this.
In this example, DQ0 is pinned out on Bit[0] of nibble 0 (nibble 0 according to instantiation
order). The RIU address for the ODELAY for Bit[0] is 0x0D. When DQ0 is addressed —
indicated by address 0x000_4100), this snippet of code is active. It enables nibble 0
(decoded to one-hot downstream) and forwards the address 0x0D to the RIU address bus.
The MicroBlaze I/O interface operates at much slower frequency, which is not fast enough
for implementing all the functions required in calibration. A helper circuit implemented in
qdriip_cal_adr_decode.sv is required to obtain commands from the registers and
translate at least a portion into single-cycle accuracy for submission to the PHY. In addition,
it supports command repetition to enable back-to-back read transactions and read data
comparison.
1. The built-in self-check (BISC) of the PHY is run. It is used to compensate the internal
skews among the data bits and the strobe on the read path.
2. After BISC completion, the required steps for the power-on initialization of the memory
part starts.
3. It requires several stages of calibration for tuning the write and read datapath skews as
mentioned in Figure 17-3.
4. After calibration is completed, PHY calculates internal offsets for the voltage and
temperature tracking purpose by considering the taps used until the end of step 3.
5. When PHY indicates the calibration completion, the user interface command execution
begins.
Figure 17-3 shows the overall flow of memory initialization and the different stages of
calibration.
X-Ref Target - Figure 17-3
Calibration Start
BISC Calibration
Calibration Complete
X24448-081021
BISC Calibration
Built-in Self Calibration (BISC) is the first stage of calibration. BISC is enabled by configuring
the SELF_CALIBRATE parameter to 2'b11 for all the byte lanes. BISC compensates the on
chip delay variations among the read bits and to center align the read clock in the read data
window (if enabled). BISC does not compensate the PCB delay variations and thus, the
output of BISC gives a fine center alignment but not an accurate one.
Memory Initialization
The memory initialization sequence is done as per the vendor requirements.
Read Leveling
The aim of this stage is to deskew all read data bits in a nibble and then keep the rise and
fall edges of the read strobe inside the valid window at an approximate 90° position.
After the completion of BISC, the capture clock position is within the valid window but not
the center. Use this initial position to find the left and right edges of the valid window and
then center align in it.
To create a clock pattern, write one burst of 1s and one burst of 0s into two address
locations. Writing an entire burst of 1 or 0 eliminates toggles on the write data bits during
the write transaction. Read leveling has to be done nibble wise as each nibble generates its
own capture clock. You have to perform a back-to-back continuous reads from those two
locations to find the two edges of the read data window. Here is the terminology used in
read leveling algorithm:
• PQTR – It is the delay element on CQ_p capture clock. Its output is used to capture the
rise data.
• NQTR – It is the delay element on CQ_n capture clock. Its output is used to capture the
fall data,
• IDELAY – Delay element on each data bit,
• INFIFO OUTPUT – Read data to the user interface,
Case 1: RL of 2
Aligning PQTR to Left Edge
The first step in the deskew process is to decrement PQTR and NQTR delays until one of
them acquires a 0 value. After the decrement for deskew only, the P data for all the bits in
the nibble are analyzed to find the left edge.
In Figure 17-4, if PQTR is in the window for all the bits in the nibble then increment IDELAY
for each bit until they fail. This deskews all the bits in the nibble and PQTR is aligned at the
left edge for all the bits.
X-Ref Target - Figure 17-4
PQTR
NQTR
Div Clock
F 0 F 0
Q (DATA IN)
INFIFO
FFFF 0000 FFFF
OUTPUT
X18186-031620
For conditions in which the PQTR is outside the window for all the bits or any of the bits in
the nibble, the PQTR/NQTR delays are incremented until there is a pass for all the bits. The
next step would be to increment the IDELAY for each data bit in the nibble in Figure 17-5.
This deskews all the bits in the nibble and PQTR is aligned at the left edge for all the bits.
X-Ref Target - Figure 17-5
PQTR
NQTR
Div Clock
F 0 F 0
Q (DATA IN)
X18187-110816
During this process both NQTR/PQTR delays are moved to find the right edge and only the
N data is used for comparison. In this case since the deskew is already completed, the N
data for any of the bits in a byte changes the right edge would be considered found.
Figure 17-6 shows the condition when the N data is aligned to the right edge.
The data is written into the INFIFO using the falling edge of the divided clock. The divided
clock is derived from NQTR and it is being moved during the calibration stage.
X-Ref Target - Figure 17-6
PQTR
NQTR
Div Clock
F 0 F 0
Q (DATA IN)
INFIFO
FFF0 F000 FFF0
OUTPUT
X18188-031620
In this stage, the NQTR and PQTR values from the BISC calibration stage is used to center
them in the data window. This is an initial calibration stage and the read leveling with
complex data pattern is used after the write calibration.
PQTR_90 = PQTR values after BISC calibration - PQTR_ALIGN gives the tap count needed
for the 90° offset.
NQTR_90 = NQTR values after BISC calibration - NQTR_ALIGN gives the tap count
needed for the 90° offset.
These values are retained as soon as the calibration algorithm starts. Now PQTR is placed at
the PQTR value at the end of "left edge alignment" + PQTR_90. NQTR is placed at NQTR
value at the end of "right edge alignment" - NQTR_90.
PQTR
NQTR
Div Clock
F 0 F 0
Q (DATA IN)
X18189-031620
Figure 17-7: NQTR and PQTR with 90° Offset from BISC
Case 2: RL of 2.5
Aligning NQTR to Left Edge
The first step in the deskew process is to decrement PQTR and NQTR delays until one of
them acquires a 0 value. After the decrement for deskew only, the N data for all the bits in
the nibble are analyzed to find the left edge.
In Figure 17-8, if NQTR is in the window for all the bits in the nibble then increment IDELAY
for each bit until they fail. This deskews all the bits in the nibble and NQTR is aligned at the
left edge for all the bits.
PQTR
NQTR
Div Clock
Q (DATA IN)
aligned to 1st 0 F 0 F 0
negedge
INFIFO
0FFF F000 0FFF
OUTPUT
Q (DATA IN)
aligned to 2nd 0 F 0 F
negedge
For conditions in which the NQTR is outside the window for all the bits or any of the bits in
the nibble, the PQTR/NQTR delays are incremented until there is a pass for all the bits. The
next step would be to increment the IDELAY for each data bit in the nibble in Figure 17-9.
This deskews all the bits in the nibble and NQTR will be aligned at the left edge for all the
bits.
PQTR
NQTR
Div Clock
Q (DATA
IN)
aligned to 0 F 0 F 0
1st
negedge
INFIFO
OUTPUT
FF00 00FF FF00 00FF
Q (DATA
IN) aligned
to 2nd F 0 F
0
negedge
X18191-031620
During this process both NQTR/PQTR delays are moved to find the right edge and only the
P data is used for comparison. In this case since the deskew is already completed, the P data
for any of the bits in a byte changes the right edge would be considered found.
Figure 17-10 shows the condition when the P data is aligned to the right edge.
The data is written into the INFIFO using the falling edge of the divided clock. The divided
clock is derived from NQTR and it is being moved during the calibration stage.
PQTR
NQTR
Div Clock
Q (DATA IN)
aligned to 1st 0 F 0 F 0
negedge
INFIFO
OUTPUT
0000 FFFF 0000 FFFF
Q (DATA IN)
aligned to 2nd 0 F 0 F
negedge
X18192-031620
In this stage, the NQTR and PQTR values from the BISC calibration stage is used to center
them in the data window. This is an initial calibration stage and read leveling with complex
data pattern is used after the write calibration.
PQTR_90 = PQTR values after BISC calibration - PQTR_ALIGN gives the tap count needed
for the 90° offset
NQTR_90 = NQTR values after BISC calibration - NQTR_ALIGN gives the tap count
needed for the 90° offset
These values are retained as soon as the calibration algorithm starts. Now NQTR is placed at
NQTR value at the end of "left edge alignment" + NQTR_90. PQTR is placed at PQTR value
at the end of "right edge alignment" - PQTR_90.
PQTR
NQTR
Div Clock
Q (DATA IN)
aligned to 1st 0 F 0 F 0
negedge
INFIFO
F000 0FFF F000 0FFF
OUTPUT
Q (DATA IN)
aligned to 2nd 0 F 0 F
negedge
X18193-110816
Figure 17-11: NQTR and PQTR with 90° Offset from BISC
1. To start, the algorithm writes one data burst each into two address locations.
a. Writes 11111 (all 1s) into the address location 1111111 (all 1s).
b. Writes 00000 (all 0s) into the address location 1111110 when calibrating A[0].
c. When these two addresses are sent on the rise and fall edges of every memory clock
cycle, it creates a clock pattern on A[0] and a constant 1 on all other address bits. The
similar cycle repeats for all address bits.
d. As you are writing the same data on all the edges of a write burst, you can keep the
data free from any toggling. So, write calibration is not required at this stage.
e. Continues read is started.
2. As there are no delay taps added on the address bits until this point of calibration,
assume the initial relation between the memory clock and the address bit A[0] as shown
in Figure 17-12.
X-Ref Target - Figure 17-12
Rise Edge
A2
LN RN
X18194-040820
Rise Edge
K-clock
A0
PF CR CF
A1
A2
LN RN
X18195-040820
6. The first and second edge taps are noted and centered in the middle of them as shown
in Figure 17-14. As the maximum frequency supported is 450 MHz, there are enough
margin even if the algorithm is not able to find the first and second edges.
X-Ref Target - Figure 17-14
Rise Edge
K-clock
A0
PF CR CF
A1
A2
LN RN
X18196-040820
A static phase shift of 90° is applied on the K-clock all the time and thus, the initial position
of the K-clock with respect to a write data bit is assumed to be one of the following three
cases:
• Case 1 – Clock is aligned inside the valid window, which is termed in Figure 17-15 as
Current Rise Window (CR) for a selected rise edge of the K-clock.
Rise Edge
K-clock
LN RN
X18197-040820
• Case 2 – Clock is aligned inside the Left Noise region (LN) as shown in Figure 17-16.
X-Ref Target - Figure 17-16
Rise Edge
K-clock
D0
D1
D2 PF CR CF
LN RN
X18199-040820
• Case 3 – Clock is aligned inside the Right Noise region (RN) as shown in Figure 17-17.
X-Ref Target - Figure 17-17
Rise Edge
K-clock
D0
D1
D2
PF CR CF
LN RN
X18200-040820
When the initial placement of the clock with respect to a data bit is as mentioned in cases
1 and 2, you can only find the two edges of the previous fall window by moving the data
delay taps. Therefore, you can only center in the previous fall window as shown in
Figure 17-18. A separate bitslip stage is required to move the clock from previous fall to
current rise window.
X-Ref Target - Figure 17-18
Rise Edge
K-clock
D0
D1
D2
PF CR CF
LN RN
X18202-040820
However, the clock can be centered in proper data window (that is, current rise window) if
the initial clock placement is as mentioned in case 3. The final placement is described in
Figure 17-19. No bitslip is required in this scenario.
Rise Edge
K-clock
D0
D1
D2
PF CR CF
LN RN
X18203-040820
The required immediate next step is to align the clock in the proper data window and bitslip
calculation is done in the next stage of calibration.
Rise Edge
K-clock
D0
PF CR CF
LN RN
D1
PF CR CF
LN RN
D2
PF CR CF
LN RN
X18204-040820
Figure 17-20: Typical Clock Placement Inside Write Bus Before K-Centering
After clock centering is completed, a few bits are centered in the previous fall window and
the others, in the current rise window. Figure 17-21 shows how the data bits in Figure 17-20
is aligned after centering.
Figure 17-21 explains that clock placement for D2 is proper but it is improper for bits D0
and D1, which are delayed by one bit time. The only method to correct the clock alignment
for D0 and D1 is to delay the address/control bits by the same number of bit times.
However, one bit time cannot be added on address/control bits as they are SDR signals.
Thus, delay the address/control bits by two bit times (one clock cycle) and delay D0 and D1
by one more bit time.
Rise Edge
K-clock
D0
PF CR CF
LN RN
D1
PF CR CF
LN RN
D2
PF CR CF
LN RN
X18205-040820
The alignment in Figure 17-21 is modified to Figure 17-22 after adding one clock cycle
delay.
Figure 17-22 explains that adding one bit time delay on D0 and D1 completes the clock
alignment process. However, the data bit D2 is not done yet because it requires no bitslip
before delaying the address/control bus. Therefore, it is required to add the same delay on
D2 as that of address bus to complete its alignment. Figure 17-22 confirms the same as it
takes two bit times (one clock cycle) for D2 to align the rise edge of the clock with CR
(current rise window) of D2.
X-Ref Target - Figure 17-22
Rise Edge
K-clock
D0
PF CR CF
LN RN
D1
PF CR CF
LN RN
D2
PF CR CF
LN RN
X18206-040820
Figure 17-22: Clock Placement After Adding One Clock Cycle Delay on Address/Control Bus
Figure 17-23 shows the final alignment after adding corresponding delays on all data bits.
X-Ref Target - Figure 17-23
Rise Edge
K-clock
+1 bitslip D0
PF CR CF
LN RN
+1 bitslip D1
PF CR CF
LN RN
+2 bitslip D2
PF CR CF
LN RN
X18207-040820
Reset Sequence
The sys_rst signal resets the entire memory design which includes general interconnect
(fabric) logic which is driven by the MMCM clock (clkout0) and RIU logic. MicroBlaze™ and
calibration logic are driven by the MMCM clock (clkout6). The sys_rst input signal is
synchronized internally to create the qdriip_rst_clk signal. The qdriip_rst_clk
reset signal is synchronously asserted and synchronously deasserted.
Figure 17-24 shows the qdriip_rst_clk (fabric reset) is synchronously asserted with a
few clock delays after sys_rst is asserted. When qdriip_rst_clk is asserted, there are
a few clocks before the clocks are shut off.
X-Ref Target - Figure 17-24
The MicroBlaze MCS ECC can be selected from the MicroBlaze MCS ECC option section in
the Advanced Options tab. The block RAM size increases if the ECC option for MicroBlaze
MCS is selected.
Clocking
The memory interface requires one mixed-mode clock manager (MMCM), one TXPLL per I/
O bank used by the memory interface, and two BUFGs. These clocking components are used
to create the proper clock frequencies and phase shifts necessary for the proper operation
of the memory interface.
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used.
Note: QDR II+ SRAM generates the appropriate clocking structure and no modifications to the RTL
are supported.
The QDR II+ SRAM tool generates the appropriate clocking structure for the desired
interface. This structure must not be modified. The allowed clock configuration is as
follows:
Requirements
GCIO
• Must use a differential I/O standard
• Must be in the same I/O column as the memory interface
• Must be in the same SLR of memory interface for the SSI technology devices
• The I/O standard and termination scheme are system dependent. For more information,
consult the UltraScale Architecture SelectIO Resources User Guide (UG571) [Ref 7].
MMCM
• MMCM is used to generate the FPGA logic system clock (1/2 of the memory clock)
• Must be located in the center bank of memory interface
• Must use internal feedback
• Input clock frequency divided by input divider must be ≥ 70 MHz (CLKINx / D ≥
70 MHz)
• Must use integer multiply and output divide values
° For two bank systems, the bank with the higher number of bytes selected is chosen
as the center bank. If the same number of bytes is selected in two banks, then the
top bank is chosen as the center bank.
TXPLL
• CLKOUTPHY from TXPLL drives XIPHY within its bank
Figure 18-1 shows an example of the clocking structure for a three bank memory interface.
The GCIO drives the MMCM located at the center bank of the memory interface. MMCM
drives both the BUFGs located in the same bank. The BUFG (which is used to generate
system clock to FPGA logic) output drives the TXPLLs used in each bank of the interface.
X-Ref Target - Figure 18-1
Memory Interface
BUFG
TXPLL
BUFG Differential
GCIO Input
I/O Bank 4
X24449-081021
• For two bank systems, MMCM is placed in a bank with the most number of bytes
selected. If they both have the same number of bytes selected in two banks, then
MMCM is placed in the top bank.
• For four bank systems, MMCM is placed in a second bank from the top.
For designs generated with System Clock configuration of No Buffer, MMCM must not be
driven by another MMCM/PLL. Cascading clocking structures MMCM → BUFG → MMCM
and PLL → BUFG → MMCM are not allowed.
If the MMCM is driven by the GCIO pin of the other bank, then the
CLOCK_DEDICATED_ROUTE constraint with value "BACKBONE" must be set on the net that
is driving MMCM or on the MMCM input. Setting up the CLOCK_DEDICATED_ROUTE
constraint on the net is preferred. But when the same net is driving two MMCMs, the
CLOCK_DEDICATED_ROUTE constraint must be managed by considering which MMCM
needs the BACKBONE route.
In such cases, the CLOCK_DEDICATED_ROUTE constraint can be set on the MMCM input. To
use the "BACKBONE" route, any clock buffer that exists in the same CMT tile as the GCIO
must exist between the GCIO and MMCM input. The clock buffers that exists in the I/O CMT
are BUFG, BUFGCE, BUFGCTRL, and BUFGCE_DIV. So QDR II+ SRAM instantiates BUFG
between the GCIO and MMCM when the GCIO pins and MMCM are not in the same bank
(see Figure 18-1).
If the GCIO pin and MMCM are allocated in different banks, QDR II+ SRAM generates
CLOCK_DEDICATED_ROUTE constraints with value as "BACKBONE." If the GCIO pin and
MMCM are allocated in the same bank, there is no need to set any constraints on the
MMCM input.
Similarly when designs are generated with System Clock Configuration as a No Buffer
option, you must take care of the "BACKBONE" constraint and the BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM if GCIO pin and MMCM are allocated in
different banks. QDR II+ SRAM does not generate clock constraints in the XDC file for No
Buffer configurations and you must take care of the clock constraints for No Buffer
configurations. For more information on clocking, see the UltraScale Architecture Clocking
Resources User Guide (UG572) [Ref 8].
For more information on the CLOCK_DEDICATED_ROUTE constraints, see the Vivado Design
Suite Properties Reference Guide (UG912) [Ref 9].
Note: If two different GCIO pins are used for two QDR II+ SRAM IP cores in the same bank, center
bank of the memory interface is different for each IP. QDR II+ SRAM generates MMCM LOC and
CLOCK_DEDICATED_ROUTE constraints accordingly.
1. QDR II+ SRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) of both the IP cores.
2. System clock pins must be allocated within the same I/O column of the memory
interface pins allocated. Add the pin LOC constraints for system clock pins and clock
constraints in your top-level XDC.
3. You must add a "BACKBONE" constraint on the net that is driving the MMCM or on the
MMCM input if GCIO pin and MMCM are not allocated in the same bank. Apart from
this, BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV must be instantiated between GCIO and
MMCM to use the "BACKBONE" route.
Note:
° The UltraScale architecture includes an independent XIPHY power supply and TXPLL
for each XIPHY. This results in clean, low jitter clocks for the memory system.
° Skew spanning across multiple BUFGs is not a concern because single point of
contact exists between BUFG → TXPLL and the same BUFG → System Clock Logic.
° System input clock cannot span I/O columns because the longer the clock lines
span, the more jitter is picked up.
TXPLL Usage
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used. One PLL per bank is used if a bank is used by a single memory
interface. You can use a second PLL for other usage. To use a second PLL, you can perform
the following steps:
1. Generate the design for the System Clock Configuration option as No Buffer.
2. QDR II+ SRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) and also to the input of PLL (PLL instance that you have in your
design).
3. You can use the PLL output clocks.
Additional Clocks
You can produce up to four additional clocks which are created from the same MMCM that
generates ui_clk. Additional clocks can be selected from the Clock Options section in the
Advanced Options tab. The GUI lists the possible clock frequencies from MMCM and the
frequencies for additional clocks vary based on selected memory frequency (Memory
Device Interface Speed (ps) value in the Basic tab), selected FPGA, and FPGA speed grade.
For situations where the memory interface is reset and recalibrated without a
reconfiguration of the FPGA, the SEM IP must be set into IDLE state to disable the memory
scan and to send the SEM IP back into the scanning (Observation or Detect only) states
afterwards. This can be done in two methods, through the “Command Interface” or “UART
interface.” See Chapter 3 of the UltraScale Architecture Soft Error Mitigation Controller
LogiCORE IP Product Guide (PG187) [Ref 10] for more information.
Resets
An asynchronous reset (sys_rst) input is provided. This is an active-High reset and the
sys_rst must assert for a minimum pulse width of 5 ns. The sys_rst can be an internal
or external pin.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must have a common reset input.
For more information on reset, see the Reset Sequence in Chapter 17, Core Architecture.
All Read Data pins of a single component must not span more than three consecutive
byte lanes and CQ/CQ# must always be allocated in center byte lane.
RECOMMENDED: Xilinx strongly recommends that the DCIUpdateMode option is kept with the default
value of ASREQUIRED so that the DCI circuitry is allowed to operate normally.
9. There are dedicated VREF pins (not included in the rules above). Either internal or
external V REF is permitted. If an external V REF is not used, the V REF pins must be pulled
to ground by a resistor value specified in the UltraScale™ Device FPGAs SelectIO™
Resources User Guide (UG571) [Ref 7]. These pins must be connected appropriately for
the standard in use.
10. The system reset pin (sys_rst_n) must not be allocated to Pins N0 and N6 if the byte
is used for the memory I/Os.
Pin Swapping
• Pins can swap freely within each Write Data byte group.
• Pins can swap freely within each Read Data byte group, except CQ/CQ# pins. Pins can
swap freely within and between their corresponding byte groups, but should not
violate above mentioned Read Data pin/byte lane allocation rules.
• Pins can swap freely withing each Address/Control byte group. Pins can swap freely
within and between their corresponding byte groups, but should not violate above
mentioned Address/Control pin/byte lane allocation rules.
• Write Data Byte groups can swap easily with each other, but should not violate above
mentioned Write Data pin/byte lane allocation rules.
• Read Data Byte groups can swap easily with each other, but should not violate above
mentioned Read Data pin/byte lane allocation rules.
• Address/Control Byte groups can swap easily with each other, but should not violate
above mentioned Address/Control pin/byte lane allocation rules.
• No other pin swapping is permitted.
Table 18-1 shows an example of an 18-bit QDR II+ SRAM interface contained within two
banks.
Table 18-1: 18-Bit QDR II+ Interface Contained in Two Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
1 sys_clk_n T1U_10 P
1 – T1U_9 N
1 q17 T1U_8 P
1 q16 T1U_7 N
1 cq_p T1U_6 P
1 q15 T1L_5 N
1 q14 T1L_4 P
1 q13 T1L_3 N
1 q12 T1L_2 P
1 q11 T1L_1 N
1 cq_n T1L_0 P
1 vrp T0U_12 –
1 – T0U_11 N
1 q10 T0U_10 P
1 q9 T0U_9 N
1 q8 T0U_8 P
1 q7 T0U_7 N
1 q6 T0U_6 P
1 q5 T0L_5 N
1 q4 T0L_4 P
1 q3 T0L_3 N
1 q2 T0L_2 P
1 q1 T0L_1 N
1 q0 T0L_0 P
0 – T3U_12 –
0 – T3U_11 N
0 – T3U_10 P
0 d17 T3U_9 N
0 d16 T3U_8 P
0 d15 T3U_7 N
0 d14 T3U_6 P
0 d13 T3L_5 N
0 d12 T3L_4 P
Table 18-1: 18-Bit QDR II+ Interface Contained in Two Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 d11 T3L_3 N
0 d10 T3L_2 P
0 bwsn1 T3L_1 N
0 d9 T3L_0 P
0 – T2U_12 –
0 d8 T2U_11 N
0 d7 T2U_10 P
0 d6 T2U_9 N
0 d5 T2U_8 P
0 k_n T2U_7 N
0 k_p T2U_6 P
0 d4 T2L_5 N
0 d3 T2L_4 P
0 d2 T2L_3 N
0 d1 T2L_2 P
0 bwsn0 T2L_1 N
0 d0 T2L_0 P
0 doff T1U_12 –
0 a21 T1U_11 N
0 a20 T1U_10 P
0 a19 T1U_9 N
0 a18 T1U_8 P
0 a17 T1U_7 N
0 a16 T1U_6 P
0 a15 T1L_5 N
0 a14 T1L_4 P
0 a13 T1L_3 N
0 a12 T1L_2 P
0 rpsn T1L_1 N
0 a11 T1L_0 P
0 vrp T0U_12 –
0 a10 T0U_11 N
Table 18-1: 18-Bit QDR II+ Interface Contained in Two Banks (Cont’d)
Bank Signal Name Byte Group I/O Type
0 a9 T0U_10 P
0 a8 T0U_9 N
0 a7 T0U_8 P
0 a6 T0U_7 N
0 a5 T0U_6 P
0 a4 T0L_5 N
0 a3 T0L_4 P
0 a2 T0L_3 N
0 a1 T0L_2 P
0 wpsn T0L_1 N
0 a0 T0L_0 P
Protocol Description
This core has the following interfaces:
• User Interface
• Memory Interface
User Interface
The user interface connects an FPGA user design to the QDR II+ SRAM solutions core to
simplify interactions between the user logic and the external memory device. The user
interface provides a set of signals used to issue a read or write command to the memory
device. These signals are summarized in Table 18-2.
Notes:
1. These ports are available and valid only in BL2 configuration. For BL4 configuration, these ports are not available or if
available, no need to be driven.
clk
init_calib_complete
app_wr_cmd0
app_rd_cmd0
app_rd_valid0
app_rd_data0 RD_DATA
X24450-082420
Before any requests can be made, the init_calib_complete signal must be asserted
High, as shown in Figure 18-2, no read or write requests can take place, and the assertion of
app_wr_cmd0 or app_rd_cmd0 on the client interface is ignored. A write request is issued
by asserting app_wr_cmd0 as a single cycle pulse. At this time, the app_wr_addr0,
app_wr_data0, and app_wr_bw_n0 signals must be valid.
On the following cycle, a read request is issued by asserting app_rd_cmd0 for a single
cycle pulse. At this time, app_rd_addr0 must be valid. After one cycle of idle time, a read
and write request are both asserted on the same clock cycle. In this case, the read to the
memory occurs first, followed by the write. The write and read commands can be applied in
any order at the user interface, two examples are shown in the Figure 18-2.
Also, Figure 18-2 shows data returning from the memory device to the user design. The
app_rd_valid0 signal is asserted, indicating that app_rd_data0 is now valid. This
should be sampled on the same cycle when app_rd_valid0 is asserted because the core
does not buffer returning data.
In case of BL2, the same protocol should be followed on two independent ports: port-0 and
port-1. Figure 18-2 shows the user interface signals on port-0 only.
Memory Interface
Memory interface is a connection from the FPGA memory solution to an external QDR II+
SRAM device. The I/O signals for this interface are defined in Table 18-3. These signals can
be directly connected to the corresponding signals on the memory device.
Figure 18-3 shows the timing diagram for the sample write and read operations at the
memory interface of a BL4 QDR II+ SRAM device and Figure 18-4 is that of a BL2 device.
X-Ref Target - Figure 18-3
qdriip_k_p
qdriip_k_n
qdriip_w_n
qdriip_r_n
qdriip_cq_p
qdriip_cq_n
X24451-082420
qdriip_k_p
qdriip_k_n
qdriip_w_n
qdriip_r_n
qdriip_cq_p
X24452-082420
• Memory IP lists the possible Reference Input Clock Speed values based on the targeted
memory frequency (based on selected Memory Device Interface Speed).
• Otherwise, select M and D Options and target for desired Reference Input Clock Speed
which is calculated based on selected CLKFBOUT_MULT (M), DIVCLK_DIVIDE (D), and
CLKOUT0_DIVIDE (D0) values in the Advanced Clocking Tab.
The required Reference Input Clock Speed is calculated from the M, D, and D0 values
entered in the GUI using the following formulas:
Where tCK is the Memory Device Interface Speed selected in the Basic tab.
Calculated Reference Input Clock Speed from M, D, and D0 values are validated as per
clocking guidelines. For more information on clocking rules, see Clocking.
Apart from the memory specific clocking rules, validation of the possible MMCM input
frequency range, MMCM VCO frequency range, and MMCM PFD frequency range values are
completed for M, D, and D0 in the GUI.
For UltraScale devices, see Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2] and Virtex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS893) [Ref 3] for MMCM Input frequency range, MMCM VCO frequency
range, and MMCM PFD frequency range values.
For UltraScale+ devices, see Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS922) [Ref 4], Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS923) [Ref 5], and Zynq UltraScale+ MPSoC Data Sheet: DC and AC
Switching Characteristics (DS925) [Ref 6] for MMCM Input frequency range, MMCM VCO
frequency range, and MMCM PFD frequency range values.
For possible M, D, and D0 values and detailed information on clocking and the MMCM, see
the UltraScale Architecture Clocking Resources User Guide (UG572) [Ref 8].
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 13]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 15]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16]
This section includes information about using Xilinx ® tools to customize and generate the
core in the Vivado Design Suite.
If you are customizing and generating the core in the IP integrator, see the Vivado Design
Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 13] for detailed
information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values change, see the
description of the parameter in this chapter. To view the parameter value, run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For more information about generating the core in Vivado, see the Vivado Design Suite User
Guide: Designing with IP (UG896) [Ref 14] and the Vivado Design Suite User Guide: Getting
Started (UG910) [Ref 15].
Note: Figures in this chapter are illustrations of the Vivado Integrated Design Environment (IDE).
This layout might vary from the current version.
Basic Tab
Figure 19-1 shows the Basic tab when you start up the QDR II+ SRAM.
X-Ref Target - Figure 19-1
IMPORTANT: All parameters shown in the controller options dialog box are limited selection options in
this release.
For the Vivado IDE, all controllers (DDR3, DDR4, LPDDR3, QDR II+, QDR-IV, and RLDRAM 3)
can be created and available for instantiation.
In Clocking, the Memory Device Interface Speed sets the speed of the interface. The
speed entered drives the available Reference Input Clock Speeds. For more
information on the clocking structure, see the Clocking, page 359.
2. To use memory parts which are not available by default through the QDR II+ SRAM
Vivado IDE, you can create a custom parts CSV file, as specified in the AR: 63462. This
CSV file has to be provided after enabling the Custom Parts Data File option. After
selecting this option. you are able to see the custom memory parts along with the
default memory parts. Note that, simulations are not supported for the custom part.
Custom part simulations require manually adding the memory model to the simulation
and might require modifying the test bench instantiation.
Figure 19-4: Vivado Customize IP Dialog Box – I/O Planning and Design Checklist
User Parameters
Table 19-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
I/O Planning
For details on I/O planning, see I/O Planning, page 235.
Required Constraints
The QDR II+ SRAM Vivado IDE generates the required constraints. A location constraint and
an I/O standard constraint are added for each external pin in the design. The location is
chosen by the Vivado IDE according to the banks and byte lanes chosen for the design.
The I/O standard is chosen by the memory type selection and options in the Vivado IDE and
by the pin type. A sample for qdriip_d[0] is shown here.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
For more information on clocking, see Clocking, page 359.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
IMPORTANT: The set_input_delay and set_output_delay constraints are not needed on the
external memory interface pins in this design due to the calibration process that automatically runs at
start-up. Warnings seen during implementation for the pins can be ignored.
Simulation
This section contains information about simulating the QDR II+ SRAM generated IP. Vivado
simulator, Questa Advanced Simulator, IES, and VCS simulation tools are used for
verification of the QDR II+ SRAM IP at each software release. Vivado simulator is not
supported yet. For more information on simulation, see Chapter 20, Example Design and
Chapter 21, Test Bench.
Example Design
This chapter contains information about the example design provided in the Vivado®
Design Suite. Vivado supports Open IP Example Design flow. To create the example design
using this flow, right-click the IP in the Source Window, as shown in Figure 20-1 and select
Open IP Example Design.
X-Ref Target - Figure 20-1
This option creates a new Vivado project. Upon selecting the menu, a dialog box to enter
the directory information for the new design project opens.
Select a directory, or use the defaults, and click OK. This launches a new Vivado with all of
the example design files and a copy of the IP.
The example design can be simulated using one of the methods in the following sections.
Project-Based Simulation
This method can be used to simulate the example design using the Vivado Integrated
Design Environment (IDE). Memory IP does not deliver the QDR II+ memory models. The
memory model required for the simulation must be downloaded from the memory vendor’s
website. The memory model file must be added in the example design using Add Sources
option to run simulation.
The Vivado simulator, Questa Advanced Simulator, IES, and VCS tools are used for QDR II+
IP verification at each software release. The Vivado simulation tool is used for QDR II+ IP
verification from 2015.1 Vivado software release. The following subsections describe steps
to run a project-based simulation using each supported simulator tool.
2. Add the memory model in the Add or create simulation sources page and click Finish
as shown in Figure 20-3.
X-Ref Target - Figure 20-3
3. In the Open IP Example Design Vivado project, under Flow Navigator, select
Simulation Settings.
4. Select Target simulator as Vivado Simulator.
7. In the Flow Navigator window, select Run Simulation and select Run Behavioral
Simulation option as shown in Figure 20-5.
8. Vivado invokes Vivado simulator and simulations are run in the Vivado simulator tool.
For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900)
[Ref 16].
2. Add the memory model in the Add or create simulation sources page and click Finish
as shown in Figure 20-7.
X-Ref Target - Figure 20-7
3. In the Open IP Example Design Vivado project, under Flow Navigator, select
Simulation Settings.
4. Select Target simulator as Questa Advanced Simulator.
a. Browse to the compiled libraries location and set the path on Compiled libraries
location option.
b. Under the Simulation tab, set the modelsim.simulate.runtime to 1 ms (there
are simulation RTL directives which stop the simulation after certain period of time,
which is less than 1 ms) as shown in Figure 20-8. The Generate Scripts Only option
generates simulation scripts only. To run behavioral simulation, Generate Scripts
Only option must be de-selected.
5. Apply the settings and select OK.
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 20-9.
7. Vivado invokes Questa Advanced Simulator and simulations are run in the Questa
Advanced Simulator tool. For more information, see the Vivado Design Suite User Guide:
Logic Simulation (UG900) [Ref 16].
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 20-9.
7. Vivado invokes IES and simulations are run in the IES tool. For more information, see the
Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 20-9.
7. Vivado invokes VCS and simulations are run in the VCS tool. For more information, see
the Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
Simulation Speed
QDR II+ SRAM provides a Vivado IDE option to reduce the simulation speed by selecting
behavioral XIPHY model instead of UNISIM XIPHY model. Behavioral XIPHY model
simulation is a default option for QDR II+ SRAM designs. To select the simulation mode,
click the Advanced Options tab and find the Simulation Options as shown in Figure 19-3.
The SIM_MODE parameter in the RTL is given a different value based on the Vivado IDE
selection.
• SIM_MODE = BFM – If fast mode is selected in the Vivado IDE, the RTL parameter
reflects this value for the SIM_MODE parameter. This is the default option.
• SIM_MODE = FULL – If UNISIM mode is selected in the Vivado IDE, XIPHY UNISIMs are
selected and the parameter value in the RTL is FULL.
IMPORTANT: QDR II+ memory models from Cypress® Semiconductor need to be modified with the
following two timing parameter values to run the simulations successfully:
`define tcqd #0
`define tcqdoh #0.15
If the design is generated with the Reference Input Clock option selected as No Buffer (at
Advanced > FPGA Options > Reference Input), the CLOCK_DEDICATED_ROUTE
constraints and BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation based on GCIO and
MMCM allocation needs to be handled manually for the IP flow. QDR II+ SRAM does not
generate clock constraints in the XDC file for No Buffer configurations and you must take
care of the clock constraints for No Buffer configurations for the IP flow.
For an example design flow with No Buffer configurations, QDR II+ SRAM generates the
example design with differential buffer instantiation for system clock pins. QDR II+ SRAM
generates clock constraints in the example_design.xdc. It also generates a
CLOCK_DEDICATED_ROUTE constraint as the “BACKBONE” and instantiates BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM input if the GCIO and MMCM are not in
same bank to provide a complete solution. This is done for the example design flow as a
reference when it is generated for the first time.
If in the example design, the I/O pins of the system clock pins are changed to some other
pins with the I/O pin planner, the CLOCK_DEDICATED_ROUTE constraints and BUFG/
BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation need to be managed manually. A DRC error
is reported for the same.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The Memory Controller is generated along with a simple test bench to verify the basic read
and write operations. The stimulus contains 10 consecutive writes followed by 10
consecutive reads for data integrity check.
Overview
Product Specification
Core Architecture
Example Design
Test Bench
Overview
IMPORTANT: This document supports QDR-IV SRAM core v2.0.
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado ® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking
° Resets
° Protocol Description
° Example Design
Core Overview
The Xilinx UltraScale™ architecture includes the QDR-IV SRAM core. This core provides
solutions for interfacing with the QDR-IV SRAM memory type.
The QDR-IV SRAM core is a physical layer with a controller for interfacing Xilinx UltraScale
FPGA user designs to the QDR-IV SRAM devices. QDR-IV SRAMs offer high-speed data
transfers on separate read and write buses on the rising and falling edges of the clock.
These memory devices are used in high-performance systems as temporary data storage,
such as:
The QDR-IV SRAM solutions core includes a PHY and the controller that takes user
commands, processes them to make them compatible to the QDR-IV protocol, and provides
the converted commands to the QDR-IV memory. The controller inside the core enables you
to provide four commands per cycle simultaneously.
Figure 22-1 shows a high-level block diagram of the QDR-IV SRAM interface solution.
X-Ref Target - Figure 22-1
X14925-040820
The QDR-IV core includes the hard blocks inside the FPGA and the soft calibration logic
necessary to ensure optimal timing of the hard blocks interfacing to the memory part.
The QDR-IV SRAM must be initialized before it can operate in the normal functional
mode. Initialization uses four special pins:
a. Apply power to the QDR-IV SRAM. Follow instructions described in the power-up
sequence section in the memory data sheet.
b. Apply reset to the QDR-IV SRAM. Follow reset sequence instruction in the memory
data sheet.
c. Assert Config (CFG_n = 0) and program the impedance control register.
d. Because the input impedance is updated, allow the PLL time (tPLL) to lock to the
input clock. See the memory data sheet for tPLL value.
• Calibration – The calibration modules provide a complete method to set all delays in
the hard blocks and soft IP to work with the memory interface. Each bit is individually
trained and then combined to ensure optimal interface performance. Results of the
calibration process are available through the Xilinx debug tools. After completion of
calibration, the PHY layer presents the raw interface to the memory part.
Feature Summary
• Component support for interface widths up to 36 bits
• Single component interface with x18 and x36 memory device support
• 2-word burst support (BL2 only)
• Only POD12 standard support
• Memory device support with 72 Mb density and 144 Mb density
• Other densities for memory device support is available through custom part selection
• Support for 5 (for HP memory part) and 8 (for XP memory part) cycles of read latency
• Support for 3 (for HP memory part) and 5 (for XP memory part) cycles of write latency
• Source code delivery in Verilog and SystemVerilog
• 4:1 memory to FPGA logic interface clock ratio
• Interface calibration and training information available through the Vivado hardware
manager
• Programmable On-die Termination (ODT) support for address, clock, and data
Information about other Xilinx LogiCORE IP modules is available at the Xilinx Intellectual
Property page. For information on pricing and availability of other Xilinx LogiCORE IP
modules and tools, contact your local Xilinx sales representative.
License Checkers
If the IP requires a license key, the key must be verified. The Vivado® design tools have
several license checkpoints for gating licensed IP through the flow. If the license check
succeeds, the IP can continue generation. Otherwise, generation halts with error. License
checkpoints are enforced by the following tools:
• Vivado synthesis
• Vivado implementation
• write_bitstream (Tcl command)
IMPORTANT: IP license level is ignored at checkpoints. The test confirms a valid license exists. It does
not check IP license level.
Product Specification
Standards
This core complies with the QDR-IV SRAM standard defined by the QDR Consortium. For
more information on UltraScale™ architecture documents, see References, page 789.
Performance
Maximum Frequencies
For more information on the maximum frequencies, see the following documentation:
Resource Utilization
For full details about performance and resource utilization, visit Performance and Resource
Utilization.
Port Descriptions
There are three port categories at the top-level of the memory interface core called the
“user design.”
• The first category is the memory interface signals that directly interfaces with the
memory part. These are defined by the QDR-IV SRAM specification.
• The second category is the application interface signals which is referred to as the
“user interface.” This is described in the Protocol Description, page 436.
• The third category includes other signals necessary for proper operation of the core.
These include the clocks, reset, and status signals from the core. The clocking and reset
signals are described in their respective sections.
Ensure that the commands are issued only after c0_init_calib_complete is High.
Any commands issued before c0_init_calib_complete signal is High will be lost.
Core Architecture
This chapter describes the UltraScale™ architecture-based FPGAs Memory Interface
Solutions core with an overview of the modules and interfaces.
Overview
The UltraScale architecture-based FPGAs Memory Interface Solutions is shown in
Figure 24-1.
X-Ref Target - Figure 24-1
User 0 QDR-IV
User FPGA Interface/ Physical SRAM
Logic Memory Initialization/ Layer Memory
Controller Calibration
CalDone
Read Data
X14879-040820
The user interface uses a simple protocol based entirely on SDR signals to make read and
write requests. For more details describing this protocol, see User Interface in Chapter 25.
PHY
The PHY is considered the low-level physical interface to an external QDR-IV SRAM device.
It contains the entire calibration logic for ensuring reliable operation of the physical
interface itself. The PHY generates the signal timing and sequencing required to interface to
the memory device.
• Clock/address/control-generation logics
• Write and read datapaths
• Logic for initializing the QDR-IV SRAM after power-up
In addition, the PHY contains calibration logic to perform timing training of the read and
write datapaths to account for system static and dynamic delays.
The user interface/controller and calibration logic communicate with this dedicated PHY in
the slow frequency clock domain, which is divided by 4. A more detailed block diagram of
the PHY design is shown in Figure 24-2.
X-Ref Target - Figure 24-2
c0_sys_clk
(differential) MMCM/PLL
0
User
Interface/
Calibration QDR-IV QDR-IV
Controller Address Decoder XIPHY IOB
MicroBlaze CalDone
Processor
Configuration
ROM
Read Data
Read Data
Status
CalDone
X14883-031616
The PHY architecture encompasses all of the logic contained in QDR-IV XIPHY module. The
PHY contains wrappers around dedicated hard blocks to build up the memory interface
from smaller components. A byte lane contains all of the clocks, resets, and datapaths for a
given subset of I/O. Multiple byte lanes are grouped together, along with dedicated
clocking resources, to make up a single bank memory interface. For more information on
the hard silicon physical layer architecture, see the UltraScale™ Architecture SelectIO™
Resources User Guide (UG571) [Ref 7].
The address unit connects the MCS to the local register set and the PHY by performing
address decode and control translation on the I/O module bus from spaces in the memory
map and MUXing return data (QDR-IV Calibration Address Decoder). In addition, it provides
address translation (also known as “mapping”) from a logical conceptualization of the
SRAM interface to the appropriate pinout-dependent location of the delay control in the
PHY address space.
Although the calibration architecture presents a simple and organized address map for
manipulating the delay elements for individual data, control and command bits, there is
flexibility in how those I/O pins are placed. For a given I/O placement, the path to the FPGA
logic is locked to a given pin. To enable a single binary software file to work with any
memory interface pinout, a translation block converts the simplified Register Interface Unit
(RIU) addressing into the pinout-specific RIU address for the target design. The specific
address translation is written by QDR-IV SRAM after a pinout is selected. The code shows an
example of the RTL structure that supports this.
In this example, DQ0 is pinned out on Bit[0] of nibble 0 (nibble 0 according to instantiation
order). The RIU address for the ODELAY for Bit[0] is 0x0D. When DQ0 is addressed —
indicated by address 0x000_4100), this snippet of code is active. It enables nibble 0
(decoded to one-hot downstream) and forwards the address 0x0D to the RIU address bus.
The MicroBlaze I/O module interface updates at a maximum rate of once every three clock
cycles, which is not always fast enough for implementing all of the functions required in
calibration. A helper circuit implemented in Calibration Address Decoder module is
required to obtain commands from the registers and translate at least a portion into
single-cycle accuracy for submission to the PHY. In addition, it supports command
repetition to enable back-to-back read transactions and read data comparison.
1. The built-in self-check (BISC) of the PHY is run. It is used to compensate the internal
skews among the data bits and the strobe on the read path. The computed skews are
used in the voltage and temperature tracking after calibration is completed.
2. After BISC completion, calibration logic performs the required power-on initialization
sequence for the memory. This is followed by several stages of timing calibration for the
write and read datapaths.
3. After calibration is completed, PHY calculates internal offsets to be used in voltage and
temperature tracking.
4. When PHY indicates the calibration completion, the user interface command execution
begins.
Figure 24-3 shows the overall flow of memory initialization and the different stages of
calibration.
X-Ref Target - Figure 24-3
Initialization/Calibration Start
BISC Calibration
QDR-IV Initialization
Address Calibration
DK to CK Alignment
Read Centering
Read Sanity
Write Centering
Write Sanity
Sanity Test
Calibration Complete
X14890-082415
Reset Sequence
The sys_rst signal resets the entire memory design which includes general interconnect
(fabric) logic which is driven by the MMCM clock (clkout0) and RIU logic. MicroBlaze™ and
calibration logic are driven by the MMCM clock (clkout6). The sys_rst input signal is
synchronized internally to create the qdriv_rst_clk signal. The qdriv_rst_clk reset
signal is synchronously asserted and synchronously deasserted.
Figure 24-4 shows the qdriv_rst_clk (fabric reset) is synchronously asserted with a few
clock delays after sys_rst is asserted. When qdriv_rst_clk is asserted, there are a few
clocks before the clocks are shut off.
X-Ref Target - Figure 24-4
The MicroBlaze MCS ECC can be selected from the MicroBlaze MCS ECC option section in
the Advanced Options tab. The block RAM size increases if the ECC option for MicroBlaze
MCS is selected.
Clocking
The memory interface requires one mixed-mode clock manager (MMCM), one TXPLL per I/
O bank used by the memory interface and two BUFGs. These clocking components are used
to create the proper clock frequencies and phase shifts necessary for the proper operation
of the memory interface.
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used.
The QDR-IV IP generates the appropriate clocking structure for the desired interface. This
structure must not be modified. The allowed clock configuration is as follows:
Requirements
GCIO
• Must use a differential I/O standard
• Must be in the same I/O column as the memory interface
• The I/O standard and termination scheme are system dependent. For more information,
consult the UltraScale Architecture SelectIO Resources User Guide (UG571) [Ref 7].
MMCM
• MMCM is used to generate the FPGA logic system clock (1/2 of the memory clock)
• Must be located in the center bank of memory interface
• Must use internal feedback
• Input clock frequency divided by input divider must be ≥ 70 MHz (CLKINx / D ≥
70 MHz)
• Must use integer multiply and output divide values
° For two bank systems, the bank with the higher number of bytes selected is chosen
as the center bank. If the same number of bytes is selected in two banks, then the
top bank is chosen as the center bank.
TXPLL
• CLKOUTPHY from TXPLL drives XIPHY within its bank
• TXPLL must be set to use a CLKFBOUT phase shift of 90°
• TXPLL must be held in reset until the MMCM lock output goes High
• Must use internal feedback
Figure 25-1 shows an example of the clocking structure for a three bank memory interface.
The GCIO drives the MMCM located at the center bank of the memory interface. MMCM
drives both the BUFGs located in the same bank. The BUFG (which is used to generate
system clock to FPGA logic) output drives the TXPLLs used in each bank of the interface.
X-Ref Target - Figure 25-1
Memory Interface
BUFG
TXPLL
BUFG Differential
GCIO Input
I/O Bank 4
X24449-081021
• For two bank systems, MMCM is placed in a bank with the most number of bytes
selected. If they both have the same number of bytes selected in two banks, then
MMCM is placed in the top bank.
• For four bank systems, MMCM is placed in a second bank from the top.
For designs generated with System Clock configuration of No Buffer, MMCM must not be
driven by another MMCM/PLL. Cascading clocking structures MMCM → BUFG → MMCM
and PLL → BUFG → MMCM are not allowed.
If the MMCM is driven by the GCIO pin of the other bank, then the
CLOCK_DEDICATED_ROUTE constraint with value "BACKBONE" must be set on the net that
is driving MMCM or on the MMCM input. Setting up the CLOCK_DEDICATED_ROUTE
constraint on the net is preferred. But when the same net is driving two MMCMs, the
CLOCK_DEDICATED_ROUTE constraint must be managed by considering which MMCM
needs the BACKBONE route.
In such cases, the CLOCK_DEDICATED_ROUTE constraint can be set on the MMCM input. To
use the "BACKBONE" route, any clock buffer that exists in the same CMT tile as the GCIO
must exist between the GCIO and MMCM input. The clock buffers that exists in the I/O CMT
are BUFG, BUFGCE, BUFGCTRL, and BUFGCE_DIV. So QDR-IV SRAM instantiates BUFG
between the GCIO and MMCM when the GCIO pins and MMCM are not in the same bank
(see Figure 25-1).
If the GCIO pin and MMCM are allocated in different banks, QDR-IV SRAM generates
CLOCK_DEDICATED_ROUTE constraints with value as "BACKBONE." If the GCIO pin and
MMCM are allocated in the same bank, there is no need to set any constraints on the
MMCM input.
Similarly when designs are generated with System Clock Configuration as a No Buffer
option, you must take care of the "BACKBONE" constraint and the BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM if GCIO pin and MMCM are allocated in
different banks. QDR-IV SRAM does not generate clock constraints in the XDC file for No
Buffer configurations and you must take care of the clock constraints for No Buffer
configurations. For more information on clocking, see the UltraScale Architecture Clocking
Resources User Guide (UG572) [Ref 8].
For more information on the CLOCK_DEDICATED_ROUTE constraints, see the Vivado Design
Suite Properties Reference Guide (UG912) [Ref 9].
Note: If two different GCIO pins are used for two QDR-IV SRAM IP cores in the same bank, center
bank of the memory interface is different for each IP. QDR-IV SRAM generates MMCM LOC and
CLOCK_DEDICATED_ROUTE constraints accordingly.
1. QDR-IV SRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) of both the IP cores.
2. System clock pins must be allocated within the same I/O column of the memory
interface pins allocated. Add the pin LOC constraints for system clock pins and clock
constraints in your top-level XDC.
3. You must add a "BACKBONE" constraint on the net that is driving the MMCM or on the
MMCM input if GCIO pin and MMCM are not allocated in the same bank. Apart from
this, BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV must be instantiated between GCIO and
MMCM to use the "BACKBONE" route.
Note:
° The UltraScale architecture includes an independent XIPHY power supply and TXPLL
for each XIPHY. This results in clean, low jitter clocks for the memory system.
° Skew spanning across multiple BUFGs is not a concern because single point of
contact exists between BUFG → TXPLL and the same BUFG → System Clock Logic.
° System input clock cannot span I/O columns because the longer the clock lines
span, the more jitter is picked up.
TXPLL Usage
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used. One PLL per bank is used if a bank is used by a single memory
interface. You can use a second PLL for other usage. To use a second PLL, you can perform
the following steps:
1. Generate the design for the System Clock Configuration option as No Buffer.
2. QDR-IV SRAM generates a single-ended input for system clock pins, such as
sys_clk_i. Connect the differential buffer output to the single-ended system clock
inputs (sys_clk_i) and also to the input of PLL (PLL instance that you have in your
design).
3. You can use the PLL output clocks.
Additional Clocks
You can produce up to four additional clocks which are created from the same MMCM that
generates ui_clk. Additional clocks can be selected from the Clock Options section in the
Advanced Options tab. The GUI lists the possible clock frequencies from MMCM and the
frequencies for additional clocks vary based on selected memory frequency (Memory
Device Interface Speed (ps) value in the Basic tab), selected FPGA, and FPGA speed grade.
For situations where the memory interface is reset and recalibrated without a
reconfiguration of the FPGA, the SEM IP must be set into IDLE state to disable the memory
scan and to send the SEM IP back into the scanning (Observation or Detect only) states
afterwards. This can be done in two methods, through the “Command Interface” or “UART
interface.” See Chapter 3 of the UltraScale Architecture Soft Error Mitigation Controller
LogiCORE IP Product Guide (PG187) [Ref 10] for more information.
Resets
An asynchronous reset (sys_rst) input is provided. This active-High reset must assert for
a minimum of 20 cycles of the FPGA logic clock.
For more information on reset, see the Reset Sequence in Chapter 24, Core Architecture.
All the DQ pins of a single data group can be allocated to any I/O pin except pin 1,
pin 7, and pin 12 of the given byte lane.
All the QK/QK# pair of a single data group must be allocated to pin 0/pin 1 pair or
pin 6/pin7 pair of the given byte lane.
All the QK/QK# pair of a single data group must be allocated to pin 0/pin 1 pair or
pin 6/pin7 pair of byte lanes 1 or 2.
All the DK/DK# pair of a single data group can be allocated to any differential pin
pair.
RECOMMENDED: Xilinx strongly recommends that the DCIUpdateMode option is kept with the default
value of ASREQUIRED so that the DCI circuitry is allowed to operate normally.
13. There are dedicated V REF pins (not included in the rules above). Either internal or
external V REF is permitted. If an external V REF is not used, the V REF pins must be pulled
to ground by a resistor value specified in the UltraScale™ Device FPGAs SelectIO™
Resources User Guide (UG571) [Ref 7]. These pins must be connected appropriately for
the standard in use.
14. The system reset pin (sys_rst_n) must not be allocated to Pins N0 and N6 if the byte
is used for the memory I/Os.
IMPORTANT: QDR-IV IP does not support data inversion. Contact your memory vendor for terminating
DINVA and DINVB at memory.
DQB12 N0 – AY30
QVLDB0 N1 – AY31
DQB9 N2 – AW31
DQB11 N3 – AY32
DQB15 N4 – AW29
DQB10 N5 – AW30
44 QKB0_P T1 N6 – AV33
QKB0_N N7 – AW33
DQB17 N8 GCIO_P_3 AU30
DQB13 N9 GCIO_N_3 AU31
DQB14 N10 GCIO_P_4 AV31
DQB16 N11 GCIO_N_4 AV32
– N12 – AV29
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
DQB24 N0 GCIO_P_1 AT32
– N1 GCIO_N_1 AU32
DQB19 N2 GCIO_P_2 AT29
DQB20 N3 GCIO_N_2 AU29
DQB22 N4 – AR32
DQB21 N5 – AR33
44 QKB1_P T2 N6 – AR28
QKB1_N N7 – AT28
DQB23 N8 – AP30
DQB26 N9 – AR31
DQB25 N10 – AR30
DQB18 N11 – AT30
– N12 – AT33
DQB27 N0 – AN31
QVLDB1 N1 – AP31
DQB29 N2 – AN32
DQB33 N3 – AP33
DQB32 N4 – AP28
DQB34 N5 – AP29
44 DKB1_P T3 N6 – AM30
DKB1_N N7 – AM31
DQB31 N8 – AM29
DQB35 N9 – AN29
DQB28 N10 – AL29
DQB30 N11 – AL30
– N12 – AN28
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
– N0 – BC28
– N1 – BD28
– N2 – BD25
– N3 – BD26
– N4 – BB27
– N5 – BC27
45 – T0 N6 – BC24
– N7 – BD24
– N8 – BB26
– N9 – BC26
– N10 – BB24
– N11 – BB25
– N12 VRP BA28
– N0 – AW28
PE_N N1 – AY28
RWB_N N2 – BA24
LDB_N N3 – BA25
RWA_N N4 – AY27
LDA_N N5 – BA27
45 LBK1_N T1 N6 – AW25
LBK0_N N7 – AY25
– N8 GCIO_P_4 AV26
– N9 GCIO_N_4 AV27
RST_N N10 GCIO_P_3 AW26
CFG_N N11 GCIO_N_3 AY26
– N12 – AV28
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
A18 N0 GCIO_P_2 AU25
A20 N1 GCIO_N_2 AU26
A17 N2 GCIO_P_1 AR25
A16 N3 GCIO_N_1 AT25
A15 N4 – AT27
A14 N5 – AU27
45 CK_P T2 N6 – AP25
CK_N N7 – AP26
A13 N8 – AR26
A12 N9 – AR27
A11 N10 – AN24
A10 N11 – AP24
AP N12 – AT24
A9 N0 – AN26
A19 N1 – AN27
A8 N2 – AK25
A7 N3 – AL25
A6 N4 – AM26
A5 N5 – AM27
45 A4 T3 N6 – AK27
AINV N7 – AL27
A3 N8 – AM24
A2 N9 – AM25
A1 N10 – AJ26
A0 N11 – AK26
– N12 – AL24
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
DQA12 N0 – BA33
– N1 – BB34
DQA15 N2 – BC33
DQA11 N3 – BD33
DQA10 N4 – BA34
DQA16 N5 – BA35
46 DKA0_P T0 N6 – BC34
DKA0_N N7 – BD34
DQA9 N8 – BB35
DQA17 N9 – BC36
DQA14 N10 – BD35
DQA13 N11 – BD36
– N12 VRP AY33
DQA7 N0 – BC39
QVLDA0 N1 – BD39
DQA5 N2 – BB36
DQA3 N3 – BC37
DQA8 N4 – BA38
DQA1 N5 – BB39
46 QKA0_P T1 N6 – BC38
QKA0_N N7 – BD38
DQA0 N8 GCIO_P_1 AY37
DQA4 N9 GCIO_N_1 AY38
DQA2 N10 GCIO_P_2 BA37
DQA6 N11 GCIO_N_2 BB37
– N12 – BA39
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
DQA19 N0 GCIO_P_4 AV36
– N1 GCIO_N_4 AW36
DQA18 N2 GCIO_P_3 AY35
DQA26 N3 GCIO_N_3 AY36
DQA25 N4 – AW38
DQA20 N5 – AW39
46 QKA1_P T2 N6 – AW34
QKA1_N N7 – AW35
DQA24 N8 – AU39
DQA23 N9 – AV39
DQA22 N10 – AV37
DQA21 N11 – AV38
– N12 – AV34
DQA32 N0 – AT34
QVLDA1 N1 – AT35
DQA35 N2 – AU34
DQA31 N3 – AU35
DQA30 N4 – AR39
DQA27 N5 – AT39
46 DKA1_P T3 N6 – AU36
DKA1_N N7 – AU37
DQA33 N8 – AR37
DQA29 N9 – AR38
DQA34 N10 – AT37
DQA28 N11 – AT38
– N12 – AR36
Table 25-2 shows an example of an 18-bit QDR-IV SRAM interface contained within three
banks.
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
– N0 – L38
– N1 – L39
– N2 – K38
– N3 – J38
– N4 – J39
– N5 – H39
49 – T0 N6 – H37
– N7 – G37
– N8 – H38
– N9 – G39
DKB1_P N10 – F38
DKB1_N N11 – F39
– N12 VRP K37
QKB1_P N0 – K36
QKB1_N N1 – J36
DQB9 N2 – J33
DQB13 N3 – H34
DQB12 N4 – J35
DQB15 N5 – H36
49 DQB17 T1 N6 – H33
QVLDB1 N7 – G34
DQB16 N8 GCIO_P_1 F34
DQB10 N9 GCIO_N_1 F35
DQB14 N10 GCIO_P_2 G35
DQB11 N11 GCIO_N_2 G36
– N12 – F33
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
QKB0_P N0 GCIO_P_3 F37
QKB0_N N1 GCIO_N_3 E38
DQB6 N2 GCIO_P_4 E36
DQB8 N3 GCIO_P_3 E37
DQB5 N4 – D39
DQB7 N5 – C39
49 DQB4 T2 N6 – D38
QVLDB0 N7 – C38
DQB0 N8 – D36
DQB3 N9 – C36
DQB1 N10 – B37
DQB2 N11 – A37
– N12 – C37
DKB0_P N0 – E35
DKB0_N N1 – D35
– N2 – D34
– N3 – C34
– N4 – D33
– N5 – C33
49 – T3 N6 – B35
– N7 – B36
– N8 – B34
– N9 – A35
– N10 – A33
– N11 – A34
– N12 – E33
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
– N0 – R27
– N1 – R28
– N2 – M30
– N3 – L30
– N4 – P28
– N5 – P29
50 – T0 N6 – N29
– N7 – M29
– N8 – N27
– N9 – N28
– N10 – L28
– N11 – L29
– N12 VRP T28
– N0 – K30
PE_N N1 – J30
RWB_N N2 – K31
LDB_N N3 – J31
RWA_N N4 – J28
LDA_N N5 – J29
50 LBK1_N T1 N6 – H32
LBK0_N N7 – G32
– N8 GCIO_P_4 H28
– N9 GCIO_N_4 H29
RST_N N10 GCIO_P_3 H31
CFG_N N11 GCIO_N_3 G31
– N12 – K28
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
A18 N0 GCIO_P_2 G30
A20 N1 GCIO_N_2 F30
A17 N2 GCIO_P_1 G29
A16 N3 GCIO_N_1 F29
A15 N4 – F32
A14 N5 – E32
50 CK_P T2 N6 – E30
CK_N N7 – D30
A13 N8 – E31
A12 N9 – D31
A11 N10 – F28
A10 N11 – E28
AP N12 – D29
A9 N0 – C31
A19 N1 – C32
A8 N2 – B30
A7 N3 – B31
A6 N4 – B32
A5 N5 – A32
50 A4 T3 N6 – A29
AINV N7 – A30
A3 N8 – C29
A2 N9 – B29
A1 N10 – D28
A0 N11 – C28
– N12 – A28
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
QKA0_P N0 – T25
QKA0_N N1 – R25
DQA1 N2 – T23
DQA8 N3 – R23
DQA4 N4 – R26
DQA6 N5 – P26
51 DQA0 T0 N6 – P24
QVLDA0 N7 – N24
DQA7 N8 – P25
DQA5 N9 – N26
DQA2 N10 – P23
DQA3 N11 – N23
– N12 VRP M25
DKA0_P N0 – M26
DKA0_N N1 – M27
– N2 – M24
– N3 – L25
– N4 – L27
– N5 – K27
51 – T1 N6 – L23
– N7 – L24
– N8 GCIO_P_4 K25
– N9 GCIO_N_4 J25
DKA1_P N10 GCIO_P_2 K26
DKA1_N N11 GCIO_N_2 J26
– N12 – J24
Bank Signal Name Byte Group Byte Group I/O Special Pin Number
Number Designation
QKA1_P N0 GCIO_P_3 G25
QKA1_N N1 GCIO_N_3 F25
DQA12 N2 GCIO_P_1 H26
DQA9 N3 GCIO_N_1 G26
DQA10 N4 – F27
DQA13 N5 – E27
51 DQA15 T2 N6 – H27
QVLDA1 N7 – G27
DQA16 N8 – E25
DQA17 N9 – E26
DQA11 N10 – G24
DQA14 N11 – F24
– N12 – H24
– N0 – D26
– N1 – C27
– N2 – B27
– N3 – A27
– N4 – C26
– N5 – B26
51 – T3 N6 – B25
– N7 – A25
– N8 – D24
– N9 – D25
– N10 – C24
– N11 – B24
– N12 – A24
Protocol Description
This core has the following interfaces:
• Memory Interface
• User Interface
• Physical Interface
Memory Interface
The QDR-IV SRAM core is customizable to support several configurations. The specific
configuration is defined by Verilog parameters in the top-level of the core.
User Interface
The user interface connects to an FPGA user design to the QDR-IV SRAM core to simplify
interactions between the user design and the external memory device. The user interface
provides a set of signals used to issue a read or write command to the memory device.
These signals are summarized in Table 25-3.
Controller Features
The QDR-IV SRAM memory controller is designed to take read and write commands from
the user interface and converts them so that they become compatible to the QDR-IV SRAM
memory protocol. Also, it ensures that the commands to the memory are handled with low
latencies meeting all the QDR-IV SRAM memory timing requirements.
The best efficiency from the controller is achieved when there is unidirectional traffic on
each port, without any bank collision in them, without the command switch from read to
write, or vice-versa. When there are alternate read/write commands, the efficiency is lost
because the bidirectional QDR-IV SRAM data bus needs to be turned around. Also when
there is bank collision, the controller has to add up latencies to avoid collision at the
memory interface which reduces efficiency. Because there are four channels per port, which
can be used for sending the command to the memory, you should know the command
order and priorities. The following sections describe these in detail.
READ ch0 READ ch0 READ ch1 READ ch1 READ ch2 READ ch2 READ ch3 READ ch3
PORT A PORT B PORT A PORT B PORT A PORT B PORT A PORT B
X16056-022216
Bank Collision
Note: HP memory devices do not have bank access restriction so bank collision does not apply
when you are dealing with HP memory devices.
The last three bits of the address denote which bank out of the eight available banks in the
memory device is being accessed. The rule as per the memory access for XP part is that
PORT B cannot access the same bank in the same clock cycle as PORT A. Because there are
four channels, the bank comparison for collision is done on a per-channel basis. If a collision
is found on any of the four channels, all of the four channels of the corresponding ports are
affected as explained in the later sections.
For detecting whether there is a collision, the last three bits of the channel address are
compared. The following conditions are checked for collision detection:
1. Comparison is done channel-wise. That is, PORT A channel 0 is compared with PORT B
channel 0. PORT A channel 0 is never compared with PORT B channel 1 or any other
channel.
2. Only the last three bits of the channel address is compared. They all should match for
collision to be detected.
3. Restriction for accessing the same bank in the same cycle only lies with PORT B and not
on PORT A. This means that for detecting the collision, the last three bits of channel 0
PORT A is compared with the last three bits of channel 0 of PORT B.
PORT B channel 0 is not compared with PORT A channel 1 for collision detection. This is
illustrated in Figure 25-3.
X-Ref Target - Figure 25-3
app_addr_a_ch0[2:0] app_addr_b_ch0[2:0]
app_addr_a_ch1[2:0] app_addr_b_ch1[2:0]
app_addr_a_ch2[2:0] app_addr_b_ch2[2:0]
app_addr_a_ch3[2:0] app_addr_b_ch3[2:0]
X16057-022216
Because of the four channels per port sending commands in every clock, there is an order
in which the command is called by the controller towards the memory.
Figure 25-4 shows how the bank collision signals for PORT A and PORT B are asserted. To
begin, the priority remains with PORT A for accessing any bank. If PORT B tries to access
(read command or write command) the same bank in the same clock, it is considered a
PORT B collision. Therefore, Bank_collision_B is asserted and controller delays the
processing of PORT B command by one user clock.
If PORT A accesses the same bank again after the pending PORT B command is serviced, it
is considered a PORT A collision. PORT A command processing is delayed by one user clock.
This is done to provide equal opportunity to both the ports in case they are trying to access
the same bank back-to-back.
Figure 25-4 takes channel 0 as an example and is true for all the channels.
X-Ref Target - Figure 25-4
ui_clk
app_addr_a_ch0[2:0] 3 3 3 4 5 7 5 1 3 5 7 4 2 3 1
app_addr_b_ch0[2:0] 1 3 1 3 6 5 6 2 4 6 5 3 1 2 5
Bank_collision_A
Bank_collision_B
app_cmd_rdy_b
app_cmd_rdy_a
X16079-022216
From Figure 25-4, there is a bank collision on PORT A or B and a corresponding ready is
deasserted. It is your responsibility to ensure that it should not issue any command after it
samples a ready to be Low (that is, hold its next command transaction until the ready gets
asserted back). If you are not sure, the command issued while ready is Low is lost for all the
four channels corresponding to the port as app_cmd_rdy_a and app_cmd_rdy_b are
common for all four channels.
Note: Because there are four channels per port, collision on one channel delays the command
processing for all four channels for that port. The following simulation snapshot explains this
scenario (Figure 25-5).
For example, the first collision occurs on PORT B channel 0 because PORT B wants to access
the same bank in the same user clock. Then, the controller stores PORT B commands for all
four channels and deasserts the PORT B ready signal. It is your responsibility to hold the
next set of commands for PORT B until the user interface asserts the ready signal. If you
issue another set of commands when the ready is Low, those commands for all four
channels are lost and it does not go to the memory interface.
app_addr_a_ch0[2:0] app_addr_b_ch0[2:0]
app_addr_a_ch1[2:0] app_addr_b_ch1[2:0]
app_addr_a_ch2[2:0] app_addr_b_ch2[2:0]
app_addr_a_ch3[2:0] app_addr_b_ch3[2:0]
X16091-022216
Examples
The following section focuses on your command sequence and the controller after handling
the command switching/collision processes and distributing to the memory interface.
In the first case, the command sequence is located at the input and output of the controller
when there is no collision. The PORT A command has switched from read to write. As
explained in the earlier section, the data bus is common for read and write and it has to
switch direction. Therefore, it has to wait for the read command to be completed.
The controller introduces eight NOPs on PORT A to avoid the bus contention at the memory
interface (see Figure 25-7). Because the write latency of the memory device is less than the
read latency when the command switches from write to read, the controller inserts four
NOPs between the commands. All PORT A commands are sent on the rising edge of the
memory clock (CK clock shown in Figure 25-7) and all PORT B commands are sent on the
falling edge.
READ ch0 READ ch0 READ ch1 READ ch1 READ ch2 READ ch3
NOP NOP
PORT A PORT B PORT A PORT B PORT B PORT B
NEXT PORT B NEXT PORT B WRITE ch2 NEXT PORT B NEXT PORT B
NOP NOP NOP
CMD CMD PORTA CMD CMD
NEXT PORT B NEXT PORT B NEXT PORT B READ ch3 NEXT PORT B
NOP NOP NOP
CMD CMD CMD PORTA CMD
X16146-091316
In the second case, there is a collision between the channel 0 of PORT A and PORT B, but
there is no command switching. The collision on PORT B results in its command processing
getting delayed by one user clock. The controller now serves the next PORT A command to
avoid bank rule violation at the memory. This is shown in Figure 25-8 where the rising edge
on all PORT A commands are sent to the memory, but on the falling edge. Four NOPs are
inserted first and then pending PORT B commands.
X-Ref Target - Figure 25-8
NEXT PORT A READ NEXT PORT A READ NEXT PORT A READ NEXT PORT A READ
CMD PORT B CMD PORT B CMD PORTB CMD PORTB
X16147-091316
Finally, the next and worst case is when there are bank collision and command switching.
First, PORT B has a collision and its execution is delayed by one clock. After one clock when
the controller serves the PORT B command, the next command on PORT A is a write which
is a command switch. This is seen in Figure 25-9 where four NOPs are inserted on PORT B
because of the collision and eight NOPs are inserted on PORT A for read to write command
switching.
READ READ
NOP NOP NOP NOP NOP NOP
PORT A PORT A
X16148-091316
Command Table
There are three types of commands supported by the controller. Table 25-4 lists the
command encoding.
If you assert write and read commands in the same user clock, the controller takes care of
asserting the NOP command before asserting the write command after a read command.
Then, it asserts a busy signal to stop you from sending any further commands until it
completes execution of the all accepted commands.
If you assert the same bank for PORT A and PORT B, the controller delays the command
from PORT B by 1.5 memory clock cycles and asserts a busy until the execution of all the
commands get completed.
Command Sequence
Because the read and write latencies are different for the given memory, the user interface
ensures that the read and write command sequence issued at the memory are in correct
order. If this is not guaranteed by the user interface, the device might be defective as the
data bus in case of QDR-IV memory is bidirectional. Suppose for a given memory device,
the read latency is eight and write latency is five.
For example, if you issued one read command followed by three write commands on the
four channels. This means that from the time when a read command is executed, eight
memory clocks are required to retrieve the read data. On the eighth clock, memory drives
the data bus. Because there are following write commands, the FPGA tries to drive the data
bus on the sixth, seventh, and eighth cycle.
On the eighth cycle, there is a case where both the FPGA and the memory tries to drive the
same bus. This might damage the device and hence it is taken care of by the user interface
inserting NOPs. This is explained further in Table 25-5 (W stands for a write command, R
stands for a read command, NOP stands for No Operation).
WR-RD-WR-RD WR-NP-NP-NP-NP-RD-NP-NP-NP-NP-NP-NP-NP-NP-WR-NP-NP-NP-NP-RD
RD-NP-NP-NP-NP-NP-NP-NP-NP-WR-NP-NP-NP-NP-RD-NP-NP-NP-NP-NP-NP-
RD-WR-RD-WR
NP-NP-WR
ui_clk
ui_rst
app_cmd_en_a
app_cmd_rdy_a
app_cmd_a_ch0 = “11”
app_addr_a_ch0
app_wrdata_a_ch0
app_cmd_a_ch1 = “10”
app_addr_a_ch1
app_rddata_a_ch1
app_rddata_valid[1]
app_cmd_a_ch2 = “11”
app_addr_a_ch2
app_wrdata_a_ch2
app_cmd_a_ch3 = “10”
app_addr_a_ch3
app_rddata_a_ch3
app_rddata_valid[3]
init_calib_complete
X14926-022216
Wait until the init_calib_complete signal is asserted High before sending any
command as shown in Figure 25-10. No read or write requests are processed (that is,
app_wr_cmd or app_rd_cmd on the client interface is ignored before
init_calib_complete is High).
Figure 25-10 shows various commands being issued for different channels from you. For
channel 0 and 2, it is the write command and for 1 and 3 it is the read command. For details,
see the command table (Table 25-4).
For the write commands, write address and write data has to be valid in the same clock cycle
as the write command. This means that for channel 0, app_wrdata_a_ch0 gets written at
location app_addr_a_ch0. For channel 1 also, it occurs the same way.
For the read commands, read address has to be present at the time of read commands
assertion. The read data is available after a few clock cycles along with read valid signal. For
Figure 25-10, for channel 1, app_rddata_a_ch1 becomes available with the
app_rddara_valid[1] signal.
Physical Interface
The physical interface is the connection from the FPGA memory interface solution to an
external QDR-IV SRAM device. The I/O signals for this interface are defined in Table 25-6.
These signals can be directly connected to the corresponding signals on the memory
device.
QKA[1:0], QKA[0]/QKA#[0] controls the DQA[17:0] outputs for x36 configuration and
O DQA[8:0] outputs for x18 configuration, respectively.
QKA_n[1:0]
QKA[1]/QKA#[1] controls the DQA[35:18] outputs for x36 configuration and
DQA[17:9] outputs for x18 configuration, respectively.
QKB[0]/QKB#[0] controls the DQB[17:0] outputs for x36 configuration and
QKB[1:0], DQB[8:0] outputs for x18 configuration, respectively.
O
QKB_n[1:0] QKB[1]/QKB#[1] controls the DQB[35:18] outputs for x36 configuration and
DQB[17:9] outputs for x18 configuration, respectively.
Synchronous Load Input. LDA_n is sampled on the rising edge of the CK clock.
LDA_n enables commands for data PORT A. LDA_n enables the commands when
LDA_n I LDA_n is Low and disables the commands when LDA_n is High. When the
command is disabled, new commands are ignored, but internal operations
continue.
Synchronous Load Input. LDB_n is sampled on the falling edge of the CK clock.
LDB_n enables commands for data PORT B. LDB_n enables the commands when
LDB_n I LDB_n is Low and disables the commands when LDB_n is High. When the
command is disabled, new commands are ignored, but internal operations
continue.
Synchronous Read/Write Input. RWA_n input is sampled on the rising edge of the
RWA_n I CK clock. The RWA_n input is used in conjunction with the LDA_n input to select a
read or write operation.
RWB_n input is sampled on the falling edge of the CK clock. The RWB_n input is
used in conjunction with the LDB_n input to select a read or write operation. The
RWB_n I
RWB_n input is used in conjunction with the LDB_n input to select a Read or Write
operation.
Output Data Valid Indicator. The QVLDA pin indicates valid output data. QVLD is
QVLDA[1:0] O edge-aligned with QKA. For example, QVLDA[0] is edge-aligned with QKA[1:0] and
QVLDA[1] is edge-aligned with QKA_n[1:0].
Output Data Valid Indicator. The QVLDB pin indicates valid output data. QVLD is
QVLDB[1:0] O edge-aligned with QKB. For example, QVLDB[0] is edge-aligned with QKB[1:0] and
QVLDB[1] is edge-aligned with QKB_n[1:0].
CFG_n I Configuration bit. This pin is used to configure different mode registers.
Figure 25-11 shows the timing diagram for the sample write and read operations at the
memory interface with write latency of three clock cycles and read latency of five clock
cycles, respectively.
The command is detected by the memory only when LDA_n and LDB_n are Low for PORT
A and PORT B, respectively. When RWA_n is Low, it is write command and when it is High, it
is a read command. This is true for PORT B as well. Address is DDR and hence on the rising
edge of CK, address is considered to be valid for PORT A and on the falling edge it is
considered for PORT B.
In Figure 25-11, the cursor position is pointing to PORT A write command. Write address is
0x050EE8. The DDR data is written into the memory as 0xC_6B7* and 0x0_57B* with the
write latency at three clock cycles.
Following falling edge is a PORT B write command at address 0x0A7BC4 and the DDR data
which is written to this memory address at PORT B is 0xF_754* and 0x7_7B2.
Next, the CK rising edge is a PORT A read command at address 0x0E6741 and
corresponding data becomes available at the DQA data bus after five CK clock cycles aligned
to the rising edge of QK clock edge because the read latency is five. The DDR read data is
0xC_818* and 0xA_150*. The qvlda is also asserted along with the data. For more
information on read and write timing, see the QDR-IV memory specification.
X-Ref Target - Figure 25-11
• Memory IP lists the possible Reference Input Clock Speed values based on the targeted
memory frequency (based on selected Memory Device Interface Speed).
• Otherwise, select M and D Options and target for desired Reference Input Clock Speed
which is calculated based on selected CLKFBOUT_MULT (M), DIVCLK_DIVIDE (D), and
CLKOUT0_DIVIDE (D0) values in the Advanced Clocking Tab.
The required Reference Input Clock Speed is calculated from the M, D, and D0 values
entered in the GUI using the following formulas:
Where tCK is the Memory Device Interface Speed selected in the Basic tab.
Calculated Reference Input Clock Speed from M, D, and D0 values are validated as per
clocking guidelines. For more information on clocking rules, see Clocking.
Apart from the memory specific clocking rules, validation of the possible MMCM input
frequency range, MMCM VCO frequency range, and MMCM PFD frequency range values are
completed for M, D, and D0 in the GUI.
For UltraScale devices, see Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2] and Virtex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS893) [Ref 3] for MMCM Input frequency range, MMCM VCO frequency
range, and MMCM PFD frequency range values.
For UltraScale+ devices, see Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS922) [Ref 4], Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS923) [Ref 5], and Zynq UltraScale+ MPSoC Data Sheet: DC and AC
Switching Characteristics (DS925) [Ref 6] for MMCM Input frequency range, MMCM VCO
frequency range, and MMCM PFD frequency range values.
For possible M, D, and D0 values and detailed information on clocking and the MMCM, see
the UltraScale Architecture Clocking Resources User Guide (UG572) [Ref 8].
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 13]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 15]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16]
This section includes information about using Xilinx ® tools to customize and generate the
core in the Vivado Design Suite.
If you are customizing and generating the core in the IP integrator, see the Vivado Design
Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 13] for detailed
information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values change, see the
description of the parameter in this chapter. To view the parameter value, run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For more information about generating the core in Vivado, see the Vivado Design Suite User
Guide: Designing with IP (UG896) [Ref 14] and the Vivado Design Suite User Guide: Getting
Started (UG910) [Ref 15].
Note: Figures in this chapter are illustrations of the Vivado Integrated Design Environment (IDE).
This layout might vary from the current version.
Basic Tab
Figure 26-1 shows the Basic tab when you start up the QDR-IV SRAM.
X-Ref Target - Figure 26-1
IMPORTANT: All parameters shown in the controller options dialog box are limited selection options in
this release.
For the Vivado IDE, all controllers (DDR3, DDR4, LPDDR3, QDR II+, QDR-IV, and RLDRAM 3)
can be created and available for instantiation.
1. Select the settings in the Clocking, Controller Options, and Memory Options.
In Clocking, the Memory Device Interface Speed sets the speed of the interface. The
speed entered drives the available Reference Input Clock Speeds. For more
information on the clocking structure, see the Clocking, page 415.
2. To use memory parts which are not available by default through the QDR-IV SRAM
Vivado IDE, you can create a custom parts CSV file, as specified in the AR: 63462. This
CSV file has to be provided after enabling the Custom Parts Data File option. After
selecting this option. you are able to see the custom memory parts along with the
default memory parts. Note that, simulations are not supported for the custom part.
Custom part simulations require manually adding the memory model to the simulation
and might require modifying the test bench instantiation.
IMPORTANT: Data Mask (DM) option is always selected for AXI designs and is grayed out (you cannot
select it). For AXI interfaces, Read Modify Write (RMW) is supported and for RMW to mask certain bytes
of Data Mask bits should be present. Therefore, the DM is always enabled for AXI interface designs. This
is the case for all data widths except 72-bit.
For 72-bit interfaces, ECC is enabled and DM is deselected and grayed out for 72-bit designs. If DM is
enabled for 72-bit designs, computing ECC is not compatible, therefore DM is disabled for 72-bit
designs.
Figure 26-4: Vivado Customize IP Dialog Box – I/O Planning and Design Checklist
User Parameters
Table 26-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
I/O Planning
For details on I/O planning, see I/O Planning, page 235.
Required Constraints
The QDR-IV SRAM Vivado IDE generates the required constraints. A location constraint and
an I/O standard constraint are added for each external pin in the design. The location is
chosen by the Vivado IDE according to the banks and byte lanes chosen for the design.
The I/O standard is chosen by the memory type selection and options in the Vivado IDE and
by the pin type. A sample for qdriv_a[0] is shown here.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
For more information on clocking, see Clocking, page 359.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
IMPORTANT: The set_input_delay and set_output_delay constraints are not needed on the
external memory interface pins in this design due to the calibration process that automatically runs at
start-up. Warnings seen during implementation for the pins can be ignored.
Simulation
This section contains information about simulating the QDR-IV SRAM generated IP. Vivado
simulator, Questa Advanced Simulator, IES, and VCS simulation tools are used for
verification of the QDR-IV SRAM IP at each software release. For more information on
simulation, see Chapter 27, Example Design and Chapter 28, Test Bench.
Example Design
This chapter contains information about the example design provided in the Vivado®
Design Suite. Vivado supports Open IP Example Design flow. To create the example design
using this flow, right-click the IP in the Source Window, as shown in Figure 27-1 and select
Open IP Example Design.
X-Ref Target - Figure 27-1
This option creates a new Vivado project. Upon selecting the menu, a dialog box to enter
the directory information for the new design project opens.
Select a directory, or use the defaults, and click OK. This launches a new Vivado with all of
the example design files and a copy of the IP.
The example design can be simulated using one of the methods in the following sections.
Project-Based Simulation
This method can be used to simulate the example design using the Vivado Integrated
Design Environment (IDE). Memory IP does not deliver the QDR-IV memory models. The
memory model required for the simulation must be downloaded from the memory vendor
website. The memory model file must be added in the example design using Add Sources
option to run simulation.
The Vivado simulator, Questa Advanced Simulator, IES, and VCS tools are used for QDR-IV
IP verification at each software release. The Vivado simulation tool is used for QDR-IV IP
verification from 2015.1 Vivado software release. The following subsections describe steps
to run a project-based simulation using each supported simulator tool.
2. Add the memory model in the Add or create simulation sources page and click Finish
as shown in Figure 27-3.
X-Ref Target - Figure 27-3
3. In the Open IP Example Design Vivado project, under Flow Navigator, select
Simulation Settings.
4. Select Target simulator as Vivado Simulator.
7. In the Flow Navigator window, select Run Simulation and select Run Behavioral
Simulation option as shown in Figure 27-5.
8. Vivado invokes Vivado simulator and simulations are run in the Vivado simulator tool.
For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900)
[Ref 16].
2. Add the memory model in the Add or create simulation sources page and click Finish
as shown in Figure 27-7.
X-Ref Target - Figure 27-7
3. In the Open IP Example Design Vivado project, under Flow Navigator, select
Simulation Settings.
4. Select Target simulator as Questa Advanced Simulator.
a. Browse to the compiled libraries location and set the path on Compiled libraries
location option.
b. Under the Simulation tab, set the modelsim.simulate.runtime to 1 ms (there
are simulation RTL directives which stop the simulation after certain period of time,
which is less than 1 ms) as shown in Figure 27-8. The Generate Scripts Only option
generates simulation scripts only. To run behavioral simulation, Generate Scripts
Only option must be de-selected.
5. Apply the settings and select OK.
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 27-9.
7. Vivado invokes Questa Advanced Simulator and simulations are run in the Questa
Advanced Simulator tool. For more information, see the Vivado Design Suite User Guide:
Logic Simulation (UG900) [Ref 16].
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 27-9.
7. Vivado invokes IES and simulations are run in the IES tool. For more information, see the
Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
6. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 27-9.
7. Vivado invokes VCS and simulations are run in the VCS tool. For more information, see
the Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
Simulation Speed
QDR-IV SRAM provides a Vivado IDE option to reduce the simulation speed by selecting
behavioral XIPHY model instead of UNISIM XIPHY model. Behavioral XIPHY model
simulation is a default option for QDR-IV SRAM designs. To select the simulation mode,
click the Advanced Options tab and find the Simulation Options as shown in Figure 26-3.
The SIM_MODE parameter in the RTL is given a different value based on the Vivado IDE
selection.
• SIM_MODE = BFM – If fast mode is selected in the Vivado IDE, the RTL parameter
reflects this value for the SIM_MODE parameter. This is the default option.
• SIM_MODE = FULL – If UNISIM mode is selected in the Vivado IDE, XIPHY UNISIMs are
selected and the parameter value in the RTL is FULL.
IMPORTANT: QDR-IV memory models from Cypress® Semiconductor need to be modified with the
following two timing parameter values to run the simulations successfully:
`define tcqd #0
`define tcqdoh #0.15
If the design is generated with the Reference Input Clock option selected as No Buffer (at
Advanced > FPGA Options > Reference Input), the CLOCK_DEDICATED_ROUTE
constraints and BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation based on GCIO and
MMCM allocation needs to be handled manually for the IP flow. QDR-IV SRAM does not
generate clock constraints in the XDC file for No Buffer configurations and you must take
care of the clock constraints for No Buffer configurations for the IP flow.
For an example design flow with No Buffer configurations, QDR-IV SRAM generates the
example design with differential buffer instantiation for system clock pins. QDR-IV SRAM
generates clock constraints in the example_design.xdc. It also generates a
CLOCK_DEDICATED_ROUTE constraint as the “BACKBONE” and instantiates BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM input if the GCIO and MMCM are not in
same bank to provide a complete solution. This is done for the example design flow as a
reference when it is generated for the first time.
If in the example design, the I/O pins of the system clock pins are changed to some other
pins with the I/O pin planner, the CLOCK_DEDICATED_ROUTE constraints and BUFG/
BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation need to be managed manually. A DRC error
is reported for the same.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The Memory Controller is generated along with a simple test bench to verify the basic read
and write operations. The stimulus contains 16 consecutive writes followed by 16
consecutive reads for data integrity check.
Overview
Product Specification
Core Architecture
Example Design
Test Bench
Overview
IMPORTANT: Contact Xilinx Support if the overall system design includes the SEM IP prior to
attempting to use the RLDRAM 3 memory interface.
Xilinx does not recommend using the RLD3 IP with an interface rate of 800 MHz or higher when the
SEM IP is enabled.
There is a risk of post-calibration data errors with RLD3 designs that span multiple FPGA banks when
the SEM IP is enabled. For RLD3 designs with an 18-bit data bus and address multiplexing enabled, it
is possible to fit the entire interface in one FPGA bank. Other configurations will not be able to fit in a
single FPGA bank and are at risk when the SEM IP is enabled.
• Hardware, IP, and Platform Development: Creating the PL IP blocks for the hardware
platform, creating PL kernels, subsystem functional simulation, and evaluating the
Vivado ® timing, resource and power closure. Also involves developing the hardware
platform for system integration. Topics in this document that apply to this design
process include:
° Clocking
° Resets
° Protocol Description
° Example Design
Core Overview
The Xilinx UltraScale™ architecture includes the RLDRAM 3 core. This core provides
solutions for interfacing with these DRAM memory types. The UltraScale architecture for
the RLDRAM 3 core is organized in the following high-level blocks:
• Controller – The controller accepts burst transactions from the User Interface and
generates transactions to and from the RLDRAM 3. The controller takes care of the
DRAM timing parameters and refresh.
• Physical Layer – The physical layer provides a high-speed interface to the DRAM. This
layer includes the hard blocks inside the FPGA and the soft blocks calibration logic
necessary to ensure optimal timing of the hard blocks interfacing to the DRAM.
The new hard blocks in the UltraScale architecture allow interface rates of up to
2,133 Mb/s to be achieved.
X16257-030216
Feature Summary
• Component support for interface widths of 18, 36, and 72 bits
Table 29-1: Supported Configurations
Interface Width Burst Length Number of Device
36 BL2, BL4 1, 2
18 BL2, BL4, BL8 1, 2
36 with address
BL4 1, 2
multiplexing
18 with address
BL4, BL8 1, 2
multiplexing
• ODT support
• Memory device support with 576 Mb and 1.125 Gb densities
• RLDRAM 3 initialization support
Information about other Xilinx LogiCORE IP modules is available at the Xilinx Intellectual
Property page. For information on pricing and availability of other Xilinx LogiCORE IP
modules and tools, contact your local Xilinx sales representative.
License Checkers
If the IP requires a license key, the key must be verified. The Vivado® design tools have
several license checkpoints for gating licensed IP through the flow. If the license check
succeeds, the IP can continue generation. Otherwise, generation halts with error. License
checkpoints are enforced by the following tools:
• Vivado synthesis
• Vivado implementation
• write_bitstream (Tcl command)
IMPORTANT: IP license level is ignored at checkpoints. The test confirms a valid license exists. It does
not check IP license level.
Product Specification
Standards
For more information on UltraScale™ architecture documents, see References, page 789.
Performance
Maximum Frequencies
For more information on the maximum frequencies, see the following documentation:
Resource Utilization
For full details about performance and resource utilization, visit Performance and Resource
Utilization.
Port Descriptions
There are three port categories at the top-level of the memory interface core called the
“user design.”
• The first category is the memory interface signals that directly interfaces with the
RLDRAM. These are defined by the Micron® RLDRAM 3 specification.
• The second category is the application interface signals which is the “user interface.”
These are described in the Protocol Description, page 511.
• The third category includes other signals necessary for proper operation of the core.
These include the clocks, reset, and status signals from the core. The clocking and reset
signals are described in their respective sections.
Core Architecture
This chapter describes the UltraScale™ architecture-based FPGAs Memory Interface
Solutions core with an overview of the modules and interfaces.
Overview
Figure 31-1 shows the UltraScale architecture-based FPGAs Memory Interface Solutions
diagram.
X-Ref Target - Figure 31-1
Memory
Controller
User
Interface
0
User FPGA
Physical RLDRAM 3
Logic Initialization/ Layer
Calibration
CalDone
Read Data
X16258-031616
The user interface uses a simple protocol based entirely on SDR signals to make read and
write requests. See User Interface in Chapter 32 for more details describing this protocol.
The Memory Controller takes commands from the user interface and adheres to the
protocol requirements of the RLDRAM 3 device. See Memory Controller for more details.
The physical interface generates the proper timing relationships and DDR signaling to
communicate with the external memory device, while conforming to the RLDRAM 3
protocol and timing requirements. See Physical Interface in Chapter 32 for more details.
Memory Controller
The Memory Controller (MC) enforces the RLDRAM 3 access requirements and interfaces
with the PHY. The controller processes read and write commands in order for BL4 and BL8,
so the commands presented to the controller is the order in which they are presented to the
memory device. For BL2, the read commands are processed in order but the write
commands are rearranged to increase the throughput.
The MC first receives commands from the user interface and determines if the command
can be processed immediately or needs to wait. When all requirements are met, the
command is placed on the PHY interface. For a write command, the controller generates a
signal for the user interface to provide the write data to the PHY. This signal is generated
based on the memory configuration to ensure the proper command-to-data relationship.
Auto-refresh commands are inserted into the command flow by the controller to meet the
memory device refresh requirements.
The data bus is shared for read and write data in RLDRAM 3. Switching from read commands
to write commands and vice versa introduces gaps in the command stream due to switching
the bus. For better throughput, changes in the command bus should be minimized when
possible.
PHY
The PHY is considered the low-level physical interface to an external RLDRAM 3 device as
well as all calibration logic for ensuring reliable operation of the physical interface itself. The
PHY generates the signal timing and sequencing required to interface to the memory
device.
• Clock/address/control-generation logics
• Write and read datapaths
• Logic for initializing the SDRAM after power-up
In addition, the PHY contains calibration logic to perform timing training of the read and
write datapaths to account for system static and dynamic delays.
The MC and calibration logic communicate with this dedicated PHY in the slow frequency
clock domain, which is divided by 4. A more detailed block diagram of the PHY design is
shown in Figure 31-1.
The MC is designed to separate out the command processing from the low-level PHY
requirements to ensure a clean separation between the controller and physical layer. The
command processing can be replaced with custom logic if desired, while the logic for
interacting with the PHY stays the same and can still be used by the calibration logic.
The PHY architecture encompasses all of the logic contained in rld_xiphy.sv. The PHY
contains wrappers around dedicated hard blocks to build up the memory interface from
smaller components. A byte lane contains all of the clocks, resets, and datapaths for a given
subset of I/O. Multiple byte lanes are grouped together, along with dedicated clocking
resources, to make up a single bank memory interface. For more information on the hard
silicon physical layer architecture, see the UltraScale™ Architecture SelectIO™ Resources User
Guide (UG571) [Ref 7].
Figure 31-2 shows the overall flow of memory initialization and the different stages of
calibration.
X-Ref Target - Figure 31-2
System Reset
RLDRAM 3 Initialization
Read DQ Deskew
QVALID Training
Calibration Complete
X24453-081021
When simulating the RLDRAM 3 example design, the calibration process is bypassed to
allow for quick traffic generation to and from the RLDRAM 3 device. Calibration is always
enabled when running the example design in hardware. The hardware manager GUI
provides information on the status of each calibration step or description of error in case of
calibration failure.
If the hardware manager GUI is not used, the first step in determining the calibration status
is to check the status of init_calib_complete and calib_error signals. The
init_calib_complete only asserts if calibration passes successfully, otherwise
calib_error is asserted. Calibration halts on the very first error encountered. There are
three status registers, dbg_pre_cal_status, dbg_cal_status, and
dbg_post_cal_status that provide information on the failing calibration stage. Each bit
of the dbg_cal_status register represents a successful start/end of a calibration step
while that for dbg_pre_cal_status and dbg_post_cal_status represent the
successful completion of certain events during and after calibration. Not all bits are
assigned and some bits might be reserved. Table 31-2 lists the pre-calibration status signal
description.
Read DQ Deskew
Read deskew routine helps to eliminate any delay variation within the DQ bits of a byte,
which in turn improves the read DQ window size. During this stage of calibration, all DQ bits
within a byte are deskewed by aligning them to the internal capture clock belonging to the
same byte. The internal capture clock is a delayed version of QK and/or PQTR/NQTR delay
taps of the capture clocks. The alignment is done by changing the IDELAY taps of individual
DQs and/or of the capture clocks until all the bits in a byte are aligned.
A pattern of all 0s and all 1s is written to various locations in the RLDRAM 3 device. The
write is done one location at a time with the data available on the memory bus four memory
clock cycles ahead of the actual BL4 write data transaction and stays on the bus for two
more memory clock cycles. Because the datapath is not calibrated at this point, this
eliminates any critical timing between DQ and DK clock and ensures correct data is getting
registered in the RLDRAM 3 device. The data read back appears as all 0s and all 1s over
alternate general interconnect cycles. As an example, read data for a single DQ bit appears
as a continuous stream of 00000000_11111111_00000000_11111111 over several
memory clock cycles. Eight 0s represent data over four memory clock cycles (one general
interconnect clock cycle).
The routine initially searches for the left edge and when successful, looks for the right edge.
This is done by moving the PQTR/NQTR delays of the capture clock. When both left and
right edges of the read DQ window have been found, the routine centers the capture clock.
Initially the qvalid signal is assigned the same IDELAY value as the corresponding QK
clock of the byte it resides in. The qvalid IDELAY taps are then either incremented or
decremented to align it to the negative edge of internal capture clock.
DQ bits associated with each DK clock are initially phase shifted by 90° to roughly align
them with the DK clock. A repetitive pattern of 10101010 is written and read back from
memory. Each DQ ODELAY tap is changed to fine tune the alignment with DK clock. When
all DQs are edge aligned to DK clock, the 90° phase shift on DQs is removed, leaving the DK
clock center aligned in the write data window.
The same 90° shift is done on the DM bit during DM calibration. To deskew DM, certain bits
of the original pattern are masked and the pattern is changed to all 0s. Alignment is
achieved when the data bits with value 1 fail to get masked and are overwritten by value 0.
The 90° phase shift on DM bits is removed at the end of alignment. DM deskew calibration
is only performed when it is enabled at the time of RLDRAM 3 IP generation in Vivado
Integrated Design Environment (IDE).
Table 31-5: QVLD Align Training/Byte Align Training for Two Bytes
Byte 0 Byte 1
Memory Clock Cycle General General General General General General
Interconnect Interconnect Interconnect Interconnect Interconnect Interconnect
Cycle 0 Cycle 1 Cycle 2 Cycle 0 Cycle 1 Cycle 2
Rise 0 0x1 0x9 0x17 0xx 0x7 0x15
Fall 0 0x2 0x10 0x18 0xx 0x8 0x16
Rise 1 0x3 0x11 0x19 0x1 0x9 0x17
Fall 1 0x4 0x12 0x20 0x2 0x10 0x18
Rise 2 0x5 0x13 0x21 0x3 0x11 0x19
Fall 2 0x6 0x14 0x22 0x4 0x12 0x20
Rise 3 0x7 0x15 0x23 0x5 0x13 0x21
Fall 3 0x8 0x16 0x24 0x6 0x14 0x22
Byte slip calibration assigns slip value of 0 to Byte 0 and value of 6 to Byte 1. As a result, data
in both bytes is offset by 1 general interconnect cycle. qvalid and byte align training is
used to align the data between bytes. This is done by analyzing the spatial location of
specific data within a byte relative to all other bytes in the general interconnect domain and
adding an additional slip value of eight on top of the slip value from previous step to the
bytes arriving one general interconnect cycle ahead of the other bytes.
Similar to the Slip stage, qvalid and byte align calibration are done independently as one
qvalid spans two bytes in certain configurations and assigning slip value of one byte to it
might cause the other byte to go out of sync.
When all calibration stages are completed, the calib_complete signal is asserted at the
User Interface and the control of the write/read datapath through the XIPHY gets
transferred from calibration module to the User Interface.
Reset Sequence
The sys_rst signal resets the entire memory design which includes general interconnect
(fabric) logic which is driven by the MMCM clock (clkout0) and RIU logic. MicroBlaze™ and
calibration logic are driven by the MMCM clock (clkout6). The sys_rst input signal is
synchronized internally to create the ui_clk_sync_rst signal. The ui_clk_sync_rst
reset signal is synchronously asserted and synchronously deasserted.
Figure 31-3 shows the ui_clk_sync_rst (fabric reset) is synchronously asserted with a
few clock delays after sys_rst is asserted. When ui_clk_sync_rst is asserted, there are
a few clocks before the clocks are shut off.
X-Ref Target - Figure 31-3
The MicroBlaze MCS ECC can be selected from the MicroBlaze MCS ECC option section in
the Advanced Options tab. The block RAM size increases if the ECC option for MicroBlaze
MCS is selected.
Clocking
The memory interface requires one mixed-mode clock manager (MMCM), one TXPLL per I/
O bank used by the memory interface, and two BUFGs. These clocking components are used
to create the proper clock frequencies and phase shifts necessary for the proper operation
of the memory interface.
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used.
Note: RLDRAM 3 generates the appropriate clocking structure and no modifications to the RTL are
supported.
The RLDRAM 3 tool generates the appropriate clocking structure for the desired interface.
This structure must not be modified. The allowed clock configuration is as follows:
Requirements
GCIO
• Must use a differential I/O standard
• Must be in the same I/O column as the memory interface
• Must be in the same SLR of memory interface for the SSI technology devices
• The I/O standard and termination scheme are system dependent. For more information,
consult the UltraScale Architecture SelectIO Resources User Guide (UG571) [Ref 7].
MMCM
• MMCM is used to generate the FPGA logic system clock (1/4 of the memory clock)
• Must be located in the center bank of memory interface
• Must use internal feedback
• Input clock frequency divided by input divider must be ≥ 70 MHz (CLKINx / D ≥
70 MHz)
• Must use integer multiply and output divide values
° For two bank systems, the bank with the higher number of bytes selected is chosen
as the center bank. If the same number of bytes is selected in two banks, then the
top bank is chosen as the center bank.
° For four bank systems, either of the center banks can be chosen. RLDRAM 3 refers
to the second bank from the top-most selected bank as the center bank.
TXPLL
• CLKOUTPHY from TXPLL drives XIPHY within its bank
• TXPLL must be set to use a CLKFBOUT phase shift of 90°
• TXPLL must be held in reset until the MMCM lock output goes High
• Must use internal feedback
Figure 32-1 shows an example of the clocking structure for a three bank memory interface.
The GCIO drives the MMCM located at the center bank of the memory interface. MMCM
drives both the BUFGs located in the same bank. The BUFG (which is used to generate
system clock to FPGA logic) output drives the TXPLLs used in each bank of the interface.
X-Ref Target - Figure 32-1
Memory Interface
BUFG
TXPLL
BUFG Differential
GCIO Input
I/O Bank 4
X24449-081021
• For two bank systems, MMCM is placed in a bank with the most number of bytes
selected. If they both have the same number of bytes selected in two banks, then
MMCM is placed in the top bank.
• For four bank systems, MMCM is placed in a second bank from the top.
For designs generated with System Clock configuration of No Buffer, MMCM must not be
driven by another MMCM/PLL. Cascading clocking structures MMCM → BUFG → MMCM
and PLL → BUFG → MMCM are not allowed.
If the MMCM is driven by the GCIO pin of the other bank, then the
CLOCK_DEDICATED_ROUTE constraint with value "BACKBONE" must be set on the net that
is driving MMCM or on the MMCM input. Setting up the CLOCK_DEDICATED_ROUTE
constraint on the net is preferred. But when the same net is driving two MMCMs, the
CLOCK_DEDICATED_ROUTE constraint must be managed by considering which MMCM
needs the BACKBONE route.
In such cases, the CLOCK_DEDICATED_ROUTE constraint can be set on the MMCM input. To
use the "BACKBONE" route, any clock buffer that exists in the same CMT tile as the GCIO
must exist between the GCIO and MMCM input. The clock buffers that exists in the I/O CMT
are BUFG, BUFGCE, BUFGCTRL, and BUFGCE_DIV. So RLDRAM 3 instantiates BUFG between
the GCIO and MMCM when the GCIO pins and MMCM are not in the same bank (see
Figure 32-1).
If the GCIO pin and MMCM are allocated in different banks, RLDRAM 3 generates
CLOCK_DEDICATED_ROUTE constraints with value as "BACKBONE." If the GCIO pin and
MMCM are allocated in the same bank, there is no need to set any constraints on the
MMCM input.
Similarly when designs are generated with System Clock Configuration as a No Buffer
option, you must take care of the "BACKBONE" constraint and the BUFG/BUFGCE/
BUFGCTRL/BUFGCE_DIV between GCIO and MMCM if GCIO pin and MMCM are allocated in
different banks. RLDRAM 3 does not generate clock constraints in the XDC file for No
Buffer configurations and you must take care of the clock constraints for No Buffer
configurations. For more information on clocking, see the UltraScale Architecture Clocking
Resources User Guide (UG572) [Ref 8].
For more information on the CLOCK_DEDICATED_ROUTE constraints, see the Vivado Design
Suite Properties Reference Guide (UG912) [Ref 9].
Note: If two different GCIO pins are used for two RLDRAM 3 IP cores in the same bank, center bank
of the memory interface is different for each IP. RLDRAM 3 generates MMCM LOC and
CLOCK_DEDICATED_ROUTE constraints accordingly.
1. RLDRAM 3 generates a single-ended input for system clock pins, such as sys_clk_i.
Connect the differential buffer output to the single-ended system clock inputs
(sys_clk_i) of both the IP cores.
2. System clock pins must be allocated within the same I/O column of the memory
interface pins allocated. Add the pin LOC constraints for system clock pins and clock
constraints in your top-level XDC.
3. You must add a "BACKBONE" constraint on the net that is driving the MMCM or on the
MMCM input if GCIO pin and MMCM are not allocated in the same bank. Apart from
this, BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV must be instantiated between GCIO and
MMCM to use the "BACKBONE" route.
Note:
° The UltraScale architecture includes an independent XIPHY power supply and TXPLL
for each XIPHY. This results in clean, low jitter clocks for the memory system.
° Skew spanning across multiple BUFGs is not a concern because single point of
contact exists between BUFG → TXPLL and the same BUFG → System Clock Logic.
° System input clock cannot span I/O columns because the longer the clock lines
span, the more jitter is picked up.
TXPLL Usage
There are two TXPLLs per bank. If a bank is shared by two memory interfaces, both TXPLLs
in that bank are used. One PLL per bank is used if a bank is used by a single memory
interface. You can use a second PLL for other usage. To use a second PLL, you can perform
the following steps:
1. Generate the design for the System Clock Configuration option as No Buffer.
2. RLDRAM 3 generates a single-ended input for system clock pins, such as sys_clk_i.
Connect the differential buffer output to the single-ended system clock inputs
(sys_clk_i) and also to the input of PLL (PLL instance that you have in your design).
3. You can use the PLL output clocks.
Additional Clocks
You can produce up to four additional clocks which are created from the same MMCM that
generates ui_clk. Additional clocks can be selected from the Clock Options section in the
Advanced Options tab. The GUI lists the possible clock frequencies from MMCM and the
frequencies for additional clocks vary based on selected memory frequency (Memory
Device Interface Speed (ps) value in the Basic tab), selected FPGA, and FPGA speed grade.
Resets
An asynchronous reset (sys_rst) input is provided. This is an active-High reset and the
sys_rst must assert for a minimum pulse width of 5 ns. The sys_rst can be an internal
or external pin.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must have a common reset input.
For more information on reset, see the Reset Sequence in Chapter 31, Core Architecture.
Note: There are two PLLs per bank and a controller uses one PLL in every bank that is being used by
the interface.
1. RLDRAM 3 interface can only be assigned to HP banks of the FPGA device.
2. Read Clock (qk/qk_n), Write Clock (dk/dk_n), dq, qvld, and dm.
a. Read Clock pairs (qkx_p/n) must be placed on N0 and N1 pins. dq associated with
a qk/qk_n pair must be in same byte lane on pins N2 to N11.
b. For the data mask off configurations, ensure that dm pin on the RLDRAM 3 device is
grounded. When data mask is enabled, one dm pin is associated with nine bits in x18
devices or with 18 bits in x36 devices. It must be placed in its associated dq byte
lanes as listed:
- For x18 part, dm[0] must be allocated in dq[8:0] allocated byte group and
dm[1] must be allocated in dq[17:9].
- For x36 part, dm[0] must be allocated in dq[8:0] or dq[26:18] allocated
byte lane. Similarly dm[1] must be allocated in dq[17:9] or dq[35:27]
allocated byte group. dq and dm must be placed on one of the pins from N2 to
N11 in the byte lane.
c. dk/dk_n must be allocated to any P-N pair in the same byte lane as ck/ck_n in the
address/control bank.
Note: Pin 12 is not part of a pin pair and must not be used for differential clocks.
d. qvld (x18 device) or qvld0 (x36 device) must be placed on one of the pins from N2
to N12 in the qk0 or qk1 data byte lane. qvld1 (x36 device) must be placed on one
of the pins from N2 to N12 in of the qk2 or qk3 data byte lane.
3. Byte lanes are configured as either data or address/control.
a. Pin N12 can be used for address/control in a data byte lane.
b. No data signals (qvalid, dq, dm) can be placed in an address/control byte lane.
4. Address/control can be on any of the 13 pins in the address/control byte lanes. Address/
control must be contained within the same bank. For three bank RLDRAM 3 interfaces,
address/control must be in the centermost bank.
5. One vrp pin per bank is used and a DCI is required for the interfaces. A vrp pin is
required in I/O banks containing inputs as well as output only banks. It is required in
output only banks because address/control signals use SSTL12_DCI to enable usage of
controlled output impedance. DCI cascade is allowed. When DCI cascade is selected,
vrp pin can be used as a normal I/O. All rules for the DCI in the UltraScale™ Architecture
SelectIO™ Resources User Guide (UG571) [Ref 7] must be followed.
6. ck must be on the PN pair in the Address/Control byte lane.
7. reset_n can be on any pin as long as FPGA logic timing is met and I/O standard can be
accommodated for the chosen bank (SSTL12).
8. Banks can be shared between two controllers.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must share a common reset input.
9. All I/O banks used by the memory interface must be in the same column.
10. All I/O banks used by the memory interface must be in the same SLR of the column for
the SSI technology devices.
11. Maximum height of interface is three contiguous banks for 72-bit wide interface.
12. Bank skipping is not allowed.
13. The input clock for the MMCM in the interface must come from the a GCIO pair in the
I/O column used for the memory interface. Information on the clock input specifications
can be found in the AC and DC Switching Characteristics data sheets (LVDS input
requirements and MMCM requirements should be considered). For more information,
see Clocking, page 501.
14. There are dedicated V REF pins (not included in the rules above). If an external V REF is not
used, the V REF pins must be pulled to ground by a resistor value specified in the
UltraScale™ Architecture SelectIO™ Resources User Guide (UG571) [Ref 7]. These pins
must be connected appropriately for the standard in use.
15. The interface must be contained within the same I/O bank type (High Range or High
Performance). Mixing bank types is not permitted with the exceptions of the reset_n
in step 6 and the input clock mentioned in step 11.
16. RLDRAM 3 pins not mentioned in the cited pin rules (JTAG, MF, etc.) or ones that you
choose not to use in your design must be connected as per Micron® RLDRAM 3 data
sheet specification.
17. The system reset pin (sys_rst_n) must not be allocated to Pins N0 and N6 if the byte
is used for the memory I/Os.
Pin Swapping
• Pins can swap freely within each byte group (data and address/control) (for more
information, see the RLDRAM 3 Pin Rules, page 506).
• Byte groups (data and address/control) can swap easily with each other.
• Pins in the address/control byte groups can swap freely within and between their byte
groups.
• No other pin swapping is permitted.
Table 32-1 shows an example of an 18-bit RLDRAM 3 interface contained within one bank.
This example is for a component interface using one x18 RLDRAM3 component with
Address Multiplexing.
1 qvld0 T3U_12 – –
1 dq8 T3U_11 N –
1 dq7 T3U_10 P –
1 dq6 T3U_9 N –
1 dq5 T3U_8 P –
1 dq4 T3U_7 N DBC-N
1 dq3 T3U_6 P DBC-P
1 dq2 T3L_5 N –
1 dq1 T3L_4 P –
1 dq0 T3L_3 N –
1 dm0 T3L_2 P –
1 qk0_n T3L_1 N DBC-N
1 qk0_p T3L_0 P DBC-P
1 reset_n T2U_12 – –
1 we# T2U_11 N –
1 a18 T2U_10 P –
1 a17 T2U_9 N –
1 a14 T2U_8 P –
1 a13 T2U_7 N QBC-N
1 a10 T2U_6 P QBC-P
1 a9 T2L_5 N –
1 a8 T2L_4 P –
1 a5 T2L_3 N –
1 a4 T2L_2 P –
1 a3 T2L_1 N QBC-N
1 a0 T2L_0 P QBC-P
1 – T1U_12 – –
1 ba3 T1U_11 N –
1 ba2 T1U_10 P –
1 ba1 T1U_9 N –
1 ba0 T1U_8 P –
1 dk1_n T1U_7 N QBC-N
1 dk1_p T1U_6 P QBC-P
1 dk0_n T1L_5 N –
1 dk0_p T1L_4 P –
1 ck_n T1L_3 N –
1 ck_p T1L_2 P –
1 ref_n T1L_1 N QBC-N
1 cs_n T1L_0 P QBC-P
1 vrp T0U_12 – –
1 dq17 T0U_11 N –
1 dq16 T0U_10 P –
1 dq15 T0U_9 N –
1 dq14 T0U_8 P –
1 dq13 T0U_7 N DBC-N
1 dq12 T0U_6 P DBC-P
1 dq11 T0L_5 N –
1 dq10 T0L_4 P –
1 dq9 T0L_3 N –
1 dm1 T0L_2 P –
1 qk1_n T0L_1 N DBC-N
1 qk1_p T0L_0 P DBC-P
Protocol Description
This core has the following interfaces:
• Memory Interface
• User Interface
• Physical Interface
Memory Interface
The RLDRAM 3 core is customizable to support several configurations. The specific
configuration is defined by Verilog parameters in the top-level of the core.
User Interface
The user interface connects to an FPGA user design to the RLDRAM 3 core to simplify
interactions between the user design and the external memory device.
Note: Both write and read commands in the same user_cmd cycle is not allowed.
Figure 32-2 shows the user_cmd signal and how it is made up of multiple commands
depending on the configuration.
X-Ref Target - Figure 32-2
2nd 1st
FPGA Logic Clock
X24454-082420
As shown in Figure 32-2, four command slots are present in a single user interface clock
cycle for BL2. Similarly, two command slots are present in a single user interface clock cycle
for BL4. These command slots are serviced sequentially and the return data for read
commands are presented at the user interface in the same sequence. Note that the read
data might not be available in the same slot as that of its read command. The slot of a read
data is determined by the timing requirements of the controller and its command slot. One
such example is mentioned in the following BL2 design configuration.
Assume that the following set of commands is presented at the user interface for a given
user interface cycle.
It is not guaranteed that the read data appears in {DATA0, NOP, DATA1, NOP} order. It might
also appear in {NOP, DATA0, NOP, DATA1} or {NOP, NOP, DATA0, DATA1} etc. orders. In any
case, the sequence of the commands are maintained.
The address bits at the user interface are concatenated based on the burst length as shown
in Table 32-4. Pad the unused address bits with zero. An example for x36 burst length 4
576 Mb device configuration is shown here:
Burst RLDRAM 3 Device Address Width at RLDRAM 3 Interface Address Width at User
Length Data Width Non-Multiplexed Mode Multiplexed Mode Interface
576 Mb
Not supported by
2 18 20 {20, 20, 20, 20}
RLDRAM 3
Not supported by ({0, 19}, {0, 19}, {0, 19},
2 36 19
RLDRAM 3 {0, 19})
4 18 19 11 ({0, 19}, {0, 19})
4 36 18 11 ({00, 18},{00, 18})
8 18 18 11 ({00, 18})
Not supported by Not supported by
8 36 N/A
RLDRAM 3 RLDRAM 3
1.125 Gb
Not supported by
2 18 21 {21, 21, 21, 21}
RLDRAM 3
Not supported by ({0, 20}, {0, 20}, {0, 20},
2 36 20
RLDRAM 3 {0, 20})
4 18 20 11 ({0, 20}, {0, 20})
4 36 19 11 ({00, 19},{00, 19})
8 18 19 11 ({00, 19})
Not supported by Not supported by
8 36 N/A
RLDRAM 3 RLDRAM 3
Notes:
1. Two device configurations (2x18, 2x36) follow the same address mapping as one device configuration mentioned.
The user interface protocol for the RLDRAM 3 four-word burst architecture is shown in
Figure 32-3.
X-Ref Target - Figure 32-3
CLK
user_cmd_en
user_wr_en
{NOP, {NOP,
{fall3, {fall5,
NOP, NOP,
rise3, rise5,
NOP, NOP,
fall2, fall4,
NOP, NOP,
user_wr_dm rise2, rise4,
fall7, fall7,
fall1, NOP,
rise7, rise7,
rise1, NOP,
fall6, fall6,
fall0, NOP,
rise6} rise6}
rise0} NOP}
user_afifo_full
user_wdfifo_full
X13060
Before any requests can be accepted, the ui_clk_sync_rst signal must be deasserted
Low. After the ui_clk_sync_rst signal is deasserted, the user interface FIFOs can accept
commands and data for storage. The init_calib_complete signal is asserted after the
memory initialization procedure and PHY calibration are complete, and the core can begin
to service client requests.
CLK
user_cmd_en
user_addr A0 A1 A2 A3 A4 A4
user_wr_en
user_afifo_full
user_wdfifo_full
X13061
When a read command is issued some time later (based on the configuration and latency of
the system), the user_rd_valid[0] signal is asserted, indicating that user_rd_data is
now valid, while user_rd_valid[1] is asserted indicating that user_rd_data is valid,
as shown in Figure 32-5. The read data should be sampled on the same cycle that
user_rd_valid[0] and user_rd_valid[1] are asserted because the core does not
buffer returning data. This functionality can be added in, if desired.
The Memory Controller only puts commands on certain slots to the PHY such that the
user_rd_valid signals are all asserted together and return the full width of data, but the
extra user_rd_valid signals are provided in case of controller modifications.
X-Ref Target - Figure 32-5
CLK
user_rd_valid[0]
user_rd_valid[1]
user_rd_data {fall1, rise1, fall0, rise0} {fall3, rise3, fall2, rise2} {fall5, rise5, fall4, rise4} {DNC, DNC, fall6, rise6} {fall7, rise7, DNC, DNC}
UG586_c3_47_042611
Physical Interface
The physical interface is the connection from the FPGA core to an external RLDRAM 3
device. The I/O signals for this interface are defined in Table 32-5. These signals can be
directly connected to the corresponding signals on the RLDRAM 3 device.
• Memory IP lists the possible Reference Input Clock Speed values based on the targeted
memory frequency (based on selected Memory Device Interface Speed).
• Otherwise, select M and D Options and target for desired Reference Input Clock Speed
which is calculated based on selected CLKFBOUT_MULT (M), DIVCLK_DIVIDE (D), and
CLKOUT0_DIVIDE (D0) values in the Advanced Clocking Tab.
The required Reference Input Clock Speed is calculated from the M, D, and D0 values
entered in the GUI using the following formulas:
Where tCK is the Memory Device Interface Speed selected in the Basic tab.
Calculated Reference Input Clock Speed from M, D, and D0 values are validated as per
clocking guidelines. For more information on clocking rules, see Clocking.
Apart from the memory specific clocking rules, validation of the possible MMCM input
frequency range, MMCM VCO frequency range, and MMCM PFD frequency range values are
completed for M, D, and D0 in the GUI.
For UltraScale devices, see Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2] and Virtex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS893) [Ref 3] for MMCM Input frequency range, MMCM VCO frequency
range, and MMCM PFD frequency range values.
For UltraScale+ devices, see Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS922) [Ref 4], Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching
Characteristics (DS923) [Ref 5], and Zynq UltraScale+ MPSoC Data Sheet: DC and AC
Switching Characteristics (DS925) [Ref 6] for MMCM Input frequency range, MMCM VCO
frequency range, and MMCM PFD frequency range values.
For possible M, D, and D0 values and detailed information on clocking and the MMCM, see
the UltraScale Architecture Clocking Resources User Guide (UG572) [Ref 8].
• Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
[Ref 13]
• Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14]
• Vivado Design Suite User Guide: Getting Started (UG910) [Ref 15]
• Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16]
This section includes information about using Xilinx ® tools to customize and generate the
core in the Vivado Design Suite.
If you are customizing and generating the core in the IP integrator, see the Vivado Design
Suite User Guide: Designing IP Subsystems using IP Integrator (UG994) [Ref 13] for detailed
information. IP integrator might auto-compute certain configuration values when
validating or generating the design. To check whether the values change, see the
description of the parameter in this chapter. To view the parameter value, run the
validate_bd_design command in the Tcl Console.
You can customize the IP for use in your design by specifying values for the various
parameters associated with the IP core using the following steps:
For more information about generating the core in Vivado, see the Vivado Design Suite User
Guide: Designing with IP (UG896) [Ref 14] and the Vivado Design Suite User Guide: Getting
Started (UG910) [Ref 15].
Note: Figures in this chapter are illustrations of the Vivado Integrated Design Environment (IDE).
This layout might vary from the current version.
Basic Tab
Figure 33-1 shows the Basic tab when you start up the RLDRAM 3 SDRAM.
X-Ref Target - Figure 33-1
IMPORTANT: All parameters shown in the controller options dialog box are limited selection options in
this release.
For the Vivado IDE, all controllers (DDR3, DDR4, LPDDR3, QDR II+, QDR-IV, and RLDRAM 3)
can be created and available for instantiation.
1. Select the settings in the Clocking, Controller Options, and Memory Options.
In Clocking, the Memory Device Interface Speed sets the speed of the interface. The
speed entered drives the available Reference Input Clock Speeds. For more
information on the clocking structure, see the Clocking, page 501.
2. To use memory parts which are not available by default through the RLDRAM 3 SDRAM
Vivado IDE, you can create a custom parts CSV file, as specified in the AR: 63462. This
CSV file has to be provided after enabling the Custom Parts Data File option. After
selecting this option. you are able to see the custom memory parts along with the
default memory parts. Note that, simulations are not supported for the custom part.
Custom part simulations require manually adding the memory model to the simulation
and might require modifying the test bench instantiation.
Figure 33-4: Vivado Customize IP Dialog Box – I/O Planning and Design Checklist
User Parameters
Table 33-1 shows the relationship between the fields in the Vivado IDE and the User
Parameters (which can be viewed in the Tcl Console).
Note: Do not turn this timing check off unless the access pattern will never cause a TWTR failure.
Table 33-2: TWTR_CHECK_OFF User Parameter
User Parameter Value Format Default Value Possible Values
False – TWTR_CHECK parameter set to ON
TWTR_CHECK_OFF String false
True – TWTR_CHECK parameter set to OFF
2. In the Generate Output Products option, do not select Generate instead select Skip
(Figure 33-5).
X-Ref Target - Figure 33-5
3. Set the TWTR_CHECK_OFF value by running the following command on the Tcl Console:
set_property -dict [list config.TWTR_CHECK_OFF <value_to_be_set>] [get_ips
<ip_name>]
For example:
The generated output files have the TWTR_CHECK parameter value set as per the selected
value.
Output Generation
For details, see the Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
I/O Planning
For details on I/O planning, see I/O Planning, page 235.
Required Constraints
This section is not applicable for this IP core.
Clock Frequencies
This section is not applicable for this IP core.
Clock Management
For information on clocking, see Clocking, page 501.
Clock Placement
This section is not applicable for this IP core.
Banking
This section is not applicable for this IP core.
Transceiver Placement
This section is not applicable for this IP core.
IMPORTANT: The set_input_delay and set_output_delay constraints are not needed on the
external memory interface pins in this design due to the calibration process that automatically runs at
start-up. Warnings seen during implementation for the pins can be ignored.
Simulation
For comprehensive information about Vivado simulation components, as well as
information about using supported third-party tools, see the Vivado Design Suite User
Guide: Logic Simulation (UG900) [Ref 16].
Example Design
This chapter contains information about the example design provided in the Vivado®
Design Suite. Vivado supports Open IP Example Design flow. To create the example design
using this flow, right-click the IP in the Source Window, as shown in Figure 34-1 and select
Open IP Example Design.
X-Ref Target - Figure 34-1
This option creates a new Vivado project. Upon selecting the menu, a dialog box to enter
the directory information for the new design project opens.
Select a directory, or use the defaults, and click OK. This launches a new Vivado with all of
the example design files and a copy of the IP.
The example design can be simulated using one of the methods in the following sections.
Project-Based Simulation
This method can be used to simulate the example design using the Vivado Integrated
Design Environment (IDE). Memory IP delivers memory models for RLDRAM 3.
The Vivado simulator, Questa Advanced Simulator, IES, and VCS tools are used for RLDRAM
3. IP verification at each software release. The Vivado simulation tool is used for RLDRAM 3.
IP verification from 2015.1 Vivado software release. The following subsections describe
steps to run a project-based simulation using each supported simulator tool.
5. In the Flow Navigator window, select Run Simulation and select Run Behavioral
Simulation option as shown in Figure 34-3.
6. Vivado invokes Vivado simulator and simulations are run in the Vivado simulator tool.
For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900)
[Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 34-5.
5. Vivado invokes Questa Advanced Simulator and simulations are run in the Questa
Advanced Simulator tool. For more information, see the Vivado Design Suite User Guide:
Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 34-5.
5. Vivado invokes IES and simulations are run in the IES tool. For more information, see the
Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
4. In the Flow Navigator window, select Run Simulation and select Run
Behavioral Simulation option as shown in Figure 34-5.
5. Vivado invokes VCS and simulations are run in the VCS tool. For more information, see
the Vivado Design Suite User Guide: Logic Simulation (UG900) [Ref 16].
Simulation Speed
RLDRAM 3 provides a Vivado IDE option to reduce the simulation speed by selecting
behavioral XIPHY model instead of UNISIM XIPHY model. Behavioral XIPHY model
simulation is a default option for RLDRAM 3 designs. To select the simulation mode, click
the Advanced Options tab and find the Simulation Options as shown in Figure 33-3.
The SIM_MODE parameter in the RTL is given a different value based on the Vivado IDE
selection.
• SIM_MODE = BFM – If fast mode is selected in the Vivado IDE, the RTL parameter
reflects this value for the SIM_MODE parameter. This is the default option.
• SIM_MODE = FULL – If UNISIM mode is selected in the Vivado IDE, XIPHY UNISIMs are
selected and the parameter value in the RTL is FULL.
If the design is generated with the Reference Input Clock option selected as No Buffer (at
Advanced > FPGA Options > Reference Input), the CLOCK_DEDICATED_ROUTE
constraints and BUFG/BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation based on GCIO and
MMCM allocation needs to be handled manually for the IP flow. RLDRAM 3 does not
generate clock constraints in the XDC file for No Buffer configurations and you must take
care of the clock constraints for No Buffer configurations for the IP flow.
For an example design flow with No Buffer configurations, RLDRAM 3 generates the
example design with differential buffer instantiation for system clock pins. RLDRAM 3
If in the example design, the I/O pins of the system clock pins are changed to some other
pins with the I/O pin planner, the CLOCK_DEDICATED_ROUTE constraints and BUFG/
BUFGCE/BUFGCTRL/BUFGCE_DIV instantiation need to be managed manually. A DRC error
is reported for the same.
Test Bench
This chapter contains information about the test bench provided in the Vivado ® Design
Suite.
The Memory Controller is generated along with a simple test bench to verify the basic read
and write operations. The stimulus contains 100 consecutive writes followed by 100
consecutive reads for data integrity check.
Traffic Generator
Traffic Generator
Overview
This section describes the setup and behavior of the Traffic Generator. In the UltraScale™
architecture, Traffic Generator is instantiated in the example design (example_top.sv) to
drive the memory design through the application interface (Figure 36-1).
X-Ref Target - Figure 36-1
Application
Interface
app_en, app_cmd,
app_addr, app_wdf_data,
app_wdf_wren,
….
app_rdy, app_wdf_rdy,
app_rd_data_valid,
app_rd_data,
….
X17905-091216
Two Traffic Generators are available to drive the memory design and they include:
By default, Vivado ® connects the memory design to the Simple Traffic Generator. You can
choose to use the Advanced Traffic Generator by defining a switch HW_TG_EN in the
example_top.sv. The Simple Traffic Generator is referred to as STG and the Advanced
Traffic Generator is referred to as ATG for the remainder of this section.
After memory initialization and calibration are done, the ATG starts sending write
commands and read commands. If the memory read data does not match with the expected
read data, the ATG flags compare errors through the status interface. For VIO or ILA debug,
you have the option to connect status interface signals.
IMPORTANT: For DDR3 and DDR4 interfaces, ATG is disabled with the AXI interface.
The ATG repeats memory writes and reads on each of the two patterns infinitely. For
simulations, ATG performs 1,000 PRBS23 pattern followed by 1,000 Hammer Zero pattern
and 1,000 PRBS address pattern.
Feature Support
In this section, the ATG basic feature support and mode of operation is described. The ATG
allows you to program different traffic patterns, a read-write mode, and the duration of
traffic burst based on their application.
Provide one traffic pattern for a simple traffic test in the direct instruction mode or program
up to 32 traffic patterns into the traffic pattern table for a regression test in the traffic table
mode.
The second choice sends write and read command pseudo-randomly. This submode
is valid for DDR3/DDR4 and RLDRAM 3 only.
Create one traffic pattern for simple traffic test using the direct instruction mode
(vio_tg_direct_instr_en).
The example in Table 36-1 shows four traffic patterns programmed in the table mode.
The first pattern has PRBS data traffic written in Linear address space. The 1,000 write
commands are issued followed by 1,000 read commands. Twenty cycles of NOPs are
inserted between every 100 cycle of commands. After completion of instruction0, the next
instruction points at instruction1.
Similarly, instruction1, instruction2, and instruction3 is executed and then looped back to
instruction0.
The ATG performs error check when a traffic pattern is programmed to read/write modes
that have write requests followed by read request (that is, Write-read-mode or
Write-once-Read-forever-mode). The ATG first sends all write requests to the memory. After
all write requests are sent, the ATG sends read requests to the same addresses as the write
requests. Then the read data returning from memory is compared with the expected read
data.
If there is no mismatch error and the ATG is not programmed into an infinite loop,
vio_tg_status_done asserts to indicate run completion.
The ATG has watchdog logic. The watchdog logic checks if the ATG has any request sent to
the application interface or the application interface has any read data return within N
(parameter TG_WATCH_DOG_MAX_CNT) number of cycles. This provides information on
whether memory traffic is running or stalled (because of reasons other than data
mismatch).
Usage
In this section, basic usage and programming of the ATG is covered.
The ATG is programmed and controlled using the VIO interface. Table 36-2 shows
instruction table programming options.
Note: Application interface signals are not shown in section. See the corresponding memory section
for the application interface data format.
VIO is instantiated for the DDR3/DDR4 example design to exercise the Traffic Generator
modes when the design is generated with the ATG option.
The expected write data and the data that is read back are added to the ILA instance. Write
and read data can be viewed in ILA for one byte only. Data of various bytes can be viewed
by driving the appropriate value for vio_rbyte_sel which is driven through VIO.
vio_rbyte_sel is a 4-bit signal and you need to pass the value through VIO for a
required byte. Based on the value driven for vio_rbyte_sel through VIO, a
corresponding DQ byte write and read data are listed in ILA.
The VIO to drive ATG modes is disabled in the default example design. To enable VIO which
drives ATG modes for DDR3/DDR4 interfaces, define the macro as VIO_ATG_EN in the
example_top module as follows:
‘define VIO_ATG_EN
You have to manually instantiate VIO for other interfaces to exercise the Traffic Generator
modes.
The ATG default control connectivity in the example design created by Vivado is listed in
Table 36-3.
Note: Application interface signals are not shown in this table. See the corresponding memory
section for application interface address/data width.
Table 36-3: Default Traffic Generator Control Connection
Signal I/O Width Description Default Value
Traffic Generator
clk I 1 Traffic Generator Clock
Clock
Traffic Generator
rst I 1 Traffic Generator Reset
Reset
Calibration
init_calib_complete I 1 Calibration Complete
Complete
General Control
Enable traffic generator to proceed from
"START" state to "LOAD" state after calibration
completes.
If you do not plan to program instruction table
or PRBS data seed, tie this signal to 1'b1. Reserved signal.
vio_tg_start I 1
If you plan to program instruction table or PRBS Tie to 1'b1.
data seed, set this bit to 0 during reset. After
reset deassertion and done with instruction/
seed programming, set this bit to 1 to start
traffic generator.
Reset traffic generator (synchronous reset, level
sensitive).
If there is outstanding traffic in memory Reserved signal.
vio_tg_rst I 1
pipeline, assert signal by some number of clock Tie to 0.
cycles until all outstanding transactions have
completed.
Restart traffic generator after generator is done
with traffic, paused or stopped with error (level
sensitive).
Reserved signal.
vio_tg_restart I 1 If there is outstanding traffic in memory Tie to 0.
pipeline, assert signal by some number of clock
cycles until all outstanding transactions have
completed.
Reserved signal.
vio_tg_pause I 1 Pause traffic generator (level sensitive).
Tie to 0.
If enabled, stop after first error detected. Read
test is performed to determine whether "READ" Reserved signal.
vio_tg_err_chk_en I 1
or "WRITE" error occurred. If not enabled, Tie to 0.
continue traffic without stop.
The ATG includes multiple data error reporting features. When using the Traffic Generator
Default Behavior, check if there is a memory error in the Status register
(vio_tg_status_err_sticky_valid) or if memory traffic stops
(vio_tg_status_watch_dog_hang).
After the first memory error is seen, the ATG logs the error address
(vio_tg_status_first_err_addr) and bit mismatch
(vio_tg_status_first_err_bit).
Table 36-4 shows the common Traffic Generator Status register output which can be used
for debug.
Table 36-4: Common Traffic Generator Status Register for Debug (Cont’d)
Signal I/O Width Description
If vio_tg_status_first_exp_bit_valid is set to 1,
vio_tg_status_first_exp_bit O APP_DATA_WIDTH
expected read data is stored in this register.
If vio_tg_err_chk_en is set to 1, this represents
vio_tg_status_first_read_bit_valid O 1 read data valid when first mismatch error is
encountered.
If vio_tg_status_first_read_bit_valid is set to 1,
vio_tg_status_first_read_bit O APP_DATA_WIDTH
read data from memory is stored in this register.
Accumulated error mismatch valid over time. This
vio_tg_status_err_bit_sticky_valid O 1 register is reset by vio_tg_err_clear,
vio_tg_err_continue, and vio_tg_restart.
If vio_tg_status_err_bit_sticky_valid is set to 1, this
vio_tg_status_err_bit_sticky O APP_DATA_WIDTH
represents accumulated error bit.
All traffic programmed completes.
vio_tg_status_done O 1 Note: If infinite loop is programmed,
vio_tg_status_done does not assert.
The VIO signal vio_tg_err_chk_en is used to enable error checking and can report read
versus write data errors on vio_tg_status_err_type when
vio_tg_status_err_type_valid is High. When using vio_tg_err_chk_en, the ATG
can be programmed to have two different behaviors when traffic error is detected.
The ATG stops traffic after first error. The ATG then performs a read-check to detect if
the mismatch seen is a "WRITE" error or "READ" error. When vio_tg_status_state
reaches ERRDone state, the read-check is completed. The vio_tg_restart can be
pulsed to clear and restart ATG or the vio_tg_err_continue can be pulsed to
continue traffic.
The ATG continues sending traffic. The traffic can be restarted by asserting pause
(vio_tg_pause), followed by pulse restart (vio_tg_restart), then deasserting
pause.
In both cases, bitwise sticky bit mismatch is available in VIO for accumulated mismatch.
For a design debug, vio_tg_status_err* signals track errors seen on current read data
return. vio_tg_status_first* signals store the first error seen.
vio_tg_status_err_bit_sticky* signals accumulate all error bits seen.
Error bit buses could be very wide. It is recommended to add a MUX stage and a flop stage
before connect the bus to ILA or VIO.
Error status can be cleared when the ATG is in either ERRDone or Pause states. Send a pulse
to the vio_tg_clear to clear all error status except sticky bit. Send a pulse to the
vio_tg_clear_all to clear all error status including sticky bit.
For additional information on how to debug data errors using the ATG, see Debugging Data
Errors in Chapter 38, Debugging.
After calibration is completed, the ATG starts sending current traffic pattern presented at
the VIO interface if direct instruction mode is on; or default traffic sequence according to
the traffic pattern table if the direct instruction mode is OFF.
If it is desired to run a custom traffic pattern, either program the instruction table before
the ATG starts or pause the ATG. Program the instruction table and restart the test traffic
through the VIO.
Steps to program the instruction table (wait for at least one general interconnect cycle
between each step) are listed here.
Common steps:
In Figure 36-2, after c0_init_calib_complete signal is set, the ATG starts executing
default instructions preloaded in the instruction table. Then, the vio_tg_pause is set to
pause the ATG, and then pulse vio_tg_restart. Three ATG instructions are being
re-programmed and the ATG is started again by deasserting vio_tg_pause and asserting
tg_start.
Figure 36-3 zooms into the VIO instruction programming in Figure 36-2. After pausing the
traffic pattern, vio_tg_restart is pulsed. Then vio_tg_instr_num and
vio_tg_instr* are set, followed by vio_tg_program_en pulse (note that
vio_tg_instr_num and vio_tg_instr* are stable for four general interconnect cycles
before and after vio_tg_program_en pulse). After programming instructions are
finished, the vio_tg_pause is deasserted and vio_tg_start is asserted.
X-Ref Target - Figure 36-3
Important Note:
1. For Write-read mode or Write-once-Read-forever modes, this ATG issues all write traffic,
followed by all read traffic. During read data check, expected read traffic is generated
on-the-fly and compared with read data.
If a memory address is written more than once with different data pattern, the ATG
creates a false error check. Xilinx recommends for a given traffic pattern programmed,
the number of command must be less than available address space programmed.
2. The ATG performs error check when read/write mode of a traffic pattern is programmed
to be "Write-read mode" or "Write-once-Read-forever modes." For "Write-only" or
"Read-only" modes error check is not performed.
The ATG data flow is summarized in Figure 36-4. The ATG is controlled and programmed
through the VIO interface. Based on current instruction pointer value, an instruction is
issued by the ATG state machine shown in Figure 36-5.
Based on the traffic pattern programmed in Read/Write mode, Read and/Write requests are
sent to the application interface. Write patterns are generated by the Write Data
Generation, Write Victim Pattern, and Write Address Generation engines (gray). Similarly,
Read patterns are generated by Read Address Generation engine (dark gray).
Traffic G enerator
Error Checker
Read Address (read test)
X14828-080415
Figure 36-5 and Table 36-5 show the ATG state machine and its states. The ATG resets at the
"Start" state. After calibration completion (init_calib_complete) and the tg_start is
asserted, the ATG state moves to instruction load called the "Load" state. The "Load" state
performs next instruction load. When the instruction load is completed, the ATG state
moves to Data initialization called the "Dinit" state. The "Dinit" state initializes all Data/
Address generation engines. After completion of data initialization, the ATG state moves to
execution called the "exe" state. The "Exe" state issues Read and/or Write requests to the
APP interface.
At the "Exe" state, you can pause the ATG and the ATG state moves to the "Pause" state. At
the "Pause" state, the ATG can be restarted by issuing tg_restart through the VIO, or
un-pause the ATG back to the "Exe" state.
At the "Exe" state, the ATG state goes through RWWait → RWload → Dinit states if
Write-Read mode or Write-once-Read-forever modes are used. At the RWWait, the ATG
waits for all Read requests to have data returned (for QDR II+ SRAM).
At the RWload state, the ATG transitions from Write mode to Read mode for DDR3/DDR4,
RLDRAM II/RLDRAM 3, or from Write/Read mode to Read only mode for QDR II+ SRAM
Write-once-Read-forever mode.
At the "Exe" state, the ATG state goes through LDWait → Load if the current instruction is
completed. At the LDWait, the ATG waits for all Read requests to have data returned.
At the "Exe" state, the ATG state goes through DNWait → Done if the last instruction is
completed. At the DNWait, the ATG waits for all Read requests to have data returned.
At the "Exe" state, the ATG state goes through ERRWait → ERRChk if an error is detected. At
the ERRWait, the ATG waits for all Read requests to have data returned. The "ERRChk" state
performs read test by issuing read requests to the application interface and determining
whether "Read" or "Write" error occurred. After read test completion, the ATG state moves
to "ERRDone."
At "Done," "Pause," and "ErrDone" states, the ATG can be restarted ATG by issuing
tg_restart.
Start
tg_restart
(default)
Init_calib_complete
&&
tg_start
tg_restart
Done Load
tg_restart
RWload Dinit
Pause
Error Detected
ERRWait ERRChk
X14829-080415
Table 36-6: CMD_PER_CLK Setting for 4:1 General Interconnect Cycle to Memory Clock Cycle
Table 36-7: CMD_PER_CLK Setting for 2:1 General Interconnect Cycle to Memory Clock Cycle
Note: For design with 2:1 general interconnect cycle to memory clock cycle ratio and burst length 8
(BL = 8), ATG error status interface vio_tg_status_* presents data in full burst (that is, double the
APP_DATA_WIDTH).
Note: CAL_CPLX is a Xilinx internal mode that is used for the Calibration Complex Pattern.
The following are steps to program PRBS Data Seed (wait for at least one general
interconnect cycle between each step):
1. Set the vio _tg_s tart to 0 to stop traffic generator before reset deassertion.
2. Check the v io_tg _stat us_st ate to be TG_INSTR_START (hex0).
3. Set the vio _tg_s eed_n um and vio_t g_see d_dat a with the desired seed
address number and seed.
4. Wait for four general interconnect cycles (optional for relaxing VIO write timing).
5. Set the vio _tg_s eed_p rogra m_en to 1. This enables seed programming.
6. Wait for four general interconnect cycles (optional for relaxing VIO write timing).
7. Set the vio _tg_s eed_p rogra m_en to 0. This disables seed programming.
8. Wait for four general interconnect cycles (optional for relaxing VIO write timing).
9. Repeat steps 3 to 8 if more than one seed (data bit) is programmed.
10. Set the vi o_tg_ start to 1. This starts traffic generator with new seed programming.
For 2:1 general interconnect cycle to memory clock cycle ratio, the seed format consists of
four data bursts of linear seed. Each linear seed has a width of DQ_WIDTH.
For example, a 72-bit wide memory design with 4:1 general interconnect cycle to memory
clock cycle ratio, linear seed starting with base of decimal 1,024 is presented by {72'd1031,
72'd1030, 72'd1029, 72'd1028, 72'd1027, 72'd1026, 72'd1025, and 72'd1024}.
A second example, a 16-bit wide memory design with 2:1 general interconnect cycle to
memory clock cycle ratio, linear seed starting with base of zero is presented by {16'd3,
16'd2, 16'd1, and 16'd0}.
Table 36-8: Linear Address Seed Look Up Table for 4:1 General Interconnect Cycle to Memory
Clock Cycle
Table 36-9: Linear Address Seed Look Up Table for 2:1 General Interconnect Cycle to Memory
Clock Cycle
Least significant bit(s) of Linear address seed is padded with zero. For DDR3/DDR4, the
3-bit of zero is padded because the burst length of eight is always used.
For RLDRAM 3, the 4-bit of zero is padded because the ATG cycles through 16 RLDRAM 3
banks automatically. For QDR II+ and QDR-IV SRAM interfaces, zero padding is not
required.
Read/Write Submode
When Read/Write mode is programmed to Write/Read mode in an instruction, there are
two options to perform the data write and read.
IMPORTANT: This mode is not supported in QDR II+ or QDR-IV SRAM interfaces.
For QDR II+ SRAM, the ATG supports separate Write and Read command signals in an
application interface. When Write-Read mode is selected, the ATG issues Write and Read
command simultaneously.
For QDR-IV SRAM, the memory controller supports two ports. In each port, there are four
read/write channels. The QDR-IV ATG top-level module is mig_v1_2_hw_tg_qdriv. The
mig_v1_2_hw_tg_qdriv instantiates two regular ATG (mig_v1_2_hw_tg) and has two
ATG status register interfaces. Each of the status register interface maps into one of the two
ports.
QDR-IV ATG supports four different modes of traffic setup. The traffic modes are
programmed using vio_tg_glb_qdriv_rw_submode.
• Both PortA and PortB send Write traffic, follow by Read traffic.
Table 36-10: QDR-IV Read/Write Mask and Read/Write Channel Sharing Sequence
Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Write Mask 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Read Mask 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Cycle 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Write Mask 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Read Mask 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Cycle 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Write Mask 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Table 36-10: QDR-IV Read/Write Mask and Read/Write Channel Sharing Sequence (Cont’d)
Cycle 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Read Mask 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Cycle 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Write Mask 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Read Mask 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Cycle 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Write Mask 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Read Mask 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Multiple IP Cores
Multiple IP Cores
This chapter describes the specifications and pin rules for generating multiple IP cores.
1. Generate the target Memory IP. If the design includes multiple instances of the same
Memory IP configuration, the IP only needs to be generated once. The same IP can be
instantiated multiple times within the design.
° If the IP shares the input sys_clk, select the No Buffer clocking option during IP
generation with the same frequency value selected for option Reference Input
Clock Period (ps). Memory IP that share sys_clk must be allocated in the same I/
O column. For more information on Sharing of Input Clock Source, see the Sharing
of Input Clock Source for a link of each controller section.
2. Create a wrapper file to instantiate the target Memory IP cores.
3. Assign the pin locations for the Memory IP I/O signals. For more information on pin
rules of the respective interface, see the Sharing of a Bank for a link of each controller
section. Also, to learn more about the available Memory IP pin planning options, see the
Vivado Design Suite User Guide: I/O and Clock Planning (UG899) [Ref 18].
4. Ensure the following specifications are followed.
Sharing of a Bank
Pin rules of each controller must be followed during IP generation. For more information on
pin rules of each interface, see the respective IP sections:
The same bank can be shared across multiple IP cores, but Memory IP allows sharing of
banks across multiple IP cores if the rules for combining I/O standards in the same bank are
followed.
IMPORTANT: If two controllers share a bank, they cannot be reset independently. The two controllers
must have a common reset input.
For more information on the rules for combining I/O standards in the same bank, see the
section "Rules for Combining I/O Standards in the Same Bank,” in UltraScale™ Architecture
SelectIO™ Resources User Guide (UG571) [Ref 7]. The DCI I/O banking rules are also captured
in UG571.
In the wrapper file in which multiple Memory IP cores are instantiated, do not connect any
signal to dbg_clk and keep the port open during instantiation. Vivado takes care of the
dbg_clk connection to the dbg_hub.
MMCM Constraints
MMCM must be allocated in the center bank of the memory I/Os selected banks. Memory
IP generates the LOC constraints for MMCM such that there is no conflict if the same bank
is shared across multiple IP cores.
Debugging
Debugging
This appendix includes details about resources available on the Xilinx ® Support website
and debugging tools.
TIP: If the IP generation halts with an error, there might be a license issue. See License Checkers in
Chapter 1 for more details.
Documentation
This product guide is the main document associated with the Memory IP. This guide, along
with documentation related to all products that aid in the design process, can be found on
the Xilinx Support web page or by using the Xilinx Documentation Navigator.
Download the Xilinx Documentation Navigator from the Downloads page. For more
information about this tool and the features available, open the online help after
installation.
Solution Centers
See the Xilinx Solution Centers for support on devices, software tools, and intellectual
property at all stages of the design cycle. Topics include design assistance, advisories, and
troubleshooting tips.
The Solution Center specific to the Memory IP core is located at Xilinx Memory IP Solution
Center.
Answer Records
Answer Records include information about commonly encountered problems, helpful
information on how to resolve these problems, and any known issues with a Xilinx product.
Answer Records are created and maintained daily ensuring that users have access to the
most accurate information available.
Answer Records for this core can be located by using the Search Support box on the main
Xilinx support web page. To maximize your search results, use proper keywords such as:
• Product name
• Tool message(s)
• Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR: 58435
Technical Support
Xilinx provides technical support at the Xilinx support web page for this LogiCORE™ IP
product when used as described in the product documentation. Xilinx cannot guarantee
timing, functionality, or support if you do any of the following:
• Implement the solution in devices that are not defined in the documentation.
• Customize the solution beyond that allowed in the product documentation.
• Change any section of the design labeled DO NOT MODIFY.
To contact Xilinx Technical Support, navigate to the Xilinx Support web page.
Debug Tools
There are many tools available to address Memory IP design issues. It is important to know
which tools are useful for debugging various situations.
XSDB Debug
Memory IP includes XSDB debug support. The Memory IP stores useful core configuration,
calibration, and data window information within internal block RAM. The Memory IP debug
XSDB interface can be used at any point to read out this information and get valuable
statistics and feedback from the Memory IP. The information can be viewed through a
Memory IP Debug GUI or through available Memory IP Debug Tcl commands.
To export information about the properties to a spreadsheet, see Figure 38-2 which shows
the Memory IP Core Properties window. Under the Properties tab, right-click anywhere in
the field, and select the Export to Spreadsheet option in the context menu. Select the
location and name of the file to save, use all the default options, and then select OK to save
the file.
For more information on the Properties window menu commands, see the “Properties
Window Popup Menu Commands” section in the Vivado Design Suite User Guide: Using the
Vivado IDE (UG893) [Ref 22].
This outputs all XSDB Memory IP content that is displayed in the GUIs.
report_debug_core example:
Example Design
Generation of a DDR3/DDR4 design through the Memory IP tool allows an example design
to be generated using the Vivado Generate IP Example Design feature. The example
design includes a synthesizable test bench with a traffic generator that is fully verified in
simulation and hardware. This example design can be used to observe the behavior of the
Memory IP design and can also aid in identifying board-related problems.
For complete details on the example design, see Chapter 6, Example Design. The following
sections describe using the example design to perform hardware validation.
Debug Signals
The Memory IP UltraScale designs include an XSDB debug interface that can be used to
very quickly identify calibration status and read and write window margin. This debug
interface is always included in the generated Memory IP UltraScale designs.
Additional debug signals for use in the Vivado Design Suite debug feature can be enabled
using the Debug Signals option on the FPGA Options Memory IP GUI screen. Enabling this
feature allows example design signals to be monitored using the Vivado Design Suite
debug feature. Selecting this option brings the debug signals to the top-level and creates a
sample ILA core that debug signals can be port mapped into.
Furthermore, a VIO core can be added as needed. For details on enabling this debug
feature, see Customizing and Generating the Core, page 217. The debug port is disabled for
functional simulation and can only be enabled if the signals are actively driven by the user
design.
The Vivado logic analyzer is used with the logic debug IP cores, including:
See the Vivado Design Suite User Guide: Programming and Debugging (UG908) [Ref 20].
Reference Boards
The KCU105 evaluation kit is a Xilinx development board that includes FPGA interfaces to a
64-bit (4 x16 components) DDR4 interface. This board can be used to test user designs and
analyze board layout.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This
section provides debug steps for common issues. The Vivado Design Suite debug feature is
a valuable resource to use in hardware debug. The signal names mentioned in the following
individual sections can be probed using the Vivado Design Suite debug feature for
debugging the specific problems.
Memory IP Usage
To focus the debug of calibration or data errors, use the provided Memory IP example
design on the targeted board with the Debug Feature enabled through the Memory IP
UltraScale GUI.
Note: Using the Memory IP example design and enabling the Debug Feature is not required to
capture calibration and window results using XSDB, but it is useful to focus the debug on a known
working solution.
However, the debug signals and example design are required to analyze the provided ILA
and VIO debug signals within the Vivado Design Suite debug feature. The latest Memory IP
release should be used to generate the example design.
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the
example design and that all constraints were met during implementation.
1. If using MMCMs in the design, ensure that all MMCMs have obtained lock by
monitoring the locked port.
2. If your outputs go to 0, check your licensing.
3. Ensure all guidelines referenced in Chapter 4, Designing with the Core and the
UltraScale Architecture PCB Design and Pin Planning User Guide (UG583) [Ref 11] have
been followed.
4. In Chapter 4, Designing with the Core, it includes information on clocking, pin/bank,
and reset requirements. In the UltraScale Architecture PCB Design and Pin Planning User
Guide (UG583) [Ref 11], it includes PCB guidelines such as trace matching, topology and
routing, noise, termination, and I/O standard requirements. Adherence to these
requirements, along with proper board design and signal integrity analysis is critical to
the success of high-speed memory interfaces.
5. Measure all voltages on the board during idle and non-idle times to ensure the voltages
are set appropriately and noise is within specifications.
° Ensure VREF is measured when External VREF is used and set to V CCO/2.
6. When applicable, check vrp resistors.
7. Look at the clock inputs to ensure that they are clean.
8. Information on the clock input specifications can be found in the AC and DC Switching
Characteristics data sheets (LVDS input requirements and PLL requirements should be
considered).
9. Check the reset to ensure the polarity is correct and the signal is clean.
10. Check terminations. The UltraScale Architecture PCB Design and Pin Planning User Guide
(UG583) [Ref 11] should be used as a guideline.
11. Perform general signal integrity analysis.
° Memory IP sets the most ideal ODT setting based on the memory parts and is
described in the RTL as MR1. The RTL is ddr3_0_ddr3.sv for DDR3 and
ddr4_0_ddr4.sv is for DDR4. IBIS simulations should be run to ensure
terminations, the most ideal ODT, and output drive strength settings are
appropriate.
° For DDR3/DDR4, observe dq/dqs on a scope at the memory. View the alignment of
the signals, VIL/VIH, and analyze the signal integrity during both writes and reads.
° Observe the Address and Command signals on a scope at the memory. View the
alignment, VIL/VIH, and analyze the signal integrity.
12. Verify the memory parts on the board(s) in test are the correct part(s) set through the
Memory IP. The timing parameters and signals widths (that is, address, bank address)
must match between the RTL and physical parts. Read/write failures can occur due to a
mismatch.
13. If Data Mask (DM) is not being used for DDR3, ensure DM pin is tied low appropriately.
For more information, see DDR3 Pin Rules in Chapter 4. Also, make sure that the GUI
option for the DM selection is set correctly. If the DM is enabled in the IP but is not
connected to the controller on the board, the calibration fails unpredictably.
14. For DDR3/DDR4, driving Chip Select (cs_n) from the FPGA is not required in single-rank
designs. It can instead be tied low at the memory device according to the memory
vendor’s recommendations. Ensure the appropriate selection (cs_n enable or disable) is
made when configuring the IP. Calibration sends commands differently based on
whether cs_n is enabled or disabled. If the pin is tied low at the memory, ensure cs_n
is disabled during IP configuration.
15. ODT is required for all DDR3/DDR4 interfaces and therefore must be driven from the
FPGA. Memory IP sets the most ideal ODT setting based on extensive simulation. The
most ideal ODT value is described in the RTL as MR1. The RTL file is ddr3_0_ddr3.sv
for DDR3 and ddr4_0_ddr4.sv is for DDR4. External to the memory device, terminate
ODT as specified in the UltraScale Architecture PCB Design and Pin Planning User Guide
(UG583) [Ref 11].
16. Check for any floating pins.
° The par input for command and address parity, alert_n input/output, and the
TEN input for Connectivity Test Mode are not supported by the DDR4 UltraScale
interface. Consult UltraScale Architecture PCB Design and Pin Planning User Guide
(UG583) [Ref 11] on how to connect these signals when not used.
Note: The par is required for DDR3 RDIMM interfaces and is optional for DDR4 RDIMM/
LRDIMM interfaces.
21. Verify trace matching requirements are met as documented in the UltraScale
Architecture PCB Design and Pin Planning User Guide (UG583) [Ref 11].
22. Bring the init_calib_complete out to a pin and check with a scope or view whether
calibration completed successfully in Hardware Manager in the Memory IP Debug GUI.
23. Verify the configuration of the Memory IP. The XSDB output can be used to verify the
Memory IP settings. For example, the clock frequencies, version of Memory IP, Mode
register settings, and the memory part configuration (see step 12) can be determined
using Table 38-1.
24. Copy all of the data reported and submit it as part of a WebCase. For more information
on opening a WebCase, see Technical Support, page 580.
System Reset
XIPHY BISC
XSDB Setup
Yes
Rank
== 0?
Read
Yes
Sanity Check Rank
== 0?
No
Write DQS-to-DQ (Simple)
Write/Read Yes
Sanity Check 2 Rank
== 0?
No
Write DQS-to-DQ (Complex)
Write/Read
Sanity Check 3
Write/Read
Sanity Check 4
Read DQS Centering Multi-Rank Adjustment
Write/Read
Sanity Check 5* All No
Rank count + 1
Done?
Yes
*San ity Check 5 runs for multi-rank and for a r ank other than the
first ra nk. For example, if th ere were two ranks, it would r un on
the second on ly. Calibration Done
**Sanity Check 6 runs for multi-rank an d goes through all of the
ranks. X24431-081021
Memory Initialization
Debug Signals
There are two types of debug signals used in Memory IP UltraScale debug. The first set is a
part of a debug interface that is always included in generated Memory IP UltraScale
designs. These signals include calibration status and tap settings that can be read at any
time throughout operation when the Hardware Manager is open using either Tcl commands
or the Memory IP Debug GUI.
The second type of debug signals are fully integrated in the IP when the Debug Signals
option in the Memory IP tool is enabled and when using the Memory IP Example Design.
However, these signals are currently only brought up in the RTL and not connected to the
debug VIO/ILA cores. Manual connection into either custom ILA/VIOs or the ILA generated
when the Debug Signals option is enabled is currently required. These signals are
documented in Table 38-2.
Table 38-2: DDR3/DDR4 Debug Signals Used in Vivado Design Suite Debug Feature
Signal Signal Width Signal Description
Signifies the status of calibration.
init_calib_complete [0:0] 1’b0 = Calibration not complete
1’b1 = Calibration completed successfully
Signifies the status of the memory core before calibration
cal_pre_status [8:0]
has started. See Table 38-3 for decoding information.
Signifies the status of each stage of calibration. See
Table 38-4 for decoding information. See the following
cal_r*_status [127:0] relevant debug sections for usage information.
Note: The * indicates the rank value. Each rank has a separate
cal_r*_status bus.
Table 38-2: DDR3/DDR4 Debug Signals Used in Vivado Design Suite Debug Feature (Cont’d)
Signal Signal Width Signal Description
Calibration sequence indicator, when RTL is issuing
commands to the DRAM.
[0] = 1’b0 -> Single Command Mode, one DRAM command
only. 1’b1 -> Back-to-Back Command Mode. RTL is issuing
dbg_cal_seq [2:0] back-to-back commands.
[1] = Write Leveling Mode.
[2] = Extended write mode enabled, where extra data and
DQS pulses are sent to the DRAM before and after the
regular write burst.
Calibration command sequence count used when RTL is
issuing commands to the DRAM. Indicates how many
dbg_cal_seq_cnt [31:0]
DRAM commands are requested (counts down to 0 when all
commands are sent out).
Calibration read data burst count (counts down to 0 when
dbg_cal_seq_rd_cnt [7:0] all expected bursts return), used when RTL is issuing read
commands to the DRAM.
dbg_rd_valid [0:0] Read Data Valid
Calibration byte selection (used to determine which byte is
currently selected and displayed in dbg_rd_data).
Table 38-2: DDR3/DDR4 Debug Signals Used in Vivado Design Suite Debug Feature (Cont’d)
Signal Signal Width Signal Description
dbg_rd_data [63:0] Read Data from Input FIFOs
dbg_rd_data_cmp [63:0] Comparison of dbg_rd_data and dbg_expected_data
Displays the expected data during calibration stages that
use general interconnect-based data pattern comparison
dbg_expected_data [63:0]
such as Read per-bit deskew or read DQS centering
(complex).
Complex Calibration Configuration
[0] = Start
[1] = 1’b0 selects the read pattern. 1’b1 selects the write
dbg_cplx_config [15:0] pattern.
[3:2] = Rank selection
[8:4] = Byte selection
[15:9] = Number of loops through data pattern
Complex Calibration Status
dbg_cplx_status [1:0] [0] = Busy
[1] = Done
Complex calibration bitwise comparison result for all bits in
the selected byte. Comparison is stored for each bit (1’b1
indicates compare mismatch):
{fall3, rise3, fall2, rise2, fall1, rise1, fall0, rise0}
[7:0] = Bit[0] of the byte
[15:8] = Bit[1] of the byte
dbg_cplx_err_log [63:0] [23:16] = Bit[2] of the byte
[31:24] = Bit[3] of the byte
[39:32] = Bit[4] of the byte
[47:40] = Bit[5] of the byte
[55:48] = Bit[6] of the byte
[63:56] = Bit[7] of the byte
dbg_io_address [27:0] MicroBlaze I/O Address Bus
dbg_pllGate [0:0] PLL Lock Indicator
dbg_phy2clb_fixdly_rdy_low [BYTES × 1 – 1:0] XIPHY fixed delay ready signal (lower nibble)
dbg_phy2clb_fixdly_rdy_upp [BYTES × 1 – 1:0] XIPHY fixed delay ready signal (upper nibble)
dbg_phy2clb_phy_rdy_low [BYTES × 1 – 1:0] XIPHY PHY ready signal (lower nibble)
dbg_phy2clb_phy_rdy_upp [BYTES × 1 – 1:0] XIPHY PHY ready signal (upper nibble)
Traffic_error [BYTES × 8× 8 – 1:0] Reserved
Traffic_clr_error [0:0] Reserved
Win_start [3:0] Reserved
Configure the device and, while the Hardware Manager is open, perform one of the
following:
1. Use the available XSDB Memory IP GUI to identify which stages have completed, which,
if any, has failed, and review the Memory IP properties window for a message on the
failure. Here is a sample of the GUI for a passing and failing case:
X-Ref Target - Figure 38-6
2. Manually analyze the XSDB output by running the following commands in the Tcl
prompt:
When the rank and calibration stage causing the failure are known, the failing byte, nibble,
and/or bit position and error status for the failure can be identified using the signals listed
in Table 38-6.
Table 38-6: DDR3/DDR4 DDR_CAL_ERROR_0/_1/_CODE Decoding
Variable Name Description
DDR_CAL_ERROR_0 Bit position failing
DDR_CAL_ERROR_1 Nibble or byte position failing
Error code specific to the failing stage of calibration. See the failing stage
DDR_CAL_ERROR_CODE
section below for details.
With these error codes, the failing stage of calibration, failing bit, nibble, and/or byte
positions, and error code are known. The next step is to review the failing stage in the
following section for specific debugging steps.
A warning flag indicates something unexpected occurred but calibration can continue.
Warnings can occur for multiple bits or bytes. Therefore, a limit on the number of warnings
stored is not set. Warnings are outputs from the PHY, where the cal_warning signal is
asserted for a single clock cycle to indicate a new warning.
In XSDB, the warnings are stored as part of the leftover address space in the block RAM
used to store the XSDB data. The amount of space left over for warnings is dependent on
the memory configuration (bus width, ranks, etc.).
The Vivado IDE displays warnings as highlighted in the example shown in Figure 38-8.
X-Ref Target - Figure 38-8
The same warnings are displayed in the Properties window where the rest of the XSDB
information is presented, as shown Figure 38-9. Apply a search filter of "warning" to find
only the warning information.
X-Ref Target - Figure 38-9
The following steps show how to manually read out the warnings.
1. Check the XSDB warnings fields to see if any warnings have occurred as listed in
Table 38-7. If CAL_WARNINGS_END is non-zero then at least one warning has occurred.
Table 38-7: DDR3/DDR4 DDR_CAL_ERROR_0/_1/_CODE Decoding
Variable Name Description
Number of block RAM address locations used to store a single
CAL_WARNINGS_START
warning (set to 2).
CAL_WARNINGS_END Total number of warnings stored in the block RAM.
2. Determine the end of the regular XSDB address range. END_ADDR0 and END_ADDR1
together form the end of the XSDB address range in the block RAM. The full address is
made up by concatenating the two addresses together in binary (each made up of nine
bits). For example, END_ADDR0 = 0x0AA and END_ADDR1 = 0x004 means the end
address is 0x8AA (18’b 00_0000_100 0_1010_1010).
3. At the Hardware Manager Tcl Console, use the following command to read out a single
warning:
read_hw -hw_core [ lindex [get_hw_cores] 0] 0 0x8AB 0x02
This command reads out the XSDB block RAM location for the address provided up
through the number of address locations requested. In the example above, the XSDB
end address is 0x8AA. Add 1 to this value to get to the warning storage area. The next
field (0x02 in the above example command) is the number of addresses to read from
the starting location. Multiple addresses can be read out by changing 0x02 to whatever
value is required.
4. The hex value read out is the raw data from the block RAM with four digits representing
one register value. For example:
A value of 00140000 is broken down into 0014 as the second register field and 0000
as the first register field where:
Notes:
1. Unit refers to value stored in the XSDB block RAM. Three locations are used in the block RAM for storage of a single warning,
the first contains the code, Unit 1 is the next address, and Unit 2 is the following address.
The XIPHY uses an internal clock to sample the DQS during a read burst and provides a
single binary value back called GT_STATUS. This sample is used as part of a training
algorithm to determine where the first rising edge of the DQS is in relation to the sampling
clock.
Calibration logic issues individual read commands to the DRAM and asserts the
clb2phy_rd_en signal to the XIPHY to open the gate which allows the sample of the DQS
to occur. The clb2phy_rd_en signal has control over the timing of the gate opening on a
DRAM-clock-cycle resolution (DQS_GATE_READ_LATENCY_RANK#_BYTE#). This signal is
controlled on a per-byte basis in the PHY and is set in the ddr_mc_pi block for use by both
calibration and the controller.
Calibration is responsible for determining the value used on a per-byte basis for use by the
controller. The XIPHY provides for additional granularity in the time to open the gate
through coarse and fine taps. Coarse taps offer 90° DRAM clock-cycle granularity (16
available) and each fine tap provides a 2.5 to 15 ps granularity for each tap (512 available).
BISC provides the number of taps for 1/4 of a memory clock cycle by taking
(BISC_PQTR_NIBBLE#-BISC_ALIGN_PQTR_NIBBLE#) or
(BISC_NQTR_NIBBLE#-BISC_ALIGN_NQTR_NIBBLE#). These are used to estimate the per-tap
resolution for a given nibble.
The search for the DQS begins with an estimate of when the DQS is expected back. The total
latency for the read is a function of the delay through the PHY, PCB delay, and the
configured latency of the DRAM (CAS latency, Additive latency, etc.). The search starts three
DRAM clock cycles before the expected return of the DQS. The algorithm must start
sampling before the first rising edge of the DQS, preferably in the preamble region. DDR3
and DDR4 have different preambles for the DQS as shown in Figure 38-10.
X-Ref Target - Figure 38-10
Pulled High
DQS DDR4
Preamble DDR3
3-State
X24455-082420
Given that DDR3 starts in the 3-state region before the burst, any accepted sample taken
can either be a 0 or 1. To avoid this result, 20 samples (in hardware) are taken for each
individual sample such that the probability of the 3-state region or noise in the sampling
clock/strobe being mistaken for the actual DQS is low. This probability is given by the
binomial probability shown in the binomial probability equation.
X = expected outcome
n= number of tries
n! x n–x
P ( X = x ) = ------------------------ p ( 1 – p )
x! ( n – x )!
When sampling in the 3-state region the result can be 0 or 1, so the probability of 20
samples all arriving at the same value is roughly 9.5 × 10 -6. Figure 38-11 shows an example
of samples of a DQS burst with the expected sampling pattern to be found as the coarse
taps are adjusted. The pattern is the expected level seen on the DQS over time as the
sampling clock is adjusted in relation to the DQS.
X-Ref Target - Figure 38-11
Can be a 0 or 1
Expected Pattern
0 0 X 1 X 0 X 1 X 0
DDR3
DDR4
Preamble Coarse
Training Mode Resolution
Coarse
Taps 0 1 2 3 4 5 6 7 8 9
1 Memory
Clock Cycle
X24456-082420
The coarse taps in the XIPHY are incremented and the value recorded at each individual
coarse tap location, looking for the full pattern “00X1X0X1X0.” For the algorithm to
incorrectly calculate the 3-state region as the actual DQS pattern, you would have to take 20
samples of all 0s at a given coarse tap, another 20 samples of all 0s at another, then 20
coarse taps of all 1s for the initial pattern (“00X1”). The probability of this occurring is
8.67 × 10 -19. This also only covers the initial scan and does not include the full pattern
which scans over 10 coarse taps.
While the probability is fairly low, there is a chance of coupling or noise being mistaken as
a DQS pattern. In this case, each sample is no longer random but a signal that can be fairly
repeatable. To guard against mistaking the 3-state region in DDR3 systems with the actual
DQS pulse, an extra step is taken to read data from the MPR register to validate the gate
alignment. The read path is set up by BISC for capture of data, placing the capture clock
roughly in the middle of the expected bit time back from the DRAM.
Because the algorithm is looking for a set pattern and does not know the exact alignment
of the DQS with the clock used for sampling the data, there are four possible patterns, as
shown in Figure 38-12.
X-Ref Target - Figure 38-12
0 0 X 1 X 0 X 1 X 0
Possible X 0 0 X 1 X 0 X 1 X 0
Patterns X X 0 0 X 1 X 0 X 1 X 0
X X X 0 0 X 1 X 0 X 1 X 0
Coarse
Taps
0 1 2 3 4 5 6 7 8 9 10 11 12
1 Memory
Clock Cycle
X24457-082420
For DDR4, if the algorithm samples 1XX or 01X this means it started the sampling too late
in relation to the DQS burst. The algorithm decreases the clb2phy_rd_en general
interconnect control and try again. If the clb2phy_rd_en is at the low limit already it
issues an error.
If all allowable values of clb2phy_rd_en for a given latency are checked and the expected
pattern is still not found, the search begins again from the start but this time the sampling
is offset by an estimated 45° using fine taps (half a coarse tap). This allows the sampling to
occur at a different phase than the initial relationship. Each time through if the pattern is
not found, the offset is reduced by half until all offset values have been exhausted.
Figure 38-13 shows an extreme case of DCD on the DQS that would result in the pattern not
being found until an offset being applied using fine taps.
X-Ref Target - Figure 38-13
0 0 0 0 0 0 0 0 0 0
DDR3
DDR4
Preamble
Training Mode Fails to Find
Pattern
Coarse
Taps 0 1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 1 0 0
Fine Offset
Pattern
Found
0 1 2 3 4 5 6 7 8 9
X24458-082420
After the pattern has been found, the final coarse tap (DQS_GATE_COARSE_RANK#_BYTE#)
is set based on the alignment of the pattern previously checked (shown in Figure 38-12).
The coarse tap is set to be the last 0 seen before the 1 (3 is used to indicate an unstable
region, where multiple samples return 0 and 1) was found in the pattern shown in
Figure 38-14. During this step, the final value of the coarse tap is set between 3 to 6. If the
coarse value of 7 to 9 is chosen, the coarse taps are decremented by 4 and the general
interconnect read latency is incremented by 1, so the value falls in the 3 to 5 range instead.
X-Ref Target - Figure 38-14
DDR3
DQS
Option 2 00X1X011X011X011X0
00X1X031X031X031X0
For example, when they are lined up taking multiple samples it might give you a different
result each time as a new sample is taken. The fine search begins in an area where all
samples returned a 0 so it is relatively stable, as shown in Figure 38-15. The fine taps are
incremented until a non-zero value is returned (which indicates the left edge of the unstable
region) and that value recorded as shown in Figure 38-17
(DQS_GATE_FINE_LEFT_RANK#_BYTE#).
X-Ref Target - Figure 38-15
DQS
Sample Clock
X24460-082420
The fine taps are then incremented until all samples taken return a 1, as shown in
Figure 38-16. This is recorded as the right edge of the uncertain region as shown in
Figure 38-17 (DQS_GATE_FINE_RIGHT_RANK#_BYTE#).
X-Ref Target - Figure 38-16
DQS
Sample Clock
X24461-082420
DQS DQS
X24462-082420
The final fine tap is computed as the midpoint of the uncertain region, (right – left)/2 + left
(DQS_GATE_FINE_CENTER_RANK#_BYTE#). This ensures optimal placement of the gate in
relation to the DQS. For simulation, speeding up a faster search is implemented for the fine
tap adjustment. This is performed by using a binary search to jump the fine taps by larger
values to quickly find the 0 to 1 transition.
For multi-rank systems, separate control exists in the XIPHY for each rank and every rank
can be trained separately for coarse and fine taps. After calibration is complete,
adjustments are made so that for each byte, the clb2phy_rd_en
(DQS_GATE_READ_LATENCY_RANK#_BYTE#) value for a given byte matches across all ranks.
The coarse taps are incremented/decremented accordingly to adjust the timing of the gate
signal to match the timing found in calibration. If a common clb2phy_rd_en setting
cannot be found for a given byte across all ranks, an error is asserted.
Debug
To determine the status of DQS Gate Calibration, click the DQS_GATE stage under the
Status window and view the results within the Memory IP Properties window. The
message displayed in the Memory IP Properties identifies how the stage failed, or notes if
it passed successfully.
The status of DQS Gate can also be determined by decoding the DDR_CAL_ERROR_0 and
DDR_CAL_ERROR_1 results according to Table 38-9. Execute the Tcl commands noted in the
XSDB Debug section to generate the XSDB output containing the signal results.
Table 38-9: DDR_CAL_ERROR Decode for DQS Preamble Detection Calibration (Cont’d)
DQS DDR_CAL_ DDR_CAL_
Gate Description Recommended Debug Steps
Code ERROR_1 ERROR_0
Check the BISC values in XSDB (for
the nibbles associated with the DQS)
to determine the 90° offset value in
taps. Check if any warnings are
generated, look if any are 0x13 or
0x014.
Could not find the 0->1 transition with For DDR3, BISC must be run and a
Logical
0x6 Byte fine taps in at least ½ tck (estimated) of data check is used to confirm the
Nibble
fine taps. DQS gate settings, but if the data is
wrong the algorithm keeps
searching and could end up in this
failure. Check data connections, vrp
settings, V REF resistor in the PCB (or
if internal V REF set properly for all
bytes).
Check calibrated coarse tap
Logical Underflow of coarse taps when trying to (DQS_GATE_COARSE_RANK*_BYTE*)
0x7 Byte
Nibble limit maximum coarse tap setting. setting for failing DQS to be sure the
value is in the range of 1–6.
Check DQS and CK trace lengths.
Ensure the maximum trace length is
Logical Violation of maximum read latency not violated. For debug purposes,
0x8 Byte
Nibble limit. try a lower frequency where more
search range is available and check if
the stage is successful.
Table 38-10 shows the signals and values adjusted or used during the DQS Preamble
Detection stage of calibration. The values can be analyzed in both successful and failing
calibrations to determine the resultant values and the consistency in results across resets.
These values can be found within the Memory IP core properties in the Hardware Manager
or by executing the Tcl commands noted in the XSDB Debug section.
Table 38-10: Additional XSDB Signals of Interest during DQS Preamble Detection
Signal Usage Signal Description
One value per rank and
DQS_GATE_COARSE_RANK*_BYTE* Final RL_DLY_COARSE tap value.
DQS group
Final RL_DLY_FINE tap value. This is
One value per rank and
DQS_GATE_FINE_CENTER_RANK*_BYTE* adjusted during alignment of sample
DQS group
clock to DQS.
One value per rank and RL_DLY_FINE tap value when left edge was
DQS_GATE_FINE_LEFT_RANK*_BYTE*
DQS group detected.
One value per rank and RL_DLY_FINE tap value when right edge
DQS_GATE_FINE_RIGHT_RANK*_BYTE*
DQS group was detected.
Table 38-10: Additional XSDB Signals of Interest during DQS Preamble Detection (Cont’d)
Signal Usage Signal Description
The DQS pattern detected during DQS
preamble detection. When a DQS
Preamble Detection error occurs where
the pattern is not found
(DDR_CAL_ERROR code 0x0, 0x2, 0x4, or
0x5), the pattern seen during CL+1 is
saved here.
The full pattern could be up to 13 bits.
The first nine bits are stored on _0.
Overflow bits are stored on _1. Currently,
_2 is reserved. For example,
9’b0_1100_1100
9’b1_1001_1000
9’b1_0011_0000
9’b0_0110_0000
Examples shown here are not
One value per rank and comprehensive, as the expected pattern
DQS_GATE_PATTERN_0/1/2_RANK*_BYTE*
DQS group looks like:
10’b0X1X0X1X00
Where X above can be a 0 or 1. The LSB
within this signals is the pattern detected
when Coarse = 0, the next bit is the
pattern detected when Coarse = 1, etc.
Additionally, there can be up to three
padded zeros before start of the pattern.
In some cases, extra information of
interest is stored in the overflow register.
The full pattern stored can be:
13’b0_0110_1100_0000
So the pattern is broken up and stored in
two locations:
9’b0_0110_0000 <- PATTERN_0
9’b0_0001_0011 <- PATTERN_1
Read Latency value last used during DQS
Preamble Detection. The Read Latency
field is limited to CAS latency -3 to CAS
One value per rank and
DQS_GATE_READ_LATENCY_RANK*_BYTE* latency + 7. If the DQS is toggling yet was
DQS group
not found check the latency of the DQS
signal coming back in relation to the chip
select.
Initial 0° offset value provided by BISC at
BISC_ALIGN_PQTR_NIBBLE* One per nibble
power-up.
Initial 0° offset value provided by BISC at
BISC_ALIGN_NQTR_NIBBLE* One per nibble
power-up.
Table 38-10: Additional XSDB Signals of Interest during DQS Preamble Detection (Cont’d)
Signal Usage Signal Description
Initial 90° offset value provided by BISC at
power-up. Compute 90° value in taps by
taking (BISC_PQTR – BISC_ALIGN_PQTR).
BISC_PQTR_NIBBLE* One per nibble To estimate tap resolution take (¼ of the
memory clock period)/ (BISC_PQTR –
BISC_ALIGN_PQTR). Useful for error code
0x6.
Initial 90° offset value provided by BISC at
power-up. Compute 90° value in taps by
taking (BISC_NQTR – BISC_ALIGN_NQTR).
BISC_NQTR_NIBBLE* One per nibble To estimate tap resolution take (¼ of the
memory clock period)/ (BISC_NQTR –
BISC_ALIGN_NQTR). Useful for error code
0x6.
This is a sample of the results for the DQS Preamble Detection XSDB debug signals:
Expected Results
Table 38-11 provides expected results for the coarse, fine, and read latency parameters
during DQS Preamble Detection. These values can be compared to the results found in
hardware testing.
Table 38-11: Expected Results for DQS Preamble Detection Coarse/Fine Tap and RL
Parameter Description
Hardware Measurements
This is the first stage of calibration. Therefore, any general setup issue can result in a failure
during DQS Preamble Detection Calibration. The first items to verify are proper clocking
and reset setup as well as usage of unmodified Memory IP RTL that is generated specifically
for the SDRAM(s) in hardware. The General Checks, page 588 section should be verified
when a failure occurs during DQS Preamble Detection.
After the General Checks, page 588 have been verified, hardware measurements on DQS,
and specifically the DQS byte that fails during DQS Preamble Detection, should be captured
and analyzed. DQS must be toggling during DQS Preamble Detection. If this stage fails,
after failure, probe the failing DQS at the FPGA using a high quality scope and probes.
When a failure occurs, the calibration goes into an error loop routine, continually issuing
read commands to the DRAM to allow for probing of the PCB. While probing DQS, validate:
1. Continuous DQS pulses exist with gaps between each BL8 read.
2. The signal integrity of DQS:
° Ensure VIL and V IH are met for the specific I/O Standard in use. For more
information, see the Kintex UltraScale FPGAs Data Sheet: DC and AC Switching
Characteristics (DS892) [Ref 2].
° Ensure that the signals have low jitter/noise that can result from any power supply
or board noise.
If DQS pulses are not present and the General Checks, page 588 have been verified, probe
the read commands at the SDAM and verify:
° Ensure VIL and VIH are met. For more information, see the JESD79-3F, DDR3 SDRAM
Standard and JESD79-4, DDR4 SDRAM Standard, JEDEC Solid State Technology
Association [Ref 1].
3. CK to command timing.
4. RESET# voltage level.
5. Memory initialization routine.
During write leveling, DQS is driven by the FPGA memory interface and DQ is driven by the
DDR3/DDR4 SDRAM device to provide feedback. To start write leveling, an MRS command
is sent to the DRAM to enable the feedback feature, while another MRS command is sent to
disable write leveling at the end. Figure 38-19 shows the block diagram for the write
leveling implementation.
X-Ref Target - Figure 38-19
FPGA
ODELAY
“10101010” 8 to 1 serializer
(Set to 0)
DDR3/DDR4 SDRAM
Coar se Delay in
PLL Clock ClockGen (Set to 0) CK
CK# D Q
DQS
8 to 1 serializer ODELAY DQS#
FEEDBACK
Write Data 8 to 1 serializer ODELAY
WRLVL_MODE
Coar se Delay in
ClockGen fo r Wr ite
Leveling
DQ
REGULAR
WL_TRAIN
X24463-081021
The XIPHY is set up for write leveling by setting various attributes in the RIU. WL_TRAIN is
set to decouple the DQS and DQ when driving out the DQS. This allows the XIPHY to
capture the returning DQ from the DRAM. Because the DQ is returned without the returning
DQS strobe for capture, the RX_GATE is set to 0 in the XIPHY to disable DQS gate operation.
While the write leveling algorithm acts on a single DQS at a time, all the XIPHY bytes are set
up for write leveling to ensure there is no contention on the bus for the DQ.
DQS is delayed with ODELAY and coarse delay (WL_DLY_CRSE[12:9] applies to all bits in a
nibble) provided in the RIU WL_DLY_RNKx register. The WL_DLY_FINE[8:0] location in the
RIU is used to store the ODELAY value for write leveling for a given nibble (used by the
XIPHY when switching ranks).
A DQS train of pulses is output by the FPGA to the DRAM to detect the relationship of CK
and DQS at the DDR3/DDR4 memory device. DQS is delayed using the ODELAY and coarse
taps in unit tap increments until a 0 to 1 transition is detected on the feedback DQ input. A
single typical burst length of eight pattern is first put out on the DQS (four clock pulses),
followed by a gap, and then 100 bursts length of eight patterns are sent to the DRAM
(Figure 38-20).
The first part is to ensure the DRAM updates the feedback sample on the DQ being sent
back, while the second provides a clock that is used by the XIPHY to clock into the XIPHY
the level seen on the DQ. Sampling the DQ while driving the DQS helps to avoid ringing on
the DQS at the end of a burst that can be mistaken as a clock edge by the DRAM.
X-Ref Target - Figure 38-20
CK/CK#
DDR3 DQS/
DQS#
To avoid false edge detection around the CK negative edge due to jitter, the DQS delays the
entire window to find the large stable 0 and 1 region (Stable 0 or 1 indicates all samples
taken return the same value). Check that you are to the left of this stable 1 region as the
right side of this region is the CK negative edge being captured with the DQS, as shown in
Figure 38-21.
CK/
CK#
DQS/
DQS#
X24465-082420
1. Find the transition from 0 to 1 using coarse taps and ODELAY taps (if needed).
During the first step, look for a static 0 to be returned from all samples taken. This
means 64 samples were taken and it is certain the data is a 0. Record the coarse tap
setting and keep incrementing the coarse tap.
° If the algorithm never sees a transition from a stable 0 to the noise or stable 1 using
the coarse taps, the ODELAY of the DQS is set to an offset value (first set at 45°,
WRLVL_ODELAY_INITIAL_OFFSET_BYTE) and the coarse taps are checked again from
0. Check for the stable 0 to stable 1 transition (the algorithm might need to perform
this if the noise region is close to 90° or there is a large amount of DCD).
° If the transition is still not found, the offset is halved and the algorithm tries again.
The final offset value used is stored at WRLVL_ODELAY_LAST_OFFSET_RANK_BYTE.
Because the algorithm is aligning the DQS with the nearest clock edge the coarse
tap sweep is limited to five, which is 1.25 clock cycles. The final coarse setting is
stored at WRLVL_COARSE_STABLE0_RANK_BYTE.
2. Find the center of the noise region around that transition from 0 to 1 using ODELAY
taps.
The second step is to sweep with ODELAY taps and find both edges of the noise region
(WRLVL_ODELAY_STABLE0_RANK_BYTE, WRLVL_ODELAY_STABLE1_RANK_BYTE while
WRLVL_ODELAY_CENTER_RANK_BYTE holds the final value). The number of ODELAY taps
used is determined by the initial alignment of the DQS and CK and the size of this noise
region as shown in Figure 38-22.
Coarse
Tap
CK/
CK#
DQS/
DQS#
1*Coarse
2*Coarse
3*Coarse
Required
ODELAY(max)
Coarse
Tap
CK/
CK#
DQS/
DQS#
Required
ODELAY
(min)
X24466-082420
After the final ODELAY setting is found, the value of ODELAY is loaded in the RIU in the
WL_DLY_RNKx[8:0] register. This value is also loaded in the ODELAY register for the DQ and
the DM to match the DQS. If any deskew has been performed on the DQS/DQ/DM when
reaching this point (multi-rank systems), the deskew information is preserved and the offset
is applied.
After write leveling, the MPR command is sent to the DRAM to disable the write leveling
feature, the WL_TRAIN is set back to the default OFF setting, and the DQS gate is turned
back on to allow for capture of the DQ with the returning strobe DQS.
Debug
To determine the status of Write Leveling Calibration, click the Write Leveling stage under
the Status window and view the results within the Memory IP Properties window. The
message displayed in Memory IP Properties identifies how the stage failed or notes if it
passed successfully.
The status of Write Leveling can also be determined by decoding the DDR_CAL_ERROR_0
and DDR_CAL_ERROR_1 results according to Table 38-12. Execute the Tcl commands noted
in the XSDB Debug section to generate the XSDB output containing the signal results.
Table 38-13 describes the signals and values adjusted or used during the Write Leveling
stage of calibration. The values can be analyzed in both successful and failing calibrations
to determine the resultant values and the consistency in results across resets. These values
can be found within the Memory IP Core Properties within Hardware Manager or by
executing the Tcl commands noted in the XSDB Debug.
This is a sample of results for the Write Leveling XSDB debug signals:
Expected Results
The tap variance across DQS byte groups vary greatly due to the difference in trace lengths
with fly-by-routing. When an error occurs, an error loop is started that generates DQS
strobes to the DRAM while still in WRLVL mode. This error loop runs continuously until a
reset or power cycle to aid in debug. Table 38-14 provides expected results for the coarse
and fine parameters during Write Leveling.
Hardware Measurements
The following measurements can be made during the error loop or when triggering on the
status bit that indicates the start of WRLVL (dbg_cal_seq[1] = 1’b1).
• Verify DQS and CK are toggling on the board. The FPGA sends DQS and CK during
Write Leveling. If they are not toggling, something is wrong with the setup and the
General Checks, page 588 section should be thoroughly reviewed.
• Verify fly-by-routing is implemented correctly on the board.
• Verify CK to DQS trace matching. The required matching is documented with the
UltraScale Architecture PCB Design and Pin Planning User Guide (UG583) [Ref 11].
Failure to adhere to this spec can result in Write Leveling failures.
• Trigger on the start of Write Leveling by bringing dbg_cal_seq[1] to an I/O and
using the rising edge (1’b1) as the scope trigger.
Monitor the following:
° MRS command at the memory to enable Write Leveling Mode. The Mode registers
must be properly set up to enable Write Leveling. Specifically, address bit A7 must
be correct. If the part chosen in the Memory IP is not accurate or there is an issue
with the connection of the address bits on the board, this could be an issue. If the
Mode registers are not set up to enable Write Leveling, the 0-to-1 transition is not
seen.
Note: For dual-rank design when address mirroring is used, address bit A7 is not the same
between the two ranks.
° Verify the ODT pin is connected and being asserted properly during the DQS
toggling.
° Check the signal levels of all the DQ bits being returned. Any stuck-at-bits (Low/
High) or floating bits that are not being driven to a given rail can cause issues.
° For DDR3 check the V REF voltage, while for DDR4 check the VREF settings are correct
in the design.
• Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the trigger to dbg_cal_seq = 0R0 (R signifies rising
edge). The following simulation example shows how the debug signals should behave
during successful Write Leveling.
X-Ref Target - Figure 38-24
To perform per-bit deskew, a non-repeating pattern is useful to deal with or diagnose cases
of extreme skew between different bits in a byte. Because this is limited by the DDR3 MPR
pattern, a long pattern is first written to the DRAM and then read back to perform per-bit
deskew (only done on the first rank of a multi-rank system). When per-bit deskew is
complete, the simple repeating pattern available through both DDR3 and DDR4 MPR is
used to center the DQS in the DQ read eye.
The XIPHY provides separate delay elements (2.5 to 15 ps per tap, 512 total) for the DQS to
clock the rising and falling edge DQ data (PQTR for rising edge, NQTR for falling edge) on
a per-nibble basis (four DQ bits per PQTR/NQTR). This allows the algorithm to center the
rising and falling edge DQS strobe independently to ensure more margin when dealing with
DCD. The data captured in the PQTR clock domain is transferred to the NQTR clock domain
before being sent to the read FIFO and to the general interconnect clock domain.
Due to this transfer of clock domains, the PQTR and NQTR clocks must be roughly 180° out
of phase. This relationship between the PQTR/NQTR clock paths is set up as part of the BISC
start-up routine, and thus calibration needs to maintain this relationship as part of the
training (BISC_ALIGN_PQTR, BISC_ALIGN_NQTR, BISC_PQTR, BISC_NQTR).
Address = 0x000
DQS/
DQS#
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
DQ
0x00
X24467-082420
Next, write 0xFF to a different address to allow for back-to-back reads (Figure 38-26). For
DDR3 address 0x008 is used, while for DDR4 address 0x000 and bank group 0x1 is used.
At higher frequencies, DDR4 requires a change in the bank group to allow for back-to-back
bursts of eight.
X-Ref Target - Figure 38-26
DQS/
DQS#
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
DQ
0xFF
X24468-082420
After the data is written, back-to-back reads are issued to the DRAM to perform per-bit
deskew (Figure 38-27).
X-Ref Target - Figure 38-27
DQS/
DQS#
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15
DQ
0x00 0xFF 0x00
X24469-082420
Using this pattern each bit in a byte is left edge aligned with the DQS strobe (PQTR/NQTR).
More than a bit time of skew can be seen and corrected as well.
RECOMMENDED: In general, a bit time of skew between bits is not ideal. Ensure the DDR3/DDR4 trace
matching guidelines within DQS byte are met. See PCB Guidelines for DDR3, page 87 and PCB
Guidelines for DDR4, page 87.
At the start of deskew, the PQTR/NQTR are decreased down together until one of them hits
0 (to preserve the initial relationship setup by BISC). Next, the data for a given bit is checked
for the matching pattern. Only the rising edge data is checked for correctness. The falling
edge comparison is thrown away to allow for extra delay on the PQTR/NQTR relative to the
DQ.
While in the ideal case, the PQTR/NQTR are edge aligned with the DQ when the delays are
set to 0. Due to extra delay in the PQTR/NQTR path, the NQTR might be pushed into the
next burst transaction at higher frequencies and so it is excluded from the comparison
(Figure 38-28 through Figure 38-29). More of the rising edge data of a given burst would
need to be discarded to deal with more than a bit time of skew. If the last part of the burst
was not excluded, the failure would cause the PQTR/NQTR to be pushed instead of the DQ
IDELAY.
X-Ref Target - Figure 38-28
PQTR/
NQTR
A Burst of 8
0 0 F F F F F F F F 0 0 0
X24470-082420
PQTR/
NQTR
A Burst of 8
0 F F F F F F F F 0 0 0 0
X24471-082420
If the pattern is found, the given IDELAY on that bit is incremented by 1, then checked again.
If the pattern is not seen, the PQTR/NQTR are incremented by 1 and the data checked again.
The algorithm checks for the passing and failing region for a given bit, adjusting either the
PQTR/NQTR delays or the IDELAY for that bit.
To guard against noise in the uncertain region, the passing region is defined by a minimum
window size (10), hence the passing region is not declared as found unless the PQTR/NQTR
are incremented and a contiguous region of passing data is found for a given bit. All of the
bits are cycled through to push the PQTR/NQTR out to align with the latest bit in a given
nibble. Figure 38-30 through Figure 38-33 show an example of the PQTR/NQTR and various
bits being aligned during the deskew stage.
X-Ref Target - Figure 38-30
PQTR/
NQTR
DQ F 0
DQ
F 0
DQ
(Early) F 0
DQ F 0
(Late)
X24472-082420
The algorithm takes the result of each bit at a time and decides based on the results of that
bit only. The common PQTR/NQTR are delayed as needed to align with each bit, but is not
decremented. This ensures it gets pushed out to the latest bit.
X-Ref Target - Figure 38-31
PQTR/
NQTR
(1) DQ Delayed to
DQ F (1) 0 align with DQS
(2) DQ Delayed to
DQ F (2) 0 align with DQS
DQ (3) DQ Delayed to
(Early) F (3) 0 align with DQS
DQ F 0
(Late)
X24473-082420
PQTR/
NQTR
DQ F 0
DQ F 0
DQ
(Early) F 0
X24474-082420
Figure 38-32: Per-Bit Deskew – PQTR/NQTR Delayed to Align with Late Bit
When completed, the PQTR/NQTR are pushed out to align with the latest DQ bit
(RDLVL_DESKEW_PQTR_nibble, RDLVL_DESKEW_NQTR_nibble), but DQ bits calibrated first
might have been early as shown in the example. Accordingly, all bits are checked once again
and aligned as needed (Figure 38-33).
X-Ref Target - Figure 38-33
PQTR/
NQTR
(6) DQ Delayed to
DQ F (6) 0 align with DQS
DQ (7) DQ Delayed to
(Early) F (7) 0 align with DQS
DQ F 0
(Late)
X24475-082420
Debug
To determine the status of Read Per-Bit Deskew Calibration, click the Read Per-Bit Deskew
stage under the Status window and view the results within the Memory IP Properties
window. The message displayed in Memory IP Properties identifies how the stage failed or
notes if it passed successfully.
Figure 38-34: Memory IP XSDB Debug GUI Example – Read Per-Bit Deskew
The status of Read Per-Bit Deskew can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-15. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-16 describes the signals and values adjusted or used during the Read Per-Bit
Deskew stage of calibration. The values can be analyzed in both successful and failing
calibrations to determine the resultant values and the consistency in results across resets.
These values can be found within the Memory IP Core Properties within Hardware
Manager or by executing the Tcl commands noted in the XSDB Debug section.
BUS_DATA_BURST_0 BUS_DATA_BURST_1
PQTR/
NQTR
(0 taps)
DQ 0x00 0xFF
BUS_DATA_BURST_2 BUS_DATA_BURST_3
90° offset
PQTR/
NQTR
(90° offset)
DQ 0x00 0xFF
X14783-070915
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope™. Figure 38-36 is an example of how the data is
converted.
Note: For this stage of calibration which is using a data pattern of all 0s or all 1s, the conversion is
not visible.
This is a sample of results for the Read Per-Bit Deskew XSDB debug signals:
Expected Results
• Look at the individual IDELAY taps for each bit. The IDELAY taps should only vary by 0
to 20 taps, and is dependent on PCB trace delays. For Deskew, the IDELAY taps are
typically in the 50 to 70 tap range, while PQTR and NQTR are usually in the 0 to 5 tap
range.
• Determine if any bytes completed successfully. The per-bit algorithm sequentially steps
through each DQS byte.
Hardware Measurements
1. Probe the write commands and read commands at the memory:
6. Probe the read burst after the write and check if the expected data pattern is being
returned.
7. Check for floating address pins if the expected data is not returned.
8. Check for any stuck-at level issues on DQ pins whose signal level does not change. If at
all possible probe at the receiver to check termination and signal integrity.
9. Check the DBG port signals and the full read data and comparison result to check the
data in general interconnect. The calibration algorithm has RTL logic issue the
commands and check the data.
Check if the dbg_rd_valid aligns with the data pattern or is off (which can indicate an
issue with DQS gate calibration). Set up a trigger when the error gets asserted to
capture signals in the hardware debugger for analysis.
10. Re-check results from DQS gate or other previous calibration stages. Compare passing
byte lanes against failing byte lanes for previous stages of calibration. If a failure occurs
during simple pattern calibration, check the values found during deskew for example.
11. All of the data comparison for read deskew occurs in the general interconnect, so it can
be useful to pull in the debug data in the hardware debugger and take a look at what the
data looks like coming back as taps are adjusted, see Figure 38-38. The screen captures
are from simulation, with a small burst of five reads. Look at dbg_rd_data,
dbg_rd_data_cmp, and dbg_rd_valid.
12. Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the Read Deskew trigger to cal_r*_status[6] = R
(rising edge). To view each byte, add an additional trigger on dbg_cmp_byte and set to
the byte of interest. The following simulation example shows how the debug signals
should behave during successful Read Deskew.
X-Ref Target - Figure 38-38
Figure 38-38: RTL Debug Signals during Read Deskew (No Error)
13. After failure during this stage of calibration, the design goes into a continuous loop of
read commands to allow board probing.
The regular deskew algorithm performs a per-bit deskew on every DQ bit in a nibble against
the PQTR/NQTR, pushing early DQ bits to line up with late bits. Because the DBI pin is an
input to one of the nibbles, it could have an effect on the PQTR/NQTR settings or even the
other DQ pins if the DQ pins need to be pushed to align with the DBI pin. A similar
mechanism as the DQ per-bit deskew is ran but the DBI pin is deskewed instead in relation
to the PQTR/NQTR.
1. Turn on DBI on the read path (MRS setting in the DRAM and a fabric switch that inverts
the read data when value read from the DBI pin is asserted).
2. If the nibble does not contain the DBI pin, skip the nibble and go to the next nibble.
3. Start from the previous PQTR/NQTR settings found during DQ deskew (edge alignment
for bits in the nibble).
4. Issue back-to-back reads to address 0x000/Bank Group 0 and 0x000/Bank Group 1.
This is repeated until per-bit DBI deskew is complete as shown in Figure 38-39.
X-Ref Target - Figure 38-39
DQS/
DQS#
Data in
DRAM 0x00 0xFF 0x00
Array
DBI_n
X15984-021616
5. Delay the DBI pin with IDELAY to edge align with the PQTR/NQTR clock. If the PQTR/
NQTR delay needs to be adjusted, the other DQ bits in the nibble are adjusted
accordingly. This occurs if the DBI pin arrives later than all other bits in the nibble.
6. Loop through all nibbles in the interface for the rank.
7. Turn off DBI on the read path (MRS setting in the DRAM and fabric switch).
Debug
To determine the status of Read Per-Bit DBI Deskew Calibration, click the Read Per-Bit DBI
Deskew Calibration stage under the Status window and view the results within the
Memory IP Properties window. The message displayed in Memory IP Properties
identifies how the stage failed or notes if it passed successfully.
The status of Read Per-Bit DBI Deskew can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-17.
Execute the Tcl commands noted in the XSDB Debug section to generate the XSDB output
containing the signal results.
Table 38-18 shows the signals and values adjusted or used during the Read Per-Bit Deskew
stage of calibration. The values can be analyzed in both successful and failing calibrations
to determine the resultant values and the consistency in results across resets. These values
can be found within the Memory IP Core Properties within Hardware Manager or by
executing the Tcl commands noted in the XSDB Debug section.
Table 38-18: Signals of Interest for Read Per-Bit DBI Deskew Calibration
Signal Usage Signal Description
Read leveling PQTR when left edge of
RDLVL_DESKEW_DBI_PQTR One per nibble read data valid window is detected
during per-bit read DBI deskew.
Read leveling NQTR when left edge of
RDLVL_DESKEW_DBI_NQTR One per nibble read data valid window is detected
during per-bit read DBI deskew.
Read leveling IDELAY delay value found
RDLVL_DESKEW_DBI_IDELAY_BYTE One per Byte
during per-bit read DBI deskew.
Read leveling PQTR when left edge of
RDLVL_DESKEW_PQTR_NIBBLE One per nibble read data valid window is detected
during per-bit read DQ deskew.
Read leveling NQTR when left edge of
RDLVL_DESKEW_NQTR_NIBBLE One per nibble read data valid window is detected
during per-bit read DQ deskew.
Table 38-18: Signals of Interest for Read Per-Bit DBI Deskew Calibration (Cont’d)
Signal Usage Signal Description
Read leveling IDELAY delay value found
RDLVL_DESKEW_IDELAY_BYTE_BIT* One per Bit
during per-bit read DQ deskew.
When a failure occurs during deskew,
some data is saved to indicate what the
data looks like for a byte across some tap
settings for a given byte the failure
occurred for (DQ IDELAY is left wherever
the algorithm left it).
Deskew (Figure 38-35):
BUS_DATA_BURST_0 holds first part of
two burst data (should be all 0) when
BUS_DATA_BURST PQTR/NQTR set to t taps.
BUS_DATA_BURST_1 holds second part
of two burst data (should be all 1). When
PQTR/NQTR set to 0 taps.
BUS_DATA_BURST_2 holds first part of
two burst data (should be all 0) when
PQTR/NQTR set to 90°.
BUS_DATA_BURST_3 holds second part
of two burst data (should be all 1) when
PQTR/NQTR set to 90°.
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope™. Figure 38-40 is an example of how the data is
converted.
Note: For this stage of calibration which is using a data pattern of all 0s or all 1s, the conversion is
not visible.
This is a sample of results for the Read Per-Bit DBI Deskew XSDB debug signals:
Expected Results
• Look at the individual IDELAY taps for each bit. The IDELAY taps should only vary by 0
to 20 taps, and is dependent on PCB trace delays. For Deskew, the IDELAY taps are
typically in the 50 to 70 tap range, while PQTR and NQTR are usually in the 0 to 5 tap
range.
• Determine if any bytes completed successfully. The per-bit algorithm sequentially steps
through each DQS byte.
Hardware Measurements
1. Probe the write commands and read commands at the memory:
6. Check for floating address pins if the expected data is not returned.
7. Check for any stuck-at level issues on DQ/DBI pins whose signal level does not change.
If at all possible probe at the receiver to check termination and signal integrity.
8. Check the DBG port signals and the full read data and comparison result to check the
data in general interconnect. The calibration algorithm has RTL logic issue the
commands and check the data.
Check if the dbg_rd_valid aligns with the data pattern or is off (which can indicate an
issue with DQS gate calibration). Set up a trigger when the error gets asserted to
capture signals in the hardware debugger for analysis.
9. Re-check results from DQS gate or other previous calibration stages. Compare passing
byte lanes against failing byte lanes for previous stages of calibration. If a failure occurs
during simple pattern calibration, check the values found during deskew for example.
10. All of the data comparison for read deskew occurs in the general interconnect, so it can
be useful to pull in the debug data in the hardware debugger and take a look at what the
data looks like coming back as taps are adjusted, see Figure 38-42. The screen captures
are from simulation, with a small burst of five reads. Look at dbg_rd_data,
dbg_rd_data_cmp, and dbg_rd_valid.
11. Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the Read DBI Deskew trigger to cal_r*_status[8]
= R (rising edge). To view each byte, add an additional trigger on dbg_cmp_byte and
set to the byte of interest. The following simulation example shows how the debug
signals should behave during successful Read DBI Deskew.
X-Ref Target - Figure 38-42
Figure 38-42: RTL Debug Signals during Read DBI Deskew (No Error)
12. After failure during this stage of calibration, the design goes into a continuous loop of
read commands to allow board probing.
DQS/
DQS#
DQ/
DQ# 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
X24476-082420
To properly account for jitter on the data and clock returned from the DRAM, multiple data
samples are taken at a given tap value. 64 read bursts are used in hardware while five are
used in simulation. More samples mean finding the best alignment in the data valid
window.
Given that the PHY has two capture strobes PQTR/NQTR that need to be centered
independently yet moved together, calibration needs to take special care to ensure the
clocks stay in a certain phase relationship with one another.
The data and PQTR/NQTR delays start with the value found during deskew. Data is first
delayed with IDELAY such that both the PQTR and NQTR clocks start out just to the left of
the data valid window for all bits in a given nibble so the entire read window can be scanned
with each clock (Figure 38-44, RDLVL_IDELAY_VALUE_Rank_Byte_Bit). Scanning the window
with the same delay element and computing the center with that delay element helps to
minimize uncertainty in tap resolution that might arise from using different delay lines to
find the edges of the read window.
PQTR
NQTR
Original DQ 0 F 0 F
DQ 0 F 0 F
Delay the data using IDELAY
X24477-082420
At the start of training, the PQTR/NQTR and data are roughly edge aligned, but because the
pattern is different from the deskew step the edge might have changed a bit. Also, during
deskew the aggregate edge for both PQTR/NQTR is found while you want to find a separate
edge for each clock.
After making sure both PQTR/NQTR start outside the data valid region, the clocks are
incremented to look for the passing region (Figure 38-45). Rising edge data is checked for
PQTR while falling edge data is checked for NQTR, with a separate check being kept to
indicate where the passing region/falling region is for each clock.
X-Ref Target - Figure 38-45
PQTR
NQTR
DQ 0 F 0 F
X24478-082420
Figure 38-45: PQTR and NQTR Delayed to Find Passing Region (Left Edge)
When searching for the edge, a minimum window size of 10 is used to guarantee the noise
region has been cleared and the true edge is found. The PQTR/NQTR delays are increased
past the initial passing point until the minimum window size is found before the left edge
is declared as found. If the minimum window is not located across the entire tap range for
either clock, an error is asserted.
Again, the PQTR/NQTR delays are incremented together and checked for error
independently to keep track of the right edge of the window. Because the data from the
PQTR domain is transferred into the NQTR clock domain in the XIPHY, the edge for NQTR is
checked first, keeping track of the results for PQTR along the way (Figure 38-46).
When the NQTR edge is located, a flag is checked to see if the PQTR edge is found as well.
If the PQTR edge was not found, the PQTR delay continues to search for the edge, while the
NQTR delay stays at its right edge (RDLVL_PQTR_RIGHT_Rank_Nibble,
RDLVL_NQTR_RIGHT_Rank_Nibble). For simulation, the right edge detection is sped up by
having the delays adjusted by larger than one tap at a time.
X-Ref Target - Figure 38-46
PQTR Right
PQTR
DQ 0 F 0 F
X24479-082420
Figure 38-46: PQTR and NQTR Delayed to Find Failing Region (Right Edge)
After both rising and falling edge windows are found, the final center point is calculated
based on the left and right edges for each clock. The final delay for each clock
(RDLVL_PQTR_CENTER_Rank_Nibble, RDLVL_NQTR_CENTER_Rank_Nibble) is computed by:
For multi-rank systems deskew only runs on the first rank, while read DQS centering using
the PQTR/NQTR runs on all ranks. After calibration is complete for all ranks, for a given DQ
bit the IDELAY is set to the center of the range of values seen for all ranks
(RDLVL_IDELAY_FINAL_BYTE_BIT). The PQTR/NQTR final value is also computed based on
the range of values seen between all of the ranks (RDLVL_PQTR_CENTER_FINAL_NIBBLE,
RDLVL_NQTR_CENTER_FINAL_NIBBLE).
IMPORTANT: For multi-rank systems, there must be overlap in the read window computation. Also,
there is a limit in the allowed skew between ranks, see the PCB Guidelines for DDR3 in Chapter 4 and
PCB Guidelines for DDR4 in Chapter 4.
Debug
To determine the status of Read MPR DQS Centering Calibration, click the Read DQS
Centering (Simple) stage under the Status window and view the results within the
Memory IP Properties window. The message displayed in Memory IP Properties
identifies how the stage failed or notes if it passed successfully.
X-Ref Target - Figure 38-47
Figure 38-47: Memory IP XSDB Debug GUI Example – Read DQS Centering (Simple)
The status of Read MPR DQS Centering can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-19. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-20 shows the signals and values adjusted or used during the Read MPR DQS
Centering stage of calibration. The values can be analyzed in both successful and failing
calibrations to determine the resultant values and the consistency in results across resets.
These values can be found within the Memory IP Core Properties within Hardware
Manager or by executing the Tcl commands noted in the XSDB Debug section.
BUS_DATA_BURST_0
PQTR
0 taps
NQTR
DQ 0 F 0 F 0 F 0 F
PQTR
90°
offset NQTR
DQ 0 F 0 F 0 F 0 F
180° offset
BUS_DATA_BURST_2
PQTR
180°
offset NQTR
DQ 0 F 0 F 0 F 0 F
BUS_DATA_BURST_3
270° offset
PQTR
270°
offset NQTR
DQ 0 F 0 F 0 F 0 F
X14785-070915
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figure 38-49 and Figure 38-50 are examples of how the
data is converted.
X-Ref Target - Figure 38-49
This is a sample of results for Read MPR DQS Centering using the Memory IP Debug GUI
within the Hardware Manager.
Note: Either the “Table” or “Chart” view can be used to look at the window.
Figure 38-51 and Figure 38-52 are screen captures from 2015.1 and might vary from the
current version.
Figure 38-51: Example Read Calibration Margin from Memory IP Debug GUI
This is a sample of results for the Read Per-Bit Deskew XSDB debug signals:
Expected Results
• Look at the individual PQTR/NQTR tap settings for each nibble. The taps should only
vary by 0 to 20 taps. Use the BISC values to compute the estimated bit time in taps.
° For example, Byte 7 Nibble 0 in Figure 38-52 is shifted and smaller compared to the
remaining nibbles. This type of result is not expected. For this specific example, the
FPGA was not properly loaded into the socket.
• Determine if any bytes completed successfully. The read DQS Centering algorithm
sequentially steps through each DQS byte group detecting the capture edges.
• To analyze the window size in ps, see the Determining Window Size in ps, page 773. In
some cases, simple pattern calibration might show a better than ideal rise or fall
window. Because a simple pattern (clock pattern) is used, it is possible for the rising
edge clock to always find the same value (for example, 1) and the falling edge to always
find the opposite (for example, 0). This can occur due to a non-ideal starting V REF value
which causes duty cycle distortion making the rise or fall larger than the other. If the
rise and fall window sizes are added together and compared against the expected clock
cycle time, the result should be more reasonable.
As a general rule of thumb, the window size for a healthy system should be ≥ 30% of the
expected UI size.
Hardware Measurements
1. Using high quality probes and scope, probe the address/command to ensure the load
register command to the DRAM that enables MPR was correct. To enable the MPR, a
Mode register set (MRS) command is issued to the MR3 register with bit A2 = 1. To make
this measurement, bring a scope trigger to an I/O based on the following conditions:
° To view each byte, add an additional trigger on dbg_cmp_byte and set to the byte
of interest.
Figure 38-53: RTL Debug Signals during Read DQS Centering (No Error)
X-Ref Target - Figure 38-54
Figure 38-54: RTL Debug Signals during Read DQS Centering (Error Case Shown)
11. After failure during this stage of calibration, the design goes into a continuous loop of
read commands to allow board probing.
The DRAM requires the write DQS to be center-aligned with the DQ to ensure maximum
write margin. Initially the write DQS is set to be roughly 90° out of phase with the DQ using
the XIPHY TX_DATA_PHASE set for the DQS. The TX_DATA_PHASE is an optional per-bit
adjustment that uses a fast internal XIPHY clock to generate a 90° offset between bits. The
DQS and DQ ODELAY are used to fine tune the 90° phase alignment to ensure maximum
margin at the DRAM.
A simple clock pattern of 10101010 is used initially because the write latency has not yet
been determined. Due to fly-by routing on the PCB/DIMM module, the command to data
timing is unknown until the next stage of calibration. Just as in read per-bit deskew when
issuing a write to the DRAM, the DQS and DQ toggles for eight clock cycles before and after
the expected write latency. This is used to ensure the data is written into the DRAM even if
Write DQS
DQ0
DQ1
DQn
X24480-082420
Figure 38-55: Initial Write DQS and DQ with Skew between Bits
1. Set TX_DATA_PHASE to 1 for DQ to add the 90° shift on the DQS relative to the DQ for
a given byte (Figure 38-56). The data read back on some DQ bits are 10101010 while
other DQ bits might be 01010101.
X-Ref Target - Figure 38-56
DQ0
DQ1
DQn
X24481-082420
2. If all the data for the byte does not match the expected data pattern, increment DQS
ODELAY one tap at a time until the expected data pattern is found on all bits and save
the delay as WRITE_DQS_TO_DQ_DESKEW_DELAY_Byte (Figure 38-57). As the DQS
ODELAY is incremented, it moves away from the edge alignment with the CK. The
deskew data is the inner edge of the data valid window for writes.
DQ0
DQ1
DQn
X24482-082420
Figure 38-57: Increment Write DQS ODELAY until All Bits Captured with Correct Pattern
3. Increment each DQ ODELAY until each bit fails to return the expected data pattern (the
data is edge aligned with the write DQS, Figure 38-58).
X-Ref Target - Figure 38-58
Write DQS
DQ0
X24483-082420
4. Return the DQ to the original position at the 0° shift using the TX_DATA_PHASE. Set DQS
ODELAY back to starting value (Figure 38-59).
DQ0
DQ1
DQn
X24484-082420
Debug
To determine the status of Write Per-Bit Deskew Calibration, click the Write DQS to DQ
Deskew stage under the Status window and view the results within the Memory IP
Properties window. The message displayed in Memory IP Properties identifies how the
stage failed or notes if it passed successfully.
Figure 38-60: Memory IP XSDB Debug GUI Example – Write DQS to DQ Deskew
The status of Write Per-Bit Deskew can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to the Table 38-21. Execute
the Tcl commands noted in the XSDB Debug section to generate the XSDB output
containing the signal results.
Table 38-22 shows the signals and values adjusted or used during the Write Per-Bit Deskew
stage of calibration. The values can be analyzed in both successful and failing calibrations
to determine the resultant values and the consistency in results across resets. These values
can be found within the Memory IP Core Properties within the Hardware Manager or by
executing the Tcl commands noted in the XSDB Debug section.
BUS_DATA_BURST_0
Write DQS TX_DATA_PHASE = 1
TX_DATA_PHASE = 0
DQ 1 0 1 0 1 0 1 0
X14786-07091
Figure 38-62: Write DQS-to-DQ Debug Data (XSDB BUS_DATA_BURST, Associated Read Data Saved)
This is a sample of results for the Write DQS Centering XSDB debug signals:
Hardware Measurements
Probe the DQ bit alignment at the memory during writes. Trigger at the start
(cal_r*_status[14] = R for Rising Edge) and again at the end of per bit deskew
(cal_r*_status[15] = R for Rising Edge) to view the starting and ending alignments. To
look at each byte, add a trigger on the byte using dbg_cmp_byte.
Expected Results
Hardware measurements should show the write DQ bits are deskewed at the end of these
calibration stages.
Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the trigger (cal_r*_status[14] =R for Rising Edge).
The following simulation examples show how the debug signals should behave during
successful Write Per-Bit Deskew:
X-Ref Target - Figure 38-63
1. Issue a set of write and read bursts with the data pattern 10101010 and check the read
data. Just as in read write per-bit deskew when issuing a write to the DRAM, the DQS and
DQ toggles for eight clock cycles before and after the expected write latency. This is
used to ensure the data is written into the DRAM even if the command-to-write data
relationship is still unknown.
2. Increment DQ ODELAY taps together until the read data pattern on all DQ bits changes
from the expected data pattern 10101010. The amount of delay required to find the
failing point is saved as WRITE_DQS_TO_DQ_PRE_ADJUST_MARGIN_LEFT_BYTE as shown
in Figure 38-66.
X-Ref Target - Figure 38-66
Write
DQS
DQ[n] F 0 F 0
DQ[n]
Delayed F 0 F 0
X24485-082420
4. Find the right edge of the window by incrementing the DQS ODELAY taps until the data
changes from the expected data pattern 10101010. The amount of delay required to
find the failing point is saved as
WRITE_DQS_TO_DQ_PRE_ADJUST_MARGIN_RIGHT_BYTE as shown in Figure 38-67.
X-Ref Target - Figure 38-67
Write
DQS
Write DQS
Delayed
DQ[n] F 0 F 0
X24486-082420
5. Calculate the center tap location for the DQS ODELAY, based on deskew and left and
right edges.
Where dly0 is the original DQS delay + left margin and dly1 is the original DQS delay +
right margin.
Debug
To determine the status of Write DQS Centering Calibration, click the Write DQS to DQ
(Simple) stage under the Status window and view the results within the Memory IP
Properties window. The message displayed in Memory IP Properties identifies how the
stage failed or notes if it passed successfully.
Figure 38-68: Memory IP XSDB Debug GUI Example – Write DQS to DQ (Simple)
The status of Write DQS Centering can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-23. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-24 shows the signals and values adjusted or used during the Write DQS Centering
stage of calibration. The values can be analyzed in both successful and failing calibrations
to determine the resultant values and the consistency in results across resets. These values
can be found within the Memory IP Core Properties within the Hardware Manager or by
executing the Tcl commands noted in the XSDB Debug section.
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figure 38-69 is an example of how the data is converted.
X-Ref Target - Figure 38-69
This is a sample of results for the Write DQS Centering XSDB debug signals:
Hardware Measurements
Probe the DQS to DQ write phase relationship at the memory. DQS should be center aligned
to DQ at the end of this stage of calibration. Trigger at the start (cal_r*_status[18] =
R for Rising Edge) and again at the end (cal_r*_status[19] = R for Rising Edge) of
Write DQS Centering to view the starting and ending alignments.
Expected Results
Hardware measurements should show that the write DQ bits are deskewed and that the
write DQS are centered in the write DQ window at the end of these calibration stages.
Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the trigger (cal_r*_status[18] = R for Rising Edge).
The simulation examples shown in the Debugging Write Per-Bit Deskew Failures > Expected
Results section can be used to additionally monitor the expected behavior for Write DQS
Centering.
During calibration the Data Mask (DM) signals are not used, they are deasserted during any
writes before/after the required amount of time to ensure they have no impact on the
pattern being written to the DRAM. If the DM signals are not used, this step of calibration
is skipped.
Two patterns are used to calibrate the DM pin. The first pattern is written to the DRAM with
the DM deasserted, ensuring the pattern is written to the DRAM properly. The second
pattern overwrites the first pattern at the same address but with the DM asserted in a
known position in the burst, as shown in Figure 38-71.
Because this stage takes place before Write Latency Calibration when issuing a write to the
DRAM, the DQS and DQ/DM toggles for eight clock cycles before and after the expected
write latency. This is used to ensure the data is written into the DRAM even though the
command-to-write data relationship is still unknown.
X-Ref Target - Figure 38-70
First Write
DQS
DQ
5555555555_55555555
DM
(DDR3)
DM
(DDR4)
X24487-082420
Second Write
DQS
DQ
BBBBBBBB_BBBBBBBB
DM
DM (DDR3)
DM (DDR4)
X24488-082420
The read back data for any given nibble is 5B5B_5B5B, where the location of the 5 in the
burst indicates where the DM is asserted. Because the data is constant during this step, the
DQS-to-DQ alignment is not stressed. Only the DQS-to-DM is checked as the DQS and DM
phase relationship is adjusted with each other.
This step is similar to Write DQS-to-DQ Per-Bit Deskew but involves the DM instead of the
DQ bits. See Write Calibration Overview, page 670 for an in-depth overview of the
algorithm. The DQS ODELAY value used to edge align the DQS with the DM is stored as
WRITE_DQS_TO_DM_DESKEW_BYTE. The ODELAY value for the DM is stored as
WRITE_DQS_TO_DM_DM_ODELAY_BYTE.
This step is similar to Write DQS-to-DQ Centering but involves the DM instead of the DQ
bits. See Write Calibration Overview, page 670 for an in-depth overview of the algorithm.
The tap value DM was set at to find the left edge is saved as
WRITE_DQS_TO_DM_PRE_ADJUST_MARGIN_LEFT_BYTE. The tap value DQS was set at to find
the right edge is saved as WRITE_DQS_TO_DM_PRE_ADJUST_MARGIN_RIGHT_BYTE.
Because the DQS ODELAY can only hold a single value, compute the aggregate smallest
left/right margin between the DQ and DM. The DQS ODELAY value is set in the middle of
this aggregate window. The final values of the DQS and DM can be found at
WRITE_DQS_ODELAY_FINAL and WRITE_DM_ODELAY_FINAL.
Debug
To determine the status of Write Data Mask Calibration, click the Write DQS to DM/DBI
(Simple) stage under the Status window and view the results within the Memory IP
Properties window. The message displayed in Memory IP Properties identifies how the
stage failed or notes if it passed successfully.
Figure 38-72: Memory IP XSDB Debug GUI Example – Write DQS to DM/DMBI (Simple)
The status of Write Data Mask Calibration can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-25. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-26 shows the signals and values adjusted or used during the Write Data Mask
stage of calibration. The values can be analyzed in both successful and failing calibrations
to determine the resultant values and the consistency in results across resets. These values
can be found within the Memory IP Core Properties within the Hardware Manager or by
executing the Tcl commands noted in the XSDB Debug section.
Table 38-26: Signals of Interest for Write Data Mask Calibration (Cont’d)
Signal Usage Signal Description
WRITE_DQS_ODELAY_FINAL_BYTE*_BIT* One per Byte Final DQS ODELAY value.
WRITE_DM_ODELAY_FINAL_BYTE*_BIT* One per Bit Final DM ODELAY value.
During calibration for a byte an
example data burst is saved for later
analysis in case of failure.
BUS_DATA_BURST_0 holds an initial
read data burst pattern for a given
byte with the starting alignment
prior to write DM deskew
(TX_DATA_PHASE set to 1 for DQS, 0
for DM and DQ).
BUS_DATA_BURST_1 holds a read
BUS_DATA_BURST (2014.3+) data burst after write DM deskew
and at the start of write DQS-to-DM
centering, after TX_DATA_PHASE for
DQS is set to 1 and the
TX_DATA_PHASE for DQ/DM is set to
1.
After a byte calibrates, the example
read data saved in the
BUS_DATA_BURST registers is
cleared. BUS_DATA_BURST_2 and
BUS_DATA_BURST_3 are not used.
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figure 38-73 and Figure 38-74 are examples of how the
data is converted.
This is a sample of results for the Write Data Mask XSDB debug signals:
Hardware Measurements
• Probe the DM to DQ bit alignment at the memory during writes. Trigger at the start
(cal_r*_status[20] = R for Rising Edge) and again at the end
(cal_r*_status[21] = R for Rising Edge) of Simple Pattern Write Data Mask
Calibration to view the starting and ending alignments.
• Probe the DM to DQ bit alignment at the memory during writes. Trigger at the start
(cal_r*_status[38] = R for Rising Edge) and again at the end
(cal_r*_status[39] = R for Rising Edge) of Complex Pattern Write Data Mask
Calibration to view the starting and ending alignments.
The following simulation examples show how the debug signals should behave during
successful Write DQS-to-DM Calibration.
X-Ref Target - Figure 38-75
Expected Results
• Look at the individual WRITE_DQS_TO_DM_DQS_ODELAY and
WRITE_DQS_TO_DM_DM_ODELAY tap settings for each nibble. The taps should only
vary by 0 to 20 taps. See Determining Window Size in ps, page 773 to calculate the
write window.
• Determine if any bytes completed successfully. The write calibration algorithm
sequentially steps through each DQS byte group detecting the capture edges.
• If the incorrect data pattern is detected, determine if the error is due to the write access
or the read access. See Determining If a Data Error is Due to the Write or Read,
page 770.
• Both edges need to be found. This is possible at all frequencies because the algorithm
uses 90° of ODELAY taps to find the edges.
1. Turn on DBI on the read path (MRS setting in the DRAM and a fabric switch that inverts
the read data when the value read from the DBI pin is asserted).
2. Write the pattern 0-F-0-F-0-F-0-F to the DRAM (extending the data pattern before/
after the burst due to this step happening before write latency calibration) to address
0x000/Bank Group 0.
3. If the nibble does not contain the DBI pin, skip the nibble and go to next nibble.
4. Start from the current setting of PQTR/NQTR, which is the center of the data valid
window for the DQ found so far.
5. Issue reads to address 0x000/Bank Group 0. This is repeated until read DQS Centering
with DBI is completed.
X-Ref Target - Figure 38-77
DQS
Data in array 0 F 0 F 0 F 0 F
DQ FF
DBI_n
X15992-021716
6. Find the left edge of read DBI pin. Decrement PQTR/NQTR to find the left edge of the
read DBI pin until the data pattern changes from the expected pattern.
7. Find the right edge of the read DBI pin. Compute the aggregate window given the XSDB
results for Read DQS Centering (simple) and the new result from DBI. This means take
the (largest left + Smallest right)/2 + largest left. This gives the center result for the
given nibble + DBI pin (aggregate center).
8. Turn off DBI on the read path (MRS setting in the DRAM and fabric switch).
Debug
To determine the status of Read DQS centering with DBI Calibration, click the Read DQS
Centering DBI (Simple) Calibration stage under the Status window and view the results
within the Memory IP Properties window. The message displayed in Memory IP
Properties identifies how the stage failed or notes if it passed successfully.
The status of Read DQS Centering DBI (Simple) can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-27. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-27: DDR_CAL_ERROR Decode for Read DQS Centering with DBI
Per-Bit DBI
Deskew DDR_CAL_ DDR_CAL_ Description Recommended Debug Steps
DDR_CAL_ERROR_ ERROR_1 ERROR_0
CODE
Check the BUS_DATA_BURST fields in
XSDB. Check the dbg_rd_data,
dbg_rd_data_cmp, and
dbg_expected_data signals in the ILA.
No valid data found for Check the pinout for the DBI pin.
0x1 Nibble N/A
a given nibble. Probe the board and check for the
returning pattern to determine if the
initial write to the DRAM happened
properly, or if it is a read failure. Probe
the DBI pin during the read.
Timeout error waiting
Check the dbg_cal_seq_rd_cnt and
0xF Nibble N/A for all read data bursts
dbg_cal_seq_cnt.
to return.
Table 38-28 describes the signals and values adjusted or used during the Read DQS
Centering DBI (Simple) stage of calibration. The values can be analyzed in both successful
and failing calibrations to determine the resultant values and the consistency in results
across resets. These values can be found within the Memory IP Core Properties within
Hardware Manager or by executing the Tcl commands noted in the XSDB Debug section.
Table 38-28: Signals of Interest for Read DQS Centering with DBI
Signal Usage Signal Description
Read leveling PQTR when left edge of
RDLVL_DBI_PQTR_LEFT_RANK_NIBBLE One per nibble read data valid window is detected
during Read DQS Centering DBI (Simple).
Read leveling PQTR when right edge of
RDLVL_DBI_PQTR_RIGHT_RANK_NIBBLE One per nibble read data valid window is detected
during Read DQS Centering DBI (Simple).
Read leveling PQTR center point between
RDLVL_DBI_PQTR_CENTER_RANK*_NIBBLE* One per nibble right and left during Read DQS Centering
DBI (Simple).
Read leveling NQTR when left edge of
RDLVL_DBI_NQTR_LEFT_RANK*_NIBBLE* One per nibble read data valid window is detected
during Read DQS Centering DBI (Simple).
Read leveling NQTR when right edge of
RDLVL_DBI_NQTR_RIGHT_ RANK*_NIBBLE* One per nibble read data valid window is detected
during Read DQS Centering DBI (Simple).
Table 38-28: Signals of Interest for Read DQS Centering with DBI (Cont’d)
Signal Usage Signal Description
Read leveling NQTR center point
RDLVL_DBI_NQTR_CENTER_RANK*_NIBBLE* One per nibble between right and left during Read DQS
Centering DBI (Simple).
Read leveling IDELAY delay value for the
RDLVL_IDELAY_DBI_RANK*_BYTE* One per rank per Byte
DBI pin set during DBI deskew.
Read leveling PQTR tap position when
RDLVL_PQTR_LEFT_RANK*_NIBBLE* One per rank per nibble left edge of read data valid window is
detected (simple pattern).
Read leveling NQTR tap position when
RDLVL_NQTR_LEFT_RANK*_NIBBLE* One per rank per nibble left edge of read data valid window is
detected (simple pattern).
Read leveling PQTR tap position when
RDLVL_PQTR_RIGHT_RANK*_NIBBLE* One per rank per nibble right edge of read data valid window is
detected (simple pattern).
Read leveling NQTR tap position when
RDLVL_NQTR_RIGHT_RANK*_NIBBLE* One per rank per nibble right edge of read data valid window is
detected (simple pattern).
Read leveling PQTR center tap position
RDLVL_PQTR_CENTER_RANK*_NIBBLE* One per rank per nibble found at the end of read DQS centering
(simple pattern).
Read leveling NQTR center tap position
RDLVL_NQTR_CENTER_RANK*_NIBBLE* One per rank per nibble found at the end of read DQS centering
(simple pattern).
Read leveling IDELAY delay value found
RDLVL_IDELAY_VALUE_RANK*_BYTE*_BIT* One per rank per Bit during per bit read DQS centering
(simple pattern).
Initial 0° offset value provided by BISC at
BISC_ALIGN_PQTR_NIBBLE* One per nibble
power-up.
Initial 0° offset value provided by BISC at
BISC_ALIGN_NQTR_NIBBLE* One per nibble
power-up.
Initial 90° offset value provided by BISC
at power-up. Compute 90° value in taps
by taking (BISC_PQTR –
BISC_PQTR_NIBBLE* One per nibble BISC_ALIGN_PQTR). To estimate tap
resolution take (¼ of the memory clock
period)/ (BISC_PQTR –
BISC_ALIGN_PQTR).
Table 38-28: Signals of Interest for Read DQS Centering with DBI (Cont’d)
Signal Usage Signal Description
Initial 90° offset value provided by BISC
at power-up. Compute 90° value in taps
by taking (BISC_NQTR –
BISC_NQTR_NIBBLE* One per nibble BISC_ALIGN_NQTR). To estimate tap
resolution take (¼ of the memory clock
period)/ (BISC_NQTR –
BISC_ALIGN_NQTR).
When a failure occurs during Read DQS
centering with DBI, some data is saved to
indicate what the data looks like for a
byte across some tap settings for a given
byte the failure occurred for (DQ IDELAY
is not adjusted).
See Figure 38-48 for an example of the
delays used for the capture:
BUS_DATA_BURST BUS_DATA_BURST_0 holds a single burst
of data when PQTR/NQTR set to 0 taps.
BUS_DATA_BURST_1 holds a single burst
of data when PQTR/NQTR set to 90°.
BUS_DATA_BURST_2 holds a single burst
of data when PQTR/NQTR set to 180°.
BUS_DATA_BURST_3 holds a single burst
of data when PQTR/NQTR set to 270°.
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figure 38-78 and Figure 38-79 are examples of how the
data is converted.
This is a sample of results for Read DQS Centering DBI (Simple) XSDB debug signals (nibbles
that do not contain the DBI pin are skipped and hence the fields are all 0):
Expected Results
• Look at the window measured during Read DQS Centering (Simple) and compare what
is found during Read DQS Centering DBI (Simple). The eye size found should be similar,
and the PQTR/NQTR should not move by more than 10 taps typically.
• Determine if any bytes completed successfully. The algorithm sequentially steps
through each DQS byte sequentially.
Hardware Measurements
1. Probe the write commands and read commands at the memory:
set to the byte of interest. The following simulation example shows how the debug
signals should behave during successful Read DQS Centering with DBI.
X-Ref Target - Figure 38-80
Figure 38-80: RTL Debug Signals during Read DQS Centering with DBI (No Error)
13. After failure during this stage of calibration, the design goes into a continuous loop of
read commands to allow board probing.
Write latency calibration makes use of the coarse tap in the WL_DLY_RNK of the XIPHY for
adjusting the write latency on a per byte basis. Write leveling uses up a maximum of three
coarse taps of the XIPHY delay to ensure each write DQS is aligned to the nearest clock
edge. Memory Controller provides the write data 1TCK early to the PHY, which is then
delayed by write leveling up to one memory clock cycle. This means for the zero PCB delay
case of a typical simulation the data would be aligned at the DRAM without additional delay
added from write calibration.
Write latency calibration can only account for early data, because in the case where the data
arrives late at the DRAM there is no push back on the controller to provide the data earlier.
With 16 XIPHY coarse taps available (each tap is 90°), four memory clock cycles of shift are
available in the XIPHY with one memory clock used by write leveling. This leaves three
memory clocks of delay available for write latency calibration.
Figure 38-81 shows the calibration flow to determine the setting required for each byte.
X-Ref Target - Figure 38-81
NO NO
Max Coarse
Data Match? Coarse = Coarse + 4
Taps?
YES YES
ERROR!
NO
All Bytes Done? Byte = Byte + 1
YES
Write Latency
Calibration
Done
X24489-082420
The write DQS for the write command is extended for longer than required to ensure the
DQS is toggling when the DRAM expects it to clock in the write data. A specific data pattern
is used to check when the correct data pattern gets written into the DRAM, as shown in
Figure 38-82.
In the example at the start of write latency calibration for the given byte. the target write
latency falls in the middle of the data pattern. The returned data would be
55AA9966FFFFFFFF rather than the expected FF00AA5555AA9966. The write DQS and
data are delayed using the XIPHY coarse delay and the operation is repeated, until the
correct data pattern is found or there are no more coarse taps available. After the pattern is
found, the amount of coarse delay required is indicated by
WRITE_LATENCY_CALIBRATION_COARSE_Rank_Byte.
X-Ref Target - Figure 38-82
Target
cWL
CK/
CK#
DQS/
DQS#
DQS/
DQS#
DQS/
DQS#
X24490-082420
• If the data pattern is not found for a given byte, the data pattern found is checked to
see if the data at the maximum delay available still arrives too early (indicating not
enough adjustment was available in the XIPHY to align to the correct location) or if the
first burst with no extra delay applied is already late (indicating at the start the data
would need to be pulled back). The following data pattern is checked:
Debug
To determine the status of Write Latency Calibration, click the Write Latency Calibration
stage under the Status window and view the results within the Memory IP Properties
window. The message displayed in Memory IP Properties identifies how the stage failed or
notes if it passed successfully.
X-Ref Target - Figure 38-83
Figure 38-83: Memory IP XSDB Debug GUI Example – Write Latency Calibration
The status of Write Latency Calibration can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-29. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-30 shows the signals and values adjusted or used during the Write Latency stage
of calibration. The values can be analyzed in both successful and failing calibrations to
determine the resultant values and the consistency in results across resets. These values can
be found within the Memory IP Core Properties in the Hardware Manager or by executing
the Tcl commands noted in the XSDB Debug section.
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figure 38-84 to Figure 38-86 show examples of how the
data is converted. Because all Fs are written before this expected Write Latency pattern and
all 0s after, this pattern can have Fs before and 0s after until Write Latency calibration is
completed at which time Figure 38-84 to Figure 38-86 are accurate representation.
This is a sample of results for the Write Latency XSDB debug signals:
Hardware Measurements
If the design is stuck in the Write Latency stage, the issue could be related to either the
write or the read. Determining whether the write or read is causing the failure is critical. The
following steps should be completed. For additional details and examples, see the
Determining If a Data Error is Due to the Write or Read, page 770 section.
Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the trigger.
The following simulation example shows how the debug signals should behave during
successful Write Latency Calibration.
Figure 38-87: RTL Debug Signals during Write Latency Calibration (x4 Example Shown)
Expected Results
Complex data patterns are used for advanced read DQS centering for memory systems to
improve read timing margin. Long and complex data patterns on both the victim and
aggressor DQ lanes impact the size and location of the data eye. The objective of the
complex calibration step is to generate the worst case data eye on each DQ lane so that the
DQS signal can be aligned, resulting in good setup/hold margin during normal operation
with any work load.
There are two long data patterns stored in a block RAM, one for a victim DQ lane, and an
aggressor pattern for all other DQ lanes. These patterns are used to generate write data, as
well as expected data on reads for comparison and error logging. Each pattern consists of
157 8-bit chunks or BL8 bursts.
Each DQ lane of 1-byte takes a turn at being the victim. An RTL state machine automatically
selects each DQ lane in turn, MUXing the victim or aggressor patterns to the appropriate
DQ lanes, issues the read/write transactions, and records errors. The victim pattern is only
walked across the DQ lanes of the selected byte to be calibrated, and all other DQ lanes
carry the aggressor pattern, including all lanes in un-selected bytes if there is more than
1-byte lane.
Similar steps to those described in Read DQS Centering are performed, with the PQTR/
NQTR starting out at the left edge of the simple window found previously. The complex
pattern is written and read back. All bits in a nibble are checked to find the left edge of the
window, incrementing the bits together as needed or the PQTR/NQTR to find the aggregate
left edge. After the left and right edges are found, it steps through the entire data eye.
Debug
To determine the status of Complex Read Leveling Calibration, click the Read DQS
Centering (Complex) stage under the Status window and view the results within the
Memory IP Properties window. The message displayed in Memory IP Properties
identifies how the stage failed or notes if it passed successfully.
Figure 38-88: Memory IP XSDB Debug GUI Example – Read DQS Centering (Complex)
The status of Read Leveling Complex can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-31. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-32 shows the signals and values adjusted or used during the Read Leveling
Complex stage of calibration. The values can be analyzed in both successful and failing
calibrations to determine the resultant values and the consistency in results across resets.
These values can be found within the Memory IP Core Properties within the Hardware
Manager or by executing the Tcl commands noted in the XSDB Debug section.
This is a sample of results for Complex Read Leveling using the Memory IP Debug GUI
within the Hardware Manager.
Note: Either the “Table” or “Chart” view can be used to look at the calibration windows.
Figure 38-89 and Figure 38-90 are screen captures from 2015.1 and might vary from the
current version.
This is a sample of results for the Read Leveling Complex XSDB debug signals:
Expected Results
• Look at the individual PQTR/NQTR tap settings for each nibble. The taps should only
vary by 0 to 20 taps. Use the BISC values to compute the estimated bit time in taps.
° For example, Byte 7 Nibble 0 in Figure 38-90 is shifted and smaller compared to the
remaining nibbles. This type of result is not expected. For this specific example, the
SDRAM was not properly loaded into the socket.
• Look at the individual IDELAY taps for each bit. The IDELAY taps should only vary by 0
to 20 taps, and is dependent on PCB trace delays. For Deskew the IDELAY taps are
typically in the 50 to 70 tap range, while PQTR and NQTR are usually in the 0 to 5 tap
range.
• Determine if any bytes completed successfully. The read leveling algorithm sequentially
steps through each DQS byte group detecting the capture edges.
• If the incorrect data pattern is detected, determine if the error is due to the write access
or the read access. See Determining If a Data Error is Due to the Write or Read,
page 770.
• To analyze the window size in ps, see Determining Window Size in ps, page 773. As a
general rule of thumb, the window size for a healthy system should be ≥ 30% of the
expected UI size.
• Compare read leveling window (read margin size) results from the simple pattern
calibration versus the complex pattern calibration. The windows should all shrink but
the reduction in window size should shrink relatively across the data byte lanes.
° Use the Memory IP Debug GUI to quickly compare simple versus complex window
sizes.
Figure 38-91 is a screen capture from 2015.1 and might vary from the current version.
X-Ref Target - Figure 38-91
Hardware Measurements
1. Probe the write commands and read commands at the memory:
commands and check the data. Check if the dbg_rd_valid aligns with the data pattern
or is off. Set up a trigger when the error gets asserted to capture signals in the hardware
debugger for analysis.
9. Re-check results from previous calibration stages. Compare passing byte lanes against
failing byte lanes for previous stages of calibration. If a failure occurs during complex
pattern calibration, check the values found during simple pattern calibration for
example.
10. All of the data comparison for complex read calibration occur in the general
interconnect, so it can be useful to pull in the debug data in the hardware debugger and
take a look at what the data looks like coming back as taps are adjusted, see
Figure 38-92 and Figure 38-93. Screenshots shown are from simulation, with a small
loop count set for the data pattern. Look at dbg_rd_data, dbg_rd_valid, and
dbg_cplx_err_log.
11. Using the Vivado Hardware Manager and while running the Memory IP Example Design
with Debug Signals enabled, set the Read Complex calibration trigger to
cal_r*_status[28] = R (rising edge). To view each byte, add an additional trigger on
dbg_cmp_byte and set to the byte of interest. The following simulation example shows
how the debug signals should behave during Read Complex Calibration.
Figure 38-92 shows the start of the complex calibration data pattern with an emphasis
on the dbg_cplx_config bus shown. The “read start” bit is Bit[0] and the number of
loops is set based on Bits[15:9], hence Figure 38-92 shows the start of complex read
pattern and the loop count set to 1 (for simulation only). The dbg_cplx_status goes
to 1 to indicate the pattern is in progress. See Table 38-2, page 594 for the list of all
debug signals.
X-Ref Target - Figure 38-92
Figure 38-93: RTL Debug Signals during Read Complex (Writes and Reads)
12. Analyze the debug signal dbg_cplx_err_log. This signal shows comparison
mismatches on a per-bit basis. When a bit error occurs, signifying an edge of the
window has been found, typically a single bit error is shown on dbg_cplx_err_log.
Meaning, all bits of this bus are 0 except for the single bit that had a comparison
mismatch which is set to 1. When an unexpected data error occurs during complex read
calibration, for example a byte shift, the entire bus would be 1. This is not the expected
bit mismatch found in window detection but points to a true read versus write issue.
Now, the read data should be compared with the expected (compare) data and the error
debugged to determine if it is a read or write issue. Use dbg_rd_data and
dbg_rd_dat_cmp to compare the received data to the expected data.
13. For more information, see Debugging Data Errors, page 758.
14. After failure during this stage of calibration, the design goes into a continuous loop of
read commands to allow board probing.
The final stage of Write DQS-to-DQ centering that is completed before normal operation is
repeating the steps performed during Write DQS-to-DQ centering but with a difficult/
complex pattern. The purpose of using a complex pattern is to stress the system for SI
effects such as ISI and noise while calculating the write DQS center and write DQ positions.
This ensures the write center position can reliably capture data with margin in a true system.
Debug
To determine the status of Write Complex Pattern Calibration, click the Write DQS to DQ
(Complex) stage under the Status window and view the results within the Memory IP
Properties window. The message displayed in Memory IP Properties identifies how the
stage failed or notes if it passed successfully.
Figure 38-94: Memory IP XSDB Debug GUI Example – Write DQS to DQ (Complex)
The status of Write Complex Pattern Calibration can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-33. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-33: DDR_CAL_ERROR Decode for Read Leveling and Write DQS Centering Calibration
Write DQS to
DQ
DDR_CAL_ DDR_CAL_
DDR_CAL_ ERROR_1 ERROR_0 Description Recommended Debug Steps
ERROR_
CODE
Check if the design meets timing. Check the
margin found for the simple pattern for the
given nibble/byte. Check if the ODELAY values
used for each bit are reasonable to others in
the byte. Check the dbg_cplx_config,
0x1 Byte N/A No valid data found
dbg_cplx_status, dbg_cplx_err_log,
dbg_rd_data, and dbg_expected_data during
this stage of calibration. Check the default
VREF value being used is correct for the
configuration.
Timeout error waiting Check the dbg_cal_seq_rd_cnt and
0xF Byte N/A
for read data to return dbg_cal_seq_cnt.
Table 38-34 shows the signals and values adjusted or used during the Write Complex
Pattern stage of calibration. The values can be analyzed in both successful and failing
calibrations to determine the resultant values and the consistency in results across resets.
These values can be found within the Memory IP Core Properties within the Hardware
Manager or by executing the Tcl commands noted in the XSDB Debug section.
Expected Results
• Look at the individual WRITE_COMPLEX_DQS_TO_DQ_DQS_ODELAY and
WRITE_COMPLEX_DQS_TO_DQ_DQ_ODELAY tap settings for each nibble. The taps
should only vary by 0 to 20 taps. To calculate the write window, see Determining
Window Size in ps, page 773.
• Determine if any bytes completed successfully. The write calibration algorithm
sequentially steps through each DQS byte group detecting the capture edges.
• If the incorrect data pattern is detected, determine if the error is due to the write access
or the read access. See Determining If a Data Error is Due to the Write or Read,
page 770.
• Both edges need to be found. This is possible at all frequencies because the algorithm
uses 90° of ODELAY taps to find the edges.
• To analyze the window size in ps, see Determining Window Size in ps, page 773. As a
general rule of thumb, the window size for a healthy system should be ≥ 30% of the
expected UI size.
Using the Vivado Hardware Manager and while running the Memory IP Example Design
with the Debug Signals enabled, set the trigger (cal_r*_status[36] = R for Rising
Edge).
The following simulation example shows how the debug signals should behave during
successful Write DQS-to-DQ.
X-Ref Target - Figure 38-95
Hardware Measurements
1. If the write complex pattern fails, use high quality probes and scope the DQS-to-DQ
phase relationship at the memory during a write. Trigger at the start
(cal_r*_status[36] = R for Rising Edge) and again at the end
(cal_r*_status[37] = R for Rising Edge) of Write Complex DQS Centering to view
the starting and ending alignments. The alignment should be approximately 90°.
2. If the DQS-to-DQ alignment is correct, observe the we_n-to-DQS relationship to see if
it meets CWL again using cal_r*_status[25] = R for Rising Edge as a trigger.
3. For all stages of write/read leveling, probe the write commands and read commands at
the memory:
For multi-rank designs, previously calibrated positions must be validated and adjusted
across each rank within the system. The previously calibrated areas that need further
adjustment for multi-rank systems are Read Level, DQS Preamble, and Write Latency. The
adjustments are described in the following sections.
Each DQS has a single IDELAY/PQTR/NQTR value that is used across ranks. During Read
Leveling Calibration, each rank is allowed to calibrate independently to find the ideal
IDELAY/PQTR/NQTR tap positions for each DQS to each separate rank. During the
multi-rank checks, the minimum and maximum value found for each DQS IDELAY/PQTR/
NQTR positions are checked, the range is computed, and the center point is used as the
final setting. For example, if a DQS has a PQTR that sees values of rank0 = 50, rank1 = 50,
rank2 = 50, and rank3 = 75, the final value would be 62. This is done to ensure a value can
work well across all ranks rather than averaging the values and giving preference to values
that happen more frequently.
During DQS gate calibration for multi-rank systems, each rank is allowed to calibrate
independently. After all ranks have been calibrated, an adjustment is required before
normal operation to ensure fast rank-to-rank switching.
Across all ranks within a byte, the read latency and general interconnect delay
(clb2phy_rd_en) must match. During the DQS Gate Adjustment stage of calibration, the
coarse taps found during DQS Preamble Detection for each rank are adjusted such that a
common read latency and clb2phy_rd_en can be used. Additionally, the coarse taps have
to be within four taps within the same byte lane across all ranks. Table 38-35 shows the DQS
Gate adjustment examples.
The write leveling and write latency values are calibrated separately for each rank. After all
ranks have been calibrated, a check is made to ensure certain XIPHY requirements are met
on the write path. The difference in write latency between the ranks is allowed to be 180° (or
two XIPHY coarse taps). This is checked during this stage.
Debug
To determine the status of Multi-Rank Adjustments and Checks, click the Read DQS
Centering Multi Rank Adjustment or Multi Rank Adjustments and Checks stage under
the Status window and view the results within the Memory IP Properties window. The
message displayed in Memory IP Properties identifies how the stage failed or notes if it
passed successfully.
X-Ref Target - Figure 38-96
Figure 38-96: Memory IP XSDB Debug GUI Example – Read DQS Centering Multi-Rank Adjustment and
Multi-Rank Adjustment and Checks
The status of Read Level Multi-Rank Adjustment can also be determined by decoding the
DDR_CAL_ERROR_0 and DDR_CAL_ERROR_1 results according to Table 38-36. Execute the
Tcl commands noted in the XSDB Debug section to generate the XSDB output containing
the signal results.
Table 38-37 shows the signals and values adjusted or used during Read Level Multi-Rank
Adjustment and Multi-Rank DQS Gate. The values can be analyzed in both successful and
failing calibrations to determine the resultant values and the consistency in results across
resets. These values can be found within the Memory IP Core Properties within the
Hardware Manager or by executing the Tcl commands noted in the XSDB Debug section.
RDLVL_IDELAY_FINAL_BYTE*_BIT* One per Bit Final IDELAY tap position from the XIPHY.
Table 38-37: Signals of Interest for Multi-Rank Adjustments and Checks (Cont’d)
Signal Usage Signal Description
Final common general interconnect read
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE* One per Byte
latency setting used for a given byte.
Final RL_DLY_COARSE tap value used for a
One per Rank
MULTI_RANK_DQS_GATE_COARSE_RANK*_BYTE* given byte (might differ from calibrated
per Byte
value).
Expected Results
If no adjustments are required then the MULTI_RANK_* signals can be blank as shown, the
field is only populated when a change is made to the values.
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE0 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE1 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE2 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE3 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE4 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE5 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE6 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE7 000
MULTI_RANK_DQS_GATE_COARSE_RANK0_BYTE8 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE0 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE1 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE2 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE3 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE4 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE5 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE6 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE7 000
MULTI_RANK_DQS_GATE_COARSE_RANK1_BYTE8 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE0 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE1 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE2 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE3 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE4 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE5 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE6 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE7 000
MULTI_RANK_DQS_GATE_READ_LATENCY_BYTE8 000
The Read level Multi-Rank Adjustment changes the values of the “FINAL” fields for the read
path. The margin for each individual rank is given in the table and chart but the final value
is stored here.
RDLVL_IDELAY_FINAL_BYTE0_BIT0 04d
RDLVL_IDELAY_FINAL_BYTE0_BIT1 052
RDLVL_IDELAY_FINAL_BYTE0_BIT2 055
RDLVL_IDELAY_FINAL_BYTE0_BIT3 051
RDLVL_IDELAY_FINAL_BYTE0_BIT4 04f
RDLVL_IDELAY_FINAL_BYTE0_BIT5 04e
RDLVL_IDELAY_FINAL_BYTE0_BIT6 050
RDLVL_IDELAY_FINAL_BYTE0_BIT7 04b
RDLVL_IDELAY_FINAL_BYTE1_BIT0 04d
RDLVL_IDELAY_FINAL_BYTE1_BIT1 050
RDLVL_IDELAY_FINAL_BYTE1_BIT2 04f
RDLVL_IDELAY_FINAL_BYTE1_BIT3 04c
RDLVL_IDELAY_FINAL_BYTE1_BIT4 050
RDLVL_IDELAY_FINAL_BYTE1_BIT5 051
RDLVL_IDELAY_FINAL_BYTE1_BIT6 052
RDLVL_IDELAY_FINAL_BYTE1_BIT7 04e
RDLVL_IDELAY_FINAL_BYTE2_BIT0 04f
RDLVL_IDELAY_FINAL_BYTE2_BIT1 052
RDLVL_IDELAY_FINAL_BYTE2_BIT2 053
RDLVL_IDELAY_FINAL_BYTE2_BIT3 049
RDLVL_IDELAY_FINAL_BYTE2_BIT4 04f
RDLVL_IDELAY_FINAL_BYTE2_BIT5 052
RDLVL_IDELAY_FINAL_BYTE2_BIT6 04e
RDLVL_IDELAY_FINAL_BYTE2_BIT7 04c
RDLVL_IDELAY_FINAL_BYTE3_BIT0 051
RDLVL_IDELAY_FINAL_BYTE3_BIT1 056
RDLVL_IDELAY_FINAL_BYTE3_BIT2 04c
RDLVL_IDELAY_FINAL_BYTE3_BIT3 04b
RDLVL_IDELAY_FINAL_BYTE3_BIT4 04f
RDLVL_IDELAY_FINAL_BYTE3_BIT5 050
RDLVL_IDELAY_FINAL_BYTE3_BIT6 055
RDLVL_IDELAY_FINAL_BYTE3_BIT7 050
RDLVL_IDELAY_FINAL_BYTE4_BIT0 04b
RDLVL_IDELAY_FINAL_BYTE4_BIT1 04c
RDLVL_IDELAY_FINAL_BYTE4_BIT2 046
RDLVL_IDELAY_FINAL_BYTE4_BIT3 048
RDLVL_IDELAY_FINAL_BYTE4_BIT4 054
RDLVL_IDELAY_FINAL_BYTE4_BIT5 055
RDLVL_IDELAY_FINAL_BYTE4_BIT6 054
RDLVL_IDELAY_FINAL_BYTE4_BIT7 04f
RDLVL_IDELAY_FINAL_BYTE5_BIT0 044
RDLVL_IDELAY_FINAL_BYTE5_BIT1 049
RDLVL_IDELAY_FINAL_BYTE5_BIT2 04a
RDLVL_IDELAY_FINAL_BYTE5_BIT3 045
RDLVL_IDELAY_FINAL_BYTE5_BIT4 04d
RDLVL_IDELAY_FINAL_BYTE5_BIT5 052
RDLVL_IDELAY_FINAL_BYTE5_BIT6 04e
RDLVL_IDELAY_FINAL_BYTE5_BIT7 04b
RDLVL_IDELAY_FINAL_BYTE6_BIT0 03d
RDLVL_IDELAY_FINAL_BYTE6_BIT1 03e
RDLVL_IDELAY_FINAL_BYTE6_BIT2 039
RDLVL_IDELAY_FINAL_BYTE6_BIT3 03c
RDLVL_IDELAY_FINAL_BYTE6_BIT4 053
RDLVL_IDELAY_FINAL_BYTE6_BIT5 052
RDLVL_IDELAY_FINAL_BYTE6_BIT6 04d
RDLVL_IDELAY_FINAL_BYTE6_BIT7 04c
RDLVL_IDELAY_FINAL_BYTE7_BIT0 040
RDLVL_IDELAY_FINAL_BYTE7_BIT1 03f
RDLVL_IDELAY_FINAL_BYTE7_BIT2 040
RDLVL_IDELAY_FINAL_BYTE7_BIT3 03c
RDLVL_IDELAY_FINAL_BYTE7_BIT4 046
RDLVL_IDELAY_FINAL_BYTE7_BIT5 047
RDLVL_IDELAY_FINAL_BYTE7_BIT6 048
RDLVL_IDELAY_FINAL_BYTE7_BIT7 045
RDLVL_IDELAY_FINAL_BYTE8_BIT0 04b
RDLVL_IDELAY_FINAL_BYTE8_BIT1 050
RDLVL_IDELAY_FINAL_BYTE8_BIT2 051
RDLVL_IDELAY_FINAL_BYTE8_BIT3 04e
RDLVL_IDELAY_FINAL_BYTE8_BIT4 04a
RDLVL_IDELAY_FINAL_BYTE8_BIT5 04c
RDLVL_IDELAY_FINAL_BYTE8_BIT6 04d
RDLVL_IDELAY_FINAL_BYTE8_BIT7 04a
RDLVL_NQTR_CENTER_FINAL_NIBBLE0 064
RDLVL_NQTR_CENTER_FINAL_NIBBLE1 06b
RDLVL_NQTR_CENTER_FINAL_NIBBLE2 066
RDLVL_NQTR_CENTER_FINAL_NIBBLE3 06b
RDLVL_NQTR_CENTER_FINAL_NIBBLE4 062
RDLVL_NQTR_CENTER_FINAL_NIBBLE5 06c
RDLVL_NQTR_CENTER_FINAL_NIBBLE6 067
RDLVL_NQTR_CENTER_FINAL_NIBBLE7 069
RDLVL_NQTR_CENTER_FINAL_NIBBLE8 065
RDLVL_NQTR_CENTER_FINAL_NIBBLE9 05d
RDLVL_NQTR_CENTER_FINAL_NIBBLE10 05d
RDLVL_NQTR_CENTER_FINAL_NIBBLE11 05c
RDLVL_NQTR_CENTER_FINAL_NIBBLE12 061
RDLVL_NQTR_CENTER_FINAL_NIBBLE13 051
RDLVL_NQTR_CENTER_FINAL_NIBBLE14 054
RDLVL_NQTR_CENTER_FINAL_NIBBLE15 04f
RDLVL_NQTR_CENTER_FINAL_NIBBLE16 063
RDLVL_NQTR_CENTER_FINAL_NIBBLE17 06d
RDLVL_PQTR_CENTER_FINAL_NIBBLE0 064
RDLVL_PQTR_CENTER_FINAL_NIBBLE1 06a
RDLVL_PQTR_CENTER_FINAL_NIBBLE2 066
RDLVL_PQTR_CENTER_FINAL_NIBBLE3 068
RDLVL_PQTR_CENTER_FINAL_NIBBLE4 061
RDLVL_PQTR_CENTER_FINAL_NIBBLE5 06d
RDLVL_PQTR_CENTER_FINAL_NIBBLE6 067
RDLVL_PQTR_CENTER_FINAL_NIBBLE7 06c
RDLVL_PQTR_CENTER_FINAL_NIBBLE8 069
RDLVL_PQTR_CENTER_FINAL_NIBBLE9 060
RDLVL_PQTR_CENTER_FINAL_NIBBLE10 061
RDLVL_PQTR_CENTER_FINAL_NIBBLE11 061
RDLVL_PQTR_CENTER_FINAL_NIBBLE12 066
RDLVL_PQTR_CENTER_FINAL_NIBBLE13 056
RDLVL_PQTR_CENTER_FINAL_NIBBLE14 058
RDLVL_PQTR_CENTER_FINAL_NIBBLE15 058
RDLVL_PQTR_CENTER_FINAL_NIBBLE16 061
RDLVL_PQTR_CENTER_FINAL_NIBBLE17 06b
Hardware Measurements
No hardware measurements are available because no command or data are sent to the
memory during this stage. Algorithm only goes through previously collected data.
Throughout calibration, read and write/read sanity checks are performed to ensure that as
each stage of calibration completes, proper adjustments and alignments are made allowing
writes and reads to be completed successfully. Sanity checks are performed as follows:
Each sanity check performed uses a different data pattern to expand the number of patterns
checked during calibration.
Notes:
1. For 3DS systems, the Write/Read Sanity Check 6 is repeated for each stack in a given rank. For each stack, the data
pattern is adjusted by adding 0x100 to the data pattern (as stored) for the base rank pattern. For example, for rank
0, stack 0 would be data pattern 0xE5742542 as shown in the table, but rank 0, stack 1 the pattern would be
0xE5742642 (and show up as 83E0_4ED8 on the DQ bus).
Data swizzling (bit reordering) is completed within the UltraScale PHY. Therefore, the data
visible on BUS_DATA_BURST and a scope in hardware is ordered differently compared to
what would be seen in ChipScope. Figures are examples of how the data is converted for the
sanity check data patterns.
Figure 38-97: Expected Read Pattern of DQS Gate and Read Sanity Checks
Debug
To determine the status of each sanity check, analyze the Memory IP Status window to
view the completion of each check. Click the sanity check of interest to view the specific
results within the Memory IP Properties window. The message displayed in Memory IP
Properties identifies how the stage failed or notes if it passed successfully.
Figure 38-104: Memory IP XSDB Debug GUI Example – Write and Read Sanity Checks
Table 38-40 shows the signals and values used to help determine which bytes the error
occurred on, as well as to provide some data returned for comparison with the expected
data pattern. These values can be found within the Memory IP Core Properties within the
Hardware Manager or by executing the Tcl commands noted in the XSDB Debug section.
Hardware Measurements
The calibration status bits (cal_r*_status) can be used as hardware triggers to capture
the write (when applicable) and read command and data on the scope. The entire interface
is checked with one write followed by one read command, so any bytes or bits that need to
be probed can be checked on a scope. The cal_r*_status triggers are as follows for the
independent sanity checks:
VT Tracking
Tracking Overview
Calibration occurs one time at start-up, at a set voltage and temperature to ensure relation
capture of the data, but during normal operation the voltage and temperature can change
or drift if conditions change. Voltage and temperature (VT) change can adjust the
relationship between DQS and DQ used for read capture and change the time in which the
DQS/DQ arrive at the FPGA as part of a read.
The arrival of the DQS at the FPGA as part of a read is calibrated at start-up, but as VT
changes the time in which the DQS arrives can change. DQS gate tracking monitors the
arrival of the DQS with a signal from the XIPHY and makes small adjustments as required if
the DQS arrives earlier or later a sampling clock in the XIPHY. This adjustment is recorded as
shown in Table 38-41.
Debug
DQS_TRACK_COARSE_BYTE* One per Byte Last recorded value for DQS gate coarse setting.
DQS_TRACK_FINE_BYTE* One per Byte Last recorded value for DQS gate fine setting.
DQS_TRACK_COARSE_MAX_BYTE* One per Byte Maximum coarse tap recorded during DQS gate tracking.
DQS_TRACK_FINE_MAX_BYTE* One per Byte Maximum fine tap recorded during DQS gate tracking.
DQS_TRACK_COARSE_MIN_BYTE* One per Byte Minimum coarse tap recorded during DQS gate tracking.
DQS_TRACK_FINE_MIN_BYTE* One per Byte Minimum fine tap recorded during DQS gate tracking.
BISC_ALIGN_PQTR One per nibble Initial 0° offset value provided by BISC at power-up.
BISC_ALIGN_NQTR One per nibble Initial 0° offset value provided by BISC at power-up.
Expected Results
DQS_TRACK_COARSE_MAX_RANK0_BYTE0 string true true 007
DQS_TRACK_COARSE_MAX_RANK0_BYTE1 string true true 006
DQS_TRACK_COARSE_MAX_RANK0_BYTE2 string true true 007
DQS_TRACK_COARSE_MAX_RANK0_BYTE3 string true true 007
DQS_TRACK_COARSE_MAX_RANK0_BYTE4 string true true 008
DQS_TRACK_COARSE_MAX_RANK0_BYTE5 string true true 008
DQS_TRACK_COARSE_MAX_RANK0_BYTE6 string true true 008
DQS_TRACK_COARSE_MAX_RANK0_BYTE7 string true true 008
DQS_TRACK_COARSE_MAX_RANK0_BYTE8 string true true 008
DQS_TRACK_COARSE_MIN_RANK0_BYTE0 string true true 006
DQS_TRACK_COARSE_MIN_RANK0_BYTE1 string true true 006
DQS_TRACK_COARSE_MIN_RANK0_BYTE2 string true true 007
DQS_TRACK_COARSE_MIN_RANK0_BYTE3 string true true 007
DQS_TRACK_COARSE_MIN_RANK0_BYTE4 string true true 008
DQS_TRACK_COARSE_MIN_RANK0_BYTE5 string true true 008
DQS_TRACK_COARSE_MIN_RANK0_BYTE6 string true true 008
DQS_TRACK_COARSE_MIN_RANK0_BYTE7 string true true 007
DQS_TRACK_COARSE_MIN_RANK0_BYTE8 string true true 007
DQS_TRACK_COARSE_RANK0_BYTE0 string true true 007
DQS_TRACK_COARSE_RANK0_BYTE1 string true true 006
DQS_TRACK_COARSE_RANK0_BYTE2 string true true 007
DQS_TRACK_COARSE_RANK0_BYTE3 string true true 007
DQS_TRACK_COARSE_RANK0_BYTE4 string true true 008
DQS_TRACK_COARSE_RANK0_BYTE5 string true true 008
DQS_TRACK_COARSE_RANK0_BYTE6 string true true 008
DQS_TRACK_COARSE_RANK0_BYTE7 string true true 008
DQS_TRACK_COARSE_RANK0_BYTE8 string true true 007
DQS_TRACK_FINE_MAX_RANK0_BYTE0 string true true 02d
DQS_TRACK_FINE_MAX_RANK0_BYTE1 string true true 02d
DQS_TRACK_FINE_MAX_RANK0_BYTE2 string true true 027
DQS_TRACK_FINE_MAX_RANK0_BYTE3 string true true 01a
DQS_TRACK_FINE_MAX_RANK0_BYTE4 string true true 021
DQS_TRACK_FINE_MAX_RANK0_BYTE5 string true true 020
DQS_TRACK_FINE_MAX_RANK0_BYTE6 string true true 012
DQS_TRACK_FINE_MAX_RANK0_BYTE7 string true true 02e
DQS_TRACK_FINE_MAX_RANK0_BYTE8 string true true 02e
BISC VT Tracking
The change in the relative delay through the FPGA for the DQS and DQ is monitored in the
XIPHY and adjustments are made to the delays to account for the change in resolution of
the delay elements. The change in the delays are recorded in the XSDB.
Debug
VT_TRACK_PQTR_NIBBLE* One per nibble PQTR position last read during BISC VT Tracking.
VT_TRACK_NQTR_NIBBLE* One per nibble NQTR position last read during BISC VT Tracking.
VT_TRACK_PQTR_MAX_NIBBLE* One per nibble Maximum PQTR value found during BISC VT Tracking.
VT_TRACK_NQTR_MAX_NIBBLE* One per nibble Maximum NQTR value found during BISC VT Tracking.
VT_TRACK_PQTR_MIN_NIBBLE* One per nibble Minimum PQTR value found during BISC VT Tracking.
VT_TRACK_NQTR_MIN_NIBBLE* One per nibble Minimum NQTR value found during BISC VT Tracking.
RDLVL_PQTR_CENTER_FINAL_NIBBLE* One per nibble Final PQTR position found during calibration.
RDLVL_NQTR_CENTER_FINAL_NIBBLE* One per nibble Final NQTR position found during calibration.
BISC_ALIGN_PQTR One per nibble Initial 0° offset value provided by BISC at power-up.
BISC_ALIGN_NQTR One per nibble Initial 0° offset value provided by BISC at power-up.
Expected Results
To see where the PQTR and NQTR positions have moved since calibration, compare the
VT_TRACK_PQTR_NIBBLE* and VT_TRACK_NQTR_NIBBLE* XSDB values to the final calibrated
positions which are stored in RDLVL_PQTR_CENTER_FINAL_NIBBLE* and
RDLVL_NQTR_CENTER_FINAL_NIBBLE*.
To see how much movement the PQTR and NQTR taps exhibit over environmental changes,
monitor:
VT_TRACK_PQTR_NIBBLE*
VT_TRACK_NQTR_NIBBLE*
VT_TRACK_PQTR_MAX_NIBBLE*
VT_TRACK_NQTR_MAX_NIBBLE*
VT_TRACK_PQTR_MIN_NIBBLE*
VT_TRACK_NQTR_MIN_NIBBLE*
Calibration Times
Calibration time depends on a number of factors, such as:
Table 38-43 gives an example of calibration times for a DDR memory interface.
2,133 0.83
1,866 1.10
1,600 0.75
x8 components 72-bit
1,333 1.04
1,066 1.56
DDR3
800 0.84
1,600 0.85
Dual-Rank SO-DIMM x8 72-bit
1,333 1.15
1,600 0.94
Dual-Rank RDIMM x8 72-bit
1,333 1.28
2,400 0.61
2,133 0.79
1,600 0.73
1,333 1.06
2,133 0.83
1,333 1.15
2,133 0.93
1,333 1.22
1,600 0.88
Dual-Rank RDIMM x4 72-bit
1,333 1.18
As with calibration error debug, the General Checks section should be reviewed. Strict
adherence to proper board design is critical in working with high speed memory interfaces.
Violation of these general checks is often the root cause of data errors.
When data errors are seen during normal operation, the Memory IP Advanced Traffic
Generator (ATG) should be used to replicate the error. The ATG is a verified solution that can
be configured to send a wide range of data, address, and command patterns. It additionally
presents debug status information for general memory traffic debug post calibration. The
ATG stores the write data and compares it to the read data. This allows comparison of
expected and actual data when errors occur. This is a critical step in data error debug as this
section will go through in detail.
ATG Setup
The default ATG configuration exercises predefined traffic instructions which are included
in the mem_v1_2_tg_instr_bram.sv module. To move away from the default
configuration and use the ATG for data error debug, use the provided VIO and ILA cores
that are generated with the example design. For more information, see the Using VIO to
Control ATG in Chapter 36, Traffic Generator.
This document assumes debug using “Direct Instruction through VIO.” The same concepts
extend to both “Instruction Block RAM” and “Program Instruction Table.” “Direct Instruction
through VIO” is enabled using vio_tg_direct_instr_en. After
vio_tg_direct_instr_en is set to 1, all of the traffic instruction fields can be driven by
the targeted traffic instruction.
ATG identifies if a traffic error is a Read or Write Error when vio_tg_err_chk_en is set to
1. Assume EXP_WR_DATA is the expected write data. After the first traffic error is seen from
a read (with a value of EXP_WR_DATA’), ATG issues multiple read commands to the failed
memory address. If all reads return data EXP_WR_DATA’, ATG classifies the error as a
WRITE_ERROR(0). Otherwise, ATG classifies the error as READ_ERROR(1). ATG also tracks the
first error bit, first error address seen.
Example 1: The following VIO setting powers on Read/Write Error Type check.
When vio_tg_err_chk_en is set to 1, ATG stops after the first error. When
vio_tg_err_chk_en is set to 0, ATG does not stop after the first error and would track
error continuously using vio_tg_status_err_bit_valid/
vio_tg_status_err_bit/vio_tg_status_err_addr.
Example 2: The following VIO setting powers off Read/Write Error Type check:
Figure 38-107 shows six addresses with read error (note that this is the same example as
was used with “Write Error” earlier. “Write Error” is not presented because
vio_tg_err_chk_en is disabled here):
ATG expects the application interface to accept a command within a certain wait time. ATG
also looks for the application interface to return data within a certain wait time after a read
command is issued. If either case is violated, ATG flags a WatchDog Hang.
Example 3: The following example shows that ATG asserts WatchDogHang. This example
shares the same VIO control setting as Example 2. In this example, ATG
vio_tg_status_state shows a “DNWait” state. Hence, ATG is waiting for read data
return.
With Linear Data, Figure 38-109 shows that when an error is detected, read data
(vio_tg_status_read_bit) is one request ahead of expected data
(vio_tg_status_exp_bit). One possibility is read command with address 0x1B0 is
dropped. Hence the next returned data with read address 0x1B8 is being compared against
the expected data of read address 0x1B0.
Figure 38-109: ATG Debug Watchdog Hang Waveform with Linear Data
Using either the Advanced Traffic Generator or the user design, the first step in data error
debug is to isolate when and where the data errors occur. To perform this, the expected data
and actual data must be known and compared. Looking at the data errors, the following
should be identified:
° Designs that can support multiple varieties of DIMM modules, all possible address
and bank bit combinations should be supported.
• Do the errors only occur for certain data patterns or sequences?
° This can indicate a shorted or open connection on the PCB. It can also indicate an
SSO or crosstalk issue.
• Determine the frequency and reproducibility of the error
The next step is to isolate whether the data corruption is due to writes or reads.
Determining whether a data error is due to the write or the read can be difficult because if
writes are the cause, read back of the data is bad as well. In addition, issues with control or
address timing affect both writes and reads.
• If the errors are intermittent, issue a small initial number of writes, followed by
continuous reads from those locations. If the reads intermittently yield bad data, there
is a potential read issue. If the reads always yield the same (wrong) data, there is a write
issue.
• Using high quality probes and scope, capture the write at the memory and the read at
the FPGA to view data accuracy, appropriate DQS-to-DQ phase relationship, and signal
integrity. To ensure the appropriate transaction is captured on DQS and DQ, look at the
initial transition on DQS from 3-state to active. During a Write, DQS does not have a
low preamble. During a read, the DQS has a low preamble. The following is an example
of a DDR3 Read and a Write to illustrate the difference:
X-Ref Target - Figure 38-110
° Check the PQTR/NQTR values after calibration. Look for variations between PQTR/
NQTR values. PQTR/NQTR values should be very similar for DQs in the same DQS
group.
The XSDB output can be used to determine the available read and write margins during
calibration. Starting with 2014.3, an XSDB Memory IP GUI is available through the Hardware
Manager to view the read calibration margins for both rising edge clock and failing edge
clock. The margins are provided for both simple and complex pattern calibration. The
complex pattern results are more representative of the margin expected during post
calibration traffic.
X-Ref Target - Figure 38-111
The following Tcl command can also be used when the Hardware Manager is open to get an
output of the window values:
report_hw_mig [get_hw_migs]
Table 38-48: Signals of Interest for Read and Write Margin Analysis
Signal Usage Signal Description
MARGIN_CONTROL Per Interface Reserved
MARGIN_STATUS Per Interface Reserved
Number of taps from center of window to
RDLVL_MARGIN_PQTR_LEFT_RANK*_BYTE*_BIT* Per Bit
left edge.
Number of taps from center of window to
RDLVL_MARGIN_NQTR_LEFT_RANK*_BYTE*_BIT* Per Bit
left edge.
Number of taps from center of window to
RDLVL_MARGIN_PQTR_RIGHT_RANK*_BYTE*_BIT* Per Bit
right edge.
Number of taps from center of window to
RDLVL_MARGIN_NQTR_RIGHT_RANK*_BYTE*_BIT* Per Bit
right edge.
Number of taps from center of window to
WRITE_DQS_DQ_MARGIN_LEFT_RANK*_BYTE*_BIT* Per Bit
left edge.
Number of taps from center of window to
WRITE_DQS_DQ_MARGIN_RIGHT_RANK*_BYTE*_BIT* Per Bit
right edge.
When data errors occur, the results of calibration should be analyzed to ensure that the
results are expected and accurate. Each of the debugging calibration sections notes what
the expected results are such as how many edges should be found, how much variance
across byte groups should exist, etc. Follow these sections to capture and analyze the
calibration results.
However, within a specific process, each tap within the delay chain is the same precise
resolution.
BISC is run on a per nibble basis for both PQTR and NQTR. The write tap results are given on
a per byte basis. To use the BISC results to determine the write window, take the average of
the BISC PQTR and NQTR results for each nibble. For example, ((BISC_NQTR_NIBBLE0 +
BISC_NQTR_NIBBLE1 + BISC_PQTR_NIBBLE0 + BISC_PQTR_NIBBLE1) / 4).
Conclusion
If this document does not help to resolve calibration or data errors, create a WebCase with
Xilinx Technical Support (see Technical Support). Attach all of the captured waveforms,
XSDB and debug signal results, and the details of your investigation and analysis.
Upgrading
XCKU095/XCVU095 Recommended Memory
Pinout Configurations
Upgrading
There are no port or parameter changes for upgrading the Memory IP core in the Vivado
Design Suite at this time.
For general information on upgrading the Memory IP, see the “Upgrading IP” section in
Vivado Design Suite User Guide: Designing with IP (UG896) [Ref 14].
XCKU095/XCVU095 Recommended
Memory Pinout Configurations
Introduction
The UltraScale™ devices, XCKU095 and XCVU095, have only one clock region between the
two I/O columns in the center of the device which might require special pinout
considerations. Other devices in the UltraScale and UltraScale+™ families do not require
special pinout considerations because they have two or more clock regions between the
I/O columns.
During implementation, a large proportion of the user logic needs to be placed in the
center of the device for connectivity and timing reasons. The reduced space between the
I/O columns in conjunction with the presence of several Memory Interface IPs, or any large
high performance I/O modules, can increase the placement complexity and challenge
routing resources. Following the guidelines in this section ensures the most efficient use of
available routing resources for faster and predictable timing closure.
For architectural and performance reasons, the memory interface logic needs to be placed
in the clock regions located on the right-hand side of the I/Os. The memory interface
controller logic is usually placed next to the Address/Command I/Os. A high overall device
utilization or user floorplanning constraints in the area next to the Address/Command I/Os
can result in reduced available routing resources.
When placing two Memory Interface IPs side-by-side with the Address/Command I/Os
located on the same clock region row, several adjacent clock regions become highly utilized
which limits the amount of user logic that can cross over or be placed in the same area.
When vertically shifted by one or more I/O banks, the potential placement and routing
challenges become less common.
In addition to memory interface pin planning, migration to the 2015.3 or later version of the
Memory Interface IP helps with timing closure due to updates to the IP clocking and
constraints. The next section discusses pinout options for different packages that results in
the most efficient use of available routing resources.
The maximum number of possible 72-bit DDR4 memory interfaces in each package is used
to illustrate the pin placement suggestions. This is just an example, the goal is to offset the
memory interfaces or at the very least offset the Address/Command I/O banks. The
double-headed arrow represents the routing channel that is created by offsetting the
Address/Command banks.
Bank 50 Bank 70
GTH Quad 230
GTY Quad 130 HP I/O HP I/O
Unbonded
Unbonded Unbonded
Bank 49 Bank 69
GTY Quad 129 GTH Quad 229
HP I/O HP I/O
Unbonded
Unbonded Unbonded
Bank 67
GTY Quad 127 Bank 47 HP I/O GTH Quad 227
Unbonded HP I/O (Addr/
Cmd)
Memory
GTY Quad 126 Bank 46 GTH Quad 226
Bank 66 Interface
Unbonded HP I/O HP I/O B
Bank 45 Memory
GTY Quad 125 Bank 65
HP I/O Interface GTH Quad 225
Unbonded HP I/O
A
Bank 44
GTY Quad 124 HP I/O Bank
Unbonded (Addr/ 84/94 GTH Quad 224
Cmd) HR I/O
X15878-011416
Bank 50 Bank 70
GTY Quad 130 GTH Quad 230
HP I/O HP I/O
Unbonded Unbonded
Unbonded Unbonded
Bank 49 Bank 69
GTY Quad 129 GTH Quad 229
HP I/O HP I/O
Unbonded
Unbonded Unbonded
Bank 48 Bank 68
GTY Quad 128 GTH Quad 228
HP I/O HP I/O
Bank 67
Bank 47 HP I/O GTH Quad 227
GTY Quad 127
HP I/O (Addr/
Cmd)
Memory
GTY Quad 126 Bank 46 Bank 66 Interface GTH Quad 226
HP I/O HP I/O B
Bank 45 Memory
Interface Bank 65
GTY Quad 125 HP I/O GTH Quad 225
A HP I/O
Bank 44
GTY Quad 124 HP I/O Bank
Unbonded (Addr/ 84/94 GTH Quad 224
Cmd) HR I/O
X15879-011416
Bank 51
GTY Quad 131 Bank 71 GTH Quad 231
HP I/O
(partial) HP I/O
Bank 50
GTY Quad 130 HP I/O Bank 70 GTH Quad 230
(Addr/ HP I/O
Cmd)
Memory Bank 69
GTY Quad 129 Bank 49 Interface HP I/O GTH Quad 229
HP I/O A Unbonded
Bank 68
Bank 48
GTY Quad 128 HP I/O GTH Quad 228
HP I/O
Unbonded
Bank 67
GTY Quad 127 Bank 47 HP I/O GTH Quad 227
Unbonded HP I/O (Addr/
Cmd)
Memory
GTY Quad 125 Bank 45 Bank 65
Interface GTH Quad 225
Unbonded HP I/O HP I/O
B
Bank 44
GTY Quad 124 HP I/O Bank
Unbonded (Addr/ 84/94 GTH Quad 224
Cmd) HR I/O
X15880-011416
Bank 51
GTY Quad 131 HP I/O Bank 71 GTH Quad 231
Unbonded (Addr/ HP I/O Unbonded
Cmd)
Bank 69
GTY Quad 129 Bank 49 HP I/O
(Addr/ GTH Quad 229
HP I/O
Cmd)
Bank 48 Bank 68
GTY Quad 128 GTH Quad 228
HP I/O HP I/O
Bank 66
GTY Quad 126 HP I/O Memory
Bank 46 GTH Quad 226
(Addr/ Interface
HP I/O C
Cmd)
Memory
Bank 45 Interface
GTY Quad 125 B Bank 65
HP I/O GTH Quad 225
HP I/O
Bank 44
GTY Quad 124 HP I/O Bank
Unbonded (Addr/ 84/94 GTH Quad 224
Cmd) HR I/O
X15881-011416
Bank 51
HP I/O Bank 71 GTH Quad 231
GTY Quad 131
(Addr/ HP I/O
Cmd)
Bank 69
GTY Quad 129 Bank 49 HP I/O
(Addr/ GTH Quad 229
HP I/O
Cmd)
Bank 48 Bank 68
Bank 48
GTY Quad 128 HP I/O HP I/O GTH Quad 228
HP I/O
Unbonded (partial)
Bank 66
GTY Quad 126 HP I/O Memory
Bank 46 GTH Quad 226
(Addr/ Interface
HP I/O C
Cmd)
Memory
Bank 45 Interface
GTY Quad 125 B Bank 65
HP I/O GTH Quad 225
HP I/O
Bank 44
HP I/O Bank
GTY Quad 124
(Addr/ 84/94 GTH Quad 224
Cmd) HR I/O
X15882-011416
Additional Recommendations
1. Migrate to 2015.3 or later version of the Vivado Design Suite:
a. Take advantage of the Quality of Results (QoR) improvements from newer releases.
b. Upgrade the Memory Interface IPs to benefit from the clocking and constraint
improvements.
2. Offset the placement of the Address/Command banks in horizontally adjacent interfaces
by at least one clock region to make routing resources available to user logic. The
placement of the Address/Command bank within a 72-bit three bank interface depends
on whether it is a DIMM or component interface.
For a DIMM interface, Xilinx recommends placing the Address/Command bank in the
center as shown in Figure B-5 for Memory Interface C. This placement enables better
PCB routing from the FPGA to the DIMM socket as shown in Figure B-7.
DQ
dq addr/cmd/ctrl
Bank 46
DQ
dq addr/cmd/ctrl
DQ
dq addr/cmd/ctrl
Bank 44
DQ
dq addr/cmd/ctrl
Bank 45
DQ
dq addr/cmd/ctrl
addr ctrl
cmd
X15883-011416
Figure B-6: Address/Command Placement Recommendation for Five Components with Fly-By Topology
FPGA
Bank 66
Bank 65
Addr/Cmd
Bank 67
Addr/
DQ Cmd/ DQ
Ctrl
Address/Command/
DQ Control DQ
DIMM
X15884-011416
3. Avoid high device utilization, especially for LUTs as they need space to be spread out in
case of high density placement.
4. Design top-level connectivity to minimize crossings over the Memory Interface IPs.
5. Force spreading of memory interface logic placement to a wider area by using the
pblock constraints.
a. By default, the memory interface logic is only placed in the clock regions that
include the I/O columns.
b. Use two clock region-wide pblock for the Memory Interface IPs located on the right
I/O columns.
c. Do not apply this technique to the Memory Interface IPs located on the left I/O
columns.
For example, an XCVU095 design with four wide Memory Interface IPs. Only two of them
can have their placement relaxed: Memory Interface 2 and Memory Interface 3.
X-Ref Target - Figure B-8
Bank 51 Bank 71
GTY Quad 131 HP I/O HP I/O GTH Quad 231
(Addr/ (Addr/
Cmd) Cmd)
GTY Quad 130 Bank 50 Memory Bank 70 Memory GTH Quad 230
HP I/O Interface 0 HP I/O Interface 3
Bank 69
GTY Quad 129 Bank 49 HP I/O
GTH Quad 229
HP I/O (Addr/
Cmd)
Bank 48 Bank 68
GTY Quad 128 Bank 48
HP I/O HP I/O GTH Quad 228
HP I/O
Unbonded (partial)
Bank 66
GTY Quad 126 HP I/O Memory GTH Quad 226
Bank 46
HP I/O (Addr/ Interface 2
Cmd)
Memory
GTY Quad 125 Bank 45 Interface 1 Bank 65 GTH Quad 225
HP I/O HP I/O
Bank 44
GTY Quad 124 HP I/O Bank
84/94 GTH Quad 224
(Addr/
Cmd) HR I/O
X15885-011516
create_pblock MemoryInterface2_pblock
resize_pblock MemoryInterface2_pblock -add CLOCKREGION_X3Y1:CLOCKREGION_X4Y3
add_cells_to_pblock MemoryInterface2_pblock [get_cells a/b/mig_2]
create_pblock MemoryInterface3_pblock
resize_pblock MemoryInterface3_pblock -add CLOCKREGION_X3Y5:CLOCKREGION_X4Y7
add_cells_to_pblock MemoryInterface3_pblock [get_cells a/b/mig_3]
6. When migrating, ensure banks selected in one device exist in the target device. See the
"Migration between UltraScale Devices and Packages" chapter in the UltraScale Architecture
PCB Design and Pin Planning User Guide (UG583) [Ref 11].
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see
Xilinx Support.
• From the Vivado ® IDE, select Help > Documentation and Tutorials.
• On Windows, select Start > All Programs > Xilinx Design Tools > DocNav.
• At the Linux command prompt, enter docnav.
Xilinx Design Hubs provide links to documentation organized by design tasks and other
topics, which you can use to learn key concepts and address frequently asked questions. To
access the Design Hubs:
• In the Xilinx Documentation Navigator, click the Design Hubs View tab.
• On the Xilinx website, see the Design Hubs page.
Note: For more information on Documentation Navigator, see the Documentation Navigator page
on the Xilinx website.
References
These documents provide supplemental material useful with this product guide:
1. JESD79-3F, DDR3 SDRAM Standard, JESD79-4, DDR4 SDRAM Standard, and JESD209-3C,
LPDDR3 SDRAM Standard, JEDEC Solid State Technology Association
2. Kintex UltraScale FPGAs Data Sheet: DC and AC Switching Characteristics (DS892)
3. Virtex UltraScale FPGAs Data Sheet: DC and AC Switching Characteristics (DS893)
4. Kintex UltraScale+ FPGAs Data Sheet: DC and AC Switching Characteristics (DS922)
5. Virtex UltraScale+ FPGAs Data Sheet: DC and AC Switching Characteristics (DS923)
6. Zynq UltraScale+ MPSoC Data Sheet: DC and AC Switching Characteristics (DS925)
7. UltraScale Architecture SelectIO Resources User Guide (UG571)
8. UltraScale Architecture Clocking Resources User Guide (UG572)
9. Vivado Design Suite Properties Reference Guide (UG912)
10. UltraScale Architecture Soft Error Mitigation Controller LogiCORE IP Product Guide
(PG187)
11. UltraScale Architecture PCB Design and Pin Planning User Guide (UG583)
12. Arm AMBA Specifications
13. Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
14. Vivado Design Suite User Guide: Designing with IP (UG896)
15. Vivado Design Suite User Guide: Getting Started (UG910)
16. Vivado Design Suite User Guide: Logic Simulation (UG900)
17. Vivado Design Suite User Guide: Implementation (UG904)
18. Vivado Design Suite User Guide: I/O and Clock Planning (UG899)
19. Vivado Design Suite User Guide: Release Notes, Installation, and Licensing (UG973)
20. Vivado Design Suite User Guide: Programming and Debugging (UG908)
21. UltraScale Maximum Memory Performance Utility (XTP414)
22. Vivado Design Suite User Guide: Using the Vivado IDE (UG893)
23. Fast Calibration and Daisy Chaining Functions in DDR4 Memory Interfaces Application
Note (XAPP1321)
Revision History
The following table shows the revision history for this document.
Continued RLDRAM 3
• Updated bits in Feature Summary section.
• Updated description in Memory Controller section.
• Added Important Note in Pin and Bank Rules section and #16 description.
Traffic Generator
• Updated Advanced Traffic Generator section.
Multiple IP Cores
• Added Important note in Sharing of a Bank section.
Debugging
• Added steps in Understanding Calibration Warnings (Cal_warning) section.
• Updated Signal Width to 127 in cal_r*_status in DDR3/DDR4 Debug Signals
Used in Vivado Design Suite Debug Feature table.
• Updated Debugging Data Errors section.