Appunti Digital Eletctronics
Appunti Digital Eletctronics
Notes of
Digital Electronics
Professor: Student:
Maurizio Martina Nicola Antonio Travaglini 235881
2 Processor architecture 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Basic architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Working principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.5 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.6 Two memory architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.7 Performance improvement methods . . . . . . . . . . . . . . . . . . . . . 29
3 Peripherals 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 I/O extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Memory-mapped and standard I/O . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Microprocessor interfacing: interrupts . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Management of processor registers . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Maskable/non-maskable interrupts . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Peripheral data managing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Timer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Memories 37
4.1 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 General organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Static-RAM (SRAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.1 SRAM-cell analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 6T: structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Dual Port SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 SRAM timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3
4.5.1 Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Synchronous SRAM (SSRAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.1 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7.1 Refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7.2 Accessing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7.3 Timing: read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.7.5 Refresh handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7.6 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7.7 SDRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 CACHE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.1 Cache organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8.2 Direct mapping cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8.3 Fully Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8.4 N-way Set Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8.5 What to do if a miss happens? . . . . . . . . . . . . . . . . . . . . . . . . 80
4.8.6 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8.7 Write-back strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8.8 Write miss event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.9 Non volatile memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.9.1 Read Only Memory (ROM) . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.9.2 MOS-based ROM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.9.3 The MOS threshold voltage . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9.4 Floting Gate Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.10 Flash memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.10.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.10.2 Cell sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.10.3 NOR-flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.10.4 NAND-flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.10.5 NOR Vs NAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.10.6 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.10.7 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.10.8 Wear levelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 Interfacing 103
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.1 L-H transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1.2 H-L transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1.3 Minimum period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.2 Lumped model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Transmission lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3.1 Multiple reflection lattice case . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.2 Loading the line with a capacitor . . . . . . . . . . . . . . . . . . . . . . 114
5.4 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Incident Wave Switching (IWS) . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.2 Reflected Wave Switching (RWS) . . . . . . . . . . . . . . . . . . . . . . 116
4
6 Serial Communications 119
6.1 Serial and parallel transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.1 Parallel Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.2 Serial link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Communication glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.1 Basic serial connection system . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Asynchronous and synchronous links . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.1 Link glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.3 Serial asynchronous protocol . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.4 Serial Synchronous Protocols . . . . . . . . . . . . . . . . . . . . . . . . 127
5
6
Chapter 1
Thanks to De Morgan’s Law, we can obtain an AND gate by simply adding an inverter at
the input of A and B, or an OR gate by inserting an inverter at the output O.
7
These examples show seen that it is easy to obtain a NOR gate and, from it, an AND or
OR gate. Is it possible to make the circuit more flexible so that the function performed is
programmable? The answer is yes: if we replace each transistor in our circuit with a couple of
transistors and switch the structure is programmable.
Depending on which of the AND/OR logic arrays is programmable, we have three basic
organizations:
8
1.2.1 PROM
There is a fixed combination of AND gates, while the OR ones are programmable. If we increase
the number of inputs, the number of AND gates will be quite high to cover all the possible
combinations.
1.2.2 PLA
It the most flexible organization: both AND and OR gates are programmable.Here, the best
way to proceed is to reuse product terms. Let us consider now an example:
9
1.2.3 PAL
The high flexibility of PLAs is paid in terms of area and, as a consequence, integrationability
and higher cost. For these reasons, in the past, some solutions involving PALs were exploited
for programmable arrays (originally PROM were used just for memories). With PALs, the
AND array is programmable while the OR array is fixed at fabrication: a given column of the
OR array has access to only a subset of the possible product terms.
PALs are more restricted than PLAs (we trade the number of OR terms vs the number of
outputs. Many device variations are needed, but each device is cheaper than a PLA.
We don’t use all of the combinations of
BCD, just from 0000 to 1001. All the other
possibilities (1010 to 1111) are unused, so
from the logic function point of view, we
can use these configurations as we prefer,
as an example with the goal of minimizing
the complexity of the logic. Gray code is a
way of coding sequences of bits such that
two neighbour sequences differ of just one
bit.
10
In this example, we can see that with PAL
we can perform all the logic functions we
want but we need to perform some logic
minimization because the OR plane is fixed
(in this case we added some zeros). With
the PLA we would have a greater flexibil-
ity, here we don’t have this, but we can
save area.
PALs are a nice way of building combinational functions. In practice, it is useful to imple-
ment also sequential circuits. PALs and PLAs are not able alone to provide sequential logics.
Using PALs with FFs we can obtain even programmable sequential circuits. With a FF at the
output of each OR gate and with a multiplexer, we can choose if we want a pure combinational
function or a sequential function. The output of the FF is feedback. This structure is used in
CPLDs.
Complex PLDs are the natural derivation of PALs and PLAs into modern programmable de-
vices. They include several PAL-like blocks (the number depends on the size of the chip, so on
the cost of the CPLD) connected through some interconnection wires and some I/O blocks for
being able to connect the programmable logic to the external world.
11
The 1.6 shows the architecture of the Xilinx Coolrunner. This architecture includes some
PLAs and the core of the device (the ability to create programmable logic) is handled by
function blocks, made by 16 macrocells. Each macrocell is almost like a PLA.
1.2.5 FPGA
FPGAs have much more logic than CPLDs. FPGAs can be RAM-based or Flash-based:
• RAM FPGAs must be programmed at power-on:
12
• Flash FPGAs: store program data in non-volatile memory:
The basic idea of FPGA is different from PLAs and PALs since it exploits multiplexer-based
approach whose output is a function of the selector. This can be made programmable by
changing the value of A and B (the inputs). Acting on the value of the selector and of the
inputs, we can obtain any logic function involving these parameters. By using more muxs, we
can build more complex structures.
This is the basic architecture of an FPGA: there is a table where we can store some constant
used to produce a logic function as an output. This table is usually called LUT. The logic for
implementing combinational functions is made of LUTs (no more AND or OR planes) then
the usual FF is used for the sequential part (and the mux chooses between a combinatorial or
sequential path).
13
The typical organization of an FPGA is a bidimensional array, where we can find:
• CLBs (Configurable Logic Blocks) containing combinational functions and FFs;
• IOBs: used to connect the outside world to the logic array, they usually have some
additional sequential element in the block.
Cyclone V - SoC Cyclone V is a family of FPGA of INTEL ALTERA. The chip is rather
complex, it involves PLLs, a lot of elements for logic operations, memories, etc...
14
for driving the control signals to its ALMs, it also has two unique clock sources and three clocks
enable signals.
The ALMs can be seen as an advanced version of the basic architecture seen before: instead
of a simple LUT, there is an adaptive LUT which can use up to 8 inputs, in order to implement
in an effective way arithmetical operations some Full Adders (with optimized connections for
carrying propagation) are involved, multiplexers and registers (for sequential logic).
Cyclone V – memory
Two types of memory blocks:
– M10K, dedicated memory resources (10 kb each), it can be programmed with different
depths and widths. The 10kb in the name counts even the use of parity bits;
– Memory Logic Array Blocks (MLABs, 640 bit each) has a fixed depth (32 bits) and a
programmable width (from 1 bit up to 20 bits).
15
Usually, in integrated circuit technology, the choice of the memory mode depends on the
amount of complexity which can be handled. On FPGA, instead, we already have memories
and we can configure them.
16
Simplified I/O
With OE disabled we can get an input from the external worlds, otherwise we can send an
output to the external world. The registers (OE and Output) can be used to control the
critical path, while the Input register can be used when we have no idea of the delay of the
input signal.
17
18
Chapter 2
Processor architecture
2.1 Introduction
- Special-Purpose: used for some dedicated devices that are designed to do only some opera-
tions;
- General-Purpose: generic operations, can be used in many fields and it is more flexible.
We will focus on general-purpose processors. This kind of processor is designed for a variety of
computation tasks and are characterized by:
- low unit cost, in part because manufacturer spreads NRE over large numbers of units;
- carefully designed since higher NRE is acceptable (can yield good performance, size and
power);
- user appealing:
19
2.2 Basic architecture
The basic idea is to have a piece of hardware able to do two things: read instructions and
process binary data. This translates into having three elements:
• Datapath (DP): portion of hardware able to do computations, basically arithmetic and
logic operations;
• Control unit (CU): drives in the correct way the datapath (the CU does not store the
algorithm);
• Memory: stores instructions and data (The algorithm is programmed into the memory).
The working principle is based on two registers: the program counter (PC) stores the address
of the cell we want to access while the instruction register (IR) stores the instruction coming
from the memory and makes it available to the processor.
20
2.2.2 Memory access
Different computer architectures manage memory access in different ways. We can consider
two families of processors: one is able to direct access to data and modify the content of
memory (higher performance, higher complexity), the other is not able to do these operations
with just one instruction. The latter is the case of several modern processors, especially for
embedded applications, which rely on the load/store approach: the processor takes the value
from the memory (LOAD), modifies it and then writes back the value in the memory (STORE).
Commonly RISC processors are based on LOAD/STORE architecture.
2.2.3 Instructions
Let us now consider an example that shows that every time a LOAD/STORE processor has
to make some computation and to update the content of the memory, three steps are needed:
Load the content of a memory cell, perform some ALU operations, Store the value back in the
memory.
21
Now we have the value of the content inside
our processor and we can modify it.
Instruction execution
Usually, each instruction is divided into more steps and each step may require one or more
clock cycle, depending on its complexity.
22
Instruction format
There are many types of instructions, so there is a different format depending on the purpose.
• R format This is the register format. It is used for arithmetic and logic operations
(ADD, SUB, AND, OR).
• I format Used for instruction where one operand is an immediate value (ALU, conditional
branches). This is the case of a sum of a value for a fixed quantity or an immediate address
displacement with respect to the first register.
23
• J format It is only used for unconditional branches.
Let us now consider as an example that the following set of instruction: LOAD a value from
the memory, ADD 1 to it, STORE the result in the memory.
Let us assume that our first instruction (which is a LOAD) is stored at address 100.
Therefore, we want to load the content of the memory at location 500 (whose content is 10)
and store it in the register R0.
The first step is the instruction FETCH: the PC must contain the value 100 (address of the
first instruction we want to fetch)
The LOAD instruction is now inside the IR. Inside the Controller, there is a decoding logic
which, knowing the format of the instruction, understands that there is a LOAD instruction
and starts to drive all the signals to the DP to execute the instruction.
24
Figure 2.4: Decode phase
At this point we have to access the memory, so, during the memory phase, we take the value
at location 500 and, in the next phase, we write it back in the register R0.
Figure 2.5: Memory access phase Figure 2.6: Write back phase
Let us have a look at the timing. We need one clock cycle for the EXECUTION even if we
are doing nothing.
25
Figure 2.7: LOAD instruction timing
Then, the following operation is an ADD. Here we still have a clock cycle for the Memory,
even if we don’t access it.
This example shows how even simple operations (in this case: take a value from the memory,
increment it by 1 and store it at a different location) can be quite complex and from a timing
point of view, we need to fetch, decode, execute, access the memory and write back. Even
performing very simple tasks has a kind of overhead due to the fact that we have to handle
several steps and basic operations and, from the hardware point of view, probably the solution
26
seen is not the best one (we make the same steps in all the instructions even if some steps are
not needed) but since it has a rigid structure (we know in each step what is going to happen)
it is a way to make things easier (even if not optimized).
Let us consider that location 103 contains a JUMP. We have to repeat almost the same
steps: after the fetch, we decode the instruction and we understand that it is a jump, the
execution, memory and write back phases do almost nothing. During the decoding operation,
we have to take one portion of the instruction itself and carry it to the logic to update the
program counter (so, in this case, the PC is not a sort of simple counter)
If we have an N-bit processor it means that the ALU, buses, registers are represented on N
bits. The size of the PC gives the address space.
The clock frequency is related to the combinational delay inside the processor. The clock
period must be longer than the longest register to register delay in the entire processor.
Therefore the maximum clock frequency depends on the minimum delay. Memory access is
often the longest delay. We have a path from PC to IR, then one from IR to one of the DP
registers, then there is another path toward the ALU and a last one from the registers to the
memory (STORE operation) and from the memory to the registers (LOAD operation). So, if
we want to maximize the clock frequency we have to work on 2 aspects to be sure the processor
is fast enough: processor-memory interface, inside of the processor (especially for DECODE
and EXECUTE stages)
27
Figure 2.11: Paths inside the processor
2.2.5 Pipelining
A technique used to increase the performances of a circuit is the pipelining. If the algorithm
requires many operations to be completed and the instructions don’t share resources, then,
while doing one, we can begin to do the other one, without having to wait for the end of the
previous. This solution may have some problems since we have to manage it carefully to avoid
overlaps problems.
This technique allows to reach high performances and to achieve a throughput close to 1
instruction per cycle. The main delay is due to the time needed for every clock cycle, so for
the individual path components.
The major problem to be avoided is the resource conflict: in the example shown in 2.12,
the memory is requested both from the access to the data memory and from instruction fetch.
This problem can be solved by using two memories.
28
2.2.6 Two memory architecture
29
Figure 2.15: Superscalar approach
30
Chapter 3
Peripherals
3.1 Introduction
A microcontroller contains a Microprocessor, a Memory, I/O ports and some peripherals. These
sections are connected through buses.
There are many kinds of peripherals: timers, converters, Standard Interfaces (to communi-
cate with the external world), and so on.
Each peripheral can be seen, from the microprocessor point of view, as a set of registers that
can be grouped into three main families:
• status registers give information about the status busy or ready of the peripheral, used
to get some feedback from the peripheral (e.g useful to know when the conversion of a
converter is finished);
• data registers are used to move data from peripheral to microprocessor and vice-versa
• port-based I/O (parallel I/O): the μP uses ports (made of pins) to communicate with
the peripherals. The software reads/writes them just like registers and the access is direct.
The problem is the scalability: if we increase the number of peripherals, the number of
ports (and bits) required may be too high. It is not so used today.
• Bus-based I/O: control part, address and data form a bus which can be accessed through
a protocol written within the micro-controller: a single instruction carries out the read or
write protocol on the bus. The peripheral is accessed as any other memory address. The
downside is that since we are using a bus (so a shared link) we cannot access concurrently
with more resources. There is no direct access.
31
• Parallel I/O: when processor supports only
bus-based I/O but a parallel I/O is needed,
the parallel peripheral is connected to a regis-
ter of the μC and the peripherals are accessed
independently. Moreover, the Parallel I/O pe-
ripheral can be easily replaced by a larger one.;
• Standard I/O (I/O mapped I/O): an additional pin (M/IO) on the bus indicates
whether a memory or peripheral, so the address decoding is simpler (when the number
of peripherals is much smaller than address space then high-order address bits can be
ignored) and there is no loss of memory addresses to peripherals. Special instructions
are needed to move data between peripheral registers and memory. When the number of
peripherals is much smaller than the address space, high-order address bits can be ignored
32
allowing for smaller and/or faster comparators. e.g., Bus has 16-bit address:
– all 64K addresses correspond to memory when M/IO set to 0;
– all 64K addresses correspond to peripherals when M/IO set to 1.
PRO: we can fully exploit the bit-width of the address bus.
• fixed interrupt: the addresses of ISR are stored in the processor and cannot be changed.
If there are not enough bytes, the ISR contains a jump. It is a simple but not so flexible
solution.
• Vectored interrupt: the address is provided by the peripheral and is commonly used
when the μP has more than one peripheral connected by a system bus. When there is an
ISR from a peripheral, the μP sends an address request to it.
• some processors have a dedicated set of registers, that are used only by ISRs (fast, but
with HW cost);
• other processors automatically save some registers to the stack (flexible, but slow);
33
• in other cases the ISR itself saves the registers it uses to the stack (flexible and efficient,
“RISC-like”);
• sometimes the compiler reserves some registers for ISRs (useful for simple ISRs which
need fast response).
• Maskable: the programmer can set a bit in the register and ignore the interrupt. It is
useful when in the middle of a time-critical code because the MPU cannot be interrupted
while servicing an interrupt by another one from the same peripheral.
DMA
The data transfer is based on interrupts: every time a data has to be written in the memory,
an interrupt request has to be asserted. Then, the data is read from the peripheral and, after
that, written in the data memory.
With DMA
Before starting the execution of the main task, the MPU configures the DMA setting the origin
and destination of the data. After the DMA controller has been configured:
• after executing an instruction μP sees the request and asserts Dack and releases the system
bus. It stalls only if the processors needs it;
34
• μP resumes execution;
• DMA reads the data from P1 and writes it in the Data memory (μP is executing the
other instructions);
Question from the audience: What happens when both Program memory and DMA have
to drive the bus?
In this architecture, if the DMA is working on the bus to move the data from peripheral
to the memory, it means that the bus cannot be used for other things. So we need separate
memories for Program and Memory. If the processor is executing a LOAD instruction when
there is a DMA transition on the bus, only one takes control of the bus and the other waits.
In the other cases, the program memory, since it is a distinct entity from the Data memory, it
can still run while the DMA is working.
3.8 Timer
The timer is basically a programmable counter that consists of 3 registers:
The result of the comparison between Counter and threshold can be used as an interrupt.
35
36
Chapter 4
Memories
4.1 Categories
Memories can be divided into many categories depending on:
• Access:
– Read only: useful for storing fixed code and constants;
– Read/write: useful for storing variables and data to be processed.
• Type:
– Volatile: data are lost if power supply goes down;
– Non-volatile: data are preserved even with no power supply.
• Interface:
The access in memory can read or write the data serially or in parallel, in a synchronous
or asynchronous way, having different types of memories.
• Addressing:
- Explicit addressing is used when specific commands to access the information are avail-
able. In this case, the specific address, where to read or write, must be provided;
- Implicit addressing is used in memories where only the command to read or write are
present if the access of memory is in a fixed position(at the beginning of the memory or
at the end, as in FIFO and LIFO);
-Addressing by content means that if the requested value is present, then the memory
provides the address to its location (Content Addressable Memory - CAM).
• Static/Dynamic:
- Static memory has no need to refresh the data in it because they remain stable in time;
- Dynamic memory needs to be rewritten to avoid a loss of information, because it fades.
4.1.1 Definitions
• Storage cell: 1 bit memory element;
• Word: collection of bits. Its size is the unit of access for the memory;
• Word-Line (WL): the line in the memory array corresponding to one word.
• Bit-Line (BL): the line in the memory array corresponding to one bit.
37
4.2 General organization
The fig.4.1 shows the general organization of a memory as an array of words. A decoder gen-
erates a signal for each word and selects only the cells of the same word line.
This organization has some issues because, as the number of words increases, the size of the
decoder becomes bigger and bigger: this means that, for high values of n, the decoder is slow.
Therefore this scheme is good just for small memories.
Usually a memory is represented by a matrix organization.
38
Here, as shown in 4.2, in every row there are multiple words and the row decoder is smaller,
but an additional multiplexer is required, driven by the column decoder, to select the correct
word in the line. This organization reduces the decoding time and optimizes array silicon area
(aspect ratio). The width of each column is the word-width p in bits.
The basic blocks of a hierarchical organization are array memories and to select one of them
there are wires for the block address and for the block selector. This solution allows saving
power because when a block isn’t used it doesn’t use energy. (another advantage: shorter wires
within blocks).
39
The basic cell of an SRAM is based on a latch. The input of N1 is A and its output is B,
for N2 the input is B and the output is A. To work as a storage cell, the circuit must work in
stable points.
The stability is analysed by applying small variations to one input and looking to what
happens to the output. In the case of the blue dots there is a specular behaviour, therefore,
only one case will be treated. The starting condition is VA = Vcc , if we apply a small variation
on the input we will have a small variation on the output (even smaller than the one in the
input), so this variation is on the input of N2, which becomes a variation on the input of N1
and so on. The variation propagates from one gate to the other one but, at every step, it is
reduced.
The case of the unstable point (red dot) is very different. We consider VA = V2cc , a small
variation on the input of N1 will give a larger variation of the output. This output is on the
input of N2 and, due to the slope of the curve, the variation of the output will be greater. This
output is then applied on N1 and this variation will increase at every point. This is an unstable
point. The cell has problems when small variation are applied on the central point of the plot
or when the variation of the voltage on A or B is lower than V2cc .
So, this cell works correctly only in the two stable points, therefore dVA and dVB must stay
far from V2cc . The logic value in B is NOT(A). It is perfect for storing binary symbols.
40
4.3.2 6T: structure
Fig.4.6 shows the 4T structure: there is a form of redundancy, since Q, BL and their negations
are present. The four transistors (two for each inverter) are used exclusively for the storage
cell.
In order to access the content of the cell, two switches are introduced: they are the transistors
M5 and M6. The structure obtained in this way is called 6T and it is shown in fig.4.7.
41
6T: read operation
The read operation is a critical one for this kind of memory since it can lead to metastability.
Let us consider that the last read operation led to BLi=1 and, then, that a storage cell with
0 is read. What happens on BLi? The WLj is activated (from 0 goes to 1), this means that
the pass transistor M6 is ON and, since it has a resistance RON , the capacitance on BL forces
a dVQ > VC2C and, instead of reading a 0, there is a writing operation of a 1.
This problem can be avoided by precharging the bit lines with VCC 2
before the reading
operation. In this way, if in the cell there was a 0, the voltage on the BL decreases, if there
was a 1 the voltage increases. This behaviour is shown in fig.4.10.
However, there is a certain amount of time to wait before to know the exact value because of
the capacitance of the bit line, limiting the access speed (it is a problem for large memories). If
this settling time is too high, it can be reduced by adding a comparator between BLi and BLi:
if BLi is increasing, BL is decreasing, comparing the two of them will give a 0 at the output in
a faster way. In a similar way, if BLi is decreasing, BL is increasing and the comparison will
give a 1.
42
Figure 4.11: Comparator used to reduce the transition time
In order to have a very dense array, we should reduce the number of transistors. Ideally,
we just need 4 transistors per cell (for the 2 inverters), but then 2 more are needed for the
access and, if for each bit-line we are adding a comparator, we are wasting even more silicon.
Therefore this comparator should be as small as possible so that most of the silicon is used
for the cells. The circuit used to implement the comparator is a sense amplifier. The sense
amplifier shown in 4.12
When we start reading, we activate a word line and the sensing signal so that MN3 and
MP3 are turned ON and the circuit is activated. If the voltage on the node B is increasing,
it means that the BL was precharged at Vcc/2 but the cell we are reading contains a 1. As
a consequence, the voltage on A, precharged at Vcc/2, is decreasing so MN2 is turning OFF
while MP2 is turning ON. Since MP2 is turning ON, the node B is pushed towards Vdd, but B
is still rising by itself since the bitline is precharged. This means that MN1 is turning ON and
MP1 is turning OFF, so the node A is pushed towards GND. So we have a positive feedback
speeding up the answer of what is inside the cell.
43
6T: transistor sizing
As stated previously, transistors should be of the minimum size to increase integration. The
sizes of the transistors must be chosen so that they allow to perform correctly the read and
write operation.
Transistor sizing: read operation Let us consider a cell storing a 0. During the read
operation, the transistors M6 and M3 are ON, due to their RON , the voltage VC2C , pre-charged
on the bitline, has a voltage partition ∆V. If this value reaches the threshold voltage Vtn,M 1 ,
M1 turns on.
Analyzing the currents, it can be seen that the current in M6 can only flow in M3 (because
the other node is the gate of M2), therefore the ∆V must be as low as possible. Considering
that M3 is turned ON (one node at 0 and the other at 1) and works in the parabolic region:
This usually means CR > 1, so M6 is minimum and M3 with the maximum width. The
resistance introduced by M3 is smaller than the one introduced by M6, therefore the voltage
44
drop on M6 is larger than the one on M3. The same holds true for M5 and M1 since the
structure is symmetric.
The current flowing in M2 must be the same as the one flowing in M5 (because of the gate
of M3) during the write operation.
45
the VQ can be written as:
In order to have VQ as low as possible, PR must be lower than 1. This condition usually
means that M2 has a smaller width than M5. The previous constraint is also valid for M4 and
M6.
46
4.5 SRAM timing
The interface of an SRAM usually contains:
• Data bus;
• Address bus specifies which is the cell or the word we want to access.;
The memory from the outside world is usually seen as an array of cells. But this is not true:
there is a lot of logic around (decoding, logic for handling the output, sense amplifier,...),
therefore, keeping the memory turned OFF means consuming power even if the memory is
doing nothing. The idea is that if we are not using the memory we keep connected to the power
supply just the array so to save power. With the CE we can disconnect all the logics that do
not get involved in keeping the information. OE enables the output logic of the array just when
we want to read (in writing operations we don’t use this output logic) thus saving power.
4.5.1 Read
The read operation is used to get data from the memory. It requires that the address is valid,
that W E is high and that the CE is activated. After that OE is activated and a certain time
tAA , during which the address is stable, is passed, the output data are valid and remain so for
a certain time tOH after the address changes.
A new read operation can start after the minimum read cycle time tRC (during which the
address remains stable) and a new data is ready after the maximum address access time, at
least tAA (worst case scenario). To avoid to consume power when the memory is not used, the
CE is deactivated.When the memory is on a bus, the output cannot be always “active” (OE):
if a read operation is required, OE = 0 and the other devices don’t have to write on the bus
(Z); if one device has to write on the memory, OE = Z.
47
Figure 4.20: Example of read timing for a two port SRAM with CE1 and CE2
Write The write operation is used to store data into the memory. The needed signals are
the memory location (address), the chip enable and the write enable.
Figure 4.21: Example of write timing for a two port SRAM with CE1 and CE2
48
4.6.1 Reading
Four reading methods are possible:
• Flow thru: The address and the control signals are set up before the clock rising edge,
and in that rising edge the read cycle begins. Data will be available after some delay but
within this clock cycle.
• Pipeline: Address and control signals are set up before the clock rising edge. Data is
read from the memory cells and stored in output registers. At the next clock cycle, data
is transferred from the output registers to the data output.
• Register to Latch: Address and control signals are set up before the clock rising edge.
Data is read from the memory cells, is stored in output latches. Data is transferred from
the output latches to the data output during the falling-edge of the clock.
49
Figure 4.24: Register to Latch
• Burst: Several bits of data are selected using a single address, which is incremented by
an on-chip counter. Both flow thru and pipelined SRAMs may have the burst feature.
Writing
• Standard: This case is similar to the previous modes of reading: address, control signals
and data are set up before the clock rising-edge. On clock raising edge, the data is written.
50
Figure 4.26: Standard Mode
• Late: Useful in single port read/write memory with pipeline. Address/control signals
are set up before clock rising-edge 1, while data are set up before the clock rising-edge 2.
4.7 DRAM
We have seen in section 4.3.2 how the SRAM requires 6 transistors (4 for the cell and 2 for the
access). This is a limiting factor if we want to integrate a lot. To increase integration, simpler
cells are required. The idea is to exploit just one transistor: we use one MOS capacitor Cs to
store the information and a pass transistor to access the cell. Charging the capacitor means
that we store a logic 1, discharging it means that we have a logic 0. In this way we achieve a
smaller cell with just one BL.
51
Figure 4.28: Cell of DRAM
The capacitance depends on the dielectric constant C=A/d. If we want a capacitor with a
large capacitance we need a big area. In order to save space, instead of planar technology, we
can work in 3D: we have a trench in the silicon and we fill it with a layer of isolator and a layer
of metal. In this way the surface of the capacitor is no longer on the plane of the semiconductor.
Depending on the depth of the trench we can change the area of the capacitor and get a large
capacitance with small area.
If we want to write something on the cell, we have to put a value on the BL, activate the
WL to access the cell, at that point Cs is charging or discharging, then the WL is deactivated.
Since the cell is rather small, it is possible to realize a large integration. Therefore we have lot
of cell connected to the same BL and a long wire for the BL. As a result the capacitance on
BL is large (CBL ) and, as in SRAMs, reading operation affects the content of the cell.
•Let us assume:
- the BL is at 0;
52
- t = 0− is the time immediately before WL activation;
So:
- VS (t = 0− )=Vcc;
- VBL (t = 0− )=0;
- VS (t = ∞ = VBL (t = ∞) = VX , this is what we want to find, Vx is the voltage inside the cell
after the reading operation.
• Charge analysis:
- QT ot (t = ∞) = (CS + CBL )VX , in steady state we assumed the same voltage on the cell and
on the bit line;
• Solutions:
VCC
- Since CS << CBL , then VX = 2
+ e;
53
Figure 4.30: Circuit with precharging and comparator
VX is still something not so useful: it is something in between logic 0 and logic 1. With this
solution we are able to read the content of the cell but not to preserve it. We have to find a
way to be able to read without destroying the content of the cell.
What we can is exploit the output of the comparator:
1. Precharge BL at Vcc/2;
2. Enable WL;
4. Disable WL.
The output of the comparator provides an answer almost immediately with the activation of
the WL and it is used to restore Vc (by closing the feedback in the circuit). Even if the answer
54
from the memory is very fast, we need to wait a certain time before removing the WL signal
to be sure that the content of the cell is correct, limiting thus the access time.
4.7.1 Refresh
We have to be sure that the system, every time we are reading a value, is performing 2 things:
giving the answer and restoring the cell. Even when the transistor for accessing the cell is
turned off, so the capacitance should be isolated from the outside world, we still have a leakage
current thus discharging the capacitor. This is due to Source/Drain to bulk junctions which
are in reverse bias. So even when the voltage on WL is zero, the voltage on Cs is not constant,
but it slowly decreases. Therefore the refresh operation is mandatory even after every certain
amount of time (dummy read).
This means that the dynamic RAM needs a refreshing logic (additional logic): with D-cell
architecture (one transistor and one capacitor) we are able to have a very dense array, but the
price is additional logic (that should be small so to not waste much silicon).
55
Figure 4.34: DRAM array organization
So, while in SRAM row and column address are provided concurrently, in DRAMs (for
historical reasons) row and column address are time-multiplexed:
RAS and CAS tell if the address in the bus refers to row or column.
• Address bus;
• Data bus;
• Write Enable → W E;
From the outside it is seen as a combinational circuit (no clock signal involved).
56
4.7.3 Timing: read
In fig.4.35 it is explained the operations needed to perform a read. First we put the row address
on the address bus, then we move the RAS signal from 1 to 0. We can notice here a first timing
constraint from when we change the address to when we move the RAS. This means that we
need a set-up time tASR for the address before moving the RAS signal. Then the address must
be stable for a certain amount of time (hold) tRAH . Then, RAS stays active for a time tRAS .
At this point, we change the value on the address bus (column address) and it must be
stable for a time tASC , then we can move the CAS from 0 to 1. This signal stays active for a
time tCAS and the address can change after tCAH .
57
Write enable must be high at least tCAS before CAS is asserted and at least tRCH after CAS
is de-asserted.
Data is valid after the maximum access time (from address tAA , from RAS tRAC and from
CAS tCAC ).
The read cycle ends after RAS and CAS are de-asserted (tCRP , tRP ).
4.7.4 Speed
The minimum time to complete the read cycle is given by
tRS = tRAS + tRP + switchingtime
How can we improve the performance?
• Fast Page Mode;
• Extended Data Out (EDO);
• Synchronous DRAM (SDRAM).
N.B. Page: group of cells with the same row address.
58
Figure 4.40: Timing: Fast page mode
First, we put the row address on the address bus, then we start moving the RAS signal. At
this point, we can put the column address on the address bus, then we can start moving the
CAS signal. Now we can remove the CAS signal but without removing the RAS signal: RAS
is kept active, we have just to change the column address and CAS. In this way we get several
values from the same page.
tP C is the time to complete an operation (read or write): the time from CAS transition to
precharge end is (tCAS + tP C );
Usually, when working with RAS and CAS, we start driving the RAS active, then we start
moving the CAS signal and, after a certain amount of time, we get the value. When we remove
the CAS we cannot be sure the data is still valid. This means that if we want to switch very
quickly the RAS and CAS signals, for being very quick in access the content of the memory,
we have to be very quick as well in getting the value from it. So the amount of time we have
to read the value of the memory is the one underlined in red in 4.41.
In EDO DRAMs, even when the CAS is removed to 1, the data is kept valid, in this way
we have more time to read it. Doing so, we can narrow the CAS signal without reducing the
DQ time slot.
59
4.7.5 Refresh handling
Every DRAM contains a refresh controller and a refresh counter to generate the row addresses.
The refresh, as was said previously, is a dummy read and write operation that must be done
periodically. There are two main refresh methods:
• distributed: in between refresh cycles the memory is free;
• burst: provides lot of free time, but we have to be sure that the first cells are valid.
The problem of refresh is not the bandwidth, but the uncertainty: if we perform a read operation
after a refreshing one, we have 100% uncertainty (from the outside we don’t know when the
refresh happens).
Usually, the refresh is not automatically handled by the DRAM itself, but it is handled by
the designer. The designer has to correctly drive the interface signals (RAS and CAS) and the
refresh. There are several types of refresh.
RAS-Only-Refresh (ROR)
Figure 4.43: Timing of ROR: 1)Row address is applied 2)RAS is set active and CAS must
remain inactive 3)After a specified amount of time, RAS is set inactive.
We put the row address on the address bus (we are identifying one of the word in the array),
then, with the column address, we select something in the same line (page). If we don’t provide
any column address, so the RAS is active and the CAS is inactive, it means the we want to
perform a refresh operation of the whole page. After a specified amount of time (enough to
finish the refresh), RAS is set inactive. This solution is annoying from the designer point of
view: he has to keep track of the rows refreshed.
60
CAS-Before-RAS (CBR)
Figure 4.44: Timing of CBR: 1)CAS is set active 2) WE must be inactive 3) After a specified
amount of time, RAS is set active 4)The refresh counter determines which row to refresh; –
After the required time, CAS and RAS are set inactive.
The main advantage of this strategy is that we don’t have to keep track of the addresses of
the lines we are refreshing. During the whole operation WE remains inactive and CAS is set
active, after a certain amount of time RAS is active too. A counter is present, it determines
the page to refresh. In this way the work of the designer is simplified.
Hidden refresh
During a refresh operation we are not able to access the value of the memory. We can “hide”
the refresh by keeping a value available on the bus.
First we perform a read operation: RAS is activated and the the same is done for the
CAS. At this point the CAS is kept active and the RAS is removed and then it is re-activate
(CBR-like). In this way the data on the bus is still valid.
61
Figure 4.46: Timing: hidden refresh 2
Self refresh
4.7.6 Writing
Two modes:
• CAS based (early write), like the reading operation: WE is taken low prior to CAS
falling.
• WE based (read-modify-write), allows to access one cell, read it, modify it and write
it back with a single bus transition. WE falls after CAS is taken low.
Read-Modify-Write detail:
62
Figure 4.47: Timing: read-modify-write
4. Disable OE
5. Enable WE
4.7.7 SDRAM
Developed to have higher storage capability and lower price than SSRAMs.
Two main families:
Main characteristics:
63
• Multiple bank architecture;
• All inputs and outputs are synchronous with the clock (so we have some registers).
• Control is easier.
• The memory executes commands (combination of the logic levels of control signals instead
of complex timing).
• the memory array is split into banks, we can refresh a bank and access another;
• controls can be performed at bank level (interleaving control on each bank separately to
hide pre-charge time).
Separated power supply: synchronous logic is characterized by a large current taken from
the power supply during the rising (or falling) edge of the clock. This can add noise inside
the array and it can be dangerous. Separating the power supply of the logic circuit (for the
synchronization) and of the array, we reduce the amount of noise injected into the array.
Mode register:
• the number of clock cycles that occur from the input of a command to the output of
data.
64
Figure 4.48: MODE Register
65
SDRAM block diagram
Figure 4.49: SDRAM block diagram: notice the multiple bank architecture
In order to be sure that the memory works correctly, we have to perform different steps, so
we need several control signals. This means that we need a controller in-between the mP and
the SDRAM.
66
The controller interacts at the beginning of the time with the SDRAM to configure it, then,
when the memory is ready, the controller receives commands (read and write) from the μP and
translates these macro-commands into specific commands for the SDRAM.
Commands
Figure 4.50: SDRAM commands, the # symbol means that the signal is active low
SDRAM use
The beginning of operations with an SDRAM starts with ACTIVE
67
The ACTIVE signal is very important because after that we have to wait a minimum delay
tRCD to READ or WRITE. tRCD is equal to n number of clock periods in which no operation
has to be done. Therefore, changing the clock frequency of the memory, changes the number
of clock cycles we have to wait in order to respect the timing of the memory.
SDRAM Read
Random: we can issue the read command and specify with the address the bank column
we want to read.
68
Figure 4.52: Burst read
Burst mode requires to load the Mode register with LOAD MODE REGISTER command.
The value to be written in the Mode register is taken from the address bus: inside the controller
we need a multiplexer that chooses to put in the address bus the value of the address or the
value of the Mode register.
69
SDRAM write
SDRAM: DDR DDR exploits both rising and falling edge of the clock, so this memory
should be very fast in reacting since the available time (to get the data) is no more a clock
period but half of it. DDR is very useful in high performance systems, where the memory
should run very fast.
A problem of DDR SDRAMs is that at high rates there can be some skew between clock
and data on the PCB → correct read/write operations are difficult! What we can do is add a
data-edge-aligned signal (DQS).
70
Figure 4.55: DQS
The wire of the DQS must have the same length of the wire that carries the data to be
sampled (so that they have the same delay). In this way the DDR uses the DQS to perfectly
synchronize the sampling.
4.8 CACHE
We are used to work with fast μP, on the other hand we want to have large memories, so that
they can store lot of information, and we want them to be fast. However, large memories (lot
of BLs → large capacitance) are slow. In practice, we can solve the problem of coupling the
speed of the processor with the speed of the memory, resorting to a hierarchical system: we
put next to the processor small memories which are very fast and we place the large memories
a bit far from the processor.
Therefore, the memory hierarchy can be seen as a pyramid: as the level increases, the
distance from the processor and the size of the memory increase, while its speed decreases.
71
Figure 4.57: Current memory hierarchy
Let us consider now, as an example of memory access, a μP with L1, L2, and main memory,
which executes a load word (lw) instruction.
Sequence of operation:
3. Not present in L2, read block of data (with requested word) from main memory, put it
in L2;
6. lw is completed.
72
Cache fundamentals
The elementary component of a cache memory is the cache line, which is made of several
field (data, address related information, ...).
• The data field is matched to the next level of hierarchy: if we want to connect successfully
this memory to the next level of hierarchy, the width of the data field must be matched
with the data width of the next level;
• The format and the size of the address related information depend on the mapping
algorithm.
Let us consider the following example: assume we have a 16 MByte main memory → 16
MByte = 24 ∗ 22 0 = 22 4 → 24 bits for the address. Let us suppose the data is represented
on 32 bits (4 bytes per data). The 24 bits are used in this way: 22 are used to choose one
line and 2 bits choose one byte among the 4 available. Let us assume we use a 64 kB cache.
64kB = 26 ∗ 21 0 → 16 bits. Since we have to match the data width of the cache with the data
width of the main memory, we need a 32-bit bus for data. As a consequence the 16 bits used
to address the single byte, must be reduced to 14.
73
4.8.1 Cache organization
Two questions:
Answers to these questions depend on type and organization of the cache. Usually byte access
support is required !
N.B. At the begin of the time, the valid field is 0. So at the begin of the time we have a lot
of misses, so when we get the correct value from the memory and we put it in the cache, the
valid is set to 1.
74
In this example, the address next to 000000 cannot be 000001, since 000000 identifies the
byte 68, 000001 the byte 24, 000002 the byte 57 and 000003 the byte 13.
For easing the access to the memory, it is better to translate the byte level address to the
block level address (just by neglecting the two least significant bits). If we are using direct
mapping, we have clearly defined where each element in the main memory can be stored inside
the cache. The position inside the cache is given by the index which is a portion of the address.
The least significant bits of the block address represent the index, the most significant ones
must be stored inside the cache as the tag so to avoid misunderstandings.
75
The microprocessor provides an address on m bits (e.g. m=24). First, we have to remove
w LSBs related to the index of the byte in one block (e.g. w=2). In this way we get an index
on r bits (in our example r=14) used as address of the cache and a tag on s-r bits (in this case
22-14 =8 bits). From the cache we get a value which contains the data, the valid (1 bit) and
the tag. At this point we can compare the input and output tag to be sure that what we are
reading from the line is exactly what we are looking for. If they are equal and the valid is 1 we
have an hit.
N.B. For building our cache we can exploit the logic available on FPGA, so the memory is
usually synchronous. When the μP gives the address, we have to sample it at the next clock
cycle, so the memory is not answering immediately. We have to be sure that we compare the
TAGs corresponding to the same index value. Therefore we add a register on the input of the
comparator (side of s-r), otherwise we compare the TAG we get from the cache with a TAG
that is not consistent from the timing point of view. Moreover, we have to pay attention to
match the number of registers we add in this way with the pipeline depth of the the memory.
• Direct mapping cache is simple to design and fast: given memory address, index in the
cache is easily found, and no search is needed.
• Problem: conflict misses, that is misses caused by accessing different memory locations
mapped to the same cache index: no flexibility in where memory block can be placed in
cache.
• Most Significant s bits identify the address of the block. Most Significant s bits are thetag;
76
In the direct mapping approach we are paying a richer structure with the advantage of having
a rather small tag. In the full associative approach we have high flexibility, which is paid with
a large tag.
Example
Figure 4.60: Fully Associative Cache example: this time we have to store the full block address
77
Fully Associative Cache: reading
Since everything can be mapped everywhere, we have that every time we want to check if
the wanted data is inside the cache or not, we have to read concurrently all the lines of the cache
and then compare the TAG field with the current TAG. Since we have to read concurrently all
the lines, the cache can be seen as a bunch of registers (each line is a register). moreover we
need lot of comparators and a logic decoder. This means that, in terms of hardware, a fully
associative cache is more complex than the a direct access one.
Fully Associative Cache: notes
When a read operation is performed, the tag must be searched in the whole cache, because
the item can be in any cache block. The search for the tag must be done by hardware in
parallel (a serial search would be slow). The necessary parallel comparator hardware is very
expensive (one comparator for each line). Fully associative approach is practical only for very
small caches.
Each block of the main memory maps to a set of cache lines. A block is made of n bits. Address
(m bits) split in two parts:
• Most Significant s bits identify the address of the block. Most Significant s-r+log2(N)
bits are the tag.
78
• Least Significant r-log2(N) bits identify the set in the cache (index);
The TAG is slightly larger than the one of the direct mapping, but much smaller compared
to the one of the fully associative.
79
N-way Set Associative Cache: reading
The address comes from the microprocessor. From it we take w LSBs for identifying the
byte inside the word, then the other part of the address is split in 2. The first part (the LSBs)
is used as index for identifying one line in the cache and, since these caches are direct mapped
ones working concurrently, we access all caches with the same index. Each line is composed of
address related information and data. The address related information is basically the TAG.
We compare these TAGs with the second part of the address, corresponding to the TAG. The
decoder takes as an input the result of the comparisons and drive accordingly the multiplexer
(whose inputs are the data from the caches). The decoder drives also the multiplexer of the
valid field.
N-way Set Associative Cache: note
Direct mapped cache and fully associative cache can be seen as just variations of set associative
cache:
• Direct mapped cache is a 1-way set associative cache;
• Fully associative cache is an n-way set associative cache (where n is the number lines in
the cache).
- If all valid bit are set, one of the lines in the cache must be replaced with the block read from
memory!
Replacement strategies:
- Random line;
80
- LRU → Least Recently Used line;
LRU replacement
It replaces Least Recently Used (LRU) line, since the least recently used line is (probably) not
used again.
We must add logic in order to store informations about the use of the stored lines.
In a 2-way set associative a bit is needed:
• When a line has to be replaced, if “access bit” is 0, replace the first line (and vice versa).
If N is greater than 2, one additional bit is not enough in order to find the least recently used
line.
- e.g. Intel 80486 uses a 8kbytes 4-way set associative cache. It implements a pseudo LRU
strategy, with three additional bits per each set. Lines are grouped into two couples:
• Second bit indicates which line in the first couple has been accessed last;
• Third bit indicates which line in the second couple has been accessed last;
• More data that can fit in the cache (capacity miss, it can be reduced working on the size
of the cache;
• Block replacement policy for the cache discarded a block that is now being referenced
and must be reloaded (conflict miss).
81
4.8.6 Writing
When a write operation is performed on a data present in the cache, we have to update the
value stored in the cache. What about the value stored in the memory?
Let us suppose to write data only to cache: main memory and cache would be inconsistent, so
this approach must be discarded. If the data in the cache is replaced and the memory is not
updated, a miss can occur and the new value (the updated one) is lost!
Write-through strategy
When a write operation is performed, we write both block in the cache and block in the
main memory. In this way memory and cache are always coherent, but the write operation is
performed at the speed of slower memory! It is a simple strategy with poor performance.
This strategy works fine if average time between two writes is greater than memory write cycle
time.
82
• Write through with buffer → consistency problem (risk of reading the block when it is
not up-to-date.
• Write back-based (write allocate):
- Read the block from the main memory;
- Place it in the cache;
- Write (update) in the cache;
- Update the main memory when the cache line is replaced (as in write back).
More on writing
Write-back strategy is based on the fact that writes are performed only one time, and only when
the line in the cache is replaced. What happens if more “actors” can access to main memory
(Multi-processor systems, DMA controller, ...)? If a modified line content is accessed from the
main memory, then an error occurs! In order to ensure consistency of the cache, a protocol
has been developed and it is called MESI protocol (Modified Exclusive Shared Invalid): inside
each line of the cache we have data, TAG, valid and other bits just to be sure that the value in
the cache is coherent with the value in the main memory and all other caches. We are paying
the price of having a fast system (if we use cache we want a fast system) with some overheads
83
• W Lj = 1→ BLi = 1 (diode off);
The true structure contains all diodes: instead of having diodes just where we need, a
simpler solution is to have them in each position of the array and then decide to connect them
or not. From the fabrication point of view each ROM is equal to another one, the difference is
in the last step, the metallization, which defines if the diode is connected or not. (No diode
→ 1 on BL, connection of diode → 0 on BL). We have to be sure to drive just one WL at the
time, otherwise we can create a short.
This solution is, however, not the smartest one: the BL is not an ideal wire, but it has a
capacitance, so having a direct connection between WL and BL is not a good idea.
84
Figure 4.68: MOS-based ROM
Compared with the diodes approach, here we have, as a downside, an area overhead: we must
connect the Source to GND, so we need an additional line. An idea to reduce this problem is
to swap transistors, so having two WLs that are physically close with transistors reversed, so
that their Sources can share the GND line. In this way we halve the number of GND lines.
We build the array of transistors and with metallization we choose the connections. To get
a programmable ROM we put a fuse between Drain and BL. In this way, programmability is
achieved by fusing or not the fuse.
85
Figure 4.69: PROM
86
- permanently OFF (VT > V max);
- switchable (Vmin < VT < Vmax).
When we connect together metallization, oxide and silicon, there is a movement of charges
inside the structure and, if the Si is p-doped, there is a depleted region (the majority carriers are
outside this area). At the equilibrium this structure has the Fermi level in all the materials. Due
to Schrödinger theory, electrons inside this structure are divided into bands and, depending on
the type of material, we have different characteristics (conductors, semi conductors, insulator,
metals). In metals, conduction and valence band are overlapped, so the distance between
the free level of energy and the Fermi level is given by the extraction potential (qXm). In
semiconductors the two bands are separated but the distance is rather narrow, so the energy
gap (distance between the lower energy band in the conduction band and the upper energy
band in the valence band) is narrow. In insulator the energy gap is rather large.
Inside the semiconductor the intrinsic level of Fermi level is very close to the low limit of the
conduction band (Ec), so the distance between E0 and EF is given by the extraction energy of
the semiconductor (qXs) plus Ec-EF I plus qΦp .
If we have to find VT we need a further step: we apply a voltage to increase the qVfb step
bending the slope of the band structure. At a certain point, the intrinsic Fermi energy Efi
crosses the Ef of the semiconductor. If the two levels are crossing it means we are changing the
type of semiconductor we are using: with EF I < EF , on the oxide interface, the behaviour of
the semiconductor is n-type. So, increasing the voltage on the structure we are able to capture
the electrons from the structure in the interface between semiconductor and oxide. When we
apply a voltage so that EF I − EF = qΦp we are in the strong inversion condition. During strong
inversion the structure is the one shown in fig 4.71
If we apply a voltage between gate and bulk, high enough to create strong inversion, we
87
create a channel of electrons even if the structure has a depleted region. To build a transistor
we need something to exploit this channel, so we add source and drain. In this way, with an
electric field, we are able to move the charges.
We want to be able to change the threshold voltage of the transistor, so we need to change
the physical parameters that defines the strong inversion condition.
A way to change the threshold voltage is to exploit the fact that some charges can be
trapped inside the oxide, and so the complete threshold voltage has a contribution which is
given by the standard threshold voltage of the transistor, then we can modify this value by
trapping charges inside the oxide. We can also implants some charges into the structure (in
this way we get the wanted value for the threshold voltage, but we are not able to change it
88
again).
VT = VT 0 − CQoxi −α Q ox
Cox
89
• Electrons in the gate increase VT ;
• Current Decreases.
• Reading:
– programmed transistors do not conduct and the BIT lines are 1;
– Unprogrammed transistors conduct and the corresponding BIT lines are at 0.
90
The erasing procedure is done through UV radiation: electrons in the floating gate achieve
sufficient energy to return in the semiconductor.
N.B.
- Programming → by cell;
- Erasing → by memory;
The erasing operation requires ceramic package with a small quartz window for exposure.
Very thin oxide next to the Drain, high gate voltage (more than VDD ) and Source and Drain
at ground. The electrons pass through the oxide barrier by tunnelling.
91
Figure 4.77: FLOTOX
At a first glance we may think to reproduce the EPROM structure with a different elementary
cell. However, this does not work ! Let us consider the example in fig.4.78
• To program A:
WL1 → VP P andBL1 → GND
• To avoid programming B:
BL2 → VP P
• To avoid programming C:
WL2 → GND
• Side effect:
WL2 → GND and BL2 → VP P erases D !
92
Figure 4.78: Example: EEPROM based on EPROM structure
Therefore, the problem is that programming a transistor interferes with the other ones.
What we can do is add an access transistor to each cell like shown in fig.4.79
• To program A:
W0 → VDD and P 0 → VDD
• To avoid programming B:
BIT1 → VP P
• To avoid programming other cells C:
W1,. . . , WN → 0
The price to use EEPROMs is that we have to handle inside the chip several voltage values, at
least three (GND, Vdd, Vpp).
93
4.10 Flash memories
Flash memories can be seen as an extension of EPROMs and of EEPROMs:
• One transistor per cell (thanks to this they can reach high integration);
• Byte/Word.
• NOR: every BL is connected to the supply voltage with a Rpu, each transistor is con-
nected to the BL through the Drain, the Source is connected to the GND and the Gate
to the WL.
• NAND: the Drain of the transistor is connected to the BL, the Source is connected to
the Drain of the next transistor.
4.10.1 Architectures
94
In NOR architectures, due to the fact that the Source must be connected to GND, we need
a metal wire which connects Source to GND. Exploiting symmetries (consecutive pages share
the same wire) we can reduce this metallization. The width of the metal layer (source) cannot
be too narrow otherwise the reliability is compromised (and it is difficult to realise).
The NAND cell solves the problem of achieving high integrability by putting the cells
sufficiently close one to the others, so that the n+ well is shared by two transistors, acting as
the source of one transistor and the drain of the other one.
95
4.10.3 NOR-flash
NOR-Flash Memory: erasing
When we perform an erasing operation we must be sure that all transistors act in the standard
way. So, we need to remove trapped charges from the floating gate.
This is done exploiting the Fowler-Nordheim effect:
• then we apply a high voltage to the bulk (in this architecture the bulk is connected to
the source);
So, if we apply an high voltage on the gates, the transistors turn ON and there will be a relevant
current variation meaning a logic 1.
To perform the program operation, we exploit the hot electrons effect, so we need a high voltage
(6V) between source and drain so to accelerate the electrons, then we need a high voltage (12V)
on the gate so that the electrons are able to jump into the floating gate. This increases the
threshold voltage of the transistor.
When, with a regular voltage, we try to access the transistor, it will not turn ON, so the current
variation at the node is very low and it will be detected as a logic 0.
96
Figure 4.84: NOR-Flash Memory: programming
Let us consider that the red circled transistor infig.4.85 is programmed, so it has electrons in the
floating gate and its threshold voltage is high. On the other hand, the blue circled transistor is
not programmed, so it has no electrons in the floating gate and its threshold voltage is regular.
Applying 5V on their WL the first transistor won’t turn ON (high Vt) and a logic 0 will be
read, whereas the second one will turn ON (regular Vt) and a logic 1 will be detected.
97
4.10.4 NAND-flash
NAND-Flash Memory: erasing
We apply a high voltage to the bulk and connect the gate of the semiconductor to ground so
to discharge the floating gate. If the bulk voltage is high enough, we can over-discharge the
floating gate removing its electrons. In this way there is a positive charge in the floating gate
and the threshold voltage of erased cells becomes negative.
With a negative threshold, even with 0 V on the WLs, the transistors are turned ON.
98
NAND-Flash Memory: reading
We apply 5V on those pages we don’t want to read.
NOR-flash: NAND-flash:
• Large cell area, small capacity; • Small cell area, large capacity;
• Fast read (tens of ns), random access; • Slow read (tens of μs), sequential ac-
cess (shadowing);
• Slow write (some μs) compared to
read; • Fast write (several μs), comparable
with read;
• Used mainly for code/instructions
• Used mainly for data.
4.10.6 Interface
The interfacing is similar to the one seen for the SRAM, the difference is that signals are
grouped to create commands (as in SDRAMs). A complex FSM with a DC-DC is used to
handle erasing and programming operations.
Several flash memories contain a Status Register to check erasing and programming (success-
ful/error), and they can implement write protection mechanisms.
The informations about the memory (block size, density, . . . ) are contained in the Common
Flash-Memory Interface (CFI).
ONFI contains the specifications of NAND-flash interface.
4.10.7 Reliability
As time passes and the memory is used, Flash memories lose reliability. This is due to:
99
• Erasure causes large and asymmetric distribution of VT (endurance): VTerase and VTprogrammed
tend to become closer;
Erasing/programming can be performed only correctly a limited amount of times (about 104 −
105 times).
Checking the status register we know if the block is reliable: if programming/erasing fails the
block is marked as damaged (thus reducing the size of usable memory).
• Dynamic: the next block to write is chosen among the ones with lower erase rate.
This type of levelling is used when only dynamic data (data that change frequently) are
involved. If we write/erase always the same data, a memory block will age faster and be
useless.
• Static: the static levelling is used both for static/dynamic data. We track the write/erasing
rate of all good blocks. In this case also static data (data that are almost never changed)
are moved to the ones with higher erase counts, to keep the age almost uniform.
EEC
The read operation in NAND-flash has that the read page cells have the gate at 0V and unread
page cells have the gate at 5V. Unread page cells undergo an electrical stress and programmed
cells become weak. We need Error Correcting Coded to improve reliability: the following ones
are taken from the family of Block Codes, they are used for reliability of memories, (another
family are the Convolution Codes, mainly used for data transmission)
• BCH codes;
100
• k: number of information symbols;
The graph below shows for t = 1, 2 that the bit error rate decreases with respect to the
case in which t = 0, with no error correction.
• Lower performance;
• Lower reliability;
• Lower price.
101
102
Chapter 5
Interfacing
5.1 Introduction
Let us consider a digital system with a transmitter (TX) and a receiver (RX). Since we are
dealing with a digital system, the TX transmits logic 0s and 1s. We need to define which
voltages and currents are needed to correctly communicate these values.
Figure 5.1: Digital system transmitting a logic 1 (left) and a logic 0 (right)
Usually, systems are made in a way to have VIH < VOH and VIL > VOL in order to have some
margin.
103
Figure 5.2: Voltage values and margins
With these conditions the system can properly work: we have introduced a noise margin
(NM):
• N MH = VOH − VIH > 0
• N ML = VIL − VOL > 0
The NM is static: it does not change with the transmission of different symbols (static com-
patibility check).
Let us consider the system from the current point of view. If the current is too high, there is a
voltage lower than VOH . This current depends on what is connected to the TX. To be sure to
have a correct communication, we have to check the NM, but also that the amount of current
required by the RX is compatible with the amount of current the TX is able to produce.
• IOH is the maximum output current for a high logic value;
• IOL is the maximum output current for a low logic value;
• IIH is the maximum input current for a high logic value;
• IIL is the maximum input current for a low logic value.
A TX can also be connected with several RXs. In this case, the TX can drive an output
current IOH , but each RX can sink a current IIH . So, if we want to connect more than one RX,
we have to check that the sum of all these currents is compatible with IOH .
104
• nH = b IIOH
IH
c
• nL = b IIOL
IL
c
The minimum between these two parameters is the static fan-out of the system (maximum
number of RXs we can connect to one TX to be sure we do not violate the interfacing rules).
Which is the maximum speed at which the system can work correctly?
From the static compatibility we have that TX and RX can be connected together, but we
don’t have any informaton about the speed of communication.
We can derive a model where the TX is represented by its Thevenin equivalent model and
the RXs is modelled as capacitors (CMOS circuits). Depending on the number n of RXs, we
have many capacitor connected in parallel (they all have the same nodes: TX output wire and
GND). Even the wire has its own model which includes a capacitance CL . A more accurate
model includes even a line resistance RL (series with RT X ).
The voltage generator at the TX is very specific: when the TX switches from 0 to 1, the
voltage goes from VOL to VOH as a step, vice-versa if it switches from 1 to 0.
Figure 5.5: L-H transition: at RX side we need at least VIH to understand there is a switch
105
Figure 5.6: H-L transition: at RX side we need at least VIL to understand there is a switch
t
VIRX = (v0 − v∞ )e− τ + v∞
where τ = (RT X + RL )(CL + nCRX ) is a time constant (we will neglect RL from now on).
L-H transition: (v0 = VOL ) v∞ = VOH ;
H-L transition: (v0 = VOH ) v∞ = VOL .
VIRX starts increasing exponentially from VOL and, after a time T, it reaches VIH .
T
VIRX (T ) = (VOL − VOH )e− τ + VOH ≥ VIH
VOL , VOH , VIH are given, τ is known (by some electrical parameters), so the only unknown is T:
−VOH
T ≥ τ ln( VVOL
IH −VOH
)
This formula provides the minimum period to have a correct transition from L to H.
If the system was symmetric, the dynamic analysis would have ended here, but this is not the
case.
106
5.1.2 H-L transition
t
VIRX = (VOH − VOL )e− τ + VOL
VIRX must reach at least VIL in one period.
−VOH −VOL
Let M = max{ln( VVOL
IH −VOH
), ln( VVOH
IL −VOL
)},
so:
Tmin = RT X (CL + nCRX )M
M depends on VOH , VOL , VIH , VIL (so we know it from electrical characteristics), CL , CRX , RT X
values can be found in datasheets, therefore, Tmin depends on n!
This equation allows us to answer to:
• Given n, which is the maximum clock frequency we can work with?
• Given a clock frequency, which is the maximum value of n?
Let us now consider an example
N.B. the sign of the currents depends on the load connection: e.g. if the current goes out from
the TX (we send a 1) it is negative
107
This means that, even if from the static point of view, we could connect more than 300 RXs
to the same TX with no electrical problems, if we want the system to work at 10 MHz, we can
use at most 3 RXs, while at 100 MHz it won’t work.
108
but how steep are the slopes of the digital signal. When the rising and falling times are
sufficiently small so the transitions are very fast, the period of the signal may be no more the
main contribution.
As thumb rule, an interconnection should be treated as a transmission line when its time delay
T is greater than 1/10th of the signal rise time tr .
How to compute the time delay ?
Signals in a PCB propagate at roughly half the light speed
v = c/2 = 150 ∗ 106 m/s = 150 ∗ 10−3 m/ns = 15cm/ns
Given the wire length we obtain the time delay.
If we have a rather large and complex PCB with lot of digital circuits working at high
frequencies, probably the drivers driving these signals could give very steep edges and the
problem is not the speed of the system itself but the steepness of the edges. If we have very
steep edges in digital systems, we have problems with long wires on PCB: the model described
so far is no more reliable, but we need a more refined one.
From the previous example, we can notice ow the steeper the slope, the more accentuated
is the ringing effect.
109
Figure 5.11: Transmission line model
v(z, t) states that the voltage at any section at any time slot is given by the overposition of
two waves: v + ,the propagating one, and v − , which propagates in the opposite direction.
The type of signal we will analyse are square waves, not the sinusoidal waves of the usual
TX line theory.
110
Sending a signal
Consider the following system:
This result shows how, at beginning of the time, the effect of the TL is equal to a load of
impedance Z0 , so there is a voltage division between Z0 and Rg .
Propagation
Receiving a signal
At the receiver side, if we model the RX as Rl , v(t, L) = Rl i(t, L)
111
With TL equations it becomes:
Possible scenarios:
• If Rl is equal to Z0 , gamma is 0 and there is no reflection;
Depending on Rl and Z0 there can be a reflection. In this case, there is a back propagat-
ing wave.
The amplitudes of the front propagating and back waves are different, because the Γ modu-
lates the amplitude of the back propagating one, the shape is the same and also the propagating
parameters are the same.
After a certain time, the back propagating wave reaches the begin of the transmission line,
g −Z0
where it sees a reflection coefficient Γg (= R
Rg +Z0
) and it may be reflected again if the TX
impedance is not matched with the characteristic impedance of the line.
This phenomenon goes on for a certain amount of time and it can give problems to the electronic
system like providing voltages that are out of the range the system was designed for.
113
Figure 5.13: Multiple reflections
This phenomenon does not last to infinity because Γ decreases the effect of the reflection
since it is bound between +1 and -1.
If the signal transmitted is limited in time (like a pulse) this phenomenon is even less accen-
tuated. However, a case closer to real applications is the one of the transmission of the step
function.
When both Γg and Γl are positive, each reflection adds a positive value to the far-end signal:
staircase waveforms
114
We have a derivative equation in canonical form.
This result shows that the behaviour of the reflected waveform due to a capacitive load is
rather similar to the voltage of an RC circuit. The unit step function u(t) is used to remember
that this solution is valid for the positive axis.
5.4 Matching
We have just seen that:
• if the resistance at the driver side is equal to the characteristic impedance, the driver is
matched to the transmission line;
• if the resistance at the load side is equal to the characteristic impedance, the load is
matched to the transmission line.
115
5.4.1 Incident Wave Switching (IWS)
If the load is matched there is no reflection (parallel termination). The system can be sized
such that the incident wave is switching the receiver: without reflection the wave arrives with
low delay and the RX switches very fast.
In this case, we should not match the driver, since there is no need as there is no reflected
wave. Moreover, matching the driver means halving the amplitude of the incident wave, so it
is better having Rg < Z0 (such that the voltage divider is close to 1)
The voltage divider effect is very low (difference between VA and VB ). Since the load is
matched, VC is equal to VB except for a delay (tp , propagation delay) The problem is that the
presence of Rl implies current flowing so high power consumption.
116
At tp , V : B is equal to 1/2 VA and the RX would not be able to switch, however, since Γis
1, VC is the sum of two contributes equal to 1/2 VA . Therefore, almost instantaneously, after
tp the RX is switching. At 2 tp the voltage VB is stable.
This solution (series termination) reduces the power consumption issue of the previous one and
avoids the reflection problems.
Figure 5.14: Near end matching: Destination Open - Source matches line impedance (series
termination)
Figure 5.15: Mismatch at both ends: Γ=1 at far end (open) Γ=0.8 at near end
117
Figure 5.16: Mismatch at both ends: Γ=1 at far end (open) Γ=- 0.5 at near end
Figure 5.17: Capacitor at far end: remember that the effect of a capacitor as a load has the same
effect of a capacitor in a RC circuit (exponential behaviour that depends on RC). Matching at
the near end side makes the signal at the RX side almost ideal
118
Chapter 6
Serial Communications
Parallel transfer is very effective in terms of bandwidth: if TX and RX work with the same
clock frequency, we can transmit an N-bit word in just one clock cycle. On the other hand, in
serial communications, N bits are transferred from TX to RX in at least N clock cycle.
119
6.1.1 Parallel Connection
• Data and timing (Clock) on separate wires → skew (especially at very high clock fre-
quency, see DDR memories).
• Few drivers, N cycles → similar power consumption of parallel link (because the the driver
is used for a very long time);
• Data and Clock on the same wire → no skew (requires proper protocol).
120
6.2 Communication glossary
121
Figure 6.5: Serial transmission rate Bit Rate = 1/Tbit
Tbit (bit time) can be seen as the period of the clock over the transmission system. We need
to sample with correct set-up and hold time. Good solutions are having the transmitter and
the receiver working on different clock edges.
Example: we want to transmit 45h. First we need the PISO. We can start from LSB or
MSB. At the receiver side, if the clock signal has a phase shift of half a clock period, we are
able to sample correctly the bits at their half period. Therefore, we fill the SIPO register and,
after 8 clock cycle, it is completely loaded (with 45h) and the Ready signal goes to 1. The time
we have to wait from when the data is loaded at transmitter side and when it is read at the
receiver side is quite long and it is called latency.
122
6.3 Asynchronous and synchronous links
In serial asynchronous links, bits are organized in characters (8 bit of data) and the trans-
mission is discontinuous (we have to wait to check that we can start again). Bit synchronization
is, therefore, required on each character. Synchronization is usually achieved with some special
characters.
• PRO: the overhead is very low;
• CONS: we need a perfectly synchronized clock at RX and TX.
In serial synchronous links, bits organized in packets (frame, the size of the information
may be quite large.) and the transmission is continuous. The bit synchronization is continuous
but additional information for frame synchronization is needed.
6.3.2 Terminology
• Transmitter: device that sends data to the bus;
• Receiver:device that receives data from the bus. TX and RX are related with the
electrical level of the system;
• Master: device initiating a transfer, generates the clock and terminates a transfer (e.g.
processor);
• Slave: device addressed by the master (e.g. memory);
• Multi-master: more than one master can attempt to control the bus;
• Arbitration: procedure to ensure that only one master has control of the bus at any
instant;
• Synchronization: procedure to synchronize the clocks of two or more devices.
123
6.3.3 Serial asynchronous protocol
1. Resting line has a defined and steady state (usually High);
4. CKRX clock is generated at the receiver and phase synchronized with the falling edge of
the Start bit. Bit synchronization is maintained for a limited time;
5. To guarantee sensing of the Start bit, there is at least one Stop bit among adjacent
characters.The Stop bit indicates that the data is ended and the line is brought back into
the idle state.
In fig.6.9 we can notice there is an overhead due to the difference between Character and Data.
124
Example: UART (Universal Asynchronous Receiver/Transmitter)
• TX side:
- Conversion with PISO register (a PISO is used to convert parallel data into a stream
of bits);
- Start and Stop bit insertion → Insertion of parity bit, if required.
• RX side:
In Fig.6.11 is listed the configuration for this example. The main clock at the RX side is 50
MHz (even if the communication is asynchronous, we need a clock to sample the data)
The RX is a FF which is sampling the data-line according to a clock signal, then we need
logic to detect the Start bit and a CU to correctly fill the SIPO.
In the timing diagram we can notice how we are sampling the data line with Tck=20 ns
(corresponding to 50 MHz). With such high clock frequency we obtain more than one sample
per bit, (since Baud rate is 19k, Tbit is close to 52 us). Therefore we have lot of samples
corresponding to just one transmitted bit.
125
Figure 6.12: Eye diagram: the best sampling point is in the middle of the “Eye”
However, due to noise, one sample, even if in the middle of the eye, might not be the best
choice. Possible solution: 3 samples and a voter (to understand the correctness of the value).
In this way we trade reliability for complexity.
It is not mandatory to take 3 consecutive samples, since all of them can be affected by the
noise.
We choose the resolution and get only the needed samples. Since there are a lot of samples
available, we can consider, as an example, one sample every 16 and then perform the voting
operation on the 3 samples in the interested region. It is a valid solution involving just one
counter, three register and a voter.
126
Figure 6.14: E.g. 16 samples (1 sample every 2604/16= 162 cycles)
- Introduced by Philips;
- Designed to connect few chips;
- Short distance;
- 2 wires (low cost).
- Introduced by Motorola;
- Same purpose as I2C;
- 3/4 wires (more complex but simpler to use).
I2C
I2C speed
We need a proper way to connect the devices to the bus. In order to avoid shorts, the I2C
protocol is connected in open drain. Each device is able to drive the line to the low logic, while
the logic 1 is imposed by the pull-up resistors. The open drain characteristic is a limiting factor
for the speed of the system.
127
Figure 6.15: Only two wires: Serial Data Line (SDA) and Serial Clock Line (SCL).
• Rp sizing:
- Static condition → when one device pulls down a line, the current must be no more
than the IOL of the device:
(VDD −VOL ) −VOL )
Rp
≤ IOL → Rp ≥ (VDDIOL = Rpmin
N.B. increasing Rpu we increase, in dynamic conditions, the time constant of the
system making it slower.
- Dynamic condition → maximum rising time:
The static condition gives the lower bound for Rpu, while the dynamic condition gives the
upper bound. We have to be sure that the maximum rising time (required to go from VIL to
VIH ) is compatible with our needs. Therefore, the question is which is the max Rp?
128
Let VIL = αVDD and VIH = βVDD , C is the maximum capacitance allowed (400pF) and R
is the pullup resistor.
t1
V (t1 ) = αVDD = VDD (1 − e− RC ) → t1 = RClog 1−α
1
t2
V (t2 ) = βVDD = VDD (1 − e− RC ) → t1 = RClog 1−β
1
tr = t2 − t1 = RClog[ (1−α)
(1−β)
] ≤ tM
r
AX
tM AX
R≤ r
(1−α) = RpM AX
Clog[ (1−β) ]
The max capacitance (CM AX ) limits the number of devices which can be connected. What
if C ≥ CM AX ? What the standard suggests are 4 possibilities:
• Reduced frequency;
• Bus buffers;
I2C transitions
• IDLE state: Serial Data Line (SDA) and Serial Clock Line (SCL) both at 1.
In this way we can sample when SCL is 1 and get a proper value since SDA is kept stable.
Even if the I2C is a synchronous standard, Philips decided to use START and STOP bits
to have a better control of what is inside of the bus. Since our system is made of open drain
devices, so the rest line is at logic 1, we need a way to indicate the start of the transmission
data.
129
Figure 6.17: I2C transitions - 2
Both start and stop conditions are generated by the bus master.
The bus is considered busy after a start condition, until a stop condition occurs.
Transfers are byte-oriented:
- Each transferred byte is followed by an acknowledge (the acknowledge is given by the slave
receiving the command);
- SDA is turned to 0 if the transfer is OK;
- SDA is left to 1 if the transfer if KO.
Figure 6.18: Acknowledge (N.B. the Master does not wait an infinite amount of time for the
aknowledge
I2C Addressing and Data transfer Bus master starts a transition by issuing a start
command. Then the master addresses a slave.
Addressing: 7-bit address + 1 bit for R/W (0 → write, 1 → read. The slave with the
corresponding address sends ACK on SDA.
data transfer
• Start Condition (Master)
• Slave address + R/W (Master)
130
• Acknowledges with ACK (Slave)
131
I2C multimaster If the system has more than one master connected to the bus, it is impor-
tant to decide which one is taking control of the bus. This is based on a two step procedure:
1. Clock Synchronization:
The synchronization requires a common clock, so the SCL is used. One master drives
SCL low and it starts counting its low period; another one detects SCL low and starts
driving SCL low and counting its low period. When the master ends its low period it
releases SCL and checks the value of SCL. Two options are possible: SCL is high, so it
starts counting the high period or SCL is low and it waits for SCL to become high. When
SCL becomes high, it starts counting its high period. The first master finishing the high
period pulls SCL low.
The synchronization of the low period is defined by the slowest master and the high period
is defined by the fastest one.
Figure 6.22: Clock Synchronization: CLK1 is the first maser, CLK2 is the second master
In the case of equal masters, more master can get access concurrently, we need an arbi-
tration mechanism.
2. Arbitration
Two or more masters can start a transition concurrently. They drive SDA low while SCL
high. An arbiter decides which one completes the transition.
The procedure is bit by bit and each master reads SDA when SCL is high: if SDA matches
what was sent then the bit is ok; if SDA does not match what was sent then the arbitration
is lost and the SDA driver is turned off. Two or more masters with identical transmission
complete their transition.
132
Figure 6.23: Arbitration: the master understands if it is winning or losing the access to the
bus not just by the AKN from the slave but also by sampling the SDA.
• Clock stretching: any slave can hold SCL low to slow the transition so to sample correctly
the line (the master understands we are not on time and waits before continuing the
transmission). In this way we can slow down the transmission without aborting it (thus
providing an AKN);
SPI
• Up to 10 MHz.
• No standard.
- Clock (SCLK);
- Master Out Slave In (MOSI);
- Master In Slave Out (MISO);
- Slave Select (SS) / Chip Select (CS).
133
SPI connections: single slave
The Master drives SCLK, SS and MOSI, the Slave drives MISO.
The Master selects one Slave via SSX. Unselected slaves must put MISO to high impedance.
2. Daisy Chain:
134
The Master has the MOSI connected to the first Slave and the MISO connected to the
last one. The Slaves in the middle are connected as a Chain.
PRO: just one SS, we don’t need high impedance and all MISO are standard signal (we
don’t have electrical problems;) CONS: less speed, less simple protocol (we select all the
slaves and one of them understands the call)
• Clock phase (CPHA): defines the edges for sampling and outputting.
The following two figures show an example of write and read operation to and from an SPI
memory.
– Command (instruction);
– Data (2 bytes).
135
Figure 6.28: SPI write
136