0% found this document useful (0 votes)
98 views54 pages

Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications

This document discusses efficient utilization of scratch-pad memory in embedded processor applications. It describes scratch-pad memory and how it differs from cache memory. It outlines an approach for partitioning application variables between on-chip scratch-pad memory and off-chip DRAM to minimize execution time. Key factors that affect this partitioning like variable lifetimes, access frequencies, and loop conflicts are described. An algorithm is presented for determining how to map variables to each memory type.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views54 pages

Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications

This document discusses efficient utilization of scratch-pad memory in embedded processor applications. It describes scratch-pad memory and how it differs from cache memory. It outlines an approach for partitioning application variables between on-chip scratch-pad memory and off-chip DRAM to minimize execution time. Key factors that affect this partitioning like variable lifetimes, access frequencies, and loop conflicts are described. An algorithm is presented for determining how to map variables to each memory type.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

Outline

Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications

Memories in Embedded Systems


Each memory has its own advantages
CPU Internal ROM

Internal SRAM External DRAM

For better performance memory accesses have to be fast

Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

What is Scratchpad memory ?


Fast on-chip SRAM Abbreviated as SPM 2 types of SPM : Static SPM locations dont change at runtime Dynamic SPM locations change at runtime

Objective
Find a technique for efficiently exploiting onchip SPM by partitioning the applications scalar and array variables into off-chip DRAM and on-chip SPM. Minimize the total execution time of the application.

SPM and Cache


Similarities
Connected to the same address and data buses. Access latency of 1 processor cycle.

Difference
SPM guarantees single cycle access time while an access to cache is subject to a miss.

Block Diagram of Embedded Processor Application

Division of Data Address Space between SRAM and DRAM

Example: Histogram Evaluation Code


Builds a histogram of 256 brightness levels for the pixels of an N* N image char Brightnesslevel [512] [512]; int Hist [256]; /* Elements initialized to 0 */ for(i = 0;i < N;i+ +) for (j = 0;j < N;j + +) /* For each pixel (i, j) in image */ level = BrightnessLevel [i] [j]; Hist [level] = Hist [level] + 1;

Problem Description
If the code is executed on a processor configured with a data cache of size 1Kb performance will be degraded by conflict misses in the cache between elements of the 2 arrays Hist and BrightnessLevel. Solution:- Selectively map to SPM those variables that cause maximum number of conflicts in the data cache.

Partitioning Strategy
Features affecting partitioning
Scalar variables and constants Size of arrays Life-times of array variables Access frequency of array variables Conflicts in loops

Partitioning Algorithm

Features affecting partitioning


Scalar variables and constants
All scalar variables and scalar constants are mapped onto SPM.

Size of Arrays
Arrays that are larger than SRAM are mapped onto off-chip memory.

Features affecting partitioning


Lifetime of an Array Variable
Definition :- period between its definition and its last use. Variables with disjoint lifetimes can be stored in the same processor register. Arrays with different lifetimes can share the same memory space.

Features affecting partitioning


Intersecting Life Times ILT(u)
Definition :- Number of array variables having a non-null intersection of lifetimes with u. Indicates the number of other arrays it could possibly interact with, in cache. So map arrays with highest ILT values into SPM, thereby eliminating a large number of potential conflicts.

Features affecting partitioning


Access frequency of Array Variables
Variable Access Count VAC(u) Definition :- Number of accesses to elements of u during its lifetime. Interference Access Count IAC(u) Definition :- Number of accesses to other arrays during the lifetime of u. Interference Factor IF(u) = VAC(u)*IAC(u)

Features affecting partitioning


Conflicts in Loops
for i = 0 to N-1 access a [i] access b [i] access c [2 i] access c [2 i + 1] end for a b

3N c

3N

Loop Conflict GraphLCG

edge weight e(u, v) = pi=1 ki ki ->total no. of accesses to u and v in loop i Total no. of accesses to a and c combined : (1+2)*N = 3N =>e(a,c) = 3N ; e(b,c) = 3N ; e(a,b) = 0

Features affecting partitioning


Loop Conflict Factor
Definition :- sum of incident edge weights to node u. LCF(u) = v LCG - {u} e(u,v) Higher the LCF, more conflicts are likely for an array, more desirable to map the array to the SPM.

Partitioning Strategy
Features affecting partitioning
Scalar variables and constants Size of arrays Life-times of array variables Access frequency of array variables Conflicts in loops

Partitioning Algorithm

Partitioning Algorithm
Algorithm for determining the mapping decision of each(scalar and array) program variable to SPM or DRAM/cache. First assigns scalar constants and variables to SPM. Arrays that are larger than SPM are mapped onto DRAM.

Partitioning Algorithm
For remaining (n) arrays, generates lifetime intervals and computes LCF and IF values. Sorts the 2n interval points thus generated and traverses them in increasing order. For each array u encountered, if there is sufficient SRAM space for u and all arrays with lifetimes intersecting the lifetime interval of u, with more critical LCF and IF nos., then maps u to SPM else to DRAM/cache.

Performance Details for Beamformer Example

Typical Applications
Dequantde-quantization routine in MPEG decoder application IDCTInverse Discrete Cosine Transform SORSuccessive Over Relaxation Algorithm MatrixMultMatrix multiplication FFTFast Fourier Transform DHRCDifferential Heat Release Computation Algorithm

Performance Comparison of Configurations A, B, C and D

Conclusion
Average improvement of 31.4% over A (only SRAM) Average improvement of 30.0% over B (only cache) Average improvement of 33.1% over C (random partitioning)

Compiler Decided Dynamic Memory allocation for Scratch Pad Based Embedded Systems.

Cache is one of the option for Onchip Memory


CPU Internal ROM

Cache External DRAM

Why All Embedded Systems Don't Have Cache Memory


The reasons could be Increased On Chip Area Increased Energy Increased Cost Hit Latency and Undeterministic Cache Access

A method for allocating program data to non-cached SRAM Dynamic i.e. allocation changes at runtime Compiler-decided transfers Zero overhead per-memory-instruction unlike software or hardware caching Has no software Caching tags Requires no run time checks High Predictable memory access times

Static Approach
Internal SRAM int a[100]; int b[100]; while(i<100) ..a while(i<100) b...

Allocator External DRAM Int b[100]

Static Approach
Internal SRAM Int a[100] int a[100]; int b[100]; while(i<100) ..a while(i<100) b...

Allocator External DRAM Int b[100]

Dynamic Approach
Internal SRAM Int a[100] int a[100]; int b[100]; while(i<100) ..a while(i<100) b...

Allocator External DRAM Int b[100]

Dynamic Approach
Internal SRAM int b[100] int a[100]; int b[100]; while(i<100) a... while(i<100) b Allocator External DRAM int a[100]

It is similar to caching, but under compiler control

Compiler-Decided Dynamic Approach


int a[100]; int b[100]; // a is in SRAM while(i<100) a. // Copy a out to DRAM // Copy b in to SRAM
while(i<100) ..b..

Need to minimize costs for greater benefit Decide on dynamic Accounts for changing program behaviorat statically Requirements run time Compiler manages and decides the transfers between sram and dram

Transfer cost

Approach
The method is to
Use profiling to estimate reuse Copy variables in to SRAM when reused
Cost model ensures that benefit exceeds cost

Transfers data between the On chip and Off chip memory under compiler supervision Compiler-known data allocation at each point in the code

Advantages
Benefits with no software translation overhead Predictable SRAM accesses ensuring better realtime guarantees than Hardware or Software caching No more data transfers than caching

Overview of Strategy
Divide the complete program into different regions For (Starting Point of each Region) < Remove Some Variables from Sram Copy Some Variables into Sram from Dram >

Some Imp Questions


What are regions ? What to bring in to SRAM ? What to evict from SRAM ?
The Problem has an exponential number of Solutions (NP Complete)

Regions
It is the code between successive program points Coincide with changes in program behavior New regions start at: Start of each procedure Before start of each loop Before conditional statements containing loops, procedures

What to Bring in to SRAM ?


Bring in variables that are re-used in region, provided cost of transfer is recovered. These transfers will reduce the memory access time Cost model accounts for:
Profile estimated re-use Benefit from reuse Detailed Cost of transfer
Bring in cost Eviction cost

What to Remove from SRAM?


The data variables that are furthest in the future This time can be obtained by assigning timestamps for each of the nodes Need concept of time order of different code regions
in the future.

The DPGR is a new data structure that helps in identification of regions and marking of time stamps It is essentially a programs call graph appended with additional nodes for
Loop nodes Variable nodes

The Data-Program Relationship Graph

Data-Program Relationship Graph 1 Defines regions


main 5 Proc_A 3 4 Proc_C 2 7 Proc_B

Defines Regions Depth first search order reveals execution time. order Allocation-change points at region changes

lo op

lo op

Time Stamps
A method associates a time stamp with every program point The time stamp forms a total order among themselves The program points are reached during the runtime in time stamp order.

Optimizations
The is no need to write back unmodified or dead SRAM variables into DRAM Optimize data transfer code using DMA when it is available Data transfer code can be placed in special memory block copy procedures

Multiple Allocations due to Multiple Paths


Contents of SRAM could be different on different incoming paths to a node in DPRG
Problem can happen in

Loops
Conditional execution Multiple calls to same procedure

Conditional join nodes

Join Node

Favor the most frequent path Consensus allocation is chosen assuming the incoming allocation from the most probable predecessor

Procedure join nodes


Few program points have multiple timestamps The nodes with multiple timestamps are called join nodes as they join multiple paths from main() A strategy is used that adopts different allocation strategies for different paths but with same code

Offsets in SRAM
SRAM can get fragmented when variables are swapped out Intelligent offset mechanism required

In this method

Place memory variables with similar lifetimes together larger fragments when evicted together

Experimental Setup
Architecture: Motorola MCORE

Memory architecture : 2 levels of memory


SRAM size: Estimated as 25% of the total data requirement

DRAM latency 10 cycles


Compiler : Gcc

Results

Conclusion
The designer has to choose the right mix of Scratch pad and Cache for performance advantages.

References
Sumesh U ,Rajeev B. Compiler Decided Dynamic Memory Allocation for Scratch Pad Based Embedded Systems . Alexandru N ,Preeti P, N Dutt . Efficient Use of Scratch Pads in Embedded Applications Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov Balancing Scratch Pad and Cache in Embedded Systems for Power and Speed Performance

Questions

Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy