Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications
Outline: - Introduction - Different Scratch Pad Memories - Cache and Scratch Pad For Embedded Applications
Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications
Objective
Find a technique for efficiently exploiting onchip SPM by partitioning the applications scalar and array variables into off-chip DRAM and on-chip SPM. Minimize the total execution time of the application.
Difference
SPM guarantees single cycle access time while an access to cache is subject to a miss.
Problem Description
If the code is executed on a processor configured with a data cache of size 1Kb performance will be degraded by conflict misses in the cache between elements of the 2 arrays Hist and BrightnessLevel. Solution:- Selectively map to SPM those variables that cause maximum number of conflicts in the data cache.
Partitioning Strategy
Features affecting partitioning
Scalar variables and constants Size of arrays Life-times of array variables Access frequency of array variables Conflicts in loops
Partitioning Algorithm
Size of Arrays
Arrays that are larger than SRAM are mapped onto off-chip memory.
3N c
3N
edge weight e(u, v) = pi=1 ki ki ->total no. of accesses to u and v in loop i Total no. of accesses to a and c combined : (1+2)*N = 3N =>e(a,c) = 3N ; e(b,c) = 3N ; e(a,b) = 0
Partitioning Strategy
Features affecting partitioning
Scalar variables and constants Size of arrays Life-times of array variables Access frequency of array variables Conflicts in loops
Partitioning Algorithm
Partitioning Algorithm
Algorithm for determining the mapping decision of each(scalar and array) program variable to SPM or DRAM/cache. First assigns scalar constants and variables to SPM. Arrays that are larger than SPM are mapped onto DRAM.
Partitioning Algorithm
For remaining (n) arrays, generates lifetime intervals and computes LCF and IF values. Sorts the 2n interval points thus generated and traverses them in increasing order. For each array u encountered, if there is sufficient SRAM space for u and all arrays with lifetimes intersecting the lifetime interval of u, with more critical LCF and IF nos., then maps u to SPM else to DRAM/cache.
Typical Applications
Dequantde-quantization routine in MPEG decoder application IDCTInverse Discrete Cosine Transform SORSuccessive Over Relaxation Algorithm MatrixMultMatrix multiplication FFTFast Fourier Transform DHRCDifferential Heat Release Computation Algorithm
Conclusion
Average improvement of 31.4% over A (only SRAM) Average improvement of 30.0% over B (only cache) Average improvement of 33.1% over C (random partitioning)
Compiler Decided Dynamic Memory allocation for Scratch Pad Based Embedded Systems.
A method for allocating program data to non-cached SRAM Dynamic i.e. allocation changes at runtime Compiler-decided transfers Zero overhead per-memory-instruction unlike software or hardware caching Has no software Caching tags Requires no run time checks High Predictable memory access times
Static Approach
Internal SRAM int a[100]; int b[100]; while(i<100) ..a while(i<100) b...
Static Approach
Internal SRAM Int a[100] int a[100]; int b[100]; while(i<100) ..a while(i<100) b...
Dynamic Approach
Internal SRAM Int a[100] int a[100]; int b[100]; while(i<100) ..a while(i<100) b...
Dynamic Approach
Internal SRAM int b[100] int a[100]; int b[100]; while(i<100) a... while(i<100) b Allocator External DRAM int a[100]
Need to minimize costs for greater benefit Decide on dynamic Accounts for changing program behaviorat statically Requirements run time Compiler manages and decides the transfers between sram and dram
Transfer cost
Approach
The method is to
Use profiling to estimate reuse Copy variables in to SRAM when reused
Cost model ensures that benefit exceeds cost
Transfers data between the On chip and Off chip memory under compiler supervision Compiler-known data allocation at each point in the code
Advantages
Benefits with no software translation overhead Predictable SRAM accesses ensuring better realtime guarantees than Hardware or Software caching No more data transfers than caching
Overview of Strategy
Divide the complete program into different regions For (Starting Point of each Region) < Remove Some Variables from Sram Copy Some Variables into Sram from Dram >
Regions
It is the code between successive program points Coincide with changes in program behavior New regions start at: Start of each procedure Before start of each loop Before conditional statements containing loops, procedures
The DPGR is a new data structure that helps in identification of regions and marking of time stamps It is essentially a programs call graph appended with additional nodes for
Loop nodes Variable nodes
Defines Regions Depth first search order reveals execution time. order Allocation-change points at region changes
lo op
lo op
Time Stamps
A method associates a time stamp with every program point The time stamp forms a total order among themselves The program points are reached during the runtime in time stamp order.
Optimizations
The is no need to write back unmodified or dead SRAM variables into DRAM Optimize data transfer code using DMA when it is available Data transfer code can be placed in special memory block copy procedures
Loops
Conditional execution Multiple calls to same procedure
Join Node
Favor the most frequent path Consensus allocation is chosen assuming the incoming allocation from the most probable predecessor
Offsets in SRAM
SRAM can get fragmented when variables are swapped out Intelligent offset mechanism required
In this method
Place memory variables with similar lifetimes together larger fragments when evicted together
Experimental Setup
Architecture: Motorola MCORE
Results
Conclusion
The designer has to choose the right mix of Scratch pad and Cache for performance advantages.
References
Sumesh U ,Rajeev B. Compiler Decided Dynamic Memory Allocation for Scratch Pad Based Embedded Systems . Alexandru N ,Preeti P, N Dutt . Efficient Use of Scratch Pads in Embedded Applications Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov Balancing Scratch Pad and Cache in Embedded Systems for Power and Speed Performance
Questions
Thank you