ARM Cortex-M memory model
ARM Cortex-M memory model
Introduction
The ARM Cortex-M memory model is a crucial aspect of embedded system design, shaping
how data is managed, accessed, and protected within microcontroller-based applications.
This comprehensive guide delves into key concepts of memory management, including
memory space, alignment, endianness, semaphores, synchronization, atomicity, memory
attributes, caches, and the Memory Protection Unit (MPU). By understanding these building
blocks, developers can create efficient, secure, and high-performance embedded systems,
ensuring memory operations run smoothly and predictably every time.
2
Understanding Address Alignment in ARM Cortex-M Processors
ARM Cortex-M processors are designed with a carefully structured address space and alignment
rules to ensure efficient and reliable operation. These rules govern two critical processes:
INSTRUCTION EXECUTION and INSTRUCTION FETCHING. Let’s explore these concepts step
by step.
ARM Cortex-M processors feature a 4 GB address space (32-bit), spanning from 0x00000000 to
0xFFFFFFFF. This space is divided into regions designated for different purposes, such as code,
SRAM, peripherals, and system memory. Each region has specific access requirements and
alignment constraints, critical for maintaining system stability and performance.
Within this space, alignment rules determine how the processor accesses and interprets memory
addresses during instruction execution and fetching. These rules ensure that memory accesses
remain efficient and compliant with the architecture's Thumb state requirements.
• Word-Aligned Access:
o Addresses divisible by 4 (0b...00 in binary).
o Required for certain 32-bit operations or memory accesses.
• Half-Word Aligned Access:
o Addresses divisible by 2 (0b...0 in binary).
o Typical for Thumb and Thumb-2 instructions.
• Aligned Access: Occurs when the address adheres to the alignment rules of the instruction.
These operations execute smoothly without issue.
• Unaligned Access:
3
o Some instructions (e.g., LDR, STR) support unaligned access and execute without
error.
o Others (e.g., LDM, STM) do not support unaligned access. If such instructions
encounter unaligned addresses, they trigger a UsageFault.
While instruction execution offers some flexibility, instruction fetching adheres to stricter rules.
Fetching instructions involves reading the next instruction from memory, and this process always
requires half-word alignment.
At first glance, addresses with PC [0] = 1 appear unaligned since they aren’t divisible by 2.
However:
1. During Fetching: The processor automatically resets bit 0 to 0 internally to ensure proper
alignment.
2. During Execution: The T-bit is preserved for instructions like BX, BLX, or POP {…, PC}, which
rely on it to maintain the Thumb state.
This behavior ensures that instructions fetched for execution are always properly aligned, while
the T-bit remains intact for Thumb state transitions.
4
ARM Cortex-M processors manage a fine balance between execution and fetching:
• Instruction Execution: Flexible, supporting both aligned and unaligned access, depending
on the instruction and system configuration.
• Instruction Fetching: Strictly enforces half-word alignment by resetting PC [0] during fetch
operations.
This dual approach ensures system integrity, efficient execution, and adherence to architectural
requirements. Developers have the tools to manage and monitor unaligned access, while the
architecture handles alignment and Thumb state transitions seamlessly.
Key Takeaway
The ARM Cortex-M address space and alignment rules are integral to the processor's operation:
• Instruction execution offers flexibility with aligned and unaligned access options.
• Instruction fetching enforces strict half-word alignment while preserving the T-bit for
Thumb state transitions.
By understanding these principles, developers can optimize their applications and leverage the full
potential of ARM Cortex-M processors in embedded systems.
5
Understanding Endianness
Endianness determines the order in which bytes are arranged within a word in memory. In little-
endian systems, the least significant byte (LSB) is stored at the lowest memory address, whereas,
in big-endian systems, the most significant byte (MSB) occupies the lowest address. For example,
consider a 32-bit value 0x12345678 stored in memory:
• Little-Endian:
Memory: 0x00 → 0x78, 0x01 → 0x56, 0x02 → 0x34, 0x03 → 0x12
• Big-Endian:
Memory: 0x00 → 0x12, 0x01 → 0x34, 0x02 → 0x56, 0x03 → 0x78
The effect of the endianness mapping on data applies to the size of the element(s) being
transferred in the load and store instructions.
ARM Cortex-M processors, including those in the STM32 family, support selectable endianness,
determined at reset by configuration input or software control. This feature comes with specific
rules:
• Data Access: The selected endianness applies only to data accesses. Instruction fetches
always follow little-endian format.
• System Control Space (SCS): Accesses to the SCS, including critical system registers, are
fixed to little-endian.
• Configuration: The AIRCR.ENDIANNESS bit in the Application Interrupt and Reset Control
Register indicates the current data access endianness. However, this configuration is static
and cannot be changed dynamically.
Instruction alignment and byte ordering follow the processor’s little-endian conventions. A 32-bit
Thumb instruction is treated as two 16-bit halfwords (hw1 and hw2):
6
Byte order for hw1 and hw2 in memory:
Peripheral Endianness
While ARM Cortex-M processors support configurable data endianness, peripherals in the system
may follow a fixed or independent convention.
• SPI and Communication Protocols: SPI peripherals often transmit and receive data in
MSB-first order, corresponding to big-endian transmission. Interfacing with such
peripherals on a little-endian processor requires ensuring that data formats align
correctly.
• DMA Transfers: Peripherals like the DMA2D controller can swap byte order during
transfers, which is especially useful in graphics systems. For instance, pixel data in formats
such as ARGB8888 or RGB565 might have a different endianness requirement between
memory and display controllers. The DMA controller handles these transformations
automatically, optimizing performance and reducing the need for software intervention.
• Cases Without Built-in Support: When a peripheral lacks native support for byte-order
transformations, software-level adjustments are required. ARM Cortex-M processors
provide efficient instructions for this purpose:
7
o REV: Reverses the byte order of a 32-bit word.
o REV16: Reverses the byte order of each halfword in a 32-bit word.
o REVSH: Reverses the byte order of a 16-bit signed halfword and sign-extends it to
32 bits.
Key Takeaway
ARM Cortex-M processors and peripherals provide flexible options for managing endianness,
enabling seamless data handling even in systems with mixed endianness conventions. By
leveraging built-in features like configurable DMA transfers or byte-swapping instructions,
developers can ensure efficient and accurate data processing across heterogeneous systems.
8
Semaphores and Synchronization: The Memory Backbone of RTOS
Efficient synchronization is crucial for managing shared resources in RTOS environments. At its
core lies atomicity, a fundamental concept underpinning both synchronization primitives like
semaphores and memory operations. In ARM architectures, atomicity is more than a design
principle—it’s intricately embedded in the memory model and instruction set architecture (ISA).
Defining Atomicity
In the context of the ARM architecture, atomicity refers to the indivisibility of memory operations.
An operation is atomic if it is completed entirely or not at all, leaving the memory in a consistent
state throughout.
1. Single-Copy Atomicity
o A memory access (read or write) is single-copy atomic if:
▪ After a series of writes, the value of the operand is always the result of one
complete write—not a mix of two writes.
▪ A read operation on the operand either retrieves the value before a write or
the value after the write, never an inconsistent mix of both.
o Example: In the ARMv7-M architecture, byte, halfword (aligned), and word (aligned)
accesses are guaranteed to be single-copy atomic.
2. Multi-Copy Atomicity
o In multiprocessing systems, multi-copy atomicity ensures that:
▪ All writes to a memory location are observed in the same order by all
processors.
▪ A write is not considered complete until all observers (e.g., cores or devices)
recognize it.
o Not all memory types support multi-copy atomicity; for example, writes to normal
memory are not multi-copy atomic.
Synchronization mechanisms like semaphores rely on atomically reading and modifying data to
prevent interference during critical sections. ARM implements semaphores using hardware-level
atomic instructions that ensure mutual exclusion, preventing other tasks from accessing a shared
9
resource. This operation is vital for avoiding race conditions and maintaining consistency in real-
time systems. ARM's approach to semaphores and other synchronization primitives evolved
through key architectural advancements.
In simpler systems, atomicity was achieved by disabling interrupts using instructions like CPSID.
This approach worked by temporarily halting other operations during a critical section.
While effective for uniprocessor systems, this method had limitations in multiprocessor or multi-
bus systems:
• Other processors or DMA masters could still access memory, leading to potential
contention.
• It was inefficient, blocking all interrupts and delaying unrelated tasks.
The introduction of the SWP (Swap) instruction provided a hardware-based solution for atomicity:
• Mechanism: SWP locked the system bus during a read-modify-write operation, preventing
other masters from accessing memory.
• Usage: Commonly used for implementing binary semaphores.
• Drawbacks:
o Performance Bottlenecks: The bus lock delayed all memory operations, even those
unrelated to the semaphore.
o Scalability Issues: In systems with high memory contention, SWP's blocking nature
hindered real-time performance.
As systems grew more complex, it became clear that SWP was not scalable, prompting ARM to
seek a more efficient solution.
With the ARMv6 architecture, ARM introduced the LDREX/STREX instruction pair, which
revolutionized synchronization mechanisms in multiprocessor systems. These instructions are
designed to provide atomic operations on memory without requiring a bus lock, allowing for
improved scalability and efficiency.
How It Works:
1. LDREX (Load-Exclusive):
10
o This instruction reads a value from memory and simultaneously marks the memory
address with an "exclusive access monitor." This monitor tracks whether any other
processor or memory bus operation modifies the memory address after it has been
loaded.
o In essence, LDREX locks the memory address for the current processor, signaling
that a subsequent STREX operation will modify the value at that address only if no
other operation has intervened.
2. STREX (Store-Exclusive):
o After loading the value with LDREX, the STREX instruction attempts to store a new
value to the same memory address.
o STREX succeeds (i.e., the store operation is performed) only if the exclusive access
monitor confirms that no other processor or operation has modified the memory
location in the meantime.
o If another processor or bus operation has updated the address in question, the
exclusive monitor becomes invalid, causing STREX to fail (store not completed). The
CPU can then retry the operation or handle the failure according to the
synchronization protocol.
The memory monitor in LDREX and STREX ensures atomicity by preventing concurrent writes to
the same address across processors. It doesn't require a global bus lock, allowing other operations
to continue without delay, making it non-blocking and improving performance in multi-core
systems.
The monitor is scalable through two mechanisms: a local monitor and a global monitor. The local
monitor tracks exclusive access for a specific processor, ensuring atomicity within that processor's
context. The global monitor ensures consistency across multiple processors, allowing each core to
access memory without contention. This two-tier approach improves scalability and performance,
especially in multi-core environments.
11
Figure 2: Local Monitor Finite-State-Machine
On Cortex-M microcontrollers, ARM also introduced bit-banding that could be used to simplify
synchronization:
• Mechanism: Bit-banding maps each bit of a specific memory region (bit-band region) to an
alias memory address in a separate address space. Writing to the alias address directly and
atomically sets or clears the corresponding bit in the original memory location—eliminating
the need for traditional bitwise operations.
12
• Use Case: In synchronization, tasks or clients can treat individual bits in a shared variable
as semaphores. By writing to their corresponding alias addresses, these bits can be
atomically toggled, ensuring proper synchronization without the complexity of manual
masking or bit manipulation.
Example:
On an ARM Cortex-M3, the bit-band region starts at 0x20000000 (SRAM), while its alias
region begins at 0x22000000. A single bit in the bit-band region is linked to a specific alias
address. Writing 1 or 0 to the alias address directly sets or clears the bit. For instance, a
shared variable can be mapped, and each bit can act as a flag or semaphore for different
tasks.
13
Figure 5: Bit-band mapping
14
Atomicity or exclusivity?
Exclusive instructions (e.g., LDREX/STREX) and atomic operations both ensure safe and reliable
access to shared resources, providing mechanisms to prevent race conditions. While atomicity in
the ARM architecture offers hardware-backed guarantees for indivisible memory operations,
LDREX/STREX uses a reservation-based approach to achieve atomicity in software. In smaller
systems, their behavior may appear similar, but as systems scale, challenges like performance
bottlenecks and contention arise. This raises the question: Are exclusive instructions sufficient for
ensuring atomicity, or do atomic operations offer a more scalable and efficient alternative?
• Scope: Atomicity guarantees for aligned memory accesses are inherently provided by the
hardware. These operations (e.g., single-copy atomic loads/stores) ensure that reads or
writes happen as indivisible units, requiring no software intervention.
• Consistency Across Systems: In multi-core systems, atomicity depends on the memory
type:
o Strongly-Ordered or Device memory ensures global consistency using multi-copy
atomicity.
o Normal memory does not guarantee multi-copy atomicity, leading to potential
inconsistencies between cores.
• Performance: These operations are fast and efficient, as no retries are required—atomicity
is achieved directly through the hardware memory subsystem.
In ARMv8.1-A and beyond, new atomic instructions like LDADD and CAS replace the reservation
model by directly performing RMW operations atomically in hardware. These instructions:
Key Comparisons
The choice of approach depends on the application and system scale. For simple tasks, disabling
interrupts and re-enabling them after critical operations may be sufficient. For multi-core or highly
concurrent systems, developers and silicon designers must balance complexity, performance, and
reliability to optimize atomicity for their specific needs.
16
Memory Attributes in ARM Cortex Processors
Memory attributes play a vital role in embedded systems, defining how processors access, order,
and synchronize memory regions. By tailoring these attributes, developers can optimize
performance, ensure data consistency, and safeguard against unexpected behavior.
1. Normal Memory:
o Used for program code and general data storage.
o Examples: Flash, SRAM, ROM, DRAM.
o Accesses can be reordered and buffered for performance optimization.
o Suitable for storage without side effects.
2. Device Memory:
o Designed for peripherals like FIFOs, interrupt controllers, and configuration
registers.
o Accesses can have side effects (e.g., modifying peripheral state).
o Enforces stricter rules to ensure system correctness.
3. Strongly-Ordered Memory:
Each memory type has additional attributes influencing access behavior, including:
• Cacheability: Determines if the memory region can be cached. A region can be:
o Write-Through Cacheable: Any write operation is immediately reflected in both the
cache and the main memory.
o Write-Back Cacheable: Writes are initially done to the cache and later written back
to memory, with options for:
▪ Write-Allocate: The cache is loaded with data on a write miss.
▪ No Write-Allocate: Data is not loaded into the cache on a write miss, avoiding
the need to update the cache for every write.
o Non-cacheable: Data is directly accessed from or written to memory, bypassing the
cache, ensuring no cache interference.
• Shareability: Indicates whether the memory is shared across multiple cores or processors.
17
o Shareable: The memory can be accessed by multiple processors, ensuring coherent
data sharing.
o Non-shareable: The memory is exclusive to a single processor, useful for local or
private data, where accessing it from different cores may lead to inconsistencies.
• Bufferability: Allows write operations to be buffered before being written to memory.
o For regions like Device memory, write operations can be buffered to optimize
performance, but the buffering must respect the order, size, and number of accesses
specified by the program.
o In Normal memory, buffering improves throughput, but in certain cases (such as
with Strongly-ordered memory), buffering may be restricted to maintain strict
access order.
In ARM Cortex processors, these attributes are managed using the Memory Protection Unit
(MPU) or the Memory Attribute Indirection Register (MAIR).
Memory attributes affect performance, determinism, and correctness. For example, Normal
memory allows reordering for higher throughput but isn't suitable for peripherals needing precise
access order. Strongly-ordered memory prioritizes correctness over performance. In embedded
systems, choosing the right memory attribute is crucial—Normal memory works well for program
storage (e.g., Flash or SRAM), while Device memory is used for peripheral registers (e.g., UART,
GPIO). ARM tools and system registers help configure these attributes to meet system
requirements.
Memory Barriers
To address synchronization and memory access reordering problems, ARM processors implement
several types of memory barriers. These barriers offer solutions for ensuring that operations are
completed in the correct order and that side effects from previous instructions are visible to
subsequent ones. Specifically:
• Data Memory Barrier (DMB): Ensures memory accesses (loads and stores) are completed
in the correct order. It guarantees that operations before the barrier finish before any that
follow it.
• Data Synchronization Barrier (DSB): A stronger barrier that makes sure all memory
operations and context changes are fully completed before any subsequent instructions
18
are executed. It’s crucial for synchronizing with peripherals or ensuring that state changes
are visible before proceeding.
• Instruction Synchronization Barrier (ISB): Forces the pipeline to flush, ensuring any
changes to system control settings or context are fully applied before fetching the next
instructions. This ensures consistency after altering processor states or peripherals.
• Speculative Store Bypass Barriers (SSBB, PSSBB): Prevents speculative execution from
bypassing recent store operations, ensuring speculative loads don’t return incorrect or
stale data.
• Consumption of Speculative Data Barrier (CSDB): Ensures that speculative load
instructions don’t affect the results of later operations, particularly when a load has been
speculatively executed but not yet completed.
These barriers solve synchronization issues, particularly in cases of memory reordering or when
working with peripheral devices. They ensure that the software behaves predictably across
different execution stages.
19
• Instruction Fetches: Instruction fetches must only access Normal memory. Accessing
Device or Strongly-ordered memory for instruction fetching is unpredictable and can cause
data inconsistency.
• Access Privileges: Memory regions can be restricted based on the privilege level:
o Privileged Accesses: Allowed during privileged execution (e.g., supervisor mode).
o Unprivileged Accesses: Allowed when running in non-privileged mode.
o A MemManage exception occurs if the processor tries to access a region with
insufficient privileges.
The XN attribute marks memory regions as non-executable, preventing code execution and
triggering a MemManage exception if execution is attempted. This protects against attacks like
code injection, ensuring that code cannot run from non-executable areas such as peripheral
memory or mounted devices (e.g., USB or SD cards).
Key Takeaway
Memory attributes in ARM Cortex processors define how memory regions are accessed, ordered,
and synchronized, affecting performance and correctness. Memory barriers, like DMB and DSB,
ensure operations occur in the correct order and synchronize with peripherals. Access can also be
restricted based on privilege levels, ensuring safe and predictable system behavior. These features
help optimize memory usage in embedded systems.
20
Protected Memory System Architecture in ARM Cortex-M Processors
In ARM Cortex-M processors, the MPU serves as an optional but vital component that controls
access rights to various memory regions. Its primary purpose is to protect system memory by
defining access permissions for different regions within the address space. This protection ensures
that memory is accessed only by authorized software, helping to prevent errors, crashes, or
unauthorized access to sensitive system data.
The MPU works by dividing the memory into regions and assigning specific attributes to each
region. These attributes dictate whether a region can be read from, written to, or executed,
depending on the access level (privileged or unprivileged). The number of regions the MPU can
manage varies across ARM architectures, from 16 regions in ARMv7-M to a more flexible
configuration in ARMv8-M.
The MPU in ARMv7-M processors supports up to 16 memory regions, and in architectures like
ARMv8-M, the regions are defined by a base and limit address. This allows developers to easily
configure the system memory map and protect critical sections of memory from unauthorized
access.
In ARMv7-M (Cortex-M0+, M3, M4, and M7), each region can be subdivided into up to eight
subregions, provided the region is large enough (at least 256 bytes). This flexibility ensures that
developers can fine-tune the memory protection based on the needs of their application.
However, in ARMv8-M (Cortex-M33), the regions are more flexible, and subregions are no longer
used, allowing for simpler and more flexible memory configurations.
21
When enabled, the MPU plays a central role in defining the system’s memory map. It manages
access rights to physical memory addresses, ensuring that the processor enforces proper access
control based on the configured memory regions. For example, the Private Peripheral Bus (PPB)
and system space always have default memory attributes, and any unauthorized access triggers a
fault.
For the MPU to function, it must first be enabled by setting a global enable bit in the control
register. If the MPU is not enabled, the processor will bypass the MPU configuration and follow
the default memory map. When enabled, the MPU checks memory accesses against the defined
permissions. If an access attempt violates the defined permissions, a fault is raised, ensuring that
only authorized software can interact with critical system resources.
When the MPU is disabled, memory accesses are not subject to the protection rules. This means
that both privileged and unprivileged accesses bypass permission checks and use the default
memory map. In this state, instruction accesses that attempt to execute from regions marked as
"Execute Never" will trigger a MemManage fault. However, data accesses do not undergo any
permission checks, and therefore cannot cause aborts.
The behavior of the system when the MPU is disabled also affects caching and speculative
operations. Cacheability is controlled by specific bits in the Control Register (CCR), and program
flow prediction or speculative fetches continue to operate based on the default memory
configuration. These features ensure that the processor maintains predictable behavior even
when the MPU is not actively enforcing memory protection.
Configuring the MPU involves interacting with a set of control registers that require privileged
access for reading and writing. If an unprivileged access is attempted on the MPU registers, a
BusFault is triggered. These registers include:
• MPU Type Register: This register provides details about the number of supported regions
and whether the MPU is present in the system.
• MPU Control Register: The MPU_CTRL register includes the global enable bit, which must
be set to activate the MPU.
22
• MPU Region Number Register: This register selects the current region, which links to its
associated base address and attributes.
• MPU Region Base Address Register: Defines the starting address of the region.
• MPU Region Attribute and Size Register: This controls various attributes of the region, such
as its size, access permissions, memory type, and sub-region access.
Each region in the MPU has its own enable bit. When a region is enabled, its associated access
rights and attributes are enforced. In ARMv7-M implementations that do not support PMSAv7,
only the MPU_TYPE register is necessary, and all other registers are reserved.
Key Takeaway
The MPU is crucial for managing memory access rights in ARM Cortex-M processors, dividing
memory into protected regions. By configuring and enabling the MPU, developers can enforce
security, prevent errors, and ensure the integrity of embedded systems. Understanding the MPU’s
configuration and behavior is essential for building secure, efficient systems on ARM Cortex-M
processors.
23
Cache Management: Optimizing Performance with Smart Caching
In modern processors, caches play a vital role in improving memory access speeds by storing
frequently accessed data closer to the CPU. Caches operate on two key principles: spatial locality
(nearby data is often accessed together) and temporal locality (recently accessed data is likely to
be used again soon). However, as workloads grow more dynamic, managing these caches
efficiently requires more than just traditional algorithms like LRU or FIFO.
Caches are often organized in a hierarchical memory system, with multiple levels of cache (L1, L2,
L3) to balance speed and capacity. While the L1 cache is closest to the CPU and the fastest, it is
limited in size, whereas L3 is larger but slower. The challenge arises when multiple agents (like
DMA or external processors) update memory simultaneously, leading to potential cache
coherency issues.
To maintain consistency, software often employs cache maintenance operations, ensuring that
data changes made in one part of the system are visible throughout the memory hierarchy.
Without this, a breakdown in coherency can occur, where outdated data is accessed from the
cache instead of the most recent version in memory.
Google’s CacheNet and NVIDIA’s AI-driven GPU caches are real-world examples of this approach,
where deep learning models optimize cache management dynamically. These systems
continuously learn and adapt to evolving access patterns, offering significant improvements over
static algorithms.
24
Modern processors, like ARM's Cortex-A series, integrate AI into their cache management
strategies. With Preload Data (PLD) and Preload Instruction (PLI) hints, the system can anticipate
future memory accesses and preload data into faster cache levels. AI further refines this by
continuously adjusting preload strategies based on learned access patterns, further optimizing
performance in real-time.
25
Conclusion
Mastering the ARM Cortex-M memory model is essential for creating reliable and optimized
embedded systems. By grasping concepts like memory space, alignment, and endianness, along
with the role of memory attributes, semaphores, and caches, developers can design systems that
effectively manage resources and maintain data consistency. The MPU adds an additional layer of
protection, safeguarding critical memory regions from unauthorized access. With a solid
understanding of these memory mechanisms, developers can leverage ARM Cortex-M processors
to their fullest potential, ensuring performance, security, and stability across applications.
26