0% found this document useful (0 votes)
30 views83 pages

07 Paging

The document discusses paging and segmentation techniques for virtual memory management. Paging allows for transparent support of large address spaces through fixed-sized pages, while segmentation complicates memory allocation due to variable segment sizes and the need for contiguity. Page tables are used for page-based address translation between virtual and physical addresses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views83 pages

07 Paging

The document discusses paging and segmentation techniques for virtual memory management. Paging allows for transparent support of large address spaces through fixed-sized pages, while segmentation complicates memory allocation due to variable segment sizes and the need for contiguity. Page tables are used for page-based address translation between virtual and physical addresses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Paging

Abhilash Jindal
Overview
• More exible address translation with paging (OSTEP Ch 18-20)
• Paging hardware
• Demand paging: swapping pages to disk when memory becomes full (OSTEP Ch
21-22)

• Swapping mechanisms
• Page replacement algorithms
• Paging in action (xv6 book Ch 2, OSTEP Ch 23)
• Paging on xv6
• Fork with Copy-on-write, Guard pages
fl
Paging Hardware
OSTEP Ch 18-20
Intel SDM Volume 3A Ch 4
Memory isolation and address space
Stack

Process 2
0x00300000
Process 1 free
0x00200000
esp OS stack
OS
eip OS code
0x00100000 0x00100000

BIOS ROM 0x000F0000


Heap

0x000C0000 Code
VGA display
0x000A0000 Process 1 address space
Bootloader 0x00007C00
0x00000000 Due to address translation, compiler need
not worry where the program will be loaded!
Segmentation
Stack • Mapping large
address spaces Process 2 stack
Process 1 stack

• Place each
Process 2
0x00300000 segment
Process 1
0x00200000 free independently to
OS not map free
0x00100000
space Process 2 heap

Process 1 heap
Heap
Process 1,2 code
0x00200000
Code
Process 1 address OS
0x00100000
0x00000000
0x00000
Allocating memory to a new process
Process 2 stack
• Find free spaces in physical memory Process 1 stack
Process 3 stack

• Di cult because segments can be of


arbitrary sizes
Process 3 heap
Process 2 heap
Process 3 code
Process 1 heap
Process 1,2 code
0x00200000
OS
0x00100000
0x00000000
ffi
Growing and shrinking address space
• Segments need to be contiguous in memory
Process 2 stack

• Growing might not succeed if there are other Process 1 stack


Process 3 stack
segments next to heap segment

Process 3 heap
Process 2 heap
Process 3 code
Process 1 heap
Process 1,2 code
0x00200000
OS
0x00100000
0x00000000
External fragmentation
• After many processes start and exit,
memory might become
“fragmented” (similar to disk)

• Example: cannot allocate 20 KB


segment

• Compaction: copy all allocated


regions contiguously, update
segment base and bound registers

• Copying is expensive
• Growing heap becomes not
possible
Limitations of segmentation
Limited exibility:

• To support large address spaces,


burden falls on programmer/compiler to
manage multiple segments

• Only an entire segment can be shared.


Example: cannot share some part of CS
(both processes use the same library)

Di erent sized segments, segments need


to be contiguous in physical memory

• complicates physical memory allocator


• lead to external fragmentation
• growing/shrinking segments is awkward
ff
fl
Paging

Stack Heap Stack

Code

Stack

Heap

Stack Heap

Heap Heap Heap

Code Code Code

Code Code Code


P1 address space DRAM P2 address space
(virtual addresses) (Physical memory) (virtual addresses)
Notebook analogy Preparing for OS exam:
Segmentation
• Read second letter from 3rd page
0 1 2 3 4 5
xv6 is an OS for x86
• Read second letter from 4th page
OS
0 1 2 3

Write an SQL query


DB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
OS:1
xv6 is an OS for x86 Write an SQL query
DB:7
Notebook
Notebook analogy Preparing for OS exam:
Paging
• Read second letter from 3rd page
0 1 2 3 4 5
xv6 is an OS for x86
• Read second letter from 8th page
OS
0 1 2 3

Write an SQL query


DB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0:3,1:4,
OS:1 0:9,1:2,
Notebook DB:5
2:2,3:8, an xv6 is
2:7,3:12
for SQL OS Write x86 query
4:6,5:10
Paging
Segmentation Paging

Large address spaces need multi segment model. Burden Transparently support large address spaces. Programmer/
falls on programmer/compiler to manage multiple segments compiler work with a at virtual address space
Di erent sized segments lead to external fragmentation

Fixed sized pages (4KB, 4MB)


Di erent sized segments complicate physical memory
allocator

Page is contiguous. Neighbouring addresses (in di erent


Full segment needs to be contiguous in physical memory
pages) may not be contiguous

Growing/shrinking segments is awkward Simple: allocate another page, free a page

Address translation hardware has an adder (va + base) and


Address translation hardware is much more complicated
a comparator (va < limit)
ff
ff
fl
ff
How to do page-based address translation?
Page table: Maintain a lookup table for each process
Virtual page Physical page Virtual page Physical page Stack Heap Stack
number number number number
7 3 7 5 Code
6 x 6 x
Stack
5 x 5 x
4 x 4 x Heap

3 x 3 7 Stack Heap
2 4 2 3
Heap Heap Heap
1 6 1 6
0 0 0 1 Code Code Code

Code Code Code


P1 address space DRAM P2 address space
(virtual addresses) (Physical memory) (virtual addresses)
How to do page-based address translation? (2)
Page table bits
• Present bit: valid mapping 9 Stack 9 Stack

• Permission bits: read only, writeable, executable 8 8 Heap

7 7 Code
Virtual page Physical page
Present Permission 6 6 Stack
number number
9 4 Y rw 5 5 Heap
8 x N
4 4 Stack
7 x N
6 x N 3 3 Heap
5 x N 2 Heap 2 Code
4 x N
1 Code 1 Code
3 x N
2 5 Y rw 0 Code 0
1 7 Y rx P1 address space DRAM
0 1 Y rx (virtual addresses) (Physical memory)
Page table size
• Virtual addresses: 232

• Size of page = (4KB) = 212 Virtual page Physical page


number number

• Number of page table entries = 220 7 3


6 x
• Number of pages in a 4GB DRAM = 232/212 = 220 5 x
4 x
• Size of each page table entry = 20 bits ~ 3 bytes 3 x
2 4
• Size of page table = 3*220 = 3MB!
1 6

• 1000 processes => ~3GB memory! 0 0


Reducing page table size
• Bigger pages
• Virtual addresses: 232

• Size of page = (4MB) = 222

• Number of page table entries = 210


• Number of pages in a 4GB DRAM = 232/222 = 210

• Size of each page table entry = 10 bits ~ 2 bytes


• Size of page table = 2*210 = 2KB!

• Bigger pages increase internal fragmentation


Observation

• Lot of page table entries are invalid Stack Virtual page Physical page
number number
7 3
6 x

free
5 x
4 x
3 x
2 4
1 6
Heap
0 0
Code
Process 1 address space
Multi-level page table PPN: 9
Virtual page Physical page
number number
PPN: 8 7 3
Virtual page Physical page 6 x
Virtual page Physical page
number number number number
6,7 9 PPN: 11
7 3 Virtual page Physical page
4,5 x
6 x number number
2,3 11
3 x
5 x 0,1 10
2 4
4 x
3 x
• Page directory entries point PPN: 10
2 4 to page table pages Virtual page Physical page
number number
1 6
1 6
0 0 • Unused portions of virtual 0 0
address space is skipped!
Notebook analogy Preparing for OS exam:
Paging
• Read second letter from 3rd page
0 1 2 3 4 5
xv6 is an OS for x86
• Read second letter from 8th page
OS
0 1 2 3

Write an SQL query


DB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0:3,1:4,
OS:1 0:9,1:2,
Notebook DB:5
2:2,3:8, an xv6 is
2:7,3:12
for SQL OS Write x86 query
4:6,5:10
Notebook analogy Preparing for OS exam:
Page directories: call 4 pages a “section”
• Read second letter from 3rd
page in Section 0
0 1 2 3 0 1 2 3
xv6 is an OS for x86
OS Section 0 Section 1
• Read second letter from 8th page
0 1 2 3
Write an SQL query
Section 0
DB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

OS:1 0:11, 0:3,1:4, 0:9,1:2, 0:6,


an xv6 is 0:13 for SQL OS Write x86 query
DB:5 1:14 2:2,3:8 2:7,3:12 1:10
Notebook
Address translation
Simple address space

• 16KB address space has 214


addresses

• Each page has 64 (= 26) bytes

• Number of pages = 28

• First 8 bits are page number, last 6


bits are o set within the page
ff
Address translation
Simple address space

• Example: 0x3F81 (VA) => 0x0DC1 (PA)

Page dir idx 15 Page table idx 14 O set


1 1 1 1 1 1 1 0 0 0 0 0 0 1
3 F 8 1

Physical page number 55 O set


0 0 1 1 0 1 1 1 0 0 0 0 0 1
0 D C 1
ff
ff
x86 segmentation and paging
• Segmentation:
• Virtual address (logical
address) => “linear address”

• Paging:
• Linear address => Physical
address
Address translation with paging on x86

• CR3 contains page


directory base
Address translation with paging on x86
x86 PTEs, PDEs
• 220 4KB pages in a 4GB DRAM
• Page base address, page table base
address are 20 bits

• Bits set by OS, used by hardware:


• Present: It is a valid entry
• Read/write: Can write if 1
• User/supervisor: CPL=3 can access if 1
• Bits set by hardware, used/cleared by OS:
• Accessed: Hardware accessed this page
• Dirty: Hardware wrote to this page
Performance degradation!
• Accessing 1 memory location requires accessing 3 memory locations!

Page dir idx 15 Page table idx 14 O set


1 1 1 1 1 1 1 0 0 0 0 0 0 1
3 F 8 1

Physical page number 55 O set


0 0 1 1 0 1 1 1 0 0 0 0 0 1
0 D C 1
ff
ff
Translation-lookaside buffer (TLB)
• First check page translation in Page dir idx 15 Page table idx 14 O set
TLB before walking the page 1 1 1 1 1 1 1 0 0 0 0 0 0 1
3 F 8 1
table

• TLB hit: ~0.5-1 cycle Physical page number 55 O set


0 0 1 1 0 1 1 1 0 0 0 0 0 1
• TLB miss: ~10-100 cycles 0 D C 1

TLB
VPN PPN

1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1
ff
ff
Which programs will run faster?
Which programs will have lesser TLB misses?

• High spatial locality: after the program accessed a memory location, it will
access a nearby memory location

• High temporal locality: after the program accessed a memory location, it will
soon access it again
Example bad programs
• Low spatial and temporal locality: most accesses lead to TLB miss
• Large hash table with random access VPN=1 a[0] a[1] a[2] a[3]
VPN=2 a[4] a[5] a[6] a[7]
int get(int I) VPN=3 a[8] a[9] a[10] a[11]
VPN=4 a[12] a[13] a[14] a[15]
return a[I];
VPN=5 a[16] a[17] a[18] a[19]

• Traversing large graphs: two neighbours can be on di erent pages


• Ok for small hash tables, small graphs.
• Working set of the program: Amount of memory that it touches
• TLB reach = (number of TLB entries) * (page size)
• Working set > TLB reach
ff
Increasing TLB reach
• Larger TLBs
• 32 -> 64 entries.
• Larger caches => slower hits
• Larger pages
• 1 TLB entry for 4MB of addresses
(with a 4MB page)

• 1024 TLB entries for 4MB of


addresses (with 4KB pages)

• Allocating large pages on Linux:


posix_memalign(void **memptr, size_t alignment, size_t size);
int madvise(void addr, size_t length, int advice= MADV_HUGEPAGE);
TLB on my x86-64 machine
$ cpuid
..
(simple synth) = Intel Core (unknown type) (Kaby Lake / Coffee Lake) {Skylake}, 14nm
..
cache and TLB information (2):
0x63: data TLB: 2M/4M pages, 4-way, 32 entries
data TLB: 1G pages, 4-way, 4 entries
Separate TLBs for di erent page sizes
0x03: data TLB: 4K pages, 4-way, 64 entries
0x76: instruction TLB: 2M/4M pages, fully, 8 entries
0xb5: instruction TLB: 4K, 8-way, 64 entries Separate TLBs for instruction and data
0xc3: L2 TLB: 4K/2M pages, 6-way, 1536 entries
..
Much larger (slower) L2 TLB

Data TLB reach: Instruction TLB reach: L2 TLB reach:


• 4KB * 64 = 256KB • 4KB * 64 = 256KB • 4KB * 1536 = 6MB
• 4MB * 32 = 128MB • 4MB * 8 = 32MB • 2MB * 1536 = 3GB
• 1GB * 4 = 4GB
ff
TLB size, multiple TLBs example

int jump = PAGESIZE / sizeof(int);


for (i = 0; i < NUMPAGES * jump; i += jump) {
a[i] += 1;
} L2 TLB misses

L1 TLB misses.

L1 TLB hits L2 TLB hits


TLB replacement policies
• Need to replace an entry after TLB is full. Which entry to replace to minimise TLB miss
rate?

• Least recently used (LRU):


• If an entry hasn’t been used recently, it is unlikely to be used soon => assumes spatial
and temporal locality of access.

• Corner case behaviours: Program cycling over N+1 pages.


• Hardware typically implements something simple
• FIFO
• Random replacement: Just pick an entry randomly
Virtual page Physical page Virtual page Physical page
number number number number

Context switching 7
6
3
x
7
6
5
x
5 x 5 x
4 x 4 x
• During context switch, OS changes 3 x 3 7
CR3 register to change page table 2 4 2 3
1 6 1 6
movl %eax, %cr3
0 0 0 1

• Privileged operation
• Marks each TLB entry invalid. Every Stack Heap Stack
memory access after context switch Code
causes TLB miss! Stack
Heap
Virtual page Physical page Stack Heap
number number
Heap Heap Heap
0 0
Code Code Code
2 4
Code Code Code
0 1 P1 address space DRAM P2 address space
(virtual addresses) (Physical memory) (virtual addresses)
INVLPG instruction
• TLB is neither write-back, nor write-through cache
• Need to run INVLPG <virtual page number> when a page table entry is
modi ed

• Similar to how OS needs to run LGDT when GDT entries are changed
fi
Tagged TLBs
• TLB entries are “tagged” with a process
context identi er
TLB
• Additional bits in CR3 register tells hardware Virtual page Physical page
Process
context
about the current PCID number number
identi er
0 0 P1
• Upon context switch: 2 4 P1
0 1 P2
• OS changes CR3: PCID, page directory base
• Hardware need not invalidate TLB entries.
~5us

• When process gets back control, some of its


TLB entries might still be present!
fi
fi
Demand Paging
OSTEP Ch 21-22
Demand paging
Providing illusion of large virtual address spaces

• Swap out a page to disk when Stack Heap Stack


physical memory is about to run out Code
of space: Stack
Heap

• Mark page as not present in page Heap


Heap
Stack
Heap
Heap
Heap
table Code Code Code
Code Code Code
• Remember where page is P1 address space DRAM P2 address space
swapped out on disk (virtual addresses) (Physical memory) (virtual addresses)

• Add swapped out physical page to Heap


free list
Swap space
Disk layout with swap space
• Reserved swap blocks, not touched by le system
• Swap space does not require crash consistency: anyways garbage after
restart (all processes are dead)

Super Swap Swap Swap Index Index Index Bitmap Data Data Data Data
Boot
block block block block block block block 0010 block block block block
fi
Swapping out a page
• Find a free swap block on disk
• Copy page to the free block
• Run INVLPG instruction to remove Proc 1
[VPN2]
page from TLB

• Mark not present, remember swap


block number in PTE Page table entry for Proc 1 [VPN2]
• Add page to free list PFN=1block # 2
Swap 10
Swapping in a page
• Hardware does page table walk to nd
that the page is not present. It raise Proc 1
[VPN2]
page fault.

• OS handles page fault: Proc 1


[VPN2]

• Copies page to a free physical page


• Updates page table entry
Page table entry for Proc 1 [VPN2]
• Hardware retries instruction. This time
nds the page, adds entry to TLB, Swap
PFN=2block # 2 1
0
continues as normal
fi
fi
Page replacement policies

• When to evict pages?


• How many pages to evict?
• Which page to evict?
When to evict pages? How many pages to evict?

• Swap out one page when we completely run out of physical memory
• What if OS itself needed a new page?
• Start swapping out before we completely run out
• When there are less than N free pages left
• Swap out multiple pages in one shot until we have M (> N) free pages left
• Sends multiple disk writes in one shot reduces seek delay
Which page to evict?
Memory access sequence:
• Goal: minimize number of swap ins/
outs 0, 1, 2, 0, 1, 3, 0, 3, 1, 2, 1
• Belady’s algorithm for optimal page
replacement

• Evict the page required furthest in


the future

• Optimal because all other pages


will be required sooner

• Future is unknown!
FIFO
Memory access sequence:
• Evict the page that came rst to the
cache 0, 1, 2, 0, 1, 3, 0, 3, 1, 2, 1
• OS appends the page to a queue
when it swaps in a page (or when it
allocates a new page)
fi
Belady’s anomaly
Bigger caches can have lower hit rates!
Access Hit/miss Resulting cache Access Hit/miss Resulting cache
1 Miss 1 1 Miss 1
2 Miss 1, 2 2 Miss 1, 2
3 Miss 1, 2, 3 3 Miss 1, 2, 3
4 Miss 2, 3, 4 4 Miss 1, 2, 3, 4
1 Miss 3, 4, 1 1 Hit 1, 2, 3, 4
2 Miss 4, 1, 2 2 Hit 1, 2, 3, 4
5 Miss 1, 2, 5 5 Miss 2, 3, 4, 5
1 Hit 1, 2, 5 1 Miss 3, 4, 5, 1
2 Hit 1, 2, 5 2 Miss 4, 5, 1, 2
3 Miss 2, 5, 3 3 Miss 5, 1, 2, 3
4 Miss 5, 3, 4 4 Miss 1, 2, 3, 4
5 Hit 5, 3, 4 5 Miss 2, 3, 4, 5

FIFO does not follow “stack property”. Cache of size 4 may not contain elements in cache of size 3.
Fairness in page replacement
• Someone had lots of pages. I had very little. My page was evicted
• OS maintains “resident size” per process: 1, 7, 9
• First select a victim process with highest resident size, remove its pages
Least Recently Used (LRU)

• Most programs exhibit temporal


locality:

• If a page was accessed recently, it


shall be accessed soon

• Keep list according to recency


Difficulty in implementing LRU

• In FIFO, list is updated by the OS


when a new page is allocated, or
when a page is swapped out

• In LRU, list needs to be updated at


every access

• OS is not running during page


accesses :-/
Implementation options

• Option 1: Hardware maintains the LRU list


• 4GB / 4KB ~ 220 pages

• List size: 20 bits * 220 pages ~ 3MB

• List cannot be in CPU => must be in memory


• Each memory access causes another set of memory accesses to update
list
Implementation options
Page table
• Option 2: Hardware updates timestamp in page table entries. Physical page Access
OS scans PTEs to nd page with oldest timestamp number Timestamp
10 8:11am
• PTEs live in memory. Updating timestamp again touches 11 7:05am
memory at every access. Defeats TLB. 15 7:00am
12 8:21am
• Lazily update timestamp:
• when mapping is brought to TLB TLB
Virtual page Physical page
• when mapping is evicted from TLB number number
0 10
• Once in a while 1 11
2 15
• Need many more bits in PTE to store timestamp 3 12
• Victim process has highest resident size. Scanning (worst case
220) PTEs will be slow
fi
Approximating LRU Page table
Physical page Access
number Timestamp
• Give up on nding least recently used. OK to evict 10 8:11am
a less recently used 11 7:05am
15 7:00am
• Hardware just lazily sets 1 access bit 12 8:21am
fi
Victim proc’s pages
Clock algorithm AB
0
1
• OS clears access bit. Evicts page with access
bit = 0 1
0

• Hardware sets access bit to 1 1


0

1
0
• Evicted page was “not recently used”
OS 1
0

1
0

1
0

0
1
Victim proc’s pages
Clock algorithm (2) DB AB
0 1
0
• Optimisation:
0 1
0
• We should prefer evicting page that has not 0 1
changed since it was brought back from disk 0

1 1
0
• Such pages can just be deleted without
doing a copy OS 0 1
0

• OS clears dirty bit when it brings a page to 1 1


0

memory from disk 1 1


0

• Evict dirty pages i not able to nd a clean 0 1


0
page
ff
fi
Thrashing: Library analogy
DB OS Student

Library

Time
Thrashing: Library analogy (2)
• Problem: Library only allows one book to be checked out. Student is
constantly running to/from the library. Not able to do work on any assignment.

• Solution:
• Reduce “working set”
• I will work on OS assignment completely before worrying about DB
assignment

• I will not work on DB assignment


• Buy a book to avoid going to library
Thrashing
• Total working set of running processes is larger than physical memory.
Constantly swapping in and out pages to/from disk. Not able to work.

• Solution:
• Reduce working set
• Admission control: run some processes for some time and then some
other

• Out-of-memory killer: Linux kills most memory intensive process


• Buy more memory to avoid copying to/from disk
Reducing memory pressure
Kernel same page merging (KSM)

• OS periodically scans a few pages


and computes their hash Stack Code Stack
Code
• If two page hashes are same, Stack
deduplicate Heap
Stack
• Change PTE of one process to Heap Heap Heap
point to the common page Code Code Code
Code Code Code
• Add duplicate page to free list P1 address space
(virtual addresses)
DRAM
(Physical memory)
P2 address space
(virtual addresses)
Shared pages and demand paging
• Need to update multiple PTEs when pages are swapped out/in
Reverse maps
• When swapping in/out a shared page, all PTEs must be updated.
• rmap: PPN -> list[*PTE]
Paging in action in xv6
xv6 book Ch 2
P1 virtual memory P2 virtual memory
0xFFFFFFFF 0xFFFFFFFF

Paging in xv6
MMIO MMIO

0x8E000000 0x8E000000

Free space Free space

OS data OS data
OS code OS code
0x80100000 0x80100000
Physical memory
0x0E000000
0x80000000 0x80000000
Free space

OS data
OS code
0x00100000

0x00000000

OS is mapped at same location in all processes


0x00000000
os_va = pa + 0x80000000 0x00000000
Mapping OS into
process address space

• User/supervisor bit is not set for OS


pages
P1 virtual memory P2 virtual memory
0xFFFFFFFF 0xFFFFFFFF
MMIO Process pages and page table pages MMIO
come from the free space
0x8E000000 0x8E000000

Free space Free space

OS data OS data
OS code OS code
0x80100000 0x80100000
Physical memory
0x0E000000
0x80000000 CR3 0x80000000
Free space

OS data
OS code
0x00100000

0x00000000

To allocate physical memory, kernel


0x00000000 can just allocate its own virtual memory 0x00000000
P2 virtual memory

Mapping OS into process virtual memory MMIO

Trap handling
Free space

• Kernel stack and IDT are in the OS data


address space of the process. =>
OS code
Hardware need not switch page Physical memory
tables for handling interrupts CR3

Free space

OS data
OS code

eip esp
%ss
%esp
Visualising syscall handling p19-syscall %e ags
%cs

trap frame
%eip
# sys_open("console", O_WRONLY)
int fetchint(uint addr, int *ip) { 0
pushl $1
if(addr >= p->sz || addr+4 > p->sz) T_SYSCALL
pushl $console %ds..
return -1;
pushl $0 %eax=SYS_open
*ip = *(int*)(addr + p->offset); %ecx
movl $SYS_open, %eax
} …
int $T_SYSCALL
%edi
pushl %eax
int argint(int n, int *ip) { tf
int sys_open(void) { %eip
return fetchint((myproc()->tf->esp)
int fd, omode;
+ 4 + 4*n, ip);
if(argint(1, &omode) < 0) {
}
return -1;
void syscall(void) { 1
} *console
int num = curproc->tf->eax;
..
curproc->tf->eax = syscalls[num](); p->sz 0
return fd;
}
} Process code
p->o set
fl
ff
P2 virtual memory

Mapping OS into process virtual memory MMIO

Reading sys call parameters


Free space
int fetchint(uint addr, int *ip) {
if(addr >= p->sz || addr+4 > p->sz) OS data esp
eip OS code
return -1;
Physical memory
*ip = *(int*)(addr + p->offset); CR3
} Free space

tf->esp
int argint(int n, int *ip) { OS data
return fetchint((myproc()->tf->esp) OS code

+ 4 + 4*n, ip);
}
Kernel page table isolation (KPTI)

• Need to switch page tables to


handle interrupts/syscalls

• Syscall and interrupt heavy


workloads like Postgres see
7-17% overheads (16-23%
without tagged TLB)
P1 virtual memory P2 virtual memory
0xFFFFFFFF 0xFFFFFFFF
MMIO During context switch, eip jumps to MMIO
another address space!
0x8E000000 0x8E000000

Free space Free space

OS data OS data
eip OS code
0x80100000 OS code 0x80100000
Physical memory
0x0E000000
0x80000000 CR3 0x80000000
Free space

OS data
OS code
0x00100000

0x00000000

After jumping to another address space,


0x00000000 the next instruction is identical as earlier 0x00000000
Boot up sequence: BIOS to bootloader to OS

OS
eip 0x00100000

BIOS ROM
eip 0x000F0000 BIOS ROM BIOS ROM
0x000F0000 0x000F0000

0x000C0000
VGA display 0x000C0000 0x000C0000
VGA display VGA display
0x000A0000 0x000A0000 0x000A0000
Bootloader Bootloader
eip 0x00007C00 0x00007C00
Kernel has different physical and virtual addresses
• kernel.ld declares virtual address 0x80100000, physical address 0x100000
• kernel.ld marks _start as entry point. _start is V2P_WO(entry) i.e, (0x8010000c - 0x80000000)
• Running readelf -l kernel shows
$ readelf -l kernel
Elf file type is EXEC (Executable file)
Entry point 0x10000c
There are 3 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x001000 0x80100000 0x00100000 0x07aab 0x07aab R E 0x1000
LOAD 0x009000 0x80108000 0x00108000 0x02516 0x0d4a8 RW 0x1000
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x10
Section to Segment mapping:
Segment Sections...
00 .text .rodata
01 .data .bss
02
Virtual memory

entry.S sets up an initial page table


entry: esp
OS
# Set page directory 0x80100000
eip movl $(V2P_WO(entrypgdir)), %eax
movl %eax, %cr3 Physical memory
# Turn on paging.
movl %cr0, %eax
orl $(CR0_PG|CR0_WP), %eax
movl %eax, %cr0

movl $(stack + KSTACKSIZE), %esp
mov $main, %eax
0x80000000
jmp *%eax

int main (void) { CR3


kinit1(end, P2V(4*1024*1024)); OS 0x00100000 OS
eip 0x00100000
__attribute__((__aligned__(PGSIZE))) BIOS ROM
CR3 pde_t entrypgdir[NPDENTRIES] = { 0x000F0000
// Map VA's [0, 4MB) to PA's [0, 4MB)
[0] = (0) | PTE_P | PTE_W | PTE_PS,

0x000C0000
// Map VA's [KERNBASE, KERNBASE+4MB) to PA's [0, 4MB) VGA display
0x000A0000
[KERNBASE>>PDXSHIFT] = (0) | PTE_P | PTE_W | PTE_PS,
}; Bootloader 0x00007C00
0x00000000
Virtual memory

main calls kinit1 to initialise free list


esp
int main (void) { eip OS 0x80100000
kinit1(end, P2V(4*1024*1024));
Physical memory

}

Now pages can be 0x80000000


allocated from the free
list! CR3
OS 0x00100000
OS 0x00100000
BIOS ROM
0x000F0000


0x000C0000
VGA display
0x000A0000
Bootloader 0x00007C00
0x00000000
Virtual memory

Mark code read-only


Remove identity mapping
static struct kmap {
void *virt;
uint phys_start; esp
OS data
uint phys_end;
eip OS code
int perm; 0x80100000
Physical memory
} kmap[] = {
{ (void*)KERNBASE, 0, EXTMEM, PTE_W}, 0x80000000
{ (void*)KERNLINK, V2P(KERNLINK), V2P(data), 0},
{ (void*)data, V2P(data), PHYSTOP, PTE_W},
{ (void*)DEVSPACE, DEVSPACE, 0, PTE_W}, CR3
};
OS data
OS code
main.c calls kvmalloc in vm.c

• setupkvm allocates a new page for page directory


• mappages adds PTEs to map four areas
• switchkvm changes CR3 to the new page directory 0x00000000
Mapping pages
• mappages takes page directory, virtual address, size, physical address, and permissions
• It calls walkpgdir with alloc=1 (to allocate page table pages if they do not exist) to nd
the page table entry

• It puts physical address in the PTE, marks it as present, and puts other permissions
on PTE

• walkpgdir:
• mimics hardware’s page table walk. It takes rst 10 bits to index into page directory to
nd page table page. It takes next 10 bits to index into page table page. It returns
page table entry.

• If page table page does not exist, it allocates a new page and adds it to page
directory
fi
fi
fi
Virtual memory

main calls kinit2 to expand free list


int main (void) {
kinit1(end, P2V(4*1024*1024));
kvmalloc();
… esp
OS data
kinit2(P2V(4*1024*1024), P2V(PHYSTOP); OS code
eip 0x80100000
} Physical memory

0x80000000

CR3
Rest of the physical memory is
made available to allocator OS data

0x00100000 OS code

0x00000000
scheduler() {
Setting
pinit(){
up new process …
sp
%e ags=FL_IF
%cs=UCODE
swtch(p->context);
p = allocproc(); %eip=0
} 0
memmove(p->offset, _binary_initcode_start,);
swtch: 0
p->tf->ds,es,ss = (SEG_UDATA<<3) | DPL_USR;
movl 4(%esp), %eax %ds
p->tf->cs = (SEG_UCODE<<3) | DPL_USR;
movl %eax, %esp %es
p->tf->eflags = FL_IF; …
movl $0, %eax
p->tf->esp = PGSIZE; %edi
ret
p->tf->eip = 0; %eip=trapret p->tf
} .globl trapret p->context
allocproc() { trapret:
eip Process code
sp = (char*)(STARTPROC + (PROCSIZE<<12)); popal
sp -= sizeof *p->tf; popl %gs p->o set
p->tf = (struct trapframe*)sp; popl %fs
sp -= sizeof *p->context; popl %es
p->context = (struct context*)sp; popl %ds
p->context->eip = (uint)trapret; addl $0x8, %esp p->context
esp Return address
return p; iret
}
fl
ff
Virtual memory

Key changes from paging MMIO

pinit(){
p = allocproc();
p->pgdir = setupkvm();
inituvm(p->pgdir, _binary_initcode_start, (int)_binary_initcode_size);
OS data
p->sz = PGSIZE; OS code
0x80100000
… Physical memory

} 0x80000000

CR3
• inituvm:
OS data
• allocates a page, clears it OS code

• adds the page to va=0 in process’ page


table. Notice that the user bit is set

• copies code in the page


0x00000000
Setting up kernel stack
void seginit(void) {
c->gdt[SEG_UCODE] = SEG(.., STARTPROC, (PROCSIZE-1) << 12, DPL_USER);
c->gdt[SEG_UDATA] = SEG(.., STARTPROC, (PROCSIZE-1) << 12, DPL_USER);
}
void scheduler(void) {
static struct proc* allocproc(void) { // pick RUNNABLE process p
sp = (char*)(STARTPROC + (PROCSIZE>>12)); switchuvm(p); 1’s kstack
p->kstack = sp - KSTACKSIZE; swtch(p->context);
} } Process
1
void switchuvm(struct proc *p) {
mycpu()->gdt[SEG_TSS] = SEG16(STS_T32A, &mycpu()->ts,
sizeof(mycpu()->ts)-1, 0);
OS memory
mycpu()->ts.ss0 = SEG_KDATA << 3;
mycpu()->ts.esp0 = (uint)p->kstack + KSTACKSIZE;
ltr(SEG_TSS << 3);
}
Physical memory

Key changes from paging 1’s PT page

void seginit(void) { 1’s PD page


c->gdt[SEG_UCODE] = SEG(.., 0, 0xffffffff, DPL_USER);
1’s kstack
c->gdt[SEG_UDATA] = SEG(.., 0, 0xffffffff, DPL_USER);
}
void scheduler(void) {
static struct proc* allocproc(void) { Process 1
// pick RUNNABLE process p
p->kstack = kalloc();
switchuvm(p);
sp = p->kstack + KSTACKSIZE;
swtch(p->context); OS memory

}
}
void switchuvm(struct proc *p) {
mycpu()->gdt[SEG_TSS] = SEG16(STS_T32A, &mycpu()->ts,
sizeof(mycpu()->ts)-1, 0);
• User segments map to entire memory since
protection is done via paging.
mycpu()->ts.ss0 = SEG_KDATA << 3;
mycpu()->ts.esp0 = (uint)p->kstack + KSTACKSIZE; • kstack, process memory need not be
ltr(SEG_TSS << 3); contiguous
lcr3(V2P(p->pgdir)); // switch to process address space
}
• switchuvm changes page tables
sbrk system call
• sys_sbrk calls growproc(n)
• growproc(n) calls (de)allocuvm
• allocuvm checks that process is not trying to grow into OS area, maps pages
in page table with writeable, user-accessible bits

• deallocuvm deallocates pages one by one from newsz to oldsz. If page table
page is not found, we move directly to next pde. If PTE is found and present,
we free the physical page and change pte to zero.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy