07 Paging
07 Paging
Abhilash Jindal
Overview
• More exible address translation with paging (OSTEP Ch 18-20)
• Paging hardware
• Demand paging: swapping pages to disk when memory becomes full (OSTEP Ch
21-22)
• Swapping mechanisms
• Page replacement algorithms
• Paging in action (xv6 book Ch 2, OSTEP Ch 23)
• Paging on xv6
• Fork with Copy-on-write, Guard pages
fl
Paging Hardware
OSTEP Ch 18-20
Intel SDM Volume 3A Ch 4
Memory isolation and address space
Stack
Process 2
0x00300000
Process 1 free
0x00200000
esp OS stack
OS
eip OS code
0x00100000 0x00100000
0x000C0000 Code
VGA display
0x000A0000 Process 1 address space
Bootloader 0x00007C00
0x00000000 Due to address translation, compiler need
not worry where the program will be loaded!
Segmentation
Stack • Mapping large
address spaces Process 2 stack
Process 1 stack
• Place each
Process 2
0x00300000 segment
Process 1
0x00200000 free independently to
OS not map free
0x00100000
space Process 2 heap
Process 1 heap
Heap
Process 1,2 code
0x00200000
Code
Process 1 address OS
0x00100000
0x00000000
0x00000
Allocating memory to a new process
Process 2 stack
• Find free spaces in physical memory Process 1 stack
Process 3 stack
Process 3 heap
Process 2 heap
Process 3 code
Process 1 heap
Process 1,2 code
0x00200000
OS
0x00100000
0x00000000
External fragmentation
• After many processes start and exit,
memory might become
“fragmented” (similar to disk)
• Copying is expensive
• Growing heap becomes not
possible
Limitations of segmentation
Limited exibility:
Code
Stack
Heap
Stack Heap
0:3,1:4,
OS:1 0:9,1:2,
Notebook DB:5
2:2,3:8, an xv6 is
2:7,3:12
for SQL OS Write x86 query
4:6,5:10
Paging
Segmentation Paging
Large address spaces need multi segment model. Burden Transparently support large address spaces. Programmer/
falls on programmer/compiler to manage multiple segments compiler work with a at virtual address space
Di erent sized segments lead to external fragmentation
3 x 3 7 Stack Heap
2 4 2 3
Heap Heap Heap
1 6 1 6
0 0 0 1 Code Code Code
7 7 Code
Virtual page Physical page
Present Permission 6 6 Stack
number number
9 4 Y rw 5 5 Heap
8 x N
4 4 Stack
7 x N
6 x N 3 3 Heap
5 x N 2 Heap 2 Code
4 x N
1 Code 1 Code
3 x N
2 5 Y rw 0 Code 0
1 7 Y rx P1 address space DRAM
0 1 Y rx (virtual addresses) (Physical memory)
Page table size
• Virtual addresses: 232
• Lot of page table entries are invalid Stack Virtual page Physical page
number number
7 3
6 x
free
5 x
4 x
3 x
2 4
1 6
Heap
0 0
Code
Process 1 address space
Multi-level page table PPN: 9
Virtual page Physical page
number number
PPN: 8 7 3
Virtual page Physical page 6 x
Virtual page Physical page
number number number number
6,7 9 PPN: 11
7 3 Virtual page Physical page
4,5 x
6 x number number
2,3 11
3 x
5 x 0,1 10
2 4
4 x
3 x
• Page directory entries point PPN: 10
2 4 to page table pages Virtual page Physical page
number number
1 6
1 6
0 0 • Unused portions of virtual 0 0
address space is skipped!
Notebook analogy Preparing for OS exam:
Paging
• Read second letter from 3rd page
0 1 2 3 4 5
xv6 is an OS for x86
• Read second letter from 8th page
OS
0 1 2 3
0:3,1:4,
OS:1 0:9,1:2,
Notebook DB:5
2:2,3:8, an xv6 is
2:7,3:12
for SQL OS Write x86 query
4:6,5:10
Notebook analogy Preparing for OS exam:
Page directories: call 4 pages a “section”
• Read second letter from 3rd
page in Section 0
0 1 2 3 0 1 2 3
xv6 is an OS for x86
OS Section 0 Section 1
• Read second letter from 8th page
0 1 2 3
Write an SQL query
Section 0
DB
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
• Number of pages = 28
• Paging:
• Linear address => Physical
address
Address translation with paging on x86
TLB
VPN PPN
1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1
ff
ff
Which programs will run faster?
Which programs will have lesser TLB misses?
• High spatial locality: after the program accessed a memory location, it will
access a nearby memory location
• High temporal locality: after the program accessed a memory location, it will
soon access it again
Example bad programs
• Low spatial and temporal locality: most accesses lead to TLB miss
• Large hash table with random access VPN=1 a[0] a[1] a[2] a[3]
VPN=2 a[4] a[5] a[6] a[7]
int get(int I) VPN=3 a[8] a[9] a[10] a[11]
VPN=4 a[12] a[13] a[14] a[15]
return a[I];
VPN=5 a[16] a[17] a[18] a[19]
L1 TLB misses.
Context switching 7
6
3
x
7
6
5
x
5 x 5 x
4 x 4 x
• During context switch, OS changes 3 x 3 7
CR3 register to change page table 2 4 2 3
1 6 1 6
movl %eax, %cr3
0 0 0 1
• Privileged operation
• Marks each TLB entry invalid. Every Stack Heap Stack
memory access after context switch Code
causes TLB miss! Stack
Heap
Virtual page Physical page Stack Heap
number number
Heap Heap Heap
0 0
Code Code Code
2 4
Code Code Code
0 1 P1 address space DRAM P2 address space
(virtual addresses) (Physical memory) (virtual addresses)
INVLPG instruction
• TLB is neither write-back, nor write-through cache
• Need to run INVLPG <virtual page number> when a page table entry is
modi ed
• Similar to how OS needs to run LGDT when GDT entries are changed
fi
Tagged TLBs
• TLB entries are “tagged” with a process
context identi er
TLB
• Additional bits in CR3 register tells hardware Virtual page Physical page
Process
context
about the current PCID number number
identi er
0 0 P1
• Upon context switch: 2 4 P1
0 1 P2
• OS changes CR3: PCID, page directory base
• Hardware need not invalidate TLB entries.
~5us
Super Swap Swap Swap Index Index Index Bitmap Data Data Data Data
Boot
block block block block block block block 0010 block block block block
fi
Swapping out a page
• Find a free swap block on disk
• Copy page to the free block
• Run INVLPG instruction to remove Proc 1
[VPN2]
page from TLB
• Swap out one page when we completely run out of physical memory
• What if OS itself needed a new page?
• Start swapping out before we completely run out
• When there are less than N free pages left
• Swap out multiple pages in one shot until we have M (> N) free pages left
• Sends multiple disk writes in one shot reduces seek delay
Which page to evict?
Memory access sequence:
• Goal: minimize number of swap ins/
outs 0, 1, 2, 0, 1, 3, 0, 3, 1, 2, 1
• Belady’s algorithm for optimal page
replacement
• Future is unknown!
FIFO
Memory access sequence:
• Evict the page that came rst to the
cache 0, 1, 2, 0, 1, 3, 0, 3, 1, 2, 1
• OS appends the page to a queue
when it swaps in a page (or when it
allocates a new page)
fi
Belady’s anomaly
Bigger caches can have lower hit rates!
Access Hit/miss Resulting cache Access Hit/miss Resulting cache
1 Miss 1 1 Miss 1
2 Miss 1, 2 2 Miss 1, 2
3 Miss 1, 2, 3 3 Miss 1, 2, 3
4 Miss 2, 3, 4 4 Miss 1, 2, 3, 4
1 Miss 3, 4, 1 1 Hit 1, 2, 3, 4
2 Miss 4, 1, 2 2 Hit 1, 2, 3, 4
5 Miss 1, 2, 5 5 Miss 2, 3, 4, 5
1 Hit 1, 2, 5 1 Miss 3, 4, 5, 1
2 Hit 1, 2, 5 2 Miss 4, 5, 1, 2
3 Miss 2, 5, 3 3 Miss 5, 1, 2, 3
4 Miss 5, 3, 4 4 Miss 1, 2, 3, 4
5 Hit 5, 3, 4 5 Miss 2, 3, 4, 5
FIFO does not follow “stack property”. Cache of size 4 may not contain elements in cache of size 3.
Fairness in page replacement
• Someone had lots of pages. I had very little. My page was evicted
• OS maintains “resident size” per process: 1, 7, 9
• First select a victim process with highest resident size, remove its pages
Least Recently Used (LRU)
1
0
• Evicted page was “not recently used”
OS 1
0
1
0
1
0
0
1
Victim proc’s pages
Clock algorithm (2) DB AB
0 1
0
• Optimisation:
0 1
0
• We should prefer evicting page that has not 0 1
changed since it was brought back from disk 0
1 1
0
• Such pages can just be deleted without
doing a copy OS 0 1
0
Library
Time
Thrashing: Library analogy (2)
• Problem: Library only allows one book to be checked out. Student is
constantly running to/from the library. Not able to do work on any assignment.
• Solution:
• Reduce “working set”
• I will work on OS assignment completely before worrying about DB
assignment
• Solution:
• Reduce working set
• Admission control: run some processes for some time and then some
other
Paging in xv6
MMIO MMIO
0x8E000000 0x8E000000
OS data OS data
OS code OS code
0x80100000 0x80100000
Physical memory
0x0E000000
0x80000000 0x80000000
Free space
OS data
OS code
0x00100000
0x00000000
OS data OS data
OS code OS code
0x80100000 0x80100000
Physical memory
0x0E000000
0x80000000 CR3 0x80000000
Free space
OS data
OS code
0x00100000
0x00000000
Trap handling
Free space
Free space
OS data
OS code
eip esp
%ss
%esp
Visualising syscall handling p19-syscall %e ags
%cs
trap frame
%eip
# sys_open("console", O_WRONLY)
int fetchint(uint addr, int *ip) { 0
pushl $1
if(addr >= p->sz || addr+4 > p->sz) T_SYSCALL
pushl $console %ds..
return -1;
pushl $0 %eax=SYS_open
*ip = *(int*)(addr + p->offset); %ecx
movl $SYS_open, %eax
} …
int $T_SYSCALL
%edi
pushl %eax
int argint(int n, int *ip) { tf
int sys_open(void) { %eip
return fetchint((myproc()->tf->esp)
int fd, omode;
+ 4 + 4*n, ip);
if(argint(1, &omode) < 0) {
}
return -1;
void syscall(void) { 1
} *console
int num = curproc->tf->eax;
..
curproc->tf->eax = syscalls[num](); p->sz 0
return fd;
}
} Process code
p->o set
fl
ff
P2 virtual memory
tf->esp
int argint(int n, int *ip) { OS data
return fetchint((myproc()->tf->esp) OS code
+ 4 + 4*n, ip);
}
Kernel page table isolation (KPTI)
OS data OS data
eip OS code
0x80100000 OS code 0x80100000
Physical memory
0x0E000000
0x80000000 CR3 0x80000000
Free space
OS data
OS code
0x00100000
0x00000000
OS
eip 0x00100000
BIOS ROM
eip 0x000F0000 BIOS ROM BIOS ROM
0x000F0000 0x000F0000
0x000C0000
VGA display 0x000C0000 0x000C0000
VGA display VGA display
0x000A0000 0x000A0000 0x000A0000
Bootloader Bootloader
eip 0x00007C00 0x00007C00
Kernel has different physical and virtual addresses
• kernel.ld declares virtual address 0x80100000, physical address 0x100000
• kernel.ld marks _start as entry point. _start is V2P_WO(entry) i.e, (0x8010000c - 0x80000000)
• Running readelf -l kernel shows
$ readelf -l kernel
Elf file type is EXEC (Executable file)
Entry point 0x10000c
There are 3 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x001000 0x80100000 0x00100000 0x07aab 0x07aab R E 0x1000
LOAD 0x009000 0x80108000 0x00108000 0x02516 0x0d4a8 RW 0x1000
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x10
Section to Segment mapping:
Segment Sections...
00 .text .rodata
01 .data .bss
02
Virtual memory
…
0x000C0000
VGA display
0x000A0000
Bootloader 0x00007C00
0x00000000
Virtual memory
• It puts physical address in the PTE, marks it as present, and puts other permissions
on PTE
• walkpgdir:
• mimics hardware’s page table walk. It takes rst 10 bits to index into page directory to
nd page table page. It takes next 10 bits to index into page table page. It returns
page table entry.
• If page table page does not exist, it allocates a new page and adds it to page
directory
fi
fi
fi
Virtual memory
0x80000000
CR3
Rest of the physical memory is
made available to allocator OS data
0x00100000 OS code
0x00000000
scheduler() {
Setting
pinit(){
up new process …
sp
%e ags=FL_IF
%cs=UCODE
swtch(p->context);
p = allocproc(); %eip=0
} 0
memmove(p->offset, _binary_initcode_start,);
swtch: 0
p->tf->ds,es,ss = (SEG_UDATA<<3) | DPL_USR;
movl 4(%esp), %eax %ds
p->tf->cs = (SEG_UCODE<<3) | DPL_USR;
movl %eax, %esp %es
p->tf->eflags = FL_IF; …
movl $0, %eax
p->tf->esp = PGSIZE; %edi
ret
p->tf->eip = 0; %eip=trapret p->tf
} .globl trapret p->context
allocproc() { trapret:
eip Process code
sp = (char*)(STARTPROC + (PROCSIZE<<12)); popal
sp -= sizeof *p->tf; popl %gs p->o set
p->tf = (struct trapframe*)sp; popl %fs
sp -= sizeof *p->context; popl %es
p->context = (struct context*)sp; popl %ds
p->context->eip = (uint)trapret; addl $0x8, %esp p->context
esp Return address
return p; iret
}
fl
ff
Virtual memory
pinit(){
p = allocproc();
p->pgdir = setupkvm();
inituvm(p->pgdir, _binary_initcode_start, (int)_binary_initcode_size);
OS data
p->sz = PGSIZE; OS code
0x80100000
… Physical memory
} 0x80000000
CR3
• inituvm:
OS data
• allocates a page, clears it OS code
• deallocuvm deallocates pages one by one from newsz to oldsz. If page table
page is not found, we move directly to next pde. If PTE is found and present,
we free the physical page and change pte to zero.