Swapping, Fragmentation, and Segmentation

16 Jul 2019

Swapping

If run time binding is used, then a process can be easily swapped back into a different area of memory.

If compile time or load time binding is used, then process swapping will become very complicated and slow, which is undesirable.

Swapping Difficulties

Context-switch time of swapping is very slow. * Disks take on order of 10s-100s of ms per access. * When adding the size of the process to transfer, the transfer time can take seconds. * Ideally, we can hide this latency by having other processes to run while a swap is taking place behind the scenes. * e.g. in Round Robin (RR), swap out the just-ran process, and have enough processes in RR to run before swap-in completes and newly swapped-in process is ready to run. * We can’t always hide this latency if in-memory processes are blocked on I/O

Swapping of processes that are blocked or waiting on I/O becomes complicated. One rule is to simply avoid swapping processes with pending I/O. Furthermore, fragmentation of main memory becomes a big issue.

Modern OS swap portions of processes in conjunction with virtual memory and demand paging.

Memory Allocation

As processes arrive, they are allocated a space in main memory (RAM)

Allocation Strategies:

best fit: find the smallest chunk that is big enough.
- This results in more and more fragmentation.
worst-fit: find the largest chunk that is big enough.
- This leaves the largest contiguous unallocated chunk for the next process.
first-fit: fid the first chunk that is big enough.
- This tends to fragment memory near the beginning of the list.
next fit: view frgments as forming a circular buffer.
- Find the first chunk that is big enough after the most recently chosen fragment.

Over time, processes leave and memory is deallocated. This results in fragmentation of memory. There will be many small chunks of non-contiguous unallocated memory between allocated processes in memory. For the next process, the OS must find a large enough unallocated chunk in fragmented memory. There may be enough memory for this, but it may not be contiguous to allow a process to load. This results in external fragmentation of main memory.

De-fragmentation/Compaction

Algorithms for recombining contiguous segments is used, but memory still gets fragmented.
Moving memory is expensive.

Paging

Avoiding external fragmentation:

Over time, repeated allocation and deallocation will cause many small chunks of non-contiguous unallocated memory to form between allocated processes in memory resulting in external fragmentation. The OS must swap out current processes to create a large enough contiguous unallocated memory. A solution is to use de-fragmentation, but it is costly to move memory.

A better solution to external fragmentation is to divide the logical address space into fixed-size pages. We can also break main memory into fixed-size frams and each page can be located in any frame.

The user’s view of memory is still as one contiguous block of logical address space. The MMU performs run-time mapping of each logical address to a physical address using the page table.

A typical page size is 4-8 KB

Ex: a 4 GB 32-bit address space with 4 KB/page (2^12) -> (2^32)/(2^12) = 1 million entries in page table. The page table would need to be >= 20 bits/table entry (~1 MB per process).

With paging we don’t get external fragmentation, but we do get some internal fragmentation

For example, suppose we have a process of size 4001 B, and each page size is 4 KB (4096 B). Then we have to allocate two pages (8 KB) and 3999 B of the second page is wasted due to fragmentation that is internal to a page.

The OS also has to maintain a frame table/pool that keeps track of what physical memory frames are free.

Conceptually, every logical/virtual address can now be divided into two parts:

Most significant bits: logial page # I
- Equals the virtual address / page size
- Used to index into page table to retrieve the corresponding physical frame f
Leas significant bits: page offset d

Implementing a Page Table

Option 1: Use a dedicated bank of hardware registers or memory to store the page table. * Fast per-instruction translation. * Slow per context switch. The entire page table has to be reloaded for the new process. * Limited by cost (expensive hardware) to being too small. Some page tables can be large, e.g. 1 million entries, which is too expensive.

Option 2: Store the page table in main memory. * Keep a pointer to the page table in a special CPU register called the Page Table Base Register (PTBR). * Can accommodate fairly large page tables. * Fast context switch - only reload the PTBR. * Slow per-instruction translation, because each instruction fetch requires two steps memory access: 1. Finding the page table in memory and indexing to the appropriate spot to retrieve the physical fram # f. 2. Retrieving the instruction from physical memory fram f.

Option 3: Cache a subset of page table mappings/entries in a small set of CPU buffers called Translation-Look-Aside Buffers (TLBs) * Fast solution to option 2’s slow two-step memory access. * Several TLB caching policies: * Cache the most popular or most frequently referenced pages in TLB. * Cache the most recently used pages.

Paging and TLB Caching

The MMU in the CPU first looks in TLBs to find a match for a given logical address.

If match found, then quickly call main memory with physical address frame f (plus offset d).
- This is called a TLB hit.
- TLB as implemented in hardware does a fast parallel match of the input page to all stored values in cache ~10% overhead in speed.
If no match found, then this is a TLB miss.
- Go through regular two-step lookup procedure:
  - Go to main memory to find page table and indes into it to retrieve frame #f, then retrieve what is stored at address <f,d> in physical memory.
- Update TLB cache with the new entry from the page table.
  - If cache full, then implement a cache replacement strategy, e.g. Least Recently Used (LRU). The goal is to maximize TLB hits and minimize TLB misses.

On a context switch, the TLB entries would typically have to all be invalidated/completely flushed since different processes have different page tables. An alternative is to include process IDs in TLB at the additional cost of hardware and an additional comparison per lookup. Only TLB entries with a process ID matching the current task are considered valid. E.g. DEC RISC Alpha CPU.

Another option is to prevent frequently used pages from being automatically invalidated in the TLBs on a task switch.

ARM allows flushing of individual entries from the TLB indexed by virtual address.

Paging and L1/L2 Caching

How does the MMU interact with L1 or L2 data or instruction caches?

It depends on whether the items in a cache are indexed as virtual or physical (for lookup purposes).

L1/L2 data/instruction caches can store their information and be indexed by either virtual or physical addresses.

If physical, then MMU must first convert virtual to physical before the cache can be consulted. This is slow, but each entry is uniquely identifiable by its physical address.
If virtual, then cache can be consulted quickly to see if there’s a hit without invoking the MMU (if miss, then MMU must still be invoked).

A virtually indexed L1/L2 data/instruction cache introduces some problems:

Homonym problem: when a new process is switched in, it may use the same virtual address V as the previous process.
- The cache that indexes just by virtual address V will return the wrong information (cached information from the prior process).

Some solutions to the homonym problem:

Flush the cache on each context switch.
- Process gets the entire cache to itself, but cache has to be rebuilt.
Add an address space id (process id) to each entry of the cache.
- So only data/instructions for the right process are returned for a given virtual address V.
- Requires hardware support and an extra comparison.
- Reduces available cache space for each process, since it has to be shared.
Each process uses non-overlapping virtual addresses in its address space.
- Unlikely, violates model that each process is compiled and executes independently in its own address space [0, MAX].

In practice,

Most L1 caches are virtually indexed - fast.
Most L2 caches are physically indexed.
- Each entry is unique.
- No collisions.
- This is good for code/data from shared library pages, i.e. if multiple processes share the same code/data, then it just has to be stored one in cache.
The virtually indexed cache is essentially a small L1 cache, and the physically indexed cache is a much larger L2 cache.