More Paging: The Kernel Side
We've already talked plenty about virtual memory, and I bet you're probably so
over entrypgdir by now; let's wrap up its story and get rid of it!
The vm.c file is HUGE; only proc.c and sysfile.c match its length. Some parts deal with the general paging implementation; we'll look at those here. The rest handles the details of paging for processes and user code, we'll need to know a bit more about processes in xv6 for that.
vm.c
After the include directives for the preprocessor, we have a declaraction for an external symbol defined in kernel.ld. This one is the beginning of the data section for the kernel.
Next we have a definition for a pointer to a global page directory: this is the
fancy new one that's gonna replace entrypgdir. Note that pde_t is a type for
page directory entries defined in types.h; it's just a type alias for int.
seginit
This first function gets called directly by the kernel's main(); it sets up
the segment descriptors in the GDT as identity maps to all of memory so that we
can ignore them from now on. Wait, didn't we already do that in the boot loader?
Yes, kind of, but that was before the kernel took over, so back then we had no notion of kernel space versus user space. Now that we do, we want to set the permission flags for each segment so that we can use the privilege ring levels, with the kernel in ring 0 and user code in ring 3. That way any misbehaving user code will get slapped with a segmentation fault the way we've all come to know and love in C.
We also have some permission flags for protection in the page directory and page table entries, so maybe we could get away without it? I mean, both kernel code and user code are read-only anyway, so maybe they could both have a Descriptor Privilege Level of 3. But no, x86 is gonna shut that right down by forbidding interrupts that take you from ring level 0 to ring level 3, so all the interrupt handler functions have to be in kernel space with a kernel code segment selector at ring level 0.
So we're just gonna have to do it all over again. Great. Well, maybe it's not too bad, let's take a look... oh god, it's awful. Okay, deep breath.
Each processor has its own GDT, so we're gonna need to call this function once
per CPU. First we figure out which CPU we're on with with the cpuid() function
that we'll see later on; for now it... (drumroll)... gets the CPU's ID. Then we
look that up in a global table of CPUs (there's an extern declaration for this
in the included proc.h) and store it in a struct cpu; we saw that before in
the spin-lock code, but we'll get around to talking about it more later.
That struct cpu has a field to hold the GDT, so we're gonna add entries for
the kernel code, kernel data, user code, and user data segment descriptors;
those entries are SEG_KCODE, SEG_KDATA, SEG_UCODE, and SEG_UDATA,
respectively. Recall that the permission bits are STA_X (executable), STA_R
(readable), and STA_W (writeable); now we're gonna pile on the descriptor
privilege levels for the kernel (0) and user (3, or DPL_USER) on top. Besides
those ring levels, we want to ignore segmentation, so each segment should be an
identity map for all virtual memory from 0 to 4 GB (0xffff_ffff).
void seginit(void)
{
// ...
c->gdt[SEG_KCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, 0);
c->gdt[SEG_KDATA] = SEG(STA_W, 0, 0xffffffff, 0);
c->gdt[SEG_UCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, DPL_USER);
c->gdt[SEG_UDATA] = SEG(STA_W, 0, 0xffffffff, DPL_USER);
// ...
}
SEG macro used here and the SEG_ASM one from
the boot loader is that this one is for C code and the other is for assembly.
Finally, we load up the new GDT into the processor with a C wrapper for the
x86 instruction lgdt.
Done with segmentation, now on to more paging.
walkpgdir
A page directory lets the paging hardware convert virtual addresses to physical
ones, but we're gonna need those mappings in the kernel too while we set up the
page directory, so this function does the conversion manually. Wait, but aren't
we setting up paging so that all of physical memory is mapped in the higher half
of the virtual address space? Can't we just add or subtract KERNBASE to do the
conversion? Well, that would work for kernel virtual addresses, but user virtual
addresses actually will use page directories and page tables in a non-obvious
way, so if we want to figure out where those go, we'll need a function for it.
In C, using the static keyword before a function limits its scope and makes it
visible only within its own file. The function returns a pte_t *, a pointer to
a page table entry (the type is defined in mmu.h as a type alias for uint).
Its arguments are a pointer to a page directory, a virtual address, and alloc
(a boolean variable, but as an int instead of bool). This alloc lets the
function play a dual role: if it's set, the function will allocate a page table
if needed; otherwise it reports failure if the page table doesn't exist. The
const keyword lets the compiler know a variable shouldn't be mutated so it'll
throw an error if we do. Here, const void *va is a pointer to a constant value
of any type; the address the pointer holds might change, but we can never write
to that address. The opposite is a void *const va: the address being pointed
to will never change, but we can overwrite the contents of that address all we
want. You can combine the two with const void *const va. What's that I hear? C
syntax is the worst? No, never...
Remember way back when, when we talked about how "linear" addresses are set up
and converted to physical ones? The first 10 bits are an index for the page
directory to pick a page directory entry, which points to a page table; the next
10 bits pick a page table entry that points to a page, and the last 12 bits are
an offset within that page; the PDX() and PTX() macros get first 10 bits and
the next 10 bits from a linear address, respectively. So we start by getting the
page directory index and using that to get the page directory entry.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
pde_t *pde = &pgdir[PDX(va)];
// ...
}
Okay, so now pde points to a page directory entry which has two parts: a
pointer to the physical address of a page table, and some flags. But who knows
if this page table even exists; most page directory (and page table) entries
aren't mapped in order to save space. So we have to check whether *pde has the
PTE_P (present) flag set.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
// ...
pte_t *pgtab;
if (*pde & PTE_P) {
// ...
} else {
// ...
}
// ...
}
If the page table exists, we should get rid of the flags and recover the pointer
to the page table using the PTE_ADDR() macro. But the hardware uses physical
addresses for these pointers, so we need to convert it to a virtual address
first, which is what this function does... recursion? Bootstrap problem? No,
it's actually easy because we can access the page table from within the kernel's
virtual address space in the higher half by adding KERNBASE to the physical
address with the P2V() macro.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
// ...
pte_t *pgtab;
if (*pde & PTE_P) {
pgtab = (pte_t *) P2V(PTE_ADDR(*pde));
} else {
// ...
}
// ...
}
Now for the else clause, which happens if the page directory entry doesn't have
the PTE_P bit set. Well, if the boolean alloc is false (zero), then we're
done and we should just report failure by returning a null pointer. On the other
hand, if it's true, we just allocate a page for the page table. But wait,
remember how page allocation might fail and return a null pointer if we're out
of free pages in the free list? And remember how I said we should always check
for that? Okay well let's check for that; if allocation fails, we also return a
null pointer. Oh, and because this is C, we're gonna do a jillion things at once
in a single line: check if alloc is false, try to allocate a page table, and
check if that allocation failed. C lets us assign to a variable and then test
that variable's value in a single statement.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
// ...
pte_t *pgtab;
if (*pde & PTE_P) {
// ...
} else {
if (!alloc || (pgtab = (pte_t *) kalloc()) == 0) {
return 0;
}
// ...
}
// ...
}
Okay, so now suppose: (1) the page table wasn't present, (2) alloc was set, and
(3) we successfully allocated a page. Now what? Remember how we filled all free
pages with garbage in kfree() using memset()? Let's undo that now by zeroing
it.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
// ...
pte_t *pgtab;
if (*pde & PTE_P) {
// ...
} else {
// ...
memset(pgtab, 0, PGSIZE);
// ...
}
// ...
}
Now we'll update the page directory entry to point to this new page table and
add the PTE_P flag so it knows it's present. Wait, while we're at it, what
other permissions will it need? Is it writeable? Can users access it? Hmm, we'd
have to know whether we're looking up a user virtual address or a kernel one,
and whether it's gonna be used for code or data. Ah, screw it, we'll just throw
all the flags on there at once. Either way, the page table entries will have
their own flags too, so we can restrict the page's permissions there instead of
here at the page directory entry.
static pte_t *walkpgdir(pde_t *pgdir, const void *va, int alloc)
{
// ...
pte_t *pgtab;
if (*pde & PTE_P) {
// ...
} else {
// ...
*pde = V2P(pgtab) | PTE_P | PTE_W | PTE_U;
}
// ...
}
Finally, we return the address of the corresponding page table entry using the
index from the middle bits of va:
mappages
Okay, so walkpgdir() returns a pointer to a page table entry and can even
crate a page table if if it doesn't exist. That's not quite enough to add new
mappings for pages though; the page itself might not be mapped, and if we just
created a new page table, then certainly none of the pages are mapped yet.
mappages() will finish the job by installing mappings in page tables (possibly
newly-allocated ones) for a range of virtual addresses.
The arguments are a page directory, a virtual address for the beginning of the range, the size of the range, a physical address to map it to, and the flags for permissions we want to set. We start off by rounding the virtual address down to the nearest page boundary and getting a pointer to the end of the range, also page-aligned.
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
char *a = (char *) PGROUNDDOWN((uint) va);
char *last = (char *) PGROUNDDOWN(((uint) va) + size - 1);
// ...
}
Now we're gonna iterate over the pages in that range; for (;;) is a common C
idiom for an infinite loop. In this case, we need to increment a and pa by
PGSIZE each time, and we'll break out of the loop when a reaches last. To
be completely honest, I'm not really sure why the authors chose to write this as
an infinite loop with the condition/break statement and update statements inside
the loop rather than as a regular old for loop; I think the latter would be more
clear, but oh well, I didn't write this.
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
// ...
for (;;) {
// ...
if (a == last) {
break;
}
a += PGSIZE;
pa += PGSIZE;
}
// ...
}
Inside the for loop, we'll start each iteration by looking up the right page
table entry with walkpgdir(), with alloc set to true. Remember how that
function called kalloc(), which might fail, in which case it returns a null
pointer? Well that means we've gotta check for a null pointer here too. This
time however, we'll return -1 for failure and 0 for success, because why not?
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
// ...
for (;;) {
pte_t *pte;
if ((pte = walkpgdir(pgdir, a, 1)) == 0) {
return -1;
}
// ...
}
return 0;
}
We're supposed to be allocating brand-new pages for this range of addresses, so if a page has already been allocated, we'll just flip out in rage and panic.
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
// ...
for (;;) {
// ...
if (*pte & PTE_P) {
panic("remap");
}
// ...
}
// ...
}
The last thing before checking the loop condition and updating a and pa is
to install the mapping to the right physical address with the right permissions
in the page table. Then we're done!
static int mappages(pde_t *pgdir, void *va, uint size, uint pa, int perm)
{
// ...
for (;;) {
// ...
*pte = pa | perm | PTE_P;
// ...
}
// ...
}
Cool, now we have a way to map new pages into a page directory. We're well on
our way to leaving poor old entrypgdir behind for the shiny new kpgdir.
kmap
Each process is gonna have its own page directory, so its mappings in the lower
half of the virtual address space might be totally different from those of
another process. But the mappings in the higher half (where the kernel lives)
will always be the same -- that way, the kernel can always use the existing page
directory for whatever process it happens to be running. We'll only use kpgdir
when the kernel isn't currently running a process, e.g. while it's running the
scheduler.
So when we create a new process, we'll need to copy in all the mappings that the
kernel expects to find into a fresh page directory for that process. Those are:
memory-mapped I/O device space from physical address 0 to 0x10_0000 (the boot
loader is also here, but we don't need it any more), kernel code and read-only
data from 0x10_0000 to the physical address of data (one of the symbols
defined in kernel.ld), kernel data the rest of physical memory from there to
PHYSTOP, and more I/O devices from 0xFE00_0000 and up. Each of these ranges
needs its own permissions too.
We'll represent each of these mappings with a struct kmap, which has fields
for the starting virtual address, the starting and ending physical addresses,
and the permissions; then the mappings will get stored in a static global
variable kmap... oh come on, what fresh hell is THIS?
static struct kmap {
void *virt;
uint phys_start;
uint phys_end;
int perm;
} kmap[] = {
{ (void *)KERNBASE, 0, EXTMEM, PTE_W },
{ (void *)KERNLINK, V2P(KERNLINK), V2P(data), 0 },
{ (void *)data, V2P(data), PHYSTOP, PTE_W },
{ (void *)DEVSPACE, DEVSPACE, 0, PTE_W }.
};
Okay, there are a few things going on here. First, the static keyword for a
variable means that variable has a single fixed location in memory that it's
never gonna move out of.
Then it does that thing again where we simultaneously define a struct type and
define a variable of that type. So the type is
So then the static global variable kmap is an array of struct kmaps. I guess
we ran out of names or something. The array has four entries, and since each one
is a struct, it needs curly braces around it.
The first entry (for the lower of the two memory-mapped I/O device regions) has
a virt field of KERNBASE, a phys_start field of 0, a phys_end field of
EXTMEM (defined as 0x10_0000), and permission flag PTE_W. So it maps a
virtual address range starting at KERNBASE to the physical address range from
0x0 to 0x10_0000 and makes it writeable so we can communicate with the devices
there. The next two entries are similar, except that the kernel code isn't
writeable.
The last entry has phys_start of 0xFE00_0000 and a phys_end of 0. That's a
little strange, but it's because we want to map all the way up to the end of the
virtual address space at 0xFFFF_FFFF. The end should be one byte past that, but
it's impossible to represent 0x1_0000_0000 with 32 bits. Setting the end to 0
makes the size calculation (phys_end - phys_start) work out nicely: it'll just
overflow to the right number. This is okay since we're using unsigned integers,
but note that signed integer overflow is undefined behavior and thus VERY BAD
and the cause of many security vulnerabilities.
Okay, back to getting rid of entrypgdir!
setupkvm
This function sets up a fresh new page directory with all the mappings in kmap
in order to please the kernel when it encounters the page directory. So needy,
right?
It takes no arguments and returns a pointer to the new page directory. First, let's allocate a page of memory to hold the new directory. We'll be good and remember to check for null (in which case we return null too) and clear the page of the garbage values we wrote when we freed it.
pde_t *setupkvm(void)
{
pde_t *pgdir;
if ((pgdir = (pde_t *) kalloc()) == 0) {
return 0;
}
memset(pgdir, 0, PGSIZE);
// ...
}
The upper end of virtual memory after DEVSPACE has I/O devices, so PHYSTOP
should be below that; this is as good a place as any to make sure.
pde_t *setupkvm(void)
{
// ...
if (P2V(PHYSTOP) > (void *) DEVSPACE) {
panic("PHYSTOP too high");
}
// ...
}
Finally, we'll add all the mappings in kmap above into this page directory so
the kernel is happy. We'll use mappages(), which returns -1 if it fails, so
we should check for that. The freevm() function is defined below, and we'll
get to it soon, but for now just know that it gets rid of all the mappings we
just made, in case any of them fails.
pde_t *setupkvm(void)
{
// ...
struct kmap *k;
for (k = kmap; k < &kmap[NELEM(kmap)]; k++) {
if (mappages(pgdir,
k->virt,
k->phys_end - k->phys_start,
(uint) k->phys_start,
k->perm) < 0) {
freevm(pgdir);
return 0;
}
}
return pgdir;
}
k is a pointer to a struct kmap, and kmap
is an array of struct kmaps; in C, arrays decay to pointers, so they have the
same type. k starts off pointing to the first (zero) entry of kmap. Then
incrementing it with k++ shifts its value by the size of a struct kmap, so
it'll point to the next entry. The loop stops when k points beyond the last
entry of kmap, as determined by the NELEM() macro which counts the number of
entries in an array. Note that array element-counting only works in C if the
array is defined in the same function or as a global variable in the same file,
which is why it's so easy to do an out-of-bounds read or write in C (yet another
common security vulnerability).
Finally, if everything worked out okay, we return a pointer to the new page directory.
switchkvm
We said above that the kernel would usually just use the page directory of the
currently-running process, but it'll use kpgdir when no process is running,
i.e. during the kernel setup and while it's scheduling a new process. So we need
a way to tell the paging hardware to load kpgdir into register %cr3, which
holds a pointer to the page directory. That's this function.
It's a one-liner: get the physical address of kpgdir and stick it in %cr3
with the assembly instruction lcr3.
kvmalloc
FINALLY, we're here! We're gonna get rid of entrypgdir! The kernel's main()
calls this function right after kinit1().
We already did all the hard work, so this one's a breeze: we call setupkvm()
to allocate a new page directory and fill it with the kernel's mappings, then
call switchkvm() to load it into the paging hardware.
And we're DONE! Take that, entrypgdir, we don't need you anymore. We're big
kids now.
Summary
So far, it's been a serious odyssey just to move from no paging in the boot
loader, to super basic paging with entrypgdir in entry.S, to kpgdir now.
Along the way, we've looked at code to allocate and free pages and install new
mappings in page directories and page tables. That'll come in handy when we look
at processes next; the virtual memory story still isn't over.
Also, note that kpgdir still isn't at the height of its powers: at the point
when main() calls kvmalloc(), the free list only contains pages for physical
memory between 0 and 4 MB. The rest will have to wait until kinit2() unleashes
its full potential. (Maybe some self-actualization seminars would help...)