System Calls: Processes
In a previous post, I pointed out some of the most important functions a kernel has to fulfill. System calls take care of two of these: virtualizing resources via virtual memory and processes, and mediating communication between user-mode processes and the hardware. We'll wrap up the former now by looking at the system call functions relating to processes and scheduling.
proc.c
fork
Unlike some of the other functions we'll talk about in this post, fork() is
used almost exclusively by user code as a system call; the kernel never calls
it. That said, it has an extremely important role: after the first process has
started, it's the only way to create more processes. It does that by copying the
parent process's virtual address space into a new page directory. We haven't
talked about the file system yet, but hopefully you're familiar with file I/O in
Linux, so you know each process has its own list of open files and a current
working directory; fork() will clone those as well for the child process.
Let's start off by getting a pointer to the parent process and creating a slot
in the process table for the child process with allocproc(). Remember, that
function returns a pointer to the new process's struct proc, but it can fail
and return null (e.g., if there is no available slot in the process table, or if
its call to kalloc() fails), so we'll need to check for that.
int fork(void)
{
// Parent process
struct proc *curproc = myproc();
// Allocate process table slot for child process
struct proc *np;
if ((np = allocproc()) == 0) {
return -1;
}
// ...
}
allocproc() also sets up the new process's stack so that it'll return into
forkret(), then trapret(), before context switching into user mode, and sets
the process's state to EMBRYO.
Next we need a page directory for the new child process; it should be a copy of
the parent process's page directory. Luckily, we already did the hard work for
this back in the virtual memory posts, so we can just use copyuvm() now. That
function can also fail, in which case we'll free the stack that allocproc()
created and set the child process's state back to UNUSED.
int fork(void)
{
// ...
if ((np->pgdir = copyuvm(curproc->pgdir, curproc->sz)) == 0) {
kfree(np->kstack);
np->kstack = 0;
np->state = UNUSED;
return -1;
}
// ...
}
Next we'll copy the parent process's size and trap frame; the latter will make
sure the child starts executing after trapret() with the same register
contents as the parent. We'll set the child process's parent to, well, its
parent (the current process).
int fork(void)
{
// ...
np->sz = curproc->sz;
np->parent = curproc;
*np->tf = *curproc->tf;
// ...
}
The two processes will be nearly identical, so we need a way to distiguish them
from user space so that a user program can give different instructions to each.
xv6 follows the Unix convention that fork() should return the child process's
PID to the parent and return 0 for the child. The parent's return value is easy;
we'll just literally return the child's PID at the end. But the child didn't
actually call fork(), so how can we set a return value that it will see?
Well, the x86 convention is for return values to be passed in the %eax
register, right? And that register will be restored from the trap frame before
switching into user mode. So we'll just store the value 0 there.
Next we'll copy all the parent process's open files and its current working
current working directory. The files are stored in a per-process file array
curproc->ofile of size NOFILE, so we can copy them over with the function
filedup() (which we'll see later). The current working directory is in
curproc->cwd and can be copied with idup().
int fork(void)
{
// ...
for (int i = 0; i < NOFILE; i++) {
if curproc->ofile[i]) {
np->ofile[i] = filedup(curproc->ofile[i]);
}
}
np->cwd = idup(curproc->cwd);
// ...
}
Then we'll copy the parent process's name with safestrcpy(), defined in
string.c. You
might be familiar with the C standard library funtion strncpy(); this function
is almost identical, except that unlike strncpy() it's guaranteed to nul-
terminate the string it copies. If you haven't seen this kind of thing before,
it's a fairly common practice to write your own safe wrappers for some of the C
standard library functions, especially the ones in string.h which are so often
error-prone and dangerous.
Finally, we'll set the child process's state to RUNNABLE and return its PID
for the parent.
int fork(void)
{
// ...
int pid = np->pid;
acquire(&ptable.lock);
np->state = RUNNABLE;
release(&ptable.lock);
return pid;
}
kill
This is one of the functions that can get called both by the kernel and as a system call. The kernel will use it to terminate malicious or buggy processes, and user code can use it as a system call to kill another process too.
We said before that killing a process immediately would present all kinds of
risks (e.g. corrupting any kernel data structures it might be updating, etc.),
so all we're gonna do is give it the ominous mark of death with the p->killed
field. Then the code in trap() will handle the actual murder the next time the
process passes through there.
The argument is a process ID number, so let's just iterate over the process table until we find a process with a matching PID; we'll return -1 if we don't find any.
int kill(int pid)
{
acquire(&ptable.lock);
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
if (p->pid == pid) {
// ...
}
}
release(&ptable.lock);
return -1;
}
If we do find a matching process, then we'll set p->killed. Also, some of the
calls to sleep() will occur inside a while loop that checks if p->killed has
been set since the process started sleeping, so let's hasten the process's death
a little by setting its state to RUNNABLE so it'll wake up and encounter those
checks faster. There's no risk of screwing up by waking up a process too early,
since each call to sleep() should be in a loop that will just put it back to
sleep if it's not ready to wake up yet.
int kill(int pid)
{
// ...
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
if (p->pid == pid) {
p->killed = 1;
if (p->state == SLEEPING) {
p->state = RUNNABLE;
}
release(&ptable.lock);
return 0;
}
}
// ...
}
sleep
The last post went over the basics of sleep() and wakeup(); they act as
mechanisms for sequence coordination or conditional synchronization, which
allows processes to communicate with each other by sleeping while waiting for
conditions to be fulfilled and waking up other processes when those conditions
are satisfied.
Processes can go to sleep on a channel or wake up other processes sleeping on a channel. In many operating systems, this is achieved via channel queues or even more complex data structures, but xv6 makes it as simple as possible by simply using pointers (or equivalently, integers) as channels; the kernel can just use any convenient address as a pointer for one process to sleep on while other processes send a wakeup call using the same pointer.
This does mean that multiple processes might be sleeping on the same channel,
either because they are waiting for the same condition before resuming execution
or because two different sleep()/wakeup() pairs accidentally used the same
channel. The result would be that a process might be woken up before the
condition it's waiting for has been fulfilled. We can solve that problem by
requiring every call to sleep() to occur inside a loop that checks the
condition; that way, if a process receives a spurious wakeup call before it
really should have been woken up, the loop will put it right back to sleep
anyway. We saw one example of this in the sys_sleep() function, in which the
while loop checked if the right number of ticks had passed.
A common concurrency danger with conditional synchronization in any operating system is the problem of missed wakeup calls: if the process that's supposed to send the wakeup call runs before the process that's supposed to sleep, it's possible that the sleeping process will never be woken up again. The problem is more general than just processes; it applies to devices too.
Imagine this scenario: a process tries to read from the disk; it'll check whether the data is ready yet and go to sleep (inside a while loop) until it is. If the disk gets to run first, then the process will just find the data ready and waiting for it, so it can continue on to use the data. If the process runs before the disk does, then it'll see the data isn't ready yet and sleep in a loop until it is; the disk will wake the process up once the data is ready.
But suppose they run at the same time, or in between each other. The process does its check and finds the data isn't ready, but before it can go to sleep, a timer interrupt or some other trap goes off and the kernel switches processes. Then the disk finishes reading and starts a disk interrupt that sends a wakeup call to any sleeping processes, but the process isn't sleeping yet. When the process starts running again later on, it'll go to sleep -- having already missed its wakeup call.
The problem is that the process can get interrupted between checking the
condition and going to sleep, right? So why don't we just disable interrupts
there with pushcli() and popcli()? add a lock there? Ah, but there's another
problem: what if the disk driver is running simultaneously on another CPU?
Disabling interrupts on the process's CPU wouldn't stop the other CPU from
sending the disk's wakeup call too early.
Okay fine, so let's use a lock instead. The process will hold the lock while it checks the condition and sleeps, and the disk driver will have to acquire the lock before it can send its wakeup call... Can you see the problem here? If the process holds the lock while it's sleeping, the disk driver will never be able to acquire the lock in order to wake it up. That's a deadlock.
HEAD. DESK.
Ugh, okay, fine, you got me. So let's use a lock, but let's have sleep()
release it right away, then reacquire it before waking up; that way the lock
will be free while the process is sleeping so the disk driver can acquire it.
Done, right? Everybody's happy?
Nope. Now we're back to the original problem: if the lock gets released inside
sleep() before the process is actually sleeping, then the wakeup call might
happen in between those and get missed.
@*#&@#\(**&@#%\)!!!
So we need a lock. And we can't hold the lock while sleeping, or we'd get a deadlock. But we also can't release it before sleeping, or we might miss a wakeup call. So... ???
See, I told you: concurrency is your worst nightmare. Ever since we decided we'd
like our operating systems to do more than run a single basic process at a time,
we introduced all kinds of problems we have to reason through. Let's check out
how xv6 actually writes the sleep() function and think through it ourselves
and try to understand if it manages to solve this problem.
We'll start by making sure of two things: (1) this CPU is currently running a process and not the scheduler (which can't ever go to sleep), and (2) the caller passed in a lock (which can be any arbitrary lock).
void sleep(void *chan, struct spinlock *lk)
{
struct proc *p = myproc();
if (p == 0) {
panic("sleep");
}
if (lk == 0) {
panic("sleep without lk");
}
// ...
}
Next we need to release the lock and put the process to sleep. That will require modifying its state, so we should now acquire the lock for the process table. But if the lock that the process is already holding is the process table lock, then trying to acquire it again would cause a panic, so let's add a check for that; if we're already holding it then we'll keep using it and we don't need to release it.
void sleep(void *chan, struct spinlock *lk)
{
// ...
if (lk != &ptable.lock) {
acquire(&ptable.lock);
release(lk);
}
// ...
}
Okay, now it's nap time for this process. We just update its channel to chan
and its state to SLEEPING, then call sched() to perform a context switch
into the scheduler so it can run a new process. We have to be holding the
process table lock before calling sched(), remember?
void sleep(void *chan, struct spinlock *lk)
{
// ...
p->chan = chan;
p->state = SLEEPING;
sched();
// ...
}
When the process wakes up later on (if indeed it turns out that the code here works and doesn't miss any wakeup calls), it'll eventually be run by the scheduler, at which point it will context switch back here. So at that point we'll reset its channel and reacquire the original lock before returning.
void sleep(void *chan, struct spinlock *lk)
{
// ...
p->chan = 0;
if (lk != &ptable.lock) {
release(&ptable.lock);
acquire(lk);
}
}
Okay, well I don't know about you, but I'm still not convinced that this
implementation won't miss any wakeup calls. After all, we release the original
lock before putting the process to sleep, right? We're holding the process table
lock at that point, which at least means that interrupts are disabled, but the
process that will wake this one up might already be running on another CPU and
might send the wakeup signal in between releasing the original lock and
updating this process's channel and state. Hmm... Well, as always, xv6 is
brilliant, so we'll see how this gets solved in the code for wakeup().
But wait! Before we move on, I have a warning for you about using this function in your own code when you start hacking away at xv6. Remember that when we first talked about deadlocks, we saw we can cause a deadlock if two processes acquire two locks in opposite orders? If process 1 tries to acquire lock A, then lock B, and process 2 simultaneously tries to acquire lock B, then lock A, then the end result is that process 1 will acquire lock A and process 2 will acquire lock B, but neither will be able to acquire the other lock since it's already being held.
If you look at the code above, the process that called sleep() must have
already been holding a lock lk, then sleep() acquires ptable.lock before
releasing lk. You know what that means: there's potential for a deadlock. So
in order to avoid that, you should make sure that any lock you pass in to
sleep() must always get acquired before ptable.lock. If any other function
(or chain of function calls) could potentially acquire ptable.lock before lk,
then you might end up with a deadlock. As always, the xv6 authors have been
extremely careful to make sure that that never happens in the existing code, so
you'll have to do the same thing for any code you add.
wakeup
This function is short and sweet because it procrastinates all the work it has
to do by pushing it off to a helper function, wakeup1(). It just acquires the
process table lock, calls wakeup1(), then releases the process table lock. It
has to grab that lock since it's gonna modify the process's state in the process
table.
wakeup1() in order to let processes that are already holding the process table
lock send wakeup calls too.
Okay, now before we go look at wakeup1(), let's get back to figuring out
whether xv6's implementation of sleep() and wakeup() can lead to missed
wakeup calls. Take a look at the code in sleep() again where the original lock
gets released -- we have to acquire the process table lock before we can
release the other lock. So now there are always two locks in play whenever we
use sleep() and wakeup().
Let's go back to the example of a process waiting on a disk read. The process
acquires some disk-related lock first, then checks to see if the disk is done
reading; if not, it'll call sleep() inside a while loop. If the disk driver
runs now before the process gets to call sleep(), that's okay: the disk driver
also has to acquire the same lock before calling wakeup(), so the disk would
just end up spinning idly. Eventually, the process runs again and gets to
call sleep(); there, it will first acquire the process table lock before
releasing the original disk-related lock.
So what happens if the disk driver's code runs now? Now the disk would be able
to acquire the original lock, so there's nothing stopping it from calling
wakeup(). But the very first thing it has to do there is acquire the process
table lock, which the process is already holding, so it just spins idly again!
There's no way the disk driver could ever beat the process to acquiring this
second lock, because the process already held the first (disk-related) lock
before acquiring the second one (the process table lock). Now the process can
finish going to sleep and switch into the scheduler, which will eventually
release the process table lock. So then the disk driver can acquire it, release
the first lock, and finally send its wakeup call.
Moral of the story? There's no way for xv6 to ever have any missed wakeup calls! The trick was to use two locks, and acquire the second before releasing the first. But coming up with that solution isn't as easy as saying "oh, just use two locks!" The solution only works because of the way the process table lock is already being handled by so many other parts of the kernel code. For example, if the context switch into the scheduler wasn't guaranteed to release the process table lock, then the disk driver in the example would never be able to acquire it after the process goes to sleep, resulting in a deadlock. The solution works because of all the design decisions in xv6 up to this point.
wakeup1
Okay, I'll stop fawning over the intricacies of xv6 concurrency management now
so we can look at how wakeup calls actually happen. Remember, this is a separate
function from wakeup() because sometimes the scheduler needs to send a wakeup
call while it's already holding the process table lock. So we're gonna assume
that every function that ever calls this is already holding it.
The implementation here is actually pretty simple now: we'll just iterate over
the process table and set every single process that's sleeping on channel
chan to RUNNABLE.
static void wakeup1(void *chan)
{
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
if (p->state == SLEEPING && p->chan == chan) {
p->state = RUNNABLE;
}
}
}
Now, there might be multiple processes sleeping on this channel, so this will
wake them all up. For some of those processes, this might be a spurious wakeup,
so again, we should always make sure to call sleep() in a loop that checks for
some condition to be satisfied. Even if multiple processes do have their
sleep conditions satisfied, they'll have to reacquire their original lock before
returning out of sleep(), so only one of them will do so and the others will
spin until the first one is done.
Why not just wake up the first process we find that's sleeping on chan? Then
we could avoid the extra overhead of a bunch of processes waking up, checking a
condition, and going back to sleep, or even spinning idly waiting to reacquire
the lock before returning. The issue is that the channels may not be unique, so
there's no way to know which of all the sleeping processes is the one whose
sleep condition has just been fulfilled. If we wake up the wrong process, it'll
just go back to sleep, but the right process didn't wake up, so that means we've
lost a wakeup call.
exit
Okay, so we saw above that kill() doesn't really kill a process immediately;
it shirks that responsibility and lets exit() handle it instead... except
even exit() won't really fully kill a process. Whew, a process's death just
keeps getting dragged out forever, doesn't it? It's starting to feel like a
cheesy death scene in a tragedy; I bet the process is tired of suffering the
slings and arrows of outrageous fortune by now.
But it does make sense. Think about what we have to do in order to wrap up a process and recycle its slot in the process table: we have to close out any open files and reset its current working directory, free its kernel stack and its entire page directory, then notify the parent that it's done running.
The trouble comes with freeing the kernel stack and process page directory. This
function runs in kernel mode, so while the user stack in the lower half of
memory will be unused now, the kernel stack is still needed in order to keep
executing the instructions for exit(). Also, with the exception of the times
when it's running the scheduling algorithm, the kernel uses the page directory
of the current process. The moment we free that page directory, the very next
memory access will be to an invalid page; the CPU would trigger an exception
then. That exception would eventually get routed to exit() again, except, oh
wait, we can't even run any instructions without generating another exception,
because the entire page directory and stack have been freed; that's a double
fault. So then the CPU would try to handle that exception, which would cause
the dreaded boogeyman of OS devs around the world: a triple fault. After a fault
triggers a second exception, which itself triggers a third exception, the CPU
just decides that the kernel in its current state doesn't have its shit together
enough to keep running, so it takes over and reboots the whole system. Oops.
Okay, so let's not do that. That means we can't free the kernel stack nor the
page directory until we're running on a different stack/page directory combo.
That could happen in scheduler() while we're using the page directory kpgdir,
or it could happen while we're running another process. xv6 does it while it's
running the parent process, in the wait() system call. If you haven't used
that in Linux before, wait() lets a parent process sleep until a child process
is done running. xv6 will use wait() to finish cleaning up after an exited
child process too.
Now, the very first process that starts running in xv6 (initproc, which loads
and runs the shell) obviously has no parent process, but that's okay because
that one should never exit as long as the system is up. So let's start this
function off by making sure that the process that's exiting isn't the initial
process.
void exit(void)
{
struct proc *curproc = myproc();
if (curproc == initproc) {
panic("init exiting");
}
// ...
}
Next we'll close all open files and clear the current working directory; again, we haven't seen the file system functions used here, but we'll get to them soon.
void exit(void)
{
// ...
// Close all open files
for (int fd = 0; fd < NOFILE; fd++) {
if (curproc->ofile[fd]) {
fileclose(curproc->ofile[fd]);
curproc->ofile[fd] = 0;
}
}
// Clear the current working directory
begin_op();
iput(curproc->cwd);
end_op();
curproc->cwd = 0;
// ...
}
Now we only have one thing left to do: notify the parent process that this
process has exited. If the parent process is currently sleeping in wait(),
then we'll need to wake it up. But maybe the parent process is currently in the
middle of executing other code before it gets to wait(); we don't want it to
miss the wakeup call... oh wait, but that's okay, remember? The implementations
of sleep() and wakeup()/wakeup1() guarantee that we can't miss a wakeup
call as long as we're holding the right lock; wait() will use the process
table lock for that. So let's acquire it now and send a wakeup call.
Now, remember that a sleeping process needs to check some condition in a loop;
how can the parent process know that the child has exited? Hmm, okay, let's set
the child's state to ZOMBIE. That'll also prevent the scheduler from trying to
run it again.
Ah, but hang on a sec... what if the parent process has itself been killed, i.e.
the current process has been orphaned? (Again with the melodrama...) A process
can't run any more user code after exit(), so an undead parent process would
never get to call wait() to clean up after its children. In that case, we'd
have to find another process that could adopt a child.
So let's just solve that problem now: this process is about to shuffle off its
mortal coil, so let's figure out if it has any children and pass them off to
another process that can keep raising them as its own. But which process is
guaranteed to live long enough to clean up after those children once they die?
Ah, initproc, of course! That first process is immortal, so it should be able
to look after any children that this process might leave behind after it makes
its quietus with a bare bodkin.
So we'll iterate over the process table, looking for any processes with parent
process equal to curproc; if we find any, we'll have initproc adopt them.
If any of our now-abandoned children has already exited before we did, we'll
send a wakeup signal to initproc too in case it's sleeping in wait().
void exit(void)
{
// ...
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
if (p->parent == curproc) {
p->parent = initproc;
if (p->state == ZOMBIE) {
wakeup1(initproc);
}
}
}
// ...
}
Okay, now it's finally time for this process to find out what dreams may come in
that sleep of death. We'll set its state to ZOMBIE and context-switch into the
scheduler, never to return; if something goes wrong and the scheduler does
return, we'll panic in order to keep this function from returning into user code
again.
wait
Like we said above, this system call lets a parent process wait for a child process to exit; it also cleans up after the child process has exited.
First, we don't even know if this process has any children, so we'll have to check by iterating through the process table and checking each process's parent to see if it matches the current process. If it does, then we'll check if it's a zombie, in which case we can clean it up and return its process ID.
We should also deal with two edge cases: first, if the process has no children
at all, and second, if the process does have children but none of them are dead
yet. In the first case, we'll just return -1 to report failure; in the second
case we'll put the current process to sleep until one of its children exits. The
sleep() call means we'll have to do these checks inside an infinite loop.
Alright, let's get started by getting the current process and acquiring the process table lock, then starting an infinite loop.
Inside the loop, we'll use a variable havekids as a boolean to track whether
we've found any child processes. Then we can iterate over the process table,
skipping any processes for which the current process is not the parent. If we
find any children, we'll set havekids to 1.
int wait(void)
{
// ...
for (;;) {
int havekids = 0;
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
if (p->parent != curproc) {
continue;
}
havekids = 1;
// ...
}
// ...
}
}
If we did find a child process, we should check if it's a zombie, in which case
it's time to finish its clean-up. That means freeing its kernel stack and its
page directory and recycling its struct proc so that it can be reallocated to
another process later on.
int wait(void)
{
// ...
for (;;) {
// ...
for (struct proc *p = ptable.proc; p < &ptable.proc[NPROC]; p++) {
// ...
if (p->state == ZOMBIE) {
int pid = p->pid;
// Free child's kernel stack
kfree(p->kstack);
p->kstack = 0;
// Free child's page directory
freevm(p->pgdir);
// Recycle child's struct proc
p->pid = 0;
p->parent = 0;
p->name[0] = 0;
p->killed = 0;
p->state = UNUSED;
release(&ptable.lock);
return pid;
}
}
// ...
}
}
Now, if havekids is still zero by the time we finish the for loop, that means
the process doesn't have any children, so we should report failure. We'll also
check if the process has been marked as killed in the meantime.
int wait(void)
{
// ...
for (;;) {
// ...
if (!havekids || curproc->killed) {
release(&ptable.lock);
return -1;
}
// ...
}
}
Finally, if it does have children, but none of them have exited yet, we'll put the process to sleep. It'll get woken up when a child exits, at which point it'll restart the outer for loop at the top and start looking through the process table again.
Summary
By now, we've looked at a good chunk of the system calls available in xv6. These
system calls wrap up the mechanisms that xv6 uses to create and exit processes
with fork(), kill(), exit(), and wait(), and introduced sleep() and
wakeup() as a means for (limited) inter-process communication.
So what's left now? The rest of the kernel code we're gonna look at will just focus on communicating with various hardware devices like the serial port, console, and keyboard. Those drivers are relatively short, but there's one device that will require a lot more work: the disk. Storing files on disk and making sure they persist across reboots require careful planning, and making files conveniently accessible to users requires an entire system of abstractions layered on top of each other, along with a whole host of file-related system calls.