diff --git a/assets/img/running-bison.jpg b/assets/img/running-bison.jpg new file mode 100644 index 0000000..2eba41d Binary files /dev/null and b/assets/img/running-bison.jpg differ diff --git a/content/blog/MOP2/mutex-sync.adoc b/content/blog/MOP2/mutex-sync.adoc new file mode 100644 index 0000000..b745228 --- /dev/null +++ b/content/blog/MOP2/mutex-sync.adoc @@ -0,0 +1,628 @@ += Mutexes and process suspension in kernels +Kamil Kowalczyk +2026-01-30 +:jbake-type: post +:jbake-tags: MOP2 osdev +:jbake-status: published + +Hello! + +In this blog post I would like to share my thoughts and some implementation details about the *mutex* +mechanism in *MOP3*! + +WAIT! MOP *3*? + +Yes! MOP2 is undergoing a huge rewrite and redesign right now, due to me not being happy with **some** +design choices of the system. This is mostly about writing SMP-first/SMP-friendly code. The rewrite +turned out to be so big that it was just easier to start a new project, while reusing big portions +of older MOP2 code - and thus MOP3 was born. Excpect more interesting updates in the future! + +New GIT repository: https://git.kamkow1lair.pl/kamkow1/MOP3 + +== Multitasking is deliquete + +=== Preface + +This portion of the article will explain critical sections and why they're so fragile. If you're an +advanced reader, you can easily skip it. + +Writing multitasked/paralell code is notoriously difficult and can lead to bazzire, awful and undeterministic +bugs. One should keep in mind that multitasked code is not something debuggable with normal developer tools +like `gdb` or `lldb` or some other crash analyzers and code inspectors. Rather than that, such code should be +proven to be safe/correct, either by just reading the code or writing formal proofs. + +Formal proofs can be written using the Promela scripting language: https://spinroot.com/spin/whatispin.html + +=== The problem + +Imagine this problem in a multitasked application: + +.A disaster happens here +[source,c] +---- +struct my_list_node { + struct my_list_node* next, *prev; + int value; +}; + +/* These are complex operations on multiple pointers */ +#define LIST_ADD(list, x) ... +#define LIST_REMOVE(list, x) ... + +struct my_list_node* my_list = ...; + +void thread_A (void) { + struct my_list_node* new_node = malloc (sizeof (*new_node)); + new_node->value = 67; + + LIST_ADD (my_list, new_node); +} + +void thread_B (void) { + LIST_REMOVE (my_list, my_list); /* remove head */ +} +---- + +The issue here is that, while `thread_A` is adding a new node, `thread_B` can try to remove it. This will +lead to pointer corruption, since `LIST_REMOVE` and `LIST_ADD` will try to re-link list's nodes at the same +time. + +Such fragile sections of code are called "critical sections", meaning that one has to be VERY careful when +tasks enter it. So where are they in the code? + +[source,c] +---- +void thread_A (void) { + /* NOT CRITICAL - new_node is not yet visible to other tasks */ + struct my_list_node* new_node = malloc (sizeof (*new_node)); + new_node->value = 67; + + /* CRITICAL */ + LIST_ADD (my_list, new_node); +} + +void thread_B (void) { + /* CRITICAL */ + LIST_REMOVE (my_list, my_list); /* remove head */ +} +---- + +=== So how do we solve this? + +This specific problem can be easily solved by, what's called "Mutual Exclusion" or "MutEx" for short. + +Why "mutual exclusion"? As you can see, to modify the list, a task needs *exclusive* access to it, ie. +nobody else can be using the list at the same time. Kind of like a bathroom - imagine someone happily walks +in while you're sitting on a toilet and tries to fit right beside you. + +==== A poor man's mutex and atomicity + +Let's park it here and move to a slightly different, but related topic - atomicity. What is it? + +To be atomic is to be indivisible. An atomic read or a write is an operation that can't be split up +and has to be completed from start to finish without being interrupted. + +While atomic operations are mutually exclusive, they work on a level of individual reads/writes of integers +of various sizes. One can atomically read/write a byte or an uint64_t. This of course won't solve our problem, +because we're operating on many integers (here pointers) simultaneously - struct my_list_node's prev and next +pointers - but despite that, we can leverage atomicity to implement higher level synchronization constructs. + +BTW, documentation on atomic operations in C can be found here: https://en.cppreference.com/w/c/atomic.html. +The functions/macros are implemented internally by the compiler, for eg. GCC atomics: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. +They have to be implemented internally, because they're very platform dependant and each architecture may handle +atomicity slightly differently, although the same rules apply everwhere. + +Now that we've discussed C atomics, let's implement the poor man's mutex! + +What needs to be done? We've established that only one task can be using the list at a time. We'd probably want +some sort of a flag to know if it's our turn to use the list. If the flag is set, it means that someone is +currently using the list and thus we have to wait. If clear, we're good to go. + +Obviously, the flag has to be atomic by itself. Otherwise one task could read the flag, while it's being updated, +leading to a thread entering the critical section, when it's not it's turn. + +[source,c] +---- +#include + +#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT + +typedef atomic_flag spin_lock_t; + +void spin_lock (spin_lock_t* sl) { + while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire)) + ; +} + +void spin_unlock (spin_lock_t* sl) { + atomic_flag_clear_explicit (sl, memory_order_release); +} + +// List same as before... + +// the lock +spin_lock_t the_lock = SPIN_LOCK_INIT; + +void thread_A (void) { + struct my_list_node* new_node = malloc (sizeof (*new_node)); + new_node->value = 67; + + spin_lock (&the_lock); + LIST_ADD (my_list, new_node); + spin_unlock (&the_lock); +} + +void thread_B (void) { + spin_lock (&the_lock); + LIST_REMOVE (my_list, my_list); /* remove head */ + spin_unlock (&the_lock); +} +---- + +As you can see, if a task tries to modify the list, it must first check whether it can do so. If not, wait +until such operation is possible. Pretty easy, eh? + +There are some big and small improvements to be made, so let's pick up a spying glass and analyze! + +===== Small inefficiency +When we fail to lock the spin lock, we're stuck in a spin-loop. This is really bad +for performance, because we're just eating up CPU cycles for nothing. A better spin lock would "relax the +CPU", ie. tell the processor that it's stuck in a spin-loop and make it consume less power or +switch hardware tasks. + +On x86 we have a `pause` instruction: https://www.felixcloutier.com/x86/pause. +ARM has a `yield` instruction: https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/SEV--SEVL--WFE--WFI--and-YIELD. + +===== Big inefficiency +Despite the `pause` or `yield` instructions and such, we still have an underlying flaw in our spin lock. +The issue is that we're just spinning when we fail to acquire the lock, when we could hand over control +to another task - we can't have access ourselves, so let someone else take the precious CPU time and get +a little bit of work done. + +Despite this "inefficiency", spin locks are the most primitive-yet-useful way of locking. This example shows +a simple solution to a simple problem, but in the real-world one would use spin locks to implement higher +level/smarter ways of providing task synchronization. + +=== The better mutex and it's building blocks + +So if spinning in place is "bad" for performance, what do we do then? Here come the "suspension queues". + +==== Scheduler states + +If we fail to acquire a mutex, it means that we're stuck waiting for it. If we're stuck, we shouldn't be +scheduled, ie. given CPU time (there's no work to be done). To denote this, we have to introduce another +process state: + +[source,c] +---- +/* process states */ +#define PROC_READY 0 +#define PROC_DEAD 1 +#define PROC_SUSPENDED 2 /* NEW */ +---- + +This new state describes a new situation in our scheduler - a process IS alive, but CANNOT be scheduled. +Before we only had alive (schedulable) processes and dead (unschedulable) processes. + +Now our scheduler won't pick up suspended processes: + +[source,c] +---- +// ... + +static struct proc* proc_find_sched (struct cpu* cpu) { + if (!cpu->proc_run_q) + return NULL; + + struct list_node_link *current, *start; + + if (cpu->proc_current) + current = cpu->proc_current->cpu_run_q_link.next; + else + current = cpu->proc_run_q; + + if (!current) + current = cpu->proc_run_q; + + start = current; + + do { + struct proc* proc = list_entry (current, struct proc, cpu_run_q_link); + + if (atomic_load (&proc->state) == PROC_READY) + return proc; + + current = current->next ? current->next : cpu->proc_run_q; + } while (current != start); + + return NULL; +} + +// ... + +void proc_sched (void) { + spin_lock_ctx_t ctxcpu; + + int s_cycles = atomic_fetch_add (&sched_cycles, 1); + + if (s_cycles % SCHED_REAP_FREQ == 0) + proc_reap (); + + struct proc* next = NULL; + struct cpu* cpu = thiscpu; + + spin_lock (&cpu->lock, &ctxcpu); + + next = proc_find_sched (cpu); + + if (next) { + cpu->proc_current = next; + + do_sched (next, &cpu->lock, &ctxcpu); + } else { + cpu->proc_current = NULL; + spin_unlock (&cpu->lock, &ctxcpu); + + spin (); + } +} +---- + +==== Suspension queues + +A suspension queue is a little tricky - it's not just a list of suspended processes. We need to establish +a many:many relationship (kinda like in databases). A process can be a member of many suspension queues +(image a process, which waits on many mutexes) and a single suspension queue has many member processes. +In order to achieve that, we can do some pointer trickery! + +[source,c] +---- +struct proc { + int pid; + struct rb_node_link proc_tree_link; + struct rb_node_link procgroup_memb_tree_link; + struct list_node_link cpu_run_q_link; + struct list_node_link reap_link; + struct list_node_link* sq_entries; /* list of intermediate structs */ + struct procgroup* procgroup; + struct proc_platformdata pdata; + uint32_t flags; + spin_lock_t lock; + struct cpu* cpu; + atomic_int state; + uintptr_t uvaddr_argument; +}; + +struct proc_suspension_q { + struct list_node_link* proc_list; /* list of processes via proc_sq_entry */ + spin_lock_t lock; +}; + +/* the intermediate struct */ +struct proc_sq_entry { + struct list_node_link sq_link; /* list link */ + struct list_node_link proc_link; /* list link */ + struct proc* proc; /* pointer back to the process */ + struct proc_suspension_q* sq; /* pointer back to the suspension queue */ +}; +---- + +Now we can talk about the suspend/resume algorithms. + +[source,c] +---- +bool proc_sq_suspend (struct proc* proc, struct proc_suspension_q* sq, spin_lock_t* resource_lock, + spin_lock_ctx_t* ctxrl) { + spin_lock_ctx_t ctxpr, ctxcpu, ctxsq; + struct cpu* cpu = proc->cpu; + + /* allocate intermediate struct */ + struct proc_sq_entry* sq_entry = malloc (sizeof (*sq_entry)); + if (!sq_entry) { + spin_unlock (resource_lock, ctxrl); + return PROC_NO_RESCHEDULE; + } + + /* set links on both ends */ + sq_entry->proc = proc; + sq_entry->sq = sq; + + /* lock with compliance to the lock hierachy */ + spin_lock (&cpu->lock, &ctxcpu); + spin_lock (&proc->lock, &ctxpr); + spin_lock (&sq->lock, &ctxsq); + + spin_unlock (resource_lock, ctxrl); + + /* transition the state to PROC_SUSPENDED */ + atomic_store (&proc->state, PROC_SUSPENDED); + + /* append to sq's list */ + list_append (sq->proc_list, &sq_entry->sq_link); + + /* append to proc's list */ + list_append (proc->sq_entries, &sq_entry->proc_link); + + /* remove from CPU's run list and decrement CPU load counter */ + list_remove (cpu->proc_run_q, &proc->cpu_run_q_link); + atomic_fetch_sub (&cpu->proc_run_q_count, 1); + + if (cpu->proc_current == proc) + cpu->proc_current = NULL; + + /* a process is now unschedulable, so it doesn't need a CPU */ + proc->cpu = NULL; + + /* release the locks */ + spin_unlock (&sq->lock, &ctxsq); + spin_unlock (&proc->lock, &ctxpr); + spin_unlock (&cpu->lock, &ctxcpu); + + return PROC_NEED_RESCHEDULE; +} + +bool proc_sq_resume (struct proc* proc, struct proc_sq_entry* sq_entry) { + spin_lock_ctx_t ctxsq, ctxpr, ctxcpu; + struct cpu* cpu = cpu_find_lightest (); /* find the least-loaded CPU */ + struct proc_suspension_q* sq = sq_entry->sq; + + /* acquire necessary locks */ + spin_lock (&cpu->lock, &ctxcpu); + spin_lock (&proc->lock, &ctxpr); + spin_lock (&sq->lock, &ctxsq); + + /* remove from sq's list */ + list_remove (sq->proc_list, &sq_entry->sq_link); + + /* remove from proc's list */ + list_remove (proc->sq_entries, &sq_entry->proc_link); + + /* Give the CPU to the process */ + proc->cpu = cpu; + + /* + * update process state to PROC_READY, but only if it's not waiting inside + * another suspension queue. + */ + if (proc->sq_entries == NULL) + atomic_store (&proc->state, PROC_READY); + + /* attach to CPU's run list and increment load counter */ + list_append (cpu->proc_run_q, &proc->cpu_run_q_link); + atomic_fetch_add (&cpu->proc_run_q_count, 1); + + /* unlock */ + spin_unlock (&sq->lock, &ctxsq); + spin_unlock (&proc->lock, &ctxpr); + spin_unlock (&cpu->lock, &ctxcpu); + + /* intermediate struct is no longer needed, so free it */ + free (sq_entry); + + return PROC_NEED_RESCHEDULE; +} +---- + +As you can see, this algorithm is quite simple actually. It's just a matter of safely +appending/removing links to specific lists. + +==== The mutex itself + +The mutex is quite simple too. + +[source,c] +---- +struct proc_mutex { + struct proc_resource* resource; /* pointer back to the resource */ + + bool locked; /* taken or free */ + struct proc_suspension_q suspension_q; /* suspension queue of waiting processes */ + struct proc* owner; /* current owner of the mutex / process, which can enter the critical section */ +}; +---- + +And now the lock and unlock algorithms. + +[source,c] +---- +bool proc_mutex_lock (struct proc* proc, struct proc_mutex* mutex) { + spin_lock_ctx_t ctxmt; + + spin_lock (&mutex->resource->lock, &ctxmt); + + /* we can enter if: the mutex is NOT locked or we're already the owner of the mutex */ + if (!mutex->locked || mutex->owner == proc) { + mutex->locked = true; + mutex->owner = proc; + spin_unlock (&mutex->resource->lock, &ctxmt); + return PROC_NO_RESCHEDULE; + } + + /* we failed to become an owner, thus we have to free up the CPU and suspend ourselves */ + return proc_sq_suspend (proc, &mutex->suspension_q, &mutex->resource->lock, &ctxmt); +} + +bool proc_mutex_unlock (struct proc* proc, struct proc_mutex* mutex) { + spin_lock_ctx_t ctxmt, ctxsq; + + spin_lock (&mutex->resource->lock, &ctxmt); + + /* + * we can't unlock a mutex, which we don't already own, + * otherwise we could just steal someone else's mutex. + */ + if (mutex->owner != proc) { + spin_unlock (&mutex->resource->lock, &ctxmt); + return PROC_NO_RESCHEDULE; + } + + spin_lock (&mutex->suspension_q.lock, &ctxsq); + + struct list_node_link* node = mutex->suspension_q.proc_list; + + /* + * This is a little tricky. What we're doing here is "direct handoff" - we basically + * transfer ownership, while keeping the mutex locked, instead of just unlocking and + * resuming everyone. + */ + + if (node) { + struct proc_sq_entry* sq_entry = list_entry (node, struct proc_sq_entry, sq_link); + struct proc* resumed_proc = sq_entry->proc; + /* transfer ownership */ + mutex->owner = resumed_proc; + mutex->locked = true; + + spin_unlock (&mutex->suspension_q.lock, &ctxsq); + spin_unlock (&mutex->resource->lock, &ctxmt); + + /* resume new owner - it can enter the critical section now */ + return proc_sq_resume (resumed_proc, sq_entry); + } + + /* no possible future owner found, so we just unlock */ + mutex->locked = false; + mutex->owner = NULL; + + spin_unlock (&mutex->suspension_q.lock, &ctxsq); + spin_unlock (&mutex->resource->lock, &ctxmt); + + return PROC_NEED_RESCHEDULE; +} +---- + +===== Direct handoff and "The Thundering Herd" + +image::/img/running-bison.jpg["A thundering herd of bison"] +~https://www.eurowildlife.org/wp-content/uploads/2013/11/European-Wildlife-American-Bison-Hybrididisation-Risk.jpg~ + +Why the direct handoff? Why not just unlock the mutex? + +The thundering herd problem is very common in multitasking and synchronization. + +We free the mutex, wake up all waiters and now anyone can take it. The issue here is that ONLY ONE +task will eventually get to acquire the mutex and enter the critical section. If only one gets to +take the mutex, why bother waking up everyone? The solution here would be to pass down the ownership +to the first encountered waiter and then the next one and so on. Also waking everyone would be cause +a performance penatly, because we're wasting CPU cycles just to be told not to execute any further. + +====== two birds, one stone + +Via direct handoff, we've also just eliminated a fairness problem. + +What does it mean for a mutex to be "fair"? + +Fairness in synchronization means that each task at some point will get a chance to work in the +critical section. + +Fairness example problem: + +[source,c] +---- +#include + +#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT + +typedef atomic_flag spin_lock_t; + +void spin_lock (spin_lock_t* sl) { + while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire)) + __asm__ volatile ("pause"); +} + +void spin_unlock (spin_lock_t* sl) { + atomic_flag_clear_explicit (sl, memory_order_release); +} + +// the lock +spin_lock_t the_lock = SPIN_LOCK_INIT; + +void thread_A (void) { + for (;;) { + spin_lock (&the_lock); + { + /* critical section goes here */ + } + spin_unlock (&the_lock); + } +} + +void thread_B (void) { + for (;;) { + spin_lock (&the_lock); + { + /* critical section goes here */ + } + spin_unlock (&the_lock); + } +} +---- + +Let's take `thread_B` for example. `thread_B` locks, does it's thing and unlocks. Because it's in an infinite +loop, there's a chance that it may re-lock too quickly, before `thread_A` gets the chance. This causes a +fairness issue, because `thread_A` is getting starved - it can't enter the critical section, even though it +plays by the rules, synchronizing properly with `thread_B` via a spinlock. On real CPUs (ie. real hardware +or QEMU with -enable-kvm), there will always be some tiny delays/"clog ups" that mitigate this problem, +but theoretically it is possible that a task can starve. + +Our algorithm solves this by always selecting the next thread to own the mutex, so we never get the issue +that when we unlock, the new owner is selected randomly - the first one to get it, owns it. This makes +the approach more deterministic/predictible. + +== Final result + +This is the final result of our work! + +[source,c] +---- +#include +#include +#include +#include +#include +#include + +/* ID of the mutex resource */ +#define MUTEX 2000 + +/* TLS (thread-local storage) support was added recently ;)! */ +LOCAL volatile char letter = 'x'; + +void app_proc (void) { + char arg_letter = (char)(uintptr_t)argument_ptr (); + + letter = arg_letter; + + for (;;) { + /* enter the critical section, print the _letter_ 3 times and leave */ + + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } + + process_quit (); +} + +void app_main (void) { + mutex_create (MUTEX); + + letter = 'a'; + + process_spawn (&app_proc, (void*)'b'); + process_spawn (&app_proc, (void*)'c'); + process_spawn (&app_proc, (void*)'d'); + + for (;;) { + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } +} +----