Mutex kernel synchronization

2026-01-30 21:55:14 +01:00
parent d1ab57cbf9
commit 4c3b7581d4
2 changed files with 628 additions and 0 deletions
--- a/assets/img/running-bison.jpg
+++ b/assets/img/running-bison.jpg
--- a/content/blog/MOP2/mutex-sync.adoc
+++ b/content/blog/MOP2/mutex-sync.adoc
@@ -0,0 +1,628 @@
+= Mutexes and process suspension in kernels
+Kamil Kowalczyk
+2026-01-30
+:jbake-type: post
+:jbake-tags: MOP2 osdev
+:jbake-status: published
+
+Hello!
+
+In this blog post I would like to share my thoughts and some implementation details about the *mutex*
+mechanism in *MOP3*!
+
+WAIT! MOP *3*?
+
+Yes! MOP2 is undergoing a huge rewrite and redesign right now, due to me not being happy with **some**
+design choices of the system. This is mostly about writing SMP-first/SMP-friendly code. The rewrite
+turned out to be so big that it was just easier to start a new project, while reusing big portions
+of older MOP2 code - and thus MOP3 was born. Excpect more interesting updates in the future!
+
+New GIT repository: https://git.kamkow1lair.pl/kamkow1/MOP3
+
+== Multitasking is deliquete
+
+=== Preface
+
+This portion of the article will explain critical sections and why they're so fragile. If you're an
+advanced reader, you can easily skip it.
+
+Writing multitasked/paralell code is notoriously difficult and can lead to bazzire, awful and undeterministic
+bugs. One should keep in mind that multitasked code is not something debuggable with normal developer tools
+like `gdb` or `lldb` or some other crash analyzers and code inspectors. Rather than that, such code should be
+proven to be safe/correct, either by just reading the code or writing formal proofs.
+
+Formal proofs can be written using the Promela scripting language: https://spinroot.com/spin/whatispin.html
+
+=== The problem
+
+Imagine this problem in a multitasked application:
+
+.A disaster happens here
+[source,c]
+----
+struct my_list_node {
+  struct my_list_node* next, *prev;
+  int value;
+};
+
+/* These are complex operations on multiple pointers */
+#define LIST_ADD(list, x) ...
+#define LIST_REMOVE(list, x) ...
+
+struct my_list_node* my_list = ...;
+
+void thread_A (void) {
+  struct my_list_node* new_node = malloc (sizeof (*new_node));
+  new_node->value = 67;
+  
+  LIST_ADD (my_list, new_node);
+}
+
+void thread_B (void) {
+  LIST_REMOVE (my_list, my_list); /* remove head */
+}
+----
+
+The issue here is that, while `thread_A` is adding a new node, `thread_B` can try to remove it. This will
+lead to pointer corruption, since `LIST_REMOVE` and `LIST_ADD` will try to re-link list's nodes at the same
+time.
+
+Such fragile sections of code are called "critical sections", meaning that one has to be VERY careful when
+tasks enter it. So where are they in the code?
+
+[source,c]
+----
+void thread_A (void) {
+  /* NOT CRITICAL - new_node is not yet visible to other tasks */
+  struct my_list_node* new_node = malloc (sizeof (*new_node));
+  new_node->value = 67;
+  
+  /* CRITICAL */
+  LIST_ADD (my_list, new_node);
+}
+
+void thread_B (void) {
+  /* CRITICAL */
+  LIST_REMOVE (my_list, my_list); /* remove head */
+}
+----
+
+=== So how do we solve this?
+
+This specific problem can be easily solved by, what's called "Mutual Exclusion" or "MutEx" for short.
+
+Why "mutual exclusion"? As you can see, to modify the list, a task needs *exclusive* access to it, ie.
+nobody else can be using the list at the same time. Kind of like a bathroom - imagine someone happily walks
+in while you're sitting on a toilet and tries to fit right beside you.
+
+==== A poor man's mutex and atomicity
+
+Let's park it here and move to a slightly different, but related topic - atomicity. What is it?
+
+To be atomic is to be indivisible. An atomic read or a write is an operation that can't be split up
+and has to be completed from start to finish without being interrupted.
+
+While atomic operations are mutually exclusive, they work on a level of individual reads/writes of integers
+of various sizes. One can atomically read/write a byte or an uint64_t. This of course won't solve our problem,
+because we're operating on many integers (here pointers) simultaneously - struct my_list_node's prev and next
+pointers - but despite that, we can leverage atomicity to implement higher level synchronization constructs.
+
+BTW, documentation on atomic operations in C can be found here: https://en.cppreference.com/w/c/atomic.html.
+The functions/macros are implemented internally by the compiler, for eg. GCC atomics: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html.
+They have to be implemented internally, because they're very platform dependant and each architecture may handle
+atomicity slightly differently, although the same rules apply everwhere.
+
+Now that we've discussed C atomics, let's implement the poor man's mutex!
+
+What needs to be done? We've established that only one task can be using the list at a time. We'd probably want
+some sort of a flag to know if it's our turn to use the list. If the flag is set, it means that someone is
+currently using the list and thus we have to wait. If clear, we're good to go.
+
+Obviously, the flag has to be atomic by itself. Otherwise one task could read the flag, while it's being updated,
+leading to a thread entering the critical section, when it's not it's turn.
+
+[source,c]
+----
+#include <stdatomic.h>
+
+#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT
+
+typedef atomic_flag spin_lock_t;
+
+void spin_lock (spin_lock_t* sl) {
+  while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire))
+    ;
+}
+
+void spin_unlock (spin_lock_t* sl) {
+  atomic_flag_clear_explicit (sl, memory_order_release);
+}
+
+// List same as before...
+
+// the lock
+spin_lock_t the_lock = SPIN_LOCK_INIT;
+
+void thread_A (void) {
+  struct my_list_node* new_node = malloc (sizeof (*new_node));
+  new_node->value = 67;
+
+  spin_lock (&the_lock);  
+  LIST_ADD (my_list, new_node);
+  spin_unlock (&the_lock);  
+}
+
+void thread_B (void) {
+  spin_lock (&the_lock);  
+  LIST_REMOVE (my_list, my_list); /* remove head */
+  spin_unlock (&the_lock);  
+}
+----
+
+As you can see, if a task tries to modify the list, it must first check whether it can do so. If not, wait
+until such operation is possible. Pretty easy, eh?
+
+There are some big and small improvements to be made, so let's pick up a spying glass and analyze!
+
+===== Small inefficiency
+When we fail to lock the spin lock, we're stuck in a spin-loop. This is really bad
+for performance, because we're just eating up CPU cycles for nothing. A better spin lock would "relax the
+CPU", ie. tell the processor that it's stuck in a spin-loop and make it consume less power or
+switch hardware tasks.
+
+On x86 we have a `pause` instruction: https://www.felixcloutier.com/x86/pause.
+ARM has a `yield` instruction: https://developer.arm.com/documentation/dui0802/b/A32-and-T32-Instructions/SEV--SEVL--WFE--WFI--and-YIELD.
+
+===== Big inefficiency
+Despite the `pause` or `yield` instructions and such, we still have an underlying flaw in our spin lock.
+The issue is that we're just spinning when we fail to acquire the lock, when we could hand over control
+to another task - we can't have access ourselves, so let someone else take the precious CPU time and get
+a little bit of work done.
+
+Despite this "inefficiency", spin locks are the most primitive-yet-useful way of locking. This example shows
+a simple solution to a simple problem, but in the real-world one would use spin locks to implement higher
+level/smarter ways of providing task synchronization.
+
+=== The better mutex and it's building blocks
+
+So if spinning in place is "bad" for performance, what do we do then? Here come the "suspension queues".
+
+==== Scheduler states
+
+If we fail to acquire a mutex, it means that we're stuck waiting for it. If we're stuck, we shouldn't be
+scheduled, ie. given CPU time (there's no work to be done). To denote this, we have to introduce another
+process state:
+
+[source,c]
+----
+/* process states */
+#define PROC_READY     0
+#define PROC_DEAD      1
+#define PROC_SUSPENDED 2 /* NEW */
+----
+
+This new state describes a new situation in our scheduler - a process IS alive, but CANNOT be scheduled.
+Before we only had alive (schedulable) processes and dead (unschedulable) processes.
+
+Now our scheduler won't pick up suspended processes:
+
+[source,c]
+----
+// ...
+
+static struct proc* proc_find_sched (struct cpu* cpu) {
+  if (!cpu->proc_run_q)
+    return NULL;
+
+  struct list_node_link *current, *start;
+
+  if (cpu->proc_current)
+    current = cpu->proc_current->cpu_run_q_link.next;
+  else
+    current = cpu->proc_run_q;
+
+  if (!current)
+    current = cpu->proc_run_q;
+
+  start = current;
+
+  do {
+    struct proc* proc = list_entry (current, struct proc, cpu_run_q_link);
+
+    if (atomic_load (&proc->state) == PROC_READY)
+      return proc;
+
+    current = current->next ? current->next : cpu->proc_run_q;
+  } while (current != start);
+
+  return NULL;
+}
+
+// ...
+
+void proc_sched (void) {
+  spin_lock_ctx_t ctxcpu;
+
+  int s_cycles = atomic_fetch_add (&sched_cycles, 1);
+
+  if (s_cycles % SCHED_REAP_FREQ == 0)
+    proc_reap ();
+
+  struct proc* next = NULL;
+  struct cpu* cpu = thiscpu;
+
+  spin_lock (&cpu->lock, &ctxcpu);
+
+  next = proc_find_sched (cpu);
+
+  if (next) {
+    cpu->proc_current = next;
+
+    do_sched (next, &cpu->lock, &ctxcpu);
+  } else {
+    cpu->proc_current = NULL;
+    spin_unlock (&cpu->lock, &ctxcpu);
+
+    spin ();
+  }
+}
+----
+
+==== Suspension queues
+
+A suspension queue is a little tricky - it's not just a list of suspended processes. We need to establish
+a many:many relationship (kinda like in databases). A process can be a member of many suspension queues 
+(image a process, which waits on many mutexes) and a single suspension queue has many member processes.
+In order to achieve that, we can do some pointer trickery!
+
+[source,c]
+----
+struct proc {
+  int pid;
+  struct rb_node_link proc_tree_link;
+  struct rb_node_link procgroup_memb_tree_link;
+  struct list_node_link cpu_run_q_link;
+  struct list_node_link reap_link;
+  struct list_node_link* sq_entries; /* list of intermediate structs */
+  struct procgroup* procgroup;
+  struct proc_platformdata pdata;
+  uint32_t flags;
+  spin_lock_t lock;
+  struct cpu* cpu;
+  atomic_int state;
+  uintptr_t uvaddr_argument;
+};
+
+struct proc_suspension_q {
+  struct list_node_link* proc_list; /* list of processes via proc_sq_entry */
+  spin_lock_t lock;
+};
+
+/* the intermediate struct */
+struct proc_sq_entry {
+  struct list_node_link sq_link; /* list link */
+  struct list_node_link proc_link; /* list link */
+  struct proc* proc; /* pointer back to the process */
+  struct proc_suspension_q* sq; /* pointer back to the suspension queue */
+};
+----
+
+Now we can talk about the suspend/resume algorithms.
+
+[source,c]
+----
+bool proc_sq_suspend (struct proc* proc, struct proc_suspension_q* sq, spin_lock_t* resource_lock,
+                      spin_lock_ctx_t* ctxrl) {
+  spin_lock_ctx_t ctxpr, ctxcpu, ctxsq;
+  struct cpu* cpu = proc->cpu;
+
+  /* allocate intermediate struct */
+  struct proc_sq_entry* sq_entry = malloc (sizeof (*sq_entry));
+  if (!sq_entry) {
+    spin_unlock (resource_lock, ctxrl);
+    return PROC_NO_RESCHEDULE;
+  }
+
+  /* set links on both ends */
+  sq_entry->proc = proc;
+  sq_entry->sq = sq;
+
+  /* lock with compliance to the lock hierachy */
+  spin_lock (&cpu->lock, &ctxcpu);
+  spin_lock (&proc->lock, &ctxpr);
+  spin_lock (&sq->lock, &ctxsq);
+
+  spin_unlock (resource_lock, ctxrl);
+
+  /* transition the state to PROC_SUSPENDED */
+  atomic_store (&proc->state, PROC_SUSPENDED);
+
+  /* append to sq's list */
+  list_append (sq->proc_list, &sq_entry->sq_link);
+
+  /* append to proc's list */
+  list_append (proc->sq_entries, &sq_entry->proc_link);
+
+  /* remove from CPU's run list and decrement CPU load counter */
+  list_remove (cpu->proc_run_q, &proc->cpu_run_q_link);
+  atomic_fetch_sub (&cpu->proc_run_q_count, 1);
+
+  if (cpu->proc_current == proc)
+    cpu->proc_current = NULL;
+
+  /* a process is now unschedulable, so it doesn't need a CPU */
+  proc->cpu = NULL;
+
+  /* release the locks */
+  spin_unlock (&sq->lock, &ctxsq);
+  spin_unlock (&proc->lock, &ctxpr);
+  spin_unlock (&cpu->lock, &ctxcpu);
+
+  return PROC_NEED_RESCHEDULE;
+}
+
+bool proc_sq_resume (struct proc* proc, struct proc_sq_entry* sq_entry) {
+  spin_lock_ctx_t ctxsq, ctxpr, ctxcpu;
+  struct cpu* cpu = cpu_find_lightest (); /* find the least-loaded CPU */
+  struct proc_suspension_q* sq = sq_entry->sq;
+
+  /* acquire necessary locks */
+  spin_lock (&cpu->lock, &ctxcpu);
+  spin_lock (&proc->lock, &ctxpr);
+  spin_lock (&sq->lock, &ctxsq);
+
+  /* remove from sq's list */
+  list_remove (sq->proc_list, &sq_entry->sq_link);
+
+  /* remove from proc's list */
+  list_remove (proc->sq_entries, &sq_entry->proc_link);
+
+  /* Give the CPU to the process */
+  proc->cpu = cpu;
+
+  /*
+   * update process state to PROC_READY, but only if it's not waiting inside
+   * another suspension queue.
+   */
+  if (proc->sq_entries == NULL)
+    atomic_store (&proc->state, PROC_READY);
+
+  /* attach to CPU's run list and increment load counter */
+  list_append (cpu->proc_run_q, &proc->cpu_run_q_link);
+  atomic_fetch_add (&cpu->proc_run_q_count, 1);
+
+  /* unlock */
+  spin_unlock (&sq->lock, &ctxsq);
+  spin_unlock (&proc->lock, &ctxpr);
+  spin_unlock (&cpu->lock, &ctxcpu);
+
+  /* intermediate struct is no longer needed, so free it */
+  free (sq_entry);
+
+  return PROC_NEED_RESCHEDULE;
+}
+----
+
+As you can see, this algorithm is quite simple actually. It's just a matter of safely
+appending/removing links to specific lists.
+
+==== The mutex itself
+
+The mutex is quite simple too.
+
+[source,c]
+----
+struct proc_mutex {
+  struct proc_resource* resource; /* pointer back to the resource */
+
+  bool locked; /* taken or free */
+  struct proc_suspension_q suspension_q; /* suspension queue of waiting processes */
+  struct proc* owner; /* current owner of the mutex / process, which can enter the critical section */
+};
+----
+
+And now the lock and unlock algorithms.
+
+[source,c]
+----
+bool proc_mutex_lock (struct proc* proc, struct proc_mutex* mutex) {
+  spin_lock_ctx_t ctxmt;
+
+  spin_lock (&mutex->resource->lock, &ctxmt);
+
+  /* we can enter if: the mutex is NOT locked or we're already the owner of the mutex */
+  if (!mutex->locked || mutex->owner == proc) {
+    mutex->locked = true;
+    mutex->owner = proc;
+    spin_unlock (&mutex->resource->lock, &ctxmt);
+    return PROC_NO_RESCHEDULE;
+  }
+
+  /* we failed to become an owner, thus we have to free up the CPU and suspend ourselves */
+  return proc_sq_suspend (proc, &mutex->suspension_q, &mutex->resource->lock, &ctxmt);
+}
+
+bool proc_mutex_unlock (struct proc* proc, struct proc_mutex* mutex) {
+  spin_lock_ctx_t ctxmt, ctxsq;
+
+  spin_lock (&mutex->resource->lock, &ctxmt);
+
+  /*
+   * we can't unlock a mutex, which we don't already own,
+   * otherwise we could just steal someone else's mutex.
+   */
+  if (mutex->owner != proc) {
+    spin_unlock (&mutex->resource->lock, &ctxmt);
+    return PROC_NO_RESCHEDULE;
+  }
+
+  spin_lock (&mutex->suspension_q.lock, &ctxsq);
+
+  struct list_node_link* node = mutex->suspension_q.proc_list;
+
+  /*
+   * This is a little tricky. What we're doing here is "direct handoff" - we basically
+   * transfer ownership, while keeping the mutex locked, instead of just unlocking and
+   * resuming everyone.
+   */
+
+  if (node) {
+    struct proc_sq_entry* sq_entry = list_entry (node, struct proc_sq_entry, sq_link);
+    struct proc* resumed_proc = sq_entry->proc;
+    /* transfer ownership */
+    mutex->owner = resumed_proc;
+    mutex->locked = true;
+
+    spin_unlock (&mutex->suspension_q.lock, &ctxsq);
+    spin_unlock (&mutex->resource->lock, &ctxmt);
+
+    /* resume new owner - it can enter the critical section now */
+    return proc_sq_resume (resumed_proc, sq_entry);
+  }
+
+  /* no possible future owner found, so we just unlock */
+  mutex->locked = false;
+  mutex->owner = NULL;
+
+  spin_unlock (&mutex->suspension_q.lock, &ctxsq);
+  spin_unlock (&mutex->resource->lock, &ctxmt);
+
+  return PROC_NEED_RESCHEDULE;
+}
+----
+
+===== Direct handoff and "The Thundering Herd"
+
+image::/img/running-bison.jpg["A thundering herd of bison"]
+~https://www.eurowildlife.org/wp-content/uploads/2013/11/European-Wildlife-American-Bison-Hybrididisation-Risk.jpg~
+
+Why the direct handoff? Why not just unlock the mutex?
+
+The thundering herd problem is very common in multitasking and synchronization.
+
+We free the mutex, wake up all waiters and now anyone can take it. The issue here is that ONLY ONE
+task will eventually get to acquire the mutex and enter the critical section. If only one gets to
+take the mutex, why bother waking up everyone? The solution here would be to pass down the ownership
+to the first encountered waiter and then the next one and so on. Also waking everyone would be cause
+a performance penatly, because we're wasting CPU cycles just to be told not to execute any further.
+
+====== two birds, one stone
+
+Via direct handoff, we've also just eliminated a fairness problem.
+
+What does it mean for a mutex to be "fair"?
+
+Fairness in synchronization means that each task at some point will get a chance to work in the
+critical section.
+
+Fairness example problem:
+
+[source,c]
+----
+#include <stdatomic.h>
+
+#define SPIN_LOCK_INIT ATOMIC_FLAG_INIT
+
+typedef atomic_flag spin_lock_t;
+
+void spin_lock (spin_lock_t* sl) {
+  while (atomic_flag_test_and_set_explicit (sl, memory_order_acquire))
+    __asm__ volatile ("pause");
+}
+
+void spin_unlock (spin_lock_t* sl) {
+  atomic_flag_clear_explicit (sl, memory_order_release);
+}
+
+// the lock
+spin_lock_t the_lock = SPIN_LOCK_INIT;
+
+void thread_A (void) {
+  for (;;) {
+    spin_lock (&the_lock);
+    {
+      /* critical section goes here */
+    }
+    spin_unlock (&the_lock);
+  }
+}
+
+void thread_B (void) {
+  for (;;) {
+    spin_lock (&the_lock);
+    {
+      /* critical section goes here */
+    }
+    spin_unlock (&the_lock);  
+  }
+}
+----
+
+Let's take `thread_B` for example. `thread_B` locks, does it's thing and unlocks. Because it's in an infinite
+loop, there's a chance that it may re-lock too quickly, before `thread_A` gets the chance. This causes a
+fairness issue, because `thread_A` is getting starved - it can't enter the critical section, even though it
+plays by the rules, synchronizing properly with `thread_B` via a spinlock. On real CPUs (ie. real hardware
+or QEMU with -enable-kvm), there will always be some tiny delays/"clog ups" that mitigate this problem,
+but theoretically it is possible that a task can starve.
+
+Our algorithm solves this by always selecting the next thread to own the mutex, so we never get the issue
+that when we unlock, the new owner is selected randomly - the first one to get it, owns it. This makes
+the approach more deterministic/predictible.
+
+== Final result
+
+This is the final result of our work!
+
+[source,c]
+----
+#include <limits.h>
+#include <proc/local.h>
+#include <proc/proc.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string/string.h>
+
+/* ID of the mutex resource */
+#define MUTEX 2000
+
+/* TLS (thread-local storage) support was added recently ;)! */
+LOCAL volatile char letter = 'x';
+
+void app_proc (void) {
+  char arg_letter = (char)(uintptr_t)argument_ptr ();
+
+  letter = arg_letter;
+
+  for (;;) {
+    /* enter the critical section, print the _letter_ 3 times and leave */
+
+    mutex_lock (MUTEX);
+
+    for (int i = 0; i < 3; i++)
+      test (letter);
+
+    mutex_unlock (MUTEX);
+  }
+
+  process_quit ();
+}
+
+void app_main (void) {
+  mutex_create (MUTEX);
+
+  letter = 'a';
+
+  process_spawn (&app_proc, (void*)'b');
+  process_spawn (&app_proc, (void*)'c');
+  process_spawn (&app_proc, (void*)'d');
+
+  for (;;) {
+    mutex_lock (MUTEX);
+
+    for (int i = 0; i < 3; i++)
+      test (letter);
+
+    mutex_unlock (MUTEX);
+  }
+}
+----