diff --git a/assets/img/Elfdiagram.png b/assets/img/Elfdiagram.png new file mode 100644 index 0000000..cfd26f2 Binary files /dev/null and b/assets/img/Elfdiagram.png differ diff --git a/assets/img/compiler-pipeline.jpg b/assets/img/compiler-pipeline.jpg new file mode 100644 index 0000000..aab90d4 Binary files /dev/null and b/assets/img/compiler-pipeline.jpg differ diff --git a/assets/img/sisyphus.jpeg b/assets/img/sisyphus.jpeg new file mode 100644 index 0000000..d5d21ce Binary files /dev/null and b/assets/img/sisyphus.jpeg differ diff --git a/content/blog/MOP2/AMD64-TLS.adoc b/content/blog/MOP2/AMD64-TLS.adoc new file mode 100644 index 0000000..2168c0d --- /dev/null +++ b/content/blog/MOP2/AMD64-TLS.adoc @@ -0,0 +1,565 @@ += Implementing TLS (Thread Local storage) for x86_64 +Kamil Kowalczyk +2026-01-31 +:jbake-type: post +:jbake-tags: MOP2 osdev +:jbake-status: published +:og-image: img/sisyphus.jpeg +:og-title: x86_64 thread-local storage implementation + + +In this article I'd like to explore the implementation details of thread local storage on x86_64/amd64 +for my operating system with compliance to System V ABI. + +full code is as always at: https://git.kamkow1lair.pl/kamkow1/MOP3 + +== Preface + +We're going to implement the bare working minimum of the ABI, just enough to make `__thread` +keyword work in Clang and GCC. The spec is more complicated than that. We're going to implement *static* +TLS (there's also dynamic TLS, you can look up `__tls_get_addr` if you're interested in going further). + +Also I'd like to share this article as a very useful resource regarding the TLS: https://maskray.me/blog/2021-02-14-all-about-thread-local-storage. +It's more generally about TLS, but made for a great learning resource for me and I really recommend you read it too. + +Other resources: + +* Spec reference: https://uclibc.org/docs/tls.pdf +* OSDev Wiki: https://wiki.osdev.org/Thread_Local_Storage + +== What is thread local storage? + +Thread local storage is a type of storage in a multitasked application, where each task has it's own copy +of it, distinct from other tasks. + +.Example of TLS in C11 +[source,c] +---- +#include +#include +#include + +thread_local int counter = 0; + +int thread_func(void *arg) { + int id = *(int*)arg; + counter++; // Each thread increments its own copy + printf("Thread %d: counter = %d\n", id, counter); + return 0; +} + +int main() { + thrd_t threads[4]; + int ids[4] = {1, 2, 3, 4}; + + for (int i = 0; i < 4; i++) { + thrd_create(&threads[i], thread_func, &ids[i]); + } + + for (int i = 0; i < 4; i++) { + thrd_join(threads[i], NULL); + } + + printf("Main thread counter: %d\n", counter); // Main's own copy + return 0; +} +---- + +Although the application is accessing and modifying a global variable, it's actually different memories being +used under the hood. Each thread has it's own copy to work with. + +What is `thread_local`? In the pre-C23 world it's a macro, which expands to the `_Thread_local` keyword, which +is the same as compiler specific `__thread` in GCC and Clang. + +== Reverse engineering + +We're going to learn how the TLS works via reverse engineering. We need to understand it, before getting +to Implementing it ourselves. Let's look at the disassembly first, generated by Clang 21.1.0 on https://godbolt.org. + +I've added some comments here, so everything is nice and easy to read. + +.Assembly generated from Clang +[source,x86asm] +---- +/* int thread_func(void *arg) */ +thread_func: + /* Push new stack frame */ + push rbp + mov rbp, rsp + mov qword ptr [rbp - 8], rdi /* store arg on the stack frame */ + + + /* Read the ID value */ + /* int id = *(int*)arg; */ + mov rax, qword ptr [rbp - 8] + mov eax, dword ptr [rax] + mov dword ptr [rbp - 12], eax + + + /* counter++; */ + mov rax, qword ptr fs:[0] /* ?????????? */ + lea rax, [rax + counter@TPOFF] + mov ecx, dword ptr [rax] + add ecx, 1 /* do the ++ */ + mov dword ptr [rax], ecx + + /* return 0; */ + xor eax, eax + pop rbp + ret + +/* The rest is irrelevant here... */ + +counter: + .long 0 +---- + +What is `fs:[0]` (also written commonly as `%fs:0` in GNU syntax)? + +We're going to refer to fs as `%fs` (GNU syntax), because that's how I write my assembly, but you can look +up the analogous syntax for you assembler (like nasm or fasm). + +== x86 segmentation + +`%fs` is an x86 segment register. There are also other segment registers: + +- `%cs` code segment +- `%ds` data segment +- `%ss` stack segment +- `%es` extra segment +- `%fs`, `%gs` general segments + +=== Real mode (16 bit) + +x86_64 (yes, a 64 bit CPU) boots up first in 16 bit mode or the "real mode". In real mode we only have 16 bit +registers, so one might think that we can address only up to 64K of memory. Segmentation let's us use more +memory, because it changes the logical addressing scheme. Instead of pointing to a specific byte +in memory, we an point to a block of memory and displace from the base of it to get the byte - and thus we +can address more than 64K. Early x86 CPUs (like the OG Intel 8086) could address up to 1MB. + +This explains the `%fs:0` syntax. We have a `%fs` base and a `0` displacement. + +A good explaination can be also found on the OSDev wiki: https://wiki.osdev.org/Segmentation. + +Also reading the `GDT` article will come in handy: https://wiki.osdev.org/Global_Descriptor_Table. From now on +I will assume we're already working with 64 bit GDT and we're going to skip the 32 bit mode entirely in this +article. + +=== Long mode (64 bit) + +Real mode uses 16 bit addresses as the segment base, so analogously 64 bit segmentation will use 64 bit addresses. + +=== Segment registers are different + +Segment registers are not like your typical `%rax` or `%rcx` - at least some. You can freely write to `%ds`, +`%ss`, `%es` and that's it! `%cs`, `%fs`, `%gs` are special in that they cannot be written to manually. +`%cs` can be reloaded by for example `lretq` instruction, `%fs` and `%gs` require writing to an `MSR` +(will explain in a bit). + +== Detour about MSRs + +MSR mean Model-Specific Register. Intel basically wanted to add unstable features and didn't want to +clutter up their architecture with experimental slop. Some of the MSRs were useful enough that they made it into +future Intel CPUs and stayed with us. Generaly speaking, MSRs control OS-related stuff about the CPU. + +MSRs are used with the `rdmsr`/`wrmsr` instructions. The scheme is like so: + +[source,x86asm] +---- +movl NUMBER_OF_MSR, %ecx +movl VALUE_BITS_LOW, %eax +movl VALUE_BITS_HIGH, %edx +wrmsr + +movl NUMBER_OF_MSR, %ecx +rdmsr +/* now %eax contains high bits and %edx low bits. These two shall be concatinated into a 64 bit value */ +---- + +== `%fs` and MSRs + +I've mentioned previously that the `%fs` and `%gs` registers can be written to by writing to an MSR - but which one? + +The MSR we care about is called (in the Intel manual) `IA32_FS_BASE`. To address the confusion early on I'll say +that some people call it slightly differently, for eg. in the Xen hypervisor code it's called `MSR_FS_BASE`. My +kernel takes the definition header from Xen, so that's why I will use Xen's naming scheme, but `IA32_FS_BASE` +would be the *official* name. + +Looking at the file `kernel/amd64/msr-index.h` we can see a juicy `#define`: + +.kernel/amd64/msr-index.h +[source,c] +---- +#define MSR_FS_BASE _AC (0xc0000100, U) /* 64bit FS base */ +---- + +The magic MSR number is `0xc0000100`. Here's how I'm using it: + +.kernel/amd64/sched1.c +[source,c] +---- +void do_sched (struct proc* proc, spin_lock_t* cpu_lock, spin_lock_ctx_t* ctxcpu) { + spin_lock_ctx_t ctxpr; + + spin_lock (&proc->lock, &ctxpr); + + thiscpu->tss.rsp0 = proc->pdata.kernel_stack; /* set TSS kernel stack */ + thiscpu->syscall_kernel_stack = proc->pdata.kernel_stack; /* set syscall entry stack */ + amd64_wrmsr (MSR_FS_BASE, proc->pdata.fs_base); /* switch to proc's fs base */ + + spin_unlock (&proc->lock, &ctxpr); + spin_unlock (cpu_lock, ctxcpu); + + amd64_do_sched ((void*)&proc->pdata.regs, (void*)proc->procgroup->pd.cr3_paddr); +} +---- + +The MSR helpers are written like so: + +.kernel/amd64/msr.c +[source,c] +---- +/// Read a model-specific register +uint64_t amd64_rdmsr (uint32_t msr) { + uint32_t low, high; + __asm__ volatile ("rdmsr" : "=a"(low), "=d"(high) : "c"(msr)); + return ((uint64_t)high << 32 | (uint64_t)low); +} + +/// Write a model-specific register +void amd64_wrmsr (uint32_t msr, uint64_t value) { + uint32_t low = (uint32_t)(value & 0xFFFFFFFF); + uint32_t high = (uint32_t)(value >> 32); + __asm__ volatile ("wrmsr" ::"c"(msr), "a"(low), "d"(high)); +} +---- + +What we do is we swap out base value of `%fs` for each process and every process has it's own TLS! +When processes are switched, the new `MSR_FS_BASE` is written. + +== So what is `%fs:0` again? + +We've managed to establish what `%fs` is, but what `%fs:0` is? + +The authors of System V TLS ABI for x86_64 were quite smart. `%fs` CANNOT be accessed on it's own, sort of. We +can't use it like a regular pointer to the TLS. We can only use segment registers with a displacement. +So when we can't use `%fs`, we can use `%fs:0`! `%fs` points to the TLS + 8 byte pointer back to itself, so then +`%fs:0` can become a pointer to the real TLS memory block. + +Also, the TLS variable offsets are negative! + +[source,text] +---- +The TLS memory: + + Var 1 Var 2 Var 3 Var 4 .... The pointer ++-------------------------------------------------------------------------------+ +| | | | | | | | | | <---+ ++-------------------------------------------------------------------------------+ | + | + ^ | + | | + TLS (fs base) | + | + %fs:0 --------------+ +---- + +If this is too difficult to grasp (don't worry, I've spent days banging by head against a wall mysekf), I'll show you now +the code, which handles the TLS in a bit. Now we're going to take another detour to discuss how the TLS looks like +from the perspective of the *ELF* file format. + +== TLS and ELF relationship + +I'm not going to go out of my way to explain the ELF format entirely - it's out of scope for today, but I'll link +a useful article here: https://wiki.osdev.org/ELF. It's a great read on the basics of the ELF format. + +++++ +
+ ELF file diagram +
+++++ +~https://wiki.osdev.org/images/f/fe/Elfdiagram.png~ + +ELF has the so-called "sections". A section is a piece of data that makes up the final executable. A section can +be `.text` where your executable code resides or `.rodata` where your read-only data sits (like string literals). + +ELF also has a special TLS section. This may seem confusing, since why would ELF store some sort of TLS, when +each task must have it's own? The TLS section is actually a template/"meta" section. It's not the actual TLS, but +rather a template of how should the TLS be contructed. + +For example: + +[source,c] +---- +__thread int a = 123; + +void my_thread (void) { + printf ("a = %d\n", a); + + a = 456; + + printf ("a = %d\n", a); +} +---- + +The first printf will display 123, because the TLS template says that `a` shall have initial value of 123, but +then the thread is free to modify it's own version. It just starts out with what is provided by the ELF file. + +=== Linking the user application + +An ELF application has to be linked after we've compiled all the necessary object files. + +++++ +
+ Compiler pipeline +
+++++ +~https://media.geeksforgeeks.org/wp-content/uploads/20250208151053192719/linker-660.jpg~ + +To get the exact ELF layout we need (remember, we're making our own OS), we can use a linker script. + +[source,text] +---- +OUTPUT_FORMAT(elf64-x86-64) + +ENTRY(_start) + +PHDRS { + text PT_LOAD; + rodata PT_LOAD; + data PT_LOAD; + bss PT_LOAD; + tls PT_TLS; /* <------ !!!! */ +} + +SECTIONS { + . = 0x0000500000000000; + + /* The executable code instructions */ + .text : { + *(.text .text.*) + *(.ltext .ltext.*) + } :text + + . = ALIGN(0x1000); + + /* Read-only data */ + .rodata : { + *(.rodata .rodata.*) + } :rodata + + . = ALIGN(0x1000); + + /* initialized data */ + .data : { + *(.data .data.*) + *(.ldata .ldata.*) + } :data + + . = ALIGN(0x1000); + + __bss_start = .; + + /* uninitialized data */ + .bss : { + *(.bss .bss.*) + *(.lbss .lbss.*) + } :bss + + __bss_end = .; + + . = ALIGN(0x1000); + + __tdata_start = .; + + /* initialized TLS data */ + .tdata : { + *(.tdata .tdata.*) + } :tls /* <------ !!!! */ + + __tdata_end = .; + + __tbss_start = .; + + /* uninitialized TLS data */ + .tbss : { + *(.tbss .tbss.*) + } :tls /* <------ !!!! */ + + __tbss_end = .; + + __tls_size = __tbss_end - __tdata_start; + + /DISCARD/ : { + *(.eh_frame*) + *(.note .note.*) + } +} +---- + +`PT_TLS` is the "program header" type - in this case we say that we want this part of the executable to be of +TLS type. This will help our OS' loader distinguish between different parts of the app and how should it act upon +them. + +Also note that we mark `.tdata` and `.tbss` both as `:tls`. This just tells the linker to merge those sections +together into a `tls` section (which we mark as `PT_TLS`). + +== Loader + +Now let's take a look inside the ELF loader: + +[source,c] +---- + case PT_TLS: { +#if defined(__x86_64__) + if (phdr->p_memsz > 0) { + /* What is the aligment we need to use? */ + size_t tls_align = phdr->p_align ? phdr->p_align : sizeof (uintptr_t); + /* Size of the TLS memory block (variables go here) */ + size_t tls_size = align_up (phdr->p_memsz, tls_align); + /* Size needed - TLS block size + 8 bytes (64 bits) for back pointer */ + size_t tls_total_needed = tls_size + sizeof (uintptr_t); + /* amount of pages to allocate */ + size_t blks = div_align_up (tls_total_needed, PAGE_SIZE); + /* Initialize TLS template in the procgroup. This will be copied into individual TLSes */ + proc->procgroup->tls.tls_tmpl_pages = blks; + proc->procgroup->tls.tls_tmpl_size = tls_size; + proc->procgroup->tls.tls_tmpl_total_size = tls_total_needed; + + /* malloc () and zero out */ + proc->procgroup->tls.tls_tmpl = malloc (blks * PAGE_SIZE); + memset (proc->procgroup->tls.tls_tmpl, 0, blks * PAGE_SIZE); + + /* copy initialized stuff */ + memcpy (proc->procgroup->tls.tls_tmpl, (void*)((uintptr_t)elf + phdr->p_offset), + phdr->p_filesz); + + proc_init_tls (proc); + } +#endif + } break; +---- + +[source,c] +---- +void proc_init_tls (struct proc* proc) { + struct limine_hhdm_response* hhdm = limine_hhdm_request.response; + + /* This application doesn't use TLS */ + if (proc->procgroup->tls.tls_tmpl == NULL) + return; + + size_t tls_size = proc->procgroup->tls.tls_tmpl_size; + size_t pages = proc->procgroup->tls.tls_tmpl_pages; + + uintptr_t tls_paddr; + uint32_t flags = MM_PG_USER | MM_PG_PRESENT | MM_PG_RW; + + /* allocate a new TLS memory space and map it into the procgroup's address space */ + uintptr_t tls_vaddr = procgroup_map (proc->procgroup, 0, pages, flags, &tls_paddr); + + uintptr_t k_tls_addr = (uintptr_t)hhdm->offset + tls_paddr; + + /* zero and copy the template contents */ + memset ((void*)k_tls_addr, 0, pages * PAGE_SIZE); + memcpy ((void*)k_tls_addr, (void*)proc->procgroup->tls.tls_tmpl, tls_size); + + /* kernel address and user address + size will point to the tls pointer */ + uintptr_t ktcb = k_tls_addr + tls_size; + uintptr_t utcb = tls_vaddr + tls_size; + + /* write the pointer value, which makes the TLS point to itself */ + *(uintptr_t*)ktcb = utcb; + + /* store as fs_base for switching during scheduling */ + proc->pdata.fs_base = utcb; + /* save allocation address to later free it when not needed */ + proc->pdata.tls_vaddr = tls_vaddr; +} +---- + +== Conclusion + +And that's it! we can use the TLS now in user apps! + +[source,c] +---- +#define MUTEX 2000 + +LOCAL volatile char letter = 'c'; + +void app_proc1 (void) { + letter = 'a'; + + for (;;) { + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } + + process_quit (); +} + +void app_proc2 (void) { + letter = 'b'; + + for (;;) { + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } + + process_quit (); +} + +void app_proc3 (void) { + letter = 'c'; + + for (;;) { + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } + + process_quit (); +} + +void app_main (void) { + mutex_create (MUTEX); + + letter = 'a'; + + process_spawn (&app_proc1, NULL); + process_spawn (&app_proc2, NULL); + process_spawn (&app_proc3, NULL); + + for (;;) { + mutex_lock (MUTEX); + + for (int i = 0; i < 3; i++) + test (letter); + + mutex_unlock (MUTEX); + } +} +---- + +=== My personal thoughts + +image::/img/sisyphus.jpeg["Literally me"] +~https://miro.medium.com/1*zW3S02mX5hqkpBBx1YUWhQ.jpeg~ + +This was difficult... Way too difficult to implement. When reading the spec and then trying to make it work, I've +noticed that all this pointer/size/alignment trickery is just so we can go around the face that x86_64 doesn't +have a built-in architectural mechanism to support such thing as TLS. All you have is a bunch of free registers +and it's up to you to make something out of that. I guess ARM is better in this case, because there's a single +source of authority that produces the CPU and sets the rules to abide by. diff --git a/templates/header.ftl b/templates/header.ftl index 240faac..40c182c 100644 --- a/templates/header.ftl +++ b/templates/header.ftl @@ -4,10 +4,13 @@ <#if (content.title)??>${content.title}<#else>JBake</#if> - - - + + + + <#if content['og-image']??> + + @@ -29,4 +32,4 @@
- \ No newline at end of file +