The peak size of the stack for a given thread of execution in a computer program is impossible to predict in general, and even when it can be predicted (in the absense of recursion and conditional allocation) the peak size can be a pretty pessimistic estimate of the actual stack requirements.

As a consequence, threading support in modern operating systems, as well as application-level threads as implemented by some languages such as Go, tends to allocate each thread a pretty large stack. This is a good tradeoff on a machine with a virtual memory manager and plenty of address space per process, but is unreasonable on a memory-constrained system like a microcontroller.

Protothreads, or "stackless threads", are a tool for providing multiple threads of execution without allocating a separate stack for each. This feat is achieved by placing several constraints on the code within the thread: blocking is only permitted in the top-level function of the thread, the top-level function's local variables cannot be allocated on the stack nor persisted in registers, and these "threads" can only yield to others at predefined points in the program (co-operative scheduling).

With these to constraints imposed, it is possible to simulate blocking by returning control to the caller (a basic scheduler, presumably) after retaining a record of where in the function execution should resume on the next call. This technique has most commonly been implemented in C via preprocessor macros, such as in Adam Dunkels' Protothreads Library. Here is a raw, macro-free example using the GCC computed goto extension, which is fully general:

Notice that the position is retained in a static variable to ensure that its value persists across calls. Any other local variables must also be declared as static for this technique to work.

Although clever use of macros allows the details of this technique to be somewhat hidden in C, the fact that it is being implemented in spite of the language prevents the result from being fully natural.

However, a hypothetical language that has protothreads as a built-in feature can in principle hide the plumbing in a much more robust way, by providing native language constructs for yielding and by automatically hoisting all of the local variables into global variables.

Such a language would still have to generate something resembling the above C code in its backend, however. In that vein, I set about seeing what this technique could look like implemented in LLVM assembly, which is a somewhat-popular choice of intermediate representation between a language's parser and its code generation backend.

Let's dive in.

Data Structures

To implement a basic cooperative scheduler we need to start with a data structure representing a thread, and then along with that a queue of threads that are ready to run. For simplicity's sake, I combined these two concepts together to make a doubly-linked list of runnable threads, whose members are of type %cont, which is short for "continuation":

This design is slightly more sophisticated than the C example above since it allows for multiple instances of the same thread code to be active at once, each with its own "context". The context here is typed as i8* but that type is just a placeholder for a pointer to some data structure that's constructed from the local variables of the implementation function.

We can wrap around that a straightforward "queue" concept:

In order to simplify management of the linked list, at the expense of some non-obvious trickery, the pointer to the queue instance itself -- after a bitcast to type %cont -- is used as the "end of list" sentinel. By placing the queue structure members in "reverse order", the list manipulation code can simply treat the queue as a list item, automatically updating the last/first pointers when items are added and removed from the edges of the list.

By embedding the linked list pointers inside the continuation structure we create the constraint that each continuation can be in only one queue at a time, which suffices for this test, since there's only one queue.

Finally, we create the type for the scheduler itself and the singleton global instance of it:

As promised, we bitcast the pointer to the queue so LLVM will accept it as a pointer to a %cont, and assign it as both the head and tail of the list, creating an empty list.

Queue Management

There are only three operations supported on the ready queue: appending an item to it, signalling that it has become ready, and removing an item from it, signalling that it has become blocked.

These are just normal linked list operations, although they are a a little more verbose than usual when written in LLVM IR, due to the need to explicitly dereference all of the structs and pointers:

Our strange handling of the sentinel values pays off here, since the code for managing the list "accidentally" updates the head/tail pointers in the queue structure when the item being manipulated happens to be at one of the ends of the list, just by dereferencing the sentinel pointer that refers to the queue.

The Scheduler Function

Next up is the scheduler itself. The flow here is pretty simple. First we locate the first item in the queue, then we remove it from the queue, and then we execute the given implementation function with the state data as parameters. Once the queue is empty (signalled by the first item being the queue itself), the scheduler returns:

Thread Implementations

Now we just need some thread implementation functions to test with. To keep things simple, our thread functions will just loop a certain number of times, yielding to the scheduler after each iteration, before finally exiting. The final program has two threads, but since their implementations are largely identical I'll focus on only one here.

First we need to allocate the global variables that will represent the thread's state. In a real compiler the thread's context would be a structure type, but to keep things simple in this hand-written LLVM assembly I just used a single integer:

The bitcast to i8* is of course unnecessary since the context type is already of this type, but it's included to show that in a real implementation such a cast would be required.

That just leaves the thread's implementation function:

The indirectbr at the beginning of the function is exactly equivalent to the computed goto we used in the original C example, but in this case the target is passed as a parameter since static local variables are not a concept in LLVM.

The code implementing the yield to the scheduler is worth isolating to better see how it works:

We use the @cont_ready function from earlier to place the thread back into the ready queue (we have no events to block on here, so we're just giving other threads an opportunity to run). The blockaddress expression serves the same purpose as the &&label syntax in GCC's computed goto C extension.

Testing It Out

Finally, we just need a little entry point function to kick things off:

Again we need to mark the threads as ready, but this time we make the continuation refer to the begin block in each function.

Once translated to native assembly, linked, and executed, the result looks something like this:

Hello
Sched Loop
Sched Done
Sched Loop
Sched Do
Thread 1 Start
Thread 1 Loop
Sched Loop
Sched Do
Thread 2 Start
Thread 2 Loop
Sched Loop
Sched Do
Thread 1 Loop
Sched Loop
Sched Do
Thread 2 Loop
Sched Loop
Sched Do
Thread 1 Loop
Thread 1 Exit
Sched Loop
Sched Do
Thread 2 Loop
Sched Loop
Sched Do
Thread 2 Loop
Thread 2 Exit
Sched Loop
Sched Done

This output (produced by printing some strings that weren't included in the above snippets, but are in the full program linked below) shows how each run of the scheduler runs a section of code from one of the threads, with each of them starting up, looping several times, and then exiting. The second thread runs two more iterations than the first before it exits.

Conclusion

Hand-writing LLVM assembly gets a bit verbose and tedious at times, but this example shows that a compiler using LLVM as a backend can implement a basic protothread scheduler with only a few modifications to the code generator:

  • Arrange for thread functions to load/store their local variables through the context structure rather than through pointers allocated with alloca.

  • Ensure that the generated code does not depend on the values in named registers persisting across a blocking call.

  • Insert thread-scheduling calls and extra ret statements into the program at points where a thread will yield.

The first of these points does, however, impose an optimization penalty on the generated function: usually simple local variables created with alloca are turned into simple registers by the mem2reg optimization pass, but this optimization is not possible when the local variables are effectively global. The resulting code will therefore have many more load and store operations than normal, though other optimization passes may help somewhat.

The full program, and a Makefile to build it, are available for those who would like to study it further or build something from it. It's released into the public domain in the hope that it will be useful, but with no warranty of quality or fitness for purpose: it's just a prototype.

I intend to apply a technique similar to this in my work-in-progress programming language Alamatic, which attempts to bring the power of modern abstractions to the limited environment of microcontrollers. It is my hope that protothreads will allow for a more natural programming model when dealing with asynchronous operations, as well as affording good composability between device driver code and application code, while avoiding the need to juggle multiple stacks in a small address space.