The peak size of the stack for a given thread of execution in a computer program is impossible to predict in general, and even when it can be predicted (in the absense of recursion and conditional allocation) the peak size can be a pretty pessimistic estimate of the actual stack requirements.

As a consequence, threading support in modern operating systems, as well as application-level threads as implemented by some languages such as Go, tends to allocate each thread a pretty large stack. This is a good tradeoff on a machine with a virtual memory manager and plenty of address space per process, but is unreasonable on a memory-constrained system like a microcontroller.

Protothreads, or "stackless threads", are a tool for providing multiple threads of execution without allocating a separate stack for each. This feat is achieved by placing several constraints on the code within the thread: blocking is only permitted in the top-level function of the thread, the top-level function's local variables cannot be allocated on the stack nor persisted in registers, and these "threads" can only yield to others at predefined points in the program (co-operative scheduling).

With these to constraints imposed, it is possible to simulate blocking by returning control to the caller (a basic scheduler, presumably) after retaining a record of where in the function execution should resume on the next call. This technique has most commonly been implemented in C via preprocessor macros, such as in Adam Dunkels' Protothreads Library. Here is a raw, macro-free example using the GCC computed goto extension, which is fully general:

void protothread() {
    static void *pos = &&begin;
    goto *pos;

  begin:
    // Execution begins here on the first call.
    puts("Before I yield");
    // Register for some event and return to caller.
    // Caller will re-call us when the event occurs.
    pos = &&resume;
    add_task_to_queue(&protothread);
    return;
  resume:
    // Execution begins here on the second call.
    puts("After I resume");
}

Notice that the position is retained in a static variable to ensure that its value persists across calls. Any other local variables must also be declared as static for this technique to work.

Although clever use of macros allows the details of this technique to be somewhat hidden in C, the fact that it is being implemented in spite of the language prevents the result from being fully natural.

However, a hypothetical language that has protothreads as a built-in feature can in principle hide the plumbing in a much more robust way, by providing native language constructs for yielding and by automatically hoisting all of the local variables into global variables.

Such a language would still have to generate something resembling the above C code in its backend, however. In that vein, I set about seeing what this technique could look like implemented in LLVM assembly, which is a somewhat-popular choice of intermediate representation between a language's parser and its code generation backend.

Let's dive in.

Data Structures

To implement a basic cooperative scheduler we need to start with a data structure representing a thread, and then along with that a queue of threads that are ready to run. For simplicity's sake, I combined these two concepts together to make a doubly-linked list of runnable threads, whose members are of type %cont, which is short for "continuation":

%cont = type {
    %cont*,          ; 0: previous item in queue linked list
    %cont*,          ; 1: next item in queue linked list
    void(i8*, i8*)*, ; 2: protothread function to resume into
    i8*,             ; 3: context (the "local scope" of the task)
    i8*              ; 4: address of block to resume into
}

This design is slightly more sophisticated than the C example above since it allows for multiple instances of the same thread code to be active at once, each with its own "context". The context here is typed as i8* but that type is just a placeholder for a pointer to some data structure that's constructed from the local variables of the implementation function.

We can wrap around that a straightforward "queue" concept:

%queue = type {
    %cont*, ; 0: last item in the queue linked list
    %cont*  ; 1: first item in the queue linked list
}

In order to simplify management of the linked list, at the expense of some non-obvious trickery, the pointer to the queue instance itself -- after a bitcast to type %cont -- is used as the "end of list" sentinel. By placing the queue structure members in "reverse order", the list manipulation code can simply treat the queue as a list item, automatically updating the last/first pointers when items are added and removed from the edges of the list.

By embedding the linked list pointers inside the continuation structure we create the constraint that each continuation can be in only one queue at a time, which suffices for this test, since there's only one queue.

Finally, we create the type for the scheduler itself and the singleton global instance of it:

%scheduler = type {
    %queue  ; 0: The ready queue
}

; The global scheduler state
@sched = global %scheduler {
    %queue {
        ; Initially the queue is empty, so the start/end elements are
        ; the queue itself, which acts as the sentinel value.
        %cont* bitcast(%queue* getelementptr(
            %scheduler* @sched, i32 0, i32 0) to %cont*
        ),
        %cont* bitcast(%queue* getelementptr (
            %scheduler* @sched, i32 0, i32 0) to %cont*
        )
    }
};

As promised, we bitcast the pointer to the queue so LLVM will accept it as a pointer to a %cont, and assign it as both the head and tail of the list, creating an empty list.

Queue Management

There are only three operations supported on the ready queue: appending an item to it, signalling that it has become ready, and removing an item from it, signalling that it has become blocked.

These are just normal linked list operations, although they are a a little more verbose than usual when written in LLVM IR, due to the need to explicitly dereference all of the structs and pointers:

define void @cont_ready(%cont* %c, i8* %blockaddr) {
    ; Get a pointer to the last item in the ready queue
    %last_cont_pp = getelementptr %scheduler* @sched, i32 0, i32 0, i32 0
    %last_cont_p = load %cont** %last_cont_pp

    ; Get a pointer to the last item's "next" pointer
    %last_cont_next_pp = getelementptr %cont* %last_cont_p, i32 0, i32 1

    ; Get pointers to the new item's previous and next pointers
    %c_prev_pp = getelementptr %cont* %c, i32 0, i32 0
    %c_next_pp = getelementptr %cont* %c, i32 0, i32 1

    ; The last item's next pointer now points to the new item
    store %cont* %c, %cont** %last_cont_next_pp
    ; The new item's previous pointer points to the last item
    store %cont* %last_cont_p, %cont** %c_prev_pp

    ; Remember the %blockaddr we will resume with
    %cont_blockaddr_pp = getelementptr %cont* %c, i32 0, i32 4
    store i8* %blockaddr, i8** %cont_blockaddr_pp

    ; The new item's next pointer points to the dummy item, which
    ; represents the end of the list.
    %dummycont = bitcast %queue* getelementptr(
        %scheduler* @sched, i32 0, i32 0
    ) to %cont*
    store %cont* %dummycont, %cont** %c_next_pp

    ; Finally, the last item in the queue is now the new item
    store %cont* %c, %cont** %last_cont_pp

    ret void
}

define void @cont_blocked(%cont* %c) {

    ; Pointers to the pointers to the next and previous items
    %prev_cont_pp = getelementptr %cont* %c, i32 0, i32 0
    %next_cont_pp = getelementptr %cont* %c, i32 0, i32 1

    ; Pointers to the next and previous items
    %prev_cont_p = load %cont** %prev_cont_pp
    %next_cont_p = load %cont** %next_cont_pp

    ; Pointer to the previous item's pointer to its next item
    %prev_next_pp = getelementptr %cont* %prev_cont_p, i32 0, i32 1
    ; Pointer to the next item's pointer to its previous item
    %next_prev_pp = getelementptr %cont* %next_cont_p, i32 0, i32 0

    ; Point the previous at the next and vice-versa, eliminating
    ; %c from the queue altogether.
    store %cont* %next_cont_p, %cont** %prev_next_pp
    store %cont* %prev_cont_p, %cont** %next_prev_pp

    ret void

}

Our strange handling of the sentinel values pays off here, since the code for managing the list "accidentally" updates the head/tail pointers in the queue structure when the item being manipulated happens to be at one of the ends of the list, just by dereferencing the sentinel pointer that refers to the queue.

The Scheduler Function

Next up is the scheduler itself. The flow here is pretty simple. First we locate the first item in the queue, then we remove it from the queue, and then we execute the given implementation function with the state data as parameters. Once the queue is empty (signalled by the first item being the queue itself), the scheduler returns:

define void @scheduler_loop() {
    %dummycont = bitcast %queue* getelementptr(
        %scheduler* @sched, i32 0, i32 0
    ) to %cont*
    br label %loop

    loop:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @sched_loop, i32 0, i32 0
            )
        )
        ; Get the first continuation from the queue.
        %nextcont_pp = getelementptr %scheduler* @sched, i32 0, i32 0, i32 1
        %nextcont_p = load %cont** %nextcont_pp
        ; The queue is empty if the first item is the dummy item
        %is_empty = icmp eq %cont* %nextcont_p, %dummycont

        ; Exit the loop if the queue is empty.
        br i1 %is_empty, label %done, label %do

    do:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @sched_do, i32 0, i32 0
            )
        )

        ; Remove the continuation from the queue by marking it as blocked.
        ; (It's not really blocked but it'll put itself back in the
        ;  ready queue if it wants to run again.)
        call void (%cont*)* @cont_blocked(%cont* %nextcont_p)

        ; Get the pointers to the resume data in the cont struct
        %nextcont_func_pp = getelementptr %cont* %nextcont_p, i32 0, i32 2
        %nextcont_context_pp = getelementptr %cont* %nextcont_p, i32 0, i32 3
        %nextcont_blockaddr_p = getelementptr %cont* %nextcont_p, i32 0, i32 4

        ; Get the resume data from those pointers
        %nextcont_func_p = load void(i8*, i8*)** %nextcont_func_pp
        %nextcont_blockaddr = load i8** %nextcont_blockaddr_p
        %nextcont_context_p = load i8** %nextcont_context_pp

        ; Call into the protothread's function with the given
        ; blockaddress, causing it to resume from that point.
        call void(i8*, i8*)* %nextcont_func_p(
            i8* %nextcont_blockaddr, i8* %nextcont_context_p
        )

        br label %loop

    done:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @sched_done, i32 0, i32 0
            )
        )
        ret void
}

Thread Implementations

Now we just need some thread implementation functions to test with. To keep things simple, our thread functions will just loop a certain number of times, yielding to the scheduler after each iteration, before finally exiting. The final program has two threads, but since their implementations are largely identical I'll focus on only one here.

First we need to allocate the global variables that will represent the thread's state. In a real compiler the thread's context would be a structure type, but to keep things simple in this hand-written LLVM assembly I just used a single integer:

; Context type and instance
%proto1_context_t = type i8
@proto1_context = global %proto1_context_t 0

; Pre-allocate the thread's continuation object, and fill out the
; function and context pointers since they will never change.
@proto1_state = global %cont {
    %cont* null,
    %cont* null,
    void(i8*, i8*)* @proto1,
    i8* bitcast(%proto1_context_t* @proto1_context to i8*),
    i8* null
}

The bitcast to i8* is of course unnecessary since the context type is already of this type, but it's included to show that in a real implementation such a cast would be required.

That just leaves the thread's implementation function:

define void @proto1(i8* %blockaddr, i8* %context_p) {
    ; Jump immediately into the requested block.
    indirectbr i8* %blockaddr, [ label %begin, label %loop ]

    begin:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @proto_1_start, i32 0, i32 0
            )
        )
        br label %loop


    loop:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @proto_1_loop, i32 0, i32 0
            )
        )

        %count = load i8* %context_p
        %go_again = icmp ult i8 %count, 2
        br i1 %go_again, label %again, label %exit

    again:
        %new_count = add i8 %count, 1
        store i8 %new_count, i8* %context_p
        call void (%cont*, i8*)* @cont_ready(
            %cont* @proto1_state, i8* blockaddress(@proto1, %loop)
        )
        ret void

    exit:
        call void (i8*)* @puts(
            i8* getelementptr inbounds (
                [15 x i8]* @proto_1_exit, i32 0, i32 0
            )
        )
        ret void
}

The indirectbr at the beginning of the function is exactly equivalent to the computed goto we used in the original C example, but in this case the target is passed as a parameter since static local variables are not a concept in LLVM.

The code implementing the yield to the scheduler is worth isolating to better see how it works:

call void (%cont*, i8*)* @cont_ready(
    %cont* @proto1_state, i8* blockaddress(@proto1, %loop)
)
ret void

We use the @cont_ready function from earlier to place the thread back into the ready queue (we have no events to block on here, so we're just giving other threads an opportunity to run). The blockaddress expression serves the same purpose as the &&label syntax in GCC's computed goto C extension.

Testing It Out

Finally, we just need a little entry point function to kick things off:

define i32 @main() {
    call void (i8*)* @puts(
        i8* getelementptr inbounds (
            [7 x i8]* @hello, i32 0, i32 0
        )
    )

    ; Mark our two tasks as ready, starting from their 'begin' blocks.
    call void (%cont*, i8*)* @cont_ready(
        %cont* @proto1_state, i8* blockaddress(@proto1, %begin)
    )
    call void (%cont*, i8*)* @cont_ready(
        %cont* @proto2_state, i8* blockaddress(@proto2, %begin)
    )

    ; Run the scheduler to give the tasks a chance to run.
    ; Won't return until the tasks have both completed.
    call void ()* @scheduler_loop()

    ret i32 0
}

Again we need to mark the threads as ready, but this time we make the continuation refer to the begin block in each function.

Once translated to native assembly, linked, and executed, the result looks something like this:

Hello
Sched Loop
Sched Done
Sched Loop
Sched Do
Thread 1 Start
Thread 1 Loop
Sched Loop
Sched Do
Thread 2 Start
Thread 2 Loop
Sched Loop
Sched Do
Thread 1 Loop
Sched Loop
Sched Do
Thread 2 Loop
Sched Loop
Sched Do
Thread 1 Loop
Thread 1 Exit
Sched Loop
Sched Do
Thread 2 Loop
Sched Loop
Sched Do
Thread 2 Loop
Thread 2 Exit
Sched Loop
Sched Done

This output (produced by printing some strings that weren't included in the above snippets, but are in the full program linked below) shows how each run of the scheduler runs a section of code from one of the threads, with each of them starting up, looping several times, and then exiting. The second thread runs two more iterations than the first before it exits.

Conclusion

Hand-writing LLVM assembly gets a bit verbose and tedious at times, but this example shows that a compiler using LLVM as a backend can implement a basic protothread scheduler with only a few modifications to the code generator:

  • Arrange for thread functions to load/store their local variables through the context structure rather than through pointers allocated with alloca.

  • Ensure that the generated code does not depend on the values in named registers persisting across a blocking call.

  • Insert thread-scheduling calls and extra ret statements into the program at points where a thread will yield.

The first of these points does, however, impose an optimization penalty on the generated function: usually simple local variables created with alloca are turned into simple registers by the mem2reg optimization pass, but this optimization is not possible when the local variables are effectively global. The resulting code will therefore have many more load and store operations than normal, though other optimization passes may help somewhat.

The full program, and a Makefile to build it, are available for those who would like to study it further or build something from it. It's released into the public domain in the hope that it will be useful, but with no warranty of quality or fitness for purpose: it's just a prototype.

I intend to apply a technique similar to this in my work-in-progress programming language Alamatic, which attempts to bring the power of modern abstractions to the limited environment of microcontrollers. It is my hope that protothreads will allow for a more natural programming model when dealing with asynchronous operations, as well as affording good composability between device driver code and application code, while avoiding the need to juggle multiple stacks in a small address space.