In this article, we will begin the process of writing a microkernel OS for the Cortex-M4 by building some simple task switching logic. The initial goal for this OS is to have support for unprivileged tasks running in thread mode. We will use cooperative multitasking early on so that we can implement some I/O bound tasks, and we will extend this to a pre-emptive multitasking model later on using the system tick timer. In this article, we’ll write a very simple cooperative multitasking implementation, which we will extend to support a barebones implementation of the Actor Model in a subsequent article.
We need a few things to get started. First, we need to implement heap allocation by plugging in some facilities expected by Newlib. We’ll allocate control structures for tasks and task stack on the heap. Next, we need to implement a queue data structure which we’ll use in this article to manage tasks, and in the next article to manage messaging. Then, we will need to make some modifications to the linker control file and the interrupt vector table we implemented in the last article in order to support our expanded heap and the service call mechanism we will use to perform context switching. While we are at it, we will implement some dummy hooks for important exceptions so if something goes wrong, we can do some postmortem analysis with the debugger. Next, we’ll write code to set up the tasks and write a skeleton service call handler. We will run what we have so far on the board to make sure that we can enter and exit the service call handler successfully. Finally, we will implement the task switching logic for the service call handler and verify that we can successfully switch between two independent tasks. If all goes well, we will have the beginning of a simple microkernel that we can extend to do some real world work. So, let’s begin.
malloc and sbrk
Every C developer understands malloc
. The malloc
implementation for the
build of Newlib that we configured in the last article is significantly scaled
down. It is good enough for our purposes, but it’s of the ordinary variety.
Memory regions allocated include an 8 byte overhead that maintains malloc
’s
internal bookkeeping. Realistically, one should avoid dynamic memory allocation
in an embedded system. We are using malloc
because it sidesteps the need for
us to write our own slab allocator. We could save some overhead at the cost of
flexibility if, for instance, we wrote a simple bump allocator. But, this would
come at the cost of easy freeing of memory. In point of fact, we have to write
a bump allocator anyway in order to use malloc
at all, but this allocator will
just do some very simple memory management of our heap and defer the bigger
decisions to malloc
itself.
Because Newlib is platform agnostic, it provides hooks that allows us to connect
the C Runtime to the operating system, or in our case, bare metal. This is why,
for instance, we had to write the _exit
method in the last article. This
method typically deals with the operating-system specific details about how to
terminate a process. The malloc
function requires a similar hook method,
called _sbrk
. The _sbrk
method is a hook between Newlib and the OS for
growing the size of the application’s heap by the given number of bytes. It
returns the pointer to a memory region that is at least that number of bytes, or
-1
if memory allocation fails.
Our version of _sbrk
uses the symbols, __HeapBase
and __HeapLimit
, defined
in our linker control file to grow the heap based on requests from malloc
.
When the heap size has been exhausted, _sbrk
returns -1
. This
implementation is a very simple bump allocator that bumps along the heap_end
pointer until the heap is exhausted.
When malloc
gets a -1
from _sbrk
, it returns NULL
to the caller.
Alternatively, we could add a debug-time assertion to the body of the second
if
statement to catch an out-of-memory error at runtime. There is little
headroom in any embedded application, so such an assertion can come in handy.
Queueing
We will use queues quite a bit in this OS. They give us a light-weight way to manage linear data. We can enqueue and dequeue items in time, and we can even implement some more interesting views on top of this data structure for insertion, such as a priority queue, that allows us to enqueue an item with priority in time. More on that in a later article.
The data structure itself is made up of a queue that holds nodes. The queue has
a head
and a tail
. The nodes are double-linked list nodes, which have a
prev
and a next
pointer. We could implement a simple queue using only a
next
pointer, but this would make it more difficult to implement some of the
more interesting algorithms we will need later on. So, we’ll deal with the
added overhead of maintaining a doubly-linked data structure.
For now, we only need three operations. QINIT
initializes a new queue.
QENQUEUE
adds an item to the end of the queue. QDEQUEUE
removes an item
from the front of the queue and assigns the variable provided as the second
argument to this item. These operations are implemented as macros to force
inlining which is important when operating under a critical section. We want to
begin a critical section, do a minimal amount of work, and release this critical
section as quickly as possible. Commonly, we will be doing one of three things
under a critical section: dequeuing, enqueuing, or dequeuing and immediately
enqueuing. This latter operation might be to shift one item from one queue to
another, or it might be to shift one item from a global to a queue and another
from the queue back to the global. For instance, the current_task
pointer
will be such a global. During a task switch, the task pointed to by the
current_task
pointer will be added to a process queue – either the run queue
or a block queue. Another task will be removed from a queue to become the
current task, and the current_task
pointer will be updated to reference this
task. Later on, we will use a similar mechanism of shifting items between
queues and pointers, as well as copying data from one task space to
another, to facilitate message passing.
These queue operations should be familiar to anyone who has worked with the data
structure before. Most of the complexity comes from handling edge cases.
There are two of note. The first is adding an element to an empty queue. The
second is removing the last element from a queue. In both cases, the head
and
tail
pointers must be cleaned up to maintain the two possible states of the
data structure. We’ll add operations to splice queues, and add / remove
elements from the middle of a queue in a later article. For now, this is
sufficient for our needs.
Linker Control Changes
In the previous article, we implemented a linker control file for the K22 part on the K22F board. The heap size in that file was large enough for testing but not large enough for this example. To add some headroom, we’ll increase this to 32k.
Interrupt Vector Changes
We need to modify the interrupt vector table to implement the Supervisor Call mechanism and to handle a few potential exceptions. We will add stubs for these exceptions which will loop forever if the exceptions are hit. If the CPU locks up, we will see the exception we hit when we break in.
The Hard Fault exception occurs when the handling of one fault causes another fault, if there is a bus error when attempting to handle another type of exception, or if an attempt is made to make a Supervisor Call by code operating at the priority of this call or higher. The Bus Fault exception occurs when an attempt is made to access an invalid memory region, when a transfer size is invalid, or when the processor is in the wrong privilege level to write to a particular hardware register. The Usage Fault exception occurs when an undefined instruction is encountered, when an invalid interrupt return address is used or when the saved program counter for the interrupt return is invalid, or when an unaligned memory address is used for load / store multiple instructions. This fault can also be configured to trigger during a divide-by-zero or an unaligned memory access. In each of these cases, there is little we can do at this point other than to halt the processor and break in to see why the exception occurred.
The Supervisor Call exception occurs when the SVC
instruction is executed.
This mechanism allows unprivileged code to call into the kernel. There are
several different ways this instruction can be used. An optional byte argument
to the instruction provides a mechanism to identify different supervisor calls.
However, we will be treating this call as if it were an EABI
method call. We
will reserve the first argument, in R0
, for the service method and submethod.
Three additional arguments, passed in R1
, R2
, and R3
, will provide
arguments to the service method. Since this follows the EABI
for ARM, this
corresponds to the first four arguments to a C function, as long as the
arguments are word size or smaller. As such, with this bit of hackery, we can
treat our svc_handler
method as a native C method. We’ll wrap this method
with a simple thunk that will preserve the current task’s context on entry and
then restore the (potentially different) current task’s context on exit, but
other than this low-level hackery, the calling convention will follow EABI.
Finally, we will stub the PendSV handler for now. In a future article, when we
start enabling interrupts that may preempt each other, we will move the context
switch to our pendsv_handler
, which will ensure that high-priority interrupts
are serviced by the user-mode drivers before lower-priority interrupts.
Here are the changes to the first 16 interrupt vectors required for our example. Note that the address we want for our assembler handlers is actually the address incremented by 1, which sets the low bit of the address. This tells the CPU that these are Thumb mode functions.
The stub methods just enter an infinite loop. We implement basically the same code four times so that we can take advantage of the symbol printed by GDB when we break into one of these faults to give us the first hint regarding how to resolve the fault.
We will examine the svc_handler_thunk
function in another section below.
The Simplified Task
For this example, we will build a simple task data structure that provides just
enough information to implement cooperative multitasking. By giving the
task_t
structure a qnode_t
header, we can cast it down to a queue node,
which will allow us to enqueue and dequeue tasks on the task queue. We will add
a flags
attribute, a pointer to the stack for this task, the entry for this
task, an opaque argument that can be passed to the task as a context, and a
queue of tasks blocked by this task. The latter queue will be used later on
when we implement messaging to track tasks trying to send messages to this task
if the message queue for this task is full.
The task_entry
type is used to define the entry point for a task. It takes
one argument, which is an opaque pointer to a user-defined context structure.
Task Switching Structures
To support task switching, we instantiate a queue for managing tasks and a global pointer to the current task.
The kernel will keep current_task
pointed to the currently executing task and
will use task_queue
for tracking tasks ready to execute.
The ARM Embedded Application Binary Interface
Before we dig into task initialization, we need to consider the ARM EABI. This
summary is by no means a complete description, but it is enough for our
purposes. The R0
, R1
, R2
, and R3
registers are used for passing
parameters and as scratch registers in a function. When a function is called,
these registers must be preserved if they are to be used after the function
returns. These four registers are also used for return values. R0
holds a
32-bit return value. R0
and R1
are used to hold a 64-bit return value. The
other registers, with the exception of R12
, should be preserved by the called
function if they are changed and restored before the called function is
returned. The R12
register acts as an intra-procedural scratch register.
Generally, this register isn’t used for much in normal application code, but it
is preserved for an interrupt / supervisor call. The other registers are
callee-save registers. They can be used by the callee, but the callee must
preserve them before returning control to the caller.
Interrupts and exceptions in the Cortex-M architecture automatically preserve
the R0
through R3
registers on the currently active stack, as well as the
xPSR
register, the PC
register, the LR
register, and the R12
register.
If any other registers will be used by the interrupt handler, these must be
preserved in code. By design, this automated preservation method matches the
EABI specification, which means that we can use C functions compiled for the ARM
EABI specification directly as interrupt handlers. Unlike other platforms,
where some thunk code is needed to enter a C function from an interrupt, the ARM
hardware takes care of all of the heavy lifting for us.
Task Context Save and Restore
That being said, additional work is necessary to preserve the current task
context prior to initiating a task switch. Specifically, the R4
through R11
registers must be preserved. The other registers are preserved automatically by
the hardware. Code executing in an interrupt runs in privileged mode, which
means that stack changes are reflected on the MSR
stack. To preserve these
registers, we must store them on the shadowed PSP
stack, which is used by the
task. Hence, we make use of a simple thunk method called svc_handler_thunk
to
wrap the svc_handler
method. This thunk manages the task context save and
restore, and then the svc_handler
method handles the mechanism of the task
switch as appropriate.
In the svc_handler_thunk
, we start by preserving the LR
register on the
MSP
stack. The value provided by the LR
register on entry is a special
value used by the CPU to re-enter the unprivileged task. When this value is
encountered as a jump or branch link value in handler mode, the CPU
automatically returns from an interrupt or calls the next linked interrupt of
lower priority. We then save the remaining context of the unprivileged task
that was not preserved by the interrupt by storing these registers to the PSP
stack. We then call the C svc_handler
method, which performs a context switch
by changing the current task and updating the PSP
register. Finally, on
return from the svc_handler
, we pop the original LR
register, restore the
context of the current task – which might have changed due to the
svc_handler
– and then return to this task by branching to the LR
register.
We’ll tackle the svc_handler
once we’ve done some task setup work.
Task Setup
To set up a task, we need to replicate the stack context that would exist after
saving the context within the kernel. Entry into an interrupt or exception
preserves, in stack order, the xPSR
, PC
, LR
, R12
, R3
, R2
, R1
, and
R0
registers. Our context saving routines preserve, in stack order, the
R11
, R10
, R9
, R8
, R7
, R6
, R5
, and R4
registers. Because our
entry method is an EABI C function, we pass the context argument for this task
entry method in the R0
register. The complete task_init
method, which
initializes a task which we then place on the task queue, is presented below.
For illustration, the register values are placed onto the stack as they would be
pushed on the stack by the CPU. Typically, an RTOS would create a set of
structures to deal with the interrupt and context registers for the convenience
of struct dereferencing. But, this code and the struct code both compile down
to the same machine code once the GCC optimizer has a chance to examine it.
This method references a task_end
method, which is entered if the task exits.
This method for now just enters an infinite loop, which we can use in this
example to detect a task exit. This could easily be changed to a method that
makes a supervisor call letting the kernel know that this task is exiting.
Critical Sections
When modifying the global task structure, we want to enter a critical section before making changes, do these changes as quickly as possible, then exit the critical section and re-enable interrupts before performing a task switch.
Two inline functions, sys_enter_critical
, and sys_exit_critical
, perform
these operations.
Supervisor Calls
We define two supervisor calls in this example. The first, sys_start
, starts
up the system and switches to the first task. The second, sys_yield
, yields
the current task and switches to the next task.
The equivalent system calls are defined in the following inline functions.
When the svc
instruction is encountered, the Supervisor Call exception is
triggered by the CPU, which transfers control via the svc_handler_thunk
to the
svc_handler
.
Supervisor Call Handler
The svc_handler
method manages two calls in this example. The SVC_START
call begins task switching and enters the first task. The SVC_YIELD
call
switches from the current task to the next task in the queue, and places the old
task back onto the queue.
Two helper methods, sys_get_psp
and sys_set_psp
, are needed for setting the
PSP
register. These are included below. We also include the MSP
flavors of
these methods, which we will need when setting up our system to run in
unprivileged thread mode.
Thread Mode
To place the system into thread mode, we need to modify the CONTROL
register
to instruct the CPU to use the PSP
register for threads. We also need to
switch to unprivileged mode before initiating the first task switch. The
following bits are needed for the CONTROL
register.
These operations require some instruction barrier magic. For that, we define a
sys_isb
method that executes a barrier instruction. The sys_set_control
method updates the CONTROL
register.
In the main
method in this example, we create two tasks, named atask
and
btask
, which resolve to the entry methods, afunc
and bfunc
respectively.
We also need an initial PSP
stack, which we use ever-so-briefly before
switching to our first task. We’ll set the size of the stack to 512 bytes, but
this is much more than we’ll need.
In the main
method, we will set up the task structure, initialize our tasks,
then enter Thread Mode, before issuing the sys_start
Supervisor Call to switch
to the first task.
Example Tasks
For this example, we’ll create an afunc
task and a bfunc
task that each
increment a global variable and then yield. This will allow us to set
breakpoints in each function, and then confirm that task switching is working
from GDB just by tracking the global variables during each execution step.
The two global variables we track are simply named a
and b
.
The afunc
and bfunc
tasks are below.
Testing Task Switching
In order to test this task switching example, we will set a breakpoint on the
line incrementing the a
variable in the afunc
task. When this breakpoint is
hit, we will use the following commands in gdb
to print the current values of
the a
and b
variables. This snippet assumes that the breakpoint we created
is breakpoint 1.
When running the debugger, we should now see the following output:
(gdb) cont
Continuing.
Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250 ++a;
$1 = 0
$2 = 0
At this point, the afunc
task is about to increment a
. The bfunc
task
has not yet run. Let’s continue.
(gdb) cont
Continuing.
Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250 ++a;
$3 = 1
$4 = 1
Now, the bfunc
task has run and yielded, incrementing the b
variable by 1
and the afunc
task has also incremented the a
variable. From here, the
task switching will continue indefinitely with the current run queue:
(gdb) cont
Continuing.
Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250 ++a;
$5 = 2
$6 = 2
(gdb) cont
Continuing.
Breakpoint 1, afunc (ctx=<optimized out>) at main.c:250
250 ++a;
$7 = 3
$8 = 3
This demonstrates that our cooperative multitasking example is working. In the next article, we will implement notifications and message passing, which will provide all of the OS-level facilities that we need to implement the Pomodoro clock.