Understand the story of a system call
A userspace request for kernel action
The mechanism to escalate privileges
The terms "syscall" and "system call" are used interchangably
this arm64 program will KILL you
.text .globl _start _start: msr VBAR_EL1, x0
Execution contexts
Define kernelspace and userspace
Kernel representation of a process or thread
What do we want out of a system call?
The five steps of a system call
An execution context is a CPU register state
The {set,long}jmp(3) library functions
{set,long}jmp(3)
setjmp: save current state
setjmp
longjmp: restore saved register state
longjmp
example use of {set,long}jmp
{set,long}jmp
Threads in a program
Threads share: address space (heap, code, static data)
Threads have their own: stack (register ptr) registers, IP
Kernelspace and userspace are distinct execution contexts
Normal programs run in userspace
Kernelspace is the privileged execution context
Registers, stack, memory are familiar
Key difference: CPU capabilities are unrestricted
Kernelspace execution context can be further subdivided
Kernel code may be running on behalf of a particular userspace process
Other kernel code runs on its own behalf
A capture of the CPU register state
A load of a saved CPU register state
Switching to kernel context is like other context switches
Main difference: privileges escalated
Switching back to userspace is similar but privilegs are dropped
The term "context switch" is sometimes used to refer to task-switching
"A computer program or subroutine is called reentrant if multiple invocations can safely run concurrently on multiple processors" (source)
struct task_struct
This is Linux's Process Control Block (PCB)
pid
A quick look at include/linux/sched.h
include/linux/sched.h
current
Refers to struct task_struct of process in current execution context
A quick look at two files:
arch/arm64/include/asm/current.h
arch/x86/include/asm/current.h
see get{p,t}id(2) in kernel/sys.c
get{p,t}id(2)
kernel/sys.c
getpid(2)
getpid(2) calls functions ... namespaces are taken into account ... locking is done task_pid_nr() { ... tsk->pid ... }
tgid
Why do we have different names?
Before Linux 2.6, there were only pids
The clone(2) call could share address space between processes
clone(2)
This allowed thread-like behavior
These processes were too independent
NPTL implements threads as specified by POSIX
The C library was hardened for concurrency
The C library introduced the tid concept
tid
The tid subdivides a pid
The kernel introduced the tgid concept
The tgid groups kernel pids together
Each pid corresponds to unique struct task_struct
Can a program do anything useful without making any syscalls?
All useful programs depend on system calls
Let's trace a program's syscall usage with strace
Syscall-free prime-number detector program
speed
security
stability
re-entrancy
confused deputy problem
One example: validate address range of any pointer arguments
Linux provides a stable syscall API
A syscall can be broken down into 5 distinct steps
Userspace invocation
Hardware-assisted privilege escalation
Kernel code handler
Hardware-assisted privilege drop
Userspace program continues
The transfer of software or hardware responsibility divides each step
All programs make system calls
Example: a shell as an abstraction over many syscalls
see /proc/PID/syscall
/proc/PID/syscall
The C library provides wrapper functions for many syscalls
Main benefit: speed
We want to minimize the high-overhead syscalls
Checks like input validation avoid syscalls if possible
Avoid architecture specific details
Example: write(2) vs write(3)
write(2)
write(3)
Number in parenthesis refers to manual page section number
Section 2 has system calls and section 3 has library calls
See man man for more information
man man
ltrace: like strace for library calls
Common accross architectures:
specify the syscall and arguments
give up control to the hardware
Specify syscall number in x8
x8
Specify arguments 1-6 in x0, x1, x2, x3, x4, x5
x0
x1
x2
x3
x4
x5
Return value will land in x0
svc #0 gives up control to hardware
svc #0
Specify syscall number in rax
rax
Args 1-6 in: rdi, rsi, rdx, r10, r8, r9
rdi
rsi
rdx
r10
r8
r9
syscall gives up control to hardware
syscall
Return value will land in rax
Difference from normal function calling convention
The syscall instruction clobbers rcx
rcx
Use r10 instead of rcx
With arguments chosen and syscall selected
This step is handled by hardware
Rewind to boot
See __primary_switched()
__primary_switched()
Set VBAR_EL1 to address of vector table
VBAR_EL1
Vector table defined in entry.S
See syscall_init()
syscall_init()
Set MSR_LSTAR to entry_SYSCALL_64 address
entry_SYSCALL_64
LSTAR: Long System Target Address Register
Back to the present
The CPU is preconfigured to correctly transfer control
This makes privilege escalation safe
On arm64: elevate execution level
On x86_64: change to ring 0
Both of these are stored in a particular register
Part architecture-specific
Part architecture-generic
Execution resumes from a hardware specified register rate
At bottom, mostly assembly and C macros
Higher on call stack is more generic code
Start in VBAR_EL1
vectors
A function defined by macro in entry.S calls into C code
entry.S
C
Execution reaches el0_svc_common()
el0_svc_common()
The invoke_syscall() indexes into jump table of handlers
invoke_syscall()
This architecture-generic handler is defined by a SYSCALL_DEFINE* macro
Start at entry_SYSCALL_64
do_syscall_64()
Using a few helper functions, index into jump table of system call handlers
x86_64
A closer look at the SYSCALL_DEFINE*() handlers
Defined in include/linux/syscalls.h
include/linux/syscalls.h
Resolve to __SYSCALL_DEFINEx(x,...
__SYSCALL_DEFINEx(x,...
Five functions generated
See __do_sys##name(...
__do_sys##name(...
No SYSCALL_DEFINE7 and above
Take a look at the SYSCALL_DEFINE definition in include/linux/syscalls.h
Indicate error using the errno macros
Return to assembly for another context switch
el0t_64_sync() calls ret_to_user()
el0t_64_sync()
ret_to_user()
kernel_exit 0
Restore registers, including the stack pointer
eret gives up control to the hardware once again
eret
entry_SYSCALL_64() prepares to return
entry_SYSCALL_64()
Prefer sysret over the slower iret
sysret
iret
Some conditions preclude usage of sysret
Via either instruction we give up control to the hardware once again
Less dangerous operation than escalation
Restore old register and stack
Drop privileges
Set rip to userspace return address
rip
The svc #0 instruction saves a return address in hardware
The eret instruction sets the program counter to this value
iret loads the return address form the stack
sysret returns to rcx
Software takes control of execution
Always check for an error
Kernel functions return -errno
-errno
C library wrappers check for error
Store original error in errno
errno
Convert return code to -1
-1
Example: musl syscall return
The errno utility from moreutils package
moreutils
See man 3 errno
man 3 errno
The system call is complete
Linux provides a stable system call API
Most programs run in user execution context ("userspace")
Kernel code runs in several execution contexts (all "kernelspace")
Hardware plays two key roles in system calls
Raising privileges and entering kernel execution context
Dropping privileges and entering user execution context
Many syscall implementation details are architecture-specific
The kernel defines the main syscall handler using a SYSCALL_DEFINE* macro
SYSCALL_DEFINE*
The C library defines wrapper functions for many syscalls
These hide architecture-specific details
Provide POSIX-compatible behavior by hiding Linux eccentricities
Always check for an error after making a syscall