How System Calls Work
Why should you care about syscalls?
As a web developer, learning about syscalls and the infrastructure around them
can make you feel quite a bit more confident in debugging and reasoning about
how systems will perform. Ruby and C++ both have their own idiomatic ways of
opening files, but in the end they both end up using the syscall open()
. This
is because userland processes (like web applications) have only one way of
communicating with the operating system: syscalls.
What to except when you’re excepting
In order for a process to communicate with the kernel, it has to pass execution to it somehow along with a number of arguments. It does that by issuing an exception, which moves the control flow from your process to the kernel’s interrupt handler, which processes the arguments and selects the correct syscall.
An exception is just one name for this concept - but there are a lot of names for the same thing: “different manufacturers have used terms like exceptions, faults, aborts, traps, and interrupts."1
In order to better understand this, let’s take a look at very simple syscall in
x86 assembly: getpid,
which returns the id of the calling process. Its syscall number is 20
, so we
put that into the eax
cpu register since that’s where the kernel will look to
determine which syscall to call.
mov eax, 20
int 0x80
The int
instruction above triggers a software interrupt or exception, which
causes the kernel to halt and run its interrupt handler. It sees that the
interrupt vector we specified was 0x80
, or 128
, which corresponds to the
syscall interrupt vector. The kernel looks in the eax
register and see if it
can find that number it its syscall table. If found, it calls that syscall.
Let’s take a look at exactly where that takes you inside the the Linux kernel, annotated with (my) comments:
sysenter_do_call:
; cmpl - subtract
; Subtract the total number of syscalls from the syscall number (%eax)
cmpl $(NR_syscalls), %eax
; jae - jump if Above or Equal to 0
; If the syscall number was out of range, handle bad call
jae sysenter_badsys
; call - call a subroutine
; *sys_call_table(,%eax,4)
; - The * is a pointer dereference
; - The X is a Y... etc
; Call the syscall you wanted
call *sys_call_table(,%eax,4)
As we saw before, the syscall number goes in register eax
. The Linux kernel
knows nothing about syscall names. All it knows is their numbers, and this is
where it looks up the syscall’s function pointer and calls it. Here are some
examples of some syscalls you might recognize and their numbers:
5
-open(2)
- open a file12
-chdir(2)
- your good friend,cd
34
-nice(2)
- change a processes nice value
Here’s a full table of syscalls and their arguments.
Once a syscall number is decided, it is never changed. As you can imagine, doing so would literally blow up all the programs.
Aside: when you see syscalls written like this: open(2)
, exec(2)
, the 2
is referring to the man page
level for syscalls,
which is 2
.
Passing arguments to syscalls
Ok, so a syscall is just a function in the kernel you call in a special interrupt-y way. How do you pass it arguments?
We saw that you put the syscall number in register eax
. The kernel looks for
arguments in registers ebx
, ecx
, and edx
. Let’s take a look at a hello
world program using the syscalls write()
and exit()
.
global _start
section .text
_start:
mov eax, 4 ; write
mov ebx, 1 ; stdout
mov ecx, msg
mov edx, msg.len
int 0x80 ; write(stdout, msg, strlen(msg));
mov eax, 1 ; exit
mov ebx, 0
int 0x80 ; exit(0)
section .data
msg: db "Hello, world!", 10
.len: equ $ - msg
The first argument (in ebx
) is a file descriptor - in this case stdout
. The
second argument (ecx
) is a pointer to the start of the message, and the third
(edx
) is the message’s length).
exit
takes one argument, the exit code - which was 0
.
If the syscall you’re using takes a lot of arguments, instead of putting values in the registers, you’ll put pointers to data structures you own in userspace.
Done
If you want to learn more about syscalls, please consult these fine sources of good syscall information:
- tldp.org - How System Calls Work on Linux/i86
- Intel 80x86 Assembly Language OpCodes (mathemainzel.info)
- x86 Assembly Guide (cs.virginia.edu)
- Say hello to x64 Assembly (0xax.blogspot.com)
If you want to see syscalls in action, try using the strace
command on Linux.
There’s a fantastic
writeup
on it by Julia Evans.
-
Interrupts, Traps, and Exceptions: flint.cs.yale.edu ↩︎