Mod05 Kernel Arch and Task Structure
Mod05 Kernel Arch and Task Structure
KERNEL ARCHITECTURE
AND THE
PROCESS DESCRIPTOR
This courseware is both the product of the author and of freely available opensource and/or public
domain materials. Wherever external material has been shown, it's source and ownership have been
clearly attributed. We acknowledge all copyrights and trademarks of the respective owners.
The contents of the courseware PDFs are considered proprietary and thus cannot be copied or
reproduced in any form whatsoever without the explicit written consent of the author.
Only the programs - source code and binaries (where applicable) - that form part of this
courseware, and that are made available to the participant, are released under the terms of the
permissive MIT license.
Under the terms of the MIT License, you can certainly use the source code provided here; you must
just attribute the original source (author of this courseware and/or other copyright/trademark
holders).
VERY IMPORTANT :: Before using this source(s) in your project(s), you *MUST* check with your
organization's legal staff that it is appropriate to do so.
The courseware PDFs are *not* under the MIT License, they are to be kept confidential, non-
distributable without consent, for your private internal use only.
The duration, contents, content matter, programs, etc. contained in this courseware and companion
participant VM are subject to change at any point in time without prior notice to individual
participants.
Care has been taken in the preparation of this material, but there is no warranty, expressed or
implied of any kind, and we can assume no responsibility for any errors or omisions. No liability is
assumed for incidental or consequential damages in connection with or arising out of the use of the
information or programs contained herein.
Time Scale
From now onward, we shall occasionally come across statements such as - “the timer
interrupt fires once every 10ms”, or, “a typical timeslice for a task is between 100 –
200ms”. For a modern computer, these time intervals are actually quite a bit – the
system can achieve a lot in that time. To get a better “human feel” for such timings,
consider the table below – a quick “thought experiment”:
Cache Access 1 ns 2s
1
Context Switch 19 us 10.55 hours
Order of Magnitude:
While we're at it, we also often hear statements like “disk speed is easily five orders of magnitude
slower than RAM”. What does “orders of magnitude” really mean? See this page for a simple
explanation. (Very quick summary: 'n' orders of magnitude => 'n' powers of 10).
1 More recently (September 2018) measurements show that context switching time is in the region of
just 1.2 to 1.5 us (microseconds) on a pinned-down CPU, and around 2.2 us without pinning (
https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/).
OLDER: a good article on context-switching time on modern Intel processors: How long does it take to make
a context switch? . Paraphrasing: “… So, what's the context-switch time? The author says in conclusion:
“Context switching is expensive. My rule of thumb is that it'll cost you about 30µs of CPU overhead. This
seems to be a good worst-case approximation.”
A Linux Journal article mentions an average switching time of 19 us.”
A more modern look at pretty much the exact same thing – system latencies artificially scaled for
human context – is seen below; this is from Systems Performance, Brendan Gregg:
Have you ever asked yourself: when does the OS actually run??
See this article: “When does your OS run?”, by Gustavo Duarte.
The HZ Value
Linux programs a timer chip, the Programmable Interval Timer (PIT - usually the 8253/8254 chip
on x86 motherboards), to issue a clock "tick" once every millisecond. How many clock ticks occur
in one second? : this value is what the kernel variable HZ is tuned to.
kernel/Kconfig.hz : CONFIG_HZ:
config HZ_250
bool "250 HZ"
help
250 Hz is a good compromise choice allowing server performance
while also showing good interactive responsiveness even
on SMP and NUMA systems. If you are going to be using NTSC video
or multimedia, selected 300Hz instead.
...
config HZ_1000
bool "1000 HZ"
help
1000 Hz is the preferred choice for desktop systems and other
systems requiring fast interactive responses to events.
This document describes Kconfig options and boot parameters that can
reduce the number of scheduling-clock interrupts, thereby improving energy
efficiency and reducing OS jitter. Reducing OS jitter is important for
some types of computationally intensive high-performance computing (HPC)
applications and for real-time applications.
[1] The boot CPU cannot run in the nohz_full mode, as at least one CPU must receive the timer
interrupt and perform basic housekeeping tasks.
However, continually running the ‘timer tick’ hardware interrupt on the boot CPU is considered
high overhead and is now unnecessary! With HRT (High Resolution Timer) support [2], the kernel
does not need to do this; from the kernel documentation:
“Once a system has switched to high resolution mode (early in the boot process), the periodic tick
is switched off. This disables the per system global periodic clock event device - e.g. the PIT on
i386 SMP systems. The periodic tick functionality is provided by an per-cpu hrtimer. The callback
function is executed in the next event interrupt context and updates jiffies and calls
update_process_times and profiling.
...”
[2]: HRT: The high-resolution timers infrastructure allows one to use the available hardware timers
to program interrupts at the right moment.
• Hardware timers are multiplexed, so that a single hardware timer is sufficient to handle a large
number of software-programmed timers.
• Usable directly from user space using the usual timer APIs.
This is why the number of hardware interrupts on IRQ 0 (‘timer tick’) is typically low (output
below from a 4 cpu x86_64 box):
$ w
12:50:19 up 4 days, 1:39, 1 user, load average: 1.79, 1.90, 1.99
...
$ grep "timer$" /proc/interrupts
0: 10 0 0 0 IR-IO-APIC 2-edge
timer
$ grep "Local timer interrupts$" /proc/interrupts
LOC: 44548113 40704708 41749178 41343538 Local timer interrupts
).
Run
<<
As an actual example, here’s the commit to the kernel code for the One Plus Nord series Android
smartphone, setting HZ to 250:
sm8250 is the part # for the Qualcomm Snapdragon 865 5G Mobile Platform).
>>
• It is critical to understand that for every thread that is alive on the Linux OS, the kernel
maintains a corresponding “task structure” (or - the mis-named - process descriptor).
In other words, the mapping between a userspace and/or kernel thread and a kernel-space
task_struct is 1:1.
• All process descriptors, i.e., all task_struct's, are organized using a linked list; experience
has shown that using a circular doubly-linked list works best. This list is called the “task
list”.
• In fact, this scheme (of using circular linked lists) is so common in usage that it is built-in to
the mainline kernel: a header called “list.h” has the data structure and macro elements to
support building and manipulating sophisticated linked lists without re-inventing the wheel.
The kernel maintains two stacks (one for each privilege level) – a user-mode and a kernel-mode
stack.
Thus, for every thread alive on the system, we have two stacks:
- a user-mode stack
- a kernel-mode stack
(The exeception to the above rule: kernel threads. Kernel threads see only kernel virtual address
space; thus, they require only a kernel-mode stack).
When a process (or thread) executes code in userspace, it is automatically using the usermode stack.
When it issues a system call, it switches to kernel-mode; now, the CPU “automatically”* uses the
kernel-mode stack for that process (or thread).
* This is usually done via microcode in the processor. See the end of the topic for an example (IA-
32).
Keep in mind that while the user space stack can grow very large (typically 8-10 MB resource
limit), the kernel-mode stack is very small : typically less than two ,or at most four, page frames.
8 KB 16 KB 8 KB 16 KB 8 KB 16 KB
So:
<<
Besides kernel text and data, the kernel dynamically allocates and manages space for several meta-
data structures and objects, among them the memory pools, kernel stacks, paging tables, etc.
$ uname -r
5.4.0-58-generic
$ grep -E "KernelStack|PageTables" /proc/meminfo
KernelStack: 20688 kB
PageTables: 52048 kB
$
herolte:/ $ uname -r
3.18.14-11104523
herolte:/ $ egrep "KernelStack|PageTables" /proc/meminfo
KernelStack: 52752 kB
PageTables: 80648 kB
herolte:/ $
>>
Q. How can we tell how big the kernel mode stack is?
Keep in mind that this size includes both the thread_info structure and the kernel-mode stack
space.
Resource:
“Kernel Small Stacks” on eLinux
Besides the kernel-mode stack of the task, the kernel also maintains another structure per task called
the thread_info structure. It is used to cache frequently referenced system data and provide a quick
way to access the task_struct.
It has evolved, over several iterations (and, being Linux, will keep evolving).
<<
thread_info : on 32-bit Linux (ARM, x86_32); small size, kept at base of kernel mode stack (of the
thread):
>>
<<
Src: Virtually mapped stacks 2: thread_info strikes back, Jon Corbet, June 2016, LWN
...
The existence of these two structures is something of a historical artifact. In the early days of Linux, only
the task_struct existed, and the whole thing lived on the kernel stack; as that structure grew, though, it
But placement on the kernel stack conferred a significant advantage: the structure could be quickly located
by masking some bits out of the stack pointer, meaning there was no need to dedicate a scarce register to
storing its location.
For certain heavily used fields, this was not an optimization that the kernel developers wanted to lose. So,
when the task_struct was moved out of the kernel-stack area, a handful of important structure fields were
left there, in the newly created thread_info structure. The resulting two-structure solution is still present in
current kernels, but it doesn't necessarily have to be that way.
...
>>
[On the IA32, the 'esp' register points to the thread_info structure; on the ARM, it's the 'sp' register
that points here.]
...
55 struct thread_info {
56 struct task_struct *task; /* main task structure */
57 __u32 flags; /* low level flags */
58 __u32 status; /* thread synchronous flags */
59 __u32 cpu; /* current CPU */
60 mm_segment_t addr_limit;
61 unsigned int sig_on_uaccess_error:1;
62 unsigned int uaccess_err:1; /* uaccess failed */
63 };
...
In https://elixir.bootlin.com/linux/v4.6/source/arch/arm/include/asm/
thread_info.h#L49
...
45 /*
46 * low level task data that entry.S needs immediate access to.
47 * __switch_to() assumes cpu_context follows immediately after cpu_domain.
48 */
49 struct thread_info {
50 unsigned long flags; /* low level flags */
51 int preempt_count; /* 0 => preemptable, <0 => bug
*/
52 mm_segment_t addr_limit; /* address limit */
53 struct task_struct *task; /* main task structure */
54 __u32 cpu; /* cpu */
55 __u32 cpu_domain; /* cpu domain */
56 struct cpu_context_save cpu_context; /* cpu context */
57 __u32 syscall; /* syscall number */
struct thread_info {
unsigned long flags; /* low level flags */
mm_segment_t addr_limit; /* address limit */
#ifdef CONFIG_ARM64_SW_TTBR0_PAN << PAN- Privileged Access Never; don’t
allow kernel
access to userspace >>
u64 ttbr0; /* saved TTBR0_EL1 */
#endif
union {
u64 preempt_count; /* 0 => preemptible, <0 => bug
*/
struct {
#ifdef CONFIG_CPU_BIG_ENDIAN
u32 need_resched;
u32 count;
#else
u32 count;
u32 need_resched;
#endif
} preempt;
};
};
x86[-64] https://elixir.bootlin.com/linux/v6.1/source/arch/x86/include/asm/thread_info.h#L56
/*
struct thread_info {
The thread info struct and kernel-mode stack are clubbed together in either a single or two
contiguous physical memory pages.
union thread_union {
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info thread_info;
#endif
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
Diagramatically (source):
Note though that from 2.6 Linux onward, each kernel mode thread’s stack is either 2 pages (8k; 32-bit
systems) or 4 pages (16k; 64-bit systems).
The traditional way to view the user-mode process/thread stack(s) was via the gstack utility. While
it works on some Linux distros, it doesn’t seem to work any longer on modern Ubuntu!
Thus, here’s an alternative script – doing much the same as gstack does: it runs GDB in batch mode
to query stacks!
Credit: poor man's profiler
sudo gdb \
-ex "set pagination 0" \
-ex "thread apply all bt" \
--batch -p <PID>
#6 0x0000562efef00040 in ?? ()
#7 0x0000562efef035ca in ?? ()
#8 0x0000562efef06ec8 in yyparse ()
#9 0x0000562efeefd29b in parse_command ()
#10 0x0000562efeefd3a7 in read_command ()
#11 0x0000562efeefd5ca in reader_loop ()
#12 0x0000562efeefbef9 in main ()
[Inferior 1 (process 3408355) detached]
That’s useful!
It works on multithreaded apps too showing the individual stacks of every thread alive within the
process.
and from
http://lxr.free-electrons.com/source/arch/arm/include/asm/thread_info.h?v=3.2#L94
[...]
/*
* how to get the thread information struct from C
*/
static inline struct thread_info *current_thread_info(void)
__attribute_const__;
<<
Explanation of above:
With THREAD_SIZE (meaning, kernel-mode stack size) = 4096 :
'sp' is the register (normally r13) holding the head (start) of the stack.
If THREAD_SIZE = 4096, then (above), return value of current_thread_info
= sp & ~4095
= sp & ~(0000 0000 0000 0000 0000 1111 1111 1111)
= sp & (1111 1111 1111 1111 1111 0000 0000 0000)
=> low 12-bits of address are zeroed out, which is equivalent to truncating it to
the nearest (numerically lower) page boundary. This, in effect, yields the
pointer to the thread_info structure (as the ti will be placed in the beginning of
the page frame that holds both the ti structure and the kernel-mode stack)!
'sp' is the register (normally r13) holding the head (start) of the stack.
If THREAD_SIZE = 8192, then (above), return value of current_thread_info
= sp & ~8191
= sp & ~(0000 0000 0000 0000 0001 1111 1111 1111)
= sp & (1111 1111 1111 1111 1110 0000 0000 0000)
=> low 13-bits of address are zeroed out, which is equivalent to truncating it to
the nearest (numerically lower) two page boundary. This, in effect, yields the
pointer to the thread_info structure (as the ti will be placed in the beginning of
the page frame that holds both the ti structure and the kernel-mode stack)!
This is looked up (below, in inline function get_current()) with offset 'task', which yields the
location of the task_struct.
The above discussion does imply that kernel stacks must be page-aligned in memory (on 32-bit at
least) in order for the ‘current’ macro to work. Yes! Kernel stacks are always a rounded power of 2
Note that on modern 64-bit processors, current is implemented in a newer arch-dependant manner:
• ARM64 (AArch64) : in-register (GPR)
• PPC64 : in-register (GPR)
• x86_64 : per-CPU variable.
>>
arch/arm/include/asm/thread_info.h
/*
* how to get the current stack pointer in C
*/
register unsigned long current_stack_pointer asm ("sp");
[...]
/*
* how to get the thread information struct from C
*/
static inline struct thread_info *current_thread_info(void)
__attribute_const__;
ARM64
thread_info evolution: Iteration ‘2’ (AArch64 and x86_64, ~ v4.10 onward):
“This patch moves arm64's struct thread_info from the task stack into
task_struct. This protects thread_info from corruption in the case of
stack overflows, and makes its address harder to determine if stack
addresses are leaked, making a number of attacks more difficult. Precise
detection and handling of overflow is left for subsequent patches.
Code View:
include/linux/thread_info.h
[...]
#ifdef CONFIG_THREAD_INFO_IN_TASK << will be true on recent ARM64 and
x86_64 >>
/*
* For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
* including <asm/current.h> can cause a circular dependency on some
platforms.
*/
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif
[...]
arch/arm64/include/asm/current.h
/*
* We don't use read_sysreg() as we want the compiler to cache the value where
* possible.
*/
static __always_inline struct task_struct *get_current(void)
{
unsigned long sp_el0;
The arch-dependant entry_task_switch() code will ensure that the sp_e10 register is updated to
‘next’ every time (we’re about to) context-switch.
https://stackoverflow.com/questions/29393677/armv8-exception-vector-significance-of-el0-sp).
The above arm32 code for obtaining current might at first look very optimized and cool; kernel
(and hardware) folks beg to differ! Check this out [source:arm64: Introduce IRQ stack [patch]]:
“…
It is a core concept to directly retrieve struct thread_info from
sp_el0. This approach helps to prevent text section size from being
increased largely as removing masking operation using THREAD_SIZE
in tons of places.
[Thanks to James Morse for his valuable feedbacks which greatly help
to figure out a better implementation. - Jungseok]
...
+/*
+ * struct thread_info can be accessed directly via sp_el0.
+ */
...
Virtually mapped stacks 2: thread_info strikes back, Jon Corbet, June 2016, LWN
arch/x86/include/asm/current.h
On the other hand, on the PPC architecture implementation, the address of the task_struct is stored
as part of hardware context (in a CPU register); this makes the lookup extremely efficient.
FYI / Note-
Here’s the commit if interested: fork: Add generic vmalloced stack support.
From arch/Kconfig :
...
config HAVE_ARCH_VMAP_STACK
def_bool n
help
An arch should select this symbol if it can support kernel stacks
in vmalloc space. This means:
config VMAP_STACK
default y
bool "Use a virtually-mapped stack"
depends on HAVE_ARCH_VMAP_STACK && !KASAN
---help---
Enable this if you want the use virtually-mapped kernel stacks
with guard pages. This causes kernel stack overflows to be
Miscellaneous / FYI
You will often see the “container_of” macro being used in kernel code. What does it mean, how
does it work?
In general, see:
MagicMacros on kernelnewbies : container_of() and ARRAY_SIZE().
Interesting:
The kernelnewbies FAQ page
What does !!(x) mean in C (esp. the Linux kernel)?
/ContainerOf What is container_of ? How does it work ?
/DoWhile0 Why do a lot of #defines in the kernel use do { ... } while (0)?
etc etc.
We do this by setting up a breakpoint in the kernel code that creates a (child) process and look up
the task_struct from within the debugger (gdb).
[Also Note:
● The details of KGDB setup/installation and usage are covered in the “LINUX Debugging
Techniques” training].
● An alternative to using KGDB is to use KDB.
● [UPDATE!]
Still another tool, (perhaps the best in terms of analysis capabilities) is the kexec/kdump
facility in conjunction with the crash utility. Crash lets one look up detailed data structure,
stack, memory, machine state, etc information.
● Shown below is sample output from tracing parts of the (now old) 2.6.17 kernel built with
kgdb support on an IA-32 system.
]
(gdb) info b
Num Type Disp Enb Address What
1 breakpoint keep y 0xc01189fb in panic at kernel/panic.c:76
2 breakpoint keep y 0xc0177a41 in sys_sync at fs/sync.c:41
breakpoint already hit 2 times
3 breakpoint keep n 0xc01069dd in timer_interrupt at
arch/i386/kernel/time.c:161
breakpoint already hit 4 times
(gdb) b do_fork
Breakpoint 8 at 0xc0117ed2: file kernel/fork.c, line 1358.
(gdb) c
Continuing.
<< Now, during our kgdb session, on the target system's shell type 'ps' (or any
executable, in fact) >>
With crash:
<< running on an x86_64 >>
atomic_t usage;
unsigned int flags;
unsigned int ptrace;
--snip--
● The Linux kernel is a (very fast!) moving target. Therefore, the material below is bound to
get outdated. The only way to “keep up” with the latest kernel source is to install git, clone
and regularly pull in the latest version (see the ‘xtra’ material on using git).
● The text within the "<<" and ">>" below are comments or further information introduced by this
author into the material for better understanding and are not part of the actual task_struct structure
source.
<< Below: (mostly) as of Linux kernel ver 5.0.3 [Mar 2019] >>
include/linux/sched.h
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK << Recent: 4.9. Commit (now thread_info is just
one 32-bit ‘flags’ field) >>
/*
* For reasons of header soup (see current_thread_info()), this
* must be the first element of task_struct.
*/
struct thread_info thread_info;
#endif
volatile long unsigned int __state; // 5.14; commit 2f064a5
<<
• __state: This can be one of the following defines that appear higher up in sched.h :
...
/*
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
*
* We have two separate sets of flags: task->state
* is about runnability, while task->exit_state are
* about the task exiting. Confusing, but this way
* modifying one set can't modify the other one by
* mistake.
*/
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define __TASK_STOPPED 4
#define __TASK_TRACED 8
/* in tsk->exit_state */
#define EXIT_ZOMBIE 16
#define EXIT_DEAD 32
/* in tsk->state again */
#define TASK_DEAD 64
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_STATE_MAX 512
...
/* Convenience macros for the sake of set_task_state */
#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)
204#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) 205#define
TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
...
• type volatile implies that this member can be altered asynchronously from interrupt routines.
<< Good ref: http://www.netrino.com/Embedded-Systems/How-To/C-Volatile-Keyword >>
<<
(gdb) p p << print 'p' <-- the new child's task_struct .
As of now, it's a copy of the parent: bash
>>
$8 = (struct task_struct *) 0xc3d61510
(gdb) p p.state
$12 = 0x0
(gdb) p /x p.flags
$13 = 0x400040
(gdb) p *p << lets look it up >>
$9 = {state = 0x0, thread_info = 0xc2140000, usage = {counter = 0x2}, flags =
0x400040, ptrace = 0x0,
lock_depth = 0xffffffff, load_weight = 0x80, prio = 0x73, static_prio = 0x78,
normal_prio = 0x73, run_list = {
next = 0xc3d61538, prev = 0xc3d61538}, array = 0x0, ioprio = 0x0, sleep_avg =
0x35a4e900,
timestamp = 0x3c1fc3738de, last_ran = 0x3c1fc34be7e, sched_time = 0x0, sleep_type =
SLEEP_NORMAL,
policy = 0x0, cpus_allowed = {bits = {0x1}}, time_slice = 0xc, first_time_slice =
0x1, tasks = {
next = 0xc03c9d68, prev = 0xc3fe80d8}, ptrace_children = {next = 0xc3d61580, prev
= 0xc3d61580},
...
(gdb)
(gdb) set print pretty
(gdb) p *p
$10 = {
state = 0,
stack = 0xd6ba8000,
usage = {
counter = 2
},
flags = 4202562,
ptrace = 0,
wake_entry = 0x0,
on_cpu = 0,
on_rq = 0,
prio = 120,
static_prio = 120,
normal_prio = 120,
rt_priority = 0,
sched_class = 0xc159b420,
se = {
load = {
weight = 1024,
inv_weight = 4194304
},
…
…
memcg = 0x0,
nr_pages = 0,
memsw_nr_pages = 0
},
ptrace_bp_refcnt = {
counter = 1
}
}
>>
<<
From include/linux/sched//signal.h :
TIF_SIGPENDING is one of the flags inside the task's thread_info structure; it is used to detect if a signal is
pending delivery upon the task. The above inline function returns True if a signal is pending, False otherwise.
<<
Ref:
likely()/unlikely() macros in the Linux kernel - how do they work? What's their benefit?
<<
'[un]likely' are compiler optimization attributes; the programmer can provide a hint to the compiler regarding
branch prediction via these statements.
The “[un]likely” compiler attributes will actually affect the code generation of the code where it’s called
from; this way we try and avoid getting off “hot” code paths.. We optimize towards the ‘hot’ path; will pay
a performance penalty if the hint is wrong. But that’s unlikely by definition!
<<
Related: “... What is the difference between terms: "Slow path" and "Fast path" ?
In general, "fast path" is the commonly run code that should finish very quickly. For example, when it comes
to spinlocks the fast path is that nobody is holding the spinlock and the CPU that wants it can just take it.
Conversely, the slow path for spinlocks is that the lock is already taken by somebody else and the CPU will
have to wait for the lock to be freed.
The second one is not as important and does not need to be optimized much at all.
In this example, the reason for not optimizing the spinlock code for dealing with lock contention is that locks
should not be contended. If they are, we need to redesign the data structures or the code to avoid contention
in the first place!
You will see similar tradeoffs in the page locking code, the scheduler code (common cases are fast, unlikely
things are put out of line by the compiler and are "behind a jump") and many other places in the kernel.”
>>
<<
Static Keys and Jump Labels in the Linux Kernel
Motivation: to avoid as much as is possible getting off the ‘hot path’, yet support ‘unlikely-to-come-true’ if
conditions within a performance-sensitive kernel code path (a good example is the kernel tracepoint code;
have to check conditionally ‘is the tracepoint enabled’ every time; static keys optimize this check!).
Source: https://www.kernel.org/doc/Documentation/static-keys.txt
...
Static keys allows the inclusion of seldom used features in
performance-sensitive fast-path kernel code, via a GCC feature and a code
patching technique. A quick example::
DEFINE_STATIC_KEY_FALSE(key);
...
if (static_branch_unlikely(&key))
do unlikely code
else
do likely code
...
static_branch_enable(&key);
...
static_branch_disable(&key);
...
/*
* This begins the randomizable portion of task_struct. Only
* scheduling-critical items should be added above here.
*/
randomized_struct_fields_start
void *stack;
<< In dup_task_struct:
ti = alloc_thread_info(tsk);
...
tsk->stack = ti; << 'ti' is the memory for the kernel-mode stack
and thread_info structure >>
>>
atomic_t usage;
/* Per task flags (PF_*), defined further below: */
unsigned int flags;
<<
// include/linux/sched.h (v5.10.60)
...
#define PF_VCPU 0x00000001 /* I'm a virtual CPU */
#define PF_IDLE 0x00000002 /* I am an IDLE thread */
#define PF_EXITING 0x00000004 /* Getting shut down */
#define PF_IO_WORKER 0x00000010 /* Task is an IO worker */
#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
#define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */
#define PF_MCE_PROCESS 0x00000080 /* Process policy on mce
errors */
#define PF_SUPERPRIV 0x00000100 /* Used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* Dumped core */
#define PF_SIGNALED 0x00000400 /* Killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_NPROC_EXCEEDED 0x00001000 /* set_user() noticed that
RLIMIT_NPROC was exceeded */
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be
initialized before use */
#define PF_USED_ASYNC 0x00004000 /* Used async_schedule*(), used
by module init */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen
*/
#define PF_FROZEN 0x00010000 /* Frozen for system suspend */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will
inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will
inherit GFP_NOIO */
#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only against
the bdi I write to,
* I am cleaning dirty pages from some other
bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address
space */
[...]
>>
unsigned int ptrace;
<<
usage = {
counter = 0x2
},
flags = 0x400040,
ptrace = 0x0,
>>
...
<< Several members that follow relate directly to the scheduler; seen later >>
int on_rq;
int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
<-- >= 2.6.23 : the CFS scheduler >
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
<<
prio = 0x73,
static_prio = 0x78,
normal_prio = 0x73,
>>
...
unsigned int policy;
<<
scheduling policy: one of:
SCHED_NORMAL or SCHED_OTHER (default non real-time);
SCHED_RR, SCHED_FIFO ((soft) real-time),
Simple sample code that adds nodes to the tail of a list can be seen here:
samples/kmemleak/kmemleak-test.c
>>
<<
Using the powerful 'crash' utility:
Lets use crash to cycle through the task list, printing the PID and name of each task. To do so, we'll need a
starting point: lets look up the kernel virtual-address of init's task structure:
--snip--
crash>
Crash Tip: within crash, use the help <command> to get detailed and useful help, often with excellent
examples!
>>
Programatically, can use the macro for_each_process() to iterate through the processes (_not_
threads) on the task list:
struct task_struct *g, *t; // 'g' : process ptr; 't': thread ptr !
do_each_thread(g, t) {
printk(KERN_DEBUG "%d %d %s\n", g->tgid, t->pid, g->comm);
} while_each_thread(g, t);
---
>>
...
struct mm_struct *mm;
struct mm_struct *active_mm; << VM:
mm : user address-space mapping; kernel threads have it as
NULL as they have no userspace mapping
active_mm : mapping for “anonymous” address space:
Details: https://github.com/torvalds/linux/blob/master/Documentation/vm/active_mm.txt
>>
<<
mm = 0xcf5a5300,
active_mm = 0xcf5a5300,
>>
...
int exit_state;
int exit_code;
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
/* JOBCTL_*, siglock protected: */
unsigned long jobctl;
<<
FAQ: Within the kernel, given a task's PID, how can we locate the corresponding task structure?
From here:
If you want to find the task_struct from a module, find_task_by_vpid(pid_t nr)
etc. are not going to work since these functions are not exported.
In a module, you can use the following (exported) function instead:
pid_task(find_vpid(pid), PIDTYPE_PID);
>>
<-- new feature of gcc 4.2 or above >
<--
From the official kernel doc: Kernel Self-Protection
config CC_STACKPROTECTOR
def_bool n
help
Set when a stack-protector mode is enabled, so that the build
can enable kernel-side support for the GCC feature.
choice
Also note that, from kernel ver 4.11 onward, many of these task-accessor macros and funcitons have been
moved into a new header: <linux/sched/signal.h> ; so take this into account in your code with:
...
2693 #define next_task(p) \
>>
struct list_head thread_node;
struct completion *vfork_done; /* for vfork() */
...
u64 utime;
u64 stime;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
u64 utimescaled;
u64 stimescaled;
#endif
u64 gtime;
struct prev_cputime prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
struct vtime vtime;
#endif
...
/* MM fault and swap info: this can arguably be seen as either mm-
specific or thread-specific: */
unsigned long min_flt;
unsigned long maj_flt;
#ifdef CONFIG_POSIX_TIMERS
struct task_cputime cputime_expires;
struct list_head cpu_timers[3];
#endif
/* Process credentials: */
<< >= 2.6.24 (?):
A broader notion of the security context of the task. Consists of a set of actionable objects, objective
and subjective contexts. See Documentation/credentials.txt and include/linux/cred.h for details.
real_cred : objective part of this context is used whenever that task is acted upon.
task->cred : subjective context that defines the details of how that task is going to act upon another
object. This may be overridden temporarily to point to another security context, but normally points
to the same context as task->real_cred.
>>
/* Objective and real subjective task credentials (COW): */
const struct cred __rcu *real_cred;
NOTE- If you don’t use the modern task_uid() / task_euid() / ... helpers, sparse (a static analyser for
the kernel) complains:
(To see this, trigger sparse static analysis via the sa_sparse target of our ‘better’ Makefile!).
>>
>>
<<
Security / Hack
Check out this artcile - “This is what a root debug backdoor in a Linux kernel looks like”, 09 May 2016.
Excerpts-
“A root backdoor for debugging ARM-powered Android gadgets managed to end up in shipped firmware
– and we're surprised this sort of colossal blunder doesn't happen more often.
The howler is the work of Chinese ARM SoC-maker Allwinner, which wrote its own kernel code
underneath a custom Android build for its devices.
Its Linux 3.4-based kernel code, on Github here, contains what looks to The Register like a debug
mode the authors forgot to kill. Although it doesn't appear to have made it into the mainstream kernel
source, it was picked up by firmware builders for various gadgets using Allwinner's chips.
if(!strncmp("rootmydevice",(char*)buf,12)){
cred = (struct cred *)__task_cred(current);
cred->uid = 0;
cred->gid = 0;
cred->suid = 0;
cred->euid = 0;
cred->euid = 0;
cred->egid = 0;
cred->fsuid = 0;
cred->fsgid = 0;
printk("now you are root\n");
}
Tkaiser, a moderator over at the forums of the Armbian operating system (a Linux distro for ARM-based
development boards) notes there's a number of vulnerable systems in the field.
--snip--
There are probably other products out there using the Allwinner SoC and the dodgy code. Tkaiser
pointed out that FriendlyARM was also quick to issue a patch.”
>>
<<
More on Linux kernel security / hacking [Optional]
• The Linux kernel code can always access userspace code/data regions
◦ BUT as noted here: https://www.kernel.org/doc/html/latest/security/self-protection.html :
“The kernel must never execute userspace memory. The kernel must also never access userspace
memory without explicit expectation to do so. These rules can be enforced either by support of
hardware-based restrictions (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's
Memory Domains). By blocking userspace memory in this way, execution and data parsing
cannot be passed to trivially-controlled userspace memory, forcing attacks to operate entirely in
kernel memory.”
• An attacker can carefully setup user memory with an attack payload, then
>>
/*
* executable name, excluding path.
*
* - normally initialized setup_new_exec()
* - access it with [gs]et_task_comm()
* - lock it with task_lock()
*/
char comm[TASK_COMM_LEN];
<<
comm = "bash\000-terminal\000",
…
(gdb) p p.comm
$11 = "bash\000-terminal\000"
>>
#ifdef CONFIG_SYSVIPC
struct sysv_sem sysvsem;
struct sysv_shm sysvshm;
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
unsigned long last_switch_count;
unsigned long last_switch_time;
#endif
/* Filesystem information: */
struct fs_struct *fs;
Pic src: My First Kernel Module: A Debugging Nightmare, Ryan Eberhardt, Nov 2020: ‘a
must read’
Notice the circular doubly linked task list on the left; init_task is the pointer to the first task, the head of the
list. The OFDT – struct files_struct – points to the open files like this: it points to struct fdtable, which in
turn has an array of pointers – struct file **fd - to struct file, which represents open files on Linux. This
contains all open file attributes including the pointer to the inode structure (which is where the VFS stores all
file details).
>>
/* Namespaces: */
struct nsproxy *nsproxy;
<<
Pic src: My First Kernel Module: A Debugging Nightmare, Ryan Eberhardt, Nov 2020:
A process on Linux always belongs to a namespace (the struct nsproxy structure has the details). The default
namespace is the system global PID namespace, where the init or systemd process PID 1 is the overall
parent.
But imagine a container running! Now, all the processes within this container also have their own process
hierarchy starting at PID 1! Thus, the host kernel understands this by using a separate namespace for that
container! (This is why a process can have a global PID different from it’s local namespace PID).
...
>>
/* Signal handlers: */
struct signal_struct *signal;
<< this has the nr_threads count, timers, some accounting stats, the array of
struct rlimit[]’s, etc >>
<<
Resource limit info is per-process based and is in *signal . Also, all threads of a process share the resource
limits. Use '$ ulimit -a' to see the resource limits (for calling process)..
struct sighand_struct {
atomic_t count;
struct k_sigaction action[_NSIG];
spinlock_t siglock;
wait_queue_head_t signalfd_wqh;
};
>>
sigset_t blocked; << blocked sigmask,
‘regular’ and RT
signals >>
sigset_t real_blocked;
/* Restored if set_restore_sigmask() was used: */
sigset_t saved_sigmask;
struct sigpending pending; << signals pending delivery
>>
unsigned long sas_ss_sp;
size_t sas_ss_size;
unsigned int sas_ss_flags;
<<
Is it possible to have the kernel send a signal to a userspace process?
Yes, of course… this article describes a way to do this (via a kernel
module of course):
Sending Signal to User space, Lirah BH, May 2016
>>
<<
A lot of members that follow, are compile-time turned ON if the
corresponding CONFIG_XXX directive is selected (at kernel configuration
time).
>>
#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task: */
struct rb_root_cached pi_waiters;
/* Updated under owner's pi_lock and rq lock */
struct task_struct *pi_top_task;
/* Deadlock detection and priority inheritance handling: */
struct rt_mutex_waiter *pi_blocked_on;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
/* Mutex deadlock detection: */
struct mutex_waiter *blocked_on;
#endif
#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
unsigned long lowest_stack;
unsigned long prev_lowest_stack;
#endif
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
*/
randomized_struct_fields_end
/*
* WARNING: on x86, 'thread_struct' contains a variable-sized
* structure. It *MUST* be at the end of 'task_struct'.
*
* Do not put anything below here!
*/
}; << the task_struct ends; finally! >>
<<
thread_struct:
Holds hardware context of the task; is obviously arch-dependant.
Used
for context-switch, fault handling, etc.
g = 0x0,
base2 = 0x0
}, {
limit0 = 0x0,
...
}},
sp = 0xffffb31643ec7cd0,
es = 0x0,
ds = 0x0,
fsindex = 0x0,
gsindex = 0x0,
fsbase = 0x7fe027164740,
gsbase = 0x0,
ptrace_bps = {0x0, 0x0, 0x0, 0x0},
debugreg6 = 0x0,
ptrace_dr7 = 0x0,
cr2 = 0x0,
trap_nr = 0x0,
error_code = 0x0,
io_bitmap_ptr = 0x0,
iopl = 0x0,
io_bitmap_max = 0x0,
addr_limit = {
seg = 0x7ffffffff000
},
...
xmm_space = {0x0, 0xffff00, 0xffff0000, 0xffffffff, 0x6e617769, 0x48434554,
0x6172632f, 0x685f6873, 0x65706c65, 0x4f000072, 0x511, 0x0, 0x0, 0x0, 0x0, 0x0,
0xa0a0a0a, 0xa0a0a0a, 0xa0a0a0a, 0xa0a0a0a, 0x4e18fdc0, 0x5579, 0x4e1961b0, 0x5579,
0x4e175270, 0x5579, 0x4e1752d0, 0x5579, 0x0, 0x0, 0x0, 0x0, 0x75722f2e, 0x72635f6e,
0x687361, 0x746c, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
padding = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
...
crash>
Suggested Assignments
1. show_monolithic: enhance the earlier “Hello, world” kernel module to print the process
context (just show the process name and PID for now) that the init and cleanup code runs in.
2. When the kernel begins to boot the system (early init), what context is the kernel code
running in? On which processor core is it running?
(Tip: edit the kernel source; add a printk in init/main.c to show this).
3. Determine if the current process context is a user or kernel thread (Hint: look up the flags
4. Enhance the above kernel module to print out some process-context information; for
example, print out the process name, PID (actually TGID), VM information (look up some
members of the mm_struct, like start_data, end_data, etc etc).
Also: print the kernel virtual addresses of some variables in the module.
Print the current value of jiffies as well.
5. show_threads: Write a kernel module that iterates over all threads alive on the system
printing out relevant details (as above).
<<
Tip:
Sample code to correctly iterate over all threads, using kernel synchronization primitives - very
important! - is part of the Linux Kernel Programming, 2nd Ed codebase:
• via task_{un}lock() pair of APIs (wrapper over the task struct’s alloc_lock spinlock):
https://github.com/PacktPublishing/Linux-Kernel-Programming_2E/blob/main/ch6/foreach/
thrd_showall/thrd_showall.c
• (This demo’s not on iterating over the task list; you could refactor it for that purpose...). Via a
reader-writer (spin)lock: https://github.com/PacktPublishing/Linux-Kernel-
Programming_2E/blob/main/ch13/rdwr_concurrent/2_demo_rdwr_rwlock/
miscdrv_rdwr_rwlock.c
• Via RCU (Read-Copy-Update): https://github.com/PacktPublishing/Linux-Kernel-
Programming_2E/tree/main/ch13/3_lockfree/thrdshowall_rcu
taskdtl_raw $ ./run_taskdtl 1
rmmod: ERROR: Module taskdtl is not currently loaded
[60425.413768] pid=1, tp = 0xffff8e5740d22980
[60425.413773] Task struct @ 0xffff8e5740d22980 ::
Process/Thread: systemd, TGID 1, PID 1
RealUID : 0, EffUID : 0
login UID : -1
[60425.413778] Task state (1) :
[60425.413778] S: interruptible sleep
[60425.413780] thread_info (0xffff8e5740d22980) is within the task struct itself
[60425.413781] stack : 0xffffa85a80060000 ; vmapped? yes
flags : 0x400100
sched ::
curr CPU : 1
on RQ? : no
prio : 120
static prio : 120
normal prio : 120
RT priority : 0
vruntime : 148822521
[60425.413785] policy : Normal/Other
[60425.413786] cpus allowed: 12
# times run on cpu: 8140
time waiting on RQ: 153550022
[60425.413788] mm info ::
not a kernel thread; mm_struct : 0xffff8e57493af380
[60425.413789] PGD base addr : 0xffff8e57501e0000
mm_users = 1, mm_count = 1
PTE page table pages = 90112 bytes
# of VMAs = 143
High-watermark of RSS usage = 3801 pages
High-water virtual memory usage = 6352 pages
Total pages mapped = 6223 pages
Pages that have PG_mlocked set = 0 pages
Refcount permanently increased = 0 pages
data_vm: VM_WRITE & ~VM_SHARED & ~VM_STACK = 1467 pages
exec_vm: VM_EXEC & ~VM_WRITE & ~VM_STACK = 3058 pages
stack_vm: VM_STACK = 33 pages
def_flags = 0x0
[60425.413795] mm userspace mapings (high to low) ::
env : 0x7ffc62e1df29 - 0x7ffc62e1dfed [ 196 bytes]
args : 0x7ffc62e1df17 - 0x7ffc62e1df29 [ 18 bytes]
start stack: 0x7ffc62e1d5d0
heap : 0x560658363000 - 0x560658863000 [ 5120 KB, 5
MB]
data : 0x560658093fd0 - 0x560658095010 [ 4 KB, 0
MB]
code : 0x560658082000 - 0x560658095010 [ 76 KB, 0
MB]
[60425.413800] in execve()? no
in iowait ? no
stack canary : 0x999243d32ec42900
utime, stime : 2290000000, 1334000000
# vol c/s, # invol c/s : 7056, 1096
# minor, major faults : 90418, 106
task I/O accounting ::
read bytes : 25239552
written (or will) bytes : 0
cancelled write bytes : 0
# read syscalls : 49373
# write syscalls : 13542
accumulated RSS usage : 12088244908 (11804926 KB)
accumulated VM usage : 20159133960 (19686654 KB)
pressure stall state flags: 0x0
[60425.413806] Hardware ctx info location is thread struct: 0xffff8e5740d24dc0
X86_64 ::
thrd info: 0xffff8e5740d22980
sp : 0xffffa85a800638a0
es : 0x0, ds : 0x0
cr2 : 0x0, trap # : 0x0, error code : 0x0
taskdtl_raw $
[OPTIONAL / FYI]
IA-32 : How does the CPU switch the 'sp' register to kernel mode stack when
entering kernel?
Source
Special CPU segment-register: TR
TR is the ‘Task Register’
TR holds ‘selector’ for a GDT descriptor
Descriptor is for a ‘Task State Segment’
So TR points indirectly to current TSS
TSS stores address of kernel-mode stack
https://kaiwantech.com