file descriptors: close syscall

Learning objective

Gain a broad overview of many aspects of the kenrel by understanding what's necessary to close a file descriptor

Overview

  1. Peel back the layers of close(2)

  2. Removing entries of the FDT

  3. Scheduling work to be done later

  4. Execution context design considerations

  5. Several more concurrency techniques

  6. Execution context sentitive code

What does close(2) do?

  • Can close fail?

Two main tasks

  1. Invalidate int fd index in FDT

  2. Close the struct file * if needed

close(2) from start to finish

demo

Verify with strace that close(3) indeed calls close(2)

Entry point

SYSCALL_DEFINE1(close)

  1. cannot restart syscall since struct file is gone

  2. If the file fails to be closed, the data may be hosed

First layer

close_fd()

  1. Use int fd arugment to index into FDT

  2. Obtain underlying struct file

  3. What benefit could is there to using spin_lock here?

    • Can call in atomic context

Safe indexing

pick_file()

  1. Index into FDT properly

  2. Do bounds checking on input value

  3. Use array_index_nospec() macro for security

  4. Use RCU to safely NULLify FDT entry without locks

  5. Concurrent readers of the FDT will see a value that makes sense

Mitigating speculative execution attacks

array_index_mask_nospec() (arm64) (x86)

  1. Create bitmask based in index

  2. All 1s if within bounds, else 0

  3. Bitwise AND index to zero if out-of-bounds

  4. Speculative indexing into the array always within bounds

Returning the file descriptor

__put_unused_fd()

  1. Why don't we call put_unused_fd()?
  • We already hold the files->file_lock spinlock

Returning the file descriptor

__clear_open_fd()

  1. Update bitmaps holding open file info

  2. full and low resolution maps used

  3. BITS_PER_LONG-sized ranges checked if all fds in use

Returning the file descriptor

__put_unused_fd()

  1. Smallest available fd stored for next open(2)

  2. This free may require updating smallest fd

Finish removing the FDT entry

pick_file()

  1. Return the struct file associated with the open fd

  2. Return NULL if fd not open

Return to the first layer

close_fd()

  1. No file? Then -EBADF

  2. Lastly, return whatever filp_close() returns

Inverting filp_open()

filp_close()

  1. sanity check reference count

  2. Never should be 0

  3. Use CHECK_DATA_CORRUPTION()
    macro which may call BUG() on kernels configured to do so

Another way to crash

BUG()

  1. ASM_BUG_FLAGS() generates assembly from preprocessor macros

  2. Why use high numbers in assembly labels?

  • high numbers in assembly are to avoid collisions

Inverting filp_open()

filp_close()

  1. If implemented call the ->flush() file operation

  2. Flush performs pre-closure cleanup

  3. Example: writing buffered data to storage medium

Path mode

  1. Can open(2) a file with O_PATH

  2. Lighweight efernce to filesystem path entry

  3. No I/O

  4. Example usage: permission checks, change of ownership

Inverting filp_open()

filp_close()

For files with I/O context

  1. Flush directory notifications using the dnotify system

  2. Remove POSIX locks associated with this file

dnotify: history

  1. First Linux filesystem event notificaton system

  2. Added in 2001 in Linux 2.4.0

  3. Monitor CRUD chagnes in directory

  4. Notifed via SIGIO usually

dnotify: problems

  1. Only directory granularity

  2. Signal handling can be tricky

  3. Need open fd

  4. not much info about events

dnotify: obsolete

  1. No longer used

  2. Kept for legacy reasons

  3. Replacement: inotify

POSIX locks

  1. Can lock range of a file with fcntl(2)

demo

example posix locks program

Inverting filp_open()

filp_close()

  1. Call fput() to finish the job

  2. No error code from fput()

  3. Return value nonzero only when flush fails

Optimal procrastination

fput()

Decrement the file's reference count (file->f_count)

  1. Use atomic_dec_and_test()

  2. No other action taken when result is nonzero

  3. If count reaches zero, instigate the real work

Optimal procrastination

fput()

Why rush? Schedule a future callback

  1. First method: only for process context

  2. Second method: for any context

Checking our execution context: interface

in_interrupt()

A depreceated macro

  1. Transitively defined by irq_count()

  2. Bitwise OR three shifted values

  3. NMI, softirq, and hardirq counts

  4. Nonzero when any count is nonzero

Checking our execution context: backend

preempt_count()

Architecture-specific data source

  1. Value stored in current->thread_info

  2. Can directly cast current since struct thread_info is first member

  3. READ_ONCE() prevents racy compiler re-ordering

Kernel threads

  1. Process context without userspace

  2. Can sleep, be preempted

  3. Can call most kernel funtions

  4. No userspace memory to access

Note on likely() macro

  1. Generates branch prediction hints

  2. Not on all CPUs

  3. unlikely() does the inverse

Why be careful with likely()?

  • Faster true case

  • Slower false case

  • Helpful only when very likely true

  • Otherwise considered harmful

In the likely case of process context

Schedule callback to run on current's behalf

Task work

  1. init_task_work() wraps callback struct member assignment

  2. task_work_add() schedules the work

    1. Our callback is to ____fput()
  3. If this fails, just fallback to the other method

Scheduling options

  1. TWA_SIGNAL interrupts target task

  2. TWA_SIGNAL_NO_IPI is more chill

    1. Signal, but don't interrupt
  3. TWA_RESUME is the most relaxed

    1. Just wait for next kernel exit by target task

The userspaceless case

  1. Global delayed work queue

  2. Create a list of files to pass to callback

  3. Run them all in a jiffy (next timer tick)

Global procrastination

fput()

Use schedule_delayed_work() to access global queue

  1. Uses structure defined with DECLARE_DELAYED_WORK()

  2. Do work after delay timer ticks pass

Global procrastination

fput()

Avoid any extra scheduling

  1. Conditionally call schedule_delayed_work()

  2. Only on first list append

  3. Resulting work will empty this queue

  4. llist: lockless linked list implementation optimized for concurrent access

First callback

Two possible callers of __fput()

  1. ____fput() when using task work

  2. delayed_fput when using global delayed work

From task work

____fput()

  1. Uses the container_of() macro

  2. Use struct member offset subtraction

  3. Pass containing struct file to __fput()

From global delayed work

delayed_fput()

  1. Detach list of files from caller handle

  2. Use special llist iterator

  3. Pass each file to __fput()

The "real guts of fput()"

__fput()

What do we need to do?

  1. Clean up file-associated resoruces

  2. Drop references held by file

  3. Free allocated memory

The "real guts of fput()"

__fput()

Is the file really open?

  1. Check FMODE_OPENED flag in file->f_mode,

  2. Set by do_dentry_open() in open(2) path

  3. Without this flag, skip to memory freeing

The "real guts of fput()"

__fput()

A debugging helper: might_sleep()

  1. Dump stack trace when called in atomic context

The "real guts of fput()"

__fput()

Spead the news of this closure

  1. fsnotify provides fs event info to other kernel systems

  2. e.g. inotify consumes this data

The "real guts of fput()"

__fput()

Call eventpoll_release() to clean up all resoruces associated with event polling on this struct file

The "real guts of fput()"

__fput()

Safe to release the file's locks: locks_remove_file()

The "real guts of fput()"

__fput()

Integrity Management Architecture (IMA)

  1. Prevents tampering with file contens

  2. Allocates resources for each file

  3. Cleanup with ima_file_free()

The "real guts of fput()"

__fput()

Handle pending asynchronous operations

  1. Only if file has FASYNC flag set

  2. Call fasync() handler defined by underlying file implementation

The "real guts of fput()"

__fput()

Call any extant release() file operation

  1. A module implemeting a character device may define this handler

The "real guts of fput()"

__fput()

Release reference to a character device and file operations

  1. Only if file is backed by one

  2. Reference to any underlying module implements fops reference

The "real guts of fput()"

__fput()

Drop reference to pid of file owner

  1. Contained in struct pid in struct fown_struct

  2. Which is a member of struct file

The "real guts of fput()"

__fput()

Use put_file_access() to perform access mode specific tasks to clean up access to the file

The "real guts of fput()"

__fput()

Drop a reference to the dentry for this file with dput()

The "real guts of fput()"

__fput()

Some file modes will require an unmount at this point

  1. Handled by dissolve_on_fput()

  2. May cover later material on namespaces

The "real guts of fput()"

__fput()

mntput() frees the struct file's struct vfsmount member

  1. represents an abstraction of a mounted filesystem

The "real guts of fput()"

__fput()

Finish the job with file_free()

  1. Called even when file wasn't open

Pennultimate cleanup

file_free()

Notify Linux Security Modules (LSM) framework users to cleanup with security_file_free() to clean up

Pennultimate cleanup

file_free()

If the file is a backing store for a device or file, drop reference to associated struct path

  1. Example: a loopback device

Pennultimate cleanup

file_free()

Decrement open file counter

  1. Directly decrement local percpu counter

  2. Global total periodically calculated

Pennultimate cleanup

file_free()

Schedule file_free_rcu()

  1. Use call_rcu()

  2. Ensure existing readers can finish

The final function

file_free_rcu()

Drop refrence to file's struct cred

  1. Stores the security context information task that opened the file

  2. Last step before freeing memory

The final function

file_free_rcu()

Free the last structure's memory

  1. Backing files free their backing file structure

  2. Otherwise, return the struct file to its kmem_cache()

  3. Back to whence it came

That's it

We return to userspace, concluding the close(2) implementation

Summary

The close(2) systemcall contains plenty of complexity and many layers

Summary

Many different types of in-kernel resources may be associated with a file

Summary

The kernel employs creative lock avoidant techniques to implement correct concurrency

Summary

Correct reference counting is essential

Summary

The codepath can invoke several file operations, including release(), flush(), and fasync()

End