file descriptors: write syscall

Learning objective:

Gain greater depth of understanding file descriptors by comparing read and write

Overview

  1. Userspace and kernel entry points

  2. Contrast with read(2)

  3. A look at security hooks

  4. Superblocks and filesystem snapshotting

In the begining

SYSCALL_DEFINE(write,...)

  1. All it does is ksys_write()

  2. Only one other caller in s390 compat code

  3. Originally there were more callers

Where did these callers go?

While file descriptors are prefered as a userspace interface, the kernel is better off working directly with struct files

ksys_write() removed from init/initramfs.c

ksys_write() removed from init/do_mounts_rd.c

  1. Notice that ksys_lseek() is entirely removed

The other kernel interface

kernel_write()

  1. Verify the write operation

  2. Acquire a filesystem resource

  3. Perform the underlying operation

  4. Release the filesystem resource

Almost a simplified vfs_write()

Callable from userspace and the kernel

ksys_write()

  1. Obtain a reference to the file position or bail

  2. Create a local copy of the file position

  3. Perform virtual filesystem (vfs)

  4. If needed, update the file position

  5. Drop any held reference

Spot the difference

ksys_write()

How does the function differ from ksys_read()?

  • vfs_write() instead of vfs_read()

  • const char __user * buf instead of char __user * buf

Keeping these slides DRY

  1. DRY: "Don't Repeat Yourself"

  2. See the slides on read

  3. We will skip right to vfs_write()

Right into the meat

vfs_write()

  1. Verify and validate the operation

  2. Acquire filesystem resources

  3. Perform the write operation

  4. Account for the operation

  5. Release filesystem resources

First validation

vfs_write()

  1. Make sure file open for writing (FMODE_WRITE)

  2. Make sure writing makes sense (FMODE_CAN_WRITE)

  3. Make sure buf is a userspace address range

Verifying the target

rw_verify_area

  1. Disallow count values with top bit set

  2. Sanity check the file position

    1. Signed offsets may wrap or exceed bounds
  3. Verify write access

Security checks

security_file_permission()
  1. Use MAY_WRITE as our mask

  2. Call an arbitrary number of file_permission security hooks

  3. If permission is granted, set off notifications

The hook caller

call_int_hook()

  1. IRC: initial return code if all hooks reutrn 0

  2. Call each hook at stop if one fails

  3. Statement expression evaluates to return code

hlist detour

hlist_for_each_entry()

"Hash List"

  1. Head contains pointer to only first node

  2. Regular list head has first & last pointers

  3. Useful when list is frequently empty, like hash list buckets

LSM XARGS

struct security_hook_heads

  1. Define a macro in particular way

  2. Resolve many instances of this macro

  3. Undefine the macro to allow later re-use

LSM_HOOK(..., file_permission, ...)

Example file_permission hooks

selinux_file_permission()

  1. Security Enhanced Linux: Fine-grained mandatory access control (MAC)

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()

  4. Quick demo: ls -lZ

Example file_permission hooks

apparmor_file_permission()

  1. AppArmor: Per-program security profiles

  2. Associated with file_permission hook here

  3. Registered with security subsystem by security_add_hooks()

More information about LSM

Upstream documentation

Notify of permission granted

fsnotify_perm()

  1. Called with MAY_WRITE
	if (!(mask & (MAY_READ | MAY_OPEN)))
		return 0;

Therefore, this is a no-op

Back to the VFS

vfs_write()

One last check:

  1. count >= MAX_RW_COUNT

  2. Ensures maximum value is rounded down to page bondary

  3. Exactly the same as read

Acquire filesystem resources

file_start_write()

  1. Check whether this is a regular file

  2. A regular file is 0 or more bytes on disk

  3. Not regular: character devices, directories, links

  4. S_ISREG()

Acquire filesystem resources

sb_start_write()

  1. Acquire superblock write access

  2. Each filesystem has one superblock

  3. Contains meta-information about filesystem

  4. Only relevant for regular files

Don't freeze me!

SB_FREEZE_WRITE and struct super_block
  1. Freezing enables snapshot fs backups

  2. Select from an array of percpu reader-writer locks

  3. Read is CPU local, write is cross-core

demo

Freezing a filesystem

Back to the VFS

vfs_write()

Now we can actually write!

  1. f_op->write() calls into the filesystem or module

  2. Like read, fallback to f_op->write_iter

  3. We should never hit the -EINVAL case if FMODE_CAN_WRITE is set

Back to the VFS

vfs_write()

When we write some bytes:

  1. Notify of file modification

  2. Account for bytes written by this task

Back to the VFS

vfs_write()

Unconditionally:

  1. Account for write syscall count by this task

  2. Release any filesystem resources acquired earlier

  3. Return bytes written or errno to userspace

This concludes write(2)

Summary

Writing is quite similar to reading, but a bit more complex

Summary

Linux Security Modules (LSM) provies a flexible way to enforce sets of security policies at the kernel level

Summary

Memory footprint minimization in the kernel is critical and this justifies hlist, which saves one pointer in the head instead of two

Summary

Kernel internal use of system call functionality is still evolving

End