Gain greater depth of understanding file descriptors by comparing read and write
Userspace and kernel entry points
Contrast with read(2)
read(2)
A look at security hooks
Superblocks and filesystem snapshotting
SYSCALL_DEFINE(write,...)
All it does is ksys_write()
ksys_write()
Only one other caller in s390 compat code
Originally there were more callers
While file descriptors are prefered as a userspace interface, the kernel is better off working directly with struct files
struct file
ksys_write() removed from init/initramfs.c
ksys_write() removed from init/do_mounts_rd.c
ksys_lseek()
kernel_write()
Verify the write operation
Acquire a filesystem resource
Perform the underlying operation
Release the filesystem resource
Almost a simplified vfs_write()
vfs_write()
Obtain a reference to the file position or bail
Create a local copy of the file position
Perform virtual filesystem (vfs)
If needed, update the file position
Drop any held reference
How does the function differ from ksys_read()?
ksys_read()
vfs_write() instead of vfs_read()
vfs_read()
const char __user * buf instead of char __user * buf
const char __user * buf
char __user * buf
DRY: "Don't Repeat Yourself"
See the slides on read
We will skip right to vfs_write()
Verify and validate the operation
Acquire filesystem resources
Perform the write operation
Account for the operation
Release filesystem resources
Make sure file open for writing (FMODE_WRITE)
FMODE_WRITE
Make sure writing makes sense (FMODE_CAN_WRITE)
FMODE_CAN_WRITE
Make sure buf is a userspace address range
buf
rw_verify_area
Disallow count values with top bit set
Sanity check the file position
Verify write access
security_file_permission()
Use MAY_WRITE as our mask
MAY_WRITE
Call an arbitrary number of file_permission security hooks
file_permission
If permission is granted, set off notifications
call_int_hook()
IRC: initial return code if all hooks reutrn 0
Call each hook at stop if one fails
Statement expression evaluates to return code
hlist_for_each_entry()
"Hash List"
Head contains pointer to only first node
Regular list head has first & last pointers
Useful when list is frequently empty, like hash list buckets
struct security_hook_heads
Define a macro in particular way
Resolve many instances of this macro
Undefine the macro to allow later re-use
LSM_HOOK(..., file_permission, ...)
selinux_file_permission()
Security Enhanced Linux: Fine-grained mandatory access control (MAC)
Associated with file_permission hook here
Registered with security subsystem by security_add_hooks()
security_add_hooks()
Quick demo: ls -lZ
ls -lZ
apparmor_file_permission()
AppArmor: Per-program security profiles
Upstream documentation
fsnotify_perm()
if (!(mask & (MAY_READ | MAY_OPEN))) return 0;
Therefore, this is a no-op
One last check:
count >= MAX_RW_COUNT
MAX_RW_COUNT
Ensures maximum value is rounded down to page bondary
Exactly the same as read
file_start_write()
Check whether this is a regular file
A regular file is 0 or more bytes on disk
Not regular: character devices, directories, links
S_ISREG()
sb_start_write()
Acquire superblock write access
Each filesystem has one superblock
Contains meta-information about filesystem
Only relevant for regular files
SB_FREEZE_WRITE
struct super_block
Freezing enables snapshot fs backups
Select from an array of percpu reader-writer locks
Read is CPU local, write is cross-core
Freezing a filesystem
Now we can actually write!
f_op->write() calls into the filesystem or module
f_op->write()
Like read, fallback to f_op->write_iter
f_op->write_iter
We should never hit the -EINVAL case if FMODE_CAN_WRITE is set
-EINVAL
When we write some bytes:
Notify of file modification
Account for bytes written by this task
Unconditionally:
Account for write syscall count by this task
Release any filesystem resources acquired earlier
Return bytes written or errno to userspace
This concludes write(2)
write(2)
Writing is quite similar to reading, but a bit more complex
Linux Security Modules (LSM) provies a flexible way to enforce sets of security policies at the kernel level
Memory footprint minimization in the kernel is critical and this justifies hlist, which saves one pointer in the head instead of two
hlist
Kernel internal use of system call functionality is still evolving