file descriptors: read syscall

Learning objective

Gain greater depth of understanding file descriptors by seeing how read uses them

Overview

  1. read(2) entry

  2. Advanced reference count optimization

  3. Reading through the virtual filesystem

Entry point

SYSCALL_DEFINE3(read, ...)

  1. Just calls ksys_read()

  2. Only one other caller in s390 compat code

  3. Originally there were more callers

Callable from userspace and the kernel

ksys_read()

  1. Obtain a reference to the file position or bail

  2. Create a local copy

  3. Perform virtual filesystem (vfs) read

  4. If needed, update the file position

  5. Drop any held reference

Optimizing the references

fdget_pos()

What is this struct fd and why might we want something more than just the struct file?

  • We don't always need to have our own reference to the struct file

  • We need to keep track of which references need to be dropped

Optimizing the references

fdget_pos()

  1. Get an unsigned long

  2. Split it into a struct fd

static inline struct fd __to_fd(unsigned long v)
{
	return (struct fd){(struct file *)(v & ~3),v & 3};
}

Optimizing the references

__fdget_pos()

  1. First, do we need the file lock?

  2. Then, do we need the file position lock?

Optmizing the references

__fdget()

Get a reference to a file descriptor unless it's opened in path mode

Get what's needed

__fget_light()

  1. If the refcount is 1, we can borrow it

  2. Othewrise, we need our own reference

    1. And we will need to free it later

Get what's needed

__fget_light()

  1. Use atomic_load_and_acquire() to get the current reference count

  2. Call files_lookup_fd_raw() directly

  3. The unsinged long return value will be cast

Many layers surrounding increment

  1. __fget()

    1. dunder since we already mask FMODE_PATH
  2. __fget_files()

  3. fget_files_rcu()

  4. get_file_rcu()

Get what's needed

__fget_light()

In the case we cannot borrow, mark the lower bits of the pointer

Optimizing the references

__fdget_pos()

The following line should be more clear:

struct file *file = (struct file *)(v & ~3);

Check if we need the fpos lock

file_needs_f_pos_lock()

When do we need the file position lock?

It is standardized

Any regular file or directory has FMODE_POS_ATOMIC set

  1. in do_dentry_open()

  2. POSIX.1-2017 2.9.7

In addition, we check the file_count and for a shared iterator

Optimizing the references

__fdget_pos()

To finish up, lock and set another bit if needed

  1. The return value is split into pointer and flags in __to_fd()

A note on CLASS

Not used much yet, but may be soon

  1. DEFINE_CLASS(fd,...)

  2. #define DEFINE_CLASS(...)

Back where we came from

ksys_read()

First check whether the file is open with f.file

  1. Maybe soon to become fd_empty() and fd_file()

  2. Recent patchset by maintainer

No position in a stream

file_ppos()

Otherwise, this just gets the address of the file position

The meat of the read

vfs_read()

Overview:

  1. Validate the operation and its inputs

  2. Execute the specifc read handler

  3. Notify of completion

The meat of the read

vfs_read()

First three checks

  1. Make sure the file is open for reading

  2. Make sure that the file can be read

  3. Make sure the output buffer is a sane address

Check the area to read from

rw_verify_area()

  1. Sanity check the file position

    1. Signed offsets may wrap or exceed bounds
  2. Verify read access

    1. security_file_permission()

The meat of the read

vfs_read()

Check that count isn't too big

  1. count >= MAX_RW_COUNT

  2. Ensures maximum value is rounded down to page bondary

The meat of the read

vfs_read()

Call the actual read!

  1. Call the read() member of file operations

  2. Otherwise, call read_iter()

The meat of the read

vfs_read()

If we are successful:

  1. Tell fsnotify to let others know of this access

  2. Account for task's bytes written

Unconditionally:

  1. Account for the task's read syscall

See struct task_io_accounting

Back where we came from

ksys_read()

Last steps to wrap up

  1. Update the file position if relevant

  2. Drop any references we may have

  3. Return the number of bytes read or an error

Drop any references we may have

fdput_pos()

  1. If we locked the file position: __f_unlock_pos()

  2. If we locked the file: fdput() calls fput()

Summary

Read doens't need to do as much as open or write

Summary

Small optimizations on file descriptor operations add up to significant performance improvements

Summary

Watch out for data storage in unexpected places like the lower bits of a pointer!

End