I have been meaning to learn about database internals. But before that, I wanted to understand the simplest database - the file. Doing this would help me document a mental model of the filesystem.

As we’ve discussed before, mental models are what you are really trying to develop when learning about systems. For file systems, your mental model should eventually include answers to questions like: what on-disk structures store the file system’s data and metadata? What happens when a process opens a file? Which on-disk structures are accessed during a read or write? By working on and improving your mental model, you develop an abstract understanding of what is going on, instead of just trying to understand the specifics of some file-system code (though that is also useful, of course!).

~ From OSTEP Ch 40 - File System Implementation

OS is all about abstractions. The process is an abstraction of the CPU and Memory (via Address Spaces). Thread is an abstraction over something schedulable/runnable. The file system is an abstraction over persistent storage.

The Basics

Here are some key aspects of files from the OS perspective:

  1. Data Container: A file serves as a container for storing data, regardless of its type. It can hold anything from text and images to program code and multimedia content.
  2. Metadata: Each file has associated metadata, which includes information like the file name, file size, creation date, modification date, file permissions, and more. This metadata helps the OS manage and organize files efficiently.
  3. File System: The OS manages files through a file system, which is a hierarchical structure that organizes files in directories or folders. The file system provides a way to locate, access, and organize files on storage devices, such as hard drives or SSDs.
  4. File Operations: The OS provides a set of operations to work with files, including creating, opening, reading, writing, closing, renaming, moving, and deleting files. These operations are essential for interacting with the data stored in files.
  5. File Access Control: The OS enforces file access control through permissions, determining which users or processes can read, write, or execute specific files. This helps protect sensitive data and maintain system security.
  6. File Extensions: In many operating systems, files are identified by their names and extensions, which indicate the file type. For example, “.txt” for text files, “.jpg” for image files, and “.mp3” for audio files. File extensions help the OS associate the correct application to open a file when a user interacts with it.
  7. Stream I/O: Files can be accessed through stream input/output (I/O) operations, allowing data to be read from or written to a file sequentially or randomly.
  8. Virtual File Systems: Modern operating systems often use a virtual file system layer to provide a unified interface for accessing different types of storage devices and file systems. This abstraction allows the OS to support various storage technologies transparently.

File I/O Operations

$ strace cat foo

...
read(3, "helo\n", 131072)               = 5
write(1, "helo\n", 5helo
)                   = 5
read(3, "", 131072)                     = 0
munmap(0x7fe3b475c000, 139264)          = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
...

Relating to Processes

From OSTEP Ch 39

From OSTEP Ch 39

How editors like Vim edit files

int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR); 
write(fd, buffer, size); // write out new version of file 
fsync(fd); 
close(fd); 
rename("foo.txt.tmp", "foo.txt");

The Mental Model

From OSTEP Ch 40

From OSTEP Ch 40

From OSTEP Ch 40

From OSTEP Ch 40

From OSTEP Ch 40

From OSTEP Ch 40

The Multi-Level Index

Multi-level index

Multi-level index

Let’s take some time here to talk about hard and soft links


$ touch foo
$ echo hello > foo
$ ln foo foo-hard
$ stat foo
  File: foo
  Size: 6               Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1289616 <--***----     Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-21 10:16:13.405015668 +0000
Modify: 2023-07-21 10:16:23.133033843 +0000
Change: 2023-07-21 10:16:29.941046564 +0000
 Birth: 2023-07-21 10:16:13.405015668 +0000
$
$ stat foo-hard
  File: foo-hard
  Size: 6               Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1289616 <--***----     Links: 2
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-21 10:16:13.405015668 +0000
Modify: 2023-07-21 10:16:23.133033843 +0000
Change: 2023-07-21 10:16:29.941046564 +0000
 Birth: 2023-07-21 10:16:13.405015668 +0000
$
$
$ ln -s foo foo-soft
$ stat foo-soft
  File: foo-soft -> foo <--***----
  Size: 3               Blocks: 0          IO Block: 4096   symbolic link
Device: 801h/2049d      Inode: 1289887 <--***----     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-07-21 10:16:57.677098374 +0000
Modify: 2023-07-21 10:16:51.645087108 +0000
Change: 2023-07-21 10:16:51.645087108 +0000
 Birth: 2023-07-21 10:16:51.645087108 +0000
$
$ cat foo
hello
$ cat foo-hard
hello
$ cat foo-soft
hello

Reading and Writing

The following reads and writes are called when we call open() on a file /foo/bar and 3 reads post that. This is in order to first traverse the inode hierarchy from the root (which is inode no 2 in most systems)

From OSTEP Ch 40

From OSTEP Ch 40

Below diagram is for creating a new file /foo/bar. Notice the huge number of reads and writes just to create a file. Then to write anything, we need to keep updating the free list bitmaps and the actual blocks. We already see how inefficient doing this every single time when something is changed (like editing a file from an editor) is. There are techniques discussed in the next section which solve just this.

From OSTEP Ch 40

From OSTEP Ch 40

Caching and Buffering

For reads,

For writes,

Misc stuff

To see what is mounted on your system, and at which points, simply run the mount program. You’ll see something like this:

$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
/dev/sda5 on /tmp type ext3 (rw)
/dev/sda7 on /var/vice/cache type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
AFS on /afs type afs (rw)

This crazy mix shows that a whole number of different file systems, including ext3 (a standard disk-based file system), the proc file system (a file system for accessing information about current processes), tmpfs (a file system just for temporary files), and AFS (a distributed file system) are all glued together onto this one machine’s file-system tree.

Closing Notes

  1. Well, that’s it! Phew!
  2. Review different file systems would be the natural next step for those interested in learning more
  3. This article by Dan Luu goes further talking about files. However, beware to first read the STEP chapters before diving there - Files are hard