Posted on Jan 14, 2013

NUSE: Networking Stack in Userspace?

FUSE (Filesystem in Userspace) allows programmers to implement a custom file system in userspace. FUSE itself works as a virtual file system in the kernel. While real file systems, such as ext4 and ReiserFS, perform all operations within the kernel, FUSE redirects VFS operations to a user-level application that actually implements a custom file system. Such VFS operations often 1-to-1 match with familiar system calls, such as open, release, read, write, mkdir, and unlink.

The FUSE architecture modularizes the big picture into three players: user applications, FUSE applications, and the FUSE kernel module. Existing user applications can benefit from FUSE without any modifications. The FUSE kernel module routes VFS operations to a proper FUSE application, based on the well-defined namespace (file paths and mounting points).

While this abstraction layer comes at a price (frequent boundary crossing between kernel and user), but in most cases it is not a deal breaker since the performance of a file system is typically bound by disk or network (for network file systems) bottlenecks. Functionality and flexibility do matter. There are many creative and useful file system implementations on top of FUSE. One intriguing example is GMailfs, which allows you to utilize your gmail account as a free disk space. This can be considered as an early prototype of cloud storage services. The advantages of FUSE are huge:

  • Kernel-level programming is pretty damn hard. Life is too short to be wasted for debugging kernel panics. FUSE effectively democratizes the file system programming for all user-level programmers.

  • One corollary of the above is that you do not need to get stuck with the C programming language in the kernel. You have freedom of religion language. With FUSE, you can implement a non-trivial file system of your own in less than a few hundreds of LOC, in your favorite language.

  • Once you implement a custom file system, you can easily port it to other operating systems, as long as they also support FUSE. This is possible because the user-level file system implementation is not tightly coupled with other kernel-level subsystems, such as scheduler and virtual memory system. Your file system implementation is readily portable.

  • The user-level isolation also helps easy deployment. You do not need to worry about operating system and its version when you release your file system, thanks to the well-defined FUSE API.

Those advantages have brought various file system implementations. In addition, FUSE has helped not only practionists but also researchers. Storage systems researchers can quickly realize their new ideas with FUSE. Quick prototyping allows researchers to focus on ideas, instead of hairy implementation details.

Longing for a Network Equivalent to FUSE

It is often desired to customize the behavior of default networking stack in a programmable way, for fun, research, or practical reasons. A few ideas that have come up in my mind are…

  • Reordering HTTP header packets to confuse firewalls.
  • Marking packets of a particular TCP flow with header options.
  • Opportunistic compression for TCP payloads.
  • QoS based on application-layer information (e.g., prioritize text over images)
  • Testing a new TCP congestion control algorithm.
  • Redirecting connections based on the content (e.g., for *.html files to a web application server, for other files to a in-memory static HTTP server)
  • Firewalling with regular expressions.
  • Per-process rate-limiting.
  • Implementing your own layer-3/4 protocols.

How can we do to achieve such things? Actually, we have a bag of ingredients:

  • libpcap (or raw sockets) provides an interface to physical network interfaces. It is too low-level in most cases. For incoming packets, the kernel will duplicate them, so one packet will follow the default networking stack and the other is redirected to the user application. This implies that this mechanism is not reliable, as user-side packets may be dropped (buffer overrun) while the kernel-side packets are not. In addition, you need to set appropriate iptables rules to make sure you intercept packets, rather than just observe.

  • TUN/TAP devices let you emulate network device drivers with software. TAP devices are essentially Ethernet links, while TUN devices do not bother with Ethernet headers (so the basic processing unit is IP packets, not Ethernet frames). Those devices act as the network side. What you write is incoming packets for the kernel. Packets the kernel transmits are what you read from the device. All TCP/IP processing (layer 3 and above) will be done as usual.

  • iptables performs various middlebox functionalities, such as firewall and NAT.

  • libnetfilter_queue provides functionalities for user-level applications to implement custom netfilter operations (in contrast to pre-defined iptables rules).

  • Proxy servers work best if what you want to do is at the application layer, but without modifications to existing applications. You basically deal with TCP streams, so you will lose any packet-level fidelity. Typically you redirect connections to your proxy with a iptables REDIRECT rule.

  • You can also intercept libc functions or system calls. What you get is similar to proxy servers, while this method is a little bit trickier. This approach is often used to “torify” network traffic of unmodified applications.

  • Custom TCP/IP stacks. lwIP is perhaps the most popular one. These networking stacks are mainly designed for embedded systems, where size and portability are the king, making them tend to be not so feature-rich. In my experience, unfortunately, none of those implementations were as mature as the Linux kernel networking stack, e.g., frequent TCP stall or unacceptable performance over high-latency links.

All those solutions are very case-specific; you must be able to choose a right tool to solve your problem. Getting expertise in one tool will not help for others, as they share little common in usage/interface. None of them provides a unified, portable, and clean interface as FUSE does for user-level file systems.

Why cannot we have an equivalent to FUSE, for networking? One reason would be the networking stack is arguably more complex than the file system by nature. The Linux TCP/IP stack consist of various protocols at several layers (link, network, transport, and socket layer), works both as a router and an endhost, deals with various communication medium rather than simple block devices, and needs many functional requirements (filtering and tunneling). Look at the simplified diagram of the Linux networking stack.

http://www.linuxfoundation.org/images/1/1c/Network_data_flow_through_kernel.png

The Linux networking stack is very complex software that is being maintained by brilliant people, who are masters of complexity. This software is very modular, but not in a unified way (e.g., the interface between Ethernet and IP is very different from that between IP and TCP). I argue that the lack of a clean abstraction for glueing modules together is the reason why we have only fragmented ad-hoc solutions to customize the behavior of the stack.

What is the consequence of not having NUSE? In academia, lots of interesting research ideas come out every year in the networking community. Most of them make solid arguments, but unfortunately only a few of them are realized making real impacts. I think the reasons of this is the lack of means to quickly prototype their ideas in a realistic environment and contribute the outcome (code) to the community. In the filesystem community, many research projects release their result as real FUSE applications.

I, as an average programmer (and as a newbie researcher), would like to have a simple interface exposed to user-level applications, to inject my little code for customized data/control flow in the networking stack. I want to do this at any location of the stack, rather than a few pre-defined hooking points. It would be also great if the interface has the same API wherever the “hook” is, so I do not have to get used to a new programming interface every time. I want my code to run on anywhere, without tedious kernel patching or module building.

As the Click modular router opened tremendous research opportunities in networking research (but mostly on routers, not end-hosts), I am pretty sure NUSE will push the frontiers of the computer networking research even further. If you are interested in doing research or writing code for NUSE, please let me know. I would be happy to collaborate.


comments powered by Disqus