The Enhanced Network Block Device Linux Kernel Module

(Last revised: 26 May. 2005)

The Enhanced NBD is the result of an industrially funded academic research project with Realm Software of Atlanta, GA, to toughen up the kernel's NBD. It started back in 2.0 times, when I back-ported the nascent NBD by Pavel Machek from the 2.1 development kernel.

What is an NBD?
Requirements
Current Version
Documentation
HOWTO
Gotcha!
HOWTO-2
Resources > 2GB
Setting up failover
Late news - mirroring
Bugs
To Do
Mail List
Downloads
Contacts

What is an NBD?

An NBD is "a long pair of wires". It makes a remote disk on a different machine act as though it were a local disk on your machine. It looks like a block device on the local machine where it's typically going to appear as /dev/nda. The remote resource doesn't need to be a whole disk or even a partition. It can be a file.

NBD transports a physicalblock device over the net

The intended use for ENBD in particular is for RAID over the net. You can make any NBD device part of a RAID mirror in order to get real time mirroring to a distant (and safe!) backup. To make it clear: start up an NBD connection to a distant NBD server, and use its local device (probably /dev/nda) where you would normally use a local partition in a RAID setup.

The original kernel device has been hardened in many ways in moving to ENBD from kernel NBD: the ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to take the networking transparantly over SSL channels (see the Makefile for the compilation options).

To summarize briefly, the important changes in ENBD with respect to the standard kernel driver are

user-space networking, combined with a new multichannel self-balancing asynchronous architecture in the kernel driver, and
automatic restart, authentication, reconnect and recovery by the user space daemons, and now (in enbd post-2.4.27) ...
support for remote ioctls, and ...
support for removeable-media such as cdroms and floppies as the remote resource, and ...
in enbd 2.4.30, support for partitioning on NBD devices, and ...
in enbd-2.4.31, support for intelligent embedded RAID1 mirroring

Requirements

both

Current version as of Mar. 2004

2.2-current

ftp area

here

I repeat: you can use either code with 2.2 kernels, but the enbd-2.4.* codes are the ones being worked on and developed while the 2.2 codes are no longer developed.

Documentation

article

here

For a taste, here are some performance measurements:

NBD performance figures

These are old figures now - taken under 2.0.36, as I recall, with a much older version of NBD than the current one, but they're still useful. The testbed was a pair of 64MB P200s on a 100BT switched circuit using 3c905 NICs. The best speed I could get out of raw TCP between them was 58.3Mb/s, tested using netperf.

Of course, the current NFS implementations have improved too.

HOWTO

Do all the compilation on the client machine. It has the kernel configuration that we need to match during the compilation - the enbd server never talks to its own kernel.

Also set the environment variable LINUXDIR to the location of the kernel source directory for your target kernel, because nowadays /usr/src/linux (the classical location) seems to be some fake that points to whatever glibc was compiled against, not to the source of the kernel you are running. The kernel directory must also contain the target kernels .config file, as it's read during the make.

% tar xzvf nbd-2.2-current.tgz
% cd nbd-2.2.23

And from there on in, I'll quote the INSTALL directions:

It should be sufficient to type "make" in this directory. That will build enbd.o, enbd-server , enbd-client in /tmp. Change BUILD in the Makefile to change the build directory. Run " make config all" to really really really make sure that everything is set up for you, but just "make" should normally do the job. Look inside the Makefile if you want to understand why and what.
Edit the Makefile and replace SERVER and CLIENT with the name of your build machine and a willing target machine respectively. The target must be running a kernel into which the enbd.o module will load. You will have to figure out what that means and make it so - running the same kernel as the build machine should be enough. The target can be the build machine too. Setting both SERVER and CLIENT to "localhost" is a safe bet.
Make sure that sudo and ssh are installed on both SERVER and CLIENT machine, and that you are a sudoer, and have done the appropriate ssh-keygen and -export trickery both sides, so you can login seamlessly between the two machines. If you have never heard of these two utilities ... well, really, you are a real rookie of an administrator and you should not be touching this kind of thing! I'd have given odds of 5:1 against you getting this far!
Then run "make test". This depends on the presence of both sudo and ssh for its working. Observe that the module enbd.o is loaded (use /sbin/lsmod for that). Observe also that enbd-server and enbd-client are running (use pstree for that). Check that both server and client have branched off slave servers and clients to handle the connections (again, use pstree to visualize the situation). Check that the state of the device is good by doing a "cat /proc/nbdinfo". You should see indications of /dev/nda being up and running, and several subpartitions - which correspond to connections - also being good. If anything is wrong, look in your system logs for error messages and send me the state shown by /proc/nbdinfo.

What happens in the test above?

make test

/tmp

/dev/nda

mke2fs /dev/nda

% mke2fs /dev/nda
% mount /dev/nda /mnt
% cd /mnt
% bonnie ...

The ndxN devices must exist on the client for this to work. I've provided a script called MAKEDEV to make them. On the client, do "cd /dev; sh path_to_MAKEDEV".

Be careful ... there is already a script called MAKEDEV in /dev. Name yours something different or look inside it and see what it does and make the devices it makes by hand. You need block devices /dev/nda, /dev/nda1, /dev/nda2 , /dev/nda3, etc, with major 43 (or whatever the kernel sets for NBD_MAJOR) and minors 0, 1, 2, 3, etc. "mknod /dev/nda b 43 0; mknod /dev/nda1 b 43 1; ..." should do the trick.

To stop the test, you can try running "make stop" or "make rescue". I don't guarantee a rescue in all circumstances, but it'll try, and you can elaborate the Makefile to suit your circumstances.

The difficulty is in stopping the self-repairing code! Sending a kill -USR1 to the daemons should shut them down and error out the pending device queue requests. A kill -USR2 will try even harder to shut them down. A kill -TERM should then murder the daemons safely, allowing you to unload the kernel module.

Look at the output from /proc/nbdinfo to gauge the state of the device. In particular, you should see the number of active sockets, and the number of active client threads.

Device a:       Open
[a] State:      initialized, verify, rw, last error 0
[a] Queued:     +0 curr reqs/+0 real reqs/+10 max reqs
[a] Buffersize: 86016   (sectors=168)
[a] Blocksize:  1024    (log=10)
[a] Size:       2097152
[a] Blocks:     2048
[a] Sockets:    4       (+)     (+)     (*)     (+)
[a] Requested:  2048+0  (602)   (462)   (431)   (553)
[a] Dispatched: 2048    (602)   (462)   (431)   (553)
[a] Errored:    0       (0)     (0)     (0)     (0)
[a] Pending:    0+0     (0)     (0)     (0)     (0)
[a] Kthreads:   0       (0 waiting/0 running/1 max)
[a] Cthreads:   4       (+)     (+)     (+)     (+)
[a] Cpids:      4       (9489)  (9490)  (9491)  (9492)
Device b-p:     Closed

In the above I see four client threads (Cthreads) all currently within the kernel (+). They're probably waiting for work to do. I see four network sockets open and known good (+) with the third of them having been active last (*). The first socket seems to have taken more of the work available than the rest, but the difference is not significant. There are no errors reported and no requests waiting in internal queues. If you send in a bug report, make sure to include the output from /proc/nbdinfo.

GOTCHA!

The server generates a signature that is implanted into the clients nbd device at first contact. Any attempt to afterwards connect to a server with a different signature will be rejected. It's an anti-spoofing device. The client doesn't really know the signature either - it's buried in the kernel and the client can only ask if it's been given the right signature or not.

Some find out that they can remove the kernel module and then start again successfully. Of course! That wipes the embedded signature. But it's not the solution. The right thing to do is to

generate the same signature in the server every time you start it, using its "-i foobar" option.
if you restart the server without restarting the client, signal the client with SIGPWR ("kill -PWR 19645 " or whatever the pid is).

Most people are caught by GOTCHA! #1, but some people hit #2, which is why I mentiion it here.

The signal with SIGPWR is normally taken care of by the assistant daemons, nbd-sstatd and nbd-cstatd, but ten to one they haven't been installed yet. I'll explain briefly ... the handshake sequence is longer for a first contact than for a reconnect, and without the SIGPWR the clients will try the short sequence instead of the long.

HOWTO-2

choose a resource (file or partition) on the serving machine and choose some ports on which to serve it out to the client. Then start the server:

nbd-server 1100 1101 1102 1103 /dev/sda1

on the client, load the nbd module (make sure to get the right one, using absolute path names if in doubt)

insmod nbd.o

on the client machine, start the client:

nbd-client your.server 1100 1101 1102 1103 /dev/nda

choose a resource (file or partition) on the serving machine and a single control port. Then start the server:

enbd-server 1099 /dev/sda1

on the client, load the nbd module (make sure to get the right one, using absolute path names if in doubt)

insmod enbd.o

on the client machine, start the client. Note that you give the server control port plus the number of channels you want it to set up. It'll find and set up on its own different ports for the data channels:

enbd-client your.server:1099 -n 4 /dev/nda

What about resources > 2GB?

lseek()

If you don't have a native 64 bit server system, from what I can find out from the current confused state of affairs in the linux world ... under glibc2 and kernels 2.2.* and 2.4.* you need to compile the nbd-server code with _LARGEFILE64_SOURCE defined. It's all set up for you from nbd-2.2.26 and nbd-2.4.5 on.

If you do not have Large File Support on your system, the ENBD still supports resource aggregation, via either linear or striping RAID, to any size, unlimited by the 2GB file size maximum, provided only that the individual components of the aggregate resource are below 2GB in size. Check out the command line arguments for the server. Just listing multiple resources on the command line is enough to cause some form of aggregation to occur!

Setting up for failover

... or how to make ENBD work with heartbeat, the well known failover infrastructure. Matthew Sackman has written a very good HOWTO on this subject. You'll find his document here . I've added the scripts necessary to the distribution archive (enbd-2.4.30 on) under the nbd/etc/ha.d directory. Flash: Steve Purkis has adapted the scripts for RedHat-based platforms, and I've included his scripts in the latest archives (enbd-2.4.32 on) in the nbd/etc-RH/ha.d directory.

Sorry about the links. I hate non-inline documentation myself. In compensation, I'll describe something of what one is trying to achieve with failover; heartbeat is only a means to an end and in many instances a simple little shell script will be just as good or better and this description may help you construct it!

The idea is that server and client are both capable of using a single "floating IP address". This floating IP is normally held by the client, but it moves to the server when the client dies, and it moves back again when the client comes back up and has been brought up to date again. The floating IP is normally that announced in DNS for some vital service such as samba or http.

Heartbeat is simply a general mechanism for detecting when the client or server has failed, and for running the appropriate scripts in response.

Overview: the client will normally be running a raid1 (mirror) composed of the NBD device and a local resource. When the client dies, the floating IP is handed off to the server, which then starts serving from the physical resource of which the NBD device is/was a virtual image. When the client comes back up, its local mirror component has to be resynced from the NBD device component, but the client can take the IP immediately, as the mirror resyncs in the background while it continues working.

Abstraction: There are 4 possible states in which the pair of machines can be: (1) server alive, client alive, (2) server alive, client dead, (3) server dead, client alive, (4) server dead, client dead. Of these, (1) is "normal" and (4) is impossible, for our purposes - failover would have failed. The transitions (1)-(2), (2)-(1) and (1)-(3), (3)-(1) are what we are interested in. Heartbeat initiates actions on the surviving machine or machines after each transition.

More detail: Let's look at the (1)-(2) transition. The server is the survivor. It will run its 'endbd-client start' script because it now has to take up the role of the client. If the client got the chance before dying, it would also have run its 'enbd-client stop' script. Leave aside what these scripts do for the moment and just focus on the naming convention. In the (2)-(1) transition, when the client comes back up, it runs its 'enbd-client start' script, and the server runs its 'enbd-client stop' script.

Similarly, in the (1)-(3) transition, where the client is the survivor, it must take up the role of the server and so it runs 'enbd-server start'. The server, if it got the chance before dying, runs 'enbd-server stop'. On the reverse transition, (3)-(1), the client runs 'enbd-server stop' and the server runs 'enbd-server start'.

What do these scripts do?

Look at (1)-(2) again, where the client dies and the server survives and takes the clients role. The server has to kill its enbd server daemon, fsck the raw partition if it wasn't journaled, and then mount it in the place where its apache and samba services expect to find it. So that's what 'enbd-client start' does for it.

The matter of taking the IP is normally handled by heartbeat, but one can do it manually with a simple ifconfig eth0:0 foobar command in the script. The same goes for starting and stopping the apache and samba services - i.e. that's handled by heartbeat too.

If the client got a chance to run its 'enbd-client stop' script before dying, it would have unmounted the raid mirror, then stopped the mirror and stopped the enbd client daemon that it was running. So that's what 'enbd-client stop' does do.

The (2)-(1) transition is the one that restarts the client in the client role. Usually the server will live to see this transition through, and its 'enbd-client stop' script will unmount the raw partition, start the enbd-server daemon on it, and that's all. The client's 'enbd-client start' script, on the other hand, has to carefully start the enbd-daemon, wait for the NBD device to come up, then start the mirror with the NBD device as primary component. Oh yes, it'll also steal the floating IP address - well, that's normally handled by heartbeat itself.

The (1)-(3) transition should be thought about in the same way, but it's linked to an easier set of scripts than (1)-(2), since the apache and samba services don't need to be relocated - they stay on the client.

The client is the survivor. It takes the role of the server with 'enbd-server start', so this script should kill its enbd-client daemon (the mirror component was dead anyway). It does not need to do anything else since the mirror itself has survived. It could take the NBD component out of the mirror with raidhotremove, but it does not need to. If the server got to run 'enbd-server stop' before dying, it should have killed its enbd-server daemon and that's all.

The reverse transition (3)-(1) is harder. This is where the server has to be reintegrated. It runs 'enbd-server start', which starts up its enbd-server daemon. The client does the reintegration work - it runs 'enbd-server stop', which starts the enbd-client demon, waits for the NBD device to come up, then integrates it into the mirror as a secondary, using raidhotadd.

The scripts are in the HOWTO, and in the distribution archive. Phew!

Late news - intelligent mirroring

fr1

temporary network failures are much more common than real disk blow-ups, so we probably only need to catch up with a bit of the data when the network comes back, not write the whole disk from scratch!
the reason you are working over the network is probably to aggregate a large number of disks, with a total size in the terrabytes or petabytes (if you can get there, 16TB is currently the aggregated limit on 32bit systems on linux), so resyncing all of a disk is something you definitely do not want to do.
unless you are using Gigabit ethernet, or some other very fast medium, the transport is probably slower than the disks, so you want to avoid network transfers when possible.

The reengineered mirroring in fr1 notices exactly what block writes are missing on the missing server, and when it comes back into contact, updates only those blocks.

That's a great speed up and time saver. It can reduce the time period in which the servers don't have redundancy from a matter of hours to a matter of seconds. And the resync is automatic when contact with an enbd server is reestablished. It doesn't require human intervention, because the enbd client issues the hotadd instruction.

To set it up, run an enbd device as one component of a fr1 ("raid1") mirror. That's it (so sue me, MacDonalds).

There is now an fr5 driver too, but even without it you can get at least the automatic resync on reconnect by patching the kernel for fr1 and then using the patched md.o module under ordinary kernel raid5.

Bugs

Renegotiation on reconnect can probably still lock up under some circumstances. I just beat on that in 2.2.23. The networking is all in userspace, so volunteers should be able to make headway if they find a new deadlock. The problem is "networking is complicated when it breaks". There are lots of possible error conditions, and the trick is to get out of all of them and back to square one safely. If by any chance the recovery protocol does get stuck, you can always kill both the server and client slave daemons, however. They'll be restarted from fresh by their guardian daemons, and will connect up again just fine. I believe that it is possible for the userspace daemons to wait forever in a send or recv library call, since I can't set a timeout for the sockets on linux, at least not under older kernels. Please someone add a timeout.

I've disabled request-aggregation in the kernel driver for 2.2. The 2.2 kernels are doing something fundamentally different here, and I don't know what. Until I figure it out, make sure the tests are done with merge_requests=0. You get occasional corrupt blocks on read if you try merged kernel requests! It looks like 2.2 kernel block requests may have holes in their buffers. It's definitely in the kernel. Don't Do That Then.

I just fixed a bug in 2.2.23 that may have been more widespread: libc sleep() gets ruined if alarm signals fire. Look out for more uses of sleep() and replace them with the microsleep() code in nbd-client.c if you see any. Tell me too.
In the 2.4.0test kernels, sending a USR2 signal to nbd-client (or echo -n 1 > /proc/nbdinfo) usually oopses and burns the kernel. This signal forces a "hard reset" of the device. In particular it clears its request queue and its usage count. The latter is very rarely harmless, but you were in a pickle anyway if the usage count was nonzero and there were no processes holding the device open! Try a "soft reset" instead. That's a USR1 signal to nbd-client (or echo -n 0 > /proc/nbdinfo ).
Curing 64bit buglets led me to find that over-range block requests from the kernel used to cause a hang. Cured in 2.2.27 and 2.4.6 (might surreptitiously port this fix into earlier versions). I never expected such requests.
Outstanding bug - the 2.4.0t1 kernel burns and dies under pressure from the nbd device (tested nbd-2.4.2) after about 10MB of transfers. The same code compiled and running under the 2.2.15 kernel supports gigabytes of transfers without harm. So it's the kernel or my misunderstanding of newer kernel changes. Is this a throttling issue? It seems to have gone away completely in nbd-2.4.25 and kernels 2.4.[0-5]. This probably has been well and truly fixed during development.
The error paths from/to the device are not clearly defined. There is no way of erroring out a request under some circumstances. Need a stronger NAK (put in in 2.4.9, which cleaned up some error paths)
Fixed a bug in 2.4.13 in which the server would get 0 bytes from a network read and loop. Thanks to Daniel Shane for spotting it.
In kernels above the great VM ripout (2.4.10 plus), VM issues seem to make deadlock a possibility. One needs to alter /proc/sys/vm/bdflush to put the two limit numbers which are usually 40,60 something more like 25,95. The second number is the percentage of buffers dirtied at which the VM goes into synchronous mode. We don't want that to happen at all! 90% should be a safe number, and I wouldn't mind 99%. The lower number is the percentage of buffers dirtied at which the kernel starts flushing them to devices. We want that to be as low as possible, though some slack can be left to help with the feel. I think 25% is appropriate. In 2.4.27 I made the client daemon shift the numbers to reasonable ones by force.
When I reviewed the 2.4.31 remote ioctl code for the 2.4.32 fork, I noticed a possible memory leak in 2.4.31 remote ioctls, and fixed it in 2.4.32. I am confident neither of the fix nor if there was a problem in the first place. Please watch out.
There is a longstanding design problem when a write request is timed out and resubmitted. Subsequent writes to the same place should wait till the resubmit has finished, or else preempt the original if there are no reads waiting on it too. At the moment only the generic write ordering control enforces that and it is lossy - under sufficient stress it is abandoned temporarily. Yet making the write ordering enforcement absolutely strict would ruin the speed since network stalls are quite common. So we can sometimes deliver writes to the same spot in the wrong order. It's even worse if the first write is timed out and resubmitted because the timeout in itself authorises write-disordering to the device wrt it. We should keep timed out writes in the database until their final exit status is known, so that a subsequent write can at least choose to stall itself or preempt the first write. If the first write is preempted, we should discard it when it eventually shows up, which means we must keep the second write in the database too. This is all very complicated.

To Do

Make the choice of ports by the daemons more unix-like. Just one initial communication channel please, followed by splitting off private channels on randomly chosen ports (done in 2.4.4).
Fix request aggregation for 2.2 kernels (looks difficult)
Port forward to 2.4 kernels (mostly working very well in 2.2.25, 2.4.0, etc.)
Add a "close" command to the negotiation protocols so that either side can choose to break off, and can tell the other about it!
Add performance figures to this page, lifting from the listed paper. (done! 1/6/00)
Maybe incorporate the mmap trick for sharing bufferspace between kernel and user space without copying. But experiments indicate that that's not where the bottleneck is. Where is it? Why can't I go past about 20% cpu utilization on a 100BT net? (At that point we're flying at between 3 and 5 MB/s).
Fix the error paths!
Put in a proper hash map and thus lift the 2GB limitation on clientside caching. (done in 2.4.14)
Write FR5.

Latest News

Just added clientside cacheing to nbd-2.4.4. This turns a readonly export into a readwrite one. It also permits a single ro resource to be shared among several clients. each of which caches local changes. See the -journal option in the man page.
Also in 2.4.4, true unix client/server model over a single port, at last.
In 2.4.9, multiple client daemons can connect to the same server, allowing the same resource to be exported many times. If you use the clientside caching too, you get a modifiable local copy suitable as a network root FS (only works up to 2GB at present).
In 2.4.27, remote IOCTLs are supported. You will have to edit the ioctl.c (and its kernel twin, the nbd_ioctl.c file) to add the ioctls you need. See the comments there.
In 2.4.30, we now have support for partitions on the NBD devices, and recent breakage for 2.2 kernels is finally mended (kernel 2.2.20 checked).
2.4.31 supports the fr1 ("fast raid1") kernel patch. The support has also been backported into 2.4.30 as a special secret easter egg that will not be noticed by anybody (shhh!), as I thought it was only fair as 2.4.31 has taken so long to come out, thanks to my experiments in it.
Jan 2004 and 2.4.32 is now the development version, and 2.4.31 is stable, me having finally removed all experiments and received no important bug reports for a long while. I forked 2.4.32 because people kept complaining that the removable device support made floppies go click every second - yes, the server probed them. So in 2.4.32 I stopped the server probing and instead passed open, close and mount events across over the remote ioctl interface (which presumably now becomes essential for a healthy life), and got only them to trigger probes. That matches the way the kernel probes a local floppy device. It was rather a big change, hence the fork.