The Enhanced NBD is the result of an industrially funded academic research
project with Realm Software of Atlanta, GA, to toughen up the kernel's
NBD. It started back in 2.0 times, when I back-ported the nascent NBD
by Pavel Machek from the 2.1 development kernel.
An NBD is "a long pair of wires". It makes a remote disk on a different machine act as though it were a local disk on your machine. It looks like a block device on the local machine where it's typically going to appear as /dev/nda. The remote resource doesn't need to be a whole disk or even a partition. It can be a file.
The intended use for ENBD in particular is for RAID over the net. You can make any NBD device part of a RAID mirror in order to get real time mirroring to a distant (and safe!) backup. To make it clear: start up an NBD connection to a distant NBD server, and use its local device (probably /dev/nda) where you would normally use a local partition in a RAID setup.
The original kernel device has been hardened in many ways in moving to ENBD from kernel NBD: the ENBD uses block-journaled multichannel communications; there is internal failover and automatic balancing between the channels; the client and server daemons restart, authenticate and reconnect after dying or loss of contact; the code can be compiled to take the networking transparantly over SSL channels (see the Makefile for the compilation options).
To summarize briefly, the important changes in ENBD with respect to the standard kernel driver are
I repeat: you can use either code with 2.2 kernels, but the enbd-2.4.* codes are the ones being worked on and developed while the 2.2 codes are no longer developed.
For a taste, here are some performance measurements:
These are old figures now - taken under 2.0.36, as I recall, with a much older version of NBD than the current one, but they're still useful. The testbed was a pair of 64MB P200s on a 100BT switched circuit using 3c905 NICs. The best speed I could get out of raw TCP between them was 58.3Mb/s, tested using netperf.
Of course, the current NFS implementations have improved too.
Do all the compilation on the client machine. It has the kernel configuration that we need to match during the compilation - the enbd server never talks to its own kernel.
Also set the environment variable LINUXDIR to the location of the
kernel source directory for your target kernel, because nowadays /usr/src/linux
(the classical location) seems to be some fake that points to whatever
glibc was compiled against, not to the source of the kernel you are running.
The kernel directory must also contain the target kernels .config file,
as it's read during the make.
And from there on in, I'll quote the INSTALL directions:
% mke2fs /dev/nda
% mount /dev/nda /mnt
% cd /mnt
% bonnie ...
The ndxN devices must exist on the client
for this to work. I've provided a script called MAKEDEV
to make them. On the client, do "cd /dev; sh path_to_MAKEDEV".
Be careful ... there is already a script called MAKEDEV
in /dev. Name yours something different or look inside
it and see what it does and make the devices it makes by hand. You need
block devices /dev/nda, /dev/nda1, /dev/nda2
, /dev/nda3, etc, with major 43 (or whatever the kernel
sets for NBD_MAJOR) and minors 0, 1, 2, 3, etc. "mknod /dev/nda
b 43 0; mknod /dev/nda1 b 43 1; ..." should do the trick.
To stop the test, you can try running "make stop" or "make rescue". I don't guarantee a rescue in all circumstances, but it'll try, and you can elaborate the Makefile to suit your circumstances.
The difficulty is in stopping the self-repairing code! Sending a
kill -USR1 to the daemons should shut them down and error out
the pending device queue requests. A kill -USR2 will try even harder
to shut them down. A kill -TERM should then murder
the daemons safely, allowing you to unload the kernel module.
Look at the output from /proc/nbdinfo to gauge the state
of the device. In particular, you should see the number of active sockets,
and the number of active client threads.
Device a: Open
[a] State: initialized, verify, rw, last error 0
[a] Queued: +0 curr reqs/+0 real reqs/+10 max reqs
[a] Buffersize: 86016 (sectors=168)
[a] Blocksize: 1024 (log=10)
[a] Size: 2097152
[a] Blocks: 2048
[a] Sockets: 4 (+) (+) (*) (+)
[a] Requested: 2048+0 (602) (462) (431) (553)
[a] Dispatched: 2048 (602) (462) (431) (553)
[a] Errored: 0 (0) (0) (0) (0)
[a] Pending: 0+0 (0) (0) (0) (0)
[a] Kthreads: 0 (0 waiting/0 running/1 max)
[a] Cthreads: 4 (+) (+) (+) (+)
[a] Cpids: 4 (9489) (9490) (9491) (9492)
Device b-p: Closed
In the above I see four client threads (Cthreads) all currently
within the kernel (+). They're probably waiting for work to do. I see
four network sockets open and known good (+) with the third of them having
been active last (*). The first socket seems to have taken more of the
work available than the rest, but the difference is not significant. There
are no errors reported and no requests waiting in internal queues. If
you send in a bug report, make sure to include the output from /proc/nbdinfo.
The server generates a signature that is implanted into the clients nbd device at first contact. Any attempt to afterwards connect to a server with a different signature will be rejected. It's an anti-spoofing device. The client doesn't really know the signature either - it's buried in the kernel and the client can only ask if it's been given the right signature or not.
Some find out that they can remove the kernel module and then start
again successfully. Of course! That wipes the embedded signature. But it's
not the solution. The right thing to do is to
Most people are caught by GOTCHA! #1, but some people hit #2, which
is why I mentiion it here.
The signal with SIGPWR is normally taken care of by the assistant
daemons, nbd-sstatd and nbd-cstatd, but ten to one they haven't been
installed yet. I'll explain briefly ... the handshake sequence is longer
for a first contact than for a reconnect, and without the SIGPWR the
clients will try the short sequence instead of the long.
If you don't have a native 64 bit server system, from what I can find out from the current confused state of affairs in the linux world ... under glibc2 and kernels 2.2.* and 2.4.* you need to compile the nbd-server code with _LARGEFILE64_SOURCE defined. It's all set up for you from nbd-2.2.26 and nbd-2.4.5 on.
If you do not have Large File Support on your system, the ENBD still
supports resource aggregation, via either linear or striping RAID, to
any size, unlimited by the 2GB file size maximum, provided only that
the individual components of the aggregate resource are below 2GB in
size. Check out the command line arguments for the server. Just listing
multiple resources on the command line is enough to cause some form of
aggregation to occur!
... or how to make ENBD work with heartbeat, the well known failover
infrastructure. Matthew Sackman has written a very good HOWTO on this
subject. You'll find his document here . I've added the scripts
necessary to the distribution archive (enbd-2.4.30 on) under the nbd/etc/ha.d
directory. Flash: Steve Purkis has adapted the scripts for
RedHat-based platforms, and I've included his scripts in the latest archives
(enbd-2.4.32 on) in the nbd/etc-RH/ha.d directory.
Sorry about the links. I hate non-inline documentation myself. In compensation,
I'll describe something of what one is trying to achieve with failover; heartbeat
is only a means to an end and in many instances a simple little shell script
will be just as good or better and this description may help you construct
it!
The idea is that server and client are both capable of using a single
"floating IP address". This floating IP is normally held by the client,
but it moves to the server when the client dies, and it moves back again
when the client comes back up and has been brought up to date again. The
floating IP is normally that announced in DNS for some vital service such
as samba or http.
Heartbeat is simply a general mechanism for detecting when the client
or server has failed, and for running the appropriate scripts in response.
Overview: the client will normally be running a raid1 (mirror)
composed of the NBD device and a local resource. When the client dies,
the floating IP is handed off to the server, which then starts serving
from the physical resource of which the NBD device is/was a virtual image.
When the client comes back up, its local mirror component has to be resynced
from the NBD device component, but the client can take the IP immediately,
as the mirror resyncs in the background while it continues working.
Abstraction: There are 4 possible states in which the pair of
machines can be: (1) server alive, client alive, (2) server alive, client
dead, (3) server dead, client alive, (4) server dead, client dead. Of
these, (1) is "normal" and (4) is impossible, for our purposes - failover
would have failed. The transitions (1)-(2), (2)-(1) and (1)-(3), (3)-(1)
are what we are interested in. Heartbeat initiates actions on the surviving
machine or machines after each transition.
More detail: Let's look at the (1)-(2) transition. The server
is the survivor. It will run its 'endbd-client start' script because it
now has to take up the role of the client. If the client got the chance
before dying, it would also have run its 'enbd-client stop' script. Leave
aside what these scripts do for the moment and just focus on the naming convention.
In the (2)-(1) transition, when the client comes back up, it runs its 'enbd-client
start' script, and the server runs its 'enbd-client stop' script.
Similarly, in the (1)-(3) transition, where the client is the survivor,
it must take up the role of the server and so it runs 'enbd-server start'.
The server, if it got the chance before dying, runs 'enbd-server stop'.
On the reverse transition, (3)-(1), the client runs 'enbd-server stop'
and the server runs 'enbd-server start'.
What do these scripts do?
Look at (1)-(2) again, where the client dies and the server survives
and takes the clients role. The server has to kill its enbd server daemon,
fsck the raw partition if it wasn't journaled, and then mount it in the
place where its apache and samba services expect to find it. So that's
what 'enbd-client start' does for it.
The matter of taking the IP is normally handled by heartbeat, but one
can do it manually with a simple ifconfig eth0:0 foobar command in the
script. The same goes for starting and stopping the apache and samba services
- i.e. that's handled by heartbeat too.
If the client got a chance to run its 'enbd-client stop' script before
dying, it would have unmounted the raid mirror, then stopped the mirror
and stopped the enbd client daemon that it was running. So that's what 'enbd-client
stop' does do.
The (2)-(1) transition is the one that restarts the client in the client
role. Usually the server will live to see this transition through, and
its 'enbd-client stop' script will unmount the raw partition, start the
enbd-server daemon on it, and that's all. The client's 'enbd-client start'
script, on the other hand, has to carefully start the enbd-daemon, wait
for the NBD device to come up, then start the mirror with the NBD device
as primary component. Oh yes, it'll also steal the floating IP address
- well, that's normally handled by heartbeat itself.
The (1)-(3) transition should be thought about in the same way, but it's
linked to an easier set of scripts than (1)-(2), since the apache and samba
services don't need to be relocated - they stay on the client.
The client is the survivor. It takes the role of the server with 'enbd-server
start', so this script should kill its enbd-client daemon (the mirror component
was dead anyway). It does not need to do anything else since the mirror
itself has survived. It could take the NBD component out of the mirror
with raidhotremove, but it does not need to. If the server got to run 'enbd-server
stop' before dying, it should have killed its enbd-server daemon and that's
all.
The reverse transition (3)-(1) is harder. This is where the server has
to be reintegrated. It runs 'enbd-server start', which starts up its
enbd-server daemon. The client does the reintegration work - it runs
'enbd-server stop', which starts the enbd-client demon, waits for the NBD
device to come up, then integrates it into the mirror as a secondary, using
raidhotadd.
The scripts are in the HOWTO, and in the distribution archive. Phew!
The reengineered mirroring in fr1 notices exactly what block writes are missing on the missing server, and when it comes back into contact, updates only those blocks.
That's a great speed up and time saver. It can reduce the time period in which the servers don't have redundancy from a matter of hours to a matter of seconds. And the resync is automatic when contact with an enbd server is reestablished. It doesn't require human intervention, because the enbd client issues the hotadd instruction.
To set it up, run an enbd device as one component of a fr1 ("raid1") mirror. That's it (so sue me, MacDonalds).
There is now an fr5 driver too, but even without it you can get at least the automatic resync on reconnect by patching the kernel for fr1 and then using the patched md.o module under ordinary kernel raid5.
Peter T. Breuer ptb@inv.it.uc3m.es