Commit Graph

132514 Commits

Author SHA1 Message Date
59a252ff8c knfsd: avoid overloading the CPU scheduler with enormous load averages
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:41 -04:00
8bbfa9f388 knfsd: remove the nfsd thread busy histogram
Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
because the questionable data provided is not worth the scalability
impact of calculating it.  Instead, always report zeroes.  The current
approach suffers from three major issues:

1. update_thread_usage() increments buckets by call service
   time or call arrival time...in jiffies.  On lightly loaded
   machines, call service times are usually < 1 jiffy; on
   heavily loaded machines call arrival times will be << 1 jiffy.
   So a large portion of the updates to the buckets are rounded
   down to zero, and the histogram is undercounting.

2. As seen previously on the nfs mailing list, the format in which
   the histogram is presented is cryptic, difficult to explain,
   and difficult to use.

3. Updating the histogram requires taking a global spinlock and
   dirtying the global variables nfsd_last_call, nfsd_busy, and
   nfsdstats *twice* on every RPC call, which is a significant
   scaling limitation.

Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
1K streaming reads at full line rate, shows the stats update code
(inlined into nfsd()) takes about 1.7% of each CPU.  This patch drops
the contribution from nfsd() into the profile noise.

This patch is a forward-ported version of knfsd-remove-nfsd-threadstats
which has been shipping in the SGI "Enhanced NFS" product since 2006.
In that time, exactly one customer has noticed that the threadstats
were missing.  It has been previously posted:

http://article.gmane.org/gmane.linux.nfs/10376

and more recently requested to be posted again.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:41 -04:00
5cb031b0af nfsd4: remove redundant check from nfsd4_open
Note that we already checked for this invalid case at the top of this
function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:41 -04:00
05f4f678b0 nfsd4: don't do lookup within readdir in recovery code
The main nfsd code was recently modified to no longer do lookups from
withing the readdir callback, to avoid locking problems on certain
filesystems.

This (rather hacky, and overdue for replacement) NFSv4 recovery code has
the same problem.  Fix it to build up a list of names (instead of
dentries) and do the lookups afterwards.

Reported symptoms were a deadlock in the xfs code (called from
nfsd4_recdir_load), with /var/lib/nfs on xfs.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reported-by: David Warren <warren@atmos.washington.edu>
2009-03-18 17:38:40 -04:00
a1c8c4d1ff nfsd4: support putpubfh operation
Currently putpubfh returns NFSERR_OPNOTSUPP, which isn't actually
allowed for v4.  The right error is probably NFSERR_NOTSUPP.

But let's just implement it; though rarely seen, it can be used by
Solaris (with a special mount option), is mandated by the rfc, and is
trivial for us to support.

Thanks to Yang Hongyang for pointing out the original problem, and to
Mike Eisler, Tom Talpey, Trond Myklebust, and Dave Noveck for further
argument....

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:40 -04:00
31dec2538e Short write in nfsd becomes a full write to the client
If a filesystem being written to via NFS returns a short write count
(as opposed to an error) to nfsd, nfsd treats that as a success for
the entire write, rather than the short count that actually succeeded.

For example, given a 8192 byte write, if the underlying filesystem
only writes 4096 bytes, nfsd will ack back to the nfs client that all
8192 bytes were written.  The nfs client does have retry logic for
short writes, but this is never called as the client is told the
complete write succeeded.

There are probably other ways it could happen, but in my case it
happened with a fuse (filesystem in userspace) filesystem which can
rather easily have a partial write.

Here is a patch to properly return the short write count to the
client.

Signed-off-by: David Shaw <dshaw@jabberwocky.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:40 -04:00
1e685ec270 NFSD: return nfsv4 error code nfserr_notsupp rather than nfsv[23]'s nfserr_opnotsupp
Thanks for Bill Baker at sun.com for catching this
at Connectathon 2009.

This bug was introduced in 2.6.27

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:39 -04:00
a601caeda2 nfsd4: move rpc_client setup to a separate function
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:39 -04:00
418cd20aa1 nfsd4: fix do_probe_callback errors
The errors returned aren't used.  Just return 0 and make them available
to a dprintk().  Also, consistently use -ERRNO errors instead of nfs
errors.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
2009-03-18 17:38:39 -04:00
8b671b8070 nfsd4: remove use of mutex for file_hashtable
As part of reducing the scope of the client_mutex, and in order to
remove the need for mutexes from the callback code (so that callbacks
can be done as asynchronous rpc calls), move manipulations of the
file_hashtable under the recall_lock.

Update the relevant comments while we're here.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Alexandros Batsakis <batsakis@netapp.com>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
2009-03-18 17:38:38 -04:00
d7fdcfe0aa nfsd4: put_nfs4_client does not require state lock
Since free_client() is guaranteed to only be called once, and to only
touch the client structure itself (not any common data structures), it
has no need for the state lock.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Alexandros Batsakis <batsakis@netapp.com>
2009-03-18 17:38:38 -04:00
18f82731b7 nfsd4: rename io_during_grace_disallowed
Use a slightly clearer, more concise name.  Also removed unused
argument.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:38 -04:00
6150ef0dc7 nfsd4: remove unused CHECK_FH flag
All users now pass this, so it's meaningless.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:37 -04:00
7e0f7cf582 nfsd4: fail when delegreturn gets a non-delegation stateid
Previous cleanup reveals an obvious (though harmless) bug: when
delegreturn gets a stateid that isn't for a delegation, it should return
an error rather than doing nothing.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:37 -04:00
203a8c8e66 nfsd4: separate delegreturn case from preprocess_stateid_op
Delegreturn is enough a special case for preprocess_stateid_op to
warrant just open-coding it in delegreturn.

There should be no change in behavior here; we're just reshuffling code.

Thanks to Yang Hongyang for catching a critical typo.

Reviewed-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:38:18 -04:00
3e633079e3 nfsd4: add a helper function to decide if stateid is delegation
Make this check self-documenting.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:52 -04:00
819a8f539a nfsd4: remove some dprintk's
I can't recall ever seeing these printk's used to debug a problem.  I'll
happily put them back if we see a case where they'd be useful.  (Though
if we do that the find_XXX() errors would probably be better
reported in find_XXX() functions themselves.)

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:52 -04:00
fd03b09906 nfsd4: remove unneeded local variable
We no longer need stidp.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:52 -04:00
dc9bf700ed nfsd4: remove redundant "if" in nfs4_preprocess_stateid_op
Note that we exit this first big "if" with stp == NULL if and only if we
took the first branch; therefore, the second "if" is redundant, and we
can just combine the two, simplifying the logic.

Reviewed-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:52 -04:00
0c2a498fa6 nfsd4: move check_stateid_generation check
No change in behavior.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:51 -04:00
a4455be085 nfsd4: trivial preprocess_stateid_op cleanup
Remove a couple redundant comments, adjust style; no change in behavior.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:51 -04:00
77f18f5e4e nfs: replace uses of __constant_{endian}
The base versions handle constant folding now, none of these headers
are exported to userspace, so the __ prefixed versions are not
necessary.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:51 -04:00
4ac35c2f79 nfsd(v2/v3): fix the failure of creation from HPUX client
sometimes HPUX nfs client sends a create request to linux nfs server(v2/v3).
the dump of the request is like:
    obj_attributes
        mode: value follows
            set_it: value follows (1)
            mode: 00
        uid: no value
            set_it: no value (0)
        gid: value follows
            set_it: value follows (1)
            gid: 8030
        size: value follows
            set_it: value follows (1)
            size: 0
        atime: don't change
            set_it: don't change (0)
        mtime: don't change
            set_it: don't change (0)

note that mode is 00(havs no rwx privilege even for the owner) and it requires
to set size to 0.

as current nfsd(v2/v3) implementation, the server does mainly 2 steps:
1) creates the file in mode specified by calling vfs_create().
2) sets attributes for the file by calling nfsd_setattr().

at step 2), it finally calls file system specific setattr() function which may
fail when checking permission because changing size needs WRITE privilege but
it has none since mode is 000.

for this case, a new file created, we may simply ignore the request of
setting size to 0, so that WRITE privilege is not needed and the open
succeeds.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
--
 vfs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:50 -04:00
e33d1ea60c lockd: clean up blocking lock cases of nlsmvc_lock()
No change in behavior, just rearranging the switch so that we break out
of the switch if and only if we're in the wait case.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:50 -04:00
e37da04ed1 nfsd: lock state around put client and delegation in nfsd4_cb_recall
not having the state locked before putting the client/delegation causes a bug.
Also removed the comment from the function header about the state being already locked

Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:50 -04:00
6c02eaa1d1 nfsd4: use helper for copying delegation filehandle
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:49 -04:00
a4773c08f2 nfsd4: use helper for copying filehandles for replay
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:49 -04:00
13024b7b40 nfsd4: fix misplaced comment
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:49 -04:00
99f8872638 nfsd: clarify exclusive create bitmask result.
The use of |= is confusing--the bitmask is always initialized to zero in
this case, so we're effectively just doing an assignment here.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:48 -04:00
686665619e nfsd : Define NFSD only when FILE_LOCKING is enabled
Enable NFSD only when FILE_LOCKING is enabled, since we don't want to
support NFSD without FILE_LOCKING.

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:48 -04:00
12214cb781 NFSD: cleanup for nfs3proc.c
MSDOS_SUPER_MAGIC is defined in <linux/magic.h>,
so use MSDOS_SUPER_MAGIC directly.

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:48 -04:00
f044ff830f nfsd4: split open/lockowner release code
The caller always knows specifically whether it's releasing a lockowner
or an openowner, and the code is simpler if we use separate functions
(and the apparent recursion is gone).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:47 -04:00
f1d110caf7 nfsd4: remove a forward declaration
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-03-18 17:30:47 -04:00
2283963f27 nfsd4: split lockstateid/openstateid release logic
The flags here attempt to make the code more general, but I find it
actually just adds confusion.

I think it's clearer to separate the logic for the open and lock cases
entirely.  And eventually we may want to separate the stateowner and
stateid types as well, as many of the fields aren't shared between the
lock and open cases.

Also move to eliminate forward references.

Start with the stateid's.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: Benny Halevy <bhalevy@panasas.com>
2009-03-18 17:30:47 -04:00
a1e4ee2286 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6:
  Staging: benet: remove driver now that it is merged in drivers/net/
2009-03-18 09:34:17 -07:00
85bff8857c Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.29' of git://linux-nfs.org/~bfields/linux:
  nfsd: nfsd should drop CAP_MKNOD for non-root
  NFSD: provide encode routine for OP_OPENATTR
2009-03-18 09:27:20 -07:00
d0573facf2 Staging: benet: remove driver now that it is merged in drivers/net/
The benet driver is now in the proper place in drivers/net/benet, so we
can remove the staging version.

Acked-by: Sathya Perla <sathyap@serverengines.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-18 09:22:17 -07:00
d941d0ed6b Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/ps3: ps3_defconfig updates
  powerpc/mm: Respect _PAGE_COHERENT on classic ppc32 SW
  powerpc/5200: Enable CPU_FTR_NEED_COHERENT for MPC52xx
  ps3/block: Replace mtd/ps3vram by block/ps3vram
2009-03-18 09:05:40 -07:00
8144737def Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: fix refptr allocation and release order
2009-03-18 09:04:25 -07:00
99dbe10968 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6:
  USB: storage: Unusual USB device Prolific 2507 variation added
  USB: Add device id for Option GTM380 to option driver
  USB: Add Vendor/Product ID for new CDMA U727 to option driver
  USB: Updated unusual-devs entry for USB mass storage on Nokia 6233
  USB: Option: let cdc-acm handle Sony Ericsson F3507g / Dell 5530
  USB: EHCI: expedite unlinks when the root hub is suspended
  USB: EHCI: Fix isochronous URB leak
  USB: option.c: add ZTE 622 modem device
  USB: wusbcore/wa-xfer, fix lock imbalance
  USB: misc/vstusb, fix lock imbalance
  USB: misc/adutux, fix lock imbalance
  USB: image/mdc800, fix lock imbalance
  USB: atm/cxacru, fix lock imbalance
  USB: unusual_devs: Add support for GI 0431 SD-Card interface
  USB: serial: new cp2101 device id
  USB: serial: ftdi: enable UART detection on gnICE JTAG adaptors blacklist interface0
  USB: serial: add FTDI USB/Serial converter devices
  USB: usbfs: keep async URBs until the device file is closed
  USB: usbtmc: add protocol 1 support
  USB: usbtmc: fix stupid bug in open()
2009-03-18 09:03:18 -07:00
bd27e6d3d2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()
  ALSA: mixart, fix lock imbalance
  ALSA: pcm_oss, fix locking typo
  ALSA: oss-mixer - Fixes recording gain control
  ALSA: hda - Workaround for buggy DMA position on ATI controllers
  ALSA: hda - Fix DMA mask for ATI controllers
  ALSA: opl3sa2 - Fix NULL dereference when suspending snd_opl3sa2
2009-03-18 07:39:11 -07:00
f1aa298679 Merge branch 'fix/opl3sa2-suspend' into for-linus 2009-03-18 08:04:36 +01:00
a232ee66e0 Merge branch 'fix/hda' into for-linus 2009-03-18 08:04:16 +01:00
6af845e4eb ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()
In snd_free_sgbuf_pags(), vunmap() is called after releasing the SG
pages, and it causes errors on Xen as Xen manages the pages
differently.  Although no significant errors have been reported on
the actual hardware, this order should be fixed other way round,
first vunmap() then free pages.

Cc: Jan Beulich <jbeulich@novell.com>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 08:04:01 +01:00
82f5d57163 ALSA: mixart, fix lock imbalance
There is an omitted unlock in one snd_mixart_hw_params fail path. Fix it.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 08:03:49 +01:00
91054598f7 ALSA: pcm_oss, fix locking typo
s/mutex_lock/mutex_unlock/ on 2 fail paths in snd_pcm_oss_proc_write.
Probably a typo, lock should be unlocked when leaving the function.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 08:03:33 +01:00
36c7b833e5 ALSA: oss-mixer - Fixes recording gain control
At the time of initialization, SNDRV_MIXER_OSS_PRESENT_PVOLUME bit is not
set for MIC (slot 7).
So, the same should not be checked when an application tries to do gain
control for audio recording devices.

Just check slot->present for SNDRV_MIXER_OSS_PRESENT_CVOLUME independently.
Verified with a simple application which opens /dev/dsp for recording and
/dev/mixer for volume control.

Have tested two usb audio mic devices.

Signed-off-by: Viral Mehta <viral.mehta@einfochips.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 07:52:28 +01:00
c673ba1c23 ALSA: hda - Workaround for buggy DMA position on ATI controllers
The position-buffer on ATI controllers are unreliable as well as
on VIA chips, thus the same workaround for DMA position reading as
VIA is useful for ATI.

Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 07:46:21 +01:00
09240cf429 ALSA: hda - Fix DMA mask for ATI controllers
ATI controllers (at least some SB0600 models) appear buggy to handle
64bit DMA.  As a workaround, reset GCAP bit0 and let the driver to
use only 32bit DMA on these controllers.

Cc: <stable@kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-03-18 07:45:41 +01:00
58cefd2b1e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix bb_prealloc_list corruption due to wrong group locking
  ext4: fix bogus BUG_ONs in in mballoc code
  ext4: Print the find_group_flex() warning only once
  ext4: fix header check in ext4_ext_search_right() for deep extent trees.
2009-03-17 20:55:40 -07:00