Commit Graph

20284 Commits

Author SHA1 Message Date
d3b428f036 fs: create and use seq_show_option for escaping
commit a068acf2ee upstream.

Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21 10:05:45 -07:00
668327e49d sched: Fix cpu_active_mask/cpu_online_mask race
commit dd9d384375 upstream.

There is a race condition in SMP bootup code, which may result
in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
or
    kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running
on busy hosts.

	CPU0                        CPUn
	====                        ====

	_cpu_up()
	  __cpu_up()
				    start_secondary()
				      set_cpu_online()
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_online_bits));
	  cpu_notify(CPU_ONLINE)
	    <do stuff, see below>
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.

Variant 1:

  cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
      rebind_workers()
        set_cpus_allowed_ptr()

  This call fails because it requires an active CPU; rebind_workers()
  ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

Variant 2:

  cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
      smpboot_unpark_threads()
       ..
        __kthread_unpark()
          __kthread_bind()
          wake_up_state()
           ..
            select_task_rq()
              select_fallback_rq()

  The ->wake_cpu of the unparked thread is not allowed, making a call
  to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
  find an allowed, active CPU and promptly resets the allowed CPUs, so
  that the task in question ends up on CPU0.

  When those unparked tasks are eventually executed, they run
  immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb9697 ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.

Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e904 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b55 ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:

  - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

  - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

  - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

Commit 875ebe940d ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.

Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.

set_cpus_allowed_ptr()")

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Wilson <msw@amazon.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6acbfb9697 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-21 10:05:29 -07:00
0c7e8b8a0a genirq: Introduce irq_chip_set_type_parent() helper
commit b7560de198 upstream.

This helper is required for irq chips which do not implement a
irq_set_type callback and need to call down the irq domain hierarchy
for the actual trigger type change.

This helper is required to fix further wreckage caused by the
conversion of TI OMAP to hierarchical irq domains and therefor tagged
for stable.

[ tglx: Massaged changelog ]

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Cc: <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org # 4.1
Link: http://lkml.kernel.org/r/1439554830-19502-3-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:51 -07:00
533eb4c087 genirq: Don't return ENOSYS in irq_chip_retrigger_hierarchy
commit 6d4affea7d upstream.

irq_chip_retrigger_hierarchy() returns -ENOSYS if it was not able to
find at least one .irq_retrigger() callback implemented in the IRQ
domain hierarchy.

That's wrong, because check_irq_resend() expects a 0 return value from
the callback in case that the hardware assisted resend was not
possible. If the return value is non zero the core code assumes
hardware resend success and the software resend is not invoked.

This results in lost interrupts on platforms where none of the parent
irq chips in the hierarchy implements the retrigger callback.

This is observable on TI OMAP, where the hierarchy is:

 ARM GIC <- OMAP wakeupgen <- TI Crossbar

Return 0 instead so the software resend mechanism gets invoked.

[ tglx: Massaged changelog ]

Fixes: 85f08c17de ('genirq: Introduce helper functions...')
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: <linux@arm.linux.org.uk>
Cc: <nsekhar@ti.com>
Cc: <jason@lakedaemon.net>
Cc: <balbi@ti.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Cc: <tony@atomide.com>
Link: http://lkml.kernel.org/r/1439554830-19502-2-git-send-email-grygorii.strashko@ti.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:51 -07:00
2f9de0cc23 cpuset: use trialcs->mems_allowed as a temp variable
commit 24ee3cf89b upstream.

The comment says it's using trialcs->mems_allowed as a temp variable but
it didn't match the code. Change the code to match the comment.

This fixes an issue when writing in cpuset.mems when a sub-directory
exists: we need to write several times for the information to persist:

| root@alban:/sys/fs/cgroup/cpuset# mkdir footest9
| root@alban:/sys/fs/cgroup/cpuset# cd footest9
| root@alban:/sys/fs/cgroup/cpuset/footest9# mkdir aa
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
|
| root@alban:/sys/fs/cgroup/cpuset/footest9# echo 0 > aa/cpuset.mems
| root@alban:/sys/fs/cgroup/cpuset/footest9# cat aa/cpuset.mems
| 0
| root@alban:/sys/fs/cgroup/cpuset/footest9#

This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.

Signed-off-by: Alban Crequy <alban@endocode.com>
Tested-by: Iago López Galeiras <iago@endocode.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:46 -07:00
58b9cca673 perf: Fix PERF_EVENT_IOC_PERIOD migration race
commit c7999c6f3f upstream.

I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.

Use the normal IPI pattern to ensure we run the code on the correct CPU.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: bad7192b84 ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:39 -07:00
db8d87e1cd perf: Fix double-free of the AUX buffer
commit ee9397a6fb upstream.

If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages.  Fix this by adding a check to __rb_free_aux().

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 57ffc5ca67 ("perf: Fix AUX buffer refcounting")
Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:39 -07:00
b8cae722c4 perf: Fix running time accounting
commit 00a2916f7f upstream.

A recent fix to the shadow timestamp inadvertly broke the running time
accounting.

We must not update the running timestamp if we fail to schedule the
event, the event will not have ran. This can (and did) result in
negative total runtime because the stopped timestamp was before the
running timestamp (we 'started' but never stopped the event -- because
it never really started we didn't have to stop it either).

Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 72f669c008 ("perf: Update shadow timestamp before add event")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:39 -07:00
75d370fe0b perf: Fix fasync handling on inherited events
commit fed66e2cdd upstream.

Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho deMelo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-13 09:07:39 -07:00
52124831a3 signal: fix information leak in copy_siginfo_from_user32
commit 3c00cb5e68 upstream.

This function can leak kernel stack data when the user siginfo_t has a
positive si_code value.  The top 16 bits of si_code descibe which fields
in the siginfo_t union are active, but they are treated inconsistently
between copy_siginfo_from_user32, copy_siginfo_to_user32 and
copy_siginfo_to_user.

copy_siginfo_from_user32 is called from rt_sigqueueinfo and
rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
of si_code.

This fixes the following information leaks:
x86:   8 bytes leaked when sending a signal from a 32-bit process to
       itself. This leak grows to 16 bytes if the process uses x32.
       (si_code = __SI_CHLD)
x86:   100 bytes leaked when sending a signal from a 32-bit process to
       a 64-bit process. (si_code = -1)
sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
       64-bit process. (si_code = any)

parsic and s390 have similar bugs, but they are not vulnerable because
rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
to a different process.  These bugs are also fixed for consistency.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16 20:52:26 -07:00
c08a75d950 signal: fix information leak in copy_siginfo_to_user
commit 26135022f8 upstream.

This function may copy the si_addr_lsb, si_lower and si_upper fields to
user mode when they haven't been initialized, which can leak kernel
stack data to user mode.

Just checking the value of si_code is insufficient because the same
si_code value is shared between multiple signals.  This is solved by
checking the value of si_signo in addition to si_code.

Signed-off-by: Amanieu d'Antras <amanieu@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16 20:52:26 -07:00
23713b4de7 ftrace: Fix breakage of set_ftrace_pid
commit e3eea1404f upstream.

Commit 4104d326b6 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.

Add back the proper hooks to have pid filtering working once again.

Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10 12:21:55 -07:00
8ee1239a02 genirq: Prevent resend to interrupts marked IRQ_NESTED_THREAD
commit 75a06189fc upstream.

The resend mechanism happily calls the interrupt handler of interrupts
which are marked IRQ_NESTED_THREAD from softirq context. This can
result in crashes because the interrupt handler is not the proper way
to invoke the device handlers. They must be invoked via
handle_nested_irq.

Prevent the resend even if the interrupt has no valid parent irq
set. Its better to have a lost interrupt than a crashing machine.

Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10 12:21:53 -07:00
ae41bfc681 security_syslog() should be called once only
commit d194e5d666 upstream.

The final version of commit 637241a900 ("kmsg: honor dmesg_restrict
sysctl on /dev/kmsg") lost few hooks, as result security_syslog() are
processed incorrectly:

- open of /dev/kmsg checks syslog access permissions by using
  check_syslog_permissions() where security_syslog() is not called if
  dmesg_restrict is set.

- syslog syscall and /proc/kmsg calls do_syslog() where security_syslog
  can be executed twice (inside check_syslog_permissions() and then
  directly in do_syslog())

With this patch security_syslog() is called once only in all
syslog-related operations regardless of dmesg_restrict value.

Fixes: 637241a900 ("kmsg: honor dmesg_restrict sysctl on /dev/kmsg")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:15 -07:00
01fed2338a PM / sleep: Increase default DPM watchdog timeout to 60
commit fff3b16d27 upstream.

Many harddisks (mostly WD ones) have firmware problems and take too
long, more than 10 seconds, to resume from suspend.  And this often
exceeds the default DPM watchdog timeout (12 seconds), resulting in a
kernel panic out of sudden.

Since most distros just take the default as is, we should give a bit
more safer value.  This patch increases the default value from 12
seconds to one minute, which has been confirmed to be long enough for
such problematic disks.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=91921
Fixes: 70fea60d88 (PM / Sleep: Detect device suspend/resume lockup and log event)
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:15 -07:00
624dda42c3 tracing: Have branch tracer use recursive field of task struct
commit 6224beb12e upstream.

Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.

The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.

Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.

There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.

This prevents the recursion and the inevitable crash that follows.

Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:12 -07:00
2161c86793 tracing: Fix typo from "static inlin" to "static inline"
commit cc9e4bde03 upstream.

The trace.h header when called without CONFIG_EVENT_TRACING enabled
(seldom done), will not compile because of a typo in the protocol
of trace_event_enum_update().

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:12 -07:00
a27274be01 tracing/filter: Do not allow infix to exceed end of string
commit 6b88f44e16 upstream.

While debugging a WARN_ON() for filtering, I found that it is possible
for the filter string to be referenced after its end. With the filter:

 # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

The filter_parse() function can call infix_get_op() which calls
infix_advance() that updates the infix filter pointers for the cnt
and tail without checking if the filter is already at the end, which
will put the cnt to zero and the tail beyond the end. The loop then calls
infix_next() that has

	ps->infix.cnt--;
	return ps->infix.string[ps->infix.tail++];

The cnt will now be below zero, and the tail that is returned is
already passed the end of the filter string. So far the allocation
of the filter string usually has some buffer that is zeroed out, but
if the filter string is of the exact size of the allocated buffer
there's no guarantee that the charater after the nul terminating
character will be zero.

Luckily, only root can write to the filter.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:12 -07:00
baa7b46259 tracing/filter: Do not WARN on operand count going below zero
commit b4875bbe7e upstream.

When testing the fix for the trace filter, I could not come up with
a scenario where the operand count goes below zero, so I added a
WARN_ON_ONCE(cnt < 0) to the logic. But there is legitimate case
that it can happen (although the filter would be wrong).

 # echo '>' > /sys/kernel/debug/events/ext4/ext4_truncate_exit/filter

That is, a single operation without any operands will hit the path
where the WARN_ON_ONCE() can trigger. Although this is harmless,
and the filter is reported as a error. But instead of spitting out
a warning to the kernel dmesg, just fail nicely and report it via
the proper channels.

Link: http://lkml.kernel.org/r/558C6082.90608@oracle.com

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03 09:29:12 -07:00
25d8f169ee genirq: devres: Fix testing return value of request_any_context_irq()
commit 63781394c5 upstream.

request_any_context_irq() returns a negative value on failure.
It returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED on success.
So fix testing return value of request_any_context_irq().

Also fixup the return value of devm_request_any_context_irq() to make it
consistent with request_any_context_irq().

Fixes: 0668d30651 ("genirq: Add devm_request_any_context_irq()")
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1431334978.17783.4.camel@ingics.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:05 -07:00
9da8e034da livepatch: add module locking around kallsyms calls
commit 9a1bd63cda upstream.

The list of loaded modules is walked through in
module_kallsyms_on_each_symbol (called by kallsyms_on_each_symbol). The
module_mutex lock should be acquired to prevent potential corruptions
in the list.

This was uncovered with new lockdep asserts in module code introduced by
the commit 0be964be0d ("module: Sanitize RCU usage and locking") in
recent next- trees.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:04 -07:00
a6556a506e rcu: Correctly handle non-empty Tiny RCU callback list with none ready
commit 6e91f8cb13 upstream.

If, at the time __rcu_process_callbacks() is invoked,  there are callbacks
in Tiny RCU's callback list, but none of them are ready to be invoked,
the current list-management code will knit the non-ready callbacks out
of the list.  This can result in hangs and possibly worse.  This commit
therefore inserts a check for there being no callbacks that can be
invoked immediately.

This bug is unlikely to occur -- you have to get a new callback between
the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to
__rcu_process_callbacks().  It was detected by the addition of RCU-bh
testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's
mutation testing.  Although this bug was made much more likely by
915e8a4fe4 (rcu: Remove fastpath from __rcu_process_callbacks()), this
did not cause the bug, but rather made it much more probable.   That
said, it takes more than 40 hours of rcutorture testing, on average,
for this bug to appear, so this fix cannot be considered an emergency.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:01 -07:00
28dd1f346b sysfs: Create mountpoints with sysfs_create_mount_point
commit f9bb48825a upstream.

This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/       s390_hypfs
/sys/kernel/config/         configfs
/sys/kernel/debug/          debugfs
/sys/firmware/efi/efivars/  efivarfs
/sys/fs/fuse/connections/   fusectl
/sys/fs/pstore/             pstore
/sys/kernel/tracing/        tracefs
/sys/fs/cgroup/             cgroup
/sys/kernel/security/       securityfs
/sys/fs/selinux/            selinuxfs
/sys/fs/smackfs/            smackfs

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:01 -07:00
bdbdf7ee9d sysctl: Allow creating permanently empty directories that serve as mountpoints.
commit f9bd6733d3 upstream.

Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-21 10:10:00 -07:00
12709f95fd perf: Fix ring_buffer_attach() RCU sync, again
commit 2f993cf093 upstream.

While looking for other users of get_state/cond_sync. I Found
ring_buffer_attach() and it looks obviously buggy?

Don't we need to ensure that we have "synchronize" _between_
list_del() and list_add() ?

IOW. Suppose that ring_buffer_attach() preempts right_after
get_state_synchronize_rcu() and gp completes before spin_lock().

In this case cond_synchronize_rcu() does nothing and we reuse
->rb_entry without waiting for gp in between?

It also moves the ->rcu_pending check under "if (rb)", to make it
more readable imo.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: josh@joshtriplett.org
Cc: tj@kernel.org
Fixes: b69cf53640 ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-29 12:35:28 -07:00
17fda38f15 Merge tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing filter fix from Steven Rostedt:
 "Vince Weaver reported a warning when he added perf event filters into
  his fuzzer tests.  There's a missing check of balanced operations when
  parenthesis are used, and this triggers a WARN_ON() and when reading
  the failure, the filter reports no failure occurred.

  The operands were not being checked if they match, this adds that"

* tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Have filter check for balanced ops
2015-06-17 20:56:57 -10:00
2cf30dc180 tracing: Have filter check for balanced ops
When the following filter is used it causes a warning to trigger:

 # cd /sys/kernel/debug/tracing
 # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
 # cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: No error

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 1223 at kernel/trace/trace_events_filter.c:1640 replace_preds+0x3c5/0x990()
 Modules linked in: bnep lockd grace bluetooth  ...
 CPU: 3 PID: 1223 Comm: bash Tainted: G        W       4.1.0-rc3-test+ #450
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012
  0000000000000668 ffff8800c106bc98 ffffffff816ed4f9 ffff88011ead0cf0
  0000000000000000 ffff8800c106bcd8 ffffffff8107fb07 ffffffff8136b46c
  ffff8800c7d81d48 ffff8800d4c2bc00 ffff8800d4d4f920 00000000ffffffea
 Call Trace:
  [<ffffffff816ed4f9>] dump_stack+0x4c/0x6e
  [<ffffffff8107fb07>] warn_slowpath_common+0x97/0xe0
  [<ffffffff8136b46c>] ? _kstrtoull+0x2c/0x80
  [<ffffffff8107fb6a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81159065>] replace_preds+0x3c5/0x990
  [<ffffffff811596b2>] create_filter+0x82/0xb0
  [<ffffffff81159944>] apply_event_filter+0xd4/0x180
  [<ffffffff81152bbf>] event_filter_write+0x8f/0x120
  [<ffffffff811db2a8>] __vfs_write+0x28/0xe0
  [<ffffffff811dda43>] ? __sb_start_write+0x53/0xf0
  [<ffffffff812e51e0>] ? security_file_permission+0x30/0xc0
  [<ffffffff811dc408>] vfs_write+0xb8/0x1b0
  [<ffffffff811dc72f>] SyS_write+0x4f/0xb0
  [<ffffffff816f5217>] system_call_fastpath+0x12/0x6a
 ---[ end trace e11028bd95818dcd ]---

Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.

This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:

 # cd /sys/kernel/debug/tracing
 # echo "((dev==1)blocks==2)" > events/ext4/ext4_truncate_exit/filter
-bash: echo: write error: Invalid argument
 # cat events/ext4/ext4_truncate_exit/filter
((dev==1)blocks==2)
^
parse_error: Meaningless filter expression

And give no kernel warning.

Link: http://lkml.kernel.org/r/20150615175025.7e809215@gandalf.local.home

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: stable@vger.kernel.org # 2.6.31+
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-17 07:13:30 -04:00
5bd2c2867f Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull lockdep fix from Ingo Molnar:
 "A lockdep/modules unload race fix that can oops"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Fix a race between /proc/lock_stat and module unload
2015-06-14 14:03:11 -10:00
cff100f5d7 Merge tag 'trace-rb-bm-fix-4.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ring buffer benchmark buglet fix from Steven Rostedt:
 "Wang Long fixed a minor bug in the module parameter for the ring
  buffer benchmark, where the produce_fifo was being ignored and the
  producer thread's priority was being set with the consumer_fifo
  parameter"

* tag 'trace-rb-bm-fix-4.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer-benchmark: Fix the wrong sched_priority of producer
2015-06-11 14:00:10 -07:00
1080293239 ring-buffer-benchmark: Fix the wrong sched_priority of producer
The producer should be used producer_fifo as its sched_priority,
so correct it.

Link: http://lkml.kernel.org/r/1433923957-67842-1-git-send-email-long.wanglong@huawei.com

Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-06-11 09:27:58 -04:00
8e76d4eecf sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
Jovi Zhangwei reported the following problem

  Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages
  with GFP_COMP flag.

  [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping:          (null) index:0x0
  [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head)
  [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page))
  [Mon May 25 05:29:33 2015] ------------[ cut here ]------------
  [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661!
  [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP

In this case it was triggered by running tcpdump but it's not necessary
reproducible on all systems.

  sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap

Compound pages cannot be migrated and it was not expected that such pages
be marked for NUMA balancing.  This did not take into account that drivers
such as net/packet/af_packet.c may insert compound pages into userspace
with vm_insert_page.  This patch tells the NUMA balancing protection
scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that
compound pages are marked for migration.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Jovi Zhangwei <jovi@cloudflare.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-10 16:43:43 -07:00
cee34d88ca lockdep: Fix a race between /proc/lock_stat and module unload
The lock_class iteration of /proc/lock_stat is not serialized against
the lockdep_free_key_range() call from module unload.

Therefore it can happen that we find a class of which ->name/->key are
no longer valid.

There is a further bug in zap_class() that left ->name dangling. Cure
this. Use RCU_INIT_POINTER() because NULL.

Since lockdep_free_key_range() is rcu_sched serialized, we can read
both ->name and ->key under rcu_read_lock_sched() (preempt-disable)
and be assured that if we observe a !NULL value it stays safe to use
for as long as we hold that lock.

If we observe both NULL, skip the entry.

Reported-by: Jerome Marchand <jmarchan@redhat.com>
Tested-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150602105013.GS3644@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07 15:46:30 +02:00
a0e9c6efa5 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "The biggest chunk of the changes are two regression fixes: a HT
  workaround fix and an event-group scheduling fix.  It's been verified
  with 5 days of fuzzer testing.

  Other fixes:

   - eBPF fix
   - a BIOS breakage detection fix
   - PMU driver fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/pt: Fix a refactoring bug
  perf/x86: Tweak broken BIOS rules during check_hw_exists()
  perf/x86/intel/pt: Untangle pt_buffer_reset_markers()
  perf: Disallow sparse AUX allocations for non-SG PMUs in overwrite mode
  perf/x86: Improve HT workaround GP counter constraint
  perf/x86: Fix event/group validation
  perf: Fix race in BPF program unregister
2015-06-05 10:00:53 -07:00
9b7b819ca1 compat: cleanup coding in compat_get_bitmap() and compat_put_bitmap()
In the functions compat_get_bitmap() and compat_put_bitmap() the
variable nr_compat_longs stores how many compat_ulong_t words should be
copied in a loop.

The copy loop itself is this:
  if (nr_compat_longs-- > 0) {
      if (__get_user(um, umask)) return -EFAULT;
  } else {
      um = 0;
  }

Since nr_compat_longs gets unconditionally decremented in each loop and
since it's type is unsigned this could theoretically lead to out of
bounds accesses to userspace if nr_compat_longs wraps around to
(unsigned)(-1).

Although the callers currently do not trigger out-of-bounds accesses, we
should better implement the loop in a safe way to completely avoid such
warp-arounds.

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2015-06-04 23:57:18 +02:00
6e49ba1bb1 Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull fixes for cpumask and modules from Rusty Russell:
 "** NOW WITH TESTING! **

  Two fixes which got lost in my recent distraction.  One is a weird
  cpumask function which needed to be rewritten, the other is a module
  bug which is cc:stable"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  cpumask_set_cpu_local_first => cpumask_local_spread, lament
  module: Call module notifier on failure after complete_formation()
2015-05-29 11:24:28 -07:00
aa319bcd36 perf: Disallow sparse AUX allocations for non-SG PMUs in overwrite mode
PMUs that don't support hardware scatter tables require big contiguous
chunks of memory and a PMI to switch between them. However, in overwrite
using a PMI for this purpose adds extra overhead that the users would
like to avoid. Thus, in overwrite mode for such PMUs we can only allow
one contiguous chunk for the entire requested buffer.

This patch changes the behavior accordingly, so that if the buddy allocator
fails to come up with a single high-order chunk for the entire requested
buffer, the allocation will fail.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 09:16:20 +02:00
dead9f29dd perf: Fix race in BPF program unregister
there is a race between perf_event_free_bpf_prog() and free_trace_kprobe():

	__free_event()
	  event->destroy(event)
	    tp_perf_event_destroy()
	      perf_trace_destroy()
		perf_trace_event_unreg()

which is dropping event->tp_event->perf_refcount and allows to proceed in:

	unregister_trace_kprobe()
	  unregister_kprobe_event()
	      trace_remove_event_call()
		    probe_remove_event_call()
	free_trace_kprobe()

while __free_event does:

	call_rcu(&event->rcu_head, free_event_rcu);
	  free_event_rcu()
	    perf_event_free_bpf_prog()

To fix the race simply move perf_event_free_bpf_prog() before
event->destroy(), since event->tp_event is still valid at that point.

Note, perf_trace_destroy() is not racing with trace_remove_event_call()
since they both grab event_mutex.

Reported-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lizefan@huawei.com
Cc: pi3orama@163.com
Fixes: 2541517c32 ("tracing, perf: Implement BPF programs attached to kprobes")
Link: http://lkml.kernel.org/r/1431717321-28772-1-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 08:46:15 +02:00
c5db6a3bde Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "One more fix from the timer departement:

    - Handle division of negative nanosecond values proper on 32bit.

      A recent cleanup wrecked the sign handling of the dividend and
      dropped the check for negative divisors"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ktime: Fix ktime_divns to do signed division
2015-05-23 17:57:40 -07:00
1c8df7bd48 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Three small fixes that have been picked up the last few weeks.
  Specifically:

   - Fix a memory corruption issue in NVMe with malignant user
     constructed request.  From Christoph.

   - Kill (now) unused blk_queue_bio(), dm was changed to not need this
     anymore.  From Mike Snitzer.

   - Always use blk_schedule_flush_plug() from the io_schedule() path
     when flushing a plug, fixing a !TASK_RUNNING warning with md.  From
     Shaohua"

* 'for-linus' of git://git.kernel.dk/linux-block:
  sched: always use blk_schedule_flush_plug in io_schedule_out
  nvme: fix kernel memory corruption with short INQUIRY buffers
  block: remove export for blk_queue_bio
2015-05-22 15:15:30 -07:00
1173ff09b9 watchdog: fix double lock in watchdog_nmi_enable_all
Commit ab992dc38f ("watchdog: Fix merge 'conflict'") has introduced an
obvious deadlock because of a typo.  watchdog_proc_mutex should be
unlocked on exit.

Thanks to Miroslav Benes who was staring at the code with me and noticed
this.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Duh-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-19 10:57:03 -07:00
10d784eae2 sched: always use blk_schedule_flush_plug in io_schedule_out
block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[  370.817949] ------------[ cut here ]------------
[  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
[  370.817971] Modules linked in: raid1
[  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
[  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[  370.817993] Call Trace:
[  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
[  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
[  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
[  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
[  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
[  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
[  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
[  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
[  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
[  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
[  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
[  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
[  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
[  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
[  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
[  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
[  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
[  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
[  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
[  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
[  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
[  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
[  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
[  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
[  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
[  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
[  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
[  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
[  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
[  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
[  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
[  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
[  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
[  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
[  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
[  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
[  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818132] ---[ end trace 7b4deb71e68b6605 ]---

V2: don't change ->in_iowait

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-18 16:06:41 -06:00
ab992dc38f watchdog: Fix merge 'conflict'
Two watchdog changes that came through different trees had a non
conflicting conflict, that is, one changed the semantics of a variable
but no actual code conflict happened. So the merge appeared fine, but
the resulting code did not behave as expected.

Commit 195daf665a ("watchdog: enable the new user interface of the
watchdog mechanism") changes the semantics of watchdog_user_enabled,
which thereafter is only used by the functions introduced by
b3738d2932 ("watchdog: Add watchdog enable/disable all functions").

There further appears to be a distinct lack of serialization between
setting and using watchdog_enabled, so perhaps we should wrap the
{en,dis}able_all() things in watchdog_proc_mutex.

This patch fixes a s2r failure reported by Michal; which I cannot
readily explain. But this does make the code internally consistent
again.

Reported-and-tested-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-18 10:08:29 -07:00
14db1e8dc0 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two fixes: a suspend/resume related regression fix, and an RT priority
  boosting fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix regression in cpuset_cpu_inactive() for suspend
  sched: Handle priority boosted tasks proper in setscheduler()
2015-05-15 12:42:33 -07:00
ef4a293a44 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "Mostly tooling fixes, but also a lockdep annotation fix, a PMU event
  list fix and a new model addition"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tools/liblockdep: Fix compilation error
  tools/liblockdep: Fix linker error in case of cross compile
  perf tools: Use getconf to determine number of online CPUs
  tools: Fix tools/vm build
  perf/x86/rapl: Enable Broadwell-U RAPL support
  perf/x86/intel: Fix SLM cache event list
  perf: Annotate inherited event ctx->mutex recursion
2015-05-15 12:38:21 -07:00
f7bcb70eba ktime: Fix ktime_divns to do signed division
It was noted that the 32bit implementation of ktime_divns()
was doing unsigned division and didn't properly handle
negative values.

And when a ktime helper was changed to utilize
ktime_divns, it caused a regression on some IR blasters.
See the following bugzilla for details:
  https://bugzilla.redhat.com/show_bug.cgi?id=1200353

This patch fixes the problem in ktime_divns by checking
and preserving the sign bit, and then reapplying it if
appropriate after the division, it also changes the return
type to a s64 to make it more obvious this is expected.

Nicolas also pointed out that negative dividers would
cause infinite loops on 32bit systems, negative dividers
is unlikely for users of this function, but out of caution
this patch adds checks for negative dividers for both
32-bit (BUG_ON) and 64-bit(WARN_ON) versions to make sure
no such use cases creep in.

[ tglx: Hand an u64 to do_div() to avoid the compiler warning ]

Fixes: 166afb6451 'ktime: Sanitize ktime_to_us/ms conversion'
Reported-and-tested-by: Trevor Cordes <trevor@tecnopolis.ca>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1431118043-23452-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-05-13 10:19:35 +02:00
9d88f22a81 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "Two patches from the irq departement:

   - a simple fix to make dummy_irq_chip usable for wakeup scenarios

   - removal of the gic arch_extn hackery.  Now that all users are
     converted we really want to get rid of the interface so people wont
     come up with new use cases"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip: gic: Drop support for gic_arch_extn
  genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip
2015-05-09 14:59:05 -07:00
95f3b1f4b1 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
 "A simple fix to actually shut down a detached device instead of
  keeping it active"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clockevents: Shutdown detached clockevent device
2015-05-09 14:57:49 -07:00
26b293e854 Merge tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
 "The newly added ftrace_print_array_seq() function had a bug in it.
  Luckily, the only user of it didn't make the 4.1 merge window.

  But the helper function should be fixed before 4.2 when the users
  start coming in"

* tag 'trace-fixes-v4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Make ftrace_print_array_seq compute buf_len
2015-05-08 18:22:05 -07:00
37815bf866 module: Call module notifier on failure after complete_formation()
The module notifier call chain for MODULE_STATE_COMING was moved up before
the parsing of args, into the complete_formation() call. But if the module failed
to load after that, the notifier call chain for MODULE_STATE_GOING was
never called and that prevented the users of those call chains from
cleaning up anything that was allocated.

Link: http://lkml.kernel.org/r/554C52B9.9060700@gmail.com

Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Fixes: 4982223e51 "module: set nx before marking module MODULE_STATE_COMING"
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2015-05-09 03:29:24 +09:30
8b10c5e2b5 perf: Annotate inherited event ctx->mutex recursion
While fuzzing Sasha tripped over another ctx->mutex recursion lockdep
splat. Annotate this.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08 11:59:40 +02:00