aboutsummaryrefslogtreecommitdiffstats
path: root/block/blk-throttle.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-09-25fs: Fix page cache inconsistency when mixing buffered and AIO DIOLukas Czerner3-21/+67
Currently when mixing buffered reads and asynchronous direct writes it is possible to end up with the situation where we have stale data in the page cache while the new data is already written to disk. This is permanent until the affected pages are flushed away. Despite the fact that mixing buffered and direct IO is ill-advised it does pose a thread for a data integrity, is unexpected and should be fixed. Fix this by deferring completion of asynchronous direct writes to a process context in the case that there are mapped pages to be found in the inode. Later before the completion in dio_complete() invalidate the pages in question. This ensures that after the completion the pages in the written area are either unmapped, or populated with up-to-date data. Also do the same for the iomap case which uses iomap_dio_complete() instead. This has a side effect of deferring the completion to a process context for every AIO DIO that happens on inode that has pages mapped. However since the consensus is that this is ill-advised practice the performance implication should not be a problem. This was based on proposal from Jeff Moyer, thanks! Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvmet: implement valid sqhd values in completionsJames Smart3-6/+12
To support sqhd, for initiators that are following the spec and paying attention to sqhd vs their sqtail values: - add sqhd to struct nvmet_sq - initialize sqhd to 0 in nvmet_sq_setup - rather than propagate the 0's-based qsize value from the connect message which requires a +1 in every sqhd update, and as nothing else references it, convert to 1's-based value in nvmt_sq/cq_setup() calls. - validate connect message sqsize being non-zero per spec. - updated assign sqhd for every completion that goes back. Also remove handling the NULL sq case in __nvmet_req_complete, as it can't happen with the current code. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme-fabrics: Allow 0 as KATO valueGuilherme G. Piccoli1-9/+9
Currently, driver code allows user to set 0 as KATO (Keep Alive TimeOut), but this is not being respected. This patch enforces the expected behavior. Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme: allow timed-out ios to retryJames Smart1-2/+0
Currently the nvme_req_needs_retry() applies several checks to see if a retry is allowed. On of those is whether the current time has exceeded the start time of the io plus the timeout length. This check, if an io times out, means there is never a retry allowed for the io. Which means applications see the io failure. Remove this check and allow the io to timeout, like it does on other protocols, and retries to be made. On the FC transport, a frame can be lost for an individual io, and there may be no other errors that escalate for the connection/association. The io will timeout, which causes the transport to escalate into creating a new association, but the io that timed out, due to this retry logic, has already failed back to the application and things are hosed. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme: stop aer posting if controller state not liveJames Smart1-2/+3
If an nvme async_event command completes, in most cases, a new async event is posted. However, if the controller enters a resetting or reconnecting state, there is nothing to block the scheduled work element from posting the async event again. Nor are there calls from the transport to stop async events when an association dies. In the case of FC, where the association is torn down, the aer must be aborted on the FC link and completes through the normal job completion path. Thus the terminated async event ends up being rescheduled even though the controller isn't in a valid state for the aer, and the reposting gets the transport into a partially torn down data structure. It's possible to hit the scenario on rdma, although much less likely due to an aer completing right as the association is terminated and as the association teardown reclaims the blk requests via nvme_cancel_request() so its immediate, not a link-related action like on FC. Fix by putting controller state checks in both the async event completion routine where it schedules the async event and in the async event work routine before it calls into the transport. It's effectively a "stop_async_events()" behavior. The transport, when it creates a new association with the subsystem will transition the state back to live and is already restarting the async event posting. Signed-off-by: James Smart <james.smart@broadcom.com> [hch: remove taking a lock over reading the controller state] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme-pci: Print invalid SGL only onceKeith Busch1-12/+18
The WARN_ONCE macro returns true if the condition is true, not if the warn was raised, so we're printing the scatter list every time it's invalid. This is excessive and makes debugging harder, so this patch prints it just once. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme-pci: initialize queue memory before interruptsKeith Busch1-2/+2
A spurious interrupt before the nvme driver has initialized the completion queue may inadvertently cause the driver to believe it has a completion to process. This may result in a NULL dereference since the nvmeq's tags are not set at this point. The patch initializes the host's CQ memory so that a spurious interrupt isn't mistaken for a real completion. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvmet-fc: fix failing max io queue connectionsJames Smart1-3/+3
fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g. admin queue plus NVMET_NR_QUEUES-1 io queues. But NVMET_NR_QUEUES is the number of io queues, so maximum queue count is really NVMET_NR_QUEUES+1. Fix the handling in the target fc transport Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme-fc: use transport-specific sgl formatJames Smart1-6/+7
Sync with NVM Express spec change and FC-NVME 1.18. FC transport sets SGL type to Transport SGL Data Block Descriptor and subtype to transport-specific value 0x0A. Removed the warn-on's on the PRP fields. They are unneeded. They were to check for values from the upper layer that weren't set right, and for the most part were fine. But, with Async events, which reuse the same structure and 2nd time issued the SGL overlay converted them to the Transport SGL values - the warn-on's were errantly firing. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme: add transport SGL definitionsJames Smart1-0/+6
Add transport SGL defintions from NVMe TP 4008, required for the final NVMe-FC standard. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme.h: remove FC transport-specific error valuesJames Smart1-13/+0
The NVM express group recinded the reserved range for the transport. Remove the FC-centric values that had been defined. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25qla2xxx: remove use of FC-specific error codesJames Smart1-1/+1
The qla2xxx driver uses the FC-specific error when it needed to return an error to the FC-NVME transport. Convert to use a generic value instead. Signed-off-by: James Smart <james.smart@broadcom.com> Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25lpfc: remove use of FC-specific error codesJames Smart1-1/+1
The lpfc driver uses the FC-specific error when it needed to return an error to the FC-NVME transport. Convert to use a generic value instead. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvmet-fcloop: remove use of FC-specific error codesJames Smart1-1/+1
The FC-NVME transport loopback test module used the FC-specific error codes in cases where it emulated a transport abort case. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvmet-fc: remove use of FC-specific error codesJames Smart1-6/+3
The FC-NVME target transport used the FC-specific error codes in return codes when the transport or lldd failed. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nvme-fc: remove use of FC-specific error codesJames Smart1-4/+4
The FC-NVME transport used the FC-specific error codes in cases where it had to fabricate an error to go back up stack. Instead of using the FC-specific values, now use a generic value (NVME_SC_INTERNAL). Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25loop: remove union of use_aio and ref in struct loop_cmdOmar Sandoval1-4/+2
When the request is completed, lo_complete_rq() checks cmd->use_aio. However, if this is in fact an aio request, cmd->use_aio will have already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a union. On x86_64, there's a hole after the union anyways, so this doesn't make struct loop_cmd any bigger. Fixes: 92d773324b7e ("block/loop: fix use after free") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25blktrace: Fix potential deadlock between delete & sysfs opsWaiman Long3-6/+16
The lockdep code had reported the following unsafe locking scenario: CPU0 CPU1 ---- ---- lock(s_active#228); lock(&bdev->bd_mutex/1); lock(s_active#228); lock(&bdev->bd_mutex); *** DEADLOCK *** The deadlock may happen when one task (CPU1) is trying to delete a partition in a block device and another task (CPU0) is accessing tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that partition. The s_active isn't an actual lock. It is a reference count (kn->count) on the sysfs (kernfs) file. Removal of a sysfs file, however, require a wait until all the references are gone. The reference count is treated like a rwsem using lockdep instrumentation code. The fact that a thread is in the sysfs callback method or in the ioctl call means there is a reference to the opended sysfs or device file. That should prevent the underlying block structure from being removed. Instead of using bd_mutex in the block_device structure, a new blk_trace_mutex is now added to the request_queue structure to protect access to the blk_trace structure. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Fix typo in patch subject line, and prune a comment detailing how the code used to work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25nbd: ignore non-nbd ioctl'sJosef Bacik1-0/+6
In testing we noticed that nbd would spew if you ran a fio job against the raw device itself. This is because fio calls a block device specific ioctl, however the block layer will first pass this back to the driver ioctl handler in case the driver wants to do something special. Since the device was setup using netlink this caused us to spew every time fio called this ioctl. Since we don't have special handling, just error out for any non-nbd specific ioctl's that come in. This fixes the spew. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25bsg-lib: don't free job in bsg_prepare_jobChristoph Hellwig1-1/+0
The job structure is allocated as part of the request, so we should not free it in the error path of bsg_prepare_job. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25brd: fix overflow in __brd_direct_accessMikulas Patocka1-1/+1
The code in __brd_direct_access multiplies the pgoff variable by page size and divides it by 512. It can cause overflow on 32-bit architectures. The overflow happens if we create ramdisk larger than 4G and use it as a sparse device. This patch replaces multiplication and division with multiplication by the number of sectors per page. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 1647b9b959c7 ("brd: add dax_operations support") Cc: stable@vger.kernel.org # 4.12+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-19KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interruptHaozhong Zhang1-12/+21
WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc)) in kvm_vcpu_trigger_posted_interrupt() intends to detect the violation of invariant that VT-d PI notification event is not suppressed when vcpu is in the guest mode. Because the two checks for the target vcpu mode and the target suppress field cannot be performed atomically, the target vcpu mode may change in between. If that does happen, WARN_ON_ONCE() here may raise false alarms. As the previous patch fixed the real invariant breaker, remove this WARN_ON_ONCE() to avoid false alarms, and document the allowed cases instead. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-19KVM: VMX: do not change SN bit in vmx_update_pi_irte()Haozhong Zhang1-5/+1
In kvm_vcpu_trigger_posted_interrupt() and pi_pre_block(), KVM assumes that PI notification events should not be suppressed when the target vCPU is not blocked. vmx_update_pi_irte() sets the SN field before changing an interrupt from posting to remapping, but it does not check the vCPU mode. Therefore, the change of SN field may break above the assumption. Besides, I don't see reasons to suppress notification events here, so remove the changes of SN field to avoid race condition. Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reported-by: "Ramamurthy, Venkatesh" <venkatesh.ramamurthy@intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: 28b835d60fcc ("KVM: Update Posted-Interrupts Descriptor when vCPU is preempted") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-19KVM: x86: Fix the NULL pointer parameter in check_cr_write()Yu Zhang1-3/+5
Routine check_cr_write() will trigger emulator_get_cpuid()-> kvm_cpuid() to get maxphyaddr, and NULL is passed as values for ebx/ecx/edx. This is problematic because kvm_cpuid() will dereference these pointers. Fixes: d1cd3ce90044 ("KVM: MMU: check guest CR3 reserved bits based on its physical address width.") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-19Revert "KVM: Don't accept obviously wrong gsi values via KVM_IRQFD"Jan H. Schönherr1-2/+0
This reverts commit 36ae3c0a36b7456432fedce38ae2f7bd3e01a563. The commit broke compilation on !CONFIG_HAVE_KVM_IRQ_ROUTING. Also, there may be cases with CONFIG_HAVE_KVM_IRQ_ROUTING, where larger gsi values make sense. As the commit was meant as an early indicator to user space that something is wrong, reverting just restores the previous behavior where overly large values are ignored when encountered (without any direct feedback). Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-09-18fcntl: Don't set si_code to SI_SIGIO when sig == SIGPOLLEric W. Biederman1-1/+1
When fixing things to avoid ambiguous cases I had a thinko and included SIGPOLL/SIGIO in with all of the other signals that have signal specific si_codes. Which is completely wrong. Fix that. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-09-17Update version of cifs moduleSteve French1-1/+1
Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-17cifs: hide unused functionsArnd Bergmann1-0/+2
The newly added SMB2+ attribute support causes unused function warnings when CONFIG_CIFS_XATTR is disabled: fs/cifs/smb2ops.c:563:1: error: 'smb2_set_ea' defined but not used [-Werror=unused-function] smb2_set_ea(const unsigned int xid, struct cifs_tcon *tcon, fs/cifs/smb2ops.c:513:1: error: 'smb2_query_eas' defined but not used [-Werror=unused-function] smb2_query_eas(const unsigned int xid, struct cifs_tcon *tcon, This adds another #ifdef around the affected functions. Fixes: 5517554e4313 ("cifs: Add support for writing attributes on SMB2+") Fixes: 95907fea4fd8 ("cifs: Add support for reading attributes on SMB2+") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steve French <smfrench@gmail.com>
2017-09-17SMB3: Add support for multidialect negotiate (SMB2.1 and later)Steve French5-18/+139
With the need to discourage use of less secure dialect, SMB1 (CIFS), we temporarily upgraded the dialect to SMB3 in 4.13, but since there are various servers which only support SMB2.1 (2.1 is more secure than CIFS/SMB1) but not optimal for a default dialect - add support for multidialect negotiation. cifs.ko will now request SMB2.1 or later (ie SMB2.1 or SMB3.0, SMB3.02) and the server will pick the latest most secure one it can support. In addition since we are sending multidialect negotiate, add support for secure negotiate to validate that a man in the middle didn't downgrade us. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org> # 4.13+
2017-09-17CIFS/SMB3: Update documentation to reflect SMB3 and various changesSteve French4-91/+91
Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-16Linux 4.14-rc1Linus Torvalds1-2/+2
2017-09-16genirq: Fix cpumask check in __irq_startup_managed()Thomas Gleixner1-1/+1
The result of cpumask_any_and() is invalid when result greater or equal nr_cpu_ids. The current check is checking for greater only. Fix it. Fixes: 761ea388e8c4 ("genirq: Handle managed irqs gracefully in irq_startup()") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Yu <yu.c.chen@intel.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Alok Kataria <akataria@vmware.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Len Brown <lenb@kernel.org> Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de
2017-09-16firmware: Restore support for built-in firmwareMarkus Trippelsdorf4-5/+71
Commit 5620a0d1aac ("firmware: delete in-kernel firmware") removed the entire firmware directory. Unfortunately it thereby also removed the support for built-in firmware. This restores the ability to build firmware directly into the kernel by pruning the original Makefile to the necessary minimum. The default for EXTRA_FIRMWARE_DIR is now the standard directory /lib/firmware/. Fixes: 5620a0d1aac ("firmware: delete in-kernel firmware") Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de> Acked-by: Greg K-H <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-16mlxsw: spectrum_router: Only handle IPv4 and IPv6 eventsIdo Schimmel1-1/+2
The driver doesn't support events from address families other than IPv4 and IPv6, so ignore them. Otherwise, we risk queueing a work item before it's initialized. This can happen in case a VRF is configured when MROUTE_MULTIPLE_TABLES is enabled, as the VRF driver will try to add an l3mdev rule for the IPMR family. Fixes: 65e65ec137f4 ("mlxsw: spectrum_router: Don't ignore IPv6 notifications") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Andreas Rammhold <andreas@rammhold.de> Reported-by: Florian Klink <flokli@flokli.de> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-16Documentation: link in networking docsPavel Machek1-1/+1
Fix link in filter.txt. Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-16tcp: fix data delivery rateEric Dumazet1-4/+3
Now skb->mstamp_skb is updated later, we also need to call tcp_rate_skb_sent() after the update is done. Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15bpf/verifier: reject BPF_ALU64|BPF_ENDEdward Cree2-1/+18
Neither ___bpf_prog_run nor the JITs accept it. Also adds a new test case. Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Edward Cree <ecree@solarflare.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15sctp: do not mark sk dumped when inet_sctp_diag_fill returns errXin Long1-1/+0
sctp_diag would not actually dump out sk/asoc if inet_sctp_diag_fill returns err, in which case it shouldn't mark sk dumped by setting cb->args[3] as 1 in sctp_sock_dump(). Otherwise, it could cause some asocs to have no parent's sk dumped in 'ss --sctp'. So this patch is to not set cb->args[3] when inet_sctp_diag_fill() returns err in sctp_sock_dump(). Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15sctp: fix an use-after-free issue in sctp_sock_dumpXin Long3-39/+36
Commit 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep with holding sock and sock lock. But Paolo noticed that endpoint could be destroyed in sctp_rcv without sock lock protection. It means the use-after-free issue still could be triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks !ep, although it's pretty hard to reproduce. I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close and sctp_sock_dump long time. This patch is to add another param cb_done to sctp_for_each_transport and dump ep->assocs with holding tsp after jumping out of transport's traversal in it to avoid this issue. It can also improve sctp diag dump to make it run faster, as no need to save sk into cb->args[5] and keep calling sctp_for_each_transport any more. This patch is also to use int * instead of int for the pos argument in sctp_for_each_transport, which could make postion increment only in sctp_for_each_transport and no need to keep changing cb->args[2] in sctp_sock_filter and sctp_sock_dump any more. Fixes: 86fdb3448cc1 ("sctp: ensure ep is not destroyed before doing the dump") Reported-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15netvsc: increase default receive buffer sizeStephen Hemminger1-1/+1
The default receive buffer size was reduced by recent change to a value which was appropriate for 10G and Windows Server 2016. But the value is too small for full performance with 40G on Azure. Increase the default back to maximum supported by host. Fixes: 8b5327975ae1 ("netvsc: allow controlling send/recv buffer size") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15tcp: update skb->skb_mstamp more carefullyEric Dumazet1-7/+12
liujian reported a problem in TCP_USER_TIMEOUT processing with a patch in tcp_probe_timer() : https://www.spinics.net/lists/netdev/msg454496.html After investigations, the root cause of the problem is that we update skb->skb_mstamp of skbs in write queue, even if the attempt to send a clone or copy of it failed. One reason being a routing problem. This patch prevents this, solving liujian issue. It also removes a potential RTT miscalculation, since __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has been changed. A future ACK would then lead to a very small RTT sample and min_rtt would then be lowered to this too small value. Tested: # cat user_timeout.pkt --local_ip=192.168.102.64 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0` +0 < S 0:0(0) win 0 <mss 1460> +0 > S. 0:0(0) ack 1 <mss 1460> +.1 < . 1:1(0) ack 1 win 65530 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0 +0 write(4, ..., 24) = 24 +0 > P. 1:25(24) ack 1 win 29200 +.1 < . 1:1(0) ack 25 win 65530 //change the ipaddress +1 `ifconfig tun0 192.168.0.10/16` +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +0 `ifconfig tun0 192.168.102.64/16` +0 < . 1:2(1) ack 25 win 65530 +0 `ifconfig tun0 192.168.0.10/16` +3 write(4, ..., 24) = -1 # ./packetdrill user_timeout.pkt Signed-off-by: Eric Dumazet <edumazet@googl.com> Reported-by: liujian <liujian56@huawei.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15fs/proc: Report eip/esp in /prod/PID/stat for coredumpingJohn Ogness1-0/+9
Commit 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat") stopped reporting eip/esp because it is racy and dangerous for executing tasks. The comment adds: As far as I know, there are no use programs that make any material use of these fields, so just get rid of them. However, existing userspace core-dump-handler applications (for example, minicoredumper) are using these fields since they provide an excellent cross-platform interface to these valuable pointers. So that commit introduced a user space visible regression. Partially revert the change and make the readout possible for tasks with the proper permissions and only if the target task has the PF_DUMPCORE flag set. Fixes: 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in> /proc/PID/stat") Reported-by: Marco Felsch <marco.felsch@preh.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Tycho Andersen <tycho.andersen@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: stable@vger.kernel.org Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Borislav Petkov <bp@alien8.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linux API <linux-api@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/87poatfwg6.fsf@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-09-15net: ipv4: fix l3slave check for index returned in IP_PKTINFODavid Ahern1-2/+6
rt_iif is only set to the actual egress device for the output path. The recent change to consider the l3slave flag when returning IP_PKTINFO works for local traffic (the correct device index is returned), but it broke the more typical use case of packets received from a remote host always returning the VRF index rather than the original ingress device. Update the fixup to consider l3slave and rt_iif actually getting set. Fixes: 1dfa76390bf05 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15net: smsc911x: Quieten netif during suspendGeert Uytterhoeven1-1/+14
If the network interface is kept running during suspend, the net core may call net_device_ops.ndo_start_xmit() while the Ethernet device is still suspended, which may lead to a system crash. E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is driven by a PM controlled clock. If the Ethernet registers are accessed while the clock is not running, the system will crash with an imprecise external abort. As this is a race condition with a small time window, it is not so easy to trigger at will. Using pm_test may increase your chances: # echo 0 > /sys/module/printk/parameters/console_suspend # echo platform > /sys/power/pm_test # echo mem > /sys/power/state To fix this, make sure the network interface is quietened during suspend. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15net: systemport: Fix 64-bit stats deadlockFlorian Fainelli1-3/+0
We can enter a deadlock situation because there is no sufficient protection when ndo_get_stats64() runs in process context to guard against RX or TX NAPI contexts running in softirq, this can lead to the following lockdep splat and actual deadlock was experienced as well with an iperf session in the background and a while loop doing ifconfig + ethtool. [ 5.780350] ================================ [ 5.784679] WARNING: inconsistent lock state [ 5.789011] 4.13.0-rc7-02179-g32fae27c725d #70 Not tainted [ 5.794561] -------------------------------- [ 5.798890] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 5.804971] swapper/0/0 [HC0[0]:SC1[1]:HE0:SE0] takes: [ 5.810175] (&syncp->seq#2){+.?...}, at: [<c0768a28>] bcm_sysport_tx_reclaim+0x30/0x54 [ 5.818327] {SOFTIRQ-ON-W} state was registered at: [ 5.823278] bcm_sysport_get_stats64+0x17c/0x258 [ 5.828053] dev_get_stats+0x38/0xac [ 5.831776] rtnl_fill_stats+0x30/0x118 [ 5.835761] rtnl_fill_ifinfo+0x538/0xe24 [ 5.839921] rtmsg_ifinfo_build_skb+0x6c/0xd8 [ 5.844430] rtmsg_ifinfo_event.part.5+0x14/0x44 [ 5.849201] rtmsg_ifinfo+0x20/0x28 [ 5.852837] register_netdevice+0x628/0x6b8 [ 5.857171] register_netdev+0x14/0x24 [ 5.861051] bcm_sysport_probe+0x30c/0x438 [ 5.865280] platform_drv_probe+0x50/0xb0 [ 5.869418] driver_probe_device+0x2e8/0x450 [ 5.873817] __driver_attach+0x104/0x120 [ 5.877871] bus_for_each_dev+0x7c/0xc0 [ 5.881834] bus_add_driver+0x1b0/0x270 [ 5.885797] driver_register+0x78/0xf4 [ 5.889675] do_one_initcall+0x54/0x190 [ 5.893646] kernel_init_freeable+0x144/0x1d0 [ 5.898135] kernel_init+0x8/0x110 [ 5.901665] ret_from_fork+0x14/0x2c [ 5.905363] irq event stamp: 24263 [ 5.908804] hardirqs last enabled at (24262): [<c08eecf0>] net_rx_action+0xc4/0x4e4 [ 5.916624] hardirqs last disabled at (24263): [<c0a7da00>] _raw_spin_lock_irqsave+0x1c/0x98 [ 5.925143] softirqs last enabled at (24258): [<c022a7fc>] irq_enter+0x84/0x98 [ 5.932524] softirqs last disabled at (24259): [<c022a918>] irq_exit+0x108/0x16c [ 5.939985] [ 5.939985] other info that might help us debug this: [ 5.946576] Possible unsafe locking scenario: [ 5.946576] [ 5.952556] CPU0 [ 5.955031] ---- [ 5.957506] lock(&syncp->seq#2); [ 5.960955] <Interrupt> [ 5.963604] lock(&syncp->seq#2); [ 5.967227] [ 5.967227] *** DEADLOCK *** [ 5.967227] [ 5.973222] 1 lock held by swapper/0/0: [ 5.977092] #0: (&(&ring->lock)->rlock){..-...}, at: [<c0768a18>] bcm_sysport_tx_reclaim+0x20/0x54 So just remove the u64_stats_update_begin()/end() pair in ndo_get_stats64() since it does not appear to be useful for anything. No inconsistency was observed with either ifconfig or ethtool, global TX counts equal the sum of per-queue TX counts on a 32-bit architecture. Fixes: 10377ba7673d ("net: systemport: Support 64bit statistics") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15net: vrf: avoid gcc-4.6 warningArnd Bergmann1-3/+3
When building an allmodconfig kernel with gcc-4.6, we get a rather odd warning: drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’: drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten [-Werror] drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror] I have no idea what this warning is even trying to say, but it does seem like a false positive. Reordering the initialization in to match the structure definition gets rid of the warning, and might also avoid whatever gcc thinks is wrong here. Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15qed: remove unnecessary call to memsetHimanshu Jha1-1/+0
call to memset to assign 0 value immediately after allocating memory with kzalloc is unnecesaary as kzalloc allocates the memory filled with 0 value. Semantic patch used to resolve this issue: @@ expression e,e2; constant c; statement S; @@ e = kzalloc(e2, c); if(e == NULL) S - memset(e, 0, e2); Signed-off-by: Himanshu Jha <himanshujha199640@gmail.com> Signed-off-by: Himanshu Jha <himanshujha199640@gmail.com> Acked-by: Sudarsana Kalluru <sudarsana.kalluru@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-15scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLEHannes Reinecke1-3/+2
When calling SG_GET_REQUEST_TABLE ioctl only a half-filled table is returned; the remaining part will then contain stale kernel memory information. This patch zeroes out the entire table to avoid this issue. Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2017-09-15scsi: sg: factor out sg_fill_request_table()Hannes Reinecke1-26/+35
Factor out sg_fill_request_table() for better readability. [mkp: typos, applied by hand] Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2017-09-15scsi: sd: Remove unnecessary condition in sd_read_block_limits()Lukas Czerner1-2/+0
After series of changes around WRITE_SAME and UNMAP setup we ended up with leftover unnecessary condition. Remove it. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>