aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/accounting/psi.txt12
-rw-r--r--Documentation/bpf/btf.rst8
-rw-r--r--Documentation/devicetree/bindings/arm/cpus.yaml2
-rw-r--r--Documentation/devicetree/bindings/hwmon/adc128d818.txt4
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-xscale.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-mtk.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-stu300.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-wmt.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-vt8500.txt)0
-rw-r--r--Documentation/devicetree/bindings/net/dsa/qca8k.txt73
-rw-r--r--Documentation/devicetree/bindings/serial/mtk-uart.txt1
-rw-r--r--Documentation/filesystems/mount_api.txt367
-rw-r--r--Documentation/i2c/busses/i2c-i8011
-rw-r--r--Documentation/lzo.txt8
-rw-r--r--Documentation/networking/bpf_flow_dissector.rst126
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/msg_zerocopy.rst2
-rw-r--r--Documentation/networking/netdev-FAQ.rst13
-rw-r--r--Documentation/networking/nf_flowtable.txt8
-rw-r--r--Documentation/networking/snmp_counter.rst12
-rw-r--r--Documentation/powerpc/DAWR-POWER9.txt32
-rw-r--r--Documentation/virtual/kvm/api.txt91
-rw-r--r--Documentation/virtual/kvm/devices/vm.txt3
-rw-r--r--Documentation/virtual/kvm/devices/xive.txt197
-rw-r--r--Documentation/virtual/kvm/mmu.txt11
25 files changed, 732 insertions, 240 deletions
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..7e71c9c1d8e9 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -56,12 +56,12 @@ situation from a state where some tasks are stalled but the CPU is
still doing productive work. As such, time spent in this subset of the
stall state is tracked separately and exported in the "full" averages.
-The ratios are tracked as recent trends over ten, sixty, and three
-hundred second windows, which gives insight into short term events as
-well as medium and long term trends. The total absolute stall time is
-tracked and exported as well, to allow detection of latency spikes
-which wouldn't necessarily make a dent in the time averages, or to
-average trends over custom time frames.
+The ratios (in %) are tracked as recent trends over ten, sixty, and
+three hundred second windows, which gives insight into short term events
+as well as medium and long term trends. The total absolute stall time
+(in us) is tracked and exported as well, to allow detection of latency
+spikes which wouldn't necessarily make a dent in the time averages,
+or to average trends over custom time frames.
Cgroup2 interface
=================
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst
index 9a60a5d60e38..7313d354f20e 100644
--- a/Documentation/bpf/btf.rst
+++ b/Documentation/bpf/btf.rst
@@ -148,16 +148,16 @@ The ``btf_type.size * 8`` must be equal to or greater than ``BTF_INT_BITS()``
for the type. The maximum value of ``BTF_INT_BITS()`` is 128.
The ``BTF_INT_OFFSET()`` specifies the starting bit offset to calculate values
-for this int. For example, a bitfield struct member has: * btf member bit
-offset 100 from the start of the structure, * btf member pointing to an int
-type, * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
+for this int. For example, a bitfield struct member has:
+ * btf member bit offset 100 from the start of the structure,
+ * btf member pointing to an int type,
+ * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
Then in the struct memory layout, this member will occupy ``4`` bits starting
from bits ``100 + 2 = 102``.
Alternatively, the bitfield struct member can be the following to access the
same bits as the above:
-
* btf member bit offset 102,
* btf member pointing to an int type,
* the int type has ``BTF_INT_OFFSET() = 0`` and ``BTF_INT_BITS() = 4``
diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
index 365dcf384d73..82dd7582e945 100644
--- a/Documentation/devicetree/bindings/arm/cpus.yaml
+++ b/Documentation/devicetree/bindings/arm/cpus.yaml
@@ -228,7 +228,7 @@ patternProperties:
- renesas,r9a06g032-smp
- rockchip,rk3036-smp
- rockchip,rk3066-smp
- - socionext,milbeaut-m10v-smp
+ - socionext,milbeaut-m10v-smp
- ste,dbx500-smp
cpu-release-addr:
diff --git a/Documentation/devicetree/bindings/hwmon/adc128d818.txt b/Documentation/devicetree/bindings/hwmon/adc128d818.txt
index 08bab0e94d25..d0ae46d7bac3 100644
--- a/Documentation/devicetree/bindings/hwmon/adc128d818.txt
+++ b/Documentation/devicetree/bindings/hwmon/adc128d818.txt
@@ -26,7 +26,7 @@ Required node properties:
Optional node properties:
- - ti,mode: Operation mode (see above).
+ - ti,mode: Operation mode (u8) (see above).
Example (operation mode 2):
@@ -34,5 +34,5 @@ Example (operation mode 2):
adc128d818@1d {
compatible = "ti,adc128d818";
reg = <0x1d>;
- ti,mode = <2>;
+ ti,mode = /bits/ 8 <2>;
};
diff --git a/Documentation/devicetree/bindings/i2c/i2c-xscale.txt b/Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt
index dcc8390e0d24..dcc8390e0d24 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-xscale.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-mtk.txt b/Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt
index ee4c32454198..ee4c32454198 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-mtk.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt b/Documentation/devicetree/bindings/i2c/i2c-stu300.txt
index bd81a482634f..bd81a482634f 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-stu300.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt b/Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt
index 49df0053347a..49df0053347a 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-vt8500.txt b/Documentation/devicetree/bindings/i2c/i2c-wmt.txt
index 94a425eaa6c7..94a425eaa6c7 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-vt8500.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-wmt.txt
diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
index bbcb255c3150..93a7469e70d4 100644
--- a/Documentation/devicetree/bindings/net/dsa/qca8k.txt
+++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
@@ -12,10 +12,15 @@ Required properties:
Subnodes:
The integrated switch subnode should be specified according to the binding
-described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
-port and PHY id, each subnode describing a port needs to have a valid phandle
-referencing the internal PHY connected to it. The CPU port of this switch is
-always port 0.
+described in dsa/dsa.txt. If the QCA8K switch is connect to a SoC's external
+mdio-bus each subnode describing a port needs to have a valid phandle
+referencing the internal PHY it is connected to. This is because there's no
+N:N mapping of port and PHY id.
+
+Don't use mixed external and internal mdio-bus configurations, as this is
+not supported by the hardware.
+
+The CPU port of this switch is always port 0.
A CPU port node has the following optional node:
@@ -31,8 +36,9 @@ For QCA8K the 'fixed-link' sub-node supports only the following properties:
- 'full-duplex' (boolean, optional), to indicate that full duplex is
used. When absent, half duplex is assumed.
-Example:
+Examples:
+for the external mdio-bus configuration:
&mdio0 {
phy_port1: phy@0 {
@@ -55,12 +61,12 @@ Example:
reg = <4>;
};
- switch0@0 {
+ switch@10 {
compatible = "qca,qca8337";
#address-cells = <1>;
#size-cells = <0>;
- reg = <0>;
+ reg = <0x10>;
ports {
#address-cells = <1>;
@@ -108,3 +114,56 @@ Example:
};
};
};
+
+for the internal master mdio-bus configuration:
+
+ &mdio0 {
+ switch@10 {
+ compatible = "qca,qca8337";
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ reg = <0x10>;
+
+ ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ port@0 {
+ reg = <0>;
+ label = "cpu";
+ ethernet = <&gmac1>;
+ phy-mode = "rgmii";
+ fixed-link {
+ speed = 1000;
+ full-duplex;
+ };
+ };
+
+ port@1 {
+ reg = <1>;
+ label = "lan1";
+ };
+
+ port@2 {
+ reg = <2>;
+ label = "lan2";
+ };
+
+ port@3 {
+ reg = <3>;
+ label = "lan3";
+ };
+
+ port@4 {
+ reg = <4>;
+ label = "lan4";
+ };
+
+ port@5 {
+ reg = <5>;
+ label = "wan";
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/serial/mtk-uart.txt b/Documentation/devicetree/bindings/serial/mtk-uart.txt
index 742cb470595b..bcfb13194f16 100644
--- a/Documentation/devicetree/bindings/serial/mtk-uart.txt
+++ b/Documentation/devicetree/bindings/serial/mtk-uart.txt
@@ -16,6 +16,7 @@ Required properties:
* "mediatek,mt8127-uart" for MT8127 compatible UARTS
* "mediatek,mt8135-uart" for MT8135 compatible UARTS
* "mediatek,mt8173-uart" for MT8173 compatible UARTS
+ * "mediatek,mt8183-uart", "mediatek,mt6577-uart" for MT8183 compatible UARTS
* "mediatek,mt6577-uart" for MT6577 and all of the above
- reg: The base address of the UART register bank.
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
index 944d1965e917..00ff0cfccfa7 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.txt
@@ -12,11 +12,13 @@ CONTENTS
(4) Filesystem context security.
- (5) VFS filesystem context operations.
+ (5) VFS filesystem context API.
- (6) Parameter description.
+ (6) Superblock creation helpers.
- (7) Parameter helper functions.
+ (7) Parameter description.
+
+ (8) Parameter helper functions.
========
@@ -41,12 +43,15 @@ The creation of new mounts is now to be done in a multistep process:
(7) Destroy the context.
-To support this, the file_system_type struct gains a new field:
+To support this, the file_system_type struct gains two new fields:
int (*init_fs_context)(struct fs_context *fc);
+ const struct fs_parameter_description *parameters;
-which is invoked to set up the filesystem-specific parts of a filesystem
-context, including the additional space.
+The first is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space, and the second points to the
+parameter description for validation at registration time and querying by a
+future system call.
Note that security initialisation is done *after* the filesystem is called so
that the namespaces may be adjusted first.
@@ -73,9 +78,9 @@ context. This is represented by the fs_context structure:
void *s_fs_info;
unsigned int sb_flags;
unsigned int sb_flags_mask;
+ unsigned int s_iflags;
+ unsigned int lsm_flags;
enum fs_context_purpose purpose:8;
- bool sloppy:1;
- bool silent:1;
...
};
@@ -141,6 +146,10 @@ The fs_context fields are as follows:
Which bits SB_* flags are to be set/cleared in super_block::s_flags.
+ (*) unsigned int s_iflags
+
+ These will be bitwise-OR'd with s->s_iflags when a superblock is created.
+
(*) enum fs_context_purpose
This indicates the purpose for which the context is intended. The
@@ -150,17 +159,6 @@ The fs_context fields are as follows:
FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
- (*) bool sloppy
- (*) bool silent
-
- These are set if the sloppy or silent mount options are given.
-
- [NOTE] sloppy is probably unnecessary when userspace passes over one
- option at a time since the error can just be ignored if userspace deems it
- to be unimportant.
-
- [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
-
The mount context is created by calling vfs_new_fs_context() or
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
structure is not refcounted.
@@ -342,28 +340,47 @@ number of operations used by the new mount code for this purpose:
It should return 0 on success or a negative error code on failure.
-=================================
-VFS FILESYSTEM CONTEXT OPERATIONS
-=================================
+==========================
+VFS FILESYSTEM CONTEXT API
+==========================
-There are four operations for creating a filesystem context and
-one for destroying a context:
+There are four operations for creating a filesystem context and one for
+destroying a context:
- (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
- struct dentry *reference,
- unsigned int sb_flags,
- unsigned int sb_flags_mask,
- enum fs_context_purpose purpose);
+ (*) struct fs_context *fs_context_for_mount(
+ struct file_system_type *fs_type,
+ unsigned int sb_flags);
- Create a filesystem context for a given filesystem type and purpose. This
- allocates the filesystem context, sets the superblock flags, initialises
- the security and calls fs_type->init_fs_context() to initialise the
- filesystem private data.
+ Allocate a filesystem context for the purpose of setting up a new mount,
+ whether that be with a new superblock or sharing an existing one. This
+ sets the superblock flags, initialises the security and calls
+ fs_type->init_fs_context() to initialise the filesystem private data.
- reference can be NULL or it may indicate the root dentry of a superblock
- that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or
- the automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT).
- This is provided as a source of namespace information.
+ fs_type specifies the filesystem type that will manage the context and
+ sb_flags presets the superblock flags stored therein.
+
+ (*) struct fs_context *fs_context_for_reconfigure(
+ struct dentry *dentry,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask);
+
+ Allocate a filesystem context for the purpose of reconfiguring an
+ existing superblock. dentry provides a reference to the superblock to be
+ configured. sb_flags and sb_flags_mask indicate which superblock flags
+ need changing and to what.
+
+ (*) struct fs_context *fs_context_for_submount(
+ struct file_system_type *fs_type,
+ struct dentry *reference);
+
+ Allocate a filesystem context for the purpose of creating a new mount for
+ an automount point or other derived superblock. fs_type specifies the
+ filesystem type that will manage the context and the reference dentry
+ supplies the parameters. Namespaces are propagated from the reference
+ dentry's superblock also.
+
+ Note that it's not a requirement that the reference dentry be of the same
+ filesystem type as fs_type.
(*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
@@ -390,20 +407,6 @@ context pointer or a negative error code.
For the remaining operations, if an error occurs, a negative error code will be
returned.
- (*) int vfs_get_tree(struct fs_context *fc);
-
- Get or create the mountable root and superblock, using the parameters in
- the filesystem context to select/configure the superblock. This invokes
- the ->validate() op and then the ->get_tree() op.
-
- [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
- ->reconfigure().
-
- (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
-
- Create a mount given the parameters in the specified filesystem context.
- Note that this does not attach the mount to anything.
-
(*) int vfs_parse_fs_param(struct fs_context *fc,
struct fs_parameter *param);
@@ -432,17 +435,80 @@ returned.
clear the pointer, but then becomes responsible for disposing of the
object.
- (*) int vfs_parse_fs_string(struct fs_context *fc, char *key,
+ (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key,
const char *value, size_t v_size);
- A wrapper around vfs_parse_fs_param() that just passes a constant string.
+ A wrapper around vfs_parse_fs_param() that copies the value string it is
+ passed.
(*) int generic_parse_monolithic(struct fs_context *fc, void *data);
Parse a sys_mount() data page, assuming the form to be a text list
consisting of key[=val] options separated by commas. Each item in the
list is passed to vfs_mount_option(). This is the default when the
- ->parse_monolithic() operation is NULL.
+ ->parse_monolithic() method is NULL.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->get_tree() method.
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+
+===========================
+SUPERBLOCK CREATION HELPERS
+===========================
+
+A number of VFS helpers are available for use by filesystems for the creation
+or looking up of superblocks.
+
+ (*) struct super_block *
+ sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *sb, struct fs_context *fc),
+ int (*set)(struct super_block *sb, struct fs_context *fc));
+
+ This is the core routine. If test is non-NULL, it searches for an
+ existing superblock matching the criteria held in the fs_context, using
+ the test function to match them. If no match is found, a new superblock
+ is created and the set function is called to set it up.
+
+ Prior to the set function being called, fc->s_fs_info will be transferred
+ to sb->s_fs_info - and fc->s_fs_info will be cleared if set returns
+ success (ie. 0).
+
+The following helpers all wrap sget_fc():
+
+ (*) int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+
+ This creates/looks up a deviceless superblock. The keying indicates how
+ many superblocks of this type may exist and in what manner they may be
+ shared:
+
+ (1) vfs_get_single_super
+
+ Only one such superblock may exist in the system. Any further
+ attempt to get a new superblock gets this one (and any parameter
+ differences are ignored).
+
+ (2) vfs_get_keyed_super
+
+ Multiple superblocks of this type may exist and they're keyed on
+ their s_fs_info pointer (for example this may refer to a
+ namespace).
+
+ (3) vfs_get_independent_super
+
+ Multiple independent superblocks of this type may exist. This
+ function never matches an existing one and always creates a new
+ one.
=====================
@@ -454,35 +520,22 @@ There's a core description struct that links everything together:
struct fs_parameter_description {
const char name[16];
- u8 nr_params;
- u8 nr_alt_keys;
- u8 nr_enums;
- bool ignore_unknown;
- bool no_source;
- const char *const *keys;
- const struct constant_table *alt_keys;
const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums;
};
For example:
- enum afs_param {
+ enum {
Opt_autocell,
Opt_bar,
Opt_dyn,
Opt_foo,
Opt_source,
- nr__afs_params
};
static const struct fs_parameter_description afs_fs_parameters = {
.name = "kAFS",
- .nr_params = nr__afs_params,
- .nr_alt_keys = ARRAY_SIZE(afs_param_alt_keys),
- .nr_enums = ARRAY_SIZE(afs_param_enums),
- .keys = afs_param_keys,
- .alt_keys = afs_param_alt_keys,
.specs = afs_param_specs,
.enums = afs_param_enums,
};
@@ -494,28 +547,24 @@ The members are as follows:
The name to be used in error messages generated by the parse helper
functions.
- (2) u8 nr_params;
-
- The number of discrete parameter identifiers. This indicates the number
- of elements in the ->types[] array and also limits the values that may be
- used in the values that the ->keys[] array maps to.
-
- It is expected that, for example, two parameters that are related, say
- "acl" and "noacl" with have the same ID, but will be flagged to indicate
- that one is the inverse of the other. The value can then be picked out
- from the parse result.
+ (2) const struct fs_parameter_specification *specs;
- (3) const struct fs_parameter_specification *specs;
+ Table of parameter specifications, terminated with a null entry, where the
+ entries are of type:
- Table of parameter specifications, where the entries are of type:
-
- struct fs_parameter_type {
- enum fs_parameter_spec type:8;
- u8 flags;
+ struct fs_parameter_spec {
+ const char *name;
+ u8 opt;
+ enum fs_parameter_type type:8;
+ unsigned short flags;
};
- and the parameter identifier is the index to the array. 'type' indicates
- the desired value type and must be one of:
+ The 'name' field is a string to match exactly to the parameter key (no
+ wildcards, patterns and no case-independence) and 'opt' is the value that
+ will be returned by the fs_parser() function in the case of a successful
+ match.
+
+ The 'type' field indicates the desired value type and must be one of:
TYPE NAME EXPECTED VALUE RESULT IN
======================= ======================= =====================
@@ -525,85 +574,65 @@ The members are as follows:
fs_param_is_u32_octal 32-bit octal int result->uint_32
fs_param_is_u32_hex 32-bit hex int result->uint_32
fs_param_is_s32 32-bit signed int result->int_32
+ fs_param_is_u64 64-bit unsigned int result->uint_64
fs_param_is_enum Enum value name result->uint_32
fs_param_is_string Arbitrary string param->string
fs_param_is_blob Binary blob param->blob
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
- fs_param_is_fd File descriptor param->file
-
- And each parameter can be qualified with 'flags':
-
- fs_param_v_optional The value is optional
- fs_param_neg_with_no If key name is prefixed with "no", it is false
- fs_param_neg_with_empty If value is "", it is false
- fs_param_deprecated The parameter is deprecated.
-
- For example:
-
- static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
- [Opt_autocell] = { fs_param_is flag },
- [Opt_bar] = { fs_param_is_enum },
- [Opt_dyn] = { fs_param_is flag },
- [Opt_foo] = { fs_param_is_bool, fs_param_neg_with_no },
- [Opt_source] = { fs_param_is_string },
- };
+ fs_param_is_fd File descriptor result->int_32
Note that if the value is of fs_param_is_bool type, fs_parse() will try
to match any string value against "0", "1", "no", "yes", "false", "true".
- [!] NOTE that the table must be sorted according to primary key name so
- that ->keys[] is also sorted.
-
- (4) const char *const *keys;
-
- Table of primary key names for the parameters. There must be one entry
- per defined parameter. The table is optional if ->nr_params is 0. The
- table is just an array of names e.g.:
+ Each parameter can also be qualified with 'flags':
- static const char *const afs_param_keys[nr__afs_params] = {
- [Opt_autocell] = "autocell",
- [Opt_bar] = "bar",
- [Opt_dyn] = "dyn",
- [Opt_foo] = "foo",
- [Opt_source] = "source",
- };
-
- [!] NOTE that the table must be sorted such that the table can be searched
- with bsearch() using strcmp(). This means that the Opt_* values must
- correspond to the entries in this table.
-
- (5) const struct constant_table *alt_keys;
- u8 nr_alt_keys;
-
- Table of additional key names and their mappings to parameter ID plus the
- number of elements in the table. This is optional. The table is just an
- array of { name, integer } pairs, e.g.:
+ fs_param_v_optional The value is optional
+ fs_param_neg_with_no result->negated set if key is prefixed with "no"
+ fs_param_neg_with_empty result->negated set if value is ""
+ fs_param_deprecated The parameter is deprecated.
- static const struct constant_table afs_param_keys[] = {
- { "baz", Opt_bar },
- { "dynamic", Opt_dyn },
+ These are wrapped with a number of convenience wrappers:
+
+ MACRO SPECIFIES
+ ======================= ===============================================
+ fsparam_flag() fs_param_is_flag
+ fsparam_flag_no() fs_param_is_flag, fs_param_neg_with_no
+ fsparam_bool() fs_param_is_bool
+ fsparam_u32() fs_param_is_u32
+ fsparam_u32oct() fs_param_is_u32_octal
+ fsparam_u32hex() fs_param_is_u32_hex
+ fsparam_s32() fs_param_is_s32
+ fsparam_u64() fs_param_is_u64
+ fsparam_enum() fs_param_is_enum
+ fsparam_string() fs_param_is_string
+ fsparam_blob() fs_param_is_blob
+ fsparam_bdev() fs_param_is_blockdev
+ fsparam_path() fs_param_is_path
+ fsparam_fd() fs_param_is_fd
+
+ all of which take two arguments, name string and option number - for
+ example:
+
+ static const struct fs_parameter_spec afs_param_specs[] = {
+ fsparam_flag ("autocell", Opt_autocell),
+ fsparam_flag ("dyn", Opt_dyn),
+ fsparam_string ("source", Opt_source),
+ fsparam_flag_no ("foo", Opt_foo),
+ {}
};
- [!] NOTE that the table must be sorted such that strcmp() can be used with
- bsearch() to search the entries.
-
- The parameter ID can also be fs_param_key_removed to indicate that a
- deprecated parameter has been removed and that an error will be given.
- This differs from fs_param_deprecated where the parameter may still have
- an effect.
-
- Further, the behaviour of the parameter may differ when an alternate name
- is used (for instance with NFS, "v3", "v4.2", etc. are alternate names).
+ An addition macro, __fsparam() is provided that takes an additional pair
+ of arguments to specify the type and the flags for anything that doesn't
+ match one of the above macros.
(6) const struct fs_parameter_enum *enums;
- u8 nr_enums;
- Table of enum value names to integer mappings and the number of elements
- stored therein. This is of type:
+ Table of enum value names to integer mappings, terminated with a null
+ entry. This is of type:
struct fs_parameter_enum {
- u8 param_id;
+ u8 opt;
char name[14];
u8 value;
};
@@ -621,11 +650,6 @@ The members are as follows:
try to look the value up in the enum table and the result will be stored
in the parse result.
- (7) bool no_source;
-
- If this is set, fs_parse() will ignore any "source" parameter and not
- pass it to the filesystem.
-
The parser should be pointed to by the parser pointer in the file_system_type
struct as this will provide validation on registration (if
CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
@@ -650,9 +674,8 @@ process the parameters it is given.
int value;
};
- and it must be sorted such that it can be searched using bsearch() using
- strcmp(). If a match is found, the corresponding value is returned. If a
- match isn't found, the not_found value is returned instead.
+ If a match is found, the corresponding value is returned. If a match
+ isn't found, the not_found value is returned instead.
(*) bool validate_constant_table(const struct constant_table *tbl,
size_t tbl_size,
@@ -665,36 +688,36 @@ process the parameters it is given.
should just be set to lie inside the low-to-high range.
If all is good, true is returned. If the table is invalid, errors are
- logged to dmesg, the stack is dumped and false is returned.
+ logged to dmesg and false is returned.
+
+ (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+
+ This performs some validation checks on a parameter description. It
+ returns true if the description is good and false if it is not. It will
+ log errors to dmesg if validation fails.
(*) int fs_parse(struct fs_context *fc,
- const struct fs_param_parser *parser,
+ const struct fs_parameter_description *desc,
struct fs_parameter *param,
- struct fs_param_parse_result *result);
+ struct fs_parse_result *result);
This is the main interpreter of parameters. It uses the parameter
- description (parser) to look up the name of the parameter to use and to
- convert that to a parameter ID (stored in result->key).
+ description to look up a parameter by key name and to convert that to an
+ option number (which it returns).
If successful, and if the parameter type indicates the result is a
boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32}.
+ the result stored in result->{boolean,int_32,uint_32,uint_64}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
prefix removed. If this matches a parameter for which the type has flag
- fs_param_neg_with_no set, then a match will be made and the value will be
- set to false/0/NULL.
-
- If the parameter is successfully matched and, optionally, parsed
- correctly, 1 is returned. If the parameter isn't matched and
- parser->ignore_unknown is set, then 0 is returned. Otherwise -EINVAL is
- returned.
-
- (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+ fs_param_neg_with_no set, then a match will be made and result->negated
+ will be set to true.
- This is validates the parameter description. It returns true if the
- description is good and false if it is not.
+ If the parameter isn't matched, -ENOPARAM will be returned; if the
+ parameter is matched, but the value is erroneous, -EINVAL will be
+ returned; otherwise the parameter's option number will be returned.
(*) int fs_lookup_param(struct fs_context *fc,
struct fs_parameter *value,
diff --git a/Documentation/i2c/busses/i2c-i801 b/Documentation/i2c/busses/i2c-i801
index d1ee484a787d..ee9984f35868 100644
--- a/Documentation/i2c/busses/i2c-i801
+++ b/Documentation/i2c/busses/i2c-i801
@@ -36,6 +36,7 @@ Supported adapters:
* Intel Cannon Lake (PCH)
* Intel Cedar Fork (PCH)
* Intel Ice Lake (PCH)
+ * Intel Comet Lake (PCH)
Datasheets: Publicly available at the Intel website
On Intel Patsburg and later chipsets, both the normal host SMBus controller
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index f79934225d8d..ca983328976b 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -102,9 +102,11 @@ Byte sequences
dictionary which is empty, and that it will always be
invalid at this place.
- 17 : bitstream version. If the first byte is 17, the next byte
- gives the bitstream version (version 1 only). If the first byte
- is not 17, the bitstream version is 0.
+ 17 : bitstream version. If the first byte is 17, and compressed
+ stream length is at least 5 bytes (length of shortest possible
+ versioned bitstream), the next byte gives the bitstream version
+ (version 1 only).
+ Otherwise, the bitstream version is 0.
18..21 : copy 0..3 literals
state = (byte - 17) = 0..3 [ copy <state> literals ]
diff --git a/Documentation/networking/bpf_flow_dissector.rst b/Documentation/networking/bpf_flow_dissector.rst
new file mode 100644
index 000000000000..b375ae2ec2c4
--- /dev/null
+++ b/Documentation/networking/bpf_flow_dissector.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+BPF Flow Dissector
+==================
+
+Overview
+========
+
+Flow dissector is a routine that parses metadata out of the packets. It's
+used in the various places in the networking subsystem (RFS, flow hash, etc).
+
+BPF flow dissector is an attempt to reimplement C-based flow dissector logic
+in BPF to gain all the benefits of BPF verifier (namely, limits on the
+number of instructions and tail calls).
+
+API
+===
+
+BPF flow dissector programs operate on an ``__sk_buff``. However, only the
+limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``.
+``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input
+and output arguments.
+
+The inputs are:
+ * ``nhoff`` - initial offset of the networking header
+ * ``thoff`` - initial offset of the transport header, initialized to nhoff
+ * ``n_proto`` - L3 protocol type, parsed out of L2 header
+
+Flow dissector BPF program should fill out the rest of the ``struct
+bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
+also adjusted accordingly.
+
+The return code of the BPF program is either BPF_OK to indicate successful
+dissection, or BPF_DROP to indicate parsing error.
+
+__sk_buff->data
+===============
+
+In the VLAN-less case, this is what the initial state of the BPF flow
+dissector looks like::
+
+ +------+------+------------+-----------+
+ | DMAC | SMAC | ETHER_TYPE | L3_HEADER |
+ +------+------+------------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point to the first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In case of VLAN, flow dissector can be called with the two different states.
+
+Pre-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of TCI
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = TPID
+
+Please note that TPID can be 802.1AD and, hence, BPF program would
+have to parse VLAN information twice for double tagged packets.
+
+
+Post-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In this case VLAN information has been processed before the flow dissector
+and BPF flow dissector is not required to handle it.
+
+
+The takeaway here is as follows: BPF flow dissector program can be called with
+the optional VLAN header and should gracefully handle both cases: when single
+or double VLAN is present and when it is not present. The same program
+can be called for both cases and would have to be written carefully to
+handle both cases.
+
+
+Reference Implementation
+========================
+
+See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference
+implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]``
+for the loader. bpftool can be used to load BPF flow dissector program as well.
+
+The reference implementation is organized as follows:
+ * ``jmp_table`` map that contains sub-programs for each supported L3 protocol
+ * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and
+ does ``bpf_tail_call`` to the appropriate L3 handler
+
+Since BPF at this point doesn't support looping (or any jumping back),
+jmp_table is used instead to handle multiple levels of encapsulation (and
+IPv6 options).
+
+
+Current Limitations
+===================
+BPF flow dissector doesn't support exporting all the metadata that in-kernel
+C-based implementation can export. Notable example is single VLAN (802.1Q)
+and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys``
+for a set of information that's currently can be exported from the BPF context.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 5449149be496..984e68f9e026 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -9,6 +9,7 @@ Contents:
netdev-FAQ
af_xdp
batman-adv
+ bpf_flow_dissector
can
can_ucan_protocol
device_drivers/freescale/dpaa2/index
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index 18c1415e7bfa..ace56204dd03 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -50,7 +50,7 @@ the excellent reporting over at LWN.net or read the original code.
patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
- http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+ https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
index 0ac5fa77f501..8c7a713cf657 100644
--- a/Documentation/networking/netdev-FAQ.rst
+++ b/Documentation/networking/netdev-FAQ.rst
@@ -131,6 +131,19 @@ it to the maintainer to figure out what is the most recent and current
version that should be applied. If there is any doubt, the maintainer
will reply and ask what should be done.
+Q: I made changes to only a few patches in a patch series should I resend only those changed?
+--------------------------------------------------------------------------------------------
+A: No, please resend the entire patch series and make sure you do number your
+patches such that it is clear this is the latest and greatest set of patches
+that can be applied.
+
+Q: I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do?
+-------------------------------------------------------------------------------------------------------------------------------------------
+A: There is no revert possible, once it is pushed out, it stays like that.
+Please send incremental versions on top of what has been merged in order to fix
+the patches the way they would look like if your latest patch series was to be
+merged.
+
Q: How can I tell what patches are queued up for backporting to the various stable releases?
--------------------------------------------------------------------------------------------
A: Normally Greg Kroah-Hartman collects stable commits himself, but for
diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt
index 54128c50d508..ca2136c76042 100644
--- a/Documentation/networking/nf_flowtable.txt
+++ b/Documentation/networking/nf_flowtable.txt
@@ -44,10 +44,10 @@ including the Netfilter hooks and the flowtable fastpath bypass.
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
\_________/ \__________/ ---------- \____________/ ^
- | ^ | | ^ |
- flowtable | | ____\/___ | |
- | | | / \ | |
- __\/___ | --------->| forward |------------ |
+ | ^ | ^ |
+ flowtable | ____\/___ | |
+ | | / \ | |
+ __\/___ | | forward |------------ |
|-----| | \_________/ |
|-----| | 'flow offload' rule |
|-----| | adds entry to |
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index 52b026be028f..38a4edc4522b 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -413,7 +413,7 @@ algorithm.
.. _F-RTO: https://tools.ietf.org/html/rfc5682
TCP Fast Path
-============
+=============
When kernel receives a TCP packet, it has two paths to handler the
packet, one is fast path, another is slow path. The comment in kernel
code provides a good explanation of them, I pasted them below::
@@ -681,6 +681,7 @@ The TCP stack receives an out of order duplicate packet, so it sends a
DSACK to the sender.
* TcpExtTCPDSACKRecv
+
The TCP stack receives a DSACK, which indicates an acknowledged
duplicate packet is received.
@@ -690,7 +691,7 @@ The TCP stack receives a DSACK, which indicate an out of order
duplicate packet is received.
invalid SACK and DSACK
-====================
+======================
When a SACK (or DSACK) block is invalid, a corresponding counter would
be updated. The validation method is base on the start/end sequence
number of the SACK block. For more details, please refer the comment
@@ -704,11 +705,13 @@ explaination:
.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
* TcpExtTCPSACKDiscard
+
This counter indicates how many SACK blocks are invalid. If the invalid
SACK block is caused by ACK recording, the TCP stack will only ignore
it and won't update this counter.
* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
+
When a DSACK block is invalid, one of these two counters would be
updated. Which counter will be updated depends on the undo_marker flag
of the TCP socket. If the undo_marker is not set, the TCP stack isn't
@@ -719,7 +722,7 @@ will be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
will be updated. As implied in its name, it might be an old packet.
SACK shift
-=========
+==========
The linux networking stack stores data in sk_buff struct (skb for
short). If a SACK block acrosses multiple skb, the TCP stack will try
to re-arrange data in these skb. E.g. if a SACK block acknowledges seq
@@ -730,12 +733,15 @@ seq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
discard, this operation is 'merge'.
* TcpExtTCPSackShifted
+
A skb is shifted
* TcpExtTCPSackMerged
+
A skb is merged
* TcpExtTCPSackShiftFallback
+
A skb should be shifted or merged, but the TCP stack doesn't do it for
some reasons.
diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/DAWR-POWER9.txt
index 2feaa6619658..bdec03650941 100644
--- a/Documentation/powerpc/DAWR-POWER9.txt
+++ b/Documentation/powerpc/DAWR-POWER9.txt
@@ -56,3 +56,35 @@ POWER9. Loads and stores to the watchpoint locations will not be
trapped in GDB. The watchpoint is remembered, so if the guest is
migrated back to the POWER8 host, it will start working again.
+Force enabling the DAWR
+=============================
+Kernels (since ~v5.2) have an option to force enable the DAWR via:
+
+ echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
+
+This enables the DAWR even on POWER9.
+
+This is a dangerous setting, USE AT YOUR OWN RISK.
+
+Some users may not care about a bad user crashing their box
+(ie. single user/desktop systems) and really want the DAWR. This
+allows them to force enable DAWR.
+
+This flag can also be used to disable DAWR access. Once this is
+cleared, all DAWR access should be cleared immediately and your
+machine once again safe from crashing.
+
+Userspace may get confused by toggling this. If DAWR is force
+enabled/disabled between getting the number of breakpoints (via
+PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
+inconsistent view of what's available. Similarly for guests.
+
+For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
+enabled in the host AND the guest. For this reason, this won't work on
+POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
+dawr_enable_dangerous file will fail if the hypervisor doesn't support
+writing the DAWR.
+
+To double check the DAWR is working, run this kernel selftest:
+ tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+Any errors/failures/skips mean something is wrong.
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index fac1887f25b5..73a501eb9291 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5,25 +5,32 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
----------------------
The kvm API is a set of ioctls that are issued to control various aspects
-of a virtual machine. The ioctls belong to three classes
+of a virtual machine. The ioctls belong to three classes:
- System ioctls: These query and set global attributes which affect the
whole kvm subsystem. In addition a system ioctl is used to create
- virtual machines
+ virtual machines.
- VM ioctls: These query and set attributes that affect an entire virtual
machine, for example memory layout. In addition a VM ioctl is used to
- create virtual cpus (vcpus).
+ create virtual cpus (vcpus) and devices.
- Only run VM ioctls from the same process (address space) that was used
- to create the VM.
+ VM ioctls must be issued from the same process (address space) that was
+ used to create the VM.
- vcpu ioctls: These query and set attributes that control the operation
of a single virtual cpu.
- Only run vcpu ioctls from the same thread that was used to create the
- vcpu.
+ vcpu ioctls should be issued from the same thread that was used to create
+ the vcpu, except for asynchronous vcpu ioctl that are marked as such in
+ the documentation. Otherwise, the first ioctl after switching threads
+ could see a performance impact.
+ - device ioctls: These query and set attributes that control the operation
+ of a single device.
+
+ device ioctls must be issued from the same process (address space) that
+ was used to create the VM.
2. File descriptors
-------------------
@@ -32,18 +39,18 @@ The kvm API is centered around file descriptors. An initial
open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
handle will create a VM file descriptor which can be used to issue VM
-ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
-and return a file descriptor pointing to it. Finally, ioctls on a vcpu
-fd can be used to control the vcpu, including the important task of
-actually running guest code.
+ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
+create a virtual cpu or device and return a file descriptor pointing to
+the new resource. Finally, ioctls on a vcpu or device fd can be used
+to control the vcpu or device. For vcpus, this includes the important
+task of actually running guest code.
In general file descriptors can be migrated among processes by means
of fork() and the SCM_RIGHTS facility of unix domain socket. These
kinds of tricks are explicitly not supported by kvm. While they will
not cause harm to the host, their actual behavior is not guaranteed by
-the API. The only supported use is one virtual machine per process,
-and one vcpu per thread.
-
+the API. See "General description" for details on the ioctl usage
+model that is supported by KVM.
It is important to note that althought VM ioctls may only be issued from
the process that created the VM, a VM's lifecycle is associated with its
@@ -323,7 +330,7 @@ They must be less than the value that KVM_CHECK_EXTENSION returns for
the KVM_CAP_MULTI_ADDRESS_SPACE capability.
The bits in the dirty bitmap are cleared before the ioctl returns, unless
-KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is enabled. For more information,
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled. For more information,
see the description of the capability.
4.9 KVM_SET_MEMORY_ALIAS
@@ -515,11 +522,15 @@ c) KVM_INTERRUPT_SET_LEVEL
Note that any value for 'irq' other than the ones stated above is invalid
and incurs unexpected behavior.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
MIPS:
Queues an external interrupt to be injected into the virtual CPU. A negative
interrupt number dequeues the interrupt.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
4.17 KVM_DEBUG_GUEST
@@ -1086,14 +1097,11 @@ struct kvm_userspace_memory_region {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
-This ioctl allows the user to create or modify a guest physical memory
-slot. When changing an existing slot, it may be moved in the guest
-physical memory space, or its flags may be modified. It may not be
-resized. Slots may not overlap in guest physical address space.
-Bits 0-15 of "slot" specifies the slot id and this value should be
-less than the maximum number of user memory slots supported per VM.
-The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
-if this capability is supported by the architecture.
+This ioctl allows the user to create, modify or delete a guest physical
+memory slot. Bits 0-15 of "slot" specify the slot id and this value
+should be less than the maximum number of user memory slots supported per
+VM. The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS.
+Slots may not overlap in guest physical address space.
If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
specifies the address space which is being modified. They must be
@@ -1102,6 +1110,10 @@ KVM_CAP_MULTI_ADDRESS_SPACE capability. Slots in separate address spaces
are unrelated; the restriction on overlapping slots only applies within
each address space.
+Deleting a slot is done by passing zero for memory_size. When changing
+an existing slot, it may be moved in the guest physical memory space,
+or its flags may be modified, but it may not be resized.
+
Memory for the region is taken starting at the address denoted by the
field userspace_addr, which must point at user addressable memory for
the entire memory slot size. Any object may back this memory, including
@@ -1961,6 +1973,7 @@ registers, find a list below:
PPC | KVM_REG_PPC_TLB3PS | 32
PPC | KVM_REG_PPC_EPTCFG | 32
PPC | KVM_REG_PPC_ICP_STATE | 64
+ PPC | KVM_REG_PPC_VP_STATE | 128
PPC | KVM_REG_PPC_TB_OFFSET | 64
PPC | KVM_REG_PPC_SPMC1 | 32
PPC | KVM_REG_PPC_SPMC2 | 32
@@ -2594,7 +2607,7 @@ KVM_S390_MCHK (vm, vcpu) - machine check interrupt; cr 14 bits in parm,
machine checks needing further payload are not
supported by this ioctl)
-Note that the vcpu ioctl is asynchronous to vcpu execution.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.78 KVM_PPC_GET_HTAB_FD
@@ -3186,8 +3199,7 @@ KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg
KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall
KVM_S390_MCHK - machine check interrupt; parameters in .mchk
-
-Note that the vcpu ioctl is asynchronous to vcpu execution.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.94 KVM_S390_GET_IRQ_STATE
@@ -3924,7 +3936,7 @@ to I/O ports.
4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
-Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
+Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_dirty_log (in)
@@ -3945,8 +3957,9 @@ The ioctl clears the dirty status of pages in a memory slot, according to
the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
field. Bit 0 of the bitmap corresponds to page "first_page" in the
memory slot, and num_pages is the size in bits of the input bitmap.
-Both first_page and num_pages must be a multiple of 64. For each bit
-that is set in the input bitmap, the corresponding page is marked "clean"
+first_page must be a multiple of 64; num_pages must also be a multiple of
+64 unless first_page + num_pages is the size of the memory slot. For each
+bit that is set in the input bitmap, the corresponding page is marked "clean"
in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
(for example via write-protection, or by clearing the dirty bit in
a page table entry).
@@ -3956,10 +3969,10 @@ the address space for which you want to return the dirty bitmap.
They must be less than the value that KVM_CHECK_EXTENSION returns for
the KVM_CAP_MULTI_ADDRESS_SPACE capability.
-This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
+This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
is enabled; for more information, see the description of the capability.
However, it can always be used as long as KVM_CHECK_EXTENSION confirms
-that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is present.
+that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
4.118 KVM_GET_SUPPORTED_HV_CPUID
@@ -4653,6 +4666,15 @@ struct kvm_sync_regs {
struct kvm_vcpu_events events;
};
+6.75 KVM_CAP_PPC_IRQ_XIVE
+
+Architectures: ppc
+Target: vcpu
+Parameters: args[0] is the XIVE device fd
+ args[1] is the XIVE CPU number (server ID) for this vcpu
+
+This capability connects the vcpu to an in-kernel XIVE device.
+
7. Capabilities that can be enabled on VMs
------------------------------------------
@@ -4946,7 +4968,7 @@ and injected exceptions.
* For the new DR6 bits, note that bit 16 is set iff the #DB exception
will clear DR6.RTM.
-7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
+7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
Architectures: all
Parameters: args[0] whether feature should be enabled or not
@@ -4969,6 +4991,11 @@ while userspace can see false reports of dirty pages. Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
+it hard or impossible to use it correctly. The availability of
+KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
+Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.
8. Other capabilities.
----------------------
diff --git a/Documentation/virtual/kvm/devices/vm.txt b/Documentation/virtual/kvm/devices/vm.txt
index 95ca68d663a4..4ffb82b02468 100644
--- a/Documentation/virtual/kvm/devices/vm.txt
+++ b/Documentation/virtual/kvm/devices/vm.txt
@@ -141,7 +141,8 @@ struct kvm_s390_vm_cpu_subfunc {
u8 pcc[16]; # valid with Message-Security-Assist-Extension 4
u8 ppno[16]; # valid with Message-Security-Assist-Extension 5
u8 kma[16]; # valid with Message-Security-Assist-Extension 8
- u8 reserved[1808]; # reserved for future instructions
+ u8 kdsa[16]; # valid with Message-Security-Assist-Extension 9
+ u8 reserved[1792]; # reserved for future instructions
};
Parameters: address of a buffer to load the subfunction blocks from.
diff --git a/Documentation/virtual/kvm/devices/xive.txt b/Documentation/virtual/kvm/devices/xive.txt
new file mode 100644
index 000000000000..9a24a4525253
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/xive.txt
@@ -0,0 +1,197 @@
+POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
+==========================================================
+
+Device types supported:
+ KVM_DEV_TYPE_XIVE POWER9 XIVE Interrupt Controller generation 1
+
+This device acts as a VM interrupt controller. It provides the KVM
+interface to configure the interrupt sources of a VM in the underlying
+POWER9 XIVE interrupt controller.
+
+Only one XIVE instance may be instantiated. A guest XIVE device
+requires a POWER9 host and the guest OS should have support for the
+XIVE native exploitation interrupt mode. If not, it should run using
+the legacy interrupt mode, referred as XICS (POWER7/8).
+
+* Device Mappings
+
+ The KVM device exposes different MMIO ranges of the XIVE HW which
+ are required for interrupt management. These are exposed to the
+ guest in VMAs populated with a custom VM fault handler.
+
+ 1. Thread Interrupt Management Area (TIMA)
+
+ Each thread has an associated Thread Interrupt Management context
+ composed of a set of registers. These registers let the thread
+ handle priority management and interrupt acknowledgment. The most
+ important are :
+
+ - Interrupt Pending Buffer (IPB)
+ - Current Processor Priority (CPPR)
+ - Notification Source Register (NSR)
+
+ They are exposed to software in four different pages each proposing
+ a view with a different privilege. The first page is for the
+ physical thread context and the second for the hypervisor. Only the
+ third (operating system) and the fourth (user level) are exposed the
+ guest.
+
+ 2. Event State Buffer (ESB)
+
+ Each source is associated with an Event State Buffer (ESB) with
+ either a pair of even/odd pair of pages which provides commands to
+ manage the source: to trigger, to EOI, to turn off the source for
+ instance.
+
+ 3. Device pass-through
+
+ When a device is passed-through into the guest, the source
+ interrupts are from a different HW controller (PHB4) and the ESB
+ pages exposed to the guest should accommadate this change.
+
+ The passthru_irq helpers, kvmppc_xive_set_mapped() and
+ kvmppc_xive_clr_mapped() are called when the device HW irqs are
+ mapped into or unmapped from the guest IRQ number space. The KVM
+ device extends these helpers to clear the ESB pages of the guest IRQ
+ number being mapped and then lets the VM fault handler repopulate.
+ The handler will insert the ESB page corresponding to the HW
+ interrupt of the device being passed-through or the initial IPI ESB
+ page if the device has being removed.
+
+ The ESB remapping is fully transparent to the guest and the OS
+ device driver. All handling is done within VFIO and the above
+ helpers in KVM-PPC.
+
+* Groups:
+
+ 1. KVM_DEV_XIVE_GRP_CTRL
+ Provides global controls on the device
+ Attributes:
+ 1.1 KVM_DEV_XIVE_RESET (write only)
+ Resets the interrupt controller configuration for sources and event
+ queues. To be used by kexec and kdump.
+ Errors: none
+
+ 1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
+ Sync all the sources and queues and mark the EQ pages dirty. This
+ to make sure that a consistent memory state is captured when
+ migrating the VM.
+ Errors: none
+
+ 2. KVM_DEV_XIVE_GRP_SOURCE (write only)
+ Initializes a new source in the XIVE device and mask it.
+ Attributes:
+ Interrupt source number (64-bit)
+ The kvm_device_attr.addr points to a __u64 value:
+ bits: | 63 .... 2 | 1 | 0
+ values: | unused | level | type
+ - type: 0:MSI 1:LSI
+ - level: assertion level in case of an LSI.
+ Errors:
+ -E2BIG: Interrupt source number is out of range
+ -ENOMEM: Could not create a new source block
+ -EFAULT: Invalid user pointer for attr->addr.
+ -ENXIO: Could not allocate underlying HW interrupt
+
+ 3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
+ Configures source targeting
+ Attributes:
+ Interrupt source number (64-bit)
+ The kvm_device_attr.addr points to a __u64 value:
+ bits: | 63 .... 33 | 32 | 31 .. 3 | 2 .. 0
+ values: | eisn | mask | server | priority
+ - priority: 0-7 interrupt priority level
+ - server: CPU number chosen to handle the interrupt
+ - mask: mask flag (unused)
+ - eisn: Effective Interrupt Source Number
+ Errors:
+ -ENOENT: Unknown source number
+ -EINVAL: Not initialized source number
+ -EINVAL: Invalid priority
+ -EINVAL: Invalid CPU number.
+ -EFAULT: Invalid user pointer for attr->addr.
+ -ENXIO: CPU event queues not configured or configuration of the
+ underlying HW interrupt failed
+ -EBUSY: No CPU available to serve interrupt
+
+ 4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
+ Configures an event queue of a CPU
+ Attributes:
+ EQ descriptor identifier (64-bit)
+ The EQ descriptor identifier is a tuple (server, priority) :
+ bits: | 63 .... 32 | 31 .. 3 | 2 .. 0
+ values: | unused | server | priority
+ The kvm_device_attr.addr points to :
+ struct kvm_ppc_xive_eq {
+ __u32 flags;
+ __u32 qshift;
+ __u64 qaddr;
+ __u32 qtoggle;
+ __u32 qindex;
+ __u8 pad[40];
+ };
+ - flags: queue flags
+ KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
+ forces notification without using the coalescing mechanism
+ provided by the XIVE END ESBs.
+ - qshift: queue size (power of 2)
+ - qaddr: real address of queue
+ - qtoggle: current queue toggle bit
+ - qindex: current queue index
+ - pad: reserved for future use
+ Errors:
+ -ENOENT: Invalid CPU number
+ -EINVAL: Invalid priority
+ -EINVAL: Invalid flags
+ -EINVAL: Invalid queue size
+ -EINVAL: Invalid queue address
+ -EFAULT: Invalid user pointer for attr->addr.
+ -EIO: Configuration of the underlying HW failed
+
+ 5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
+ Synchronize the source to flush event notifications
+ Attributes:
+ Interrupt source number (64-bit)
+ Errors:
+ -ENOENT: Unknown source number
+ -EINVAL: Not initialized source number
+
+* VCPU state
+
+ The XIVE IC maintains VP interrupt state in an internal structure
+ called the NVT. When a VP is not dispatched on a HW processor
+ thread, this structure can be updated by HW if the VP is the target
+ of an event notification.
+
+ It is important for migration to capture the cached IPB from the NVT
+ as it synthesizes the priorities of the pending interrupts. We
+ capture a bit more to report debug information.
+
+ KVM_REG_PPC_VP_STATE (2 * 64bits)
+ bits: | 63 .... 32 | 31 .... 0 |
+ values: | TIMA word0 | TIMA word1 |
+ bits: | 127 .......... 64 |
+ values: | unused |
+
+* Migration:
+
+ Saving the state of a VM using the XIVE native exploitation mode
+ should follow a specific sequence. When the VM is stopped :
+
+ 1. Mask all sources (PQ=01) to stop the flow of events.
+
+ 2. Sync the XIVE device with the KVM control KVM_DEV_XIVE_EQ_SYNC to
+ flush any in-flight event notification and to stabilize the EQs. At
+ this stage, the EQ pages are marked dirty to make sure they are
+ transferred in the migration sequence.
+
+ 3. Capture the state of the source targeting, the EQs configuration
+ and the state of thread interrupt context registers.
+
+ Restore is similar :
+
+ 1. Restore the EQ configuration. As targeting depends on it.
+ 2. Restore targeting
+ 3. Restore the thread interrupt contexts
+ 4. Restore the source states
+ 5. Let the vCPU run
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index f365102c80f5..2efe0efc516e 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -142,7 +142,7 @@ Shadow pages contain the following information:
If clear, this page corresponds to a guest page table denoted by the gfn
field.
role.quadrant:
- When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
+ When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
sptes. That means a guest page table contains more ptes than the host,
so multiple shadow pages are needed to shadow one guest page.
For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
@@ -158,9 +158,9 @@ Shadow pages contain the following information:
The page is invalid and should not be used. It is a root page that is
currently pinned (by a cpu hardware register pointing to it); once it is
unpinned it will be destroyed.
- role.cr4_pae:
- Contains the value of cr4.pae for which the page is valid (e.g. whether
- 32-bit or 64-bit gptes are in use).
+ role.gpte_is_8_bytes:
+ Reflects the size of the guest PTE for which the page is valid, i.e. '1'
+ if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
role.nxe:
Contains the value of efer.nxe for which the page is valid.
role.cr0_wp:
@@ -173,6 +173,9 @@ Shadow pages contain the following information:
Contains the value of cr4.smap && !cr0.wp for which the page is valid
(pages for which this is true are different from other pages; see the
treatment of cr0.wp=0 below).
+ role.ept_sp:
+ This is a virtual flag to denote a shadowed nested EPT page. ept_sp
+ is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
role.smm:
Is 1 if the page is valid in system management mode. This field
determines which of the kvm_memslots array was used to build this