From 85f94b5ef0eede10e7071f9e7e5b864ffad96d72 Mon Sep 17 00:00:00 2001 From: Thomas Petazzoni Date: Sat, 29 Apr 2017 11:06:42 +0200 Subject: dt-bindings: mtd: document new "on-die" nand-ecc-mode A number of NAND chips support a feature called on-die ECC, where the NAND chip itself is capable of doing error detection and correction. The new "on-die" value for nand-ecc-mode indicates that we want this functionality to be used. Signed-off-by: Thomas Petazzoni Acked-by: Rob Herring Reviewed-by: Richard Weinberger Signed-off-by: Boris Brezillon --- Documentation/devicetree/bindings/mtd/nand.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/nand.txt b/Documentation/devicetree/bindings/mtd/nand.txt index b05601600083..133f3813719c 100644 --- a/Documentation/devicetree/bindings/mtd/nand.txt +++ b/Documentation/devicetree/bindings/mtd/nand.txt @@ -21,7 +21,7 @@ Optional NAND chip properties: - nand-ecc-mode : String, operation mode of the NAND ecc mode. Supported values are: "none", "soft", "hw", "hw_syndrome", - "hw_oob_first". + "hw_oob_first", "on-die". Deprecated values: "soft_bch": use "soft" and nand-ecc-algo instead - nand-ecc-algo: string, algorithm of NAND ECC. -- cgit v1.2.3-59-g8ed1b From a21512c1698d8106bcece0d24ff590dc92682678 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 18 May 2017 22:25:56 -0300 Subject: rtc.txt: standardize document format Each text file under Documentation follows a different format. Some doesn't even have titles! Change its representation to follow the adopted standard, using ReST markups for it to be parseable by Sphinx: - adjust identation of the titles; - mark a table as such; - don't capitalize chapter names. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Alexandre Belloni --- Documentation/rtc.txt | 44 +++++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 21 deletions(-) (limited to 'Documentation') diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index ddc366026e00..47feb4414b7e 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -1,6 +1,6 @@ - - Real Time Clock (RTC) Drivers for Linux - ======================================= +======================================= +Real Time Clock (RTC) Drivers for Linux +======================================= When Linux developers talk about a "Real Time Clock", they usually mean something that tracks wall clock time and is battery backed so that it @@ -32,8 +32,8 @@ only issue an alarm up to 24 hours in the future, other hardware may be able to schedule one any time in the upcoming century. - Old PC/AT-Compatible driver: /dev/rtc - -------------------------------------- +Old PC/AT-Compatible driver: /dev/rtc +-------------------------------------- All PCs (even Alpha machines) have a Real Time Clock built into them. Usually they are built into the chipset of the computer, but some may @@ -105,8 +105,8 @@ that will be using this driver. See the code at the end of this document. (The original /dev/rtc driver was written by Paul Gortmaker.) - New portable "RTC Class" drivers: /dev/rtcN - -------------------------------------------- +New portable "RTC Class" drivers: /dev/rtcN +-------------------------------------------- Because Linux supports many non-ACPI and non-PC platforms, some of which have more than one RTC style clock, it needed a more portable solution @@ -136,35 +136,37 @@ a high functionality RTC is integrated into the SOC. That system might read the system clock from the discrete RTC, but use the integrated one for all other tasks, because of its greater functionality. -SYSFS INTERFACE +SYSFS interface --------------- The sysfs interface under /sys/class/rtc/rtcN provides access to various rtc attributes without requiring the use of ioctls. All dates and times are in the RTC's timezone, rather than in system time. -date: RTC-provided date -hctosys: 1 if the RTC provided the system time at boot via the +================ ============================================================== +date RTC-provided date +hctosys 1 if the RTC provided the system time at boot via the CONFIG_RTC_HCTOSYS kernel option, 0 otherwise -max_user_freq: The maximum interrupt rate an unprivileged user may request +max_user_freq The maximum interrupt rate an unprivileged user may request from this RTC. -name: The name of the RTC corresponding to this sysfs directory -since_epoch: The number of seconds since the epoch according to the RTC -time: RTC-provided time -wakealarm: The time at which the clock will generate a system wakeup +name The name of the RTC corresponding to this sysfs directory +since_epoch The number of seconds since the epoch according to the RTC +time RTC-provided time +wakealarm The time at which the clock will generate a system wakeup event. This is a one shot wakeup event, so must be reset - after wake if a daily wakeup is required. Format is seconds since - the epoch by default, or if there's a leading +, seconds in the - future, or if there is a leading +=, seconds ahead of the current - alarm. -offset: The amount which the rtc clock has been adjusted in firmware. + after wake if a daily wakeup is required. Format is seconds + since the epoch by default, or if there's a leading +, seconds + in the future, or if there is a leading +=, seconds ahead of + the current alarm. +offset The amount which the rtc clock has been adjusted in firmware. Visible only if the driver supports clock offset adjustment. The unit is parts per billion, i.e. The number of clock ticks which are added to or removed from the rtc's base clock per billion ticks. A positive value makes a day pass more slowly, longer, and a negative value makes a day pass more quickly. +================ ============================================================== -IOCTL INTERFACE +IOCTL interface --------------- The ioctl() calls supported by /dev/rtc are also supported by the RTC class -- cgit v1.2.3-59-g8ed1b From d7e578c8118113789b7abd2977e208c64d6f8465 Mon Sep 17 00:00:00 2001 From: Stefan Agner Date: Fri, 21 Apr 2017 18:23:36 -0700 Subject: mtd: gpmi: document current clock requirements The clock requirements are completely missing, add the clocks currently required by the driver. Signed-off-by: Stefan Agner Acked-by: Rob Herring Signed-off-by: Boris Brezillon --- Documentation/devicetree/bindings/mtd/gpmi-nand.txt | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt index d02acaff3c35..b289ef3c1b7e 100644 --- a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt @@ -4,7 +4,12 @@ The GPMI nand controller provides an interface to control the NAND flash chips. Required properties: - - compatible : should be "fsl,-gpmi-nand" + - compatible : should be "fsl,-gpmi-nand", chip can be: + * imx23 + * imx28 + * imx6q + * imx6sx + * imx7d - reg : should contain registers location and length for gpmi and bch. - reg-names: Should contain the reg names "gpmi-nand" and "bch" - interrupts : BCH interrupt number. @@ -13,6 +18,13 @@ Required properties: and GPMI DMA channel ID. Refer to dma.txt and fsl-mxs-dma.txt for details. - dma-names: Must be "rx-tx". + - clocks : clocks phandle and clock specifier corresponding to each clock + specified in clock-names. + - clock-names : The "gpmi_io" clock is always required. Which clocks are + exactly required depends on chip: + * imx23/imx28 : "gpmi_io" + * imx6q/sx : "gpmi_io", "gpmi_apb", "gpmi_bch", "gpmi_bch_apb", "per1_bch" + * imx7d : "gpmi_io", "gpmi_bch_apb" Optional properties: - nand-on-flash-bbt: boolean to enable on flash bbt option if not -- cgit v1.2.3-59-g8ed1b From 2968698ba03187b09e84fbc709f0b5d19ddd94ff Mon Sep 17 00:00:00 2001 From: Xiaolei Li Date: Wed, 31 May 2017 16:26:38 +0800 Subject: mtd: nand: mediatek: update DT bindings Add MT2712 NAND Flash Controller dt bindings documentation. Signed-off-by: Xiaolei Li Reviewed-by: Matthias Brugger Signed-off-by: Boris Brezillon --- Documentation/devicetree/bindings/mtd/mtk-nand.txt | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/mtk-nand.txt b/Documentation/devicetree/bindings/mtd/mtk-nand.txt index 069c192ed5c2..dbf9e054c11c 100644 --- a/Documentation/devicetree/bindings/mtd/mtk-nand.txt +++ b/Documentation/devicetree/bindings/mtd/mtk-nand.txt @@ -12,7 +12,8 @@ tree nodes. The first part of NFC is NAND Controller Interface (NFI) HW. Required NFI properties: -- compatible: Should be "mediatek,mtxxxx-nfc". +- compatible: Should be one of "mediatek,mt2701-nfc", + "mediatek,mt2712-nfc". - reg: Base physical address and size of NFI. - interrupts: Interrupts of NFI. - clocks: NFI required clocks. @@ -141,7 +142,7 @@ Example: ============== Required BCH properties: -- compatible: Should be "mediatek,mtxxxx-ecc". +- compatible: Should be one of "mediatek,mt2701-ecc", "mediatek,mt2712-ecc". - reg: Base physical address and size of ECC. - interrupts: Interrupts of ECC. - clocks: ECC required clocks. -- cgit v1.2.3-59-g8ed1b From 4db4d35ebda390b5287d758fdd51b26c24fbc26b Mon Sep 17 00:00:00 2001 From: Chris Packham Date: Thu, 25 May 2017 11:49:12 +1200 Subject: mtd: mchp23k256: Add OF device ID table This allows registering of this device via a Device Tree. Signed-off-by: Chris Packham Reviewed-by: Andrew Lunn Tested-by: Andrew Lunn Acked-by: Boris Brezillon Acked-by: Rob Herring Signed-off-by: Brian Norris --- .../devicetree/bindings/mtd/microchip,mchp23k256.txt | 18 ++++++++++++++++++ drivers/mtd/devices/mchp23k256.c | 8 ++++++++ 2 files changed, 26 insertions(+) create mode 100644 Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt new file mode 100644 index 000000000000..25e5ad38b0f0 --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt @@ -0,0 +1,18 @@ +* MTD SPI driver for Microchip 23K256 (and similar) serial SRAM + +Required properties: +- #address-cells, #size-cells : Must be present if the device has sub-nodes + representing partitions. +- compatible : Must be "microchip,mchp23k256" +- reg : Chip-Select number +- spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at + +Example: + + spi-sram@0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "microchip,mchp23k256"; + reg = <0>; + spi-max-frequency = <20000000>; + }; diff --git a/drivers/mtd/devices/mchp23k256.c b/drivers/mtd/devices/mchp23k256.c index e237db9f1bdb..9d8306a15833 100644 --- a/drivers/mtd/devices/mchp23k256.c +++ b/drivers/mtd/devices/mchp23k256.c @@ -19,6 +19,7 @@ #include #include #include +#include struct mchp23k256_flash { struct spi_device *spi; @@ -166,9 +167,16 @@ static int mchp23k256_remove(struct spi_device *spi) return mtd_device_unregister(&flash->mtd); } +static const struct of_device_id mchp23k256_of_table[] = { + { .compatible = "microchip,mchp23k256" }, + {} +}; +MODULE_DEVICE_TABLE(of, mchp23k256_of_table); + static struct spi_driver mchp23k256_driver = { .driver = { .name = "mchp23k256", + .of_match_table = of_match_ptr(mchp23k256_of_table), }, .probe = mchp23k256_probe, .remove = mchp23k256_remove, -- cgit v1.2.3-59-g8ed1b From 4379075a870b8de43a9ecd5b46884953234fc669 Mon Sep 17 00:00:00 2001 From: Chris Packham Date: Fri, 2 Jun 2017 15:21:19 +1200 Subject: mtd: mchp23k256: Add support for mchp23lcv1024 The mchp23lcv1024 is similar to the mchp23k256, the differences (from a software point of view) are the capacity of the chip and the size of the addresses used. There is no way to detect the specific chip so we must be told via a Device Tree or default to mchp23k256 when device tree is not used. Signed-off-by: Chris Packham Reviewed-by: Andrew Lunn Acked-by: Rob Herring Signed-off-by: Brian Norris --- .../bindings/mtd/microchip,mchp23k256.txt | 2 +- drivers/mtd/devices/mchp23k256.c | 66 ++++++++++++++++++---- 2 files changed, 57 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt index 25e5ad38b0f0..7328eb92a03c 100644 --- a/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt +++ b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt @@ -3,7 +3,7 @@ Required properties: - #address-cells, #size-cells : Must be present if the device has sub-nodes representing partitions. -- compatible : Must be "microchip,mchp23k256" +- compatible : Must be one of "microchip,mchp23k256" or "microchip,mchp23lcv1024" - reg : Chip-Select number - spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at diff --git a/drivers/mtd/devices/mchp23k256.c b/drivers/mtd/devices/mchp23k256.c index 3e5feb454644..8956b7dcc984 100644 --- a/drivers/mtd/devices/mchp23k256.c +++ b/drivers/mtd/devices/mchp23k256.c @@ -21,10 +21,18 @@ #include #include +#define MAX_CMD_SIZE 4 + +struct mchp23_caps { + u8 addr_width; + unsigned int size; +}; + struct mchp23k256_flash { struct spi_device *spi; struct mutex lock; struct mtd_info mtd; + const struct mchp23_caps *caps; }; #define MCHP23K256_CMD_WRITE_STATUS 0x01 @@ -34,22 +42,40 @@ struct mchp23k256_flash { #define to_mchp23k256_flash(x) container_of(x, struct mchp23k256_flash, mtd) +static void mchp23k256_addr2cmd(struct mchp23k256_flash *flash, + unsigned int addr, u8 *cmd) +{ + int i; + + /* + * Address is sent in big endian (MSB first) and we skip + * the first entry of the cmd array which contains the cmd + * opcode. + */ + for (i = flash->caps->addr_width; i > 0; i--, addr >>= 8) + cmd[i] = addr; +} + +static int mchp23k256_cmdsz(struct mchp23k256_flash *flash) +{ + return 1 + flash->caps->addr_width; +} + static int mchp23k256_write(struct mtd_info *mtd, loff_t to, size_t len, size_t *retlen, const unsigned char *buf) { struct mchp23k256_flash *flash = to_mchp23k256_flash(mtd); struct spi_transfer transfer[2] = {}; struct spi_message message; - unsigned char command[3]; + unsigned char command[MAX_CMD_SIZE]; spi_message_init(&message); command[0] = MCHP23K256_CMD_WRITE; - command[1] = to >> 8; - command[2] = to; + mchp23k256_addr2cmd(flash, to, command); transfer[0].tx_buf = command; - transfer[0].len = sizeof(command); + transfer[0].len = mchp23k256_cmdsz(flash); spi_message_add_tail(&transfer[0], &message); transfer[1].tx_buf = buf; @@ -73,17 +99,16 @@ static int mchp23k256_read(struct mtd_info *mtd, loff_t from, size_t len, struct mchp23k256_flash *flash = to_mchp23k256_flash(mtd); struct spi_transfer transfer[2] = {}; struct spi_message message; - unsigned char command[3]; + unsigned char command[MAX_CMD_SIZE]; spi_message_init(&message); memset(&transfer, 0, sizeof(transfer)); command[0] = MCHP23K256_CMD_READ; - command[1] = from >> 8; - command[2] = from; + mchp23k256_addr2cmd(flash, from, command); transfer[0].tx_buf = command; - transfer[0].len = sizeof(command); + transfer[0].len = mchp23k256_cmdsz(flash); spi_message_add_tail(&transfer[0], &message); transfer[1].rx_buf = buf; @@ -123,6 +148,16 @@ static int mchp23k256_set_mode(struct spi_device *spi) return spi_sync(spi, &message); } +static const struct mchp23_caps mchp23k256_caps = { + .size = SZ_32K, + .addr_width = 2, +}; + +static const struct mchp23_caps mchp23lcv1024_caps = { + .size = SZ_128K, + .addr_width = 3, +}; + static int mchp23k256_probe(struct spi_device *spi) { struct mchp23k256_flash *flash; @@ -143,12 +178,16 @@ static int mchp23k256_probe(struct spi_device *spi) data = dev_get_platdata(&spi->dev); + flash->caps = of_device_get_match_data(&spi->dev); + if (!flash->caps) + flash->caps = &mchp23k256_caps; + mtd_set_of_node(&flash->mtd, spi->dev.of_node); flash->mtd.dev.parent = &spi->dev; flash->mtd.type = MTD_RAM; flash->mtd.flags = MTD_CAP_RAM; flash->mtd.writesize = 1; - flash->mtd.size = SZ_32K; + flash->mtd.size = flash->caps->size; flash->mtd._read = mchp23k256_read; flash->mtd._write = mchp23k256_write; @@ -168,7 +207,14 @@ static int mchp23k256_remove(struct spi_device *spi) } static const struct of_device_id mchp23k256_of_table[] = { - { .compatible = "microchip,mchp23k256" }, + { + .compatible = "microchip,mchp23k256", + .data = &mchp23k256_caps, + }, + { + .compatible = "microchip,mchp23lcv1024", + .data = &mchp23lcv1024_caps, + }, {} }; MODULE_DEVICE_TABLE(of, mchp23k256_of_table); -- cgit v1.2.3-59-g8ed1b From 7de117fd5bfe0d84e50714ef5dcf5f3cec7f0eef Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 7 Jun 2017 20:52:12 +0900 Subject: mtd: nand: denali: avoid hard-coding ECC step, strength, bytes This driver was originally written for the Intel MRST platform with several platform-specific parameters hard-coded. Currently, the ECC settings are hard-coded as follows: #define ECC_SECTOR_SIZE 512 #define ECC_8BITS 14 #define ECC_15BITS 26 Therefore, the driver can only support two cases. - ecc.size = 512, ecc.strength = 8 --> ecc.bytes = 14 - ecc.size = 512, ecc.strength = 15 --> ecc.bytes = 26 However, these are actually customizable parameters, for example, UniPhier platform supports the following: - ecc.size = 1024, ecc.strength = 8 --> ecc.bytes = 14 - ecc.size = 1024, ecc.strength = 16 --> ecc.bytes = 28 - ecc.size = 1024, ecc.strength = 24 --> ecc.bytes = 42 So, we need to handle the ECC parameters in a more generic manner. Fortunately, the Denali User's Guide explains how to calculate the ecc.bytes. The formula is: ecc.bytes = 2 * CEIL(13 * ecc.strength / 16) (for ecc.size = 512) ecc.bytes = 2 * CEIL(14 * ecc.strength / 16) (for ecc.size = 1024) For DT platforms, it would be reasonable to allow DT to specify ECC strength by either "nand-ecc-strength" or "nand-ecc-maximize". If none of them is specified, the driver will try to meet the chip's ECC requirement. For PCI platforms, the max ECC strength is used to keep the original behavior. Newer versions of this IP need ecc.size and ecc.steps explicitly set up via the following registers: CFG_DATA_BLOCK_SIZE (0x6b0) CFG_LAST_DATA_BLOCK_SIZE (0x6c0) CFG_NUM_DATA_BLOCKS (0x6d0) For older IP versions, write accesses to these registers are just ignored. Signed-off-by: Masahiro Yamada Acked-by: Rob Herring Signed-off-by: Boris Brezillon --- .../devicetree/bindings/mtd/denali-nand.txt | 7 ++ drivers/mtd/nand/denali.c | 87 +++++++++++++--------- drivers/mtd/nand/denali.h | 12 ++- drivers/mtd/nand/denali_dt.c | 5 ++ drivers/mtd/nand/denali_pci.c | 4 + 5 files changed, 78 insertions(+), 37 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/denali-nand.txt b/Documentation/devicetree/bindings/mtd/denali-nand.txt index e593bbeb2115..b7742a7363ea 100644 --- a/Documentation/devicetree/bindings/mtd/denali-nand.txt +++ b/Documentation/devicetree/bindings/mtd/denali-nand.txt @@ -7,6 +7,13 @@ Required properties: - reg-names: Should contain the reg names "nand_data" and "denali_reg" - interrupts : The interrupt number. +Optional properties: + - nand-ecc-step-size: see nand.txt for details. If present, the value must be + 512 for "altr,socfpga-denali-nand" + - nand-ecc-strength: see nand.txt for details. Valid values are: + 8, 15 for "altr,socfpga-denali-nand" + - nand-ecc-maximize: see nand.txt for details + The device tree may optionally contain sub-nodes describing partitions of the address space. See partition.txt for more detail. diff --git a/drivers/mtd/nand/denali.c b/drivers/mtd/nand/denali.c index b3c99d98fdee..2d3d9875dfaa 100644 --- a/drivers/mtd/nand/denali.c +++ b/drivers/mtd/nand/denali.c @@ -886,8 +886,6 @@ static int denali_hw_ecc_fixup(struct mtd_info *mtd, return max_bitflips; } -#define ECC_SECTOR_SIZE 512 - #define ECC_SECTOR(x) (((x) & ECC_ERROR_ADDRESS__SECTOR_NR) >> 12) #define ECC_BYTE(x) (((x) & ECC_ERROR_ADDRESS__OFFSET)) #define ECC_CORRECTION_VALUE(x) ((x) & ERR_CORRECTION_INFO__BYTEMASK) @@ -899,6 +897,7 @@ static int denali_sw_ecc_fixup(struct mtd_info *mtd, struct denali_nand_info *denali, unsigned long *uncor_ecc_flags, uint8_t *buf) { + unsigned int ecc_size = denali->nand.ecc.size; unsigned int bitflips = 0; unsigned int max_bitflips = 0; uint32_t err_addr, err_cor_info; @@ -928,9 +927,9 @@ static int denali_sw_ecc_fixup(struct mtd_info *mtd, * an erased sector. */ *uncor_ecc_flags |= BIT(err_sector); - } else if (err_byte < ECC_SECTOR_SIZE) { + } else if (err_byte < ecc_size) { /* - * If err_byte is larger than ECC_SECTOR_SIZE, means error + * If err_byte is larger than ecc_size, means error * happened in OOB, so we ignore it. It's no need for * us to correct it err_device is represented the NAND * error bits are happened in if there are more than @@ -939,7 +938,7 @@ static int denali_sw_ecc_fixup(struct mtd_info *mtd, int offset; unsigned int flips_in_byte; - offset = (err_sector * ECC_SECTOR_SIZE + err_byte) * + offset = (err_sector * ecc_size + err_byte) * denali->devnum + err_device; /* correct the ECC error */ @@ -1345,13 +1344,39 @@ static void denali_hw_init(struct denali_nand_info *denali) denali_irq_init(denali); } -/* - * Althogh controller spec said SLC ECC is forceb to be 4bit, - * but denali controller in MRST only support 15bit and 8bit ECC - * correction - */ -#define ECC_8BITS 14 -#define ECC_15BITS 26 +int denali_calc_ecc_bytes(int step_size, int strength) +{ + /* BCH code. Denali requires ecc.bytes to be multiple of 2 */ + return DIV_ROUND_UP(strength * fls(step_size * 8), 16) * 2; +} +EXPORT_SYMBOL(denali_calc_ecc_bytes); + +static int denali_ecc_setup(struct mtd_info *mtd, struct nand_chip *chip, + struct denali_nand_info *denali) +{ + int oobavail = mtd->oobsize - denali->bbtskipbytes; + int ret; + + /* + * If .size and .strength are already set (usually by DT), + * check if they are supported by this controller. + */ + if (chip->ecc.size && chip->ecc.strength) + return nand_check_ecc_caps(chip, denali->ecc_caps, oobavail); + + /* + * We want .size and .strength closest to the chip's requirement + * unless NAND_ECC_MAXIMIZE is requested. + */ + if (!(chip->ecc.options & NAND_ECC_MAXIMIZE)) { + ret = nand_match_ecc_req(chip, denali->ecc_caps, oobavail); + if (!ret) + return 0; + } + + /* Max ECC strength is the last thing we can do */ + return nand_maximize_ecc(chip, denali->ecc_caps, oobavail); +} static int denali_ooblayout_ecc(struct mtd_info *mtd, int section, struct mtd_oob_region *oobregion) @@ -1588,34 +1613,26 @@ int denali_init(struct denali_nand_info *denali) /* no subpage writes on denali */ chip->options |= NAND_NO_SUBPAGE_WRITE; - /* - * Denali Controller only support 15bit and 8bit ECC in MRST, - * so just let controller do 15bit ECC for MLC and 8bit ECC for - * SLC if possible. - * */ - if (!nand_is_slc(chip) && - (mtd->oobsize > (denali->bbtskipbytes + - ECC_15BITS * (mtd->writesize / - ECC_SECTOR_SIZE)))) { - /* if MLC OOB size is large enough, use 15bit ECC*/ - chip->ecc.strength = 15; - chip->ecc.bytes = ECC_15BITS; - iowrite32(15, denali->flash_reg + ECC_CORRECTION); - } else if (mtd->oobsize < (denali->bbtskipbytes + - ECC_8BITS * (mtd->writesize / - ECC_SECTOR_SIZE))) { - pr_err("Your NAND chip OOB is not large enough to contain 8bit ECC correction codes"); + ret = denali_ecc_setup(mtd, chip, denali); + if (ret) { + dev_err(denali->dev, "Failed to setup ECC settings.\n"); goto failed_req_irq; - } else { - chip->ecc.strength = 8; - chip->ecc.bytes = ECC_8BITS; - iowrite32(8, denali->flash_reg + ECC_CORRECTION); } + dev_dbg(denali->dev, + "chosen ECC settings: step=%d, strength=%d, bytes=%d\n", + chip->ecc.size, chip->ecc.strength, chip->ecc.bytes); + + iowrite32(chip->ecc.strength, denali->flash_reg + ECC_CORRECTION); + + iowrite32(chip->ecc.size, denali->flash_reg + CFG_DATA_BLOCK_SIZE); + iowrite32(chip->ecc.size, denali->flash_reg + CFG_LAST_DATA_BLOCK_SIZE); + /* chip->ecc.steps is set by nand_scan_tail(); not available here */ + iowrite32(mtd->writesize / chip->ecc.size, + denali->flash_reg + CFG_NUM_DATA_BLOCKS); + mtd_set_ooblayout(mtd, &denali_ooblayout_ops); - /* override the default read operations */ - chip->ecc.size = ECC_SECTOR_SIZE; chip->ecc.read_page = denali_read_page; chip->ecc.read_page_raw = denali_read_page_raw; chip->ecc.write_page = denali_write_page; diff --git a/drivers/mtd/nand/denali.h b/drivers/mtd/nand/denali.h index 37833535a7a3..a06ed741b550 100644 --- a/drivers/mtd/nand/denali.h +++ b/drivers/mtd/nand/denali.h @@ -259,6 +259,14 @@ #define ECC_COR_INFO__MAX_ERRORS GENMASK(6, 0) #define ECC_COR_INFO__UNCOR_ERR BIT(7) +#define CFG_DATA_BLOCK_SIZE 0x6b0 + +#define CFG_LAST_DATA_BLOCK_SIZE 0x6c0 + +#define CFG_NUM_DATA_BLOCKS 0x6d0 + +#define CFG_META_DATA_SIZE 0x6e0 + #define DMA_ENABLE 0x700 #define DMA_ENABLE__FLAG BIT(0) @@ -301,8 +309,6 @@ #define MODE_10 0x08000000 #define MODE_11 0x0C000000 -#define ECC_SECTOR_SIZE 512 - struct nand_buf { int head; int tail; @@ -337,11 +343,13 @@ struct denali_nand_info { int max_banks; unsigned int revision; unsigned int caps; + const struct nand_ecc_caps *ecc_caps; }; #define DENALI_CAP_HW_ECC_FIXUP BIT(0) #define DENALI_CAP_DMA_64BIT BIT(1) +int denali_calc_ecc_bytes(int step_size, int strength); extern int denali_init(struct denali_nand_info *denali); extern void denali_remove(struct denali_nand_info *denali); diff --git a/drivers/mtd/nand/denali_dt.c b/drivers/mtd/nand/denali_dt.c index b48430fe3cd4..bd1aa4cf4457 100644 --- a/drivers/mtd/nand/denali_dt.c +++ b/drivers/mtd/nand/denali_dt.c @@ -32,10 +32,14 @@ struct denali_dt { struct denali_dt_data { unsigned int revision; unsigned int caps; + const struct nand_ecc_caps *ecc_caps; }; +NAND_ECC_CAPS_SINGLE(denali_socfpga_ecc_caps, denali_calc_ecc_bytes, + 512, 8, 15); static const struct denali_dt_data denali_socfpga_data = { .caps = DENALI_CAP_HW_ECC_FIXUP, + .ecc_caps = &denali_socfpga_ecc_caps, }; static const struct of_device_id denali_nand_dt_ids[] = { @@ -64,6 +68,7 @@ static int denali_dt_probe(struct platform_device *pdev) if (data) { denali->revision = data->revision; denali->caps = data->caps; + denali->ecc_caps = data->ecc_caps; } denali->platform = DT; diff --git a/drivers/mtd/nand/denali_pci.c b/drivers/mtd/nand/denali_pci.c index ac843238b77e..37dc0934c24c 100644 --- a/drivers/mtd/nand/denali_pci.c +++ b/drivers/mtd/nand/denali_pci.c @@ -27,6 +27,8 @@ static const struct pci_device_id denali_pci_ids[] = { }; MODULE_DEVICE_TABLE(pci, denali_pci_ids); +NAND_ECC_CAPS_SINGLE(denali_pci_ecc_caps, denali_calc_ecc_bytes, 512, 8, 15); + static int denali_pci_probe(struct pci_dev *dev, const struct pci_device_id *id) { int ret; @@ -65,6 +67,8 @@ static int denali_pci_probe(struct pci_dev *dev, const struct pci_device_id *id) pci_set_master(dev); denali->dev = &dev->dev; denali->irq = dev->irq; + denali->ecc_caps = &denali_pci_ecc_caps; + denali->nand.ecc.options |= NAND_ECC_MAXIMIZE; ret = pci_request_regions(dev, DENALI_NAND_NAME); if (ret) { -- cgit v1.2.3-59-g8ed1b From 91300dd67baec4f046aa76e3a2e8222d15cc76e9 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Wed, 7 Jun 2017 20:52:14 +0900 Subject: mtd: nand: denali_dt: add compatible strings for UniPhier SoC variants Add two compatible strings for UniPhier SoC family. "socionext,uniphier-denali-nand-v5a" is used on UniPhier sLD3, LD4, Pro4, sLD8. "socionext,uniphier-denali-nand-v5b" is used on UniPhier Pro5, PXs2, LD6b, LD11, LD20. Signed-off-by: Masahiro Yamada Acked-by: Rob Herring Signed-off-by: Boris Brezillon --- .../devicetree/bindings/mtd/denali-nand.txt | 6 ++++++ drivers/mtd/nand/denali_dt.c | 25 ++++++++++++++++++++++ 2 files changed, 31 insertions(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/denali-nand.txt b/Documentation/devicetree/bindings/mtd/denali-nand.txt index b7742a7363ea..504291d2e5c2 100644 --- a/Documentation/devicetree/bindings/mtd/denali-nand.txt +++ b/Documentation/devicetree/bindings/mtd/denali-nand.txt @@ -3,6 +3,8 @@ Required properties: - compatible : should be one of the following: "altr,socfpga-denali-nand" - for Altera SOCFPGA + "socionext,uniphier-denali-nand-v5a" - for Socionext UniPhier (v5a) + "socionext,uniphier-denali-nand-v5b" - for Socionext UniPhier (v5b) - reg : should contain registers location and length for data and reg. - reg-names: Should contain the reg names "nand_data" and "denali_reg" - interrupts : The interrupt number. @@ -10,8 +12,12 @@ Required properties: Optional properties: - nand-ecc-step-size: see nand.txt for details. If present, the value must be 512 for "altr,socfpga-denali-nand" + 1024 for "socionext,uniphier-denali-nand-v5a" + 1024 for "socionext,uniphier-denali-nand-v5b" - nand-ecc-strength: see nand.txt for details. Valid values are: 8, 15 for "altr,socfpga-denali-nand" + 8, 16, 24 for "socionext,uniphier-denali-nand-v5a" + 8, 16 for "socionext,uniphier-denali-nand-v5b" - nand-ecc-maximize: see nand.txt for details The device tree may optionally contain sub-nodes describing partitions of the diff --git a/drivers/mtd/nand/denali_dt.c b/drivers/mtd/nand/denali_dt.c index bd1aa4cf4457..be598230c108 100644 --- a/drivers/mtd/nand/denali_dt.c +++ b/drivers/mtd/nand/denali_dt.c @@ -42,11 +42,36 @@ static const struct denali_dt_data denali_socfpga_data = { .ecc_caps = &denali_socfpga_ecc_caps, }; +NAND_ECC_CAPS_SINGLE(denali_uniphier_v5a_ecc_caps, denali_calc_ecc_bytes, + 1024, 8, 16, 24); +static const struct denali_dt_data denali_uniphier_v5a_data = { + .caps = DENALI_CAP_HW_ECC_FIXUP | + DENALI_CAP_DMA_64BIT, + .ecc_caps = &denali_uniphier_v5a_ecc_caps, +}; + +NAND_ECC_CAPS_SINGLE(denali_uniphier_v5b_ecc_caps, denali_calc_ecc_bytes, + 1024, 8, 16); +static const struct denali_dt_data denali_uniphier_v5b_data = { + .revision = 0x0501, + .caps = DENALI_CAP_HW_ECC_FIXUP | + DENALI_CAP_DMA_64BIT, + .ecc_caps = &denali_uniphier_v5b_ecc_caps, +}; + static const struct of_device_id denali_nand_dt_ids[] = { { .compatible = "altr,socfpga-denali-nand", .data = &denali_socfpga_data, }, + { + .compatible = "socionext,uniphier-denali-nand-v5a", + .data = &denali_uniphier_v5a_data, + }, + { + .compatible = "socionext,uniphier-denali-nand-v5b", + .data = &denali_uniphier_v5b_data, + }, { /* sentinel */ } }; MODULE_DEVICE_TABLE(of, denali_nand_dt_ids); -- cgit v1.2.3-59-g8ed1b From df1d178879f53d7f790d4bc372aa0fc5d5423328 Mon Sep 17 00:00:00 2001 From: Brian Norris Date: Tue, 23 May 2017 07:30:19 +0200 Subject: dt-bindings: mtd: make partitions doc a bit more generic MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently the only documented partitioning is "fixed-partitions" but there are more methods in use that we may want to support in the future. Mention them and make it clear Fixed Partitions are just a single case. Signed-off-by: Brian Norris Signed-off-by: Rafał Miłecki Acked-by: Rob Herring --- .../devicetree/bindings/mtd/partition.txt | 32 ++++++++++++++++++---- 1 file changed, 26 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/partition.txt b/Documentation/devicetree/bindings/mtd/partition.txt index 81a224da63be..36f3b769a626 100644 --- a/Documentation/devicetree/bindings/mtd/partition.txt +++ b/Documentation/devicetree/bindings/mtd/partition.txt @@ -1,29 +1,49 @@ -Representing flash partitions in devicetree +Flash partitions in device tree +=============================== -Partitions can be represented by sub-nodes of an mtd device. This can be used +Flash devices can be partitioned into one or more functional ranges (e.g. "boot +code", "nvram", "kernel"). + +Different devices may be partitioned in a different ways. Some may use a fixed +flash layout set at production time. Some may use on-flash table that describes +the geometry and naming/purpose of each functional region. It is also possible +to see these methods mixed. + +To assist system software in locating partitions, we allow describing which +method is used for a given flash device. To describe the method there should be +a subnode of the flash device that is named 'partitions'. It must have a +'compatible' property, which is used to identify the method to use. + +We currently only document a binding for fixed layouts. + + +Fixed Partitions +================ + +Partitions can be represented by sub-nodes of a flash device. This can be used on platforms which have strong conventions about which portions of a flash are used for what purposes, but which don't use an on-flash partition table such as RedBoot. -The partition table should be a subnode of the mtd node and should be named +The partition table should be a subnode of the flash node and should be named 'partitions'. This node should have the following property: - compatible : (required) must be "fixed-partitions" Partitions are then defined in subnodes of the partitions node. -For backwards compatibility partitions as direct subnodes of the mtd device are +For backwards compatibility partitions as direct subnodes of the flash device are supported. This use is discouraged. NOTE: also for backwards compatibility, direct subnodes that have a compatible string are not considered partitions, as they may be used for other bindings. #address-cells & #size-cells must both be present in the partitions subnode of the -mtd device. There are two valid values for both: +flash device. There are two valid values for both: <1>: for partitions that require a single 32-bit cell to represent their size/address (aka the value is below 4 GiB) <2>: for partitions that require two 32-bit cells to represent their size/address (aka the value is 4 GiB or greater). Required properties: -- reg : The partition's offset and size within the mtd bank. +- reg : The partition's offset and size within the flash Optional properties: - label : The label / name for this partition. If omitted, the label is taken -- cgit v1.2.3-59-g8ed1b From fe496e23b74852cbb2df7e0a6e26752131c41bb6 Mon Sep 17 00:00:00 2001 From: Tom Rini Date: Wed, 21 Jun 2017 08:22:06 -0400 Subject: dt-bindings: mtd: elm: Correct compatible string requirement The binding says that the compatible string must be "ti,am33xx-elm" but the code checks only for, and all functioning users set, this as "ti,am3352-elm" so correct the binding. Cc: David Woodhouse Cc: Brian Norris Cc: Boris Brezillon Cc: Marek Vasut Cc: Richard Weinberger Cc: Cyrille Pitchen Cc: Rob Herring Cc: Mark Rutland Signed-off-by: Tom Rini Signed-off-by: Boris Brezillon --- Documentation/devicetree/bindings/mtd/elm.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/elm.txt b/Documentation/devicetree/bindings/mtd/elm.txt index 8c1528c421d4..59ddc61c1076 100644 --- a/Documentation/devicetree/bindings/mtd/elm.txt +++ b/Documentation/devicetree/bindings/mtd/elm.txt @@ -1,7 +1,7 @@ Error location module Required properties: -- compatible: Must be "ti,am33xx-elm" +- compatible: Must be "ti,am3352-elm" - reg: physical base address and size of the registers map. - interrupts: Interrupt number for the elm. -- cgit v1.2.3-59-g8ed1b From a7adb70a73b7cb220f4515745d2671d2226e0097 Mon Sep 17 00:00:00 2001 From: Tom Rini Date: Wed, 21 Jun 2017 08:14:54 -0400 Subject: dt-bindings: gpmc: Correct location of generic gpmc binding The binding bus/ti-gpmc.txt has been moved to memory-controllers/omap-gpmc.txt. Update all references to this in order to make reading and understanding a given binding easier. Cc: David Woodhouse Cc: Brian Norris Cc:Boris Brezillon Cc: Marek Vasut Cc: Richard Weinberger Cc: Cyrille Pitchen Cc: Rob Herring Cc: Mark Rutland Signed-off-by: Tom Rini Signed-off-by: Boris Brezillon --- Documentation/devicetree/bindings/mtd/gpmc-nand.txt | 2 +- Documentation/devicetree/bindings/mtd/gpmc-nor.txt | 4 ++-- Documentation/devicetree/bindings/mtd/gpmc-onenand.txt | 2 +- Documentation/devicetree/bindings/net/gpmc-eth.txt | 4 ++-- 4 files changed, 6 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt index 174f68c26c1b..dd559045593d 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt @@ -5,7 +5,7 @@ the GPMC controller with a name of "nand". All timing relevant properties as well as generic gpmc child properties are explained in a separate documents - please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt For NAND specific properties such as ECC modes or bus width, please refer to Documentation/devicetree/bindings/mtd/nand.txt diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nor.txt b/Documentation/devicetree/bindings/mtd/gpmc-nor.txt index 4828c17bb784..131d3a74d0bd 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-nor.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-nor.txt @@ -5,7 +5,7 @@ child nodes of the GPMC controller with a name of "nor". All timing relevant properties as well as generic GPMC child properties are explained in a separate documents. Please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Required properties: - bank-width: Width of NOR flash in bytes. GPMC supports 8-bit and @@ -28,7 +28,7 @@ Required properties: Optional properties: - gpmc,XXX Additional GPMC timings and settings parameters. See - Documentation/devicetree/bindings/bus/ti-gpmc.txt + Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Optional properties for partition table parsing: - #address-cells: should be set to 1 diff --git a/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt b/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt index 5d8fa527c496..b6e8bfd024f4 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt @@ -5,7 +5,7 @@ the GPMC controller with a name of "onenand". All timing relevant properties as well as generic gpmc child properties are explained in a separate documents - please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Required properties: diff --git a/Documentation/devicetree/bindings/net/gpmc-eth.txt b/Documentation/devicetree/bindings/net/gpmc-eth.txt index ace4a64b3695..f7da3d73ca1b 100644 --- a/Documentation/devicetree/bindings/net/gpmc-eth.txt +++ b/Documentation/devicetree/bindings/net/gpmc-eth.txt @@ -9,7 +9,7 @@ the GPMC controller with an "ethernet" name. All timing relevant properties as well as generic GPMC child properties are explained in a separate documents. Please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt For the properties relevant to the ethernet controller connected to the GPMC refer to the binding documentation of the device. For example, the documentation @@ -43,7 +43,7 @@ Required properties: Optional properties: - gpmc,XXX Additional GPMC timings and settings parameters. See - Documentation/devicetree/bindings/bus/ti-gpmc.txt + Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Example: -- cgit v1.2.3-59-g8ed1b From 3fde00a014ed1b0716a1d73fdebd96188e2ff6fe Mon Sep 17 00:00:00 2001 From: Florian Fainelli Date: Mon, 26 Jun 2017 14:15:02 -0700 Subject: dt-bindings: Document the Broadcom STB wake-up timer node Document the binding for the Broadcom STB SoCs wake-up timer node allowing the system to generate alarms and exit low power states. Acked-by: Rob Herring Signed-off-by: Florian Fainelli Signed-off-by: Alexandre Belloni --- .../bindings/rtc/brcm,brcmstb-waketimer.txt | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt b/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt new file mode 100644 index 000000000000..1d990bcc0baf --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt @@ -0,0 +1,22 @@ +Broadcom STB wake-up Timer + +The Broadcom STB wake-up timer provides a 27Mhz resolution timer, with the +ability to wake up the system from low-power suspend/standby modes. + +Required properties: +- compatible : should contain "brcm,brcmstb-waketimer" +- reg : the register start and length for the WKTMR block +- interrupts : The TIMER interrupt +- interrupt-parent: The phandle to the Always-On (AON) Power Management (PM) L2 + interrupt controller node +- clocks : The phandle to the UPG fixed clock (27Mhz domain) + +Example: + +waketimer@f0411580 { + compatible = "brcm,brcmstb-waketimer"; + reg = <0xf0411580 0x14>; + interrupts = <0x3>; + interrupt-parent = <&aon_pm_l2_intc>; + clocks = <&upg_fixed>; +}; -- cgit v1.2.3-59-g8ed1b From ecebcd4da6cde7bfcde62d06488faba164b70b37 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Tue, 4 Jul 2017 13:16:41 -0600 Subject: docs: Do not include from kernel/rcu/srcu.c That file went away with commit bd8cc5a062f4 (srcu: Remove Classic SRCU) during the 4.13 merge window, leading to errors like: Error: Cannot open file ./kernel/rcu/srcu.c during the docs build. Reported-by: Linus Torvalds Signed-off-by: Jonathan Corbet --- Documentation/driver-api/basics.rst | 3 --- 1 file changed, 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/driver-api/basics.rst b/Documentation/driver-api/basics.rst index 472e7a664d13..ab82250c7727 100644 --- a/Documentation/driver-api/basics.rst +++ b/Documentation/driver-api/basics.rst @@ -106,9 +106,6 @@ Kernel utility functions .. kernel-doc:: kernel/sys.c :export: -.. kernel-doc:: kernel/rcu/srcu.c - :export: - .. kernel-doc:: kernel/rcu/tree.c :export: -- cgit v1.2.3-59-g8ed1b From df8f4c6c0275216498340a64c7d2674377bd3e78 Mon Sep 17 00:00:00 2001 From: Ulrich Hecht Date: Thu, 27 Apr 2017 16:37:43 +0200 Subject: dt-bindings: pwm: Add R-Car M3-W device tree bindings Add device tree bindings for the PWM controller found on R-Car M3-W SoCs. Signed-off-by: Ulrich Hecht Reviewed-by: Geert Uytterhoeven Reviewed-by: Simon Horman Acked-by: Rob Herring Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt | 1 + 1 file changed, 1 insertion(+) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt index d6de64335022..7e94b802395d 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt @@ -8,6 +8,7 @@ Required Properties: - "renesas,pwm-r8a7791": for R-Car M2-W - "renesas,pwm-r8a7794": for R-Car E2 - "renesas,pwm-r8a7795": for R-Car H3 + - "renesas,pwm-r8a7796": for R-Car M3-W - reg: base address and length of the registers block for the PWM. - #pwm-cells: should be 2. See pwm.txt in this directory for a description of the cells format. -- cgit v1.2.3-59-g8ed1b From d7f673d8a0776f3f791fd795b409060ba808b62a Mon Sep 17 00:00:00 2001 From: Fabrice GASNIER Date: Wed, 14 Jun 2017 17:13:16 +0200 Subject: dt-bindings: pwm: Update STM32 timers clock names Clock name has been updated during driver/DT binding review: https://lkml.org/lkml/2016/12/13/718 Update DT binding doc to reflect this. Fixes: cd9a99c2f8e8 (dt-bindings: pwm: Add STM32 bindings) Signed-off-by: Fabrice Gasnier Acked-by: Rob Herring Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/pwm-stm32.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/pwm-stm32.txt b/Documentation/devicetree/bindings/pwm/pwm-stm32.txt index 6dd040363e5e..3e6d55018d7a 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-stm32.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-stm32.txt @@ -24,7 +24,7 @@ Example: compatible = "st,stm32-timers"; reg = <0x40010000 0x400>; clocks = <&rcc 0 160>; - clock-names = "clk_int"; + clock-names = "int"; pwm { compatible = "st,stm32-pwm"; -- cgit v1.2.3-59-g8ed1b From c571123c8a94cfbc88e70be4e8883529181417ce Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 17 Jun 2017 12:26:44 -0300 Subject: pwm: Standardize document format Each text file under Documentation follows a different format. Some don't even have titles! Change its representation to follow the adopted standard, using ReST markup for it to be parseable by Sphinx: - mark document title; - mark literal blocks; - better format the parameters. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Thierry Reding --- Documentation/pwm.txt | 46 ++++++++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/pwm.txt b/Documentation/pwm.txt index 789b27c6ec99..8fbf0aa3ba2d 100644 --- a/Documentation/pwm.txt +++ b/Documentation/pwm.txt @@ -1,4 +1,6 @@ +====================================== Pulse Width Modulation (PWM) interface +====================================== This provides an overview about the Linux PWM interface @@ -16,7 +18,7 @@ Users of the legacy PWM API use unique IDs to refer to PWM devices. Instead of referring to a PWM device via its unique ID, board setup code should instead register a static mapping that can be used to match PWM -consumers to providers, as given in the following example: +consumers to providers, as given in the following example:: static struct pwm_lookup board_pwm_lookup[] = { PWM_LOOKUP("tegra-pwm", 0, "pwm-backlight", NULL, @@ -40,9 +42,9 @@ New users should use the pwm_get() function and pass to it the consumer device or a consumer name. pwm_put() is used to free the PWM device. Managed variants of these functions, devm_pwm_get() and devm_pwm_put(), also exist. -After being requested, a PWM has to be configured using: +After being requested, a PWM has to be configured using:: -int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); + int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); This API controls both the PWM period/duty_cycle config and the enable/disable state. @@ -72,11 +74,14 @@ interface is provided to use the PWMs from userspace. It is exposed at pwmchipN, where N is the base of the PWM chip. Inside the directory you will find: -npwm - The number of PWM channels this chip supports (read-only). + npwm + The number of PWM channels this chip supports (read-only). -export - Exports a PWM channel for use with sysfs (write-only). + export + Exports a PWM channel for use with sysfs (write-only). -unexport - Unexports a PWM channel from sysfs (write-only). + unexport + Unexports a PWM channel from sysfs (write-only). The PWM channels are numbered using a per-chip index from 0 to npwm-1. @@ -84,21 +89,26 @@ When a PWM channel is exported a pwmX directory will be created in the pwmchipN directory it is associated with, where X is the number of the channel that was exported. The following properties will then be available: -period - The total period of the PWM signal (read/write). - Value is in nanoseconds and is the sum of the active and inactive - time of the PWM. + period + The total period of the PWM signal (read/write). + Value is in nanoseconds and is the sum of the active and inactive + time of the PWM. -duty_cycle - The active time of the PWM signal (read/write). - Value is in nanoseconds and must be less than the period. + duty_cycle + The active time of the PWM signal (read/write). + Value is in nanoseconds and must be less than the period. -polarity - Changes the polarity of the PWM signal (read/write). - Writes to this property only work if the PWM chip supports changing - the polarity. The polarity can only be changed if the PWM is not - enabled. Value is the string "normal" or "inversed". + polarity + Changes the polarity of the PWM signal (read/write). + Writes to this property only work if the PWM chip supports changing + the polarity. The polarity can only be changed if the PWM is not + enabled. Value is the string "normal" or "inversed". -enable - Enable/disable the PWM signal (read/write). - 0 - disabled - 1 - enabled + enable + Enable/disable the PWM signal (read/write). + + - 0 - disabled + - 1 - enabled Implementing a PWM driver ------------------------- -- cgit v1.2.3-59-g8ed1b From 8517bb1f19679bf2bf6c29a98b7a4f3a78629554 Mon Sep 17 00:00:00 2001 From: Jerome Brunet Date: Thu, 8 Jun 2017 14:24:14 +0200 Subject: dt-bindings: pwm: meson: Add compatible for gxbb ao PWMs Add compatible string to properly handle the PWMs found in the AO domain of the gxbb (and gxl) family. Acked-by: Neil Armstrong Signed-off-by: Jerome Brunet Acked-by: Rob Herring Signed-off-by: Thierry Reding --- Documentation/devicetree/bindings/pwm/pwm-meson.txt | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/pwm/pwm-meson.txt b/Documentation/devicetree/bindings/pwm/pwm-meson.txt index 5376a4468cb6..5b07bebbf6f7 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-meson.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-meson.txt @@ -2,7 +2,9 @@ Amlogic Meson PWM Controller ============================ Required properties: -- compatible: Shall contain "amlogic,meson8b-pwm" or "amlogic,meson-gxbb-pwm". +- compatible: Shall contain "amlogic,meson8b-pwm" + or "amlogic,meson-gxbb-pwm" + or "amlogic,meson-gxbb-ao-pwm" - #pwm-cells: Should be 3. See pwm.txt in this directory for a description of the cells format. -- cgit v1.2.3-59-g8ed1b From cdcca896aee19e338adf3000512cade4befa5c69 Mon Sep 17 00:00:00 2001 From: Serge Semin Date: Wed, 14 Dec 2016 02:49:19 +0300 Subject: NTB: Add new Memory Windows API documentation Since the new API slightly changes the way a typical NTB client driver works, the documentation file needs to be appropriately updated. Signed-off-by: Serge Semin Acked-by: Allen Hubbe Signed-off-by: Jon Mason --- Documentation/ntb.txt | 99 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 91 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ntb.txt b/Documentation/ntb.txt index 1d9bbabb6c79..a5af4f0159f3 100644 --- a/Documentation/ntb.txt +++ b/Documentation/ntb.txt @@ -1,14 +1,16 @@ # NTB Drivers NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects -the separate memory systems of two computers to the same PCI-Express fabric. -Existing NTB hardware supports a common feature set, including scratchpad -registers, doorbell registers, and memory translation windows. Scratchpad -registers are read-and-writable registers that are accessible from either side -of the device, so that peers can exchange a small amount of information at a -fixed address. Doorbell registers provide a way for peers to send interrupt -events. Memory windows allow translated read and write access to the peer -memory. +the separate memory systems of two or more computers to the same PCI-Express +fabric. Existing NTB hardware supports a common feature set: doorbell +registers and memory translation windows, as well as non common features like +scratchpad and message registers. Scratchpad registers are read-and-writable +registers that are accessible from either side of the device, so that peers can +exchange a small amount of information at a fixed address. Message registers can +be utilized for the same purpose. Additionally they are provided with with +special status bits to make sure the information isn't rewritten by another +peer. Doorbell registers provide a way for peers to send interrupt events. +Memory windows allow translated read and write access to the peer memory. ## NTB Core Driver (ntb) @@ -26,6 +28,87 @@ as ntb hardware, or hardware drivers, are inserted and removed. The registration uses the Linux Device framework, so it should feel familiar to anyone who has written a pci driver. +### NTB Typical client driver implementation + +Primary purpose of NTB is to share some peace of memory between at least two +systems. So the NTB device features like Scratchpad/Message registers are +mainly used to perform the proper memory window initialization. Typically +there are two types of memory window interfaces supported by the NTB API: +inbound translation configured on the local ntb port and outbound translation +configured by the peer, on the peer ntb port. The first type is +depicted on the next figure + +Inbound translation: + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ + | dma-mapped |-ntb_mw_set_trans(addr) | + | memory | _v____________ | ______________ + | (addr) |<======| MW xlat addr |<====| MW base addr |<== memory-mapped IO + |------------| |--------------| | |--------------| + +So typical scenario of the first type memory window initialization looks: +1) allocate a memory region, 2) put translated address to NTB config, +3) somehow notify a peer device of performed initialization, 4) peer device +maps corresponding outbound memory window so to have access to the shared +memory region. + +The second type of interface, that implies the shared windows being +initialized by a peer device, is depicted on the figure: + +Outbound translation: + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ ______________ + | dma-mapped | | | MW base addr |<== memory-mapped IO + | memory | | |--------------| + | (addr) |<===================| MW xlat addr |<-ntb_peer_mw_set_trans(addr) + |------------| | |--------------| + +Typical scenario of the second type interface initialization would be: +1) allocate a memory region, 2) somehow deliver a translated address to a peer +device, 3) peer puts the translated address to NTB config, 4) peer device maps +outbound memory window so to have access to the shared memory region. + +As one can see the described scenarios can be combined in one portable +algorithm. + Local device: + 1) Allocate memory for a shared window + 2) Initialize memory window by translated address of the allocated region + (it may fail if local memory window initialization is unsupported) + 3) Send the translated address and memory window index to a peer device + Peer device: + 1) Initialize memory window with retrieved address of the allocated + by another device memory region (it may fail if peer memory window + initialization is unsupported) + 2) Map outbound memory window + +In accordance with this scenario, the NTB Memory Window API can be used as +follows: + Local device: + 1) ntb_mw_count(pidx) - retrieve number of memory ranges, which can + be allocated for memory windows between local device and peer device + of port with specified index. + 2) ntb_get_align(pidx, midx) - retrieve parameters restricting the + shared memory region alignment and size. Then memory can be properly + allocated. + 3) Allocate physically contiguous memory region in compliance with + restrictions retrieved in 2). + 4) ntb_mw_set_trans(pidx, midx) - try to set translation address of + the memory window with specified index for the defined peer device + (it may fail if local translated address setting is not supported) + 5) Send translated base address (usually together with memory window + number) to the peer device using, for instance, scratchpad or message + registers. + Peer device: + 1) ntb_peer_mw_set_trans(pidx, midx) - try to set received from other + device (related to pidx) translated address for specified memory + window. It may fail if retrieved address, for instance, exceeds + maximum possible address or isn't properly aligned. + 2) ntb_peer_mw_get_addr(widx) - retrieve MMIO address to map the memory + window so to have an access to the shared memory. + +Also it is worth to note, that method ntb_mw_count(pidx) should return the +same value as ntb_peer_mw_count() on the peer with port index - pidx. + ### NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) The primary client for NTB is the Transport client, used in tandem with NTB -- cgit v1.2.3-59-g8ed1b From 7f1e988dffbd808ad17f22b6b88a9aa42ebe739a Mon Sep 17 00:00:00 2001 From: Linus Walleij Date: Tue, 30 May 2017 09:53:31 +0200 Subject: rtc: gemini: Augment DT bindings for Faraday The Gemini RTC is actually a standard IP block from Faraday Technology called FTRTC010. Rename the bindings, add the generic compatible string and add definitions for the two available clocks. Cc: devicetree@vger.kernel.org Cc: Po-Yu Chuang Acked-by: Rob Herring Signed-off-by: Linus Walleij Signed-off-by: Alexandre Belloni --- .../devicetree/bindings/rtc/cortina,gemini.txt | 14 ----------- .../devicetree/bindings/rtc/faraday,ftrtc010.txt | 28 ++++++++++++++++++++++ 2 files changed, 28 insertions(+), 14 deletions(-) delete mode 100644 Documentation/devicetree/bindings/rtc/cortina,gemini.txt create mode 100644 Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/rtc/cortina,gemini.txt b/Documentation/devicetree/bindings/rtc/cortina,gemini.txt deleted file mode 100644 index 4ce4e794ddbb..000000000000 --- a/Documentation/devicetree/bindings/rtc/cortina,gemini.txt +++ /dev/null @@ -1,14 +0,0 @@ -* Cortina Systems Gemini RTC - -Gemini SoC real-time clock. - -Required properties: -- compatible : Should be "cortina,gemini-rtc" - -Examples: - -rtc@45000000 { - compatible = "cortina,gemini-rtc"; - reg = <0x45000000 0x100>; - interrupts = <17 IRQ_TYPE_LEVEL_HIGH>; -}; diff --git a/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt b/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt new file mode 100644 index 000000000000..e3938f5e0b6c --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt @@ -0,0 +1,28 @@ +* Faraday Technology FTRTC010 Real Time Clock + +This RTC appears in for example the Storlink Gemini family of +SoCs. + +Required properties: +- compatible : Should be one of: + "faraday,ftrtc010" + "cortina,gemini-rtc", "faraday,ftrtc010" + +Optional properties: +- clocks: when present should contain clock references to the + PCLK and EXTCLK clocks. Faraday calls the later CLK1HZ and + says the clock should be 1 Hz, but implementers actually seem + to choose different clocks here, like Cortina who chose + 32768 Hz (a typical low-power clock). +- clock-names: should name the clocks "PCLK" and "EXTCLK" + respectively. + +Examples: + +rtc@45000000 { + compatible = "cortina,gemini-rtc"; + reg = <0x45000000 0x100>; + interrupts = <17 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&foo 0>, <&foo 1>; + clock-names = "PCLK", "EXTCLK"; +}; -- cgit v1.2.3-59-g8ed1b From d2be279bcd8055ddfd92cc5f5d305eb3651e059b Mon Sep 17 00:00:00 2001 From: Amelie Delaunay Date: Thu, 6 Jul 2017 10:47:44 +0200 Subject: dt-bindings: rtc: stm32: add support for STM32H7 This patch documents support for STM32H7 Real Time Clock. It introduces a new compatible and rework clock definitions. On STM32H7 we have a 'pclk' clock for register access, in addition to the 'rtc_ck' clock. Acked-by: Rob Herring Signed-off-by: Amelie Delaunay Signed-off-by: Alexandre Belloni --- .../devicetree/bindings/rtc/st,stm32-rtc.txt | 32 ++++++++++++++++++---- 1 file changed, 27 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt index e2837b951237..0a4c371a9b7a 100644 --- a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt +++ b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt @@ -1,17 +1,25 @@ STM32 Real Time Clock Required properties: -- compatible: "st,stm32-rtc". +- compatible: can be either "st,stm32-rtc" or "st,stm32h7-rtc", depending on + the device is compatible with stm32(f4/f7) or stm32h7. - reg: address range of rtc register set. -- clocks: reference to the clock entry ck_rtc. +- clocks: can use up to two clocks, depending on part used: + - "rtc_ck": RTC clock source. + It is required on stm32(f4/f7) and stm32h7. + - "pclk": RTC APB interface clock. + It is not present on stm32(f4/f7). + It is required on stm32h7. +- clock-names: must be "rtc_ck" and "pclk". + It is required only on stm32h7. - interrupt-parent: phandle for the interrupt controller. - interrupts: rtc alarm interrupt. - st,syscfg: phandle for pwrcfg, mandatory to disable/enable backup domain (RTC registers) write protection. -Optional properties (to override default ck_rtc parent clock): -- assigned-clocks: reference to the ck_rtc clock entry. -- assigned-clock-parents: phandle of the new parent clock of ck_rtc. +Optional properties (to override default rtc_ck parent clock): +- assigned-clocks: reference to the rtc_ck clock entry. +- assigned-clock-parents: phandle of the new parent clock of rtc_ck. Example: @@ -25,3 +33,17 @@ Example: interrupts = <17 1>; st,syscfg = <&pwrcfg>; }; + + rtc: rtc@58004000 { + compatible = "st,stm32h7-rtc"; + reg = <0x58004000 0x400>; + clocks = <&rcc RTCAPB_CK>, <&rcc RTC_CK>; + clock-names = "pclk", "rtc_ck"; + assigned-clocks = <&rcc RTC_CK>; + assigned-clock-parents = <&rcc LSE_CK>; + interrupt-parent = <&exti>; + interrupts = <17 1>; + interrupt-names = "alarm"; + st,syscfg = <&pwrcfg>; + status = "disabled"; + }; -- cgit v1.2.3-59-g8ed1b From 697e5a47aa12cdab6f2a8b284cc923cdf704eafc Mon Sep 17 00:00:00 2001 From: Alexandre Belloni Date: Thu, 6 Jul 2017 11:42:02 +0200 Subject: rtc: add generic nvmem support Many RTCs have an on board non volatile storage. It can be battery backed RAM or an EEPROM. Use the nvmem subsystem to export it to both userspace and in-kernel consumers. This stays compatible with the previous (non documented) ABI that was using /sys/class/rtc/rtcx/device/nvram to export that memory. But will warn about the deprecation. Signed-off-by: Alexandre Belloni --- Documentation/rtc.txt | 2 + drivers/rtc/Kconfig | 8 ++++ drivers/rtc/Makefile | 1 + drivers/rtc/class.c | 4 ++ drivers/rtc/nvmem.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++ drivers/rtc/rtc-core.h | 8 ++++ include/linux/rtc.h | 7 +++ 7 files changed, 143 insertions(+) create mode 100644 drivers/rtc/nvmem.c (limited to 'Documentation') diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index 47feb4414b7e..c0c977445fb9 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -164,6 +164,8 @@ offset The amount which the rtc clock has been adjusted in firmware. which are added to or removed from the rtc's base clock per billion ticks. A positive value makes a day pass more slowly, longer, and a negative value makes a day pass more quickly. +*/nvmem The non volatile storage exported as a raw file, as described + in Documentation/nvmem/nvmem.txt ================ ============================================================== IOCTL interface diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig index d0e4b4a1c2a1..72419ac2c52a 100644 --- a/drivers/rtc/Kconfig +++ b/drivers/rtc/Kconfig @@ -77,6 +77,14 @@ config RTC_DEBUG Say yes here to enable debugging support in the RTC framework and individual RTC drivers. +config RTC_NVMEM + bool "RTC non volatile storage support" + select NVMEM + default RTC_CLASS + help + Say yes here to add support for the non volatile (often battery + backed) storage present on RTCs. + comment "RTC interfaces" config RTC_INTF_SYSFS diff --git a/drivers/rtc/Makefile b/drivers/rtc/Makefile index 4050fc8b9271..acd366b41c85 100644 --- a/drivers/rtc/Makefile +++ b/drivers/rtc/Makefile @@ -15,6 +15,7 @@ ifdef CONFIG_RTC_DRV_EFI rtc-core-y += rtc-efi-platform.o endif +rtc-core-$(CONFIG_RTC_NVMEM) += nvmem.o rtc-core-$(CONFIG_RTC_INTF_DEV) += rtc-dev.o rtc-core-$(CONFIG_RTC_INTF_PROC) += rtc-proc.o rtc-core-$(CONFIG_RTC_INTF_SYSFS) += rtc-sysfs.o diff --git a/drivers/rtc/class.c b/drivers/rtc/class.c index 58e2a05765bb..2ed970d61da1 100644 --- a/drivers/rtc/class.c +++ b/drivers/rtc/class.c @@ -290,6 +290,8 @@ EXPORT_SYMBOL_GPL(rtc_device_register); */ void rtc_device_unregister(struct rtc_device *rtc) { + rtc_nvmem_unregister(rtc); + mutex_lock(&rtc->ops_lock); /* * Remove innards of this RTC, then disable it, before @@ -448,6 +450,8 @@ int __rtc_register_device(struct module *owner, struct rtc_device *rtc) rtc_proc_add_device(rtc); + rtc_nvmem_register(rtc); + rtc->registered = true; dev_info(rtc->dev.parent, "registered as %s\n", dev_name(&rtc->dev)); diff --git a/drivers/rtc/nvmem.c b/drivers/rtc/nvmem.c new file mode 100644 index 000000000000..8567b4ed9ac6 --- /dev/null +++ b/drivers/rtc/nvmem.c @@ -0,0 +1,113 @@ +/* + * RTC subsystem, nvmem interface + * + * Copyright (C) 2017 Alexandre Belloni + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include + +#include "rtc-core.h" + +/* + * Deprecated ABI compatibility, this should be removed at some point + */ + +static const char nvram_warning[] = "Deprecated ABI, please use nvmem"; + +static ssize_t +rtc_nvram_read(struct file *filp, struct kobject *kobj, + struct bin_attribute *attr, + char *buf, loff_t off, size_t count) +{ + struct rtc_device *rtc = attr->private; + + dev_warn_once(kobj_to_dev(kobj), nvram_warning); + + return nvmem_device_read(rtc->nvmem, off, count, buf); +} + +static ssize_t +rtc_nvram_write(struct file *filp, struct kobject *kobj, + struct bin_attribute *attr, + char *buf, loff_t off, size_t count) +{ + struct rtc_device *rtc = attr->private; + + dev_warn_once(kobj_to_dev(kobj), nvram_warning); + + return nvmem_device_write(rtc->nvmem, off, count, buf); +} + +static int rtc_nvram_register(struct rtc_device *rtc) +{ + int err; + + rtc->nvram = devm_kzalloc(rtc->dev.parent, + sizeof(struct bin_attribute), + GFP_KERNEL); + if (!rtc->nvram) + return -ENOMEM; + + rtc->nvram->attr.name = "nvram"; + rtc->nvram->attr.mode = 0644; + rtc->nvram->private = rtc; + + sysfs_bin_attr_init(rtc->nvram); + + rtc->nvram->read = rtc_nvram_read; + rtc->nvram->write = rtc_nvram_write; + rtc->nvram->size = rtc->nvmem_config->size; + + err = sysfs_create_bin_file(&rtc->dev.parent->kobj, + rtc->nvram); + if (err) { + devm_kfree(rtc->dev.parent, rtc->nvram); + rtc->nvram = NULL; + } + + return err; +} + +static void rtc_nvram_unregister(struct rtc_device *rtc) +{ + sysfs_remove_bin_file(&rtc->dev.parent->kobj, rtc->nvram); +} + +/* + * New ABI, uses nvmem + */ +void rtc_nvmem_register(struct rtc_device *rtc) +{ + if (!rtc->nvmem_config) + return; + + rtc->nvmem_config->dev = &rtc->dev; + rtc->nvmem_config->owner = rtc->owner; + rtc->nvmem = nvmem_register(rtc->nvmem_config); + if (IS_ERR_OR_NULL(rtc->nvmem)) + return; + + /* Register the old ABI */ + if (rtc->nvram_old_abi) + rtc_nvram_register(rtc); +} + +void rtc_nvmem_unregister(struct rtc_device *rtc) +{ + if (IS_ERR_OR_NULL(rtc->nvmem)) + return; + + /* unregister the old ABI */ + if (rtc->nvram) + rtc_nvram_unregister(rtc); + + nvmem_unregister(rtc->nvmem); +} diff --git a/drivers/rtc/rtc-core.h b/drivers/rtc/rtc-core.h index 7a4ed2f7c7d7..ecab76a3207c 100644 --- a/drivers/rtc/rtc-core.h +++ b/drivers/rtc/rtc-core.h @@ -45,3 +45,11 @@ static inline const struct attribute_group **rtc_get_dev_attribute_groups(void) return NULL; } #endif + +#ifdef CONFIG_RTC_NVMEM +void rtc_nvmem_register(struct rtc_device *rtc); +void rtc_nvmem_unregister(struct rtc_device *rtc); +#else +static inline void rtc_nvmem_register(struct rtc_device *rtc) {} +static inline void rtc_nvmem_unregister(struct rtc_device *rtc) {} +#endif diff --git a/include/linux/rtc.h b/include/linux/rtc.h index 8e4a5f44f59e..d53ecdc060cf 100644 --- a/include/linux/rtc.h +++ b/include/linux/rtc.h @@ -14,6 +14,7 @@ #include #include +#include #include extern int rtc_month_days(unsigned int month, unsigned int year); @@ -144,6 +145,12 @@ struct rtc_device { bool registered; + struct nvmem_config *nvmem_config; + struct nvmem_device *nvmem; + /* Old ABI support */ + bool nvram_old_abi; + struct bin_attribute *nvram; + #ifdef CONFIG_RTC_INTF_DEV_UIE_EMUL struct work_struct uie_task; struct timer_list uie_timer; -- cgit v1.2.3-59-g8ed1b From 1d278a879081ddc40286500e58868aaee47de257 Mon Sep 17 00:00:00 2001 From: David Howells Date: Wed, 5 Jul 2017 16:25:53 +0100 Subject: VFS: Kill off s_options and helpers Kill off s_options, save/replace_mount_options() and generic_show_options() as all filesystems now implement ->show_options() for themselves. This should make it easier to implement a context-based mount where the mount options can be passed individually over a file descriptor. Signed-off-by: David Howells Signed-off-by: Al Viro --- Documentation/filesystems/vfs.txt | 6 ---- fs/efivarfs/super.c | 1 - fs/namespace.c | 59 --------------------------------------- fs/super.c | 1 - include/linux/fs.h | 9 ------ 5 files changed, 76 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f42b90687d40..ee56a7d10da9 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -1187,12 +1187,6 @@ The underlying reason for the above rules is to make sure, that a mount can be accurately replicated (e.g. umounting and mounting again) based on the information found in /proc/mounts. -A simple method of saving options at mount/remount time and showing -them is provided with the save_mount_options() and -generic_show_options() helper functions. Please note, that using -these may have drawbacks. For more info see header comments for these -functions in fs/namespace.c. - Resources ========= diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c index d7a7c53803c1..5b68e4294faa 100644 --- a/fs/efivarfs/super.c +++ b/fs/efivarfs/super.c @@ -29,7 +29,6 @@ static const struct super_operations efivarfs_ops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, .evict_inode = efivarfs_evict_inode, - .show_options = generic_show_options, }; static struct super_block *efivarfs_sb; diff --git a/fs/namespace.c b/fs/namespace.c index 544ab84642eb..0e1fdb306133 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1237,65 +1237,6 @@ struct vfsmount *mnt_clone_internal(const struct path *path) return &p->mnt; } -static inline void mangle(struct seq_file *m, const char *s) -{ - seq_escape(m, s, " \t\n\\"); -} - -/* - * Simple .show_options callback for filesystems which don't want to - * implement more complex mount option showing. - * - * See also save_mount_options(). - */ -int generic_show_options(struct seq_file *m, struct dentry *root) -{ - const char *options; - - rcu_read_lock(); - options = rcu_dereference(root->d_sb->s_options); - - if (options != NULL && options[0]) { - seq_putc(m, ','); - mangle(m, options); - } - rcu_read_unlock(); - - return 0; -} -EXPORT_SYMBOL(generic_show_options); - -/* - * If filesystem uses generic_show_options(), this function should be - * called from the fill_super() callback. - * - * The .remount_fs callback usually needs to be handled in a special - * way, to make sure, that previous options are not overwritten if the - * remount fails. - * - * Also note, that if the filesystem's .remount_fs function doesn't - * reset all options to their default value, but changes only newly - * given options, then the displayed options will not reflect reality - * any more. - */ -void save_mount_options(struct super_block *sb, char *options) -{ - BUG_ON(sb->s_options); - rcu_assign_pointer(sb->s_options, kstrdup(options, GFP_KERNEL)); -} -EXPORT_SYMBOL(save_mount_options); - -void replace_mount_options(struct super_block *sb, char *options) -{ - char *old = sb->s_options; - rcu_assign_pointer(sb->s_options, options); - if (old) { - synchronize_rcu(); - kfree(old); - } -} -EXPORT_SYMBOL(replace_mount_options); - #ifdef CONFIG_PROC_FS /* iterator; we want it to have access to namespace_sem, thus here... */ static void *m_start(struct seq_file *m, loff_t *pos) diff --git a/fs/super.c b/fs/super.c index dfb56a9665d8..6bc3352adcf3 100644 --- a/fs/super.c +++ b/fs/super.c @@ -168,7 +168,6 @@ static void destroy_super(struct super_block *s) WARN_ON(!list_empty(&s->s_mounts)); put_user_ns(s->s_user_ns); kfree(s->s_subtype); - kfree(s->s_options); call_rcu(&s->rcu, destroy_super_rcu); } diff --git a/include/linux/fs.h b/include/linux/fs.h index bc0c054894b9..e265b2ea72c6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1351,11 +1351,6 @@ struct super_block { */ char *s_subtype; - /* - * Saved mount options for lazy filesystems using - * generic_show_options() - */ - char __rcu *s_options; const struct dentry_operations *s_d_op; /* default d_op for dentries */ /* @@ -3033,10 +3028,6 @@ extern void setattr_copy(struct inode *inode, const struct iattr *attr); extern int file_update_time(struct file *file); -extern int generic_show_options(struct seq_file *m, struct dentry *root); -extern void save_mount_options(struct super_block *sb, char *options); -extern void replace_mount_options(struct super_block *sb, char *options); - static inline bool io_is_direct(struct file *filp) { return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host); -- cgit v1.2.3-59-g8ed1b From 7461279bba4a62d192eda6dd10de68ac0543bfcd Mon Sep 17 00:00:00 2001 From: Paul Burton Date: Sat, 17 Jun 2017 13:52:46 -0700 Subject: dt-bindings: Document img,boston-clock binding Add device tree binding documentation for the clocks provided by the MIPS Boston development board from Imagination Technologies, and a header file describing the available clocks for use by device trees & driver. Signed-off-by: Paul Burton Acked-by: Stephen Boyd Cc: Frank Rowand Cc: Michael Turquette Cc: Rob Herring Cc: devicetree@vger.kernel.org Cc: linux-clk@vger.kernel.org Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/16482/ Signed-off-by: Ralf Baechle --- .../devicetree/bindings/clock/img,boston-clock.txt | 31 ++++++++++++++++++++++ MAINTAINERS | 7 +++++ include/dt-bindings/clock/boston-clock.h | 14 ++++++++++ 3 files changed, 52 insertions(+) create mode 100644 Documentation/devicetree/bindings/clock/img,boston-clock.txt create mode 100644 include/dt-bindings/clock/boston-clock.h (limited to 'Documentation') diff --git a/Documentation/devicetree/bindings/clock/img,boston-clock.txt b/Documentation/devicetree/bindings/clock/img,boston-clock.txt new file mode 100644 index 000000000000..7bc5e9ffb624 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/img,boston-clock.txt @@ -0,0 +1,31 @@ +Binding for Imagination Technologies MIPS Boston clock sources. + +This binding uses the common clock binding[1]. + +[1] Documentation/devicetree/bindings/clock/clock-bindings.txt + +The device node must be a child node of the syscon node corresponding to the +Boston system's platform registers. + +Required properties: +- compatible : Should be "img,boston-clock". +- #clock-cells : Should be set to 1. + Values available for clock consumers can be found in the header file: + + +Example: + + system-controller@17ffd000 { + compatible = "img,boston-platform-regs", "syscon"; + reg = <0x17ffd000 0x1000>; + + clk_boston: clock { + compatible = "img,boston-clock"; + #clock-cells = <1>; + }; + }; + + uart0: uart@17ffe000 { + /* ... */ + clocks = <&clk_boston BOSTON_CLK_SYS>; + }; diff --git a/MAINTAINERS b/MAINTAINERS index bcf2d258c0dd..0b977a6bf849 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8504,6 +8504,13 @@ F: arch/mips/include/asm/mach-loongson32/ F: drivers/*/*loongson1* F: drivers/*/*/*loongson1* +MIPS BOSTON DEVELOPMENT BOARD +M: Paul Burton +L: linux-mips@linux-mips.org +S: Maintained +F: Documentation/devicetree/bindings/clock/img,boston-clock.txt +F: include/dt-bindings/clock/boston-clock.h + MIROSOUND PCM20 FM RADIO RECEIVER DRIVER M: Hans Verkuil L: linux-media@vger.kernel.org diff --git a/include/dt-bindings/clock/boston-clock.h b/include/dt-bindings/clock/boston-clock.h new file mode 100644 index 000000000000..a6f009821137 --- /dev/null +++ b/include/dt-bindings/clock/boston-clock.h @@ -0,0 +1,14 @@ +/* + * Copyright (C) 2016 Imagination Technologies + * + * SPDX-License-Identifier: GPL-2.0 + */ + +#ifndef __DT_BINDINGS_CLOCK_BOSTON_CLOCK_H__ +#define __DT_BINDINGS_CLOCK_BOSTON_CLOCK_H__ + +#define BOSTON_CLK_INPUT 0 +#define BOSTON_CLK_SYS 1 +#define BOSTON_CLK_CPU 2 + +#endif /* __DT_BINDINGS_CLOCK_BOSTON_CLOCK_H__ */ -- cgit v1.2.3-59-g8ed1b From dc8d387210e3e2ab294031e8f6542329bc9141c4 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 11 Jul 2017 18:41:12 -0400 Subject: tracing: Update Documentation/trace/ftrace.txt The documentation of ftrace.txt has become rather outdated. Bring it closer to reality of todays kernel. Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/ftrace.txt | 508 ++++++++++++++++++++++++++++++++--------- 1 file changed, 396 insertions(+), 112 deletions(-) (limited to 'Documentation') diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 94a987bd2bc5..a0bbf5fabe98 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -5,10 +5,11 @@ Copyright 2008 Red Hat Inc. Author: Steven Rostedt License: The GNU Free Documentation License, Version 1.2 (dual licensed under the GPL v2) -Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, - John Kacur, and David Teigland. +Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, + John Kacur, and David Teigland. Written for: 2.6.28-rc2 Updated for: 3.10 +Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt Introduction ------------ @@ -26,9 +27,11 @@ a task is woken to the task is actually scheduled in. One of the most common uses of ftrace is the event tracing. Through out the kernel is hundreds of static event points that -can be enabled via the debugfs file system to see what is +can be enabled via the tracefs file system to see what is going on in certain parts of the kernel. +See events.txt for more information. + Implementation Details ---------------------- @@ -39,34 +42,47 @@ See ftrace-design.txt for details for arch porters and such. The File System --------------- -Ftrace uses the debugfs file system to hold the control files as +Ftrace uses the tracefs file system to hold the control files as well as the files to display output. -When debugfs is configured into the kernel (which selecting any ftrace -option will do) the directory /sys/kernel/debug will be created. To mount +When tracefs is configured into the kernel (which selecting any ftrace +option will do) the directory /sys/kernel/tracing will be created. To mount this directory, you can add to your /etc/fstab file: - debugfs /sys/kernel/debug debugfs defaults 0 0 + tracefs /sys/kernel/tracing tracefs defaults 0 0 Or you can mount it at run time with: - mount -t debugfs nodev /sys/kernel/debug + mount -t tracefs nodev /sys/kernel/tracing For quicker access to that directory you may want to make a soft link to it: - ln -s /sys/kernel/debug /debug + ln -s /sys/kernel/tracing /tracing + + *** NOTICE *** + +Before 4.1, all ftrace tracing control files were within the debugfs +file system, which is typically located at /sys/kernel/debug/tracing. +For backward compatibility, when mounting the debugfs file system, +the tracefs file system will be automatically mounted at: + + /sys/kernel/debug/tracing -Any selected ftrace option will also create a directory called tracing -within the debugfs. The rest of the document will assume that you are in -the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate -on the files within that directory and not distract from the content with -the extended "/sys/kernel/debug/tracing" path name. +All files located in the tracefs file system will be located in that +debugfs file system directory as well. + + *** NOTICE *** + +Any selected ftrace option will also create the tracefs file system. +The rest of the document will assume that you are in the ftrace directory +(cd /sys/kernel/tracing) and will only concentrate on the files within that +directory and not distract from the content with the extended +"/sys/kernel/tracing" path name. That's it! (assuming that you have ftrace configured into your kernel) -After mounting debugfs, you can see a directory called -"tracing". This directory contains the control and output files +After mounting tracefs you will have access to the control and output files of ftrace. Here is a list of some of the key files: @@ -92,10 +108,20 @@ of ftrace. Here is a list of some of the key files: writing to the ring buffer, the tracing overhead may still be occurring. + The kernel function tracing_off() can be used within the + kernel to disable writing to the ring buffer, which will + set this file to "0". User space can re-enable tracing by + echoing "1" into the file. + + Note, the function and event trigger "traceoff" will also + set this file to zero and stop tracing. Which can also + be re-enabled by user space using this file. + trace: This file holds the output of the trace in a human - readable format (described below). + readable format (described below). Note, tracing is temporarily + disabled while this file is being read (opened). trace_pipe: @@ -109,7 +135,8 @@ of ftrace. Here is a list of some of the key files: will not be read again with a sequential read. The "trace" file is static, and if the tracer is not adding more data, it will display the same - information every time it is read. + information every time it is read. This file will not + disable tracing while being read. trace_options: @@ -128,12 +155,14 @@ of ftrace. Here is a list of some of the key files: tracing_max_latency: Some of the tracers record the max latency. - For example, the time interrupts are disabled. - This time is saved in this file. The max trace - will also be stored, and displayed by "trace". - A new max trace will only be recorded if the - latency is greater than the value in this - file. (in microseconds) + For example, the maximum time that interrupts are disabled. + The maximum time is saved in this file. The max trace will also be + stored, and displayed by "trace". A new max trace will only be + recorded if the latency is greater than the value in this file + (in microseconds). + + By echoing in a time into this file, no latency will be recorded + unless it is greater than the time in this file. tracing_thresh: @@ -152,32 +181,34 @@ of ftrace. Here is a list of some of the key files: that the kernel uses for allocation, usually 4 KB in size). If the last page allocated has room for more bytes than requested, the rest of the page will be used, - making the actual allocation bigger than requested. + making the actual allocation bigger than requested or shown. ( Note, the size may not be a multiple of the page size due to buffer management meta-data. ) + Buffer sizes for individual CPUs may vary + (see "per_cpu/cpu0/buffer_size_kb" below), and if they do + this file will show "X". + buffer_total_size_kb: This displays the total combined size of all the trace buffers. free_buffer: - If a process is performing the tracing, and the ring buffer - should be shrunk "freed" when the process is finished, even - if it were to be killed by a signal, this file can be used - for that purpose. On close of this file, the ring buffer will - be resized to its minimum size. Having a process that is tracing - also open this file, when the process exits its file descriptor - for this file will be closed, and in doing so, the ring buffer - will be "freed". + If a process is performing tracing, and the ring buffer should be + shrunk "freed" when the process is finished, even if it were to be + killed by a signal, this file can be used for that purpose. On close + of this file, the ring buffer will be resized to its minimum size. + Having a process that is tracing also open this file, when the process + exits its file descriptor for this file will be closed, and in doing so, + the ring buffer will be "freed". It may also stop tracing if disable_on_free option is set. tracing_cpumask: - This is a mask that lets the user only trace - on specified CPUs. The format is a hex string - representing the CPUs. + This is a mask that lets the user only trace on specified CPUs. + The format is a hex string representing the CPUs. set_ftrace_filter: @@ -190,6 +221,9 @@ of ftrace. Here is a list of some of the key files: to be traced. Echoing names of functions into this file will limit the trace to only those functions. + The functions listed in "available_filter_functions" are what + can be written into this file. + This interface also allows for commands to be used. See the "Filter commands" section for more details. @@ -202,7 +236,14 @@ of ftrace. Here is a list of some of the key files: set_ftrace_pid: - Have the function tracer only trace a single thread. + Have the function tracer only trace the threads whose PID are + listed in this file. + + If the "function-fork" option is set, then when a task whose + PID is listed in this file forks, the child's PID will + automatically be added to this file, and the child will be + traced by the function tracer as well. This option will also + cause PIDs of tasks that exit to be removed from the file. set_event_pid: @@ -217,17 +258,28 @@ of ftrace. Here is a list of some of the key files: set_graph_function: - Set a "trigger" function where tracing should start - with the function graph tracer (See the section - "dynamic ftrace" for more details). + Functions listed in this file will cause the function graph + tracer to only trace these functions and the functions that + they call. (See the section "dynamic ftrace" for more details). + + set_graph_notrace: + + Similar to set_graph_function, but will disable function graph + tracing when the function is hit until it exits the function. + This makes it possible to ignore tracing functions that are called + by a specific function. available_filter_functions: - This lists the functions that ftrace - has processed and can trace. These are the function - names that you can pass to "set_ftrace_filter" or - "set_ftrace_notrace". (See the section "dynamic ftrace" - below for more details.) + This lists the functions that ftrace has processed and can trace. + These are the function names that you can pass to + "set_ftrace_filter" or "set_ftrace_notrace". + (See the section "dynamic ftrace" below for more details.) + + dyn_ftrace_total_info: + + This file is for debugging purposes. The number of functions that + have been converted to nops and are available to be traced. enabled_functions: @@ -250,12 +302,21 @@ of ftrace. Here is a list of some of the key files: an 'I' will be displayed on the same line as the function that can be overridden. + If the architecture supports it, it will also show what callback + is being directly called by the function. If the count is greater + than 1 it most likely will be ftrace_ops_list_func(). + + If the callback of the function jumps to a trampoline that is + specific to a the callback and not the standard trampoline, + its address will be printed as well as the function that the + trampoline calls. + function_profile_enabled: When set it will enable all functions with either the function - tracer, or if enabled, the function graph tracer. It will + tracer, or if configured, the function graph tracer. It will keep a histogram of the number of functions that were called - and if run with the function graph tracer, it will also keep + and if the function graph tracer was configured, it will also keep track of the time spent in those functions. The histogram content can be displayed in the files: @@ -283,12 +344,11 @@ of ftrace. Here is a list of some of the key files: printk_formats: This is for tools that read the raw format files. If an event in - the ring buffer references a string (currently only trace_printk() - does this), only a pointer to the string is recorded into the buffer - and not the string itself. This prevents tools from knowing what - that string was. This file displays the string and address for - the string allowing tools to map the pointers to what the - strings were. + the ring buffer references a string, only a pointer to the string + is recorded into the buffer and not the string itself. This prevents + tools from knowing what that string was. This file displays the string + and address for the string allowing tools to map the pointers to what + the strings were. saved_cmdlines: @@ -298,6 +358,22 @@ of ftrace. Here is a list of some of the key files: comms for events. If a pid for a comm is not listed, then "<...>" is displayed in the output. + If the option "record-cmd" is set to "0", then comms of tasks + will not be saved during recording. By default, it is enabled. + + saved_cmdlines_size: + + By default, 128 comms are saved (see "saved_cmdlines" above). To + increase or decrease the amount of comms that are cached, echo + in a the number of comms to cache, into this file. + + saved_tgids: + + If the option "record-tgid" is set, on each scheduling context switch + the Task Group ID of a task is saved in a table mapping the PID of + the thread to its TGID. By default, the "record-tgid" option is + disabled. + snapshot: This displays the "snapshot" buffer and also lets the user @@ -336,6 +412,9 @@ of ftrace. Here is a list of some of the key files: # cat trace_clock [local] global counter x86-tsc + The clock with the square brackets around it is the one + in effect. + local: Default clock, but may not be in sync across CPUs global: This clock is in sync with all CPUs but may @@ -448,6 +527,23 @@ of ftrace. Here is a list of some of the key files: See events.txt for more information. + set_event: + + By echoing in the event into this file, will enable that event. + + See events.txt for more information. + + available_events: + + A list of events that can be enabled in tracing. + + See events.txt for more information. + + hwlat_detector: + + Directory for the Hardware Latency Detector. + See "Hardware Latency Detector" section below. + per_cpu: This is a directory that contains the trace per_cpu information. @@ -539,13 +635,25 @@ Here is the list of current tracers that may be configured. to draw a graph of function calls similar to C code source. + "blk" + + The block tracer. The tracer used by the blktrace user + application. + + "hwlat" + + The Hardware Latency tracer is used to detect if the hardware + produces any latency. See "Hardware Latency Detector" section + below. + "irqsoff" Traces the areas that disable interrupts and saves the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this - trace with the latency-format option enabled. + trace with the latency-format option enabled, which + happens automatically when the tracer is selected. "preemptoff" @@ -571,6 +679,26 @@ Here is the list of current tracers that may be configured. RT tasks (as the current "wakeup" does). This is useful for those interested in wake up timings of RT tasks. + "wakeup_dl" + + Traces and records the max latency that it takes for + a SCHED_DEADLINE task to be woken (as the "wakeup" and + "wakeup_rt" does). + + "mmiotrace" + + A special tracer that is used to trace binary module. + It will trace all the calls that a module makes to the + hardware. Everything it writes and reads from the I/O + as well. + + "branch" + + This tracer can be configured when tracing likely/unlikely + calls within the kernel. It will trace when a likely and + unlikely branch is hit and if it was correct in its prediction + of being correct. + "nop" This is the "trace nothing" tracer. To remove all @@ -582,7 +710,7 @@ Examples of using the tracer ---------------------------- Here are typical examples of using the tracers when controlling -them only with the debugfs interface (without using any +them only with the tracefs interface (without using any user-land utilities). Output format: @@ -674,7 +802,7 @@ why a latency happened. Here is a typical trace. This shows that the current tracer is "irqsoff" tracing the time for which interrupts were disabled. It gives the trace version (which never changes) and the version of the kernel upon which this was executed on -(3.10). Then it displays the max latency in microseconds (259 us). The number +(3.8). Then it displays the max latency in microseconds (259 us). The number of trace entries displayed and the total number (both are four: #4/4). VP, KP, SP, and HP are always zero and are reserved for later use. #P is the number of online CPUs (#P:4). @@ -709,6 +837,8 @@ explains which is which. '.' otherwise. hardirq/softirq: + 'Z' - NMI occurred inside a hardirq + 'z' - NMI is running 'H' - hard irq occurred inside a softirq. 'h' - hard irq is running 's' - soft irq is running @@ -757,24 +887,24 @@ nohex nobin noblock trace_printk -nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only context-info nolatency-format -sleep-time -graph-time record-cmd +norecord-tgid overwrite nodisable_on_free irq-info markers noevent-fork function-trace +nofunction-fork nodisplay-graph nostacktrace +nobranch To disable one of the options, echo in the option prepended with "no". @@ -830,8 +960,6 @@ Here are the available options: trace_printk - Can disable trace_printk() from writing into the buffer. - branch - Enable branch tracing with the tracer. - annotate - It is sometimes confusing when the CPU buffers are full and one CPU buffer had a lot of events recently, thus a shorter time frame, were another CPU may have only had @@ -850,7 +978,8 @@ Here are the available options: -0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock userstacktrace - This option changes the trace. It records a - stacktrace of the current userspace thread. + stacktrace of the current user space thread after + each trace event. sym-userobj - when user stacktrace are enabled, look up which object the address belongs to, and print a @@ -873,29 +1002,21 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] context-info - Show only the event data. Hides the comm, PID, timestamp, CPU, and other useful data. - latency-format - This option changes the trace. When - it is enabled, the trace displays - additional information about the - latencies, as described in "Latency - trace format". - - sleep-time - When running function graph tracer, to include - the time a task schedules out in its function. - When enabled, it will account time the task has been - scheduled out as part of the function call. - - graph-time - When running function profiler with function graph tracer, - to include the time to call nested functions. When this is - not set, the time reported for the function will only - include the time the function itself executed for, not the - time for functions that it called. + latency-format - This option changes the trace output. When it is enabled, + the trace displays additional information about the + latency, as described in "Latency trace format". record-cmd - When any event or tracer is enabled, a hook is enabled - in the sched_switch trace point to fill comm cache + in the sched_switch trace point to fill comm cache with mapped pids and comms. But this may cause some overhead, and if you only care about pids, and not the name of the task, disabling this option can lower the - impact of tracing. + impact of tracing. See "saved_cmdlines". + + record-tgid - When any event or tracer is enabled, a hook is enabled + in the sched_switch trace point to fill the cache of + mapped Thread Group IDs (TGID) mapping to pids. See + "saved_tgids". overwrite - This controls what happens when the trace buffer is full. If "1" (default), the oldest events are @@ -935,19 +1056,98 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] functions. This keeps the overhead of the tracer down when performing latency tests. + function-fork - When set, tasks with PIDs listed in set_ftrace_pid will + have the PIDs of their children added to set_ftrace_pid + when those tasks fork. Also, when tasks with PIDs in + set_ftrace_pid exit, their PIDs will be removed from the + file. + display-graph - When set, the latency tracers (irqsoff, wakeup, etc) will use function graph tracing instead of function tracing. - stacktrace - This is one of the options that changes the trace - itself. When a trace is recorded, so is the stack - of functions. This allows for back traces of - trace sites. + stacktrace - When set, a stack trace is recorded after any trace event + is recorded. + + branch - Enable branch tracing with the tracer. This enables branch + tracer along with the currently set tracer. Enabling this + with the "nop" tracer is the same as just enabling the + "branch" tracer. Note: Some tracers have their own options. They only appear in this file when the tracer is active. They always appear in the options directory. +Here are the per tracer options: + +Options for function tracer: + + func_stack_trace - When set, a stack trace is recorded after every + function that is recorded. NOTE! Limit the functions + that are recorded before enabling this, with + "set_ftrace_filter" otherwise the system performance + will be critically degraded. Remember to disable + this option before clearing the function filter. + +Options for function_graph tracer: + + Since the function_graph tracer has a slightly different output + it has its own options to control what is displayed. + + funcgraph-overrun - When set, the "overrun" of the graph stack is + displayed after each function traced. The + overrun, is when the stack depth of the calls + is greater than what is reserved for each task. + Each task has a fixed array of functions to + trace in the call graph. If the depth of the + calls exceeds that, the function is not traced. + The overrun is the number of functions missed + due to exceeding this array. + + funcgraph-cpu - When set, the CPU number of the CPU where the trace + occurred is displayed. + + funcgraph-overhead - When set, if the function takes longer than + A certain amount, then a delay marker is + displayed. See "delay" above, under the + header description. + + funcgraph-proc - Unlike other tracers, the process' command line + is not displayed by default, but instead only + when a task is traced in and out during a context + switch. Enabling this options has the command + of each process displayed at every line. + + funcgraph-duration - At the end of each function (the return) + the duration of the amount of time in the + function is displayed in microseconds. + + funcgraph-abstime - When set, the timestamp is displayed at each + line. + + funcgraph-irqs - When disabled, functions that happen inside an + interrupt will not be traced. + + funcgraph-tail - When set, the return event will include the function + that it represents. By default this is off, and + only a closing curly bracket "}" is displayed for + the return of a function. + + sleep-time - When running function graph tracer, to include + the time a task schedules out in its function. + When enabled, it will account time the task has been + scheduled out as part of the function call. + + graph-time - When running function profiler with function graph tracer, + to include the time to call nested functions. When this is + not set, the time reported for the function will only + include the time the function itself executed for, not the + time for functions that it called. + +Options for blk tracer: + + blk_classic - Shows a more minimalistic output. + irqsoff ------- @@ -1711,6 +1911,85 @@ events. -0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep +Hardware Latency Detector +------------------------- + +The hardware latency detector is executed by enabling the "hwlat" tracer. + +NOTE, this tracer will affect the performance of the system as it will +periodically make a CPU constantly busy with interrupts disabled. + + # echo hwlat > current_tracer + # sleep 100 + # cat trace +# tracer: hwlat +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940 + <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365 + <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062 + <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633 + <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961 + <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1 + + +The above output is somewhat the same in the header. All events will have +interrupts disabled 'd'. Under the FUNCTION title there is: + + #1 - This is the count of events recorded that were greater than the + tracing_threshold (See below). + + inner/outer(us): 12/14 + + This shows two numbers as "inner latency" and "outer latency". The test + runs in a loop checking a timestamp twice. The latency detected within + the two timestamps is the "inner latency" and the latency detected + after the previous timestamp and the next timestamp in the loop is + the "outer latency". + + ts:1499801089.066141940 + + The absolute timestamp that the event happened. + + nmi-total:4 nmi-count:1 + + On architectures that support it, if an NMI comes in during the + test, the time spent in NMI is reported in "nmi-total" (in + microseconds). + + All architectures that have NMIs will show the "nmi-count" if an + NMI comes in during the test. + +hwlat files: + + tracing_threshold - This gets automatically set to "10" to represent 10 + microseconds. This is the threshold of latency that + needs to be detected before the trace will be recorded. + + Note, when hwlat tracer is finished (another tracer is + written into "current_tracer"), the original value for + tracing_threshold is placed back into this file. + + hwlat_detector/width - The length of time the test runs with interrupts + disabled. + + hwlat_detector/window - The length of time of the window which the test + runs. That is, the test will run for "width" + microseconds per "window" microseconds + + tracing_cpumask - When the test is started. A kernel thread is created that + runs the test. This thread will alternate between CPUs + listed in the tracing_cpumask between each period + (one "window"). To limit the test to specific CPUs + set the mask in this file to only the CPUs that the test + should run on. + function -------- @@ -1821,15 +2100,15 @@ something like this simple program: #define STR(x) _STR(x) #define MAX_PATH 256 -const char *find_debugfs(void) +const char *find_tracefs(void) { - static char debugfs[MAX_PATH+1]; - static int debugfs_found; + static char tracefs[MAX_PATH+1]; + static int tracefs_found; char type[100]; FILE *fp; - if (debugfs_found) - return debugfs; + if (tracefs_found) + return tracefs; if ((fp = fopen("/proc/mounts","r")) == NULL) { perror("/proc/mounts"); @@ -1839,27 +2118,27 @@ const char *find_debugfs(void) while (fscanf(fp, "%*s %" STR(MAX_PATH) "s %99s %*s %*d %*d\n", - debugfs, type) == 2) { - if (strcmp(type, "debugfs") == 0) + tracefs, type) == 2) { + if (strcmp(type, "tracefs") == 0) break; } fclose(fp); - if (strcmp(type, "debugfs") != 0) { - fprintf(stderr, "debugfs not mounted"); + if (strcmp(type, "tracefs") != 0) { + fprintf(stderr, "tracefs not mounted"); return NULL; } - strcat(debugfs, "/tracing/"); - debugfs_found = 1; + strcat(tracefs, "/tracing/"); + tracefs_found = 1; - return debugfs; + return tracefs; } const char *tracing_file(const char *file_name) { static char trace_file[MAX_PATH+1]; - snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); + snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name); return trace_file; } @@ -1898,12 +2177,12 @@ Or this simple script! ------ #!/bin/bash -debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` -echo nop > $debugfs/tracing/current_tracer -echo 0 > $debugfs/tracing/tracing_on -echo $$ > $debugfs/tracing/set_ftrace_pid -echo function > $debugfs/tracing/current_tracer -echo 1 > $debugfs/tracing/tracing_on +tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts` +echo nop > $tracefs/tracing/current_tracer +echo 0 > $tracefs/tracing/tracing_on +echo $$ > $tracefs/tracing/set_ftrace_pid +echo function > $tracefs/tracing/current_tracer +echo 1 > $tracefs/tracing/tracing_on exec "$@" ------ @@ -2145,13 +2424,18 @@ include the -pg switch in the compiling of the kernel.) At compile time every C file object is run through the recordmcount program (located in the scripts directory). This program will parse the ELF headers in the C object to find all -the locations in the .text section that call mcount. (Note, only -white listed .text sections are processed, since processing other -sections like .init.text may cause races due to those sections -being freed unexpectedly). - -A new section called "__mcount_loc" is created that holds -references to all the mcount call sites in the .text section. +the locations in the .text section that call mcount. Starting +with gcc verson 4.6, the -mfentry has been added for x86, which +calls "__fentry__" instead of "mcount". Which is called before +the creation of the stack frame. + +Note, not all sections are traced. They may be prevented by either +a notrace, or blocked another way and all inline functions are not +traced. Check the "available_filter_functions" file to see what functions +can be traced. + +A section called "__mcount_loc" is created that holds +references to all the mcount/fentry call sites in the .text section. The recordmcount program re-links this section back into the original object. The final linking stage of the kernel will add all these references into a single table. @@ -2679,7 +2963,7 @@ in time without stopping tracing. Ftrace swaps the current buffer with a spare buffer, and tracing continues in the new current (=previous spare) buffer. -The following debugfs files in "tracing" are related to this +The following tracefs files in "tracing" are related to this feature: snapshot: @@ -2752,7 +3036,7 @@ cat: snapshot: Device or resource busy Instances --------- -In the debugfs tracing directory is a directory called "instances". +In the tracefs tracing directory is a directory called "instances". This directory can have new directories created inside of it using mkdir, and removing directories with rmdir. The directory created with mkdir in this directory will already contain files and other -- cgit v1.2.3-59-g8ed1b From 7d2b39ab9b0d67c3e6c66e3b68644522d4e15392 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Wed, 12 Jul 2017 16:39:31 -0600 Subject: docs: Include uaccess docs from the right file Documentation/core-api/kernel-api.rst was including kerneldoc comments from arch/x86/include/asm/uaccess_32.h, but the relevant comments moved to .../uaccess.h some time ago. Correct the include to pick up the comments and eliminate a warning. Signed-off-by: Jonathan Corbet --- Documentation/core-api/kernel-api.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 9ec8488319dc..17b00914c6ab 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -114,7 +114,7 @@ The Slab Cache User Space Memory Access ------------------------ -.. kernel-doc:: arch/x86/include/asm/uaccess_32.h +.. kernel-doc:: arch/x86/include/asm/uaccess.h :internal: .. kernel-doc:: arch/x86/lib/usercopy_32.c -- cgit v1.2.3-59-g8ed1b From 6d16e9143e9ef7e4c3f497f1111dfd48eaf2bab6 Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Wed, 12 Jul 2017 16:44:38 -0600 Subject: docs: Turn off section numbering for the input docs The input docs enable section numbering at multiple levels, leading to a lot of bright-red "nested numbered toctree" warnings in newer Sphinx versions. Just take that directive out for now to help alleviate the global red-pixel shortage. Signed-off-by: Jonathan Corbet --- Documentation/input/index.rst | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/input/index.rst b/Documentation/input/index.rst index 7a3e71c2bd00..9888f5cbf6d5 100644 --- a/Documentation/input/index.rst +++ b/Documentation/input/index.rst @@ -6,7 +6,6 @@ Contents: .. toctree:: :maxdepth: 2 - :numbered: input_uapi input_kapi -- cgit v1.2.3-59-g8ed1b From 3afadfd902a7477db0b121a3fcbd964d3606c29f Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 7 Jul 2017 03:21:17 +0900 Subject: memory-barriers.txt: Fix broken link to atomic_ops.txt Few obsolete links to atomic_ops.txt exist in memory-barriers.txt though the file has moved to core-api/atomic_ops.rst. This commit fixes the obsolete links. Signed-off-by: SeongJae Park Signed-off-by: Jonathan Corbet --- Documentation/memory-barriers.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index f1c9eaa45a57..3ed0d8bf2412 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1876,8 +1876,8 @@ There are some more advanced barrier functions: This makes sure that the death mark on the object is perceived to be set *before* the reference counter is decremented. - See Documentation/atomic_ops.txt for more information. See the "Atomic - operations" subsection for information on where to use these. + See Documentation/core-api/atomic_ops.rst for more information. See the + "Atomic operations" subsection for information on where to use these. (*) lockless_dereference(); @@ -2584,7 +2584,7 @@ situations because on some CPUs the atomic instructions used imply full memory barriers, and so barrier instructions are superfluous in conjunction with them, and in such cases the special barrier primitives will be no-ops. -See Documentation/atomic_ops.txt for more information. +See Documentation/core-api/atomic_ops.rst for more information. ACCESSING DEVICES -- cgit v1.2.3-59-g8ed1b From 51e988f4092428e3d2c9f141fba9f86583bc82f3 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Fri, 7 Jul 2017 03:21:18 +0900 Subject: kokr/memory-barriers.txt: Fix obsolete link to atomic_ops.txt Obsolete links to atomic_ops.txt exist in ko_KR/memory-barriers.txt though the file has moved to core-api/atomic_ops.rst. This commit fixes the obsolete links. Signed-off-by: SeongJae Park Signed-off-by: Jonathan Corbet --- Documentation/translations/ko_KR/memory-barriers.txt | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index c6f4ead76ce7..38310dcd6620 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -523,11 +523,11 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니 즉, ACQUIRE 는 최소한의 "취득" 동작처럼, 그리고 RELEASE 는 최소한의 "공개" 처럼 동작한다는 의미입니다. -atomic_ops.txt 에서 설명되는 어토믹 오퍼레이션들 중에는 완전히 순서잡힌 것들과 -(배리어를 사용하지 않는) 완화된 순서의 것들 외에 ACQUIRE 와 RELEASE 부류의 -것들도 존재합니다. 로드와 스토어를 모두 수행하는 조합된 어토믹 오퍼레이션에서, -ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE 는 해당 -오퍼레이션의 스토어 부분에만 적용됩니다. +core-api/atomic_ops.rst 에서 설명되는 어토믹 오퍼레이션들 중에는 완전히 +순서잡힌 것들과 (배리어를 사용하지 않는) 완화된 순서의 것들 외에 ACQUIRE 와 +RELEASE 부류의 것들도 존재합니다. 로드와 스토어를 모두 수행하는 조합된 어토믹 +오퍼레이션에서, ACQUIRE 는 해당 오퍼레이션의 로드 부분에만 적용되고 RELEASE 는 +해당 오퍼레이션의 스토어 부분에만 적용됩니다. 메모리 배리어들은 두 CPU 간, 또는 CPU 와 디바이스 간에 상호작용의 가능성이 있을 때에만 필요합니다. 만약 어떤 코드에 그런 상호작용이 없을 것이 보장된다면, 해당 @@ -1848,7 +1848,7 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효 이 코드는 객체의 업데이트된 death 마크가 레퍼런스 카운터 감소 동작 *전에* 보일 것을 보장합니다. - 더 많은 정보를 위해선 Documentation/atomic_ops.txt 문서를 참고하세요. + 더 많은 정보를 위해선 Documentation/core-api/atomic_ops.rst 문서를 참고하세요. 어디서 이것들을 사용해야 할지 궁금하다면 "어토믹 오퍼레이션" 서브섹션을 참고하세요. @@ -2550,7 +2550,7 @@ CPU 에서는 사용되는 어토믹 인스트럭션 자체에 메모리 배리 있는데, 그런 경우에 이 특수 메모리 배리어 도구들은 no-op 이 되어 실질적으로 아무일도 하지 않습니다. -더 많은 내용을 위해선 Documentation/atomic_ops.txt 를 참고하세요. +더 많은 내용을 위해선 Documentation/core-api/atomic_ops.rst 를 참고하세요. 디바이스 액세스 -- cgit v1.2.3-59-g8ed1b From a711bdc095d2c9b6ad15e737d1cdc46409b09538 Mon Sep 17 00:00:00 2001 From: Bharat Bhushan Date: Wed, 12 Jul 2017 14:33:24 -0700 Subject: kexec/kdump: minor Documentation updates for arm64 and Image Minor updates in Documentation for arm64 as relocatable kernel. Also this patch updates documentation for using uncompressed image "Image" which is used for ARM64. Link: http://lkml.kernel.org/r/1495104793-6563-1-git-send-email-Bharat.Bhushan@nxp.com Signed-off-by: Bharat Bhushan Cc: Dave Young Cc: Baoquan He Cc: Vivek Goyal Cc: Jonathan Corbet Cc: AKASHI Takahiro Cc: Pratyush Anand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kdump/kdump.txt | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 615434d81108..51814450a7f8 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -112,8 +112,8 @@ There are two possible methods of using Kdump. 2) Or use the system kernel binary itself as dump-capture kernel and there is no need to build a separate dump-capture kernel. This is possible only with the architectures which support a relocatable kernel. As - of today, i386, x86_64, ppc64, ia64 and arm architectures support relocatable - kernel. + of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support + relocatable kernel. Building a relocatable kernel is advantageous from the point of view that one does not have to build a second kernel for capturing the dump. But @@ -339,7 +339,7 @@ For arm: For arm64: - Use vmlinux or Image -If you are using a uncompressed vmlinux image then use following command +If you are using an uncompressed vmlinux image then use following command to load dump-capture kernel. kexec -p \ @@ -361,6 +361,12 @@ to load dump-capture kernel. --dtb= \ --append="root= " +If you are using an uncompressed Image, then use following command +to load dump-capture kernel. + + kexec -p \ + --initrd= \ + --append="root= " Please note, that --args-linux does not need to be specified for ia64. It is planned to make this a no-op on that architecture, but for now -- cgit v1.2.3-59-g8ed1b From 77493f04b74cdff3a61fb3fb14b1f5a71d88fd5f Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Wed, 12 Jul 2017 14:34:25 -0700 Subject: procfs: fdinfo: extend information about epoll target files Since it is possbile to have same number in tfd field (say file added, closed, then nother file dup'ed to same number and added back) it is imposible to distinguish such target files solely by their numbers. Strictly speaking regular applications don't need to recognize these targets at all but for checkpoint/restore sake we need to collect targets to be able to push them back on restore stage in a proper order. Thus lets add file position, inode and device number where this target lays. This three fields can be used as a primary key for sorting, and together with kcmp help CRIU can find out an exact file target (from the whole set of processes being checkpointed). Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com Signed-off-by: Cyrill Gorcunov Acked-by: Andrei Vagin Cc: Al Viro Cc: Pavel Emelyanov Cc: Michael Kerrisk Cc: Jason Baron Cc: Andy Lutomirski Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 6 +++++- fs/eventpoll.c | 8 ++++++-- 2 files changed, 11 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 4cddbce85ac9..adba21b5ada7 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1786,12 +1786,16 @@ pair provide additional information particular to the objects they represent. pos: 0 flags: 02 mnt_id: 9 - tfd: 5 events: 1d data: ffffffffffffffff + tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 where 'tfd' is a target file descriptor number in decimal form, 'events' is events mask being watched and the 'data' is data associated with a target [see epoll(7) for more details]. + The 'pos' is current offset of the target file in decimal form + [see lseek(2)], 'ino' and 'sdev' are inode and device numbers + where target file resides, all in hex format. + Fsnotify files ~~~~~~~~~~~~~~ For inotify files the format is the following diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a6d194831ed8..322904c3ebdf 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -960,10 +960,14 @@ static void ep_show_fdinfo(struct seq_file *m, struct file *f) mutex_lock(&ep->mtx); for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) { struct epitem *epi = rb_entry(rbp, struct epitem, rbn); + struct inode *inode = file_inode(epi->ffd.file); - seq_printf(m, "tfd: %8d events: %8x data: %16llx\n", + seq_printf(m, "tfd: %8d events: %8x data: %16llx " + " pos:%lli ino:%lx sdev:%x\n", epi->ffd.fd, epi->event.events, - (long long)epi->event.data); + (long long)epi->event.data, + (long long)epi->ffd.file->f_pos, + inode->i_ino, inode->i_sb->s_dev); if (seq_has_overflowed(m)) break; } -- cgit v1.2.3-59-g8ed1b From e41d58185f1444368873d4d7422f7664a68be61d Mon Sep 17 00:00:00 2001 From: Dmitry Vyukov Date: Wed, 12 Jul 2017 14:34:35 -0700 Subject: fault-inject: support systematic fault injection Add /proc/self/task//fail-nth file that allows failing 0-th, 1-st, 2-nd and so on calls systematically. Excerpt from the added documentation: "Write to this file of integer N makes N-th call in the current task fail (N is 0-based). Read from this file returns a single char 'Y' or 'N' that says if the fault setup with a previous write to this file was injected or not, and disables the fault if it wasn't yet injected. Note that this file enables all types of faults (slab, futex, etc). This setting takes precedence over all other generic settings like probability, interval, times, etc. But per-capability settings (e.g. fail_futex/ignore-private) take precedence over it. This feature is intended for systematic testing of faults in a single system call. See an example below" Why add a new setting: 1. Existing settings are global rather than per-task. So parallel testing is not possible. 2. attr->interval is close but it depends on attr->count which is non reset to 0, so interval does not work as expected. 3. Trying to model this with existing settings requires manipulations of all of probability, interval, times, space, task-filter and unexposed count and per-task make-it-fail files. 4. Existing settings are per-failure-type, and the set of failure types is potentially expanding. 5. make-it-fail can't be changed by unprivileged user and aggressive stress testing better be done from an unprivileged user. Similarly, this would require opening the debugfs files to the unprivileged user, as he would need to reopen at least times file (not possible to pre-open before dropping privs). The proposed interface solves all of the above (see the example). We want to integrate this into syzkaller fuzzer. A prototype has found 10 bugs in kernel in first day of usage: https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance I've made the current interface work with all types of our sandboxes. For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to make /proc entries non-root owned. So I am fine with the current version of the code. [akpm@linux-foundation.org: fix build] Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov Cc: Akinobu Mita Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/fault-injection/fault-injection.txt | 78 +++++++++++++++++++++++ fs/proc/base.c | 52 +++++++++++++++ include/linux/sched.h | 1 + kernel/fork.c | 4 ++ lib/fault-inject.c | 7 ++ 5 files changed, 142 insertions(+) (limited to 'Documentation') diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 415484f3d59a..192d8cbcc5f9 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -134,6 +134,22 @@ use the boot option: fail_futex= mmc_core.fail_request=,,, +o proc entries + +- /proc/self/task//fail-nth: + + Write to this file of integer N makes N-th call in the current task fail + (N is 0-based). Read from this file returns a single char 'Y' or 'N' + that says if the fault setup with a previous write to this file was + injected or not, and disables the fault if it wasn't yet injected. + Note that this file enables all types of faults (slab, futex, etc). + This setting takes precedence over all other generic debugfs settings + like probability, interval, times, etc. But per-capability settings + (e.g. fail_futex/ignore-private) take precedence over it. + + This feature is intended for systematic testing of faults in a single + system call. See an example below. + How to add new fault injection capability ----------------------------------------- @@ -278,3 +294,65 @@ allocation failure. # env FAILCMD_TYPE=fail_page_alloc \ ./tools/testing/fault-injection/failcmd.sh --times=100 \ -- make -C tools/testing/selftests/ run_tests + +Systematic faults using fail-nth +--------------------------------- + +The following code systematically faults 0-th, 1-st, 2-nd and so on +capabilities in the socketpair() system call. + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +int main() +{ + int i, err, res, fail_nth, fds[2]; + char buf[128]; + + system("echo N > /sys/kernel/debug/failslab/ignore-gfp-wait"); + sprintf(buf, "/proc/self/task/%ld/fail-nth", syscall(SYS_gettid)); + fail_nth = open(buf, O_RDWR); + for (i = 0;; i++) { + sprintf(buf, "%d", i); + write(fail_nth, buf, strlen(buf)); + res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds); + err = errno; + read(fail_nth, buf, 1); + if (res == 0) { + close(fds[0]); + close(fds[1]); + } + printf("%d-th fault %c: res=%d/%d\n", i, buf[0], res, err); + if (buf[0] != 'Y') + break; + } + return 0; +} + +An example output: + +0-th fault Y: res=-1/23 +1-th fault Y: res=-1/23 +2-th fault Y: res=-1/23 +3-th fault Y: res=-1/12 +4-th fault Y: res=-1/12 +5-th fault Y: res=-1/23 +6-th fault Y: res=-1/23 +7-th fault Y: res=-1/23 +8-th fault Y: res=-1/12 +9-th fault Y: res=-1/12 +10-th fault Y: res=-1/12 +11-th fault Y: res=-1/12 +12-th fault Y: res=-1/12 +13-th fault Y: res=-1/12 +14-th fault Y: res=-1/12 +15-th fault Y: res=-1/12 +16-th fault N: res=0/12 diff --git a/fs/proc/base.c b/fs/proc/base.c index f1e1927ccd48..88b773f318cd 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1355,6 +1355,53 @@ static const struct file_operations proc_fault_inject_operations = { .write = proc_fault_inject_write, .llseek = generic_file_llseek, }; + +static ssize_t proc_fail_nth_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + int err, n; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + put_task_struct(task); + if (task != current) + return -EPERM; + err = kstrtoint_from_user(buf, count, 10, &n); + if (err) + return err; + if (n < 0 || n == INT_MAX) + return -EINVAL; + current->fail_nth = n + 1; + return count; +} + +static ssize_t proc_fail_nth_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + int err; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + put_task_struct(task); + if (task != current) + return -EPERM; + if (count < 1) + return -EINVAL; + err = put_user((char)(current->fail_nth ? 'N' : 'Y'), buf); + if (err) + return err; + current->fail_nth = 0; + return 1; +} + +static const struct file_operations proc_fail_nth_operations = { + .read = proc_fail_nth_read, + .write = proc_fail_nth_write, +}; #endif @@ -3311,6 +3358,11 @@ static const struct pid_entry tid_base_stuff[] = { #endif #ifdef CONFIG_FAULT_INJECTION REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations), + /* + * Operations on the file check that the task is current, + * so we create it with 0666 to support testing under unprivileged user. + */ + REG("fail-nth", 0666, proc_fail_nth_operations), #endif #ifdef CONFIG_TASK_IO_ACCOUNTING ONE("io", S_IRUSR, proc_tid_io_accounting), diff --git a/include/linux/sched.h b/include/linux/sched.h index 20814b7d7d70..3822d749fc9e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -974,6 +974,7 @@ struct task_struct { #ifdef CONFIG_FAULT_INJECTION int make_it_fail; + int fail_nth; #endif /* * When (nr_dirtied >= nr_dirtied_pause), it's time to call diff --git a/kernel/fork.c b/kernel/fork.c index d2b9d7c31eaf..ade237a96308 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -573,6 +573,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) kcov_task_init(tsk); +#ifdef CONFIG_FAULT_INJECTION + tsk->fail_nth = 0; +#endif + return tsk; free_stack: diff --git a/lib/fault-inject.c b/lib/fault-inject.c index 4ff157159a0d..09ac73c177fd 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -107,6 +107,12 @@ static inline bool fail_stacktrace(struct fault_attr *attr) bool should_fail(struct fault_attr *attr, ssize_t size) { + if (in_task() && current->fail_nth) { + if (--current->fail_nth == 0) + goto fail; + return false; + } + /* No need to check any other properties if the probability is 0 */ if (attr->probability == 0) return false; @@ -134,6 +140,7 @@ bool should_fail(struct fault_attr *attr, ssize_t size) if (!fail_stacktrace(attr)) return false; +fail: fail_dump(attr); if (atomic_read(&attr->times) != -1) -- cgit v1.2.3-59-g8ed1b From dcda9b04713c3f6ff0875652924844fae28286ea Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 12 Jul 2017 14:36:45 -0700 Subject: mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Alex Belits Cc: Chris Wilson Cc: Christoph Hellwig Cc: Darrick J. Wong Cc: David Daney Cc: Johannes Weiner Cc: Mel Gorman Cc: NeilBrown Cc: Ralf Baechle Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DMA-ISA-LPC.txt | 2 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +- arch/sparc/kernel/mdesc.c | 2 +- drivers/mmc/host/wbsd.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/target/target_core_transport.c | 2 +- drivers/vhost/net.c | 2 +- drivers/vhost/scsi.c | 2 +- drivers/vhost/vsock.c | 2 +- include/linux/gfp.h | 56 +++++++++++++++++++++------- include/linux/slab.h | 3 +- include/trace/events/mmflags.h | 2 +- mm/hugetlb.c | 4 +- mm/internal.h | 2 +- mm/page_alloc.c | 14 +++++-- mm/sparse-vmemmap.c | 4 +- mm/util.c | 6 +-- mm/vmalloc.c | 2 +- mm/vmscan.c | 8 ++-- net/core/dev.c | 6 +-- net/core/skbuff.c | 2 +- net/sched/sch_fq.c | 2 +- tools/perf/builtin-kmem.c | 2 +- 24 files changed, 86 insertions(+), 47 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt index c41331398752..7a065ac4a9d1 100644 --- a/Documentation/DMA-ISA-LPC.txt +++ b/Documentation/DMA-ISA-LPC.txt @@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc. Unfortunately the memory available for ISA DMA is scarce so unless you allocate the memory during boot-up it's a good idea to also pass -__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder. +__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder. (This scarcity also means that you should allocate the buffer as early as possible and not release it until the driver is unloaded.) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index 20b1485ff1e8..e2329db9d6f4 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) return (pgd_t *)__get_free_page(pgtable_gfp_flags(mm, PGALLOC_GFP)); #else struct page *page; - page = alloc_pages(pgtable_gfp_flags(mm, PGALLOC_GFP | __GFP_REPEAT), + page = alloc_pages(pgtable_gfp_flags(mm, PGALLOC_GFP | __GFP_RETRY_MAYFAIL), 4); if (!page) return NULL; diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 710e491206ed..8cb0190e2a73 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -93,7 +93,7 @@ int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order) } if (!hpt) - hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT + hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL |__GFP_NOWARN, order - PAGE_SHIFT); if (!hpt) diff --git a/arch/sparc/kernel/mdesc.c b/arch/sparc/kernel/mdesc.c index e4b4e790bf89..fa466ce45bc9 100644 --- a/arch/sparc/kernel/mdesc.c +++ b/arch/sparc/kernel/mdesc.c @@ -205,7 +205,7 @@ static struct mdesc_handle *mdesc_kmalloc(unsigned int mdesc_size) handle_size = (sizeof(struct mdesc_handle) - sizeof(struct mdesc_hdr) + mdesc_size); - base = kmalloc(handle_size + 15, GFP_KERNEL | __GFP_REPEAT); + base = kmalloc(handle_size + 15, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!base) return NULL; diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c index e15a9733fcfd..9668616faf16 100644 --- a/drivers/mmc/host/wbsd.c +++ b/drivers/mmc/host/wbsd.c @@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma) * order for ISA to be able to DMA to it. */ host->dma_buffer = kmalloc(WBSD_DMA_SIZE, - GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN); + GFP_NOIO | GFP_DMA | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); if (!host->dma_buffer) goto free; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 65f5a794f26d..98749fa817da 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count, } if (!session->response) session->response = (char *)__get_free_pages(GFP_KERNEL - | __GFP_REPEAT | GFP_DMA, + | __GFP_RETRY_MAYFAIL | GFP_DMA, get_order(session->bufsize)); if (!session->response) { mutex_unlock(&session->mutex); diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index f1b3a46bdcaf..1bdc10651bcd 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -252,7 +252,7 @@ int transport_alloc_session_tags(struct se_session *se_sess, int rc; se_sess->sess_cmd_map = kzalloc(tag_num * tag_size, - GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL); if (!se_sess->sess_cmd_map) { se_sess->sess_cmd_map = vzalloc(tag_num * tag_size); if (!se_sess->sess_cmd_map) { diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index e3d7ea1288c6..06d044862e58 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -897,7 +897,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) struct sk_buff **queue; int i; - n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT); + n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!n) return -ENOMEM; vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL); diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index fd6c8b66f06f..ff02a942c4d5 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1404,7 +1404,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) struct vhost_virtqueue **vqs; int r = -ENOMEM, i; - vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL); if (!vs) { vs = vzalloc(sizeof(*vs)); if (!vs) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 3f63e03de8e8..c9de9c41aa97 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -508,7 +508,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file) /* This struct is large and allocation could fail, fall back to vmalloc * if there is no other way. */ - vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT); + vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!vsock) return -ENOMEM; diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 4c6656f1fee7..bcfb9f7c46f5 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -25,7 +25,7 @@ struct vm_area_struct; #define ___GFP_FS 0x80u #define ___GFP_COLD 0x100u #define ___GFP_NOWARN 0x200u -#define ___GFP_REPEAT 0x400u +#define ___GFP_RETRY_MAYFAIL 0x400u #define ___GFP_NOFAIL 0x800u #define ___GFP_NORETRY 0x1000u #define ___GFP_MEMALLOC 0x2000u @@ -136,26 +136,56 @@ struct vm_area_struct; * * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim. * - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt - * _might_ fail. This depends upon the particular VM implementation. + * The default allocator behavior depends on the request size. We have a concept + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER). + * !costly allocations are too essential to fail so they are implicitly + * non-failing by default (with some exceptions like OOM victims might fail so + * the caller still has to check for failures) while costly requests try to be + * not disruptive and back off even without invoking the OOM killer. + * The following three modifiers might be used to override some of these + * implicit rules + * + * __GFP_NORETRY: The VM implementation will try only very lightweight + * memory direct reclaim to get some memory under memory pressure (thus + * it can sleep). It will avoid disruptive actions like OOM killer. The + * caller must handle the failure which is quite likely to happen under + * heavy memory pressure. The flag is suitable when failure can easily be + * handled at small cost, such as reduced throughput + * + * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim + * procedures that have previously failed if there is some indication + * that progress has been made else where. It can wait for other + * tasks to attempt high level approaches to freeing memory such as + * compaction (which removes fragmentation) and page-out. + * There is still a definite limit to the number of retries, but it is + * a larger limit than with __GFP_NORETRY. + * Allocations with this flag may fail, but only when there is + * genuinely little unused memory. While these allocations do not + * directly trigger the OOM killer, their failure indicates that + * the system is likely to need to use the OOM killer soon. The + * caller must handle failure, but can reasonably do so by failing + * a higher-level request, or completing it only in a much less + * efficient manner. + * If the allocation does fail, and the caller is in a position to + * free some non-essential memory, doing so could benefit the system + * as a whole. * * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. New users should be evaluated carefully - * (and the flag should be used only when there is no reasonable failure - * policy) but it is definitely preferable to use the flag rather than - * opencode endless loop around allocator. - * - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will - * return NULL when direct reclaim and memory compaction have failed to allow - * the allocation to succeed. The OOM killer is not called with the current - * implementation. + * cannot handle allocation failures. The allocation could block + * indefinitely but will never return with failure. Testing for + * failure is pointless. + * New users should be evaluated carefully (and the flag should be + * used only when there is no reasonable failure policy) but it is + * definitely preferable to use the flag rather than opencode endless + * loop around allocator. + * Using this flag for costly allocations is _highly_ discouraged. */ #define __GFP_IO ((__force gfp_t)___GFP_IO) #define __GFP_FS ((__force gfp_t)___GFP_FS) #define __GFP_DIRECT_RECLAIM ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */ #define __GFP_KSWAPD_RECLAIM ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */ #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM)) -#define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) +#define __GFP_RETRY_MAYFAIL ((__force gfp_t)___GFP_RETRY_MAYFAIL) #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) diff --git a/include/linux/slab.h b/include/linux/slab.h index 04a7f7993e67..41473df6dfb0 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -471,7 +471,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags) * * %__GFP_NOWARN - If allocation fails, don't issue any warnings. * - * %__GFP_REPEAT - If allocation fails initially, try once more before failing. + * %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail + * eventually. * * There are other flags available as well, but these are not intended * for general use, and so are not documented here. For a full list of diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 10e3663a75a6..8e50d01c645f 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -34,7 +34,7 @@ {(unsigned long)__GFP_FS, "__GFP_FS"}, \ {(unsigned long)__GFP_COLD, "__GFP_COLD"}, \ {(unsigned long)__GFP_NOWARN, "__GFP_NOWARN"}, \ - {(unsigned long)__GFP_REPEAT, "__GFP_REPEAT"}, \ + {(unsigned long)__GFP_RETRY_MAYFAIL, "__GFP_RETRY_MAYFAIL"}, \ {(unsigned long)__GFP_NOFAIL, "__GFP_NOFAIL"}, \ {(unsigned long)__GFP_NORETRY, "__GFP_NORETRY"}, \ {(unsigned long)__GFP_COMP, "__GFP_COMP"}, \ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 1e516520433d..bc48ee783dd9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1384,7 +1384,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) page = __alloc_pages_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| - __GFP_REPEAT|__GFP_NOWARN, + __GFP_RETRY_MAYFAIL|__GFP_NOWARN, huge_page_order(h)); if (page) { prep_new_huge_page(h, page, nid); @@ -1525,7 +1525,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h, { int order = huge_page_order(h); - gfp_mask |= __GFP_COMP|__GFP_REPEAT|__GFP_NOWARN; + gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); return __alloc_pages_nodemask(gfp_mask, order, nid, nmask); diff --git a/mm/internal.h b/mm/internal.h index 0e4f558412fb..24d88f084705 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -23,7 +23,7 @@ * hints such as HIGHMEM usage. */ #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\ - __GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\ + __GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\ __GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\ __GFP_ATOMIC) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64b7d82a9b1a..6d30e914afb6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3284,6 +3284,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer will not help higher order allocs */ if (order > PAGE_ALLOC_COSTLY_ORDER) goto out; + /* + * We have already exhausted all our reclaim opportunities without any + * success so it is time to admit defeat. We will skip the OOM killer + * because it is very likely that the caller has a more reasonable + * fallback than shooting a random task. + */ + if (gfp_mask & __GFP_RETRY_MAYFAIL) + goto out; /* The OOM killer does not needlessly kill tasks for lowmem */ if (ac->high_zoneidx < ZONE_NORMAL) goto out; @@ -3413,7 +3421,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags, } /* - * !costly requests are much more important than __GFP_REPEAT + * !costly requests are much more important than __GFP_RETRY_MAYFAIL * costly ones because they are de facto nofail and invoke OOM * killer to move on while costly can fail and users are ready * to cope with that. 1/4 retries is rather arbitrary but we @@ -3920,9 +3928,9 @@ retry: /* * Do not retry costly high order allocations unless they are - * __GFP_REPEAT + * __GFP_RETRY_MAYFAIL */ - if (costly_order && !(gfp_mask & __GFP_REPEAT)) + if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) goto nopage; if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a56c3989f773..c50b1a14d55e 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node) if (node_state(node, N_HIGH_MEMORY)) page = alloc_pages_node( - node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, + node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, get_order(size)); else page = alloc_pages( - GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT, + GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, get_order(size)); if (page) return page_address(page); diff --git a/mm/util.c b/mm/util.c index 26be6407abd7..6520f2d4a226 100644 --- a/mm/util.c +++ b/mm/util.c @@ -339,7 +339,7 @@ EXPORT_SYMBOL(vm_mmap); * Uses kmalloc to get the memory but if the allocation fails then falls back * to the vmalloc allocator. Use kvfree for freeing the memory. * - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT + * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL * is supported only for large (>32kB) allocations, and it should be used only if * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks. * @@ -367,11 +367,11 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node) kmalloc_flags |= __GFP_NOWARN; /* - * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly + * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly * requests because there is no other way to tell the allocator * that we want to fail rather than retry endlessly. */ - if (!(kmalloc_flags & __GFP_REPEAT) || + if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) || (size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) kmalloc_flags |= __GFP_NORETRY; } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6016ab079e2b..8698c1c86c4d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1795,7 +1795,7 @@ fail: * allocator with @gfp_mask flags. Map them into contiguous * kernel virtual space, using a pagetable protection of @prot. * - * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT + * Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL * and __GFP_NOFAIL are not supported * * Any use of gfp flags outside of GFP_KERNEL should be consulted diff --git a/mm/vmscan.c b/mm/vmscan.c index e9210f825219..a1af041930a6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2506,18 +2506,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, return false; /* Consider stopping depending on scan and reclaim activity */ - if (sc->gfp_mask & __GFP_REPEAT) { + if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { /* - * For __GFP_REPEAT allocations, stop reclaiming if the + * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the * full LRU list has been scanned and we are still failing * to reclaim pages. This full LRU scan is potentially - * expensive but a __GFP_REPEAT caller really wants to succeed + * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed */ if (!nr_reclaimed && !nr_scanned) return false; } else { /* - * For non-__GFP_REPEAT allocations which can presumably + * For non-__GFP_RETRY_MAYFAIL allocations which can presumably * fail without consequence, stop if we failed to reclaim * any pages from the last SWAP_CLUSTER_MAX number of * pages that were scanned. This will return to the diff --git a/net/core/dev.c b/net/core/dev.c index 02440518dd69..8515f8fe0460 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -7384,7 +7384,7 @@ static int netif_alloc_rx_queues(struct net_device *dev) BUG_ON(count < 1); - rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT); + rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!rx) return -ENOMEM; @@ -7424,7 +7424,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev) if (count < 1 || count > 0xffff) return -EINVAL; - tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT); + tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!tx) return -ENOMEM; @@ -7965,7 +7965,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, /* ensure 32-byte alignment of whole construct */ alloc_size += NETDEV_ALIGN - 1; - p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT); + p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL); if (!p) return NULL; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 8b11341ed69a..f990eb8b30a9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4747,7 +4747,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, gfp_head = gfp_mask; if (gfp_head & __GFP_DIRECT_RECLAIM) - gfp_head |= __GFP_REPEAT; + gfp_head |= __GFP_RETRY_MAYFAIL; *errcode = -ENOBUFS; skb = alloc_skb(header_len, gfp_head); diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 147fde73a0f5..263d16e3219e 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -648,7 +648,7 @@ static int fq_resize(struct Qdisc *sch, u32 log) return 0; /* If XPS was setup, we can allocate memory on right NUMA node */ - array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT, + array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_RETRY_MAYFAIL, netdev_queue_numa_node_read(sch->dev_queue)); if (!array) return -ENOMEM; diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c index 0a8a1c45af87..a1497c516d85 100644 --- a/tools/perf/builtin-kmem.c +++ b/tools/perf/builtin-kmem.c @@ -643,7 +643,7 @@ static const struct { { "__GFP_FS", "F" }, { "__GFP_COLD", "CO" }, { "__GFP_NOWARN", "NWR" }, - { "__GFP_REPEAT", "R" }, + { "__GFP_RETRY_MAYFAIL", "R" }, { "__GFP_NOFAIL", "NF" }, { "__GFP_NORETRY", "NR" }, { "__GFP_COMP", "C" }, -- cgit v1.2.3-59-g8ed1b From efc479e6900c22bad9a2b649d13405ed9cde2d53 Mon Sep 17 00:00:00 2001 From: Roman Kagan Date: Thu, 22 Jun 2017 16:51:01 +0300 Subject: kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There is a flaw in the Hyper-V SynIC implementation in KVM: when message page or event flags page is enabled by setting the corresponding msr, KVM zeroes it out. This is problematic because on migration the corresponding MSRs are loaded on the destination, so the content of those pages is lost. This went unnoticed so far because the only user of those pages was in-KVM hyperv synic timers, which could continue working despite that zeroing. Newer QEMU uses those pages for Hyper-V VMBus implementation, and zeroing them breaks the migration. Besides, in newer QEMU the content of those pages is fully managed by QEMU, so zeroing them is undesirable even when writing the MSRs from the guest side. To support this new scheme, introduce a new capability, KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic pages aren't zeroed out in KVM. Signed-off-by: Roman Kagan Signed-off-by: Radim Krčmář --- Documentation/virtual/kvm/api.txt | 9 +++++++++ arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/hyperv.c | 13 +++++++++---- arch/x86/kvm/hyperv.h | 2 +- arch/x86/kvm/x86.c | 7 ++++++- include/uapi/linux/kvm.h | 1 + 6 files changed, 27 insertions(+), 6 deletions(-) (limited to 'Documentation') diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 3a9831b72945..78ac577c9378 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -4329,3 +4329,12 @@ Querying this capability returns a bitmap indicating the possible virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N (counting from the right) is set, then a virtual SMT mode of 2^N is available. + +8.11 KVM_CAP_HYPERV_SYNIC2 + +Architectures: x86 + +This capability enables a newer version of Hyper-V Synthetic interrupt +controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM +doesn't clear SynIC message and event flags pages when they are enabled by +writing to the respective MSRs. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ef37d0dc61bd..9d8de5dd7546 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -462,6 +462,7 @@ struct kvm_vcpu_hv_synic { DECLARE_BITMAP(auto_eoi_bitmap, 256); DECLARE_BITMAP(vec_bitmap, 256); bool active; + bool dont_zero_synic_pages; }; /* Hyper-V per vcpu emulation context */ diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index ebae57ac5902..a8084406707e 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -221,7 +221,8 @@ static int synic_set_msr(struct kvm_vcpu_hv_synic *synic, synic->version = data; break; case HV_X64_MSR_SIEFP: - if (data & HV_SYNIC_SIEFP_ENABLE) + if ((data & HV_SYNIC_SIEFP_ENABLE) && !host && + !synic->dont_zero_synic_pages) if (kvm_clear_guest(vcpu->kvm, data & PAGE_MASK, PAGE_SIZE)) { ret = 1; @@ -232,7 +233,8 @@ static int synic_set_msr(struct kvm_vcpu_hv_synic *synic, synic_exit(synic, msr); break; case HV_X64_MSR_SIMP: - if (data & HV_SYNIC_SIMP_ENABLE) + if ((data & HV_SYNIC_SIMP_ENABLE) && !host && + !synic->dont_zero_synic_pages) if (kvm_clear_guest(vcpu->kvm, data & PAGE_MASK, PAGE_SIZE)) { ret = 1; @@ -687,14 +689,17 @@ void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) stimer_init(&hv_vcpu->stimer[i], i); } -int kvm_hv_activate_synic(struct kvm_vcpu *vcpu) +int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages) { + struct kvm_vcpu_hv_synic *synic = vcpu_to_synic(vcpu); + /* * Hyper-V SynIC auto EOI SINT's are * not compatible with APICV, so deactivate APICV */ kvm_vcpu_deactivate_apicv(vcpu); - vcpu_to_synic(vcpu)->active = true; + synic->active = true; + synic->dont_zero_synic_pages = dont_zero_synic_pages; return 0; } diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index cd1119538add..12f65fe1011d 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -56,7 +56,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu); void kvm_hv_irq_routing_update(struct kvm *kvm); int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vcpu_id, u32 sint); void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector); -int kvm_hv_activate_synic(struct kvm_vcpu *vcpu); +int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages); void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu); void kvm_hv_vcpu_uninit(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 08aa5e442aa7..4f41c5222ecd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2659,6 +2659,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_VAPIC: case KVM_CAP_HYPERV_SPIN: case KVM_CAP_HYPERV_SYNIC: + case KVM_CAP_HYPERV_SYNIC2: case KVM_CAP_PCI_SEGMENT: case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: @@ -3382,10 +3383,14 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu, return -EINVAL; switch (cap->cap) { + case KVM_CAP_HYPERV_SYNIC2: + if (cap->args[0]) + return -EINVAL; case KVM_CAP_HYPERV_SYNIC: if (!irqchip_in_kernel(vcpu->kvm)) return -EINVAL; - return kvm_hv_activate_synic(vcpu); + return kvm_hv_activate_synic(vcpu, cap->cap == + KVM_CAP_HYPERV_SYNIC2); default: return -EINVAL; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index ebd604c222d8..38b2cfbc8112 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -927,6 +927,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_CMMA_MIGRATION 145 #define KVM_CAP_PPC_FWNMI 146 #define KVM_CAP_PPC_SMT_POSSIBLE 147 +#define KVM_CAP_HYPERV_SYNIC2 148 #ifdef KVM_CAP_IRQ_ROUTING -- cgit v1.2.3-59-g8ed1b From 7d84120b5ba61912a5333f5fe2c4e8f35ef9514f Mon Sep 17 00:00:00 2001 From: Rafał Miłecki Date: Sun, 25 Jun 2017 13:11:54 +0200 Subject: Documentation: ABI: mtd: describe "offset" more precisely MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit So far Linux supported only two levels of MTD devices so we didn't need a very precise description for this sysfs file. With commit 97519dc52b44a ("mtd: partitions: add support for subpartitions") there is support for a tree structure so we should have more precise description. Using "parent" and "flash device" makes it more accurate. Signed-off-by: Rafał Miłecki Signed-off-by: Brian Norris --- Documentation/ABI/testing/sysfs-class-mtd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-class-mtd b/Documentation/ABI/testing/sysfs-class-mtd index 3b5c3bca9186..f34e592301d1 100644 --- a/Documentation/ABI/testing/sysfs-class-mtd +++ b/Documentation/ABI/testing/sysfs-class-mtd @@ -229,6 +229,6 @@ KernelVersion: 4.1 Contact: linux-mtd@lists.infradead.org Description: For a partition, the offset of that partition from the start - of the master device in bytes. This attribute is absent on - main devices, so it can be used to distinguish between - partitions and devices that aren't partitions. + of the parent (another partition or a flash device) in bytes. + This attribute is absent on flash devices, so it can be used + to distinguish them from partitions. -- cgit v1.2.3-59-g8ed1b From 7228b66aaf723a623e578aa4db7d083bb39546c9 Mon Sep 17 00:00:00 2001 From: Mat Martineau Date: Thu, 13 Jul 2017 13:17:03 +0100 Subject: KEYS: Add documentation for asymmetric keyring restrictions Provide more specific examples of keyring restrictions as applied to X.509 signature chain verification. Signed-off-by: Mat Martineau Signed-off-by: David Howells Signed-off-by: James Morris --- Documentation/crypto/asymmetric-keys.txt | 65 ++++++++++++++++++++++++++++---- Documentation/security/keys/core.rst | 6 +++ 2 files changed, 63 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt index b82b6ad48488..5969bf42562a 100644 --- a/Documentation/crypto/asymmetric-keys.txt +++ b/Documentation/crypto/asymmetric-keys.txt @@ -10,6 +10,7 @@ Contents: - Signature verification. - Asymmetric key subtypes. - Instantiation data parsers. + - Keyring link restrictions. ======== @@ -318,7 +319,8 @@ KEYRING LINK RESTRICTIONS ========================= Keyrings created from userspace using add_key can be configured to check the -signature of the key being linked. +signature of the key being linked. Keys without a valid signature are not +allowed to link. Several restriction methods are available: @@ -327,9 +329,10 @@ Several restriction methods are available: - Option string used with KEYCTL_RESTRICT_KEYRING: - "builtin_trusted" - The kernel builtin trusted keyring will be searched for the signing - key. The ca_keys kernel parameter also affects which keys are used for - signature verification. + The kernel builtin trusted keyring will be searched for the signing key. + If the builtin trusted keyring is not configured, all links will be + rejected. The ca_keys kernel parameter also affects which keys are used + for signature verification. (2) Restrict using the kernel builtin and secondary trusted keyrings @@ -337,8 +340,10 @@ Several restriction methods are available: - "builtin_and_secondary_trusted" The kernel builtin and secondary trusted keyrings will be searched for the - signing key. The ca_keys kernel parameter also affects which keys are used - for signature verification. + signing key. If the secondary trusted keyring is not configured, this + restriction will behave like the "builtin_trusted" option. The ca_keys + kernel parameter also affects which keys are used for signature + verification. (3) Restrict using a separate key or keyring @@ -346,7 +351,7 @@ Several restriction methods are available: - "key_or_keyring:[:chain]" Whenever a key link is requested, the link will only succeed if the key - being linked is signed by one of the designated keys. This key may be + being linked is signed by one of the designated keys. This key may be specified directly by providing a serial number for one asymmetric key, or a group of keys may be searched for the signing key by providing the serial number for a keyring. @@ -354,7 +359,51 @@ Several restriction methods are available: When the "chain" option is provided at the end of the string, the keys within the destination keyring will also be searched for signing keys. This allows for verification of certificate chains by adding each - cert in order (starting closest to the root) to one keyring. + certificate in order (starting closest to the root) to a keyring. For + instance, one keyring can be populated with links to a set of root + certificates, with a separate, restricted keyring set up for each + certificate chain to be validated: + + # Create and populate a keyring for root certificates + root_id=`keyctl add keyring root-certs "" @s` + keyctl padd asymmetric "" $root_id < root1.cert + keyctl padd asymmetric "" $root_id < root2.cert + + # Create and restrict a keyring for the certificate chain + chain_id=`keyctl add keyring chain "" @s` + keyctl restrict_keyring $chain_id asymmetric key_or_keyring:$root_id:chain + + # Attempt to add each certificate in the chain, starting with the + # certificate closest to the root. + keyctl padd asymmetric "" $chain_id < intermediateA.cert + keyctl padd asymmetric "" $chain_id < intermediateB.cert + keyctl padd asymmetric "" $chain_id < end-entity.cert + + If the final end-entity certificate is successfully added to the "chain" + keyring, we can be certain that it has a valid signing chain going back to + one of the root certificates. + + A single keyring can be used to verify a chain of signatures by + restricting the keyring after linking the root certificate: + + # Create a keyring for the certificate chain and add the root + chain2_id=`keyctl add keyring chain2 "" @s` + keyctl padd asymmetric "" $chain2_id < root1.cert + + # Restrict the keyring that already has root1.cert linked. The cert + # will remain linked by the keyring. + keyctl restrict_keyring $chain2_id asymmetric key_or_keyring:0:chain + + # Attempt to add each certificate in the chain, starting with the + # certificate closest to the root. + keyctl padd asymmetric "" $chain2_id < intermediateA.cert + keyctl padd asymmetric "" $chain2_id < intermediateB.cert + keyctl padd asymmetric "" $chain2_id < end-entity.cert + + If the final end-entity certificate is successfully added to the "chain2" + keyring, we can be certain that there is a valid signing chain going back + to the root certificate that was added before the keyring was restricted. + In all of these cases, if the signing key is found the signature of the key to be linked will be verified using the signing key. The requested key is added diff --git a/Documentation/security/keys/core.rst b/Documentation/security/keys/core.rst index 0d831a7afe4f..1648fa80b3bf 100644 --- a/Documentation/security/keys/core.rst +++ b/Documentation/security/keys/core.rst @@ -894,6 +894,12 @@ The keyctl syscall functions are: To apply a keyring restriction the process must have Set Attribute permission and the keyring must not be previously restricted. + One application of restricted keyrings is to verify X.509 certificate + chains or individual certificate signatures using the asymmetric key type. + See Documentation/crypto/asymmetric-keys.txt for specific restrictions + applicable to the asymmetric key type. + + Kernel Services =============== -- cgit v1.2.3-59-g8ed1b From 52a5c155cf79f1f059bffebf4d06d0249573e659 Mon Sep 17 00:00:00 2001 From: Wanpeng Li Date: Thu, 13 Jul 2017 18:30:42 -0700 Subject: KVM: async_pf: Let guest support delivery of async_pf from guest mode MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf returns 0 if in guest mode. This is similar to what svm.c wanted to do all along, but it is only enabled for Linux as L1 hypervisor. Foreign hypervisors must never receive async page faults as vmexits, because they'd probably be very confused about that. Cc: Paolo Bonzini Cc: Radim Krčmář Signed-off-by: Wanpeng Li Signed-off-by: Radim Krčmář --- Documentation/virtual/kvm/msr.txt | 5 +++-- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kernel/kvm.c | 7 ++++++- arch/x86/kvm/mmu.c | 2 +- arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 5 +++-- 7 files changed, 16 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 0a9ea515512a..1ebecc115dc6 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -166,10 +166,11 @@ MSR_KVM_SYSTEM_TIME: 0x12 MSR_KVM_ASYNC_PF_EN: 0x4b564d02 data: Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area which must be in guest RAM and must be - zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 + zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu 0 when disabled. Bit 1 is 1 if asynchronous page faults can be injected - when vcpu is in cpl == 0. + when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults + are delivered to L1 as #PF vmexits. First 4 byte of 64 byte memory location will be written to by the hypervisor at the time of asynchronous page fault (APF) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5e9ac508f718..da3261e384d3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -653,6 +653,7 @@ struct kvm_vcpu_arch { bool send_user_only; u32 host_apf_reason; unsigned long nested_apf_token; + bool delivery_as_pf_vmexit; } apf; /* OSVW MSRs (AMD only) */ diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index cff0bb6556f8..a965e5b0d328 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -67,6 +67,7 @@ struct kvm_clock_pairing { #define KVM_ASYNC_PF_ENABLED (1 << 0) #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1) +#define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT (1 << 2) /* Operations for KVM_HC_MMU_OP */ #define KVM_MMU_OP_WRITE_PTE 1 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 43e10d6fdbed..71c17a5be983 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -330,7 +330,12 @@ static void kvm_guest_cpu_init(void) #ifdef CONFIG_PREEMPT pa |= KVM_ASYNC_PF_SEND_ALWAYS; #endif - wrmsrl(MSR_KVM_ASYNC_PF_EN, pa | KVM_ASYNC_PF_ENABLED); + pa |= KVM_ASYNC_PF_ENABLED; + + /* Async page fault support for L1 hypervisor is optional */ + if (wrmsr_safe(MSR_KVM_ASYNC_PF_EN, + (pa | KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT) & 0xffffffff, pa >> 32) < 0) + wrmsrl(MSR_KVM_ASYNC_PF_EN, pa); __this_cpu_write(apf_reason.enabled, 1); printk(KERN_INFO"KVM setup async PF for cpu %d\n", smp_processor_id()); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3825a35cd752..9b1dd114956a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3749,7 +3749,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu) kvm_event_needs_reinjection(vcpu))) return false; - if (is_guest_mode(vcpu)) + if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu)) return false; return kvm_x86_ops->interrupt_allowed(vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5a3bb1a697a2..84e62acf2dd8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8037,7 +8037,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) if (is_nmi(intr_info)) return false; else if (is_page_fault(intr_info)) - return enable_ept; + return !vmx->vcpu.arch.apf.host_apf_reason && enable_ept; else if (is_no_device(intr_info) && !(vmcs12->guest_cr0 & X86_CR0_TS)) return false; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f3f10154c133..6753f0982791 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2063,8 +2063,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data) { gpa_t gpa = data & ~0x3f; - /* Bits 2:5 are reserved, Should be zero */ - if (data & 0x3c) + /* Bits 3:5 are reserved, Should be zero */ + if (data & 0x38) return 1; vcpu->arch.apf.msr_val = data; @@ -2080,6 +2080,7 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data) return 1; vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS); + vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT; kvm_async_pf_wakeup_all(vcpu); return 0; } -- cgit v1.2.3-59-g8ed1b From d3457c877b14aaee8c52923eedf05a3b78af0476 Mon Sep 17 00:00:00 2001 From: Roman Kagan Date: Fri, 14 Jul 2017 17:13:20 +0300 Subject: kvm: x86: hyperv: make VP_INDEX managed by userspace MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hyper-V identifies vCPUs by Virtual Processor Index, which can be queried via HV_X64_MSR_VP_INDEX msr. It is defined by the spec as a sequential number which can't exceed the maximum number of vCPUs per VM. APIC ids can be sparse and thus aren't a valid replacement for VP indices. Current KVM uses its internal vcpu index as VP_INDEX. However, to make it predictable and persistent across VM migrations, the userspace has to control the value of VP_INDEX. This patch achieves that, by storing vp_index explicitly on vcpu, and allowing HV_X64_MSR_VP_INDEX to be set from the host side. For compatibility it's initialized to KVM vcpu index. Also a few variables are renamed to make clear distinction betweed this Hyper-V vp_index and KVM vcpu_id (== APIC id). Besides, a new capability, KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip attempting msr writes where unsupported, to avoid spamming error logs. Signed-off-by: Roman Kagan Signed-off-by: Radim Krčmář --- Documentation/virtual/kvm/api.txt | 9 +++++++ arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/hyperv.c | 54 +++++++++++++++++++++++++-------------- arch/x86/kvm/hyperv.h | 1 + arch/x86/kvm/x86.c | 3 +++ include/uapi/linux/kvm.h | 1 + 6 files changed, 50 insertions(+), 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 78ac577c9378..e63a35fafef0 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -4338,3 +4338,12 @@ This capability enables a newer version of Hyper-V Synthetic interrupt controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM doesn't clear SynIC message and event flags pages when they are enabled by writing to the respective MSRs. + +8.12 KVM_CAP_HYPERV_VP_INDEX + +Architectures: x86 + +This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its +value is used to denote the target vcpu for a SynIC interrupt. For +compatibilty, KVM initializes this msr to KVM's internal vcpu index. When this +capability is absent, userspace can still query this msr's value. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index da3261e384d3..87ac4fba6d8e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -467,6 +467,7 @@ struct kvm_vcpu_hv_synic { /* Hyper-V per vcpu emulation context */ struct kvm_vcpu_hv { + u32 vp_index; u64 hv_vapic; s64 runtime_offset; struct kvm_vcpu_hv_synic synic; diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index a8084406707e..2695a34fa1c5 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -106,14 +106,27 @@ static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint, return 0; } -static struct kvm_vcpu_hv_synic *synic_get(struct kvm *kvm, u32 vcpu_id) +static struct kvm_vcpu *get_vcpu_by_vpidx(struct kvm *kvm, u32 vpidx) +{ + struct kvm_vcpu *vcpu = NULL; + int i; + + if (vpidx < KVM_MAX_VCPUS) + vcpu = kvm_get_vcpu(kvm, vpidx); + if (vcpu && vcpu_to_hv_vcpu(vcpu)->vp_index == vpidx) + return vcpu; + kvm_for_each_vcpu(i, vcpu, kvm) + if (vcpu_to_hv_vcpu(vcpu)->vp_index == vpidx) + return vcpu; + return NULL; +} + +static struct kvm_vcpu_hv_synic *synic_get(struct kvm *kvm, u32 vpidx) { struct kvm_vcpu *vcpu; struct kvm_vcpu_hv_synic *synic; - if (vcpu_id >= atomic_read(&kvm->online_vcpus)) - return NULL; - vcpu = kvm_get_vcpu(kvm, vcpu_id); + vcpu = get_vcpu_by_vpidx(kvm, vpidx); if (!vcpu) return NULL; synic = vcpu_to_synic(vcpu); @@ -320,11 +333,11 @@ static int synic_set_irq(struct kvm_vcpu_hv_synic *synic, u32 sint) return ret; } -int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vcpu_id, u32 sint) +int kvm_hv_synic_set_irq(struct kvm *kvm, u32 vpidx, u32 sint) { struct kvm_vcpu_hv_synic *synic; - synic = synic_get(kvm, vcpu_id); + synic = synic_get(kvm, vpidx); if (!synic) return -EINVAL; @@ -343,11 +356,11 @@ void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector) kvm_hv_notify_acked_sint(vcpu, i); } -static int kvm_hv_set_sint_gsi(struct kvm *kvm, u32 vcpu_id, u32 sint, int gsi) +static int kvm_hv_set_sint_gsi(struct kvm *kvm, u32 vpidx, u32 sint, int gsi) { struct kvm_vcpu_hv_synic *synic; - synic = synic_get(kvm, vcpu_id); + synic = synic_get(kvm, vpidx); if (!synic) return -EINVAL; @@ -689,6 +702,13 @@ void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu) stimer_init(&hv_vcpu->stimer[i], i); } +void kvm_hv_vcpu_postcreate(struct kvm_vcpu *vcpu) +{ + struct kvm_vcpu_hv *hv_vcpu = vcpu_to_hv_vcpu(vcpu); + + hv_vcpu->vp_index = kvm_vcpu_get_idx(vcpu); +} + int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages) { struct kvm_vcpu_hv_synic *synic = vcpu_to_synic(vcpu); @@ -983,6 +1003,11 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host) struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv; switch (msr) { + case HV_X64_MSR_VP_INDEX: + if (!host) + return 1; + hv->vp_index = (u32)data; + break; case HV_X64_MSR_APIC_ASSIST_PAGE: { u64 gfn; unsigned long addr; @@ -1094,18 +1119,9 @@ static int kvm_hv_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv; switch (msr) { - case HV_X64_MSR_VP_INDEX: { - int r; - struct kvm_vcpu *v; - - kvm_for_each_vcpu(r, v, vcpu->kvm) { - if (v == vcpu) { - data = r; - break; - } - } + case HV_X64_MSR_VP_INDEX: + data = hv->vp_index; break; - } case HV_X64_MSR_EOI: return kvm_hv_vapic_msr_read(vcpu, APIC_EOI, pdata); case HV_X64_MSR_ICR: diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 12f65fe1011d..e637631a9574 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -59,6 +59,7 @@ void kvm_hv_synic_send_eoi(struct kvm_vcpu *vcpu, int vector); int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages); void kvm_hv_vcpu_init(struct kvm_vcpu *vcpu); +void kvm_hv_vcpu_postcreate(struct kvm_vcpu *vcpu); void kvm_hv_vcpu_uninit(struct kvm_vcpu *vcpu); static inline struct kvm_vcpu_hv_stimer *vcpu_to_stimer(struct kvm_vcpu *vcpu, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 6753f0982791..5b8f07889f6a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2666,6 +2666,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_SPIN: case KVM_CAP_HYPERV_SYNIC: case KVM_CAP_HYPERV_SYNIC2: + case KVM_CAP_HYPERV_VP_INDEX: case KVM_CAP_PCI_SEGMENT: case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: @@ -7688,6 +7689,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) struct msr_data msr; struct kvm *kvm = vcpu->kvm; + kvm_hv_vcpu_postcreate(vcpu); + if (vcpu_load(vcpu)) return; msr.data = 0x0; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 38b2cfbc8112..6cd63c18708a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -928,6 +928,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_PPC_FWNMI 146 #define KVM_CAP_PPC_SMT_POSSIBLE 147 #define KVM_CAP_HYPERV_SYNIC2 148 +#define KVM_CAP_HYPERV_VP_INDEX 149 #ifdef KVM_CAP_IRQ_ROUTING -- cgit v1.2.3-59-g8ed1b From a966ac73d7772a2b067c50fa16bbbfe418fc6374 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sun, 14 May 2017 07:49:15 -0300 Subject: bcache.txt: standardize document format Each text file under Documentation follows a different format. Some doesn't even have titles! Change its representation to follow the adopted standard, using ReST markups for it to be parseable by Sphinx: - Add a title for the document; - Use a list for the listed URLs; - mark literal blocks; - adjust whitespaces; - Don't capitalize section titles. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Jonathan Corbet --- Documentation/bcache.txt | 190 ++++++++++++++++++++++++++--------------------- 1 file changed, 107 insertions(+), 83 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt index a9259b562d5c..c0ce64d75bbf 100644 --- a/Documentation/bcache.txt +++ b/Documentation/bcache.txt @@ -1,10 +1,15 @@ +============================ +A block layer cache (bcache) +============================ + Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be nice if you could use them as cache... Hence bcache. Wiki and git repositories are at: - http://bcache.evilpiepirate.org - http://evilpiepirate.org/git/linux-bcache.git - http://evilpiepirate.org/git/bcache-tools.git + + - http://bcache.evilpiepirate.org + - http://evilpiepirate.org/git/linux-bcache.git + - http://evilpiepirate.org/git/bcache-tools.git It's designed around the performance characteristics of SSDs - it only allocates in erase block sized buckets, and it uses a hybrid btree/log to track cached @@ -37,17 +42,19 @@ to be flushed. Getting started: You'll need make-bcache from the bcache-tools repository. Both the cache device -and backing device must be formatted before use. +and backing device must be formatted before use:: + make-bcache -B /dev/sdb make-bcache -C /dev/sdc make-bcache has the ability to format multiple devices at the same time - if you format your backing devices and cache device at the same time, you won't -have to manually attach: +have to manually attach:: + make-bcache -B /dev/sda /dev/sdb -C /dev/sdc bcache-tools now ships udev rules, and bcache devices are known to the kernel -immediately. Without udev, you can manually register devices like this: +immediately. Without udev, you can manually register devices like this:: echo /dev/sdb > /sys/fs/bcache/register echo /dev/sdc > /sys/fs/bcache/register @@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad a caching device later. See 'ATTACHING' section below. -The devices show up as: +The devices show up as:: /dev/bcache -As well as (with udev): +As well as (with udev):: /dev/bcache/by-uuid/ /dev/bcache/by-label/