From f7ab97f78a5c573e49474edbd260ea6898ddccda Mon Sep 17 00:00:00 2001 From: Oliver Hartkopp Date: Fri, 16 Nov 2007 16:09:28 -0800 Subject: [CAN]: Add documentation This patch adds documentation for the PF_CAN protocol family. Signed-off-by: Oliver Hartkopp Signed-off-by: Urs Thuermann Signed-off-by: David S. Miller --- Documentation/networking/00-INDEX | 2 + Documentation/networking/can.txt | 629 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 631 insertions(+) create mode 100644 Documentation/networking/can.txt (limited to 'Documentation/networking') diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 563e442f2d42..b4eefadb9adf 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -24,6 +24,8 @@ baycom.txt - info on the driver for Baycom style amateur radio modems bridge.txt - where to get user space programs for ethernet bridging with Linux. +can.txt + - documentation on CAN protocol family. cops.txt - info on the COPS LocalTalk Linux driver cs89x0.txt diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt new file mode 100644 index 000000000000..f1b2de170929 --- /dev/null +++ b/Documentation/networking/can.txt @@ -0,0 +1,629 @@ +============================================================================ + +can.txt + +Readme file for the Controller Area Network Protocol Family (aka Socket CAN) + +This file contains + + 1 Overview / What is Socket CAN + + 2 Motivation / Why using the socket API + + 3 Socket CAN concept + 3.1 receive lists + 3.2 local loopback of sent frames + 3.3 network security issues (capabilities) + 3.4 network problem notifications + + 4 How to use Socket CAN + 4.1 RAW protocol sockets with can_filters (SOCK_RAW) + 4.1.1 RAW socket option CAN_RAW_FILTER + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER + 4.1.3 RAW socket option CAN_RAW_LOOPBACK + 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS + 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + 4.3 connected transport protocols (SOCK_SEQPACKET) + 4.4 unconnected transport protocols (SOCK_DGRAM) + + 5 Socket CAN core module + 5.1 can.ko module params + 5.2 procfs content + 5.3 writing own CAN protocol modules + + 6 CAN network drivers + 6.1 general settings + 6.2 local loopback of sent frames + 6.3 CAN controller hardware filters + 6.4 currently supported CAN hardware + 6.5 todo + + 7 Credits + +============================================================================ + +1. Overview / What is Socket CAN +-------------------------------- + +The socketcan package is an implementation of CAN protocols +(Controller Area Network) for Linux. CAN is a networking technology +which has widespread use in automation, embedded devices, and +automotive fields. While there have been other CAN implementations +for Linux based on character devices, Socket CAN uses the Berkeley +socket API, the Linux network stack and implements the CAN device +drivers as network interfaces. The CAN socket API has been designed +as similar as possible to the TCP/IP protocols to allow programmers, +familiar with network programming, to easily learn how to use CAN +sockets. + +2. Motivation / Why using the socket API +---------------------------------------- + +There have been CAN implementations for Linux before Socket CAN so the +question arises, why we have started another project. Most existing +implementations come as a device driver for some CAN hardware, they +are based on character devices and provide comparatively little +functionality. Usually, there is only a hardware-specific device +driver which provides a character device interface to send and +receive raw CAN frames, directly to/from the controller hardware. +Queueing of frames and higher-level transport protocols like ISO-TP +have to be implemented in user space applications. Also, most +character-device implementations support only one single process to +open the device at a time, similar to a serial interface. Exchanging +the CAN controller requires employment of another device driver and +often the need for adaption of large parts of the application to the +new driver's API. + +Socket CAN was designed to overcome all of these limitations. A new +protocol family has been implemented which provides a socket interface +to user space applications and which builds upon the Linux network +layer, so to use all of the provided queueing functionality. A device +driver for CAN controller hardware registers itself with the Linux +network layer as a network device, so that CAN frames from the +controller can be passed up to the network layer and on to the CAN +protocol family module and also vice-versa. Also, the protocol family +module provides an API for transport protocol modules to register, so +that any number of transport protocols can be loaded or unloaded +dynamically. In fact, the can core module alone does not provide any +protocol and cannot be used without loading at least one additional +protocol module. Multiple sockets can be opened at the same time, +on different or the same protocol module and they can listen/send +frames on different or the same CAN IDs. Several sockets listening on +the same interface for frames with the same CAN ID are all passed the +same received matching CAN frames. An application wishing to +communicate using a specific transport protocol, e.g. ISO-TP, just +selects that protocol when opening the socket, and then can read and +write application data byte streams, without having to deal with +CAN-IDs, frames, etc. + +Similar functionality visible from user-space could be provided by a +character device, too, but this would lead to a technically inelegant +solution for a couple of reasons: + +* Intricate usage. Instead of passing a protocol argument to + socket(2) and using bind(2) to select a CAN interface and CAN ID, an + application would have to do all these operations using ioctl(2)s. + +* Code duplication. A character device cannot make use of the Linux + network queueing code, so all that code would have to be duplicated + for CAN networking. + +* Abstraction. In most existing character-device implementations, the + hardware-specific device driver for a CAN controller directly + provides the character device for the application to work with. + This is at least very unusual in Unix systems for both, char and + block devices. For example you don't have a character device for a + certain UART of a serial interface, a certain sound chip in your + computer, a SCSI or IDE controller providing access to your hard + disk or tape streamer device. Instead, you have abstraction layers + which provide a unified character or block device interface to the + application on the one hand, and a interface for hardware-specific + device drivers on the other hand. These abstractions are provided + by subsystems like the tty layer, the audio subsystem or the SCSI + and IDE subsystems for the devices mentioned above. + + The easiest way to implement a CAN device driver is as a character + device without such a (complete) abstraction layer, as is done by most + existing drivers. The right way, however, would be to add such a + layer with all the functionality like registering for certain CAN + IDs, supporting several open file descriptors and (de)multiplexing + CAN frames between them, (sophisticated) queueing of CAN frames, and + providing an API for device drivers to register with. However, then + it would be no more difficult, or may be even easier, to use the + networking framework provided by the Linux kernel, and this is what + Socket CAN does. + + The use of the networking framework of the Linux kernel is just the + natural and most appropriate way to implement CAN for Linux. + +3. Socket CAN concept +--------------------- + + As described in chapter 2 it is the main goal of Socket CAN to + provide a socket interface to user space applications which builds + upon the Linux network layer. In contrast to the commonly known + TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!) + medium that has no MAC-layer addressing like ethernet. The CAN-identifier + (can_id) is used for arbitration on the CAN-bus. Therefore the CAN-IDs + have to be chosen uniquely on the bus. When designing a CAN-ECU + network the CAN-IDs are mapped to be sent by a specific ECU. + For this reason a CAN-ID can be treated best as a kind of source address. + + 3.1 receive lists + + The network transparent access of multiple applications leads to the + problem that different applications may be interested in the same + CAN-IDs from the same CAN network interface. The Socket CAN core + module - which implements the protocol family CAN - provides several + high efficient receive lists for this reason. If e.g. a user space + application opens a CAN RAW socket, the raw protocol module itself + requests the (range of) CAN-IDs from the Socket CAN core that are + requested by the user. The subscription and unsubscription of + CAN-IDs can be done for specific CAN interfaces or for all(!) known + CAN interfaces with the can_rx_(un)register() functions provided to + CAN protocol modules by the SocketCAN core (see chapter 5). + To optimize the CPU usage at runtime the receive lists are split up + into several specific lists per device that match the requested + filter complexity for a given use-case. + + 3.2 local loopback of sent frames + + As known from other networking concepts the data exchanging + applications may run on the same or different nodes without any + change (except for the according addressing information): + + ___ ___ ___ _______ ___ + | _ | | _ | | _ | | _ _ | | _ | + ||A|| ||B|| ||C|| ||A| |B|| ||C|| + |___| |___| |___| |_______| |___| + | | | | | + -----------------(1)- CAN bus -(2)--------------- + + To ensure that application A receives the same information in the + example (2) as it would receive in example (1) there is need for + some kind of local loopback of the sent CAN frames on the appropriate + node. + + The Linux network devices (by default) just can handle the + transmission and reception of media dependent frames. Due to the + arbritration on the CAN bus the transmission of a low prio CAN-ID + may be delayed by the reception of a high prio CAN frame. To + reflect the correct* traffic on the node the loopback of the sent + data has to be performed right after a successful transmission. If + the CAN network interface is not capable of performing the loopback for + some reason the SocketCAN core can do this task as a fallback solution. + See chapter 6.2 for details (recommended). + + The loopback functionality is enabled by default to reflect standard + networking behaviour for CAN applications. Due to some requests from + the RT-SocketCAN group the loopback optionally may be disabled for each + separate socket. See sockopts from the CAN RAW sockets in chapter 4.1. + + * = you really like to have this when you're running analyser tools + like 'candump' or 'cansniffer' on the (same) node. + + 3.3 network security issues (capabilities) + + The Controller Area Network is a local field bus transmitting only + broadcast messages without any routing and security concepts. + In the majority of cases the user application has to deal with + raw CAN frames. Therefore it might be reasonable NOT to restrict + the CAN access only to the user root, as known from other networks. + Since the currently implemented CAN_RAW and CAN_BCM sockets can only + send and receive frames to/from CAN interfaces it does not affect + security of others networks to allow all users to access the CAN. + To enable non-root users to access CAN_RAW and CAN_BCM protocol + sockets the Kconfig options CAN_RAW_USER and/or CAN_BCM_USER may be + selected at kernel compile time. + + 3.4 network problem notifications + + The use of the CAN bus may lead to several problems on the physical + and media access control layer. Detecting and logging of these lower + layer problems is a vital requirement for CAN users to identify + hardware issues on the physical transceiver layer as well as + arbitration problems and error frames caused by the different + ECUs. The occurrence of detected errors are important for diagnosis + and have to be logged together with the exact timestamp. For this + reason the CAN interface driver can generate so called Error Frames + that can optionally be passed to the user application in the same + way as other CAN frames. Whenever an error on the physical layer + or the MAC layer is detected (e.g. by the CAN controller) the driver + creates an appropriate error frame. Error frames can be requested by + the user application using the common CAN filter mechanisms. Inside + this filter definition the (interested) type of errors may be + selected. The reception of error frames is disabled by default. + +4. How to use Socket CAN +------------------------ + + Like TCP/IP, you first need to open a socket for communicating over a + CAN network. Since Socket CAN implements a new protocol family, you + need to pass PF_CAN as the first argument to the socket(2) system + call. Currently, there are two CAN protocols to choose from, the raw + socket protocol and the broadcast manager (BCM). So to open a socket, + you would write + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + + and + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + + respectively. After the successful creation of the socket, you would + normally use the bind(2) system call to bind the socket to a CAN + interface (which is different from TCP/IP due to different addressing + - see chapter 3). After binding (CAN_RAW) or connecting (CAN_BCM) + the socket, you can read(2) and write(2) from/to the socket or use + send(2), sendto(2), sendmsg(2) and the recv* counterpart operations + on the socket as usual. There are also CAN specific socket options + described below. + + The basic CAN frame structure and the sockaddr structure are defined + in include/linux/can.h: + + struct can_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 can_dlc; /* data length code: 0 .. 8 */ + __u8 data[8] __attribute__((aligned(8))); + }; + + The alignment of the (linear) payload data[] to a 64bit boundary + allows the user to define own structs and unions to easily access the + CAN payload. There is no given byteorder on the CAN bus by + default. A read(2) system call on a CAN_RAW socket transfers a + struct can_frame to the user space. + + The sockaddr_can structure has an interface index like the + PF_PACKET socket, that also binds to a specific interface: + + struct sockaddr_can { + sa_family_t can_family; + int can_ifindex; + union { + struct { canid_t rx_id, tx_id; } tp16; + struct { canid_t rx_id, tx_id; } tp20; + struct { canid_t rx_id, tx_id; } mcnet; + struct { canid_t rx_id, tx_id; } isotp; + } can_addr; + }; + + To determine the interface index an appropriate ioctl() has to + be used (example for CAN_RAW sockets without error checking): + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + + strcpy(ifr.ifr_name, "can0" ); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + bind(s, (struct sockaddr *)&addr, sizeof(addr)); + + (..) + + To bind a socket to all(!) CAN interfaces the interface index must + be 0 (zero). In this case the socket receives CAN frames from every + enabled CAN interface. To determine the originating CAN interface + the system call recvfrom(2) may be used instead of read(2). To send + on a socket that is bound to 'any' interface sendto(2) is needed to + specify the outgoing interface. + + Reading CAN frames from a bound CAN_RAW socket (see above) consists + of reading a struct can_frame: + + struct can_frame frame; + + nbytes = read(s, &frame, sizeof(struct can_frame)); + + if (nbytes < 0) { + perror("can raw socket read"); + return 1; + } + + /* paraniod check ... */ + if (nbytes < sizeof(struct can_frame)) { + fprintf(stderr, "read: incomplete CAN frame\n"); + return 1; + } + + /* do something with the received CAN frame */ + + Writing CAN frames can be done similarly, with the write(2) system call: + + nbytes = write(s, &frame, sizeof(struct can_frame)); + + When the CAN interface is bound to 'any' existing CAN interface + (addr.can_ifindex = 0) it is recommended to use recvfrom(2) if the + information about the originating CAN interface is needed: + + struct sockaddr_can addr; + struct ifreq ifr; + socklen_t len = sizeof(addr); + struct can_frame frame; + + nbytes = recvfrom(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, &len); + + /* get interface name of the received CAN frame */ + ifr.ifr_ifindex = addr.can_ifindex; + ioctl(s, SIOCGIFNAME, &ifr); + printf("Received a CAN frame from interface %s", ifr.ifr_name); + + To write CAN frames on sockets bound to 'any' CAN interface the + outgoing interface has to be defined certainly. + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + addr.can_ifindex = ifr.ifr_ifindex; + addr.can_family = AF_CAN; + + nbytes = sendto(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, sizeof(addr)); + + 4.1 RAW protocol sockets with can_filters (SOCK_RAW) + + Using CAN_RAW sockets is extensively comparable to the commonly + known access to CAN character devices. To meet the new possibilities + provided by the multi user SocketCAN approach, some reasonable + defaults are set at RAW socket binding time: + + - The filters are set to exactly one filter receiving everything + - The socket only receives valid data frames (=> no error frames) + - The loopback of sent CAN frames is enabled (see chapter 3.2) + - The socket does not receive its own sent frames (in loopback mode) + + These default settings may be changed before or after binding the socket. + To use the referenced definitions of the socket options for CAN_RAW + sockets, include . + + 4.1.1 RAW socket option CAN_RAW_FILTER + + The reception of CAN frames using CAN_RAW sockets can be controlled + by defining 0 .. n filters with the CAN_RAW_FILTER socket option. + + The CAN filter structure is defined in include/linux/can.h: + + struct can_filter { + canid_t can_id; + canid_t can_mask; + }; + + A filter matches, when + + & mask == can_id & mask + + which is analogous to known CAN controllers hardware filter semantics. + The filter can be inverted in this semantic, when the CAN_INV_FILTER + bit is set in can_id element of the can_filter structure. In + contrast to CAN controller hardware filters the user may set 0 .. n + receive filters for each open socket separately: + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + rfilter[1].can_id = 0x200; + rfilter[1].can_mask = 0x700; + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + + To disable the reception of CAN frames on the selected CAN_RAW socket: + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0); + + To set the filters to zero filters is quite obsolete as not read + data causes the raw socket to discard the received CAN frames. But + having this 'send only' use-case we may remove the receive list in the + Kernel to save a little (really a very little!) CPU usage. + + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER + + As described in chapter 3.4 the CAN interface driver can generate so + called Error Frames that can optionally be passed to the user + application in the same way as other CAN frames. The possible + errors are divided into different error classes that may be filtered + using the appropriate error mask. To register for every possible + error condition CAN_ERR_MASK can be used as value for the error mask. + The values for the error mask are defined in linux/can/error.h . + + can_err_mask_t err_mask = ( CAN_ERR_TX_TIMEOUT | CAN_ERR_BUSOFF ); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_ERR_FILTER, + &err_mask, sizeof(err_mask)); + + 4.1.3 RAW socket option CAN_RAW_LOOPBACK + + To meet multi user needs the local loopback is enabled by default + (see chapter 3.2 for details). But in some embedded use-cases + (e.g. when only one application uses the CAN bus) this loopback + functionality can be disabled (separately for each socket): + + int loopback = 0; /* 0 = disabled, 1 = enabled (default) */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_LOOPBACK, &loopback, sizeof(loopback)); + + 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS + + When the local loopback is enabled, all the sent CAN frames are + looped back to the open CAN sockets that registered for the CAN + frames' CAN-ID on this given interface to meet the multi user + needs. The reception of the CAN frames on the same socket that was + sending the CAN frame is assumed to be unwanted and therefore + disabled by default. This default behaviour may be changed on + demand: + + int recv_own_msgs = 1; /* 0 = disabled (default), 1 = enabled */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, + &recv_own_msgs, sizeof(recv_own_msgs)); + + 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + 4.3 connected transport protocols (SOCK_SEQPACKET) + 4.4 unconnected transport protocols (SOCK_DGRAM) + + +5. Socket CAN core module +------------------------- + + The Socket CAN core module implements the protocol family + PF_CAN. CAN protocol modules are loaded by the core module at + runtime. The core module provides an interface for CAN protocol + modules to subscribe needed CAN IDs (see chapter 3.1). + + 5.1 can.ko module params + + - stats_timer: To calculate the Socket CAN core statistics + (e.g. current/maximum frames per second) this 1 second timer is + invoked at can.ko module start time by default. This timer can be + disabled by using stattimer=0 on the module comandline. + + - debug: (removed since SocketCAN SVN r546) + + 5.2 procfs content + + As described in chapter 3.1 the Socket CAN core uses several filter + lists to deliver received CAN frames to CAN protocol modules. These + receive lists, their filters and the count of filter matches can be + checked in the appropriate receive list. All entries contain the + device and a protocol module identifier: + + foo@bar:~$ cat /proc/net/can/rcvlist_all + + receive list 'rx_all': + (vcan3: no entry) + (vcan2: no entry) + (vcan1: no entry) + device can_id can_mask function userdata matches ident + vcan0 000 00000000 f88e6370 f6c6f400 0 raw + (any: no entry) + + In this example an application requests any CAN traffic from vcan0. + + rcvlist_all - list for unfiltered entries (no filter operations) + rcvlist_eff - list for single extended frame (EFF) entries + rcvlist_err - list for error frames masks + rcvlist_fil - list for mask/value filters + rcvlist_inv - list for mask/value filters (inverse semantic) + rcvlist_sff - list for single standard frame (SFF) entries + + Additional procfs files in /proc/net/can + + stats - Socket CAN core statistics (rx/tx frames, match ratios, ...) + reset_stats - manual statistic reset + version - prints the Socket CAN core version and the ABI version + + 5.3 writing own CAN protocol modules + + To implement a new protocol in the protocol family PF_CAN a new + protocol has to be defined in include/linux/can.h . + The prototypes and definitions to use the Socket CAN core can be + accessed by including include/linux/can/core.h . + In addition to functions that register the CAN protocol and the + CAN device notifier chain there are functions to subscribe CAN + frames received by CAN interfaces and to send CAN frames: + + can_rx_register - subscribe CAN frames from a specific interface + can_rx_unregister - unsubscribe CAN frames from a specific interface + can_send - transmit a CAN frame (optional with local loopback) + + For details see the kerneldoc documentation in net/can/af_can.c or + the source code of net/can/raw.c or net/can/bcm.c . + +6. CAN network drivers +---------------------- + + Writing a CAN network device driver is much easier than writing a + CAN character device driver. Similar to other known network device + drivers you mainly have to deal with: + + - TX: Put the CAN frame from the socket buffer to the CAN controller. + - RX: Put the CAN frame from the CAN controller to the socket buffer. + + See e.g. at Documentation/networking/netdevices.txt . The differences + for writing CAN network device driver are described below: + + 6.1 general settings + + dev->type = ARPHRD_CAN; /* the netdevice hardware type */ + dev->flags = IFF_NOARP; /* CAN has no arp */ + + dev->mtu = sizeof(struct can_frame); + + The struct can_frame is the payload of each socket buffer in the + protocol family PF_CAN. + + 6.2 local loopback of sent frames + + As described in chapter 3.2 the CAN network device driver should + support a local loopback functionality similar to the local echo + e.g. of tty devices. In this case the driver flag IFF_ECHO has to be + set to prevent the PF_CAN core from locally echoing sent frames + (aka loopback) as fallback solution: + + dev->flags = (IFF_NOARP | IFF_ECHO); + + 6.3 CAN controller hardware filters + + To reduce the interrupt load on deep embedded systems some CAN + controllers support the filtering of CAN IDs or ranges of CAN IDs. + These hardware filter capabilities vary from controller to + controller and have to be identified as not feasible in a multi-user + networking approach. The use of the very controller specific + hardware filters could make sense in a very dedicated use-case, as a + filter on driver level would affect all users in the multi-user + system. The high efficient filter sets inside the PF_CAN core allow + to set different multiple filters for each socket separately. + Therefore the use of hardware filters goes to the category 'handmade + tuning on deep embedded systems'. The author is running a MPC603e + @133MHz with four SJA1000 CAN controllers from 2002 under heavy bus + load without any problems ... + + 6.4 currently supported CAN hardware (September 2007) + + On the project website http://developer.berlios.de/projects/socketcan + there are different drivers available: + + vcan: Virtual CAN interface driver (if no real hardware is available) + sja1000: Philips SJA1000 CAN controller (recommended) + i82527: Intel i82527 CAN controller + mscan: Motorola/Freescale CAN controller (e.g. inside SOC MPC5200) + ccan: CCAN controller core (e.g. inside SOC h7202) + slcan: For a bunch of CAN adaptors that are attached via a + serial line ASCII protocol (for serial / USB adaptors) + + Additionally the different CAN adaptors (ISA/PCI/PCMCIA/USB/Parport) + from PEAK Systemtechnik support the CAN netdevice driver model + since Linux driver v6.0: http://www.peak-system.com/linux/index.htm + + Please check the Mailing Lists on the berlios OSS project website. + + 6.5 todo (September 2007) + + The configuration interface for CAN network drivers is still an open + issue that has not been finalized in the socketcan project. Also the + idea of having a library module (candev.ko) that holds functions + that are needed by all CAN netdevices is not ready to ship. + Your contribution is welcome. + +7. Credits +---------- + + Oliver Hartkopp (PF_CAN core, filters, drivers, bcm) + Urs Thuermann (PF_CAN core, kernel integration, socket interfaces, raw, vcan) + Jan Kizka (RT-SocketCAN core, Socket-API reconciliation) + Wolfgang Grandegger (RT-SocketCAN core & drivers, Raw Socket-API reviews) + Robert Schwebel (design reviews, PTXdist integration) + Marc Kleine-Budde (design reviews, Kernel 2.6 cleanups, drivers) + Benedikt Spranger (reviews) + Thomas Gleixner (LKML reviews, coding style, posting hints) + Andrey Volkov (kernel subtree structure, ioctls, mscan driver) + Matthias Brukner (first SJA1000 CAN netdevice implementation Q2/2003) + Klaus Hitschler (PEAK driver integration) + Uwe Koppe (CAN netdevices with PF_PACKET approach) + Michael Schulze (driver layer loopback requirement, RT CAN drivers review) -- cgit v1.2.3-59-g8ed1b From 8e8c71f1ab0ca1c4e74efad14533b991524dcb6c Mon Sep 17 00:00:00 2001 From: Gerrit Renker Date: Wed, 21 Nov 2007 09:56:48 -0200 Subject: [DCCP]: Honour and make use of shutdown option set by user This extends the DCCP socket API by honouring any shutdown(2) option set by the user. The behaviour is, as much as possible, made consistent with the API for TCP's shutdown. This patch exploits the information provided by the user via the socket API to reduce processing costs: * if the read end is closed (SHUT_RD), it is not necessary to deliver to input CCID; * if the write end is closed (SHUT_WR), the same idea applies, but with a difference - as long as the TX queue has not been drained, we need to receive feedback to keep congestion-control rates up to date. Hence SHUT_WR is honoured only after the last packet (under congestion control) has been sent; * although SHUT_RDWR seems nonsensical, it is nevertheless supported in the same manner as for TCP (and agrees with test for SHUTDOWN_MASK in dccp_poll() in net/dccp/proto.c). Furthermore, most of the code already honours the sk_shutdown flags (dccp_recvmsg() for instance sets the read length to 0 if SHUT_RD had been called); CCID handling is now added to this by the present patch. There will also no longer be any delivery when the socket is in the final stages, i.e. when one of dccp_close(), dccp_fin(), or dccp_done() has been called - which is fine since at that stage the connection is its final stages. Motivation and background are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/shutdown A FIXME has been added to notify the other end if SHUT_RD has been set (RFC 4340, 11.7). Note: There is a comment in inet_shutdown() in net/ipv4/af_inet.c which asks to "make sure the socket is a TCP socket". This should probably be extended to mean `TCP or DCCP socket' (the code is also used by UDP and raw sockets). Signed-off-by: Gerrit Renker Signed-off-by: Ian McDonald Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller --- Documentation/networking/dccp.txt | 2 ++ net/dccp/input.c | 27 ++++++++++++++++++++------- net/dccp/proto.c | 2 +- 3 files changed, 23 insertions(+), 8 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index afb66f9a8aff..f771034e8f27 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -72,6 +72,8 @@ DCCP_SOCKOPT_CCID_TX_INFO Returns a `struct tfrc_tx_info' in optval; the buffer for optval and optlen must be set to at least sizeof(struct tfrc_tx_info). +On unidirectional connections it is useful to close the unused half-connection +via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. Sysctl variables ================ diff --git a/net/dccp/input.c b/net/dccp/input.c index 1ce101062824..df0fb2c149a6 100644 --- a/net/dccp/input.c +++ b/net/dccp/input.c @@ -103,6 +103,21 @@ static void dccp_event_ack_recv(struct sock *sk, struct sk_buff *skb) DCCP_SKB_CB(skb)->dccpd_ack_seq); } +static void dccp_deliver_input_to_ccids(struct sock *sk, struct sk_buff *skb) +{ + const struct dccp_sock *dp = dccp_sk(sk); + + /* Don't deliver to RX CCID when node has shut down read end. */ + if (!(sk->sk_shutdown & RCV_SHUTDOWN)) + ccid_hc_rx_packet_recv(dp->dccps_hc_rx_ccid, sk, skb); + /* + * Until the TX queue has been drained, we can not honour SHUT_WR, since + * we need received feedback as input to adjust congestion control. + */ + if (sk->sk_write_queue.qlen > 0 || !(sk->sk_shutdown & SEND_SHUTDOWN)) + ccid_hc_tx_packet_recv(dp->dccps_hc_tx_ccid, sk, skb); +} + static int dccp_check_seqno(struct sock *sk, struct sk_buff *skb) { const struct dccp_hdr *dh = dccp_hdr(skb); @@ -209,8 +224,9 @@ static int __dccp_rcv_established(struct sock *sk, struct sk_buff *skb, case DCCP_PKT_DATAACK: case DCCP_PKT_DATA: /* - * FIXME: check if sk_receive_queue is full, schedule DATA_DROPPED - * option if it is. + * FIXME: schedule DATA_DROPPED (RFC 4340, 11.7.2) if and when + * - sk_shutdown == RCV_SHUTDOWN, use Code 1, "Not Listening" + * - sk_receive_queue is full, use Code 2, "Receive Buffer" */ __skb_pull(skb, dh->dccph_doff * 4); __skb_queue_tail(&sk->sk_receive_queue, skb); @@ -300,9 +316,7 @@ int dccp_rcv_established(struct sock *sk, struct sk_buff *skb, DCCP_SKB_CB(skb)->dccpd_seq, DCCP_ACKVEC_STATE_RECEIVED)) goto discard; - - ccid_hc_rx_packet_recv(dp->dccps_hc_rx_ccid, sk, skb); - ccid_hc_tx_packet_recv(dp->dccps_hc_tx_ccid, sk, skb); + dccp_deliver_input_to_ccids(sk, skb); return __dccp_rcv_established(sk, skb, dh, len); discard: @@ -543,8 +557,7 @@ int dccp_rcv_state_process(struct sock *sk, struct sk_buff *skb, DCCP_ACKVEC_STATE_RECEIVED)) goto discard; - ccid_hc_rx_packet_recv(dp->dccps_hc_rx_ccid, sk, skb); - ccid_hc_tx_packet_recv(dp->dccps_hc_tx_ccid, sk, skb); + dccp_deliver_input_to_ccids(sk, skb); } /* diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 7a3bea9c28c1..0aec73592124 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -981,7 +981,7 @@ EXPORT_SYMBOL_GPL(dccp_close); void dccp_shutdown(struct sock *sk, int how) { - dccp_pr_debug("entry\n"); + dccp_pr_debug("called shutdown(%x)\n", how); } EXPORT_SYMBOL_GPL(dccp_shutdown); -- cgit v1.2.3-59-g8ed1b From ebe6f7e73c3efec1de295205806b4550fcb468cd Mon Sep 17 00:00:00 2001 From: Gerrit Renker Date: Wed, 21 Nov 2007 10:00:17 -0200 Subject: [DCCP]: Update documentation This updates the DCCP documentation, following input from Ian McDonald, clarifiying the status of DCCP, and adding a note about the test tree. Signed-off-by: Gerrit Renker Signed-off-by: Ian McDonald Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller --- Documentation/networking/dccp.txt | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index f771034e8f27..fdc93beec057 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -19,19 +19,23 @@ for real time and multimedia traffic. It has a base protocol and pluggable congestion control IDs (CCIDs). -It is at proposed standard RFC status and the homepage for DCCP as a protocol -is at: - http://www.read.cs.ucla.edu/dccp/ +DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol +is at http://www.ietf.org/html.charters/dccp-charter.html Missing features ================ -The DCCP implementation does not currently have all the features that are in -the RFC. +The Linux DCCP implementation does not currently support all the features that are +specified in RFCs 4340...42. The known bugs are at: http://linux-net.osdl.org/index.php/TODO#DCCP +For more up-to-date versions of the DCCP implementation, please consider using +the experimental DCCP test tree; instructions for checking this out are on: +http://linux-net.osdl.org/index.php/DCCP_Testing#Experimental_DCCP_source_tree + + Socket options ============== -- cgit v1.2.3-59-g8ed1b From e333b3edc489151afda2a4f6c798842c64cb67a4 Mon Sep 17 00:00:00 2001 From: Gerrit Renker Date: Wed, 21 Nov 2007 10:09:56 -0200 Subject: [DCCP]: Promote CCID2 as default CCID This patch addresses the following problems: 1. DCCP relies for its proper functioning on having at least one CCID module enabled (as in TCP plugable congestion control). Currently it is possible to disable both CCIDs and thus leave the DCCP module in a compiled, but entirely non-functional state: no sockets can be created when no CCID is available. Furthermore, the protocol is (again like TCP) not intended to be used without CCIDs. Last, a non-empty CCID list is needed for doing CCID feature negotiation. 2. Internally the default CCID that is advertised by the Linux host is set to CCID2 (DCCPF_INITIAL_CCID in include/linux/dccp.h). Disabling CCID2 in the Kconfig menu without changing the defaults leads to a failure `module not found' when trying to load the dccp module (which internally tries to load the default CCID). 3. The specification (RFC 4340, sec. 10) treats CCID2 somewhat like a `minimum common denominator'; the specification says that: * "New connections start with CCID 2 for both endpoints" * "A DCCP implementation intended for general use, such as an implementation in a general-purpose operating system kernel, SHOULD implement at least CCID 2. The intent is to make CCID 2 broadly available for interoperability [...]" Providing CCID2 as minimum-required CCID (like Reno/Cubic in TCP) thus seems reasonable. Hence this patch automatically selects CCID2 when DCCP is enabled. Documentation also added. Discussions with Ian McDonald on this subject are gratefully acknowledged. Signed-off-by: Gerrit Renker Signed-off-by: Ian McDonald Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller --- Documentation/networking/dccp.txt | 11 +++++++++-- net/dccp/Kconfig | 1 + net/dccp/ccids/Kconfig | 13 ++----------- 3 files changed, 12 insertions(+), 13 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index fdc93beec057..ffb9ca937d65 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -14,8 +14,15 @@ Introduction ============ Datagram Congestion Control Protocol (DCCP) is an unreliable, connection -based protocol designed to solve issues present in UDP and TCP particularly -for real time and multimedia traffic. +oriented protocol designed to solve issues present in UDP and TCP, particularly +for real-time and multimedia (streaming) traffic. +It divides into a base protocol (RFC 4340) and plugable congestion control +modules called CCIDs. Like plugable TCP congestion control, at least one CCID +needs to be enabled in order for the protocol to function properly. In the Linux +implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as +the TCP-friendly CCID3 (RFC 4342), are optional. +For a brief introduction to CCIDs and suggestions for choosing a CCID to match +given applications, see section 10 of RFC 4340. It has a base protocol and pluggable congestion control IDs (CCIDs). diff --git a/net/dccp/Kconfig b/net/dccp/Kconfig index 0549e4719b13..7aa2a7acc7ec 100644 --- a/net/dccp/Kconfig +++ b/net/dccp/Kconfig @@ -1,6 +1,7 @@ menuconfig IP_DCCP tristate "The DCCP Protocol (EXPERIMENTAL)" depends on INET && EXPERIMENTAL + select IP_DCCP_CCID2 ---help--- Datagram Congestion Control Protocol (RFC 4340) diff --git a/net/dccp/ccids/Kconfig b/net/dccp/ccids/Kconfig index 80f469887691..f37af17556ba 100644 --- a/net/dccp/ccids/Kconfig +++ b/net/dccp/ccids/Kconfig @@ -20,18 +20,9 @@ config IP_DCCP_CCID2 to the user. For example, a hypothetical application that transferred files over DCCP, using application-level retransmissions for lost packets, would prefer CCID 2 to CCID 3. On-line games may - also prefer CCID 2. + also prefer CCID 2. See RFC 4341 for further details. - CCID 2 is further described in RFC 4341, - http://www.ietf.org/rfc/rfc4341.txt - - This text was extracted from RFC 4340 (sec. 10.1), - http://www.ietf.org/rfc/rfc4340.txt - - To compile this CCID as a module, choose M here: the module will be - called dccp_ccid2. - - If in doubt, say M. + CCID2 is the default CCID used by DCCP. config IP_DCCP_CCID2_DEBUG bool "CCID2 debugging messages" -- cgit v1.2.3-59-g8ed1b From c28149016c24f4399c0a1eb0ebc15c92611223f0 Mon Sep 17 00:00:00 2001 From: Gerrit Renker Date: Wed, 21 Nov 2007 10:14:31 -0200 Subject: [DCCP]: Update documentation on ioctls Signed-off-by: Gerrit Renker Acked-by: Ian McDonald Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller --- Documentation/networking/dccp.txt | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'Documentation/networking') diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index ffb9ca937d65..d76905a5a087 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -136,6 +136,12 @@ sync_ratelimit = 125 ms sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit of this parameter is milliseconds; a value of 0 disables rate-limiting. +IOCTLS +====== +FIONREAD + Works as in udp(7): returns in the `int' argument pointer the size of + the next pending datagram in bytes, or 0 when no datagram is pending. + Notes ===== -- cgit v1.2.3-59-g8ed1b From cb75994ec311b2cd50e5205efdcc0696abd6675d Mon Sep 17 00:00:00 2001 From: Wang Chen Date: Mon, 3 Dec 2007 22:33:28 +1100 Subject: [UDP]: Defer InDataGrams increment until recvmsg() does checksum Thanks dave, herbert, gerrit, andi and other people for your discussion about this problem. UdpInDatagrams can be confusing because it counts packets that might be dropped later. Move UdpInDatagrams into recvmsg() as allowed by the RFC. Signed-off-by: Wang Chen Signed-off-by: Herbert Xu Signed-off-by: David S. Miller --- Documentation/networking/udplite.txt | 2 +- net/ipv4/udp.c | 7 +++---- net/ipv6/udp.c | 4 +++- 3 files changed, 7 insertions(+), 6 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt index b6409cab075c..3870f280280b 100644 --- a/Documentation/networking/udplite.txt +++ b/Documentation/networking/udplite.txt @@ -236,7 +236,7 @@ This displays UDP-Lite statistics variables, whose meaning is as follows. - InDatagrams: Total number of received datagrams. + InDatagrams: The total number of datagrams delivered to users. NoPorts: Number of packets received to an unknown port. These cases are counted separately (not as InErrors). diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 03c400ca14c5..3465d4ad301b 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -873,6 +873,8 @@ try_again: if (err) goto out_free; + UDP_INC_STATS_USER(UDP_MIB_INDATAGRAMS, is_udplite); + sock_recv_timestamp(msg, sk, skb); /* Copy the address. */ @@ -966,10 +968,8 @@ int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) int ret; ret = (*up->encap_rcv)(sk, skb); - if (ret <= 0) { - UDP_INC_STATS_BH(UDP_MIB_INDATAGRAMS, up->pcflag); + if (ret <= 0) return -ret; - } } /* FALLTHROUGH -- it's a UDP Packet */ @@ -1023,7 +1023,6 @@ int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) goto drop; } - UDP_INC_STATS_BH(UDP_MIB_INDATAGRAMS, up->pcflag); return 0; drop: diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index ee1cc3f8599f..b0474a618bbe 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -164,6 +164,8 @@ try_again: if (err) goto out_free; + UDP6_INC_STATS_USER(UDP_MIB_INDATAGRAMS, is_udplite); + sock_recv_timestamp(msg, sk, skb); /* Copy the address. */ @@ -292,7 +294,7 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) UDP6_INC_STATS_BH(UDP_MIB_RCVBUFERRORS, up->pcflag); goto drop; } - UDP6_INC_STATS_BH(UDP_MIB_INDATAGRAMS, up->pcflag); + return 0; drop: UDP6_INC_STATS_BH(UDP_MIB_INERRORS, up->pcflag); -- cgit v1.2.3-59-g8ed1b From b8599d20708fa3bde1e414689f3474560c2d990b Mon Sep 17 00:00:00 2001 From: Gerrit Renker Date: Thu, 13 Dec 2007 12:25:01 -0200 Subject: [DCCP]: Support for server holding timewait state This adds a socket option and signalling support for the case where the server holds timewait state on closing the connection, as described in RFC 4340, 8.3. Since holding timewait state at the server is the non-usual case, it is enabled via a socket option. Documentation for this socket option has been added. The setsockopt statement has been made resilient against different possible cases of expressing boolean `true' values using a suggestion by Ian McDonald. Signed-off-by: Gerrit Renker Signed-off-by: Ian McDonald Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: David S. Miller --- Documentation/networking/dccp.txt | 6 ++++++ include/linux/dccp.h | 3 +++ net/dccp/output.c | 6 ++++-- net/dccp/proto.c | 13 ++++++++++++- 4 files changed, 25 insertions(+), 3 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index d76905a5a087..39131a3c78f8 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -57,6 +57,12 @@ can be set before calling bind(). DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet size (application payload size) in bytes, see RFC 4340, section 14. +DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold +timewait state when closing the connection (RFC 4340, 8.3). The usual case is +that the closing server sends a CloseReq, whereupon the client holds timewait +state. When this boolean socket option is on, the server sends a Close instead +and will enter TIMEWAIT. This option must be set after accept() returns. + DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums always cover the entire packet and that only fully covered application data is diff --git a/include/linux/dccp.h b/include/linux/dccp.h index 312b989c7edb..c676021603f5 100644 --- a/include/linux/dccp.h +++ b/include/linux/dccp.h @@ -205,6 +205,7 @@ struct dccp_so_feat { #define DCCP_SOCKOPT_CHANGE_L 3 #define DCCP_SOCKOPT_CHANGE_R 4 #define DCCP_SOCKOPT_GET_CUR_MPS 5 +#define DCCP_SOCKOPT_SERVER_TIMEWAIT 6 #define DCCP_SOCKOPT_SEND_CSCOV 10 #define DCCP_SOCKOPT_RECV_CSCOV 11 #define DCCP_SOCKOPT_CCID_RX_INFO 128 @@ -492,6 +493,7 @@ struct dccp_ackvec; * @dccps_role - role of this sock, one of %dccp_role * @dccps_hc_rx_insert_options - receiver wants to add options when acking * @dccps_hc_tx_insert_options - sender wants to add options when sending + * @dccps_server_timewait - server holds timewait state on close (RFC 4340, 8.3) * @dccps_xmit_timer - timer for when CCID is not ready to send * @dccps_syn_rtt - RTT sample from Request/Response exchange (in usecs) */ @@ -528,6 +530,7 @@ struct dccp_sock { enum dccp_role dccps_role:2; __u8 dccps_hc_rx_insert_options:1; __u8 dccps_hc_tx_insert_options:1; + __u8 dccps_server_timewait:1; struct timer_list dccps_xmit_timer; }; diff --git a/net/dccp/output.c b/net/dccp/output.c index e97584aa4898..b2e17910930d 100644 --- a/net/dccp/output.c +++ b/net/dccp/output.c @@ -567,8 +567,10 @@ void dccp_send_close(struct sock *sk, const int active) /* Reserve space for headers and prepare control bits. */ skb_reserve(skb, sk->sk_prot->max_header); - DCCP_SKB_CB(skb)->dccpd_type = dp->dccps_role == DCCP_ROLE_CLIENT ? - DCCP_PKT_CLOSE : DCCP_PKT_CLOSEREQ; + if (dp->dccps_role == DCCP_ROLE_SERVER && !dp->dccps_server_timewait) + DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_CLOSEREQ; + else + DCCP_SKB_CB(skb)->dccpd_type = DCCP_PKT_CLOSE; if (active) { dccp_write_xmit(sk, 1); diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 8a73c8f98d76..cc87c500bfb8 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -551,6 +551,12 @@ static int do_dccp_setsockopt(struct sock *sk, int level, int optname, (struct dccp_so_feat __user *) optval); break; + case DCCP_SOCKOPT_SERVER_TIMEWAIT: + if (dp->dccps_role != DCCP_ROLE_SERVER) + err = -EOPNOTSUPP; + else + dp->dccps_server_timewait = (val != 0); + break; case DCCP_SOCKOPT_SEND_CSCOV: /* sender side, RFC 4340, sec. 9.2 */ if (val < 0 || val > 15) err = -EINVAL; @@ -653,6 +659,10 @@ static int do_dccp_getsockopt(struct sock *sk, int level, int optname, val = dp->dccps_mss_cache; len = sizeof(val); break; + case DCCP_SOCKOPT_SERVER_TIMEWAIT: + val = dp->dccps_server_timewait; + len = sizeof(val); + break; case DCCP_SOCKOPT_SEND_CSCOV: val = dp->dccps_pcslen; len = sizeof(val); @@ -918,7 +928,8 @@ static void dccp_terminate_connection(struct sock *sk) case DCCP_OPEN: dccp_send_close(sk, 1); - if (dccp_sk(sk)->dccps_role == DCCP_ROLE_SERVER) + if (dccp_sk(sk)->dccps_role == DCCP_ROLE_SERVER && + !dccp_sk(sk)->dccps_server_timewait) next_state = DCCP_ACTIVE_CLOSEREQ; else next_state = DCCP_CLOSING; -- cgit v1.2.3-59-g8ed1b From 558f82ef6e0d25e87f7468c07b6db1fbbf95a855 Mon Sep 17 00:00:00 2001 From: Masahide NAKAMURA Date: Thu, 20 Dec 2007 20:42:57 -0800 Subject: [XFRM]: Define packet dropping statistics. This statistics is shown factor dropped by transformation at /proc/net/xfrm_stat for developer. It is a counter designed from current transformation source code and defined as linux private MIB. See Documentation/networking/xfrm_proc.txt for the detail. Signed-off-by: Masahide NAKAMURA Signed-off-by: David S. Miller --- Documentation/networking/xfrm_proc.txt | 71 +++++++++++++++++++++++++ include/linux/snmp.h | 31 +++++++++++ include/net/snmp.h | 5 ++ include/net/xfrm.h | 18 +++++++ net/xfrm/Makefile | 1 + net/xfrm/xfrm_policy.c | 24 +++++++++ net/xfrm/xfrm_proc.c | 96 ++++++++++++++++++++++++++++++++++ 7 files changed, 246 insertions(+) create mode 100644 Documentation/networking/xfrm_proc.txt create mode 100644 net/xfrm/xfrm_proc.c (limited to 'Documentation/networking') diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt new file mode 100644 index 000000000000..ec9045b5c34a --- /dev/null +++ b/Documentation/networking/xfrm_proc.txt @@ -0,0 +1,71 @@ +XFRM proc - /proc/net/xfrm_* files +================================== +Masahide NAKAMURA + + +Transformation Statistics +------------------------- +xfrm_proc is a statistics shown factor dropped by transformation +for developer. +It is a counter designed from current transformation source code +and defined like linux private MIB. + +Inbound statistics +~~~~~~~~~~~~~~~~~~ +XfrmInError: + All errors which is not matched others +XfrmInBufferError: + No buffer is left +XfrmInHdrError: + Header error +XfrmInNoStates: + No state is found + i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong +XfrmInStateProtoError: + Transformation protocol specific error + e.g. SA key is wrong +XfrmInStateModeError: + Transformation mode specific error +XfrmInSeqOutOfWindow: + Sequence out of window +XfrmInStateExpired: + State is expired +XfrmInStateMismatch: + State has mismatch option + e.g. UDP encapsulation type is mismatch +XfrmInStateInvalid: + State is invalid +XfrmInTmplMismatch: + No matching template for states + e.g. Inbound SAs are correct but SP rule is wrong +XfrmInNoPols: + No policy is found for states + e.g. Inbound SAs are correct but no SP is found +XfrmInPolBlock: + Policy discards +XfrmInPolError: + Policy error + +Outbound errors +~~~~~~~~~~~~~~~ +XfrmOutError: + All errors which is not matched others +XfrmOutBundleGenError: + Bundle generation error +XfrmOutBundleCheckError: + Bundle check error +XfrmOutNoStates: + No state is found +XfrmOutStateProtoError: + Transformation protocol specific error +XfrmOutStateModeError: + Transformation mode specific error + e.g. Outer header space is not enough +XfrmOutStateExpired: + State is expired +XfrmOutPolBlock: + Policy discards +XfrmOutPolDead: + Policy is dead +XfrmOutPolError: + Policy error diff --git a/include/linux/snmp.h b/include/linux/snmp.h index 89f0c2b5f405..86d3effb2836 100644 --- a/include/linux/snmp.h +++ b/include/linux/snmp.h @@ -217,4 +217,35 @@ enum __LINUX_MIB_MAX }; +/* linux Xfrm mib definitions */ +enum +{ + LINUX_MIB_XFRMNUM = 0, + LINUX_MIB_XFRMINERROR, /* XfrmInError */ + LINUX_MIB_XFRMINBUFFERERROR, /* XfrmInBufferError */ + LINUX_MIB_XFRMINHDRERROR, /* XfrmInHdrError */ + LINUX_MIB_XFRMINNOSTATES, /* XfrmInNoStates */ + LINUX_MIB_XFRMINSTATEPROTOERROR, /* XfrmInStateProtoError */ + LINUX_MIB_XFRMINSTATEMODEERROR, /* XfrmInStateModeError */ + LINUX_MIB_XFRMINSEQOUTOFWINDOW, /* XfrmInSeqOutOfWindow */ + LINUX_MIB_XFRMINSTATEEXPIRED, /* XfrmInStateExpired */ + LINUX_MIB_XFRMINSTATEMISMATCH, /* XfrmInStateMismatch */ + LINUX_MIB_XFRMINSTATEINVALID, /* XfrmInStateInvalid */ + LINUX_MIB_XFRMINTMPLMISMATCH, /* XfrmInTmplMismatch */ + LINUX_MIB_XFRMINNOPOLS, /* XfrmInNoPols */ + LINUX_MIB_XFRMINPOLBLOCK, /* XfrmInPolBlock */ + LINUX_MIB_XFRMINPOLERROR, /* XfrmInPolError */ + LINUX_MIB_XFRMOUTERROR, /* XfrmOutError */ + LINUX_MIB_XFRMOUTBUNDLEGENERROR, /* XfrmOutBundleGenError */ + LINUX_MIB_XFRMOUTBUNDLECHECKERROR, /* XfrmOutBundleCheckError */ + LINUX_MIB_XFRMOUTNOSTATES, /* XfrmOutNoStates */ + LINUX_MIB_XFRMOUTSTATEPROTOERROR, /* XfrmOutStateProtoError */ + LINUX_MIB_XFRMOUTSTATEMODEERROR, /* XfrmOutStateModeError */ + LINUX_MIB_XFRMOUTSTATEEXPIRED, /* XfrmOutStateExpired */ + LINUX_MIB_XFRMOUTPOLBLOCK, /* XfrmOutPolBlock */ + LINUX_MIB_XFRMOUTPOLDEAD, /* XfrmOutPolDead */ + LINUX_MIB_XFRMOUTPOLERROR, /* XfrmOutPolError */ + __LINUX_MIB_XFRMMAX +}; + #endif /* _LINUX_SNMP_H */ diff --git a/include/net/snmp.h b/include/net/snmp.h index fbb66663a42c..ce2f48507510 100644 --- a/include/net/snmp.h +++ b/include/net/snmp.h @@ -118,6 +118,11 @@ struct linux_mib { unsigned long mibs[LINUX_MIB_MAX]; }; +/* Linux Xfrm */ +#define LINUX_MIB_XFRMMAX __LINUX_MIB_XFRMMAX +struct linux_xfrm_mib { + unsigned long mibs[LINUX_MIB_XFRMMAX]; +}; /* * FIXME: On x86 and some other CPUs the split into user and softirq parts diff --git a/include/net/xfrm.h b/include/net/xfrm.h index eea1c327c93e..a79702bcdcd0 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -19,6 +19,9 @@ #include #include #include +#ifdef CONFIG_XFRM_STATISTICS +#include +#endif #define XFRM_PROTO_ESP 50 #define XFRM_PROTO_AH 51 @@ -34,6 +37,17 @@ #define MODULE_ALIAS_XFRM_TYPE(family, proto) \ MODULE_ALIAS("xfrm-type-" __stringify(family) "-" __stringify(proto)) +#ifdef CONFIG_XFRM_STATISTICS +DECLARE_SNMP_STAT(struct linux_xfrm_mib, xfrm_statistics); +#define XFRM_INC_STATS(field) SNMP_INC_STATS(xfrm_statistics, field) +#define XFRM_INC_STATS_BH(field) SNMP_INC_STATS_BH(xfrm_statistics, field) +#define XFRM_INC_STATS_USER(field) SNMP_INC_STATS_USER(xfrm_statistics, field) +#else +#define XFRM_INC_STATS(field) +#define XFRM_INC_STATS_BH(field) +#define XFRM_INC_STATS_USER(field) +#endif + extern struct sock *xfrm_nl; extern u32 sysctl_xfrm_aevent_etime; extern u32 sysctl_xfrm_aevent_rseqth; @@ -1139,6 +1153,10 @@ static inline void xfrm6_fini(void) } #endif +#ifdef CONFIG_XFRM_STATISTICS +extern int xfrm_proc_init(void); +#endif + extern int xfrm_state_walk(u8 proto, int (*func)(struct xfrm_state *, int, void*), void *); extern struct xfrm_state *xfrm_state_alloc(void); extern struct xfrm_state *xfrm_state_find(xfrm_address_t *daddr, xfrm_address_t *saddr, diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile index 45744a3d3a51..332cfb0ff566 100644 --- a/net/xfrm/Makefile +++ b/net/xfrm/Makefile @@ -4,5 +4,6 @@ obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \ xfrm_input.o xfrm_output.o xfrm_algo.o +obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o obj-$(CONFIG_XFRM_USER) += xfrm_user.o diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 521cb6e12561..32ddb7b12e7f 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -27,11 +27,19 @@ #include #include #include +#ifdef CONFIG_XFRM_STATISTICS +#include +#endif #include "xfrm_hash.h" int sysctl_xfrm_larval_drop __read_mostly; +#ifdef CONFIG_XFRM_STATISTICS +DEFINE_SNMP_STAT(struct linux_xfrm_mib, xfrm_statistics) __read_mostly; +EXPORT_SYMBOL(xfrm_statistics); +#endif + DEFINE_MUTEX(xfrm_cfg_mutex); EXPORT_SYMBOL(xfrm_cfg_mutex); @@ -2258,6 +2266,16 @@ static struct notifier_block xfrm_dev_notifier = { 0 }; +#ifdef CONFIG_XFRM_STATISTICS +static int __init xfrm_statistics_init(void) +{ + if (snmp_mib_init((void **)xfrm_statistics, + sizeof(struct linux_xfrm_mib)) < 0) + return -ENOMEM; + return 0; +} +#endif + static void __init xfrm_policy_init(void) { unsigned int hmask, sz; @@ -2294,9 +2312,15 @@ static void __init xfrm_policy_init(void) void __init xfrm_init(void) { +#ifdef CONFIG_XFRM_STATISTICS + xfrm_statistics_init(); +#endif xfrm_state_init(); xfrm_policy_init(); xfrm_input_init(); +#ifdef CONFIG_XFRM_STATISTICS + xfrm_proc_init(); +#endif } #ifdef CONFIG_AUDITSYSCALL diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c new file mode 100644 index 000000000000..31d035415ecd --- /dev/null +++ b/net/xfrm/xfrm_proc.c @@ -0,0 +1,96 @@ +/* + * xfrm_proc.c + * + * Copyright (C)2006-2007 USAGI/WIDE Project + * + * Authors: Masahide NAKAMURA + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#include +#include +#include +#include + +static struct snmp_mib xfrm_mib_list[] = { + SNMP_MIB_ITEM("XfrmInError", LINUX_MIB_XFRMINERROR), + SNMP_MIB_ITEM("XfrmInBufferError", LINUX_MIB_XFRMINBUFFERERROR), + SNMP_MIB_ITEM("XfrmInHdrError", LINUX_MIB_XFRMINHDRERROR), + SNMP_MIB_ITEM("XfrmInNoStates", LINUX_MIB_XFRMINNOSTATES), + SNMP_MIB_ITEM("XfrmInStateProtoError", LINUX_MIB_XFRMINSTATEPROTOERROR), + SNMP_MIB_ITEM("XfrmInStateModeError", LINUX_MIB_XFRMINSTATEMODEERROR), + SNMP_MIB_ITEM("XfrmInSeqOutOfWindow", LINUX_MIB_XFRMINSEQOUTOFWINDOW), + SNMP_MIB_ITEM("XfrmInStateExpired", LINUX_MIB_XFRMINSTATEEXPIRED), + SNMP_MIB_ITEM("XfrmInStateMismatch", LINUX_MIB_XFRMINSTATEMISMATCH), + SNMP_MIB_ITEM("XfrmInStateInvalid", LINUX_MIB_XFRMINSTATEINVALID), + SNMP_MIB_ITEM("XfrmInTmplMismatch", LINUX_MIB_XFRMINTMPLMISMATCH), + SNMP_MIB_ITEM("XfrmInNoPols", LINUX_MIB_XFRMINNOPOLS), + SNMP_MIB_ITEM("XfrmInPolBlock", LINUX_MIB_XFRMINPOLBLOCK), + SNMP_MIB_ITEM("XfrmInPolError", LINUX_MIB_XFRMINPOLERROR), + SNMP_MIB_ITEM("XfrmOutError", LINUX_MIB_XFRMOUTERROR), + SNMP_MIB_ITEM("XfrmOutBundleGenError", LINUX_MIB_XFRMOUTBUNDLEGENERROR), + SNMP_MIB_ITEM("XfrmOutBundleCheckError", LINUX_MIB_XFRMOUTBUNDLECHECKERROR), + SNMP_MIB_ITEM("XfrmOutNoStates", LINUX_MIB_XFRMOUTNOSTATES), + SNMP_MIB_ITEM("XfrmOutStateProtoError", LINUX_MIB_XFRMOUTSTATEPROTOERROR), + SNMP_MIB_ITEM("XfrmOutStateModeError", LINUX_MIB_XFRMOUTSTATEMODEERROR), + SNMP_MIB_ITEM("XfrmOutStateExpired", LINUX_MIB_XFRMOUTSTATEEXPIRED), + SNMP_MIB_ITEM("XfrmOutPolBlock", LINUX_MIB_XFRMOUTPOLBLOCK), + SNMP_MIB_ITEM("XfrmOutPolDead", LINUX_MIB_XFRMOUTPOLDEAD), + SNMP_MIB_ITEM("XfrmOutPolError", LINUX_MIB_XFRMOUTPOLERROR), + SNMP_MIB_SENTINEL +}; + +static unsigned long +fold_field(void *mib[], int offt) +{ + unsigned long res = 0; + int i; + + for_each_possible_cpu(i) { + res += *(((unsigned long *)per_cpu_ptr(mib[0], i)) + offt); + res += *(((unsigned long *)per_cpu_ptr(mib[1], i)) + offt); + } + return res; +} + +static int xfrm_statistics_seq_show(struct seq_file *seq, void *v) +{ + int i; + for (i=0; xfrm_mib_list[i].name; i++) + seq_printf(seq, "%-24s\t%lu\n", xfrm_mib_list[i].name, + fold_field((void **)xfrm_statistics, + xfrm_mib_list[i].entry)); + return 0; +} + +static int xfrm_statistics_seq_open(struct inode *inode, struct file *file) +{ + return single_open(file, xfrm_statistics_seq_show, NULL); +} + +static struct file_operations xfrm_statistics_seq_fops = { + .owner = THIS_MODULE, + .open = xfrm_statistics_seq_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +int __init xfrm_proc_init(void) +{ + int rc = 0; + + if (!proc_net_fops_create(&init_net, "xfrm_stat", S_IRUGO, + &xfrm_statistics_seq_fops)) + goto stat_fail; + + out: + return rc; + + stat_fail: + rc = -ENOMEM; + goto out; +} -- cgit v1.2.3-59-g8ed1b From d9727bb2d516bc16bafee4216eec91cd2be5f30a Mon Sep 17 00:00:00 2001 From: Masahide NAKAMURA Date: Tue, 25 Dec 2007 20:56:26 -0800 Subject: [XFRM] Documentaion: Fix error example at XFRMOUTSTATEMODEERROR. Signed-off-by: Masahide NAKAMURA Signed-off-by: David S. Miller --- Documentation/networking/xfrm_proc.txt | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt index ec9045b5c34a..53c1a58b02f1 100644 --- a/Documentation/networking/xfrm_proc.txt +++ b/Documentation/networking/xfrm_proc.txt @@ -60,7 +60,6 @@ XfrmOutStateProtoError: Transformation protocol specific error XfrmOutStateModeError: Transformation mode specific error - e.g. Outer header space is not enough XfrmOutStateExpired: State is expired XfrmOutPolBlock: -- cgit v1.2.3-59-g8ed1b From 95766fff6b9a78d11fc2d3812dd035381690b55d Mon Sep 17 00:00:00 2001 From: Hideo Aoki Date: Mon, 31 Dec 2007 00:29:24 -0800 Subject: [UDP]: Add memory accounting. Signed-off-by: Takahiro Yasui Signed-off-by: Hideo Aoki Signed-off-by: David S. Miller --- Documentation/networking/ip-sysctl.txt | 27 ++++++++++++++++ include/net/udp.h | 9 ++++++ net/ipv4/af_inet.c | 5 +++ net/ipv4/proc.c | 3 +- net/ipv4/sysctl_net_ipv4.c | 31 ++++++++++++++++++ net/ipv4/udp.c | 57 ++++++++++++++++++++++++++++++++-- net/ipv6/udp.c | 32 ++++++++++++++++--- 7 files changed, 157 insertions(+), 7 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6f7872ba1def..17a6e46fbd43 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -446,6 +446,33 @@ tcp_dma_copybreak - INTEGER and CONFIG_NET_DMA is enabled. Default: 4096 +UDP variables: + +udp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all UDP sockets. + + min: Below this number of pages UDP is not bothered about its + memory appetite. When amount of memory allocated by UDP exceeds + this number, UDP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all UDP sockets. + + Default is calculated at boot time from amount of available memory. + +udp_rmem_min - INTEGER + Minimal size of receive buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for receiving data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 4096 + +udp_wmem_min - INTEGER + Minimal size of send buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for sending data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 4096 + CIPSOv4 Variables: cipso_cache_enable - BOOLEAN diff --git a/include/net/udp.h b/include/net/udp.h index 98cb09ca3a27..93796beac8ff 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -65,6 +65,13 @@ extern rwlock_t udp_hash_lock; extern struct proto udp_prot; +extern atomic_t udp_memory_allocated; + +/* sysctl variables for udp */ +extern int sysctl_udp_mem[3]; +extern int sysctl_udp_rmem_min; +extern int sysctl_udp_wmem_min; + struct sk_buff; /* @@ -198,4 +205,6 @@ extern void udp_proc_unregister(struct udp_seq_afinfo *afinfo); extern int udp4_proc_init(void); extern void udp4_proc_exit(void); #endif + +extern void udp_init(void); #endif /* _UDP_H */ diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 03633b7b9b4a..0e12cf646071 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -139,6 +139,8 @@ void inet_sock_destruct(struct sock *sk) __skb_queue_purge(&sk->sk_receive_queue); __skb_queue_purge(&sk->sk_error_queue); + sk_mem_reclaim(sk); + if (sk->sk_type == SOCK_STREAM && sk->sk_state != TCP_CLOSE) { printk("Attempt to release TCP socket in state %d %p\n", sk->sk_state, sk); @@ -1417,6 +1419,9 @@ static int __init inet_init(void) /* Setup TCP slab cache for open requests. */ tcp_init(); + /* Setup UDP memory threshold */ + udp_init(); + /* Add UDP-Lite (RFC 3828) */ udplite4_register(); diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 41734db677be..53bc010beefd 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -56,7 +56,8 @@ static int sockstat_seq_show(struct seq_file *seq, void *v) sock_prot_inuse(&tcp_prot), atomic_read(&tcp_orphan_count), tcp_death_row.tw_count, atomic_read(&tcp_sockets_allocated), atomic_read(&tcp_memory_allocated)); - seq_printf(seq, "UDP: inuse %d\n", sock_prot_inuse(&udp_prot)); + seq_printf(seq, "UDP: inuse %d mem %d\n", sock_prot_inuse(&udp_prot), + atomic_read(&udp_memory_allocated)); seq_printf(seq, "UDPLITE: inuse %d\n", sock_prot_inuse(&udplite_prot)); seq_printf(seq, "RAW: inuse %d\n", sock_prot_inuse(&raw_prot)); seq_printf(seq, "FRAG: inuse %d memory %d\n", diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 844f26fab06f..a5a9f8e3bb25 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -812,6 +813,36 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = &proc_dointvec, }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "udp_mem", + .data = &sysctl_udp_mem, + .maxlen = sizeof(sysctl_udp_mem), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "udp_rmem_min", + .data = &sysctl_udp_rmem_min, + .maxlen = sizeof(sysctl_udp_rmem_min), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "udp_wmem_min", + .data = &sysctl_udp_wmem_min, + .maxlen = sizeof(sysctl_udp_wmem_min), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero + }, { .ctl_name = 0 } }; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 1ce6b60b7f93..353284360751 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -82,6 +82,7 @@ #include #include #include +#include #include #include #include @@ -118,6 +119,17 @@ EXPORT_SYMBOL(udp_stats_in6); struct hlist_head udp_hash[UDP_HTABLE_SIZE]; DEFINE_RWLOCK(udp_hash_lock); +int sysctl_udp_mem[3] __read_mostly; +int sysctl_udp_rmem_min __read_mostly; +int sysctl_udp_wmem_min __read_mostly; + +EXPORT_SYMBOL(sysctl_udp_mem); +EXPORT_SYMBOL(sysctl_udp_rmem_min); +EXPORT_SYMBOL(sysctl_udp_wmem_min); + +atomic_t udp_memory_allocated; +EXPORT_SYMBOL(udp_memory_allocated); + static inline int __udp_lib_lport_inuse(__u16 num, const struct hlist_head udptable[]) { @@ -901,13 +913,17 @@ try_again: err = ulen; out_free: + lock_sock(sk); skb_free_datagram(sk, skb); + release_sock(sk); out: return err; csum_copy_err: + lock_sock(sk); if (!skb_kill_datagram(sk, skb, flags)) UDP_INC_STATS_USER(UDP_MIB_INERRORS, is_udplite); + release_sock(sk); if (noblock) return -EAGAIN; @@ -1072,7 +1088,15 @@ static int __udp4_lib_mcast_deliver(struct sk_buff *skb, skb1 = skb_clone(skb, GFP_ATOMIC); if (skb1) { - int ret = udp_queue_rcv_skb(sk, skb1); + int ret = 0; + + bh_lock_sock_nested(sk); + if (!sock_owned_by_user(sk)) + ret = udp_queue_rcv_skb(sk, skb1); + else + sk_add_backlog(sk, skb1); + bh_unlock_sock(sk); + if (ret > 0) /* we should probably re-process instead * of dropping packets here. */ @@ -1165,7 +1189,13 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct hlist_head udptable[], inet_iif(skb), udptable); if (sk != NULL) { - int ret = udp_queue_rcv_skb(sk, skb); + int ret = 0; + bh_lock_sock_nested(sk); + if (!sock_owned_by_user(sk)) + ret = udp_queue_rcv_skb(sk, skb); + else + sk_add_backlog(sk, skb); + bh_unlock_sock(sk); sock_put(sk); /* a return value > 0 means to resubmit the input, but @@ -1460,6 +1490,10 @@ struct proto udp_prot = { .hash = udp_lib_hash, .unhash = udp_lib_unhash, .get_port = udp_v4_get_port, + .memory_allocated = &udp_memory_allocated, + .sysctl_mem = sysctl_udp_mem, + .sysctl_wmem = &sysctl_udp_wmem_min, + .sysctl_rmem = &sysctl_udp_rmem_min, .obj_size = sizeof(struct udp_sock), #ifdef CONFIG_COMPAT .compat_setsockopt = compat_udp_setsockopt, @@ -1655,6 +1689,25 @@ void udp4_proc_exit(void) } #endif /* CONFIG_PROC_FS */ +void __init udp_init(void) +{ + unsigned long limit; + + /* Set the pressure threshold up by the same strategy of TCP. It is a + * fraction of global memory that is up to 1/2 at 256 MB, decreasing + * toward zero with the amount of memory, with a floor of 128 pages. + */ + limit = min(nr_all_pages, 1UL<<(28-PAGE_SHIFT)) >> (20-PAGE_SHIFT); + limit = (limit * (nr_all_pages >> (20-PAGE_SHIFT))) >> (PAGE_SHIFT-11); + limit = max(limit, 128UL); + sysctl_udp_mem[0] = limit / 4 * 3; + sysctl_udp_mem[1] = limit; + sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2; + + sysctl_udp_rmem_min = SK_MEM_QUANTUM; + sysctl_udp_wmem_min = SK_MEM_QUANTUM; +} + EXPORT_SYMBOL(udp_disconnect); EXPORT_SYMBOL(udp_hash); EXPORT_SYMBOL(udp_hash_lock); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index c9a97b405511..bf58acab2064 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -204,13 +204,17 @@ try_again: err = ulen; out_free: + lock_sock(sk); skb_free_datagram(sk, skb); + release_sock(sk); out: return err; csum_copy_err: + lock_sock(sk); if (!skb_kill_datagram(sk, skb, flags)) UDP6_INC_STATS_USER(UDP_MIB_INERRORS, is_udplite); + release_sock(sk); if (flags & MSG_DONTWAIT) return -EAGAIN; @@ -366,10 +370,21 @@ static int __udp6_lib_mcast_deliver(struct sk_buff *skb, struct in6_addr *saddr, while ((sk2 = udp_v6_mcast_next(sk_next(sk2), uh->dest, daddr, uh->source, saddr, dif))) { struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC); - if (buff) - udpv6_queue_rcv_skb(sk2, buff); + if (buff) { + bh_lock_sock_nested(sk2); + if (!sock_owned_by_user(sk2)) + udpv6_queue_rcv_skb(sk2, buff); + else + sk_add_backlog(sk2, buff); + bh_unlock_sock(sk2); + } } - udpv6_queue_rcv_skb(sk, skb); + bh_lock_sock_nested(sk); + if (!sock_owned_by_user(sk)) + udpv6_queue_rcv_skb(sk, skb); + else + sk_add_backlog(sk, skb); + bh_unlock_sock(sk); out: read_unlock(&udp_hash_lock); return 0; @@ -482,7 +497,12 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct hlist_head udptable[], /* deliver */ - udpv6_queue_rcv_skb(sk, skb); + bh_lock_sock_nested(sk); + if (!sock_owned_by_user(sk)) + udpv6_queue_rcv_skb(sk, skb); + else + sk_add_backlog(sk, skb); + bh_unlock_sock(sk); sock_put(sk); return 0; @@ -994,6 +1014,10 @@ struct proto udpv6_prot = { .hash = udp_lib_hash, .unhash = udp_lib_unhash, .get_port = udp_v6_get_port, + .memory_allocated = &udp_memory_allocated, + .sysctl_mem = sysctl_udp_mem, + .sysctl_wmem = &sysctl_udp_wmem_min, + .sysctl_rmem = &sysctl_udp_rmem_min, .obj_size = sizeof(struct udp6_sock), #ifdef CONFIG_COMPAT .compat_setsockopt = compat_udpv6_setsockopt, -- cgit v1.2.3-59-g8ed1b From f1862b0ae2294f6970f695abf02392d025e02dbe Mon Sep 17 00:00:00 2001 From: Adrian Bunk Date: Sun, 27 Jan 2008 23:04:43 -0800 Subject: [SHAPER]: The scheduled shaper removal. This patch contains the scheduled removal of the shaper driver. Signed-off-by: Adrian Bunk Acked-by: Alan Cox Signed-off-by: David S. Miller --- Documentation/feature-removal-schedule.txt | 9 - Documentation/networking/00-INDEX | 2 - Documentation/networking/shaper.txt | 48 --- drivers/net/Kconfig | 17 - drivers/net/Makefile | 1 - drivers/net/shaper.c | 603 ----------------------------- include/linux/Kbuild | 1 - include/linux/if_shaper.h | 51 --- 8 files changed, 732 deletions(-) delete mode 100644 Documentation/networking/shaper.txt delete mode 100644 drivers/net/shaper.c delete mode 100644 include/linux/if_shaper.h (limited to 'Documentation/networking') diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 2c3ae6c02a47..cc02e67239ec 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -280,15 +280,6 @@ Who: Thomas Gleixner --------------------------- -What: shaper network driver -When: January 2008 -Files: drivers/net/shaper.c, include/linux/if_shaper.h -Why: This driver has been marked obsolete for many years. - It was only designed to work on lower speed links and has design - flaws that lead to machine crashes. The qdisc infrastructure in - 2.4 or later kernels, provides richer features and is more robust. -Who: Stephen Hemminger - --------------------------- What: i2c-i810, i2c-prosavage and i2c-savage4 diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index b4eefadb9adf..02e56d447a8f 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -84,8 +84,6 @@ policy-routing.txt - IP policy-based routing ray_cs.txt - Raylink Wireless LAN card driver info. -shaper.txt - - info on the module that can shape/limit transmitted traffic. sk98lin.txt - Marvell Yukon Chipset / SysKonnect SK-98xx compliant Gigabit Ethernet Adapter family driver info diff --git a/Documentation/networking/shaper.txt b/Documentation/networking/shaper.txt deleted file mode 100644 index 6c4ebb66a906..000000000000 --- a/Documentation/networking/shaper.txt +++ /dev/null @@ -1,48 +0,0 @@ -Traffic Shaper For Linux - -This is the current BETA release of the traffic shaper for Linux. It works -within the following limits: - -o Minimum shaping speed is currently about 9600 baud (it can only -shape down to 1 byte per clock tick) - -o Maximum is about 256K, it will go above this but get a bit blocky. - -o If you ifconfig the master device that a shaper is attached to down -then your machine will follow. - -o The shaper must be a module. - - -Setup: - - A shaper device is configured using the shapeconfig program. -Typically you will do something like this - -shapecfg attach shaper0 eth1 -shapecfg speed shaper0 64000 -ifconfig shaper0 myhost netmask 255.255.255.240 broadcast 1.2.3.4.255 up -route add -net some.network netmask a.b.c.d dev shaper0 - -The shaper should have the same IP address as the device it is attached to -for normal use. - -Gotchas: - - The shaper shapes transmitted traffic. It's rather impossible to -shape received traffic except at the end (or a router) transmitting it. - - Gated/routed/rwhod/mrouted all see the shaper as an additional device -and will treat it as such unless patched. Note that for mrouted you can run -mrouted tunnels via a traffic shaper to control bandwidth usage. - - The shaper is device/route based. This makes it very easy to use -with any setup BUT less flexible. You may need to use iproute2 to set up -multiple route tables to get the flexibility. - - There is no "borrowing" or "sharing" scheme. This is a simple -traffic limiter. We implement Van Jacobson and Sally Floyd's CBQ -architecture into Linux 2.2. This is the preferred solution. Shaper is -for simple or back compatible setups. - -Alan diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index a6728661c416..777bb81d9ada 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -3015,23 +3015,6 @@ config NET_FC adaptor below. You also should have said Y to "SCSI support" and "SCSI generic support". -config SHAPER - tristate "Traffic Shaper (OBSOLETE)" - depends on EXPERIMENTAL - ---help--- - The traffic shaper is a virtual network device that allows you to - limit the rate of outgoing data flow over some other network device. - The traffic that you want to slow down can then be routed through - these virtual devices. See - for more information. - - An alternative to this traffic shaper are traffic schedulers which - you'll get if you say Y to "QoS and/or fair queuing" in - "Networking options". - - To compile this driver as a module, choose M here: the module - will be called shaper. If unsure, say N. - config NETCONSOLE tristate "Network console logging support (EXPERIMENTAL)" depends on EXPERIMENTAL diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 17a2ffd73d7d..a2f2662c244b 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -93,7 +93,6 @@ obj-$(CONFIG_NET_SB1000) += sb1000.o obj-$(CONFIG_MAC8390) += mac8390.o obj-$(CONFIG_APNE) += apne.o 8390.o obj-$(CONFIG_PCMCIA_PCNET) += 8390.o -obj-$(CONFIG_SHAPER) += shaper.o obj-$(CONFIG_HP100) += hp100.o obj-$(CONFIG_SMC9194) += smc9194.o obj-$(CONFIG_FEC) += fec.o diff --git a/drivers/net/shaper.c b/drivers/net/shaper.c deleted file mode 100644 index 228f650250f6..000000000000 --- a/drivers/net/shaper.c +++ /dev/null @@ -1,603 +0,0 @@ -/* - * Simple traffic shaper for Linux NET3. - * - * (c) Copyright 1996 Alan Cox , All Rights Reserved. - * http://www.redhat.com - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * - * Neither Alan Cox nor CymruNet Ltd. admit liability nor provide - * warranty for any of this software. This material is provided - * "AS-IS" and at no charge. - * - * - * Algorithm: - * - * Queue Frame: - * Compute time length of frame at regulated speed - * Add frame to queue at appropriate point - * Adjust time length computation for followup frames - * Any frame that falls outside of its boundaries is freed - * - * We work to the following constants - * - * SHAPER_QLEN Maximum queued frames - * SHAPER_LATENCY Bounding latency on a frame. Leaving this latency - * window drops the frame. This stops us queueing - * frames for a long time and confusing a remote - * host. - * SHAPER_MAXSLIP Maximum time a priority frame may jump forward. - * That bounds the penalty we will inflict on low - * priority traffic. - * SHAPER_BURST Time range we call "now" in order to reduce - * system load. The more we make this the burstier - * the behaviour, the better local performance you - * get through packet clustering on routers and the - * worse the remote end gets to judge rtts. - * - * This is designed to handle lower speed links ( < 200K/second or so). We - * run off a 100-150Hz base clock typically. This gives us a resolution at - * 200Kbit/second of about 2Kbit or 256 bytes. Above that our timer - * resolution may start to cause much more burstiness in the traffic. We - * could avoid a lot of that by calling kick_shaper() at the end of the - * tied device transmissions. If you run above about 100K second you - * may need to tune the supposed speed rate for the right values. - * - * BUGS: - * Downing the interface under the shaper before the shaper - * will render your machine defunct. Don't for now shape over - * PPP or SLIP therefore! - * This will be fixed in BETA4 - * - * Update History : - * - * bh_atomic() SMP races fixes and rewritten the locking code to - * be SMP safe and irq-mask friendly. - * NOTE: we can't use start_bh_atomic() in kick_shaper() - * because it's going to be recalled from an irq handler, - * and synchronize_bh() is a nono if called from irq context. - * 1999 Andrea Arcangeli - * - * Device statistics (tx_pakets, tx_bytes, - * tx_drops: queue_over_time and collisions: max_queue_exceded) - * 1999/06/18 Jordi Murgo - * - * Use skb->cb for private data. - * 2000/03 Andi Kleen - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include - -struct shaper_cb { - unsigned long shapeclock; /* Time it should go out */ - unsigned long shapestamp; /* Stamp for shaper */ - __u32 shapelatency; /* Latency on frame */ - __u32 shapelen; /* Frame length in clocks */ - __u16 shapepend; /* Pending */ -}; -#define SHAPERCB(skb) ((struct shaper_cb *) ((skb)->cb)) - -static int sh_debug; /* Debug flag */ - -#define SHAPER_BANNER "CymruNet Traffic Shaper BETA 0.04 for Linux 2.1\n" - -static void shaper_kick(struct shaper *sh); - -/* - * Compute clocks on a buffer - */ - -static int shaper_clocks(struct shaper *shaper, struct sk_buff *skb) -{ - int t=skb->len/shaper->bytespertick; - return t; -} - -/* - * Set the speed of a shaper. We compute this in bytes per tick since - * thats how the machine wants to run. Quoted input is in bits per second - * as is traditional (note not BAUD). We assume 8 bit bytes. - */ - -static void shaper_setspeed(struct shaper *shaper, int bitspersec) -{ - shaper->bitspersec=bitspersec; - shaper->bytespertick=(bitspersec/HZ)/8; - if(!shaper->bytespertick) - shaper->bytespertick++; -} - -/* - * Throw a frame at a shaper. - */ - - -static int shaper_start_xmit(struct sk_buff *skb, struct net_device *dev) -{ - struct shaper *shaper = dev->priv; - struct sk_buff *ptr; - - spin_lock(&shaper->lock); - ptr=shaper->sendq.prev; - - /* - * Set up our packet details - */ - - SHAPERCB(skb)->shapelatency=0; - SHAPERCB(skb)->shapeclock=shaper->recovery; - if(time_before(SHAPERCB(skb)->shapeclock, jiffies)) - SHAPERCB(skb)->shapeclock=jiffies; - skb->priority=0; /* short term bug fix */ - SHAPERCB(skb)->shapestamp=jiffies; - - /* - * Time slots for this packet. - */ - - SHAPERCB(skb)->shapelen= shaper_clocks(shaper,skb); - - { - struct sk_buff *tmp; - /* - * Up our shape clock by the time pending on the queue - * (Should keep this in the shaper as a variable..) - */ - for(tmp=skb_peek(&shaper->sendq); tmp!=NULL && - tmp!=(struct sk_buff *)&shaper->sendq; tmp=tmp->next) - SHAPERCB(skb)->shapeclock+=SHAPERCB(tmp)->shapelen; - /* - * Queue over time. Spill packet. - */ - if(time_after(SHAPERCB(skb)->shapeclock,jiffies + SHAPER_LATENCY)) { - dev_kfree_skb(skb); - dev->stats.tx_dropped++; - } else - skb_queue_tail(&shaper->sendq, skb); - } - - if(sh_debug) - printk("Frame queued.\n"); - if(skb_queue_len(&shaper->sendq)>SHAPER_QLEN) - { - ptr=skb_dequeue(&shaper->sendq); - dev_kfree_skb(ptr); - dev->stats.collisions++; - } - shaper_kick(shaper); - spin_unlock(&shaper->lock); - return 0; -} - -/* - * Transmit from a shaper - */ - -static void shaper_queue_xmit(struct shaper *shaper, struct sk_buff *skb) -{ - struct sk_buff *newskb=skb_clone(skb, GFP_ATOMIC); - if(sh_debug) - printk("Kick frame on %p\n",newskb); - if(newskb) - { - newskb->dev=shaper->dev; - newskb->priority=2; - if(sh_debug) - printk("Kick new frame to %s, %d\n", - shaper->dev->name,newskb->priority); - dev_queue_xmit(newskb); - - shaper->dev->stats.tx_bytes += skb->len; - shaper->dev->stats.tx_packets++; - - if(sh_debug) - printk("Kicked new frame out.\n"); - dev_kfree_skb(skb); - } -} - -/* - * Timer handler for shaping clock - */ - -static void shaper_timer(unsigned long data) -{ - struct shaper *shaper = (struct shaper *)data; - - spin_lock(&shaper->lock); - shaper_kick(shaper); - spin_unlock(&shaper->lock); -} - -/* - * Kick a shaper queue and try and do something sensible with the - * queue. - */ - -static void shaper_kick(struct shaper *shaper) -{ - struct sk_buff *skb; - - /* - * Walk the list (may be empty) - */ - - while((skb=skb_peek(&shaper->sendq))!=NULL) - { - /* - * Each packet due to go out by now (within an error - * of SHAPER_BURST) gets kicked onto the link - */ - - if(sh_debug) - printk("Clock = %ld, jiffies = %ld\n", SHAPERCB(skb)->shapeclock, jiffies); - if(time_before_eq(SHAPERCB(skb)->shapeclock, jiffies + SHAPER_BURST)) - { - /* - * Pull the frame and get interrupts back on. - */ - - skb_unlink(skb, &shaper->sendq); - if (shaper->recovery < - SHAPERCB(skb)->shapeclock + SHAPERCB(skb)->shapelen) - shaper->recovery = SHAPERCB(skb)->shapeclock + SHAPERCB(skb)->shapelen; - /* - * Pass on to the physical target device via - * our low level packet thrower. - */ - - SHAPERCB(skb)->shapepend=0; - shaper_queue_xmit(shaper, skb); /* Fire */ - } - else - break; - } - - /* - * Next kick. - */ - - if(skb!=NULL) - mod_timer(&shaper->timer, SHAPERCB(skb)->shapeclock); -} - - -/* - * Bring the interface up. We just disallow this until a - * bind. - */ - -static int shaper_open(struct net_device *dev) -{ - struct shaper *shaper=dev->priv; - - /* - * Can't open until attached. - * Also can't open until speed is set, or we'll get - * a division by zero. - */ - - if(shaper->dev==NULL) - return -ENODEV; - if(shaper->bitspersec==0) - return -EINVAL; - return 0; -} - -/* - * Closing a shaper flushes the queues. - */ - -static int shaper_close(struct net_device *dev) -{ - struct shaper *shaper=dev->priv; - struct sk_buff *skb; - - while ((skb = skb_dequeue(&shaper->sendq)) != NULL) - dev_kfree_skb(skb); - - spin_lock_bh(&shaper->lock); - shaper_kick(shaper); - spin_unlock_bh(&shaper->lock); - - del_timer_sync(&shaper->timer); - return 0; -} - -/* - * Revectored calls. We alter the parameters and call the functions - * for our attached device. This enables us to bandwidth allocate after - * ARP and other resolutions and not before. - */ - -static int shaper_header(struct sk_buff *skb, struct net_device *dev, - unsigned short type, - const void *daddr, const void *saddr, unsigned len) -{ - struct shaper *sh=dev->priv; - int v; - if(sh_debug) - printk("Shaper header\n"); - skb->dev = sh->dev; - v = dev_hard_header(skb, sh->dev, type, daddr, saddr, len); - skb->dev = dev; - return v; -} - -static int shaper_rebuild_header(struct sk_buff *skb) -{ - struct shaper *sh=skb->dev->priv; - struct net_device *dev=skb->dev; - int v; - if(sh_debug) - printk("Shaper rebuild header\n"); - skb->dev=sh->dev; - v = sh->dev->header_ops->rebuild(skb); - skb->dev=dev; - return v; -} - -#if 0 -static int shaper_cache(struct neighbour *neigh, struct hh_cache *hh) -{ - struct shaper *sh=neigh->dev->priv; - struct net_device *tmp; - int ret; - if(sh_debug) - printk("Shaper header cache bind\n"); - tmp=neigh->dev; - neigh->dev=sh->dev; - ret=sh->hard_header_cache(neigh,hh); - neigh->dev=tmp; - return ret; -} - -static void shaper_cache_update(struct hh_cache *hh, struct net_device *dev, - unsigned char *haddr) -{ - struct shaper *sh=dev->priv; - if(sh_debug) - printk("Shaper cache update\n"); - sh->header_cache_update(hh, sh->dev, haddr); -} -#endif - -#ifdef CONFIG_INET - -static int shaper_neigh_setup(struct neighbour *n) -{ -#ifdef CONFIG_INET - if (n->nud_state == NUD_NONE) { - n->ops = &arp_broken_ops; - n->output = n->ops->output; - } -#endif - return 0; -} - -static int shaper_neigh_setup_dev(struct net_device *dev, struct neigh_parms *p) -{ -#ifdef CONFIG_INET - if (p->tbl->family == AF_INET) { - p->neigh_setup = shaper_neigh_setup; - p->ucast_probes = 0; - p->mcast_probes = 0; - } -#endif - return 0; -} - -#else /* !(CONFIG_INET) */ - -static int shaper_neigh_setup_dev(struct net_device *dev, struct neigh_parms *p) -{ - return 0; -} - -#endif - -static const struct header_ops shaper_ops = { - .create = shaper_header, - .rebuild = shaper_rebuild_header, -}; - -static int shaper_attach(struct net_device *shdev, struct shaper *sh, struct net_device *dev) -{ - sh->dev = dev; - sh->get_stats=dev->get_stats; - - shdev->neigh_setup = shaper_neigh_setup_dev; - shdev->hard_header_len=dev->hard_header_len; - shdev->type=dev->type; - shdev->addr_len=dev->addr_len; - shdev->mtu=dev->mtu; - sh->bitspersec=0; - return 0; -} - -static int shaper_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd) -{ - struct shaperconf *ss= (struct shaperconf *)&ifr->ifr_ifru; - struct shaper *sh=dev->priv; - - if(ss->ss_cmd == SHAPER_SET_DEV || ss->ss_cmd == SHAPER_SET_SPEED) - { - if(!capable(CAP_NET_ADMIN)) - return -EPERM; - } - - switch(ss->ss_cmd) - { - case SHAPER_SET_DEV: - { - struct net_device *them=__dev_get_by_name(&init_net, ss->ss_name); - if(them==NULL) - return -ENODEV; - if(sh->dev) - return -EBUSY; - return shaper_attach(dev,dev->priv, them); - } - case SHAPER_GET_DEV: - if(sh->dev==NULL) - return -ENODEV; - strcpy(ss->ss_name, sh->dev->name); - return 0; - case SHAPER_SET_SPEED: - shaper_setspeed(sh,ss->ss_speed); - return 0; - case SHAPER_GET_SPEED: - ss->ss_speed=sh->bitspersec; - return 0; - default: - return -EINVAL; - } -} - -static void shaper_init_priv(struct net_device *dev) -{ - struct shaper *sh = dev->priv; - - skb_queue_head_init(&sh->sendq); - init_timer(&sh->timer); - sh->timer.function=shaper_timer; - sh->timer.data=(unsigned long)sh; - spin_lock_init(&sh->lock); -} - -/* - * Add a shaper device to the system - */ - -static void __init shaper_setup(struct net_device *dev) -{ - /* - * Set up the shaper. - */ - - shaper_init_priv(dev); - - dev->open = shaper_open; - dev->stop = shaper_close; - dev->hard_start_xmit = shaper_start_xmit; - dev->set_multicast_list = NULL; - - /* - * Intialise the packet queues - */ - - /* - * Handlers for when we attach to a device. - */ - - dev->neigh_setup = shaper_neigh_setup_dev; - dev->do_ioctl = shaper_ioctl; - dev->hard_header_len = 0; - dev->type = ARPHRD_ETHER; /* initially */ - dev->set_mac_address = NULL; - dev->mtu = 1500; - dev->addr_len = 0; - dev->tx_queue_len = 10; - dev->flags = 0; -} - -static int shapers = 1; -#ifdef MODULE - -module_param(shapers, int, 0); -MODULE_PARM_DESC(shapers, "Traffic shaper: maximum number of shapers"); - -#else /* MODULE */ - -static int __init set_num_shapers(char *str) -{ - shapers = simple_strtol(str, NULL, 0); - return 1; -} - -__setup("shapers=", set_num_shapers); - -#endif /* MODULE */ - -static struct net_device **devs; - -static unsigned int shapers_registered = 0; - -static int __init shaper_init(void) -{ - int i; - size_t alloc_size; - struct net_device *dev; - char name[IFNAMSIZ]; - - if (shapers < 1) - return -ENODEV; - - alloc_size = sizeof(*dev) * shapers; - devs = kzalloc(alloc_size, GFP_KERNEL); - if (!devs) - return -ENOMEM; - - for (i = 0; i < shapers; i++) { - - snprintf(name, IFNAMSIZ, "shaper%d", i); - dev = alloc_netdev(sizeof(struct shaper), name, - shaper_setup); - if (!dev) - break; - - if (register_netdev(dev)) { - free_netdev(dev); - break; - } - - devs[i] = dev; - shapers_registered++; - } - - if (!shapers_registered) { - kfree(devs); - devs = NULL; - } - - return (shapers_registered ? 0 : -ENODEV); -} - -static void __exit shaper_exit (void) -{ - int i; - - for (i = 0; i < shapers_registered; i++) { - if (devs[i]) { - unregister_netdev(devs[i]); - free_netdev(devs[i]); - } - } - - kfree(devs); - devs = NULL; -} - -module_init(shaper_init); -module_exit(shaper_exit); -MODULE_LICENSE("GPL"); - diff --git a/include/linux/Kbuild b/include/linux/Kbuild index a85e87fd60d7..bc33a5c87d64 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -231,7 +231,6 @@ unifdef-y += if_ltalk.h unifdef-y += if_link.h unifdef-y += if_pppol2tp.h unifdef-y += if_pppox.h -unifdef-y += if_shaper.h unifdef-y += if_tr.h unifdef-y += if_tun.h unifdef-y += if_vlan.h diff --git a/include/linux/if_shaper.h b/include/linux/if_shaper.h deleted file mode 100644 index 3b1b7ba19825..000000000000 --- a/include/linux/if_shaper.h +++ /dev/null @@ -1,51 +0,0 @@ -#ifndef __LINUX_SHAPER_H -#define __LINUX_SHAPER_H - -#ifdef __KERNEL__ - -#define SHAPER_QLEN 10 -/* - * This is a bit speed dependent (read it shouldn't be a constant!) - * - * 5 is about right for 28.8 upwards. Below that double for every - * halving of speed or so. - ie about 20 for 9600 baud. - */ -#define SHAPER_LATENCY (5*HZ) -#define SHAPER_MAXSLIP 2 -#define SHAPER_BURST (HZ/50) /* Good for >128K then */ - -struct shaper -{ - struct sk_buff_head sendq; - __u32 bytespertick; - __u32 bitspersec; - __u32 shapelatency; - __u32 shapeclock; - unsigned long recovery; /* Time we can next clock a packet out on - an empty queue */ - spinlock_t lock; - struct net_device *dev; - struct net_device_stats* (*get_stats)(struct net_device *dev); - struct timer_list timer; -}; - -#endif - -#define SHAPER_SET_DEV 0x0001 -#define SHAPER_SET_SPEED 0x0002 -#define SHAPER_GET_DEV 0x0003 -#define SHAPER_GET_SPEED 0x0004 - -struct shaperconf -{ - __u16 ss_cmd; - union - { - char ssu_name[14]; - __u32 ssu_speed; - } ss_u; -#define ss_speed ss_u.ssu_speed -#define ss_name ss_u.ssu_name -}; - -#endif -- cgit v1.2.3-59-g8ed1b From 9a6c686799346b6c95405c9e051f5023873504fa Mon Sep 17 00:00:00 2001 From: Jay Vosburgh Date: Tue, 13 Nov 2007 20:25:48 -0800 Subject: [BONDING]: Documentation update Update the bonding documentation: more discussion on initialization and configuration, changes to discussion of packet reordering in balance-rr, update some out of date information. Based in part on input from Rick Jones and Andy Gospodarek . Signed-off-by: Jay Vosburgh Acked-by: Andy Gospodarek Signed-off-by: David S. Miller --- Documentation/networking/bonding.txt | 204 ++++++++++++++++++++++++----------- 1 file changed, 143 insertions(+), 61 deletions(-) (limited to 'Documentation/networking') diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 6cc30e0d5795..a0cda062bc33 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1,7 +1,7 @@ Linux Ethernet Bonding Driver HOWTO - Latest update: 24 April 2006 + Latest update: 12 November 2007 Initial release : Thomas Davis Corrections, HA extensions : 2000/10/03-15 : @@ -166,12 +166,17 @@ to use ifenslave. 2. Bonding Driver Options ========================= - Options for the bonding driver are supplied as parameters to -the bonding module at load time. They may be given as command line -arguments to the insmod or modprobe command, but are usually specified -in either the /etc/modules.conf or /etc/modprobe.conf configuration -file, or in a distro-specific configuration file (some of which are -detailed in the next section). + Options for the bonding driver are supplied as parameters to the +bonding module at load time, or are specified via sysfs. + + Module options may be given as command line arguments to the +insmod or modprobe command, but are usually specified in either the +/etc/modules.conf or /etc/modprobe.conf configuration file, or in a +distro-specific configuration file (some of which are detailed in the next +section). + + Details on bonding support for sysfs is provided in the +"Configuring Bonding Manually via Sysfs" section, below. The available bonding driver parameters are listed below. If a parameter is not specified the default value is used. When initially @@ -812,11 +817,13 @@ the system /etc/modules.conf or /etc/modprobe.conf configuration file. 3.2 Configuration with Initscripts Support ------------------------------------------ - This section applies to distros using a version of initscripts -with bonding support, for example, Red Hat Linux 9 or Red Hat -Enterprise Linux version 3 or 4. On these systems, the network -initialization scripts have some knowledge of bonding, and can be -configured to control bonding devices. + This section applies to distros using a recent version of +initscripts with bonding support, for example, Red Hat Enterprise Linux +version 3 or later, Fedora, etc. On these systems, the network +initialization scripts have knowledge of bonding, and can be configured to +control bonding devices. Note that older versions of the initscripts +package have lower levels of support for bonding; this will be noted where +applicable. These distros will not automatically load the network adapter driver unless the ethX device is configured with an IP address. @@ -864,11 +871,31 @@ USERCTL=no Be sure to change the networking specific lines (IPADDR, NETMASK, NETWORK and BROADCAST) to match your network configuration. - Finally, it is necessary to edit /etc/modules.conf (or -/etc/modprobe.conf, depending upon your distro) to load the bonding -module with your desired options when the bond0 interface is brought -up. The following lines in /etc/modules.conf (or modprobe.conf) will -load the bonding module, and select its options: + For later versions of initscripts, such as that found with Fedora +7 and Red Hat Enterprise Linux version 5 (or later), it is possible, and, +indeed, preferable, to specify the bonding options in the ifcfg-bond0 +file, e.g. a line of the format: + +BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=+192.168.1.254" + + will configure the bond with the specified options. The options +specified in BONDING_OPTS are identical to the bonding module parameters +except for the arp_ip_target field. Each target should be included as a +separate option and should be preceded by a '+' to indicate it should be +added to the list of queried targets, e.g., + + arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 + + is the proper syntax to specify multiple targets. When specifying +options via BONDING_OPTS, it is not necessary to edit /etc/modules.conf or +/etc/modprobe.conf. + + For older versions of initscripts that do not support +BONDING_OPTS, it is necessary to edit /etc/modules.conf (or +/etc/modprobe.conf, depending upon your distro) to load the bonding module +with your desired options when the bond0 interface is brought up. The +following lines in /etc/modules.conf (or modprobe.conf) will load the +bonding module, and select its options: alias bond0 bonding options bond0 mode=balance-alb miimon=100 @@ -883,9 +910,10 @@ up and running. 3.2.1 Using DHCP with Initscripts --------------------------------- - Recent versions of initscripts (the version supplied with -Fedora Core 3 and Red Hat Enterprise Linux 4 is reported to work) do -have support for assigning IP information to bonding devices via DHCP. + Recent versions of initscripts (the versions supplied with Fedora +Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to +work) have support for assigning IP information to bonding devices via +DHCP. To configure bonding for DHCP, configure it as described above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" @@ -895,18 +923,14 @@ is case sensitive. 3.2.2 Configuring Multiple Bonds with Initscripts ------------------------------------------------- - At this writing, the initscripts package does not directly -support loading the bonding driver multiple times, so the process for -doing so is the same as described in the "Configuring Multiple Bonds -Manually" section, below. - - NOTE: It has been observed that some Red Hat supplied kernels -are apparently unable to rename modules at load time (the "-o bond1" -part). Attempts to pass that option to modprobe will produce an -"Operation not permitted" error. This has been reported on some -Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels -exhibiting this problem, it will be impossible to configure multiple -bonds with differing parameters. + Initscripts packages that are included with Fedora 7 and Red Hat +Enterprise Linux 5 support multiple bonding interfaces by simply +specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the +number of the bond. This support requires sysfs support in the kernel, +and a bonding driver of version 3.0.0 or later. Other configurations may +not support this method for specifying multiple bonding interfaces; for +those instances, see the "Configuring Multiple Bonds Manually" section, +below. 3.3 Configuring Bonding Manually with Ifenslave ----------------------------------------------- @@ -977,15 +1001,58 @@ initialization scripts lack support for configuring multiple bonds. options, you may wish to use the "max_bonds" module parameter, documented above. - To create multiple bonding devices with differing options, it -is necessary to use bonding parameters exported by sysfs, documented -in the section below. + To create multiple bonding devices with differing options, it is +preferrable to use bonding parameters exported by sysfs, documented in the +section below. + + For versions of bonding without sysfs support, the only means to +provide multiple instances of bonding with differing options is to load +the bonding driver multiple times. Note that current versions of the +sysconfig network initialization scripts handle this automatically; if +your distro uses these scripts, no special action is needed. See the +section Configuring Bonding Devices, above, if you're not sure about your +network initialization scripts. + + To load multiple instances of the module, it is necessary to +specify a different name for each instance (the module loading system +requires that every loaded module, even multiple instances of the same +module, have a unique name). This is accomplished by supplying multiple +sets of bonding options in /etc/modprobe.conf, for example: + +alias bond0 bonding +options bond0 -o bond0 mode=balance-rr miimon=100 + +alias bond1 bonding +options bond1 -o bond1 mode=balance-alb miimon=50 + + will load the bonding module two times. The first instance is +named "bond0" and creates the bond0 device in balance-rr mode with an +miimon of 100. The second instance is named "bond1" and creates the +bond1 device in balance-alb mode with an miimon of 50. + + In some circumstances (typically with older distributions), +the above does not work, and the second bonding instance never sees +its options. In that case, the second options line can be substituted +as follows: + +install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 + This may be repeated any number of times, specifying a new and +unique name in place of bond1 for each subsequent instance. + + It has been observed that some Red Hat supplied kernels are unable +to rename modules at load time (the "-o bond1" part). Attempts to pass +that option to modprobe will produce an "Operation not permitted" error. +This has been reported on some Fedora Core kernels, and has been seen on +RHEL 4 as well. On kernels exhibiting this problem, it will be impossible +to configure multiple bonds with differing parameters (as they are older +kernels, and also lack sysfs support). 3.4 Configuring Bonding Manually via Sysfs ------------------------------------------ - Starting with version 3.0, Channel Bonding may be configured + Starting with version 3.0.0, Channel Bonding may be configured via the sysfs interface. This interface allows dynamic configuration of all bonds in the system without unloading the module. It also allows for adding and removing bonds at runtime. Ifenslave is no @@ -1030,9 +1097,6 @@ To enslave interface eth0 to bond bond0: To free slave eth0 from bond bond0: # echo -eth0 > /sys/class/net/bond0/bonding/slaves - NOTE: The bond must be up before slaves can be added. All -slaves are freed when the interface is brought down. - When an interface is enslaved to a bond, symlinks between the two are created in the sysfs filesystem. In this case, you would get /sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and @@ -1622,6 +1686,15 @@ one for each switch in the network). This will insure that, regardless of which switch is active, the ARP monitor has a suitable target to query. + Note, also, that of late many switches now support a functionality +generally referred to as "trunk failover." This is a feature of the +switch that causes the link state of a particular switch port to be set +down (or up) when the state of another switch port goes down (or up). +It's purpose is to propogate link failures from logically "exterior" ports +to the logically "interior" ports that bonding is able to monitor via +miimon. Availability and configuration for trunk failover varies by +switch, but this can be a viable alternative to the ARP monitor when using +suitable switches. 12. Configuring Bonding for Maximum Throughput ============================================== @@ -1709,7 +1782,7 @@ balance-rr: This mode is the only mode that will permit a single interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the - striping often results in peer systems receiving packets out + striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments. @@ -1721,22 +1794,20 @@ balance-rr: This mode is the only mode that will permit a single interface's worth of throughput, even after adjusting tcp_reordering. - Note that this out of order delivery occurs when both the - sending and receiving systems are utilizing a multiple - interface bond. Consider a configuration in which a - balance-rr bond feeds into a single higher capacity network - channel (e.g., multiple 100Mb/sec ethernets feeding a single - gigabit ethernet via an etherchannel capable switch). In this - configuration, traffic sent from the multiple 100Mb devices to - a destination connected to the gigabit device will not see - packets out of order. However, traffic sent from the gigabit - device to the multiple 100Mb devices may or may not see - traffic out of order, depending upon the balance policy of the - switch. Many switches do not support any modes that stripe - traffic (instead choosing a port based upon IP or MAC level - addresses); for those devices, traffic flowing from the - gigabit device to the many 100Mb devices will only utilize one - interface. + Note that the fraction of packets that will be delivered out of + order is highly variable, and is unlikely to be zero. The level + of reordering depends upon a variety of factors, including the + networking interfaces, the switch, and the topology of the + configuration. Speaking in general terms, higher speed network + cards produce more reordering (due to factors such as packet + coalescing), and a "many to many" topology will reorder at a + higher rate than a "many slow to one fast" configuration. + + Many switches do not support any modes that stripe traffic + (instead choosing a port based upon IP or MAC level addresses); + for those devices, traffic for a particular connection flowing + through the switch to a balance-rr bond will not utilize greater + than one interface's worth of bandwidth. If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order @@ -1936,6 +2007,10 @@ Failover may be delayed via the downdelay bonding module option. 13.2 Duplicated Incoming Packets -------------------------------- + NOTE: Starting with version 3.0.2, the bonding driver has logic to +suppress duplicate packets, which should largely eliminate this problem. +The following description is kept for reference. + It is not uncommon to observe a short burst of duplicated traffic when the bonding device is first used, or after it has been idle for some period of time. This is most easily observed by issuing @@ -2096,6 +2171,9 @@ The new driver was designed to be SMP safe from the start. EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, devices need not be of the same speed. + Starting with version 3.2.1, bonding also supports Infiniband +slaves in active-backup mode. + 3. How many bonding devices can I have? There is no limit. @@ -2154,11 +2232,15 @@ switches currently available support 802.3ad. 8. Where does a bonding device get its MAC address from? - If not explicitly configured (with ifconfig or ip link), the -MAC address of the bonding device is taken from its first slave -device. This MAC address is then passed to all following slaves and -remains persistent (even if the first slave is removed) until the -bonding device is brought down or reconfigured. + When using slave devices that have fixed MAC addresses, or when +the fail_over_mac option is enabled, the bonding device's MAC address is +the MAC address of the active slave. + + For other configurations, if not explicitly configured (with +ifconfig or ip link), the MAC address of the bonding device is taken from +its first slave device. This MAC address is then passed to all following +slaves and remains persistent (even if the first slave is removed) until +the bonding device is brought down or reconfigured. If you wish to change the MAC address, you can set it with ifconfig or ip link: -- cgit v1.2.3-59-g8ed1b