Sandeep Gupta
independent study with
Prof. Satish K. Tripathi
Department of Computer Science
University of Maryland at College Park, MD 20742
&
Dr. Srinivasan Keshav
AT&T Bell Laboratories, Murray Hill, NJ 07974
during Summer '94
This note describes work related to the Unix port of native mode ATM [6] protocol stack on Xunet II [4]. In particular, the note includes description of work on IDLInet [9] networking protocol stack for elimination of memory copies, and for adapting it's relevant routines to the Unix kernel. The kernel modification which creates a simple test setup for easy debugging of the initial port in a real kernel environment is also described.
ATM (Asynchronous Transfer Mode) refers to the set of networking standards that define a cell based data multiplexing and switching technique associated with CCITT Broadband Integrated Services Digital Network (B-ISDN). There is considerable interest in connecting workstations to ATM based networks. Xunet is an experimental high speed wide area network, which connects several Universities and AT&T Bell Labs using ATM switches. When we refer to ATM networks, we usually mean ATM Adaptation Layers (AALs), ATM Layer, and the Physical Layer. From the perspective of this work, the access to the network is at the AAL. This implementation was done with a lot of existing work. The background, outline of the related components, comparison of the native mode approach to that of related implementations for ATM on Unix workstations is included.
The purpose of this note is to provide the background and a description of the work done. The next section is about native mode ATM and the requirements of this work. Following sections describe the Xunet II framework, the native mode protocol stack, work done on the stack, the mapping to unix, test environments, and further work. Finally, the appendix provides an environment related summary.
A native mode ATM protocol stack [6] allows user applications to see per virtual circuit end-to-end Quality of service guarantees of the underlying ATM network. Most work on ATM networks focusses on the Data Link and Physical layers of the protocol stack. It is often assumed with some reason, that IP and TCP protocols will be the higher layer protocols.
An extremely large number of packet networks, most significantly Internet subnetworks, deploy Internetworking Protocol (IP) at the network layer. IP is a connectionless protocol. Transport layer services like connection oriented TCP and a connectionless UDP constitute most popular end-user use of the internet. As ATM is now being deployed as a high speed network solution in both Wide Area and Local Area Networks, there is a natural need for these to interoperate.
In response to the above, there has been a lot of effort on one straight forward approach akin to the CSNet solution in which IP was tunnelled through X.25 networks [3]. Likewise, tunnelling IP through the ATM networks has some use, thus mimicking a connectionless network service. This allows retaining the conventional layers of networking software. However, when we do that, there is no known way to effectively satisfy quality of service (QoS) requirements on a per user or per connection basis with a connectionless and multiplexed network layer.
Figure 1:
Conventional and Native mode ATM architectures
Thus, with connectionless IP, higher layer service users cannot expect any performance guarantees from the network. Such a guarantee is imperative for providing certain applications like multimedia, real-time services, etc. A connection oriented network like ATM could supply such guarantees from the lower layers to the higher layer application, if every layer between ATM and the application is connection oriented, and preferably not multiplexed. An implementation with this philosophy will also provide a natural way to reflect the features of the ATM service to the application. This approach and it's counterpart conventional approach described prior to this, is shown in figure 1.
There have been several attempts to integrate ATM technology into the current systems. One important classification is based on whether or not they retain the semantics of the socket interface. The ones that change the semantics of the interface have the modifications directed towards copyless architectures, by allowing applications to access card buffers directly. Even though it does not solve all problems, there is merit in eliminating copies altogether. However this approach does change the socket interface enough to force a rewrite of old applications.
The other attempts. e.g., [2], [8] that retain the socket semantics, provide an interface to the ATM layer, and [2] also provide mechanisms to tunnel IP over ATM. This enables the use of TCP as the transport layer. In our opinion these are good interim approaches, however since this entails loss of several ATM features we believe there is merit in designing new transport protocols that reflect native ATM capability to the user applications. Such an implementation would be able to provide the QoS guarantees, as it would use connection oriented stack through out, and there is no logical multiplexing of higher layers into a single lower layer from the perspective of the user of the higher layer.
Building such a protocol stack that provides transport layer support while still retaining native ATM advantages was the primary motivation for the IDLInet [9] protocol stack. In addition, for high performance, the intent was to design lean data paths and separated data flow from control flow which also happens to be a consequence of an underlying connection oriented networks with out of band signalling. IDLInet provides QoS negotiation, error control, flow control, performance guarantees, multicast support, and connection management services. IDLInet runs over DOS as well as over REAL, a network simulator. The basic implementation of idlinet was taken as a proof of principle and we worked from there.
The objective was to provide the same protocol stack as IDLInet for the Unix environment. Since Unix is a multi-user environment, any such service has to be provided as a system service. The idlinet stack cannot be directly ported to the kernel as the two environments are very different. Also, we needed to work on making the stack more efficient. In particular, we changed the buffer handling and scheduling in the appropriate layer (transport) of the idlinet stack and modified it to fit into the Unix kernel and the Xunet AAL/X interface. The changes in buffer handling were to eliminate copies, wherever they could be avoided. Also, scheduling in the idlinet stack was simple round-robin, and it was adapted to fit scheduling in the kernel. In addition to this basic work, we created the environment for testing of the implemention on standalone machines.
In this section we describe the unix kernel environment for implementation with some related background for perspective. Unix networking protocol implementations are layered in the kernel, and the user can view it as a single transport / session layer programming interface at the socket programming interface. With minor exceptions, inside the kernel, higher layer protocol implementations use the implementation of the lower layer as a given, reflecting the standard networking paradigm.
The kernel/user divide allows for a clean division of functionality, but it is also used to enforce inter-user protection and it facilitates improved performance. This is so because integrating it with kernel gives the implementor a greater flexibility and control over scheduling, resource management and interface access.
As mentioned in the introduction, access to ATM is at the adaptation layer. In general different ATM service access points for user applications may provide different classes of service (e.g., AAL1-AAL5). One of the classes, e.g., provides a connectionless data service, and another class provides connection oriented data service. Xunet II adaptation layer is called AAL/X which is similar to AAL5.
The common implementation paradigms are the BSD Networking Implementation [7] and System V Streams. Both paradigms have common rationale: transparency across machines, efficient implementations, and ability to retain compatibility with other Unix interprocess communication. Since Xunet II is connected to several university environments where most machines run BSD/Unix variants, BSD networking has been chosen as the vehicle [4]. Within the BSD paradigm, a communication domain is the basic abstraction to define all support for each type of communication network. Each domain has a set of protocols, for different classes of service. Existing AF_XUNET domain implementation provided a datagram interface to the AAL/X adaptation layer.
A brief note on IDLInet is included in the appendix. [9] includes a detailed discussion on the implementation's architecture for the basic stack layers and signalling. In this note we mention the architecture of the stack layers and the transport protocols implemented. The signalling part was skipped, as on Xunet II an entirely different signalling protocol is implemented [10].
Current IDLInet implementation provides three transport layer services: Guaranteed performance, Best effort and Reliable in-order delivery. The guaranteed performance service provides a certain rate of transmission, with possibility of lost data. Reliable service is equivalent to TCP, i.e., it provides in-order, error free delievery guarantee, though not bounded by any time guarantee. Best effort is the regular send and pray feature.
QoS negotiation is done using Resource Management Transport Protocol. This uses the transport layer. As of now this is not implemented in the Unix port.
In this section the problem of memory copies is introduces, with a discussion on the possible alternatives. Finally we include the implementation for IDLInet which is a simple variation of the BSD/Unix mbufs.
In contemporary systems, the primary bottleneck dictating network throughput performance is the memory bandwidth. Buffering is essential whenever there is a strictly layered interface between the sender and receiver protocols. Each copy is two memory cycles, a read and a write. Buffers may potentially be copied from layer to layer, and also for some intra layer operations such as checksumming, fragmentation, and encapsulation. Not all copies can be eliminated.
Figure 2: Buffer copies (1-4) in a
conventional BSD/Unix network stack. 3 is often avoided.
Some of the copies are inherent to the hardware and the operating system architecture. To illustrate this, figure 2 shows the path of data buffers in a Unix networking subsystem. For reasons of protection, copy 1, from user space to kernel space is unavoidable. Copies 2,3 may be eliminated by smart use of buffers. Finally, copy 4, from system buffers to card buffers is determined solely by the hardware design. In the case of network interface card [1] on IRIX machines the card understands only the physical memory address, and the kernel mbufs are naturally in virtual memory space. Hence the data has to be transferred to a buffer at a physical address that the card would understand, even though the card and the workstation share memory space.
The BSD/Unix mbufs and the System V Unix blocks are essentially general purpose buffer libraries for protocols to use, providing a uniform interface and eliminating copies by providing primitives to chain or pass buffers by reference. This does not solve all problems however. One source of unexpected copies is fragmentation. Making multiple copies of pages may be avoided initially, but fragmentation in this case entails prepending or appending protocol data. Any second level of fragmentation, or copying them to a contiguos space that the network interface will understand, will require copies. One solution has been proposed by Van Jacobson, namely pbufs (unpublished), which are preallocated based on the interface, to avoid fragmentation into small mbufs.
Ideally, one could preallocate buffers with holes at appropriate places where the lower layers would fill in protocol data. If such a buffer can be allocated from the uppermost layer, or even the socket layer, we can have a single copy architecture - at the cost of changing the socket code. VINCE (Vendor Independent Network Control Entity) [5] uses a similar scheme in user space. It didn't seem worth the design effort to impose this ideal solution over the Unix mbuf scheme.
The design decisions were influenced by our interest in porting the IDLInet stack to Unix. In parallel, we wanted to improve memory handling in the IDLInet/REAL (and DOS) code. The software structure of IDLInet code is shown in figure 8, towards the end of this writeup. In the IDLInet code, there are copies when the buffers are passed between the socket layer and the transport layer, the transport and the ATM layer of IDLInet (which appends an ATM like trailer and computes the checksum), and similarly between the other layers, as shown in the figure. In the original code, for computing the checksum, the packet is first copied to a contiguous buffer and then the checksum is computed.
Figure 3: Copy elimination in the idlinet stack.
Since we wanted to move this code towards Unix kernel, the mbuf routines [7] that are used in the Unix kernel were taken and adapted to a user space version of it that can run on REAL and on DOS. There are two ways of adapting the mbuf routines to the REAL/DOS version. These routines are layered over the kernel memory allocator. One option would have been to replicate that environment and keep the MBUF code intact. However, the mbuf routines in the IRIX kernel has enhancements, and their design was not completely understood. Secondly, the page allocation etc., that comes with the kernel may not be required, or even be effective for the user space implementation. For the sake of simplicity, we keep the ported mbuf routines compatible at the call interface levels. This also allowed us to do some optimization in the code that was not required in the Unix port. In particular, the network layer of IDLInet is not required on Unix. Contiguous buffers were used from the higher layer, without breaking them into pages, and for fragmentation at the network layer, used references to the same buffer with offsets. This allows a zero copy fragmentation. If we retained the kernel implementation of mbufs, this optimization would not have been easy. The code for mbuf interface compatible routines is derived from the standard kernel code. Basically it involved removing the virtual memory page management code, and using simple memory allocation routines.
Other than bringing the code closer to the kernel environment, the checksum routines were rewritten. The original routines required that the data be in a contiguous space, so it involved a copy. Now they operate on mbuf chains, eliminating one more read/write. The remaining copies, and the ones eliminated are shown in figure 3.
In this section we summarize all the other changes that were required on the IDLInet routines that were needed to be moved to the Unix kernel.
The other major difference between the idlinet stack and the kernel environment is the scheduling. IDLInet relies on a round-robin scheduling of the layers, whereas the kernel environment is system call and device interrupt driven. Both have provisions for timer initiated interrupts. This task is basically isolating the routines and changing entry points and the interface parameters to the kernel environment.
Figure 4: Input Software structure
These can be understood in terms of figures 4 and 5, and the IDLInet software structure [9] The figure 8 reproduced from there outlines the software structure. Figures 4, 5 outline the basic structure of the Unix implementation.
When data is received from the network interface meant for the ATM stack, it is placed in the appropriate interface queue and a software interrupt is posted to call the xu_input() (or loopback_input in the case of AF_LOOPBACK protocol family) routine. This routine calls the g2_receive() routine. These are enqueued into receive buffers as in the IDLInet routines, and eventually, when a complete packet is received, it is placed in the socket buffers using the standard socket routines. Notice that in the original IDLInet, control to g2_receive() routine is given by the scheduler at regular intervals.
Figure 5: Output Software Structure Required. Please see section on Further work.
The send side routines use the protocol switch entry for transmitting data. The xu_usrreq() (or loopback_usrreq()) as above, places the data from the socket layer to the appropriate send list as in IDLInet, after appropriate fragmentation and encapsulation. The g2_transmit() routine is called by xu_fasttimo () which looks at the send lists and transmits any packets available there. Eventually, these queues are planned to be implemented at multiple priorities.
The leaky bucket algorithm in IDLInet uses floating point operations. For the sake of speed, floating point operations are not allowed in the kernel. The leaky bucket token calculation routine is replaced by a stub routine that returns a token whenever it is called. This routine needs to be rewritten to use integer operations only.
There is a difference in handling of the comunication domain and netisr initialization. In the Irix 4 version these were statically configured into the kernel using definitions in header files and tables. In Irix 5, they are configured into the appropriate datastructures using function calls. This also allows them to dynamically deleted. The netisr bit is also set using a function call, at domain initialization time. Separate routines have been written for both versions of the kernel.
This section describes the method for testing the mbuf implementation, testing the native mode port on a standalone unix machine, and finally the test set up for Xunet.
These were a set of tests randomly picking a set of mbuf library calls to take a group of buffers, and repeat a set of operations on them. The operations were hard coded into the test routines. While these were not picked up to represent an exhaustive set, usual tricks like odd number of even sized buffers, even number of odd sized buffers, staggered data patterns, and a random set of operations which built the buffer chains and then trimmed it on both sides by random number of bytes.
In case of IDLInet on REAL, a simple comparison with the output on the standard stack was considered as a test. This turned out woefully inadequate as the idlinet applications were not completely understood initially.
For Unix implementation, in addition to the kernel test set-ups described below, test programs were written to exercise the sockets with appropriate protocol options - guaranteed performance, best effort, and reliable service. Test for each option is in two parts, a client program, which generates traffic, and a server program, which receives and displays the data on the screen. These routines are called gserv, gclient for guaranteed service test, bserv, bclient for best effort service test, and rserv, rclient for reliable service test. These are minor modifications of the client-server test routines serv, client for the original AF_XUNET datagram service.
In addition to this, the Unix utlity netstat was used to monitor memory usage while the system is left to run for hours, on a single connection, or repeatedly for several connections. Any buffer leaks from the routines can be spotted by an increase in the number allocated to the networking subsystem. Initial testing was done by isolating one unix machine from the network, as that way the steady allocation of memory was affected only by these test programs.
There are two procedures to be described here. First we tested the implementation on a single machine, and then on actual Xunet machines. Even though the Unix environment is available on Xunet, setting up the test consumed a lot of time.
This setup was done to simplify the test procedures. Setup on the actual network takes time and sometimes is really not necessary. Also, since a lot of development activity was ongoing on the actual network, sometimes it is not easy to tell what really may be the source of bugs.
There is an OS version mismatch between the standalone machine used (IRIX 5), and the machines on Xunet (IRIX 4). The loopback drivers and the code was adapted for both versions. This difference is mentioned in the porting issues.
The logical view of the loopback communication domain is shown in figure 6. Basically it looks the same to the transport layer from the socket and the card interface side. The loopback routines mimic the routines in the actual setup exactly, in terms of the call interface and the parameters passed. Note that this setup is OS specfic, as different OSs use a slightly different interface at that level.
Figure 6:
Loopback domain, for testing the implementation
on a standalone Unix machine.
An AAL packet, when sent out on the loopback is sent back to the transport layer via the netisr software interrupt and a fake interface queue with the the appropriate vci etc. set in the right place. In this case, the loopback routines mimic the orc card drivers. This setup can thus be used on an isolated machine that has no connections to Xunet or any specific hardware installed. They ability to isolate the machine by turning off the ethernet interface (using the Unix utility ifconfig) and physically disconnecting it from the net, rovides a tightly controlled environment for testing. However, please see the caveat mentioned below.
In the actual set up, with everything working perfectly, we should be able to use the signalling software to get the vci's for a connection, and use the test routines as usual. During the testing on the actual setup, a problem was noticed with the signalling software, expected due to a version mismatch in the running binaries. (It affected the read mode of the sockets, returning old data if the socket had nothing new to read). It is easy to bypass the signalling and set up the connections manually, if this is suspected.
Figure 7:
Setup used for testing the implementation on Xunet II
Figure 7 shows the part of Xunet used for testing the stack. The two workstations, peters and wilson, are connected through the two switches sxu11 and xmh-3. The workstation tempel is connected to the datakit switch which allows one to access the console of each of the stations and the switches.
The method to set up the connections in the switches manually is as follows:
Use tempel to connect to the switch consoles, via the datakit switch, and
use the switch monitor commands as shown in this example:
With the setup as shown in the figure, say we wanted to connect vci 1 on
peters to vci 1 on wilson, as peter
wilson and vci 2 in the
reverse direction. Notice the slot numbers on the switches. The command
for connecting "c", uses the notation "<slot number>.<vci>" for it's
arguments.
From tempel, log into x-mh using the command con -l xmh3. At the switch prompt type in c 7.1 2.1. Similarly on sxu11, log in using the command con -l xgucci3, and on the switch prompt type c 4.1 5.1. In the reverse direction, set the connection for vci 2. Notice the order is important for the connection (half duplex) direction.
While the method of testing the stack on a standalone machine first and then taking it to the actual network is very useful, there is a kind of problem that might be found on the actual set up only. In our case, we got all three transports to work on the standalone machine, but the reliable service did not work on the actual set up. The reason was while porting the code for the reliable option a mistake was made in using the protocol control blocks, and the recieve and the send side wound up writing to a part of each other's protocol block. Such bugs will then be discovered only on the actual setup.
The following are the things that need to be worked on in the current implementation.
This note describes the work done towards porting Idlinet to IRIX. It includes description on the environment, and background related to this work. The actual effort and the environment set up for testing it is described.
REAL is a simulator for studying the dynamic behavior of flow and congestion control schemes in Packet Switched Data Networks. The input to the simulator is a description of the network topology, protocols, workload and control parameters. The output is statistics for the data sources and intermediate nodes. The input is called a scenario and is described in NetLanguage. The network is modelled as a graph, where vertices represent sources, sinks, or gateways. The edges represent the communication links. The user can specify network wide flow control method, the workload at each of the sources, the scheduling disciplines at the gateways, and for each communication link, the latency, bandwidth, size of trunk board buffers, and the packet size.
Figure 8: IDLInet software structure. From [9].
Each layer in the IDLInet protocol stack, right from applications on the application layer to the data link layer, are expected to provide three functions as entry points: the layer's initialization function, the layer's operation function and a timeout function. These entry points are provided to the scheduler, which calls each layer's operation function in a round robin manner. In addition to this, layers can also schedule functions using timer requests. Figure 8 is the software structure overview diagram reproduced from [9] and shows the data and control flow in IDLInet. The solid lines show the data flow, and the dotted lines show the control flows.