;; -*-mode: Outline; -*- TOC General Awards Keynote: The Internet's coming silent spring Interactive 3D Graphics applications for Tcl The AGFL Grammar Work Lab Plan9 and Inferno Introduction to Air Traffic Management Systems The Future Is Coming: Where the X Window System Should Go XCL: An Xlib Compatibility Layer for XCB Biglook: A Widget Library for the Scheme Programming Language BSD BoF An Implementation of Scheduler Activations on the NetBSD Operating System Authorization and Charging in Public WLANs Using FreeBSD and 802.1x Design and Performance of the OpenBSD Stateful Packet Filter (pf) Enhancing NFS. Cross-Administrative Domain Access Ningaui: A Linux Cluster for Business CPCMS: A Configuration Management System Based on Cryptographic Names X Meets Z: Verifying Correctness in the Presence of POSIX Threads Cyclone: A Safe Dialect of C Cooperative Tasking Without Manual Stack Management Improving Wait-Free Algorithms for Interprocess Communication in Embedded Real-Time Systems Special Closing Session: How Flies Fly? * General June 14-16, 2002, Monterey, CA Attendance: ca. 1100 people, down from the last year's 1500. The slowdown in the economy was noticeable. The conference is selective: in the general track, 25 out of submitted 105 papers were accepted. In the FREENIX track, 26 out of 53 submitted papers were accepted. * Awards James Gosling received the Lifetime USENIX achievement (aka the "Flame") award. Accepting the award, James Gosling told a story of his first USENIX conference. It was in 1978, at Columbia U (?). At the first day of the conference, in the morning, he took a stroll through a park near the conference center. He enjoyed the park and the walk. When he later told that to several people, they stared at him in disbelief, touched him to make sure, and exclaimed: "You're alive!" That, James Gosling said, epitomizes my programming career: "I was just to dumb to know better." The Apache project received the Software Tool User Group award. * The Internet's coming silent spring Lawrence Lessig, Stanford University. The keynote address. Lawrence Lessig is the director of the Center for Internet and Society at Stanford Law School. He's the board member of the Electronic Frontier Foundation (EFF). The style of his presentation was most unusual. His slides flashed a word or a sentence in white against a black background. The first slide read: "". Lawrence Lessig started by telling three stories. The first was the story of Armstrong's fight against RCA. RCA hired Armstrong to improve their AM technology. Armstrong has invented the FM radio instead. When he started to promote it, RCA fought tooth and nail. One of RCA's internal memos said, "we cannot _allow_ that damn radio technology to compete". After twenty years of litigation, Armstrong gave up and committed suicide. "FM radio was not allowed," Lessig said. The second story: Internet vs. AT&T. When Paul Baran first proposed the packet switching technology, Department of Defense asked the opinion of _the_ telephone company. AT&T said, this will never work. Internally, an AT&T manager wrote, "we cannot _allow_ that damn technology to suck the blood out of our veins". Internet was no allowed. The third story was about an Aibo dog, a web site to teach an Aibo dog to dance jazz, and Sony's attempt to shut down that web site. The point of these stories was that societies or environments can either allow or forbid. End-to-end architectures allow innovation. Centralized architectures (like the old AT&T network) exercise a tight control over who can access their network and what kind of protocols they are permitted to run. Internet is a end-to-end architecture, a neutral network that lets many different clients to connect and communicate to each other. The key innovations of the Internet (the TCP/IP protocol, USENET, HTTP, Hotmail, IRC) were done either by graduate students ("kids") or by non-Americans. By design, Internet encourages innovation by outsiders. Another example of an end-to-end network that fostered innovation (802.11, Bluetooth) is _unlicensed_ spectrum. The open, end-to-end architecture of the Internet is threatened. The physical layer (the owners of the wires and fibers) and the content layer are corrupting the core. With the help of lawyers, the owners of the networks and of the content have managed to frame the debate in a very restrictive way. First, they say: "It's our property". The cable companies say: the cable and the fiber are our property, therefore, we can dictate what kind of services and protocols are allowed on our networks. Running web servers on client computers is not allowed. Compare that with the dawn of the Internet and USENET: AT&T owned the networks, and charged a great deal for their use. However, AT&T was not permitted to discriminate against the users. As long as the users pay, they are free to talk about anything over the phone, and they can send all kinds of noises including data-modulated ones. The second issue that frames the current debate: "It's just theft." Napster is the case in point. Curiously, the Napster situation has occurred many times before. Early in the 20th century, piano rolls "Napsterized" sheet music. After much acrimony over compensations to composers and copyright holders, a solution has been found. The U.S. Congress has instituted a compulsory licensing of sheet music. Radio broadcasting permitted distribution of music to wide audiences for little cost, which rekindled the "napsterization" debate. Again, the solution was a compulsory licensing. Copyright holders were compensated for the uses of their content. However, the copyright holders could not dictate how the content should or should not be played. Cable TV brought the napsterization issue again, and again the compulsory licensing was the solution. The issue arose once again in VCR vs. the film industry. Jack Valenti (the president of the Motion Picture Association of America) said that VCR is a "Boston robber of the film industry" [Nowadays, Jack Valenti equates DeCSS with terrorism. Some people never change.] After eight years, the case went to the Supreme Court, which said: as long as there is a _potential_ significant non-infringing use of a new technology, the technology cannot be enjoined on the basis of the copyright law. In the case of VCR, no compulsory licensing was necessary: it turns out that the "robbed" film industry found ways to make a lot of money on videotapes. These examples showed that (big) distributors of content always resisted new ways of using the content. However, they never managed to suppress the innovation and the innovators -- until now. DMCA (Digital Millennium Copyright protection act) and courts in the Napster case let the media conglomerates regulate the technology. The old content distribution networks trying to use the law and the courts to prevent any new ways of distributing the content. The unfortunate fact, Lawrence Lessig said, is that the current debate has taken extreme positions: the absolute control of the content vs. communism. We need a balance. As the compulsory licensing examples show, there are many ways people can share content without being thieves. Consider a book. If we copy a book entirely and sell the copy, we clearly infringe on the copyright. If we make a copy of a chapter to mark it up, we exercise our fair use right. Besides these two uses, there are many other, _unregulated_ uses. We can sell the book to a used book store, we can give the book away to a library, we can hi-light sentences in the book, we can write comments on margins, we can burn the book or use it as a doorstop. All these uses are unregulated. The majority of the uses are unregulated. When it comes to digital content however, most of the uses turn out regulated: it's either an infringement or the fair use. The debate becomes polarized, and the middle ground is lost. We need to quash the bad rhetoric, reframe the debate, flee from the extremes and seek the balance. The presentation certainly struck the chord. At the end of his presentation, Lawrence Lessig received a standing ovation. * Interactive 3D Graphics applications for Tcl Oliver Kersting and Juergen Doellner, Hasso-Plattner-Institute, University of Potsdam. Freenix track. Common tasks of programming a 3D application are configuring a scene and its properties, and exploration. These tasks are most conveniently performed in a domain-specific, interpreted, _scripting_ language. The authors have built a higher-level Virtual Rendering System (VRS) on the top of OpenGL. The main object in the VRS library (which is written in C++) is a _node_. A node can be a shape, a property of a shape (color, reflection coefficient, etc), or a transformation. Nodes can be assembled in a scene graph. The authors have built Tcl bindings to the VRS library. Therefore, the scene manipulation -- an often repeated and crucial step -- can be done in a scripting language. A scene graph can be built incrementally and iteratively. The graph is rendered by a VRS library in real time. The presenter gave a demonstration. Indeed, zooming, rotation, associating a zooming factor with a slider can all be done interactively. The authors have built a tool that semi-automatically maps C++ API into Tcl. Wrapper classes are generated automatically. Moreover, overloaded C++ methods are turned into "generic" Tcl functions. This is a distinguishing feature of the tool. Interestingly, the tool exports an "overloaded" delete operator into the corresponding Tcl function. Scene nodes in Tcl scripts must be explicitly deleted. The authors have also built a nice Tcl IDE, with a class browser, property sheets and an editor/shell window with keyword completion and syntax coloring. Unlike VTC, VRS is a general purpose rendering library. VRS supports advanced shading schemes, RenderMan, FreeType fonts, 2D imaging, etc. http://www.vrs3d.org The library is under LGPL. * The AGFL Grammar Work Lab Cornelis H.A. Koster and Erik Verbruggen, KUN. Freenix track. The talk has described the first _free_ parser generator for natural languages, under the GNU public license (GPL/LGPL). The parser generator takes a grammar and a lexicon and yields a parser. The latter reads the text and converts it into a Head-Modifier (H-M) form. Generated parsers can be used commercially royalty-free. The H-M form is a set of pairs [head, modifiers] where 'head' and 'modifiers' are (possibly empty) strings of words. For example, the phrase "new walking shoes" is translated into [N:shoes, A:walking] [N:shoes, A:new] Both the head and the modifiers are typed: N (for noun), P (pronoun), A (adjective), V (a verbform). H-M pairs represent _major_ relations expressed in the text; "unimportant" words are eliminated. The parser can also normalize the tense of verbs. Here's a more elaborate example: "Every PhD student gets a reduced price for the ETAP conference" is parsed into: [N:student, N:PhD] [N:student, V:gets] [V:gets, N:price] [P:it,V:reduced] [V:reduced,N:price] [N:price, for N:conference] [N:conference, N:ETAP] The conversion from a passive particle phrase "a reduced price" to an active verb construction [P:it,V:reduced] [V:reduced,N:price] is remarkable. The original phrase did not say who authorized the reduction of the price for the conference. The parser has determined that the actor is missing, and inserted a dummy actor 'it': [P:it,V:reduced]. H-M pairs produced by a parser can be used for text classification, information retrieval, and question answering systems. The parser can also be used in educational software. In fact, the parser has been used by the authors in a text (patent application) classification system called Peking. The AGFL package comes with a sample grammar and a lexicon, of English. AGFL stands for an "affix grammar over a finite lattice", which is a _context-free_ grammar with set-valued attributes. The parser is non-deterministic, and therefore is not fast. It is, however, robust: it does not crash when is presented with unknown words or an incorrect phrase. Again, the package is free. The AGFL package is available at: http://www.cs.kun.nl/agfl/ The grammar of English is very illuminating: http://www.cs.kun.nl/agfl/npx/grammar.html Given an ambiguous phrase, the parser currently returns only one parsing alternative (chosen more or less arbitrarily and heuristically). Version 2.0 will incorporate statistics and probabilities. The parser is more scalable and faster than DCG is Prolog. The system is written in a CDL3 language. The CDL system is also a parser generator, which yields fast _deterministic_ parsers. When I asked the speaker to compare his approach with a "dominant" pure statistical approach, he said that extremes are rarely sufficient. Statistical approaches need a lot of statistics to represent the basic facts of the language (e.g., the Subject-Verb-Object order of English). The latter can be more succinctly and conveniently expressed by rules. It's more productive to use rules for most of the grammar, and statistics to resolve ambiguities. Cornelis H.A. Koster conducted a BoF on natural language processing. The main topic of the BoF was again the statistical vs. rule-based classification and retrieval of documents. He mentioned an interesting fact that stemming and singular/plural normalization do not improve classification much. One of the drawbacks of the pure-statistical approach to the information retrieval is that the relevance score is highly non-linear. A relevance score of 99% tells that the document in question is most likely highly relevant to a query. OTH, a relevance score of 70% often means that the document has nothing to do with the topic. The precision and the recall of pure statistical classifications and retrieval reach a plateau. You need to add the rules of the grammar to get over the plateau. Dr. Koster has briefly described his document classification system Peking. The documents to classify are patent application. First he parses the document into HM pairs as outlined above. The system collects the statistics of HM pairs and winnows out stiff and noisy terms. What's left are significant pairs, which describe "the essence" of a set of documents. * Plan9 and Inferno As the year before, Vita Nuova had a booth at USENIX Expo. Two exhibitors were constantly busy demonstrating the two remarkable operating systems. Inferno now runs natively on a iPAQ. Inferno supports only one language: Limbo. Because the language is safe, Inferno does not need a kernel-user-space boundary. This makes the system fast. A Limbo virtual machine, DIS, is rather similar to JVM. It's possible to map (all?) JVM codes to DIS. Alas, DIS does not support tail-recursion. Roger Peppe, one of the exhibitors, showed a Tk emulation in Limbo. Essentially, he wrote a Limbo/Tk. He also showed an amazingly transparent networking in Inferno/Plan9. A serial port in Plan9 is represented by several files. You can control the port with the regular 'echo' command. He demonstrated a driver for a digital camera, written in "shell" (in Limbo). The interface to the driver is a file system again. In Plan9/Inferno, it's possible to create 'virtual files' which are associated with a pair of handlers rather than with a storage. When you load the camera driver, you see a set of files in a directory you designate. Doing "echo zoom 0.5 > kodak/ctl" changes the zoom factor. You can see the pictures by executing "ls -l kodak/img/" at the shell prompt. Roger Peppe showed a GUI, again written in shell (Limbo/Tk to be precise) that talks to the camera by reading and writing to the files in the kodak/ directory. You can mount the kodak directory on another computer: iPAQ that was connected to the Plan9 laptop via a wireless link. You can indeed control the camera from the iPAQ. The transparency of networking in Plan9/Inferno is astounding. The underlying protocol, Styx, is well thought out and concise. It takes fewer than 10 pages to completely describe it. OS developers could greatly enhance their systems if they implement Styx. Distributive programming in Limbo is based on channels -- like in communicating sequential processes. Threads are synchronized through rendez-vous, like in Ada. From the distributive system point of view, Limbo programming looks surprisingly similar to Erlang. Plan9 and Inferno binaries are freely available, even for commercial use. The source, the core, costs money. The source cannot be freely distributed, but it can be modified and used in a commercial application royalty-free. Changes do not have to be reported back to Vita Nuova. See the FAQs at www.vitanuova.com * Introduction to Air Traffic Management Systems Ron Reisman and James Murphy, NASA Ames Research Center, Rob Savoye, Seneca Software. Invited talk. Trajectory synthesis is the hottest topic, which includes physics, scheduling, even expert systems and neural nets. The flight routes are fixed. Every two hours, the FAA command center in VA reviews the weather and other conditions and replaces the routes with the alternatives, according to the rule book. All _strategic_ systems run on UNIX. A tidbit: PowerPC, which was developed at IBM, was financed by FAA as a part of the advanced automated system (AAS) project. FAA needed a platform to run UNIX reliably enough for air traffic controllers. The AAS project was probably the biggest software engineering disaster. It was budgeted for $3.5 billion; when AAS was canceled $9 _billion_ was spent. At one point, Congress and FAA discussed a free flight system. According to the system, there are no mid-flight traffic controllers and pilots have a freedom to choose the best route during the cruising phase. This turns out not a good idea: even the authors of the free flight proposal do not believe in it any more. Ground controllers and fixed routes will remain. * The Future Is Coming: Where the X Window System Should Go James Gettys, Compaq Computer Corp. Freenix track. The talk focused on migration and replication of X window sessions. A user should be able to disconnect from an X window application and reconnect to it from another terminal. A user may want to replicate his "screen" so other users can interact with an application, possibly in parallel. When a session is moved to a different terminal however, the new X server may have the different screen size, screen depth and the resolution, different fonts and other resources. There are a few tools that support screen replication and migration to some extent: xmove, VNC, pseudo-servers, Emacs. VNC shuffles bitmaps around and doesn't handle screens of different sizes and depths. Emacs actually supports transparent session replication and migration: see make-frame-on-display. The feature has been implemented for quite some time and works well. The only drawback is a remarkable obscurity of the feature and the lack of advertising. Server-site fonts: "a 15-year migraine". Any server-based fonts will fail: there isn't a guarantee an application will get its fonts; serious applications need a lot of details about font metrics and therefore overload the server with font requests; long startup times (while the server rasterizes all glyphs in application-requested fonts, even if an application will use only few glyphs). A toolkit is the best place to handle migration and replication. Alas, most toolkits are not built with these issues in mind. Gim Gettys reported that he is slowly overcoming the problems, by changing a toolkit (GTK) to be aware that the screen resolution, the screen size and depth may change during application's lifetime. He has already managed to move a simple widget from one screen to another. Still, there is a lot of work to do. * XCL: An Xlib Compatibility Layer for XCB Jamey Sharp and Bart Massey, Portland State University. Freenix track. XCB is C bindings for the X protocol. Unlike XLib, XCB deals only with communication, marshaling and similar issues. XCB offers far simpler and consistent interface to the X protocol than XLib does. XCB is very small (27 KB compiled) and is therefore particularly suitable for embedded systems. XCB can be used in a multi-threaded environment. The code for XCB is mainly _automatically_ generated from the specification of the X protocol! XLib contains much more functionality than mere a client-side of the X protocol. XCL is an attempt to implement some of that functionality, using XCB to communicate with the server. XCL+XCB is meant as a replacement of XLib (a lighter version of XLib, to be precise). The goal is largely achieved. XCL+XCB is 55 KB compiled (whereas XLib is 665 KB. Some of that can be reduced by replacing pervasive macros with procedure calls). Some of the XCL code is automatically generated; most of the other is just copied from the XLib source code and adjusted to use XCB. XCL does not implement all of the XLib. In particular, XCL does not support X color management, i18n and caches. As the presenter showed, these features, although accounting for more than the half of XLib, are hardly used at all. Most of the toolkits do i18n and color management on their own. The author reported the success in relinking of several X windows applications (rxvt and gw) to use XCL rather than XLib. http://xcb.cs.pdx.edu/ * Biglook: A Widget Library for the Scheme Programming Language Erick Gallesio, University of Nice; and Manuel Serrano, INRIA. Freenix track. Like Tcl, Biglook relies on a programming language for building graphical user interfaces. It uses a CLOS-like object layer [generic functions rather than class methods] to represent widgets, and Scheme closures to handle events. Biglook is a (Bigloo) Scheme library. Biglook code can be compiled by a Bigloo compiler into C or JVM byte code. Biglook implemented a full-fledged IDE for Scheme: Bee. Biglook is a thin wrapper on the top of native back-end libraries. Currently, two native widget libraries are supported: GTK+ and Swing. The wrapping overhead is quite low: about 15% of additional allocations. The paper stresses several points that greatly contribute to Biglook expressiveness: 'virtual slots' in records, keyword arguments, and closures as call-backs. Virtual slots is the mechanism to relay actions on widget classes to the back-end widget library. Virtual slots also make the API concise and expressive. For example, three classical functions 'iconify', 'de-iconify' and 'window-iconified' are represented in Biglook by a single virtual slot 'visible' in a window class. The variable number of arguments and keyword arguments are essential to declarative GUI programming. Without them, the creation of a widget must be split into two steps: the instantiation of a widget and the setting of its attributes. An alternative, passing of a list of attributes to a widget constructor, involves a run-time overhead and disables compile-time checking of attributes and their values. In Biglook, keyword parameters are resolved at macro-expansion (i.e., compile) time. A macro instantiate::widget (which is automatically generated from widget's class declaration) does all the work of resolving keyword attributes. Other automatically generated macros are getters and setters, and a 'with-access::widget' binding form. During the talk, Manuel showed several GUIs and their complete code, which was only 10-lines long or shorter. Of a special note is a comparison of Biglook, Tcl/Tk, Swing and GTK on an example of a widget with several buttons. In Biglook, the creation of a button, its packing and the setting of the event handler are all part of the same expression. In Tcl/Tk, the code that deals with a button is spread out in three places. In Swing, the same code is split in 4 places. In GTK, in 5 places. The paper notes that in the experience of the authors, the Biglook source code is about twice smaller than the same interface implemented in Tcl/Tk, and about four times smaller that the interface implemented in C with GTK+. http://kaolin.unice.fr/Biglook * BSD BoF Like the previous two conferences, this year's USENIX had a SuperBSD BoF. Leaders of three non-commercial BSD projects and of BSD/OS and Mac OS X/Darwin came to talk about their projects' achievements and future plans. It seems that every year the SuperBSD BoF is attended by more and more people. ** FreeBSD FreeBSD now includes a dramatically faster driver for the Intel Gigabit Ethernet card. The driver was written by Intel and is supported by Intel in FreeBSD CVS tree. FreeBSD 5.0 will include many parts of the TrustedBSD (which is being supported by DARPA). FreeBSD 5.0 will include UFS2: 64-bit block pointers, native extended attributes (ACL, security ids, etc), transparent hash directories. Marshall Kirk McKusick said that hash directories (which hash file names into i-node numbers as directory blocks are read in memory) are more efficient than B-tree-based directories. ** Darwin bash as /bin/sh, SAMBA, Python and _Ruby_ are the part of the base system. WebDAV will be used heavily, as a distributed filesystem. netinfo will be phased out and replaced with LDAP/OpenDirectory. MacOS X versions 10.0 used a mixture of NetBSD and FreeBSD user-level applications. The upcoming and future versions of MacOS X and Darwin will closely track FreeBSD. * An Implementation of Scheduler Activations on the NetBSD Operating System Nathan J. Williams, Wasabi Systems Inc. Freenix track. The talk started with an overview of three implementation models of threads. In one model, threads exist only at the user level. All threads run within the same kernel process. The implementation is fast; however, when one thread enters the kernel or blocks, all other threads within the same process are blocked too. On the other extreme are kernel threads. Like processes, kernel threads are scheduled or blocked independently. Unlike processes, different threads may share the same address space, job identification, user credentials, etc. Although thread switching is faster than process switching, it is still relatively expensive. What's more important, each kernel thread takes kernel resources, which are limited. Kernel thread synchronization involves a system call and is therefore expensive. The third model -- which is the topic of the talk -- is a two-level implementation. N kernel threads run M (M>=N) user threads. Thread synchronization and switching occurs at the user level. When a kernel thread executing a user thread blocks on an i/o or other system call, the kernel "sends a signal" to an application, telling it that a user thread has blocked. The signal will be handled by the remaining kernel threads assigned to an application. If there are no other threads to handle the "signal", the kernel may launch a new thread and give it to the application. The "signal" handler (called an upcall handler) will mark the user thread that got itself blocked as non-runnable and will switch to another runnable user thread if any. This approach implies a co-operation between the kernel and the user application. The kernel scheduler notifies the application about scheduling decisions that affect the application: creation and deletion of kernel threads, a thread becoming blocked, unblocked, or pre-empted. In the latter case, the pre-emption event is delivered to an application _after_ it was resumed (so the application cannot stave off the pre-emption but still can know that it was off the CPU for some time). The NetBSD implementation described in the talk uses so-called schedule activations. Scheduling event notifications are called upcalls. They behave pretty much like POSIX signals. An application registers its intent to be notified of scheduling events by registering an upcall handler. The application should also allocate a set of stack frames to run the handler. The implementation described in the talk looks surprisingly similar to call/cc-based threads/co-routines. When the kernel executes an upcall, it runs a special trampoline code in the user space. The trampoline saves the stack pointer and the context of the previously running code, sets the stack frame and jumps to the upcall handler. Upcall handlers are supposed to "switch" to an appropriate user thread, hence, they do not return. To avoid running out of stack frames for upcall handlers, an application must determine which stack frames are no longer "relevant" and manually "garbage collect them" (i.e., mark them as free). The talk clearly showed kernel hackers re-discovering call/cc. Scheduler activations will be in the next release of NetBSD. The maximum number of concurrent threads ran without encountering resource limitations was 7000-8000. After the talk, Clem Cole commented that they have to support more threads and have to spend the enormous amount of time on debugging. Clem Cole said that DEC/Compaq thought they got threads right in their True64 OS. Until they tried to run a Netcenter, which launched 10,000-100,000 threads. A lot of bugs showed up then. * Authorization and Charging in Public WLANs Using FreeBSD and 802.1x Pekka Nikander, Ericsson Research NomadicLab. Freenix track. The topic of the talk is how to do authentication in a public wireless network. This research started from a business proposal for collecting micro-payments for access to a wireless LAN and other resources. Alas, the funding for the author's start-up dried out, so he decided to do research instead. The talk gave a good overview of possible threats in wireless LANs: a laptop may pretend to be a router; a laptop may run DHCP and advertise itself as a default gateway; a laptop can launch denial-of-service attacks against cheap access points and thus force clients to use expensive access points. Incidentally, the latter threat has been realized, in cellular networks. The presenter said Russia has powerful jamming antennas on the border with Finland. They suppress base stations in the border regions and force cell phones to connect to base stations in Russia. The latter charge 10 times as much for access. The rest of the talk explained an implementation of LAN-level authentication mechanisms (EAP and 802.1x) as Netgraph nodes. Some of the NG-nodes -- e.g., MAC-address-based filtering, tag-based filtering at the frame level -- are rather generic and can be used in other projects. Although EAP and 802.1x solve the problem of a bi-lateral authentication, these techniques do not help much in multi-cast/broadcast. Resource discovery and similar protocols are inherently multi-cast. * Design and Performance of the OpenBSD Stateful Packet Filter (pf) Daniel Hartmeier, Systor AG. Freenix track. The PF filter chains filtering rules in a linear linked list. For each packet, the rules are evaluated one after another. For optimization, the filters are grouped in "skip steps" according to the interface number and the source address. If the first filter within the group failed (e.g., did not match the source address), then the other filters in the group are skipped. The topic of the talk is an implementation of _stateful_ rules. The state for a TCP connection is a ip-address,port,seq-number tuple. The filter can easily check if a TCP packet belongs to an already established session. If it does, the packet is allowed through, without any need to consult any rules. The pf filter maintains the state for UDP "connections" as well. A UDP "session" is defined as a flow of packets within a short (e.g., 30 seconds) interval between the packets. The state of sessions is maintained in a red-black tree. The PF filter reassembles a packet from fragments and applies rules to whole packets. The filter may optionally 'randomize' TCP sequence numbers for outgoing packets. The talk compared the pf filter with IPFilter and with iptables on several benchmarks. IPtables is a stateless filter. It turns out that maintaining sessions' state is cheap but evaluating rules is expensive. With IPTables, 200 rules halve the bandwidth. Therefore, stateful filters (which need fewer rules to express a policy) perform better than stateless filters. * Enhancing NFS. Cross-Administrative Domain Access Joseph Spadavecchia and Erez Zadok, Stony Brook University. Freenix track. The talk presented two attempts to tinker with an NFS server to make NFS a bit more secure. The attempts are: range-mapping for UIDs and file cloaking. The latter prevents certain files within an exported file system from being visible to NFS clients. However, a rogue client (a user with a root access) can easily defeat the guards described in the talk. Peter Honeyman clearly (if somewhat emotionally) pointed that fact out. NFS is just an insecure protocol. * Ningaui: A Linux Cluster for Business Andrew Hume, AT&T Labs-Research; and Scott Daniels, EDS. Freenix track. Ningaui is a tiny Australian mouse-like beast. Ningaui is a Linux cluster -- but it's not a Beowulf. The latter is designed for high-performance scientific computations. In contrast, Ningaui is aimed at business applications and stresses high-availability, fault tolerance, and a very intensive i/o. A node in a Ningaui cluster can be brought down, upgraded in software and hardware, and plugged back into the cluster. At that point the node becomes a part of the cluster without any additional intervention. Although currently Ningaui runs on Linux boxes, the cluster software is largely platform independent. In a response to a question, Andrew Hume said that it's easy to plug FreeBSD nodes into the cluster (in fact, he was about to do that). Currently the cluster is based on Linux (RedHat 6.2) boxes inter-connected by a cLAN. The nodes are loosely coupled. The cluster software consists of a job scheduler, a file replicator, IP address takeover software, and the code to run a leader election protocol. Job scheduling, the central service, is implemented as an auction with nodes bidding for jobs. The winning node gains a lease to the job; the node should compute the result within the lease time, or apply for the lease extension. The winning node also loads all necessary files into its local file system. Files are loaded via a 'ccp' (cluster copy command), which effectively replicates the data. No NFS is used. The bidding/leasing implementation automatically guarantees load-balancing, fail-over and recovery. The job scheduler is run by a leader. An election (or the confirmation) of a leader happens every 5 seconds, by a single UDP broadcast by each voter. The leader is determined in 10 ms after the broadcast. The election protocol was designed at AT&T and is rather efficient. Currently the cluster runs an At&T billing auditing application. AT&T call centers upload (ftp) their billing files to the cluster, which uncompresses the files, checks the records, computes statistics and watches for anomalies. One of the files they processed was 140 GB long! The job of being an incoming FTP site is a "job" subject to load-balancing and bidding. Nodes in a cluster bid for becoming an FTP site; the winning node gets a lease, acquires the IP address and starts accepting files. If the node fails, its lease eventually expires, and another node takes over. No heartbeat monitoring is necessary. The talk briefly enumerated problems the author encountered when working in such a data intensive environment. For example, a chain "gzip -d file.gz | a.out > output-file" fails 2-5% of the time on a heavily loaded system. TCP storms happen: TCP stack would unexpectedly accept connections that it does not have resources to process. The lack of a direct i/o under Linux is a nuisance: all file i/o goes through the buffer cache. A single reading of an extremely large file pollutes the cache and evicts all other file blocks. ReiserFS turns out not very good when dealing with few but humongous files: every couple of weeks the kernel panics. OTH, due to the logging of updates, rebooting is quick. The superdaemon inetd has an undocumented connection throttle set to 40 processes/min. The author had to examine the source code, reset the throttle to 5000 processes/min and recompile inetd. The benefit of having the source code available is clear. All the problems encountered with the Ningaui so far are merely nuisances. The authors run a similar billing auditing application on a Sun E10K and have encountered more and more serious problems. Thus E10K is no more reliable or available than a Linux/i386 cluster -- but vastly more expensive. * CPCMS: A Configuration Management System Based on Cryptographic Names Jonathan S. Shapiro, Johns Hopkins University. Freenix track. Awarded Best Paper! CPCMS (now called OpenCM) is intended to be a replacement for CVS. Unlike CVS, which keeps tracks of files, OpenCM keeps track of collections of objects, their meta-data, and the history. Objects are things with attributes rather than merely files. A version number is assigned to a change rather than to the content. OpenCM was built as a configuration management system for the EROS project, which is an open-source, _secure_ OS (www.eros-os.org). Therefore, OpenCM was designed to be an assured configuration-management system, which tracks changes not only to files but to sets of files. The system should provide assurances that the file content hasn't been tampered with. Consequently, the access control should be extended to configurations. CVS does not even track file renames. OpenCM intentionally looks and feels like CVS. Like CVS, OpenCM includes a command-line client. OpenCM, however, supports rename operations, assigns versions to configurations rather than to files, supports IDE-centric workspaces and cryptographic authentication. Most interestingly, OpenCM supports a disconnected operation: a user can commit and rollback while being disconnected from the repository. Objects in a repository are named by a cryptographic hash of object's content. This gives the client an assurance that the retrieved configuration was not tempered with. OpenCM is scalable due to replication. The uniqueness of object names and the immutability of the content objects greatly facilitates the replication. Access control relies on X.509 certificates (which are to be distributed in a trusted way because the certificates are generally self-signed to avoid the complexities and the cost of CA). Privileges are assigned to groups, which are made from users and other groups. The latter feature supports delegation. http://www.opencm.org * X Meets Z: Verifying Correctness in the Presence of POSIX Threads Bart Massey, Portland State University; and Robert T. Bauer, Rational Software Corp. Freenix track. The paper showed how formal methods helped solve a really hard multi-threaded problem. The problem arose during the development of XCB, a client library for the X protocol. See the XCB/XCL talk above. The library is supposed to be _transparently_ used in a multi-threaded environment. Making the library thread-safe is hard, because the X protocol does not, by its design, prevent deadlocks. Therefore, the client library (XCB in our case) must be careful to avoid these potential deadlocks. Alas, the careful design is hard. The author tried a brute force approach and failed. His algorithm may prevent a deadlock in one case and allow it in another. Bart Massey will correct the algorithm, only to find another fatal flaw later. After four attempts, he realized that there must be a better way. As a professor at Portland State University, Bart Massey teaches Z. The textbook in his class is "The way of Z. Practical Programming with formal methods". Interestingly, this book is written by a radiologist. Bart Massey decided then to follow what he teaches. He wanted to specify POSIX threads in Z, specify the X server behavior in Z, describe the properties of the desired client algorithm in Z, and then formally prove that the whole system is deadlock free. The latter part -- the formal proof -- is still wanting. Unfortunately the semantics of POSIX threads is not well-specified. Still, the mere fact of describing the client behavior in Z has given a crucial insight that lead to the correct algorithm. That algorithm is implemented in the current version of XCB. Although there is no formal proof of its correctness (yet), the extensive testing and use revealed no problems. Thus the lesson of this talk is that formal methods are indeed useful. Good ideas come up just from mere writing down a formal specification. Engineers use Math and logic extensively in their work; software engineers and opensource developers should do that too. * Cyclone: A Safe Dialect of C Trevor Jim, AT&T Labs-Research; and Greg Morrisett, Dan Grossman, Michael Hicks, James Cheney, and Yanling Wang, Cornell. General track. The talk made a flimsy case for Cyclone. The presenter never justified his assertions that developers really need to know precise details of the data representation and to access bits and bytes of complex data structures. Incidentally, if these assertions were true, the developers should not be using Cyclone instead of C. Cyclone introduces several kinds of pointers, including a fat pointer. A fat pointer carries the range information, needed for run-time bound checks. A fat pointer is not merely a pointer, yet its details are opaque to developers. A dereference operation via a fat pointer is not atomic (which creates problems for using Cyclone in multi-threaded programs). The idea of a never-null pointer and of a static reasoning of these pointers is interesting. Let's consider: int @f = x; where '@' marks a never-null pointer, similar to '*' marking a regular pointer. An assignment or an initialization of a never-null pointer entail an _implicit_ run-time nullarity check. However, if the value to assign is known to the compiler to be a never-null pointer, no run-time checks will be inserted. Note that the Cyclone compiler adds implicit run-time checks -- which make the language safer but remove the programmer from "the bare metal." If a programmer doesn't mind an abstraction of the bare metal, the programmer then would be more comfortable with a more expressive language like OCaml, for example, or Java. Performance: from 0 overhead to 3x slowdown, in pointer-intensive applications. After the talk, several persons noted that 3x slowdown is unacceptable. Cyclone still does not support multi-threading nor completely manual memory management (Cyclone supports arena management and GC?). One person asked a funny and witty question: with never-null pointers (marked by '@' rather than the regular '*'), fat pointers (marked by '?') and contemplated forward-only pointers, won't the designers of Cyclone run out of characters to mark these pointers? Have they considered Unicode? http://www.cs.cornell.edu/projects/cyclone/ * Cooperative Tasking Without Manual Stack Management Atul Adya, Jon Howell, Marvin Theimer, Bill Bolosky, and John Douceur, Microsoft Research. General track. The talks has stirred up a lively debate, which continued after the talk and all through the lunch. There are two major concurrency models: multi-threading and event-driven computations. As Ousterhout pointed out, threads are a bad idea, for most of the purposes. The acrimony between the two models is caused by conflating several concepts. The talk took time to explain different axes of a concurrent application. ** task management *** serial task management. On one end is a serial task management: a task executes on a CPU until it is finished. No other task is scheduled until the first task completes. This strategy is the easiest to reason about; yet it is inefficient as the CPU idles when the task waits for i/o completion. *** preemptive task management Execution of tasks can interleave on a uniprocessor or overlap on multiprocessors. This strategy seems efficient, until we consider the communication between tasks. If tasks are completely separate and do not share any data, there is no danger of them being interleaved or overlapping [see a data partitioning axis below]. However, the tasks do communicate. In the threaded model, tasks share the same address space and communicate by updating shared variables. With pre-emptive multi-threading, a thread is generally never sure that it is working with consistent data structures. A thread can be pre-empted at any moment. A newly scheduled thread may alter the data structures. To be sure of consistency of its data, a thread must lock the data structures and then verify (and if needed, restore) invariants. However a thread must not hold locks for a long time. Furthermore, to avoid deadlocks, the thread must release locks in certain situations (when it failed to acquire all needed locks, for example or before doing a slow i/o). Once the thread released the locks and then re-acquired them, it must verify the invariants all over again. *** co-operative task management This is a compromise between the serial and the pre-emptive management. Unlike the serial task management, a task may surrender the CPU before the completion and let another task (thread) run in its address space. Unlike the pre-emptive management, a thread yields control on its own accord, at well-defined points in its execution. Cooperative multi-threading does not eliminate the problem of consistency of data structures. After the thread yielded the control and then regained it, the thread has to verify all invariants. However, a thread yields at the points defined by its programmer. Between these points, the computation can be assumed serial. No locks are needed to guarantee the consistency of thread's data. Therefore, the cooperative multitasking model is significantly less error-prone than the pre-emptive one. The cooperative multitasking often has a better performance, too, due to the elimination of the locking overhead. The work on the Flash web server and the last year USENIX presentation ["High-performance memory-based web servers: kernel and user-space performance" by Philippe Joubert et al] showed that multi-threaded web servers are not competitive against a finite state machine with an efficient event notification. ** stack management Stack management is making sure a thread can access its local data after the thread has lost and then regained the control. Automatic stack management is the one performed, for example, by C thread libraries. A thread library saves the stack and other context of the thread that has lost control, and restores the context for the thread scheduled to run next. Manual stack management is when a programmer must pack a callback and all the data for it in a closure. The programmer thus manually creates a continuation frame -- to be executed when an event arrives. Other axes are: i/o management (synchronous vs. asynchronous), conflict management (optimistic vs. pessimistic; semaphores vs. monitors), data partitioning. The paper explicitly considers only the task and stack management axes. In a multi-threaded application, pre-emption of a thread is transparent to a programmer. The thread library or the OS take care of saving and restoring thread's stack. OTH, an event-driven application often must explicitly specify which function to execute when an event arrives and which data to pass it. These data must be manually packed in a closure. The manual stack management rips a C function apart and leads to a code that is difficult to maintain. The talk gave several good illustrations of horrors of event-based programming in a language that does not natively support closures and continuations. The paper mentions that the transparency of thread's pre-emption and resumption is actually a drawback, which makes multi-threaded programs difficult to reason about. The paper contends that the immensely convoluted style of typical event-passing programs has nothing to do with the cooperative task management. The convoluted style is the consequence of the manual _stack_ management. The presenter specifically mentioned that Scheme doesn't suffer that style problem. In Scheme, event-based programming looks elegant and natural. As the paper says, "some languages have a built-in facility for transparently constructing closures; Scheme's call-with-current continuation is an obvious example. Such a facility obviates the idea of manual stack management altogether. The paper focuses on the stack management problem in conventional languages without elegant closures." The rest of the paper and of the talk showed techniques to make cooperative-multitasking programming, specifically in C, more pleasant. The paper demonstrated adapters that allow mixing of cooperative-thread and preemptive-thread code (written by disparate programmers). Note a USENIX01 talk "A toolkit for user-level file systems" by David Mazieres, who used C++ templates to automate building of continuations. The solution advocated in the talk is to use co-routines based on "fibers". I think though that subliminally the message of the talk was to use Scheme. *** Additional note about the drawbacks of threads Another USENIX'02 paper, "Ninja: A Framework for Network Services" J. Robert von Behren, Eric A. Brewer, Nikita Borisov, Michael Chen, Matt Welsh, Josh MacDonald, Jeremy Lau, Steve Gribble, and David Culler, University of California at Berkeley General Track proceedings, pp, 87-102. includes the following passage (the top of the first column on p. 91):
The scalability limits of threads are well-known and have been studied in several contexts, including Internet services [PDZ99] and parallel computing [RV89] . Generally, as the number of threads grows, OS overhead (scheduling and aggregate memory footprint) increases, which leads to a decrease in overall performance. Direct use of threads presents several other correctness and tuning challenges, including race conditions and lock contention. Principle: Concurrency is implicit in the programming model; threads are managed by the runtime system. Since we must avoid excess threads to achieve graceful degradation, we simply prevent services from creating threads directly. Instead, services only define what could be concurrent, via (explicitly) concurrent stages. Conceptually, each stage has a dedicated but bounded thread pool, but the allocation and scheduling of threads is handled by SEDA. Thus the system as a whole is event-driven, but stages may block internally (for example, by invoking a library routine or blocking I/O call), and use multiple threads for concurrency. The size of the stage's thread pool must be balanced between obtaining sufficient concurrency and limiting the total number of threads; SEDA uses a feedback loop to manage thread pools automatically. The particular policies are beyond the scope of this paper, as they only effect the node performance. Roughly, allocation is based on effective use of threads (non idle) and priorities, while scheduling is based on queue size and tries to batch tasks for better locality and amortization (as in [Lar00] ).
Mentioned references: [PDZ99] V. S. Pai, P. Druschel, and W. Zwaenepoel. "Flash: An efficient and portable Web server." Proc. of the 1999 Annual USENIX Technical Conference, June 1999. [RV89] E. Roberts and M. Vandevoorde. Work crews: An abstraction for controlling parallelism. DEC SRC Technical Report 42, Palo Alto, California, 1989. Back in 1989 people started to realize that threads aren't good for parallel computing! [Lar00] J. Larus. Enhancing server performance with Staged-Server. http:// www.research.microsoft.com/~larus/Talks/StagedServer.ppt, October 2000. * Improving Wait-Free Algorithms for Interprocess Communication in Embedded Real-Time Systems Hai Huang, Padmanabhan Pillai, and Kang G. Shin, University of Michigan. General track. This talk, perhaps unwittingly, has confirmed the previous talk's observation that threads are really a bad idea. The gist of a threading model is a (conceptually) parallel execution of computations that communicate via shared data. This model is fault-prone, deadlock-prone, priority--inversion-prone; the model exhibits a poor scalability due to locking. The model is hard to reason about, e.g., it's hard to compute the worst execution time of a multi-threaded task. The latter is a particular drawback in real-time (RT) systems. A better IPC model is a wait-free inter-process communication. Concurrently running tasks can communicate through a shared state -- but in a way that requires no locking and still guarantees race-free execution. The drawback is bigger space requirements. The talk has described a memory efficient wait-free algorithm for a single-writer multiple-reader problem. The algorithm is a combination of two kinds of non-blocking protocols. One is a non-blocking write (NBW) protocol, which is as fast as a simple write to shared memory without any synchronization. Another class of lock-free write/read protocols relies on the presence of certain atomic operations: compare-and-set or increment or decrement. Both classes of the lock-free protocols trade space for safety. The interesting result of the paper is that by combining the two classes, it's possible to design a fast lock-free protocol that takes less memory that each of the original algorithms separately. The insight is an appropriate partitioning of reader tasks into slow and fast readers. The fast readers and the writer interact via a NBW protocol; the slow readers interact via one of the protocols of the second class (e.g., a double-buffering protocol). The hybrid protocol takes less space than either NBW or the double-buffering protocol applied to all readers; The hybrid protocol is faster than the double-buffering algorithm alone. The paper and the talk showed the results of several benchmarks. The graphs demonstrated that the best locking solution to the single-writer multiple-readers problem has the least space requirements but absolutely horrible worst-case performance. * Special Closing Session: How Flies Fly? Michael H. Dickinson, Williams Professor, UC Berkeley A fascinating talk about flies: how a small fruit fly maneuvers at high speed; how it avoids obstacles and how it lands. To understand the flight of insects, the presenter contended, we need to take a holistic approach. We need to research how the movement of wings produces a lift; how a fly moves and controls its wings; how the visual system modulates these controls; and how the behavior (e.g., sensing a smell) modulate the reflexive control of the visual system. Professor Dickinson described an experimental flight arena built in their Lab at Berkeley. Using the arena, they have been investigating aerodynamic forces, muscle groups exercised during various flight stages, firings of neurons controlling the muscles, and the effect of visual clues and other sensors. Flies move in complex patterns made of straight lines and very sharp 90-degree turns. The turns are reflexive and governed by a visual system. When a fly approaches an obstacle on the left, it makes a sharp turn to the right. How do flies fly? Folklore says that flies defy aero-dynamics. In a sense, it's true. If we take a wing of the fly and place it in an aerodynamic tube, we can determine lift and drag coefficients as functions of the angle of attack. If we take rapid snapshots of wing movements of a fly, we can determine the speed of flight and the angles of attack during one wing cycle. We can use these experimental data to compute the average lift -- which turns out less than fly's weight. So flies can't fly. A closer inspection shows that flies take advantage of non-stable flight conditions. In all insects, wings move backward and forward rather than up-and-down. When the wing sharply begins the forward motion, it creates a vortex over the wing. The air in the vortex has a high speed, therefore, a low pressure. This adds to the lift. At the end of the forward stroke, the wing _turns_. This enlarges the vortex and again adds to the lift. The wing then moves backwards and crosses the wake it created during the forward motion. This interaction again adds to the lift. With all these things considered, the lift is enough for the flight. It is interesting to note that planes avoid unstable flight conditions as much as possible. The wings of a plane are designed to make the air flow over and above the wing as smoothly as possible, without building vortices.The wing of a plane never crosses its own wake. Insects on the other hand delight in an unstable flight. They have mastered the unstable flight and thus defeated the conventional, "stable" aerodynamics. The wings of a fly are controlled by two pairs of strong power muscles. Two muscles go along the thorax (the body) and two muscles go across. During the flight, the power muscles work "asynchronously", _without_ the stimulation by the nervous system. The muscles are stimulated by themselves. The contraction of one pair of muscles moves the wings forward and stretches the other pair of muscles. This stretching causes the second pair to contract and to move the wings backwards -- and also to stretch the first pair of muscles. The muscles thus oscillate in a resonance mode. The brain controls the flight via steering muscles: small muscles that move a "hinge" connecting wings to the body and change the angle of attack and the amplitude of wing movement. Thus the nervous system finely modulates the the wing movement. Flies have two pairs of wings. The second (rear) pair of wings is almost degenerated. Although the rear wings play no aerodynamic role, they are very important. The small wings are equipped with sensors, which are directly connected to the nervous cells that control the muscles modulating the movement of the main wings. This reflexive feedback system makes the flight stable even in the presence of wind. If the nervous system does not directly stimulate the power muscles, how does the flight start? First the nervous system leaks Ca into the body fluid. The presence of calcium brings the power muscles to the brink of the oscillation mode. To start the oscillations, a fly uses a special starter muscle. This muscle contracts exactly once during the flight: it moves the wings and thus pushes the power muscles into the resonance. When a fly is about to land, the steering muscles disengage the wings. The power muscles get oxygen not from the blood or the body fluid but from a number of small pipes. Oxygen is delivered directly to cells. A robotics Lab at UC Berkeley is building an artificial insect, based on what they have learned from a fruit fly. The talk has made it very difficult to swat flies any more.