Always Crashing

Presenting the very antithesis of functional computing, the crash is an area ripe for exploration, exposing as it does the rich seam between writable and runnable, and offering radical insights into the contemporary interface between man and machine. Martin Howse dons a hard hat to mine deep into this rich geology


Crashing is painful, at the very least implying lengthy reboots, loss of data, hair-wrenching debugging sessions, or random hardware faults which may take weeks to track down. At a more extreme end of the spectrum, the crash of hard real-time software responsible for, say, the movements of a space probe, can literally be translated into a less than soft crash.

What can go wrong will go wrong, as Moore's law gives way to Murphy's law and the necessary complexity which eats up the marketplace's fresh cycles invokes a brittle crash at the interface of person and processor. Crash is all about visibility, the making of such laws and forces apparent, and if we can see beyond the frustration and practical issues, such as forensics, data recovery and troubleshooting, associated with crashing, it does reveal itself as an area ripe for philosophical and social enquiry, both in relation to free software and to notions of entropy and quantum computing. Crashing, largely a realm in which programmed systems fail to expect the unexpected, illuminates the brittle nature of many so-called high-level programming languages. Rigorous type and error checking can only go so far; a catalogue of mishaps and outages related to the buffer overflow, under which data leaks out beyond the programmer's narrowly defined confines, would fill a large volume. The overlap of the physical sciences and information theory, a view of the world primarily as information as pioneered by Konrad Zuse's work in the late 1960s, expands the realm of crashing beyond considerations of code, and further elaborates a notion of interface which is key to crashing.

There can be little doubt that crashing offers a rich menu of food for thought, and it's worth clarifying crashing in relation to the simplest computational matter. Crashing exists always in relation to a necessarily programmable machine. A machine does not crash unless it is a Turing machine (TM). Without rehearsing a full theory, which can readily be found online, a TM, which represents the limits of computability through defining that which can be computable, can readily be simulated. A state machine takes a ride down a tape, looking up symbols it encounters along the way, which, in concert with the current state, further determine the behaviour and state of the machine through specifying a transition function. However, if such an encountered symbol does not exist within our table of transition functions, or multiple symbols are encountered, then we have a problem. The Turing machine crashes or halts, and from here we can position crashing within the gripping context of the limits of computability as defined by Alan Turing's famous halting problem; that a program cannot possibly compute whether another program will crash or halt.

Free computing, free love

The field of crashing, in relation to contemporary computing, is necessarily more complex, and crashing can well be viewed as an essential and dangerous partner to complexity and abstraction. But crash as defined within a Turing model still holds good; Crash as a halting of Operating System or application functionality, crashing as doing or accessing what is forbidden for a process, what is outside the current system of notation or the rehearsed bounds of an application. For example, the all too familiar segmentation fault occurs when a program tries to access memory locations which haven't been allocated for the program's own use.

Within an impossible realm of perfect abstractions or of supreme black boxing, with totally private data, there can be no crashing, in that crashing can be viewed as leakage. Witness the terms of a buffer overflow. The battle against crashing waged daily by overworked programmers could well be understood in terms of a war against code and data promiscuity. Crashing erases the private, and ironically, closed source systems which provide no public access to code remain bug-ridden and crash-ready without easy remedy. Under such OSes, crashing can only provoke invocations and cursing under the blackest of arts, or a serious wiping clean of the slate.

Crash is also all about uncertainty; the so called blue screen of death, the ghastly screen colour of a crashed Windows system, which truly comes out of the blue. Just as repetition occupies an important position within the field of information science and more specifically error correction, so repeated crashes can well serve to provide much needed information as to hardware and software problems, and thence remedies can be applied. A repeated crash highlights the system's own flaws. And crashing always stands in vigorous relation to both system architecture and the heavily intertwined development or licensing model. Modularity, abstraction, and complexity within the OS and kernel are quite obvious factors here and the example of the GNU/Hurd operating system (see issue 49) highlights such issues with practical zeal. In contrast to a traditional monolithic kernel, which provides for a larger target and for a heavier fall, the GNU/Hurd's microkernel approach, which reduces the amount of policy within the kernel and leaves most services running in userspace where errors won't impact on the total system, should make for a more robust system. Indeed, the notion of crashing can further expose interesting and somewhat political questions regarding kernel and user space, and how and where kernel, OS and apps can be defined and divided. After all, the separation of protected kernelspace from open userspace is deemed necessary in order to reduce the risk of crashing.

Crashing in the open

Crashing thus implies and exposes hierarchies; a super superuser reigning in kernel space reins in users and their processes within a realm of restricted power. On a practical level witness the multiple ways under which users running as root can crash and destroy their own systems by accident. It would be too dangerous to mention such familiar examples here. Running as root pushes far beyond the less than apt but commonly applied metaphor of riding a motorcycle without a helmet. The practical example of the GNU/Hurd project, though far from finished, does away with such distinctions at the same time as fighting the curse of the crash. Perhaps there are other ways of designing an OS without entering a pact with the devil where power is traded for solidity. And it's also thus easy to see, within a context exposed by crashing, why the GNU/Hurd, with its design which stresses userland versatility, really is the only choice for a GNU generation. Crashing shows us how politics enters computation, and it's not just about functionality or getting the job done as has always been the battle cry of crashed out closed source OS users rejecting the advances of free software.

Crashing and openness exist in strange relation, with one warring partner provoking and yet strengthening the other through the necessity for exposure. At first glance free software with its implied and visible modularity provides for a smaller crash fall out. Though the GUI may crash and freeze, the user knows that the OS can be accessed at a lower level of abstraction. Under proprietary systems, such exposure and access is not possible. The system is solely as it appears, or as can be inferred. An actively and openly developed system also allows for plentiful debugging hooks into the system which can be used to recover and analyse important data pertaining to the crash. An open source crash toolkit provides for valuable insights into how both OS and the contemporary machine function. Crashing as subject matter for further study offers a compelling route into examining an OS and architecture; spotlighting a good range of essential, often ill-explored regions, perhaps centring mostly on resource management, such as the functionality of the virtual memory manager, but also unravelling the tight knit of processor and kernel which dictates a good degree of policy and architecture. Though any OS may choose to abstract a functional model away from that dictated by the likes of Intel or AMD, the double demands of security and speed frequently mean that kernel coders are held at the mercy of the chip.

A school for crashing

It may come as no surprise that the most accomplished students within such a college of crashing reside very much within the frat house of cracking. Those who code exploits need a thorough understanding of the tiny gaps between machine and kernel which can readily be crowbarred open and put to new use. Such territory overlaps well with the terrain of crashing, and indeed crackers may well content themselves with wiping out a machine of no use for further remote exploitations. With little to guide their studies beyond source code and few written works which can enlighten this realm, would be scholars of the crash will often find themselves referring to on and offline journals such as Phrack, sieving rare grains of knowledge from both code and commentary. Crashing exists in unique relation to such subcultures which can also be viewed in terms of openness; the exposure of all facts and knowledge, however dangerous, regarding a system. Information wants to be free even if such freedom implies a knowledge which can crash systems, furnishing a direct parallel with the very freedom of data which enacts the crash.

Crash exposes a vast range of social, political and computational issues at the same time as itself acting or signifying exposure. Crash can readily be seen as the exposure of the programmable and machinic in that what perhaps was not necessarily viewed as a machine or as coded is now revealed as such. Crash returns the user to a lower state of abstraction; a crash within a GUI can be resolved within the shell and we can often find some way to hook in to a frozen machine either through premeditated use of the serial port, or through supremely low level bit snooping. Crash implies both exposure and a cynical delight in that naive revelation: witness the huge number of hacker sites and magazines devoting pages to images of the blue screen of death or other such crash artifacts in public places such as railway stations, at airports or on ATMs in high streets. The delight in exposure is all the more acute when running an interface which attempts to hide its true nature. A GNU/Linux command line crash would find few fans, yet a crash within a bland graphical display at a high level of abstraction from the machine, such as a train timetable display, is suitably attractive. At the same time these crashes expose tangible information and hackers can thus find out exactly what OS some municipal or public device is running.

Yet such exposure of the programmed and the machinic, has just replaced one taken for granted level or mode of operation with another within a hierarchy. An error message takes on the heavenly status of a communication from the gods and is thus assumed to be causally correct in its weighty, static diagnosis. Its equally programmed nature is often ignored. Such messages surely come from somewhere outside the OS, from some overseeing grand kernel. Indeed a relevant question is how a crashed OS can report and comment on its own fate. Open source software equally dispels such myths. Messages from the gods are surely resistant to grepping. Crashing becomes a far more enlightening activity under openness.

Crash research is all about digging deep into the obvious or the explicit, asking questions of that which is often taken for granted. As we've seen crashing can be used as a tool to ask questions of operating systems, to ask where and why lines have been drawn and where and how security can be assessed. And given a thoroughly modern field of equivalence of hardware and software, through re-configurable computing, where does the kernel and thus crashing stand within this open trinity of machine, kernel and user application? When hardware can abstract and organise itself, the maternal position of OS in containing, securing, organising and scheduling processes or demands in relation to available resources changes. An OS is all about consistency and standards in relation to the interface for a user to the complex, open world of computation. Crash highlights thye necessary and inevitable loss and leakage within such translation or interface, exposing the trade-off with flexibility which contemporary OSes operating under poor models of network security enact.

Don't panic

Digging deep into cause and effect crash-wise is a tricky business and as soon as hardware faults and interrupts enter the picture, particularly in instances of spontaneous reboots, crashing really does reveal itself as a labyrinthine terrain. Stress testing (as covered in issue 37) is probably the best way to highlight such faults, through a rather tortuous process of elimination. Yet before such test can even be conducted, the kernel must be broached as the next level down within the hierarchy of abstraction. Raised above the processor and acting, short of low level probes, as our only interface, the kernel can provide much needed information on any crash, standing as it does in a very special symptomatic or indeed symbiotic relation to this domain. For it could be argued that, aside from plentiful and well documented processor and hardware bugs such as the famous F0 or F00F bugs under early Pentiums, the kernel is that which crashes and even causes a crash; terminating promiscuous processes or bailing out before a flailing kernel code base does more damage. At this point it's apparent that causality exists only from a practical perspective. Kernel panic, which many new users will have encountered in the context of booting a poorly configured newly compiled kernel, is a suitably apt phrase. Panic is well associated with confusion, and occurs when the kernel is cornered with no idea of a decent next move.

Two main varieties of kernel panic enable us to further elaborate a system of crash classification (see A taxonomy of the crash). A total lock up signals a hard panic, which, if we're lucky enough to be in console mode, is often accompanied by the phrase "Aieee!" and a full record of the kernel's stack which can well be used to debug the crash. A soft panic, which may not prove to be fatal or may well be followed by a hard panic, logs a rather charming Oops phrase as well as a good deal of information including our friend the stack backtrace, contents of the CPU's registers, the code the CPU was executing and a call trace, and a list of functions that the process was in at the time of the OOPS. All of this information provides essential material for forensics, and excellent guides such as the online OOPS! An Introduction to Linux Kernel Debugging walk the user through relevant use of a range of crash tools such as ksymoops and objdump. Such Sherlock-like activity is highly addictive, and forensics provides a good introduction to navigating and understanding the kernel source code. Grep is our friend here when it comes to tracking down code snippets which can teach us much about the world of crashing.

It's not my fault

Panic.c, within the kernel source tree, is the last port of call prior to destination crash. Grepping for the contained panic function throughout the source reveals a vast number of functionalities ready and willing to make this final journey. Init functions are a serious contender, crashing out on a sinking boot ship in the case of a misconfigured kernel. Exit.c, which takes care of process termination, is another hot favourite, with panic called in cases such as attempting to kill the kernel's initialisation process. References to Oops can be found throughout the kernel in connection with both the panic function and the well titled die function, found in traps.c; code which provides a decent run down of processor exception codes. There are 19 such codes under the x86 which signal exceptions such as an illegal instruction or a divide by zero. Panic may well ensue, depending on the exception and whether we are in kernel or userspace or servicing an interrupt. Another major die client is the virtual memory (VM) manager whose functionality is also called from our traps.c, when the processor signals a page fault to the kernel. We're on the track of the common or garden segmentation fault or segfault here, but, given Linux's VM model, a page fault may be perfectly innocent. The VM, which stands in good relation to crashing, responsible as it is for process segregation, abstracts the physical memory for a number of reasons. VM management is well covered elsewhere, particularly by Jon Masters, in his excellent regular series on kernel development. In the case of a page fault, the memory manager in hardware will signal to the kernel when a process accesses a memory address that is not currently mapped to a physical location. Fault.c within the arch/i386/mm kernel source directory resolves the issue, which may simply be due to having swapped the page out to disk. The handling here is simple, but other more extreme cases, such as trying to access an invalid address or trying to write to executable code, have far more serious results. The kernel hands out a SIGSEGV or segfault signal to the offending process.

Of course, we're talking userspace here. The kernel maps directly to a physical address as it would be too tough to maintain page tables for itseld residing in virtual memory. However, we can use the kmalloc function to allocate space in virtual memory. A page fault under vmalloc, used by a good many kernel modules, can be fixed up by updating page tables. However, if we encounter a bad reference reading or writing data from userspace, then it's straight into Oops. Or, in the vibrant words of the relevant kernel code commentary, "We'll have to terminate things with extreme prejudice." Registers and virtual address are printed and the die function is called, crashing out at the exploit-ridden interface of kernel and userspace. The VM is all about separation with an opposing crash operating as the inevitable revenge of promiscuous leakage.

Crashing resides firmly within the context of entropy, a more scientific expression of Murphy's law which we kicked off with. The notion of entropy has well survived the journey from the beginning of the industrial age, through basic theories of information to contemporary and quantum computing. Indeed the notion of entropy, with particular reference to Maxwell's paradoxical demon of 1871, could well be argued to occupy a key position within ideas and thought experiments leading to the birth of quantum computing. Claude Shannon's work within the parallel fields of entropy, redefining this concept within the context of information, and error correction acts as a powerful lens for the subject of crashing at the overlap of the physical and information. The contemporary view of physics as a matter of information further illuminates the field, broadening the terms of interface and openness which crashing addresses. Crash well intertwines, as both ancestor and curse, with contemporary theories, such as those of quantum computing expert Seth Lloyd, which situate the universe as a vast information processor computing its own existence. In a move which neatly ties in the Los Alamos lab, joint hothouse nursery for both computing and the atomic bomb, Lloyd argues that the ultimate laptop could well be considered as a one kilogram piece of matter, or information, converted completely to energy, radiation which can also be considered as information. Such a device exists today in the form of a 20-megaton hydrogen bomb. Crashing could well be considered in such a light; an illumination brighter than a thousand suns.

A taxonomy of the crash

Much crash research has confined its attentions to the purely practical, and in so doing the wider picture has been ignored, and repetition has resulted due to a lack of consensus on a shared system of classification or taxonomy. It may thus prove useful to iterate over the types of crash and perhaps attempt some sort of necessarily incomplete classification, distinguishing in the process possible causes and effects. Crashing is a reasonably generic operation, of course with the proviso that more or less open OSes will extend the visibility implied by crash, thus shattering the dark glass of GUI and shell and exposing the raw innards of the beast. Thus such a taxonomy will concentrate on specifying crashes under a GNU/Linux system under the implied hierarchy of application, GUI, shell, kernel and x86 architecture with a grand user and kernel space divide severing this schematic. Crashes could be classified according to either visibility or access (such as being able to interpret a crash dump) on the one hand and probable cause on the other, with the symptomatic lying somewhere between. A classification could begin with an application crash, or serious loss of functionality, as signalled by a freeze, or error message such as a segmentation fault. We can of course always drop back to the shell and kill the offending process, or make use of Xkill. Such a loss can be extended to the GUI, with a freeze or lock up here proving perhaps more fatal if access to keyboard and mouse is denied. In such instances it may be possible to hook into the system at a lower level, perhaps through SSH or Telnet, or via a serial console. Loss of access to basic resources either through poor management, memory leaks or exploits such as a fork bomb could equally well cause system lockups which may extend as far as barring all comers. With all entry points denied, a classic crash or freeze, followed by necessary reboot, is encountered. Such a classification corresponds well with the findings of the Ballista project, initiated by Carnegie Mellon University, which also outlined a taxonomy whose imaginative acronym CRASH stands for Crash, Restart, Abort, Silent and Hindering, with the latter two terms referring to application error codes. In this instance, Restart would mean a spontaneous reboot; probably the most hardware-bound component within our classification.

key links

OOPS! An Introduction to Linux Kernel Debugging: http://urbanmyth.org/linux/oops

Collection of Software Bugs: http://www5.in.tum.de/~huckle/bugse.html

Linux Crash HOWTO: http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html

Kernel debugging: http://www.kernelhacking.org/docs/kernelhacking-HOWTO/indexs09.html

Oops: http://lxr.linux.no/source/Documentation/oops-tracing.txt

Debugging: http://en.wikibooks.org/wiki/Linux_kernel#Debugging

Understanding The Linux Virtual Memory Manager: http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand

Ballista: http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/edrc-ballista/http://www

Crash symposium: http://crash.1010.co.uk

Seth Lloyd: http://www-me.mit.edu/people/personal/slloyd.htm

History of Los Alamos: http://www.lanl.gov/history/index.shtml