Last time I talked about the broad architecture of how exceptions work. Today I’m focusing on how debug information is contained in the eh_frame section. To explain why exception information is held in the format it is today, we need to dive a little into the history of libunwind. Much of what is here comes from linux reference documents, found here.

History

Libunwind was originally developed by HP back in the days of the unix wars. It was developed by HP for their particular flavor of unix, HP-UX, to debug stack traces from the itanium architecture. This library, libunwind, used DWARF debug information in order to unwind stack traces. This at the time was fine since libunwind was intended to interpret stack traces, and was often used as a debugging tool.

Eventually, HP decided to open source and backport the library for use in C++ exception handling. Because of this, the current exception handling regime (enshrined in the Itanium ABI, which HP-UX was originally was going to run on) was based in libunwind.

Format

DWARF information for functions is held in two data structures: Common Information Entries (CIE) and Frame Descriptor Entires (FDE). FDEs describe whole functions and reference CIE’s, which are common to multiple functions. Inside of CIE’s and FDE’s are instructions that describe the semantic operations of the code within. This includes storing and popping registers, allocating on the stack, and saving and remembering state. Moreover, DWARF instructions also keep track of the location of the executed code, allowing the state of a program to be reverse engineered from a stack trace given the instruction pointer.

Common Information Entry

Common Information Entries contain the following fields:

  • length: an unsigned 4 byte integer
  • CIE_id, which is just 0 and indicates that this is a CIE
  • version: an unsigned 1 byte integer
  • augmentation: a null terminated string that contains flags about how the CIE should be interpreted
  • code_alignment_factor: the alignment of code in advance location instructions. This is usually just the size of an instruction. Represented as a uLEB128.
  • data_alignment_factor: This is the alignment of data represented as a sLEB128. This is usually just the word size of the processor times negative one, since stacks usually grown downwards.
  • return_address_register: a unsigned byte constant that notes where the return address is. Despite it’s name, it’s not required that the return address is found in a register.
  • Augmentation data: This contains, depending on the augmentation string, a pointer to the LSDA, a byte denoting the pointer encoding of the LSDA, and also the personality routine used.

After that, the CIE has instructions that are common to all functions that reference it. Usually, this is just the return address being pushed onto the stack.

Frame Descriptor Entry

Frame descriptor entries are largely similar to Common Information Entries, but contain a bit more information about the function that it represents. It contains:

  • length: an unsigned 4 byte integer
  • CIE_pointer: a unsigned 4 byte pointer to the CIE that the FDE references
  • initial_location: the address of the function being described in a pointer-size integer
  • address_range: the length of the function being described in a pointer-size integer
  • instructions: same as CIE, but describing the function prologue.

Little Endian Base 128

LEB128, which I have either referred to as uLEB or sLEB for their signed and unsigned variants respectively, are extensively used within the DWARF format due to their ability to represent small quantities efficiently. It’s a variable length integer encoding scheme where the first bit of a byte is used to indicate the end of the constant. The remaining 7-bit bytes are concatenated to make the integer.

I do not like LEB128 because it’s usage means that the instructions are almost never aligned. It’s also slow to decode, and DWARF ends up very large anyways because of redundant information and padding.

Redundant Information

As you may have noticed, DWARF contains many bits of arguably unnecessary information. Code and data alignment are constant on an architecture, and the return address location is known based on calling convention. In a debug format where the computer executing the code may not be the one debugging the stack trace, this is absolutely necessary. However, in a exception handling context, this is never the case. The alignment and return address location are always known, and thus should never be included. That, along with the very verbose form of the instructions means that DWARF usually ends up very large and very difficult to parse.

Alternatives

Because of the shortcomings of DWARF, almost every vendor I’ve seen has tried to improve on it in some way. ARM uses arm instructions in their EABI, which eliminates much of the redundant information and packs the instructions into much smaller sizes. Both Windows, Apple, and the Linux kernel also have their unique forms of exception handling information, although I’ve had much less experience with those. And there’s me, who has created and implemented their own format for AVR. (Shameless plug)