This post is about finding and fixing a memory leak I discovered in the SNMP daemon, snmpd(8), in OpenBSD. This sort of analysis is foreign territory for me; I'm not a software hacker by day. However, using instructions written by Otto Moerbeek as my Rosetta stone and Google to fill in the blanks when it came to usage of the GNU debugger, gdb(1), I was able to find and fix the memory leak.

I'm documenting the steps I used for my future self and for others.

The Problem

Since I knew roughly which code path must have the leak, I first examined it manually. I could not see where memory wasn't being given back. I needed to instrument the process as it was running in order to find the leak.

Before Starting

This set of instructions from Otto Moerbeek was my guide. As per his guide, you have to rebuild libc with MALLOC_STATS enabled. This enables statistics collection that is used later on.

Edit /usr/src/lib/libc/stdlib/malloc.c and define MALLOC_STATS (it's actually already in the file, just uncomment it). Now rebuild and install libc.

<edit malloc.c>
# cd /usr/src/lib/libc
# make obj
# make depend
# make install

Other things to know before starting:

  • The binary you're going to debug cannot have its symbols stripped. Many, in fact most, binaries in /usr/bin and /usr/sbin are stripped as they are installed (from "make install"). For that reason I will manually "make" snmpd inside /usr/src/usr.sbin/snmpd/ and run the resulting binary without doing "make install". If you're unsure if your binary is stripped, use the file(1) command.
% file /usr/sbin/snmpd
/usr/sbin/snmpd: ELF 32-bit LSB executable, Intel 80386, version 1, for OpenBSD, dynamically linked
  (uses shared libs), stripped

% file /usr/sbin/dhcpd
/usr/sbin/dhcpd: ELF 32-bit LSB executable, Intel 80386, version 1, for OpenBSD, dynamically linked
  (uses shared libs), not stripped
  • The binary must be compiled with debugging enabled. This is done by passing -g to the arguments passed to gcc(1).
  • When you run gdb(1), do it from the directory where your source files are. By default, gdb will look in the current directory (among other places) for the source files it needs to interpret line numbers, etc.

Get To It Already

Start a shell and run the binary. In my case, I run snmpd with -d to keep it in the foreground. This makes it really easy to kill (with Ctrl+C) and relaunch (with up arrow, "enter"). It also causes anything written to stderr to be sent to the terminal which will be useful later on.

In another terminal I run ps(1) and look for the PID of the unprivileged process (the process that executes the code path where I suspect the leak to be).

# ps axo user,pid,command | egrep "PID|snmpd"
joel     19645 egrep PID|snmpd (zsh)
root      9056 snmpd: parent (snmpd)
_snmpd    1184 snmpd: snmp engine (snmpd)

My PID is 1184 in this case.

Now run the debugger, pointing it to the binary on the file system so it knows the makeup of the process you will eventually attach it to.

# cd /usr/src/usr.sbin/snmpd# gdb ./snmpd

Now there's some decisions to make. You have to tell the debugger which part(s) of the code you want to inspect. You do that by setting breakpoints. When program execution hits a breakpoint, the debugger stops the process β€” that is, it doesn't quit the process, it just freezes it in time β€” and allows things like variables, memory locations, registers, etc, to be examined as they are at that exact point in time. Since I have an idea of where in the code the leak exists, I'm setting a breakpoint on a particular function so that when it gets called, the debugger will stop.

(gdb) br mib_pftableaddrs
Breakpoint 1 at 0x1c00d0e3: file mib.c, line 2179.

See how gdb knows the filename (mib.c) and line number (2179) of the function? That's why you need a binary that's not stripped.

Note that so far gdb isn't actually doing anything to our running process. We're just getting things lined up. Now that we're ready, we can attach to our running process, PID 1184.

(gdb) attach 1184
Attaching to program: /usr/src/usr.sbin/snmpd/snmpd, process 1184
Reading symbols from /usr/lib/
Loaded symbols for /usr/lib/
Reading symbols from /usr/lib/
Loaded symbols for /usr/lib/
Reading symbols from /usr/lib/
Loaded symbols for /usr/lib/
Reading symbols from /usr/lib/
Loaded symbols for /usr/lib/
Reading symbols from /usr/lib/
Loaded symbols for /usr/lib/
Reading symbols from /usr/libexec/
Loaded symbols for /usr/libexec/
[Switching to thread 1025681]
0x0fd6cd85 in kevent () at :2
2       <stdin>: No such file or directory.
        in <stdin>
Current language:  auto; currently asm

As soon as you "attach" to a process, gdb stops the process; it's frozen. Allow the process to continue running by issuing the "continue" command to gdb.

(gdb) continue
<no more output>

At this point gdb appears to have hung. This is expected. gdb will not return to the "(gdb)" prompt until a breakpoint is hit.

Now it's time to trigger the memory leak. In my case, I did some walks of the pfTblAddrTable and watched the SIZE/RES of PID 1184 climb. With the leak triggered, use the MALLOC_STATS capabilities to get more information.

At this point I'm back at the "(gdb)" prompt and the process is frozen because I hit the breakpoint that I set above β€” walking pfTblAddrTable causes the mib_pftableaddrs function to be called.

Use gdb to call the malloc_dump() function in libc. Note the integer argument to the function is the file descriptor where the output should be sent (this is relative to the process, not to gdb!). In my case, since snmpd is running in the foreground and sending stderr to the terminal, I give an argument of "2" (ie, stderr).

(gdb) call malloc_dump(2)

The snmpd terminal immediately shows the malloc stats. The most interesting part is at the bottom, the leak report.

Leak report
           f     sum      #    avg
         0x0    7856     62    126
   0x2020448    1024      1   1024
   0x2021fdf     512      1    512
   0x202201a    2048      1   2048
   0x75311f9   17056      1  17056
  0x1c004aff    2368      1   2368
  0x1c00593d      64      1     64
  0x1c008910   65664      1  65664
  0x1c015786   99840     10   9984

As Otto explains, the f column contains the address of the code that allocated the memory. From this report we can see that the code at 0x1c015786 has caused 10 leaks of size 9984 pages. That's our culprit. Use gdb to find out what that code is:

(gdb) list *0x1c015786
0x1c015786 is in pfr_buf_grow (pf.c:141).
136             bs = buf_esize[b->pfrb_type];
137             if (!b->pfrb_msize) {
138                     if (minsize < 64)
139                             minsize = 64;
140                     b->pfrb_caddr = calloc(bs, minsize);
141                     if (b->pfrb_caddr == NULL)
142                             return (-1);
143                     b->pfrb_msize = minsize;
144             } else {
145                     if (minsize == 0)

There we go. gdb has identified the pfr_buf_grow() function in pf.c.

Ok.... that was the easy part. gdb can't tell you where the memory should be freed, it can only tell you where the memory is being allocated. You as the programmer have to understand the program flow and where it makes sense to call free(). What this gives you is a place to start working backwards from. What calls pfr_buf_grow()? Are those functions missing calls to free()?

gdb can be useful in the investigation at this point too. When the program flow hits a breakpoint and you get the (gdb) prompt back, you can "step" along through the code (or "next", if you don't want to step into any functions on that particular line of code).

For the leak in snmpd, I was able to step/next my way through the function where I had set the breakpoint and realized that program flow was actually different than how I had calculated it in my head. Once this was clear, I realized where I needed to add a free().