Grokking Oopses

Note: This entry has been restored from old archives.

Introduction

Man it was hard to find information on debugging 2.6 kernel issues, maybe my Google skills need honing. Recently I was parachuted into a customer site on a support-or-die mission. The problem: kernel-space. My problem: not much kernel-space experience. A week of little sleep, much reading, and some learning. In the end the problem was rooted out and the crack team of kernel monkeys back at base blew it out of the water. Hooray!

This is the first in a trilogy of entries I intend to write. This one discusses a specific example of an “Oops” in detail, giving a line-by-line description of everything the oops tells you. Well, almost everything, some details go too deep to cover in something this introductory… to drill all the way down could take a whole textbook.

The second entry goes into relating the oops to the original code, I might call it “tracing assembly”. The third entry will delve into a session of debugging with ‘kdb’.

Now the first two entries will certainly happen as almost all the material was written during my work trip. (Joyful hours of trains, flights, and just plain waiting at stations and airports.) I won’t make any promises about the ‘kdb’ entry though, since I’ve written nothing for it and would need to spend quite a bit of time generating examples and traces.

Important items of note:

  • An IA32/i386 architecture/kernel is assumed, the oops code is platform dependant.
  • The oops examined here comes from a 2.6.16 kernel.
  • Sometimes file paths are given, these are relative to the root of a Linux 2.6.16 source tree.
  • This entry deals with a specific oops for a specific error case and doesn’t discuss much beyond the specific example.
  • There’s no guarantee, as far as I know, that the oops format won’t change at some time.

Finally, a disclaimer: I am not a kernel developer. In fact I learnt a couple of new things just writing these two entries. No doubt there are important facts I’ve missed or not rated highly enough. Maybe it is even possible that I’ve got something plain wrong (if so please correct me via the comments or email).

Oops!

Let us start where these things usually start, a kernel oops:

kernel: Unable to handle kernel paging request at virtual address 68230a7d
kernel: printing eip:
kernel: c0143742
kernel: *pde = 00000000
kernel: Oops: 0000 [#1]
kernel: SMP 
kernel: Modules linked in: monkey_pkt_eater nfnetlink_queue xt_condition
    iptable_ips edd ide_cd ip_nat_ftp ip_conntrack_ftp ipt_REDIRECT xt_tcpudp
    ipt_addrtype ebtable_nat ebtables ip_nat_mms ip_nat_pptp ip_nat_irc
    iptable_nat ip_conntrack_mms ip_conntrack_pptp ip_conntrack_irc af_packet
    ipt_logmark ipt_CONFIRMED ipt_confirmed ipt_owner ipt_REJECT sr_mod cdrom
    usb_storage evdev microcode parport_pc ppdev parport xt_state xt_NOTRACK
    iptable_raw iptable_filter ip_conntrack_netlink ip_nat ipt_LOG ip_conntrack
    ip_tables x_tables nfnetlink_log nfnetlink monkey e1000 capability
    commoncap loop sg ahci libata sd_mod scsi_mod
kernel: CPU:    0
kernel: EIP:    0060:[<c0143742>]    Tainted: PF     VLI
kernel: EFLAGS: 00210202   (2.6.16.43-54.5-smp #1) 
kernel: EIP is at put_page+0x2/0x40
kernel: eax: 68230a7d   ebx: 00000001   ecx: f1057680   edx: 68230a7d
kernel: esi: f6fe7600   edi: fe089b98   ebp: f64d3c14   esp: f64d3bbc
kernel: ds: 007b   es: 007b   ss: 0068
kernel: Process snort_inline (pid: 6298, threadinfo=f64d2000 task=dff3fa70)
kernel: Stack: <0>c02a8b4a f6fe7600 f1057020 c02a88d8 000004cd f9de0fdd c7af5000 f7802b60 
kernel: 0000c7af fe73e860 f64d3c48 f64d3c48 e7e19834 f64d3c34 f5b54cc0 00000000 
kernel: 00000001 c02c7adf 00200282 d94bd4cc f64d3c48 f4429a40 f64d3c40 f9dd912a 
kernel: Call Trace:
kernel: [<c02a8b4a>] skb_release_data+0x8a/0xa0
kernel: [<c02a88d8>] kfree_skbmem+0x8/0x80
kernel: [<f9de0fdd>] mkyDestroyPacket+0x9d/0x260 [monkey_pkt_eater]
kernel: [<c02c7adf>] ip_local_deliver_finish+0xef/0x270
kernel: [<f9dd912a>] mkyIpqPostHandler+0xda/0x140 [monkey_pkt_eater]
kernel: [<c02c79f0>] ip_local_deliver_finish+0x0/0x270
kernel: [<f8e9e182>] find_dequeue_entry+0x82/0x90 [nfnetlink_queue]
kernel: [<f8e9e1ab>] issue_verdict+0x1b/0x50 [nfnetlink_queue]
kernel: [<f8e9f5af>] nfqnl_recv_verdict+0x1ff/0x330 [nfnetlink_queue]
kernel: [<c0308730>] schedule+0x350/0xdd0
kernel: [<f887940e>] nfnetlink_rcv_msg+0x16e/0x200 [nfnetlink]
kernel: [<f8879542>] nfnetlink_rcv+0xa2/0x169 [nfnetlink]
kernel: [<c02beecd>] netlink_unicast+0x16d/0x250
kernel: [<c02bf462>] netlink_data_ready+0x12/0x50
kernel: [<c02be2d1>] netlink_sendskb+0x21/0x40
kernel: [<c02bfacd>] netlink_sendmsg+0x27d/0x320
kernel: [<c0116c88>] __wake_up+0x38/0x50
kernel: [<c02a2b2e>] sock_sendmsg+0xae/0xe0
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<f90caa7f>] monkey_chan_queue_append+0x15f/0x170 [monkey]
kernel: [<c02a9e4a>] verify_iovec+0x2a/0x90
kernel: [<c02a306d>] sys_sendmsg+0x15d/0x270
kernel: [<c02a3f86>] sys_recvfrom+0x116/0x140
kernel: [<f90557c0>] e1000_clean_rx_irq+0x0/0x3d0 [e1000]
kernel: [<f9053b0e>] e1000_clean+0x53e/0x780 [e1000]
kernel: [<f90c7780>] mkyGetState+0x20/0x30 [monkey]
kernel: [<f9ddbc90>] packet_writer+0x2e0/0xd50 [monkey_pkt_eater]
kernel: [<c02a453f>] sys_socketcall+0x24f/0x280
kernel: [<c013a5b0>] __do_IRQ+0xc0/0x110
kernel: [<c0102d3f>] sysenter_past_esp+0x54/0x75
kernel: Code: c0 8b 01 01 c2 83 fa 10 89 11 7f 05 83 fa f0 7d 0d f0 01 15 fc 8f
    40 c0 c7 01 00 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 89 c2 <8b> 00 f6
    c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0 

(This oops has had some timestamp details removed and has been wrapped for neatness. Also, some “proprietary symbols” have been renamed, let us just say: beware the monkey.)

Tackling the oops

Let me first say: Thanks to the Internet containing cruft from all ages you’ll find many references to ksymoops when researching kernel oopses. Stop now! Maybe add a “-ksymoops” to your search query, though this may remove several good links. It really is a battle against accumulated information. The ksymoops tool was useful for dealing with Linux 2.4 oopses, but in 2.6 it’s work is done at oops-time and ksymoops is useless. If you’re dealing with 2.6 forget about ksymoops.

On with the oops. It tells us quite a lot about what was going on when the it occurred, we have the stack, the call path, the content of registers, and more. On inspection much of this is pretty straight forward to someone familiar with Linux, C programming, and Intel CPUs and assembly (be aware of GAS). It’s probably very hard going for someone who hasn’t done any work at that sort of level, but difficulty is just a good opportunity to learn! Right? However, like me, you don’t need to be a hard-core kernel hacker (I’m not even soft-core, I’d barely earn a PG rating) to get an idea of what’s going on.

Line by line

kernel: Unable to handle kernel paging request at virtual address 68230a7d

The core problem, and an oft-seen one, think of this as the “segfault” of the kernel. In actual fact both userspace nd kernelspace memory access violations come from the same place (MMU) and are handled by the same code. Generation of the oops text starting with this line has it’s origins in do_page_fault in the arch/i386/mm/fault.c code file. Think of that path as “i386 memory manager faults”. (“faults” covers a bunch of conditions, not just access violations.)

In short: something in the kernel has tried to use an address that doesn’t work, and that address was: 0x68230a7d.

kernel: printing eip:
kernel: c0143742
kernel: *pde = 00000000

Eip! It sounds like a small rodent’s cry of distress. It’s actually the “Extended Instruction Pointer”, which is the CPU register containing the address in memory of the instruction being executed. Thinking back to computer architecture classes you might be more familiar with this as the “Program Counter”. This tells us that at the time of the invalid paging request the instruction at address 0xc0143742 was being executed. A most useful bit of information that we’ll return to several times.

The “PDE” is the “Page Directory Entry”, and I’m not going to pursue that further. Try this article by Robert Collins, published in DDJ back in ’96, for more information: Understanding 4M Page Size Extensions on the Pentium Processor.

There’s a seriously large amount of information behind this that is worth reading up on. For starters try Google mixing “Linux Kernel” with a variety of things like “virtual memory”, “paging”, “mmu”, … From my bookshelf I’d recommend Chapter 2 of Understanding the Linux Kernel. That book is fairly bent towards Intel, I don’t have a book that covers deeper Intel specifics and can only recommend you make use of Google.

kernel: Oops: 0000 [#1]
kernel: SMP 

Now we’ve jumped to output generated by the die function in arch/i386/kernel/traps.c.

The first number is actually the error_code value passed to do_page_fault printed in hex. The interpretation for this is:

 *  bit 0 == 0 means no page found, 1 means protection fault
 *  bit 1 == 0 means read, 1 means write
 *  bit 2 == 0 means kernel, 1 means user-mode

So 0000 means “no page found” on “read” in “kernel” mode. This checks out with what we already know, consistency is good.

The number in the square brackets is the oops count. It is the number of times this die function has been called since boot, counting from 1. Usually you only want to worry about the first oops that you find. Everything after that point could be a side-effect of earlier kernel badness. If you’re working past the first oops then anything I have written here will be old news to you.

Yeah, and it is an SMP kernel. Other strings that may appear here are PREEMPT, and DEBUG_PAGEALLOC. After this the somewhat mis-named show_registers function found in arch/i386/kernel/traps.c is called and the rest of the oops text is generated in or below this call.

kernel: Modules linked in: monkey_pkt_eater nfnetlink_queue xt_condition
    ...

This output is generated by a call to print_modules (kernel/modules.c) by show_registers. Usually one of these will be the culprit for your problem, module code tends to be less stable than core kernel code, especially proprietary modules!

kernel: CPU:    0

This message is coming from CPU 0. Obvious enough?

kernel: EIP:    0060:[<c0143742>]    Tainted: PF     VLI

Eip! Again. The value 0060 comes from the CS register, let’s ignore it. (Sorry!) After that we have the EIP address 0xc0143742 shown again. Now “Tainted“, sounds nasty doesn’t it? There are six single-char codes here, and they’re generated by the print_tainted function in kernel/panic.c. The codes represent a bunch of undesirable, yet legal, things that might have been done to the kernel. The first code either ‘P‘ or ‘G‘, if the former then a non-GPL module has been loaded (as is the case here). If you have a ‘P‘ then don’t expect much support from kernel.org developers, unless the problem is obviously not related to non-kosher kernel modules. You’ll either want to work out the problem yourself, reproduce the problem without the proprietary module loaded, or get help from the vendor responsible for said module.

I’ll take a short-cut now and rip the comment from print_tainted to give the full list of flags:

 *  'P' - Proprietary module has been loaded.
 *  'F' - Module has been forcibly loaded.
 *  'S' - SMP with CPUs not designed for SMP.
 *  'R' - User forced a module unload.
 *  'M' - Machine had a machine check experience.
 *  'B' - System has hit bad_page.

The alternative character to all except ‘P‘ is a blank space (i.e. flag is unset). In our oops we also have an ‘F‘ because one of the custom modules loaded is missing symbol version information. The ‘F‘ flag can come into the mix for a variety of reasons it seems, the canonical one being that `insmod -f` was used to load it. None of the other “taints” are flagged in my example so I’m not going detail them. (See the bottom of Documentation/oops-tracing.txt for information.)

One last note on “taint”: Once a kernel is tainted it stays tainted. I.e. if you load a propriety module it taints the kernel and the ‘P‘ taint flag remains even if you unload it. You could think of it as an “from here on in all bets are off” state.

After the taint flags we have “VLI” and I don’t have a clue what this signifies. This is hard-coded in the i386 traps.c file, since it isn’t variable within the architecture I’ll dismiss it as unimportant!

kernel: EFLAGS: 00210202   (2.6.16.43-54.5-smp #1) 

This line first gives the CPU’s EFLAGS register, this register stores all the usual flags like “carry” as well as a bunch of less familiar flags (see link for more detail). After the flags, in brackets, is the version string identifying the kernel that generated the oops.

kernel: EIP is at put_page+0x2/0x40

Eip! Yet again, but now with some more useful information. The symbol attached to the code into which EIP is pointing along with EIP’s offset from the symbol (0x02) and the size of the code associated with the symbol (0x40, distance between the put_page symbol and the next marked symbol).

kernel: eax: 68230a7d   ebx: 00000001   ecx: f1057680   edx: 68230a7d
kernel: esi: f6fe7600   edi: fe089b98   ebp: f64d3c14   esp: f64d3bbc
kernel: ds: 007b   es: 007b   ss: 0068

Now we finally get the actual named purpose of the print_registers function: a dump of the content some CPU registers. It’s worth noting at this point that %edx and %eax both contain our dodgey address.

kernel: Process snort_inline (pid: 6298, threadinfo=f64d2000 task=dff3fa70)

Information about the current “process” being run on the CPU including pointers to the process’s thread and task structures. There will always be some process, but it may or may not be directly related to the error that caused the oops. When you’re debugging don’t jump to conclusions without good reason, keep following the trail of the oops.

kernel: Stack: <0>c02a8b4a f6fe7600 f1057020 c02a88d8 000004cd f9de0fdd c7af5000 f7802b60 
kernel: 0000c7af fe73e860 f64d3c48 f64d3c48 e7e19834 f64d3c34 f5b54cc0 00000000 
kernel: 00000001 c02c7adf 00200282 d94bd4cc f64d3c48 f4429a40 f64d3c40 f9dd912a 

The stack. Simply each word starting at %esp (stack pointer) and continuing
to the base of he stack or for kstack_depth_to_print words, a limit
configurable at kernel compile time. Some values in the stack can be related
to the next part of the oops, the “Call Trace“. Essentially, starting with the
at %esp the call chain is traced:

kernel: Call Trace:
kernel: [<c02a8b4a>] skb_release_data+0x8a/0xa0
kernel: [<c02a88d8>] kfree_skbmem+0x8/0x80
kernel: [<f9de0fdd>] mkyDestroyPacket+0x9d/0x260 [monkey_pkt_eater]
kernel: [<c02c7adf>] ip_local_deliver_finish+0xef/0x270
kernel: [<f9dd912a>] mkyIpqPostHandler+0xda/0x140 [monkey_pkt_eater]
kernel: [<c02c79f0>] ip_local_deliver_finish+0x0/0x270
kernel: [<f8e9e182>] find_dequeue_entry+0x82/0x90 [nfnetlink_queue]
kernel: [<f8e9e1ab>] issue_verdict+0x1b/0x50 [nfnetlink_queue]
kernel: [<f8e9f5af>] nfqnl_recv_verdict+0x1ff/0x330 [nfnetlink_queue]
kernel: [<c0308730>] schedule+0x350/0xdd0
kernel: [<f887940e>] nfnetlink_rcv_msg+0x16e/0x200 [nfnetlink]
kernel: [<f8879542>] nfnetlink_rcv+0xa2/0x169 [nfnetlink]
kernel: [<c02beecd>] netlink_unicast+0x16d/0x250
kernel: [<c02bf462>] netlink_data_ready+0x12/0x50
kernel: [<c02be2d1>] netlink_sendskb+0x21/0x40
kernel: [<c02bfacd>] netlink_sendmsg+0x27d/0x320
kernel: [<c0116c88>] __wake_up+0x38/0x50
kernel: [<c02a2b2e>] sock_sendmsg+0xae/0xe0
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<f90caa7f>] monkey_chan_queue_append+0x15f/0x170 [monkey]
kernel: [<c02a9e4a>] verify_iovec+0x2a/0x90
kernel: [<c02a306d>] sys_sendmsg+0x15d/0x270
kernel: [<c02a3f86>] sys_recvfrom+0x116/0x140
kernel: [<f90557c0>] e1000_clean_rx_irq+0x0/0x3d0 [e1000]
kernel: [<f9053b0e>] e1000_clean+0x53e/0x780 [e1000]
kernel: [<f90c7780>] mkyGetState+0x20/0x30 [monkey]
kernel: [<f9ddbc90>] packet_writer+0x2e0/0xd50 [monkey_pkt_eater]
kernel: [<c02a453f>] sys_socketcall+0x24f/0x280
kernel: [<c013a5b0>] __do_IRQ+0xc0/0x110
kernel: [<c0102d3f>] sysenter_past_esp+0x54/0x75

Think of this as analogous to “bt” in gdb, it is generated by show_trace_log_lvl in arch/i386/kernel/traps.c. But be aware that it can be broken, what if your bug involved mucking up data on the stack?

What we know at this point is that the put_page where our problem occured was called by skb_release_data (aha! networking code!), which in turn was called by kfree_skbmem, and that by mkyDestroyPacket. That last item, from the monkey_pkt_eater module comes from our propriety code. Our suspicion is that we have, for some reason, given some bad data to kfree_skbmem.

kernel: Code: c0 8b 01 01 c2 83 fa 10 89 11 7f 05 83 fa f0 7d 0d f0 01 15 fc 8f
    40 c0 c7 01 00 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 89 c2 <8b> 00 f6
    c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0 

Personally I find this the most exciting part of the oops. A sequence of hex values, joy! But what is the “Code:“? Quite simple really. The kernel takes the value of EIP, subtracts 43 then prints out 64 bytes from that point, marking the byte at EIP with <..>. I.e. it is a hexdump of the section of machine code being run at the time the oops occurred. And this means we can disassemble.

But it might not be so straight forward as we’d like. That “-43” means the hex-dump could start anywhere, maybe right in the middle of an instruction. So I suggest either removing everything prior to EIP (remove everything before <8b>) or starting at the symbol address that EIP maps into. The latter option is only going to work if the offset from the symbol is less than 43, in our case we’re lucky because in put_page+0x2/0x40 we’re given that the offset is only 2 bytes, so we can safely cut the code down to this:

    89 c2 <8b> 00 f6 c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0 

Now we can employ trusty xxd to reverse the hex to a binary file:

:; echo '89 c2 8b 00 f6 c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0' | 
    sed 's/[^0-9a-f]//g;s/^/0: /' |  
    xxd -r  -c 256 > code.bin

What do we do with this code.bin file? Our next friendly tool is objdump, which won’t accept data from STDIN or using <(...), thus the temporary code.bin file rather than a pipe. With objdump we do and get this:

:; objdump -b binary -m i386 -D code.bin 

code.bin:     file format binary

Disassembly of section .data:

0000000000000000 <.data>:
   0:   89 c2                   mov    %eax,%edx
   2:   8b 00                   mov    (%eax),%eax
   4:   f6 c4 40                test   $0x40,%ah
   7:   75 1e                   jne    0x27
   9:   8b 42 04                mov    0x4(%edx),%eax
   c:   40                      inc    %eax
   d:   74 1f                   je     0x2e
   f:   f0 83 42 04 ff          lock addl $0xffffffff,0x4(%edx)
  14:   0f 98 c0                sets   %al

Some really hard-core people familiar with machine code read it like assembly, and can easily work out the instruction sequence prior to put_page too. But if that was you then you wouldn’t be reading this I think.

What we know now is that the oops occurred in the assembly above at offset 2, at “mov (%eax),%eax“. Which is moving the word at the address stored in %eax into %eax. We can surmise that our invalid address is stored in %eax, and looking further up the oops, at the register dump, we see that the value in %eax is 0x68230a7d and that is the bad address that the very first line of the oops gives us. We now know the exact problem, but it isn’t really any use to use without some context isn’t it?

Thinking a little further, we know that offset 0 in the assembly code above is the entry point of the put_page function so the only code belonging to the function prior to where the error occurs is “mov %eax,%edx“. Logically it seems that an argument to the put_page function was provided in %edx (which the register dump shows also contains 0x68230a7d), and that this argument provides the bad address.

At this point we still don’t have any useful context and really need to start looking at a more complete picture of the code involved. That’s the subject of the next entry in this series.

[With some amusement I note that during the week I was dealing with the problem that prompted me to write this stuff the topic of “decoding oopses” became somewhat popular on LKML! Heh, great timing. This renders my own efforts somewhat redundant, oh well.]