Grokking Oopses
Thu 2008-01-17 00:31
Introduction
Man it was hard to find information on debugging 2.6 kernel issues, maybe my Google skills need honing. Recently I was parachuted into a customer site on a support-or-die mission. The problem: kernel-space. My problem: not much kernel-space experience. A week of little sleep, much reading, and some learning. In the end the problem was rooted out and the crack team of kernel monkeys back at base blew it out of the water. Hooray!
This is the first in a trilogy of entries I intend to write. This one discusses a specific example of an "Oops" in detail, giving a line-by-line description of everything the oops tells you. Well, almost everything, some details go too deep to cover in something this introductory… to drill all the way down could take a whole textbook.
The second entry goes into relating the oops to the original code, I might call it "tracing assembly". The third entry will delve into a session of debugging with 'kdb'.
Now the first two entries will certainly happen as almost all the material was written during my work trip. (Joyful hours of trains, flights, and just plain waiting at stations and airports.) I won't make any promises about the 'kdb' entry though, since I've written nothing for it and would need to spend quite a bit of time generating examples and traces.
Important items of note:
- An IA32/i386 architecture/kernel is assumed, the oops code is platform dependant.
- The oops examined here comes from a 2.6.16 kernel.
- Sometimes file paths are given, these are relative to the root of a Linux 2.6.16 source tree.
- This entry deals with a specific oops for a specific error case and doesn't discuss much beyond the specific example.
- There's no guarantee, as far as I know, that the oops format won't change at some time.
Finally, a disclaimer: I am not a kernel developer. In fact I learnt a couple of new things just writing these two entries. No doubt there are important facts I've missed or not rated highly enough. Maybe it is even possible that I've got something plain wrong (if so please correct me via the comments or email).
Oops!
Let us start where these things usually start, a kernel oops:
kernel: Unable to handle kernel paging request at virtual address 68230a7d
kernel: printing eip:
kernel: c0143742
kernel: *pde = 00000000
kernel: Oops: 0000 [#1]
kernel: SMP
kernel: Modules linked in: monkey_pkt_eater nfnetlink_queue xt_condition
iptable_ips edd ide_cd ip_nat_ftp ip_conntrack_ftp ipt_REDIRECT xt_tcpudp
ipt_addrtype ebtable_nat ebtables ip_nat_mms ip_nat_pptp ip_nat_irc
iptable_nat ip_conntrack_mms ip_conntrack_pptp ip_conntrack_irc af_packet
ipt_logmark ipt_CONFIRMED ipt_confirmed ipt_owner ipt_REJECT sr_mod cdrom
usb_storage evdev microcode parport_pc ppdev parport xt_state xt_NOTRACK
iptable_raw iptable_filter ip_conntrack_netlink ip_nat ipt_LOG ip_conntrack
ip_tables x_tables nfnetlink_log nfnetlink monkey e1000 capability
commoncap loop sg ahci libata sd_mod scsi_mod
kernel: CPU: 0
kernel: EIP: 0060:[<c0143742>] Tainted: PF VLI
kernel: EFLAGS: 00210202 (2.6.16.43-54.5-smp #1)
kernel: EIP is at put_page+0x2/0x40
kernel: eax: 68230a7d ebx: 00000001 ecx: f1057680 edx: 68230a7d
kernel: esi: f6fe7600 edi: fe089b98 ebp: f64d3c14 esp: f64d3bbc
kernel: ds: 007b es: 007b ss: 0068
kernel: Process snort_inline (pid: 6298, threadinfo=f64d2000 task=dff3fa70)
kernel: Stack: <0>c02a8b4a f6fe7600 f1057020 c02a88d8 000004cd f9de0fdd c7af5000 f7802b60
kernel: 0000c7af fe73e860 f64d3c48 f64d3c48 e7e19834 f64d3c34 f5b54cc0 00000000
kernel: 00000001 c02c7adf 00200282 d94bd4cc f64d3c48 f4429a40 f64d3c40 f9dd912a
kernel: Call Trace:
kernel: [<c02a8b4a>] skb_release_data+0x8a/0xa0
kernel: [<c02a88d8>] kfree_skbmem+0x8/0x80
kernel: [<f9de0fdd>] mkyDestroyPacket+0x9d/0x260 [monkey_pkt_eater]
kernel: [<c02c7adf>] ip_local_deliver_finish+0xef/0x270
kernel: [<f9dd912a>] mkyIpqPostHandler+0xda/0x140 [monkey_pkt_eater]
kernel: [<c02c79f0>] ip_local_deliver_finish+0x0/0x270
kernel: [<f8e9e182>] find_dequeue_entry+0x82/0x90 [nfnetlink_queue]
kernel: [<f8e9e1ab>] issue_verdict+0x1b/0x50 [nfnetlink_queue]
kernel: [<f8e9f5af>] nfqnl_recv_verdict+0x1ff/0x330 [nfnetlink_queue]
kernel: [<c0308730>] schedule+0x350/0xdd0
kernel: [<f887940e>] nfnetlink_rcv_msg+0x16e/0x200 [nfnetlink]
kernel: [<f8879542>] nfnetlink_rcv+0xa2/0x169 [nfnetlink]
kernel: [<c02beecd>] netlink_unicast+0x16d/0x250
kernel: [<c02bf462>] netlink_data_ready+0x12/0x50
kernel: [<c02be2d1>] netlink_sendskb+0x21/0x40
kernel: [<c02bfacd>] netlink_sendmsg+0x27d/0x320
kernel: [<c0116c88>] __wake_up+0x38/0x50
kernel: [<c02a2b2e>] sock_sendmsg+0xae/0xe0
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50
kernel: [<f90caa7f>] monkey_chan_queue_append+0x15f/0x170 [monkey]
kernel: [<c02a9e4a>] verify_iovec+0x2a/0x90
kernel: [<c02a306d>] sys_sendmsg+0x15d/0x270
kernel: [<c02a3f86>] sys_recvfrom+0x116/0x140
kernel: [<f90557c0>] e1000_clean_rx_irq+0x0/0x3d0 [e1000]
kernel: [<f9053b0e>] e1000_clean+0x53e/0x780 [e1000]
kernel: [<f90c7780>] mkyGetState+0x20/0x30 [monkey]
kernel: [<f9ddbc90>] packet_writer+0x2e0/0xd50 [monkey_pkt_eater]
kernel: [<c02a453f>] sys_socketcall+0x24f/0x280
kernel: [<c013a5b0>] __do_IRQ+0xc0/0x110
kernel: [<c0102d3f>] sysenter_past_esp+0x54/0x75
kernel: Code: c0 8b 01 01 c2 83 fa 10 89 11 7f 05 83 fa f0 7d 0d f0 01 15 fc 8f
40 c0 c7 01 00 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 89 c2 <8b> 00 f6
c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0
(This oops has had some timestamp details removed and has been wrapped for neatness. Also, some "proprietary symbols" have been renamed, let us just say: beware the monkey.)
Tackling the oops
Let me first say: Thanks to the Internet containing cruft from all ages you'll
find many references to ksymoops when researching kernel oopses. Stop now!
Maybe add a "-ksymoops" to your search query, though this may remove several
good links. It really is a battle against accumulated information. The
ksymoops tool was useful for dealing with Linux 2.4 oopses, but in 2.6
it's work is done at oops-time and ksymoops is useless. If you're dealing
with 2.6 forget about ksymoops.
On with the oops. It tells us quite a lot about what was going on when the it occurred, we have the stack, the call path, the content of registers, and more. On inspection much of this is pretty straight forward to someone familiar with Linux, C programming, and Intel CPUs and assembly (be aware of GAS). It's probably very hard going for someone who hasn't done any work at that sort of level, but difficulty is just a good opportunity to learn! Right? However, like me, you don't need to be a hard-core kernel hacker (I'm not even soft-core, I'd barely earn a PG rating) to get an idea of what's going on.
Line by line
kernel: Unable to handle kernel paging request at virtual address 68230a7d
The core problem, and an oft-seen one, think of this as the "segfault" of
the kernel. In actual fact both userspace nd kernelspace memory access
violations come from the same place (MMU) and are handled by the same
code. Generation of the oops text starting with this line has it's origins in
do_page_fault in the arch/i386/mm/fault.c code file. Think of that path
as "i386 memory manager faults". ("faults" covers a bunch of conditions, not
just access violations.)
In short: something in the kernel has tried to use an address that doesn't work, and that address was: 0x68230a7d.
kernel: printing eip: kernel: c0143742 kernel: *pde = 00000000
Eip! It sounds like a small rodent's cry of distress. It's actually the
"Extended Instruction Pointer", which is the CPU register containing the
address in memory of the instruction being executed. Thinking back to computer
architecture classes you might be more familiar with this as the "Program
Counter". This tells us that at the time of the invalid paging request the
instruction at address 0xc0143742 was being executed. A most useful bit of
information that we'll return to several times.
The "PDE" is the "Page Directory Entry", and I'm not going to pursue that further. Try this article by Robert Collins, published in DDJ back in '96, for more information: Understanding 4M Page Size Extensions on the Pentium Processor.
There's a seriously large amount of information behind this that is worth reading up on. For starters try Google mixing "Linux Kernel" with a variety of things like "virtual memory", "paging", "mmu", … From my bookshelf I'd recommend Chapter 2 of Understanding the Linux Kernel. That book is fairly bent towards Intel, I don't have a book that covers deeper Intel specifics and can only recommend you make use of Google.
kernel: Oops: 0000 [#1] kernel: SMP
Now we've jumped to output generated by the die function in
arch/i386/kernel/traps.c.
The first number is actually the error_code value passed to do_page_fault
printed in hex. The interpretation for this is:
* bit 0 == 0 means no page found, 1 means protection fault * bit 1 == 0 means read, 1 means write * bit 2 == 0 means kernel, 1 means user-mode
So 0000 means "no page found" on "read" in "kernel" mode. This checks out with
what we already know, consistency is good.
The number in the square brackets is the oops count. It is the number of times
this die function has been called since boot, counting from 1. Usually you
only want to worry about the first oops that you find. Everything after that
point could be a side-effect of earlier kernel badness. If you're working past
the first oops then anything I have written here will be old news to you.
Yeah, and it is an SMP kernel. Other strings that may appear here are
PREEMPT, and DEBUG_PAGEALLOC. After this the somewhat mis-named
show_registers function found in arch/i386/kernel/traps.c is called and
the rest of the oops text is generated in or below this call.
kernel: Modules linked in: monkey_pkt_eater nfnetlink_queue xt_condition
...
This output is generated by a call to print_modules (kernel/modules.c) by
show_registers. Usually one of these will be the culprit for your
problem, module code tends to be less stable than core kernel code, especially
proprietary modules!
kernel: CPU: 0
This message is coming from CPU 0. Obvious enough?
kernel: EIP: 0060:[<c0143742>] Tainted: PF VLI
Eip! Again. The value 0060 comes from the CS register, let's ignore it.
(Sorry!) After that we have the EIP address 0xc0143742 shown again. Now
"Tainted", sounds nasty doesn't it? There are six single-char codes here,
and they're generated by the print_tainted function in kernel/panic.c.
The codes represent a bunch of undesirable, yet legal, things that might have
been done to the kernel. The first code either 'P' or 'G', if the former
then a non-GPL module has been loaded (as is the case here). If you have a
'P' then don't expect much support from kernel.org developers, unless the
problem is obviously not related to non-kosher kernel modules. You'll either
want to work out the problem yourself, reproduce the problem without the
proprietary module loaded, or get help from the vendor responsible for said
module.
I'll take a short-cut now and rip the comment from print_tainted to give the
full list of flags:
* 'P' - Proprietary module has been loaded. * 'F' - Module has been forcibly loaded. * 'S' - SMP with CPUs not designed for SMP. * 'R' - User forced a module unload. * 'M' - Machine had a machine check experience. * 'B' - System has hit bad_page.
The alternative character to all except 'P' is a blank space (i.e. flag is
unset). In our oops we also have an 'F' because one of the custom modules
loaded is missing symbol version information. The 'F' flag can come into the
mix for a variety of reasons it seems, the canonical one being that
`insmod -f`
was used to load it. None of the other "taints" are flagged in my example so
I'm not going detail them. (See the bottom of
Documentation/oops-tracing.txt for information.)
One last note on "taint": Once a kernel is tainted it stays tainted. I.e. if
you load a propriety module it taints the kernel and the 'P' taint flag remains
even if you unload it. You could think of it as an "from here on in all bets
are off" state.
After the taint flags we have "VLI" and I don't have a clue what this
signifies. This is hard-coded in the i386 traps.c file, since it isn't
variable within the architecture I'll dismiss it as unimportant!
kernel: EFLAGS: 00210202 (2.6.16.43-54.5-smp #1)
This line first gives the CPU's EFLAGS register, this register stores all the usual flags like "carry" as well as a bunch of less familiar flags (see link for more detail). After the flags, in brackets, is the version string identifying the kernel that generated the oops.
kernel: EIP is at put_page+0x2/0x40
Eip! Yet again, but now with some more useful information. The symbol
attached to the code into which EIP is pointing along with EIP's offset from
the symbol (0x02) and the size of the code associated with the symbol
(0x40, distance between the put_page symbol and the next marked symbol).
kernel: eax: 68230a7d ebx: 00000001 ecx: f1057680 edx: 68230a7d kernel: esi: f6fe7600 edi: fe089b98 ebp: f64d3c14 esp: f64d3bbc kernel: ds: 007b es: 007b ss: 0068
Now we finally get the actual named purpose of the print_registers function:
a dump of the content some CPU registers. It's worth noting at this point that
%edx and %eax both contain our dodgey address.
kernel: Process snort_inline (pid: 6298, threadinfo=f64d2000 task=dff3fa70)
Information about the current "process" being run on the CPU including pointers to the process's thread and task structures. There will always be some process, but it may or may not be directly related to the error that caused the oops. When you're debugging don't jump to conclusions without good reason, keep following the trail of the oops.
kernel: Stack: <0>c02a8b4a f6fe7600 f1057020 c02a88d8 000004cd f9de0fdd c7af5000 f7802b60 kernel: 0000c7af fe73e860 f64d3c48 f64d3c48 e7e19834 f64d3c34 f5b54cc0 00000000 kernel: 00000001 c02c7adf 00200282 d94bd4cc f64d3c48 f4429a40 f64d3c40 f9dd912a
The stack. Simply each word starting at %esp (stack pointer) and continuing
to the base of he stack or for kstack_depth_to_print words, a limit
configurable at kernel compile time. Some values in the stack can be related
to the next part of the oops, the "Call Trace". Essentially, starting with the
at %esp the call chain is traced:
kernel: Call Trace: kernel: [<c02a8b4a>] skb_release_data+0x8a/0xa0 kernel: [<c02a88d8>] kfree_skbmem+0x8/0x80 kernel: [<f9de0fdd>] mkyDestroyPacket+0x9d/0x260 [monkey_pkt_eater] kernel: [<c02c7adf>] ip_local_deliver_finish+0xef/0x270 kernel: [<f9dd912a>] mkyIpqPostHandler+0xda/0x140 [monkey_pkt_eater] kernel: [<c02c79f0>] ip_local_deliver_finish+0x0/0x270 kernel: [<f8e9e182>] find_dequeue_entry+0x82/0x90 [nfnetlink_queue] kernel: [<f8e9e1ab>] issue_verdict+0x1b/0x50 [nfnetlink_queue] kernel: [<f8e9f5af>] nfqnl_recv_verdict+0x1ff/0x330 [nfnetlink_queue] kernel: [<c0308730>] schedule+0x350/0xdd0 kernel: [<f887940e>] nfnetlink_rcv_msg+0x16e/0x200 [nfnetlink] kernel: [<f8879542>] nfnetlink_rcv+0xa2/0x169 [nfnetlink] kernel: [<c02beecd>] netlink_unicast+0x16d/0x250 kernel: [<c02bf462>] netlink_data_ready+0x12/0x50 kernel: [<c02be2d1>] netlink_sendskb+0x21/0x40 kernel: [<c02bfacd>] netlink_sendmsg+0x27d/0x320 kernel: [<c0116c88>] __wake_up+0x38/0x50 kernel: [<c02a2b2e>] sock_sendmsg+0xae/0xe0 kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50 kernel: [<c012f4c0>] autoremove_wake_function+0x0/0x50 kernel: [<f90caa7f>] monkey_chan_queue_append+0x15f/0x170 [monkey] kernel: [<c02a9e4a>] verify_iovec+0x2a/0x90 kernel: [<c02a306d>] sys_sendmsg+0x15d/0x270 kernel: [<c02a3f86>] sys_recvfrom+0x116/0x140 kernel: [<f90557c0>] e1000_clean_rx_irq+0x0/0x3d0 [e1000] kernel: [<f9053b0e>] e1000_clean+0x53e/0x780 [e1000] kernel: [<f90c7780>] mkyGetState+0x20/0x30 [monkey] kernel: [<f9ddbc90>] packet_writer+0x2e0/0xd50 [monkey_pkt_eater] kernel: [<c02a453f>] sys_socketcall+0x24f/0x280 kernel: [<c013a5b0>] __do_IRQ+0xc0/0x110 kernel: [<c0102d3f>] sysenter_past_esp+0x54/0x75
Think of this as analogous to "bt" in gdb, it is generated by
show_trace_log_lvl in arch/i386/kernel/traps.c. But be aware that it
can be broken, what if your bug involved mucking up data on the stack?
What we know at this point is that the put_page where our problem occured
was called by skb_release_data (aha! networking code!), which in turn was
called by kfree_skbmem, and that by mkyDestroyPacket. That last
item, from the monkey_pkt_eater module comes from our propriety code. Our
suspicion is that we have, for some reason, given some bad data to
kfree_skbmem.
kernel: Code: c0 8b 01 01 c2 83 fa 10 89 11 7f 05 83 fa f0 7d 0d f0 01 15 fc 8f
40 c0 c7 01 00 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 89 c2 <8b> 00 f6
c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0
Personally I find this the most exciting part of the oops. A sequence of hex
values, joy! But what is the "Code:"? Quite simple really. The kernel
takes the value of EIP, subtracts 43 then prints out 64 bytes from that point,
marking the byte at EIP with <..>. I.e. it is a hexdump of the
section of machine code being run at the time the oops occurred. And this
means we can disassemble.
But it might not be so straight forward as we'd like. That "-43" means the
hex-dump could start anywhere, maybe right in the middle of an instruction. So
I suggest either removing everything prior to EIP (remove everything before
<8b>) or starting at the symbol address that EIP maps into. The
latter option is only going to work if the offset from the symbol is less than
43, in our case we're lucky because in put_page+0x2/0x40 we're given that
the offset is only 2 bytes, so we can safely cut the code down to this:
89 c2 <8b> 00 f6 c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0
Now we can employ trusty xxd to reverse the hex to a binary file:
:; echo '89 c2 8b 00 f6 c4 40 75 1e 8b 42 04 40 74 1f f0 83 42 04 ff 0f 98 c0' | \
sed 's/[^0-9a-f]//g;s/^/0: /' | \
xxd -r -c 256 > code.bin
What do we do with this code.bin file? Our next friendly tool is
objdump, which won't accept data from STDIN or using <(...), thus the
temporary code.bin file rather than a pipe. With objdump we do and get
this:
:; objdump -b binary -m i386 -D code.bin code.bin: file format binary Disassembly of section .data: 0000000000000000 <.data>: 0: 89 c2 mov %eax,%edx 2: 8b 00 mov (%eax),%eax 4: f6 c4 40 test $0x40,%ah 7: 75 1e jne 0x27 9: 8b 42 04 mov 0x4(%edx),%eax c: 40 inc %eax d: 74 1f je 0x2e f: f0 83 42 04 ff lock addl $0xffffffff,0x4(%edx) 14: 0f 98 c0 sets %al
Some really hard-core people familiar with machine code read it like
assembly, and can easily work out the instruction sequence prior to put_page
too. But if that was you then you wouldn't be reading this I think.
What we know now is that the oops occurred in the assembly above at offset 2,
at "mov (%eax),%eax". Which is moving the word at the address stored in
%eax into %eax. We can surmise that our invalid address is stored in
%eax, and looking further up the oops, at the register dump, we see that the
value in %eax is 0x68230a7d and that is the bad address that the very
first line of the oops gives us. We now know the exact problem, but it isn't
really any use to use without some context isn't it?
Thinking a little further, we know that offset 0 in the assembly code above is
the entry point of the put_page function so the only code belonging to the
function prior to where the error occurs is "mov %eax,%edx". Logically it
seems that an argument to the put_page function was provided in %edx
(which the register dump shows also contains 0x68230a7d), and that this
argument provides the bad address.
At this point we still don't have any useful context and really need to start looking at a more complete picture of the code involved. That's the subject of the next entry in this series.
[With some amusement I note that during the week I was dealing with the problem that prompted me to write this stuff the topic of "decoding oopses" became somewhat popular on LKML! Heh, great timing. This renders my own efforts somewhat redundant, oh well.]
Recent Entries
Categories
- Entries - 260
- Beer - 1
- Cycling - 2
- Food - 53
- Cooking - 26
- Hare - 5
- Soup - 1
- Eating - 5
- England - 5
- London - 4
- Rickmansworth - 1
- Produce - 14
- Ristretto - 8
- Health - 3
- Money - 2
- Random - 74
- Technology - 93
- Code - 22
- General - 46
- Security - 23
- Work - 2
- Wanderings - 32
- Australia - 2
- Barcelona - 2
- Belgium - 2
- England - 15
- Cambridge_Easter - 3
- Lakes - 9
- Finland - 4
- France - 1
- Germany - 1
- Ramble - 3
- Wales - 1




No Responses