Troubleshooting Tools
0. The purpose
There are too many tools for us to track the system problem. It’s just like a detector’s work, at first, we must find ‘evidence’ as quickly as possible. Beside the awareness, using the right tools for problem is vital. Let’s record it in here.
1. Core dump
when
First ‘tool’ is not a tools, it’s just a snapshot file that record memory/processor/register and so on when our program crashed as OS reason. Just like javadump file, it’s very useful for us diagnose and debug the problem.
When can we get a core dump file?
Linux use signal as one kind of asynchronous event handling mechanism, each signal has its default action, like Ingore (ignore signal ), Stop(suspend process), Terminate (terminate process), and Core (termination and core dump) etc..
So when Core
action is triggered, linux will generate core dump.
Signal | Action | Addition |
---|---|---|
SIGQUIT | CORE | Quit from keyboard, e.g. "Ctrl+\" |
SIGILL | CORE | Illegal Instruction, e.g. kill -ll $$ |
SIGABRT | CORE | Abort signal from abort |
SIGSEGV | CORE | Invalid memory reference, e.g. write to null pointer memory area or overflowstack |
SIGTRAP | CORE | Trace/breakpoint trap |
For full sigal list, plean see signal manual page. And signal we show in core dump to help using recognize the problem type.
how
I. enable configuation
En~…But we often hear that ‘I could not find any core dump when it crashed’.
There some switcher
to let system generate a core dump.
Enter ulimit -c
command and get the result value as 0
, it indicate that core dump is disabled by default, it would not generate core dump file.
We can use the command ulimit -c unlimited
to enable the core dump function, and does not limit the core dump file size;
Using the above command will only effective for terminal current environment, if you want to be permanent, you can modify the file /etc/security/limits.conf file
#/etc/security/limits.conf
#Each line describes a limit for a user in the form:
#<domain> <type> <item> <value>
* soft core unlimited
II. dump file path
- The default generated core file is saved in the executable file’s directory, file name is core.
- Using
sysctl-a |grep core
, andkernel.core_pattern
will indiate the dump file path, andkernel.core_uses_pid
using1
to let core file contains process id - To modify it using
sysctl -w
orsysctl -p
as sudoer.
III. debug core file
use the command gdb [program] [coredump]
to view the core file
vagrant@vagrant-ubuntu-trusty:~/test$ gdb seg core
GNU gdb (Ubuntu 7.8-1ubuntu4) 7.8.0.20141001-cvs
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from seg...(no debugging symbols found)...done.
[New LWP 12887]
Core was generated by `./seg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000400506 in main ()
(gdb) where
#0 0x0000000000400506 in main ()
(gdb) info frame
Stack level 0, frame at 0x7fff961538c0:
rip = 0x400506 in main; saved rip = 0x7f17f9d0eec5
Arglist at 0x7fff961538b0, args:
Locals at 0x7fff961538b0, Previous frame's sp is 0x7fff961538c0
Saved registers:
rbp at 0x7fff961538b0, rip at 0x7fff961538b8
(gdb) Quit
IV. generate dump with gdb
we can use gdb
to generate a core dump easily..
first, connect to java pid
gdb -q --pid=xxx
then
(gdb) generate-core-file
at last, don’t forget to detail from process
(gdb) detach
V. for java process
if core dump is come from java process
we can transform it to java heap file use jmap
jmap -dump:format=b,file=heap.hprof $JAVA_HOME/java core.xxx
then use tools like MAT
to analysis memory.
or we can use jstack
to see java stack
jstack -m $JAVA_HOME/bin/java core.xxx
(ps: jstack maybe meet a bug )
Also see
2. dmesg/messages
when
When process crashed, we need to confirm is it killed by oom-killer?
how
We can use dmesg
and /var/adm/messages.*
to see kernel messages.(dmesg for kernel ring buffer, and messages for all)
expect boot info…
oom-killer will in it, we can use
sudo dmesg | grep java | grep -i oom-killer
to see confirm that is program killed by oom-killer.
Also see
- Linux内核OOM机制的详细分析
- What’s the difference of dmesg output and /var/log/messages?
- How to Deal With Non-heap or Native Memory Leak
3. strace
when
“strace – trace system calls and signals”
En~…we can use strace
to see system call and signals in program.
we can use strace
to see param or return value of system call and signals.
how
I. follow call and signals
use strace command to see system call
strace ./[program]
we will see system call param and return value and signals
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0139000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0137000
arch_prctl(ARCH_SET_FS, 0x7f90c0137740) = 0
mprotect(0x7f90bff19000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ) = 0
mprotect(0x7f90c0146000, 4096, PROT_READ) = 0
munmap(0x7f90c013a000, 39504) = 0
fstat(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0143000
read(0, 0x7f90c0143000, 1024) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=13689, si_uid=1000} ---
+++ killed by SIGTERM +++
II. count system call
strace -c ./[program]
we will see count as this
vagrant@vagrant-ubuntu-trusty:~/test$ strace -c ./test_strace
^CProcess 13742 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
32.09 0.000060 8 8 mmap
23.53 0.000044 15 3 fstat
14.44 0.000027 7 4 mprotect
9.63 0.000018 18 1 munmap
8.56 0.000016 8 2 open
6.95 0.000013 4 3 3 access
1.60 0.000003 3 1 read
1.07 0.000002 1 2 close
1.07 0.000002 2 1 execve
0.53 0.000001 1 1 brk
0.53 0.000001 1 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.000187 27 3 total
III. other options
option | effect |
---|---|
-o | output result to file |
-T | track the system call time |
-p | trace a running procee, using -p [pid] |
-e | A qualifying expression which modifies which events to trace or how to trace them, e.g. -e trace=signal to only trace signal, or -e write write IO operation. |
Also see
4. ulimit
when
In linux, every process has its resource limit, like sub-process number, open file number, dump file size, etc..
If our process more than its max value will lead a crash.
how
Limit is in resource level, but using count is in user level.
so limit like ‘file number’ will use number that AProc open file number plus BProc open file number to check is over limit.
This is the reason why some software like ‘apache’ using standalone user to run it..
To see process limit, use
cat /proc/[pid]/limits
what the limit value come from?
- init proc: set by kernal
- system service: set by
setrlimit
- shell proc: per login use
/etc/security/limits.conf
or set by preulimit -Sx
command - shell executed: inherite from shell process
To modify process limit..(only soft limit we can modify)
Use ulimit
we can modify shell process limit and continue process forked from shell.
Use echo -n \"Max processes=xx:yy\" >/proc/<pid>/limits
we can modify limit for running process(need root)
Also see: