Troubleshooting Tools

0. The purpose


There are too many tools for us to track the system problem. It’s just like a detector’s work, at first, we must find ‘evidence’ as quickly as possible. Beside the awareness, using the right tools for problem is vital. Let’s record it in here.

1. Core dump


when

First ‘tool’ is not a tools, it’s just a snapshot file that record memory/processor/register and so on when our program crashed as OS reason. Just like javadump file, it’s very useful for us diagnose and debug the problem.

When can we get a core dump file?

Linux use signal as one kind of asynchronous event handling mechanism, each signal has its default action, like Ingore (ignore signal ), Stop(suspend process), Terminate (terminate process), and Core (termination and core dump) etc..

So when Core action is triggered, linux will generate core dump.

Signal Action Addition
SIGQUIT CORE Quit from keyboard, e.g. "Ctrl+\"
SIGILL CORE Illegal Instruction, e.g. kill -ll $$
SIGABRT CORE Abort signal from abort
SIGSEGV CORE Invalid memory reference, e.g. write to null pointer memory area or overflowstack
SIGTRAP CORE Trace/breakpoint trap

For full sigal list, plean see signal manual page. And signal we show in core dump to help using recognize the problem type.

how

I. enable configuation

En~…But we often hear that ‘I could not find any core dump when it crashed’.

There some switcher to let system generate a core dump.

Enter ulimit -c command and get the result value as 0, it indicate that core dump is disabled by default, it would not generate core dump file. We can use the command ulimit -c unlimited to enable the core dump function, and does not limit the core dump file size;

Using the above command will only effective for terminal current environment, if you want to be permanent, you can modify the file /etc/security/limits.conf file

#/etc/security/limits.conf
#Each line describes a limit for a user in the form:
#<domain>      <type>  <item>         <value>
      *         soft     core        unlimited
II. dump file path
  • The default generated core file is saved in the executable file’s directory, file name is core.
  • Using sysctl-a |grep core, and kernel.core_pattern will indiate the dump file path, and kernel.core_uses_pid using 1 to let core file contains process id
  • To modify it using sysctl -w or sysctl -p as sudoer.
III. debug core file

use the command gdb [program] [coredump] to view the core file

vagrant@vagrant-ubuntu-trusty:~/test$ gdb seg core
GNU gdb (Ubuntu 7.8-1ubuntu4) 7.8.0.20141001-cvs
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from seg...(no debugging symbols found)...done.
[New LWP 12887]
Core was generated by `./seg'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000400506 in main ()
(gdb) where
#0  0x0000000000400506 in main ()
(gdb) info frame
Stack level 0, frame at 0x7fff961538c0:
 rip = 0x400506 in main; saved rip = 0x7f17f9d0eec5
 Arglist at 0x7fff961538b0, args:
 Locals at 0x7fff961538b0, Previous frame's sp is 0x7fff961538c0
 Saved registers:
  rbp at 0x7fff961538b0, rip at 0x7fff961538b8
(gdb) Quit
IV. generate dump with gdb

we can use gdb to generate a core dump easily..

first, connect to java pid

gdb -q --pid=xxx

then

(gdb) generate-core-file     

at last, don’t forget to detail from process

(gdb) detach 
V. for java process

if core dump is come from java process

we can transform it to java heap file use jmap

jmap -dump:format=b,file=heap.hprof $JAVA_HOME/java core.xxx

then use tools like MAT to analysis memory.

or we can use jstack to see java stack

jstack -m $JAVA_HOME/bin/java core.xxx

(ps: jstack maybe meet a bug )

Also see

2. dmesg/messages


when

When process crashed, we need to confirm is it killed by oom-killer?

how

We can use dmesg and /var/adm/messages.* to see kernel messages.(dmesg for kernel ring buffer, and messages for all)

expect boot info…

oom-killer will in it, we can use

sudo dmesg | grep java | grep -i oom-killer

to see confirm that is program killed by oom-killer.

Also see

3. strace


when

“strace – trace system calls and signals”

En~…we can use strace to see system call and signals in program.

we can use strace to see param or return value of system call and signals.

how

I. follow call and signals

use strace command to see system call

strace ./[program]

we will see system call param and return value and signals

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0139000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0137000
arch_prctl(ARCH_SET_FS, 0x7f90c0137740) = 0
mprotect(0x7f90bff19000, 16384, PROT_READ) = 0
mprotect(0x600000, 4096, PROT_READ)     = 0
mprotect(0x7f90c0146000, 4096, PROT_READ) = 0
munmap(0x7f90c013a000, 39504)           = 0
fstat(0, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90c0143000
read(0, 0x7f90c0143000, 1024)           = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=13689, si_uid=1000} ---
+++ killed by SIGTERM +++
II. count system call
strace -c ./[program]

we will see count as this

vagrant@vagrant-ubuntu-trusty:~/test$ strace -c ./test_strace
^CProcess 13742 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 32.09    0.000060           8         8           mmap
 23.53    0.000044          15         3           fstat
 14.44    0.000027           7         4           mprotect
  9.63    0.000018          18         1           munmap
  8.56    0.000016           8         2           open
  6.95    0.000013           4         3         3 access
  1.60    0.000003           3         1           read
  1.07    0.000002           1         2           close
  1.07    0.000002           2         1           execve
  0.53    0.000001           1         1           brk
  0.53    0.000001           1         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000187                    27         3 total        
III. other options
option effect
-o output result to file
-T track the system call time
-p trace a running procee, using -p [pid]
-e A qualifying expression which modifies which events to trace or how to trace them, e.g. -e trace=signal to only trace signal, or -e write write IO operation.

Also see

4. ulimit


when

In linux, every process has its resource limit, like sub-process number, open file number, dump file size, etc..

If our process more than its max value will lead a crash.

how

Limit is in resource level, but using count is in user level.

so limit like ‘file number’ will use number that AProc open file number plus BProc open file number to check is over limit.

This is the reason why some software like ‘apache’ using standalone user to run it..

To see process limit, use

cat /proc/[pid]/limits

what the limit value come from?

  • init proc: set by kernal
  • system service: set by setrlimit
  • shell proc: per login use /etc/security/limits.conf or set by pre ulimit -Sx command
  • shell executed: inherite from shell process

To modify process limit..(only soft limit we can modify)

Use ulimit we can modify shell process limit and continue process forked from shell.

Use echo -n \"Max processes=xx:yy\" >/proc/<pid>/limits

we can modify limit for running process(need root)

Also see:

os

Comments