Oracle Delivery in Duty

May 4, 2009

CPU Monitoring

 

Linux Windows AIX Solaris

Linux

‘top’ can give you a overall view of CPU usage by process:

top - 23:50:16 up  3:25,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:  88 total,   1 running,  87 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   2055112k total,   227684k used,  1827428k free,    53556k buffers
Swap:  2096472k total,        0k used,  2096472k free,   100884k cached

  PID USER      PR  NI  VIRT  RES  SHR S &%CPU %MEM    TIME+  COMMAND
    1 root      16   0  4876  596  500 S  0.0  0.0   0:00.78 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    5 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
    7 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3
   10 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/0
   11 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/1
   12 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/2

The following keys are very useful:

Type '1' - displays all CPUs (if hyber-threaded, you will see them too)
Type 'c' - lists full command lines

‘vmstat’ can give you CPU info along with memory, io, and swap info:

# vmstat -a
procs                      memory      swap          io     system         cpu
 r  b   swpd   free  inact active   si   so    bi    bo   in    cs us sy wa id
12  0      0 129600 3772736 2850908    0    0     3    15   14    39  4  1  1 39
- If runnable threads (r column) divided by the number of CPUs is greater than 1, there may be a CPU
bottleneck
- If us+sy is close to 100%, then there may be a CPU bottleneck.  See action to determine next steps.

Windows

Windows Task Manager can be used to monitor your system. You can start by “Start->Run->(type) taskmgr and hit <Enter>”

cpu_mon_win_taskmgr
This screen shows a 4-CPU system with average CPU usage 38&%.  Individual CPU usage is also shown clearly.

To automate and capture CPU usage into a file using ‘perfmon’ tool that comes with Windows 2003, see “perf automation” section from the left navigation.

AIX

’sar’ tells you overall system CPU usage, if you just want to know if your system is busy or not, then use ’sar’.

# sar -u 10 3

AIX lpar05 2 5 00040B1F4C00 04/07/03
17:54:58 %usr %sys %wio %idle
17:55:08   30   57    1    12
17:55:18   29   57    1    12
17:55:28   26   43    1    29
Average    29   53    1    18

The output shows that the system spent 29% in user mode (your applications most likely), 53% in system
mode (OS-related, e.g., CPU-comsuming libraries), and 1% waiting for IO requests, and
was idle 18% of the time. If $usr + %sys = 100%, there may be a CPU bottleneck.

‘vmstat’ can be used to determine if there maybe a CPU bottleneck:

# vmstat 5 3
kthr      memory               page            faults         cpu
----- ----------- ------------------------ ------------ ---------------
r b 	avm fre 	re pi po fr sr cy 		in sy cs 	   us sy id wa
2 3     298565    163 0 14 58 2047 8594 0 971 217296 1286 43 56 17 34
2 2     298824    124 0 29 20 251 352 0 800 248079 1039   22 28 22 29
1 7     300027    293 0 15 6 206 266 0 1150 91086 479     7  14  9 69
- If runnable threads (r column) divided by the number of CPUs is greater than 1, there may be a CPU
bottleneck
- If us+sy is close to 100&%, then there may be a CPU bottleneck.  See action to determine next steps.

 

Solaris

To monitor for a total of 600 seconds (10 min) the CPU usage, open a terminal, and type the following (5-second interval for 120 intervals):

# sar -A 5 120
SunOS ultra5 5.9 sun4u  05/14/2002

15:44:26  %usr  %sys  %wio  %idle
15:44:31    1    2    0   97
15:44:36    1    2    0   96
15:44:41    9    2    9   80
15:44:46    2    3    0   95
15:44:51    0    1    0   99
15:44:56    3    2    0   96
15:45:01    0    0    0   100
15:45:06   15    0    0   85
15:45:11    2    0    0   97

CPU Analysis – Locating Dominant Processes

Filed under: Performance / Tuning Tips. To the Point — Tags: , , , , , , , , , , — Ryan Stephanus @ 9:50 pm

 

  Linux  Windows  AIX  Solaris

 

Linux

To determine which process is hogging CPU, use ‘top’:

top - 23:50:16 up  3:25,  1 user,  load average: 0.00, 0.00, 0.00
Tasks:  88 total,   1 running,  87 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0&percnt; us,  0.0&percnt; sy,  0.0&percnt; ni, 100.0&percnt; id,  0.0&percnt; wa,  0.0&percnt; hi,  0.0&percnt; si
Mem:   2055112k total,   227684k used,  1827428k free,    53556k buffers
Swap:  2096472k total,        0k used,  2096472k free,   100884k cached

  PID USER      PR  NI  VIRT  RES  SHR S &percnt;CPU &percnt;MEM    TIME+  COMMAND
    1 root      16   0  4876  596  500 S  0.0  0.0   0:00.78 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    5 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1
    6 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
    7 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3
   10 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/0
   11 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/1
   12 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/2  

‘ps aux –sort pcpu’ can give you CPU info in a simpler format at command prompt:

# ps aux --sort pcpu
....

db2fenc1  5849  0.0  0.1 319016 9140 ?       S    Aug10   0:10 db2fmp  ,1,0,0,0,
0,0,0,1e014,2,0,1,69fe0,0x12ac0000,0x12ac0000,eaa4
db2inst1  7462  0.0  0.4 369432 33408 ?      S    Aug11   4:34 db2agent (idle)
db2inst1 30723  0.0  0.4 370592 37540 ?      S    Sep04  14:04 db2agent (idle)
root      5640  0.0  0.0  1500  428 tty1     S    Sep26   0:00 /sbin/mingetty tty1
root     25189  0.0  0.0  1796  504 ?        S    Oct12   0:00 /usr/sbin/vsftpd /etc/vsftpd/vsftpd.conf
root     15515  0.0  0.0  6872 2188 ?        S    15:36   0:00 sshd: root@pts/1
root     15518  0.0  0.0  4252 1388 pts/1    S    15:36   0:00 -bash
root     15765  0.0  0.0  2724  776 pts/1    R    16:03   0:00 ps aux --sort pcpu
root      1554 43.0  0.0 194024 3744 ?       R    Aug08 73330:27 /usr/local/staf/bin/STAFProc

If the process is normal (error-free), you system may just be handling the peak load (more load will crash it), so it is time to increase CPU capacity.  If you believe the workload is not heavy, then your application may need to be tuned.  See “Code Tuning” for the next step.

 

Windows

Windows Task Manager can be used to monitor your system’s CPU uage.  You can start by “Start->Run->(type) taskmgr and hit <Enter>”.  Look at the ‘Processes’ tab, and click on the ‘CPU’ column to sort by CPU usage.  The highest CPU-using process is on the top.


cpu_mon_win_taskmgr_process

 

To automate and capture CPU usage into a file using ‘perfmon’ tool that comes with Windows 2003, see “perf automation” section from the left navigation.

 

AIX

‘ps aux’ will show cpu usage sorted by most usage first.  Do a ‘more’ to see the top ones.

# ps aux |more

USER         PID &percnt;CPU &percnt;MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
root       53274 12.4  0.0   40   40      - A      Nov 28 8678:15 wait
root       45078 12.4  0.0   40   40      - A      Nov 28 8677:24 wait
root       69666 12.4  0.0   40   40      - A      Nov 28 8677:08 wait
root       61470 12.4  0.0   40   40      - A      Nov 28 8669:01 wait
root       65568 12.4  0.0   40   40      - A      Nov 28 8659:51 wait
root        8196 12.4  0.0   40   40      - A      Nov 28 8656:59 wait
root       49176 12.4  0.0   40   40      - A      Nov 28 8654:17 wait
root       57372 12.4  0.0   40   40      - A      Nov 28 8633:34 wait
root       90280  0.1  0.0  492  500      - A      Nov 28 55:12 /usr/sbin/syncd
root     1716352  0.0  8.0 1154040 1154072      - A      Dec 02  3:47 /usr/WebSphere/
db2inst1 1143036  0.0  0.0 1044  460      - A      Dec 02  2:11 db2disp 0
root           0  0.0  0.0   64   64      - A      Nov 28  4:55 swapper

A better CPU profiling tool is called ‘tprof’, available in the latest AIX versions.  It profiles CPU usage for a given interval, and prints out top CPU-hogging program(s) and their associated libraries. While your application is running under load, do the following:

# tprof -x sleep 60

Sun Dec  4 15:55:01 2005
System: AIX 5.3 Node: pw101 Machine: 002FBF7D4C00
Starting Command sleep 60
stopping trace collection.
Generating sleep.prof

# more sleep.prof

Configuration information
=========================
System: AIX 5.3 Node: pw101 Machine: 002FBF7D4C00
Tprof command was:
    tprof -x sleep 60
Trace command was:
    /usr/bin/trace -ad -L 1000000 -T 500000 -j 000,001,002,003,38F,005,006,134,139,5A2,465,00A,234 -o -
Total Samples = 48008
Total Elapsed Time = 60.01s
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Process                                Freq  Total Kernel   User Shared  Other
=======                                ====  ===== ======   ==== ======  =====
wait                                      8  99.69  99.69   0.00   0.00   0.00
/usr/bin/tprof                            1   0.14   0.00   0.00   0.13   0.00
/home/db2inst1/sqllib/bin/db2fm
                                          6   0.04   0.02   0.00   0.02   0.00
/etc/syncd                                2   0.03   0.03   0.00   0.00   0.00
/home/dasusr1/das/bin/db2fm               6   0.01   0.01   0.00   0.00   0.00
db2hmon                                   1   0.01   0.01   0.00   0.00   0.00
db2agent                                  1   0.01   0.01   0.00   0.00   0.00
/home/db2inst1/sqllib/adm/db2set
                                          6   0.01   0.01   0.00   0.00   0.00
/usr/bin/sh                               6   0.01   0.01   0.00   0.00   0.00
r/WebSphere/AppServer/java/bin//java      3   0.01   0.01   0.00   0.00   0.00
db2disp                                   1   0.00   0.00   0.00   0.00   0.00

If the process is normal (error-free), you system may just be handling the peak load (more load will crash it), so it is time to increase CPU capacity.  You can certainly change the priority the busy application is running at by using ‘nice’ and ‘renice’ commands (not very often used).  Alternatively, if you believe the workload is not heavy, but CPU usage is high, then your application may need to be tuned.  See “Code Tuning” for the next step.

 

Solaris

Information coming soon (if you have useful information regarding this topic, please send it to me ryan.stephanus@live.com, thanks.)

Memory Monitoring

 

  Linux  Windows  AIX  Solaris

 

Linux

To see if your system is short of physical memory, you can use /proc/meminfo and vmstat:

# more /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  8384733184 3916734464 4467998720        0 43380736 1746464768
Swap:        0        0        0
MemTotal:      8188216 kB
MemFree:       4363280 kB
MemShared:           0 kB
Buffers:         42364 kB
Cached:        1705532 kB
SwapCached:          0 kB
Active:        1730232 kB
ActiveAnon:    1359612 kB
ActiveCache:    370620 kB
Inact_dirty:   1076176 kB
Inact_laundry:  173872 kB
Inact_clean:    177952 kB
Inact_target:   631644 kB
HighTotal:     7455648 kB
HighFree:      4335064 kB
LowTotal:       732568 kB
LowFree:         28216 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Committed_AS:  4099888 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

The ‘vmstat’ example below shows 3 snapshots of system memory usage (along with other info) with 1 second apart:

# vmstat 1 3
procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy wa id
15  0      0 4363300  42364 1705532    0    0     1     0    1     1  0  0  1  0
15  0      0 4363300  42364 1705532    0    0     0     0  117   126  6 94  0  0
15  0      0 4363300  42364 1705532    0    0     0     0  114   112  6 94  0  0

 

Windows

Windows Task Manager can be used to monitor your system.  You can start by “Start->Run->(type) taskmgr and hit <Enter>”

cpu_mon_win_ram_exhausted
This screen shows a 4-CPU system with current available physical memory = 2.5MB.  If
physical memory runs out, your system is paging (disk IO) and your applications will
be slow.

To automate and capture memory usage into a file using ‘perfmon’ tool that comes with Windows 2003, see “perf automation” section from the left navigation.

 

AIX

‘vmstat -s’ sends a summary report to STDOUT starting from system initialization expressed in absolute counts rather than on a interval basis, as ‘vmstat 1 10′ shows below.

# vmstat -s
            725859553 total address trans. faults
              2158076 page ins
              9763908 page outs
                    0 paging space page ins
                    0 paging space page outs
                    0 total reclaims
            325664108 zero filled pages faults
                20972 executable filled pages faults
                    0 pages examined by clock
                    0 revolutions of the clock hand
                    0 pages freed by the clock
              9798459 backtracks
                    0 free frame waits
                    0 extend XPT waits
               147220 pending I/O waits
             11921921 start I/Os
              1700772 iodones
           2022681020 cpu context switches
             41711252 device interrupts
            302225212 software interrupts
           8098172091 decrementer interrupts
               585603 mpc-sent interrupts
               585601 mpc-receive interrupts
                    0 phantom interrupts
                    0 traps
           6747450643 syscalls

‘vmstat’ can be used to determine if there maybe a memory bottleneck by looking at the ‘avm’, ‘pi’ and ‘po’ columns:

 

 

 

# vmstat 1 10

kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 2  3 298565    163 0  14  58 2047 8594 0  971 217296 1286 23 26 17 34
 2  2 298824    124 0  29  20  251  352 0  800 248079 1039 22 28 22 29
 1  7 300027    293 0  15   6  206  266 0 1150  91086  479  7 14  9 69
 0 13 300233    394 0   1   0  127  180 0  894   6412  276  2  2  0 96
 0 14 300453    543 0   4   0   45   82 0  793   5976  258  1  2  0 97
 0 14 301488    329 0   2   2  116  179 0  803   6806  282  1  3  0 96
 0 14 302207    435 0   5   4  112  159 0  821  12349  402  2  3  0 95
 3  9 301740   2240 0  70   9  289  508 0  963 187874 1089 19 31  6 44
 1  4 271719  30561 0  39   0    0    0 0  827 203604 1217 21 31 19 30
 3  2 269996  30459 0  16   0    0    0 0  764 182351 1387 18 25 34 23
Columns 'pi' and 'po' indicate the number of paging space page-ins/per second and the number
of paging space page-outs/per second.  If the values are constantly non-zero, there is most
likely a memory bottleneck.  Occasional non-zero values don't necessarily indicate a problem
since paging is the main principal of virtual memory.

 

Solaris

‘vmstat’ can be used to determine if there maybe a memory bottleneck by looking at the ‘avm’, ‘pi’ and ‘po’ columns:

# vmstat 1 10
kthr    memory              page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 1  0 3279604 34633454   0   0   0   0    0   0   7 10513 568  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   6  209 522  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   1  209 520  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   3  222 520  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   1  222 524  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   1  210 516  0  0 99  0
 0  0 3279573 34633485   0   0   0   0    0   0   6  209 530  0  0 99  0
Columns 'pi' and 'po' indicate the number of paging space page-ins/per second and the number
of paging space page-outs/per second.  If the values are constantly non-zero, there is most
likely a memory bottleneck.  Occasional non-zero values don't necessarily indicate a problem
since paging is the main principal of virtual memory.

Disk I/O Monitoring

  Linux  Windows  AIX  Solaris

 

Linux

Both ’sar’ and ‘iostat’ are available on Linux to monitor disk activities:

# sar -d 1 5
Linux 2.4.21-27.ELsmp (pw101)         12/04/2005

06:33:54 PM       DEV       tps  rd_sec/s  wr_sec/s
06:33:55 PM    dev8-0      0.00      0.00      0.00
06:33:55 PM   dev8-16      0.00      0.00      0.00
06:33:55 PM   dev8-32      0.00      0.00      0.00
06:33:55 PM   dev8-48      0.00      0.00      0.00
06:33:55 PM   dev8-64      0.00      0.00      0.00
06:33:55 PM   dev8-65      0.00      0.00      0.00
06:33:55 PM   dev8-66      0.00      0.00      0.00

# iostat 1 3
Linux 2.4.21-27.ELsmp (pw101)         12/04/2005

avg-cpu:  %user   %nice    %sys %iowait   %idle
           7.14    0.00    4.53    0.01   88.32

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.23       1066    2377572
sdb               0.01         0.39         0.00    3941054        118
sdc               0.00         0.23         0.00    2364386         16
sdd               0.00         0.00         0.00         24          0
sde               0.35         0.09         6.88     933722   70340208
sde1              0.35         0.09         6.88     933218   70340208
sde2              0.00         0.00         0.00        280          0
If '%idle' goes below 20%, the system maybe queuing up disk I/Os and response time suffers.
'1 5' and '1 3' parameters are intervals and iterations (e.g., 1-second interval 5 iterations).

 

Windows

‘Perfmon’ is a Windows tool that can be configured to monitor (and log to file) system resources.  Physical disk counters are one area that can be added to the overall resources to be monitored.  You can start ‘perfmon’ simply by typing ‘perfmon’ in the “Start->Run…” window:


disk_tune_perfmon-addcounter
Once you created a counter log, for example, named 'perf', you can add counters to it.
Double-click 'perf' to open it, and click on "Add Counters", and select
"PhysicalDisk" in the 'Performance object' dropdown list.  Then select the counters needed.

A couple of indicators must be monitored for hard disks in your system. Watch the Physical Disk (instance)Disk Transfers/sec counter for each physical disk and if it goes above 25 disk I/Os per second then you’ve got poor response time for your disk. A bottleneck from a disk can significantly impact response time for applications running on your system, so you should investigate this further by tracking Physical Disk(instance)% Idle Time, which measures the percent time that your hard disk is idle during the measurement interval, and if you see this counter fall below 20% then you’ve likely got read/write requests queuing up for your disk which is unable to service these requests in a timely fashion. In this case it’s time to upgrade your hardware to use faster disks or scale out your application to better handle the load.
 

disk_tune_perfmon-addcounter_2

Then you will need to start the perf counter log by click on the play button (the icon like the one your VCR).  This starts the monitoring of the counters you added and logging.  You may at a later time, stop it, and play it back as shown below.
 disk_tune_perfmon-playback

AIX

'sar -d' sends a snapshot of disk activities to STDOUT.  You can supply the interval and iteration parameters for the command to repeat:
# sar -d 1 1

AIX cmcs101 3 5 002FBF7D4C00    01/04/06

System configuration: lcpu=8 drives=22

14:25:18     device    %busy    avque    r+w/s    Kbs/s   avwait   avserv

14:25:19     hdisk6      0      0.0        0        0      0.0      0.0
             hdisk7      0      0.0        0        0      0.0      0.0
             hdisk4      0      0.0        0        0      0.0      0.0
             hdisk5      0      0.0        0        0      0.0      0.0
             hdisk8      0      0.0        0        0      0.0      0.0
            hdisk11      0      0.0        0        0      0.0      0.0
             hdisk9      0      0.0        0        0      0.0      0.0
            hdisk10      0      0.0        0        0      0.0      0.0
            hdisk12      0      0.0        0        0      0.0      0.0
            hdisk13      0      0.0        0        0      0.0      0.0
            hdisk14      0      0.0        0        0      0.0      0.0
             hdisk0      0      0.0        0        0      0.0      0.0
            hdisk15      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk1      0      0.0        0        0      0.0      0.0
               dac0      0      0.0        0        0      0.0      0.0
           dac0-utm      0      0.0        0        0      0.0      0.0
               dac1      0      0.0        0        0      0.0      0.0
           dac1-utm      0      0.0        0        0      0.0      0.0
            hdisk16      0      0.0        0        0      0.0      0.0
            hdisk17      0      0.0        0        0      0.0      0.0
** this example shows no disk activities.

'iostat aux' outputs similar information on disk I/O:

# iostat aux 1 1

System configuration: lcpu=8 drives=22

18:34:05     device    %busy    avque    r+w/s    Kbs/s   avwait   avserv

18:34:06     hdisk6      0      0.0        0        0      0.0      0.0
             hdisk7      0      0.0        0        0      0.0      0.0
             hdisk4      0      0.0        0        0      0.0      0.0
             hdisk5      0      0.0        0        0      0.0      0.0
             hdisk8      0      0.0        0        0      0.0      0.0
            hdisk11      0      0.0        0        0      0.0      0.0
             hdisk9      0      0.0        0        0      0.0      0.0
            hdisk10      0      0.0        0        0      0.0      0.0
            hdisk12      0      0.0        0        0      0.0      0.0
            hdisk13      0      0.0        0        0      0.0      0.0
            hdisk14      0      0.0        0        0      0.0      0.0
             hdisk0      0      0.0        0        0      0.0      0.0
            hdisk15      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk1      0      0.0        0        0      0.0      0.0
               dac0      0      0.0        0        0      0.0      0.0
           dac0-utm      0      0.0        0        0      0.0      0.0
               dac1      0      0.0        0        0      0.0      0.0
           dac1-utm      0      0.0        0        0      0.0      0.0
            hdisk16      0      0.0        0        0      0.0      0.0
            hdisk17      0      0.0        0        0      0.0      0.0
parameters '1 1' indicate interval and iteration for refreshes.

To measure disk Read throughput:

# time dd if=/tmp/f66mb.out of=/dev/null bs=1024k
63+1 records in.
63+1 records out.

real    0m2.04s
user    0m0.00s
sys     0m1.05s
The 'time' command shows the amount of time it took to complete the read.  The read throughput
in this example is about 33MB per second (66MB / 2.04 seconds real time).

To measure disk Write throughput:

# sync; date; dd if=/dev/zero of=/tmp/1000m bs=1024k count=1000; date; sync; date
Thu Jan  5 23:02:41 PST 2006
1000+0 records in.
1000+0 records out.
Thu Jan  5 23:02:59 PST 2006
Thu Jan  5 23:02:59 PST 2006
In this example, dd completed after 18 seconds (23:02:59 - 23:02:41) and wrote with 55MB
per second (1GB / 18 seconds).

To pin-point which file system, logical volume, or physical hdisk is the most active under a heavy workload,

use filemon. See details here.

 

Solaris

The 'iostat �xc' command can be used to view disk activities on a Solaris machine.

The �x argument allows extended statistics to be reported

and the �c option reports the CPU utilization for the interval.


# iostat -xc
                  extended device statistics                        cpu
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b  us sy wt id
fd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   8 64 27  0
sd0         51.1    0.2 6545.1    1.6  0.0  1.8   34.7   0 100 
sd1         84.7    0.0 10615.1   0.0  0.0  1.6   19.0   1  98
sd4         27.6    6.8  220.5   51.6  0.0  2.9   83.0   0  98
sd6          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
Looks like disk 'sd0' is really busy (100% busy!).  Next step is to find out what is using it.
The fields have the following meanings:

          disk    name of the disk
          r/s     reads per second
          w/s     writes per second
          Kr/s    kilobytes read per second
          Kw/s    kilobytes written per second
          wait    average number of transactions waiting for ser-
                  vice (queue length)
          actv    average number of transactions  actively  being
                  serviced  (removed  from  the queue but not yet
                  completed)

          %w      percent of time there are transactions  waiting
                  for service (queue non-empty)
          %b      percent of time the disk is busy  (transactions
                  in progress)

Disk I/O Analysis – Determining I/O Bottlenecks

Filed under: Performance / Tuning Tips. To the Point — Tags: , , , , , — Ryan Stephanus @ 9:20 pm

  Linux  Windows  AIX  Solaris

 

Linux

There are many tools to monitor disk io in Linux, such as ‘iostat’, ’sar’, and ’sadc’.  It is easy to use these tools to find out which device (or hard disk) is having the busiest I/O activities, but it is a bit difficult to find out which program (the exact pid) that is actually using a particular device.  This task is a easier on AIX with ‘filemon’ (see AIX section).  The following example is one way to determine which program may be generating the most disk I/O on Linux:

# sar -d
Linux 2.4.21-27.ELsmp (pw101)         12/04/2005
12:00:00 AM       DEV       tps  rd_sec/s  wr_sec/s
....
Average:     dev8-128      7.16      5.37     75.07
Average:     dev8-129      7.16      5.37     75.07
Average:     dev8-130      0.00      0.00      0.00
....
The above command finds the busiest device.  To determine what that device is, do:
# more /proc/devices
Character devices:
  1 mem
  2 pty
  3 ttyp
  4 ttyS
  5 cua
  7 vcs
...
Block devices:
  1 ramdisk
  2 fd
  3 ide0
  7 loop
  8 sd
...
 71 sd
129 sd
The above example indicates the 'Block device' 8 is 'sd' or scsi disk, and #129 is one of
scsi disks in the system.  So 'sar -d' tells you that one of the scsi disks, or device
'dev8-129' is busy.
Now the following command is a bit busy, but you can always strip away piped commands to see
what output is like for each command.  But the shown example give you which program (along with
its pid) has the most open files on a given busy device.  In our example, the busy device we
are interested is 'dev8-129', but is represented as '8,129' in the 'lsof' output.
# lsof | grep "8,129" | awk '{print $1" "$2}' | uniq -c | sort -n -r
 [count   p   pid]
    307 java 31916
    307 java 31915
    307 java 31914
    307 java 31913
    307 java 31912
    307 java 31911
    307 java 31910
    307 java 31909
    307 java 31908
    307 java 31907
    307 java 31906
    307 java 31905
    307 java 31904
    307 java 31903
    307 java 27645
     55 db2sysc 32011
     51 db2sysc 32012
     46 httpd 32009
     46 httpd 32008
     46 httpd 32006
     46 httpd 31901
     46 httpd 31900
     46 httpd 31899
     46 httpd 31898
     46 httpd 31897
     46 httpd 31874
     45 db2sysc 32019
     40 db2fmp 31996
     31 sshd 26713
     29 db2sysc 32025
     29 db2sysc 32024
...
The above output gives a clue as to which process(es) to look into for disk I/O problems
(i.e., 'java').  The 'java' processes are having the largest number of open files, and
chances are doing the most disk I/O.  Of course, some files maybe memory-cached, so this
method may not always work.  I'd like to hear from you if you have a better solution.  Please
contact me (see footer for contact info).

 

Windows

‘Perfmon’ is a Windows tool that can be configured to monitor (and log to file) system resources.  Physical disk counters are one area that can be added to the overall resources to be monitored.  You can start ‘perfmon’ simply by typing ‘perfmon’ in the “Start->Run…” window:

disk_tunte_perfmon-io

The above example is using ‘perfmon’ to show resource usage using a perf log file.  The ‘Counter’ column shows the disk I/O related counters for disk drives C: and E:.  You can use these counters to see clearly which disk is busy.  To see which program is causing the disk I/O, there is a freeware tool called ‘Filemon’ can help.  Here is a simple description of it – “FileMon monitors and displays file system activity on a system in real-time. Its advanced capabilities make it a powerful tool for exploring the way Windows works, seeing how applications use the files and DLLs, or tracking down problems in system or application file configurations. Filemon’s timestamping feature will show you precisely when every open, read, write or delete, happens, and its status column tells you the outcome.”  More here.

 

AIX

Once you find which device is busy using the ‘iostat’ command, you can use ‘lspv -L’ to find out which volume group is associated with that disk.  This at least gives you a clue as to possibly which groups of programs that may cause the disk I/O (asusme the volume groups are set for a particular puspose, e.g., databases):

# iostat aux 1

System configuration: lcpu=8 drives=22

18:34:05     device    &percnt;busy    avque    r+w/s    Kbs/s   avwait   avserv

18:34:06     hdisk6      0      0.0        0        0      0.0      0.0
             hdisk7      0      0.0        0        0      0.0      0.0
             hdisk4      0      0.0        0        0      0.0      0.0
             hdisk5      0      0.0        0        0      0.0      0.0
             hdisk8      0      0.0        0        0      0.0      0.0
            hdisk11      0      0.0        0        0      0.0      0.0
             hdisk9      0      0.0        0        0      0.0      0.0
            hdisk10      0      0.0        0        0      0.0      0.0
            hdisk12      0      0.0        0        0      0.0      0.0
            hdisk13      0      0.0        0        0      0.0      0.0
            hdisk14      0      0.0        0        0      0.0      0.0
             hdisk0      0      0.0        0        0      0.0      0.0
            hdisk15      0      0.0        0        0      0.0      0.0
             hdisk2      0      0.0        0        0      0.0      0.0
             hdisk3      0      0.0        0        0      0.0      0.0
             hdisk1      0      0.0        0        0      0.0      0.0
               dac0      0      0.0        0        0      0.0      0.0
           dac0-utm      0      0.0        0        0      0.0      0.0
               dac1      0      0.0        0        0      0.0      0.0
           dac1-utm      0      0.0        0        0      0.0      0.0
            hdisk16      0      0.0        0        0      0.0      0.0
            hdisk17      0      0.0        0        0      0.0      0.0

# lspv -L
hdisk0          002fbf7d85488b31                    rootvg          active
hdisk1          002fbf7db8392e38                    pagingvg        active
hdisk2          002fbf7db8392f17                    pagingvg        active
hdisk3          002fbf7db8392ffb                    pagingvg        active
hdisk4          002fbf7db83930ea                    pagingvg        active
hdisk5          002fbf7db83931d4                    pagingvg        active
hdisk6          002fbf7db83932c2                    appservg        active
hdisk7          002fbf7db83933a6                    appservg        active
hdisk8          002fbf7db839348f                    appservg        active
hdisk9          002fbf7db8393579                    appservg        active
hdisk10         002fbf7db839365e                    appservg        active
hdisk11         002fbf7db839374e                    appservg        active
hdisk12         002fbf7db8393835                    appservg        active
hdisk13         002fbf7db8393921                    appservg        active
hdisk14         002fbf7db8393a18                    appservg        active
hdisk15         002fbf7db8393b04                    appservg        active
hdisk16         002fbf7de1362e64                    dbvg            active
hdisk17         002fbf7de1362f78                    dbvg            active

A best way, though, is to use AIX’s ‘filemon’ command.  The command can output busiest fs, lv, and pv, AND associate them with processes.  The following example show a snapshot of system disk I/O for 10 seconds:

# filemon -O all -o fmon.out; sleep 10; trcstop

Enter the "trcstop" command to complete filemon processing

[filemon: Reporting started]

# [filemon: Reporting completed]

[filemon: 10.945 secs in measured interval]

# more fmon.out:

Sun Dec  4 18:40:14 2005
System: AIX pw101 Node: 5 Machine: 002FBF7D4C00

Cpu utilization:   1.0&percnt;
Cpu allocation:  100.0&percnt;

Most Active Files
------------------------------------------------------------------------
  #MBs  #opns   #rds   #wrs  file                     volume:inode
------------------------------------------------------------------------
   1.2      4    296      0  unix                     /dev/hd2:63102
   0.7      2    170      0  services                 /dev/hd4:1173
   0.5    242    124      0  group                    /dev/hd4:600
   0.1      8     16      0  ksh.cat                  /dev/hd2:79522
   0.0      6     12      0  default.env              /dev/hd9var:8199
....
Most Active Segments
------------------------------------------------------------------------
  #MBs  #rpgs  #wpgs  segid  segtype                  volume:inode
------------------------------------------------------------------------
   0.0      0      1   1460347  client

Most Active Logical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume                   description
------------------------------------------------------------------------
  0.00      0     24    1.1  /dev/hd8                 jfs2log
  0.00      0      8    0.4  /dev/hd9var              /var

Most Active Physical Volumes
------------------------------------------------------------------------
  util  #rblk  #wblk   KB/s  volume                   description
------------------------------------------------------------------------
  0.00      0     32    1.5  /dev/hdisk0              N/A
...
Detailed Physical Volume Stats   (512 byte blocks)
------------------------------------------------------------------------

VOLUME: /dev/hdisk0  description: N/A
writes:                 4       (0 errs)
  write sizes (blks):   avg     8.0 min       8 max       8 sdev     0.0
  write times (msec):   avg   5.321 min   2.983 max   7.837 sdev   1.778
  write sequences:      4
  write seq. lengths:   avg     8.0 min       8 max       8 sdev     0.0
seeks:                  4       (100.0&percnt;)
  seek dist (blks):     init 59548744,
                        avg 4780893.3 min       8 max 7171344 sdev 3380596.4
  seek dist (&percnt;tot blks):init 41.53385,
                        avg 3.33456 min 0.00001 max 5.00184 sdev 2.35789
time to next req(msec): avg 1471.627 min   6.890 max 4202.723 sdev 1716.529
throughput:             1.5 KB/sec
utilization:            0.00
Two things shown here:
- 'jfs2log' is causing the most active lv '/dev/hd8'
- number of 'seeks' is the same as number of 'writes' or 'seeks' is 100&percnt;, a possible
indication that file system is highly fragmented.  You may need to run 'smit dejfs' to
degragment the fs.

 

Solaris

The ‘iostat �xc’ command can be used to view disk activities on a Solaris machine. The �x argument allows extended statistics to be reported and the �c option reports the CPU utilization for the interval.

# iostat -xc
                  extended device statistics                        cpu
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  &percnt;w  &percnt;b  us sy wt id
fd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0   8 64 27  0
sd0         51.1    0.2 6545.1    1.6  0.0  1.8   34.7   0 100 
sd1         84.7    0.0 10615.1   0.0  0.0  1.6   19.0   1  98
sd4         27.6    6.8  220.5   51.6  0.0  2.9   83.0   0  98
sd6          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
Looks like disk 'sd0' is really busy (100&percnt; busy!).  Next step is to find out what is using it.
The fields have the following meanings:

          disk    name of the disk
          r/s     reads per second
          w/s     writes per second
          Kr/s    kilobytes read per second
          Kw/s    kilobytes written per second
          wait    average number of transactions waiting for ser-
                  vice (queue length)
          actv    average number of transactions  actively  being
                  serviced  (removed  from  the queue but not yet
                  completed)

          &percnt;w      percent of time there are transactions  waiting
                  for service (queue non-empty)
          &percnt;b      percent of time the disk is busy  (transactions
                  in progress)
# lsof
Can give you open files information.  If you have better info on Solaris disk I/O analysis
please contact us below.

Analyzing Network Performance

Filed under: Performance / Tuning Tips. To the Point — Tags: , , , , , , — Ryan Stephanus @ 9:10 pm

  Linux  Windows  AIX  Solaris

 

Linux

Use ‘netstat’ with -p option to display PID/Program name for sockets:

# netstat -p 2
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address               Foreign Address             State     PID/Program name
tcp        1      0 pw101.perfryan.com:56451   pw101.perfryan.com:9081    CLOSE_WAIT  31901/httpd
tcp        1      0 pw101.perfryan.com:56460   pw101.perfryan.com:9081    CLOSE_WAIT  32008/httpd
tcp        1      0 pw101.perfryan.com:56456   pw101.perfryan.com:9081    CLOSE_WAIT  32006/httpd
tcp        1      0 pw101.perfryan.com:56446   pw101.perfryan.com:9081    CLOSE_WAIT  31898/httpd
tcp        1      0 pw101.perfryan.com:56447   pw101.perfryan.com:9081    CLOSE_WAIT  31899/httpd
tcp        1      0 pw101.perfryan.com:56445   pw101.perfryan.com:9081    CLOSE_WAIT  31897/httpd
tcp        1      0 pw101.perfryan.com:35518   pw101.perfryan.com:9081    CLOSE_WAIT  31900/httpd
tcp        0    196 pw101.perfryan.com:ssh     9.65.243.110:2708           ESTABLISHED 28120/0
tcp        0      0 pw101.perfryan.com:56437   legolas.stl.ib:DB2_db2inst1 ESTABLISHED 31903/java
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node PID/Program name    Path
unix  6      [ ]         DGRAM                    1879   1766/syslogd        /dev/log
unix  3      [ ]         STREAM     CONNECTED     575699094 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699093 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699092 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699091 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699090 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699089 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699088 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699087 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699086 31983/db2wdog
unix  3      [ ]         STREAM     CONNECTED     575699085 31983/db2wdog
...

Looks like db2wdog is a major program using the network (it has active sockets).

 

Windows

‘Perfmon’ is a Windows tool that can be configured to monitor (and log to file) system resources.  Physical disk counters are one area that can be added to the overall resources to be monitored.  You can start ‘perfmon’ simply by typing ‘perfmon’ in the “Start->Run…” window:


network_tune_tm-network-ana1

 

AIX

A very useful command called ‘netpmon’ is available on AIX.  It uses ‘trace’ to monitor activity and report statistics on network I/O usage and network-related CPU usage.  One can easily determine that if a process is causing high network-related CPU usage, chances are it is causing a lot of network traffice.

# netpmon -o netmon.out; sleep 10; trcstop
# more netmon.out

Sun Dec  4 19:55:40 2005
System: AIX pw101 Node: 5 Machine: 002FBF7D4C00

========================================================================

Process CPU Usage Statistics:
-----------------------------
                                                   Network
Process (top 20)             PID  CPU Time   CPU %   CPU %
----------------------------------------------------------
rcp                       349885    0.7986   0.660   0.522
db2fm                     512122    0.0375   0.073   0.000
rcp                       349882    0.2821   0.234   0.171
netpmon                   249960    0.0225   0.044   0.000
db2fm                    1962022    0.0136   0.026   0.000
db2set                    516208    0.0096   0.019   0.000
java                     1716352    0.0070   0.014   0.000
db2disp                  1143036    0.0052   0.010   0.000
swapper                        0    0.0040   0.008   0.000
trcstop                   512124    0.0036   0.007   0.000
UNKNOWN                   647338    0.0025   0.005   0.000
db2fmcd                   311516    0.0022   0.004   0.000
java                      536650    0.0019   0.004   0.000
getty                     290960    0.0019   0.004   0.000
db2hmon                  1458184    0.0014   0.003   0.000
gil                        77862    0.0014   0.003   0.003
sshd:                     737282    0.0014   0.003   0.000
ksh                      2027638    0.0012   0.002   0.000
java                      331974    0.0008   0.001   0.000
netpmon                  1470576    0.0007   0.001   0.000
sddsrv                    266388    0.0006   0.001   0.000
sched                      12294    0.0004   0.001   0.000
----------------------------------------------------------
Total (all processes)               0.1216   0.236   0.696
The "Network CPU%" column is the percentage of total time that this process spent executing
network-related code.

 

Solaris

Use ‘netstat’ and the following script to determine the PID that is using port(s) which causing the highest network throughput:

#!/bin/ksh
#
# Ryan Stephanus
# 7-30-2003
# find from a port the pid that started the port
#
# Need to be root to check other folk's PID's out
#
line='-------------------------------------------------------------------------'
pids=$(/usr/bin/ps -ef | sed 1d | awk '{print $2}')

# Prompt users or use 1st cmdline argument
if [ $# -eq 0 ]; then
        read ans?"Enter port you like to know pid for:  "
else
        ans=$1
fi

# Check all pids for this port, then list that process
for f in $pids
do
        /usr/proc/bin/pfiles $f 2>/dev/null | /usr/xpg4/bin/grep -q "port: $ans"
        if [ $? -eq 0 ] ; then
                echo $line
                echo "Port: $ans is being used by PID:c"
                /usr/bin/ps -ef -o pid -o args | egrep -v "grep|pfiles" | grep $f
        fi
done
exit 0
** from http://sysunconfig.net/unixtips/solaris.html

Blog at WordPress.com.