OS Thread Startup 22

Recently I encountered a performance problem scenario where a simple sqlplus “/ as sysdba” took about 2minutes to finish, this is critical to the client’s business because they have a local C program that loads Call Detail Reports on the database making use of local authentication for most of its operations and Sql*Loader to load the data, so this “2minutes of waiting” when accumulated greatly consumes significant time on their operations and greatly impacts the business.

When I arrived on the client I first checked the alert logs of both ASM (they have a separate home for ASM) and RDBMS, there were no errors…

Then I checked on the server to see if there were any CPU, IO, memory, swap, and network bottlenecks going on

The CPU run queue was zero and most of the time 90% idle

The disks were also most of the time idle

The memory utilization was low with 430MB free

Real memory Usage
-----------------------------
Total Installed : 6.00000 GB
Total Used : 5.57032 GB
Total Free : 439.97264 MB
REAL MEMORY USAGE : % : 92.800
-----------------------------

The paging on the server is not significant which is 1.1GB out of 16GB

Paging Space Usage
-----------------------------
Total Pg Space : 16640.00000 MB
Pg Space Used : 1187.77340 MB
Percent Used PAGING : 7.100
-----------------------------

Then I did sqlplus “/ as sysdba” to do some SELECTs on some views (…yeah it really took about 2minutes…)

I looked at the free memory of the SGA memory structures… well, I think SGA is not the issue but I still have to find out…

select pool, round(bytes/1024/1024,2) size_mb from v$sgastat where name like '%free memory%';

POOL            SIZE_MB
------------ ----------
shared pool      655.57
large pool        302.6
java pool            16
streams pool      15.99

At that time, there were only 4 users connected (that includes me) to the database and the C program was not running. I suspected that there are some contentions (latch) or concurrency problems that I’ll see on the wait interface… here’s a portion of the AWR report

   Elapsed:               60.11 (mins)
   DB Time:              121.50 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
os thread startup                     3,278       3,115    950   42.7 Concurrenc
latch free                            6,298       1,822    289   25.0      Other
flashback buf free by RVWR            1,771       1,721    972   23.6      Other
SQL*Net more data to client          60,087       1,574     26   21.6    Network
CPU time                                          1,088          14.9
          -------------------------------------------------------------

The DB Time was 3940.8 seconds (121.50 * 60) and the top event which is the “os thread startup” consumes 83% (3276/3940.8) of the total DB Time

And just for comparison, below is their normal workload which is generated on the same period (but different day) as the above

   Elapsed:               59.52 (mins)
   DB Time:               72.42 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                          2,009          46.2
db file scattered read              151,781         423      3    9.7   User I/O
db file sequential read              66,212         301      5    6.9   User I/O
latch: shared pool                      384          67    175    1.5 Concurrenc
log file sequential read                404          19     46    0.4 System I/O
          -------------------------------------------------------------

Hmmm so what is “os thread startup”? The ADDM report also outputs this event as a top finding

FINDING 1: 79% impact (3113 seconds)
------------------------------------
Wait event "os thread startup" in wait class "Concurrency" was consuming
significant database time.

   RECOMMENDATION 1: Application Analysis, 79% benefit (3113 seconds)
      ACTION: Investigate the cause for high "os thread startup" waits. Refer
         to Oracle's "Database Reference" for the description of this wait
         event. Use given SQL for further investigation.
      RATIONALE: The SQL statement with SQL_ID "4gd6b1r53yt88" was found
         waiting for "os thread startup" wait event.
         RELEVANT OBJECT: SQL statement with SQL_ID 4gd6b1r53yt88

   RECOMMENDATION 2: Application Analysis, 79% benefit (3113 seconds)
      ACTION: Investigate the cause for high "os thread startup" waits in
         Service "SYS$BACKGROUND".

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Wait class "Concurrency" was consuming significant database
               time. (80% impact [3156 seconds])

The event is not even documented on 8i, 9i, 10g, 11g official doc… I found some articles when I Google’d but they are not related to my issue… and Metalink does not have any helpful documents on this event for my situation…

There was already a Service Request regarding the issue, and the Support Engineer defined it as

“This wait event might be seen if the database server is executing on a platform that supports multi-threading. We enter this waiting state while a thread is starting up and leave the wait state when the thread has started or the startup request is cancelled.
This indicates some high contention at OS level avoiding even new process startup.
Issue is related to OS please involve system admin to solve same.”

The manager decided that we do a hard shutdown of the LPAR (we’re on P570 AIX 5.3) from the management console, and let’s see what it can do…

The DBA told me that once they are connected, all the operations are okay (DML, DDL, SQL*Loader).

So after the restart we tried to run the program for a while just to observe the behavior and what’s going on… apparently the local authentication is still slow!

For me, the challenge is how to have a response time profile of a session from the time it starts to connect with local authentication up to the time the session is connected so I would know what part of the operation is taking a lot of time.

What I remember, is when I was doing some queries to do some drill down on a session that was connecting using sqlplus “/ as sysdba” I see the background processes being spawned (QMNC, ARCn, etc..) are taking a lot of time in WAITING state with SECONDS_IN_WAIT reaching up to 20 seconds and has a wait event of “os thread startup”, then after a while the user gets connected then he’s now ready to do his thing without problems…Meaning after he’s connected there are no more complains…

SQL> select w.sid, s.sql_hash_value, s.program, w.event, w.wait_time/100 t, w.seconds_in_wait seconds_in_wait, w.state, w.p1, w.p2, w.p3
  2  from v$session s, v$session_wait w
  3  where s.sid = w.sid
  4  and w.state = 'WAITING'
  5  and w.event like '%os thread startup%'
  6  order by 6 asc;

   SID SQL_HASH_VALUE PROGRAM                 EVENT                       T SECONDS_IN_WAIT STATE                     P1         P2         P3
------ -------------- ----------------------- ------------------ ---------- --------------- ----------------- ---------- ---------- ----------
   121              0 oracle@ps570roc01 (QMNC os thread startup)          0              18 WAITING                    0          0          0

To know what’s going on when a user connects and for me to account the operation where it takes significant time; I created a database user TRACE_USER and created a logon trigger that invokes Extended SQL Trace Level 12. This will give me the timing information that I need when the user is connecting to the database.

To create the user:

create user trace_user identified by trace_user account unlock;
grant create session,dba to trace_user;

To create the logon trigger:

create or replace trigger sys.set_trace
after logon on database
when (user = 'trace_user')
declare
lcommand varchar(200);
begin
execute immediate 'alter session set statistics_level=all';
execute immediate 'alter session set max_dump_file_size=unlimited';
execute immediate 'alter session set events ''10046 trace name context forever, level 12''';
end set_trace;
/

After the user and the trigger are created, logon as the TRACE_USER by doing the following command:

sqlplus "trace_user/trace_user" << eof
exit;
eof

This will generate an Extended SQL Trace (10046) trace file under the $ORACLE_BASE/admin/<db_name>/udump directory

Then see if there’s significant time on the database calls (PARSE,BIND,EXEC,FETCH) by looking at the “e=” section which means elapsed duration consumed by the database call

And check if there’s significant time on wait events by looking at the “ela=” section which means elapsed duration consumed by the wait event (probably by doing a system call)

Below is a small portion of the trace file generated, we are interested on the highlighted portions…

PARSING IN CURSOR #5 len=131 dep=1 uid=0 oct=3 lid=0 tim=15687899202263 hv=1389591971 ad='ca9a7948'
select privilege#,nvl(col#,0),max(mod(nvl(option$,0),2))from objauth$ where obj#=:1 and grantee#=:2 group by privilege#,nvl(col#,0)
END OF STMT
EXEC #5:c=0,e=145,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899202260
WAIT #5: nam='db file sequential read' ela= 30005 file#=1 block#=24208 blocks=1 obj#=-1 tim=15687899232346
FETCH #5:c=0,e=30075,p=1,cr=2,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899232396

There were no database calls that consumed significant time (numbers below are microseconds, that’s for 9i and above):

less oracle_ora_1118270.trc | grep -i ",e=" | cut -d , -f2 > karlarao.txt ; sed -n 's/e=/ /p' karlarao.txt | sort -nr | less

 358746
 183162
 61293
 44661
 32580
 30075
 28695
 26950
 25837
 24244
 23519
 20543
 20132
 19449
 19290
 18809
 18250
 17230
 16507
 16349
 13438
 11716
 11690
 11284
 10137

There were no wait events that consumed significant time (numbers below are microseconds, that’s for 9i and above):

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d " " -f8 | sort -nr | less

30005
28624
13253
11592
9650

There was no “os thread startup” event!

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d "=" -f2 | uniq

'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'db file sequential read' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela

I need to have some instrumentation on the system calls, so I’ll use a UNIX tool,

presenting… TRUSS…

This tool will let you do some application tracing, for us we are interested in tracing SQL*Plus, and display the calls that an application makes to external libraries and the kernel… sounds cool :)

A “man truss” will give you an overview, but if you want more detail there are some good references at the IBM site or Metalink

Execute the following commands (one at a time):

truss -c sqlplus "/ as sysdba" << EOF
exit;
EOF

truss -d sqlplus "/ as sysdba" << EOF
exit;
EOF

The “-c” switch counts traced system calls, faults, and signals

The “-d” switch displays the timestamp in seconds on each line. The first line of the trace output will show the base time from which the individual time stamps are measured

Bwahaha, Now I know the problem!

The DNS service was causing the slow down on the sqlplus “/ as sysdba” operation and also the high “os thread startup” waits

The text below is the portion of “truss” output where the slow down occurs denoted by “(sleeping…)” events when reading on the /etc/resolv.conf and /etc/hosts. Please see the highlighted region below

ENOTTY and ECONNREFUSED are the errors that you’ll see, you’ll find more details on the error messages on this file /usr/include/sys/errno.h
Below are their meanings

#define	ENOTTY	25	/* Inappropriate I/O control operation	*/
#define ECONNREFUSED    79      /* Connection refused */

The portion of “truss -d” where the slow down occurs, you’ll see a leap on the timing information from 2.3379 seconds (reading /etc/resolv.conf) –> 7.3477 seconds –> 17.3489 seconds –> 37.3555 seconds (reading /etc/hosts) –> 70.3863 seconds (already connected)

The timing information below may not exactly match the real “time” that the user is waiting, it’s because the SQL*Plus is being probed to know the part of the underlying system calls where there are significant waits

0.3172:        gethostname(0x0FFFFFFFFFFF9920, 256) = 0
0.3175:        _getpid()                        = 1102056
0.3177:        open("/etc/resolv.conf", O_RDONLY) = 8
0.3180:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
0.3182:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 69
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
0.3190:        close(8)                         = 0
0.3194:        socket(2, 2, 0)                  = 8
0.3197:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3199:        connext(8, 0x09001000A0018600, 16) = 0
0.3202:        send(8, 0x0FFFFFFFFFFF7FF0, 41, 0) = 41
0.3205:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3208:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3210:        close(8)                         = 0
0.3212:        socket(2, 2, 0)                  = 8
0.3215:        sendto(8, 0x0FFFFFFFFFFF7FF0, 41, 0, 0x09001000A0018610, 16) = 41
0.3217:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3220:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 108
0.3222:        close(8)                         = 0
0.3224:        socket(2, 2, 0)                  = 8
0.3227:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3229:        connext(8, 0x09001000A0018600, 16) = 0
0.3231:        send(8, 0x0FFFFFFFFFFF7FF0, 28, 0) = 28
0.3233:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3236:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3238:        close(8)                         = 0
0.3240:        socket(2, 2, 0)                  = 8
0.3243:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018610, 16) = 28
0.3245:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3248:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 28
0.3250:        close(8)                         = 0
0.3252:        socket(2, 2, 0)                  = 8
0.3373:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) (sleeping...)
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 0
5.3468:        close(8)                         = 0
5.3470:        socket(2, 2, 0)                  = 8
5.3472:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) (sleeping...)
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) = 0
15.3479:        close(8)                        = 0
15.3482:        socket(2, 2, 0)                 = 8
15.3484:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) (sleeping...)
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) = 0
35.3491:        close(8)                        = 0
35.3495:        getdomainname(0x0FFFFFFFFFFF9480, 256) = 0
35.3497:        open("/etc/hosts", O_RDONLY)    = 8
35.3500:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
35.3502:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
35.3505:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
35.3514:        close(8)                        = 0
kwrite(9, "01 �0603 s".., 432)    = 432
37.3555:        kread(10, "16060280".., 2064) (sleeping...)
kread(10, "06 E06\b".., 2064)   = 1605
70.3762:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
70.3764:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
70.3767:        lseek(8, 0, 0)                  = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
70.3772:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3776:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3781:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3786:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3790:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3795:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3799:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3804:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3808:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3813:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3817:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3822:        lseek(8, 51712, 0)              = 51712
kread(8, "\n0589 D058A".., 512)     = 512
70.3827:        close(8)                        = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
70.3854:        lseek(4, 512, 0)                = 512
kread(4, "17 �".., 512)     = 512
70.3858:        lseek(4, 1024, 0)               = 1024
kread(4, "16 * R h819E".., 512)     = 512
70.3863:        lseek(4, 4096, 0)               = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
70.3882:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
70.3886:        __libc_sbrk(0x0000000000030020) = 0x000000001023C880
70.3892:        access("login.sql", 0)          Err#2  ENOENT
70.3895:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
70.3898:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFDA10) = 0

Apparently there was really a problem with the DNS when it was checked by the System Administrator. He just altered some network related files and instantly we checked on the local authentication… and we could quickly do a local authentication again! Whooho! and everybody was happy :)

Below is the good “truss” output of the local authentication, which takes about 3 seconds from the timing information

3.0713:        gethostname(0x0FFFFFFFFFFF9900, 256) = 0
3.0716:        _getpid()                        = 2539762
3.0719:        open("/etc/resolv.conf", O_RDONLY) = 8
3.0721:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0724:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 70
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
3.0731:        close(8)                         = 0
3.0735:        socket(2, 2, 0)                  = 8
3.0740:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70C4, 0x0FFFFFFFFFFF70C0) = 0
3.0742:        connext(8, 0x09001000A0018600, 16) = 0
3.0746:        send(8, 0x0FFFFFFFFFFF7FD0, 41, 0) = 41
3.0750:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0753:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 108
3.0756:        send(8, 0x0FFFFFFFFFFF7FD0, 28, 0) = 28
3.0758:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0761:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 28
3.0764:        close(8)                         = 0
3.0767:        getdomainname(0x0FFFFFFFFFFF9460, 256) = 0
3.0769:        open("/etc/hosts", O_RDONLY)     = 8
3.0771:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0774:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0777:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
3.0786:        close(8)                         = 0
kwrite(9, "01 �0603 s".., 432)    = 432
kread(10, "06 H06\b".., 2064)   = 1608
3.0973:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
3.0975:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0978:        lseek(8, 0, 0)                   = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
3.0983:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.0988:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.0993:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.0997:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1002:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1006:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1011:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1015:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1021:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1025:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1029:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1034:        lseek(8, 51712, 0)               = 51712
kread(8, "\n0589 D058A".., 512)     = 512
3.1038:        close(8)                         = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
3.1064:        lseek(4, 512, 0)                 = 512
kread(4, "17 �".., 512)     = 512
3.1069:        lseek(4, 1024, 0)                = 1024
kread(4, "16 * R h819E".., 512)     = 512
3.1073:        lseek(4, 4096, 0)                = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
3.1095:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
3.1099:        __libc_sbrk(0x0000000000030020)  = 0x000000001023C880
3.1105:        access("login.sql", 0)           Err#2  ENOENT
3.1109:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
3.1112:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFD9F0) = 0
About these ads

22 comments

  1. Great troubleshooting!
    I just discovered the asmiostat utility through your post on Oracle-L. Thanks for all the info!

  2. Thanks for sharing this info. I am sure that “truss” utility will help me more in troubleshooting issues.

    • That’s right Gidhin, it’s all about response time… and you have to “divide-and-conquer” to see where the problem is.
      And use the right tool for the right problem :)

  3. Thank you Karlarao. We are experiencing a similar situation. Can you share the resolution particulars so that we can further benefit from your experience? I am seeing the same thing in our truss output but do not know what to do to fix it. I am a Oracle DBA but will relay your information to the more knowledgeable System Administrators. Regards.

    • Hi Jack,

      Thanks for reading my post. :)
      When we had the timing info of truss, and identified that it takes much of the time on the remote (/etc/resolv.conf) and local resolvers (/etc/hosts) by looking at the lines with \”sleeping…\”
      we quickly checked on the DNS servers and did some \”dig\” commands to know the status of forward and reverse lookups.

      forward lookup:
      dig fqdn_of_your_machine

      reverse lookup:
      dig -x ip_of_your_machine

      the commands above are not responding and takes too long. Then, also doing \”ping\” on the nameserver entry on /etc/resolv.conf verified that the DNS server was the problem because we could not reach it…
      So we replaced the nameserver entry with a valid DNS entry then everything went okay :)
      Hope this helps :)

  4. Jack and I work together. Our problem is masked by the fact that we’re migrating from HPUX to AIX, but the problem was found on the older HPUX systems as well.

    Our truss shows a hang, not on getdomainname, but during the kread sleeping and waiting for the “NTP0 #######” to return. This happens only on one database, and it’s derivatives (prod, dev, copy, test, metadata) but not on other databases in the environment. Once the oracle process is spawned, it completes very very quickly.

    One connection will clear in 0.3 second; however, if we spawn multiple connections at the same time, each connection chews up CPU. Once we hit 15 for one DB, or 60 for the other copies, the CPU is so bound up that the delay is found.

    A clone of the AIX system with a different database imported can handle 1000 local connections or 150 remote connections with half the CPU without hitting the delay.

    /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf are all the same between the AIX systems.

    PTR and A records come back fine from both DNS servers from the DB servers in question.

    We can spawn non-oracle processes ad nauseum with very little CPU overhead.

    • Hi Josh,

      That’s a weird scenario :) you might be hitting a bug? Have you involved Oracle Support on this issue? What did they say?

      What OS and Oracle version are you running?

      And, my scenario is different from yours.. are you also getting a lot of waits on OS Thread Startup?
      On your part, what you can do.. you may want to drill down on the 15 slow sessions when the server is CPU bound.
      And submit the following to Oracle Support for faster resolution:

      1) session wait and session statistics of the 15 sessions (you may want to use the sw and snapper tool of Tanel Poder), spool it in a file

      2) procstack samples of the 15 sessions, these are quick stack samples so you would know about the current process execution. The support engineer will have an idea about the kernel functions being used by the sessions which they could map to the Oracle Kernel layer, if there are any known issues on that area..

      $ procstack pid > sample1.txt
      then wait another 30seconds or a minute
      $ procstack pid > sample2.txt
      then wait another 30seconds or a minute
      $ procstack pid > sample3.txt
      ….

      you may also use Doc ID: 459694.1 Procwatcher: Script to Monitor and Examine Oracle and CRS Processes, but I suggest if you are doing this do it dev or test environment because these are debuggers and might hang your process.

      3) topas and pmap of the 15 sessions, so the support engineer would know about the OS stats and memory allocation breakdown of the sessions

      4) and of course the truss output

      5) Statspack, AWR, ADDM reports from the time the sessions are experiencing performance problems

      These info could be enough, or the support engineer will advise you to run some other tools like oradebug,etc. But having these info will greatly help in the analysis.. :)

      If ever you solve this problem, kindly update this post for your resolution so we could help other readers :) Thanks!

  5. Hi Guys,

    It’s very likely that the “hung” kread is working on a socket (the first parameter of this syscall, as you guys may already know, is the file descriptor which it is reading from).

    So, it makes sense to hypothesize that the client process is waiting for a “message” from the process on the server side (which can either be the listener itself or the server process spawned by the listener).

    A couple of questions that may be relevant to ask:

    1. is this particular database configured to use shared server (MTS)?
    2. do you have a separate listener for this database (i.e. the other instances on this server have listener[s] of their own)?

    Enabling SQL Net tracing may give some clue… or, if you can test this in a controlled environment which reproduces the problem (i.e., not prod), you can run truss on the listener with the -f (fork) option.

    Cheers.
    Rhojel

  6. On Linux the equivalent of truss is ‘strace’ and the timing option is -T.
    unhappilly the output is as follow :

    open(“/etc/ld.so.cache”, O_RDONLY) = 3
    fstat(3, {st_mode=S_IFREG|0644, st_size=43327, …}) = 0
    mmap(NULL, 43327, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b457a8e5000
    close(3) = 0
    open(“/lib64/libdl.so.2″, O_RDONLY) = 3

    Whith the time at the end, but a quick sed will correct that:

    /tmp> more /tmp/foutput | sed ‘s/\(.*\)\(\)$/\2 \1/’

    open(“/etc/ld.so.cache”, O_RDONLY) = 3
    fstat(3, {st_mode=S_IFREG|0644, st_size=43327, …}) = 0
    mmap(NULL, 43327, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b457a8e5000
    close(3) = 0
    open(“/lib64/libdl.so.2″, O_RDONLY) = 3

  7. Oops, system wordpress eat any part of the post enclosed between inferior and superior signs.
    I replace all these signs by Brackets :

    In linux we have :

    strace -T sqlplus / as sysdba [[EOF
    exit ;
    EOF

    open(“/etc/ld.so.cache”, O_RDONLY) = 3 [0.000014]
    fstat(3, {st_mode=S_IFREG|0644, st_size=43327, …}) = 0 [0.000010]
    mmap(NULL, 43327, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b457a8e5000 [0.000011]
    close(3) = 0 [0.000010]
    open(“/lib64/libdl.so.2″, O_RDONLY) = 3 [0.000015]

    more /tmp/yoyo | sed ‘s/\(.*\)\([.*]\)$/\2 \1/’

    [0.000014] open(“/etc/ld.so.cache”, O_RDONLY) = 3
    [0.000010] fstat(3, {st_mode=S_IFREG|0644, st_size=43327, …}) = 0
    [0.000011] mmap(NULL, 43327, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b457a8e5000
    [0.000010] close(3) = 0
    [0.000015] open(“/lib64/libdl.so.2″, O_RDONLY) = 3

  8. Hi Karlarao,

    We are facing the same issue.
    sqlplus “/as sysdba” and sqlplus user@dbname are slow.

    AWR report shows “os thread startup” as top event.

    Top 5 Timed Events Avg %Total
    ~~~~~~~~~~~~~~~~~~ wait Call
    Event Waits Time (s) (ms) Time Wait Class
    —————————— ———— ———– —— —— ———-
    os thread startup 237 224 945 200.8 Concurrenc

    —————-

    When I did a truss of sqlplus

    truss -d -v all sqlplus “/as sysdba”<< EOF
    exit;
    EOF

    I noticed that a jump in time happens at below part.
    ————————————-
    0.4892 getpid() = 20069 [20068]
    0.4896 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "", SOV_DEFAULT) = 9
    0.4898 ioctl(9, SIOCGIFNUM, 0xFFFFFFFF7FFF96F4) = 0
    read 4 bytes
    0.4902 ioctl(9, _IOWRN('i', 20, 8), 0xFFFFFFFF7FFF96E0) = 0
    write/read 8 bytes
    0.4905 ioctl(9, SIOCGIFFLAGS, 0x1001B7ED0) = 0
    write/read (struct ifreq)
    0.4908 ioctl(9, SIOCGIFFLAGS, 0x1001B7EF0) = 0
    write/read (struct ifreq)
    0.4910 ioctl(9, SIOCGIFADDR, 0x1001B7EF0) = 0
    write/read (struct ifreq)
    0.4913 ioctl(9, SIOCGARP, 0xFFFFFFFF7FFF96BC) = 0
    write/read (struct arpreq)
    0.4918 write(10, "03 w0603 s".., 887) = 887
    read(11, 0x1001CF066, 2064) (sleeping…)
    5.9906 read(11, "03F506\b".., 2064) = 1013
    5.9921 open("/ap01/apps/oracle/product/10.2.0.4/rdbms/mesg/oraus.msb", O_RDONLY) = 12
    5.9926 fcntl(12, F_SETFD, 0x00000001) = 0
    5.9928 lseek(12, 0, SEEK_SET) = 0
    5.9930 read(12, "1513 "011303\t\t".., 256) = 256
    5.9934 lseek(12, 512, SEEK_SET) = 512
    5.9936 read(12, "1A Q 5 C [ V uE085FC9480".., 512) = 512
    5.9939 lseek(12, 1024, SEEK_SET) = 1024
    5.9941 read(12, "18 $ 4 = G S".., 512) = 512
    5.9944 lseek(12, 50688, SEEK_SET) = 50688
    5.9946 read(12, "\f05 ] P05 ^".., 512) = 512
    —————————————

    Can you please give your thoughts on this issue ?

    Thanks,
    Saj

  9. Hi Karlarao,

    Need your help, i have same issue but different error.
    we have 10G R2.4 on RAC with 3 instances, but only on db3 have this problem, the other fine.

    i also run truss command :
    truss -d -v all sqlplus “/as sysdba”<< EOF
    exit;
    EOF

    =====================
    0.0909 so_socket(PF_INET, SOCK_DGRAM, IPPROTO_IP, "", SOV_DEFAULT) = 8
    0.0910 bind(8, 0x1001BA450, 16, SOV_SOCKBSD) = 0
    AF_INET name = 127.0.0.1 port = 0
    0.0911 getsockname(8, 0xFFFFFFFF7FFF7BC4, 0xFFFFFFFF7FFF7C0C, SOV_DEFAULT) = 0
    AF_INET name = 127.0.0.1 port = 55424
    0.0913 getpeername(8, 0xFFFFFFFF7FFF7BC4, 0xFFFFFFFF7FFF7C0C, SOV_DEFAULT) Err#134 ENOTCONN
    0.0914 getsockopt(8, SOL_SOCKET, SO_SNDBUF, 0xFFFFFFFF7FFF7D3C, 0xFFFFFFFF7FFF7D4C, SOV_DEFAULT) = 0
    0.0915 getsockopt(8, SOL_SOCKET, SO_RCVBUF, 0xFFFFFFFF7FFF7D3C, 0xFFFFFFFF7FFF7D4C, SOV_DEFAULT) = 0
    0.0915 fcntl(8, F_SETFD, 0x00000001) = 0
    0.0916 ioctl(8, FIONBIO, 0xFFFFFFFF7FFF7D8C) = 0
    write 4 bytes
    0.0919 brk(0x1001BCDF0) = 0
    0.0919 brk(0x1001C0DF0) = 0
    0.0921 sigaction(SIGCLD, 0x00000000, 0xFFFFFFFF7FFF63F0) = 0
    old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x20000
    0.0922 lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    0.0922 sigaction(SIGCLD, 0xFFFFFFFF7FFF6190, 0xFFFFFFFF7FFF6288) = 0
    new: hand = 0x00000000 mask = 0x9FBEF057 0x0000FFF7 0 0 flags = 0x1000C
    old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x20000
    0.0924 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    0.0924 pipe() = 9 [10]
    0.0925 pipe() = 11 [12]
    0.0936 fork1() = 35906
    0.0937 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    0.0940 lwp_sigmask(SIG_SETMASK, 0x00001000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    0.0941 sigaction(SIGPIPE, 0xFFFFFFFF7FFF6190, 0xFFFFFFFF7FFF6288) = 0
    new: hand = 0xFFFFFFFF7CFC9E4C mask = 0xFFBFFEFF 0x0000FFFF 0 0 flags = 0x000C
    old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
    0.0944 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    0.0944 close(9) = 0
    0.0945 close(12) = 0
    read(11, 0xFFFFFFFF7FFF63D0, 64) (sleeping…)
    3.1268 read(11, " N T P 0 3 5 9 0 6\n", 64) = 11
    3.1271 getpid() = 35905 [35904]
    3.1272 fcntl(11, F_SETFD, 0x00000001) = 0
    3.1275 write(10, " :0101 901 ,".., 58) = 58
    3.1276 write(10, "FE06 ( D".., 254) = 254
    3.1278 read(11, " 0201 9\f01".., 2064) = 32
    3.1279 getpid() = 35905 [35904]
    3.1279 write(10, "A806DEAD".., 168) = 168
    3.1281 read(11, "8B06DEAD".., 2064) = 139
    3.1282 brk(0x1001C0DF0) = 0
    3.1283 brk(0x1001C4DF0) = 0
    3.1289 getcwd("/export/home/oracle/truss", 256) = 0
    3.1292 write(10, " &060106".., 38) = 38
    3.1294 read(11, "B4060106".., 2064) = 180
    3.1296 open("/oracle/product/10.2.0/nls/data/lx2001f.nlb", O_RDONLY) = 9
    3.1298 read(9, "03\n 041905".., 92) = 92
    3.1299 read(9, "1F1F\t".., 6316) = 6313
    3.1300 close(9) = 0
    3.1301 brk(0x1001C4DF0) = 0
    3.1302 brk(0x100210DF0) = 0
    3.1305 open("/oracle/product/10.2.0/nls/data/lx207d0.nlb", O_RDONLY) = 9
    3.1307 read(9, "04\n 0404B5 :".., 92) = 92
    3.1324 read(9, "07D001 !".., 308452) = 308452
    3.1325 close(9) = 0
    3.1326 write(10, " E06021F".., 69) = 69
    3.1341 read(11, "16060280".., 2064) = 22
    3.1359 write(10, " -0603 ;".., 45) = 45
    3.1360 read(11, "CE06\b".., 2064) = 206
    3.1363 sigaction(SIGTSTP, 0xFFFFFFFF7FFFA3B0, 0xFFFFFFFF7FFFA4A8) = 0
    new: hand = 0x00000001 mask = 0x9FBEF057 0x0000FFF7 0 0 flags = 0x000C
    old: hand = 0x00000000 mask = 0 0 0 0 flags = 0x0000
    3.1365 fstat(0, 0xFFFFFFFF7FFFA550) = 0
    d=0x0000016600000002 i=138007066 m=0100600 l=0 u=200 g=100 sz=6
    at = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    mt = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    ct = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    bsz=8192 blks=16 fs=tmpfs
    3.1367 ioctl(0, TCGETA, 0xFFFFFFFF7FFFA48C) Err#25 ENOTTY
    3.1368 getpid() = 35905 [35904]
    3.1368 getuid() = 200 [200]
    3.1369 getuid() = 200 [200]
    3.1370 getuid() = 200 [200]
    3.1371 door_info(7, 0xFFFFFFFF7FFFA178) = 0
    target=50880 proc=0x2D8BC data=0xDEADBEED
    attributes=DOOR_UNREF
    uniquifier=260050
    3.1373 door_call(7, 0xFFFFFFFF7FFFA270) = 0
    data_ptr=0xFFFFFFFF7C400000 data_size=222
    desc_ptr=0x0 desc_num=0
    rbuf=0xFFFFFFFF7C400000 rsize=16384
    3.1377 getpid() = 35905 [35904]
    3.1379 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "", SOV_DEFAULT) = 9
    3.1380 ioctl(9, SIOCGIFNUM, 0xFFFFFFFF7FFF9C44) = 0
    read 4 bytes
    3.1381 ioctl(9, _IOWRN('i', 20, 8), 0xFFFFFFFF7FFF9C30) = 0
    write/read 8 bytes
    3.1381 ioctl(9, SIOCGIFFLAGS, 0x1001A97C0) = 0
    write/read (struct ifreq)
    3.1382 ioctl(9, SIOCGIFFLAGS, 0x1001A97E0) = 0
    write/read (struct ifreq)
    3.1383 ioctl(9, SIOCGIFADDR, 0x1001A97E0) = 0
    write/read (struct ifreq)
    3.1384 ioctl(9, SIOCGARP, 0xFFFFFFFF7FFF9C0C) = 0
    write/read (struct arpreq)
    3.1385 write(10, "01A80603 s".., 424) = 424
    3.1693 read(11, "06 h06\b".., 2064) = 1640
    3.1698 open("/oracle/product/10.2.0/rdbms/mesg/oraus.msb", O_RDONLY) = 12
    3.1700 fcntl(12, F_SETFD, 0x00000001) = 0
    3.1701 lseek(12, 0, SEEK_SET) = 0
    3.1701 read(12, "1513 "011303\t\t".., 256) = 256
    3.1702 lseek(12, 512, SEEK_SET) = 512
    3.1703 read(12, "1A Q 5 C [ V uE085FC9480".., 512) = 512
    3.1704 lseek(12, 1024, SEEK_SET) = 1024
    3.1705 read(12, "18 $ 4 = G S".., 512) = 512
    3.1706 lseek(12, 50688, SEEK_SET) = 50688
    3.1706 read(12, "\f05 ] P05 ^".., 512) = 512
    3.1708 lseek(12, 512, SEEK_SET) = 512
    3.1708 read(12, "1A Q 5 C [ V uE085FC9480".., 512) = 512
    3.1709 lseek(12, 1024, SEEK_SET) = 1024
    3.1710 read(12, "18 $ 4 = G S".., 512) = 512
    3.1711 lseek(12, 50688, SEEK_SET) = 50688
    3.1711 read(12, "\f05 ] P05 ^".., 512) = 512
    3.1713 lseek(12, 512, SEEK_SET) = 512
    3.1713 read(12, "1A Q 5 C [ V uE085FC9480".., 512) = 512
    3.1714 lseek(12, 1024, SEEK_SET) = 1024
    3.1715 read(12, "18 $ 4 = G S".., 512) = 512
    3.1716 lseek(12, 50688, SEEK_SET) = 50688
    3.1716 read(12, "\f05 ] P05 ^".., 512) = 512
    3.1718 lseek(12, 512, SEEK_SET) = 512
    3.1718 read(12, "1A Q 5 C [ V uE085FC9480".., 512) = 512
    3.1719 lseek(12, 1024, SEEK_SET) = 1024
    3.1720 read(12, "18 $ 4 = G S".., 512) = 512
    3.1722 lseek(12, 51712, SEEK_SET) = 51712
    3.1723 read(12, "\n0589 D058A".., 512) = 512
    3.1724 close(12) = 0
    3.1727 write(10, " 3.1774 write(1, ” S Q L > “, 5) = 5
    3.1775 ioctl(0, TCGETA, 0xFFFFFFFF7FFFA5CC) Err#25 ENOTTY
    3.1776 fstat(0, 0xFFFFFFFF7FFFA690) = 0
    d=0x0000016600000002 i=138007066 m=0100600 l=0 u=200 g=100 sz=6
    at = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    mt = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    ct = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    bsz=8192 blks=16 fs=tmpfs
    3.1777 fstat(0, 0xFFFFFFFF7FFFA560) = 0
    d=0x0000016600000002 i=138007066 m=0100600 l=0 u=200 g=100 sz=6
    at = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    mt = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    ct = Mar 12 03:23:18 WIT 2010 [ 1268338998 ]
    bsz=8192 blks=16 fs=tmpfs
    3.1779 read(0, ” e x i t ;\n”, 8192) = 6

    =======================

    Need your help to read what makes the problem?

    Rgrds,
    Riki

  10. Hi Karl,

    Thanks for the info. I’m googling for OS Thread startup” wait event info, and u have provided a good way of debugging the issue.
    thaks, srini.

  11. Man .. Beautiful .. Wonderful .. I will bookmark your site and take the feeds additionallyI am satisfied to find so many helpful information right here within the put up, we want work out more strategies on this regard, thank you for sharing. . . . . .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s