Karl Arao's Blog

Just another weblog about Oracle,Linux,Troubleshooting,etc..etc..

Archive for April, 2009

Advanced Oracle Troubleshooting by Tanel Poder in Singapore

Posted by karlarao on April 20, 2009

When I first saw that Tanel will conduct his seminar in Singapore, I told myself that I would even spend my own money just to be on that training! I’ve already read performance books like Optimizing Oracle Performance, Oracle 8i Internal Services, Forecasting Oracle Performance… And after that I still want more, and I still have questions that need to be answered. Well, if you’re on a tight budget you just opt to download some more docs/books to do multiple reads coupled with research/test cases and also reading through others blog…
But thanks to my boss for the funding, I was there! :-)

The seminar has 8 chapters,

1) End to end lifecycle of a database request (33 slides)
2) Oracle, OS, and hardware interaction (39 slides)
3) Oracle kernel core services (39 slides)
4) OWI and kernel instrumentation framework (34 slides)
5) Lifecycle of a cursor (27 slides)
6) SQL execution row source operators and data flow (34 slides)
7) Logical and Physical IO internals (37 slides)
8 ) Lifecycle of a transaction (32 slides)

On the first day we were able to tackle the three chapters…

The 1st was about how to “Systematically Troubleshoot” performance problems. That Oracle is just a program and every execution that is happening is not magic, using the right tool for the right problem you’ll be able to pinpoint the root cause. Also he emphasized on how important it is to make the user’s session a starting point on diagnosis and he demo’d on how to make use of v$system_wait, snapper, ASH, v$sesstat, and of course if the built-in instrumentation is not enough… the process stack sampling is very useful that will tell you what kernel function where most of the time is spent…he also taught us how to read the stack tree using that magic Metalink note (175982.1) ;)

2nd, dives into the OS troubleshooting, how you’ll systematically approach memory and CPU starvation on both Oracle and OS drilling down to the processes using Oracle (heapdump) and OS tools (ps, pstack, strace, procmon.exe for Windows – yeah you could also do process stack samples, pmap, top, vmstat)

3rd, was more of the internals where he discussed the core services of Oracle Kernel starting with the KGH module, the shared pool, and some more heaps, heap dumps and state objects, some X$ tables, and of course the most important part of the chapter… the Latching. He demo’d on how to systematically troubleshoot latching problems using latchprof, latchprofx, and snapper.

Then… At the end of the day, I was able to discuss with Tanel the concept of “performance visualization” see links below. I mean, yeah you can do reactive tuning and be very good at internals… but it doesn’t stop there… you have to be proactive, and visualizing/trending your performance numbers will help you on that. I’ve read his paper with Neil Gunther about how Multidimensional Visualization helps on analysing Oracle session level performance data that sometimes the usual x-y charts could mislead. He showed me some cool 3d performance monitoring tools running on JVM which is still on the works, but it’s really cool ;)

And we also discussed about “capacity planning” because Neil Gunther is the author of the “Super Serial Scalability model” which he worked with (on the Visualization CMG paper) and Craig Shallahamer discussed it on the Chapter 10 (Scalability) of his book which is very interesting. Well, Tanel will still have to find out how efficient or realistic the “forecasting models” will be when applied with Oracle performance data. Well me too! But it still trickles down to response time and visualization ;)

The second day

We started with the Oracle Wait Interface, he answered my question on how would you know if it’s a syscall? Usercall? Or kernel function call?… he just used pstack, pmap, and the magic Metalink note. Cool ;) Then he explained why the v$session is the most important view in 10g above. He showed us how to pinpoint if it’s a pl/sql that is executing or just plain SQL; yeah he has scripts for that! … the “sample” script is very useful for finding out which SQLs are problematic. He ended the chapter with identifying the session whether it is client-server or N-tiered application and demo’d on how to make use of dbms_session and dbms_monitor and how to trace these sessions.

5th discussed about lifecycle of cursors. Tanel discussed the internals of Library Cache and how applications live and die by how you do cursor management ;) He also got good scripts here!

6th for me is the most interesting, this is where you start to glue together the info about the Oracle and OS interaction from day1 that Oracle is just a big complex C program… he started the chapter by clearly explaining how to properly get and read the explain plan (root -> branch -> leaves and row source -> filters -> access path) and associating each line with the Oracle kernel functions which you could get by doing a process stack and feeding the output to his “os_explain” tool…cool ;) Then he ended the chapter with SQL plan profiling using some of his formatted explain plan scripts…

Unfortunately we ran out of time… we still have two chapters left! Tanel have to catch a flight to Sydney at 8pm. But he told us he will be blogging more about the later chapters, and at least we have his module… So the last demo was his “perfsheet” ;) a very good visualization tool. Nice job with the pivot tables and 3d charts!

All in all, the popular geek from Estonia showed me how Oracle really works! (well, you owe us two chapters :p)

BTW, Porus Homi Havewala (http://enterprise-manager.blogspot.com) was also on the class…

Below are some of the seminar pictures:
Tanel Poder AOT

Tanel Poder AOT

Tanel Poder AOT

Tanel Poder AOT

Below are some links about Visualization:

Neil Gunther & Tanel Poder – Multidimensional Visualization of Oracle Performance using Barry007

http://arxiv.org/pdf/0809.2532

Tanel Poder’s Visualization Tool

http://blog.tanelpoder.com/2008/12/28/performance-visualization-made-easy-perfsheet-20-beta/

http://blog.tanelpoder.com/2009/02/14/performance-visualization-capacity-planning-and-hotsos-symposium/

Neil Gunther – A.A. Michelson Award Acceptance Speech

http://www.perfdynamics.com/Test/aamspeech.pdf

Neil Gunther’s Blog

http://perfdynamics.blogspot.com/2008/12/gunther-receives-aa-michelson-award.html

Posted in Performance | Tagged: , | 2 Comments »

OS Thread Startup

Posted by karlarao on April 6, 2009

Recently I encountered a performance problem scenario where a simple sqlplus “/ as sysdba” took about 2minutes to finish, this is critical to the client’s business because they have a local C program that loads Call Detail Reports on the database making use of local authentication for most of its operations and Sql*Loader to load the data, so this “2minutes of waiting” when accumulated greatly consumes significant time on their operations and greatly impacts the business.

When I arrived on the client I first checked the alert logs of both ASM (they have a separate home for ASM) and RDBMS, there were no errors…

Then I checked on the server to see if there were any CPU, IO, memory, swap, and network bottlenecks going on

The CPU run queue was zero and most of the time 90% idle

The disks were also most of the time idle

The memory utilization was low with 430MB free

Real memory Usage
-----------------------------
Total Installed : 6.00000 GB
Total Used : 5.57032 GB
Total Free : 439.97264 MB
REAL MEMORY USAGE : % : 92.800
-----------------------------

The paging on the server is not significant which is 1.1GB out of 16GB

Paging Space Usage
-----------------------------
Total Pg Space : 16640.00000 MB
Pg Space Used : 1187.77340 MB
Percent Used PAGING : 7.100
-----------------------------

Then I did sqlplus “/ as sysdba” to do some SELECTs on some views (…yeah it really took about 2minutes…)

I looked at the free memory of the SGA memory structures… well, I think SGA is not the issue but I still have to find out…

select pool, round(bytes/1024/1024,2) size_mb from v$sgastat where name like '%free memory%';

POOL            SIZE_MB
------------ ----------
shared pool      655.57
large pool        302.6
java pool            16
streams pool      15.99

At that time, there were only 4 users connected (that includes me) to the database and the C program was not running. I suspected that there are some contentions (latch) or concurrency problems that I’ll see on the wait interface… here’s a portion of the AWR report

   Elapsed:               60.11 (mins)
   DB Time:              121.50 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
os thread startup                     3,278       3,115    950   42.7 Concurrenc
latch free                            6,298       1,822    289   25.0      Other
flashback buf free by RVWR            1,771       1,721    972   23.6      Other
SQL*Net more data to client          60,087       1,574     26   21.6    Network
CPU time                                          1,088          14.9
          -------------------------------------------------------------

The DB Time was 3940.8 seconds (121.50 * 60) and the top event which is the “os thread startup” consumes 83% (3276/3940.8) of the total DB Time

And just for comparison, below is their normal workload which is generated on the same period (but different day) as the above

   Elapsed:               59.52 (mins)
   DB Time:               72.42 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                          2,009          46.2
db file scattered read              151,781         423      3    9.7   User I/O
db file sequential read              66,212         301      5    6.9   User I/O
latch: shared pool                      384          67    175    1.5 Concurrenc
log file sequential read                404          19     46    0.4 System I/O
          -------------------------------------------------------------

Hmmm so what is “os thread startup”? The ADDM report also outputs this event as a top finding

FINDING 1: 79% impact (3113 seconds)
------------------------------------
Wait event "os thread startup" in wait class "Concurrency" was consuming
significant database time.

   RECOMMENDATION 1: Application Analysis, 79% benefit (3113 seconds)
      ACTION: Investigate the cause for high "os thread startup" waits. Refer
         to Oracle's "Database Reference" for the description of this wait
         event. Use given SQL for further investigation.
      RATIONALE: The SQL statement with SQL_ID "4gd6b1r53yt88" was found
         waiting for "os thread startup" wait event.
         RELEVANT OBJECT: SQL statement with SQL_ID 4gd6b1r53yt88

   RECOMMENDATION 2: Application Analysis, 79% benefit (3113 seconds)
      ACTION: Investigate the cause for high "os thread startup" waits in
         Service "SYS$BACKGROUND".

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Wait class "Concurrency" was consuming significant database
               time. (80% impact [3156 seconds])

The event is not even documented on 8i, 9i, 10g, 11g official doc… I found some articles when I Google’d but they are not related to my issue… and Metalink does not have any helpful documents on this event for my situation…

There was already a Service Request regarding the issue, and the Support Engineer defined it as

“This wait event might be seen if the database server is executing on a platform that supports multi-threading. We enter this waiting state while a thread is starting up and leave the wait state when the thread has started or the startup request is cancelled.
This indicates some high contention at OS level avoiding even new process startup.
Issue is related to OS please involve system admin to solve same.”

The manager decided that we do a hard shutdown of the LPAR (we’re on P570 AIX 5.3) from the management console, and let’s see what it can do…

The DBA told me that once they are connected, all the operations are okay (DML, DDL, SQL*Loader).

So after the restart we tried to run the program for a while just to observe the behavior and what’s going on… apparently the local authentication is still slow!

For me, the challenge is how to have a response time profile of a session from the time it starts to connect with local authentication up to the time the session is connected so I would know what part of the operation is taking a lot of time.

What I remember, is when I was doing some queries to do some drill down on a session that was connecting using sqlplus “/ as sysdba” I see the background processes being spawned (QMNC, ARCn, etc..) are taking a lot of time in WAITING state with SECONDS_IN_WAIT reaching up to 20 seconds and has a wait event of “os thread startup”, then after a while the user gets connected then he’s now ready to do his thing without problems…Meaning after he’s connected there are no more complains…

SQL> select w.sid, s.sql_hash_value, s.program, w.event, w.wait_time/100 t, w.seconds_in_wait seconds_in_wait, w.state, w.p1, w.p2, w.p3
  2  from v$session s, v$session_wait w
  3  where s.sid = w.sid
  4  and w.state = 'WAITING'
  5  and w.event like '%os thread startup%'
  6  order by 6 asc;

   SID SQL_HASH_VALUE PROGRAM                 EVENT                       T SECONDS_IN_WAIT STATE                     P1         P2         P3
------ -------------- ----------------------- ------------------ ---------- --------------- ----------------- ---------- ---------- ----------
   121              0 oracle@ps570roc01 (QMNC os thread startup)          0              18 WAITING                    0          0          0

To know what’s going on when a user connects and for me to account the operation where it takes significant time; I created a database user TRACE_USER and created a logon trigger that invokes Extended SQL Trace Level 12. This will give me the timing information that I need when the user is connecting to the database.

To create the user:

create user trace_user identified by trace_user account unlock;
grant create session,dba to trace_user;

To create the logon trigger:

create or replace trigger sys.set_trace
after logon on database
when (user = 'trace_user')
declare
lcommand varchar(200);
begin
execute immediate 'alter session set statistics_level=all';
execute immediate 'alter session set max_dump_file_size=unlimited';
execute immediate 'alter session set events ''10046 trace name context forever, level 12''';
end set_trace;
/

After the user and the trigger are created, logon as the TRACE_USER by doing the following command:

sqlplus "trace_user/trace_user" << eof
exit;
eof

This will generate an Extended SQL Trace (10046) trace file under the $ORACLE_BASE/admin/<db_name>/udump directory

Then see if there’s significant time on the database calls (PARSE,BIND,EXEC,FETCH) by looking at the “e=” section which means elapsed duration consumed by the database call

And check if there’s significant time on wait events by looking at the “ela=” section which means elapsed duration consumed by the wait event (probably by doing a system call)

Below is a small portion of the trace file generated, we are interested on the highlighted portions…

PARSING IN CURSOR #5 len=131 dep=1 uid=0 oct=3 lid=0 tim=15687899202263 hv=1389591971 ad='ca9a7948'
select privilege#,nvl(col#,0),max(mod(nvl(option$,0),2))from objauth$ where obj#=:1 and grantee#=:2 group by privilege#,nvl(col#,0)
END OF STMT
EXEC #5:c=0,e=145,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899202260
WAIT #5: nam='db file sequential read' ela= 30005 file#=1 block#=24208 blocks=1 obj#=-1 tim=15687899232346
FETCH #5:c=0,e=30075,p=1,cr=2,cu=0,mis=0,r=0,dep=1,og=4,tim=15687899232396

There were no database calls that consumed significant time (numbers below are microseconds, that’s for 9i and above):

less oracle_ora_1118270.trc | grep -i ",e=" | cut -d , -f2 > karlarao.txt ; sed -n 's/e=/ /p' karlarao.txt | sort -nr | less

 358746
 183162
 61293
 44661
 32580
 30075
 28695
 26950
 25837
 24244
 23519
 20543
 20132
 19449
 19290
 18809
 18250
 17230
 16507
 16349
 13438
 11716
 11690
 11284
 10137

There were no wait events that consumed significant time (numbers below are microseconds, that’s for 9i and above):

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d " " -f8 | sort -nr | less

30005
28624
13253
11592
9650

There was no “os thread startup” event!

less oracle_ora_1118270.trc | grep -i "ela=" | cut -d "=" -f2 | uniq

'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'db file sequential read' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'db file sequential read' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela
'SQL*Net message to client' ela
'SQL*Net message from client' ela

I need to have some instrumentation on the system calls, so I’ll use a UNIX tool,

presenting… TRUSS…

This tool will let you do some application tracing, for us we are interested in tracing SQL*Plus, and display the calls that an application makes to external libraries and the kernel… sounds cool :)

A “man truss” will give you an overview, but if you want more detail there are some good references at the IBM site or Metalink

Execute the following commands (one at a time):

truss -c sqlplus "/ as sysdba" << EOF
exit;
EOF

truss -d sqlplus "/ as sysdba" << EOF
exit;
EOF

The “-c” switch counts traced system calls, faults, and signals

The “-d” switch displays the timestamp in seconds on each line. The first line of the trace output will show the base time from which the individual time stamps are measured

Bwahaha, Now I know the problem!

The DNS service was causing the slow down on the sqlplus “/ as sysdba” operation and also the high “os thread startup” waits

The text below is the portion of “truss” output where the slow down occurs denoted by “(sleeping…)” events when reading on the /etc/resolv.conf and /etc/hosts. Please see the highlighted region below

ENOTTY and ECONNREFUSED are the errors that you’ll see, you’ll find more details on the error messages on this file /usr/include/sys/errno.h
Below are their meanings

#define	ENOTTY	25	/* Inappropriate I/O control operation	*/
#define ECONNREFUSED    79      /* Connection refused */

The portion of “truss -d” where the slow down occurs, you’ll see a leap on the timing information from 2.3379 seconds (reading /etc/resolv.conf) –> 7.3477 seconds –> 17.3489 seconds –> 37.3555 seconds (reading /etc/hosts) –> 70.3863 seconds (already connected)

The timing information below may not exactly match the real “time” that the user is waiting, it’s because the SQL*Plus is being probed to know the part of the underlying system calls where there are significant waits

0.3172:        gethostname(0x0FFFFFFFFFFF9920, 256) = 0
0.3175:        _getpid()                        = 1102056
0.3177:        open("/etc/resolv.conf", O_RDONLY) = 8
0.3180:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
0.3182:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 69
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
0.3190:        close(8)                         = 0
0.3194:        socket(2, 2, 0)                  = 8
0.3197:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3199:        connext(8, 0x09001000A0018600, 16) = 0
0.3202:        send(8, 0x0FFFFFFFFFFF7FF0, 41, 0) = 41
0.3205:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3208:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3210:        close(8)                         = 0
0.3212:        socket(2, 2, 0)                  = 8
0.3215:        sendto(8, 0x0FFFFFFFFFFF7FF0, 41, 0, 0x09001000A0018610, 16) = 41
0.3217:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3220:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 108
0.3222:        close(8)                         = 0
0.3224:        socket(2, 2, 0)                  = 8
0.3227:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70E4, 0x0FFFFFFFFFFF70E0) = 0
0.3229:        connext(8, 0x09001000A0018600, 16) = 0
0.3231:        send(8, 0x0FFFFFFFFFFF7FF0, 28, 0) = 28
0.3233:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3236:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) Err#79 ECONNREFUSED
0.3238:        close(8)                         = 0
0.3240:        socket(2, 2, 0)                  = 8
0.3243:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018610, 16) = 28
0.3245:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 1
0.3248:        nrecvfrom(8, 0x0FFFFFFFFFFF9320, 1024, 0, 0x0FFFFFFFFFFF7950, 0x0FFFFFFFFFFF7178) = 28
0.3250:        close(8)                         = 0
0.3252:        socket(2, 2, 0)                  = 8
0.3373:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) (sleeping...)
2.3379:        _poll(0x0FFFFFFFFFFF7190, 1, 5000) = 0
5.3468:        close(8)                         = 0
5.3470:        socket(2, 2, 0)                  = 8
5.3472:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) (sleeping...)
7.3477:        _poll(0x0FFFFFFFFFFF7190, 1, 10000) = 0
15.3479:        close(8)                        = 0
15.3482:        socket(2, 2, 0)                 = 8
15.3484:        sendto(8, 0x0FFFFFFFFFFF7FF0, 28, 0, 0x09001000A0018600, 16) = 28
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) (sleeping...)
17.3489:        _poll(0x0FFFFFFFFFFF7190, 1, 20000) = 0
35.3491:        close(8)                        = 0
35.3495:        getdomainname(0x0FFFFFFFFFFF9480, 256) = 0
35.3497:        open("/etc/hosts", O_RDONLY)    = 8
35.3500:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
35.3502:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
35.3505:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
35.3514:        close(8)                        = 0
kwrite(9, "01 �0603 s".., 432)    = 432
37.3555:        kread(10, "16060280".., 2064) (sleeping...)
kread(10, "06 E06\b".., 2064)   = 1605
70.3762:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
70.3764:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
70.3767:        lseek(8, 0, 0)                  = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
70.3772:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3776:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3781:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3786:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3790:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3795:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3799:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3804:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3808:        lseek(8, 50688, 0)              = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
70.3813:        lseek(8, 512, 0)                = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
70.3817:        lseek(8, 1024, 0)               = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
70.3822:        lseek(8, 51712, 0)              = 51712
kread(8, "\n0589 D058A".., 512)     = 512
70.3827:        close(8)                        = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
70.3854:        lseek(4, 512, 0)                = 512
kread(4, "17 �".., 512)     = 512
70.3858:        lseek(4, 1024, 0)               = 1024
kread(4, "16 * R h819E".., 512)     = 512
70.3863:        lseek(4, 4096, 0)               = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
70.3882:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
70.3886:        __libc_sbrk(0x0000000000030020) = 0x000000001023C880
70.3892:        access("login.sql", 0)          Err#2  ENOENT
70.3895:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
70.3898:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFDA10) = 0

Apparently there was really a problem with the DNS when it was checked by the System Administrator. He just altered some network related files and instantly we checked on the local authentication… and we could quickly do a local authentication again! Whooho! and everybody was happy :)

Below is the good “truss” output of the local authentication, which takes about 3 seconds from the timing information

3.0713:        gethostname(0x0FFFFFFFFFFF9900, 256) = 0
3.0716:        _getpid()                        = 2539762
3.0719:        open("/etc/resolv.conf", O_RDONLY) = 8
3.0721:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0724:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 70
kread(8, " d o m a i n     this is the domain name, i removed it".., 4096)    = 0
3.0731:        close(8)                         = 0
3.0735:        socket(2, 2, 0)                  = 8
3.0740:        getsockopt(8, 65535, 4104, 0x0FFFFFFFFFFF70C4, 0x0FFFFFFFFFFF70C0) = 0
3.0742:        connext(8, 0x09001000A0018600, 16) = 0
3.0746:        send(8, 0x0FFFFFFFFFFF7FD0, 41, 0) = 41
3.0750:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0753:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 108
3.0756:        send(8, 0x0FFFFFFFFFFF7FD0, 28, 0) = 28
3.0758:        _poll(0x0FFFFFFFFFFF7170, 1, 5000) = 1
3.0761:        nrecvfrom(8, 0x0FFFFFFFFFFF9300, 1024, 0, 0x0FFFFFFFFFFF7930, 0x0FFFFFFFFFFF7158) = 28
3.0764:        close(8)                         = 0
3.0767:        getdomainname(0x0FFFFFFFFFFF9460, 256) = 0
3.0769:        open("/etc/hosts", O_RDONLY)     = 8
3.0771:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
3.0774:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0777:        kioctl(8, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 3453
kread(8, " #   @ ( # ) 4 7\t 1 . 1".., 4096)    = 0
3.0786:        close(8)                         = 0
kwrite(9, "01 �0603 s".., 432)    = 432
kread(10, "06 H06\b".., 2064)   = 1608
3.0973:        open("/oracle/app/product/10.2/db_1/rdbms/mesg/oraus.msb", O_RDONLY) = 8
3.0975:        kfcntl(8, F_SETFD, 0x0000000000000001) = 0
3.0978:        lseek(8, 0, 0)                   = 0
kread(8, "1513 "011303\t\t".., 256)     = 256
3.0983:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.0988:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.0993:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.0997:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1002:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1006:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1011:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1015:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1021:        lseek(8, 50688, 0)               = 50688
kread(8, "\f05 ] P05 ^".., 512)     = 512
3.1025:        lseek(8, 512, 0)                 = 512
kread(8, "1A Q 5 C [ V u �85 �9480".., 512)     = 512
3.1029:        lseek(8, 1024, 0)                = 1024
kread(8, "18 $ 4 = G S".., 512)     = 512
3.1034:        lseek(8, 51712, 0)               = 51712
kread(8, "\n0589 D058A".., 512)     = 512
3.1038:        close(8)                         = 0
kwrite(9, " <0611 k".., 60)     = 60
kread(10, " �06\b".., 2064)   = 179
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
kwrite(9, " U0603 h".., 85)     = 85
kread(10, "1606\b".., 2064)   = 22
3.1064:        lseek(4, 512, 0)                 = 512
kread(4, "17 �".., 512)     = 512
3.1069:        lseek(4, 1024, 0)                = 1024
kread(4, "16 * R h819E".., 512)     = 512
3.1073:        lseek(4, 4096, 0)                = 4096
kread(4, "\f82 P86".., 512)     = 512

kwrite(1, "\n", 1)                              = 1
Connected to:
kwrite(1, " C o n n e c t e d   t o".., 14)     = 14
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
kwrite(1, " O r a c l e   D a t a b".., 77)     = 77
With the Partitioning, OLAP, Data Mining and Real Application Testing options
kwrite(1, " W i t h   t h e   P a r".., 78)     = 78

kwrite(1, "\n", 1)                              = 1
3.1095:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
3.1099:        __libc_sbrk(0x0000000000030020)  = 0x000000001023C880
3.1105:        access("login.sql", 0)           Err#2  ENOENT
3.1109:        access("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0) = 0
3.1112:        statfs("/oracle/app/product/10.2/db_1/sqlplus/admin/glogin.sql", 0x0FFFFFFFFFFFD9F0) = 0

Posted in Performance | Tagged: , , , , | 15 Comments »

Single Instance and RAC Kernel/OS upgrade

Posted by karlarao on April 4, 2009

This document will serve as a guide for the Kernel and OS upgrade activities for

  1. Single Instance on ASM using raw devices
  2. RAC with ASM (using ASMlib) and OCFS2

Upgrading the Kernel and OS is easy and will just need some few commands. The critical part is the dependencies once the Kernel gets updated, so if you’re using ASMlib and OCFS2 you’ll notice that after the upgrade they’re not working anymore… you can’t startup the ASM, then if your OCR and Voting Disk are on OCFS2 the CRS stack wont start all because the RPMs of ASMlib and OCFS2 are kernel dependent, also there are similar components/softwares that are kernel dependent so you have to check them out and do a risk analysis before doing the upgrade.

… If that’s the case, why do you upgrade your Kernel or OS? … Normally you don’t do this often on your production environment considering that you have 24/7 operations but if the business permits it’s ideal to make an upgrade policy on your OS/systems. Below are some of the “drivers why you need to upgrade”:

  • You need to upgrade the firmware of your SAN environment that requires you to be in a specific OS version
  • You need to upgrade the firmware of your HBA that needs a specific kernel version
  • You need to upgrade your RAID controller’s firmware that requires you to be in a specific kernel or OS version
  • Because the hardware/software vendor’s certification matrix is stating that you need to be in a specific kernel or OS version to be on a certified configuration
  • And many more…

So after reading this document, you know what to do/check once your boss/client/vendor tell you to upgrade the kernel or OS with an Oracle Database Single Instance or RAC running.

To view the document, click here

Posted in Installation, Linux, RAC, Upgrade | Tagged: , , , | Leave a Comment »