Diagnosing and Resolving “gc block lost” 34

Last week, one of our clients had a sudden slow down on all of their applications which is running on two node RAC environment

Below is the summary of the setup:
- Server and Storage: SunFire X4200 with LUNs on EMC CX300
- OS: RHEL 4.3 ES
- Oracle 10.2.0.3 (database and clusterware)
- Database Files, Flash Recovery Area, OCR, and Voting disk are located on OCFS2 filesystems
- Application: Forms and Reports (6i and also lower)

As per the DBA, the workload on the database was normal and there were no changes on the RAC nodes and on the applications. Hmm, I can’t really tell because I haven’t really looked into their workload so I don’t have past data to compare.

I first setup the OS Watcher on both nodes (with 60sec snapshot interval) to have an end-to-end view of the server performance while I’m doing the diagnosis on the database. It’s a cool tool and automatically gathers a lot of info when you start it (cpu, io, memory, network, process info), also this tool has a graphing facility which makes it faster to detect spikes and any trends.

Then, I looked into the database parameters and compared it with the RDA of the database I had before. This is just to check if there are any changes on the parameters that might cause the sudden slow down. Hmm…there was nothing new

Then, I queried on the V$SESSION

SQL> select count(*) from v$session;		-- server1

  COUNT(*)
----------
       385

SQL> select count(*) from v$session;		-- server2

  COUNT(*)
----------
	49

.
I was surprised! The session count on both servers show that there are more users connected on server1 (almost 89% of the total users). This could be because of
1) the clients having lower versions (< Sql*Plus 8.1 or OCI8, see Note 97926.1) that may not support TAF (FAILOVER_MODE) and Load Balancing (LOAD_BALANCE) or
2) they are using TNS entries explicitly connecting to server1
Since the RAC database has different applications connected to it, knowing which applications are on server1 and how many they are will be enough to determine which users on particular applications could be transferred to server2 to at least balance the number of users on both nodes, but doing this may still not solve the issue as I don’t have enough info that this is the root cause of the problem..

The query below shows that all the users don’t have FAILOVER capabilities which could be attributed by the two points I mentioned above

select distinct s.inst_id instance_id, s.failover_type failover_type, s.failover_method
failover_method, s.failed_over failed_over
from gv$session s;
INSTANCE_ID FAILOVER_TYPE        FAILOVER_METHOD      FAILED_OVER
----------- -------------------- -------------------- --------------------
          1 NONE                 NONE                 NO
          2 NONE                 NONE                 NO

.
Below query shows the distinct applications/modules and their count which are all on server1. Also you’ll notice that most of them are inactive (some of these users should connect to server2 to have balance on both nodes)

SQL> select distinct inst_id, module, program, status, count(*)
from gv$session
group by inst_id, module, program, status
order by 1,2,3,4;

   INST_ID MODULE					      PROGRAM						 STATUS       COUNT(*)
---------- -------------------------------------------------- -------------------------------------------------- ---------- ----------
	 1 F45RUN32.EXE 				      F45RUN32.EXE					 INACTIVE	    23		 

... output snipped ....

	 1 xxxPgm					      xxxPgm						 INACTIVE	    21		

... output snipped ....

	 1 c:\xxxwin95\BIN\F45RUN32.EXE L:\xxx\VS\FORMS\MAI   c:\xxxwin95\BIN\F45RUN32.EXE L:\xxx\VS\FORMS\MAI	 INACTIVE	    44
	 1 c:\xxxwin95\BIN\F45RUN32.EXE L:\MAIN.FMX	      c:\xxxwin95\BIN\F45RUN32.EXE L:\MAIN.FMX		 INACTIVE	    31
	 1 c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxMENU.fmx       c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxMENU.fmx	 INACTIVE	    14
	 1 c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx       c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx	 INACTIVE	    96	

... output snipped ....

	 1 c:\xxxwin95\BIN\R25SRV32.exe 		      c:\xxxwin95\BIN\R25SRV32.exe			 INACTIVE	    54		

... output snipped ....

	 1 ifweb90.exe					      ifweb90.exe					 INACTIVE	    11

.
Below info shows some data samples of DBA_HIST_ACTIVE_SESS_HISTORY which is graphed using Tanel Poder’s Perfsheet (could be found here and here) to clearly distinguish the distribution of the modules and number of users. Most of the users on server1 uses the “c:\xxxwin95\BIN\F45RUN32.EXE w:\xxxlogin.fmx”

GcBlocksLost_1_SessionCount

Having the graph above may lead us to conclusion that the significant number of users on server1 attributes to the slow down of the transactions. But as per the DBA, there were no changes made and they were running the same transactions as before on server1 which has acceptable response time. Also the OS Watcher shows that the CPU utilization (peak is 60% and most of the time at 40%) and run queue was low and the disk IO utilization and service time were average (always 10ms below), and there were enough memory on the server (3GB free out of 12GB.. the SGA is 2GB) and no swapping.

Hmm…Drilling down on the wait interface and on per session level that is doing the important business operations will give us a definite conclusion on what is really the bottleneck on the database.

The graph below is another sample from DBA_HIST_ACTIVE_SESS_HISTORY that shows server1 (in blue box) is suffering from “gc cr block lost” and “gc cr multi block request” from 7am to 4pm. The “Metalink Doc ID: 563566.1 gc lost blocks diagnostics” indicates that it is a cluster problem which could be a problem on the network interconnect

GcBlocksLost_2_server1waits

To confirm the bottleneck shown from the sample of DBA_HIST_ACTIVE_SESS_HISTORY (above), the output of ADDM and AWR report from a peak period was analyzed (from 2-3 & 3-4 pm). Below are the output of ADDM and AWR reports:

ADDM output:

          DETAILED ADDM REPORT FOR TASK 'TASK_137854' WITH ID 137854
          ----------------------------------------------------------

              Analysis Period: 28-MAY-2009 from 13:58:59 to 14:59:12
         Database ID/Instance: 3967623528/2
      Database/Instance Names: xxx
                    Host Name: xxx
             Database Version: 10.2.0.3.0
               Snapshot Range: from 15642 to 15643
                Database Time: 3710 seconds
        Average Database Load: 1 active sessions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

FINDING 1: 52% impact (1917 seconds)
------------------------------------
Cluster multi-block requests were consuming significant database time.

   RECOMMENDATION 1: SQL Tuning, 37% benefit (1389 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "0h0fn2d19adtk". Look for an alternative plan that does not use
         object scans.
         RELEVANT OBJECT: SQL statement with SQL_ID 0h0fn2d19adtk

         ... output snipped ... 

   RECOMMENDATION 2: SQL Tuning, 8.4% benefit (310 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "8jd43xr4rp00u". Look for an alternative plan that does not use
         object scans.
         RELEVANT OBJECT: SQL statement with SQL_ID 8jd43xr4rp00u

         ... output snipped ... 

   RECOMMENDATION 3: SQL Tuning, 4.6% benefit (172 seconds)
      ACTION: Run SQL Tuning Advisor on the SQL statement with SQL_ID
         "6wpgf3s0vzuks". Look for an alternative plan that does not use
         object scans.
         RELEVANT OBJECT: SQL statement with SQL_ID 6wpgf3s0vzuks

         ... output snipped ... 

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Inter-instance messaging was consuming significant database
               time on this instance. (94% impact [3488 seconds])
         SYMPTOM: Wait class "Cluster" was consuming significant database
                  time. (94% impact [3488 seconds])

FINDING 2: 51% impact (1874 seconds)
------------------------------------
Global Cache Service Processes (LMSn) in other instances were not processing
requests fast enough.

   RECOMMENDATION 1: DB Configuration, 51% benefit (1874 seconds)
      ACTION: Increase throughput of the Global Cache Service (LMSn)
         processes. Increase the number of Global Cache Service processes by
         increasing the value of the parameter "gcs_server_processes".
         Alternatively, if the host is CPU bound consider increasing the OS
         priority of the Global Cache Service processes.
      RATIONALE: The value of parameter "gcs_server_processes" was "2" during
         the analysis period.

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Inter-instance messaging was consuming significant database
               time on this instance. (94% impact [3488 seconds])
         SYMPTOM: Wait class "Cluster" was consuming significant database
                  time. (94% impact [3488 seconds])

FINDING 3: 36% impact (1330 seconds)
------------------------------------
Cluster communications that were retried due to lost blocks consumed
significant database time.

   RECOMMENDATION 1: Host Configuration, 36% benefit (1330 seconds)
      ACTION: Check the configuration of the cluster interconnect. Check OS
         setup like adapter setting, firmware and driver release. Check that
         the OS's socket receive buffers are large enough to store an entire
         multiblock read. The value of parameter
         "db_file_multiblock_read_count" may be decreased as a workaround.
      RATIONALE: The instance was consuming 477 kilo bits per second of
         interconnect bandwidth.

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Inter-instance messaging was consuming significant database
               time on this instance. (94% impact [3488 seconds])
         SYMPTOM: Wait class "Cluster" was consuming significant database
                  time. (94% impact [3488 seconds])

#######################################################################

          DETAILED ADDM REPORT FOR TASK 'TASK_137874' WITH ID 137874
          ----------------------------------------------------------

              Analysis Period: 28-MAY-2009 from 14:59:12 to 15:58:44
         Database ID/Instance: 3967623528/2
      Database/Instance Names: xxx
                    Host Name: xxx
             Database Version: 10.2.0.3.0
               Snapshot Range: from 15643 to 15644
                Database Time: 1563 seconds
        Average Database Load: .4 active sessions

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

FINDING 1: 42% impact (654 seconds)
-----------------------------------
Cluster communications that were retried due to lost blocks consumed
significant database time.

   RECOMMENDATION 1: Host Configuration, 42% benefit (654 seconds)
      ACTION: Check the configuration of the cluster interconnect. Check OS
         setup like adapter setting, firmware and driver release. Check that
         the OS's socket receive buffers are large enough to store an entire
         multiblock read. The value of parameter
         "db_file_multiblock_read_count" may be decreased as a workaround.
      RATIONALE: The instance was consuming 134 kilo bits per second of
         interconnect bandwidth.

   SYMPTOMS THAT LED TO THE FINDING:
      SYMPTOM: Inter-instance messaging was consuming significant database
               time on this instance. (88% impact [1371 seconds])
         SYMPTOM: Wait class "Cluster" was consuming significant database
                  time. (88% impact [1372 seconds])

.
AWR output:

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxxx         3967623528 xxxxxx              2 10.2.0.3.0  YES xxx

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     15642 28-May-09 13:58:58        57      27.0
  End Snap:     15643 28-May-09 14:59:12        64      28.7
   Elapsed:               60.23 (mins)
   DB Time:               61.83 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
gc cr multi block request            19,644       1,902     97   51.3    Cluster
gc cr block lost                      1,011       1,117   1104   30.1    Cluster
enq: WF - contention                    626         268    429    7.2      Other
gc current block lost                   186         203   1089    5.5    Cluster
cr request retry                        678         162    240    4.4      Other

Per Wait Class
~~~~~~~~~~~~~~~~~~
                                                                  Avg
                                       %Time       Total Wait    wait     Waits
Wait Class                      Waits  -outs         Time (s)    (ms)      /txn
-------------------- ---------------- ------ ---------------- ------- ---------
Cluster                        34,450    7.0            3,385      98       6.8
Other                          91,292   64.7              467       5      18.1
User I/O                       26,510     .0              102       4       5.3
Concurrency                    29,668     .0               54       2       5.9
System I/O                     13,360     .0               11       1       2.6
Commit                          1,313     .0                4       3       0.3
Application                     2,374     .0                2       1       0.5
Network                       129,774     .0                0       0      25.7
Configuration                      20     .0                0       0       0.0

#######################################################################

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Release     RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxxx         3967623528 xxxxxx              2 10.2.0.3.0  YES xxx

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:     15643 28-May-09 14:59:12        64      28.7
  End Snap:     15644 28-May-09 15:58:43        64      29.6
   Elapsed:               59.53 (mins)
   DB Time:               26.05 (mins)

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
gc cr block lost                        389         429   1103   27.5    Cluster
gc cr multi block request             2,360         422    179   27.0    Cluster
enq: WF - contention                    510         212    416   13.6      Other
gc current block lost                   188         204   1084   13.0    Cluster
cr request retry                        326         181    554   11.6      Other

Per Wait Class
~~~~~~~~~~~~~~~~~~
                                                                  Avg
                                       %Time       Total Wait    wait     Waits
Wait Class                      Waits  -outs         Time (s)    (ms)      /txn
-------------------- ---------------- ------ ---------------- ------- ---------
Cluster                        10,239    8.9            1,247     122       2.1
Other                         153,082   71.9              468       3      31.0
Application                     1,784   18.0              161      90       0.4
Concurrency                    33,014     .0               53       2       6.7
User I/O                        4,246     .0               27       6       0.9
System I/O                     13,648     .0               14       1       2.8
Commit                          1,546     .0                4       3       0.3
Network                       138,333     .0                1       0      28.0
Configuration                      16     .0                0       4       0.0

.
The only thing that is left to do is to drill down on the complaining users and check on the network performance and interconnect… unfortunately the users were going home and I can’t see any significant database activity. So I called it a night, and just analyzed my activity log and read on Metalink Doc ID 563566.1. Also I advised the DBA to divide the users across the nodes.

Next day afternoon, and it’s time to do the drill down. You can see on the graph below (a sample from GV$SESSION) that the users were already distributed, but still there is a performance problem. I told you this will not solve the issue ! :) it is just a plus that we manage to distribute the users :)

GcBlocksLost_3_SessionCount

A query on GV$SESSION_LONGOPS will show users with large SQLs. Drilling down on the session statistics of these user’s will give detail on what are they waiting and how much time they spent on what resource. However, it would be a repetitive step to get all the sessions on the GV$SESSION_LONGOPS.

select * from (
			select
			a.inst_id inst, a.START_TIME, b.osuser, b.module,  a.sid, a.serial#,
			a.opname, b.event, a.target, a.sofar, a.totalwork, round(a.sofar/a.totalwork, 4)*100 pct, a.units, a.elapsed_seconds, a.time_remaining time_remaining_sec, round(a.time_remaining/60,2) min
			,a.sql_hash_value
		-- 	,message
			from gv$session_longops a, gv$session b
			where a.sid = b.sid
			and b.serial# = b.serial#
			and sofar < totalwork
			order by totalwork desc)

INST START_TIME        OSUSER     MODULE             SID SERIAL# OPNAME     EVENT           TARGET          SOFAR  TOTALWORK        PCT UNITS      ELAPSED_SECONDS TIME_REMAINING_SEC        MIN SQL_HASH_VALUE
---- ----------------- ---------- --------------- ------ ------- ---------- --------------- ---------- ---------- ---------- ---------- ---------- --------------- ------------------ ---------- --------------
   2 20090529 18:36:38 oraprod    oracle@xxxxx280    454      82 Table Scan gc cr multi blo xxxxxxxx.B       2205      25403       8.68 Blocks                 343               3609      60.15     1857468103
                                  R (TNS V1-V3)                             ck request      ENMAS

   2 20090529 18:30:39 oraprod    oracle@xxxxx280    607     546 Table Scan gc cr multi blo xxxxxxxx.C      11062      15203      72.76 Blocks                 708                265       4.42     1477847824
                                  R (TNS V1-V3)                             ck request      OVMAS

   1 20090529 15:51:58 PC829      c:\xxxwin95\BIN    474     721 Table Scan gc cr request   xxxxxxxx.C      11108      15203      73.06 Blocks               10317               3803      63.38     1636678807
                                  \F45RUN32.EXE w                                           OVMAS
                                  : \xxxlogin.fmx

   1 20090529 17:27:49 PC1280     c:\xxxwin95\BIN    473     171 Table Scan gc cr request   xxxxxx.DTT        336      12918        2.6 Blocks                4449             166599    2776.65     1117075250
                                  \R25SRV32.exe                                             _DVDTL

   1 20090529 17:27:49 oracle     sqlplus@server2    473     171 Table Scan PX Deq: Executi xxxxxx.DTT        336      12918        2.6 Blocks                4449             166599    2776.65     1117075250
                                  .xxxx.xxx (TNS                           on Msg          _DVDTL
                                   V1-V3)

.
I already have an idea that my problem is on the cluster configuration. So I asked the DBA that we go up to the Accounting Department (where there are a lot of users complaining) and check on the desktop terminals of the users. We saw two users which has the application module on total hang. These users are PC1069 (with SID 601) and PC918 (with SID 483). So a query on the GV$SESSION_LONGOPS does not show these users but a query on GV$SESSION_WAIT shows that they are waiting on “gc cr request” event

select s.inst_id, w.sid, s.sql_hash_value, s.osuser, s.program, w.event, w.seconds_in_wait seconds_in_wait, w.state, w.p1, w.p2, w.p3
from gv$session s, gv$session_wait w
where s.sid = w.sid and s.type = 'USER'
and w.state = 'WAITING'
and s.osuser in ('PC1069','PC918')
order by 6 asc;

   INST_ID    SID SQL_HASH_VALUE OSUSER     PROGRAM                 EVENT              SECONDS_IN_WAIT STATE                     P1         P2         P3
---------- ------ -------------- ---------- ----------------------- ------------------ --------------- ----------------- ---------- ---------- ----------
         1    483      449494677 PC918      c:\xxxwin95\BIN\F45RUN3 SQL*Net message fr            3853 WAITING            675562835          1          0
                                            2.EXE w:\xxxlogin.fmx   om client

         1    601     1874068660 PC1069     c:\xxxwin95\BIN\F45RUN3 gc cr request                    6 WAITING                   12    1173863          1
                                            2.EXE w:\xxxlogin.fmx

         1    483      449494677 PC918      c:\xxxwin95\BIN\F45RUN3 gc cr request                    0 WAITING                    7    1649989          1
                                            2.EXE w:\xxxlogin.fmx

.
Focusing on these two sessions (SID 601 and 483) and getting their session statistics will give us an idea if they are doing something, on a total hang, or waiting for something. Below query on V$SESSION_WAIT shows that the SEQ# is being incremented (a number that uniquely identifies this wait) which means the session (SID 601) is still waiting for something, the event “gc cr request”

BTW, I’m using SW script of Tanel Poder (downloadable here), and you’ll also see some usage of Snapper and USQL

SQL> @sw 601

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    601 WAITING gc cr request                                 27223           3          7    1647259          1

... output snipped ...

SQL> @sw 601

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    601 WAITING gc cr request                                 27225           0          7    1647259          1

... output snipped ...

SQL> @sw 601

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    601 WAITING gc cr request                                 27227           0          7    1647259          1

... output snipped ...

SQL> @sw 601

    SID STATE   EVENT                                          SEQ# SEC_IN_WAIT         P1         P2         P3 P1TRANSL
------- ------- ---------------------------------------- ---------- ----------- ---------- ---------- ---------- ------------------------------------------
    601 WAITING gc cr request                                 27229           7          7    1647259          1

.
A query on V$SESSTAT will show the performance/wait counters being incremented (SID 601), the output below was a snap of the session for 5 seconds period done 5 times. You’ll see that “gc current” waits are significant

SQL> @snapper out 5 5 601
-- Session Snapper v1.10 by Tanel Poder ( http://www.tanelpoder.com )
---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     601, 20090529 18:48:36,        6, STAT, opened cursors current                  ,            -1,         -0,         -1,       -.17
DATA,     601, 20090529 18:48:36,        6, STAT, recursive calls                         ,             5,          1,          5,        .83
DATA,     601, 20090529 18:48:36,        6, STAT, session logical reads                   ,            24,          4,         24,          4
DATA,     601, 20090529 18:48:36,        6, STAT, cluster wait time                       ,           492,         82,        492,         82
DATA,     601, 20090529 18:48:36,        6, STAT, user I/O wait time                      ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, session pga memory                      ,        -65536,     -10923,    -65.54k,    -10.92k
DATA,     601, 20090529 18:48:36,        6, STAT, physical read total IO requests         ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, physical read total bytes               ,          8192,       1365,      8.19k,      1.37k
DATA,     601, 20090529 18:48:36,        6, STAT, gcs messages sent                       ,             3,          1,          3,         .5
DATA,     601, 20090529 18:48:36,        6, STAT, db block gets                           ,             6,          1,          6,          1
DATA,     601, 20090529 18:48:36,        6, STAT, db block gets from cache                ,             6,          1,          6,          1
DATA,     601, 20090529 18:48:36,        6, STAT, consistent gets                         ,            18,          3,         18,          3
DATA,     601, 20090529 18:48:36,        6, STAT, consistent gets from cache              ,            18,          3,         18,          3
DATA,     601, 20090529 18:48:36,        6, STAT, consistent gets - examination           ,             8,          1,          8,       1.33
DATA,     601, 20090529 18:48:36,        6, STAT, physical reads                          ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, physical reads cache                    ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, physical read IO requests               ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, physical read bytes                     ,          8192,       1365,      8.19k,      1.37k
DATA,     601, 20090529 18:48:36,        6, STAT, db block changes                        ,             3,          1,          3,         .5
DATA,     601, 20090529 18:48:36,        6, STAT, free buffer requested                   ,             3,          1,          3,         .5
DATA,     601, 20090529 18:48:36,        6, STAT, calls to get snapshot scn: kcmgss       ,             5,          1,          5,        .83
DATA,     601, 20090529 18:48:36,        6, STAT, redo entries                            ,             2,          0,          2,        .33
DATA,     601, 20090529 18:48:36,        6, STAT, redo size                               ,           536,         89,        536,      89.33
DATA,     601, 20090529 18:48:36,        6, STAT, gc current blocks received              ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, gc remote grants                        ,             1,          0,          1,        .17
DATA,     601, 20090529 18:48:36,        6, STAT, undo change vector size                 ,            84,         14,         84,         14
DATA,     601, 20090529 18:48:36,        6, STAT, no work - consistent read gets          ,             9,          2,          9,        1.5
DATA,     601, 20090529 18:48:36,        6, STAT, table fetch by rowid                    ,             4,          1,          4,        .67
DATA,     601, 20090529 18:48:36,        6, STAT, index scans kdiixs1                     ,             4,          1,          4,        .67
DATA,     601, 20090529 18:48:36,        6, STAT, buffer is not pinned count              ,             8,          1,          8,       1.33
DATA,     601, 20090529 18:48:36,        6, STAT, execute count                           ,             4,          1,          4,        .67
DATA,     601, 20090529 18:48:36,        6, TIME, PL/SQL execution elapsed time           ,           139,         23,      139us,    23.17us
DATA,     601, 20090529 18:48:36,        6, TIME, DB CPU                                  ,         10000,       1667,       10ms,     1.67ms
DATA,     601, 20090529 18:48:36,        6, TIME, sql execute elapsed time                ,      27552125,    4592021,     27.55s,      4.59s
DATA,     601, 20090529 18:48:36,        6, TIME, DB time                                 ,      27552133,    4592022,     27.55s,      4.59s
DATA,     601, 20090529 18:48:36,        6, WAIT, db file sequential read                 ,          8374,       1396,     8.37ms,      1.4ms
DATA,     601, 20090529 18:48:36,        6, WAIT, gc current request                      ,       9664835,    1610806,      9.66s,      1.61s
--  End of snap 1
DATA,     601, 20090529 18:48:42,        5, STAT, opened cursors cumulative               ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, opened cursors current                  ,            -1,         -0,         -1,        -.2
DATA,     601, 20090529 18:48:42,        5, STAT, recursive calls                         ,             4,          1,          4,         .8
DATA,     601, 20090529 18:48:42,        5, STAT, recursive cpu usage                     ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, session logical reads                   ,            42,          8,         42,        8.4
DATA,     601, 20090529 18:48:42,        5, STAT, cluster wait time                       ,           419,         84,        419,       83.8
DATA,     601, 20090529 18:48:42,        5, STAT, user I/O wait time                      ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, enqueue requests                        ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, global enqueue gets sync                ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, physical read total IO requests         ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, physical read total bytes               ,         16384,       3277,     16.38k,      3.28k
DATA,     601, 20090529 18:48:42,        5, STAT, gcs messages sent                       ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, db block gets                           ,            41,          8,         41,        8.2
DATA,     601, 20090529 18:48:42,        5, STAT, db block gets from cache                ,            41,          8,         41,        8.2
DATA,     601, 20090529 18:48:42,        5, STAT, consistent gets                         ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, consistent gets from cache              ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, physical reads                          ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, physical reads cache                    ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, physical read IO requests               ,             2,          0,          2,         .4
DATA,     601, 20090529 18:48:42,        5, STAT, physical read bytes                     ,         16384,       3277,     16.38k,      3.28k
DATA,     601, 20090529 18:48:42,        5, STAT, db block changes                        ,            26,          5,         26,        5.2
DATA,     601, 20090529 18:48:42,        5, STAT, exchange deadlocks                      ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, free buffer requested                   ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:42,        5, STAT, calls to get snapshot scn: kcmgss       ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, redo entries                            ,            14,          3,         14,        2.8
DATA,     601, 20090529 18:48:42,        5, STAT, redo size                               ,          3504,        701,       3.5k,      700.8
DATA,     601, 20090529 18:48:42,        5, STAT, redo subscn max counts                  ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, gc local grants                         ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, gc remote grants                        ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, undo change vector size                 ,          1140,        228,      1.14k,        228
DATA,     601, 20090529 18:48:42,        5, STAT, deferred (CURRENT) block cleanout applic,             4,          1,          4,         .8
DATA,     601, 20090529 18:48:42,        5, STAT, parse count (total)                     ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, STAT, execute count                           ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:42,        5, TIME, parse time elapsed                      ,            41,          8,       41us,      8.2us
DATA,     601, 20090529 18:48:42,        5, TIME, PL/SQL execution elapsed time           ,           125,         25,      125us,       25us
DATA,     601, 20090529 18:48:42,        5, TIME, sql execute elapsed time                ,       5433501,    1086700,      5.43s,      1.09s
DATA,     601, 20090529 18:48:42,        5, TIME, DB time                                 ,       5433501,    1086700,      5.43s,      1.09s
DATA,     601, 20090529 18:48:42,        5, WAIT, db file sequential read                 ,         17812,       3562,    17.81ms,     3.56ms
DATA,     601, 20090529 18:48:42,        5, WAIT, gc current request                      ,      -4221489,    -844298,     -4.22s,   -844.3ms
DATA,     601, 20090529 18:48:42,        5, WAIT, gc current block lost                   ,       6634963,    1326993,      6.63s,      1.33s
DATA,     601, 20090529 18:48:42,        5, WAIT, gc current grant 2-way                  ,           254,         51,      254us,     50.8us
DATA,     601, 20090529 18:48:42,        5, WAIT, events in waitclass Other               ,           600,        120,      600us,      120us
--  End of snap 2
DATA,     601, 20090529 18:48:47,        5, STAT, session logical reads                   ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:47,        5, STAT, cluster wait time                       ,           503,        101,        503,      100.6
DATA,     601, 20090529 18:48:47,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:47,        5, STAT, db block gets                           ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:47,        5, STAT, db block gets from cache                ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:47,        5, STAT, exchange deadlocks                      ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:47,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:47,        5, WAIT, gc current request                      ,       1221742,     244348,      1.22s,   244.35ms
DATA,     601, 20090529 18:48:47,        5, WAIT, gc current block lost                   ,       3813435,     762687,      3.81s,   762.69ms
DATA,     601, 20090529 18:48:47,        5, WAIT, events in waitclass Other               ,           609,        122,      609us,    121.8us
--  End of snap 3
DATA,     601, 20090529 18:48:52,        5, STAT, session logical reads                   ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:52,        5, STAT, cluster wait time                       ,           524,        105,        524,      104.8
DATA,     601, 20090529 18:48:52,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:52,        5, STAT, db block gets                           ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:52,        5, STAT, db block gets from cache                ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:52,        5, STAT, exchange deadlocks                      ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:52,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:52,        5, TIME, sql execute elapsed time                ,       6613095,    1322619,      6.61s,      1.32s
DATA,     601, 20090529 18:48:52,        5, TIME, DB time                                 ,       6613095,    1322619,      6.61s,      1.32s
DATA,     601, 20090529 18:48:52,        5, WAIT, gc current request                      ,       3000843,     600169,         3s,   600.17ms
DATA,     601, 20090529 18:48:52,        5, WAIT, gc current block lost                   ,       4018577,     803715,      4.02s,   803.72ms
DATA,     601, 20090529 18:48:52,        5, WAIT, events in waitclass Other               ,           570,        114,      570us,      114us
--  End of snap 4
DATA,     601, 20090529 18:48:57,        5, STAT, session logical reads                   ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:57,        5, STAT, cluster wait time                       ,           523,        105,        523,      104.6
DATA,     601, 20090529 18:48:57,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:57,        5, STAT, db block gets                           ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:57,        5, STAT, db block gets from cache                ,             3,          1,          3,         .6
DATA,     601, 20090529 18:48:57,        5, STAT, exchange deadlocks                      ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:57,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     601, 20090529 18:48:57,        5, TIME, sql execute elapsed time                ,       5230170,    1046034,      5.23s,      1.05s
DATA,     601, 20090529 18:48:57,        5, TIME, DB time                                 ,       5230170,    1046034,      5.23s,      1.05s
DATA,     601, 20090529 18:48:57,        5, WAIT, gc current request                      ,          -473,        -95,     -473us,    -94.6us
DATA,     601, 20090529 18:48:57,        5, WAIT, gc current block lost                   ,       5230137,    1046027,      5.23s,      1.05s
DATA,     601, 20090529 18:48:57,        5, WAIT, events in waitclass Other               ,           390,         78,      390us,       78us
--  End of snap 5

PL/SQL procedure successfully completed.

.
Below is the SQL being executed for SID 601 which is a batch INSERT operation that is just a normal operation as per the DBA

SQL> @usql 601

HASH_VALUE  CH# SQL_TEXT
---------- ---- ------------------------------------------------------------------------------------------------------------------------------------------------------
   6143262    0 INSERT INTO xxx(
                BATxxx,POLxxx,SRCxxx,PAYxxx,DATxxx,DATxxx,AMxxxx,ACCxxx,NEWxxx,POLxxx,FUNxxx,Uxxxxx,Txxxx,Cxxxx,PRxxxx,Exxxx,xxxxID,POxxxx,Nxxxx,xxxAMT,xxxPRM,xxxMAV
                ,COLxxxx,NOPxxx )  VALUES ( :b1,:b2,:b3,:b4,:b5,:b6,:b7,:b8,:b9,:b10,:b11,:b12,:b13,:b14,:b15,:b16,:b17,:b18,:b19,:b20,:b21,:b22,:b23,:b24  )

 CH# PARENT_HANDLE    OBJECT_HANDLE        PARSES   H_PARSES EXECUTIONS    FETCHES ROWS_PROCESSED       LIOS       PIOS      SORTS     CPU_MS     ELA_MS USERS_EXECUTING
---- ---------------- ---------------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- ---------------
   0 00000000D0A5FC68 00000000D0A5FA40         10          1       2714          0           2714     300753      17731          0  12136.359  5899798.3               1

SQL hash value:         6143262    Cursor address:  00000000D0A5FC68    |   Statement first parsed at: 2009-05-29/13:01:42  |  20903 seconds ago

.
Below is the SQL being executed for SID 483 which is a SELECT operation that is just a normal operation as per the DBA

SQL> @usql 483

HASH_VALUE  CH# SQL_TEXT
---------- ---- ------------------------------------------------------------------------------------------------------------------------------------------------------
 449494677    0 SELECT x.xxxPxD,F.xxxNMx,B.xxxPxD,x.xxxNMx,D.PxLMxD,D.PxLFRQ,x.xxVxMT,x.xxVSTx,x.xxVxFF,x.xxVDxx,SxM(G.xxVMxN) MxDPRxM   FRxM RxPMxS x,BxNMxS B,xxxMxS
                x,PxLMxS D,xxVMxS x,xxxMxS F,xxVMxS G  WHxRx x.PxLNxM = NVL(:b1,'X')  xND x.RxLTYP = 'PxY'  xND (x.RxLPRx = 100  xR x.RxLPRx xS NxLL  ) xND x.xxxPxD =
                F.xxxPxD  xND B.PxLNxM (+)   = NVL(:b2,'X')  xND B.RxLTYP (+)   = 'xNS'  xND B.xxVNxM = 1  xND (B.RxLPxR = 100  xR B.RxLPxR xS NxLL  ) xND B.xxxPxD =
                x.xxxPxD (+)    xND D.PxLNxM = NVL(:b3,'X')  xND x.PxLNxM = NVL(:b4,'X')  xND x.xxVNxM = 1  xND G.PxLNxM = NVL(:b5,'X')  GRxxP BY
                x.xxxPxD,F.xxxNMx,B.xxxPxD,x.xxxNMx,D.PxLMxD,D.PxLFRQ,x.xxVxMT,x.xxVSTx,x.xxVxFF,x.xxVDxx

 CH# PARENT_HANDLE    OBJECT_HANDLE        PARSES   H_PARSES EXECUTIONS    FETCHES ROWS_PROCESSED       LIOS       PIOS      SORTS     CPU_MS     ELA_MS USERS_EXECUTING
---- ---------------- ---------------- ---------- ---------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------- ---------------
   0 00000000C2E54DD0 00000000C2E5E700          8          1       1047       2089           1043      58793       3905       2087   2676.616 11108807.7               1

.
A query on V$SESSTAT will show the performance/wait counters being incremented (SID 483), the output below was a snap of the session for 5 seconds period done 10 times. You’ll see that “gc cr” waits are significant

SQL> @snapper out 5 10 483
-- Session Snapper v1.10 by Tanel Poder ( http://www.tanelpoder.com )
---------------------------------------------------------------------------------------------------------------------------------------------
HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC
---------------------------------------------------------------------------------------------------------------------------------------------
DATA,     483, 20090529 19:00:17,        5, STAT, cluster wait time                       ,           527,        105,        527,      105.4
DATA,     483, 20090529 19:00:17,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:17,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:17,        5, WAIT, gc cr request                           ,      -1220551,    -244110,     -1.22s,  -244.11ms
DATA,     483, 20090529 19:00:17,        5, WAIT, gc cr block lost                        ,       6485830,    1297166,      6.49s,       1.3s
DATA,     483, 20090529 19:00:17,        5, WAIT, events in waitclass Other               ,           530,        106,      530us,      106us
--  End of snap 1
DATA,     483, 20090529 19:00:22,        5, STAT, session logical reads                   ,             4,          1,          4,         .8
DATA,     483, 20090529 19:00:22,        5, STAT, cluster wait time                       ,           454,         91,        454,       90.8
DATA,     483, 20090529 19:00:22,        5, STAT, messages sent                           ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, gcs messages sent                       ,             2,          0,          2,         .4
DATA,     483, 20090529 19:00:22,        5, STAT, consistent gets                         ,             4,          1,          4,         .8
DATA,     483, 20090529 19:00:22,        5, STAT, consistent gets from cache              ,             4,          1,          4,         .8
DATA,     483, 20090529 19:00:22,        5, STAT, consistent gets - examination           ,             3,          1,          3,         .6
DATA,     483, 20090529 19:00:22,        5, STAT, free buffer requested                   ,             2,          0,          2,         .4
DATA,     483, 20090529 19:00:22,        5, STAT, shared hash latch upgrades - no wait    ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, gc current blocks received              ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, no work - consistent read gets          ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, table fetch by rowid                    ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, index fetch by key                      ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:22,        5, STAT, buffer is not pinned count              ,             2,          0,          2,         .4
DATA,     483, 20090529 19:00:22,        5, WAIT, gc cr request                           ,      -8443706,   -1688741,     -8.44s,     -1.69s
DATA,     483, 20090529 19:00:22,        5, WAIT, gc cr block lost                        ,       6986370,    1397274,      6.99s,       1.4s
DATA,     483, 20090529 19:00:22,        5, WAIT, gc current block 2-way                  ,           596,        119,      596us,    119.2us
DATA,     483, 20090529 19:00:22,        5, WAIT, events in waitclass Other               ,           266,         53,      266us,     53.2us
--  End of snap 2
DATA,     483, 20090529 19:00:27,        5, STAT, cluster wait time                       ,           506,        101,        506,      101.2
DATA,     483, 20090529 19:00:27,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:27,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
--  End of snap 3
DATA,     483, 20090529 19:00:32,        5, STAT, cluster wait time                       ,           488,         98,        488,       97.6
DATA,     483, 20090529 19:00:32,        5, WAIT, gc cr request                           ,       9888169,    1977634,      9.89s,      1.98s
--  End of snap 4
DATA,     483, 20090529 19:00:37,        5, STAT, cluster wait time                       ,           548,        110,        548,      109.6
DATA,     483, 20090529 19:00:37,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:37,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:37,        5, WAIT, gc cr request                           ,      -4444128,    -888826,     -4.44s,  -888.83ms
DATA,     483, 20090529 19:00:37,        5, WAIT, gc cr block lost                        ,       7920880,    1584176,      7.92s,      1.58s
DATA,     483, 20090529 19:00:37,        5, WAIT, events in waitclass Other               ,           341,         68,      341us,     68.2us
--  End of snap 5
DATA,     483, 20090529 19:00:42,        5, STAT, cluster wait time                       ,           494,         99,        494,       98.8
DATA,     483, 20090529 19:00:42,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:42,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:42,        5, WAIT, gc cr request                           ,        883591,     176718,   883.59ms,   176.72ms
DATA,     483, 20090529 19:00:42,        5, WAIT, gc cr block lost                        ,       4055130,     811026,      4.06s,   811.03ms
DATA,     483, 20090529 19:00:42,        5, WAIT, events in waitclass Other               ,           337,         67,      337us,     67.4us
--  End of snap 6
DATA,     483, 20090529 19:00:47,        6, STAT, cluster wait time                       ,           448,         75,        448,      74.67
DATA,     483, 20090529 19:00:47,        6, STAT, gcs messages sent                       ,             1,          0,          1,        .17
DATA,     483, 20090529 19:00:47,        6, STAT, free buffer requested                   ,             1,          0,          1,        .17
DATA,     483, 20090529 19:00:47,        6, WAIT, gc cr request                           ,      -2105578,    -350930,     -2.11s,  -350.93ms
DATA,     483, 20090529 19:00:47,        6, WAIT, gc cr block lost                        ,       6590281,    1098380,      6.59s,       1.1s
DATA,     483, 20090529 19:00:47,        6, WAIT, events in waitclass Other               ,           496,         83,      496us,    82.67us
--  End of snap 7
DATA,     483, 20090529 19:00:53,        5, STAT, cluster wait time                       ,           547,        109,        547,      109.4
DATA,     483, 20090529 19:00:53,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:53,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
--  End of snap 8
DATA,     483, 20090529 19:00:58,        5, STAT, cluster wait time                       ,           528,        106,        528,      105.6
DATA,     483, 20090529 19:00:58,        5, STAT, gcs messages sent                       ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:58,        5, STAT, free buffer requested                   ,             1,          0,          1,         .2
DATA,     483, 20090529 19:00:58,        5, WAIT, gc cr block lost                        ,       5286330,    1057266,      5.29s,      1.06s
DATA,     483, 20090529 19:00:58,        5, WAIT, events in waitclass Other               ,           636,        127,      636us,    127.2us
--  End of snap 9
DATA,     483, 20090529 19:01:03,        5, STAT, cluster wait time                       ,           489,         98,        489,       97.8
DATA,     483, 20090529 19:01:03,        5, WAIT, gc cr request                           ,      10886698,    2177340,     10.89s,      2.18s
--  End of snap 10

PL/SQL procedure successfully completed.

.
Another graph of DBA_HIST_ACTIVE_SESS_HISTORY will show that the server is suffering from “gc cr block lost” and “gc cr multi block request”. Again, ADDM and AWR reports also support this graph

GcBlocksLost_4_server1waits

Hmm… I already have sufficient info that on the instance and session level, I’m getting waits that point me to diagnose and look into the network interconnect. Analyzing the “netstat -s” output from OS Watcher shows that there are many errors on the TCP and UDP packets

$ cat server1_netstat.dat | egrep -i "udpInOverflows|packet receive errors|fragments dropped|reassembles failed|fragments dropped after timeout"
    34096 fragments dropped after timeout
    306030 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout
    306268 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout
    306574 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout
    306930 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout

… output snipped … 

    15 packet receive errors
    34096 fragments dropped after timeout
    328087 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout
    328493 packet reassembles failed
    15 packet receive errors
    34096 fragments dropped after timeout
    328889 packet reassembles failed
    15 packet receive errors

.
We quickly advised the network team to check on the interconnect switch for any issues and if possible replace it with a new unit to see if it solves the performance problem.

Then the DBA scheduled a downtime, and at first we tried restarting just the switch if it solves the problem. Then we make them run the applications again, but still there is a performance problem.

Then we shutdown again and the network team replaced the interconnect switch with a new unit. Then startup, and run the applications again…..

tentenenenen tenen!!!

…… they got fast :D !!!

Another check on the “netstat -s” shows no errors!

karao@karl:~/Desktop$ cat karlarao.dat | egrep -i "udpInOverflows|packet receive errors|fragments dropped|reassembles failed|fragments dropped after timeout"
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors
    0 packet receive errors

.
Another graph of DBA_HIST_ACTIVE_SESS_HISTORY shows that the “gc block lost” waits are gone :D

3d view:
GcBlocksLost_5_server1waits3d

Stacked graph:
GcBlocksLost_6_server1waits1

Update:
————–
I was also able to share and present this performance scenario on one of our DBA round tables locally (at the Oracle office)..
- Performance Scenario – Sudden slow down on RAC (pdf) – Oracle Philippines DBA round table, 07/21/2009





About these ads

34 comments

  1. This is very nice demonstration of what you learned from Tanel. I think these kind of explanation of how to use TPTools post will increase in 1 year time.
    Thank you for sharing this case

    • Thanks Coskan :) Actually I really liked the output of the Perfsheet that enabled me to look into the waits of both nodes using DBA_HIST_ACTIVE_SESS_HISTORY and grouping the nodes in the pivot chart :) In this way I could better explain to the manager what’s happening on the RAC nodes that he could easily grasp :) But, excel got 32767 line limit so you must consider the amount of data sample :) and the file really grows big. But it really helps in visualization.. :)

  2. I know how to fix the 32k line limit (or is it 64k?), but havent had the time to code it. One might argue that if you have so many data points you probably have too detailed data. It would make sense to rewrite the query to return less granular data. But of course there are always the corner cases…

    • Hi Tanel,
      I’m not sure if it’s 64k, I have not yet checked it :) I knew the 32k limit back in HighSchool :p
      do you have the link for the workaround? (I think the default would do the purpose..)

      Yeah and too detailed data will make the population on the sheet so slowww… so adding some filter like “and b.begin_interval_time between sysdate – 1 and sysdate” would really help..

  3. Regarding filtering:
    That’s why I have the %FROM_DATE% and %TO_DATE% variables which will be replaced with whatever you have in the corresponding from/to date fields on main page

    • Yes I agree with you Doug :)

      If I just relied on the ADDM/AWR and went straight to the “netstat -s” that would quickly solve the issue. It’s just that I need some more facts/numbers/figures to make the big decision of replacing the interconnect :)

      Thanks for your comment, you are on my RSS reader! :)

  4. It’s just that I need some more facts/numbers/figures to make the big decision of replacing the interconnect

    There’s definitely nothing wrong with as much good evidence as you can gather, particularly when talking to other teams.

    you are on my RSS reader!

    Believe me, I’ll be linking here when I get round to it. I’m lucky Tanel pointed out your blog. Good stuff

  5. Pingback: Blogroll Report 06/06/09 – 12/06/09 « Coskan’s Approach to Oracle

    • Riyaj,

      Thanks for reading my blog, and for adding me in your Blogroll :)
      You’re also in my Useful Links page :)

  6. Hi Karlarao:

    May be you can make use of sar too, comes usually by default in any linux/unix distribution and gives a lot of info about what is the system doing. I normally modify the execution interval ( usually located on linux distros under /etc/cron.d/sysstat ) from each 10 minutes to each minute. The info is available in /var/log/sa directory, and I use the ksar ( http://ksar.atomique.net/ )java application to see the data in a graphic way.

    hope this helps and congrats for this blog and the knowledge that you put on in

    Regards

    • Hi David,

      Yeah I’m using kSar :) very useful for postmortem analysis of the server activity..
      and I also use nmon (http://www.ibm.com/developerworks/aix/library/au-analyze_aix) for graphically looking at the IO devices with most activity. It also has CPU graph and could monitor memory, kernel, network, nfs, filesystems

      But, for long monitoring durations, I prefer OS Watcher :) it has a Windows version too that spools data in CSV format which you could easily graph in MS Excel..

  7. Very nice! I just ran across your post after looking at a system last night that has an intermittent problem with lost blocks. You did a great job of documenting your diagnosis process. This is (for me anyway) often harder than solving the problem in the first place. Thanks for taking the time to post this.

    Kerry

    • Hi Kerry,

      I find your posts and scripts useful too, I think you should make an archive page of your scripts for easy access.. ;)

      Thanks for the comment.. :)

  8. I know, I know. Like I said, it’s a lot easier to do stuff than it is to write down what you did (and why). But I’ll see what I can do about a page listing the scripts.

    Keep up the good work.

    Kerry

  9. Hello karlarao,
    I would ask if your network team replaced interconnect switch unit with new in the way, they replaced this one with new one of same version or with new version? I solve very similar problem in company and try mostly all what found on net …

    Thanks

    Lukas

    • Hi Lukas,

      Thanks for reading my post. The new switch is totally new make & model (bigger & faster)… I’m not aware where did they get it, the old switch BTW was found faulty & just stopped working.

      With this post, I’m not recommending to replace switches right away. I believe you could encounter this waits even not relating to the switches.. Do some systematic triage and execute the action plans unless the facts/numbers/figures lead you to do so.. ( & don’t give up investigating :) )

  10. Hello, Karl Arao my name is João Paulo , I’m from Brazil, You did a fantastic job, I’m with the same problem, in my osw log I get this output…. Is this a problem ?

    udpInCksumErrs = 0 udpInOverflows = 0
    udpInCksumErrs = 93 udpInOverflows = 0
    udpInCksumErrs = 0 udpInOverflows = 0
    udpInCksumErrs = 93 udpInOverflows = 0
    udpInCksumErrs = 0 udpInOverflows = 0
    udpInCksumErrs = 93 udpInOverflows = 0

    Best Regards

  11. Pingback: Graphing the AAS with Perfsheet a la Enterprise Manager « Karl Arao's Blog

  12. Pingback: DB Optimizer » ASH Visualizations: R, ggplot2, Gephi, Jit, HighCharts, Excel ,SVG

  13. Hello Karl,

    Great job and documentation.

    I am having the same intermittent poor database performance and applications are complaining. This happens once or twice a week and it is getting to be a concerned situation.

    I read about your gc block lost problem and how your resolved it. I believe it is same problem that we are having. When I query gv$sysstat view, I see a very high value of 26000 for RAC node 1 and 7000 for RAC node 2. This tells me there is problem, but I want to gather all the numbers to make presentation to management, Unix SA and Network team.

    Can you send me a copy of the Tanel Poder’s Perfsheet file. This file does not exist anymore on the website. I want to create a PowerPoint presentation with convincing numbers and graphs.

    I will appreciate if you can assist.

    Thank you
    Grace

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s