Knowing the trend of Deadlock occurrences from the Alert Log 9

Recently, my client deployed a new application and had this intermittent “Deadlock Storm” …

A trace file was sent and I was able to pinpoint the cause of the deadlock and the session that caused it.
The deadlock was a TX enqueue with mode of 4 (S – share) which could be verified by looking at the following lines of the Process State dump:

   last wait for 'enq: TX - row lock contention' blocking sess=0x 7000000cb239d60 seq=7849 wait_time=2929705 seconds since wait started=3
            name|mode=54580004, usn<<16 | slot=a0028, sequence=283f2

the “enqueue and lock mode” is explained as:
mode=54580004 (see above)
5458 (hex) = TX (ascii)
0004 (hex) = mode 4 (S – share)

The possible cause would be multiple concurrent sessions insert the same key value into a table that has one or more unique key. The table involved on the SQL was identified and has a unique key (on 7 columns) and 4 foreign keys (on different tables). But still I have to verify…

And I may have to look into the application logic, and may have to start with the insert statement causing the deadlock.
But, it could also be possible that there are more statements involved in this “Deadlock storm”.. and I’m just looking on one occurence..
hmm.. interesting..

Lately, I’ve been using this shell command I got from Kyle Hailey‘s site for quickly getting ORA- errors on the alert log.

See the code I’m using below:

Go to the bdump directory to run these shell commands

 Date and errors in alert.log

     cat alert_orcl.log | \
     awk 'BEGIN{buf=""}
          /[0-9]:[0-9][0-9]:[0-9]/{buf=$0}
          /ORA-/{print buf,$0}' > ORA-errors_orcl.txt

On the client site, I executed the command on the alert log. Then to my surprise.. there are so many ORA-00060 errors (total of 1,616 lines)! which means, there are more trace files that I have to look on. Great, Now I’m overwhelmed. :)

First…I want to know how often Deadlocks occur, I want it per month, day, and hour. So I could see the pattern of the occurrence then, if there’s a pattern I could ask the Developer what he’s running on that period or what are the modules that are frequently accessed. That way I could correlate what’s happening on the trace files to what are they doing on the application.

The alert views of 10g could be useful to get that kind of info (DBA_OUTSTANDING_ALERTS, DBA_ALERT_HISTORY, DBA_ALERT_ARGUMENTS, DBMS_ALERT_INFO, V$ALERT_TYPES).

But I want to dig on the alert log using some shell scripting.. which I’m not very good at :p

I’m playing with this very simple command, this shows me that on August 7 11AM there are 55 occurrence of “ORA-” errors. Some or all of them could be “ORA-00060″ errors.

karao@karl:~/Desktop$ export Month=Aug
karao@karl:~/Desktop$ export Date=7
karao@karl:~/Desktop$ export Time=11
karao@karl:~/Desktop$ echo $Month $Date $Time "," `cut -d " " -f1,2,3,4,5,6 alert_orcl.log | grep "$Month  $Date $Time" | wc -l`
Aug 7 11 , 55

So it’s possible! I could have an output of “month, day, hour, and number of occurrence”
I just have to iterate through the month and day…… then have a count of the hourly data so I would know the rate of deadlocks per hour.

But I need some help from a shell script expert :p One of the internal Unix/DBA admins helped me and finished it just the way I wanted it :) (Thanks a lot Mel! and for integrating Kyle’s script)

Here is the script (I’ve used this on AIX 5L with Oracle 10.2.0.4, use at your own risk. If I have more time I could test this on Linux and other database versions, or if you have modified version of this script, let me know :) )

So now I could directly parse on the alert log, search for the search string that I want (most likely an ORA- error code).. then output the number of occurrence… :) (hours with count=0 is not shown)

See the sample below… I’m searching for the exact error message “ORA-00060″ on the alert log.

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log ORA-00060
Jul 6 19 , 10
Jul 6 20 , 120
Jul 6 21 , 130
Jul 6 22 , 120
Jul 6 23 , 130
Jul 7 00 , 120
Jul 7 01 , 119
Jul 7 02 , 120
Jul 7 03 , 130
Jul 7 04 , 120
Jul 7 05 , 130
Jul 7 06 , 120
Jul 7 07 , 80
Jul 31 11 , 87
Aug 1 08 , 20
Aug 7 10 , 10
Aug 7 11 , 50

Then search for the word “Deadlock” that gives the same output..

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log Deadlock
Jul 6 19 , 10
Jul 6 20 , 120
Jul 6 21 , 130
Jul 6 22 , 120
Jul 6 23 , 130
Jul 7 00 , 120
Jul 7 01 , 119
Jul 7 02 , 120
Jul 7 03 , 130
Jul 7 04 , 120
Jul 7 05 , 130
Jul 7 06 , 120
Jul 7 07 , 80
Jul 31 11 , 87
Aug 1 08 , 20
Aug 7 10 , 10
Aug 7 11 , 50

Looking at the raw data.. this gives me a hint that on July 6-7 there were a lot of Deadlock occurrence. Then it’s possible that they have fixed some code which cut the Deadlock count to half on August. Also you’ll notice that from July 7-31 there were no Deadlock errors (I wonder why). And, could there be a periodic batch process running on every 7th of the month? Well, we still have to further investigate.. but at least we have the numbers to help us.. :)

Also, to have a more meaningful data.. you could show your boss a flashy graph.. :) just spool the output to a .csv file then open it on MS Excel

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log ORA-00060 > ORA-00060.csv

Select the data then graph it :)

DeadlockGraph1

Here’s the graph…. That’s the “trend” that I want to see!

DeadlockTrend

But the troubleshooting does not stop here.. I have to drill down on the specific hours, then look into the trace files. Or make use of some other tools or source of info that could help me identify the root cause..

And I’m starting to read the posts of Doug and Miladin, I’m sure I’ll get something useful on their posts. :)

Doug Burns
http://oracledoug.com/serendipity/index.php?/archives/1413-Good-News-for-ASH-Fans.html
http://oracledoug.com/serendipity/index.php?/archives/1478-Diagnosing-Locking-Problems-using-ASH-Part-2.html
http://oracledoug.com/serendipity/index.php?/archives/1480-Diagnosing-Locking-Problems-using-ASH-Part-3.html
http://oracledoug.com/serendipity/index.php?/archives/1481-Diagnosing-Locking-Problems-using-ASH-Part-4.html
http://oracledoug.com/serendipity/index.php?/archives/1484-Diagnosing-Locking-Problems-using-ASH-Part-5.html
http://oracledoug.com/serendipity/index.php?/archives/1487-Diagnosing-Locking-Problems-using-ASH-Part-6.html
http://oracledoug.com/serendipity/index.php?/archives/1488-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-7.html
http://oracledoug.com/serendipity/index.php?/archives/1491-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-8.html
http://oracledoug.com/serendipity/index.php?/archives/1492-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-9.html
http://oracledoug.com/serendipity/index.php?/archives/1495-Diagnosing-Locking-Problems-using-ASHLogMiner-The-End.html

Miladin Modrakovic
http://oraclue.com/2009/04/20/detecting-deadlock-source/
http://oraclue.com/2009/04/23/detecting-deadlock-source-part-2/
.
.

About these ads

9 comments

  1. Hi Karl
    This is very good analysis again. I like the way you approach problems. Next time, i may not get similar issue but the approach to problem solving is useful I think.

    thank you again.

    – kumar

  2. Pingback: Blogroll Report 14/08/2009 – 21/08/2009 « Coskan’s Approach to Oracle

  3. Karl,

    Have you tested the script in Linux. I am not getting the same output you got. Have you the chance to test it in Linux? Pls let us know.

  4. Pingback: Graphing the AAS with Perfsheet a la Enterprise Manager « Karl Arao's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s