Karl Arao's Blog

Just another weblog about Oracle,Linux,Troubleshooting,Performance,etc..etc..

Posts Tagged ‘Scripting’

IOsaturationtoolkit-v2 with Exadata IORM and AWESOME text graph

Posted by karlarao on May 14, 2012

I’ve got a new version of IOsaturation toolkit which you can download here http://karlarao.wordpress.com/scripts-resources/ and it has a cool script called “smartscanloop” that shows you the Smart Scan MB/s per database across the Exadata  compute nodes.. it’s a per 2secs sample so that’s a pretty fine grained perf data and near real time text graph. Very useful for doing IORM demos and monitoring what database is currently hogging the IO resources and since it’s presented in a consolidated view you don’t have to go to each Enterprise Manager performance page and have a bunch of browser windows open.

The SECTION 1 is what I usually use to validate the IO numbers on the database side from my Orion (see oriontoolkit here) and Calibrate IO runs. I’ve been using it for quite a while on new RAC/non-RAC installations from client sites.. and I used it heavily on my R&D server while continuously enhancing the toolkit

The SECTION 2 gives you a standard tool to demonstrate the behavior of IORM (http://karlarao.tiddlyspot.com/#IORM).. so let’s say you are playing around with IORM percentage allocations for let’s say 3 databases the “saturate” script works well to generate load for each database and then you can observe the effects of the percentage allocation to the IO bandwidth/latency of each database.

when you run

./saturate 4 dbm1 2 exadb1

it will create 4 sessions on dbm1 and 2 sessions on exadb1 all doing parallel select and it outputs a log file for each database session. Each session log file has details on the start and end time, elapsed, MB/s which is pretty much everything you need to know to quantify the performance from a session level perspective. You’ll appreciate this session level output and be impressed on what IORM can do when you start investigating on IO prioritization as you see sessions from the other database having higher MB/s and lower elapsed times and as you play with different IORM scenarios.
cat *log | grep benchmark
Sample output below:
 benchmark ,instance       ,start            ,end              ,elapsed   ,MBs
 ----------,---------------,-----------------,-----------------,----------,-------
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:30,        62,    554
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        63,    545
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539

So the output of smartscanloop is the high level IO numbers across the cluster and the log files are your session detail numbers. Below is the simple output which just shows the SmartScan MB/s per database

This AWESOME text graph is similar to what you see in the Enterprise Manager performance page IO tab. Note that you’ll be seeing higher numbers of MB/s on the smartscanloop compared to EM because of a more fine grained interval (2secs) which is also the same behavior when you measure the IO latency as I explained here (avg latency issue)

then I’ve modified the script to have the advanced output that shows the Hierarchy of Exadata IO. See the updated README for more details on how to use it. Below is the output

What’s good about this is the numbers are about the same when you do a per 10secs snapshot of AWR.. compare the AAS and latency (avgwt ms) columns of the above image and below

Again, this is pretty useful for monitoring the high level smart scans IO that’s happening across your Exadata cluster, if you are on an environment where there’s separation of duties you can even hand off this script to the sys admins that are monitoring the storage cells with their home grown alerting scripts, kSar, or nagios.. so this will serve as their view on the database side of things.

And if any of your clients haven’t adopted the IORM, this is very useful for DEMOs to customer sites to showcase the IORM capabilities.. and if you don’t want to show the latency and other columns you can opt to just use the simple output which only shows the smart scans MB/s (see get_smartscan.simple on README ). Most of the time.. the simpler the output the easier for them (users) to understand.

Sweet! right?!?

Wait.. Does the toolkit work on non-Exadata DBs?

The SECTION 1 works on Exa and non-Exa (I’ve also mentioned this at the beginning of this post)

The SECTION 2 works well on Exa because the underlying scripts just makes use of dcli commands, some shell and SQL.. but if you are interested to have the smartscanloop output on non-Exa environment I have the script get_smartscan.nonexa on the toolkit that shows the “physical read total bytes” instead of the “cell physical IO bytes eligible for predicate offload”..

then on each node do this if it’s a non-Exa RAC

$ while : ; do ./get_smartscan.nonexa | grep "%" ; echo "--" ; sleep 2 ; done
%,05/12/12 12:57:33,DEMO1     ,      0,
%,05/12/12 12:57:35,dbm1      ,      0,
%,05/12/12 12:57:38,DBFS1     ,      0,
%,05/12/12 12:57:40,exadb1    ,      0,
%,05/12/12 12:57:42,PRIMARY1  ,      0,
--
%,05/12/12 12:57:46,DEMO1     ,      1,
%,05/12/12 12:57:48,dbm1      ,   2078, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:57:50,DBFS1     ,      0,
%,05/12/12 12:57:52,exadb1    ,    972, @@@@@@@@@@
%,05/12/12 12:57:55,PRIMARY1  ,      0,
--
%,05/12/12 12:57:59,DEMO1     ,      0,
%,05/12/12 12:58:01,dbm1      ,   1088, @@@@@@@@@@@
%,05/12/12 12:58:03,DBFS1     ,      0,
%,05/12/12 12:58:05,exadb1    ,   2356, @@@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:07,PRIMARY1  ,      0,
--
%,05/12/12 12:58:11,DEMO1     ,      0,
%,05/12/12 12:58:14,dbm1      ,   1000, @@@@@@@@@@
%,05/12/12 12:58:16,DBFS1     ,      0,
%,05/12/12 12:58:18,exadb1    ,   2336, @@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:20,PRIMARY1  ,      0,
--
%,05/12/12 12:58:24,DEMO1     ,      0,
%,05/12/12 12:58:26,dbm1      ,   1031, @@@@@@@@@@
%,05/12/12 12:58:29,DBFS1     ,      0,
%,05/12/12 12:58:31,exadb1    ,   2071, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:33,PRIMARY1  ,      0,
--

But wait! there’s more!

Since the output is in comma separated format.. you can just do the following

./smartscanloop > smartscanloop.txt

and leave it running… So you can do this when you want to characterize the IO of a batch run or reporting of multiple databases and the text file that will be generated is pretty small and it doesn’t hurt in terms of CPU/IO resources because it’s just doing distributed SSH and doing some SELECT on v$ views

then from the text file you can easily see what’s the highest MB/s by executing the command below


$ less smartscanloop.txt | sort -nk5 | tail
pd01db03: %,05/10/12 21:29:28,biprd2    ,   7358, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 08:57:12,biprd2    ,   7363, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 14:55:54,biprd2    ,   7454, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 09:29:57,biprd2    ,   7556, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:17:50,biprd2    ,   7785, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:12:19,biprd2    ,   7880, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:24,biprd2    ,   7886, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:15:48,biprd2    ,   8112, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:40,biprd2    ,   8138, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:49:46,biprd2    ,   9315, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

and since it’s a CSV file you can then easily graph it in Tableau! (just put a header on it first)


host       , time            , inst     , smartscan,
pd01db01: %,05/10/12 13:08:14,hcmprd1   ,      0,
pd01db01: %,05/10/12 13:08:16,paprd1    ,      0,
pd01db01: %,05/10/12 13:08:18,rmprd1    ,      0,
pd01db01: %,05/10/12 13:08:20,lmprd1    ,      0,
pd01db01: %,05/10/12 13:08:22,DBFSPRD1  ,      0,
pd01db01: %,05/10/12 13:08:24,hcrprd1   ,      0,
pd01db02: %,05/10/12 13:08:14,hcrprd2   ,      0,
pd01db02: %,05/10/12 13:08:16,rmprd2    ,      0,
pd01db02: %,05/10/12 13:08:18,paprd2    ,      0,
pd01db02: %,05/10/12 13:08:20,hcmprd2   ,      0,
pd01db02: %,05/10/12 13:08:22,DBFSPRD2  ,      0,
pd01db02: %,05/10/12 13:08:24,lmprd2    ,      0,
pd01db03: %,05/10/12 13:08:14,biprd2    ,   3412, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 13:08:16,DBFSPRD3  ,      0,
pd01db04: %,05/10/12 13:08:14,DBFSPRD4  ,      0,
pd01db04: %,05/10/12 13:08:16,mvwprd2   ,      0,
pd01db04: %,05/10/12 13:08:18,fsprd2    ,    763, @@@@@@@@
pd01db04: %,05/10/12 13:08:20,mtaprd112 ,      0,
... output snipped ...

and hack the data! remember on the advanced output you can slice and dice it by host, time, inst, aas, latency, returned, ic, smartscan.. :)

Here’s the IO graph per instance

and IO graph per instance and host

Hope I’ve shared you some good stuff ;)










Posted in Oracle, Performance | Tagged: , , , | Leave a Comment »

Knowing the trend of Deadlock occurrences from the Alert Log

Posted by karlarao on August 15, 2009

Recently, my client deployed a new application and had this intermittent “Deadlock Storm” …

A trace file was sent and I was able to pinpoint the cause of the deadlock and the session that caused it.
The deadlock was a TX enqueue with mode of 4 (S – share) which could be verified by looking at the following lines of the Process State dump:

   last wait for 'enq: TX - row lock contention' blocking sess=0x 7000000cb239d60 seq=7849 wait_time=2929705 seconds since wait started=3
            name|mode=54580004, usn<<16 | slot=a0028, sequence=283f2

the “enqueue and lock mode” is explained as:
mode=54580004 (see above)
5458 (hex) = TX (ascii)
0004 (hex) = mode 4 (S – share)

The possible cause would be multiple concurrent sessions insert the same key value into a table that has one or more unique key. The table involved on the SQL was identified and has a unique key (on 7 columns) and 4 foreign keys (on different tables). But still I have to verify…

And I may have to look into the application logic, and may have to start with the insert statement causing the deadlock.
But, it could also be possible that there are more statements involved in this “Deadlock storm”.. and I’m just looking on one occurence..
hmm.. interesting..

Lately, I’ve been using this shell command I got from Kyle Hailey‘s site for quickly getting ORA- errors on the alert log.

See the code I’m using below:

Go to the bdump directory to run these shell commands

 Date and errors in alert.log

     cat alert_orcl.log | \
     awk 'BEGIN{buf=""}
          /[0-9]:[0-9][0-9]:[0-9]/{buf=$0}
          /ORA-/{print buf,$0}' > ORA-errors_orcl.txt

On the client site, I executed the command on the alert log. Then to my surprise.. there are so many ORA-00060 errors (total of 1,616 lines)! which means, there are more trace files that I have to look on. Great, Now I’m overwhelmed. :)

First…I want to know how often Deadlocks occur, I want it per month, day, and hour. So I could see the pattern of the occurrence then, if there’s a pattern I could ask the Developer what he’s running on that period or what are the modules that are frequently accessed. That way I could correlate what’s happening on the trace files to what are they doing on the application.

The alert views of 10g could be useful to get that kind of info (DBA_OUTSTANDING_ALERTS, DBA_ALERT_HISTORY, DBA_ALERT_ARGUMENTS, DBMS_ALERT_INFO, V$ALERT_TYPES).

But I want to dig on the alert log using some shell scripting.. which I’m not very good at :p

I’m playing with this very simple command, this shows me that on August 7 11AM there are 55 occurrence of “ORA-” errors. Some or all of them could be “ORA-00060″ errors.

karao@karl:~/Desktop$ export Month=Aug
karao@karl:~/Desktop$ export Date=7
karao@karl:~/Desktop$ export Time=11
karao@karl:~/Desktop$ echo $Month $Date $Time "," `cut -d " " -f1,2,3,4,5,6 alert_orcl.log | grep "$Month  $Date $Time" | wc -l`
Aug 7 11 , 55

So it’s possible! I could have an output of “month, day, hour, and number of occurrence”
I just have to iterate through the month and day…… then have a count of the hourly data so I would know the rate of deadlocks per hour.

But I need some help from a shell script expert :p One of the internal Unix/DBA admins helped me and finished it just the way I wanted it :) (Thanks a lot Mel! and for integrating Kyle’s script)

Here is the script (I’ve used this on AIX 5L with Oracle 10.2.0.4, use at your own risk. If I have more time I could test this on Linux and other database versions, or if you have modified version of this script, let me know :) )

So now I could directly parse on the alert log, search for the search string that I want (most likely an ORA- error code).. then output the number of occurrence… :) (hours with count=0 is not shown)

See the sample below… I’m searching for the exact error message “ORA-00060″ on the alert log.

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log ORA-00060
Jul 6 19 , 10
Jul 6 20 , 120
Jul 6 21 , 130
Jul 6 22 , 120
Jul 6 23 , 130
Jul 7 00 , 120
Jul 7 01 , 119
Jul 7 02 , 120
Jul 7 03 , 130
Jul 7 04 , 120
Jul 7 05 , 130
Jul 7 06 , 120
Jul 7 07 , 80
Jul 31 11 , 87
Aug 1 08 , 20
Aug 7 10 , 10
Aug 7 11 , 50

Then search for the word “Deadlock” that gives the same output..

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log Deadlock
Jul 6 19 , 10
Jul 6 20 , 120
Jul 6 21 , 130
Jul 6 22 , 120
Jul 6 23 , 130
Jul 7 00 , 120
Jul 7 01 , 119
Jul 7 02 , 120
Jul 7 03 , 130
Jul 7 04 , 120
Jul 7 05 , 130
Jul 7 06 , 120
Jul 7 07 , 80
Jul 31 11 , 87
Aug 1 08 , 20
Aug 7 10 , 10
Aug 7 11 , 50

Looking at the raw data.. this gives me a hint that on July 6-7 there were a lot of Deadlock occurrence. Then it’s possible that they have fixed some code which cut the Deadlock count to half on August. Also you’ll notice that from July 7-31 there were no Deadlock errors (I wonder why). And, could there be a periodic batch process running on every 7th of the month? Well, we still have to further investigate.. but at least we have the numbers to help us.. :)

Also, to have a more meaningful data.. you could show your boss a flashy graph.. :) just spool the output to a .csv file then open it on MS Excel

karao@karl:~/Desktop$ ./AlertCount.sh alert_orcl.log ORA-00060 > ORA-00060.csv

Select the data then graph it :)

DeadlockGraph1

Here’s the graph…. That’s the “trend” that I want to see!

DeadlockTrend

But the troubleshooting does not stop here.. I have to drill down on the specific hours, then look into the trace files. Or make use of some other tools or source of info that could help me identify the root cause..

And I’m starting to read the posts of Doug and Miladin, I’m sure I’ll get something useful on their posts. :)

Doug Burns
http://oracledoug.com/serendipity/index.php?/archives/1413-Good-News-for-ASH-Fans.html
http://oracledoug.com/serendipity/index.php?/archives/1478-Diagnosing-Locking-Problems-using-ASH-Part-2.html
http://oracledoug.com/serendipity/index.php?/archives/1480-Diagnosing-Locking-Problems-using-ASH-Part-3.html
http://oracledoug.com/serendipity/index.php?/archives/1481-Diagnosing-Locking-Problems-using-ASH-Part-4.html
http://oracledoug.com/serendipity/index.php?/archives/1484-Diagnosing-Locking-Problems-using-ASH-Part-5.html
http://oracledoug.com/serendipity/index.php?/archives/1487-Diagnosing-Locking-Problems-using-ASH-Part-6.html
http://oracledoug.com/serendipity/index.php?/archives/1488-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-7.html
http://oracledoug.com/serendipity/index.php?/archives/1491-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-8.html
http://oracledoug.com/serendipity/index.php?/archives/1492-Diagnosing-Locking-Problems-using-ASHLogMiner-Part-9.html
http://oracledoug.com/serendipity/index.php?/archives/1495-Diagnosing-Locking-Problems-using-ASHLogMiner-The-End.html

Miladin Modrakovic
http://oraclue.com/2009/04/20/detecting-deadlock-source/
http://oraclue.com/2009/04/23/detecting-deadlock-source-part-2/
.
.

Posted in Oracle, Troubleshooting | Tagged: , , , , | 9 Comments »

 
Follow

Get every new post delivered to your Inbox.

Join 27 other followers