Karl Arao's Blog

Just another weblog about Oracle,Linux,Troubleshooting,Performance,etc..etc..

Posts Tagged ‘Performance’

IOsaturationtoolkit-v2 with Exadata IORM and AWESOME text graph

Posted by karlarao on May 14, 2012

I’ve got a new version of IOsaturation toolkit which you can download here http://karlarao.wordpress.com/scripts-resources/ and it has a cool script called “smartscanloop” that shows you the Smart Scan MB/s per database across the Exadata  compute nodes.. it’s a per 2secs sample so that’s a pretty fine grained perf data and near real time text graph. Very useful for doing IORM demos and monitoring what database is currently hogging the IO resources and since it’s presented in a consolidated view you don’t have to go to each Enterprise Manager performance page and have a bunch of browser windows open.

The SECTION 1 is what I usually use to validate the IO numbers on the database side from my Orion (see oriontoolkit here) and Calibrate IO runs. I’ve been using it for quite a while on new RAC/non-RAC installations from client sites.. and I used it heavily on my R&D server while continuously enhancing the toolkit

The SECTION 2 gives you a standard tool to demonstrate the behavior of IORM (http://karlarao.tiddlyspot.com/#IORM).. so let’s say you are playing around with IORM percentage allocations for let’s say 3 databases the “saturate” script works well to generate load for each database and then you can observe the effects of the percentage allocation to the IO bandwidth/latency of each database.

when you run

./saturate 4 dbm1 2 exadb1

it will create 4 sessions on dbm1 and 2 sessions on exadb1 all doing parallel select and it outputs a log file for each database session. Each session log file has details on the start and end time, elapsed, MB/s which is pretty much everything you need to know to quantify the performance from a session level perspective. You’ll appreciate this session level output and be impressed on what IORM can do when you start investigating on IO prioritization as you see sessions from the other database having higher MB/s and lower elapsed times and as you play with different IORM scenarios.
cat *log | grep benchmark
Sample output below:
 benchmark ,instance       ,start            ,end              ,elapsed   ,MBs
 ----------,---------------,-----------------,-----------------,----------,-------
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:30,        62,    554
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        63,    545
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539

So the output of smartscanloop is the high level IO numbers across the cluster and the log files are your session detail numbers. Below is the simple output which just shows the SmartScan MB/s per database

This AWESOME text graph is similar to what you see in the Enterprise Manager performance page IO tab. Note that you’ll be seeing higher numbers of MB/s on the smartscanloop compared to EM because of a more fine grained interval (2secs) which is also the same behavior when you measure the IO latency as I explained here (avg latency issue)

then I’ve modified the script to have the advanced output that shows the Hierarchy of Exadata IO. See the updated README for more details on how to use it. Below is the output

What’s good about this is the numbers are about the same when you do a per 10secs snapshot of AWR.. compare the AAS and latency (avgwt ms) columns of the above image and below

Again, this is pretty useful for monitoring the high level smart scans IO that’s happening across your Exadata cluster, if you are on an environment where there’s separation of duties you can even hand off this script to the sys admins that are monitoring the storage cells with their home grown alerting scripts, kSar, or nagios.. so this will serve as their view on the database side of things.

And if any of your clients haven’t adopted the IORM, this is very useful for DEMOs to customer sites to showcase the IORM capabilities.. and if you don’t want to show the latency and other columns you can opt to just use the simple output which only shows the smart scans MB/s (see get_smartscan.simple on README ). Most of the time.. the simpler the output the easier for them (users) to understand.

Sweet! right?!?

Wait.. Does the toolkit work on non-Exadata DBs?

The SECTION 1 works on Exa and non-Exa (I’ve also mentioned this at the beginning of this post)

The SECTION 2 works well on Exa because the underlying scripts just makes use of dcli commands, some shell and SQL.. but if you are interested to have the smartscanloop output on non-Exa environment I have the script get_smartscan.nonexa on the toolkit that shows the “physical read total bytes” instead of the “cell physical IO bytes eligible for predicate offload”..

then on each node do this if it’s a non-Exa RAC

$ while : ; do ./get_smartscan.nonexa | grep "%" ; echo "--" ; sleep 2 ; done
%,05/12/12 12:57:33,DEMO1     ,      0,
%,05/12/12 12:57:35,dbm1      ,      0,
%,05/12/12 12:57:38,DBFS1     ,      0,
%,05/12/12 12:57:40,exadb1    ,      0,
%,05/12/12 12:57:42,PRIMARY1  ,      0,
--
%,05/12/12 12:57:46,DEMO1     ,      1,
%,05/12/12 12:57:48,dbm1      ,   2078, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:57:50,DBFS1     ,      0,
%,05/12/12 12:57:52,exadb1    ,    972, @@@@@@@@@@
%,05/12/12 12:57:55,PRIMARY1  ,      0,
--
%,05/12/12 12:57:59,DEMO1     ,      0,
%,05/12/12 12:58:01,dbm1      ,   1088, @@@@@@@@@@@
%,05/12/12 12:58:03,DBFS1     ,      0,
%,05/12/12 12:58:05,exadb1    ,   2356, @@@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:07,PRIMARY1  ,      0,
--
%,05/12/12 12:58:11,DEMO1     ,      0,
%,05/12/12 12:58:14,dbm1      ,   1000, @@@@@@@@@@
%,05/12/12 12:58:16,DBFS1     ,      0,
%,05/12/12 12:58:18,exadb1    ,   2336, @@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:20,PRIMARY1  ,      0,
--
%,05/12/12 12:58:24,DEMO1     ,      0,
%,05/12/12 12:58:26,dbm1      ,   1031, @@@@@@@@@@
%,05/12/12 12:58:29,DBFS1     ,      0,
%,05/12/12 12:58:31,exadb1    ,   2071, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:33,PRIMARY1  ,      0,
--

But wait! there’s more!

Since the output is in comma separated format.. you can just do the following

./smartscanloop > smartscanloop.txt

and leave it running… So you can do this when you want to characterize the IO of a batch run or reporting of multiple databases and the text file that will be generated is pretty small and it doesn’t hurt in terms of CPU/IO resources because it’s just doing distributed SSH and doing some SELECT on v$ views

then from the text file you can easily see what’s the highest MB/s by executing the command below


$ less smartscanloop.txt | sort -nk5 | tail
pd01db03: %,05/10/12 21:29:28,biprd2    ,   7358, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 08:57:12,biprd2    ,   7363, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 14:55:54,biprd2    ,   7454, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 09:29:57,biprd2    ,   7556, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:17:50,biprd2    ,   7785, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:12:19,biprd2    ,   7880, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:24,biprd2    ,   7886, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:15:48,biprd2    ,   8112, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:40,biprd2    ,   8138, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:49:46,biprd2    ,   9315, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

and since it’s a CSV file you can then easily graph it in Tableau! (just put a header on it first)


host       , time            , inst     , smartscan,
pd01db01: %,05/10/12 13:08:14,hcmprd1   ,      0,
pd01db01: %,05/10/12 13:08:16,paprd1    ,      0,
pd01db01: %,05/10/12 13:08:18,rmprd1    ,      0,
pd01db01: %,05/10/12 13:08:20,lmprd1    ,      0,
pd01db01: %,05/10/12 13:08:22,DBFSPRD1  ,      0,
pd01db01: %,05/10/12 13:08:24,hcrprd1   ,      0,
pd01db02: %,05/10/12 13:08:14,hcrprd2   ,      0,
pd01db02: %,05/10/12 13:08:16,rmprd2    ,      0,
pd01db02: %,05/10/12 13:08:18,paprd2    ,      0,
pd01db02: %,05/10/12 13:08:20,hcmprd2   ,      0,
pd01db02: %,05/10/12 13:08:22,DBFSPRD2  ,      0,
pd01db02: %,05/10/12 13:08:24,lmprd2    ,      0,
pd01db03: %,05/10/12 13:08:14,biprd2    ,   3412, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 13:08:16,DBFSPRD3  ,      0,
pd01db04: %,05/10/12 13:08:14,DBFSPRD4  ,      0,
pd01db04: %,05/10/12 13:08:16,mvwprd2   ,      0,
pd01db04: %,05/10/12 13:08:18,fsprd2    ,    763, @@@@@@@@
pd01db04: %,05/10/12 13:08:20,mtaprd112 ,      0,
... output snipped ...

and hack the data! remember on the advanced output you can slice and dice it by host, time, inst, aas, latency, returned, ic, smartscan.. :)

Here’s the IO graph per instance

and IO graph per instance and host

Hope I’ve shared you some good stuff ;)










Posted in Oracle, Performance | Tagged: , , , | Leave a Comment »

Fast Analytics of AWR Top Events

Posted by karlarao on March 24, 2012

I’ve been working on a lot of good schtuff lately on the area of capacity planning. And I’ve greatly improved my time to generate workload characterization visualization and analysis using my AWR scripts which I enhanced to fit on the analytics tool that I’ve been using.. and that is Tableau.

So I’ve got a couple of performance and capacity planning use case scenarios which I will blog in parts in the next few days or weeks. But before that I need to familiarize you on how I mine this valuable AWR performance data.

Let’s get started with the AWR top events, the same top events that you see in your AWR reports but presented in a time series manner across SNAP_IDs…

So I’ve got this script called awr_topevents_v2.sql http://goo.gl/TufUj which I added a section to compute for “CPU wait” (new metric in 11g) to include the “unaccounted for DB Time” on high run queue or CPU intensive workloads. This is a newer version of the script that I used here http://karlarao.wordpress.com/2010/07/25/graphing-the-aas-with-perfsheet-a-la-enterprise-manager/

On the screenshots below I spooled the output on a text file and transformed it to a CSV format. Then from there you can play around with the data points using Tableau that you can download here http://www.tableausoftware.com/public/download

Is this a free software? Yes. But the catch is Tableau Public only allows you to have 100K rows on a given text or csv file. But for my capacity planning and troubleshooting purposes that’s enough to do the work :) The paid version of Tableau (Desktop) allows you to connect directly to the database but that will cost you $1999. ouch. The Tableau (Desktop Personal) is much like the Public version but with no data limit and it allows you to save your work on a specific file format (.twb and .tde files), Yes guys in Public version you are only allowed to save it to their Public portal… but that’s fine as long as I have the data points I can quickly work on the visualization again. Here’s the comparison if you are interested on the detailed product comparison http://www.tableausoftware.com/products/desktop/specs

Once you get the CSV file ready which you can download HERE

Open the Tableau and follow the steps below…

1) This is a trial version of Tableau Desktop.. the steps should be the same or similar with the Public version. Click “Connect to data”

2) Select the CSV file

3) Select “Text File”

4) Then you can view the data points

5) Data Points are shown below

6) Click OK

7) Import all data

8) It will create a file format that I believe where it stores the dimension data

9) After the data points import

10) Drag the “Events” on the color section, then “Add all members”

11) Click OK

12) Drag the AAS on the Rows

13) SNAP_ID on the columns. That’s your AWR SNAP_IDs

14) Filter the negative values. Those are the periods where you had instance shutdown.

15) Exclude the Nulls

16) And now you can select an area! hover on it and it will give you the event and the AAS number (the higher the AAS the more database work is happening)

17) Play around with the AAS data points

18) Now here it will show you detailed SNAP_ID data points (the dots) and you can hover on it

19) You can also view the underlying data points of those dots

20) Here it shows the event, SNAP_IDs, and the AAS value.. so here you can grab the SNAP_ID and execute the AWRRPT.sql on the database server to generate the larger AWR report

21) Drilling down more on the data points

22) You can also focus just on specific events. Let’s say just the CPU..

23) or the event that’s hogging most of the resources

24) and just select the data points from that bottleneck

25) Copy the summary info

26) Then play around with the data in command line

And clearly the database bottleneck is the PX event… which could just be the surface of the problem or a symptom for another issue..
And then visually you’ll start finding trends on the workload..

let’s say on this workload I found sustained high workload periods that would last for days..and these could be the periods where the batch jobs are executed..
SNAP_ID 34241 – 34306
11/07/06 03:30 – 11/07/07 12:00SNAP_ID 34417 – 34516
11/07/10 13:30 – 11/07/12 15:00SNAP_ID 34565 – 34694
11/07/13 15:30 – 11/07/16 08:00SNAP_ID 35213 – 35322
11/07/27 03:30 – 11/07/29 10:00

SNAP_ID 35363 – 35477
11/07/30 06:30 – 11/08/01 15:30

SNAP_ID 35590 – 35670
11/08/04 00:00 – 11/08/05 16:00

And if you want to have regression analysis, you must put all of these data points on the Y and X axis (I’ll create a script for that one of these days).. that way the r2toolkit will have a very good scope of data.. and then you’ll get your nice R2 trend..
but it doesn’t stop there..
you have to drill down on the data points, use the awrrpt, addmrpt, awrsqrpt, ashrpt, and drill down on sessions with snapper, os utilities, etc. to get to the bottom of the issue
so that’s forecasting+troubleshooting!
Update: 
The screenshots above just makes sense of the SNAP_ID for the time series data. After a little more AWR data hacking I was able to make sense of the date value which is really useful because I no longer have think of SNAP_IDs across instances and getting them aligned on a particular period… and the trick here is Tableau automatically creates a time dimension for the time column of my AWR scripts.  Kewlness :)  and that way I can stitch performance data across hostnames and consolidated instances of an entire Exadata Full Rack! and I’ll show how it is done on the next couple of posts.
The graph below shows an example of having a time dimension. Below is the AWR data (shown by day and hour) coming from a non-RAC database on a SAN storage with a default AWR retention period. The arrows below are pointing to a workload period where they are  running a report which has a SQL that has access paths doing lots of multiblock IOs (db file scattered read) and that throttles the performance of the (already slow) storage causing the IO latency to shoot up. This graph was pretty useful on the storage performance troubleshooting which correlates to the numbers of OS (iostat, vmstat, and atop) and session level (snapper) tools. Now the single block latency (db file sequential read) has always been on the 20ms below range regardless of load. So fixing the SQL to favor the use of indexes (db file sequential read) made it do efficient IO, that made the SQL run from 15mins to 10secs.

 

Hope I’ve shared you some good stuff!

 

 

Posted in Oracle, Performance, Troubleshooting | Tagged: , , , | 1 Comment »

RMOUG 2012 training days

Posted by karlarao on February 13, 2012

In the next few days I’ll be at RMOUG 2012 training days! and I’ll be again presenting the topic so dear to my heart. Mining the AWR ;) I’ve updated the material with a couple of new research and findings, some of them are as follows:

  • CPU Wait (new metric in 11g Performance page)
  • Latency issues on virtualized environments

So if you are attending the RMOUG training days, stop by at my session @ room 401 Thursday 9:45 am-10:45 am

There will also be a RAC Attack at the exhibition area so that must be fun ;)

and don’t forget to follow RMOUG’s twitter @RMOUG_ORG for updates during the conference!

See yah!

 

Update: 

I had a great time at RMOUG, was able to meet old/new friends and had a good number of attendees. The organizers really did a good job :)  Here’s the updated PPT I used during the presentation.
That’s the most recent, the one on the RMOUG is not yet updated. For the scripts http://karlarao.wordpress.com/scripts-resources/ they are all at the “PerformanceAndTroubleshooting” folder

Thanks to all who attended my presentation, I know that’s a bit too much technical stuff for 1hour presentation squeezing all 101 slides and all that info. The best way is to go through my notes on the slides, check out all the links I pointed, and try to poke around on the scripts. And I assure you will learn tons of stuff & greatly appreciate the material  :)

Posted in Community, Oracle, Performance | Tagged: , , | Leave a Comment »

 
Follow

Get every new post delivered to your Inbox.

Join 27 other followers