Karl Arao's Blog

Just another weblog about Oracle,Linux,Troubleshooting,Performance,etc..etc..

Posts Tagged ‘Visualization’

IOsaturationtoolkit-v2 with Exadata IORM and AWESOME text graph

Posted by karlarao on May 14, 2012

I’ve got a new version of IOsaturation toolkit which you can download here http://karlarao.wordpress.com/scripts-resources/ and it has a cool script called “smartscanloop” that shows you the Smart Scan MB/s per database across the Exadata  compute nodes.. it’s a per 2secs sample so that’s a pretty fine grained perf data and near real time text graph. Very useful for doing IORM demos and monitoring what database is currently hogging the IO resources and since it’s presented in a consolidated view you don’t have to go to each Enterprise Manager performance page and have a bunch of browser windows open.

The SECTION 1 is what I usually use to validate the IO numbers on the database side from my Orion (see oriontoolkit here) and Calibrate IO runs. I’ve been using it for quite a while on new RAC/non-RAC installations from client sites.. and I used it heavily on my R&D server while continuously enhancing the toolkit

The SECTION 2 gives you a standard tool to demonstrate the behavior of IORM (http://karlarao.tiddlyspot.com/#IORM).. so let’s say you are playing around with IORM percentage allocations for let’s say 3 databases the “saturate” script works well to generate load for each database and then you can observe the effects of the percentage allocation to the IO bandwidth/latency of each database.

when you run

./saturate 4 dbm1 2 exadb1

it will create 4 sessions on dbm1 and 2 sessions on exadb1 all doing parallel select and it outputs a log file for each database session. Each session log file has details on the start and end time, elapsed, MB/s which is pretty much everything you need to know to quantify the performance from a session level perspective. You’ll appreciate this session level output and be impressed on what IORM can do when you start investigating on IO prioritization as you see sessions from the other database having higher MB/s and lower elapsed times and as you play with different IORM scenarios.
cat *log | grep benchmark
Sample output below:
 benchmark ,instance       ,start            ,end              ,elapsed   ,MBs
 ----------,---------------,-----------------,-----------------,----------,-------
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:30,        62,    554
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        63,    545
 benchmark ,dbm1           ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    537
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539
 benchmark ,exadb1         ,05/13/12 19:18:28,05/13/12 19:19:32,        64,    539

So the output of smartscanloop is the high level IO numbers across the cluster and the log files are your session detail numbers. Below is the simple output which just shows the SmartScan MB/s per database

This AWESOME text graph is similar to what you see in the Enterprise Manager performance page IO tab. Note that you’ll be seeing higher numbers of MB/s on the smartscanloop compared to EM because of a more fine grained interval (2secs) which is also the same behavior when you measure the IO latency as I explained here (avg latency issue)

then I’ve modified the script to have the advanced output that shows the Hierarchy of Exadata IO. See the updated README for more details on how to use it. Below is the output

What’s good about this is the numbers are about the same when you do a per 10secs snapshot of AWR.. compare the AAS and latency (avgwt ms) columns of the above image and below

Again, this is pretty useful for monitoring the high level smart scans IO that’s happening across your Exadata cluster, if you are on an environment where there’s separation of duties you can even hand off this script to the sys admins that are monitoring the storage cells with their home grown alerting scripts, kSar, or nagios.. so this will serve as their view on the database side of things.

And if any of your clients haven’t adopted the IORM, this is very useful for DEMOs to customer sites to showcase the IORM capabilities.. and if you don’t want to show the latency and other columns you can opt to just use the simple output which only shows the smart scans MB/s (see get_smartscan.simple on README ). Most of the time.. the simpler the output the easier for them (users) to understand.

Sweet! right?!?

Wait.. Does the toolkit work on non-Exadata DBs?

The SECTION 1 works on Exa and non-Exa (I’ve also mentioned this at the beginning of this post)

The SECTION 2 works well on Exa because the underlying scripts just makes use of dcli commands, some shell and SQL.. but if you are interested to have the smartscanloop output on non-Exa environment I have the script get_smartscan.nonexa on the toolkit that shows the “physical read total bytes” instead of the “cell physical IO bytes eligible for predicate offload”..

then on each node do this if it’s a non-Exa RAC

$ while : ; do ./get_smartscan.nonexa | grep "%" ; echo "--" ; sleep 2 ; done
%,05/12/12 12:57:33,DEMO1     ,      0,
%,05/12/12 12:57:35,dbm1      ,      0,
%,05/12/12 12:57:38,DBFS1     ,      0,
%,05/12/12 12:57:40,exadb1    ,      0,
%,05/12/12 12:57:42,PRIMARY1  ,      0,
--
%,05/12/12 12:57:46,DEMO1     ,      1,
%,05/12/12 12:57:48,dbm1      ,   2078, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:57:50,DBFS1     ,      0,
%,05/12/12 12:57:52,exadb1    ,    972, @@@@@@@@@@
%,05/12/12 12:57:55,PRIMARY1  ,      0,
--
%,05/12/12 12:57:59,DEMO1     ,      0,
%,05/12/12 12:58:01,dbm1      ,   1088, @@@@@@@@@@@
%,05/12/12 12:58:03,DBFS1     ,      0,
%,05/12/12 12:58:05,exadb1    ,   2356, @@@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:07,PRIMARY1  ,      0,
--
%,05/12/12 12:58:11,DEMO1     ,      0,
%,05/12/12 12:58:14,dbm1      ,   1000, @@@@@@@@@@
%,05/12/12 12:58:16,DBFS1     ,      0,
%,05/12/12 12:58:18,exadb1    ,   2336, @@@@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:20,PRIMARY1  ,      0,
--
%,05/12/12 12:58:24,DEMO1     ,      0,
%,05/12/12 12:58:26,dbm1      ,   1031, @@@@@@@@@@
%,05/12/12 12:58:29,DBFS1     ,      0,
%,05/12/12 12:58:31,exadb1    ,   2071, @@@@@@@@@@@@@@@@@@@@@
%,05/12/12 12:58:33,PRIMARY1  ,      0,
--

But wait! there’s more!

Since the output is in comma separated format.. you can just do the following

./smartscanloop > smartscanloop.txt

and leave it running… So you can do this when you want to characterize the IO of a batch run or reporting of multiple databases and the text file that will be generated is pretty small and it doesn’t hurt in terms of CPU/IO resources because it’s just doing distributed SSH and doing some SELECT on v$ views

then from the text file you can easily see what’s the highest MB/s by executing the command below


$ less smartscanloop.txt | sort -nk5 | tail
pd01db03: %,05/10/12 21:29:28,biprd2    ,   7358, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 08:57:12,biprd2    ,   7363, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 14:55:54,biprd2    ,   7454, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 09:29:57,biprd2    ,   7556, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:17:50,biprd2    ,   7785, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:12:19,biprd2    ,   7880, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:24,biprd2    ,   7886, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 19:15:48,biprd2    ,   8112, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/11/12 10:31:40,biprd2    ,   8138, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 18:49:46,biprd2    ,   9315, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

and since it’s a CSV file you can then easily graph it in Tableau! (just put a header on it first)


host       , time            , inst     , smartscan,
pd01db01: %,05/10/12 13:08:14,hcmprd1   ,      0,
pd01db01: %,05/10/12 13:08:16,paprd1    ,      0,
pd01db01: %,05/10/12 13:08:18,rmprd1    ,      0,
pd01db01: %,05/10/12 13:08:20,lmprd1    ,      0,
pd01db01: %,05/10/12 13:08:22,DBFSPRD1  ,      0,
pd01db01: %,05/10/12 13:08:24,hcrprd1   ,      0,
pd01db02: %,05/10/12 13:08:14,hcrprd2   ,      0,
pd01db02: %,05/10/12 13:08:16,rmprd2    ,      0,
pd01db02: %,05/10/12 13:08:18,paprd2    ,      0,
pd01db02: %,05/10/12 13:08:20,hcmprd2   ,      0,
pd01db02: %,05/10/12 13:08:22,DBFSPRD2  ,      0,
pd01db02: %,05/10/12 13:08:24,lmprd2    ,      0,
pd01db03: %,05/10/12 13:08:14,biprd2    ,   3412, @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
pd01db03: %,05/10/12 13:08:16,DBFSPRD3  ,      0,
pd01db04: %,05/10/12 13:08:14,DBFSPRD4  ,      0,
pd01db04: %,05/10/12 13:08:16,mvwprd2   ,      0,
pd01db04: %,05/10/12 13:08:18,fsprd2    ,    763, @@@@@@@@
pd01db04: %,05/10/12 13:08:20,mtaprd112 ,      0,
... output snipped ...

and hack the data! remember on the advanced output you can slice and dice it by host, time, inst, aas, latency, returned, ic, smartscan.. :)

Here’s the IO graph per instance

and IO graph per instance and host

Hope I’ve shared you some good stuff ;)










Posted in Oracle, Performance | Tagged: , , , | Leave a Comment »

Fast Analytics of AWR Top Events

Posted by karlarao on March 24, 2012

I’ve been working on a lot of good schtuff lately on the area of capacity planning. And I’ve greatly improved my time to generate workload characterization visualization and analysis using my AWR scripts which I enhanced to fit on the analytics tool that I’ve been using.. and that is Tableau.

So I’ve got a couple of performance and capacity planning use case scenarios which I will blog in parts in the next few days or weeks. But before that I need to familiarize you on how I mine this valuable AWR performance data.

Let’s get started with the AWR top events, the same top events that you see in your AWR reports but presented in a time series manner across SNAP_IDs…

So I’ve got this script called awr_topevents_v2.sql http://goo.gl/TufUj which I added a section to compute for “CPU wait” (new metric in 11g) to include the “unaccounted for DB Time” on high run queue or CPU intensive workloads. This is a newer version of the script that I used here http://karlarao.wordpress.com/2010/07/25/graphing-the-aas-with-perfsheet-a-la-enterprise-manager/

On the screenshots below I spooled the output on a text file and transformed it to a CSV format. Then from there you can play around with the data points using Tableau that you can download here http://www.tableausoftware.com/public/download

Is this a free software? Yes. But the catch is Tableau Public only allows you to have 100K rows on a given text or csv file. But for my capacity planning and troubleshooting purposes that’s enough to do the work :) The paid version of Tableau (Desktop) allows you to connect directly to the database but that will cost you $1999. ouch. The Tableau (Desktop Personal) is much like the Public version but with no data limit and it allows you to save your work on a specific file format (.twb and .tde files), Yes guys in Public version you are only allowed to save it to their Public portal… but that’s fine as long as I have the data points I can quickly work on the visualization again. Here’s the comparison if you are interested on the detailed product comparison http://www.tableausoftware.com/products/desktop/specs

Once you get the CSV file ready which you can download HERE

Open the Tableau and follow the steps below…

1) This is a trial version of Tableau Desktop.. the steps should be the same or similar with the Public version. Click “Connect to data”

2) Select the CSV file

3) Select “Text File”

4) Then you can view the data points

5) Data Points are shown below

6) Click OK

7) Import all data

8) It will create a file format that I believe where it stores the dimension data

9) After the data points import

10) Drag the “Events” on the color section, then “Add all members”

11) Click OK

12) Drag the AAS on the Rows

13) SNAP_ID on the columns. That’s your AWR SNAP_IDs

14) Filter the negative values. Those are the periods where you had instance shutdown.

15) Exclude the Nulls

16) And now you can select an area! hover on it and it will give you the event and the AAS number (the higher the AAS the more database work is happening)

17) Play around with the AAS data points

18) Now here it will show you detailed SNAP_ID data points (the dots) and you can hover on it

19) You can also view the underlying data points of those dots

20) Here it shows the event, SNAP_IDs, and the AAS value.. so here you can grab the SNAP_ID and execute the AWRRPT.sql on the database server to generate the larger AWR report

21) Drilling down more on the data points

22) You can also focus just on specific events. Let’s say just the CPU..

23) or the event that’s hogging most of the resources

24) and just select the data points from that bottleneck

25) Copy the summary info

26) Then play around with the data in command line

And clearly the database bottleneck is the PX event… which could just be the surface of the problem or a symptom for another issue..
And then visually you’ll start finding trends on the workload..

let’s say on this workload I found sustained high workload periods that would last for days..and these could be the periods where the batch jobs are executed..
SNAP_ID 34241 – 34306
11/07/06 03:30 – 11/07/07 12:00SNAP_ID 34417 – 34516
11/07/10 13:30 – 11/07/12 15:00SNAP_ID 34565 – 34694
11/07/13 15:30 – 11/07/16 08:00SNAP_ID 35213 – 35322
11/07/27 03:30 – 11/07/29 10:00

SNAP_ID 35363 – 35477
11/07/30 06:30 – 11/08/01 15:30

SNAP_ID 35590 – 35670
11/08/04 00:00 – 11/08/05 16:00

And if you want to have regression analysis, you must put all of these data points on the Y and X axis (I’ll create a script for that one of these days).. that way the r2toolkit will have a very good scope of data.. and then you’ll get your nice R2 trend..
but it doesn’t stop there..
you have to drill down on the data points, use the awrrpt, addmrpt, awrsqrpt, ashrpt, and drill down on sessions with snapper, os utilities, etc. to get to the bottom of the issue
so that’s forecasting+troubleshooting!
Update: 
The screenshots above just makes sense of the SNAP_ID for the time series data. After a little more AWR data hacking I was able to make sense of the date value which is really useful because I no longer have think of SNAP_IDs across instances and getting them aligned on a particular period… and the trick here is Tableau automatically creates a time dimension for the time column of my AWR scripts.  Kewlness :)  and that way I can stitch performance data across hostnames and consolidated instances of an entire Exadata Full Rack! and I’ll show how it is done on the next couple of posts.
The graph below shows an example of having a time dimension. Below is the AWR data (shown by day and hour) coming from a non-RAC database on a SAN storage with a default AWR retention period. The arrows below are pointing to a workload period where they are  running a report which has a SQL that has access paths doing lots of multiblock IOs (db file scattered read) and that throttles the performance of the (already slow) storage causing the IO latency to shoot up. This graph was pretty useful on the storage performance troubleshooting which correlates to the numbers of OS (iostat, vmstat, and atop) and session level (snapper) tools. Now the single block latency (db file sequential read) has always been on the 20ms below range regardless of load. So fixing the SQL to favor the use of indexes (db file sequential read) made it do efficient IO, that made the SQL run from 15mins to 10secs.

 

Hope I’ve shared you some good stuff!

 

 

Posted in Oracle, Performance, Troubleshooting | Tagged: , , , | 1 Comment »

SQLTXPLAIN quick tips and tricks and DB Optimizer VST

Posted by karlarao on February 11, 2012

Lately I’ve been busy on projects and research so I’m putting more and more stuff on my wiki/braindump… and really I need to catch up on blogging.
I have a longer draft post about SQLTXPLAIN.. but I haven’t finished it yet so I’ll just go with the quick post for now.

SQLTXPLAIN:

I’ve been using SQLTXPLAIN for quite a while, and I can say I’m a really big fan. It’s a cool tool to use for systematic SQL Tuning so I got them all neatly documented here http://karlarao.tiddlyspot.com/#SQLTXPLAIN.

BTW, Carlos Sierra, the author of the tool will present at Hotsos 2012 http://www.hotsos.com/sym12/sym_speakers_sierra.html so I may drop by the conference and say hello :)

The wiki link will probably show you all the things you need to know about SQLTXPLAIN. My personal use case favorites would be the following:

  • Mining 10053 output
    • On the execution plan view of the main HTML report whenever you hover on each step of the plan it will pop up an info about the Query Block (QB) details which you can mine on the raw 10053 text file that is also on the zip file. This is helpful when troubleshooting the specific part of the SQL where it has gone wrong.
  • Instant SQL Monitor output
    • The HTML and text versions of SQL Monitor is readily available for the particular SQL_ID you are troubleshooting ;)
  • Compare good/bad plan of SQLs
    • This is pretty handy because the output report of the SQLT-compare highlights the difference in red text. So instead of chasing some delta numbers, you can just go straight to the red text!
  • Fix bad stats
    • Let’s say you have a prod and dev that have completely different run times. And you found out that the problem is the stats, you can do a SQLT on the good plan and transfer the stats to the other environment. You will be using the sqltimp.sql for this
  • Transfer stored outline/profiles
    • This makes use of exp/imp to transfer and outline + DBMS_SQLTUNE for the SQL Profile
    • Manual SQL Profile is readily available as sqlt_<sqlt_id>_sqlprof.sql or can be created using sqlt/utl/coe_xfr_sql_profile.sql
  • Reproducing an execution plan from one system to another and Creating test cases
    • I usually pull SQLT from the server I’m troubleshooting and load it on my laptop. And from there I can have the same optimizer environment then I can do anything I want with it. (well you can also do this from prod then load the SQLT to dev/QA environment)
    • I mainly use the SQLT-tc (test case builder) and that is totally different from the TCB (test case builder) which uses the DBMS_SQLDIAG.. wait, am I confusing you with the TCB terms? :)
      • Let me give you a little background

Each SQLTXPLAIN run (START sqltxtract.sql SQL_ID) packages everything in one zip file

[oracle@karl sqlt_s54491-good]$ du -sm sqlt_s54491.zip
 47 sqlt_s54491.zip

and inside that zip file includes the test case zip file that has a *_tc.zip extension (sqlt_s54491_tc.zip)
Note that this is different from the *_tcb.zip extension (sqlt_s54491_tcb.zip) which is using the DBMS_SQLDIAG TCB (test case builder) package

[oracle@karl sqlt_s54491-good]$ unzip -l sqlt_s54491.zip
 Archive: sqlt_s54491.zip
 Length Date Time Name
 --------- ---------- ----- ----
 104557695 11-27-2011 02:36 sqlt_s54491_10053_explain.trc
 2587 11-27-2011 09:18 sqlt_s54491_driver.zip
 660035 11-27-2011 02:35 sqlt_s54491_lite.html
 28711 11-27-2011 09:18 sqlt_s54491_log.zip
 34589692 11-27-2011 02:35 sqlt_s54491_main.html
 316568 11-27-2011 09:18 sqlt_s54491_opatch.zip
 40487 11-27-2011 02:35 sqlt_s54491_p73080644_sqlprof.sql
 21526 11-27-2011 02:35 sqlt_s54491_readme.html
 461504 11-27-2011 02:35 sqlt_s54491_sql_detail_active.html
 385239 11-27-2011 02:35 sqlt_s54491_sql_monitor_active.html
 424680 11-27-2011 02:35 sqlt_s54491_sql_monitor.html
 57213 11-27-2011 02:35 sqlt_s54491_sql_monitor.txt
 333117 11-27-2011 02:35 sqlt_s54491_sta_report_mem.txt
 1488 11-27-2011 02:35 sqlt_s54491_sta_script_mem.sql
 616556 11-27-2011 02:36 sqlt_s54491_tcb.zip         <-- ''DBMS_SQLDIAG TCB'' (test case builder) package zip file
 8659 11-27-2011 09:18 sqlt_s54491_tc_script.sql
 7793 11-27-2011 09:18 sqlt_s54491_tc_sql.sql
 4254816 11-27-2011 09:18 sqlt_s54491_tc.zip         <-- ''SQLT-tc test case'' zip file
 33308174 11-27-2011 09:18 sqlt_s54491_trc.zip
 --------- -------
 180076540 19 files
  • And from the wiki, you will see the related wiki topics that starts with “testcase -”. Actually, you can generate test cases using the following methods:

DB Optimizer:

Below you will see that the highlighted table name starts with the schema TC<sqlt_id>, that means it was generated from SQLT-tc and I was also able to reproduce underlying objects of the SQL_ID and the optimizer environment on my laptop.

The cool thing about DB Optimizer is that it automates a lot of tedious tasks for you. It takes away the pain of manual diagramming just like what I’m doing here http://goo.gl/VZFB that turns out just look like this http://goo.gl/ttrh

Bad Plan Patrick Star

I’m a very visual guy, and I can have a better understanding of the problem and solve things quickly (and creatively) by making use of visualization.

The highlighted table on the diagram below shows a missing join condition. This version of the SQL takes FOREVER!… here with DB Optimizer, right after it prints the VST diagram and correlating it with the “merge join cartesian” on the SQLT.. it’s pretty easy to identify the missing join (see highlighted SQL on the left) which is the root cause of the performance problem.

and a quick modification on the SQL made it run for just 10 seconds! that’s the power of SQLTXPLAIN and DB Optimizer VST  ;)

You may also want to check out this ultimate SQL tune off by Kyle Hailey and Jonathan Lewis to know more about DB Optimizer http://dboptimizer.com/2010/06/14/jonathan-lewis-kyle-hailey-webinar-replay/

And my example usage of the tool on a high load SAP OLTP environment – DB Optimizer example – 3mins to 10secs

That’s it for now, hope I’ve shared you some good stuff :)

Update: 

I attended the 2nd day of Hotsos and I was able to meet Carlos, he’s a really nice guy. Randy Johnson, Carlos, and I had a great discussion about the tools we are working on and currently developing… we were like kids on a table with a laptop showing off each others work & that includes Jamey Johnston showing his performance monitoring tool from his server at home… I was able to show Carlos how the VST of DB Optimizer works and he was impressed by the power of easily slicing and dicing on the query visually. Of course we shared a couple of SQLTXPLAIN tips and tricks and war stories where the tool was really useful. And all of these stuff over beer and good food :) that was a fun Hotsos party :)

Posted in Oracle, Performance, Troubleshooting | Tagged: , , , | 1 Comment »

 
Follow

Get every new post delivered to your Inbox.

Join 27 other followers