| Last week I was able to attend a 3 day exadata workshop conducted by performance engineers from Singapore. It was supposed to be a 5 day workshop with hands on labs but everything was squeezed (unfortunately no hands on) and just the critical sections were tackled.. The workshop highlight was the demos and these are as follows: - loading large amount of data from an external table - cell offloading on a typical datawarehouse query - creation of a 400gb tablespace - linear scalability of storage cells Although all of these demos were done on “Exadata Version 1“.. and just 13 of the storage servers were running..I can say that the performance is still stunning!!!
But the real highligth for me is having my scripts run on the database machine BTW, to get the scripts.. check out the Scripts Section for the file name and the most recent version of the script.. IMPORTANT NOTE: Diagnostic Pack License is needed for the scripts |
![]() |
All the time they were doing demos they were just showing some graphed output of V$SYSSTAT statistics related to Exadata cell effectiveness and a customized output of collectl showing the storage servers and the RAC nodes stacked together on a putty session and from there we can see the IO, CPU, network statistics as they were hammering the full rack Exadata..
So I was also really curious about seeing the database performance numbers.. I’ve asked if we can see the Enterprise Manager as they were doing the demos.. Unfortunately it was not available..
With the spooled output of my scripts (executed on the node where the demo was done).. I was able to see the following across SNAP_IDs:
- Load (awr_genwl.sql)
- Top Timed Events (awr_topevents.sql)
- Tablespace and File IO statistics (awr_io_ts.sql, awr_io_file.sql)
- Top SQLs (awr_topsql.sql, awr_topsqlx.sql)
Below are the output:
Load:
I was interested on the period when the demo was run (May 20th afternoon)… the time that I saw the storage servers being hammered and giving 1 GB/s throughput on each. I’d like to see CPU and IO numbers on the database side plus the AAS (Average Active Sessions) and other metrics.. more like characterizing the workload.. you can see on SNAP_ID 1211 that there was a sudden surge on DB Time, IO read MB/s, and CPU utilization..
I was surprised to see that the utilization numbers (rightmost) are very low even for the whole AWR retention period (I did not include here), and take note that this is the output from one node only so I can’t really conclude what’s happening to the entire cluster..
AWR CPU and IO Workload Report
i *** *** ***
n Total Total Total U S
Snap s Snap C CPU A Oracle OS Physical Oracle RMAN OS S Y I
Snap Start t Dur P Time DB DB Bg RMAN A CPU OS CPU Memory IOPs IOPs IOPs IO r IO w Redo Exec CPU CPU CPU R S O
ID Time # (m) U (s) Time CPU CPU CPU S (s) Load (s) (mb) r w redo (mb)/s (mb)/s (mb)/s Sess /s % % % % % %
------ --------------- --- ---------- --- ----------- ---------- --------- --------- -------- ----- ----------- ------- ----------- ---------- --------- --------- --------- --------- --------- --------- ---- --------- ------ ---- ---- ---- ---- ----
1208 10/05/20 08:00 2 60.25 8 28920.00 24.08 8.43 10.02 0.00 0.0 18.44 0.29 334.80 32189.62 0.000 0.082 0.050 0.000 0.001 0.000 40 1.549 0 0 1 1 0 0
1209 10/05/20 09:00 2 59.89 8 28747.20 21.80 8.43 10.29 0.00 0.0 18.72 0.40 320.49 32189.62 0.003 0.078 0.034 0.000 0.002 0.000 40 1.490 0 0 1 1 0 0
1210 10/05/20 10:00 2 59.87 8 28737.60 24.86 9.43 9.78 0.00 0.0 19.21 0.03 318.18 32189.62 0.006 0.080 0.039 0.000 0.002 0.000 40 1.539 0 0 1 1 0 0
1211 10/05/20 11:00 2 60.00 8 28800.00 3211.59 1732.40 19.19 0.00 0.9 1751.59 0.39 2157.32 32189.62 56.584 4.423 0.453 55.847 0.477 0.008 41 3.404 6 0 7 7 0 0
1212 10/05/20 12:00 2 60.02 8 28809.60 353.16 42.58 12.79 0.00 0.1 55.37 0.17 420.83 32189.62 10.901 0.210 0.145 10.734 0.004 0.001 49 2.263 0 0 1 1 0 0
1213 10/05/20 13:00 2 60.25 8 28920.00 1415.78 465.06 18.32 0.00 0.4 483.38 0.37 839.79 32189.62 0.079 8.471 0.478 0.001 1.321 0.004 42 3.358 2 0 3 2 0 0
1214 10/05/20 14:00 2 59.80 8 28704.00 20.56 8.60 10.49 0.00 0.0 19.09 0.03 318.41 32189.62 0.012 0.127 0.052 0.000 0.002 0.001 42 1.780 0 0 1 1 0 0
Top Timed Events:
Drilling down on the Top Events on the SNAP_ID 1211, I’m wondering if I’ll be seeing something different… you can see that half of the DB Time is attributed to “CPU time” but with just an AAS of .5 which means there’s still a lot of CPU available on the RAC node and the database is probably not blocked. You could also see that there’s some external table read going on (hmm that could be the 1st demo), and notice the event “cell smart table scan”..
AWR Top Events Report
i
n
Snap s Snap A
Snap Start t Dur Event Time Avgwt DB Time A
ID Time # (m) Event Rank Waits (s) (ms) % S Wait Class
------ --------------- --- ---------- ---------------------------------------- ----- -------------- -------------- -------- ------- ------ ---------------
1208 10/05/20 08:00 2 60.25 reliable message 1 1674.00 8.80 5.26 37 0.0 Other
1208 10/05/20 08:00 2 60.25 CPU time 2 0.00 8.43 0.00 35 0.0 CPU
1208 10/05/20 08:00 2 60.25 PX Deq: Slave Session Stats 3 3150.00 8.23 2.61 34 0.0 Other
1208 10/05/20 08:00 2 60.25 control file sequential read 4 11400.00 7.12 0.62 30 0.0 System I/O
1208 10/05/20 08:00 2 60.25 enq: PS - contention 5 8342.00 2.66 0.32 11 0.0 Other
1209 10/05/20 09:00 2 59.89 CPU time 1 0.00 8.43 0.00 39 0.0 CPU
1209 10/05/20 09:00 2 59.89 PX Deq: Slave Session Stats 2 3130.00 7.27 2.32 33 0.0 Other
1209 10/05/20 09:00 2 59.89 control file sequential read 3 11330.00 6.68 0.59 31 0.0 System I/O
1209 10/05/20 09:00 2 59.89 reliable message 4 1608.00 4.14 2.58 19 0.0 Other
1209 10/05/20 09:00 2 59.89 enq: PS - contention 5 8347.00 2.38 0.28 11 0.0 Other
1210 10/05/20 10:00 2 59.87 CPU time 1 0.00 9.43 0.00 38 0.0 CPU
1210 10/05/20 10:00 2 59.87 control file sequential read 2 11314.00 7.98 0.71 32 0.0 System I/O
1210 10/05/20 10:00 2 59.87 PX Deq: Slave Session Stats 3 3305.00 7.83 2.37 31 0.0 Other
1210 10/05/20 10:00 2 59.87 reliable message 4 1654.00 5.81 3.51 23 0.0 Other
1210 10/05/20 10:00 2 59.87 enq: PS - contention 5 8520.00 1.98 0.23 8 0.0 Other
1211 10/05/20 11:00 2 60.00 CPU time 1 0.00 1732.40 0.00 54 0.5 CPU
1211 10/05/20 11:00 2 60.00 external table read 2 12567.00 410.76 32.69 13 0.1 User I/O
1211 10/05/20 11:00 2 60.00 cell smart table scan 3 86393.00 238.88 2.77 7 0.1 User I/O
1211 10/05/20 11:00 2 60.00 gc buffer busy acquire 4 4137.00 196.75 47.56 6 0.1 Cluster
1211 10/05/20 11:00 2 60.00 external table misc IO 5 119.00 119.18 1001.48 4 0.0 User I/O
1212 10/05/20 12:00 2 60.02 cell smart table scan 1 33606.00 197.91 5.89 56 0.1 User I/O
1212 10/05/20 12:00 2 60.02 cell smart file creation 2 92753.00 84.68 0.91 24 0.0 User I/O
1212 10/05/20 12:00 2 60.02 ASM file metadata operation 3 830.00 80.80 97.35 23 0.0 Other
1212 10/05/20 12:00 2 60.02 control file sequential read 4 11440.00 56.00 4.90 16 0.0 System I/O
1212 10/05/20 12:00 2 60.02 CPU time 5 0.00 42.58 0.00 12 0.0 CPU
1213 10/05/20 13:00 2 60.25 CPU time 1 0.00 465.06 0.00 33 0.1 CPU
1213 10/05/20 13:00 2 60.25 gc buffer busy acquire 2 15209.00 379.47 24.95 27 0.1 Cluster
1213 10/05/20 13:00 2 60.25 external table read 3 5706.00 211.29 37.03 15 0.1 User I/O
1213 10/05/20 13:00 2 60.25 gc current block busy 4 1091.00 94.23 86.37 7 0.0 Cluster
1213 10/05/20 13:00 2 60.25 external table misc IO 5 48.00 69.21 1441.82 5 0.0 User I/O
1214 10/05/20 14:00 2 59.80 CPU time 1 0.00 8.60 0.00 42 0.0 CPU
1214 10/05/20 14:00 2 59.80 PX Deq: Slave Session Stats 2 3159.00 7.09 2.24 34 0.0 Other
1214 10/05/20 14:00 2 59.80 control file sequential read 3 11363.00 6.35 0.56 31 0.0 System I/O
1214 10/05/20 14:00 2 59.80 reliable message 4 1706.00 4.86 2.85 24 0.0 Other
1214 10/05/20 14:00 2 59.80 enq: PS - contention 5 8312.00 2.23 0.27 11 0.0 Other
Tablespace IO statistics:
Having seen the 1 GB/s throughput on each storage server I’d like to know how it is with the tablespace and datafile latency.. As you can see below there are high reads on tablespace DG14D7 and writes on DG14D1
AWR Tablespace IO Report
i
n
Snap s Snap
Snap Start t Dur IO IOPS IOPS IOPS
ID Time # (m) TS Rank Reads Av Reads/s Av Rd(ms) Av Blks/Rd Writes Av Writes/s Buffer Waits Av Buf Wt(ms) Total IO R+W Total R+W
------ --------------- --- ---------- -------------------- ---- -------- ---------- --------- ---------- -------- ----------- ------------ ------------- ------------ ---------
1208 10/05/20 08:00 2 60.25 SYSAUX 1 0 0 0.0 0.0 203 0 0 0.0 203 0
1208 10/05/20 08:00 2 60.25 UNDOTBS2 2 0 0 0.0 0.0 90 0 0 0.0 90 0
1208 10/05/20 08:00 2 60.25 SYSTEM 3 0 0 0.0 0.0 2 0 0 0.0 2 0
1209 10/05/20 09:00 2 59.89 SYSAUX 1 7 0 1.4 1.0 185 0 0 0.0 192 0
1209 10/05/20 09:00 2 59.89 UNDOTBS2 2 0 0 0.0 0.0 91 0 0 0.0 91 0
1209 10/05/20 09:00 2 59.89 SYSTEM 3 1 0 0.0 1.0 5 0 0 0.0 6 0
1210 10/05/20 10:00 2 59.87 SYSAUX 1 20 0 2.5 1.0 194 0 0 0.0 214 0
1210 10/05/20 10:00 2 59.87 UNDOTBS2 2 0 0 0.0 0.0 88 0 0 0.0 88 0
1210 10/05/20 10:00 2 59.87 SYSTEM 3 0 0 0.0 0.0 4 0 0 0.0 4 0
1211 10/05/20 11:00 2 60.00 DG14D7 1 112984 31 21.0 63.5 0 0 0 0.0 112984 31
1211 10/05/20 11:00 2 60.00 DG14D1 2 382 0 5.8 1.0 15317 4 2257 37.0 15699 4
1211 10/05/20 11:00 2 60.00 UNDOTBS2 3 0 0 0.0 0.0 263 0 44 0.0 263 0
1211 10/05/20 11:00 2 60.00 SYSAUX 4 12 0 1.7 1.0 239 0 0 0.0 251 0
1211 10/05/20 11:00 2 60.00 SYSTEM 5 44 0 2.0 1.0 101 0 1160 121.1 145 0
1212 10/05/20 12:00 2 60.02 SYSAUX 1 20 0 60.0 1.0 298 0 0 0.0 318 0
1212 10/05/20 12:00 2 60.02 UNDOTBS2 2 0 0 0.0 0.0 176 0 0 0.0 176 0
1212 10/05/20 12:00 2 60.02 SYSTEM 3 41 0 3.9 1.0 27 0 0 0.0 68 0
1213 10/05/20 13:00 2 60.25 DG14D1 1 259 0 3.7 1.0 30061 8 1068 29.0 30320 8
1213 10/05/20 13:00 2 60.25 SYSAUX 2 9 0 2.2 1.0 265 0 0 0.0 274 0
1213 10/05/20 13:00 2 60.25 UNDOTBS2 3 0 0 0.0 0.0 249 0 3 10.0 249 0
1213 10/05/20 13:00 2 60.25 SYSTEM 4 17 0 222.9 1.0 45 0 1881 192.4 62 0
1214 10/05/20 14:00 2 59.80 SYSAUX 1 44 0 4.1 1.0 285 0 0 0.0 329 0
1214 10/05/20 14:00 2 59.80 UNDOTBS2 2 0 0 0.0 0.0 160 0 1 0.0 160 0
1214 10/05/20 14:00 2 59.80 SYSTEM 3 1 0 20.0 1.0 10 0 0 0.0 11 0
Datafile IO statistics:
Below are the datafiles with read/write activity across SNAP_IDs… You may notice that tablespaces DG14D7 and DG14D1 on disk group +DISKVOL1 contains only user data while +DISKVOL3 contains the SYSTEM, SYSAUX, UNDO tablespaces
AWR File IO Report
i
n
Snap s Snap
Snap Start t Dur IO IOPS IOPS IOPS
ID Time # (m) TS File# Filename Rank Reads Av Reads/s Av Rd(ms) Av Blks/Rd Writes Av Writes/s Buffer Waits Av Buf Wt(ms) Total IO R+W Total R+W
------ --------------- --- ---------- -------------------- ----- ------------------------------------------------------------ ---- -------- ---------- --------- ---------- -------- ----------- ------------ ------------- ------------ ---------
1208 10/05/20 08:00 2 60.25 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 1 0 0 203 0 0 0.0 203 0
1208 10/05/20 08:00 2 60.25 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 2 0 0 90 0 0 0.0 90 0
1208 10/05/20 08:00 2 60.25 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 3 0 0 2 0 0 0.0 2 0
1209 10/05/20 09:00 2 59.89 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 1 7 0 1.4 1.0 185 0 0 0.0 192 0
1209 10/05/20 09:00 2 59.89 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 2 0 0 91 0 0 0.0 91 0
1209 10/05/20 09:00 2 59.89 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 3 1 0 0.0 1.0 5 0 0 0.0 6 0
1210 10/05/20 10:00 2 59.87 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 1 20 0 2.5 1.0 194 0 0 0.0 214 0
1210 10/05/20 10:00 2 59.87 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 2 0 0 88 0 0 0.0 88 0
1210 10/05/20 10:00 2 59.87 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 3 0 0 4 0 0 0.0 4 0
1211 10/05/20 11:00 2 60.00 DG14D7 15 +DISKVOL1/rtldb/datafile/dg14d7.501.714759895 1 112984 187 21.0 63.5 0 0 0 0.0 112984 31
1211 10/05/20 11:00 2 60.00 DG14D1 13 +DISKVOL1/rtldb/datafile/dg14d1.500.714758119 2 382 1 5.8 1.0 15317 25 2257 37.0 15699 4
1211 10/05/20 11:00 2 60.00 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 3 0 0 263 0 44 0.0 263 0
1211 10/05/20 11:00 2 60.00 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 4 12 0 1.7 1.0 239 0 0 0.0 251 0
1211 10/05/20 11:00 2 60.00 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 5 44 0 2.0 1.0 101 0 1160 121.1 145 0
1212 10/05/20 12:00 2 60.02 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 1 20 0 60.0 1.0 298 0 0 0.0 318 0
1212 10/05/20 12:00 2 60.02 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 2 0 0 176 0 0 0.0 176 0
1212 10/05/20 12:00 2 60.02 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 3 41 0 3.9 1.0 27 0 0 0.0 68 0
1213 10/05/20 13:00 2 60.25 DG14D1 13 +DISKVOL1/rtldb/datafile/dg14d1.500.714758119 1 259 0 3.7 1.0 30061 50 1068 29.0 30320 8
1213 10/05/20 13:00 2 60.25 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 2 9 0 2.2 1.0 265 0 0 0.0 274 0
1213 10/05/20 13:00 2 60.25 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 3 0 0 249 0 3 10.0 249 0
1213 10/05/20 13:00 2 60.25 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 4 17 0 222.9 1.0 45 0 1881 192.4 62 0
1214 10/05/20 14:00 2 59.80 SYSAUX 2 +DISKVOL3/rtldb/datafile/sysaux.299.714681829 1 44 0 4.1 1.0 285 0 0 0.0 329 0
1214 10/05/20 14:00 2 59.80 UNDOTBS2 4 +DISKVOL3/rtldb/datafile/undotbs2.260.714681979 2 0 0 160 0 1 0.0 160 0
1214 10/05/20 14:00 2 59.80 SYSTEM 1 +DISKVOL3/rtldb/datafile/system.300.714681827 3 1 0 20.0 1.0 10 0 0 0.0 11 0
Top SQLs:
I’d like to know the top SQLs running on the server, to my surprise as I was investigating on the output… the number 1 SQL was the massive data load executed on the demo (SNAP_ID 1211)…
This direct-path load (INSERT /*+ APPEND */) was using an external table with flat files (multiple small files in similar size with total size roughly around 50GB) staged on a DBFS (database filesystem) and loading it to a compressed table… This is very similar to the Data Warehouse data loading best practice mentioned here
What’s interesting about this it was able to process the 199million rows in just 27mins! and take note of the low AAS (0.39) doing this operation… and it would be nice to see the SQL Monitor Report for this execution as well as the output of my AWR scripts on other RAC nodes.. and yes the final “select count(*)…” on the target table…
BTW, I’ve briefly explained the output of the AWR Top SQL script on this reply to Oralce-l
AWR Top SQL Report
i Ela
n Time
Snap s Snap Plan Ela per CPU IO App Ccr Cluster PX A
Snap Start t Dur SQL Hash Time exec Time Wait Wait Wait Wait Direct Parse Server A Time SQL
ID Time # (m) ID Value Module (s) (s) (s) (s) (s) (s) (s) LIO PIO Writes Rows Exec Count Exec S Rank Text
------ --------------- --- ------- --------------- ------------ -------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ------------ ------------ ------------ ------------ -------- ---------- -------- ------- ---- ------
1211 10/05/20 11:00 2 60.00 gz90qgc3f72fa 1967845001 SQL*Plus 1411.13 1411.13 565.92 305.14 0.00 64.05 347.28 536400 93 99690 199553620 1 17 16 0.39 1 INSERT
1213 10/05/20 13:00 2 60.25 gz90qgc3f72fa 1967845001 SQL*Plus 1359.58 1359.58 443.98 291.29 0.00 16.64 518.73 466179 17 292337 198131707 1 17 16 0.38 1 INSERT
1211 10/05/20 11:00 2 60.00 4tb00t5k8jc41 3807662380 SQL*Plus 1306.22 326.56 999.78 307.32 0.00 0.00 0.00 13140180 13107338 0 3 4 20 12 0.36 2 select
1050 10/05/13 18:00 2 59.94 g2cd0wdckd7rn 3807662380 SQL*Plus 785.48 575.09 210.45 0.00 0.00 0.00 7226484 7208526 0 0 0 8 8 0.22 1 select
1049 10/05/13 17:00 2 59.93 g2cd0wdckd7rn 3807662380 SQL*Plus 785.42 577.15 208.40 0.00 0.00 0.00 7191030 7173160 0 0 0 8 8 0.22 1 select
1213 10/05/20 13:00 2 60.25 0kkhhb2w93cx0 2170058777 469.05 0.29 3.21 0.00 0.00 11.60 452.32 32076 0 0 1610 1610 1609 0 0.13 2 update
1211 10/05/20 11:00 2 60.00 2nszgcfgcgvts 666733144 SQL*Plus 418.08 418.08 151.28 233.01 0.00 0.03 0.00 1782 17 0 1 1 17 16 0.12 3 select
1211 10/05/20 11:00 2 60.00 0kkhhb2w93cx0 2170058777 215.14 0.17 1.37 0.00 0.00 4.70 208.65 12228 0 0 1263 1263 1263 0 0.06 4 update
1212 10/05/20 12:00 2 60.02 4tb00t5k8jc41 597344198 SQL*Plus 125.51 125.51 1.75 123.71 0.00 0.00 0.00 519885 518076 0 1 1 5 4 0.03 1 select
Well this is really different from what Greg Rahn has done on the Exadata V2 loading 7.8 billion rows in 275 seconds! hmmm because of a whole lot of factors.. bigger bandwidth, faster CPUs, the Sun machines, new cool stuff on the V2, compression, table definition, # of columns, data type, etc. etc. etc. etc.
But this is astounding compared to the data load rate I’m seeing on other Data Warehouse environments…
And this is where the importance of having a balanced configuration kicks in.. On some environments I see storage saturation (worst is seeing the disks 100% busy with high service times and queue) and low storage bandwidth causing low data load rate and longer response times affecting the whole ETL process… as an example you can see from the figures below that even if you have the right amount of CPU (16 powerful CPUs) but if the IO subsystem can’t cope with the application requirements you clearly have a bottleneck.. This is a linear regression I made on IO read and write MB/s on Direct Path Read and Large Writes respectively in relation to the scheduled ETL run times and daily workload patterns.
| IO read | IO write |
|---|---|
![]() |
![]() |
Having seen how Exadata behaves and learning about the architecture, administration, IORM (IO Resource Management), monitoring, and migration paths… although lacking hands-on experience and there are more things I still have to learn! I can say that it’s no longer a black box… it has already been demystified.. ;) I have to say.. hands-down to the Exadata Storage Server Software without it all the storage servers will just behave as a traditional storage server and it’s the cool stuff that makes each of them able to run on its full speed given the bandwidth limits..
Also there are some stuff I am curious about in Exadata V2 or V1…
1) Having most of the work done underneath, I wonder what’s the saturation point of the Storage Servers and when will be the trigger that I have to add another storage server or Exadata rack..
2) And how is it going to be if applied with IORM (IO Resource Management) and production and test databases are sharing Exadata Storage Servers… more like large scale consolidation of databases… hmm..
3) I wonder how this will change the capacity planning space.. because you now have Storage Servers that could work smart and work hard making the difference for the leap in performance.. hmm..



Excellent blog about Exadata Machine.
Asvish,
Thanks for dropping by, I hope you find the scripts useful as well
As the author of collectl I’m always interested in performance and one thing I always find fascinating is the correlation of activities across a system. When I look at some of your output, which only seems to sample at once an hour, I wonder about what’s happening to other system resources at the times you highlighted. Perhaps the real question is when you report a disk load, what is the sample time? is it the snap time of 1 hour?Experience shows you have to get pretty fine grained for the numbers to be meaningful, closer to the 5-15 second range. The other thing is it can also be very useful to see those same resources prior to and after the times you’re showing as single point in time may not show what’s really going on.
Are the disks running at their maximums and might there be an opportunity for increasing throughput by configuration changes? Did the increase in disk rates exactly correlate with queries? What’s happening to memory utilization, particularly slabs, since they can fluctuate quite a lot under heavy load. Or what about memory fragmentation as reported by /proc/buddy? Lots of other things.
A couple of thoughts come to mind that may or may not make sense, but if could plot your data against collectl’s, some interesting things might show up. But the question of HOW to do that can get complicated. You can certainly load collectl’s data into a spreadsheet and plot it there, but it’s kind of hard work. On the other hand if you could append you data to collectl’s you could define some additional plots using colplot.
Not being an oracle person I’m not really sure what’s possible and what isn’t, but if you can get any of these stats in real time, you can relatively easily write a custom module that will allow collectl to read that data directly and display [or even plot] that data along with all of collectl’s other data natively. If you’re a perl programmer and are interested I could probably walk you through it. But as I said I don’t know if it’s worth the effort of not.
-mark
Hi Mark,
Yes the snap interval is 1hour and I can say that the latency numbers could be pretty bad when we drill down on 5sec intervals and it is very likely that the devices are running at their maximum..
Sorry to reply just now.. I’ve been thingking about this for quite a while. I made a few test cases and will blog my observations and what I’ve learned from it..
The initial post is here:
http://karlarao.wordpress.com/2010/07/05/oracle-datafile-io-latency-part-1/
All the best..
And Mark knows exactly how his stuff got so popular within Exadata Engineering :-
Pingback: Oracle datafile IO latency – Part 1 « Karl Arao's Blog
Pingback: Oracle Clinic
Storage can be shared by multiple types of workloads and multiple databases
Sharing lowers administration costs
Sharing leads to more efficient usage of storage
But, workloads may not happily coexist
ETL jobs interfere with DSS query performance
One production data warehouse can interfere with another
Hi Maclean,
I’ve seen the presentation on your blog, thanks for sharing… that was good stuff