Last week I had a lengthy post at oracle-l that tackles Calibrate IO, Short Stroking, Stripe size, UEK kernel, and ASM redundancy effect on IOPS Exadata which you can read here
followed by interesting exchange of tweets with Kevin Closson here (see 06/21-22 tweets) which I was replying in between games at UnderwaterHockey US Nationals 2012 which we won the championship for the B division I have my awesome photo with the medal here
This post will detail on the ASM redundancy/parity effect on IOPS… if… by changing the ASM redundancy (external, normal, and high) will it decrease the workload read/write IOPS or stay as is. I’ll walk you through the step by step on how I did the instrumentation + the test case itself then end with some results and observations.
Let’s get started!
- Intel 2nd gen Sandy Bridge
- 8 x 1TB 7200 RPM SATA3 – shortstroked
- on first 320GB vbox LVM stripe (VirtualBox .vdi files)
- next 320GB ASM (DATA disk group – external redundancy)
- remaining RECO area (for backups)
- 16GB physical memory having lots of VirtualBox VMs (12 VMs can run concurrently with 0.45 load average and no CPU WAIT IO)
Follow the “Installation” and “Run the benchmark” sections on my wiki here http://karlarao.tiddlyspot.com/#%5B%5Bcpu%20-%20SillyLittleBenchmark%20-%20SLOB%5D%5D
Essentially you have to be able to instrument the different layers of things. The sample screenshot below are the terminal sessions I have while doing the test cases and it’s really helpful to have them all in a single view because you’ll be able to quickly notice changes as the workload grow and reach its sustained state. So when you run sh ./runit.sh 0 128 you’ll be able to see from load average of 2 it will gradually go up to 128 and it will stay on that range and that’s the sustained state of the workload. Then from there I’ll start to take screenshots and gather all sorts of data that will serve as my data points from that test case. So with the single view you’ll be able to see how everything behaves from the ramp-up to the sustained state and when you cancel the run or whenever you do any changes to your test case environment.
On the hardware side I’ve got collectl (http://collectl.sourceforge.net) which shows the overall workload of the OS and CPU + IO details in a time series manner (see the first 3 terminal windows above). And I usually execute the following commands for every run and package the spooled output in a test case folder
collectl --all -o T -o D >> collectl-all.txt collectl -sc --verbose -o T -o D >> collectl-cpuverbose.txt collectl -sD --verbose -o T -o D >> collectl-ioverbose.txt
The critical numbers would be the aggregate IOPS from all the 8 disks, and the detailed IOPS per disk together with the IO size, service time, and Pct% Utilization. All these numbers gives you an IO characterization of what SLOB is doing.
The OS numbers has to be correlated to the physical reads/writes (load profile) on the database side and just relying on the AWR snapshots taken at start and end of the test case is not really a good measure especially on a small scale box like mine with just 4cores and 8spindles which essentially I’m making it work for 128 concurrent users doing an update on their own tables.. and ultimately the queueing effect will kick in and the disks will be really busy that you’ll get to a point where the AWR snapshots will not be able to do its thing and may even hang or queue or it will wait for it to be able to grab some IO or until it gets an IO bandwidth to do its own operations then that’s the time it will be able to complete the AWR snapshot/report…whew.
So to get around that, you have to read those performance numbers from memory and it’s only possible by doing a query from a v$ view.. luckily there’s a script created by Timur Akhmadeev a few months ago http://timurakhmadeev.wordpress.com/2012/02/21/load-profile/ that outputs the same Load Profile section of the AWR report but pulls data from the v$sysmetric view. The view outputs the past 15 and 60 seconds of a bunch of metrics which is similar to V$SYSSTAT but some metrics are already transformed into per/sec or per/transaction numbers. The one that Timur used is the 60secs time slice which is equivalent to creating 1min AWR reports which is good enough since the workload is sustained and the numbers will most likely not vary that much plus the Load Profile section has all the stuff that I need to see what’s happening on the physical reads/writes because SLOB is essentially a test case to guage your reads/writes IOPS.
I’ve modified the SQL https://www.dropbox.com/s/3ijoy3zljqps0wx/loadprof.sql to add the “tm” column so I can correlate it with the time series output of collectl. The sample output is shown on the left side of the image above. You can download the script and put it on the SLOB directory and run it as
while : ; do sqlplus "/ as sysdba" @loadprof ; echo "--"; sleep 2 ; done
I also want to see how the workload looks like in ASH.. so I’ve got a script called ash that calls a SQL script by Kyle Hailey which gives you the Average Active Sessions (AAS) for the past 5 seconds time slice.. so if let’s say I’m doing a sustained number of 128 readers I should be able to see 128 AAS doing “db file sequential read” (random IO).. so rather than pulling up an OEM or creating a repo for OEM just to see what’s happening on the SLOB run I can just go with this text graph or text version of what I’d like to see..
The sample output is shown on the right hand side of the image, see the FASL column having the 128 AAS on “db file sequential read”.. also the time column at the far left can be correlated to the collectl and loadprof.sql output. The script is available here
Another important thing is measuring the latency… I could have written a SQL that will mimic the top 5 timed events pulling from v$ views that also shows the latency column but for now I don’t really have the time to do that. So a very good alternative is to do it with the “awesome snapper” then spool it to a file and grep for stuff… and because the SLOB is doing a sustained reads/writes workload the numbers will most likely stay the same or within the range of that sustained number and with that I can just do 5 samples or 5 runs of snapper and that will tell me what my latency is.
Good thing about using the snapper is the latency is being measured from the SLOB users themselves and that’s what the sessions are experiencing for each IO operation that they are doing. Also take note that with the snapper command below I can grep for DBWR, CKPT, LGWR processes which I can’t do in the usual top 5 timed events..which I was able to build a cool comparison of latency graphs across redundancy levels (more on this on the results section). And each snapper sample has a timestamp that you can correlate with the collectl, loadprof, and ash output. Below are the commands I used:
spool snapper.txt @snapper ash=event+wait_class,stats,gather=tsw,tinclude=CPU,sinclude=redo|reads|writes 5 5 "select sid from v$session where username like 'USER%' or program like '%DBW%' or program like '%CKP%' or program like '%LGW%'" spool off cat snapper.txt | grep write | egrep "DBW|CKP|LGW" | grep WAIT | sort -rnk2 cat snapper.txt | grep "db file sequential read" cat snapper.txt | grep "free buffer waits" cat snapper.txt | grep "write complete waits"
Check out the script here http://tech.e2sn.com/oracle-scripts-and-tools/session-snapper
The test case
Before the test case I executed a backup of the SLOB database then I make use of that backup as a starting point of each iteration which makes the underlying data consistent across runs. Then for each ASM redundancy I did the following:
1) Wipe out the diskgroup
sqlplus "/ as sysasm" alter diskgroup data dismount force; drop diskgroup data force including contents;
2) Create the new diskgroup (1st iteration – normal, 2nd iteration – high, 3rd iteration – external)
3) Restore the backup
4) Execute 128 readers and gather perf data
5) Execute 128 writers and gather perf data
6) Go back to step1
>>128 Readers Testcase
>>128 Writers Testcase
- Shown on the left is the IOPS comparison across redundancy
- on the right is the read:write ratio
- Shown on the left is the IOPS comparison across redundancy
- on the right is the read:write ratio
- the graph at the bottom is just the “write iops” part of the graph above.. since SLOB is executing a massive UPDATE SQL with some filter predicates and that causes scan on some rows which translates to “read iops”
>>Read Latency (ms)
- Here are the graph definitions:
- ckpt -> CKPT process – “control file parallel write” latency
- lgwr -> LGWR process – “log file parallel write” latency
- dbwr -> DBWR process – “db file parallel write” latency
- phyreads -> SLOB users – “db file sequential read” latency
- fbw -> SLOB users – “free buffer waits” latency
- wcw -> SLOB users – “write complete waits” latency
>>Write Latency (ms)
- (see graph definitions above)
>>Some insights on the numbers above…
First of all.. I can say I did a really good job on documenting all these stuff ;) the results are in 6 parts
- part 1 >>128 Readers Testcase
- part 2 >>128 Writers Testcase
- part 3 >>Read IOPS
- part 4 >>Write IOPS
- part 5 >>Read Latency (ms)
- part 6 >>Write Latency (ms)
but as you can see at a high level as I change redundancy the read IOPS stayed at the range of 2400+ IOPS while on the writes as I move to “normal” redundancy it went down to half and “high” redundancy it went down to 1/3 and if you are a big SLOB fan, the first two parts of the results will be gold mine for you because that’s more of a geeky output and I essentially pulled the numbers from those raw screenshots to build the graphs.
So let’s move ahead on explaining the charts…
On the Read IOPS and latency I don’t see any huge difference and I can say it’s not being affected.. the IOPS stayed at the range of 2400+ IOPS then the latency at the range of 900ms across redundancy levels. It clearly shows that you don’t have a raid penalty or parity penalty to sacrifice for reads.
On the Write IOPS and latency
- it’s pretty clear that on normal you lose half and on high you only get 1/3 of the hardware IOPS
- And what’s interesting about this is when I was watching the workload of normal and high I can see an increasing number of AAS (average active sessions) spent on “free buffer waits” and “write complete waits”… well I’m seeing them on external redundancy but not as much.
- Actually, as we change redundancy we are lowering the IOPS capacity which we sacrifice for the redundancy/parity and still streaming the same amount of workload and these wait events are just signs of inadequate bandwidth. The DBWR is working harder to make free buffers available for the same workload with lesser IOPS to work on and that translates to higher latencies.
- You will also notice on the raw screenshots that the hardware write IOPS numbers from collectl stayed at around 1000+ at full throttle while on the database load profile you’ll see that it’s just around 300 (high redundancy) which is the effect of the ASM parity.
>>How can this affect your Storage Capacity Planning?
I’ll pull a conversation I had with a really good friend of mine.. his question was ”Quick question on your AWR mining script AWR-gen-wl…, is IOPs calculated before or after ASM mirroring? For example on Exadata, if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?”
and here’s my response..
The script awr_iowl.sql and awr_iowlexa.sql have the columns that accounts for RAID1.. that is read penalty of 1 and write penalty of 2.
read on the section “the iops raid penalty” on this link http://www.zdnetasia.com/calculate-iops-in-a-storage-array-62061792.htm and the “real life examples” on this link http://www.yellow-bricks.com/2009/12/23/iops/
so those computations should also apply for Exadata since normal redundancy is essentially RAID1 that’s write penalty of 2, and the high redundancy is penalty of 3.
Now I remember this sizing exercise I had with an EMC engineer on a project bid before https://www.evernote.com/shard/s48/sh/03602b99-3274-4c64-b5d1-bbe7bd961f8d/95be02ccf9aa75cf863bb19115353eb0
and that’s why I created those columns to get the data directly from AWR.. so for every snapshot you’ve got the “hardware iops needed” and “number of disks needed”, what’s good about that is as your workload vary those two numbers will be representative of that workload. So since you have a lot of data samples, I usually make a histogram on those two columns and get the top percentile numbers because most likely those are the peak pereiods and I can investigate on it by drilling down on the snap_ids and looking into the SQLs and validating it to the app owners as to what’s the application is running at that time.
I’ve attached an excel sheet which you can just plug the total workload iops on the yellow box. So in your case, let’s say you have 10K workload IOPS with 50:50 read/write ratio… that’s equivalent to 15K hardware IOPS for normal redundancy and 20K hardware IOPS for high redundancy.
the excel screenshot is actually here —-> https://lh6.googleusercontent.com/-00PkzwfwnOE/T-N0oo2Q-FI/AAAAAAAABqE/EbTOnHBlpmQ/s2048/20120621_IOPS.png
Note that I’m particular with the words “workload IOPS” and “hardware IOPS”
so on this statement
“if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?” <– if this 10,000 is what you pulled from the AWR then it’s the database that did the 10K IOPS so that’s the “workload IOPS”.. and that’s essentially your “IO workload requirements”.
Then let’s say you haven’t migrated to Exadata.. you have to take into account the penalty computation shown above.. so you’ll arrive with 15000 “hardware IOPS” needed (normal redundancy).. and say each disk has IOPS of 180 then you need at least 83 disks so that’s 83disks / 12 disks each cell = 6.9 storage cells … and that’s Half Rack Exadata. But looking at the data sheet https://www.dropbox.com/s/ltvr7caysvfmvkr/dbmachine-x2-2-datasheet-175280.pdf it seems like you can fit the 15000 on a quarter rack (because of the flash).. mmm.. well I’m not pretty confident with that because if let’s say 50% of 15000 IOPS are writes (7.5K IOPS) then I would investigate on the IOPS write mix if most of it are DBWR related (v$sysstat.physical write IO requests) or LGWR related (v$sysstat.redo writes) and if most of it are DBWR related then I don’t think you’ll ever benefit from the smart flash log. So I would still go with the Half (12500 disk IOPS) or Full Rack (25000 disk IOPS) for my “hardware IOPS” capacity. And I’ll also take into consideration the MB/s needed for that database but that should be augmented by the flash cache.
In addition to my response above.. Also be aware that Smart Flash Log does not lower the HDD (HardDisk) touch load…both sides of the write go to flash and HDD concurrently. So you should not be too optimistic about the Smart Flash Log when sizing Exadata for “write IOPS” because you can’t predict who will always win on the fastest-write-race…so as much as possible base your write IOPS sizing on HardDisk IOPS. But definitely on the read IOPS even small (OLTP) or large IOPS (DSS) the flash cache will be able to augments those HDD numbers.
So with my test case shown here.. the statement on the reply above is true. The ASM redundancy/parity affects the “workload write IOPS” number (be it small-OLTP or large-DSS IOPS).. but not for the “workload read IOPS”. This behavior is the same even on a regular SAN environment… which you have to be careful/aware when sizing storage. So now you know that 50K IOPS in any data sheet is not equal to 50K write IOPS.. factor in the redundancy and always remember Utilization=Requirements/Capacity no matter what the type workload or platform is as long as as the total workload IOPS/MBs requirements is well within the total hardware IOPS/MBs capacity you are up a notch on not seeing any IO bottlenecks.
Hope I’ve shared you some good stuff