The effect of ASM redundancy/parity on read/write IOPS – SLOB test case! for Exadata and non-Exa environments 13

Last week I had a lengthy post at oracle-l that tackles Calibrate IO, Short Stroking, Stripe size, UEK kernel, and ASM redundancy effect on IOPS Exadata which you can read here
followed by interesting exchange of tweets with Kevin Closson here (see 06/21-22 tweets) which I was replying in between games at UnderwaterHockey US Nationals 2012 which we won the championship for the B division ;) I have my awesome photo with the medal here

This post will detail on the ASM redundancy/parity effect on IOPS… if… by changing the ASM redundancy (external, normal, and high) will it decrease the workload read/write IOPS or stay as is. I’ll walk you through the step by step on how I did the instrumentation + the test case itself then end with some results and observations.

Let’s get started!

The Environment

My R&D server running OEL 5.7 on UEK kernel with

  •  Intel 2nd gen Sandy Bridge
  • 8 x 1TB 7200 RPM SATA3 – shortstroked
    • on first 320GB vbox LVM stripe (VirtualBox .vdi files)
    • next 320GB ASM (DATA disk group – external redundancy)
    • remaining RECO area (for backups)
  • 16GB physical memory having lots of VirtualBox VMs (12 VMs can run concurrently with 0.45 load average and no CPU WAIT IO)

SLOB installation

Follow the “Installation” and “Run the benchmark” sections on my wiki here http://karlarao.tiddlyspot.com/#%5B%5Bcpu%20-%20SillyLittleBenchmark%20-%20SLOB%5D%5D

Instrumentation

Essentially you have to be able to instrument the different layers of things. The sample screenshot below are the terminal sessions I have while doing the test cases and it’s really helpful to have them all in a single view because you’ll be able to quickly notice changes as the workload grow and reach its sustained state. So when you run sh ./runit.sh 0 128 you’ll be able to see from load average of 2 it will gradually go up to 128 and it will stay on that range and that’s the sustained state of the workload. Then from there I’ll start to take screenshots and gather all sorts of data that will serve as my data points from that test case. So with the single view you’ll be able to see how everything behaves from the ramp-up to the sustained state and when you cancel the run or whenever you do any changes to your test case environment.

1) Hardware:

On the hardware side I’ve got collectl (http://collectl.sourceforge.net) which shows the overall workload of the OS and CPU + IO details in a time series manner (see the first 3 terminal windows above). And I usually execute the following commands for every run and package the spooled output in a test case folder

collectl --all -o T -o D >> collectl-all.txt
collectl -sc --verbose -o T -o D >> collectl-cpuverbose.txt
collectl -sD --verbose -o T -o D >> collectl-ioverbose.txt

The critical numbers would be the aggregate IOPS from all the 8 disks, and the detailed IOPS per disk together with the IO size, service time, and Pct% Utilization. All these numbers gives you an IO characterization of what SLOB is doing.

2) Database:

Load Profile
The OS numbers has to be correlated to the physical reads/writes (load profile) on the database side and just relying on the AWR snapshots taken at start and end of the test case is not really a good measure especially on a small scale box like mine with just 4cores and 8spindles which essentially I’m making it work for 128 concurrent users doing an update on their own tables.. and ultimately the queueing effect will kick in and the disks will be really busy that you’ll get to a point where the AWR snapshots will not be able to do its thing and may even hang or queue or it will wait for it to be able to grab some IO or until it gets an IO bandwidth to do its own operations then that’s the time it will be able to complete the AWR snapshot/report…whew.

So to get around that, you have to read those performance numbers from memory and it’s only possible by doing a query from a v$ view.. luckily there’s a script created by Timur Akhmadeev a few months ago http://timurakhmadeev.wordpress.com/2012/02/21/load-profile/ that outputs the same Load Profile section of the AWR report but pulls data from the v$sysmetric view. The view outputs the past 15 and 60 seconds of a bunch of metrics which is similar to V$SYSSTAT but some metrics are already transformed into per/sec or per/transaction numbers. The one that Timur used is the 60secs time slice which is equivalent to creating 1min AWR reports which is good enough since the workload is sustained and the numbers will most likely not vary that much plus the Load Profile section has all the stuff that I need to see what’s happening on the physical reads/writes because SLOB is essentially a test case to guage your reads/writes IOPS.

I’ve modified the SQL https://www.dropbox.com/s/3ijoy3zljqps0wx/loadprof.sql to add the “tm” column so I can correlate it with the time series output of collectl. The sample output is shown on the left side of the image above. You can download the script and put it on the SLOB directory and run it as

while : ; do sqlplus "/ as sysdba" @loadprof ; echo "--"; sleep 2 ; done

ASH
I also want to see how the workload looks like in ASH.. so I’ve got a script called ash that calls a SQL script by Kyle Hailey which gives you the Average Active Sessions (AAS) for the past 5 seconds time slice.. so if let’s say I’m doing a sustained number of 128 readers I should be able to see 128 AAS doing “db file sequential read” (random IO).. so rather than pulling up an OEM or creating a repo for OEM just to see what’s happening on the SLOB run I can just go with this text graph or text version of what I’d like to see..

The sample output is shown on the right hand side of the image, see the FASL column having the 128 AAS on “db file sequential read”.. also the time column at the far left can be correlated to the collectl and loadprof.sql output. The script is available here
https://www.dropbox.com/s/pbcewv374dq5bc0/ash-aveactn300.sql

3) Sessions:

Another important thing is measuring the latency… I could have written a SQL that will mimic the top 5 timed events pulling from v$ views that also shows the latency column but for now I don’t really have the time to do that. So a very good alternative is to do it with the “awesome snapper” then spool it to a file and grep for stuff… and because the SLOB is doing a sustained reads/writes workload the numbers will most likely stay the same or within the range of that sustained number and with that I can just do 5 samples or 5 runs of snapper and that will tell me what my latency is.

Good thing about using the snapper is the latency is being measured from the SLOB users themselves and that’s what the sessions are experiencing for each IO operation that they are doing.  Also take note that with the snapper command below I can grep for DBWR, CKPT, LGWR processes which I can’t do in the usual top 5 timed events..which I was able to build a cool comparison of latency graphs across redundancy levels (more on this on the results section). And each snapper sample has a timestamp that you can correlate with the collectl, loadprof, and ash output. Below are the commands I used:

spool snapper.txt
@snapper ash=event+wait_class,stats,gather=tsw,tinclude=CPU,sinclude=redo|reads|writes 5 5 "select sid from v$session where username like 'USER%' or program like '%DBW%' or program like '%CKP%' or program like '%LGW%'"
spool off

cat snapper.txt | grep write | egrep "DBW|CKP|LGW" | grep WAIT | sort -rnk2
cat snapper.txt | grep "db file sequential read"
cat snapper.txt | grep "free buffer waits"
cat snapper.txt | grep "write complete waits"

Check out the script here http://tech.e2sn.com/oracle-scripts-and-tools/session-snapper

The test case

Before the test case I executed a backup of the SLOB database then I make use of that backup as a starting point of each iteration which makes the underlying data consistent across runs. Then for each ASM redundancy I did the following:

1) Wipe out the diskgroup

sqlplus "/ as sysasm"
alter diskgroup data dismount force;
drop diskgroup data force including contents;

2) Create the new diskgroup (1st iteration – normal, 2nd iteration – high, 3rd iteration – external)
3) Restore the backup
4) Execute 128 readers and gather perf data
5) Execute 128 writers and gather perf data
6) Go back to step1

Results

>>128 Readers Testcase


External Redundancy

Normal Redundancy

High Redundancy

>>128 Writers Testcase


External Redundancy

Normal Redundancy

High Redundancy

>>Read IOPS

  • Shown on the left is the IOPS comparison across redundancy
  • on the right is the read:write ratio

>>Write IOPS

  • Shown on the left is the IOPS comparison across redundancy
  • on the right is the read:write ratio
  • the graph at the bottom is just the “write iops” part of the graph above.. since SLOB is executing a massive UPDATE SQL with some filter predicates and that causes scan on some rows which translates to “read iops”

>>Read Latency (ms)

  • Here are the graph definitions:
    • ckpt -> CKPT process – “control file parallel write” latency
    • lgwr -> LGWR process – “log file parallel write” latency
    • dbwr -> DBWR process – “db file parallel write” latency
    • phyreads -> SLOB users – “db file sequential read” latency
    • fbw -> SLOB users – “free buffer waits” latency
    • wcw -> SLOB users – “write complete waits” latency

>>Write Latency (ms)

  • (see graph definitions above)

Observations

>>Some insights on the numbers above…

First of all.. I can say I did a really good job on documenting all these stuff ;) the results are in 6 parts

  • part 1 >>128 Readers Testcase
  • part 2 >>128 Writers Testcase
  • part 3 >>Read IOPS
  • part 4 >>Write IOPS
  • part 5 >>Read Latency (ms)
  • part 6 >>Write Latency (ms)

but as you can see at a high level as I change redundancy the read IOPS stayed at the range of 2400+ IOPS while on the writes as I move to “normal” redundancy it went down to half and “high” redundancy it went down to 1/3 and if you are a big SLOB fan, the first two parts of the results will be gold mine for you because that’s more of a geeky output and I essentially pulled the numbers from those raw screenshots to build the graphs.

So let’s move ahead on explaining the charts…

On the Read IOPS and latency I don’t see any huge difference and I can say it’s not being affected.. the IOPS stayed at the range of 2400+ IOPS then the latency at the range of 900ms across redundancy levels. It clearly shows that you don’t have a raid penalty or parity penalty to sacrifice for reads.

On the Write IOPS and latency

  • it’s pretty clear that on normal you lose half and on high you only get 1/3 of the hardware IOPS
  • And what’s interesting about this is when I was watching the workload of normal and high I can see an increasing number of AAS (average active sessions) spent on “free buffer waits” and “write complete waits”… well I’m seeing them on external redundancy but not as much.
  • Actually, as we change redundancy we are lowering the IOPS capacity which we sacrifice for the redundancy/parity and still streaming the same amount of workload and these wait events are just signs of inadequate bandwidth. The DBWR is working harder to make free buffers available for the same workload with lesser IOPS to work on and that translates to higher latencies.
  • You will also notice on the raw screenshots that the hardware write IOPS numbers from collectl stayed at around 1000+ at full throttle while on the database load profile you’ll see that it’s just around 300 (high redundancy) which is the effect of the ASM parity.

>>How can this affect your Storage Capacity Planning?

I’ll pull a conversation I had with a really good friend of mine.. his question was “Quick question on your AWR mining script AWR-gen-wl…, is IOPs calculated before or after ASM mirroring? For example on Exadata, if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?”

and here’s my response..

The script awr_iowl.sql and awr_iowlexa.sql have the columns that accounts for RAID1.. that is read penalty of 1 and write penalty of 2.
read on the section “the iops raid penalty” on this link http://www.zdnetasia.com/calculate-iops-in-a-storage-array-62061792.htm and the “real life examples” on this link http://www.yellow-bricks.com/2009/12/23/iops/

so those computations should also apply for Exadata since normal redundancy is essentially RAID1 that’s write penalty of 2, and the high redundancy is penalty of 3.

Now I remember this sizing exercise I had with an EMC engineer on a project bid before https://www.evernote.com/shard/s48/sh/03602b99-3274-4c64-b5d1-bbe7bd961f8d/95be02ccf9aa75cf863bb19115353eb0
and that’s why I created those columns to get the data directly from AWR.. so for every snapshot you’ve got the “hardware iops needed” and “number of disks needed”, what’s good about that is as your workload vary those two numbers will be representative of that workload. So since you have a lot of data samples, I usually make a histogram on those two columns and get the top percentile numbers because most likely those are the peak pereiods and I can investigate on it by drilling down on the snap_ids and looking into the SQLs and validating it to the app owners as to what’s the application is running at that time.

I’ve attached an excel sheet which you can just plug the total workload iops on the yellow box. So in your case, let’s say you have 10K workload IOPS with 50:50 read/write ratio… that’s equivalent to 15K hardware IOPS for normal redundancy and 20K hardware IOPS for high redundancy.

the excel screenshot is actually here —-> https://lh6.googleusercontent.com/-00PkzwfwnOE/T-N0oo2Q-FI/AAAAAAAABqE/EbTOnHBlpmQ/s2048/20120621_IOPS.png

Note that I’m particular with the words “workload IOPS” and “hardware IOPS”
so on this statement
“if I see 10,000 write IOPs, did the cells do 10k or did they do 20k (normal redundancy)?” <– if this 10,000 is what you pulled from the AWR then it’s the database that did the 10K IOPS so that’s the “workload IOPS”.. and that’s essentially your “IO workload requirements”.

Then let’s say you haven’t migrated to Exadata.. you have to take into account the penalty computation shown above.. so you’ll arrive with 15000 “hardware IOPS” needed (normal redundancy).. and say each disk has IOPS of 180 then you need at least 83 disks so that’s 83disks / 12 disks each cell = 6.9 storage cells … and that’s Half Rack Exadata. But looking at the data sheet https://www.dropbox.com/s/ltvr7caysvfmvkr/dbmachine-x2-2-datasheet-175280.pdf it seems like you can fit the 15000 on a quarter rack (because of the flash).. mmm.. well I’m not pretty confident with that because if let’s say 50% of 15000 IOPS are writes (7.5K IOPS) then I would investigate on the IOPS write mix if most of it are DBWR related (v$sysstat.physical write IO requests) or LGWR related (v$sysstat.redo writes) and if most of it are DBWR related then I don’t think you’ll ever benefit from the smart flash log. So I would still go with the Half (12500 disk IOPS) or Full Rack (25000 disk IOPS) for my “hardware IOPS” capacity. And I’ll also take into consideration the MB/s needed for that database but that should be augmented by the flash cache.

In addition to my response above.. Also be aware that Smart Flash Log does not lower the HDD (HardDisk) touch load…both sides of the write go to flash and HDD concurrently. So you should not be too optimistic about the Smart Flash Log when sizing Exadata for “write IOPS” because you can’t predict who will always win on the fastest-write-race…so as much as possible base your write IOPS sizing on HardDisk IOPS. But definitely on the read IOPS even small (OLTP) or large IOPS (DSS) the flash cache will be able to augments those HDD numbers.

So with my test case shown here.. the statement on the reply above is true. The ASM redundancy/parity affects the “workload write IOPS” number (be it small-OLTP or large-DSS IOPS).. but not for the “workload read IOPS”. This behavior is the same even on a regular SAN environment… which you have to be careful/aware when sizing storage. So now you know that 50K IOPS in any data sheet is not equal to 50K write IOPS.. factor in the redundancy ;) and always remember Utilization=Requirements/Capacity no matter what the type workload or platform is as long as as the total workload IOPS/MBs requirements is well within the total hardware IOPS/MBs capacity you are up a notch on not seeing any IO bottlenecks.

Hope I’ve shared you some good stuff :)

About these ads

13 comments

  1. Pingback: Speaking at E4! « Karl Arao's Blog

  2. hey karl – first and foremost, congrats on your gold medal. cool!

    As for collectl, it’s always fun to see people using it in new and different ways and I guess what I’m wondering is why you’re redirecting collectl’s output to a text file. Do you realize that means you can only look at what you’ve decided to record at that point in time. Perhaps you didn’t realize that you could run: collectl –all -P -f/dirname and it will record everything to a raw file.

    Then when you’re tests are done, simply ^C collectl and then you can run any of your above commands against that data as often and as many different ways as you like. In other words, in your text files you also have disk data in detail format and cpu data in verbose format. But what if you want to look at individual CPUs or networks? You can’t because they’re not in your txt files. On the other hand, if you recoded the data in raw files you can play it back in as many different formats as you like.

    Also, how did you get the data in a format suitable for plotting? If you recorded data in raw files you can convert it to space-separated files suitable for plotting directly with gnuplot or better yet, colplot.

    In other words, my advise is to ALWAYS record collectl data in raw files as you never know how you want to drill down into it. Some times you want brief more, sometimes verbose mode and sometimes detail mode. Sometimes you want to see it on your terminal and sometimes you want to plot it. WIth raw files you can do it all.

    The other thing I couldn’t tell is whether you’re running these commands on a single server or multiple ones. If you’re running on multiple servers you really should look into colmux at that will make your life much easier. And speaking of Kevin Closson, I’ve been after him for months to check out colmux and he hasn’t either. ;)

    -mark

    • Yes I know I’ve been used to spooling them in text files ;) but I’ll take note of your raw files advice they may come handy one of these days.. :) BTW, I’ve played with colplot once before pretty cool and I’m not forgetting about it. For this post this is just one node so no need for colmux but I was able to make use of it a bunch of times in Exadata/RAC monitoring.

      Thanks for dropping by Mark!

      -Karl

  3. Karl,
    one more question .
    How to measure IOPS, throughput in Oracle 9i (9.2.0.8) , I have to consolidate few such systems .
    Probably some clever statspack hacking, right ?
    Regards
    GregG

  4. I am not too sure if you cant that easily extrapolate your home grown system SLOB test results to Exadata. I would be more confident if you run SLOB on Exadata an prove your conclusions there. There are way too large difference in between 8 spins system and Exadata.
    But I like the title anyway :)

    Yury

    • Well I’m very confident on my results, I’ve seen a high intensive Insert batch load (a very large stream of inserts and loading it as fast as it can) on a chain of three(3) X2-8s which is a total of 42 storage servers and it was able to reach 60K “workload write IOPS” on a normal redundancy. And according to the data sheet, the x2-8 high perf disk rack has 50,000 “disk IOPS”. So if that’s a cluster of 3 x2-8s that’s about 150,000 “disk IOPS”… and if you are running normal redundancy the max “workload write IOPS” you can get is 75,000… which is well within the linear 60K “workload write IOPS”. So this blog post proves what I’ve seen from the real world. Check the bottom graph here http://goo.gl/v60EK

      -Karl

      • @Yury , @Karlarao : Any full rack X2 Exadata can sustain 50,000 random single-block instance writes. Since that is a horribly limited amount of write bandwidth we shouldn’t spend as much time wondering if it’s true as we should criticize for the horrible imbalance between CPU capability and write capacity. Horrible.

        • and by your reply you mean this (from our email thread before):

          “Kevin Closson
          May 17
          to me
          …***hold it*** I messed up… was mixing half and full rack.
          OK, always start with datasheet which is 50,000 WIOPS full rack
          Exadata only supports ASM and ASM partner writes are not accounted for.
          Your application write payload max would therefore be 25,000/full rack normal redund.”

          regarding the imbalance… mmm I can argue that Exadata is still a strong platform because you are just running databases on it and nothing else. And that alone is a big difference. Let me put some real world examples:

          >> First. I’ve diagnosed a performance problem on a newly migrated database from an old to new SAN environment (TIER1). And apparently the reads are a lot slower because the SAN storage processor is already saturated because they’ve got a mix of MS Exchange Server workload, databases, and other uses for that SAN (NFS, CIFS, etc.). Simply put, the requirements are exceeding the capacity causing the storage processor to be affected which leads to bad IO times. Since this database is mission critical, they’ve migrated it to TIER2 storage (supposedly a slower array) which apparently is way faster than their TIER1 and that solved their database IO issues.

          >> Second. I’ve seen a case where an M5000 box (64 CMT cores) is backed by 800+ spindles Symmetrix storage which is using the EMC “FAST” feature. But even though it’s M5000 they are not using it only for databases, it’s a multi-tenant Solaris environment making intensive use of zones. In one M5000 box they’ve got 7 zones dedicated for app servers and only 2 zones for databases. The 1st database zone is holding the EMGC, the other is holding 7databases. The end result is a mixed workload (file IO and database IO) that doesn’t really like each other and they’re not really enforcing the resource management features of Solaris so it’s a big playground for all the zones and when all of them reach their peak periods then there comes the latency issues and high SYS CPU. Simply put, even with a Ferrari SAN if you are not doing the right thing.. you’ll not get the right results.

          I’ve got some screenshots here http://goo.gl/MYKxK
          see
          >> average latency issue correlation of SAN, datafiles, session IO
          >> IO issue – SAN performance validation – saturated storage processor

          -Karl

  5. Pingback: Putting SLOB (The Silly Little Oracle Benchmark) To Use For Knowledge Sake! « Kevin Closson's Blog: Platforms, Databases and Storage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s