When enough is enough? - how to estimate I/O Performance
Hello reader,
Last week we faced performance problems also shown on the Oracle 10g Db-Console as 'SQL found consuming ...'; The users of the application mourned everything is going slow. Hm ...
The strange thing was that the same kind of SQL run without problems before. I started to monitor I/O with iostat on the Sun Solaris Database Server. The Storage was based on SAN. The I/O rates were from 5m/s - 10m/s. The devices were about 95%-100% busy and this with an I/O rate with only about 10M/second? Usually 100m/s and more were possible.
Contacted Unix and SAN-Team. The secret behind the strange behavior was that the SAN-Area was not only used by our application but shared by much much bigger one's compared to our I/O demand. Our problem was that we were in need of more I/O but due to the global effect all I/O Capacity was exhausted.
How to rate I/O Performance?
My next idea was - we do not want to be called by the application users if we would run into I/O problems again. We wanted to monitor it. Soon it was clear that simple the I/O rate is not really a criteria for the quantitation of I/O Performance.
We assume 40m/s I/O on the Database server.
- Does the application really need not more?
40m/s could be exactly the amount needed by the Application to satisfy all user requests
- Does the SAN comes to it's limit due to global contention between other Applications.
40m/s could a bottleneck, 200m/s could be possible
- Is the read operation caused by a long full table scan?
if yes - 40m/s could a bottleneck, 200m/s could be possible
if no - for pure random access caused by an index-range-scan this is a good value
A little piece of known work;
As you see to evaluate an I/O bottleneck we need more then some ratings. What we need is something like a baseline. A piece of work which we know in normal circumstances is achieved in a known amount of time. To achieve this we need a constant workload :
A Probe table which has filled with rows and about 128m of size
A hinted Full Tablescan counting rows
A statistics mechanism sampling data during work (begin snap, end snap)
- physical reads - blocks (PR)
- physical read total bytes - (PRTB)
- session logical reads - blocks (SLR)
- db_block_size (DBS)
- elapsed milliseconds (ELAMS) - start time, end time in a fraction of 1/1000 seconds
- Delta Sessionstats (dss) - Sessionstats(end snap) - Sessionstats(begin snap)
Some calculations
Physical_Read_MB_Per_Second (PRMPS): gives you the average physical read performance of the elapsed time interval.
(PRMPS) = (dss.PRTB / (1024*1024)) / (ELAMS / 1000)
Logical_Read_Total_Bytes (LRTB): There is no statistic value for Logial Read Total Bytes as for physical reads. The metric must be calculated. The value is always the same on the probe table because all blocks are accessed in database cache independent if they have been read from disk or not.
(LRTB) = dss.SLR * DBS
Physical_Read_Percent (PRP) : helps you to determine if you have enough physical reads for calculating I/O performance. If for example you would have 1 million logical reads in 2 seconds but only 1 physical read you need not to go further calculating physical I/O Performance because all of your data was read from cache.
(PRP) = dss.PRTB / ( LTRB / 100 )
Thresholds
Now if the percentage physical reads reaches the break even point (60%) you have to decide when to notify and what your notification level would be. You need thresholds. The Threshold unit is Physical_Read_Mb_Per_Second. There are two levels of Threshold defined.
- Threshold Warning
if your PRMPS is lower then the Warning Threshold you could log this event or even sent an email. - Threshold Error
if PRMPS ls lower then the Error Threshold then the event is logged and a notification email is dropped.
Example notification mail
OASDBD01 : PHYSICAL SCATTERED READ MB PER SECOND (ERROR)
FROM: ORACLE.OASIS.OASISD@COBACOBANA.COM
TO: SAN ADMINS; APP ADMINS
Dear sirs - Oracle Server detected I/O Latency
DMON Event Details
---------------------------------------------------------------------------
Snap number : 2248
Snap Time : 04.07.2008 - 23:46:54
Event level : ERROR
Treshold Warning : 15,00
Treshold Error : 10,00
Exceeded Value : 07,41
Exceeded Value Comment : PHYSICAL SCATTERED READ MB PER SECOND
---------------------------------------------------------------------------
Please check the load of the system during snap time 04.07.2008 - 23:46:54
Please check on Development how the I/O Latency affects the OASIS XRM application
If necessary to contact Unix/SAN Administrators
Oracle wishes you a successfully business day
PS.: next check will be in one hour
As we enabled the I/O checking in development we detected some fixed intervals where I/O performance was constantly low -
which we identified as the backup times slice of the SAN System.
HTH
Karl Reitschuster


Reader Comments