A couple of weeks ago a customer had some troubles with the overall performance. He complained that everything is slow and (of course) nothing changed since the last few weeks. They also immediately blamed the database on their side to be the issue. The usual stuff also, I’m sure that sounds just too familiar to you.
So as they “identified” the database already as being the issue my team requested some AWR reports and as we got them I noticed a strange but all to common behavior. My team mates got the AWR reports, went to the SQL Statistics, sections “SQL ordered by Elapsed Time” and “SQL ordered by CPU Time” and identified immediately a materialized view rebuild as cause for the problem. That looked like this:
SQL ordered by Elapsed Time
- Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
- % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
Elapsed Time (s) | CPU Time (s) | Executions | Elap per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|
1,343 | 679 | 1 | 1342.85 | 33.00 | f20ccnxhvbk65 | DECLARE job BINARY_INTEGER := … | |
770 | 323 | 1 | 770.24 | 18.93 | 1usnr4gmcq60d | /* MV_REFRESH (DEL) */ delete … | |
571 | 355 | 1 | 571.23 | 14.04 | gz04689vd55db | /* MV_REFRESH (INS) */INSERT /… | |
311 | 311 | 1 | 310.91 | 7.64 | 0vhmfumrjchnh | SQL*Plus | BEGIN dm_incr_symbols_post_pro… |
288 | 9 | 0 | 7.09 | bb3f2gjndvjss | oracle@crptd1 (TNS V1-V3) | SELECT /*+ OPAQUE_TRANSFORM */… | |
258 | 258 | 1 | 257.76 | 6.34 | 75vtwb7j4jzdm | SQL*Plus | INSERT INTO SYMB_EXTRACTT SELE… |
149 | 121 | 1 | 148.80 | 3.66 | 90wtn50vy6af6 | DECLARE job BINARY_INTEGER := … | |
114 | 103 | 1 | 113.75 | 2.80 | 9993mp6h7kqkp | INSERT /*+ BYPASS_RECURSIVE_CH… | |
48 | 38 | 2 | 24.00 | 1.18 | 3nkcg1h5ysqss | DECLARE job BINARY_INTEGER := … | |
48 | 38 | 2 | 23.98 | 1.18 | fvb5prrr7b0c3 | MERGE INTO FT_E_UPS1 UPS1 USIN… |
Back to SQL Statistics
Back to Top
SQL ordered by CPU Time
- Resources reported for PL/SQL code includes the resources used by all SQL statements called by the code.
- % Total DB Time is the Elapsed Time of the SQL statement divided into the Total Database Time multiplied by 100
CPU Time (s) | Elapsed Time (s) | Executions | CPU per Exec (s) | % Total DB Time | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|---|---|---|
679 | 1,343 | 1 | 679.06 | 33.00 | f20ccnxhvbk65 | DECLARE job BINARY_INTEGER := … | |
355 | 571 | 1 | 355.50 | 14.04 | gz04689vd55db | /* MV_REFRESH (INS) */INSERT /… | |
323 | 770 | 1 | 323.18 | 18.93 | 1usnr4gmcq60d | /* MV_REFRESH (DEL) */ delete … | |
311 | 311 | 1 | 310.89 | 7.64 | 0vhmfumrjchnh | SQL*Plus | BEGIN dm_incr_symbols_post_pro… |
258 | 258 | 1 | 257.74 | 6.34 | 75vtwb7j4jzdm | SQL*Plus | INSERT INTO SYMB_EXTRACTT SELE… |
121 | 149 | 1 | 121.05 | 3.66 | 90wtn50vy6af6 | DECLARE job BINARY_INTEGER := … | |
103 | 114 | 1 | 102.58 | 2.80 | 9993mp6h7kqkp | INSERT /*+ BYPASS_RECURSIVE_CH… | |
38 | 48 | 2 | 19.20 | 1.18 | 3nkcg1h5ysqss | DECLARE job BINARY_INTEGER := … | |
38 | 48 | 2 | 19.20 | 1.18 | fvb5prrr7b0c3 | MERGE INTO FT_E_UPS1 UPS1 USIN… | |
29 | 29 | 12 | 2.38 | 0.70 | 68930z34bm3db | SQL*Plus | select ‘file[‘ || substr(trim(… |
9 | 288 | 0 | 7.09 | bb3f2gjndvjss | oracle@dftg1 (TNS V1-V3) | SELECT /*+ OPAQUE_TRANSFORM */… |
So here we have a PL/SQL job which does the materialized view refresh (first line) and the statements for the refresh itself as second and third line. For completeness here the first statement:
DECLARE job BINARY_INTEGER := :job; next_date DATE := :mydate; broken BOOLEAN := FALSE; BEGIN dbms_refresh.refresh('"ABTGHS"."DB_CREAT_STST_PWOR"'); :mydate := next_date; IF broken THEN :b := 1; ELSE :b := 0; END IF; END;
So all what they did was to go there, look the first SQL with high elapsed and cpu time and nearly reported back to them that this is the problem and they have to solve this. I call this the “lucky shot method”. Sometimes when you are lucky than the reason for the issue on the DB is a bad SQL or a bunch of bad SQLs which max out the DB on CPU power, or I/O or whatever. In such a case you just go to those sections, identify the SQLs, fix them and everything is good again – you were lucky. This works sometimes and you are a hero because it took you just 5 minutes for fixing the issue but sometimes it doesn’t and you blame it on some weird constellation of OS, network and something else so that nobody recognizes that you are just a fool and didn’t look at the big picture. Thinking of that I noticed that this is just far too common in IT. Tech admins, DBA, developers – in every section you have people like this. Looking 5 minutes into the issue and telling you then that this and that is the issue. You go ahead and fix them but still no change. So next round trip, and next, and next, and next. All could have been prevented if the person had just once a look into the big picture…
So what is the big picture, what do I mean by that?
I’m a fan of knowing what happens and why. Always analyze all the information you got, even request some more if you think that some vital information is missing and make your conclusions out of that. Stop the try and error method, the lucky shot method. The AWR report I got from the customer is an all too good example. There were 2 simple lines which made me curious:
Snap Id | Snap Time | Sessions | Cursors/Session | |
---|---|---|---|---|
Begin Snap: | 17934 | 05-May-10 09:00:59 | 129 | 144.1 |
End Snap: | 17935 | 05-May-10 10:00:12 | 130 | 143.0 |
Elapsed: | 59.21 (mins) | |||
DB Time: | 67.81 (mins) |
I’m talking about the last two. The first tells me that the snapshot which got compared are in a time range of 59.21 minutes. The DB time, so the time when the DB was actually working was 67.81 minutes. So something looks strange here. If I would be in a single core environment I wouldn’t be over my 59.21. If I would be in a dual-core environment then the database was just working 57% of the time (59.21 x 2 cores = 118.42 – DB time multiplies by the amount of cores where work was performed parallel). Next I had a look into the locks going on. If you have high locking then the DB time is also idle but the throughput is low. But it turned that this was also no issue. The next information confirmed then that the DB wasn’t the problem at all:
Operating System Statistics
Statistic | Total |
---|---|
AVG_BUSY_TIME | 33,088 |
AVG_IDLE_TIME | 321,941 |
The average idle time statistic is much higher than the busy time. So I requested the amount of cores on that system and it turned out that there are 14 available. A busy database would have a DB time of max. 828.94 minutes. No locking going on. I followed up with them and it turned out as I thought: The database wasn’t the problem at all. In fact it was pretty much idle over the whole time. The issue was on the application side. A java application which memory got filled up so that it had to do full garbage collection all the time to continue processing.
Instead of doing the lucky shot we prevented us and them from a lot of headache by looking into the big picture.