ROWS versus default RANGE in analytic window clause
I have talked at KScope about the difference between ROWS and RANGE in the analytic window clause, but haven't yet blogged about it. Recently while working out a quiz for the PL/SQL Challenge I discovered yet another reason for remembering to primarily use ROWS and only use RANGE when the actual problem requires it.
From my KScope presentation examples here is a simple case of a rolling sum of salaries:
Scott and Ford get the same sum and similarly for Martin and Ward. Why? Because when the window clause is omitted, it defaults to RANGE BETWEEN:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means, that for Scott it takes the sum of the salary of all employees in department 20 that have the same or smaller salary value (column of the ORDER BY) as Scott. So it will be the sum of all in deptno 20 with salary <= 3000, which includes Ford! When calculating SUM_SAL for Ford, it will be the same result.
We can get the rolling sum we probably expect by using ROWS BETWEEN instead:
The effect of ROWS BETWEEN rather than the default RANGE BETWEEN is that the sum does not take into account the value of the colum in the ORDER BY clause, but rather just takes the sum of salaries from Scott and the previous rows. Note: Since we have a non-unique ordering, it is indeterminate which row comes first of Scott or Ford. You might run this another day and see Ford first having SUM_SAL=7875 and Scott right after with SUM_XAL=10875. Generally I recommend adding something unique (for example order by sal, empno) to make it determinate giving same output on the same data.
But what if we are ordering by EMPNO rather than SAL so we have something unique and we are certain that RANGE BETWEEN never encounters duplicate values? Can't we just allow ourselves to save some typing and just write:
Since we have a unique ordering, the default RANGE BETWEEN in this case gives the same result as if we add the ROWS BETWEEN clause:
So why should we bother to type that long ROWS BETWEEN clause when we have something unique?
Well, consider this example of using two analytic functions - one within an inline view. (This example is a bit contrived on the EMP table, but I use similar constructs many times in my daily work.) The inline view creates a rolling sum, the outer select filters on the result and makes a new rolling sum on the leftover rows:
The interesting part comes when examining autotrace output for that SQL:
The inner SUM() uses a WINDOW SORT operation and the outer SUM() uses WINDOW BUFFER. Two memory sorts are used.
Let us add the ROWS BETWEEN clause:
Giving us the exact same output:
But a couple of interesting difference in the autotrace output:
The inner SUM() uses a WINDOW SORT operation like before, but the outer SUM() uses WINDOW NOSORT this time! And just one memory sort is used.
So when we are lazy and save on typing by not writing a window specification, we will get the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When doing RANGE, the optimizer cannot save an extra sorting operation in this example (probably because RANGE needs to "look ahead" in the row source), but if we explicitly use the ROWS BETWEEN clause, the optimizer knows it can use the sorting it has already created and save some work.
The same is true if the example had used diffent analytic functions in the inline view and the outer select (for example COUNT() as the inner analytic function and SUM() as the outer analytic function.) What matters is that they use the same PARTITION and ORDER BY. In real life I find it is often different functions I use in constructs like this, but the WINDOW NOSORT still applies.
So this is yet another argument for my rule of thumb when I use analytic functions:
If I am doing analytic on the entire partition (or entire resultset), then I skip the window clause (that is: I do not write ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):
But once I have an ORDER BY, I generally do not rely on default window clause, but rather explicitly writes either ROWS or RANGE BETWEEN:
Analytic functions that only support the order-by clause and not a windowing clause - like for example ROW_NUMBER() - are of course excempt from this rule of thumb. But if a function supports both order-by clause and windowing clause, then I do not write an order-by clause without adding the windowing clause.
I use this rule of thumb because in my experience 95% of the time I really want ROWS BETWEEN - only rarely have I a true case for using RANGE BETWEEN. In 80% of the cases I have a unique ORDER BY clause which actually make RANGE BETWEEN give the same result, but as we see above at a cost of potentially doing much more work sorting the data. Also explicitly writing ROWS or RANGE make it clear to other developers what the intention is - if the windowing clause is omitted I do not really know if it is on purpose meant to be RANGE, or if it is a lazy programmer who really meant ROWS :-)
So please be explicit in your windowing clauses. If you need to save typing, then create an auto-replace in your TOAD or whatever IDE you use :-)
From my KScope presentation examples here is a simple case of a rolling sum of salaries:
select deptno , ename , sal , sum(sal) over ( partition by deptno order by sal ) sum_sal from scott.emp order by deptno , sal /
DEPTNO ENAME SAL SUM_SAL ---------- ---------- ---------- ---------- 10 MILLER 1300 1300 10 CLARK 2450 3750 10 KING 5000 8750 20 SMITH 800 800 20 ADAMS 1100 1900 20 JONES 2975 4875 20 SCOTT 3000 10875 20 FORD 3000 10875 30 JAMES 950 950 30 MARTIN 1250 3450 30 WARD 1250 3450 30 TURNER 1500 4950 30 ALLEN 1600 6550 30 BLAKE 2850 9400 14 rows selected.
Scott and Ford get the same sum and similarly for Martin and Ward. Why? Because when the window clause is omitted, it defaults to RANGE BETWEEN:
select deptno , ename , sal , sum(sal) over ( partition by deptno order by sal range between unbounded preceding and current row ) sum_sal from scott.emp order by deptno , sal /
DEPTNO ENAME SAL SUM_SAL ---------- ---------- ---------- ---------- 10 MILLER 1300 1300 10 CLARK 2450 3750 10 KING 5000 8750 20 SMITH 800 800 20 ADAMS 1100 1900 20 JONES 2975 4875 20 SCOTT 3000 10875 20 FORD 3000 10875 30 JAMES 950 950 30 MARTIN 1250 3450 30 WARD 1250 3450 30 TURNER 1500 4950 30 ALLEN 1600 6550 30 BLAKE 2850 9400 14 rows selected.
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means, that for Scott it takes the sum of the salary of all employees in department 20 that have the same or smaller salary value (column of the ORDER BY) as Scott. So it will be the sum of all in deptno 20 with salary <= 3000, which includes Ford! When calculating SUM_SAL for Ford, it will be the same result.
We can get the rolling sum we probably expect by using ROWS BETWEEN instead:
select deptno , ename , sal , sum(sal) over ( partition by deptno order by sal rows between unbounded preceding and current row ) sum_sal from scott.emp order by deptno , sal /
DEPTNO ENAME SAL SUM_SAL ---------- ---------- ---------- ---------- 10 MILLER 1300 1300 10 CLARK 2450 3750 10 KING 5000 8750 20 SMITH 800 800 20 ADAMS 1100 1900 20 JONES 2975 4875 20 SCOTT 3000 7875 20 FORD 3000 10875 30 JAMES 950 950 30 MARTIN 1250 2200 30 WARD 1250 3450 30 TURNER 1500 4950 30 ALLEN 1600 6550 30 BLAKE 2850 9400 14 rows selected.
The effect of ROWS BETWEEN rather than the default RANGE BETWEEN is that the sum does not take into account the value of the colum in the ORDER BY clause, but rather just takes the sum of salaries from Scott and the previous rows. Note: Since we have a non-unique ordering, it is indeterminate which row comes first of Scott or Ford. You might run this another day and see Ford first having SUM_SAL=7875 and Scott right after with SUM_XAL=10875. Generally I recommend adding something unique (for example order by sal, empno) to make it determinate giving same output on the same data.
But what if we are ordering by EMPNO rather than SAL so we have something unique and we are certain that RANGE BETWEEN never encounters duplicate values? Can't we just allow ourselves to save some typing and just write:
select deptno , ename , sal , sum(sal) over ( partition by deptno order by empno ) sum_sal from scott.emp order by deptno , empno /
DEPTNO ENAME SAL SUM_SAL ---------- ---------- ---------- ---------- 10 CLARK 2450 2450 10 KING 5000 7450 10 MILLER 1300 8750 20 SMITH 800 800 20 JONES 2975 3775 20 SCOTT 3000 6775 20 ADAMS 1100 7875 20 FORD 3000 10875 30 ALLEN 1600 1600 30 WARD 1250 2850 30 MARTIN 1250 4100 30 BLAKE 2850 6950 30 TURNER 1500 8450 30 JAMES 950 9400 14 rows selected.
Since we have a unique ordering, the default RANGE BETWEEN in this case gives the same result as if we add the ROWS BETWEEN clause:
select deptno , ename , sal , sum(sal) over ( partition by deptno order by empno rows between unbounded preceding and current row ) sum_sal from scott.emp order by deptno , empno /
DEPTNO ENAME SAL SUM_SAL ---------- ---------- ---------- ---------- 10 CLARK 2450 2450 10 KING 5000 7450 10 MILLER 1300 8750 20 SMITH 800 800 20 JONES 2975 3775 20 SCOTT 3000 6775 20 ADAMS 1100 7875 20 FORD 3000 10875 30 ALLEN 1600 1600 30 WARD 1250 2850 30 MARTIN 1250 4100 30 BLAKE 2850 6950 30 TURNER 1500 8450 30 JAMES 950 9400 14 rows selected.
So why should we bother to type that long ROWS BETWEEN clause when we have something unique?
Well, consider this example of using two analytic functions - one within an inline view. (This example is a bit contrived on the EMP table, but I use similar constructs many times in my daily work.) The inline view creates a rolling sum, the outer select filters on the result and makes a new rolling sum on the leftover rows:
select s.* , sum(sal) over ( partition by deptno order by empno ) sum_sal_2 from ( select deptno , empno , ename , sal , sum(sal) over ( partition by deptno order by empno ) sum_sal from scott.emp ) s where sum_sal > 5000 order by deptno , empno /
DEPTNO EMPNO ENAME SAL SUM_SAL SUM_SAL_2 ---------- ---------- ---------- ---------- ---------- ---------- 10 7839 KING 5000 7450 5000 10 7934 MILLER 1300 8750 6300 20 7788 SCOTT 3000 6775 3000 20 7876 ADAMS 1100 7875 4100 20 7902 FORD 3000 10875 7100 30 7698 BLAKE 2850 6950 2850 30 7844 TURNER 1500 8450 4350 30 7900 JAMES 950 9400 5300 8 rows selected.
The interesting part comes when examining autotrace output for that SQL:
Execution Plan ---------------------------------------------------------- Plan hash value: 3842450409 ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 14 | 826 | 4 (25)| 00:00:01 | | 1 | WINDOW BUFFER | | 14 | 826 | 4 (25)| 00:00:01 | |* 2 | VIEW | | 14 | 826 | 4 (25)| 00:00:01 | | 3 | WINDOW SORT | | 14 | 238 | 4 (25)| 00:00:01 | | 4 | TABLE ACCESS FULL| EMP | 14 | 238 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("SUM_SAL">5000) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 7 consistent gets 0 physical reads 0 redo size 583 bytes sent via SQL*Net to client 364 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 8 rows processed
The inner SUM() uses a WINDOW SORT operation and the outer SUM() uses WINDOW BUFFER. Two memory sorts are used.
Let us add the ROWS BETWEEN clause:
select s.* , sum(sal) over ( partition by deptno order by empno rows between unbounded preceding and current row ) sum_sal_2 from ( select deptno , empno , ename , sal , sum(sal) over ( partition by deptno order by empno rows between unbounded preceding and current row ) sum_sal from scott.emp ) s where sum_sal > 5000 order by deptno , empno /
Giving us the exact same output:
DEPTNO EMPNO ENAME SAL SUM_SAL SUM_SAL_2 ---------- ---------- ---------- ---------- ---------- ---------- 10 7839 KING 5000 7450 5000 10 7934 MILLER 1300 8750 6300 20 7788 SCOTT 3000 6775 3000 20 7876 ADAMS 1100 7875 4100 20 7902 FORD 3000 10875 7100 30 7698 BLAKE 2850 6950 2850 30 7844 TURNER 1500 8450 4350 30 7900 JAMES 950 9400 5300 8 rows selected.
But a couple of interesting difference in the autotrace output:
Execution Plan ---------------------------------------------------------- Plan hash value: 1485958224 ----------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 14 | 826 | 4 (25)| 00:00:01 | | 1 | WINDOW NOSORT | | 14 | 826 | 4 (25)| 00:00:01 | |* 2 | VIEW | | 14 | 826 | 4 (25)| 00:00:01 | | 3 | WINDOW SORT | | 14 | 238 | 4 (25)| 00:00:01 | | 4 | TABLE ACCESS FULL| EMP | 14 | 238 | 3 (0)| 00:00:01 | ----------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("SUM_SAL">5000) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 7 consistent gets 0 physical reads 0 redo size 587 bytes sent via SQL*Net to client 364 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 8 rows processed
The inner SUM() uses a WINDOW SORT operation like before, but the outer SUM() uses WINDOW NOSORT this time! And just one memory sort is used.
So when we are lazy and save on typing by not writing a window specification, we will get the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When doing RANGE, the optimizer cannot save an extra sorting operation in this example (probably because RANGE needs to "look ahead" in the row source), but if we explicitly use the ROWS BETWEEN clause, the optimizer knows it can use the sorting it has already created and save some work.
The same is true if the example had used diffent analytic functions in the inline view and the outer select (for example COUNT() as the inner analytic function and SUM() as the outer analytic function.) What matters is that they use the same PARTITION and ORDER BY. In real life I find it is often different functions I use in constructs like this, but the WINDOW NOSORT still applies.
So this is yet another argument for my rule of thumb when I use analytic functions:
If I am doing analytic on the entire partition (or entire resultset), then I skip the window clause (that is: I do not write ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):
sum(sal) over ( partition by deptno )
But once I have an ORDER BY, I generally do not rely on default window clause, but rather explicitly writes either ROWS or RANGE BETWEEN:
sum(sal) over ( partition by deptno order by empno rows between unbounded preceding and current row )
Analytic functions that only support the order-by clause and not a windowing clause - like for example ROW_NUMBER() - are of course excempt from this rule of thumb. But if a function supports both order-by clause and windowing clause, then I do not write an order-by clause without adding the windowing clause.
I use this rule of thumb because in my experience 95% of the time I really want ROWS BETWEEN - only rarely have I a true case for using RANGE BETWEEN. In 80% of the cases I have a unique ORDER BY clause which actually make RANGE BETWEEN give the same result, but as we see above at a cost of potentially doing much more work sorting the data. Also explicitly writing ROWS or RANGE make it clear to other developers what the intention is - if the windowing clause is omitted I do not really know if it is on purpose meant to be RANGE, or if it is a lazy programmer who really meant ROWS :-)
So please be explicit in your windowing clauses. If you need to save typing, then create an auto-replace in your TOAD or whatever IDE you use :-)
Thanks! I've been reading the training kit for 70-461 but it didn't make this clear. You blog here lays it our perfectly.
ReplyDeleteGreat explanation....crystal clear!!!!!
ReplyDeleteThanks Kim, everything is very clear, however, as always!
ReplyDeleteGreat explanation
ReplyDeleteThank you for such a great clarification!!!
ReplyDeletegreat insight
ReplyDeleteGood important insight indeed, explained very clear simple way, thanks
ReplyDeletegreat Explanation!
ReplyDeleteTo the point and perfect. Thanks a lot.
ReplyDeletePerfect!Thanks a lot.
ReplyDeletePerfect!! i believe it would increase the performance when table size is quite big.
ReplyDeleteThank you! Now I understand.
ReplyDeleteWhat a clarification sir, thanks much, concept is cleared like crystal :)
ReplyDeleteGood
ReplyDeleteMany Thanks!!
ReplyDeleteGreat article; please do not encourage Oracle developers to use toad.
ReplyDelete