Group by Groups
For some time now I have struggled to efficiently "group by groups" on data containing references between our suppliers item numbers and the original equipment manufacturer (OEM) number. I can group those data by supplier and their item number and get a "group of OEM numbers." That group I will call a unique item and I will group the data once more by that group to find which suppliers agree on an item having the same set (group) of OEM numbers.
It took me some time to figure out how to do this efficiently (the data was several million records and the "unique groups" to be found was about 3/4 million groups.) But after several tries, some good help from the OTN forums and some reading in my Tom Kyte books, I finally got a solution that seems to work well ;-)
Let me show what I tried with an example using good old scott.emp:
First I group by job (I use xmlagg here because I am on version 11.1 and therefore no listagg ):
I notice here that both job CLERK and MANAGER has the same set of deptnos.
So if I group by deptnos I can get this result:
(If you substitute OEM number for deptnos and supplier items for jobs, this represents my real world case - except the real world finds near 3/4 million such "group by groups.")
So my requirement is to identify all such unique groups of deptnos in my orig_data table, give each such group a surrogate key in a parent table, and then populate two child tables with the deptnos of each group and the jobs that have that group of deptnos:
For the surrogate groupkey I can just use a rownumber on my group by deptnos query:
That query I can use for a (slow) insert into my three tables in this simple manner:
The tables now contain this data:
I can now from these data get the same result as before (just to test I have created the desired data):
So far so good. This all works pretty much as desired - except for a couple of things:
The very simple loop insert code will be slow. OK, it is a one-time conversion job (in theory, but very few times at least) so that could probably be acceptable (except for my professional pride. ;-)
But worse is, that I have groups where the string aggregation won't work - the string would have to be about varchar2(10000) for my real world case (some groups would contain something like 1600 OEM numbers) which won't work in SQL in the group by. :-(
So I have tried an attempt using collections. First a collection of deptnos:
All very good - no problems here. But then a collection of jobs:
Now it fails with an "inconsistent datatype" error - I cannot group by a collection datatype (without doing something more...)
This is the time when I asked for help on the OTN forums, and odie_63 kindly reminded me that an object datatype can be grouped if it has a MAP or an ORDER member method. A MAP function would probably be the most efficient, but it requires being able to return a numerical value that will be unique - not realistic in this case. Odie suggested a simple ORDER method instead:
It relies on the fact that when the nested tables are collections of simple datatypes, they can be compared for equality by default.
That allows me to "group by groups":
And Odie then also suggested a method to use that query to insert into the three tables:
Let's just check the result again:
Yes! This creates the same groupings as I did with the string aggregation (OK, different key, but that is just a pseudo key so no matter ;-). Now I have a method that can "group by groups" using collections rather than string aggregation and thus work on larger groups than can be contained in a VARCHAR2.
But then I used that method on the real large set of data. Bummer :-( It created about 1½ million groups rather than 3/4 millions - there were duplicate groups!
So I guessed it might have something to do with the ORDER member method always returning "1" for un-equal collections, rather than "-1" for "smaller than" and "1" for "greater than". So I made a new body for the container object type:
This one compares the elements of the collections and defines when a collection is "greater than" or "smaller than" another.
Let's test the new type body:
Hooray - we get the same groups! (OK, different group keys - presumably due to the new order method - but that's OK.)
But I was still troubled that it did not really work as it should. If I just did the above query, it would perfectly correct create about 3/4 million groups. But when I put the same query into the INSERT ALL construct, somewhat over 800.000 groups were created - duplicates still existed :-(
So I went back to my good old trusted copy of Tom Kyte's "Expert One-On-One Oracle" book in the chapter on object relational coding. I decided to try to do this using an O-R view and an instead-of trigger:
First yet another object type:
Then an object type view:
And finally the instead-of trigger on the view:
Now I can insert into my three tables by inserting into the O-R view:
Note how it says it created 4 "rows" - which is 4 "instances of the object", but translates into this much data in the three tables:
Or if we take the same control statement I've used before:
Yes - we got the same groups of groups again. And we now even have a short and sweet method of querying out the same group objects:
Same result as we've also seen before.
And finally I am happy with the result. When I did this to my real world data, I got exactly the 3/4 million unique groups. Yay :-D
I had actually expected the instead-of trigger insertion to slow down the process, as I introduced a kind of "row-by-row" processing here. But in reality it turned out to be slightly faster (about 50 minutes compared to an hour) than the method using TABLE to "break out" the collections and then INSERT ALL. There may have been differences in load on the server at the time, so I won't conclude that one method or the other is faster - I'll just note that the instead-of trigger method for this particular use case at least didn't slow things down ;-)
And that's a wrap - now I can start using the generated groups of data :-D
It took me some time to figure out how to do this efficiently (the data was several million records and the "unique groups" to be found was about 3/4 million groups.) But after several tries, some good help from the OTN forums and some reading in my Tom Kyte books, I finally got a solution that seems to work well ;-)
Let me show what I tried with an example using good old scott.emp:
SQL> create table orig_data as
2 select distinct job, deptno
3 from scott.emp e
4 /
Table created.
SQL> select job
2 , deptno
3 from orig_data
4 order by
5 job
6 , deptno
7 /
JOB DEPTNO
--------- ----------
ANALYST 20
CLERK 10
CLERK 20
CLERK 30
MANAGER 10
MANAGER 20
MANAGER 30
PRESIDENT 10
SALESMAN 30
9 rows selected.
First I group by job (I use xmlagg here because I am on version 11.1 and therefore no listagg ):
SQL> select od.job
2 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
3 from orig_data od
4 group by od.job
5 /
JOB DEPTNOS
--------- ------------------------------
ANALYST 20
CLERK 10,20,30
MANAGER 10,20,30
PRESIDENT 10
SALESMAN 30
I notice here that both job CLERK and MANAGER has the same set of deptnos.
So if I group by deptnos I can get this result:
SQL> select s2.deptnos
2 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
3 from (
4 select od.job
5 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
6 from orig_data od
7 group by od.job
8 ) s2
9 group by s2.deptnos
10 /
DEPTNOS JOBS
------------------------------ ------------------------------
10 PRESIDENT
10,20,30 CLERK,MANAGER
20 ANALYST
30 SALESMAN
(If you substitute OEM number for deptnos and supplier items for jobs, this represents my real world case - except the real world finds near 3/4 million such "group by groups.")
So my requirement is to identify all such unique groups of deptnos in my orig_data table, give each such group a surrogate key in a parent table, and then populate two child tables with the deptnos of each group and the jobs that have that group of deptnos:
SQL> create table groups (
2 groupkey number primary key
3 )
4 /
Table created.
SQL> create table groups_depts (
2 groupkey number references groups (groupkey)
3 , deptno number(2)
4 )
5 /
Table created.
SQL> create table groups_jobs (
2 groupkey number references groups (groupkey)
3 , job varchar2(9)
4 )
5 /
Table created.
For the surrogate groupkey I can just use a rownumber on my group by deptnos query:
SQL> select row_number() over (order by s2.deptnos) groupkey
2 , s2.deptnos
3 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
4 from (
5 select od.job
6 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
7 from orig_data od
8 group by od.job
9 ) s2
10 group by s2.deptnos
11 /
GROUPKEY DEPTNOS JOBS
---------- ------------------------------ ------------------------------
1 10 PRESIDENT
2 10,20,30 CLERK,MANAGER
3 20 ANALYST
4 30 SALESMAN
That query I can use for a (slow) insert into my three tables in this simple manner:
SQL> begin
2 for g in (
3 select row_number() over (order by s2.deptnos) groupkey
4 , s2.deptnos
5 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs
6 from (
7 select od.job
8 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos
9 from orig_data od
10 group by od.job
11 ) s2
12 group by s2.deptnos
13 ) loop
14 insert into groups values (g.groupkey);
15
16 insert into groups_depts
17 select g.groupkey
18 , to_number(regexp_substr(str, '[^,]+', 1, level)) deptno
19 from (
20 select rownum id
21 , g.deptnos str
22 from dual
23 )
24 connect by instr(str, ',', 1, level-1) > 0
25 and id = prior id
26 and prior dbms_random.value is not null;
27
28 insert into groups_jobs
29 select g.groupkey
30 , regexp_substr(str, '[^,]+', 1, level) job
31 from (
32 select rownum id
33 , g.jobs str
34 from dual
35 )
36 connect by instr(str, ',', 1, level-1) > 0
37 and id = prior id
38 and prior dbms_random.value is not null;
39
40 end loop;
41 end;
42 /
PL/SQL procedure successfully completed.
The tables now contain this data:
SQL> select *
2 from groups
3 order by groupkey
4 /
GROUPKEY
----------
1
2
3
4
SQL> select *
2 from groups_depts
3 order by groupkey, deptno
4 /
GROUPKEY DEPTNO
---------- ----------
1 10
2 10
2 20
2 30
3 20
4 30
6 rows selected.
SQL> select *
2 from groups_jobs
3 order by groupkey, job
4 /
GROUPKEY JOB
---------- ---------
1 PRESIDENT
2 CLERK
2 MANAGER
3 ANALYST
4 SALESMAN
I can now from these data get the same result as before (just to test I have created the desired data):
SQL> select g.groupkey
2 , d.deptnos
3 , j.jobs
4 from groups g
5 join (
6 select groupkey
7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
8 from groups_depts
9 group by groupkey
10 ) d
11 on d.groupkey = g.groupkey
12 join (
13 select groupkey
14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
15 from groups_jobs
16 group by groupkey
17 ) j
18 on j.groupkey = g.groupkey
19 /
GROUPKEY DEPTNOS JOBS
---------- ------------------------------ ------------------------------
1 10 PRESIDENT
2 10,20,30 CLERK,MANAGER
3 20 ANALYST
4 30 SALESMAN
So far so good. This all works pretty much as desired - except for a couple of things:
The very simple loop insert code will be slow. OK, it is a one-time conversion job (in theory, but very few times at least) so that could probably be acceptable (except for my professional pride. ;-)
But worse is, that I have groups where the string aggregation won't work - the string would have to be about varchar2(10000) for my real world case (some groups would contain something like 1600 OEM numbers) which won't work in SQL in the group by. :-(
So I have tried an attempt using collections. First a collection of deptnos:
SQL> create type deptno_tab_type as table of number(2)
2 /
Type created.
SQL> select od.job
2 , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
3 from orig_data od
4 group by od.job
5 /
JOB DEPTNOS
--------- ------------------------------
ANALYST DEPTNO_TAB_TYPE(20)
CLERK DEPTNO_TAB_TYPE(10, 20, 30)
MANAGER DEPTNO_TAB_TYPE(10, 20, 30)
PRESIDENT DEPTNO_TAB_TYPE(10)
SALESMAN DEPTNO_TAB_TYPE(30)
All very good - no problems here. But then a collection of jobs:
SQL> create type job_tab_type as table of varchar2(9)
2 /
Type created.
SQL> select s2.deptnos
2 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
3 from (
4 select od.job
5 , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos
6 from orig_data od
7 group by od.job
8 ) s2
9 group by s2.deptnos
10 /
group by s2.deptnos
*
ERROR at line 9:
ORA-00932: inkonsistente datatyper: forventede -, fik XAL_SUPERVISOR.DEPTNO_TAB_TYPE
Now it fails with an "inconsistent datatype" error - I cannot group by a collection datatype (without doing something more...)
This is the time when I asked for help on the OTN forums, and odie_63 kindly reminded me that an object datatype can be grouped if it has a MAP or an ORDER member method. A MAP function would probably be the most efficient, but it requires being able to return a numerical value that will be unique - not realistic in this case. Odie suggested a simple ORDER method instead:
SQL> create type deptno_container as object (
2 nt deptno_tab_type
3 , order member function match (o deptno_container) return integer
4 );
5 /
Type created.
SQL>
SQL> create or replace type body deptno_container as
2 order member function match (o deptno_container) return integer is
3 begin
4 return case when nt = o.nt then 0 else 1 end;
5 end;
6 end;
7 /
Type body created.
It relies on the fact that when the nested tables are collections of simple datatypes, they can be compared for equality by default.
That allows me to "group by groups":
SQL> select row_number() over(order by null) groupkey
2 , s2.deptnos
3 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
4 from (
5 select od.job
6 , deptno_container(
7 cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
8 ) deptnos
9 from orig_data od
10 group by od.job
11 ) s2
12 group by s2.deptnos
13 /
GROUPKEY DEPTNOS(NT) JOBS
---------- ------------------------ --------------------------------
1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
TAB_TYPE(30))
2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
TAB_TYPE(10))
3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
TAB_TYPE(10, 20, 30))
4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
TAB_TYPE(20))
And Odie then also suggested a method to use that query to insert into the three tables:
SQL> insert all
2 when rn0 = 1 then
3 into groups (groupkey)
4 values (groupkey)
5 when rn1 = 1 then
6 into groups_jobs (groupkey, job)
7 values (groupkey, job)
8 when rn2 = 1 then
9 into groups_depts (groupkey, deptno)
10 values(groupkey, deptno)
11 with all_groups as (
12 select row_number() over(order by null) groupkey
13 , s2.deptnos
14 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
15 from (
16 select od.job
17 , deptno_container(
18 cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
19 ) deptnos
20 from orig_data od
21 group by od.job
22 ) s2
23 group by s2.deptnos
24 )
25 select groupkey
26 , value(j) job
27 , value(d) deptno
28 , row_number() over (
29 partition by groupkey
30 order by null
31 ) rn0
32 , row_number() over (
33 partition by groupkey, value(j)
34 order by null
35 ) rn1
36 , row_number() over (
37 partition by groupkey, value(d)
38 order by null
39 ) rn2
40 from all_groups t
41 , table(t.jobs) j
42 , table(t.deptnos.nt) d
43 ;
15 rows created.
Let's just check the result again:
SQL> select g.groupkey
2 , d.deptnos
3 , j.jobs
4 from groups g
5 join (
6 select groupkey
7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
8 from groups_depts
9 group by groupkey
10 ) d
11 on d.groupkey = g.groupkey
12 join (
13 select groupkey
14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
15 from groups_jobs
16 group by groupkey
17 ) j
18 on j.groupkey = g.groupkey
19 /
GROUPKEY DEPTNOS JOBS
---------- ------------------------ --------------------------------
1 30 SALESMAN
2 10 PRESIDENT
3 10,20,30 CLERK,MANAGER
4 20 ANALYST
Yes! This creates the same groupings as I did with the string aggregation (OK, different key, but that is just a pseudo key so no matter ;-). Now I have a method that can "group by groups" using collections rather than string aggregation and thus work on larger groups than can be contained in a VARCHAR2.
But then I used that method on the real large set of data. Bummer :-( It created about 1½ million groups rather than 3/4 millions - there were duplicate groups!
So I guessed it might have something to do with the ORDER member method always returning "1" for un-equal collections, rather than "-1" for "smaller than" and "1" for "greater than". So I made a new body for the container object type:
SQL> create or replace type body deptno_container as
2 order member function match (o deptno_container) return integer is
3 begin
4 for i in 1..nt.count loop
5 if i > o.nt.count then
6 return 1;
7 elsif nt(i) > o.nt(i) then
8 return 1;
9 elsif nt(i) < o.nt(i) then
10 return -1;
11 end if;
12 end loop;
13 if nt.count < o.nt.count then
14 return -1;
15 else
16 return 0;
17 end if;
18 end;
19 end;
20 /
Type body created.
This one compares the elements of the collections and defines when a collection is "greater than" or "smaller than" another.
Let's test the new type body:
SQL> select row_number() over(order by null) groupkey
2 , s2.deptnos
3 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
4 from (
5 select od.job
6 , deptno_container(
7 cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
8 ) deptnos
9 from orig_data od
10 group by od.job
11 ) s2
12 group by s2.deptnos
13 /
GROUPKEY DEPTNOS(NT) JOBS
---------- ------------------------ --------------------------------
1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
TAB_TYPE(10))
2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
TAB_TYPE(10, 20, 30))
3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
TAB_TYPE(20))
4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
TAB_TYPE(30))
Hooray - we get the same groups! (OK, different group keys - presumably due to the new order method - but that's OK.)
But I was still troubled that it did not really work as it should. If I just did the above query, it would perfectly correct create about 3/4 million groups. But when I put the same query into the INSERT ALL construct, somewhat over 800.000 groups were created - duplicates still existed :-(
So I went back to my good old trusted copy of Tom Kyte's "Expert One-On-One Oracle" book in the chapter on object relational coding. I decided to try to do this using an O-R view and an instead-of trigger:
First yet another object type:
SQL> create type groups_container as object (
2 groupkey number
3 , deptnos deptno_container
4 , jobs job_tab_type
5 )
6 /
Type created.
Then an object type view:
SQL> create view groups_view
2 of groups_container
3 with object identifier (groupkey)
4 as
5 select g.groupkey
6 , deptno_container(
7 cast(
8 multiset(
9 select gd.deptno
10 from groups_depts gd
11 where gd.groupkey = g.groupkey
12 order by gd.deptno
13 )
14 as deptno_tab_type
15 )
16 ) deptnos
17 , cast(
18 multiset(
19 select gj.job
20 from groups_jobs gj
21 where gj.groupkey = g.groupkey
22 order by gj.job
23 )
24 as job_tab_type
25 ) jobs
26 from groups g
27 /
View created.
And finally the instead-of trigger on the view:
SQL> create trigger groups_view_ins
2 instead of insert
3 on groups_view
4 begin
5 insert into groups (groupkey)
6 values (:new.groupkey);
7
8 forall i in 1..:new.deptnos.nt.count
9 insert into groups_depts (groupkey, deptno)
10 values (:new.groupkey, :new.deptnos.nt(i));
11
12 forall i in 1..:new.jobs.count
13 insert into groups_jobs (groupkey, job)
14 values (:new.groupkey, :new.jobs(i));
15 end;
16 /
Trigger created.
Now I can insert into my three tables by inserting into the O-R view:
SQL> insert into groups_view (
2 groupkey, deptnos, jobs
3 )
4 select rownum groupkey
5 , s3.deptnos
6 , s3.jobs
7 from (
8 select s2.deptnos
9 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs
10 from (
11 select od.job
12 , deptno_container(
13 cast(collect(od.deptno order by od.deptno) as deptno_tab_type)
14 ) deptnos
15 from orig_data od
16 group by od.job
17 ) s2
18 group by s2.deptnos
19 ) s3
20 /
4 rows created.
Note how it says it created 4 "rows" - which is 4 "instances of the object", but translates into this much data in the three tables:
SQL> select *
2 from groups
3 order by groupkey
4 /
GROUPKEY
----------
1
2
3
4
SQL> select *
2 from groups_depts
3 order by groupkey, deptno
4 /
GROUPKEY DEPTNO
---------- ----------
1 10
2 10
2 20
2 30
3 20
4 30
6 rows selected.
SQL> select *
2 from groups_jobs
3 order by groupkey, job
4 /
GROUPKEY JOB
---------- ---------
1 PRESIDENT
2 CLERK
2 MANAGER
3 ANALYST
4 SALESMAN
Or if we take the same control statement I've used before:
SQL> select g.groupkey
2 , d.deptnos
3 , j.jobs
4 from groups g
5 join (
6 select groupkey
7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos
8 from groups_depts
9 group by groupkey
10 ) d
11 on d.groupkey = g.groupkey
12 join (
13 select groupkey
14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs
15 from groups_jobs
16 group by groupkey
17 ) j
18 on j.groupkey = g.groupkey
19 /
GROUPKEY DEPTNOS JOBS
---------- ------------------------ --------------------------------
1 10 PRESIDENT
2 10,20,30 CLERK,MANAGER
3 20 ANALYST
4 30 SALESMAN
Yes - we got the same groups of groups again. And we now even have a short and sweet method of querying out the same group objects:
SQL> select *
2 from groups_view
3 order by groupkey
4 /
GROUPKEY DEPTNOS(NT) JOBS
---------- ------------------------ --------------------------------
1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT')
TAB_TYPE(10))
2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER')
TAB_TYPE(10, 20, 30))
3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST')
TAB_TYPE(20))
4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN')
TAB_TYPE(30))
Same result as we've also seen before.
And finally I am happy with the result. When I did this to my real world data, I got exactly the 3/4 million unique groups. Yay :-D
I had actually expected the instead-of trigger insertion to slow down the process, as I introduced a kind of "row-by-row" processing here. But in reality it turned out to be slightly faster (about 50 minutes compared to an hour) than the method using TABLE to "break out" the collections and then INSERT ALL. There may have been differences in load on the server at the time, so I won't conclude that one method or the other is faster - I'll just note that the instead-of trigger method for this particular use case at least didn't slow things down ;-)
And that's a wrap - now I can start using the generated groups of data :-D
Comments
Post a Comment