Partitioning external tables in 12.2
One of the new features in Oracle 12.2 is partitioning of external tables - a quite useful feature if you have multiple identical files of data, for example from multiple sources.
But how can you partition external files, you ask? Well, you don't partition the files, but each file can be considered a partition by the database. I'll show you an example.
Like always, to use external tables we need a DIRECTORY and our user needs privileges on the directory. So as a DBA user we do this:
In that directory we have 4 files:
Those CSV files contain sales data per region and country. We're getting those files from 4 offices around the world.
Region AMER (Americas) we get in two files - amen.csv is AMER North, ames.csv is AMER South. File emea.csv is data from region EMEA (Europe, Middle-East and Africa.) And file asoc.csv has data from two regions - ASIA and OCEA (Oceania.)
So, as user HR, let's use these files in an external table to be able to query those sales data.
First we try it without partitioning so we can observe the difference:
We put all the files in the LOCATION clause, that way when we select from the external table, all the files will be read.
So we try to select all the data:
Yup, all the rows from all 4 files are shown, just like we want it.
If we look at the plan used, we see a full table scan of the external table.
Let's query one of the regions - EMEA:
Looking at the plan, again we see full table access, this time with a filter on the region:
So all the files have been read, and then those rows that weren't from EMEA was discarded. A bit of a wasted effort, since we happen to know that the data for region EMEA only exists in one of the 4 files.
So we drop this approach and try it a bit differently.
Again we create the external table, but this time instead of one global LOCATION clause with all 4 files, we use the PARTITION clause to setup LIST partitioning on the external table:
We create three partitions, where we tell the database that region AMER is in two files, regions ASIA and OCEA both are in one file, and region EMEA is in the last file.
So we try to select everything again:
No change in the output, of course.
Looking at the plan, we can see that it is still full table access, but partitioning information is included and we see it scans all three partitions:
Then the interesting thing happens when we select data only for region EMEA:
The plan this time changes to PARTITION LIST SINGLE and in Pstart/Pstop is shown that only the third partition is accessed. This means that only file emea.csv has been read.
Note that there is no filter here, it is not necessary as the database knows, that partition 3 contains all the data for region EMEA and nothing but data for region EMEA. This is about as efficient as it can get.
Our second partition (file asoc.csv) had data from two regions in it, ASIA and OCEA. Let's see what happens if we query region OCEA:
Again a PARTITION LIST SINGLE and Pstart/Pstop shows we scan the second partition.
But since the partition contains two regions, this time a FILTER is needed. Still it is better than without partitioning, because we only read the single file necessary and only discard ASIA rows, rather than reading all files and discarding almost all rows.
But one thing you must beware of is, that for external tables the partitioning is not enforced - it cannot be. The database trusts you and believes you are telling it the truth.
So what happens if you lie to the database? Let's try to edit the emea.csv file:
We add a line in region AMER (that should have been in file ames.csv), as well as a line in region MARS (that doesn't exist in the list partioning definition.)
So let's query region EMEA again after adding those rows:
What? The output shows two rows that doesn't satisfy the WHERE clause? How did that happen?
Well, the plan is just like it was before - full access of the EMEA partition and no FILTER operation:
So the database trusted us when we told it that file emea.csv contained only EMEA region data, therefore it didn't do any superfluous checking. That we lied to the database is our problem, so we're to blame for the "wrong" output.
We can also try to query region AMER:
And the row for Argentina is not shown in the output even though it has region AMER?
Again the reason is the plan accesses partition 1 only, because the database is certain that region AMER data only exists in files amen.csv and ames.csv:
Similarly we cannot query region MARS even though a row with region MARS exists in the emea.csv file:
This time the plan directly tells us that we are querying a region that doesn't exist according to the metadata we have given the database:
So in all, partitioning external tables is very handy to avoid unnecessarily reading files that we know doesn't contain the data we are querying. We just have to be certain the data really is distributed in the files matching what we describe in the list partitioning clauses - if it isn't, we risk wrong output.
UPDATE:
Besides this partitioning and constraints allowed and such documented new 12.2 features on external tables, there's some very cool other new 12.2 features that just happened to be forgotten in the documentation. Read here about runtime parameter overriding:
https://blogs.oracle.com/datawarehousing/the-first-really-hidden-gem-in-oracle-database-12c-release-2:-runtime-modification-of-external-table-parameters
But how can you partition external files, you ask? Well, you don't partition the files, but each file can be considered a partition by the database. I'll show you an example.
Like always, to use external tables we need a DIRECTORY and our user needs privileges on the directory. So as a DBA user we do this:
create directory ext_table_dir as '/u01/extdata/orcl' / grant read, write on directory ext_table_dir to hr /
In that directory we have 4 files:
[oracle@vbgeneric orcl]$ pwd /u01/extdata/orcl [oracle@vbgeneric orcl]$ ls -l *.csv -rw-r--r-- 1 oracle oinstall 59 Aug 18 15:43 amen.csv -rw-r--r-- 1 oracle oinstall 61 Aug 18 15:44 ames.csv -rw-r--r-- 1 oracle oinstall 106 Aug 18 15:49 asoc.csv -rw-r--r-- 1 oracle oinstall 103 Aug 18 15:46 emea.csv
Those CSV files contain sales data per region and country. We're getting those files from 4 offices around the world.
Region AMER (Americas) we get in two files - amen.csv is AMER North, ames.csv is AMER South. File emea.csv is data from region EMEA (Europe, Middle-East and Africa.) And file asoc.csv has data from two regions - ASIA and OCEA (Oceania.)
[oracle@vbgeneric orcl]$ head *.csv ==> amen.csv <== REGION;COUNTRY;SALES AMER;Canada;1000000 AMER;USA;2000000 ==> ames.csv <== REGION;COUNTRY;SALES AMER;Brasil;750000 AMER;Uruguay;250000 ==> asoc.csv <== REGION;COUNTRY;SALES ASIA;China;2750000 ASIA;Korea;1100000 OCEA;Australia;900000 OCEA;New Zealand;700000 ==> emea.csv <== REGION;COUNTRY;SALES EMEA;Denmark;1500000 EMEA;Germany;1750000 EMEA;Turkey;300000 EMEA;Nigeria;100000
So, as user HR, let's use these files in an external table to be able to query those sales data.
First we try it without partitioning so we can observe the difference:
create table sales_ext ( region varchar2(4) , country varchar2(20) , sales number ) organization external ( type oracle_loader default directory ext_table_dir access parameters ( records delimited by newline field names all files fields terminated by ';' missing field values are null reject rows with all null fields ) location ('amen.csv', 'ames.csv', 'asoc.csv', 'emea.csv') ) /
We put all the files in the LOCATION clause, that way when we select from the external table, all the files will be read.
So we try to select all the data:
select /*+ gather_plan_statistics */ * from sales_ext order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- AMER Brasil 750000 AMER Canada 1000000 AMER USA 2000000 AMER Uruguay 250000 ASIA China 2750000 ASIA Korea 1100000 EMEA Denmark 1500000 EMEA Germany 1750000 EMEA Nigeria 100000 EMEA Turkey 300000 OCEA Australia 900000 OCEA New Zealand 700000 12 rows selected.
Yup, all the rows from all 4 files are shown, just like we want it.
If we look at the plan used, we see a full table scan of the external table.
select * from table(dbms_xplan.display_cursor(format=>'IOSTATS PARTITION LAST')) /
PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------------------------- SQL_ID 0wax5hfmwn32u, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ * from sales_ext order by region, country Plan hash value: 1357455879 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 12 |00:00:00.12 | 486 | | 1 | SORT ORDER BY | | 1 | 408K| 12 |00:00:00.12 | 486 | | 2 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 408K| 12 |00:00:00.12 | 486 | ---------------------------------------------------------------------------------------------------
Let's query one of the regions - EMEA:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'EMEA' order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- EMEA Denmark 1500000 EMEA Germany 1750000 EMEA Nigeria 100000 EMEA Turkey 300000
Looking at the plan, again we see full table access, this time with a filter on the region:
--------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.11 | 486 | | 1 | SORT ORDER BY | | 1 | 4084 | 4 |00:00:00.11 | 486 | |* 2 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 4084 | 4 |00:00:00.11 | 486 | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("REGION"='EMEA')
So all the files have been read, and then those rows that weren't from EMEA was discarded. A bit of a wasted effort, since we happen to know that the data for region EMEA only exists in one of the 4 files.
So we drop this approach and try it a bit differently.
drop table sales_ext /
Again we create the external table, but this time instead of one global LOCATION clause with all 4 files, we use the PARTITION clause to setup LIST partitioning on the external table:
create table sales_ext ( region varchar2(4) , country varchar2(20) , sales number ) organization external ( type oracle_loader default directory ext_table_dir access parameters ( records delimited by newline field names all files fields terminated by ';' missing field values are null reject rows with all null fields ) ) partition by list (region) ( partition sales_ext_part_amer values ('AMER') location ('amen.csv', 'ames.csv') , partition sales_ext_part_asoc values ('ASIA', 'OCEA') location ('asoc.csv') , partition sales_ext_part_emea values ('EMEA') location ('emea.csv') ) /
We create three partitions, where we tell the database that region AMER is in two files, regions ASIA and OCEA both are in one file, and region EMEA is in the last file.
So we try to select everything again:
select /*+ gather_plan_statistics */ * from sales_ext order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- AMER Brasil 750000 AMER Canada 1000000 AMER USA 2000000 AMER Uruguay 250000 ASIA China 2750000 ASIA Korea 1100000 EMEA Denmark 1500000 EMEA Germany 1750000 EMEA Nigeria 100000 EMEA Turkey 300000 OCEA Australia 900000 OCEA New Zealand 700000 12 rows selected.
No change in the output, of course.
Looking at the plan, we can see that it is still full table access, but partitioning information is included and we see it scans all three partitions:
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 12 |00:00:00.14 | 486 | | 1 | SORT ORDER BY | | 1 | 24507 | | | 12 |00:00:00.14 | 486 | | 2 | PARTITION LIST ALL | | 1 | 24507 | 1 | 3 | 12 |00:00:00.14 | 486 | | 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 3 | 24507 | 1 | 3 | 12 |00:00:00.14 | 486 | --------------------------------------------------------------------------------------------------------------------
Then the interesting thing happens when we select data only for region EMEA:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'EMEA' order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- EMEA Denmark 1500000 EMEA Germany 1750000 EMEA Nigeria 100000 EMEA Turkey 300000
The plan this time changes to PARTITION LIST SINGLE and in Pstart/Pstop is shown that only the third partition is accessed. This means that only file emea.csv has been read.
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 4 |00:00:00.32 | 486 | | 1 | PARTITION LIST SINGLE | | 1 | 82 | KEY | KEY | 4 |00:00:00.32 | 486 | | 2 | SORT ORDER BY | | 1 | 82 | | | 4 |00:00:00.32 | 486 | | 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 82 | 3 | 3 | 4 |00:00:00.32 | 486 | --------------------------------------------------------------------------------------------------------------------
Note that there is no filter here, it is not necessary as the database knows, that partition 3 contains all the data for region EMEA and nothing but data for region EMEA. This is about as efficient as it can get.
Our second partition (file asoc.csv) had data from two regions in it, ASIA and OCEA. Let's see what happens if we query region OCEA:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'OCEA' order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- OCEA Australia 900000 OCEA New Zealand 700000
Again a PARTITION LIST SINGLE and Pstart/Pstop shows we scan the second partition.
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 2 |00:00:00.16 | 486 | | 1 | PARTITION LIST SINGLE | | 1 | 82 | KEY | KEY | 2 |00:00:00.16 | 486 | | 2 | SORT ORDER BY | | 1 | 82 | | | 2 |00:00:00.16 | 486 | |* 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 82 | 2 | 2 | 2 |00:00:00.16 | 486 | -------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("REGION"='OCEA')
But since the partition contains two regions, this time a FILTER is needed. Still it is better than without partitioning, because we only read the single file necessary and only discard ASIA rows, rather than reading all files and discarding almost all rows.
But one thing you must beware of is, that for external tables the partitioning is not enforced - it cannot be. The database trusts you and believes you are telling it the truth.
So what happens if you lie to the database? Let's try to edit the emea.csv file:
[oracle@vbgeneric orcl]$ vi emea.csv
We add a line in region AMER (that should have been in file ames.csv), as well as a line in region MARS (that doesn't exist in the list partioning definition.)
REGION;COUNTRY;SALES EMEA;Denmark;1500000 EMEA;Germany;1750000 EMEA;Turkey;300000 EMEA;Nigeria;100000 AMER;Argentina;80000 MARS;Red Planet;55555
So let's query region EMEA again after adding those rows:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'EMEA' order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- AMER Argentina 80000 EMEA Denmark 1500000 EMEA Germany 1750000 EMEA Nigeria 100000 EMEA Turkey 300000 MARS Red Planet 55555
What? The output shows two rows that doesn't satisfy the WHERE clause? How did that happen?
Well, the plan is just like it was before - full access of the EMEA partition and no FILTER operation:
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 6 |00:00:00.22 | 506 | | 1 | PARTITION LIST SINGLE | | 1 | 82 | KEY | KEY | 6 |00:00:00.22 | 506 | | 2 | SORT ORDER BY | | 1 | 82 | | | 6 |00:00:00.22 | 506 | | 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 82 | 3 | 3 | 6 |00:00:00.22 | 506 | --------------------------------------------------------------------------------------------------------------------
So the database trusted us when we told it that file emea.csv contained only EMEA region data, therefore it didn't do any superfluous checking. That we lied to the database is our problem, so we're to blame for the "wrong" output.
We can also try to query region AMER:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'AMER' order by region, country /
REGI COUNTRY SALES ---- -------------------- ---------- AMER Brasil 750000 AMER Canada 1000000 AMER USA 2000000 AMER Uruguay 250000
And the row for Argentina is not shown in the output even though it has region AMER?
Again the reason is the plan accesses partition 1 only, because the database is certain that region AMER data only exists in files amen.csv and ames.csv:
-------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | Buffers | -------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 4 |00:00:00.11 | 486 | | 1 | PARTITION LIST SINGLE | | 1 | 82 | KEY | KEY | 4 |00:00:00.11 | 486 | | 2 | SORT ORDER BY | | 1 | 82 | | | 4 |00:00:00.11 | 486 | | 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 1 | 82 | 1 | 1 | 4 |00:00:00.11 | 486 | --------------------------------------------------------------------------------------------------------------------
Similarly we cannot query region MARS even though a row with region MARS exists in the emea.csv file:
select /*+ gather_plan_statistics */ * from sales_ext where region = 'MARS' order by region, country /
no rows selected
This time the plan directly tells us that we are querying a region that doesn't exist according to the metadata we have given the database:
---------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | Pstart| Pstop | A-Rows | A-Time | ---------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | 0 |00:00:00.01 | | 1 | PARTITION LIST EMPTY | | 1 | 82 |INVALID|INVALID| 0 |00:00:00.01 | | 2 | SORT ORDER BY | | 0 | 82 | | | 0 |00:00:00.01 | |* 3 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | 0 | 82 |INVALID|INVALID| 0 |00:00:00.01 | ---------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - filter("REGION"='MARS')
So in all, partitioning external tables is very handy to avoid unnecessarily reading files that we know doesn't contain the data we are querying. We just have to be certain the data really is distributed in the files matching what we describe in the list partitioning clauses - if it isn't, we risk wrong output.
UPDATE:
Besides this partitioning and constraints allowed and such documented new 12.2 features on external tables, there's some very cool other new 12.2 features that just happened to be forgotten in the documentation. Read here about runtime parameter overriding:
https://blogs.oracle.com/datawarehousing/the-first-really-hidden-gem-in-oracle-database-12c-release-2:-runtime-modification-of-external-table-parameters
Comments
Post a Comment