Collation in 12cR2 - when AA equals Å (or not?)
Collation in Oracle 12cR2 gives some more finegrained possibilities for setting up how your data is to sorted and compared linguistically. A lot we could do before with NLS_SORT and NLS_COMP and the likes, but the collation features can both be simpler to use as well as offer more detailed control.
But do some testing for your specific language cases - you might find some small surprises like I did when trying it out with danish.
Here's how I tried danish column level collation...
To explain a bit to the international readers: the danish alphabet has 29 letters, the standard 26 A-Z followed by three special characters Æ, Ø and Å. In the old days, we didn't have the Å and instead the same sound was written AA (double-A). Today words in the language all use Å, but quite a few names (people or places) still use AA where others use Å, and you can't hear the difference when someone pronounces his name whether he spells it Vestergård or Vestergaard.
So for a proper danish sorting of names, AA should be considered identical to the letter Å. Linguistic NLS sorting in Oracle has been capable of this for many versions, collation does not bring any news on this front.
But you can often have a table with some name columns that need to be linguistically sorted and compared, and at the same time some alphabetic key columns (primary or foreign) that need to be binary sorted and compared. You could solve that by putting NLS_SORT in the queries, but with collation you can specify right down to column level how you want the data sorted and compared, so the queries don't need to do anything but just query the data.
For collation to work, I've had to set the database parameter MAX_STRING_SIZE to EXTENDED.
Before the demonstration, I'm first making certain my session is not using Danish NLS sorting (just to be sure not to be accused of cheating ;-) :
For collation, I can use the values that are valid for NLS sorting too. So I can find the possibilities for Danish like this:
A little testing showed I want the DANISH_M value (later I'll show the difference to the DANISH value.)
I create a table specifying the default collation for the table to be BINARY_CI and the specific collation for column CITY to be DANISH_M_CI - this would typify cases where an alphabetic key is used and treated as "generic ascii characters" without any linguistic properties, while the information carrying column is to be considered in a specific language. In both cases the suffix _CI indicates to the database that I want Case Insensitive collation (since data has been entered in a horrible mishmash of upper and lower case.)
Column STORE_ID inherits the table default collation, so when I order by STORE_ID, it's sorted with BINARY_CI collation:
But column CITY has a different collation specified, so when I order by CITY, it's sorted with DANISH_M_CI collation. The sorting is case insensitive, and the double-A's are sorted like they were an Å:
Three cities have been affected by double-A considered identical to Å:
The collation on the column can be overruled with the COLLATE operator (as an alternative to sorting by NLSSORT function.) So we can sort by the STORE_ID column in a DANISH_M_CI collation, so the double-A's are sorted like they were a single Å:
Overruling does not change the collation of the column itself, it only is the collation of the expression that is affected:
An expression with a COLLATE operator can be given column alias and the alias keeps the collation information:
Using the COLLATE operator to overrule a column collation allows us to see, that the difference between DANISH_CI and DANISH_M_CI is, that although both are case insensitive due to the _CI prefix, DANISH sorts double-A's as two single A's (unlike DANISH_M that considers a double-A to be the same as an Å):
The COLLATE operator also allows certain pseudo-collations to be used. Unlike a named collation like DANISH_M_CI, the collation USING_NLS_SORT is not a specific collation, but tells the database to use the collation of the currently active NLS settings, in this case BINARY:
The collations are not just used for sorting, but also for comparison (similar to NLS_COMP.) So when we look for cities containing a U, this comparison is case insensitive:
Not just LIKE but other functions as well use the collation information, for example INSTR:
And some comparison operators as well (the sequence of Å's following the A is because Å is the last letter of the danish alphabet.) This finds cities beginning with an A, and since the column has DANISH_M_CI comparison, double-A's are not consider as two single-A's, so those two cities are not found here:
But using >= and <= comparison to look for cities beginning with an Å finds the two cities with double-A:
The interesting surprise I found, is a bit different behaviour in LIKE operator. If we use LIKE to search for cities beginning with A, we do not get those with double-A's, so that's OK:
Then I would expect them to be part of the result when I use LIKE to search for cities beginning with Å - but no, they are not there either:
In order to find them with LIKE, I need explicitly to look for two A's:
Using >= and <= involves ordering/sorting to identify what is greater than or less than, so that should work fine, and it does.
Using LIKE could conceptually be thought of as the same (in which case it should have given same result) or it could be thought of something like SUBSTR(city,1,1)='a' (in which case the double-A's should have been found when searching for cities beginning with A.) The weird part is that LIKE seems to do neither, but has a logic all of its own? I don't know if that's expected behaviour or perhaps sort of buggy...
Anyway, my general conclusion is that using the new collation features makes some things easier to do and implement than using NLS_SORT / NLS_COMP. I'd suggest looking into collation if you have an application with those needs.
But do some testing for your specific language cases - you might find some small surprises like I did when trying it out with danish.
Here's how I tried danish column level collation...
To explain a bit to the international readers: the danish alphabet has 29 letters, the standard 26 A-Z followed by three special characters Æ, Ø and Å. In the old days, we didn't have the Å and instead the same sound was written AA (double-A). Today words in the language all use Å, but quite a few names (people or places) still use AA where others use Å, and you can't hear the difference when someone pronounces his name whether he spells it Vestergård or Vestergaard.
So for a proper danish sorting of names, AA should be considered identical to the letter Å. Linguistic NLS sorting in Oracle has been capable of this for many versions, collation does not bring any news on this front.
But you can often have a table with some name columns that need to be linguistically sorted and compared, and at the same time some alphabetic key columns (primary or foreign) that need to be binary sorted and compared. You could solve that by putting NLS_SORT in the queries, but with collation you can specify right down to column level how you want the data sorted and compared, so the queries don't need to do anything but just query the data.
For collation to work, I've had to set the database parameter MAX_STRING_SIZE to EXTENDED.
Before the demonstration, I'm first making certain my session is not using Danish NLS sorting (just to be sure not to be accused of cheating ;-) :
alter session set nls_sort = binary /
For collation, I can use the values that are valid for NLS sorting too. So I can find the possibilities for Danish like this:
select value from v$nls_valid_values where parameter = 'SORT' and value like '%DANISH%' order by value /
VALUE ---------------------------------------------------------------- DANISH DANISH_M UCA0610_DANISH UCA0620_DANISH UCA0700_DANISH XDANISH
A little testing showed I want the DANISH_M value (later I'll show the difference to the DANISH value.)
I create a table specifying the default collation for the table to be BINARY_CI and the specific collation for column CITY to be DANISH_M_CI - this would typify cases where an alphabetic key is used and treated as "generic ascii characters" without any linguistic properties, while the information carrying column is to be considered in a specific language. In both cases the suffix _CI indicates to the database that I want Case Insensitive collation (since data has been entered in a horrible mishmash of upper and lower case.)
create table stores ( store_id varchar2(5 char) primary key , city varchar2(20 char) collate danish_m_ci ) default collation binary_ci / insert into stores values ('AA001', 'København') / insert into stores values ('AA002', 'Korsør') / insert into stores values ('AB001', 'Aarhus') / insert into stores values ('AB002', 'Ålestrup') / insert into stores values ('BA001', 'karup') / insert into stores values ('BA002', 'KYBEHUSE') / insert into stores values ('BB001', 'AALBORG') / insert into stores values ('BB002', 'andst') / insert into stores values ('ÅÅ001', 'AUNING') / insert into stores values ('ÅÅ002', 'Kaastrup') / commit /
Column STORE_ID inherits the table default collation, so when I order by STORE_ID, it's sorted with BINARY_CI collation:
select store_id, city from stores order by store_id /
STORE CITY ----- -------------------- AA001 København AA002 Korsør AB001 Aarhus AB002 Ålestrup BA001 karup BA002 KYBEHUSE BB001 AALBORG BB002 andst ÅÅ001 AUNING ÅÅ002 Kaastrup
But column CITY has a different collation specified, so when I order by CITY, it's sorted with DANISH_M_CI collation. The sorting is case insensitive, and the double-A's are sorted like they were an Å:
select store_id, city from stores order by city /
STORE CITY ----- -------------------- BB002 andst ÅÅ001 AUNING BA001 karup AA002 Korsør BA002 KYBEHUSE AA001 København ÅÅ002 Kaastrup BB001 AALBORG AB002 Ålestrup AB001 Aarhus
Three cities have been affected by double-A considered identical to Å:
- Kaastrup sorted as Kåstrup
- AALBORG sorted as ÅLBORG
- Aarhus sorted as ÅRHUS
select store_id, city , collation(store_id) store_coll , collation(city) city_coll from stores order by store_id /
STORE CITY STORE_COL CITY_COLL ----- -------------------- --------- ----------- AA001 København BINARY_CI DANISH_M_CI AA002 Korsør BINARY_CI DANISH_M_CI AB001 Aarhus BINARY_CI DANISH_M_CI AB002 Ålestrup BINARY_CI DANISH_M_CI BA001 karup BINARY_CI DANISH_M_CI BA002 KYBEHUSE BINARY_CI DANISH_M_CI BB001 AALBORG BINARY_CI DANISH_M_CI BB002 andst BINARY_CI DANISH_M_CI ÅÅ001 AUNING BINARY_CI DANISH_M_CI ÅÅ002 Kaastrup BINARY_CI DANISH_M_CI
The collation on the column can be overruled with the COLLATE operator (as an alternative to sorting by NLSSORT function.) So we can sort by the STORE_ID column in a DANISH_M_CI collation, so the double-A's are sorted like they were a single Å:
select store_id, city from stores order by store_id collate danish_m_ci /
STORE CITY ----- -------------------- AB001 Aarhus AB002 Ålestrup BA001 karup BA002 KYBEHUSE BB001 AALBORG BB002 andst AA001 København AA002 Korsør ÅÅ001 AUNING ÅÅ002 Kaastrup
Overruling does not change the collation of the column itself, it only is the collation of the expression that is affected:
select store_id, city , collation(store_id) store_coll , collation(store_id collate danish_m_ci) expr_coll from stores order by store_id collate danish_m_ci /
STORE CITY STORE_COL EXPR_COLL ----- -------------------- --------- ----------- AB001 Aarhus BINARY_CI DANISH_M_CI AB002 Ålestrup BINARY_CI DANISH_M_CI BA001 karup BINARY_CI DANISH_M_CI BA002 KYBEHUSE BINARY_CI DANISH_M_CI BB001 AALBORG BINARY_CI DANISH_M_CI BB002 andst BINARY_CI DANISH_M_CI AA001 København BINARY_CI DANISH_M_CI AA002 Korsør BINARY_CI DANISH_M_CI ÅÅ001 AUNING BINARY_CI DANISH_M_CI ÅÅ002 Kaastrup BINARY_CI DANISH_M_CI
An expression with a COLLATE operator can be given column alias and the alias keeps the collation information:
select store_id, city, store_id_m , collation(store_id) store_coll , collation(store_id_m) store_m_coll from ( select store_id, city , store_id collate danish_m_ci as store_id_m from stores ) order by store_id_m /
STORE CITY STORE STORE_COL STORE_M_COL ----- -------------------- ----- --------- ----------- AB001 Aarhus AB001 BINARY_CI DANISH_M_CI AB002 Ålestrup AB002 BINARY_CI DANISH_M_CI BA001 karup BA001 BINARY_CI DANISH_M_CI BA002 KYBEHUSE BA002 BINARY_CI DANISH_M_CI BB001 AALBORG BB001 BINARY_CI DANISH_M_CI BB002 andst BB002 BINARY_CI DANISH_M_CI AA001 København AA001 BINARY_CI DANISH_M_CI AA002 Korsør AA002 BINARY_CI DANISH_M_CI ÅÅ001 AUNING ÅÅ001 BINARY_CI DANISH_M_CI ÅÅ002 Kaastrup ÅÅ002 BINARY_CI DANISH_M_CI
Using the COLLATE operator to overrule a column collation allows us to see, that the difference between DANISH_CI and DANISH_M_CI is, that although both are case insensitive due to the _CI prefix, DANISH sorts double-A's as two single A's (unlike DANISH_M that considers a double-A to be the same as an Å):
select store_id, city from stores order by city collate danish_ci /
STORE CITY ----- -------------------- BB001 AALBORG AB001 Aarhus BB002 andst ÅÅ001 AUNING ÅÅ002 Kaastrup BA001 karup AA002 Korsør BA002 KYBEHUSE AA001 København AB002 Ålestrup
The COLLATE operator also allows certain pseudo-collations to be used. Unlike a named collation like DANISH_M_CI, the collation USING_NLS_SORT is not a specific collation, but tells the database to use the collation of the currently active NLS settings, in this case BINARY:
select store_id, city from stores order by city collate using_nls_sort /
STORE CITY ----- -------------------- BB001 AALBORG ÅÅ001 AUNING AB001 Aarhus BA002 KYBEHUSE ÅÅ002 Kaastrup AA002 Korsør AA001 København BB002 andst BA001 karup AB002 Ålestrup
The collations are not just used for sorting, but also for comparison (similar to NLS_COMP.) So when we look for cities containing a U, this comparison is case insensitive:
select store_id, city from stores where city like '%U%' order by city /
STORE CITY ----- -------------------- ÅÅ001 AUNING BA001 karup BA002 KYBEHUSE ÅÅ002 Kaastrup AB002 Ålestrup AB001 Aarhus
Not just LIKE but other functions as well use the collation information, for example INSTR:
select store_id, city from stores where instr(city, 'h') > 0 order by city /
STORE CITY ----- -------------------- BA002 KYBEHUSE AA001 København AB001 Aarhus
And some comparison operators as well (the sequence of Å's following the A is because Å is the last letter of the danish alphabet.) This finds cities beginning with an A, and since the column has DANISH_M_CI comparison, double-A's are not consider as two single-A's, so those two cities are not found here:
select store_id, city from stores where city >= 'a' and city <= 'aååååååååååååååååååå' order by city /
STORE CITY ----- -------------------- BB002 andst ÅÅ001 AUNING
But using >= and <= comparison to look for cities beginning with an Å finds the two cities with double-A:
select store_id, city from stores where city >= 'å' and city <= 'åååååååååååååååååååå' order by city /
STORE CITY ----- -------------------- BB001 AALBORG AB002 Ålestrup AB001 Aarhus
The interesting surprise I found, is a bit different behaviour in LIKE operator. If we use LIKE to search for cities beginning with A, we do not get those with double-A's, so that's OK:
select store_id, city from stores where city like 'a%' order by city /
STORE CITY ----- -------------------- BB002 andst ÅÅ001 AUNING
Then I would expect them to be part of the result when I use LIKE to search for cities beginning with Å - but no, they are not there either:
select store_id, city from stores where city like 'å%' order by city /
STORE CITY ----- -------------------- AB002 Ålestrup
In order to find them with LIKE, I need explicitly to look for two A's:
select store_id, city from stores where city like 'aa%' order by city /
STORE CITY ----- -------------------- BB001 AALBORG AB001 Aarhus
Using >= and <= involves ordering/sorting to identify what is greater than or less than, so that should work fine, and it does.
Using LIKE could conceptually be thought of as the same (in which case it should have given same result) or it could be thought of something like SUBSTR(city,1,1)='a' (in which case the double-A's should have been found when searching for cities beginning with A.) The weird part is that LIKE seems to do neither, but has a logic all of its own? I don't know if that's expected behaviour or perhaps sort of buggy...
Anyway, my general conclusion is that using the new collation features makes some things easier to do and implement than using NLS_SORT / NLS_COMP. I'd suggest looking into collation if you have an application with those needs.
Comments
Post a Comment