ODCI dynamic table function parser
Supposing you've got data as a text string with "row" and "column" delimiters and you would like to parse out those columns and rows. The external table syntax would actually be great for this, but that requires writing the text out to a file first, and then you can import it back again as columns and rows.
There are various other alternatives using SQL and/or XML manipulations, but one method that I don't see much used is to use the Data Cartridge functionality to in a sense expand the SQL language to do this. This solution is greatly inspired by an OTN forum post I stumbled upon, where Anton Scheffer demonstrates the use of ODCITable* functions.
The idea is that you can create an object type that implements certain specifically named functions. Then Oracle can call those functions when it needs to describe, prepare, start and fetch from your table function. What that means is, that at hard parse time, you can programmatically generate the metadata for what columns and datatypes your function will return.
Sounds weird? Let's get on with the code and you'll see what I mean.
The goal is to make a table function that enables you to do this:
In that example the column delimiter is colon and the row delimiter is comma. There are three "columns": name, age and gender. The call to delimited_col_row.parser() specifies the text string as well as a string containing the specification of the columns.
How can this be done? By creating an object type that implements ODCITable* functions:
The specification of our object type contains the 5 ODCITable* function specifications needed. They must have those names and contain certain parameters.
The parser() function is the actual function that whomever uses this functionality will call. Typically this is shown in manuals/examples as either a standalone function or a packaged function. Anton showed it can also be a static function within the object itself, which makes a nice simple way to encapsulate all the code needed in a single program unit - the object type. Note the parser() function will not have a body, but the "using delimited_col_row" tells Oracle that the object "delimited_col_row" contains the implementation of this function.
The global variables in the object will contain the object state needed to maintain state between the various calls to the different ODCITable* functions.
The body will then implement the 5 ODCITable* functions. We'll walk through them one at a time:
When Oracle hard parses the call to delimited_col_row.parser() it needs to find out what datatypes that table function will return. So at hard parse time, Oracle will call ODCITableDescribe with the parameters from the delimited_col_row.parser() call. This function then needs to return an ANYTYPE that will contain the datatype the function will return.
So in line 20 we start creating a type (typecode_object) that will contain our columns. Lines 22-28 parses the column specification in p_cols parameter, so the loop runs as many times as there are to be columns, and lines 30-34 emulates Oracle behaviour that unquoted identifiers are uppercased. Lines 36-76 case statement will add an attribute (a column) to our return datatype, depending on whether we have specified a VARCHAR2, NUMBER or DATE column. Line 78 finished the creation of the object type.
Then we start a type of typecode_table in line 80 and line 82-91 specifies that it will be a table of the object type we just created.
With a regular table function, Oracle will also do a describe call at hard parse to find out the actual object type. Normally that is found out be looking up metadata in data dictionary from when the function was compiled, but because we have implemented ODCITableDescribe, it will be our function that will be called at hard parse and we can dynamically create the collection type that describes what our parser() function will return.
The describe call is done at hard parse time. It is a static function and does not preserve state, but Oracle remembers what was returned by ODCITableDescribe and saves it an object type sys.odcitabfuncinfo. This is saved along with a lot of other information in the statement cache, which means that when we soft parse, sys.odcitabfuncinfo value is taken from statement cache, while ODCITableDescribe is only called at hard parse.
So preparing the SQL statement first either does hard parse (calling ODCITableDescribe) or soft parse (retrieving sys.odcitabfuncinfo value from cache) and then it calls ODCITablePrepare giving as input the sys.odcitabfuncinfo value as well as the parameters from the parser() call. Using that information, we need to create an object of type delimited_col_row to pass out as p_context.
The p_tabinfo object contains an element called rettype, which contains the collection type we returned in ODCITableDescribe. Line 127-136 retrieves the attributes of that collection type. Most of the output parameters of the getattreleminfo() call we do not use, but we need to specify them as they are OUT parameters. We will be using l_type.
Line 138-148 we get information about our collection type. What we will use is the value of l_numcols, which enables us to extend l_col_types to the number of columns needed (line 150.)
Then we loop over the number of columns and line 153-162 finds the column type ID (number) of each column and saves it in our l_col_types collection.
And finally line 165-176 creates an instance of our object type initializing the global variables and returning it as the context needed for keeping object state. In the state we save the type and the number of columns and their types and the delimiters.
Next up is starting the statement, which will be passed the object instance p_context along with the variables from the actual call to parser() function. Here we need to reset the part of our package state that can be different between calls, and in this case that is simply setting our buffer to the actual text and resetting the counters we use for the parsing. We don't do anything with p_cols, p_col_delim and p_row_delim in this case, as we can't allow the calling procedure to change this after hard parse has happened.
ODCITableFetch is no longer a static function, it is a member function and so has access to all the state of the object via self. And fetching can (often will) be called multiple times getting the rows back piecewise. ODCITableFetch is not required to adhere to the p_numrows parameter, it is allowed to return more (for example all the rows) and Oracle will simply cache the result itself and feed the rows back to the caller in the required chunks.
In the state we use g_pos1 and g_pos2 to keep track of how far into the string we have parsed so far. The caller will often end by calling fetch one last time to see if there are any more rows even when we have passed the end. So line 208 takes care that when an attempt is made to fetch beyond what is there, we do not try to parse any more and an empty result is then returned, indicating to the caller that there is no more rows.
But if we haven't already reached the end, we start creating an object instance in line 209. That is an object of the type we have saved in our state as g_type, which is the collection type we will be returning. The object instance is created in p_tab, which is the output parameter of ODCITableFetch.
Then we loop over our data, one column value at a time. So we increase g_col_num in line 212 to keep track of where we are. If we are at the first column, line 215 increases the local row count, while 216 increases the global row count. The difference is that g_row_num keeps track of the number of rows in total and it is saved in the object state so it can keep track across multiple fetch calls. But l_row_cnt restarts in every fetch call and is used to be able to only return the desired number of rows for each fetch.
When we are at the beginning of a row, we need to create an instance of the object type within our collection. We do that in line 217 and line 218 tells Oracle we will construct this row one piece (one column) at a time.
If we are not at the last column (line 221) we search for the next column delimiter (line 222), while at the last column we search for a row delimiter (line 224.) That allows us to extract the column data using SUBSTR between g_pos1 and g_pos2. We do that in lines 229, 232 and 235 depending on the datatype, which we look up in our global state in line 227.
If we have reached the end of the text, line 238 exits the loop. If not, we set g_pos1 to g_pos2 to move on. If we are at the last column, we reset the column counter. And as we then have completed a row, we also check if we have reached the number of rows the caller want to fetch. If we have (line 243) we exit the loop.
Having exited the loop (whether at the end of the data or at p_numrows reached), we finish the object instance creation in line 247. So now the output variable p_tab contains a collection of rows with our columns and that is returned.
Multiple fetch calls can be performed until all the data has been returned. Then a well-behaved application can keep the prepared cursor opened and re-use it with new bind variable content, which would call ODCITableStart again with the new text buffer, and that would reset the object state so new fetches can be performed.
Once the application is done with the cursor and closes it, it will call ODCITableClose. In that function we should do any necessary cleanup, but as our only state is global variables inside the object instance, it will go away anyway, so we have no cleanup to perform.
Now for the fun part - actually using this. I can give the function a string with the data, a string defining columns with their datatypes, the column delimiter and the row delimiter. And a SELECT * then gets the column names and datatypes from our code and retrieves the rows with column values from our code as well:
The data string could also be a column within a table and that way we can get the rows as child rows:
We do not need to always use SELECT * because our code is called at parse time, so the parser knows the column names and types and we can use the columns explicitly in our statement as well, everywhere where normal table columns can be used, not just in the select list. It is also allowed to use lowercase identifiers in our column specification, because we have written ODCITableDescribe to emulate the standard Oracle functionality of unquoted identifiers becoming uppercase. Specifying column and row delimiter we can avoid if the data uses the defaults of colon and comma:
If we quote our identifiers, the code in ODCITableDescribe keeps the case of the identifiers and then we are required to continue using the quoted identifiers, just like standard Oracle:
If need be, we can specify different column and row delimiters. Note that the code in ODCITableDescribe will parse the column specification string using the same delimiters, so when the data here use | and ; we also need to use | and ; in the column specification.
We can support DATE beside VARCHAR2 and NUMBER, but note that this implementation works with the NLS_DATE_FORMAT of your session, you cannot specify a date format:
The same issue actually also holds true for NUMBER. If for example your session NLS_DECIMAL_CHARACTERS specify using decimal comma, any non-integer numbers in the data string also must use decimal comma (in which case the delimiter cannot be a comma.)
This has mostly been fun to create. I plan a future post (hopefully fairly soon) comparing performance of this with other methods of splitting out rows and columns from a string. I do not expect this to be the very best method, but probably not the very worst either. We shall see when I get to do some more in-depth comparison.
The primary thing I've learned from this is probably a greater understanding of how Oracle operates on a cursor, when the various phases (describe, prepare, start, fetch close) are called.
The describe being at hard parse time is one catch to this method. The column specification and column/row delimiters you cannot set dynamically in consecutive calls to delimited_col_row.parser(), at least not with bind variables. As long as you use literals for the values given to p_cols, p_col_delim and p_row_delim parameters, everything is good. If you change the literals, you will be forcing a hard parse anyway. If you pass binds or table columns to p_cols, p_col_delim and p_row_delim, you must not change the values of those between calls. If you do, you will have to manually force a hard pass somehow.
And there is a lot on the to-do list if this should be a real generic solution able to handle many situations:
Even if all that to-do was done, then you could keep on adding functionality until you got it to be able to do as much as external table syntax. But then I actually think a better way would be if external table syntax ever allowed location to be specified as a CLOB rather than a file.
Anyway, have fun with this (if you're sufficiently nerdy like me to think this is fun ;-)
Note: Carsten Czarski thought about this way of doing it way earlier than me. He's blogged a very similar thing 8 years ago: http://sql-plsql-de.blogspot.de/2007/11/csv-dateien-einlesen-mit-sql-easy-csv.html - Thanks for letting me know, Carsten.
There are various other alternatives using SQL and/or XML manipulations, but one method that I don't see much used is to use the Data Cartridge functionality to in a sense expand the SQL language to do this. This solution is greatly inspired by an OTN forum post I stumbled upon, where Anton Scheffer demonstrates the use of ODCITable* functions.
The idea is that you can create an object type that implements certain specifically named functions. Then Oracle can call those functions when it needs to describe, prepare, start and fetch from your table function. What that means is, that at hard parse time, you can programmatically generate the metadata for what columns and datatypes your function will return.
Sounds weird? Let's get on with the code and you'll see what I mean.
The goal is to make a table function that enables you to do this:
with children(groupid, data) as ( select 1, 'Adam:9:M,Belinda:7:F,Carl:12:M' from dual union all select 2, 'Dave:11:M,Ellen:12:F' from dual union all select 3, 'Frank:6:M,George:5:M,Helen:13:F,Isabella:4:F' from dual union all select 4, 'Julian:15:M' from dual ) select children.groupid , parsed.age , parsed.gender , parsed.name from children , table( delimited_col_row.parser( children.data , 'name:varchar2(10),age:number,gender:varchar2(1)' ) ) parsed order by children.groupid, parsed.age /
In that example the column delimiter is colon and the row delimiter is comma. There are three "columns": name, age and gender. The call to delimited_col_row.parser() specifies the text string as well as a string containing the specification of the columns.
How can this be done? By creating an object type that implements ODCITable* functions:
create or replace type delimited_col_row as object ( g_buffer varchar2(32000) , g_type anytype , g_numcols integer , g_col_types sys.odcinumberlist , g_col_delim varchar2(1) , g_row_delim varchar2(1) , g_pos1 integer , g_pos2 integer , g_col_num integer , g_row_num integer , static function parser( p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return anydataset pipelined using delimited_col_row , static function odcitabledescribe( p_tabtype out anytype , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number , static function odcitableprepare( p_context out delimited_col_row , p_tabinfo in sys.odcitabfuncinfo , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number , static function odcitablestart( p_context in out delimited_col_row , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number , member function odcitablefetch( self in out delimited_col_row , p_numrows in number , p_tab out anydataset ) return number , member function odcitableclose( self in delimited_col_row ) return number ) /
The specification of our object type contains the 5 ODCITable* function specifications needed. They must have those names and contain certain parameters.
The parser() function is the actual function that whomever uses this functionality will call. Typically this is shown in manuals/examples as either a standalone function or a packaged function. Anton showed it can also be a static function within the object itself, which makes a nice simple way to encapsulate all the code needed in a single program unit - the object type. Note the parser() function will not have a body, but the "using delimited_col_row" tells Oracle that the object "delimited_col_row" contains the implementation of this function.
The global variables in the object will contain the object state needed to maintain state between the various calls to the different ODCITable* functions.
- ODCITableDescribe needs an OUT parameter of type ANYTYPE that will contain the type that the parser() function will return. The rest of the parameters to ODCITableDescribe needs to be the same parameters as our parser() function.
- ODCITablePrepare needs an OUT parameter of our object type - that will be an instance of our object type that will contain the state (global variables.) Then it needs an IN parameter of sys.odcitabfuncinfo type, that will contain the result of the parsing (basically the result of the ODCITableDescribe() call.) The rest of the parameters then again need to be the same as the parser() function.
- ODCITableStart needs an IN OUT parameter of the object type, so it can reset the object state. Again the same parameters are needed at the end.
- ODCITableFetch and ODCITableClose just needs the parameters shown - the "self" object instance and (for ODCITableFetch) a couple extra parameters. They do not need the parameter list of the parser() function.
The body will then implement the 5 ODCITable* functions. We'll walk through them one at a time:
create or replace type body delimited_col_row as static function odcitabledescribe( p_tabtype out anytype , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number is l_type anytype; l_numcols integer; l_pos1 pls_integer; l_pos2 pls_integer; l_pos3 pls_integer; l_colname varchar2(1932); l_coltype varchar2(1932); begin anytype.begincreate(dbms_types.typecode_object, l_type); l_pos1 := 0; loop l_pos2 := instr(p_cols, p_col_delim, l_pos1+1); l_pos3 := nvl(nullif(instr(p_cols, p_row_delim, l_pos1+1),0),length(p_cols)+1); l_colname := substr(p_cols, l_pos1+1, l_pos2-l_pos1-1); l_coltype := upper(substr(p_cols, l_pos2+1, l_pos3-l_pos2-1)); if l_colname like '"%"' then l_colname := trim(both '"' from l_colname); else l_colname := upper(l_colname); end if; case when l_coltype like 'VARCHAR2(%)' then l_type.addattr( l_colname , dbms_types.typecode_varchar2 , null , null , to_number(trim(substr(l_coltype,10,length(l_coltype)-10))) , null , null ); when l_coltype = 'NUMBER' then l_type.addattr( l_colname , dbms_types.typecode_number , null , null , null , null , null ); when l_coltype = 'DATE' then l_type.addattr( l_colname , dbms_types.typecode_date , null , null , null , null , null ); else raise dbms_types.invalid_parameters; end case; exit when l_pos3 > length(p_cols); l_pos1 := l_pos3; end loop; l_type.endcreate; anytype.begincreate(dbms_types.typecode_table, p_tabtype); p_tabtype.setinfo( null , null , null , null , null , l_type , dbms_types.typecode_object , 0 ); p_tabtype.endcreate(); return odciconst.success; exception when others then return odciconst.error; end odcitabledescribe;
When Oracle hard parses the call to delimited_col_row.parser() it needs to find out what datatypes that table function will return. So at hard parse time, Oracle will call ODCITableDescribe with the parameters from the delimited_col_row.parser() call. This function then needs to return an ANYTYPE that will contain the datatype the function will return.
So in line 20 we start creating a type (typecode_object) that will contain our columns. Lines 22-28 parses the column specification in p_cols parameter, so the loop runs as many times as there are to be columns, and lines 30-34 emulates Oracle behaviour that unquoted identifiers are uppercased. Lines 36-76 case statement will add an attribute (a column) to our return datatype, depending on whether we have specified a VARCHAR2, NUMBER or DATE column. Line 78 finished the creation of the object type.
Then we start a type of typecode_table in line 80 and line 82-91 specifies that it will be a table of the object type we just created.
With a regular table function, Oracle will also do a describe call at hard parse to find out the actual object type. Normally that is found out be looking up metadata in data dictionary from when the function was compiled, but because we have implemented ODCITableDescribe, it will be our function that will be called at hard parse and we can dynamically create the collection type that describes what our parser() function will return.
static function odcitableprepare( p_context out delimited_col_row , p_tabinfo in sys.odcitabfuncinfo , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number is l_type anytype; l_numcols integer; l_col_types sys.odcinumberlist := sys.odcinumberlist(); l_typecode pls_integer; l_prec pls_integer; l_scale pls_integer; l_len pls_integer; l_csid pls_integer; l_csfrm pls_integer; l_aname varchar2(30); l_schemaname varchar2(30); l_typename varchar2(30); l_version varchar2(30); l_dummytype anytype; begin l_typecode := p_tabinfo.rettype.getattreleminfo( 1 , l_prec , l_scale , l_len , l_csid , l_csfrm , l_type , l_aname ); l_typecode := l_type.getinfo( l_prec , l_scale , l_len , l_csid , l_csfrm , l_schemaname , l_typename , l_version , l_numcols ); l_col_types.extend(l_numcols); for idx in 1..l_numcols loop l_col_types(idx) := l_type.getattreleminfo( idx , l_prec , l_scale , l_len , l_csid , l_csfrm , l_dummytype , l_aname ); end loop; p_context := delimited_col_row( p_text , l_type , l_numcols , l_col_types , p_col_delim , p_row_delim , 0 , 0 , 0 , 0 ); return odciconst.success; end odcitableprepare;
The describe call is done at hard parse time. It is a static function and does not preserve state, but Oracle remembers what was returned by ODCITableDescribe and saves it an object type sys.odcitabfuncinfo. This is saved along with a lot of other information in the statement cache, which means that when we soft parse, sys.odcitabfuncinfo value is taken from statement cache, while ODCITableDescribe is only called at hard parse.
So preparing the SQL statement first either does hard parse (calling ODCITableDescribe) or soft parse (retrieving sys.odcitabfuncinfo value from cache) and then it calls ODCITablePrepare giving as input the sys.odcitabfuncinfo value as well as the parameters from the parser() call. Using that information, we need to create an object of type delimited_col_row to pass out as p_context.
The p_tabinfo object contains an element called rettype, which contains the collection type we returned in ODCITableDescribe. Line 127-136 retrieves the attributes of that collection type. Most of the output parameters of the getattreleminfo() call we do not use, but we need to specify them as they are OUT parameters. We will be using l_type.
Line 138-148 we get information about our collection type. What we will use is the value of l_numcols, which enables us to extend l_col_types to the number of columns needed (line 150.)
Then we loop over the number of columns and line 153-162 finds the column type ID (number) of each column and saves it in our l_col_types collection.
And finally line 165-176 creates an instance of our object type initializing the global variables and returning it as the context needed for keeping object state. In the state we save the type and the number of columns and their types and the delimiters.
static function odcitablestart( p_context in out delimited_col_row , p_text in varchar2 , p_cols in varchar2 , p_col_delim in varchar2 default ':' , p_row_delim in varchar2 default ',' ) return number is begin p_context.g_buffer := p_text; p_context.g_pos1 := 0; p_context.g_pos2 := 0; p_context.g_col_num := 0; p_context.g_row_num := 0; return odciconst.success; end odcitablestart;
Next up is starting the statement, which will be passed the object instance p_context along with the variables from the actual call to parser() function. Here we need to reset the part of our package state that can be different between calls, and in this case that is simply setting our buffer to the actual text and resetting the counters we use for the parsing. We don't do anything with p_cols, p_col_delim and p_row_delim in this case, as we can't allow the calling procedure to change this after hard parse has happened.
member function odcitablefetch( self in out delimited_col_row , p_numrows in number , p_tab out anydataset ) return number is l_row_cnt integer := 0; begin if self.g_pos2 < length(self.g_buffer) then anydataset.begincreate(dbms_types.typecode_object, self.g_type, p_tab); loop self.g_col_num := self.g_col_num + 1; if self.g_col_num = 1 then l_row_cnt := l_row_cnt + 1; self.g_row_num := self.g_row_num + 1; p_tab.addinstance; p_tab.piecewise(); end if; if self.g_col_num < self.g_numcols then self.g_pos2 := instr(self.g_buffer, self.g_col_delim, self.g_pos1+1); else self.g_pos2 := nvl(nullif(instr(self.g_buffer, self.g_row_delim, self.g_pos1+1),0),length(self.g_buffer)+1); end if; case self.g_col_types(self.g_col_num) when dbms_types.typecode_varchar2 then p_tab.setvarchar2(substr(self.g_buffer, self.g_pos1+1, self.g_pos2-self.g_pos1-1)); when dbms_types.typecode_number then p_tab.setnumber(to_number(substr(self.g_buffer, self.g_pos1+1, self.g_pos2-self.g_pos1-1))); when dbms_types.typecode_date then p_tab.setdate(to_date(substr(self.g_buffer, self.g_pos1+1, self.g_pos2-self.g_pos1-1))); end case; exit when self.g_pos2 > length(self.g_buffer); self.g_pos1 := self.g_pos2; if self.g_col_num = self.g_numcols then self.g_col_num := 0; exit when l_row_cnt >= p_numrows; end if; end loop; p_tab.endcreate; end if; return odciconst.success; end odcitablefetch;
ODCITableFetch is no longer a static function, it is a member function and so has access to all the state of the object via self. And fetching can (often will) be called multiple times getting the rows back piecewise. ODCITableFetch is not required to adhere to the p_numrows parameter, it is allowed to return more (for example all the rows) and Oracle will simply cache the result itself and feed the rows back to the caller in the required chunks.
In the state we use g_pos1 and g_pos2 to keep track of how far into the string we have parsed so far. The caller will often end by calling fetch one last time to see if there are any more rows even when we have passed the end. So line 208 takes care that when an attempt is made to fetch beyond what is there, we do not try to parse any more and an empty result is then returned, indicating to the caller that there is no more rows.
But if we haven't already reached the end, we start creating an object instance in line 209. That is an object of the type we have saved in our state as g_type, which is the collection type we will be returning. The object instance is created in p_tab, which is the output parameter of ODCITableFetch.
Then we loop over our data, one column value at a time. So we increase g_col_num in line 212 to keep track of where we are. If we are at the first column, line 215 increases the local row count, while 216 increases the global row count. The difference is that g_row_num keeps track of the number of rows in total and it is saved in the object state so it can keep track across multiple fetch calls. But l_row_cnt restarts in every fetch call and is used to be able to only return the desired number of rows for each fetch.
When we are at the beginning of a row, we need to create an instance of the object type within our collection. We do that in line 217 and line 218 tells Oracle we will construct this row one piece (one column) at a time.
If we are not at the last column (line 221) we search for the next column delimiter (line 222), while at the last column we search for a row delimiter (line 224.) That allows us to extract the column data using SUBSTR between g_pos1 and g_pos2. We do that in lines 229, 232 and 235 depending on the datatype, which we look up in our global state in line 227.
If we have reached the end of the text, line 238 exits the loop. If not, we set g_pos1 to g_pos2 to move on. If we are at the last column, we reset the column counter. And as we then have completed a row, we also check if we have reached the number of rows the caller want to fetch. If we have (line 243) we exit the loop.
Having exited the loop (whether at the end of the data or at p_numrows reached), we finish the object instance creation in line 247. So now the output variable p_tab contains a collection of rows with our columns and that is returned.
member function odcitableclose( self in delimited_col_row ) return number is begin return odciconst.success; end odcitableclose; end; /
Multiple fetch calls can be performed until all the data has been returned. Then a well-behaved application can keep the prepared cursor opened and re-use it with new bind variable content, which would call ODCITableStart again with the new text buffer, and that would reset the object state so new fetches can be performed.
Once the application is done with the cursor and closes it, it will call ODCITableClose. In that function we should do any necessary cleanup, but as our only state is global variables inside the object instance, it will go away anyway, so we have no cleanup to perform.
Now for the fun part - actually using this. I can give the function a string with the data, a string defining columns with their datatypes, the column delimiter and the row delimiter. And a SELECT * then gets the column names and datatypes from our code and retrieves the rows with column values from our code as well:
select * from table( delimited_col_row.parser( '1:Adam:M,2:Belinda:F,3:Carl:M' , 'ID:NUMBER,NAME:VARCHAR2(10),GENDER:VARCHAR2(1)' , ':' , ',' ) ) order by gender, name /
ID NAME GENDER ---------- ---------- ------ 2 Belinda F 1 Adam M 3 Carl M
The data string could also be a column within a table and that way we can get the rows as child rows:
with children(groupid, data) as ( select 1, 'Adam:9:M,Belinda:7:F,Carl:12:M' from dual union all select 2, 'Dave:11:M,Ellen:12:F' from dual union all select 3, 'Frank:6:M,George:5:M,Helen:13:F,Isabella:4:F' from dual union all select 4, 'Julian:15:M' from dual ) select children.groupid , parsed.* from children , table( delimited_col_row.parser( children.data , 'NAME:VARCHAR2(10),AGE:NUMBER,GENDER:VARCHAR2(1)' , ':' , ',' ) ) parsed order by children.groupid /
GROUPID NAME AGE GENDER ---------- ---------- ---------- ------ 1 Adam 9 M 1 Belinda 7 F 1 Carl 12 M 2 Dave 11 M 2 Ellen 12 F 3 Frank 6 M 3 George 5 M 3 Helen 13 F 3 Isabella 4 F 4 Julian 15 M
We do not need to always use SELECT * because our code is called at parse time, so the parser knows the column names and types and we can use the columns explicitly in our statement as well, everywhere where normal table columns can be used, not just in the select list. It is also allowed to use lowercase identifiers in our column specification, because we have written ODCITableDescribe to emulate the standard Oracle functionality of unquoted identifiers becoming uppercase. Specifying column and row delimiter we can avoid if the data uses the defaults of colon and comma:
with children(groupid, data) as ( select 1, 'Adam:9:M,Belinda:7:F,Carl:12:M' from dual union all select 2, 'Dave:11:M,Ellen:12:F' from dual union all select 3, 'Frank:6:M,George:5:M,Helen:13:F,Isabella:4:F' from dual union all select 4, 'Julian:15:M' from dual ) select children.groupid , parsed.age , parsed.gender , parsed.name from children , table( delimited_col_row.parser( children.data , 'name:varchar2(10),age:number,gender:varchar2(1)' ) ) parsed order by children.groupid, parsed.age /
GROUPID AGE GENDER NAME ---------- ---------- ------ ---------- 1 7 F Belinda 1 9 M Adam 1 12 M Carl 2 11 M Dave 2 12 F Ellen 3 4 F Isabella 3 5 M George 3 6 M Frank 3 13 F Helen 4 15 M Julian
If we quote our identifiers, the code in ODCITableDescribe keeps the case of the identifiers and then we are required to continue using the quoted identifiers, just like standard Oracle:
with children(groupid, data) as ( select 1, 'Adam:9:M,Belinda:7:F,Carl:12:M' from dual union all select 2, 'Dave:11:M,Ellen:12:F' from dual union all select 3, 'Frank:6:M,George:5:M,Helen:13:F,Isabella:4:F' from dual union all select 4, 'Julian:15:M' from dual ) select children.groupid , parsed."Age" , parsed."Gender" , parsed."Name" from children , table( delimited_col_row.parser( children.data , '"Name":varchar2(10),"Age":number,"Gender":varchar2(1)' ) ) parsed order by children.groupid, parsed."Age" /
GROUPID Age Gender Name ---------- ---------- ------ ---------- 1 7 F Belinda 1 9 M Adam 1 12 M Carl 2 11 M Dave 2 12 F Ellen 3 4 F Isabella 3 5 M George 3 6 M Frank 3 13 F Helen 4 15 M Julian
If need be, we can specify different column and row delimiters. Note that the code in ODCITableDescribe will parse the column specification string using the same delimiters, so when the data here use | and ; we also need to use | and ; in the column specification.
with children(groupid, data) as ( select 1, 'Adam|9|M;Belinda|7|F;Carl|12|M' from dual union all select 2, 'Dave|11|M;Ellen|12|F' from dual union all select 3, 'Frank|6|M;George|5|M;Helen|13|F;Isabella|4|F' from dual union all select 4, 'Julian|15|M' from dual ) select children.groupid , parsed.age , parsed.gender , parsed.name from children , table( delimited_col_row.parser( children.data , 'name|varchar2(10);age|number;gender|varchar2(1)' , '|' , ';' ) ) parsed order by children.groupid, parsed.age /
GROUPID AGE GENDER NAME ---------- ---------- ------ ---------- 1 7 F Belinda 1 9 M Adam 1 12 M Carl 2 11 M Dave 2 12 F Ellen 3 4 F Isabella 3 5 M George 3 6 M Frank 3 13 F Helen 4 15 M Julian
We can support DATE beside VARCHAR2 and NUMBER, but note that this implementation works with the NLS_DATE_FORMAT of your session, you cannot specify a date format:
with children(groupid, data) as ( select 1, 'Adam:01-Jan-06,Belinda:02-Feb-08,Carl:03-Mar-03' from dual union all select 2, 'Dave:04-Apr-04,Ellen:05-May-03' from dual union all select 3, 'Frank:06-Jun-09,George:07-Jul-10,Helen:08-Aug-02,Isabella:09-Sep-11' from dual union all select 4, 'Julian:10-Oct-00' from dual ) select children.groupid , to_char(parsed.birthday,'YYYY-MM-DD') birthdate , parsed.name from children , table( delimited_col_row.parser( children.data , 'name:varchar2(10),birthday:date' , ':' , ',' ) ) parsed order by children.groupid, parsed.birthday desc /
GROUPID BIRTHDATE NAME ---------- ---------- ---------- 1 2008-02-02 Belinda 1 2006-01-01 Adam 1 2003-03-03 Carl 2 2004-04-04 Dave 2 2003-05-05 Ellen 3 2011-09-09 Isabella 3 2010-07-07 George 3 2009-06-06 Frank 3 2002-08-08 Helen 4 2000-10-10 Julian
The same issue actually also holds true for NUMBER. If for example your session NLS_DECIMAL_CHARACTERS specify using decimal comma, any non-integer numbers in the data string also must use decimal comma (in which case the delimiter cannot be a comma.)
This has mostly been fun to create. I plan a future post (hopefully fairly soon) comparing performance of this with other methods of splitting out rows and columns from a string. I do not expect this to be the very best method, but probably not the very worst either. We shall see when I get to do some more in-depth comparison.
The primary thing I've learned from this is probably a greater understanding of how Oracle operates on a cursor, when the various phases (describe, prepare, start, fetch close) are called.
The describe being at hard parse time is one catch to this method. The column specification and column/row delimiters you cannot set dynamically in consecutive calls to delimited_col_row.parser(), at least not with bind variables. As long as you use literals for the values given to p_cols, p_col_delim and p_row_delim parameters, everything is good. If you change the literals, you will be forcing a hard parse anyway. If you pass binds or table columns to p_cols, p_col_delim and p_row_delim, you must not change the values of those between calls. If you do, you will have to manually force a hard pass somehow.
And there is a lot on the to-do list if this should be a real generic solution able to handle many situations:
- Allow data to be in CLOB rather than just VARCHAR2.
- Handle delimiters of greater length than 1 (particularly to be able to handle CR/LF row delimiters.)
- Allow escaping of delimiters so the data may contain escaped delimiters.
- Allow quoting of string values so the strings may contain delimiters.
- Allow specification of NLS_DATE_FORMAT.
- Allow specification of NLS_DECIMAL_CHARACTERS.
- Etc.
Even if all that to-do was done, then you could keep on adding functionality until you got it to be able to do as much as external table syntax. But then I actually think a better way would be if external table syntax ever allowed location to be specified as a CLOB rather than a file.
Anyway, have fun with this (if you're sufficiently nerdy like me to think this is fun ;-)
Note: Carsten Czarski thought about this way of doing it way earlier than me. He's blogged a very similar thing 8 years ago: http://sql-plsql-de.blogspot.de/2007/11/csv-dateien-einlesen-mit-sql-easy-csv.html - Thanks for letting me know, Carsten.
Excellent publication, but I have a question you can use this type of dynamic tables within a function, you can make an example please to know how it can be done. Thanks for everything and greetings.
ReplyDeleteI leave you an example of my code, and I get an error, but if I consult it, I directly execute everything normal.
CREATE OR REPLACE TYPE NUBERH.rec_idnmb as object (
id INTEGER,
nombre VARCHAR2(4000)
);
/
CREATE OR REPLACE TYPE NUBERH.tab_idnmb as TABLE OF rec_idnmb;
/
CREATE OR REPLACE FUNCTION get_test
(
p_marcas_json varchar2
)
RETURN tab_idnmb
AS
v_ret tab_idnmb;
BEGIN
select
rec_idnmb(
ID,
NAME
)
BULK COLLECT INTO v_ret
from table(
delimited_col_row.parser(
'1:Adam:M,2:Belinda:F,3:Carl:M',
'ID:NUMBER,NAME:VARCHAR2(10),GENDER:VARCHAR2(1)',
':',
','
)
)
order by gender, name;
RETURN v_ret;
END;
Hi Daniel
DeleteI don’t know of any restrictions that should cause your example to fail. I can try to reproduce when I get to a computer.
In the meantime, could you please post which error you are getting?
(And a comment: based on your argument name, it looks like you wish to use this to parse JSON. If that is the case, then there are better ways to do that - depending on which database version you are running. )
Cheerio
/Kim
Hi again, Daniel
DeleteI've tried to use my table function in PL/SQL now (just in an anonymous block, but that's sufficient to show the problem), and I get the error message: ORA-22905: cannot access rows from a non-nested table item.
Even though it should not be a problem, it looks like this is one of the very few cases where SQL that is embedded in PL/SQL is not behaving exactly the same as when it is pure SQL running in the SQL engine alone.
I can do a workaround by putting the SQL in an EXECUTE IMMEDIATE to make it dynamic SQL - that way only the SQL engine touches it, not the PL/SQL engine. It shouldn't be necessary and I consider it actually to be a bug in Oracle. However, it is a type of bug that is probably not a high priority to fix (and there might even be arguments that it is not really a bug but has to be this way - who knows ;-)
Anyway, I'll do a addendum post demonstrating the use of dynamic SQL as a workaround.
Cheerio
/Kim
As promised, a post demonstrating the dynamic SQL:
Deletehttps://www.kibeha.dk/2019/01/ora-22905-when-calling-odci-table.html
Hope it helps ;-)
When trying to create function type body delimited_col_row I am getting a error. [Error] PLS-00103 (100: 25): PLS-00103: Encountered the symbol "end-of-file" when expecting one of the following: end not pragma final instantiable order overriding static member constructor map
ReplyDeleteAny idea what the issue is?
Hi, Craig
DeleteWell, probably the issue is that I should have provided a link to a source file ;-) I am guessing (since your error is on column 25 of line 100) that you missed the weird way I published the code in this blog post.
The *type* body delimited_col_row starts with line 1: "create or replace type body delimited_col_row as". It contains multiple functions. I show each function separately, but each snippet needs to be merged together into one single type body before it will compile.
So the 5 snippets that make up the type body needs to be appended to one another. Line 1-100, 102-179, 181-198, 200-250 and 252-261. Then it should compile.
I am sorry I didn't describe this very well in the blog post. But you could see my presentation on "Twisting Data": http://bit.ly/kibeha_datatwist4_pptx. Part 3 is all about ways to split delimited data, several of which are easier to use than this ODCI function. This ODCI function is mostly useful for very specialized cases.
But the presentation has a companion script you can download: http://bit.ly/kibeha_datatwist4_sql. In that script is also the source code of the delimited_col_row type and type body, which you can then grab directly instead of piecing it together from the blog post.
Hope you can get it to run :-)
Cheerio
/Kim