CSV, XML and JSON parsing - a comparison over versions
Via Steven Feuerstein I was asked to try and give advice concerning fastest way to parse CSV data in PL/SQL. The case was that they had a different database that could be setup to deliver data as a webservice serving either CSV, XML or JSON, and they wished to let their APEX application use PL/SQL to retrieve data from that other database via such webservices. They were not yet on 12.1.0.2.0 so they did not have JSON_TABLE, and they seemed to find that do-it-yourself parsing of CSV data was the fastest. Did we have any idea of a faster way?
I would normally go for built-in functions as being generally the fastest, so XMLTABLE would be my preference. But there might be usecases where the overhead in bytes of XML creates too much network traffic. Or just possibly there might be cases where XMLTABLE is slower because it needs to be able to generically parse very complex XML and so might have to setup some complex structures internally that isn't really needed for very simple XML that simply represents rows in a table/view/query.
So maybe the manual CSV parsing could be faster. And since this would have to be used for maybe 10-20 calls to form a single APEX page, even shaving off milliseconds might make a difference in the total performance. I looked at the PL/SQL code example they had worked out and spotted at least one thing that could help.
The code would loop through the CSV data in the CLOB with a construct like this:
That uses the "occurrence" parameter of DBMS_LOB.INSTR (fourth parameter), so the loop first asks for the position of the first EOL, then asks for the second EOL, and so on. When asking DBMS_LOB.INSTR for the second EOL, DBMS_LOB.INSTR will have to first find the first EOL (again) and then get the second EOL. When asking for the third EOL, DBMS_LOB.INSTR finds the first, the second, and then the third. So for N rows, the first EOL will actually be found N times, the second EOL will be found N-1 times, and so on.
That code I could easily improve by changing the construct slightly:
Keep track of the positions so that each call to DBMS_LOB.INSTR asks for the first occurrence of EOL from the position of the last one + 1.
Similarly each column was also parsed by using "occurrence" parameter of INSTR for getting the "column"th occurrence of a semicolon:
And that could also quite easily be changed into code that keeps track of positions allowing INSTR to never need to pass over the same parts of the string over and over.
This had a very good effect on the code and improved the performance quite a bit - particularly for many rows. I decided to test it against XMLTABLE and saw that the improved manual CSV parsing seemed to be faster - at least in the 11.2 instance I was using. But then I wanted to test against JSON_TABLE and tried in a 12.1.0.2.0 - JSON_TABLE was quite fast, and then there was a surprise when I tried XMLTABLE in 12.1.
So I decided to try it out more rigorously in three different versions I have available to me...
First I create a function that can return some rows of testdata in either of the three formats. This will emulate the results of the webservice call.
The manual parsing of CSV data will be done with a table function, which needs a row type and a table type to return:
The first function I use the original method that was sent to me in the question, which utilizes the "occurrence" parameter of both DBMS_LOB.INSTR and INSTR functions:
The second function is my improved version that keeps track of position as it goes along so it always asks for the first "occurrence" but from the specified position:
I create a table to log the timing results of the tests:
And then the block that executes the various tests 3 times for 1, 10, 100, 1000 and 10000 rows:
The results of the test I can compare by querying by test type:
My results in 3 different versions/environments are (run NULL is the average of the three runs):
Here in the first set of tests we spot that CSV1 grows slower and slower per row as the number of rows increase. It does not scale, which is because of the described problem using "occurrence" parameter of INSTR. CSV2 does not have that problem and scale well as the row number increases. At 10.000 rows CSV2 is about 85 times faster per row than CSV1.
Interesting is that XMLTABLE seems to suffer a scaling problem almost as bad as the CSV1 method. Somehow it looks like XMLTABLE in this version also grows progressively slower as the number of rows increase? The do-it-yourself parsing of a CSV webservice seems a much better choice for the original requirement than built-in XMLTABLE on XML webservice.
Again CSV1 suffers scalability problems, which CSV2 does not. At 10.000 rows CSV2 is about 75 times faster per row than CSV1.
But really interesting is that XMLTABLE appears to have improved massively in this version. Once XMLTABLE is "warmed up" it scales well and is about twice as fast as the otherwise good CSV2 method. Now the built-in beats the do-it-yourself, so in 12.1.0.1.0 using XMLTABLE on an XML webservice looks like outperforming manual code on a CSV webservice.
As before CSV1 does not scale but CSV2 does. At 10.000 rows CSV2 is about 445 times faster per row than CSV1.
The XMLTABLE is again consistently scaling well. It is no longer twice as fast as CSV2 but approximately identical. I'd guess what we are seeing here is not a degrading of XMLTABLE, but rather that the PL/SQL method used in CSV2 somehow is compiled a bit more optimally in 12.1.0.2.0 - witness CSV2 is much faster than CSV1 here in 12.1.0.2.0 than in 12.1.0.1.0 and 11.2.0.3.0.
Really interesting is that JSON_TABLE also scales well and is about twice as fast as both XML and CSV2. Presumably that can be because JSON path queries are simpler than XQuery so the JSON_TABLE function needs to support less complex scenarios. Also it may help that no special datatype is involved, just plain CLOB. Or JSON_TABLE may simply be better written internally, who knows.
The lesson learned is that built-in functions can improve over the versions. Don't trust what you thought you knew was the fastest way - it may change with an upgrade. Keep up to date on new functionality and think about where that may be even faster.
For the original question where they could choose between letting the internal webservices of the other database publish CSV, XML or JSON and then use appropriate parsing methods in Oracle, it definitely matters what they pick.
If they could use version 12.1.0.2.0 they would get the fastest parsing method and at the same time gain the advantages of JSON_TABLE parsing rather than manually coding it, while spending a little more network bandwidth than a CSV but less than XML.
If they have version 12.1.0.1.0 they can get a fast method with XMLTABLE and gain the simpler coding than manual parsing, but at the cost of the network bandwidth overhead of XML.
And if they are on version 11.2.0.3.0, the performance desired probably can lead to a decision that the overhead of more code for manual CSV parsing is worth it in terms of performance and especially scalability.
If they are on 11.2.0.4.0 or a different version, they should run the test scripts (available here) themselves and see if the XMLTABLE scalability problems is only a problem in 11.2.0.3.0 ;-)
I would normally go for built-in functions as being generally the fastest, so XMLTABLE would be my preference. But there might be usecases where the overhead in bytes of XML creates too much network traffic. Or just possibly there might be cases where XMLTABLE is slower because it needs to be able to generically parse very complex XML and so might have to setup some complex structures internally that isn't really needed for very simple XML that simply represents rows in a table/view/query.
So maybe the manual CSV parsing could be faster. And since this would have to be used for maybe 10-20 calls to form a single APEX page, even shaving off milliseconds might make a difference in the total performance. I looked at the PL/SQL code example they had worked out and spotted at least one thing that could help.
The code would loop through the CSV data in the CLOB with a construct like this:
loop l_rec_len := dbms_lob.instr(p_clob, l_eol, 1, l_rownr) - l_offset; l_buffer := dbms_lob.substr(p_clob, l_rec_len, l_offset); exit when l_buffer is null; // ... l_offset := l_offset + l_rec_len + 1; l_rownr := l_rownr + 1; end loop;
That uses the "occurrence" parameter of DBMS_LOB.INSTR (fourth parameter), so the loop first asks for the position of the first EOL, then asks for the second EOL, and so on. When asking DBMS_LOB.INSTR for the second EOL, DBMS_LOB.INSTR will have to first find the first EOL (again) and then get the second EOL. When asking for the third EOL, DBMS_LOB.INSTR finds the first, the second, and then the third. So for N rows, the first EOL will actually be found N times, the second EOL will be found N-1 times, and so on.
That code I could easily improve by changing the construct slightly:
loop l_crpos2 := dbms_lob.instr(p_clob, l_eol, l_crpos1+1); l_buffer := dbms_lob.substr(p_clob, l_crpos2-l_crpos1-1, l_crpos1+1); exit when l_buffer is null; // ... l_crpos1 := l_crpos2; l_rownr := l_rownr + 1; end loop;
Keep track of the positions so that each call to DBMS_LOB.INSTR asks for the first occurrence of EOL from the position of the last one + 1.
Similarly each column was also parsed by using "occurrence" parameter of INSTR for getting the "column"th occurrence of a semicolon:
// ... l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,4); l_row.col4 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,5); l_row.col5 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); // ...
And that could also quite easily be changed into code that keeps track of positions allowing INSTR to never need to pass over the same parts of the string over and over.
This had a very good effect on the code and improved the performance quite a bit - particularly for many rows. I decided to test it against XMLTABLE and saw that the improved manual CSV parsing seemed to be faster - at least in the 11.2 instance I was using. But then I wanted to test against JSON_TABLE and tried in a 12.1.0.2.0 - JSON_TABLE was quite fast, and then there was a surprise when I tried XMLTABLE in 12.1.
So I decided to try it out more rigorously in three different versions I have available to me...
First I create a function that can return some rows of testdata in either of the three formats. This will emulate the results of the webservice call.
create or replace function create_testdata( p_rows number , p_type varchar2 ) return clob is l_eol varchar2(2) := chr(13); l_clob clob; begin if p_type = 'CSV' then dbms_lob.createtemporary(l_clob,false); for r in 1..p_rows loop dbms_lob.append(l_clob, 'Row '||r||' Column 1 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 2 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 3 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 4 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 5 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 6 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 7 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 8 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 9 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 10 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 11 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 12 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 13 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 14 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 15 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 16 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 17 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 18 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 19 The quick brown fox jumped over the lazy dog' ||';Row '||r||' Column 20 The quick brown fox jumped over the lazy dog' ||l_eol ); end loop; elsif p_type = 'XML' then select xmlelement( "Rows" , xmlagg( xmlelement( "Row" , xmlforest( 'Row '||level||' Column 1 The quick brown fox jumped over the lazy dog' as "Column1" , 'Row '||level||' Column 2 The quick brown fox jumped over the lazy dog' as "Column2" , 'Row '||level||' Column 3 The quick brown fox jumped over the lazy dog' as "Column3" , 'Row '||level||' Column 4 The quick brown fox jumped over the lazy dog' as "Column4" , 'Row '||level||' Column 5 The quick brown fox jumped over the lazy dog' as "Column5" , 'Row '||level||' Column 6 The quick brown fox jumped over the lazy dog' as "Column6" , 'Row '||level||' Column 7 The quick brown fox jumped over the lazy dog' as "Column7" , 'Row '||level||' Column 8 The quick brown fox jumped over the lazy dog' as "Column8" , 'Row '||level||' Column 9 The quick brown fox jumped over the lazy dog' as "Column9" , 'Row '||level||' Column 10 The quick brown fox jumped over the lazy dog' as "Column10" , 'Row '||level||' Column 11 The quick brown fox jumped over the lazy dog' as "Column11" , 'Row '||level||' Column 12 The quick brown fox jumped over the lazy dog' as "Column12" , 'Row '||level||' Column 13 The quick brown fox jumped over the lazy dog' as "Column13" , 'Row '||level||' Column 14 The quick brown fox jumped over the lazy dog' as "Column14" , 'Row '||level||' Column 15 The quick brown fox jumped over the lazy dog' as "Column15" , 'Row '||level||' Column 16 The quick brown fox jumped over the lazy dog' as "Column16" , 'Row '||level||' Column 17 The quick brown fox jumped over the lazy dog' as "Column17" , 'Row '||level||' Column 18 The quick brown fox jumped over the lazy dog' as "Column18" , 'Row '||level||' Column 19 The quick brown fox jumped over the lazy dog' as "Column19" , 'Row '||level||' Column 20 The quick brown fox jumped over the lazy dog' as "Column20" ) ) ) ).getclobval() into l_clob from dual connect by level <= p_rows; elsif p_type = 'JSON' then dbms_lob.createtemporary(l_clob,false); dbms_lob.append(l_clob,'{ "rows" : ['); for r in 1..p_rows loop dbms_lob.append(l_clob, case r when 1 then '' else ',' end ||'{ column1 : "Row '||r||' Column 1 The quick brown fox jumped over the lazy dog"' ||', column2 : "Row '||r||' Column 2 The quick brown fox jumped over the lazy dog"' ||', column3 : "Row '||r||' Column 3 The quick brown fox jumped over the lazy dog"' ||', column4 : "Row '||r||' Column 4 The quick brown fox jumped over the lazy dog"' ||', column5 : "Row '||r||' Column 5 The quick brown fox jumped over the lazy dog"' ||', column6 : "Row '||r||' Column 6 The quick brown fox jumped over the lazy dog"' ||', column7 : "Row '||r||' Column 7 The quick brown fox jumped over the lazy dog"' ||', column8 : "Row '||r||' Column 8 The quick brown fox jumped over the lazy dog"' ||', column9 : "Row '||r||' Column 9 The quick brown fox jumped over the lazy dog"' ||', column10 : "Row '||r||' Column 10 The quick brown fox jumped over the lazy dog"' ||', column11 : "Row '||r||' Column 11 The quick brown fox jumped over the lazy dog"' ||', column12 : "Row '||r||' Column 12 The quick brown fox jumped over the lazy dog"' ||', column13 : "Row '||r||' Column 13 The quick brown fox jumped over the lazy dog"' ||', column14 : "Row '||r||' Column 14 The quick brown fox jumped over the lazy dog"' ||', column15 : "Row '||r||' Column 15 The quick brown fox jumped over the lazy dog"' ||', column16 : "Row '||r||' Column 16 The quick brown fox jumped over the lazy dog"' ||', column17 : "Row '||r||' Column 17 The quick brown fox jumped over the lazy dog"' ||', column18 : "Row '||r||' Column 18 The quick brown fox jumped over the lazy dog"' ||', column19 : "Row '||r||' Column 19 The quick brown fox jumped over the lazy dog"' ||', column20 : "Row '||r||' Column 20 The quick brown fox jumped over the lazy dog"' ||'} ' ); end loop; dbms_lob.append(l_clob,'] }'); end if; return l_clob; end create_testdata; /
The manual parsing of CSV data will be done with a table function, which needs a row type and a table type to return:
create type t_test_row as object ( rownr integer , col1 varchar2(100) , col2 varchar2(100) , col3 varchar2(100) , col4 varchar2(100) , col5 varchar2(100) , col6 varchar2(100) , col7 varchar2(100) , col8 varchar2(100) , col9 varchar2(100) , col10 varchar2(100) , col11 varchar2(100) , col12 varchar2(100) , col13 varchar2(100) , col14 varchar2(100) , col15 varchar2(100) , col16 varchar2(100) , col17 varchar2(100) , col18 varchar2(100) , col19 varchar2(100) , col20 varchar2(100) ) / create type t_test_tab as table of t_test_row /
The first function I use the original method that was sent to me in the question, which utilizes the "occurrence" parameter of both DBMS_LOB.INSTR and INSTR functions:
create or replace function parse_csv1( p_clob clob ) return t_test_tab pipelined is l_row t_test_row := t_test_row(null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null); l_buffer varchar2(32767); l_eol varchar2(2) := chr(13); l_rownr number := 1; l_offset number := 1; l_rec_len number; l_pos1 number; l_pos2 number; begin loop l_rec_len := dbms_lob.instr(p_clob, l_eol, 1, l_rownr) - l_offset; l_buffer := dbms_lob.substr(p_clob, l_rec_len, l_offset); exit when l_buffer is null; l_row.rownr := l_rownr; l_pos1 := instr(l_buffer,';',1); l_row.col1 := substr( l_buffer,1, l_pos1-1 ); l_pos1 := instr(l_buffer,';',1,1); l_pos2 := instr(l_buffer,';',1,2); l_row.col2 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,3); l_row.col3 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,4); l_row.col4 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,5); l_row.col5 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,6); l_row.col6 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,7); l_row.col7 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,8); l_row.col8 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,9); l_row.col9 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,10); l_row.col10 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,11); l_row.col11 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,12); l_row.col12 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,13); l_row.col13 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,14); l_row.col14 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,15); l_row.col15 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,16); l_row.col16 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,17); l_row.col17 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,18); l_row.col18 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := instr(l_buffer,';',1,19); l_row.col19 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); l_pos1 := l_pos2; l_pos2 := length(l_buffer)+1; l_row.col20 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 ); pipe row(l_row); l_offset := l_offset + l_rec_len + 1; l_rownr := l_rownr + 1; end loop; return; exception when no_data_needed then null; end parse_csv1; /
The second function is my improved version that keeps track of position as it goes along so it always asks for the first "occurrence" but from the specified position:
create or replace function parse_csv2( p_clob clob ) return t_test_tab pipelined is l_row t_test_row := t_test_row(null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null); l_buffer varchar2(32767); l_eol varchar2(2) := chr(13); l_rownr number := 1; l_offset number := 1; l_rec_len number; l_pos1 number := 0; l_pos2 number := 0; l_crpos1 number := 0; l_crpos2 number; function lf_getcol return varchar2 is begin l_pos1 := l_pos2; l_pos2 := coalesce( nullif(instr(l_buffer, ';', l_pos1+1), 0) , length(l_buffer)+1 ); return(substr(l_buffer, l_pos1+1, l_pos2-l_pos1-1)); end; begin loop l_crpos2 := dbms_lob.instr(p_clob, l_eol, l_crpos1+1); l_buffer := dbms_lob.substr(p_clob, l_crpos2-l_crpos1-1, l_crpos1+1); exit when l_buffer is null; l_row.rownr := l_rownr; l_pos1 := 0; l_pos2 := 0; l_row.col1 := lf_getcol; l_row.col2 := lf_getcol; l_row.col3 := lf_getcol; l_row.col4 := lf_getcol; l_row.col5 := lf_getcol; l_row.col6 := lf_getcol; l_row.col7 := lf_getcol; l_row.col8 := lf_getcol; l_row.col9 := lf_getcol; l_row.col10 := lf_getcol; l_row.col11 := lf_getcol; l_row.col12 := lf_getcol; l_row.col13 := lf_getcol; l_row.col14 := lf_getcol; l_row.col15 := lf_getcol; l_row.col16 := lf_getcol; l_row.col17 := lf_getcol; l_row.col18 := lf_getcol; l_row.col19 := lf_getcol; l_row.col20 := lf_getcol; pipe row(l_row); l_crpos1 := l_crpos2; l_rownr := l_rownr + 1; end loop; return; exception when no_data_needed then null; end parse_csv2; /
I create a table to log the timing results of the tests:
create table test_log ( run number , testtype varchar2(4) , num_rows number , parse_time interval day to second(6) ) /
And then the block that executes the various tests 3 times for 1, 10, 100, 1000 and 10000 rows:
declare l_num_rows number; l_clob clob; l_starttime timestamp(6); l_endtime timestamp(6); begin for l_run in 1..3 loop for l_power in 0..4 loop l_num_rows := power(10, l_power); l_clob := create_testdata(l_num_rows, 'CSV'); l_starttime := systimestamp; for parsed_rows in ( select * from table(parse_csv1(l_clob)) ) loop null; end loop; l_endtime := systimestamp; insert into test_log values ( l_run, 'CSV1', l_num_rows, l_endtime-l_starttime ); commit; -- l_starttime := systimestamp; for parsed_rows in ( select * from table(parse_csv2(l_clob)) ) loop null; end loop; l_endtime := systimestamp; insert into test_log values ( l_run, 'CSV2', l_num_rows, l_endtime-l_starttime ); commit; -- l_clob := create_testdata(l_num_rows, 'XML'); l_starttime := systimestamp; for parsed_rows in ( select * from xmltable( '/Rows/Row' passing xmltype( l_clob ) columns rownr for ordinality , col1 varchar2(100) path 'Column1' , col2 varchar2(100) path 'Column2' , col3 varchar2(100) path 'Column3' , col4 varchar2(100) path 'Column4' , col5 varchar2(100) path 'Column5' , col6 varchar2(100) path 'Column6' , col7 varchar2(100) path 'Column7' , col8 varchar2(100) path 'Column8' , col9 varchar2(100) path 'Column9' , col10 varchar2(100) path 'Column10' , col11 varchar2(100) path 'Column11' , col12 varchar2(100) path 'Column12' , col13 varchar2(100) path 'Column13' , col14 varchar2(100) path 'Column14' , col15 varchar2(100) path 'Column15' , col16 varchar2(100) path 'Column16' , col17 varchar2(100) path 'Column17' , col18 varchar2(100) path 'Column18' , col19 varchar2(100) path 'Column19' , col20 varchar2(100) path 'Column20' ) xml_rows ) loop null; end loop; l_endtime := systimestamp; insert into test_log values ( l_run, 'XML', l_num_rows, l_endtime-l_starttime ); commit; -- -- Uncomment this if database version is less than 12.1.0.2.0 -- >>> l_clob := create_testdata(l_num_rows ,'JSON'); l_starttime := systimestamp; for parsed_rows in ( select * from json_table( l_clob ,'$.rows[*]' columns ( rownr for ordinality , col1 varchar2(100) path '$.column1' , col2 varchar2(100) path '$.column2' , col3 varchar2(100) path '$.column3' , col4 varchar2(100) path '$.column4' , col5 varchar2(100) path '$.column5' , col6 varchar2(100) path '$.column6' , col7 varchar2(100) path '$.column7' , col8 varchar2(100) path '$.column8' , col9 varchar2(100) path '$.column9' , col10 varchar2(100) path '$.column10' , col11 varchar2(100) path '$.column11' , col12 varchar2(100) path '$.column12' , col13 varchar2(100) path '$.column13' , col14 varchar2(100) path '$.column14' , col15 varchar2(100) path '$.column15' , col16 varchar2(100) path '$.column16' , col17 varchar2(100) path '$.column17' , col18 varchar2(100) path '$.column18' , col19 varchar2(100) path '$.column19' , col20 varchar2(100) path '$.column20' )) json_rows ) loop null; end loop; l_endtime := systimestamp; insert into test_log values ( l_run, 'JSON', l_num_rows, l_endtime-l_starttime ); commit; -- <<< end loop; end loop; end; /
The results of the test I can compare by querying by test type:
select * from ( select run , testtype , num_rows , to_char(extract(minute from parse_time),'FM09')||':'|| to_char(extract(second from parse_time),'FM09D999999') parse_time , round( 1000 * (extract(minute from parse_time)*60+extract(second from parse_time)) / num_rows , 2 ) ms_per_row from ( select * from test_log union all select null run , testtype , num_rows , numtodsinterval(avg( extract(minute from parse_time)*60+extract(second from parse_time) ),'second') parse_time from test_log group by testtype, num_rows ) ) pivot ( max(parse_time) parse , max(ms_per_row) per_row for testtype in ( 'CSV1' as csv1 , 'CSV2' as csv2 , 'XML' as xml , 'JSON' as json ) ) order by run, num_rows /
My results in 3 different versions/environments are (run NULL is the average of the three runs):
- Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
RUN | NUM ROWS |
CSV1 TOTAL TIME |
CSV1 PER ROW |
CSV2 TOTAL TIME |
CSV2 PER ROW |
XML TOTAL TIME |
XML PER ROW |
---|---|---|---|---|---|---|---|
1 | 1 | 00:00.005215 | 5,22 | 00:00.00468 | 4,68 | 00:00.008071 | 8,07 |
1 | 10 | 00:00.007828 | 0,78 | 00:00.006784 | 0,68 | 00:00.044712 | 4,47 |
1 | 100 | 00:00.173038 | 1,73 | 00:00.086039 | 0,86 | 00:00.492416 | 4,92 |
1 | 1000 | 00:11.750444 | 11,75 | 00:00.980171 | 0,98 | 00:11.752243 | 11,75 |
1 | 10000 | 19:48.409493 | 118,84 | 00:10.779098 | 1,08 | 17:00.178783 | 102,02 |
2 | 1 | 00:00.002452 | 2,45 | 00:00.01609 | 16,09 | 00:00.007291 | 7,29 |
2 | 10 | 00:00.012376 | 1,24 | 00:00.013253 | 1,33 | 00:00.047517 | 4,75 |
2 | 100 | 00:00.342004 | 3,42 | 00:00.247234 | 2,47 | 00:00.518944 | 5,19 |
2 | 1000 | 00:16.065279 | 16,07 | 00:00.954751 | 0,95 | 00:12.12573 | 12,13 |
2 | 10000 | 20:54.745519 | 125,47 | 00:21.760463 | 2,18 | 17:20.21969 | 104,02 |
3 | 1 | 00:00.001244 | 1,24 | 00:00.010238 | 10,24 | 00:00.005864 | 5,86 |
3 | 10 | 00:00.007862 | 0,79 | 00:00.0059 | 0,59 | 00:00.044587 | 4,46 |
3 | 100 | 00:00.174661 | 1,75 | 00:00.089375 | 0,89 | 00:00.501092 | 5,01 |
3 | 1000 | 00:11.369612 | 11,37 | 00:00.96306 | 0,96 | 00:12.223152 | 12,22 |
3 | 10000 | 20:07.233167 | 120,72 | 00:09.922741 | 0,99 | 17:35.55329 | 105,56 |
1 | 00:00.00297 | 2,97 | 00:00.010336 | 10,34 | 00:00.007075 | 7,08 | |
10 | 00:00.009355 | 0,94 | 00:00.008646 | 0,86 | 00:00.045605 | 4,56 | |
100 | 00:00.229901 | 2,3 | 00:00.140883 | 1,41 | 00:00.504151 | 5,04 | |
1000 | 00:13.061778 | 13,06 | 00:00.965994 | 0,97 | 00:12.033708 | 12,03 | |
10000 | 20:16.79606 | 121,68 | 00:14.154101 | 1,42 | 17:18.650588 | 103,87 |
Here in the first set of tests we spot that CSV1 grows slower and slower per row as the number of rows increase. It does not scale, which is because of the described problem using "occurrence" parameter of INSTR. CSV2 does not have that problem and scale well as the row number increases. At 10.000 rows CSV2 is about 85 times faster per row than CSV1.
Interesting is that XMLTABLE seems to suffer a scaling problem almost as bad as the CSV1 method. Somehow it looks like XMLTABLE in this version also grows progressively slower as the number of rows increase? The do-it-yourself parsing of a CSV webservice seems a much better choice for the original requirement than built-in XMLTABLE on XML webservice.
- Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
RUN | NUM ROWS |
CSV1 TOTAL TIME |
CSV1 PER ROW |
CSV2 TOTAL TIME |
CSV2 PER ROW |
XML TOTAL TIME |
XML PER ROW |
---|---|---|---|---|---|---|---|
1 | 1 | 00:00.004626 | 4,63 | 00:00.004612 | 4,61 | 00:00.047893 | 47,89 |
1 | 10 | 00:00.002274 | 0,23 | 00:00.002644 | 0,26 | 00:00.002777 | 0,28 |
1 | 100 | 00:00.040193 | 0,4 | 00:00.027838 | 0,28 | 00:00.013524 | 0,14 |
1 | 1000 | 00:02.092844 | 2,09 | 00:00.269647 | 0,27 | 00:00.138578 | 0,14 |
1 | 10000 | 03:27.278202 | 20,73 | 00:02.828744 | 0,28 | 00:01.584584 | 0,16 |
2 | 1 | 00:00.000512 | 0,51 | 00:00.000576 | 0,58 | 00:00.001642 | 1,64 |
2 | 10 | 00:00.002183 | 0,22 | 00:00.002663 | 0,27 | 00:00.003235 | 0,32 |
2 | 100 | 00:00.042229 | 0,42 | 00:00.028084 | 0,28 | 00:00.015719 | 0,16 |
2 | 1000 | 00:02.122371 | 2,12 | 00:00.276936 | 0,28 | 00:00.195269 | 0,2 |
2 | 10000 | 03:28.612211 | 20,86 | 00:02.866212 | 0,29 | 00:01.641526 | 0,16 |
3 | 1 | 00:00.001379 | 1,38 | 00:00.001012 | 1,01 | 00:00.001801 | 1,8 |
3 | 10 | 00:00.002225 | 0,22 | 00:00.002936 | 0,29 | 00:00.002867 | 0,29 |
3 | 100 | 00:00.039881 | 0,4 | 00:00.027729 | 0,28 | 00:00.013694 | 0,14 |
3 | 1000 | 00:02.110584 | 2,11 | 00:00.277426 | 0,28 | 00:00.133827 | 0,13 |
3 | 10000 | 03:27.172959 | 20,72 | 00:02.839838 | 0,28 | 00:01.630668 | 0,16 |
1 | 00:00.002172 | 2,17 | 00:00.002067 | 2,07 | 00:00.017112 | 17,11 | |
10 | 00:00.002227 | 0,22 | 00:00.002748 | 0,27 | 00:00.00296 | 0,3 | |
100 | 00:00.040768 | 0,41 | 00:00.027884 | 0,28 | 00:00.014312 | 0,14 | |
1000 | 00:02.1086 | 2,11 | 00:00.27467 | 0,27 | 00:00.155891 | 0,16 | |
10000 | 03:27.687791 | 20,77 | 00:02.844931 | 0,28 | 00:01.618926 | 0,16 |
Again CSV1 suffers scalability problems, which CSV2 does not. At 10.000 rows CSV2 is about 75 times faster per row than CSV1.
But really interesting is that XMLTABLE appears to have improved massively in this version. Once XMLTABLE is "warmed up" it scales well and is about twice as fast as the otherwise good CSV2 method. Now the built-in beats the do-it-yourself, so in 12.1.0.1.0 using XMLTABLE on an XML webservice looks like outperforming manual code on a CSV webservice.
- Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
RUN | NUM ROWS |
CSV1 TOTAL TIME |
CSV1 PER ROW |
CSV2 TOTAL TIME |
CSV2 PER ROW |
XML TOTAL TIME |
XML PER ROW |
JSON TOTAL TIME |
JSON PER ROW |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 00:00.004476 | 4,48 | 00:00.004232 | 4,23 | 00:00.002809 | 2,81 | 00:00.000414 | 0,41 |
1 | 10 | 00:00.001474 | 0,15 | 00:00.000824 | 0,08 | 00:00.002234 | 0,22 | 00:00.000723 | 0,07 |
1 | 100 | 00:00.053043 | 0,53 | 00:00.009351 | 0,09 | 00:00.010603 | 0,11 | 00:00.006555 | 0,07 |
1 | 1000 | 00:04.797509 | 4,8 | 00:00.111597 | 0,11 | 00:00.104581 | 0,1 | 00:00.050972 | 0,05 |
1 | 10000 | 08:10.989516 | 49,1 | 00:01.220601 | 0,12 | 00:01.050176 | 0,11 | 00:00.623944 | 0,06 |
2 | 1 | 00:00.000408 | 0,41 | 00:00.000206 | 0,21 | 00:00.001499 | 1,5 | 00:00.000253 | 0,25 |
2 | 10 | 00:00.001412 | 0,14 | 00:00.000997 | 0,1 | 00:00.002317 | 0,23 | 00:00.000746 | 0,07 |
2 | 100 | 00:00.06044 | 0,6 | 00:00.009602 | 0,1 | 00:00.009898 | 0,1 | 00:00.005458 | 0,05 |
2 | 1000 | 00:04.840011 | 4,84 | 00:00.15369 | 0,15 | 00:00.148956 | 0,15 | 00:00.053851 | 0,05 |
2 | 10000 | 08:11.884956 | 49,19 | 00:01.141171 | 0,11 | 00:01.049298 | 0,1 | 00:00.564004 | 0,06 |
3 | 1 | 00:00.000295 | 0,3 | 00:00.00022 | 0,22 | 00:00.001325 | 1,33 | 00:00.000258 | 0,26 |
3 | 10 | 00:00.001406 | 0,14 | 00:00.000855 | 0,09 | 00:00.002049 | 0,2 | 00:00.00068 | 0,07 |
3 | 100 | 00:00.054669 | 0,55 | 00:00.009505 | 0,1 | 00:00.010099 | 0,1 | 00:00.005421 | 0,05 |
3 | 1000 | 00:04.821541 | 4,82 | 00:00.112043 | 0,11 | 00:00.098209 | 0,1 | 00:00.051282 | 0,05 |
3 | 10000 | 08:08.371391 | 48,84 | 00:01.213267 | 0,12 | 00:01.072971 | 0,11 | 00:00.534615 | 0,05 |
1 | 00:00.001726 | 1,73 | 00:00.001553 | 1,55 | 00:00.001878 | 1,88 | 00:00.000308 | 0,31 | |
10 | 00:00.001431 | 0,14 | 00:00.000892 | 0,09 | 00:00.0022 | 0,22 | 00:00.000716 | 0,07 | |
100 | 00:00.056051 | 0,56 | 00:00.009486 | 0,09 | 00:00.0102 | 0,1 | 00:00.005811 | 0,06 | |
1000 | 00:04.819687 | 4,82 | 00:00.125777 | 0,13 | 00:00.117249 | 0,12 | 00:00.052035 | 0,05 | |
10000 | 08:10.415288 | 49,04 | 00:01.19168 | 0,12 | 00:01.057482 | 0,11 | 00:00.574188 | 0,06 |
As before CSV1 does not scale but CSV2 does. At 10.000 rows CSV2 is about 445 times faster per row than CSV1.
The XMLTABLE is again consistently scaling well. It is no longer twice as fast as CSV2 but approximately identical. I'd guess what we are seeing here is not a degrading of XMLTABLE, but rather that the PL/SQL method used in CSV2 somehow is compiled a bit more optimally in 12.1.0.2.0 - witness CSV2 is much faster than CSV1 here in 12.1.0.2.0 than in 12.1.0.1.0 and 11.2.0.3.0.
Really interesting is that JSON_TABLE also scales well and is about twice as fast as both XML and CSV2. Presumably that can be because JSON path queries are simpler than XQuery so the JSON_TABLE function needs to support less complex scenarios. Also it may help that no special datatype is involved, just plain CLOB. Or JSON_TABLE may simply be better written internally, who knows.
The lesson learned is that built-in functions can improve over the versions. Don't trust what you thought you knew was the fastest way - it may change with an upgrade. Keep up to date on new functionality and think about where that may be even faster.
For the original question where they could choose between letting the internal webservices of the other database publish CSV, XML or JSON and then use appropriate parsing methods in Oracle, it definitely matters what they pick.
If they could use version 12.1.0.2.0 they would get the fastest parsing method and at the same time gain the advantages of JSON_TABLE parsing rather than manually coding it, while spending a little more network bandwidth than a CSV but less than XML.
If they have version 12.1.0.1.0 they can get a fast method with XMLTABLE and gain the simpler coding than manual parsing, but at the cost of the network bandwidth overhead of XML.
And if they are on version 11.2.0.3.0, the performance desired probably can lead to a decision that the overhead of more code for manual CSV parsing is worth it in terms of performance and especially scalability.
If they are on 11.2.0.4.0 or a different version, they should run the test scripts (available here) themselves and see if the XMLTABLE scalability problems is only a problem in 11.2.0.3.0 ;-)
Hey Kim,
ReplyDeleteI'm in a situation of needing data from a MySQL database and decided on REST interface at the MySQL side, with manual CSV or XML (we are on 12.0.1.0 sadly, so no builtin JSON. It's worth mentioning that I tried with PLJSON, but it's even slower than the CSV1 method mentioned here.) parsing on the Oracle side.
So thank you for this blog post, valuable informations here! :)
Regards,
Daniel
Hi, Daniel
DeleteGlad the post could help you :-)
Yes, although the original datasource of the folks I was helping was a SAP HANA, the same can apply to other sources, like MySQL.
Depending on circumstances, an alternative to consider also can be Heterogenous Gateway - either a DB specific (license cost) version or the free generic ODBC version. If dedicated connections over the network between the source and target DB is viable, that can be a great solution. But in these Cloud days where the DBs can be scattered all over the globe, using REST is often a preferable alternative to persistent connections.
When you get to 12.2, native JSON functionality is greatly expanded and improved. In 12.2 I'd (after testing the actual usecase, of course) probably only use manual CSV parsing in extremely specific situations (can't really think of any offhand, but there's bound to be something truly weird ;-)
Have fun parsing.
Cheerio
/Kim
Hey Kim,
ReplyDeleteThanks for the quick reply :)
Actually, I had used the DB link method before, backed by a Heterogenous Gateway with ODBC, but it was just too slow and unreliable, when dealing with large amounts of data at once. Not to mention the connection issues and datatype conversion errors which happened a few times. Don't know who to blame there, because I never had the time, to fully debug it, sadly.
Also I think, the REST way is a bit more developer friendly (and maybe easier to maintain in the long run), because you only have to deal with a REST interface on one side, which can be the language of your choice, and PL/SQL on the other side, which is something you probably already know well :)
Unlike the ODBC way, where you have to deal with much more system administration stuff and seems harder to debug, because it has more components glued together,and use software you are unfamiliar with (like the MySQL ODBC connector).
Regards,
Daniel