Scraping and querying Oracle Database Developer Choice Awards votes - Part 2
In Part 1 I demonstrated how to "scrape" data from the live webpages with the votes of the Oracle Database Developer Choice Awards. Unfortunately those webpages are not ideal for scraping, so I promised a workaround. Here I do a semi-manual scraping af data by using a browser to retrieve the actual HTML, save it to a file, and then let the code parse out the data from the file.
In the live version, the first thing I had to do was privileges to let the database perform http calls. That is not necessary when doing it this way. So in cases where you might not get permission from your sysadmin to open the firewall for database callouts, this can also be a method.
Instead of http privileges, I need to create a directory object. This should point to a directory on the database server (or at least a network share that is mounted on a path on the database server.) I grant privileges on the directory to SCOTT user.
In SCOTT schema I create the types and table just like I did in Part 1. Only thing I have different here is I added a DATE column CACHED to the table. This is on request from Steven Feuerstein, so we have a history of voting data.
The package specification is unchanged from Part 1.
In the body I need a few changes. The major change is the get_html function, where I open a BFILE and use DBMS_LOB.LOADCLOBFROMFILE. This retrieves the HTML from file instead of HTTP call, but still returns the HTML as a CLOB.
The second helper function needs no change. The content of the HTML is no different when reading it from a file rather than HTTP.
And the table function is also unchanged, since the XML returned from html_to_xml is unchanged.
The procedure to cache the data is a little bit changed from Part 1. This time I do not delete the cache content but only insert. The insert I give a timestamp in column CACHED. Because the data should be used internationally, I decided to save the timestamp in UTC. I only save the timestamp to "hour" precision, which of course only works if I don't populate the cache more often than maximum once an hour. As I expect to cache the voting history once a day, this is OK. But saving the hour as well gives me an idea of whether there was 28 hours or 20 hours since last time, as I cannot be certain to do this at the same hour every day. Not saving the entire timestamp with second precision is simply me being lazy to make querying with hour precision easier, and I don't want minute or second precision.
So now I can open these URLs in a browser:
https://community.oracle.com/community/database/awards/sql-voting
https://community.oracle.com/community/database/awards/plsql-voting
https://community.oracle.com/community/database/awards/ords-voting
https://community.oracle.com/community/database/awards/apex-voting
https://community.oracle.com/community/database/awards/db-design-voting
Each of those 5 pages I right-click, Save as, and save as HTML only to the directory I created, calling the files sql-voting.html, plsql-voting.html, etc.
And then I call my cache population procedure:
That I can perform about once a day or how often I wish to save the voting history.
Having voting history, I can now study how much each nominee increases his/her votes over time:
Notice how my pal Erik van Roon jumps up the ladder from an 8th place to a 5th place between October 2nd and 3rd? :-)
So this is the alternative scraping method, for cases where either the database is not allowed to call out, or the webpages to be scraped are so much designed for interactive browser use that scraping http calls are not possible or prohibitively difficult. For those case retrieving the HTML via the browser can work, and then the data parsing code can work on the saved HTML files.
If you found this helpful, why don't you consider voting for me in the Oracle Database Developer Choice Awards? :-)
In the live version, the first thing I had to do was privileges to let the database perform http calls. That is not necessary when doing it this way. So in cases where you might not get permission from your sysadmin to open the firewall for database callouts, this can also be a method.
Instead of http privileges, I need to create a directory object. This should point to a directory on the database server (or at least a network share that is mounted on a path on the database server.) I grant privileges on the directory to SCOTT user.
create or replace directory odevchoice as '/home/oracle/odevchoice' / grant read, write on directory odevchoice to scott /
In SCOTT schema I create the types and table just like I did in Part 1. Only thing I have different here is I added a DATE column CACHED to the table. This is on request from Steven Feuerstein, so we have a history of voting data.
create type oddca_nominee_type as object ( category varchar2(10) , idea_id integer , name varchar2(30) , score integer , votes integer ) / create type oddca_nominee_table_type as table of oddca_nominee_type / create table oddca_nominee_cache ( cached date , category varchar2(10) , idea_id integer , name varchar2(30) , score integer , votes integer ) /
The package specification is unchanged from Part 1.
create or replace package oddca as subtype category_type is varchar2(10); CATEGORY_SQL constant category_type := 'sql'; CATEGORY_PLSQL constant category_type := 'plsql'; CATEGORY_ORDS constant category_type := 'ords'; CATEGORY_APEX constant category_type := 'apex'; CATEGORY_DBDESIGN constant category_type := 'db-design'; function nominees( p_category category_type ) return oddca_nominee_table_type pipelined; procedure populate_cache( p_category category_type ); end oddca; /
In the body I need a few changes. The major change is the get_html function, where I open a BFILE and use DBMS_LOB.LOADCLOBFROMFILE. This retrieves the HTML from file instead of HTTP call, but still returns the HTML as a CLOB.
create or replace package body oddca as function get_html( p_category category_type ) return clob is html clob; src bfile; dest_offset integer := 1; src_offset integer := 1; lang_context integer := dbms_lob.default_lang_ctx; warning integer; begin if p_category not in ( CATEGORY_SQL , CATEGORY_PLSQL , CATEGORY_ORDS , CATEGORY_APEX , CATEGORY_DBDESIGN ) then raise_application_error(-20000, 'Invalid award category'); end if; dbms_lob.createtemporary(html, false, DBMS_LOB.CALL); src := bfilename('DEVCHOICE', p_category||'-voting.html'); dbms_lob.open(src, dbms_lob.file_readonly); dbms_lob.loadclobfromfile( html , src , dbms_lob.lobmaxsize , dest_offset , src_offset , nls_charset_id('UTF8') , lang_context , warning ); return html; end get_html;
The second helper function needs no change. The content of the HTML is no different when reading it from a file rather than HTTP.
function html_to_xml( p_html clob ) return xmltype is cutout clob; pos1 integer; pos2 integer; data xmltype; begin pos1 := dbms_lob.instr(p_html, '<div class="display-idea-list"', 1); pos2 := dbms_lob.instr(p_html, '<span id="jive-whats-new-more">', pos1+1); if pos1 = 0 or pos2 = 0 or pos1 = null or pos2 = null then raise_application_error(-20001, 'Not expected HTML content'); end if; dbms_lob.createtemporary(cutout, false, DBMS_LOB.CALL); dbms_lob.copy(cutout, p_html, pos2-pos1, 1, pos1); data := xmltype( '<doc>' || replace(replace(replace(replace( cutout , '&'||'rsquo;', ' ') , '&'||'hellip;', ' ') , '&'||'nbsp;', ' ') , '<br>', '<br/>') || '</doc>' ); dbms_lob.freetemporary(cutout); return data; end html_to_xml;
And the table function is also unchanged, since the XML returned from html_to_xml is unchanged.
function nominees( p_category category_type ) return oddca_nominee_table_type pipelined is html clob; data xmltype; begin html := get_html(p_category); data := html_to_xml(html); for nominees in ( select oddca_nominee_type( p_category , to_number(regexp_replace(ideaid,'[^[:digit:]]')) /* 'ideaListDiv1234' => 1234 */ , trim(name) , to_number(regexp_replace(score,'[^[:digit:]]')) , to_number(regexp_replace(votes,'[^[:digit:]]')) /* '123 votes' => 123 */ ) nominee from xmltable( '/doc/*' passing data columns ideaid varchar2(100) path '@id' , name varchar2(100) path 'div/div[@class="jive-content"]/div[@class="jive-content-header"]/div[@class="jive-content-title"]/h2/a' , score varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/div/strong' , votes varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/span[@class="jive-score-meta font-color-meta"]/span[@class="idea-vote-count"]' ) ) loop pipe row( nominees.nominee ); end loop; end nominees;
The procedure to cache the data is a little bit changed from Part 1. This time I do not delete the cache content but only insert. The insert I give a timestamp in column CACHED. Because the data should be used internationally, I decided to save the timestamp in UTC. I only save the timestamp to "hour" precision, which of course only works if I don't populate the cache more often than maximum once an hour. As I expect to cache the voting history once a day, this is OK. But saving the hour as well gives me an idea of whether there was 28 hours or 20 hours since last time, as I cannot be certain to do this at the same hour every day. Not saving the entire timestamp with second precision is simply me being lazy to make querying with hour precision easier, and I don't want minute or second precision.
procedure populate_cache( p_category category_type ) is begin insert into oddca_nominee_cache select trunc(sys_extract_utc(systimestamp),'HH24') , n.category , n.idea_id , n.name , n.score , n.votes from table(oddca.nominees(p_category)) n; end populate_cache; end oddca; /
So now I can open these URLs in a browser:
https://community.oracle.com/community/database/awards/sql-voting
https://community.oracle.com/community/database/awards/plsql-voting
https://community.oracle.com/community/database/awards/ords-voting
https://community.oracle.com/community/database/awards/apex-voting
https://community.oracle.com/community/database/awards/db-design-voting
Each of those 5 pages I right-click, Save as, and save as HTML only to the directory I created, calling the files sql-voting.html, plsql-voting.html, etc.
And then I call my cache population procedure:
begin oddca.populate_cache(oddca.CATEGORY_SQL); oddca.populate_cache(oddca.CATEGORY_PLSQL); oddca.populate_cache(oddca.CATEGORY_ORDS); oddca.populate_cache(oddca.CATEGORY_APEX); oddca.populate_cache(oddca.CATEGORY_DBDESIGN); commit; end; /
That I can perform about once a day or how often I wish to save the voting history.
Having voting history, I can now study how much each nominee increases his/her votes over time:
select to_char(cached, 'Mon dd, HH24"h"') hour , category , name , score , votes , negative , positive , round(100 * positive / nullif(votes,0), 1) pos_pct , dense_rank() over (partition by cached, category order by score desc) rnk_score , dense_rank() over (partition by cached, category order by positive desc) rnk_pos , score - lag(score) over (partition by category, name order by cached) inc_score , positive - lag(positive) over (partition by category, name order by cached) inc_pos from ( select cached , category , name , score , votes , (votes - score / 10) / 2 as negative , score / 10 + (votes - score / 10) / 2 as positive from oddca_nominee_cache ) s1 order by cached desc, category, score desc /
Notice how my pal Erik van Roon jumps up the ladder from an 8th place to a 5th place between October 2nd and 3rd? :-)
HOUR CATEGORY NAME SCORE VOTES NEGATIVE POSITIVE POS_PCT RNK_SCORE RNK_POS INC_SCORE INC_POS ----------- --------- ------------------ ----- ----- -------- -------- ------- --------- ------- --------- ------- Oct 05, 04h apex Jari Laine 680 80 6 74 92.5 1 1 0 0 Oct 05, 04h apex Morten Braten 660 76 5 71 93.4 2 2 0 0 Oct 05, 04h apex Juergen Schuster 510 55 2 53 96.4 3 3 0 0 Oct 05, 04h apex Kiran Pawar 400 50 5 45 90.0 4 4 0 0 Oct 05, 04h apex Karen Cannell 260 30 2 28 93.3 5 5 10 1 Oct 05, 04h apex Paul MacMillan 130 29 8 21 72.4 6 6 0 0 Oct 05, 04h apex Trent Schafer 110 21 5 16 76.2 7 7 0 0 Oct 05, 04h db-design Heli Helskyaho 850 103 9 94 91.3 1 1 0 0 Oct 05, 04h db-design Michelle Kolbe 580 68 5 63 92.6 2 2 0 0 Oct 05, 04h db-design Rob Lockard 410 51 5 46 90.2 3 3 0 0 Oct 05, 04h db-design Mark Hoxey 90 23 7 16 69.6 4 4 10 1 Oct 05, 04h ords Dietmar Aust 1170 125 4 121 96.8 1 1 0 0 Oct 05, 04h ords Dimitri Gielis 790 91 6 85 93.4 2 2 0 0 Oct 05, 04h ords Morten Braten 420 56 7 49 87.5 3 3 0 0 Oct 05, 04h ords Kiran Pawar 340 42 4 38 90.5 4 4 0 0 Oct 05, 04h ords Anton Nielsen 240 32 4 28 87.5 5 5 0 0 Oct 05, 04h ords Tim St. Hilaire 130 19 3 16 84.2 6 6 0 0 Oct 05, 04h plsql Adrian Billington 970 103 3 100 97.1 1 2 20 2 Oct 05, 04h plsql Roger Troller 870 115 14 101 87.8 2 1 0 0 Oct 05, 04h plsql Sean Stuber 750 87 6 81 93.1 3 3 10 1 Oct 05, 04h plsql Patrick Barel 650 85 10 75 88.2 4 4 0 0 Oct 05, 04h plsql Morten Braten 620 74 6 68 91.9 5 5 0 0 Oct 05, 04h plsql Kim Berg Hansen 410 55 7 48 87.3 6 6 0 0 Oct 05, 04h plsql Bill Coulam 380 50 6 44 88.0 7 7 10 1 Oct 05, 04h sql Emrah Mete 2980 366 34 332 90.7 1 1 0 1 Oct 05, 04h sql Sayan Malakshinov 1750 239 32 207 86.6 2 2 30 3 Oct 05, 04h sql Sean Stuber 670 103 18 85 82.5 3 3 10 1 Oct 05, 04h sql Kim Berg Hansen 620 98 18 80 81.6 4 4 10 1 Oct 05, 04h sql Erik Van Roon 300 60 15 45 75.0 5 5 30 3 Oct 05, 04h sql Matthias Rogel 280 46 9 37 80.4 6 7 10 1 Oct 05, 04h sql Justin Cave 270 55 14 41 74.5 7 6 0 0 Oct 05, 04h sql Stew Ashton 250 49 12 37 75.5 8 7 0 0 Oct 03, 13h apex Jari Laine 680 80 6 74 92.5 1 1 0 0 Oct 03, 13h apex Morten Braten 660 76 5 71 93.4 2 2 0 0 Oct 03, 13h apex Juergen Schuster 510 55 2 53 96.4 3 3 0 0 Oct 03, 13h apex Kiran Pawar 400 50 5 45 90.0 4 4 0 0 Oct 03, 13h apex Karen Cannell 250 29 2 27 93.1 5 5 10 1 Oct 03, 13h apex Paul MacMillan 130 29 8 21 72.4 6 6 0 0 Oct 03, 13h apex Trent Schafer 110 21 5 16 76.2 7 7 0 0 Oct 03, 13h db-design Heli Helskyaho 850 103 9 94 91.3 1 1 0 0 Oct 03, 13h db-design Michelle Kolbe 580 68 5 63 92.6 2 2 0 0 Oct 03, 13h db-design Rob Lockard 410 51 5 46 90.2 3 3 0 0 Oct 03, 13h db-design Mark Hoxey 80 22 7 15 68.2 4 4 10 1 Oct 03, 13h ords Dietmar Aust 1170 125 4 121 96.8 1 1 10 1 Oct 03, 13h ords Dimitri Gielis 790 91 6 85 93.4 2 2 10 1 Oct 03, 13h ords Morten Braten 420 56 7 49 87.5 3 3 0 0 Oct 03, 13h ords Kiran Pawar 340 42 4 38 90.5 4 4 0 0 Oct 03, 13h ords Anton Nielsen 240 32 4 28 87.5 5 5 0 0 Oct 03, 13h ords Tim St. Hilaire 130 19 3 16 84.2 6 6 20 2 Oct 03, 13h plsql Adrian Billington 950 101 3 98 97.0 1 2 10 1 Oct 03, 13h plsql Roger Troller 870 115 14 101 87.8 2 1 0 0 Oct 03, 13h plsql Sean Stuber 740 86 6 80 93.0 3 3 30 3 Oct 03, 13h plsql Patrick Barel 650 85 10 75 88.2 4 4 0 0 Oct 03, 13h plsql Morten Braten 620 74 6 68 91.9 5 5 0 0 Oct 03, 13h plsql Kim Berg Hansen 410 55 7 48 87.3 6 6 0 0 Oct 03, 13h plsql Bill Coulam 370 49 6 43 87.8 7 7 0 0 Oct 03, 13h sql Emrah Mete 2980 364 33 331 90.9 1 1 100 10 Oct 03, 13h sql Sayan Malakshinov 1720 236 32 204 86.4 2 2 30 3 Oct 03, 13h sql Sean Stuber 660 102 18 84 82.4 3 3 30 3 Oct 03, 13h sql Kim Berg Hansen 610 97 18 79 81.4 4 4 10 1 Oct 03, 13h sql Erik Van Roon 270 57 15 42 73.7 5 5 100 10 Oct 03, 13h sql Justin Cave 270 55 14 41 74.5 5 6 0 0 Oct 03, 13h sql Matthias Rogel 270 45 9 36 80.0 5 8 0 0 Oct 03, 13h sql Stew Ashton 250 49 12 37 75.5 6 7 0 0 Oct 02, 08h apex Jari Laine 680 80 6 74 92.5 1 1 Oct 02, 08h apex Morten Braten 660 76 5 71 93.4 2 2 Oct 02, 08h apex Juergen Schuster 510 55 2 53 96.4 3 3 Oct 02, 08h apex Kiran Pawar 400 50 5 45 90.0 4 4 Oct 02, 08h apex Karen Cannell 240 28 2 26 92.9 5 5 Oct 02, 08h apex Paul MacMillan 130 29 8 21 72.4 6 6 Oct 02, 08h apex Trent Schafer 110 21 5 16 76.2 7 7 Oct 02, 08h db-design Heli Helskyaho 850 103 9 94 91.3 1 1 Oct 02, 08h db-design Michelle Kolbe 580 68 5 63 92.6 2 2 Oct 02, 08h db-design Rob Lockard 410 51 5 46 90.2 3 3 Oct 02, 08h db-design Mark Hoxey 70 21 7 14 66.7 4 4 Oct 02, 08h ords Dietmar Aust 1160 124 4 120 96.8 1 1 Oct 02, 08h ords Dimitri Gielis 780 90 6 84 93.3 2 2 Oct 02, 08h ords Morten Braten 420 56 7 49 87.5 3 3 Oct 02, 08h ords Kiran Pawar 340 42 4 38 90.5 4 4 Oct 02, 08h ords Anton Nielsen 240 32 4 28 87.5 5 5 Oct 02, 08h ords Tim St. Hilaire 110 17 3 14 82.4 6 6 Oct 02, 08h plsql Adrian Billington 940 100 3 97 97.0 1 2 Oct 02, 08h plsql Roger Troller 870 115 14 101 87.8 2 1 Oct 02, 08h plsql Sean Stuber 710 83 6 77 92.8 3 3 Oct 02, 08h plsql Patrick Barel 650 85 10 75 88.2 4 4 Oct 02, 08h plsql Morten Braten 620 74 6 68 91.9 5 5 Oct 02, 08h plsql Kim Berg Hansen 410 55 7 48 87.3 6 6 Oct 02, 08h plsql Bill Coulam 370 49 6 43 87.8 7 7 Oct 02, 08h sql Emrah Mete 2880 354 33 321 90.7 1 1 Oct 02, 08h sql Sayan Malakshinov 1690 233 32 201 86.3 2 2 Oct 02, 08h sql Sean Stuber 630 99 18 81 81.8 3 3 Oct 02, 08h sql Kim Berg Hansen 600 96 18 78 81.3 4 4 Oct 02, 08h sql Justin Cave 270 55 14 41 74.5 5 5 Oct 02, 08h sql Matthias Rogel 270 45 9 36 80.0 5 7 Oct 02, 08h sql Stew Ashton 250 49 12 37 75.5 6 6 Oct 02, 08h sql Erik Van Roon 170 47 15 32 68.1 7 8 96 rows selected.
So this is the alternative scraping method, for cases where either the database is not allowed to call out, or the webpages to be scraped are so much designed for interactive browser use that scraping http calls are not possible or prohibitively difficult. For those case retrieving the HTML via the browser can work, and then the data parsing code can work on the saved HTML files.
If you found this helpful, why don't you consider voting for me in the Oracle Database Developer Choice Awards? :-)
Comments
Post a Comment