client/verb
Test_Any_Site(var/url as text) //eg: www.google.com
var/site = ""
world << "TESTING - [url]"
site = get_website("http://[url]")
world << "site length [length(site)]"
proc
get_website(url)
var/list/site = new()
var/site_text = ""
var/refresh = 0
while(!site_text && refresh<10)
refresh++
site = world.Export(url)
if(!site) //Site is not loading at all
world << "ATTEMPT [refresh] FAILED @ <[url]> - NO SITE - NEXT TRY [2*refresh] seconds"
sleep(refresh*20) //Pause between retrys for better chance of getting a page (longer each time)
continue
site = site["CONTENT"]
site_text = file2text(site)
if(length(site_text)<10) //Arbitrary length to catch bad sites
world << "ATTEMPT [refresh] FAILED @ <[url]> - BAD SITE- [site_text]- NEXT TRY [2*refresh] seconds"
site_text = ""
sleep(refresh*20) //Pause between retrys for better chance
if(!site_text) //If even after all that, no site was loaded return null
world << "# ERROR # ATTEMPT [refresh] - NO SITE @ <[url]>"
return
return site_text
Problem description:
While trying to retrieve a webpage in-order to parse its contents, I occasionally receive a bad site when I don't believe I should.
I am not sure if this is a code problem or a web page problem. Occasionally instead of getting the webpage, I get something along the lines of:
‹
and that is it.
It doesn't happen all the time (maybe 10% of the time) and it doesn't happen with all sites. One site I seem to have the most trouble with is: http://www.politifact.com/personalities/
It should be returning as a plain text string?