Headless WebKit with PhantomJS
PhantomJS is just pure awesomeness. I gave a presentation on it at Well.ca and talked about how one can use it to facilitate website testing (with CasperJS). At the time we just encountered a bug with the website, so it was fitting. Having a headless WebKit can do so much for you other than just tesitng too!
Taking Screenshots
For another project I was working on, I wanted to capture webpages and save them as images. Before I discovered PhantomJS, I used CutyCapt and Qt graphics-dojo example. PhantomJS is much much simpler to use though. Check out the official rasterize example for details.
Fetching Actual Website Content
Another thing that’s great about having a headless browser is that you can use it to fetch the actual content of a website. Nowadays so many sites are javascript driven so simply fetching the initial HTML won’t do. Below is how you can use PhantomJS to capture the actual content.
var page = require('webpage').create(),
system = require('system'),
fs = require('fs'),
address, output;
if (system.args.length != 3) {
console.log('Usage: grab.js URL filename');
phantom.exit(1);
} else {
address = system.args[1];
output = system.args[2];
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
window.setTimeout(function () {
var results = page.evaluate(function() {
return document.documentElement.innerHTML;
});
try {
var f = fs.open(output, "w");
f.write(results);
f.close();
} catch (e) {
console.log(e);
}
phantom.exit();
}, 200);
}
});
}
Save the code to grab.js and run it via PhantomJS by providing the URL to fetch and the output file to save the content to.