Monday, July 24, 2017

2017-07-24: Replacing Heritrix with Chrome in WAIL, and the release of node-warc, node-cdxj, and Squidwarc


I have written posts detailing how an archives modifications made to the JavaScript of a web page being replayed collided with the JavaScript libraries used by the page and how JavaScript + CORS is a deadly combination during replay. Today I am here to announce the release of a suite of high fidelity web archiving tools that help to mitigate the problems surrounding web archiving and a dynamic JavaScript powered web.To demonstrate this, consider the image above: the left-hand screen shot shows today's cnn.com archived and replayed in WAIL, whereas the right-hand screen shot shows cnn.com in the Internet Archive on 2017-07-24T16:00:02.

In this post, I will be covering:


WAIL


Let me begin by announcing that WAIL has transitioned away from using Heritrix as the primary preservation method. Instead, WAIL now directly uses a full Chrome browser (Electron provided) as the preservation crawler. WAIL does not use WARCreate, Browsertrix, brozzler, Webrecorder, or a derivation of one of these tools but my own special sauce. The special sauce powering these crawls has been open sourced and made available through node-warc. Rest assured WAIL still provides auto configured Heritrix crawls but the Page Only, Page + Same Domain Links and Page + All Links crawls now use a full Chrome browser in combination with automatic scrolling of the page and the Pepper Flash Plugin. You can download this version of WAIL today. Oh and did I mention that WAIL's browser based crawls pass Mat Kelly's Archival Acid Test.

But I thought WAIL was already using a full Chrome browser and a modified WARCreate for the Page Only crawl? Yes, that is correct, but the key aspect here is the modified in modified WARCreate. WARCreate was modified for automation to use Node.js buffers, re-request every resource besides the fully rendered page and to work using Electron that is not an extension. What was shared was saving both the rendered page and the request/response headers. So how did I do this or what kind of black magic did I use in order to achieve this? Enter node-warc.

Before I continue


It is easy to forget which tool did this first and continues to do it extremely well. That tool is WARCreate. None of this would be possible if WARCreate had not done it first and I had not cut my teeth on Mat Kelly's projects. So look for this very same functionality to come to WARCreate in the near future as Chrome and the Extension APIs have matured beyond what was initially available to Mat at WARCreates inception. It still amazes me that he was able to get WARCreate to do what it does in the hostile environment that is Chrome Extensions. Thanks Mat! Now get that Ph.D. so that we come closer to not being concerned with WARCs containing cookies and other sensitive information.


node-warc


node-warc started out as a Node.js library for reading WARC files to address my dislike for how other libraries would crash and burn if they encounter an off by one error (webarchiveplayer, OpenWayback, Pywb indexers). But also to build one that is more performant and has a nicer API than the only other one on npm which is three years old with no updates and no gzip. As I worked on making the WAIL provided crawls awesome and Squidwarc it became a good home for the browser based preservation side of handling WARCs. node-warc is now a one stop shop for reading and creation of WARC files using Node.js.

On the reading side, node-warc supports both gzipped and non-gzipped warcs. An example of how to get started reading warcs using node-warc is shown below with the API documentation available online.

How performant is node-warc? Below are performance metrics for parsing both gzipped and un-gzipped warc files of different size.

un-gzipped

size records time max process memory
145.9MB 8,026 2s 22 MiB
268MB 852 2s 77 MiB
2GB 76,980 21s 100 MiB
4.8GB 185,662 1m 144.3 MiB

gzipped

size records time max process memory
7.7MB 1,269 records 297ms 7.1 MiB
819.1MB 34,253 records 16s 190.3 MiB
2.3GB 68,020 records 45s 197.6 MiB
5.3GB 269,464 records 4m 198.2 MiB
Now to the fun part. node-warc provides the means for the archiving the web using Electron's provided Chrome browser and using Chrome or Chrome headless through the chrome-remote-interface a Node.js wrapper for the DevTools Protocol. If you wish to use this library for preservation with Electron use ElectronRequestCapturer and ElectronWARCGenerator. The Electron archiving capabilities were developed in WAIL and then put into node-warc so that others can build high fidelity web archiving tools using Electron. If you need an example to help you get started consult wail-archiver.

For use with Chrome via chrome-remote-interface, use RemoteChromeRequestCapturer and RemoteChromeWARCGenerator. The Chrome specific portion of node-warc came from developing Squidwarc a high fidelity archival crawler that uses Chrome or Chrome Headless. Both the Electron and remote Chrome WARCGenerator and RequestCapturer share the same DevTools Protocol but each has their own way of accessing that API. node-warc takes care of that for you by providing a unified API for both Electron and Chrome. The special sauce here is node-warc retrieves the response body from Chrome/Electron by asking for it and Chrome/Electron will give it to us. It is that simple. Documentation for node-warc is available via n0tan3rd.github.io/node-warc and is released on Github under the MIT license. node-warc welcomes contributions and hopes that it will be found useful. Download it today using npm (npm install node-warc or yarn add node-warc)

node-cdxj


The companion library to node-warc is node-cdxj, cdxj on npm and is the Node.js library for parsing CDXJ files commonly used by Pywb. An example of how to use this library is seen below.

node-cdxj is distributed via Github and npm (npm install cdxj or yarn add cdxj) Full API documentation is available via n0tan3rd.github.io/node-cdxj and is released under the MIT license.

Squidwarc


Now that Vitaly Slobodin stepped down as the maintainer of PhantomJS (it's dead Jim) in deference to Headless Chrome it is with great pleasure to introduce to you today Squidwarc a high fidelity archival crawler that uses Chrome or Headless Chrome directly. Squidwarc aims to address the need for a high fidelity crawler akin to Heritrix while still easy enough for the personal archivist to setup and use. Squidwarc does not seek (at the moment) to dethrone Heritrix as the queen of wide archival crawls, rather seeks to address Heritrix's short comings namely
  • No JavaScript execution
  • Everything is plain text
  • Requiring configuration to known how to preserve the web
  • Setup time and technical knowledge required of its users
Those are some bold (cl)aims. Yes, they are, but in comparison to other web archiving tools using Chrome as the crawler makes sense. Plus to quote Vitaly Slobodin
Chrome is faster and more stable than PhantomJS. And it doesn't eat memory like crazy.
So why work hard when you can let the Chrome devs do a lot of the hard work for you. They must keep up with the crazy fast changing world of web development. Then why should not the web archiving community utilize that to our advantage??? I think so at least and is why I created Squidwarc. Which reminds me of the series of articles Kalev Leetaru wrote entitled Why Are Libraries Failing At Web Archiving And Are We Losing Our Digital History?, Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web and The Internet Archive Turns 20: A Behind The Scenes Look At Archiving The Web? Well, sir, I present to you Squidwarc, an archival crawler that can handle the every changing and dynamic web. I have showed you mine, what do your crawlers look like?

Squidwarc is an HTTP/1.1 and HTTP/2, GET, POST, HEAD, OPTIONS, PUT, DELETE preserving, JavaScript executing, page interacting archival crawler (just to name a few). And yes it can do all that. If you doubt me, see the documentation for what Squidwarc is capable of through the chrome-remote-interface and node-warc. Squidwarc is different from brozzler as it supports both Chrome and Headless Chrome right out of the box, does not require a middle man to capture the requests and create the WARC, makes full use of the DevTools Protocol thanks to it being a Node.js based crawler (Google approved) and is simpler to setup and use.

So what can be done with Squidwarc at its current stage? I created a video demonstrating the technique described by Dr. Justin Brunelle in Adapting the Hypercube Model to Archive Deferred Representations and Their Descendants which can be viewed below. The code used in this video is on Github as is Squidwarc itself.



The crawls operate in terms of a Composite Memento. For those who are unfamiliar with this terminology, a composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. An example crawl configuration (JSON files currently) is seen below with annotations (comments) are not valid JSON. A non-annotated configuration file is provided in the Squidwarc Github repository.

The definitions for them are seen below and remember Squidwarc crawls operate in terms of a composite memento. The frontier is web pages, not web pages plus resources of a web page (Chrome retrieves them for us automatically).
page-only
Preserve the page so that there is no difference when replaying the page from viewing the page in a web browser at preservation time
page-same-domain
Page Only option plus preserve all links found on the page that are on the same domain as the page
page-all-links
page-same-domain plus all links from other domains
Below is a video demonstrating that the Squidwarc crawls do in fact preserve only the page, page + same domain links, and page + all links using the initial seed n0tan3rd.github.io/wail/sameDomain1.



Squidwarc is an open source project and available on Github. Squidwarc is not yet available via npm but you can begin using Squidwarc by cloning the repo. Let us build the next phase of web archiving together. Squidwarc welcomes all who wish to be part of its development and if there are any issues feel free to open one up.

Both WAIL and Squidwarc use node-warc and a Chrome browser for preservation. If portability and no setup are what you seek download and start using WAIL. If you just want to use the crawler clone the Squidwarc repository and begin preserving the web using your Chrome browser today. All the projects in this blog post welcome contributions as well as issues via Github. The excuse of "I guess the crawler was not configured to preserve this page" or the term "unarchivable page" are no longer valid in the age of browser based preservation.

- John Berlin

No comments:

Post a Comment