2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

Trip Report for the IIPC Building Better Crawlers Hackathon in London, UK.                           

On September 22-23, 2016, I attended the IIPC Building Better Crawlers Hackathon (#iipchack) at the British Library in London, UK. Having been to London almost exactly 2 years ago for the Digital Libraries 2014 conference, I was excited to go back, but was more so anticipating collaborating with some folks I had long been in contact with during my tenure as a PhD student researcher at ODU.

The event was a well-organized yet loosely scheduled meeting that resembled more of an "Unconference" than a Hackathon in that the discussion topics were defined as the event progressed rather than a larger portion being devoted to implementation (see the recent Archives Unleashed 1.0 and 2.0 trip reports). The represented organizations were:

Day 0

As everyone arrived at the event from abroad and locally, the event organizer Olga Holownia invited the attendees to an informal get-together meeting at The Skinners Arms. There the conversation was casual but frequently veered into aspects of web archiving and brain picking, which we were repeatedly encouraged to "Save for Tomorrow".

Day 1

The first day began with Andy Jackson (@anjacks0n) welcoming everyone and thanking them for coming despite the short notice and announcement of the event over the Summer. He and Gil Hoggarth (@grhggrth), both of the British Library, kept detailed notes of the conference happenings as they progressed with Andy keeping an editable open document for other attendees to collaborate on building.

Tom Cramer (@tcramer) of Stanford, who mentioned he had organized hackathons in the past, encouraged everyone in attendance (14 in number) to introduce themselves and give a synopsis of their role and their previous work at their respective institutions. He also asked how we could go about making crawling tools accessible to non-web archiving specialists to stimulate conversation.

The responding discussion initiated a theme that ran throughout the hackathon -- that of the web archiving from a web browser.

One tool to accomplish this is Brozzler from Internet Archive, which combines warcprox and Chromium to preserve HTTP content sent over the wire into the WARC format. I had previously attempted to get Brozzler (originally forked from Umbra) up and running but was not successful. Other attendees either had previously tried or had not heard of the software. This transitioned later into Noah Levitt (of Internet Archive) giving an interactive audience-participation walk through of installing, setting up, and using Brozzler.

Prior to the interactive portion of the event, however, Jefferson Bailey (@jefferson_bail) of Internet Archive started a presentation by speaking about WASAPI (Web Archiving Systems API), a specification for defining data transfer of web archives. The specification is a collaboration with University of North Texas, Rutgers, Stanford via LOCKSS, and other organizations. Jefferson emphasized that the specification is not implementation specific; it does not get into issues like access control, parameters of a specific path, etc. The rationale behind this was so that the spec would not be just a preservation data transport tool but also a means of data transfer for researcher. Their in-development implementation takes WARCs, pulls out data to generates a derivative WARC, then defines a Hadoop job using Pig syntax. Noah Levitt added that the Jobs API requires you to supply an operation like "Build CDX" and the WARCs on which you want to perform the operation.

In a typical non-linear unconference fashion (also exhibited in this blog post), Noah then gave details on Brozzler (presentation slides). With a room full of Mac and Linux users, installation proved particularly challenging. One issue I had previously run into was latency in starting RethinkDB. This issue was also exhibited by Colin Rosenthal (@colinrosenthal) while he was on Linux and I on Mac. Noah's machine, which he showed in a demo as having the exact same versions of all dependencies I had installed did not show this latency, so Your Mileage Might Vary with installation but in the end both Colin and I (possibly others) were successful in crawling a few URIs using Brozzler.

Andy added to Noah's interactive session by referencing his effort in Dockerizing Brozzler and his other work in component-izing and Dockerized the other roles and tasks web archiving process with his project Wren. While one such component is the Archival Acid Test project I had created for Digital Libraries 2014, the other sub-projects of run allow for the mitigation of other tools that are otherwise difficult or time consuming to configure.

One such tool that was lauded throughout the conference was Alex Osborne's (@atosborne) tinycdxserver Andy also has created a Dockerized version of tinycdxserver. This tool was new to me but the reported statistics on CDX querying speed and storage have the potential for significant improvement for large web archives. Per Alex's description of the tool, the indexes are stored compressed using Facebook's RocksDB and are about a fifth of the size in tinycdxserver when compared to a flat CDX file. Further, Wayback instances can simply be pointed at a tinycdxserver instance using the built-in RemoteResourceIndex field in the Wayback configuration file, which makes for easy integration.

Strands

A wholly unconference discussion then commenced with topics we wanted to cover in the second part of the day. After coming up with and classifying various idea, Andy defined three groups: the Heritrix Wish List, Brozzler, and Automated QA.

Each attendee could join any of the three for further discussion. I chose "Automated QA", given the relevance of archival integrity is related to my research topic.

The Heritrix group expressed challenges that the members had encountered in transitioning from Heritrix version 1 to version 3. "The Heritrix 3 console is a step back from Heritrix 1's. Building and running scripts in Heritrix 3 is a pain." was the general sentiment from the group. Other concerns were scarce documentation, which might be remedied with funded efforts to improve it, as deep knowledge of the tool's working are needed to accurately represent the capability of the tool. Kristinn Sigurðsson (@kristsi), who was involved in the development of H3 (and declined to give a history documenting the non-existence of H2) has since resolved some issues. I was encouraged to use his fork of Heritrix 3 from he and others, my own recommendation inadvertent included:

The Brozzler group first identified the behavior of Brozzler versus a crawler in its handling of one page or site at a time (a la WARCreate) instead of adding discovered URIs to a frontier and seeding those URIs for subsequent crawls. Per above, Brozzler's use of RethinkDB as both the crawl frontier and the CDX service makes it especially appealing and more scalable. Brozzler allows multiple workers to pull URIs for a pool and report back to a RethinkDB instance. This worked fairly well in my limited but successful testing at the hackathon.

The Automated QA group first spoke about the National Library of Australia's Bamboo project. The tool consumes Heritrix's (and/or warcprox) crawl output folder and provides in-progress indexes from WARC files prior to a crawl finishing. Other statistics can also be added in as well as automated generation of screenshots for comparison of the captures on-the-fly. We also highlighted some particular items that crawlers and browser-based preservation tools have trouble capturing. For example, video formats that vary in support between browsers, URIs defined in the "srcset" attribute, responsive design behaviors, etc. I also referenced my work in Ahmed AlSum's (@aalsum) Thumbnail Summarization using SimHash, as presented at the Web Archiving Collaboration meeting.

After presentation by the groups, the attendees called it a day for further discussions at a nearby pub.

Day 2

The second day commenced with a few questions we all decided upon and agreed to while at the pub as good discussions for the next day. These questions:

  1. Given five engineers and two years, what would you build?
  2. What are the barriers in training for the current and future crawling software and tools?
Given Five...

Responses to the first included something like Brozzler's frontier but redesigned to allow for continuous instead of a single URI for crawling. With a segue toward Heritrix, Kristinn verbally considered the relationship between configurability and scalability. "You typically don't install heritrix on a virtual machine", he said, "usually a machine for this use requires at least 64 gigabytes of RAM." Also discussed was getting the raw data for a crawl versus being able to get the data needed to replicate the experience and the particular importance of the latter.

Additionally, there was talk of adapting the scheme used by Brozzler for an Electron application meant for browsing and the ability to toggle archiving through warcprox (related: see recent post on WAIL). On the flip side, Kristinn mentioned that it surprised him that we can potentially create a browser of this sort that can interact with a proxy but not build another crawler -- highlight the lack of options in other Heritrix-like robust archival crawlers.

Barriers in Training

For the second question, those involved with institutional archives seemed to agreed that if one was going to hire a crawl engineer, Java and Python experience are a pre-requisite to exposure to some of the archive-specific concepts. For current institutional training practice, Andy stated that he turns new developers in his organization loose on ACT, which is simply a CRUD application to introduce them into the web archiving domain. Others said it would be useful to have staff exchanges and internships for collaboration and getting more employees familiar with web archiving.

Collaboration

Another topic arose from the previous conversation about future methods of collaboration. For future work on writing documentation, more Getting Started fundamental guides as well as test sites for tools would be welcomed. For future communication, the IIPC Slack Channel as well as the newly created IIPC GitHub wiki will be the next iteration of the outdated IIPC Tools page and the COPTR initiative.

The whole-group discussion wrapped up with identifying concrete next steps from what was discussed at the event. These included creating setup guides for Brozzler, testing of any further use cases of Umbra versus Brozzler, future work on access control considerations as currently done by institutions and next steps regarding that, and a few other TODOs. A monthly online meeting is also planned to facilitate collaboration between meetings as well as more continued interaction via Slack instead of a number of outdated, obsolete, or noisy e-mail channels.

In Conclusion...

Attendance of the IIPC Building Better Crawlers Hackathon was invaluable to establishing contacts and gaining more exposure to the field and efforts done by others. Many of the conversations were open-ended, which lead to numerous other topics discussed and opened the doors to potential new collaborations. I gained a lot of insight from discussing my research topic and others' projects and endeavors. I hope to be involved with future Hackathons-turned-Unconferences from IIPC in the future and appreciate the opportunity I had to attend.

—Mat (@machawk1)

Kristinn Sigurðsson has also written a post about his take aways from the event.

Tom Cramer also published his report on the Hackathon since the publication of this post.


Comments