Monday, December 11, 2017

2017-12-11: Difficulties in timestamping archived web pages

Figure 1: A web page from nasa.gov is archived
 by Michael's Evil Wayback in July 2017.
Figure 2: When visiting the same archived page in October 2017,
we found that the content of the page has been tampered with. 
The 2016 Survey of Web Archiving in the United States shows an increasing trend of using public and private web archives in addition to the Internet Archive (IA). Because of this tendency we should consider the question of validity of archived web pages deleivered by these archives. 
Let us look at an example where the important web page https://climate.nasa.gov/vital-signs/carbon-dioxide/, that keeps a record of the carbon dioxide (CO2) level in the Earth’s atmosphere, is captured by a private web archive “Michael’s Evil Wayback” on July 17, 2017 at 18:51 GMT. At this time, as Figure 1 shows, the CO2 was 406.31 ppm.
When revisiting the same archived page in October 2017, we should be presented with the same content. Surprisingly, CO2 changed and became 270.31 ppm as Figure 2 shows. So which one is the “real” archived archived page?
We can simply detect that the content of an archived web page has been modified by generating a cryptographic hash value on the returned HTML code. For example, the following command will download the web page https://climate.nasa.gov/vital-signs/carbon-dioxide/ and generate a SHA-256 hash value on its HTML content
$ curl -s https://climate.nasa.gov/vital-signs/carbon-dioxide/ | shasum -a 256
b87320c612905c17d1f05ffb2f9401ef45a6727ed6c80703b00240a209c3e828  -
The next figure illustrates how the simple apporach of generating hashes can detect any tampering with content of archived pages. In this example, the "black hat" in the figure (i.e., Michael’s Evil Wayback) has changed the CO2 to a lower value (i.e., in favor of individuals or organizations who deny that CO2 is one of the main causes of global warming).  
Another possible solution to validate archived web pages is to use timestamping. If a trusted timestamp is issued on an archived web page, anyone should verify that a particular representation of the web page has existed in a specific time in the past.
As of today, many systems, such as OriginStamp and OpenTimestamps offer a free-of-charge service to generate blockchain-based trusted timestamps of digital documents, such as Bitcoin. These tools perform multiple steps to successfully create timestamps. One of these steps requires computing a hash value which represents the content of the resource (i.e, by the cURL command above). Next, this hash value is converted to a Bitcoin's address, then a Bitcoin's transaction is made where one of the two sides of the transaction (i.e., the source and destination) should be the new generated address. Once approved by the blockchain, the transaction creation datetime is considered to be a trusted timestamp. Shawn Jones describes in "Trusted Timestamping of Mementos" how to create trusted timestamp of archived web pages using blockchain networks.
In our technical report "Difficulties of Timestamping Archived Web Pages", we show that trusted timestamping archived web pages is not an easy task for several reasons. The main reason is that a hash value calculated on the content of  an archived web page (i.e., memento) should be repeatable. That is we should always obtain the same hash value each time we retrieve the memento. In addition to those difficulties, we introduced some requirements to be fulfilled in order to generate repeatable hash values of mementos.

--Mohamed Aturban


Mohamed Aturban, Michael L. Nelson, Michele C. Weigle, "Difficulties of Timestamping Archived Web Pages." 2017. Technical Report. arXiv:1712.03140.

Monday, December 4, 2017

2017-12-03: Introducing Docker - Application Containerization & Service Orchestration


For the last few years, Docker, the application containerization technology, has been gaining a lot of attraction from the DevOps community and lately it has made its way to the academia and research community as well. I have been following it since its inception in 2013. For the last couple years, it has become a daily driver for me. At the same time, I have been encouraging my colleagues to use Docker in their research projects. As a result, we are gradually moving away from one virtual machine (VM) per project to a swarm of nodes running containers of various projects and services. If you have accessed MemGator, CarbonDate, Memento Damage, Story Graph or some other WS-DL services lately, you have been served from our Docker deployment. We even have an on-demand PHP/MySQL application deployment system using Docker for the CS418 - Web Programming course.



In the last summer, Docker Inc. selected me as the Docker Campus Ambassador for Old Dominion University. While I have already given some Docker talks to some more focused groups, with the campus ambassador hat on, I decided to organize an event where grads and undergrads of the Computer Science department at large can benefit.


The CS department accepted it as a colloquium, scheduled for Nov 29, 2017. We were anticipating about 50 participants, but many more showed up. The increasing interest of students towards containerization technology can be taken as an indicator of its usefulness and perhaps it should be included as part of some courses offered in future.


The session lasted for a little over an hour. It started with some slides motivating with a Dockerization story and a set of problems that potentially Docker can solve. Slides then introduced some basics of Docker and further illustrated how a simple script can be packaged into an image and distributed using DockerHub. The presentation followed by a live demo of a step-by-step evolution of a simple script into a multi-container application using micro-service architecture while demonstrating various aspects of Docker in each step. Finally, the session was opened for questions and answers.


For the purpose of illustration I prepared an application that scrapes a given web page to extract links from it. The demo code has folders for various steps as it progresses from a simple script to a multi-service application stack. Each folder has a README file to explain changes from the previous step and instructions to run the application. The code is made available on GitHub. Following is a brief summary of the demo.

Step 0

The Step 0 has a simple linkextractor.py Python script (as shown below) that accepts a URL as an argument and prints all the hyperlinks on the page out.


However, running this rather simple script might raise some of the following issues:

  • Is the script executable? (chmod a+x linkextractor.py)
  • Is Python installed on the machine?
  • Can you install software on the machine?
  • Is "pip" installed?
  • Are "requests" and "beautifulsoup4" Python libraries installed?

Step 1

The Step 1 includes a simple Dockerfile to it to automate installation of all the requirements and build an isolated self-contained image.


Inclusion of this Dockerfile ensures that the script will run without any hiccups in a Docker container as a one-off command.

Step 2

The Step 2 makes some changes in the Python script; 1) to convert extracted paths to full URLs, 2) to extract both links and anchor texts, and 3) to move the main logic in a function and return an object so that the script can be used as a module in other scripts.

This step illustrates that new changes in the code will not affect any running containers and will not impact an image that was built already (unless overridden). Building a new image with a different tag allows co-existence of both the versions that can be run as desired.

Step 3

The Step 3 adds another Python file main.py that utilizes the module written in the previous step to expose the link extraction as a web service API that returns JSON response. Libraries required are extracted in the requirements.txt file. The Dockerfile is updated to accommodate these changes and to by default run the server rather than the script as a one-off command.

This step demonstrates how host and container ports are mapped to expose the service running inside a container.

Step 4

The Step 4 moves all the code, written so far for the JSON API, in a separate folder to build an independent image. In addition to that, it adds a PHP file index.php in a separate folder that serves as a front-end application which internally communicates with the Python API for link extraction. To glue these services together a docker-compose.yml file is added as shown below.


This step demonstrates how multiple services can be orchestrated using Docker Compose. We did not crate a custom image for the PHP application, instead demonstrated how the code can be mounted inside a container (in this case a container based on the official php:7-apache image). This allows any modifications of the code reflected immediately inside the running container, which could be very handy in the development mode.

Step 5

The Step 5 adds another Dockerfile to build a custom image of the front-end PHP application. The Python API server is updated to utilize Redis for caching. Additionally, the docker-compose.yml file is updated to reflect changes in the front-end application ( the"web" service block) and to include a service of Redis from its official Docker image.

This step illustrates how easy it is to progressively add components to compose a multi-container service stack. At this stage, the demo application architecture reflects what is illustrated in the title image of this post (the first figure).

Step 6

The Step 6 completely replaces the Python API service component with an equivalent Ruby implementation. Some slight modifications are made in the docker-compose.yml file to reflect these changes. Additionally, a "logs" directory is mounted in the Ruby API service as a volume for persistent storage.

This step illustrates how easily any component of a micro-service architecture application stack can be swapped out with an equivalent service. Additionally, it demonstrates volumes for persistent storage so that containers can remain stateless.


The video recording of the session is made available on YouTube as well as on the colloquium recordings page of the department (the latter has more background noise). Slides and demo codes are made available under appropriate permissive licenses to allow modification and reuse.

Resources



--
Sawood Alam

Wednesday, November 22, 2017

2017-11-22: Deploying the Memento-Damage Service





Many web services such as archive.isArchive-ItInternet Archive, and UK Web Archive have provided archived web pages or mementos for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception.

Related to this, Justin Brunelle devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedded video (e.g., a Youtube page). Even though the first page is missing more resources, intuitively the second page is more damaged and less useful for users. The damage value ranges from 0 to 1, where a damage of 1 means the web page lost all of its embedded resources. Figure 1 gives an illustration of how this prototype works.

Figure 1. The overview flowchart of Memento Damage
Although this prototype has been successfully proven to be capable of measuring the damage, it is not user ready. Thus, we implemented a web service, called Memento-Damage, based on the prototype.

Analyzing the Challenges

Reducing the Calculation Time

As previously explained, the basic notion of damage calculation is mirroring human perception of a memento. Thus, we analyze the screenshot of the web page as a representation of how the page looks in the user's eyes. This screenshot analysis takes the most time of the entire damage calculation process.

The initial prototype is built on top of the Perl programming language and used PerlMagick to analyze the screenshot. This library dumps the color scheme (RGB) of each pixel in the screenshot into a file. This output file will then be loaded by the prototype for further analysis. Dumping and reading the pixel colors of the screenshot take a significant amount of time and it is repeated according to the number of stylesheets the web page has. Therefore, if a web page has 5 stylesheets, the analysis will be repeated 5 times even though it uses the same screenshot as the basis.

Simplifying the Installation and Making It Distributable
Before running the prototype, users are required to install all dependencies manually. The list of dependencies is not provided. Users must discover it themselves by identifying the error that appears during the execution. Furthermore, we need to ‘package’ and deploy this prototype into a ready-to-use and distributable tool that can be used widely in various communities. How? By providing 4 different ways of using the service:
Solving Other Technical Issues
Several technical issues that needed to be solved included:
  1. Handling redirection (status_code  = 301, 302, or 307).
  2. Providing some insights and information.
    The user will not only get the final damage value but also will be informed about the detail of the process that happened during the crawling and calculation process as well as the components that make up the value of the final damage. If an error happened, the error info will also be provided.  
  3. Dealing with overlapping resources and iframes. 

Measuring the Damage

Crawling the Resources
When a user inputs a URI-M into the Memento-Damage service, the tool will check the content-type of the URI-M and crawl all resources. The properties of the resources, such as size and position of an image, will be written into a log file. Figure 2 summarizes the crawling process conducted in Memento-Damage. Along with this process, a screenshot of the website will also be created.

Figure 2. The crawling process in Memento-Damage

Calculating the Damage
After crawling the resources, Memento-Damage will start calculating the damage by reading the log files that are previously generated (Figure 3). Memento-Damage will first read the network log and examine the status_code of each resource. If a URI is redirected (status_code = 301 or 302), it will chase down the final URI by following the URI in the header location as depicted in Figure 4. Each resource will be processed in accordance with its type (image, css, javascript, text, iframe) to obtain its actual and potential damage value. Then, the total damage is computed using the formula:
\begin{equation}\label{eq:tot_dmg}T_D = \frac{A_D}{P_D}\end{equation}
where:
     $ T_D = Total Damage \\
        A_D = Actual Damage \\
        P_D = Potential Damage $

The formula above can be further elaborated into:
$$ T_D = \frac{A_{D_i} + A_{D_c} + A_{D_j} + A_{D_m} + A_{D_t} + A_{D_f}}{P_{D_i} + P_{D_c} + P_{D_j} + P_{D_m} + P_{D_t} + P_{D_f}} $$
where each subscript notation represent image (i), css (c), javascript (j), multimedia (m), text, and iframe (f), respectively. 
For image analysis, we use Pillow, a python image library that has better and faster performance than PerlMagick. Pillow can read pixels in an image without dumping it into a file to speed up the analysis process. Furthermore, we modify the algorithm so that we only need to run the analysis script once for all stylesheets.
Figure 3. The calculation process in Memento-Damage

Figure 4. Chasing down a redirected URI

Dealing with Overlapping Resources

Figure 5. Example of a memento that contains overlapping resources (accessed on March 30th, 2017)
URIs with overlapping resources such the one illustrated in Figure 5 need to be treated differently to prevent the damage value from being double counted. To solve this problem, we created a concept of rectangle (Rectangle = xmin, ymin, xmax, ymax). We perceive the overlapping resources as rectangles and calculate the size of the intersection area. The size of one of the overlapped resource will be reduced by the intersection size, while the other overlapped resource will get the whole full size. Figure 6 and Listing 1 give the illustration of the rectangle concept.
Figure 6. Intersection concept for overlapping resources in an URI
def rectangle_intersection_area(a, b):
    dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
    dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
    if (dx >= 0) and (dy >= 0): 
        return dx * dy
Listing 1. Measuring image rectangle in Memento-Damage

Dealing with Iframes

Dealing with iframe is quite tricky and requires some customization. First, by default, crawling process cannot access content inside of iframe using native javascript or JQuery selector due to a cross-domain problem. This problem becomes more complicated when this iframe is nested in another iframe(s). Therefore, we need to find a way to switch from main frame to the iframe. To handle this problem, we utilize the API provided by PhantomJS that facilitates switching from one iframe to another. Second, the location properties of the resources inside of iframe are calculated relative to that particular iframe position, not to the main frame position. It could lead to a wrong damage calculation. Thus, for a resource located inside an iframe, its position must be computed in a nested calculation by taking into account the position of its parent frame(s)

Using Memento-Damage

The Web Service

a. Website
The Memento-Damage website gives the easiest way to use the Memento-Damage tool. However, since it runs on a resource-limited server provided by ODU, it is not recommended for calculating damage a large number of requests. Figure 7 shows a brief preview of the website.
Figure 7. The calculation result from Memento-Damage

b. REST API
The REST API service is part of the web service which facilitates damage calculation from any HTTP Client (e.g. web browser, CURL, etc) and gives output in JSON format. This makes it possible for the user to do further analysis with the resulting output. Using REST API, a user can create a script and calculate damage on a few number of URIs (e.g. 5).
The default REST API usage for memento damage is:
http://memento-damage.cs.odu.edu/api/damage/[the input URI-M]
Listing 2 and Listing 3 show examples of using Memento-Damage REST API with CURL and Python.
curl http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci
Listing 2. Using Memento-Damage REST API with Curl
import requests
resp = requests.get('http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci')
print resp.json()
Listing 3. Using Memento-Damage REST API as embedded code in Python

Local Service

a. Docker Version
The Memento-Damage docker image uses Ubuntu-LXDE for the desktop environment. A fixed desktop environment is used to avoid an inconsistency issue of the damage value of the same URI run by different machines with different operating systems. We found that PhantomJS, the headless browser that is used for generating the screenshot, rendered the web page in accordance with the machine's desktop environment. Hence, the same URI could have a slightly different screenshot, and thus different damage values when run on different machines (Figure 8). 


Figure 8. Screenshot of https://web.archive.org/web/19990125094845/http://www.dot.state.al.us taken by PhantomJS run on 2 machines with different OS.
To start using the Docker version of Memento-Damage, the user can follow these  steps: 
  1. Pull the docker image:
    docker pull erikaris/memento-damage
    
  2. Run the docker image:
    docker run -it -p :80 --name  
    
    Example:
    docker run -i -t -p 8080:80 --name memdamage erikaris/memento-damage:latest
    
    After this step is completed, we now have the Memento-Damage web service running on
    http://localhost:8080/
    
  3. Run memento-damage as a CLI using the docker exec command:
    docker exec -it <container name> memento-damage <URI>
    Example:
    docker exec -it memdamage memento-damage http://odu.edu/compsci
    
    If the user wants to work from the inside of the Docker's terminal, use this following command:
    docker exec -it <container name> bash
    Example:
    docker exec -it memdamage bash 
  4. Start exploring the Memento-Damage using various options (Figure 9) that can be obtained by typing
    docker exec -it memdamage memento-damage --help
    or if the user is already inside the Docker's container, just simply type:
    memento-damage --help
~$ docker exec -it memdamage memento-damage --help
Usage: memento-damage [options] <URI>

Options:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir=OUTPUT_DIR
                        output directory (optional)
  -O, --overwrite       overwrite existing output directory
  -m MODE, --mode=MODE  output mode: "simple" or "json" [default: simple]
  -d DEBUG, --debug=DEBUG
                        debug mode: "simple" or "complete" [default: none]
  -L, --redirect        follow url redirection

Figure 9. CLI options provided by Memento-Damage

Figure 10 depicts an output generated by CLI-version Memento-Damage using complete debug mode (option -d complete).
Figure 10. CLI-version Memento-Damage output using option -d complete
Further details about using Docker to run Memento-Damage is available on http://memento-damage.cs.odu.edu/help/.

b. Library
The library version offers functionality (web service and CLI) that is relatively similar to that of the Docker version. It is aimed at the people who already have all the dependencies (PhantomJS 2.xx and Python 2.7) installed on their machine and do not want to bother installing Docker. The latest library version can be downloaded from GitHub.
Start using the library by following these steps:
  1. Install the library using the command:
    sudo pip install web-memento-damage-X.x.tar.gz
  2. Run the Memento-Damage as a web service:     memento-damage-server
  3. Run the Memento-Damage via CLI:   memento-damage <URI>
  4. Explore available options by using option --help:
    memento-damage-server --help                     (for the web service)
    or 
    memento-damage --help                                  (for CLI)

Testing on a Large Number of URIs

To prove our claim that Memento-Damage can handle a large number of URIs, we conducted a test on 108,511 URI-Ms using a testing script written in Python. The testing used the Docker version of Memento-Damage that was run on a machine with specification: Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz, Memory 4 GiB. The testing result is shown below.

Summary of Data
=================================================
Total URI-M: 108511
Number of URI-M are successfully processed: 80580
Number of URI-M are failed to process: 27931
Number of URI-M has not processed yet: 0

With a dataset of 108,511 input URI-Ms tested, we found 80,580 URI-Ms were successfully processed while the rest, 27,931 URI-Ms, failed to process. The processing failure on those 27,931 URI-Ms happened because of the concurrent limitation access from Internet Archive. On average, 1 URI-M needs 32.5 seconds processing time. This is 110 times faster than the prototype version, which takes an average of 1 hour to process 1 URI-M. In some cases, the prototype version even took almost 2 hours to process 1 URI-M. 

From those successfully processed URI-Ms, we managed to create some visualizations to help us better understand the result as can be seen below. The first graph (Figure 11) shows the number of average missing embedded resources per memento per year according to the damage value (Dm) and the missing resources (Mm). The most interesting case appeared in 2011 where the Dm value was significantly higher than Mm. It means, although on the average the URI-Ms in 2011 only lost 4% of their resources, these losses actually caused 4 (four) times higher damages than the number showed by Mm. On the other hand, in 2008, 2010, 2013, and 2017, Dm value is lower than Mm, which implies those missing resources are less important.
Figure 11. The average embedded resources missed per memento per year
Figure 12. Comparison of All Resources vs Missing Resources

The second graph (Figure 12) shows the number of total resources in each URI-Ms and its missing resources.  The x-axis represents each URI-M sorted in descending order by the number of resources, while the y-axis represents the number of resources owned by each URI-M. This graph gives us an insight that almost every URI-M lost at least one of its embedded resources. 

Summary

In this research, we have improved the calculation method for measuring the damage on a memento (URI-M) based on the prototype from the previous version. The improvements include reducing calculation time, fixing various bugs, handling redirection and a new type of resources. We successfully developed the Memento-Damage into a comprehensive tool that has the ability to show the detail of every resource that contributes to the damage. Furthermore, it also provides several approaches for utilizing the tool such as python library and the Docker version. The testing result shows that Memento-Damage works faster compared to the prototype version and can handle a larger number of mementos. Table 1 summarizes the improvements that we made on Memento-Damage compared to the initial prototype. 

No
Subject
Prototype
Memento Damage
1.Programming LanguageJavascript + PerlJavascript + Python
2.InterfaceCLICLI
Website
REST API
3.DistributionSource CodeSource Code
Python library
Docker
4.OutputPlain TextPlain Text
JSON
5.Processing timeVery slowFast
6.Includes IFrameNAAvailable
7.Redirection HandlingNAAvailable
8.Resolve OverlapNAAvailable
9.Blacklisted URIsOnly has 1 blacklisted URI which is added manuallyAdd several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.Batch executionNot supportedSupported
11.DOM selector capabilityonly support simple selection querysupport complex selection query
12.Input filteringNAOnly process an input of HTML format
Table 1. Improvement on Memento-Damage compared to the initial prototype

Help Us to Improve

This tool still needs a lot of improvements to increase its functionality and provide a better user experience. We really hope and strongly encourage everyone, especially people who work in web archiving field, to try this tool and give us feedback. Please kindly read the FAQ and HELP before starting using the Memento-Damage. Help us to improve and tell us what we can do better by posting any bugs, errors, issues, or difficulties that you find in this tool on our GitHub

- Erika Siregar -

Monday, November 20, 2017

2017-11-20: Dodging the Memory Hole 2017 Trip Report

At the Internet Archive, it was rainy in San Francisco, but that did not deter those of us attending Dodging the Memory Hole 2017. We engaged in discussions about a very important topic: the preservation of online news content.


Keynote: Brewster Kahle, founder and digital librarian for the Internet Archive

Brewster Kahle is well known in digital preservation and especially web archiving circles. He founded the Internet Archive in May 1996. The WS-DL and LANL's Prototyping Team collaborate heavily with those from the Internet Archive, so hearing his talk was quite inspirational.




We are familiar with the Internet Archive's efforts to archive the Web, visible mostly through the Wayback Machine, but the goal of the Internet Archive is "Universal Access to All Knowledge", something that Kahle equates to the original Library of Alexandria or putting humans on the moon. To that extent, he highlighted many initiatives by the Internet Archive to meet this goal. He mentioned that the contents of a book take up roughly a MegaByte. With 28 TeraBytes the works of the Library of Congress can be stored digitally—digitizing it is another matter, but it is completely doable, and by digitizing it we remove restrictions on access due to distance and other factors. Why stop with documents? There are many other types of content. Kahle highlighted the efforts by the Internet Archive to make television content, video games, audio, and more. They also have a loaning program whereby they allow users to borrow books, which are also digitized using book scanners. He stressed that, because of its mission to provide content to all, the Internet Archive is indeed a library.



As a library, the Internet Archive also becomes a target for governments seeking information on the activities of their citizens. Kahle highlighted one incident in which the FBI sent a letter demanding information from the Internet Archive. Thanks to help from the Electronic Frontier Foundation, the Internet Archive sued the United States government and won, defending the rights of those using their services.



Kahle emphasized that we can all help with preserving the web by helping the Internet Archive build its holdings of web content. The Internet Archive contains a form with a simple "save page now" button, but they also support other methods of submitting content.



Contributions from Los Alamos National Laboratory (LANL) and Old Dominion University (ODU)


Martin Klein from LANL and Mark Graham from the Internet Archive




Martin Klein presented work on Robust Links. Martin briefly used motivating work he had done with Herbert Van de Sompel at Los Alamos National Laboratory, mentioning the problems of link rot, and content drift, the latter of which I also worked on.
He covered how one can create links that are robust by:
  1. submitting a URI to a web archive
  2. decorating the link HTML so that future users can reach archived versions of the linked content
For the first item, he talked about how one can use tools like the Internet Archive's "Save Page Now" button as well as WS-DL's own ArchiveNow. The second item is covered by the Robust Links specification. Mark Graham, Directory of the Wayback Machine at the Internet Archive, further expanded upon Martin's talk by describing how the Wayback Extension also provides the capability to save pages, navigate the archive, and more. It is available for Chrome, Safari, and Firefox. It is shown in the screenshots below.
A screenshot of the Wayback Extension in Chrome.
A screenshot of the Wayback Extension in Safari. Note the availability of the option "Site Map", which is not present in the Chrome version
A screenshot of the Wayback Extension in Firefox. Note how there is less functionality.


Of course, the WS-DL efforts of ArchiveNow and Mink augment these preservation efforts by submitting content to multiple web archives, including the Internet Archive.



I enjoyed one of the most profound revelations from Martin and Mark's talk: URIs are addresses, not the content that was on the page at the moment you read it. I realize that efforts like IPFS are trying to use hashes to address this dichotomy, but the web has not yet migrated to them.

Shawn M. Jones from ODU




I presented a lightning talk highlighting a blog post from earlier this year where I try to answer the question: where can we post stories summarizing web archive collections? I talked about why storytelling works as a visualization method for summarizing collections and then evaluated a number of storytelling and curation tools with the goal of finding those that best support this visualization method.


Selected Presentations


I tried to cover elements of all presentations while live tweeting during the event, and wish I could go into more detail here, but, as usual I will only cover a subset.

Mark Graham highlighted the Internet Archive's relationships with online news content. He highlighted a report by Rachel Maddow where she used the Internet Archive to recover tweets posted by former US National Security Advisor Michael Flynn, thus incriminating him. He talked about other efforts, such as NewsGrabber, Archive-It, and the GDELT project, which all further archive online news or provide analysis of archived content. Most importantly, he covered "News At Risk"—content that has been removed from the web by repressive regimes, further emphasizing the importance of archiving it for future generations. In that vein, he discussed the Environmental Data & Governance Initiative, set up to archive environmental data from government agencies after Donald Trump's election.

Ilya Kreymer and Anna Perricci presented their work on Webrecorder, web preservation software hosted at webrecorder.io. An impressive tool for "high fidelity" web archiving, Webrecorder allows one to record a web browsing session and save it to a WARC. Kreymer demonstrated its use on a CNN news web site with an embedded video, showing how the video was captured as well as the rest of the content on the page. The Webrecorder.io web platform allows one to record using their native browser, or they can choose from a few other browsers and configurations in case the user agent plays a role in the quality of recording or playback. For offline use, they have also developed Webrecorder Player, in which one can playback their WARCs without requiring an Internet connection. Anna Perricci said that it is perfect for browsing a recorded web session on an airplane. Contributors to this blog have written about Webrecorder before.

Katherine Boss, Meredith Broussard, Fernando Chirigati, and Rémi Rampin discussed the problems surrounding the preservation of news apps: interactive content on news sites that allow readers to explore data collected by journalists on a particular issue. Because of their dynamic nature, news apps are difficult to archive. Unlike static documents, they can not be printed or merely copied. They often consist of client and server side code developed without a focus on reproducibility. Preserving news apps often requires the assistance of the organization that created the news app, which is not always available. Rémi Rampin noted that, for those organizations that were willing to help them, their group had had success using the research reproducibility tool reprozip to preserve and play back news apps.

Roger Macdonald and Will Crichton provided an overview of the Internet Archive's efforts to provide information from TV news. They have employed the Esper video search tool as a way to explore their colleciton. Because it is difficult for machines to derive meaning from pixels within videos, they used captioning as a way to provide for effective searching and analysis of the TV news content at the Internet Archive. Their goal is to allow search engines to connect fact checking to the TV media. To this end, they employed facial recognition on hours of video to find content where certain US politicians were present. From there one can search for a politician and see where they have given interviews on such news channels as CNN, BBC, and Fox News. Alternatively, they are exploring the use of identifying the body position of each person in a frame. Using this, it might be possible to answer questions such as "every video where a man is standing over a woman". The goal is to make video as easy as text to search for meaning.

Maria Praetzellis highlighted a project named Community Webs that uses Archive-It. Community Webs provides libraries the tools necessary to preserve news and other content relevant to their local community. Through community webs, local public libraries receive education and training, help with collection development, and archiving services and infrastructure.

Kathryn Stine and Stephen Abrams presented the work done on the Cobweb Project. Cobweb provides an environment where many users can collaborate to produce seeds that can then be captured by web archiving initiatives. If an event is unfolding and news stories are being written, the documents containing these stories may change quickly, thus it is imperative for our cultural memory that these stories be captured as close to publication as possible. Cobweb provides an environment for the community to create a collection of seeds and metadata related to one of these events.
Matthew Weber shared some results from the News Measures Research Project. This project started as an attempt to "create an archive of local news content in order to assess the breadth and depth of local news coverage in the United States". The researchers were surprised to discover that local news in the United States covers a much larger area than expected: 546 miles on average. Most areas are "woefully underserved". Consolidation of corporate news ownership has led to fewer news outlets in many areas and the focus of these outlets is becoming less local and more regional. These changes are of concern because the press is important to the democratic processes within the United States.

Social



As usual, I met quite a few people during our meals and breaks. I appreciate talks over lunch with Sativa Peterson of Arizona State Library and Carolina Hernandez of the University of Oregon. It was nice to discuss the talks and their implications for journalism with Eva Tucker of Centered Media and Barrett Golding of Hearing Voices. I also appreciated feedback and ideas from Ana Krahmer of the University of North Texas, Kenneth Haggerty of the University of Missouri, Matthew Collins of the University of San Francisco Gleeson Library, Kathleen A. Hansen of University of Minnesota, and Nora Paul, retired director of Minnesota Journalism Center. I was especially intrigued by discussions with Mark Graham on using storytelling with web archives, Rob Brackett of Brackett Development, who is interested in content drift, and James Heilman, who works on WikiProject Medicine with Wikipedia.


Summary


Like last year, Dodging the Memory Hole was an inspirational conference highlighting current efforts to save online news. Having it at the Internet Archive further provided expertise and stimulated additional discussion on the techniques and capabilities afforded by web archives. Pictures of the event are available on Facebook. Video coverage is broken up into several YouTube videos: Day 1 before lunch, Day 1 after lunch, Day 2 before lunch, Day 2 after lunch, and lightning talks. DTMH highlights the importance of news in an era of a changing media presence in the United States, further emphasizing that web archiving can help us fact-check statements so we can hold onto a record of not only how we got here, but also guide where we might go next. -- Shawn M. Jones

Thursday, November 16, 2017

2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

While researching my dissertation topic, I re-encountered the paper, "Routing Memento Requests Using Binary Classifiers" by Bornand, Balakireva, and Van de Sompel from JCDL 2016 (arXiv:1606.09136v1). The high-level gist of this paper is that by using two corpora of URI-Rs consisting of requests to their Memento aggregator (one for training, the other for training evaluation), the authors were able to significantly mitigate wasted requests to archives that contained no mementos for a requested URI-R.

For each of the 17 Web archives included in the experiment, with the exception of the Internet Archive on the assumption that a positive result would always be returned, a classifier was generated. The classifiers informed the decision of, given a URI-R, whether the respective Web archive should be queried.

Optimization of this sort has been performed before. For example, AlSum et al. from TPDL 2013 (trip report, IJDL 2014, and arXiv) created profiles for 12 Web archives based on TLD and showed that it is possible to obtain a complete TimeMap for 84% of the URI-Rs requested using only the top 3 archives. In two separate papers from TPDL 2015 (trip report) then TPDL 2016 (trip report), Alam et al. (2015, 2016) described making routing decisions when you have the archive's CDX information and when you have to use the archive's query interface to expose its holdings (respectively) to optimize queries.

The training data set was based off of the LANL Memento Aggregator cache from September 2015 containing over 1.2 million URI-Rs. The authors used Receiver Operating Characteristic (ROC) curves comparing the rate of false positives (URI-R should not have been included but was) to the rate of true positives (URI-R was rightfully included in the classification). When requesting a prediction from the classifier once training, a pair of each of these rates is chosen corresponding to the most the most acceptable compromise for the application.

A sample ROC curve (from the paper) to visualize memento requests to an archive.

Classification of this sort required feature selection, of which the authors used character length of the URI-R and the count of special characters as well as the Public Suffix List domain as a feature (cf. AlSum et al.'s use of TLD as a primary feature). The rationale in choosing PSL over TLD was because of most archiving covering the same popular TLDs. An additional token feature was used by parsing the URI-R, removing delimiters to form tokens, and transforming the tokens to lowercase.

The authors used four different methods to evaluating the ranking of the features being explored for the classifiers: frequency over the training set, sum of the differences between feature frequencies for a URI-R and the aforementioned method, Entropy as defined by Hastie et al. (2009), and the Gini impurity (see Breiman et al. 1984). Each metric was evaluated to determine how it affected the prediction by training a binary classifier using the logistic regression algorithm.

The paper includes applications of the above plots for each of the four feature selection strategies. Following the training, they evaluated the performance of each algorithm, with a preference toward low computation load and memory usage, for classification using correspond sets of selected features. The algorithms evaluated were logistical regression (as used before, Multinomial Bayes, Random Forest, and SVM. Aside from Random Forest, the other three algorithms had similar runtime predictions, so were evaluated further.

A classifier was trained using each permutation of the three remaining algorithms and each archive. To determine the true positive threshold, they brought in the second data set consisting of 100,000 unrelated URI-Rs from the Internet Archive's log files from early 2012. Of the three algorithms, they found that logistic regression performed the best for 10 archives and Multinomial Bayes for 6 others (per above, IA was excluded).

The authors then evaluated the trained classifiers using yet another dataset of URI-Rs from 200,000 randomly selected requests (cleaned to just over 187,000) from oldweb.today. Given the data set was based on inter-archive requests, it was more representative of that of an aggregator's requests compared to the IA dataset. They computed recall, computational cost, and response time using a simulated method to prevent the need for thousands of requests. These results confirmed that the currently used heuristic of querying all archives has the best recall (results are comprehensive) but response time could be drastically reduced using a classifier. With a reduction in recall of 0.153, less than 4 requests instead of 17 would reduce the response time from just over 3.7 seconds to about 2.2 seconds. Additional details of optimization obtained through evaluation of the true positive rate can be had in the paper.

Take Away

I found this paper to be an interesting an informative read on a very niche topic that is hyper-relevant to my dissertation topic. I foresee a potential chance to optimize archival query from other Memento aggregators like MemGator and look forward to further studies is this realm on both optimization and caching.

Mat (@machawk1)

Nicolas J. Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. "Routing Memento Requests Using Binary Classifiers," In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL), pp. 63-72, (Also at arXiv:1606.09136).

Monday, November 6, 2017

2017-11-06: Association for Information Science and Technology (ASIS&T) Annual Meeting 2017

The crowds descended upon Arlington, Virginia for the 80th annual meeting of the Association for Information Science and Technology. I attended this meeting to learn more about ASIS&T, including its special interest groups. Also attending with me was former ODU Computer Science student and current Los Alamos National Laboratory librarian Valentina Neblitt-Jones.
The ASIS&T team had organized a wonderful collection of panels, papers, and other activities for us to engage in.

Plenary Speakers

Richard Marks: Head of the PlayStation Magic Lab at Sony Interactive Entertainment

Richard Marks talked about the importance of play to the human experience. He covered innovations at the Playstation Magic Lab in an effort to highlight possible futures of human-computer interaction. The goal of the laboratory is "experience engineering" whereby the developers focus on improving the experience of game play rather than on more traditional software development. Play is about interaction and the Magic Lab focuses on amplifying that interaction.

One of the new frontiers of gaming is virtual reality, whereby users are immersed in a virtual world. Marks talked about how using an avatar in a game intiates a "virtual transfer of identity". Consider the example of pouring water: seeing onesself pour water on a screen while using a controller provides one level of immersion, but seeing the virtual glass of water in your hands makes the action far more natural. He mentioned that VR players confronted with a virtual tightrope suspended above New York City had difficulty stepping onto the tightrope, even though they knew it was just a game.

He talked about thresholds of technology change, detailing the changes in calculating machines throughout the 20th Century and how "when you can get it into your shirt pocket, now everything changes". Though this calculator example seems an obvious direction of technology, it was not entirely obvious when calculating machines were first being developed. The same parallel can be made for user interfaces. Marks also mentioned that games allow their researchers to explore many different techniques without having to worry about the potential for loss of life or other challenges that confront interface researchers in other industries.

William Powers: Laboratory for Social Machines at the MIT Media Lab

William Powers, author of "Hamlet's Blackberry" and reporter at the Washington Post, gave a thoughtful talk on the effects of information overload on society. To him, tech revolutions are all about depth and meaning. Depth is about focus, reflection, and when "the human mind takes its most valuable and most important journeys". Meaning is our ability to develop "theories about what exists is all about".

He talked about the current social changes people are experiencing in the online (and offline) world. He personally found that he was not able to give attention to things he cared about. The more time he spent online, the harder it became to read longer pieces of work, like books. A number of media stories exist about diminishing attention spans correlated to an increase in online use.

While at a fellowship at Harvard's Shorenstein Center, Powers began work on what print on paper had done for civilization. He covered different "Philosophers of Screens" from history. Socrates believed that the alphabet would destroy our minds, fearing that people would not think outside of the words on the page. Socrates felt that people needed distance to truly digest the world around them. Seneca lived in a world of many new technologies, such as postal systems and paved roads, but he feared the "restless energy" that haunted him, developing mental exercises to focus the mind. By inventing the printing press, Gutenberg helped mass produce the written word, leading some of his era to fear the end of civilization as misinformation was being printed. In Shakespeare's time, people complained that the print revolution had given them too much to read and that they would not be able to keep up with it. Benjamin Franklin worked to overcome his own addictions through the use of ritual. Henry David Thoreau bemoaned the distracted nature of his compatriots in the 19th Century, noting that "when our life ceases to be inward and private, conversation degenerates to gossip." Marshall McLuhan also believed that we could rise above information overload by developing our own strategies.

The output of this journey became the paper "Hamlet's Blackberry: Why Paper Is Eternal", which then led to the book &quotHamlet's Blackberry". The common thread was that each age has had new technical advances and concerns that people were becoming less focused and more out of touch. Each age also had visionaries who found that they could rise above this information fray by developing their own techniques for focus and introspection. Every technical revolution starts with the idea that the technology will consume everything, but this is hardly the case. Says Powers, "If all goes well with the digital revolution, then tech will allow us to have the depth that paper has given us." Powers even mentioned that he had been discussing with Tim Berners-Lee how to build a "better virtual society in the virtual world" that would in turn improve our real world.


Sample of Papers Presented

As usual, I cannot cover all papers presented, and, due to overlaps, was not present at all sessions. I will discuss a subset of the presentations that I attended.

Top Ranked Papers

Eric Forcier presented something near to one of my topics of interest in "Re(a)d Wedding: A Case Study Exploring Everyday Information Behaviors of the Transmedia Fan". In the paper he talks about the phenomena of transmedia fandom: fans who explore a fictional world through many different media types. The paper specifically focuses on an event in the Game of Thrones media franchise: The Red Wedding. Game of Thrones is an HBO television show based on a series of books named A Song of Ice and Fire. This story event is of interest because book fans were aware of the events of the Red Wedding before television fans experienced them, leading to a variety of different experiences for both. Forcier details the different types of fans and how they interact. Forcier's work has some connection to my work on spoilers and using web archives to avoid them.


In "Before Information Literacy [Or, Who Am I , As a Subject-Of-(Information)-Need?]", Ronald Day of the Department of Information and Library Science at Indiana University discusses the current issue of fake news. In his paper he considers the current solutions of misinformation exposure to be incomplete. Even though we are focusing on developing better algorithms for detecting fake news and also attempting to improve information literacy, there is also the possibility of improving a person's ability to determine what they want out of an information source. Day's paper provides an interesting history of information retrieval from an information science perspective. Over the years, I have heard that "data becomes information, but not all data is information"; Day extends this further by stating that "knowledge may result in information, but information doesn't necessarily have to come from or result in knowledge".

In "Affordances and Constraints in the Online Identity Work of LGBTQ+ Individuals", Vanessa Kitzie discusses the concepts of online identity in the LGBTQ+ community. Using interviews with thirty LGBTQ+ individuals, she asks about the experiences of the LGBTQ+ community in both social media and search engines. She finds that search engines are often used by members of the community to find the language necessary to explore their identity, but this is problematic because labels are dependent on clicks rather than on identity. Some members of the community create false social profiles so that they can "escape the norms confining" their "physical body" and choose the identity they want others to see. Many use social media to connect to other members of the community. The suggestions of further people to follow often introduces the user to more terms that help them with their identity. Her work is an important exploration of the concept of self, both on and offline.

Other Selected Papers
Sarah Bratt presented "Big Data, Big Metadata, and Quantitative Study of Science: A Workflow Model for Big Scientometrics". In this paper, she and her co-authors demonstrates a repeatable workflow used to process bibliometric data for the GenBank project. She maps the workflow that they developed for this project to the standard areas detailed in Jeffrey M. Stanton's Data Science. It is their hope that the framework can be applied to other areas of big data analytics and they intend to pursue a workflow that will work in these areas. I wondered if their workflow would be applicable to projects like the Clickstream Map of Science. I was also happy to see that her group was trying to tackle disambiguation, something I've blogged about before.


Yu Chi presented "Understanding and Modeling Behavior Patterns in Cross-Device Web Search." She and her co-authors conducted a user study to explore the behaviors surrounding beginning a web search on one device and then continuing it on another compared with just searching on a single device. They make the point that "strategies found on the single device, single-session search might not be applicable to the cross-device search". Users switching devices have a new behavior, re-finding, that might be necessary due to the interruption. They discovered that there are differences in user behavior in the two instances and that Hidden Markov Models could be used to model and uncover some user behavior. This work has implications for search engines and information retrieval.


"Toward A Characterization of Digital Humanities Research Collections: A Contrastive Analysis of Technical Designs" is the work of Katrina Fenlon. She talks about thematic research collections, which are collected by scholars who are trying to "support research on a theme". She focuses on the technical designs of thematic research collections and explores how collections with different uses have different designs. In the paper, she reviews three very different collections and categorizes them based on need: providing advanced access to value-added sources, providing context and interrelationships to sources, and also providing a platform for "new kinds of analysis and interpretation". I was particularly interested in Dr. Felon's research because of my own work on collections.


I was glad to once again see Leslie Johnston from the United States National Archives and Records Administration. She presented her work on "ERA 2.0: The National Archives New Framework for Electronic Records Preservation." This paper discusses the issues of developing the second version of Electronic Records Archives (ERA), the system that receives and processes US government records from many agencies before permanently archiving them for posterity. It is complex because records consist not only of different file formats, but many have different regulations surrounding their handling. ERA 2.0 now uses an Agile software methodology for development as well as cloud computing in order to effectively adapt to changing needs and requirements.


Unique to my experience at the conference was Kolina Koltai's presentation of "Questioning Science with Science: The Evolution of the Vaccine Safety Movement." In this work, the authors interviewed those who sought more research on vaccine safety, often called "anti-vaxxers". Most participants cited concern for children, and not just their own, as one of their values. They often read scientific journals and are concerned about financial conflicts of interest between government agencies and the corporations that they regulate, especially in light of prior issues involving research into the safety of tobacco and sugar. The Deficit Model, the idea that the group just lacks sufficient information, does not exist for this group. They discovered that the Global Belief Model has not been effective in understanding members of this movement. It is the hope of the authors that this work will be helpful in developing campaigns and addressing concerns about vaccine safety. In a larger sense, it supports other work on "how people develop belief systems based upon their values" also providing information for those attempting to study fake news.


Manasa Rath presented "Identifying the Reasons Contributing to Question Deletion in Educational Q&A." They looked at "bad" questions asked on the Q&A site Brainly. I was particularly interested in this work because the authors identified what features of a question caused moderators to delete it and then discovered that a J48-Decision Tree classifier is best at predicting if a given question would be deleted.


"Tweets May Be Archived: Civic Engagement. Digital Preservation, and Obama White House Social Media Data" was presented by Adam Kriesberg. Using data from the Obama White House Social Media Archive stored at the Internet Archive the authors discussed the archiving -- not just web archiving -- of Barack Obama's social media content on Twitter, Vine, and Facebook. Problems exist on some platforms, such as Facebook, where data can be downloaded by users, but is not necessarily structured in a way useful to those outside of Facebook. Facebook data is only browseable by year and photographs included in the data store lack metadata. Obama changed Vine accounts during his presidency, making it difficult for archivists to determine if they have a complete collection from even a single social media platform. An archived Twitter account is temporal, meaning that counts for likes and retweets are only from a snapshot in time. On this note, Kriesberg says that values are likes and retweets are "incorrect", but I object to the terminology of "incorrect". Content drift is something I and others of WS-DL have studied and any observation from the web needs to be studied with the knowledge that it is a snapshot in time. He notes that even though we have Obama's content, we do not have the content of those he engaged with, making some conversations one-sided. He finally mentions that social media platforms provide a moving target for archivists and researchers, as APIs and HTML changes quickly, making tool development difficult. I recommend this work for anyone attempting to archive or work with social media archives.

Social

As with other conferences, ASIS&T provided multiple opportunities to connect with researchers in the community. I appreciated the interesting conversations with Christina Pikas, Hamid Alhoori, and others during breaks. I also liked the lively conversations with Elaine Toms and Timothy Bowman. I want to thank Lorri Mon for inviting me to the Florida State University alumni lunch with Kathleen Burnett, Adam Worrall, Gary Burnett, Lynette Hammond Gerido, and others where we discussed each others' work as well as developments at FSU.

 I apologize to anyone else I have left off.

Summary

ASIS&T is a neat organization focusing on the intersections of information science and technology. As always, I am looking forward to possibly attending future conferences, like Vancouver in 2018.

-- Shawn M. Jones