Thursday, December 24, 2015

2015-12-24: CNI Fall 2015 Membership Meeting Trip Report

The CNI Fall 2015 Membership Meeting was held in Washington, D.C., December 14-15, 2015.  Like all CNI meetings, the Fall 2015 meeting was excellent and contained many high quality presentations.  Unfortunately, the members' project briefings ran simultaneously, with 7 or 8 different presentations overlapping at any given time.  As a result I missed a great deal. 

Cliff Lynch kicked off the meeting with reflections about public access to federally funded research (e.g., CRS R42983), interoperability (e.g., OAI-ORE, ORCIDs, IIIF), linked data (e.g., Wikipedia notability guidelines for biographies),  privacy & surveillance (e.g., eavesdropping Barbies, Ashley Madison data breach, RFC 7624), and understanding the personalization algorithms that go into presenting (and thus archiving) the view of the web that you experience (e.g., our 2013 D-Lib Magazine article about mobile vs. desktop & GeoIP), and much more.  I'm hesitant to try to further summarize his talk -- watching the video of his talk, as always, is time well spent.

In the next session Herbert and I presented "Achieving Meaningful Interoperability for Web-based Scholarship", which is basically a summary of our recent D-Lib Magazine paper "Reminiscing About 15 Years of Interoperability Efforts". 

2016-01-07 Edit: CNI has now posted the video of our presentation:

See also the excellent summary and commentary from David Rosenthal about the "signposting" proposal.

The next session I split between "Linked Data for Libraries and Archives: LD4L and Europeana" (see the "Linked Data for Libraries" site) and "Is Gold Open Access Sustainable? Update from the UC Pay-It-Forward Project" (slides, video).  The final session of the day included several presentations I would have liked to have seen but didn't.  I understand "Documenting Ferguson: Building A Community Digital Repository" (slides) was good & standing room only.

I missed the opening session on the second day (including the "Update on Funding Opportunities" presentation), but made the presentation from David Rosenthal about emulation.  See the transcript of his talk, as well as his 2015 Emulation and Virtualization as Preservation Strategies report for the AMF.

Unfortunately, David's talk collided with that of Martin & his UCLA colleagues.  Fortunately, CNI has posted the video of their talk, his slides are online, and he has a great interactive site to explore the data.

After lunch I attend Rob's talk "The Future of Linked Data in Libraries: Assessing BibFrame Against Best Practices" (slides).  Rob even referenced my "no free kittens" slogan (tirade?) from our time developing OAI-ORE:

The closing plenary was an excellent talk from Julie Brill, head of the Federal Trade Commission, entitled "Transparency, Trust, and Consumer Protection in a Complex World".  The transcript is worth reading, but the essence of the talk explores the role the FTC would (should?) play in making sure that consumers can be aware of the data that companies track about them and how that data is used to make decisions about the consumers. (2016-01-07 edit: the video of her presentation is now online.)

A mostly complete list of slides is available via the OSF.  CNI recorded many of the presentations and have begun uploading the videos to the CNI Youtube channel.  The CNI Spring 2016 Membership Meeting will be held in San Antonio, TX, April 4-5, 2016.

Given all the simultaneous sessions, your CNI experience was probably different than mine.  Check out these other CNI Fall 2015 trip reports: Dale Askey, Jaap Geraerts, and Tim Pyatt


Tuesday, December 8, 2015

2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

When an archived web page is viewed using the Wayback Machine, the archival datetime is easy to determine from the URI and the Wayback Machine's display.  The archival datetime of embedded resources (images, CSS, etc.) is another story.  And what stories their archival datetimes can tell.  These stories are the topic of my recent research and Hypertext 2015 publication.  This post introduces composite mementos, the evaluation of their temporal (in-)coherence, provides an overview of my research results.


What is a composite memento?


A Memento is an archived copy of web resource (RFC 7089)  The datetime when the copy was archived is called its Memento-Datetime.  A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation.  Composite mementos can be thought of as a tree structure.  The root resource embeds other resources, which may themselves embed resources, etc.  The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT.  Or does it?


Hints of Temporal Incoherence


Consider the following weather report that was captured 2004-12-09 19:09:26 GMT.  The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right.  Look closely at description of Current Conditions and the radar image.  How can there be no clouds on the radar when the current conditions are light drizzle?  Something is wrong here.  We have encountered temporal incoherence.  This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives.  In this case, the radar image was captured much later (9 months!) than the web page itself.  However, there is no indication of this condition.


A Framework for Evaluating Temporal Coherence

In order to study temporal coherence of composite mementos, a framework was needed.  The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states.  The four states and sample patterns are described below.  The technical report describing the framework is available on arXiv.


Prima Facie Coherent

An embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured.  The figure below illustrates the most common case.  Here the embedded memento was captured after the root but modified before the root.  The importance of Last-Modified is discussed in my previous post on the importance of header replay.


Possibly Coherent

An embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured before the root.


Probably Violative

An embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.

Prima Facie Violative

An embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured.  The figure below illustrates this case.  Here the embedded memento was captured after the root and was also modified after the root.




Only One in Five Archived Web Pages Existed as Presented

Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive.  Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.

Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.

Heuristics:  The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.

We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).

The paper can be downloaded from the ACM Digital Library or from my local copy.  The slides from the Hypertext'15 talk follow.

One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.

-- Scott G. Ainsworth