2016-10-03: Summary of “Finding Pages on the Unarchived Web"

In this paper, the authors detailed their approach to recover the unarchived Web based on links and anchors of crawled pages. The data used was from the Dutch 2012 Web archive at the National Library of the Netherlands (KB), totaling about 38 million webpages. The collection was selected by the library based on categories related to Dutch history, social and cultural heritage. Each website is categorized using UNESCO code. The authors try to address three research questions: Can we recover a significant fraction of unarchived pages?, How rich are the representations for the unarchived pages?, and Are these representations rich enough to characterize the content?

The link extraction used Hadoop MapReduce and Apache Pig to process all archived webpages and used JSoup to extract links from their content. A second MapReduce job was to index the URLs and check if they are archived or not. Then the data was deduplicated based on the value of year, anchor text, source, target, and hashcode (MD5). In addition basic cleaning and processing was performed on the data set. The resulting number of pages in the dataset was 11 million webpages. Both external links (inter-server links) which are links between different servers, and site internal links (intra-server links) which occur within a server were included in the data set. Apache Pig script was used to aggregate the extracted links to different element such as TLD, domains, hosts, and file type.

The processed file list is as following:
(sourceURL, sourceUnesco, sourceInSeedProperty, targetURL, targetUnesco, targetInSeedProperty, anchorText, crawlDate, targetInArchiveProperty, sourceHash).

There are four main classification of URLs found in this data set, shown in Figure 1:
1-Intentionally archived URLs in the seed list, which is 92% of the dataset (10.1M).
2-Unintentionally archived URLs due to crawler configuration, which is 8% of the dataset (0.8M).
3-Inner Aura: unarchived URLs which the parent domain is included in the seed list (5.5M), (20% depth 4, because 94% are links to the site).
4-Outer Aura: unarchived URLs which do not have a parent domain that is on the seed list (5.2M), (29.7% depth 2).

In this work, the Aura is defined as Web documents which are not included in the archived collection but are known to have existed through references to those unarchived Web documents in the archived pages.

They analyzed the four classification and checked unique hosts, domain, and TLD. They found that unintentionally archived URLs have higher percentage of unique hosts, domain, and TLD compared to intentionally archived URLs. And that outer Aura have higher percentage of unique hosts, domain and TLD compared to inner Aura.

When checking the Aura they found that most of the unarchived Aura points to textual web content. The inner Aura mostly had a (.nl) top level domain (95.7%) and the outer Aura had 34.7% (.com) TLD, 31.1% (.nl) TLD, and 18% (.jp) TLD. They high percentage of Japanese TLD is that they unintentionally archived those pages. Also, they analyzed the indegree of the Aura where all target representations in the outer Aura have at least one source link, 18% have at least 3 links, and 10% have 5 links or more. In addition, the Aura was compared by the number of intra-server links and the inter-server links, the inner Aura had 94.4% intra-server links. On the other hand the outer Aura had 59.7% of inter-server links.
The number of unique anchor text words for both inner and outer Aura was almost similar, 95% had at least one word describing them, 30% have at least three words, and 3% have 10 words or more.

To test the theory of finding missing unarchived Web pages, they took a random 300 websites where 150 are homepages and 150 are non-homepages. They made sure the websites selected are either live or archived. They found that 46.7% of the targets page were found within the top 10 SERP using anchor text. However for non-homepage 46% were found using texts obtained from the URLs. By combining anchor text and URL word evidence both homepage and non-homepage had a high percentage, 64% of the homepages, and 55.3% of the deeper pages can be retrieved. Another random sample of URLs was selected to check the anchor text and words from the link, and they found homepages can be represented with anchor text, on the other hand non-homepages are better represented with both anchor text and words from the link.

They found that the archived pages show evidence of a large number of unarchived pages and websites. They also found that only a few homepage webpages have rich representations. Finally, they found that even with a few words to describe a missing webpage they can be found within the first rank. Future work include adding further information such as surrounding text and advance retrieve models.

Resources:

-Lulwah M. Alkwai

Comments