Friday, April 13, 2018

2018-04-13: Web Archives are Used for Link Stability, Censorship Avoidance, and Traffic Siphoning

ISIS members immolating captured Jordanian pilot
Web archives have been used for purposes other than digital preservation and browsing historical data. These purposes can be divided into three categories:

  1. Uploading content to web archives to ensure continuous availability of the data.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of direct links, for news sites with opposing ideologies to avoid increasing their web traffic and deprive them of ad revenue.

1. Uploading content to web archives to ensure continuous availability of the data


Web archives, by design, are intended to solve the problem of digital data preservation so people can access data when it is no longer available on the live web. In this paper, Who and What Links to the Internet Archive, (Yasmin AlNoamany, Ahmed AlSum, Michele C. Weigle, and Michael L. Nelson, 2013), the authors show that the percentage of the requested archived pages which currently do not exist on the live web is 65%. The paper also determines where do Internet Archive's Wayback Machine users come from. The following table, from the paper, contains the top 10 referrers that link to IA’s Wayback Machine. The list of top 10 referrers represents 51.9% of all the referrers. en.wikipedia.org outnumbers all other sites including search engines and the home page of Internet Archive (archive.org).
The top 10 referrers that link to IA’s Wayback Machine
Who and What Links to the Internet Archive, (AlNoamany et al. 2013) Table 5

Sometimes the archived data is controversial and the user wants to make sure that he or she can refer back to it later in case it is removed from the live web. A clear example of that is the deleted tweets from U.S. president Donald Trump.
Mr. Trump's deleted tweets on politwoops.eu


2. Avoiding governments' censorship or websites' terms of service


Using the Internet Archive to find a way around terms of service for file sharing sites was addressed by Justin Littman in a blog post, Islamic State Extremists Are Using the Internet Archive to Deliver Propaganda. He stated that ISIS sympathizers are using the Internet Archive as a web delivery platform for extremist propaganda, posing a threat to the archival mission of Internet Archive. Mr. Littman did not evaluate the content to determine if it is extremist in nature since much of it is in Arabic. This behavior is not new. It has been noted with some of the data uploaded by Al-Qaeda sympathizers a long time before ISIS was created. Al-Qaeda uploaded this file https://archive.org/details/osamatoobama to the Internet Archive on February 16 of 2010 to circumvent file sharing sites' content removal policies. ISIS sympathizers upload clips documenting battles, executions, or even video announcements by ISIS leaders to the Internet Archive because that type of data will get automatically removed from the web if uploaded to video sharing sites like Youtube to prevent extremists propaganda.

On February 4th of 2015, ISIS uploaded a video to the Internet Archive featuring the execution by immolation of captured Jordanian pilot Muath Al-Kasasbeh; that's only one day after the execution! This video violates Youtube's terms of service and is no longer on Youtube.
https://archive.org/details/YouTube_201502
ISIS members immolating captured Jordanian pilot (graphic video)
In fact, Youtube's algorithm is so aggressive that it removed thousands of videos documenting the Syrian revolution. Activists argued that the removed videos were uploaded for the purpose of documenting atrocities during the Syrian government's crackdown, and that Youtube killed any possible hope for future war crimes prosecutions.

Hani Al-Sibai, a lawyer, Islamic scholar, Al-Qaeda sympathizer, and a former member of The Egyptian Islamic Jihad Group who lives in London as a political refugee, uploads his content to the Internet Archive. Although he is anti-ISIS and, more often than not, his content does not encourage violence and he only had few issues with Youtube, he pushes his content to multiple sites on the web including web archiving sites to ensure continuous availability of his data.

For example, this is a an audio recording from Hani Al-Sibai condemning the immolation of the Jordanian pilot, Muath Al-Kasasbeh. Mr. Al-Sibai uploaded this recording to the Internet Archive a day after the execution.
https://archive.org/details/7arqTayyar
An audio recording by Hani Al-Sibai condemning the execution by burning (uploaded to IA a day after the execution)

These are some examples where the Internet Archive is used as a file sharing service. Clips are simultaneously uploaded to Youtube. Vimeo, and the Internet Archive for the purpose of sharing.
Screen-shot from justpaste.it where links to videos uploaded to IA are used for sharing purpose 
Both videos shown in the screen shot were removed from Youtube for violating terms of service, but they are not lost because they have been uploaded to the Internet Archive.

https://www.youtube.com/watch?v=Cznm0L5X9LE
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (removed from Youtube)

https://archive.org/details/Fajr3_201407
Rebuttal from Hani Al-Sibai addressing ISIS spokesman's attack on Al-Qaeda leader Ayman Al-Zawaheri (uploaded to IA)

https://www.youtube.com/watch?v=VuSgxhBtoic
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (removed from Youtube)

https://archive.org/details/Ta3liq_Hadi
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to IA)
The same video was not removed from Vimeo
https://vimeo.com/111975796
Rebuttal from Hani Al-Sibai addressing ISIS leader's speech on the expansion of ISIS (uploaded to Vimeo)
I am not sure if web archiving sites have content moderation policies, but even with sharing sites that do, they are inconsistent! Youtube is a perfect example; no one knows what YouTube's rules even are anymore.

Less popular use of the Internet Archive include browsing the live web using Internet Archive links to bypass governments' censorship. Sometimes, governments block sites with opposing ideologies, but their archived versions remain accessible. When these governments realize that their censorship is being evaded, they entirely block the Internet Archive to prevent access to the the same content they blocked on the live web. In 2017, the IA’s Wayback Machine was blocked in India and in 2015, Russia blocked the Internet Archive over a single page!

3. Using URLs from web archives instead of direct links for news sites with opposing ideologies to deprive them of ad revenue

Even when the live web version is not blocked, there are situations where readers want to deny traffic and the resulting ad revenue for web sites with opposing ideologies. In a recent paper, Understanding Web Archiving Services and Their (Mis)Use on Social Media (Savvas Zannettou, Jeremy Blackburn, Emiliano De Cristofaro, Michael Sirivianos, Gianluca Stringhini, 2018), the authors presented a large-scale analysis of Web archiving services and their use on social network, the archived content, and how it is shared/used. They found that contentious news and social media posts are the most common types of content archived. Also, URLs from web archiving sites are widely posted on “fringe” groups in Reddit and 4chan to preserve controversial data that might disappear; this case also falls under the first category. Furthermore, the authors found evidence of groups' admins forcing members to use URLs from web archives instead of direct links to sites with opposing ideologies to refer to them without increasing their traffic or to deprive them of ad revenue. For instance, The_Donald subreddit systematically targets ad revenue of news sources with adverse ideologies using moderation bots that block URLs from those sites and prompt users to post archive URLs instead.

The authors also found that web archives are used to evade censorship policies in some communities: for example, /pol/ users post archive.is URLs to share content from 8chan and Facebook, which are banned on the platform, or to dodge word-filters (e.g., ‘smh’ becomes ‘baka’, so links to smh.com.au point to baka.com.au instead).

According to the authors, Reddit bots are responsible for posting a huge portion of archive URLs in Reddit due to moderators trying to ensure the availability of the data, but this practice affects the amount of traffic that the source sites would have received from Reddit.

I went on 4chan to include a few examples similar to those examined in the paper and despite not knowing what 4chan is prior to reading the paper, I was able to find a couple of examples of sharing archived links on 4chan in just under 2 minutes. I took screen shots of both examples; the threads have been deleted since 4chan removes threads after they reach page 10.

Pages are archived on archive.is then shared on 4chan
Sharing links to archive.org in a comment on 4chan

The take away message is that web archives have been used for purposes other than digital preservation and browsing historical data. These purposes include:
  1. Uploading content to web archives to mitigate the risk of data loss.
  2. Avoiding governments' censorship or websites' terms of service.
  3. Using URLs from web archives, instead of original source links for news sites with opposing ideologies to deprive them of ad revenue.
--
Hussam Hallak

Monday, April 9, 2018

2018-04-09: Trip Report for the National Forum on Ethics and Archiving the Web (EAW)


On March 23-24, 2018 I attended the National Forum on Ethics and Archiving the Web (EAW), hosted at the New Museum and organized by Rhizome and the members of the Documenting the Now project.  The nor'easter "Toby" frustrated the travel plans of many, including causing my friend Martin Klein to have to cancel completely and for me to not arrive at the New Museum until after the start of the second session at 2pm on Thursday.  Fortunately, all the sessions were recorded and I link to them below.

Day 1 -- March 22, 2018


Session 1 (recording) began with a welcome, and then a keynote by Marisa Parham, entitled "The Internet of Affects: Haunting Down Data".  I did have the privilege of seeing her keynote at the last DocNow meeting in December, and looking at the tweets ("#eaw18") she addressed some of the same themes, including the issues of the process of archiving social media (e.g., tweets) and the resulting decontextualization, including "Twitter as dataset vs. Twitter as experience", and "how do we reproduce the feeling of community and enhance our understanding of how to read sources and how people in the past and present are engaged with each other?"  She also made reference to the Twitter heat map for showing interaction with the Ferguson grand jury verdict ("How a nation exploded over grand jury verdict: Twitter heat map shows how 3.5 million #Ferguson tweets were sent as news broke that Darren Wilson would not face trial").



After Marisa's keynote was the panel on "Archiving Trauma", with Michael Connor (moderator), Chido Muchemwa, Nick Ruest (slides), Coral Salomón, Tonia Sutherland, and Lauren Work.  There are too many important topics here and I did not experience the presentations directly, so I will refer you to the recording for further information and a handful of selected tweets below. 


The next session after lunch was "Documenting Hate" (recording), with Aria Dean (moderator), Patrick Davison, Joan Donovan, Renee Saucier, and Caroline Sinders.  I arrived at the New Museum about 10 minutes into this panel.  Caroline spoke about the Pepe the Frog meme, its appropriation by Neo-Nazis, and the attempt by its creator to wrest it back -- "How do you balance the creator’s intentions with how culture has remixed their work?"

Joan spoke about a range of topics, including archiving the Daily Stormer forum, archiving the disinformation regarding the attacks in Charlottesville this summer (including false information originating on 4chan about who drove the car), and an algorithmic image collection technique for visualizing trending images in the collection.


Renee Saucier talked about experiences collecting sites for the "Canadian Political Parties and Political Interest Groups" (Archive-It collection 227), which includes Neo-Nazi and affiliated political parties.


The next panel was "Web Archiving as Civic Duty", with Amelia Acker (co-moderator), Natalie Baur, Adam Kriesberg (co-moderator) (transcript), Muira McCammon, and Hanna E. Morris.  My own notes on this session are sparse (in part because most of the presenters did not use slides), so I'll include a handful of tweets I noted that I feel succinctly capture the essence of the presentations.  I did find a link to Muria's MS thesis "Reimagining and Rewriting the Guantánamo Bay Detainee Library: Translation, Ideology, and Power", but it is currently under embargo.  I did find an interview with her that is available and relevant.  Relevant to Muria's work with deleted US Govt accounts is Justin Littman's recent description of a disinformation attack with re-registering deleted accounts ("Vulnerabilities in the U.S. Digital Registry, Twitter, and the Internet Archive"). 2018-04-17 update: Muira just published two related articles about deleted tweets: "Trouble @JTFGTMO" and "Can They Really Delete That?".


The third session, "Curation and Power" (recording) began with a panel with Jess Ogden (moderator), Morehshin Allahyari, Anisa Hawes, Margaret Hedstrom, and Lozana Rossenova.  Again, I'll borrow heavily from tweets. 


The final session for Thursday was the keynote by Safiya Noble, based on her recent book "Algorithms of Oppression" (recording).  I really enjoyed Safiya's keynote; I had heard of some of the buzz and controversy (see my thread (1, 2, 3) about archiving some of the controversy) around the book but I had not yet given it a careful review (if you're not familiar with it, read this five minute summary Safiya wrote for Time).  I include several insightful tweets from others below, but I'll also summarize some of the points that I took away from her presentation (and they should be read as such and not as a faithful or complete transcription of her talk).

First, as a computer scientist I understand and am sympathetic to the idea that ranking algorithms that Google et al. use should be neutral.  It's an ultimately naive and untenable position, but I'd be lying if I said I did not understand the appeal.  The algorithms that help us differentiate quality pages from spam pages about everyday topics like artists, restaurants, and cat pictures do what they do well.  In one of the examples I use in my lecture (slides 55-58), it's the reason why for the query "DJ Shadow", the wikipedia.org and last.fm links appear on Google's page 1, and djshadow.rpod.ru appears on page 15: in this case the ranking of the sites based on their popularity in terms of links, searches, clicks, and other user-oriented metrics makes sense.  But what happens when the query is, as Safiya provides in her first example, "black girls"?  The result (ca. 2011) is almost entirely porn (cf. the in-conference result for "asian girls"), and the algorithms that served us so well in finding quality DJ Shadow pages in this case produce a socially undesirable result.  Sure, this undesirable result is from having indexed the global corpus (and our interactions with it) and is thus a mirror of the society that created those pages, but given the centrality in our life that Google enjoys and the fact that people consider it an oracle rather than just a tool that gives undesirable results when indexing undesirable content, it is irresponsible for Google to ignore the feedback loop that they provide; they no longer just reflect the bias, they hegemonically reinforce the bias, as well as give attack vectors for those who would defend the bias

Furthermore, there is already precedent for adjusting search results to eliminate bias in other dimensions: for example, PageRank by itself is biased against late-arriving pages/sites (e.g., "Impact of Web Search Engines on Page Popularity"), so search engines (SEs) adjust the rankings to accommodate these pages.  Similarly, Google has a history of intervening to remove "Google Bombs" (e.g., "miserable failure"), punish attempts to modify ranking, and even replacing results pages with jokes -- if these modifications are possible, then Google can no longer pretend the algorithm results are inviolable. 

She did not confine her criticism to Google, she also examined query results in digital libraries like ArtStor.  The metadata describing the contents in the DL originate from a point-of-view, and queries with a different POV will not return the expected results.  I use similar examples in my DL lecture on metadata (my favorite is reminding the students that the Vietnamese refer to the Vietnam War as the "American War"), stressing that even actions as seemingly basic as assigning DNS country codes (e.g., ".ps") are fraught with geopolitics, and that neutrality is an illusion even in a discipline like computer science. 

There's a lot more to her talk than I have presented, and I encourage you to take the time to view it.  We can no longer pretend Google is just the "backrub" crawler and google.stanford.edu interface; it is a lens that both shows and shapes who we are.  That's an awesome responsibility and has to be treated as such.


Day 2 -- March 23, 2018


The second day began with the panel "Web as Witness - Archiving & Human Rights" (recording), with Pamela Graham (moderator), Anna Banchik, Jeff Deutch, Natalia Krapiva, and Dalila Mujagic. Anna and Natalia presented the activities of the UC Berkeley Human Rights Investigations Lab, where they do open-source investigations (discovering, verifying, geo-locating, more) publicly available data of human rights violations.  Next was Jeff talking about the Syrian Archive, and the challenges they faced with Youtube algorithmically removing what they believed to be "extremist content".  He also had a nice demo about how they used image analysis to identify munitions videos uploaded by Syrians.  Dalila presented the work of WITNESS, an organization promoting the use of video to document human rights violations and how they can be used as evidence.  The final presentation was about the airwars.org (a documentation project about civilian causalities in air strikes), but I missed a good part of this presentation as I focused on my upcoming panel. 


My session, "Fidelity, Integrity, & Compromise", was Ada Lerner (moderator) (site), Ashley Blewer (slides, transcript) Michael L. Nelson (me) (slides), and Shawn Walker (slides).  I had the luxury of going last, but that meant that I was so focused on reviewing my own material that I could not closely follow their presentations.  I and my students have read Ada's paper and it is definitely worth reviewing.  They review a series of attacks (and fixes) that all center around "abandoned" live web resources (what we called "zombies") that can be (re-)registered and then included in historical pages.  That sounds like a far-fetched attack vector, except when you remember that modern pages include 100s of resources from many different sites via Javascript, and there is a good chance that any page is likely to include a zombie whose live web domain is available for purchase.  Scott's presentation dealt with research issues surrounding using social media, and Ashley's talk dealt with role of using fixity information (e.g., "There's a lot "oh I should be doing that" or "I do that" but without being integrated holistically into preservation systems in a way that brings value or a clear understand as to the "why"").  As for my talk, I asserted that Brian Williams first performed "Gin and Juice" in 1992, a full year before Snoop Dogg, and I have a video of a page in the Internet Archive to "prove" it.  The actual URI in which it is indexed in the Internet Archive is obfuscated, but this video is 1) of an actual page in the IA, that 2) pulls live web content into the archive, despite the fixes that Ada provided, and 3) the page rewrites the URL in the address bar to pretend to be at a different URL and time (in this case, dj-jay-requests.surge.sh, and 19920531014618 (May 31, 1992)). 






The last panel before lunch was "Archives for Change", with Hannah Mandel (moderator), Lara Baladi, Natalie Cadranel, Lae’l Hughes-Watkins, and Mehdi Yahyanejad.  My notes for this session are sparse, so again I'll just highlight a handful of useful tweets.




After lunch, the next session (recording) was a conversation between Jarrett Drake and Stacie Williams on their experiences developing the People's Archive of Police Violence in Cleveland, which "collects, preserves, and shares the stories, memories, and accounts of police violence as experienced or observed by Cleveland citizens."  This was the only panel with the format of two people having a conversation (effectively interviewing each other) about their personal transformation and lessons learned.


The next session was "Stewardship & Usage", with Jefferson Bailey, Monique Lassere, Justin Littman, Allan Martell, Anthony Sanchez.  Jefferson's excellent talk was entitled "Lets put our money where our ethics are", and was an eye-opening discussion about the state of funding (or lack thereof) for web archiving. The tweets below capture the essence of the presentation, but this is definitely one you should take the time to watch.  Allan's presentation addressed the issues about building "community archives" and being aware of tensions that exist between different marginalized groups. Justin's presentation was excellent, detailing both GWU's collection activities and the associated ethical challenges (including who and what to collect) and the gap between collecting via APIs and archiving web representations.  I believe Anthony and Monique jointly gave their presentation about ethical web archiving requires proper representation from marginalized communities.



The next panel "The Right to be Forgotten", was in Session 7 (recording), and featured Joyce Gabiola (moderator), Dorothy Howard, and Katrina Windon.  The right to be forgotten is a significant issue facing search engines in the EU, but has yet to arrive as a legal issue in the US.  Again, my notes on this session are sparse, so I'm relying on tweets. 


The final regular panel was "The Ethics of Digital Folklore", and featured Dragan Espenschied (moderator) (notes), Frances Corry, Ruth Gebreyesus, Ian Milligan (slides), and Ari Spool.  At this point my laptop's battery died so I have absolutely no notes on this session. 


The final session was with Elizabeth Castle, Marcella Gilbert, Madonna Thunder Hawk, with an approximately 10 minute rough cut preview of "Warrior Women", a documentary about Madonna Thunder Hawk, her daughter Marcella Gilbert, Standing Rock, and the DAPL protests.


Day 3 -- March 24, 2018


Unfortunately, I had to leave on Saturday and was unable to attend any of the nine workshop sessions: "Ethical Collecting with Webrecorder", "Distributed Web of Care", "Open Source Forensics", "Ethically Designing Social Media from Scratch", "Monitoring Government Websites with EDGI", "Community-Based Participatory Research", "Data Sharing", "Webrecorder - Sneak Preview", "Artists’ Studio Archives", and unconference slots.   There are three additional recorded sessions corresponding to the workshops that I'll link here (session 8, session 9, session 10) because they'll eventually scroll off the main page.

This was a great event and the enthusiasm with which it was greeted is an indication of the topic.  There were so many great presentations that I'm left with the unenviable task of writing a trip report that's simultaneously too long and does not do justice to any of the presentations.  I'd like to thank the other members of my panel (AdaShawn, and Ashley), all who live tweeted the event, the organizers at Rhizome (esp. Michael Connor), Documenting the Now (esp. Bergis Jules), the New Museum, and the funders: IMLS and the Knight Foundation.   I hope they will find a way to do this again soon. 

--Michael

See also: Ashley Blewer wrote a short summary of EAW, with a focus on the keynotes and  three different presentations.  Please let me know if there are other summaries / trip reports to add.

Also, please feel free to contact me with additions / corrections for the information and links above.  





Wednesday, March 21, 2018

2018-03-21: Cookies Are Why Your Archived Twitter Page Is Not in English


Fig. 1 - Barack Obama's Twitter page in Urdu

The ODU WSDL lab has sporadically encountered archived Twitter pages for which the default HTML language setting was expected to be in English, but when retrieving the archived page its template appears in a foreign language. For example, the tweet content of Previous US President Barack Obama’s archived Twitter page, shown in the image above, is in English, but the page template is in Urdu. You may notice that some of the information, such as, "followers", "following", "log in", etc. are not display in English but instead are displayed in Urdu. A similar observation was expressed by Justin Littman in "The vulnerability in the US digital registry, Twitter, and the Internet Archive". According to Justin's post, the Internet Archive is aware of the bug and is in the process of fixing it.  This problem may appear benign to the casual observer, but it has deep implications when looked at from a digital archivist perspective.

The problem became more evident when Miranda Smith (a member of the WSDL lab) was finalizing the implementation of a Twitter Follower-History-Count tool.  The application makes use of mementos extracted from the Internet Archive (IA) in order to find the number of followers that a particular Twitter account had acquired through time. The tool expects the Web page retrieved from the IA to be rendered in English in order to perform Web scraping and extract for the number of followers a Twitter account had at a particular time. Since it was now evident that Twitter pages were not archived in English only, we had to decide to account for all possible language settings or discard non-English mementos. We asked ourselves: Why are some Twitter pages archived in Non-English languages that we generally expected to be in English? Note that we are referring to the interface/template language and not the language of the tweet content.

We later found that this issue is more prevalent than we initially thought it was. We selected the previous US President Barack Obama as our personality to explore how many languages and how often his Twitter page was archived. We downloaded the TimeMap of his page using MemGator and then downloaded all the mementos in it for analysis. We found that his Twitter page was archived in 47 different languages (all the languages that Twitter currently supports, a subset of which is supported in their widgets) across five different web archives, including Internet Archive (IA), Archive-It (AIT), Library of Congress (LoC), UK Web Archive (UKWA), and Portuguese Web Archive (PT). Our dataset shows that overall only 53% of his pages (out of over 9,000 properly archived mementos) were archived in English. Of the remaining 47% mementos 22% were archived in Kannada and 25% in 45 other languages combined. We excluded mementos from our dataset that were not "200 OK" or did not have language information.

Fig. 2 shows that in the UKWA English is only 5% of languages in which Barack Obama's Twitter pages were archived. Conversely, in the IA, about half of the number of Barack Obama's Twitter pages are archived in English as much as all the remaining languages combined. It is worth noting that AIT is a subset of the IA. On the one hand, it is good to have more language diversity in archives (for example, the archival record is more complete for English language web pages than other languages). On the other hand, it is very disconcerting when the page is captured in a language not anticipated. We also noted that Twitter pages in the Kannada language are archived more often than all other non-English languages combined, although Kannada ranks 32 globally by the number of native speakers which are 0.58% of the global population. We tried to find out why some Twitter pages were archived in non-English languages that belong to accounts that generally tweet in English. We also tried to find out why Kannada is so prevalent among many other non-English languages. Our findings follow.

Fig. 2 Barack Obama Twitter Page Language Distribution in Web Archives

We started investigating the reason why web archives sometimes capture pages in non-English languages, and we came up with the following potential reasons:
  • Some JavaScript in the archived page is changing the template text in another language at the replay time
  • A cached page on a shared proxy is serving content in other languages
  • "Save Page Now"-like features are utilizing users' browsers' language preferences to capture pages
  • Geo-location-based language setting
  • Crawler jobs are intentionally or unintentionally configured to send a different "Accept-Language" header
The actual reason turned out to have nothing to do with any of these, instead it was related to cookies, but describing our thought process and how we arrived at the root of the issue has some important lessons worth sharing.

Evil JavaScript


Since JavaScript is known to cause issues in web archiving (a previous blog post by John Berlin expands on this problem), both at capture and replay time, we first thought this has to do with some client-side localization where a wrong translation file is leaking at replay time. However, when we looked at the page source in a browser as well as on the terminal using curl (as illustrated below), it was clear that translated markup is being generated on the server side. Hence, this possibility was struck off.

$ curl --silent https://twitter.com/?lang=ar | grep "<meta name=\"description\""
  <meta name="description" content="من الأخبار العاجله حتى الترفيه إلى الرياضة والسياسة، احصل على القصه كامله مع التعليق المباشر.">

Caching


We thought Twitter might be doing content negotiation using "Accept-Language" request header, so we changed language preference in our web browser and opened Twitter in an incognito window which confirmed our hypothesis. Twitter did indeed consider the language preference sent by the browser and responded a page in that language. However, when we investigated HTTP response headers we found that twitter.com does not return the "Vary" header when it should. This behavior can be dangerous because the content negotiation is happening on "Accept-Language" header, but it is not advertised as a factor of content negotiation. This means, a proxy can cache a response to a URI in some language and serve it back to someone else when the same URI is requested, even with a different language in the "Accept-Language" setting. We considered this as a potential possibility of how an undesired response can get archived. 

On further investigation we found that Twitter tries very hard (sometimes in wrong ways) to make sure their pages are not cached. This can be seen in their response headers illustrated below. The Cache-Control and obsolete Pragma headers explicitly ask proxies and clients not to cache the response itself or anything about the response by setting values to "no-cache" and "no-store". The Date (the date/time at which the response was originated) and Last-Modified headers are set to the same value to ensure that the cache (if stored) becomes invalid immediately. Additionally, the Expires header (the date/time after which the response is considered stale) is set to March 31, 1981, a date far in the past, long before Twitter even existed, to further enforce cache invalidation.


$ curl -I https://twitter.com/
HTTP/1.1 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
pragma: no-cache
date: Sun, 18 Mar 2018 17:43:25 GMT
last-modified: Sun, 18 Mar 2018 17:43:25 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
...

Hence, the possibility of a cache returning pages in different languages due to the missing "Vary" header was also not sufficient to justify the number of mementos in non-English languages.

Geo-location


We thought about the possibility that Twitter identifies a potential language for guest visitors based on their IP address (to guess the geo-location). However, the languages seen in mementos do not align with the places where archival crawlers are located. For example, the Kannada language that is dominating in the UK Web Archive is spoken in the State of Karnataka in India, and it is unlikely that the UK Web Archive is crawling from machines located in Karnataka.


On-demand Archiving


The Internet Archive recently introduced the "Save Page Now" feature, which acts as a proxy and forwards request headers of the user to the upstream web server rather than its own. This behavior can be observed in a memento that we requested for an HTTP echo service, HTTPBin, from our browser. The service echoes back data in the response that it receives from the client in the request. By archiving it, we expect to see headers that identify the client that the service has seen as the requesting client. The headers shown there are of our browser, not of the IA's crawler, especially the "Accept-Language" (that we customized in our browser) and the "User-Agent" headers, which confirms our hypothesis that IA's Save Page Now feature acts as a proxy.

$ curl http://web.archive.org/web/20180227154007/http://httpbin.org/anything
{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Accept-Encoding": "gzip,deflate",
    "Accept-Language": "es",
    "Connection": "close",
    "Host": "httpbin.org",
    "Referer": "https://web.archive.org/",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"
  },
  "json": null,
  "method": "GET",
  "origin": "207.241.225.235",
  "url": "https://httpbin.org/anything"
}

This behavior made us consider that people from different regions of the world with different language setting in their browsers, when using "Save Page Now" feature, would end up preserving Twitter pages in the language of their preference (since Twitter does honor "Accept-Language" header in some cases). However, we were unable to replicate this in our browser. Also, not every archive has on-demand archiving and thus could never replay users' request headers.

We also repeated this experiment in Archive.is, another on-demand web archive. Unlike IA, they do not replay users' headers like a proxy, instead they have their custom request headers. Archive.is does not show the original markup, instead it modifies the page heavily before serving, so a curl output will not be very useful. However, the content of our archived HTTP echo service page look like this:


{
  "args": {},
  "data": "",
  "files": {},
  "form": {},
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip",
    "Accept-Language": "tt,en;q=0.5",
    "Connection": "close",
    "Host": "httpbin.org",
    "Referer": "https://www.google.co.uk/",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2704.79 Safari/537.36"
  },
  "json": null,
  "method": "GET",
  "origin": "128.82.7.11, 128.82.7.11",
  "url": "https://httpbin.org/anything"
}

Note that it has its custom "Accept-Language" and "User-Agent" headers (different from our browser from which we requested the capture). It also has a custom "Referer" header included. However, unlike IA, it replayed our IP address as origin. We then captured https://twitter.com/?lang=ar (http://archive.is/cioM5) followed by https://twitter.com/phonedude_mln/ (http://archive.is/IbHgB) to see if the language session sticks across two successive Twitter requests, but that was not the case as the second page was archived in English (not in Arabic). Though, this does not necessarily prove that their crawler does not have this issue. It is possible that two different instances of their crawler handled the two requests or some other links of Twitter pages (with "?lang=en") were archived between the two requests by someone else. However, we do not have sufficient information to be certain about it.

Misconfigured Crawler


Some of the early memento we observed this behavior happening were from the Archive-It. So, we thought that some collection maintainers might have misconfigured their crawling job that sends a non-default "Accept-Language" header, resulting in such mementos. Since we did not have access to their crawling configuration, there was very little we could do to test this hypothesis. Many of the leading web archives are using Heritrix as their crawler, including Archive-It, and we happen to have some WARC files from AI, so we started looking into those. We looked for request records of those WARC files for any Twitter links to see what "Accept-Language" header was sent. We were quite surprised to see that Heritrix never sent any "Accept-Language" headers to any server, so this could not be the reason at all. However, when looking into those WARC files, we saw "Cookie" headers sent to the servers in the request records of Twitter and many others. This lead us to uncover the actual cause of the issue.

Cookies, the Real Culprit


So far, we have been considering Heritrix to be a stateless crawler, but when we looked into the WARC files of AI, we observed Cookies being sent to servers. This means Heritrix does have Cookie management built-in (which is often necessary to meaningfully capture some sites). With this discovery, we started investigating Twitter's behavior from a different perspective. The page source of Twitter has a list of alternate links for each language they provide localization for (currently, 47 languages). This list can get added to the frontier queue of the crawler. Though, these links have a different URI (i.e., having a query parameter "?lang=<lang-code>"), once any of these links are loaded, the session is set for that language until the language is explicitly changed or the session expires/cleared. In the past they had options in the interface to manually select a language, which then gets set for the session. It is understandable that general purpose web sites cannot rely on the "Accept-Language" completely for localization related content negotiation as browsers have made it difficult to customize language preferences, especially if one has to set it on a per-site basis.

We experimented with Twitter's language related behavior in our web browser by navigating to https://twitter.com/?lang=ar, which yields the page in the Arabic language. Then navigating to any Twitter page such as https://twitter.com/ or https://twitter.com/ibnesayeed (without the explicit "lang" query parameter) continues to serve Arabic pages (if a Twitter account is not logged in). Here is how Twitter's server behaves for language negotiation:

  • If a "lang" query parameter (with a supported language) is present in any Twitter link, that page is served in the corresponding language.
  • If the user is a guest, value from the "lang" parameter is set for the session (this gets set each time an explicit language parameter is passed) and remains sticky until changed/cleared.
  • If the user is logged in (using Twitter's credentials), the default language preference is taken from their profile preferences, so the page will only show in a different language if an explicit "lang" parameter is present in the URI. However, it is worth noting that crawlers generally behave like guests.
  • If the user is a guest and no "lang" parameter is passed, Twitter falls back to the language supplied in the "Accept-Language" header.
  • If the user is a guest, no "lang" parameter is passed, and no "Accept-Language" header is provided, then responses are in English (though, this could be affected by Geo-IP, which we did not test).

In the example below we illustrate some of that behavior using curl. First, we fetch Twitter's home page in Arabic using explicit "lang" query parameter and show that the response was indeed in Arabic as it contains lang="ar" attribute in the <html> element tag. We also saved any cookies that the server might want to set in the "/tmp/twitter.cookie" file. We then showed that the cookie file does indeed have a "lang" cookie with the value "ar" (there are some other cookies in it, but those are not relevant here). In the next step, we fetched Twitter's home page without any explicit "lang" query parameter and received a response in the default English language. Then we fetched the home page with the "Accept-Language: ur" header and got the responses in Urdu. Finally, we fetched the home page again, but this time supplied the saved cookies (that includes "lang=ar" cookie) and received the response again in Arabic.

$ curl --silent -c /tmp/twitter.cookie https://twitter.com/?lang=ar | grep "<html"
<html lang="ar"</span><nowiki> data-scribe-reduced-action-queue="true">

$ grep lang /tmp/twitter.cookie
twitter.com FALSE / FALSE 0 lang ar

$ curl --silent https://twitter.com/ | grep "<html"
<html lang="en" data-scribe-reduced-action-queue="true">

$ curl --silent -H "Accept-Language: ur" https://twitter.com/ | grep "<html"
<html lang="ur" data-scribe-reduced-action-queue="true">

$ curl --silent -b /tmp/twitter.cookie https://twitter.com/ | grep "<html"
<html lang="ar" data-scribe-reduced-action-queue="true">


Twitter Cookies and Heritrix


Now that we understood the reason, we wanted to replicate what is happening in a real archival crawler. We used Heritrix to simulate the effect that Twitter cookies have when a Twitter page gets archived in the IA. The order of these links was carefully chosen to see if the first link sets the language to Arabic and then the second one gets captured in Arabic or not. We seeded the following URLs and placed them in the same sequence inside Heritrix's configuration file:
We had already proven that the first URI which included the language identifier for Arabic (lang=ar) will place the language identifier inside the cookie. The question now becomes: What is the effect this cookie will have on subsequent crawls/requests of future Twitter pages? Is the language identifier going to stay the same as the one already set in the cookie? Is is it going to revert to a default language preference? The common expectation for our seeded URIs is that the first Twitter page will be archived in Arabic, and that the second page will be archived in English, since a request to a top level .com domain is usually defaulted to the English language. However, since we have observed that the Twitter cookies contain the language identifier when this parameter is passed in the URI, then if subsequent Twitter pages use the same cookie, it is plausible that the language identifier is going to be maintained.

After running the crawling job in Heritrix for the seeded URIs, we inspected the WARC file generated by Heritrix. The results were as we expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic. Relevant portions of the resulting WARC file are shown below:

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Concurrent-To: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
WARC-Record-ID: <urn:uuid:473273f6-48fa-4dd3-a5f0-81caf9786e07>
Content-Type: application/http; msgtype=request
Content-Length: 301

GET /?lang=ar HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC, shown above, is a  request record for the URI https://twitter.com/?lang=ar. Highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/?lang=ar". This request yielded a response from Twitter that contains a "set-cookie" header with the language identifier included in the URI "lang=ar" as shown in the portion of the WARC below. The HTML was rendered in Arabic (notice the highlighted <html> element with the lang attribute in the response payload below).

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/?lang=ar
WARC-Date: 2018-03-16T21:58:44Z
WARC-Payload-Digest: sha1:FCOPDBN2U5LXU7FEUUGQ4WXYGR7OP5JI
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:7dbc3a67-5cf8-4375-8343-c0f6b03039f4>
Content-Type: application/http; msgtype=response
Content-Length: 151985

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 150665
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:44 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:44 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:34 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: lang=ar; Path=/
set-cookie: ct0=10558ec97ee83fe0f2bc6de552ed4b0e; Expires=Sat, 17 Mar 2018 03:58:44 UTC; Path=/; Domain=.twitter.com; Secure
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: 2a2fc89f51b930202ab24be79b305312
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 100
x-transaction: 001495f800dc517f
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">
...

The subsequent request in the seeded Heritrix configuration file (https://twitter.com/phonedude_mln/) generated an additional request record which is shown on the WARC file portion below. The highlighted lines illustrates the GET request made to the host "twitter.com" with the path and query parameter "/phonedude_mln/". You may notice that a "Cookie"  with the value lang=ar was included as one of the parameters in the header request which was set initially by the first seeded URI. The results were as expected. Heritrix was indeed saving and replaying "Cookie" headers, resulting in the second page being captured in Arabic.

WARC/1.0
WARC-Type: request
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Concurrent-To: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
WARC-Record-ID: <urn:uuid:eef134ed-f3dc-459b-95e7-624b4d747bc1>
Content-Type: application/http; msgtype=request
Content-Length: 655

GET /phonedude_mln/ HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://cs.odu.edu/)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: twitter.com
Cookie: lang=ar; _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; ct0=10558ec97ee83fe0f2bc6de552ed4b0e; guest_id=v1%3A152123752160566016; personalization_id=v1_uAUfoUV9+DkWI8mETqfuFg==

The portion of the WARC file, shown below, shows the effect of Heritrix saving and playing the "Cookie" headers. The highlighted <html> element proved that the  HTML language identifier was set to Arabic on the second seeded URI (https://twitter.com/phonedude_mln/), although the URI did not include in the language identifier.

WARC/1.0
WARC-Type: response
WARC-Target-URI: https://twitter.com/phonedude_mln/
WARC-Date: 2018-03-16T21:58:48Z
WARC-Payload-Digest: sha1:5LI3DGWO6NGK4LWSIHFZZHW43H2Z2IWA
WARC-IP-Address: 104.244.42.129
WARC-Record-ID: <urn:uuid:634dea88-6994-4bd4-af05-5663d24c3727>
Content-Type: application/http; msgtype=response
Content-Length: 518086

HTTP/1.0 200 OK
cache-control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
content-length: 516921
content-type: text/html;charset=utf-8
date: Fri, 16 Mar 2018 21:58:48 GMT
expires: Tue, 31 Mar 1981 05:00:00 GMT
last-modified: Fri, 16 Mar 2018 21:58:48 GMT
pragma: no-cache
server: tsa_b
set-cookie: fm=0; Expires=Fri, 16 Mar 2018 21:58:38 UTC; Path=/; Domain=.twitter.com; Secure; HTTPOnly
set-cookie: _twitter_sess=BAh7CSIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNo%250ASGFzaHsABjoKQHVzZWR7ADoPY3JlYXRlZF9hdGwrCGKB0jBiAToMY3NyZl9p%250AZCIlZmQ1MTY4ZjQ3NmExZWQ1NjUyNDRmMzhhZGNiMmFhZjQ6B2lkIiU0OTQ0%250AZDMxMDY4NjJhYjM4NjBkMzI4MDE0NjYyOGM5ZA%253D%253D--f571656f1526d7ff1b363d527822ebd4495a1fa3; Path=/; Domain=.twitter.com; Secure; HTTPOnly
status: 200 OK
strict-transport-security: max-age=631138519
x-connection-hash: ef102c969c74f3abf92966e5ffddb6ba
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-response-time: 335
x-transaction: 0014986c00687fa3
x-twitter-response-tags: BouncerCompliant
x-ua-compatible: IE=edge,chrome=1
x-xss-protection: 1; mode=block; report=https://twitter.com/i/xss_report

<!DOCTYPE html>
<html lang="ar" data-scribe-reduced-action-queue="true">
...

We used PyWb to replay pages from the captured WARC file. Fig. 3 is the page rendered after retrieving the first seeded URI of our collection (https://twitter.com/?lang=ar). For those not familiar with Arabic, this is indeed Twitter's home page in Arabic.

Fig.3  https://twitter.com/?lang=ar

Fig. 4 is the representation given by PyWb after requesting the second seeded URI (https://twitter.com/phonedude_mln). The page was rendered using Arabic as the default language, although we did not include this setting in the URI, nor did our browser language settings include Arabic.

Fig.4  https://twitter.com/phonedude_mln/ in Arabic

Why is Kannada More Prominent?


As we noted before, Twitter's page source now includes a list of alternate links for 47 supported languages. These links look something like this:

<link rel="alternate" hreflang="fr" href="https://twitter.com/?lang=fr">
<link rel="alternate" hreflang="en" href="https://twitter.com/?lang=en">
<link rel="alternate" hreflang="ar" href="https://twitter.com/?lang=ar">
...
<link rel="alternate" hreflang="kn" href="https://twitter.com/?lang=kn">

The fact that Kannada ("kn") is the last language in the list is why it is so prevalent in web archives. While other language specific links overwrite the session set by their predecessor, the last one affects many more Twitter links in the frontier queue. Twitter started supporting Kannada along with three other Indian languages in July 2015 and placed it at the very end of language related alternate links. Since then, it has been captured more often in various archives than any other non-English language. Before these new languages were added, Bengali used to be the last link in the alternate language links for about a year. Our dataset shows dense archival activity for Bengali between July 2014 to July 2015, then Kannada took over. This confirms our hypothesis about the spatial placement of the last language related link sticking the session for a long time with that language. This affects all upcoming links in the crawlers' frontier queue from the same domain until another language specific link overwrites the session.

What Should We Do About It?


Disabling cookies does not seem to be a good option for crawlers as some sites would try hard to set a cookie by repeatedly returning redirect responses until their desired "Cookie" headers is included in the request. However, explicitly reducing the cookie expiration duration in crawlers could potentially mitigate the long-lasting impact of such sticky cookies. Garbage collecting any cookie that was set more than a few seconds ago would make sure that no cookie is being reused for more than a few successive requests. Sandboxing crawl jobs in many isolated sessions is another potential solution to minimize the impact. Alternatively, some filtering policies can be set for URLs that set any session cookies to download them in a separate short-lived session to isolate them from rest of the crawl frontier queue.

Conclusions


The problem of portions of Twitter pages unintentionally being archived in non-English languages is quite significant. We found that 47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive. We found that the root cause has to do with Twitter's sticky language sessions maintained using cookies which Heritrix crawler seems to honor.

The Kannada language being the last one in the list of language-specific alternate links on Twitter's pages makes it overwrite the language cookies resulting from the URLs in other languages listed above it. This causes more Twitter pages in the frontier queue being archived in Kannada than other non-English languages. Crawlers are generally considered to be stateless, but honoring cookies makes them somewhat stateful. This behavior in web archives may not be specific to just Twitter, but many other sites that utilize cookies for content negotiation might have some similar consequences. This issue can potentially be mitigated by reducing the cookie expiration duration explicitly in crawlers or distributing the crawling task for the URLs of the same domain in many small sandboxed instances.

--
Sawood Alam
and
Plinio Vargas