Why even archive?
Because Corporations are Crappy Archivists and there are A Million Ways to Die on the Web (by ArchiveTeam). ArchiveTeam lists examples of different reasons major websites disappear from the Web.
A guide on how to preserve your local music scene with Internet Archive
Summary of the 4 main points of Archive and Survive, or How To Make Your New Minimal Post-Hardcore Inspired Hyperpop Adjacent Chicago Juke Meets Early Gabber Project Last For A Thousand Years (In Four Easy Steps!):
- Save everything (don’t trust a mega-corp to save it for you)
- Reperform
- Collaborate with other forms (using a piece within a new medium makes it more resilient)
- Make your rich friends buy original instruments
“We only have the Stradivari and Guarneri string instruments still around played today because of fuccboi nobles who bought violins they weren’t very good at playing”
Link rot scale – Pew Research says 38% of webpages that existed in 2013 are no longer accessible a decade later
Notable big sites that disappeared before our eyes: MySpace, Geocities, MTV News and their archive, Comedy Central site.
Do you have to download all the things?
No.
Anna’s Archive: How to become a pirate archivist taught me that there is value in downloading metadata. It is better to know that Thing X existed at time Y than to have zero information. Some information is better than the “big perfect archiving project that you never undertook.”
Archiving Proxies
WARC proxies – https://github.com/internetarchive/warcprox (note: won’t work on Windows because it relies on fcntl module) and ArchiveBox Proxy for archiving full pages while you scrape data. Telerik Fiddler might be another workable option for saving session traffic.
What I ended up using for a project on Windows was mitmproxy with the mitmdump command line executable that comes with it. You can use the --set hardump="./myfile.har" argument to save your captured session as a standard HAR file. Here’s an example script that tracks anything you send through localhost:8080 except requests to a few ad-server domains:
mitmdump --set hardump="./myhardump.txt" !~a !~d "doubleclick|forter|google|clarity|bing|criteo|tiktok|facebook|quantummetric|stickyadstv|agkn.com|taboola|bluekai|adsrvr|casalemedia|adnxs|pubmatic|outbrain"
And here is a reference to mitmdump command-line options.
By default, mitmproxy validates inbound headers. Sometimes, a violation of HTTP/2 headers can crash your session – you can disable this validation with the --set validate_inbound_headers=false command line parameter
You can also script mitmproxy and mitmdump to have them behave as mock-servers, and to match-and-replace content on the fly. I’m especially grateful for KevCui’s mitmproxy scripts because I only know enough Python to go hissss I’ma snaek!!
Downloading webpages
https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine – several URL parameter settings in Wayback Machine. You can just set the year to get the closest match, rather than an exact date. You can use “if_” to get minimal iframing w/ URL rewrite, or “id_” to get the literal HTML that the Wayback machine got originally – without any added code/rewrites.
Ex:
https://web.archive.org/web/2011if_/http://msdn.microsoft.com/en-us/magazine/bb985653.aspx
You can also access Wayback Machine data using their CDX API:
If you are running into files being blocked because of CORS (in Chrome, “ERR_BLOCKED_BY_ORB”). Run Chrome like this:
https://web.archive.org/cdx/search/cdx?url=*.co.yu&collapse=urlkey&filter=mimetype:text/html&limit=1000
Here is a sample project writeup, and here is the official CDX API documentation.
(more…)