Digital Archival – link dump

Written on

March 10, 2024

Why even archive?

Because Corporations are Crappy Archivists and there are A Million Ways to Die on the Web (by ArchiveTeam). ArchiveTeam lists examples of different reasons major websites disappear from the Web.

A guide on how to preserve your local music scene with Internet Archive

Summary of the 4 main points of Archive and Survive, or How To Make Your New Minimal Post-Hardcore Inspired Hyperpop Adjacent Chicago Juke Meets Early Gabber Project Last For A Thousand Years (In Four Easy Steps!):

Save everything (don’t trust a mega-corp to save it for you)
Reperform
Collaborate with other forms (using a piece within a new medium makes it more resilient)
Make your rich friends buy original instruments
“We only have the Stradivari and Guarneri string instruments still around played today because of fuccboi nobles who bought violins they weren’t very good at playing”

Link rot scale – Pew Research says 38% of webpages that existed in 2013 are no longer accessible a decade later

Notable big sites that disappeared before our eyes: MySpace, Geocities, MTV News and their archive, Comedy Central site.

Do you have to download all the things?
No.
Anna’s Archive: How to become a pirate archivist taught me that there is value in downloading metadata. It is better to know that Thing X existed at time Y than to have zero information. Some information is better than the “big perfect archiving project that you never undertook.”

Archiving Proxies

WARC proxies – https://github.com/internetarchive/warcprox (note: won’t work on Windows because it relies on fcntl module) and ArchiveBox Proxy for archiving full pages while you scrape data. Telerik Fiddler might be another workable option for saving session traffic.

What I ended up using for a project on Windows was mitmproxy with the mitmdump command line executable that comes with it. You can use the --set hardump="./myfile.har" argument to save your captured session as a standard HAR file. Here’s an example script that tracks anything you send through localhost:8080 except requests to a few ad-server domains:

mitmdump --set hardump="./myhardump.txt" !~a !~d "doubleclick|forter|google|clarity|bing|criteo|tiktok|facebook|quantummetric|stickyadstv|agkn.com|taboola|bluekai|adsrvr|casalemedia|adnxs|pubmatic|outbrain"

And here is a reference to mitmdump command-line options.
By default, mitmproxy validates inbound headers. Sometimes, a violation of HTTP/2 headers can crash your session – you can disable this validation with the --set validate_inbound_headers=false command line parameter

You can also script mitmproxy and mitmdump to have them behave as mock-servers, and to match-and-replace content on the fly. I’m especially grateful for KevCui’s mitmproxy scripts because I only know enough Python to go hissss I’ma snaek!!

Downloading webpages

https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine – several URL parameter settings in Wayback Machine. You can just set the year to get the closest match, rather than an exact date. You can use “if_” to get minimal iframing w/ URL rewrite, or “id_” to get the literal HTML that the Wayback machine got originally – without any added code/rewrites.
Ex:

https://web.archive.org/web/2011if_/http://msdn.microsoft.com/en-us/magazine/bb985653.aspx

You can also access Wayback Machine data using their CDX API:

If you are running into files being blocked because of CORS (in Chrome, “ERR_BLOCKED_BY_ORB”). Run Chrome like this:

https://web.archive.org/cdx/search/cdx?url=*.co.yu&collapse=urlkey&filter=mimetype:text/html&limit=1000

Here is a sample project writeup, and here is the official CDX API documentation.

chrome.exe --disable-web-security

http://www.httrack.com/ – downloads assets locally and rewrites URLs to target them. Better than wget because wget waits until the entire download is finished before rewriting the URLs. HTTRack does this on the fly. Can be set to just fetch a set of URLs rather than crawl an entire site. To fetch just your defined set, the depth has to be “2”. Make sure to set the Build->Do not purge old files, if you are downloading in several runs.

Wget

Code for grabbing files while preserving a portion of the directory structure (manpage for wget):

wget --no-host-directories --force-directories --cut-dirs=6 --timeout=30 --wait=0.5 --convert-links --page-requisites --input-file=wgetdownload.txt --no-clobber -e robots=off --no-verbose --regex-type=pcre --reject-regex="(\.css|\.js|watermark)"

--cut-dirs <- cut this many directory levels from the dir structure you'll create

For example, fetching the URL https://web.archive.org/web/20170502122051if_/https://blog.mailchimp.com/myfolder/
with --cut-dirs=5 will leave just 1 dir written to your disk: myfolder
(Note that wget sees the slashes in the https:// as two path dividers)

--no-host-directories
Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behaviour.

--force-directories
The opposite of -nd---create a hierarchy of directories, evenif one would not have been created otherwise. E.g. wget -x
http://fly.srk.fer.hr/robots.txt will save the downloaded
file to fly.srk.fer.hr/robots.txt.

--page-requisites
This option causes Wget to download all the files that are
necessary to properly display a given HTML page. This
includes such things as inlined images, sounds, and referenced stylesheets.

--no-clobber
If a file is downloaded more than once in the same directory, this option tells Wget not to overwrite it. Wget will rename the file and will keep links pointing to the renamed version, as appropriate.

--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This
affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc. (the manpage has more details)

Troubleshooting “Read error at byte…”

I kept getting this message when trying to download images from Getty.edu. And the images would get saved (but would be missing half the graphical data). This problem cropped up when fetching them from Firefox too, so it isn’t specific to Wget:

C:\>wget -O MiraCalligraphiaeMonumenta-p184_aa34483d-0718-4316-8102-da351270c52c.jpg https://media.getty.edu/iiif/image/aa34483d-0718-4316-8102-da351270c52c/full/full/0/default.jpg
--2024-08-18 01:05:47--  https://media.getty.edu/iiif/image/aa34483d-0718-4316-8102-da351270c52c/full/full/0/default.jpg

Resolving media.getty.edu (media.getty.edu)... 18.160.200.61, 18.160.200.101, 18.160.200.30, ...
Connecting to media.getty.edu (media.getty.edu)|18.160.200.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [image/jpeg]
Saving to: 'MiraCalligraphiaeMonumenta-p184_aa34483d-0718-4316-8102-da351270c52c.jpg'

MiraCalligraphiaeMonumenta-p1     [     <=>                                          ]   8.04M  8.08MB/s    in 1.0s

2024-08-18 01:05:51 (8.08 MB/s) - Read error at byte 8429822 (No error).Retrying.

...<retry happens here with same results>...

--2024-08-18 01:05:57--  (try: 3)  https://media.getty.edu/iiif/image/aa34483d-0718-4316-8102-da351270c52c/full/full/0/default.jpg
Connecting to media.getty.edu (media.getty.edu)|18.160.200.61|:443... connected.
Unable to establish SSL connection.

Images looked like:

What helped was to downgrade the TLS version & making it very clear that wget should keep retrying:

wget --no-check-certificate --secure-protocol=TLSv1_2 --retry-on-host-error --waitretry 3 -t 3 -O MiraCalligraphiaeMonumenta-p112_e96d2d0e-329e-4939-93cc-134908a8c8a0.jpg https://media.getty.edu/iiif/image/e96d2d0e-329e-4939-93cc-134908a8c8a0/full/full/0/default.jpg

FTP

Recursively listing all the files under a certain FTP directory:

Open WinSCP
Go to the directory in question
Click Commands -> Open Terminal
You can run shell commands there, as long as they’re not in interactive mode.
Use the find command with these parameters:
find . -type f -regextype awk -regex ".+\.(htm|html|exe)$"
This will use the awk Regex engine (which has bracket grouping), and will list any file that ends with htm/html/exe.

Why you shouldn’t get too deep into bypassing bot-blockers

You could end up like the person in So you want to scrape like the Big Boys?

Post-processing

Regex for removing all script tags and their contents:
Replace <script(.+?)</script> with a blank.
When using Notepad++ also toggle the option that has “.” match newlines.

Extracting data from Windows help files

Here is how you can Print to PDF from Windows 2000: use “PDF Creator 0.9.3” with Postscript included.

I used this PDF print driver on Virtual Box w/ Windows 2000 to get old Microsoft Systems Journal articles exported out of the “Help system” used by the MSDN 1996 disc (use the customer key 337-7364994)

Something I haven’t tried yet:
These “Help system” trees can be converted to RTF format in bulk. I’ve gotten to this point: getting a big flat RTF file successfully. There is probably a path from there onward to an HTML export. Either by converting the RTF to HTML; or by converting from winhelp to Htmlhelp and exporting as HTML site, or by using Herdsoft’s strange winhelpcgi CGI script with a local webserver to create a local website out of HLP files and scrape it.

Help Decompiler
http://web.archive.org/web/20091012074653/http://freenet-homepage.de/mawin/helpdeco.htm
https://www.herdsoft.com/catalog/ehlp2rtf.html

HTML Help Workshop
https://www.herdsoft.com/ti/german/themen/hilfe/hpj_in_html.html
https://web.archive.org/web/20070210090723/http://msdn.microsoft.com/library/en-us/htmlhelp/html/hwmicrosofthtmlhelpdownloads.asp

(Get a copy of the exe for HTML Help Workshop 1.4 from my site)

https://www.oocities.org/mwinterhoff/helpdeco.htmhttp://fileformats.archiveteam.org/wiki/Multimedia_Viewer_Book

Converting winhelp -> Htmlhelp
https://web.archive.org/web/20070221011010/http://www.help-info.de/en/Help_Info_WinHelp/hw_converting.htm