Site exploration

From Archiveteam
Revision as of 15:22, 9 July 2013 by Lewis Collard (talk | contribs) (update)
Jump to: navigation, search

This page contains some tips and tricks for exploring soon-to-be-dead websites, to find URLs to feed into the Archive Team crawlers.

Open Directory Project data

The Open Directory Project offers machine-readable downloads of its data. You want the "content.rdf.u8.gz" from there.

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Quick-and-dirty shell parsing for the not-too-fussy:

grep '<link r:resource=.*dyingsite\.com' content.rdf.u8 | sed 's/.*<link r\:resource="\([^"]*\).*".*/\1/' | sort | uniq > odp-sitelist.txt

MediaWiki wikis

MediaWiki wikis, especially the very large ones operated by the Wikimedia Foundation, often return a large number of important sites hosted with a service.

mwlinkscrape.py is a tool by an Archive Team patriot which extracts a machine-readable list from a number of wikis (it actually uses the text of this page to get a list of wikis to scrape).

./mwlinkscrape.py "*.dyingsite.com" > mw-sitelist.txt

Bing API

Microsoft, bless their Redmondish hearts, have an API for fetching Bing search engine results, which has a free tier of 5000 queries per month (this will cover you for about 250 sets of 1000 results). However, it only returns the first 1000 results for any query, so you can't just search "site:dyingsite.com" and get all the things on a site. You'll need to get a bit creative with the search terms.

Grab this Python script (look for "BING_API_KEY" and replace it with your "Primary Account Key"), and then:

python bingscrape.py "site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "about me site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "gallery site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "in memoriam site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "diary site:dyingsite.com" >> bing-sitelist.txt
python bingscrape.py "bob site:dyingsite.com" >> bing-sitelist.txt

And so on.

Common Crawl Index

The Common Crawl index is a very big (21 gigabytes compressed) list of URLs in the Common Crawl corpus. Grepping this list may well reveal plenty of URLs to archive. The list is in an odd format; along the lines of com.deadsite.www/subdirectory/subsubdirectory:http so you'll need to some filtering of the results. The results can sometimes be ambiguous.

grep '^com\.dyingsite[/\.]' zfqwbPRW.txt > commoncrawl-sitelist.txt

Our Ivan has a Python script which will take your list of URLs on standard input and print out a list of normally-formed URLs on standard output.