- Yahoo issues an error 999 after about 30 minutes of fetching from a certain IP. We used two approaches to get around this.
- TOR (slow as molasses, but worked) - collected using httrack
- multiple IPs (fast, but needs large IP resources) - collected using wget
The tarballs in the archive reflect both archiving methods:
-rw-r--r-- 1 root root 228855239 Dec 15 13:35 starwars.yahoo.com-goekesmi-raw.tar.bz2 -rw-r--r-- 1 root root 36529217 Dec 20 15:53 starwars.yahoo.com-tor.tar.bz2