Talk:Angelfire

From Archiveteam
Revision as of 19:10, 8 May 2015 by Schbirid (talk | contribs)
Jump to: navigation, search

Some brainstorming from procrastination:

First grab all the sitemap indexes: curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls

http://www.angelfire.com/sitemap-index-00.xml.gz
http://www.angelfire.com/sitemap-index-01.xml.gz
http://www.angelfire.com/sitemap-index-02.xml.gz
...


Use that to grab all the sitemaps: wget -i sitemap-index-urls

<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
...


Extract the urls: zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls

http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
http://www.angelfire.com/vevayaqo/sitemap.xml
http://www.angelfire.com/planet/dumbass123/sitemap.xml
...


And grab them all: wget --force-directories -i sitemap-urls


TODO: Find a smart way to grab everything from that.

You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.* Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.



You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.


Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view


some users have blogs, like this in the sitemap: http://filesha.angelfire.com/blog/index.blog