Difference between revisions of "Talk:Angelfire"

From Archiveteam
Jump to: navigation, search
 
Line 1: Line 1:
Some brainstorming from procrastination:
 
 
First grab all the sitemap indexes:
 
curl http://www.angelfire.com/robots.txt | grep -Eo 'http.*gz' > sitemap-index-urls
 
<pre>
 
http://www.angelfire.com/sitemap-index-00.xml.gz
 
http://www.angelfire.com/sitemap-index-01.xml.gz
 
http://www.angelfire.com/sitemap-index-02.xml.gz
 
...
 
</pre>
 
 
 
Use that to grab all the sitemaps:
 
wget -i sitemap-index-urls
 
<pre>
 
<sitemap><loc>http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
 
<sitemap><loc>http://www.angelfire.com/vevayaqo/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
 
<sitemap><loc>http://www.angelfire.com/planet/dumbass123/sitemap.xml</loc><lastmod>2012-04-10</lastmod></sitemap>
 
...
 
</pre>
 
 
 
Extract the urls:
 
zgrep -hEo 'http:.*xml' sitemap-index-*.xml.gz > sitemap-urls
 
<pre>
 
http://www.angelfire.com/punk4/jori_loves_jackass/sitemap.xml
 
http://www.angelfire.com/vevayaqo/sitemap.xml
 
http://www.angelfire.com/planet/dumbass123/sitemap.xml
 
...
 
</pre>
 
 
 
And grab them all:
 
wget --force-directories -i sitemap-urls
 
 
 
TODO: Find a smart way to grab everything from that.
 
 
You will want --no-cookies and reject http://www.angelfire.lycos.com/doc/images/track/ot_noscript.gif.*
 
Some images are hosted on http://www.angelfire.lycos.com and will require some smart hackery.
 
 
-----
 
  
  
Line 50: Line 8:
  
 
-----
 
-----
Guestbooks have been killed in 2012, eg http://htmlgear.lycos.com/guest/control.guest?u=gosanson&i=2&a=view
 
-----
 
some users have blogs, like this in the sitemap: http://filesha.angelfire.com/blog/index.blog
 

Latest revision as of 07:38, 9 May 2015


You can also extract the "realms" and username combinations from the sitemap-indexes: zgrep -hEo 'http:.*xml' ori/sitemap-index-*.xml.gz | sed 's#http://www.angelfire.com/##' | sed 's#/sitemap.xml##' | sed 's#/#\t#'


Warning: There are usernames without a "realm" prefix! Like the random jeshare, seacrozzer or hjones669.