Difference between revisions of "Lulu Poetry"

From Archiveteam
Jump to navigation Jump to search
m (→‎Strategies: more info)
Line 15: Line 15:
Because the urls are sequential, you can call wget with a nano-bash script:<br>
Because the urls are sequential, you can call wget with a nano-bash script:<br>
<tt>for x in $(seq 1 100000); do wget http://www.poetry.com/poems/archiveteam/$x/ -O poem$x.html; done</tt><br>
<tt>for x in $(seq 1 100000); do wget http://www.poetry.com/poems/archiveteam/$x/ -O poem$x.html; done</tt><br>
Or you can just make a text file containing a list of incrementing urls and feed that to wget as a source of urls (may be faster than the above).
Or you can just make a text file containing a list of incrementing urls and feed that to wget as a source of urls (may be faster than the above).<br>
A quick command to build a list of urls: <tt>perl -e 'print "<nowiki>http://www.poetry.com/poems/archiveteam/$_/\n</nowiki>" for 100000...14500000' > biglist</tt>

Revision as of 05:09, 2 May 2011

Lulu Poetry or Poetry.com, announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)


Site Structure

The urls appear to be flexible and sequential:

(12:13:09 AM) closure: http://www.poetry.com/poems/archiveteam-bitches/3535201/ , heh, look at that, you can just put in any number you like I think

(12:15:16 AM) closure: http://www.poetry.com/user/allofthem/7936443/ same for the users

Strategies

Currently people are using wget in various ways.

Because the urls are sequential, you can call wget with a nano-bash script:
for x in $(seq 1 100000); do wget http://www.poetry.com/poems/archiveteam/$x/ -O poem$x.html; done
Or you can just make a text file containing a list of incrementing urls and feed that to wget as a source of urls (may be faster than the above).
A quick command to build a list of urls: perl -e 'print "http://www.poetry.com/poems/archiveteam/$_/\n" for 100000...14500000' > biglist