HTTrack options

From Archiveteam
Revision as of 22:21, 23 December 2009 by Scumola (talk | contribs)
Jump to: navigation, search

Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram:

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 ''

  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.

NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.