HTTrack options

From Archiveteam
Revision as of 22:08, 23 December 2009 by Scumola (talk | contribs) (Created page with 'Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram: httrack --connection-per-second=50 --sockets=80 --ke…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram:

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5 -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'

  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.

NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.