HTTrack options

From Archiveteam
Revision as of 22:08, 23 December 2009 by Scumola (talk | contribs) (Created page with 'Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram: httrack --connection-per-second=50 --sockets=80 --ke…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram:

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5 -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'

  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.

NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.