Difference between revisions of "HTTrack options"

From Archiveteam
Jump to: navigation, search
(Created page with 'Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram: httrack --connection-per-second=50 --sockets=80 --ke…')
 
Line 1: Line 1:
 
Good options to use for httrack to mirror a large-ish site (requires 2GB of ram).  Works well on my DELL 2850 w/ 4GB of ram:
 
Good options to use for httrack to mirror a large-ish site (requires 2GB of ram).  Works well on my DELL 2850 w/ 4GB of ram:
  
httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5 -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'
+
httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'
  
 
* ignores robots.txt
 
* ignores robots.txt

Revision as of 22:21, 23 December 2009

Good options to use for httrack to mirror a large-ish site (requires 2GB of ram). Works well on my DELL 2850 w/ 4GB of ram:

httrack --connection-per-second=50 --sockets=80 --keep-alive --display --verbose --advanced-progressinfo --disable-security-limits -n -i -s0 -m -F 'Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' -#L500000000 'http://www.facebook.com/FacebookPages?v=app_2347471856#/FacebookPages?v=photos'

  • ignores robots.txt
  • allows for a queue of 500M unfetched URLS
  • custom useragent
  • pretty fast (uses several connections at once)
  • will re-write links so they work offline

NOTE: remove the "-n" if you only want to mirror the site in question. Leave it in to grab everything off neighbouring sites to completely render the page if the internet goes away.

NOTE: httrack runs java internally (I believe) and is limited to 2GB of ram. Not sure if a 64-bit version of it will allow for a larger crawl queue.