Wget with WARC output

From Archiveteam
Jump to: navigation, search

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers. With Wget, that's difficult. With a few tricks you can keep the response headers, but there is no option to save the request headers. You also lose the response headers that don't produce an HTML page: Wget doesn't save redirects and 404 responses.

The development version of Wget can write its results to a WARC (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, it's possible to save both the request and the response headers. It also provides a clean way to store redirects and 404 responses.

There is an additional advantage: if Wget writes these headers to a WARC file, it is no longer necessary to use the --save-headers to save them at the top of each downloaded file. There is need to remove these headers afterwards to produce a clean copy: the mirror produced by Wget are useable without post-processing.

Note that this work has been accepted into the Wget codebase, and as of version 1.14 wget supports WARC output out of the box.

Contents

Compiling

bzr branch bzr://bzr.savannah.gnu.org/wget/trunk
cd trunk
./bootstrap
./configure && make

Usage

To download a file and save the request and response data to a WARC file, run this:

src/wget "http://www.archiveteam.org/" --warc-file="at"

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have a non-compressed WARC file, use the --no-warc-compression option:

src/wget "http://www.archiveteam.org/" --warc-file="at" --no-warc-compression

Saving one file is nice, but the warc-file option becomes even more powerful if you combine it with Wget's mirror option: (You may want to try this with a smaller site than the AT wiki.)

src/wget "http://www.archiveteam.org/" --mirror --warc-file="at"

If you uncompress at-00000.warc.gz and look at it, you'll see that it contains WARC records for every request and response: it is a complete copy of the mirrored site, while at the same time Wget also created the normal mirror of the site.

Options

--warc-file=FILENAME enables the WARC export. WARC files will be based on FILENAME: FILENAME-00000.warc.gz, FILENAME-00001.warc.gz et cetera.

--warc-max-size=NUMBER defines the maximum size of the WARC files. The default is an infinite limit ("inf"). If you download a large site, the recommended limit is 1GB, set the option to 1G to enable this limit. Note that this is a soft limit: files can get slightly larger than this, depending on the files you download.

--warc-header=STRING adds STRING as a custom header to the warcinfo record, e.g. "operator: Archive Team". This option can be used multiple times.

--warc-cdx=FILENAME writes a CDX index file to FILENAME.cdx. The CDX file will contain a list of the records and their locations in the WARC files.

--warc-dedup=FILENAME can be used to reduce the size of WARC files generated by a recrawl. FILENAME should point to a CDX file, generated with --warc-cdx in a previous run. For each file it downloads, Wget will check the CDX file to see if the response is listed there. If the exact file already exists, a "revisit" record with a reference to the previous record will be added to the WARC file, instead of a duplicate "response" record. Duplicate records are detected by comparing the SHA-1 digest of the payload of the response.

--no-warc-compression will write uncompressed WARC files. Compression is enabled by default. It is better to use the built-in compression than to compress the WARC files afterwards. The built-in compression will compress each record as an individual GZIP block, which allows other utilities to extract single records from the file.

--no-warc-digests disables the SHA-1 digests. By default, SHA-1 digests will be calculated for the whole response block and the response payload. If you really need to, you can disable that.

--no-warc-keep-log can be set if you don't want the Wget log in the WARC file. By default, Wget will add the log file as a separate record to the WARC file.

--warc-tempdir=DIRECTORY sets the temporary directory used by the WARC writer. The system tempdir will be used by default.

WARC file format

The WARC file format is an ISO standard. The official specification of ISO 28500:2009 is not available for free. However, the final draft is free, and is supposed to be technically equivalent to the official standard.

The WARC usage task force has published WARC implementation guidelines with additional recommendations.


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY · Deathwatch · Projects
Archiveteam.jpg
Archiving projects Archive.is · BetaArchive · Gmane · Internet Archive · It Died · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite
Blogging Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LiveJournal · My Opera · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd
Cloud hosting/file sharing AnyHub · Box · Dropbox · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase
Corporations Apple · IBM · Google · Lycos Europe · Microsoft · Yahoo!
Events Arab Spring · Occupy movement · Spanish Revolution
Font Repos Google Web Fonts · GNU FreeFont · Fontspace
Forums 4chan · College Confidential · ESPN Forums · forums.starwars.com · HeavenGames · Yahoo! Messages · Yahoo! Neighbors
Gaming City of Heroes · Club Nintendo · Desura · Emulation Zone · GameMaker Sandbox · Halo · Infinite Crisis · Minecraft.net · Player.me · Playfire · Steam · Warhammer · Xfire
Image hosting AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · deviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotopedia · Geograph Britain and Ireland · GTF Képhost · ImageShack · Imgur · Inkblazers · Instagr.am · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Snapjoy · Streetfiles · Tabblo · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons
Knowledge/Wikis arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram) · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia) · Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com · Game Developer Magazine · Gigaom · Helium · JPG Magazine · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices
Microblogging Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Twitter · TwitLonger
Music/Audio AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · TuneWiki · Twaud.io · WinAmp
People Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project
Protocols/Infrastructure FTP · Gopher · IRC · Usenet · World Wide Web
Q&A Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers
Recipes/Food Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList
Social bookmarking Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero
Social networks Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vkontakte · WeeWorld · Weibo · Wretch · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...
Shopping/Retail Alibaba · AliExpress · Amazon · Apple Store · eBay · Printfection · RadioShack · Sears · Target · The Book Depository · ThinkGeek · Walmart
Software/code hosting Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz
Video hosting Academic Earth · Blip.tv · Epic · Google Video · Justin.tv · Nokia Trailers · Qwiki · Stickam · TED Talks · Twitch.tv · Ustream · Viddler · Viddy · Vimeo · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)
Web hosting Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch) · Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webzdarma · Virgin Media
Web applications Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin
Other AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Distill · Dmoz · Easel · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Neopets · Quantcast · Quizilla · Salon Table Talk · Slidecast · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · Volán · Widgetbox · Windows Technical Preview · Wunderlist · Zoocasa
Information A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Backup Tips · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG
Projects Audit2014 · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census) · IRC Quotes · ISP Hosting · JSMESS · JSVLC · Just Solve the Problem · Project Newsletter · University Web Hosting · Valhalla · Woohoo
Tools ArchiveBot · ArchiveTeam Warrior (Tracker) · Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)
Teams Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam
About Archive Team Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ
Personal tools