Wget with WARC output

From Archiveteam
Revision as of 10:12, 15 April 2012 by Alard (Talk | contribs)
Jump to: navigation, search

From the discussion about Working with ARCHIVE.ORG, we learn that it is important to save not just files but also HTTP headers. With Wget, that's difficult. With a few tricks you can keep the response headers, but there is no option to save the request headers. You also lose the response headers that don't produce an HTML page: Wget doesn't save redirects and 404 responses.

The development version of Wget can write its results to a WARC (Web ARChive file format) file, just like Heritrix and other archiving tools. With the WARC format, it's possible to save both the request and the response headers. It also provides a clean way to store redirects and 404 responses.

There is an additional advantage: if Wget writes these headers to a WARC file, it is no longer necessary to use the --save-headers to save them at the top of each downloaded file. There is need to remove these headers afterwards to produce a clean copy: the mirror produced by Wget are useable without post-processing.

Contents

Compiling

bzr branch bzr://bzr.savannah.gnu.org/wget/trunk
cd trunk
./bootstrap
./configure && make

Usage

To download a file and save the request and response data to a WARC file, run this:

src/wget "http://www.archiveteam.org/" --warc-file="at"

This will download the file to index.html, but it will also create a file at-00000.warc.gz. This is a gzipped WARC file that contains the request and response headers (of the initial redirect and of the Wiki homepage) and the html data.

If you want to have a non-compressed WARC file, use the --no-warc-compression option:

src/wget "http://www.archiveteam.org/" --warc-file="at" --no-warc-compression

Saving one file is nice, but the warc-file option becomes even more powerful if you combine it with Wget's mirror option: (You may want to try this with a smaller site than the AT wiki.)

src/wget "http://www.archiveteam.org/" --mirror --warc-file="at"

If you uncompress at-00000.warc.gz and look at it, you'll see that it contains WARC records for every request and response: it is a complete copy of the mirrored site, while at the same time Wget also created the normal mirror of the site.

Options

--warc-file=FILENAME enables the WARC export. WARC files will be based on FILENAME: FILENAME-00000.warc.gz, FILENAME-00001.warc.gz et cetera.

--warc-max-size=NUMBER defines the maximum size of the WARC files. The default is an infinite limit ("inf"). If you download a large site, the recommended limit is 1GB, set the option to 1G to enable this limit. Note that this is a soft limit: files can get slightly larger than this, depending on the files you download.

--warc-header=STRING adds STRING as a custom header to the warcinfo record, e.g. "operator: Archive Team". This option can be used multiple times.

--warc-cdx writes a CDX index file to FILENAME.cdx. The CDX file will contain a list of the records and their locations in the WARC files.

--warc-dedup=FILENAME can be used to reduce the size of WARC files generated by a recrawl. FILENAME should point to a CDX file, generated with --warc-cdx in a previous run. For each file it downloads, Wget will check the CDX file to see if the response is listed there. If the exact file already exists, a "revisit" record with a reference to the previous record will be added to the WARC file, instead of a duplicate "response" record. Duplicate records are detected by comparing the SHA-1 digest of the payload of the response.

--no-warc-compression will write uncompressed WARC files. Compression is enabled by default. It is better to use the built-in compression than to compress the WARC files afterwards. The built-in compression will compress each record as an individual GZIP block, which allows other utilities to extract single records from the file.

--no-warc-digests disables the SHA-1 digests. By default, SHA-1 digests will be calculated for the whole response block and the response payload. If you really need to, you can disable that.

--no-warc-keep-log can be set if you don't want the Wget log in the WARC file. By default, Wget will add the log file as a separate record to the WARC file.

--warc-tempdir=DIRECTORY sets the temporary directory used by the WARC writer. The system tempdir will be used by default.

WARC file format

The WARC file format is an ISO standard. The official specification of ISO 28500:2009 is not available for free. However, the final draft is free, and is supposed to be technically equivalent to the official standard.

The WARC usage task force has published WARC implementation guidelines with additional recommendations.

Personal tools