Difference between revisions of "INTERNETARCHIVE.BAK/git-annex implementation"

From Archiveteam
Jump to navigation Jump to search
Line 3: Line 3:
For more information, see http://git-annex.branchable.com/design/iabackup/.
For more information, see http://git-annex.branchable.com/design/iabackup/.


= Internet Archive's structure =
= Some quick info on Internet Archive =
 
== Data model ==


IA's data is organized into ''collections'' and ''items''.  One collection contains many items.
IA's data is organized into ''collections'' and ''items''.  One collection contains many items.


Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.
== Browsing the Internet Archive ==
In addition to the web interface, you can use the [https://pypi.python.org/pypi/internetarchive Internet Archive command-line tool].  The tool currently requires a Python 2.x installation.  If you use pip, run
<pre>
pip install internetarchive
</pre>
From there, you can run `ia search 'collection:*'` to get information on all collections as a JSON array.  (It's a big list.)  See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information.


= First tasks =
= First tasks =

Revision as of 23:40, 4 March 2015

This page addresses a git-annex implementation of INTERNETARCHIVE.BAK.

For more information, see http://git-annex.branchable.com/design/iabackup/.

Some quick info on Internet Archive

Data model

IA's data is organized into collections and items. One collection contains many items.

Here's an example collection and item in that collection: https://archive.org/details/archiveteam-fire, https://archive.org/details/proust-panic-download-warc.

Browsing the Internet Archive

In addition to the web interface, you can use the Internet Archive command-line tool. The tool currently requires a Python 2.x installation. If you use pip, run

pip install internetarchive

From there, you can run `ia search 'collection:*'` to get information on all collections as a JSON array. (It's a big list.) See https://pypi.python.org/pypi/internetarchive#command-line-usage for more information.

First tasks

<closure> SketchCow: I have to work on git-annex development all day (what a fate), not this, and I'm doing 7drl 24x7 all next week. Some first steps others could do:
<closure> - pick a set of around 10 thousand items whose size sums to around 8 TB
<closure> - build map from Item to shard. Needs to scale well to 24+ million. sql?
<closure> - write ingestion script that takes an item and generates a tarball of its non-derived files. Needs to be able to reproduce the same checksum each time run on an (unmodified) item. I know how to make tar and gz reproducible, BTW
<closure> - write client registration backend, which generates the client's ssh private key, git-annex UUID, and sends them to the client (somehow tied to IA library cards?)
<closure> - client runtime environment (docker image maybe?) with warrior-like interface
<closure> (all that needs to do is configure things and get git-annex running)
<closure> could someone wiki that? ta