Difference between revisions of "Talk:INTERNETARCHIVE.BAK"

From Archiveteam
Jump to navigation Jump to search
(anticipated problems)
(checksums)
Line 13: Line 13:
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
* Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)
** Proposed solution: have multiple people make their own collection of checksums of IA files. --[[User:Mhazinsk|Mhazinsk]] 00:10, 2 March 2015 (EST)
** All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
* "Dark" items (e.g. the "Internet Records" collection)
* "Dark" items (e.g. the "Internet Records" collection)
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
** There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
*** It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
*** It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
* Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)
* Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)

Revision as of 07:27, 2 March 2015

A note on the end-user drives

I feel it is really critical that the drives or directories sitting in the end-user's location be absolutely readable, as a file directory, containing the files. Even if that directory is inside a .tar or .zip or .gz file. Making it into a encrypted item should not happen, unless we make a VERY SPECIFIC, and redundant channel of such a thing. --Jscott 00:01, 2 March 2015 (EST)

Potential solutions to the storage problem

  • Tahoe-LAFS - decentralized (mostly), client-side encrypted file storage grid
    • Requires central introducer and possibly gateway nodes
    • Any storage node could perform a Sybil attack until a feature for client-side storage node choice is added to Tahoe.
  • git-annex - allows tracking copies of files in git without them being stored in a repository
    • Also provides a way to know what sources exist for a given item. git-annex is not (AFAIK) locked to any specific storage medium. -- yipdw

Other anticipated problems

  • Users tampering with data - how do we know data a user stored has not been modified since it was pulled from IA?
    • Proposed solution: have multiple people make their own collection of checksums of IA files. --Mhazinsk 00:10, 2 March 2015 (EST)
    • All IA items already include checksums in the _files.xml. So there could be an effort to back up these xml files in more locations than the data itself (should be feasible since they are individually quite small).
  • "Dark" items (e.g. the "Internet Records" collection)
    • There are classifications of items within the Archive that should be considered for later waves, and not this initial effort. That includes dark items, television, and others.
      • It seems like this would include a lot of what we would want to back up the most though, e.g. a substantial percentage of the books scanned are post-1923 and not public
  • Data which may be illegal in certain countries/jurisdictions and expose volunteers to legal risk (terrorist propaganda, pornography, etc.)