Audit2014

We've uploaded a bunch of stuff:

subject:archiveteam = 13,785 items
collection:archiveteam = 60,172 items
subject:archiveteam AND NOT collection:archiveteam = 2,028 items

(The 3rd one should eventually be close to empty.)

Let's go through the list and make sure it's categorized, has decent metadata, etc.

Many of our uploads are quite large, and have been broken into many items on Archive.org. We'll group them together here and verify each set all at once.

Things to check

Collection: Are all the related items grouped into a collection?
Description: Can a visitor figure out what each item represents? Items in a collection don't need to repeat the description of the collection, but it'd be nice if they had a sentence or two, and information about how the item differs from the other items in the collection ("MP3s from earbits.com, files starting with c." from the Earbits items is a good example.)
Inclusion: Are all the related items included in the same collection?
Categorization: Can a visitor find the item by browsing the collections?
Cross-references: Can a visitor find other items in a set, starting at any item in the set? Can a visitor find the index of a large set starting from any part of it?
Indexing: If the item is a collection of sub-items, is one of these sub-items an index of the others? (This is a complicated thing to check for and to create when it doesn't exist, so we can come back to this after we've checked the rest.)
Your suggestion here: this is just off the top of my head.

High-level Collections

https://archive.org/details/web
- https://archive.org/details/archiveteam
  - https://archive.org/details/archiveteam-fire
  - https://archive.org/details/archivebot
- https://archive.org/details/wikiteam

Current Sub-Collections at Archive Team

Collection	Status	Auditor	Item Count	Has an Index	Description of Audit
No Category (earbits)	Unaudited		98	Yes	The items are not in a collection. Most items are WARCs; the rest need additional work if anyone is going to be able to find the exact MP3 they want.
archiveteam_ptch	Audited	db48x	50	No	Collection has great description, but no categories. Items in collection are WARCS. One item not included in the collection: deathy-s3-test-ptch
archiveteam_flowerpot	Audited	db48x	406	No	The description of the collection is anemic, but each item is well-identified.
github_files	Audited	db48x	1	No	Pretty bad shape. Only one item in the collection, and that's only half the data. Was the rest never uploaded? Has no description, keywords or other metadata. Other Github items could be included, such as this repository index, and these other file downloads
justintv	Audited	db48x	189	No Partial (Src)	Decent description, but no other metadata. There are 51 other 'justintv' items, but none of them look to be from us.
archiveteam_mochimedia	Audited	db48x	9	No	Collection includes Mochi's notice about the shutdown, but no other context. The items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index. Index can be easily generated from this 26MB JSON file--chfoo
archivebot	Unaudited		1070	Sort of: Viewer	ArchiveBot; The viewer doesn't seem to index into crawls; there's no link from the collection or the items to the viewer (or anywhere else)
archiveteam_yahooblogs and archiveteam_yahooblog	Audited	db48x	49	No	Collection description is just the shutdown notice (and apparently quite a brief one at that) with no other context. Items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index. One item is orphaned in a collection of its own; apparently caused by a typo in the collection name.
archiveteam-splinder	Unaudited		53		See Splinder
archiveteam-picplz	Audited	db48x	141	Yes	The collection description is just the shutdown message, with no other context. Items are tarballs containing WARCs. There is an index, but it's not a part of the collection ([1]). There's also a search page for the index, which is great.
archiveteam_puush	Audited	db48x	1781		The collection description is just the shutdown notice, but it's better than average; it includes some context. The items are all WARCs with CDXs, but there's no central index.
archiveteam_upcoming	Audited	dashcloud1	142	no	The collection description only describes the site, not the items themselves. Individual items have no description of any kind.
archiveteam_randomfandom	Audited	dashcloud1	42	yes	Short collection description, but has an index, and every collection item is well described. Index is located right on collection page.
archiveteam_antecedents	Audited	db48x	46	N/A	This collection represents multiple sites, rather than multiple parts of a single large site. The collection description is quite brief, but each item appears to have a paragraph describing what the site is/was, as well as some basic metadata such as keywords. All the items appear to be WARCs with CDXs
archiveteam_jazzhands	Audited	db48x	443	No	This one is a collection of items from multiple sites, but those sites are also broken up into multiple items based on when they were scanned. The items have brief descriptions and some keywords, and are WARCs with CDXs. A good way to improve this would be to make collections for each site as subcollections.
archiveteam-mobileme-hero	Unaudited		4007	Yes (source)
archiveteam_myopera	Audited	dashcloud1	155	No	Collection page has a nice description of the site, and the items. The items appear to be all have WARCs, and have no descriptions/keywords of any kind on them.
archiveteam_bebo	Unaudited	JesseW	2867		They appear to all be WARCs, most uploaded on the same day; it's not clear if all of them are in the Wayback Machine or not. Each item has no description or context.
archiveteam_dogster	Audited	jscott	55	???	Collection well described. Wayback Machine-Ready WARCs, all integrated.
hyves	Unaudited		517		Hyves
archiveteam_wretch	Unaudited		2163		Wretch; WARCs
archiveteam_xanga	Unaudited		454		Xanga; WARCs
twitterstream	Unaudited		41		Twitter According to reviews, at least one file is empty.
pastebinpastes	Unaudited		223		These are tarballs (less than 100 MBs, usually), containing each paste in a separate file. Most recently updated on July 1, 2014
archiveteam_zapd	Unaudited		19		Zapd; WARCs
archiveteam_patch	Unaudited		38		Patch ; WARCs
archiveteam_posterous	Unaudited		444		Posterous ; WARCs
archiveteam_greader	Unaudited		368		Google Reader; 3 categories of WARCs: Directory, Stats & general. It would probably be good to also put them in separate collections. There is also a combined stats item.
archiveteam_ignsites	Unaudited		81		IGN (needs link to archive); Each item contains a particular subdomain. Descriptive names. (primeblog.ign.com item needs to be added to archiveteam and web collections)
archiveteam_g4tv_forums	Unaudited		74		ARCs from wikipedia:G4 (TV channel), mainly from the forum
archiveteam-yahoovideo	Unaudited		156		Yahoo! Video; various inconsistency in naming and categories; some items contain zip files, while others contain tar files.
archive-team-friendster	Unaudited		137	Maybe -> archiveteam-friendster-index item	Friendster; early (2011) project, variety of formats
archiveteam_formspring	Unaudited		1477		Formspring; WARCs; some duplication in collection description
archiveteam_yahoo_messages	Unaudited		17		Yahoo! Messages; WARCs; Minimal description on collection, none on items
archiveteam_punchfork	Unaudited		47	Yes	Punchfork; Needs link to index from collection description (and item descriptions); three different types of items, unclear differences
yahoo_korea_blogs	Unaudited		10		WARCs; no item descriptions
archiveteam-cinch	Unaudited		20	No	Cinch.fm; 10 items, in both WARC and tar formats
archiveteam_dailybooth	Unaudited		203	Yes	DailyBooth; link to index on collection page needs adjusting; images seem to be downloadable; individual items lack descriptions
archiveteam_weblognl	Unaudited		26	No	Weblog.nl; no English-language description
stage6	Unaudited		790		Videos from wikipedia:Stage6; many seem to be unavailable from IA, due to "issues with the item's content."
googlegroups-part2	Unaudited		27	No	Google Groups; each item contains a single tar file (ranging in size from 300 MB to over 40 GB); the tar files contain separate zip files for each group; the zip files the actual files. This should probably be grouped with the other grabs of Google Groups.
archiveteam-btinternet	Unaudited		8	No	WARCs
archiveteam-qaudio-archive	Unaudited		7	No	Many small WARCs in each item; lengthy explanation in collection description, none in each item
webshots-freeze-frame	Unaudited		2459	No	Webshots; WARCs
tabblo-archive	Unaudited		1806	Maybe: groups item	Tabblo; 9 MegaWARCs, the rest of the items are groups of individual accounts as zip files
archiveteam-fortunecity	Unaudited	Yes	55		FortuneCity; 26 "Set" items (containing a single large tar in each one); also 26 WARC items, and one leftovers item
2012-04-30-wikimedia-images-snapshot	Unaudited	Nemo	148	Not really	Should become a subcollection of "wikicollections", so that it's next to "wikimediacommons". The "remote" tarballs partially overlap with xowa items nowadays. If a complete mirror of the Your.Org tarballs is desired, we should list it at [2] with some maintenance information. It's not clear whether investing N TB at IA is a priority here, nor whether IA expects WikiTeam to do the uploads instead (in that case, ask Hydriz or Arkiver). Also, the Your.Org dumps are currently blocked on the lack of a rsync server on Wikimedia servers.
archiveteam-anyhub	Unaudited		39		AnyHub; 18 each WARC & tar items, and one called the "Blue Collection"
archiveteam-fileplanet	Unaudited		675		FilePlanet
archiveteam-umich-save	Unaudited		52
archiveteam-geocities	Unaudited		12		Geocities
archiveteam-fire	Unaudited		7135		A vast and misc. collection; needs quite a bit of TLC ; (www.asiatorrents.me-subtitle-1-to-38406-20141205 item needs to be added to the archiveteam, and web collections)
archiveteam-mypodcast	Unaudited		383		Each item is a separate podcast, containing individual sound files, playable through the IA interface; there is also a misc item
archiveteam-googlegroups	Unaudited	JesseW	1,348	Partial (each item has a list of groups, but there's no overall list)	Google Groups; This is divided into items by the initial two letters (or digits or underscore). The item for "th" has an inconsistent title and category.
isohunt dumps 1 2 3	Audited	vitzli	3	Partial	These are not yet in a dedicated collection, and have never been post-processed. Some of the .torrent files may actually be error pages. This needs work, and proper full auditing. Visit summary page or IsoHunt for more details
No Category (streetfiles)	Unaudited
archiveteam_yahoovoices	Unaudited		30	No	Yahoo! Voices; WARCs
archiveteam_twitchtv	Unaudited		2213	Yes (source)	Twitch.tv
archiveteam_fotopedia	Unaudited		40		Fotopedia; WARCs
archiveteam_canvas	Unaudited		47		Canv.as; WARCs
archiveteam_ancestry	Unaudited		82		Ancestry.com; WARCs

In progress???

But what happened after? Where are the archives?

BerliOS
Deletionpedia
Delicious
ExtraTorrent
Free ProHosting
Google Video
Ispygames
Len Sassaman Project
Lulu Poetry
Prodigy.net
Resedagboken
ScreenshotsDatabase.com
Spanish Revolution: Is this finished?
University of Michigan personal webpages
Wallbase
Wallhaven
Webmonkey
Widgetbox
Windows Live Spaces

Oddities, Mislocations, and To Do

https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions.

To be moved to better collection

https://archive.org/details/archiveteam-fileplanet is a well done collection with a description that goes into detail about the site... if only it had any of the items. Instead, they are dumped in Community Texts. They don't even have anything tying them to archiveteam in the item names, despite clearly being from us. https://archive.org/search.php?query=FileplanetFiles seems to bring them up.

Collections

(The items within them also need to be added to the archiveteam, and web collections.)

WARC

FTP

Misc

Missing

Yahoo!_Blog: What happened to the Vietnam archives? Does anyone have a copy or at least a blurry screenshot of the Korean shutdown notice?

Audit2014

Contents

Things to check

High-level Collections

Current Sub-Collections at Archive Team

In progress???

Oddities, Mislocations, and To Do

To be moved to better collection

Collections

WARC

FTP

Misc

Missing

Navigation menu

Audit2014

Things to check

High-level Collections

Current Sub-Collections at Archive Team

In progress???

Oddities, Mislocations, and To Do

To be moved to better collection

Collections

WARC

FTP

Misc

Missing

Navigation menu

Search