Google Reader/War room

From Archiveteam
Jump to: navigation, search

This page is an archive of Archive Team's Google Reader backup project, kept here for the historical record.

Backing up historical feed data

Google Reader acts as a cache for RSS/Atom feed content, keeping deleted posts and deleted blogs readable (if you can recreate the RSS/Atom feed URL). After the Reader shutdown, only a small portion (100 posts per blog) will be available via the Feeds API, so it is imperative we grab everything before July 1 through the /reader/ API.

How you can help

Upload your feed URLs

We need to discover as many feed URLs as possible. Not all of them can be discovered through crawling, so so please upload your OPML files. (Though if you have any private or passworded feeds, please strip them out.)

Upload OPML files and lists of URLs to:

http://allyourfeed.ludios.org:8080/

Run the grab on your Linux machine

This project is not in the Warrior yet, so follow the install steps on these projects:

https://github.com/ArchiveTeam/greader-grab (grabs the actual text content of feeds)

https://github.com/ArchiveTeam/greader-directory-grab (searches for feeds using Reader's Feed Directory)

https://github.com/ArchiveTeam/greader-stats-grab (grabs subscriber counts and other data)

(Up to ~5GB of your disk space will be used; items are immediately uploaded elsewhere.)

Crawl websites to discover blogs and usernames

We need to discover millions of blog/username URLs on popular blogging platforms (which we'll turn into feed URLs).

Join #donereading and #archiveteam on efnet if you'd like to help with this.

The counts listed below are underestimates; please ask on IRC for updated counts.

See https://github.com/ludios/greader-item-maker/blob/master/url_filter.py for additional sites not listed here.

Tools for URL discovery

git clone https://github.com/trivio/common_crawl_index
cd common_crawl_index
pip install --user boto
PYTHONPATH=. python bin/index_lookup_remote 'com.blogspot'

You can copy and edit bin/index_lookup_remote to print just the necessary information:

# Print entire URL:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + path

# Print just the subdomain:
	print '.'.join(url.split('/', 1)[0].split('.')[::-1])

# Print just the first two URL /path segments:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 2)[0:2])

# Print just the first URL /path segment:
	rest, schema =  url.rsplit(":", 1)
	domain, path = rest.split('/', 1)
	print schema + '://' + '.'.join(domain.split('.')[::-1]) + '/' + '/'.join(path.split('/', 1)[0:1])

Pipe the output to | uniq | bzip2 > sitename-list.bz2, check it with bzless, and upload it to our OPML collector.

Add to the above list of blog platforms

See: