| url shortening was a fucking awful idea|
url shortening was a fucking awful idea
|Archiving status||In progress...|
TinyURL, bit.ly and other similar services allow long URLs to be converted to smaller ones on their specific service; the small URL is visited by a consumer and their web browser is redirected to the long URL.
Such services are a ticking timebomb. If they go away, get hacked or sell out millions of links will be lost (see Wikipedia: Link Rot). Archive.org/301Works is acting as an escrow for URL shortener databases, but they rely on URL shorteners to actually give them their databases. Even 301Works founding member bit.ly does not actually share their databases and most other big shorteners don't share theirs either.
Who did this?
- User:Scumola started this wiki page
- User:Chronomex started the Urlteam scraping effort
- User:Soult Helps with scraping
- User:Jeroenz0r Helps with scraping (and stalking Soult)
The fine folks at archive.org have provides us with upload permissions to the 301Works archive: http://www.archive.org/details/301utm. They unfortunately do not want to make them downloadable, but the same data is in our torrents too, just in a different format (we use tab-delimited, xz-compressed files while 301works uses comma-delimited uncompressed files).
- TinyBack (written in ruby by User:Soult)
- User:Chronomex wrote his own Perl-based scraper: 
- The Monkeyshines algorithmic scraper has a tool to efficiently scrape URL shortening services -- see the examples/shorturls directory. It scales efficiently to tens of millions of saved URLs. It uses a read-through cache to prevent re-requesting urls, it allows multiple scrapers to run on different machines while sharing the same lookup cache. You can either feed it a list of bare urls, or have it randomly try either base-36 or base-62 URLs. With it, User:Mrflip gathered about 6M valid URLs pulled from twitter messages so far.
Or just ask!
Here's a template that worked for me at least once. Well, data is pending but the site owner is gung-ho.
Try sending an email to the website owner:
Hello! I'm working with Jason Scott of textfiles.com and other members of the Archive Team. Since the recent scare involving http://tr.im/'s announced (and then retracted) imminent demise, we've been working to archive all the links from URL shorteners around the Internet. If I'm not mistaken, you operate urlx.org. Would you be so kind as to share with us a copy of your URL database? We'll do our best to preserve this data forever in a useful way. We are already very far along in scraping links from tr.im, but it's faster (and friendlier!) to contact site owners asking for a copy of their data than it is to scrape. We've got a domain registered, urlte.am, and all links will be available for redirect in the format: http://urlx.org.urlte.am/av3 If you could help us, that would be excellent! Thank you,
The new table includes shorteners we have already started to scrape.
|Name||Est. number of shorturls||Scraping done by||Status||Comments|
|TinyURL||1000000000||User:Soult||5-letter codes done, on halt due to being banned (2010-12-20)||non-sequential, bans IP for requesting too many non-existing shorturls|
|bit.ly||4000000000||User:Soult||lots and lots of scraping needed (2011-03-25)||non-sequential|
|goo.gl||??||User:Scumola||started (2011-03-04)||goo.gl throttles pulls|
|is.gd||534183259||User:Chronomex/User:Soult||probably got about 95% before switch to non-sequential||now non-sequential, new software version added crappy rate limiting|
|ff.im||?||User:Chronomex||only used by FriendFeed, no interface to shorten new URLs|
|4url.cc||1279 (2009-08-14)||User:Chronomex||done (2009-08-14)||dead (2011-02-15)|
|xs.md||3084 (2009-08-15)||User:Chronomex||done||dead (2010-11-18)|
|url.0daymeme.com||14867 (2009-08-14)||User:Chronomex||done||dead (2010-11-18)|
|tr.im||1990425||User:Soult||got what we could||dead (2011-12-31)|
|adjix.com||?||User:Jeroenz0r||Already done: 00-zz, 000-zzz, 0000-izzz.||case-insensitive, incremental|
|rod.gs||?||User:Jeroenz0r||Done: 00-ZZ, 000-2Qc||case-sensitive, incremental, server can't keep up with all the requests.|
|biglnk.com||?||User:Jeroenz0r||Done: 0-Z, 00-ZZ, 000-ZZZ||case-sensitive, incremental|
|go.to||60000||User:Asiekierka||Done: ~45000 (go.to network links only: goto_dump.zip)||no codes, only names, google-fu only gives the first 1000 results for each, thankfully most domains have less|
|Name||Number of shorturls||Scraping done by||Status||Comments|
List last updated 2009-08-14.
- 6url.com - HTML redirect
- ad.vu - mirror of adjix.com
- budurl.com - Appears non-incremental
- cli.gs - Appears non-incremental
- decenturl.com - Not at all easy to scrape.
- doiop.com - Appears non-incremental
- easyurl.net - Appears non-incremental: http://easyurl.net/afd2f
- ilix.in - HTML redirect
- imfy.us - requires a recaptcha to get to the linked site, and avast goes nuts.
- jdem.cz - Incremental with random (?) last digit: http://jdem.cz/bw388
- metamark.net / xrl.us - ? http://xrl.us/bfabog
- myurl.in - http://myurl.in/xtP5H / http://urlgator.com/xtP5H /http://ug4.me/xtP5H / http://link-ed.in/xtP5H - HTML redirect
- minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh
- notlong.com - Appears to be alpha-only: http://yeitoo.notlong.com/
- nutshellurl.com - Appears incremental. 301s to a redirector script, which then 301s you to the destination.
- ow.ly - I can't get it to work.
- pnt.me - Doesn't appear guessable, too big a space to bruteforce: http://pnt.me/FzAblc
- redirx.com - Lowercase alpha only, appears sequential or guessable: http://redirx.com/?wyok
- s3nt.com - Probably sequential. http://s3nt.com/aa goes somewhere different from /ab
- shortlinks.co.uk - Working again.
- short.to - Domain is parked - Probably sequential/loweralpha: http://short.to/msmp
- shorturl.com - Probably sequential/loweralpha: http://alturl.com/wqok
- shrinklink.co.uk - Doesn't appear sequential: http://www.shrinklink.co.uk/45bmx , www.shrinklink.co.uk/npk6xp
- shrinkurl.us - Alway telling URL is malformed
- shrt.st - Appears incremental: http://shrt.st/vpz
- simurl.com - Doesn't appear guessable: http://simurl.com/panpes
- shorl.com - Doesn't appear guessable: http://shorl.com/tisikestibahu
- smarturl.eu / joturl.com / zip.sm - Doesn't appear guessable, HTML redirect.
- snipr.com - Appears incremental: http://snipr.com/27nvst http://snipr.com/27nvtt
- snipurl.com - See above ^
- snurl.com - See above above ^^
- surl.co.uk - Many shortening options.
- tighturl.com - Appears incremental: http://tighturl.com/30xu http://tighturl.com/30xv
- tiny.cc - Appears non-incremental
- tweetburner.com / twurl.nl - Appears incremental
- ur1.ca - Database is downloadable from website directly.
- url9.com - Sequential, alphanumeric. Leading 0s are significant.
- urlx.org - Owner has agreed to share his database
- xrl.us - see metamark.net
- goo.gl - Google
- fb.me - Facebook
- y.ahoo.it - Yahoo
- youtu.be - YouTube
- t.co? - Twitter
- post.ly - Posterous
- wp.me - Wordpress.com
- flic.kr - Flickr
- lnkd.in - LinkedIn
- su.pr - StumbleUpon
- go.usa.gov - USA Government (and since they control the Internets, it doesn't get much more official than this)
- amzn.to - Amazon
- binged.it - Bing (bonus points for being longer than bing.com)
- 1.usa.gov - USA Government
- tcrn.ch - Techcrunch
Dead or Broken Shorteners
- chod.sk - Appears non-incremental, not resolving
- gonext.org - not resolving
- ix.it - Not resolving
- jijr.com - Doesn't appear to be a shortener, now parked
- kissa.be - "Kissa.be url shortener service is shutdown"
- kurl.us - Parked.
- miklos.dk - Doesn't appear guessable: http://miklos.dk/!z7bA6a - "Vi arbejder på sagen..."
- minurl.org - Presently in ERROR 404
- muhlink.com - Not resolving
- myurl.us - cpanel frontend
- 1link.in - Website dead
- canurl.com - Website dead
- dwarfurl.com - Website dead/Numeric, appears incremental: http://dwarfurl.com/08041
- easyuri.com - Website dead/Appears hex incremental with last digit random/checksum: http://easyuri.com/1339f , http://easyuri.com/133a3
- go2cut.com - Website dead
- lnkurl.com - Website dead
- minilien.com - Doesn't appear guessable: http://minilien.com/?9nyvwnA0gh - Website dead
- memurl.com - Pronounceable. Broken.
- nyturl.com - NY Times (bonus points for being longer than nyt.com, which they own). Taken by squatters
- digg.com - discontinued - 
- u.nu - "The shortest URLs. period." Website dead since at least 1st of october 2010 (http://web.archive.org/web/20100104023208/http://u.nu/)