From Archiveteam
Jump to: navigation, search

I (i336_) am adding this here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presented. (This saves me the extra time I'd otherwise need to use to format this w/ a refined writing style.)

Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.

If I have anything to share it will be put here. I will endeavor to use this as a scratchpad to share in. I also tend to treat IRC as micro-twitter when I'm nervous, so you _probably_ won't generally need to ask me what's new. :P

NOTE - I have a 110k-line-long userlist if that's helpful.

Crawling supersedes all forms of information that have yet been crawled.

Pagination types

Object pages and view_comments pages use a different pagination system. The HTML is different. Thankfully knowing which one to check for is a case of simply knowing the current job URL, or going off the job type (eg "object" could map to /{i} and "view_comments" could map to /view_comments/{i}).

Object pages

Object example: (32 pages, 6473 replies)

Always specify per=200 when fetching normal objects (anything fetching URLs like /[0-9]+ are requests for objects.)

view_comments pages

This doesn't seem to respond to the per parameter. You always get a fixed number of comments per page.

Working example: (11 pages, 700 comments)

Last-page detection

It took me a bit to find a reasonably simple way to do this. My first two attempts are buried in for posterity, and may be worth looking at. I've moved them out of the way because they're a tad noisy textually and I want to keep this short.

If you're on an object page like, the next page will be pointed to by <link rel="next" id="browse_next" href="/205?p=32">, and you can either just follow the URL directly, or extract out the p=([0-9]+) bit and follow that. The entire <link /> tag will disappear on the last page.

If you're on a view_comments page, scan the HTML for /view_comments/93576596?p=(curpage+1). If you don't find a URL matching this string, you're on the last page. My main concern with this is whether the URL will contain other parameters in it besides the p. FWIW, there will also be a chunk of HTML on the last page that matches <span class=r_browse_selected>[0-9]+</span></td></tr></table>, but this chunk will not exist on pages that only have one page to them.

Basic workflow suggestion

This is taken from the standpoint of each discrete URL run.

ID runs

  1. Check the cookie jar. If you don't have a ukey, you need to get one. POST something like login=abc&password=def&flag_not_ip_assign=1 to / and you should get a Set-Cookie: ukey=.... back.
  2. Request a known-working URL (for example If you get a 302 Moved instead of 200, your ukey has expired. Wipe the current one and go back to #1.
  3. If you start bouncing between #1 and #2, send an error message back to the tracker; someone will need to try and login in a browser. (Worst-case scenario is that this won't work, next-worst case scenario is that we'll need to cycle through account manufacturing. I'm tentatively confident it won't come to this though.)
  4. Given an ID of 123456, request,, and The first URL will use object pagination, the second will use view_comment pagination, and the third may also use view_comment pagination as well - I have no test cases to prove this, but the page structures look similar enough that I strongly suspect it.
  5. Check if there's a page after this one using the techniques described above (depending on the current URL). If there is, add the next page to the fetch queue.

Considering the number of images that have already been crawled, I don't think crawling images found in may be a good idea. It would have been really cool to be able to do this right from the start but it'd just produce duplicates now.

Username runs

No usernames have been crawled yet. This is great, it means we can do it properly. vs

While sniffing around some Ubuntu packages I found references to an "", when I googled this I got a search result from that had an oddly similar URL structure to

Visiting showed me an all-but-blank page, but I could see I was using's backend - the UI is identical.

So, I tried logging in with my account, and it... worked. It accepted my credentials, and sent me to

Next, for the fun of it I tried accessing an object like AND IT WORKED! DOES NOT REDIRECT!!!

The other big difference is that doesn't have user uploads hidden on user pages! So is much, much bigger than :) and this will be very useful to save.

The other big difference is that with we can archive all of the conversations on the site. I thought the aspects of were going to be lost. If we can crawl the site successfully this will not be the case.

Note: something to consider - also says "Search is not possible", just like does. I think a reasonable interpretation of this is that EX is not going to keep going past the 31st, and that the whole thing is going to go away. So the show isn't over yet. :P

An aside:

I found it amusing that unlike ( ->, ( -> isn't using DDoS protection. Something to keep in mind.

Also, now we know what ISP EX are/were using. Volia must've gotten enough complaints to fill a book... something very very interesting to keep in mind, especially considering their cheapest offering ("Intel Atom D510/1GB/250GB + UNLIM 100 Mbit/s") is $15.31/mo USD...!

Logging in

With a username of abc and a password of def,

 $ curl '' --data 'login=abc&password=def&flag_not_ip_assign=1' -vvvv

The other parameter corresponds to the without recognition of IP on the homepage. This probably doesn't apply to the DDoS protection systems, but hey... why not ;) although if you think doing this would be suspicious then maybe not.

You get a ukey cookie back. If you implement a standard keep-everything cookie jar then that's fine, but ukey is the only parameter you need to supply to authenticate. I don't think it changes.

Once you've made an account in a browser:

  • change your language at the top-right, for convenience
  • the second-last option under settings (the link immediately to the left of the language dropdown) says "disallow requests to files with limited access facilities" - maybe disabling this will mean we can scrape at least references to things even if we can't download them. I have this turned on and in I see "no access to object ... !" messages amongst the replies. Maybe seeing this is not useful (we can't derive anything from the ID). Or maybe it is? I don't know.

User accounts

This is going to be the biggest issue.

"...Comrade, exua_archiver is logged in from 3,287 different IP addresses, and is currently using 648Mbps of bandwidth."

Hopefully this does not happen.

Hopefully we do not need to ask the question of "okay, what happens if we try to batch create accounts?"


I'll leave it to you to make an account to use for the warrior, since it only takes 20 seconds. You can borrow one of my accounts if you want though.

Strategy for

  • Add<userid> to the user-discovery system
  • Add{i} to the bruteforcing system
  • Add{i} to the bruteforcing system - this uses the p and per semantics I described below
  • Consider adding{i} - I think this shows the references to an object, and may turn up otherwise hidden content.
  • Write a scraper that looks for /user/[A-Za-z0-9_-] (<-- I'm 99.9% confident that covers all usernames) in all returned data, and add the extra usernames to the base list.
  • Scrape avatars off either just the user pages or everywhere (user avatars are linked to in almost all pages)

Object access with vs

Compare with (warning, this will download an 8GB MKV, be ready to hit cancel) allows access to things has blocked for some reason. (I'm logged into, so having a login isn't an issue.)

Request minimization

I say just request everything off *and*, but that means we're requesting something like 7ish requests per ID... that means at least 700 million requests. I fear the database's load alarms will go off...

I think the XSPFs can probably go, although I'll definitely hear arguments for why they should stay.

I think the resized (?[0-9]+) avatar requests can go.

The hard decision is whether to replace with entirely. It looks like it'd work 100%, but I'm hesitant. I don't know what to do here.

URL hierarchy use an ID system for all "objects", including conversations, threads, collections, and folders. By bruteforcing all of them, we'll get everything.

However, this will not get the *structure* of interconnected threads. If you look at, you'll see it has "Beginning of discussion:" and "Last respond:" links that point to other IDs. Woohoo! We can now... only say that ID A relates to ID B, "somehow". We can't infer the semantic structure from the page HTML and what it links to.

To preserve the thread structure we need to fetch the view_comments links, and then chase through the pagination on those pages, in order to archive the threads' context. We can recover the ordering with the timestamps, but that doesn't preserve the thread hierarchy.

Also - if you check URLs like, you'll see its parent URL is "/en/about". So in some rare cases you'll get odd URLs like that. It looks like you can just treat them like collections.

The p/per system

Knowing when you're on the last page is important.

Have a look at the navigation area at the top of these 3 pages:

Here are a few things that change when you're on the last page:

  • It looks like the tooltip for the go-to-the-end right-facing arrow contains a Russian string that ends with the number of the last item in the page. (As in, if there are 152 items on a page, the number will be 152.) This will match the yyy in the xxx..yyy number in bold text in the middle of the left and right arrows.
  • The go-right and go-right-to-end arrows have no text in the middle of them
  • Both the go-right and -to-end arrows are no longer wrapped in an <a href....

"The more techniques you use, the less chance for failure when working at scale"...?


My first archive target was the{i}.xspf URL pattern because it's what advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.

After paging through the full set of results crawled by the Wayback Machine with the intention of seeing what I could find, I discovered and started poking through JavaScript code and discovered r_view.

The XSPF approach should be discarded as r_view is superior in every way. Unlike XSPF, it includes the author of the post (critical!!), md5sum and size of every file (verifiable downloads! yay!), and the text description associated with the post, which was going to be lost.

Use this from now on.{i}



I remembered RSS was a thing while browsing some HTML. Example:

Unfortunately it only returns a handful of results.

But I discovered this!

100 is the max.

This is important, as it is the only way I have found to archive lists of *collections*. (Using r_view or .xspf shows you an empty object with a picture in it. They clearly weren't built to handle this datatype.)

Files, folders and "collections"

Files are /get/... URLs.

Folders are lists of files.

Collections are groups of folders. I'm using the term "collections" because I don't know what else to call them.

RSS vs r_view

As I said before, the r_view/XSPF method doesn't seem to handle the collection datatype, while RSS does.

An example:


XSPF: (sorry, this link downloads itself)


Only RSS can list collections. However it only seems to be able to return the first 100; I have not found any pagination options.

NOTE: Notice how in the r_view URL the file_upload_id tag's preview parameter had the same filename as the picture tag's URL parameter? Please don't quote me on this, but this might be a consistent way to detect collections.


Found this first (also hiding in the JS). It's cute but not useful.


I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked.

Using this we can search for anything we want! It returns a text list of matching IDs.

This is not especially useful for automated archiving, but critical for manual prioritization.

After trying a bunch of URLs I found count worked with RSS, so I tried it with this URL as well. count can go up to 500 here :D

EDIT: count can go up to 1000. Yeah, crazy. The sad thing is, this is nigh unusable :(


One of the few HTML links that still work. Examples are and

You can specify p to select the page number, zero-indexed. (So p=1 is page two.)

You can specify per to set the number of items per page. 200 is the max. (Use a consistent per with all searches as the site doesn't use an after=...-type system.) The website (ie in a browser) seems to remember the per setting you use, FWIW.


This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.

Recommended strategy

  • Use r_view with bruteforced IDs to fetch contents of folders. Consider also fetching the image that gets returned (nice-to-have, but not important).
  • Use userlist(s) with user_say to fetch archives of what users have written about. I understand this will include their comments and the files they have publicly released.
  • Check for an RSS feed on every folder ID that seems to be a collection (going off the suggested heuristics above, along with your own ideas) - or alternatively do RSS scans on every ID. I consider the RSS feeds a nice-to-have "extra", and not part of the crawling operation, as collections simply provide a way to group folders together, you're getting the folder list anyway, and we don't yet know how to return more than the 100 most recent RSS items, so it's not particularly introspective.


I would definitely appreciate extra eyes on the site itself, looking for interesting interconnections between content. I'm having that "I'm sure there's something I've not thought of here..." thing happening, but that may just be nerves.

If you (yes, you, this applies to everyone) feel like wasting a bit of time, just make an account and click around, keeping in mind the need to preserve the semantic structure of the site. If you think of anything, say it in IRC. It may be useful!