Collecting items randomly
An ArchiveTeam member comes across a lot of websites to be saved, with various structures. Understanding how items are accessible is a crucial point in creating the item list, which can then easily be scraped with well-developed tools.
Imagine the following situation: A website associates a long, unique identifier to each item, impossible to discover with brute force. The site doesn't provide an index or sitemap, of course. You don't even know the number of items. (There are sites that work this way.)
An obvious way is a Google/Bing/Commoncrawl/whatever discovery. But wait! The site allows to request a random item. You keep clicking on the button, and you get different and different items. Say, you can do this request automatically, as you have the link.
Simple! Repeat requesting that URL x times, x is a big number.
But how long should we try? Obviously, if the items are presented really randomly, then some items soon appear twice, you don't even need to go too far to experience that. The longer you run the discovery script, the more often the already-seen items appear. How long does it worth trying to get a new random item?
(TLDR: click here.)
In combinatorics, the model representing this situation is sampling with replacement. The most often discussed question about this is "what is the probability of picking...". But now, our question is, after how many picks reaches the number of picked distinct items the number of all (n) items? Or, after k picks, what percentage of all the items have we seen?
Not knowing the number n, we can't really answer this question. We can only examine the tendency of getting new, yet not seen items, watching it as it decreases.
Or, we can run simulations, where we do know what the number n is and what percentage of the items we have seen, and we try to find some constant between the only two numbers we know: the number of picks (or tries) and the number of distinct (found) items.
Let's do this for say, 100 items. We keep picking, and after each try we note how many items we have found. The following table – to be short – contains the state after every tenth try.
Easy to see that if we pick 100 times, we get 60 distinct items, and even after 200 picks, we only have 88% of the items. After 300 picks, we miss only 4%, and after 400 picks, we have almost everything, and even more after 500 picks. (As this is only one experiment, these are just approximate numbers.)
So, if we knew how many items there are, four times that many requests of a random item would present us almost all the items. But what if we don't know the total number of items?
Look at the third column. When we reach 100%, the ratio of tries and found items is ~4.2. This is the number we can always calculate, and independent of the total number of items.
Don't ask me to prove this mathematically. Let me present another simulation instead, with not that small and round numbers: say, we have 3811 items.
|Tries||Items found||Tries/found||Tries/total||Found %|
Let's first check the percentage (in the case above, that was the number of found items, because number of all the items was 100). Now it's the fifth column. We have 3811 items. After this many tries, you can see, ~63% items found. Twice as many tries gives us 86%, three times: ~95%, four times: 98%, five times: ~100%. The percentages are similar to those in the first simulation.
Now, go on to the tries/found ratio (third column), which is the most interesting. When we reach 100%, it is ~4.7. In the first case it was 4.2. But let's look at some other milestones: when we reach 68%, this ratio is 1.62 and 1.69 in the first and second case, respectively; at 84%, 2.14 and 2.25. You see, this is quite constant. Of course, there are larger differences in the last percents, and even 100% doesn't mean you've got each and every items. But, after x tries, you can count how many distinct items you have, and can make a guess on how much of the total corpus you've discovered.
So, if you need to do such a discovery, make n tries, then count the distinct items, that is k, and calculate n/k. Then find the ratio nearest to it in one of the preceding tables, and then you get an approximation of what percentage of all items you've discovered.
We could also learn that if we want to discover almost all the items, we need to push that random button at least three times more than the actual number of items, but doesn't worth more than five times that many tries. However, twice the number of items still gives a fair result.
An example: a run of 94,836 successful queries on kepkezelo.com gave 40,418 distinct items. The n/k ratio is 2.35. According to the second table, that means ~86% of the items has been discovered. Thus, the total number of items is around 47,000, and we can expect that after such another run we'd have ~98% of them.