Difference between revisions of "GeoCities Japan"

From Archiveteam
Jump to: navigation, search
(Added information from first crawl.)
Line 17: Line 17:
 
[[Category:GeoCities]]
 
[[Category:GeoCities]]
 
[[Category:Web hosting]]
 
[[Category:Web hosting]]
 +
 +
== Crawl Summaries ==
 +
 +
(Please add your crawls here)
 +
 +
* Nov 9 2018: crawl done using seeds compiled from IA’s existing CDX data (see below).
 +
** Total size: 3.7TB (uncompressed: 3.9TB)
 +
** Total URLs crawled: 96M
 +
** [https://transfer.sh/oVBA6/crawl-report.txt Crawl report], [https://transfer.sh/Xh3SO/hosts-report.txt Hostname list], [https://transfer.sh/10xYRN/mimetype-report.txt MIME type report]
  
 
== Discovery Info ==
 
== Discovery Info ==
 
* DNS CNAMEs for geocities (JSON format): <s>[https://transfer.sh/QYWEG/geocities-dns-data]</s> (dead link), [https://web.archive.org/web/20181004152609/https://transfer.sh/QYWEG/geocities-dns-data]
 
* DNS CNAMEs for geocities (JSON format): <s>[https://transfer.sh/QYWEG/geocities-dns-data]</s> (dead link), [https://web.archive.org/web/20181004152609/https://transfer.sh/QYWEG/geocities-dns-data]
* Several records available [https://anonfile.com/z1z62ak8ba/records_zip here] (alternative link: [https://transfer.sh/5c5y1/records.zip])
+
* Records compiled from IA’s CDX data, available [https://anonfile.com/z1z62ak8ba/records_zip here] (alternative link: [https://transfer.sh/5c5y1/records.zip])
 
** geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
 
** geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
 
** geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.  
 
** geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.  
Line 32: Line 41:
 
* geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as {{Job|31ges4c4c96k140sp6zah5vcc}}: <s>[https://transfer.sh/CLtZc/geocities-patch.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181007210002/urls-transfer.sh-geocities-patch.txt-inf-20181007-195532-31ges-urls.txt]
 
* geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as {{Job|31ges4c4c96k140sp6zah5vcc}}: <s>[https://transfer.sh/CLtZc/geocities-patch.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181007210002/urls-transfer.sh-geocities-patch.txt-inf-20181007-195532-31ges-urls.txt]
 
* geocities.co.jp and geocities.jp crawl from [http://web.archive.org/web/20140403184117/http://award.surpara.com/misssp/ Miss Surfersparadise], crawled as {{Job|e8ynrp5a7p4vwjkyxw9eph9p0}}: <s>[https://transfer.sh/cka7b/geocities-misssp.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt]
 
* geocities.co.jp and geocities.jp crawl from [http://web.archive.org/web/20140403184117/http://award.surpara.com/misssp/ Miss Surfersparadise], crawled as {{Job|e8ynrp5a7p4vwjkyxw9eph9p0}}: <s>[https://transfer.sh/cka7b/geocities-misssp.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt]
 +
 +
== Crawler Traps ==
 +
 +
* A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. ([http://cgi.geocities.jp/otanibc/websb2s/i-calendar.cgi?nen=2018&tuki=10&ikzhp=&mid=&mpass=&time=1541701596 Example])
  
 
== Issues ==
 
== Issues ==

Revision as of 10:13, 9 November 2018

GeoCities Japan
GeoCities Japan logo
Geocities japan 2k.png
URL http://www.geocities.jp/, http://www.geocities.co.jp/
Project status Closing
Archiving status In progress...
Project source Unknown
Project tracker Unknown
IRC channel #notagain (on EFnet)
Project lead Unknown

GeoCities Japan is the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform.

Shutdown

On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts can still be created until 2019-01-10.)

Crawl Summaries

(Please add your crawls here)

Discovery Info

  • DNS CNAMEs for geocities (JSON format): [1] (dead link), [2]
  • Records compiled from IA’s CDX data, available here (alternative link: [3])
    • geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
    • geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.
      • NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below.
    • blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total.
    • geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp.
      • Individual websites are listed in the following format: http://www.geocities.co.jp/[NeighborhoodName]/[AAAA] where AAAA ranges from 1000 to 9999.
    • include-surts.txt: List of subdomains that should be allowed by your crawler.
  • geocities.jp grab from E-Shuushuu Wiki, crawled as job:cu6azkjwy45qmo1wwdxsdfusj: Pastebin[IAWcite.todayMemWeb]
  • geocities.jp grab from Danbooru, crawled as job:5x0pf7wloqgeqc2r9rddino2l: Gist[IAWcite.todayMemWeb]
  • geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as job:31ges4c4c96k140sp6zah5vcc: [4] (dead link), [5]
  • geocities.co.jp and geocities.jp crawl from Miss Surfersparadise, crawled as job:e8ynrp5a7p4vwjkyxw9eph9p0: [6] (dead link), [7]

Crawler Traps

  • A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. (Example)

Issues

  • Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
    • However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
    • So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
    • Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
  • Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
  • Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
    • Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.