Difference between revisions of "GeoCities Japan"

From Archiveteam
Jump to: navigation, search
(Added DNS CNAME url)
 
(23 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{Infobox project
 
{{Infobox project
 
| title = GeoCities Japan
 
| title = GeoCities Japan
| URL = https://geocities.yahoo.co.jp/, http://www.geocities.jp/
+
| image = Geocities japan 2k.png
| project_status = {{closing}}
+
| URL = {{URL|http://www.geocities.jp/}}<br />{{URL|http://www.geocities.co.jp/}}
| archiving_status = {{upcoming}}
+
| project_status = {{offline}}
 +
| archiving_status = {{partiallysaved}}
 
| irc = notagain
 
| irc = notagain
 +
| lead = [[User:Hiroi]], [[User:DoomTay]]
 
}}
 
}}
  
'''GeoCities Japan''' is the Japanese version of [[GeoCities]]. It survived the 2009 shutdown of the global platform.
+
'''GeoCities Japan''' was the Japanese version of [[GeoCities]]. It survived the 2009 shutdown of the global platform and shut down end of March 2019.
  
 
== Shutdown ==
 
== Shutdown ==
On 2018-10-01, Yahoo! Japan [http://info-geocities.yahoo.co.jp/p/close/ announced] that they would be closing GeoCities at the end of March 2019. (New accounts can still be created until 2019-01-10.)
+
On 2018-10-01, Yahoo! Japan [http://info-geocities.yahoo.co.jp/p/close/ announced] that they would be closing GeoCities at the end of March 2019. (New accounts could still be created until 2019-01-10.) It shut down on 2019-04-01 shortly after midnight JST.
 +
 
 +
== Crawl Summaries ==
 +
(Please add your crawls here)
 +
 
 +
* Nov 9 2018: crawl done using seeds compiled from IA’s existing CDX data (see below). Available on [https://archive.org/details/archiveteam_geocitiesjp IA] (currently being uploaded).
 +
** Total size: 3.7TB (uncompressed: 3.9TB)
 +
** Total URLs crawled: 96M
 +
** {{URL|https://transfer.sh/NxpeF/crawl-report.txt|Crawl report}}, {{URL|https://transfer.sh/14ubvG/hosts-report.txt|Hostname list}}, {{URL|https://transfer.sh/12I5ES/mimetype-report.txt|MIME type report}}
 +
 
 +
== Deduplication ==
 +
 
 +
We'll follow roughly the deduplication schema outlined [http://beta.taricorp.net/2016/web-history-warc/ here], but with a shared MySQL-complaint database.
 +
(The database will be online soon; in the meantime, you can begin to prepare the metadata following the description below.)
 +
 
 +
The deduplication workflow goes as follows:
 +
 
 +
# During / after individual crawls, each person generates the metadata (using warcsum or other tools) corresponding to their crawled WARC files, following the schema below.
 +
# Metadata is then inserted into the database. It is crucial that this table does not get screwed up, so please contact me (hiroi on IRC channel) for access if you'd want to add your data.
 +
#* If time/resource permits, the uploader may fill in deduplication info at the time of insertion, but this is not required.
 +
#* That's because (provided that all warc files are available for download) the metadata in the db is enough for standalone duplication.
 +
# A specific worker machine will be running through this table continuously and filling in deduplication info (ref_id, ref_uri, ref_date).
 +
#* As of now such script hasn't actually been written yet. '''If you're willing to write it up, please let [[User:Hiroi]] know via IRC.'''
 +
# At the time of release, we'll use this database to deduplicate all WARC archives at once (by replacing duplicated entries with [https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record revisit] records) and combine all together for release.
 +
 
 +
The database schema is given by the following. For details on warc_offset and warc_len, please see [https://bitbucket.org/tari/optiwarc/overview source code of warcsum and other tools].
 +
 
 +
<pre>
 +
Table warc_records
 +
+---------------+--------------+------+-----+---------+----------------+
 +
| Field        | Type        | Null | Key | Default | Extra          |
 +
+---------------+--------------+------+-----+---------+----------------+
 +
| id            | int(11)      | NO  | PRI | NULL    | auto_increment |
 +
| name          | varchar(1024)| NO  |    | NULL    |                | (WARC file name)
 +
| size          | bigint(20)  | NO  |    | NULL    |                | (size of the file)
 +
| location      | varchar(2083)| YES  |    | NULL    |                | (current available location, i.e. download link)
 +
| digest        | varchar(1024)| YES  |    | NULL    |                | (hash of the entire file)
 +
+---------------+--------------+------+-----+---------+----------------+
 +
 
 +
Table uri_records
 +
+---------------+--------------+------+-----+---------+----------------+
 +
| Field        | Type        | Null | Key | Default | Extra          |
 +
+---------------+--------------+------+-----+---------+----------------+
 +
| id            | int(11)      | NO  | PRI | NULL    | auto_increment |
 +
| warc_id      | int(11)      | NO  |    | NULL    |                | (warc_records.id)
 +
| warc_offset  | bigint(20)  | NO  |    | NULL    |                | (the offset of individual record in WARC file)
 +
| warc_len      | bigint(20)  | NO  |    | NULL    |                | (length of the (compressed) individual record)
 +
| uri          | varchar(2083)| NO  |    | NULL    |                | (uri of the record)
 +
| datetime      | varchar(256) | NO  |    | NULL    |                | (access time, taken from WARC file directly)
 +
| digest        | varchar(1024)| NO  |    | NULL    |                | (default value is "sha1:xxxxxx")
 +
| ref_id        | int(11)      | YES  |    | NULL    |                | (original copy's id, if the record is a duplicate)
 +
| ref_uri      | varchar(2083)| YES  |    | NULL    |                | (original copy's uri, can be filled in to reduce queries)
 +
| ref_date      | varchar(256) | YES  |    | NULL    |                | (original copy's date)
 +
+---------------+--------------+------+-----+---------+----------------+
 +
</pre>
 +
 
 +
== Discovery Info ==
 +
* DNS CNAMEs for geocities (JSON format): <s>[https://transfer.sh/QYWEG/geocities-dns-data]</s> (dead link), [https://web.archive.org/web/20181004152609/https://transfer.sh/QYWEG/geocities-dns-data]
 +
* Records compiled from IA’s CDX data, available [https://anonfile.com/z1z62ak8ba/records_zip here] (alternative link: [https://transfer.sh/5c5y1/records.zip])
 +
** geocities_jp_first.txt: First level subdirectory list under geocities.jp, compiled from IA CDX data. 566,690 records in total.
 +
** geocities_co_jp_first.txt: Same as above, for geocities.co.jp. 12,470 records in total.
 +
*** NOTE: The majority of sites under geocities.co.jp are not first-level sites, but "neighborhood" sites which are second-level (there could be, in theory, 1.79M of them; how many actually exist unknown), see explanation below.
 +
** blogs_yahoo_co_jp_first.txt: Same as above, for blogs.yahoo.co.jp. 646,901 records in total.
 +
** geocities_co_jp_fields.txt: List of neighborhood names under geocities.co.jp.
 +
*** Individual websites are listed in the following format: <code><nowiki>http://www.geocities.co.jp/[NeighborhoodName]/[AAAA]</nowiki></code> where <code>AAAA</code> ranges from 1000 to 9999.
 +
** include-surts.txt: List of subdomains that should be allowed by your crawler.
 +
* geocities.jp grab from [https://e-shuushuu.net/wiki/index.php/Main_Page E-Shuushuu Wiki], crawled as {{Job|cu6azkjwy45qmo1wwdxsdfusj}}: {{URL|https://pastebin.com/raw/17hLpsN5|Pastebin}}
 +
* geocities.jp grab from Danbooru, crawled as {{Job|5x0pf7wloqgeqc2r9rddino2l}}: {{URL|https://gist.githubusercontent.com/DoomTay/12a146e35fcee745b764ba3ae3c7545f/raw/863a021e43e0c93cb6f8943725a2ef5d1a699477/geocities-danbooru.txt|Gist}}
 +
* geocities.co.jp and missed geocities.jp URLs grabbed from the above targets, crawled as {{Job|31ges4c4c96k140sp6zah5vcc}}: <s>[https://transfer.sh/CLtZc/geocities-patch.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181007210002/urls-transfer.sh-geocities-patch.txt-inf-20181007-195532-31ges-urls.txt]
 +
* geocities.co.jp and geocities.jp crawl from [http://web.archive.org/web/20140403184117/http://award.surpara.com/misssp/ Miss Surfersparadise], crawled as {{Job|e8ynrp5a7p4vwjkyxw9eph9p0}}: <s>[https://transfer.sh/cka7b/geocities-misssp.txt]</s> (dead link), [https://archive.org/download/archiveteam_archivebot_go_20181021150002/urls-transfer.sh-geocities-misssp.txt-inf-20181007-102152-3ntkw-urls.txt]
 +
* Crawls from links within links from [https://www.businessinsider.com/how-to-visit-last-remnants-geocities-before-destroyed-2018-10 this Business Insider article] {{Job|ayildv5yxmeo6s7egxni9dlnd}} {{URL|https://transfer.sh/uPLU4/biscrapes.txt}}
 +
* Sites collated by [[User:Sanqui]] {{Job|cp5r3a9fifipnbxo8hsy4tmhx}} {{URL|https://etc.sanqui.net/archiveteam/geocities.jp_various.txt}}
 +
* Scrapes from [https://web.archive.org/web/20031129155959/http://www.ragsearch.com/ Ragsearch] {{Job|adh7m0i9ka25buvdlabm0p9ii}} [https://archive.org/download/archiveteam_archivebot_go_20190102070002/urls-transfer.sh-ragsearch.txt-inf-20181217-154114-adh7m-urls.txt] {{Job|dmde087vgmmjluo9qjodob1ai}} [https://archive.org/download/archiveteam_archivebot_go_20190329030002/urls-transfer.sh-ragsearch2.txt-inf-20190329-012917-dmde0-urls.txt] {{Job|54l4xfl49rqpfttrkbzv968zm}} [https://archive.org/download/archiveteam_archivebot_go_20190331030001/urls-transfer.sh-ragsearch3.txt-inf-20190329-021433-54l4x-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031225145427/http://www.puni.to/ PuniTo] {{Job|2752dep7k79puge1a9mdo93x1}} [https://archive.org/download/archiveteam_archivebot_go_20190131020002/urls-transfer.sh-too.puni.to.txt-inf-20190102-084827-2752d-urls.txt]
 +
* {{Job|eoy17cb66jg4f9vmgi0v9fexo}} [https://archive.org/download/archiveteam_archivebot_go_20190218160003/urls-transfer.sh-combined_files.txt-inf-20190201-045902-eoy17-urls.txt]
 +
* Scrapes from [http://www.amaterasu.jp/ Amaterasu] (NSFW) {{Job|2vbwnt5l8nipjddqo17ex2r3j}} [https://archive.org/download/archiveteam_archivebot_go_20190221080002/urls-transfer.sh-amaterasu.txt-inf-20190221-044822-2vbwn-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031202162803/http://www.surpara.com/ Surfers Paradise] {{Job|chr2z6wrw4srlmxo489wksqef}} [https://archive.org/download/archiveteam_archivebot_go_20190328210002/urls-gist.githubusercontent.com-surpara.txt-inf-20190313-060006-chr2z-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031203000642/http://www.meguri.net/ Meguri-net] and [https://web.archive.org/web/20031207051228/http://www.oisan.jp/search/ Oisearch] {{Job|5m5qct4quwkn3blzgitqtd3uq}} {{URL|https://transfer.sh/2qlfJ/meguri+oisan.txt}} [https://archive.org/download/archiveteam_archivebot_go_20190329030002/urls-transfer.sh-meguri%2Boisan.txt-inf-20190329-005321-5m5qc-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031201043230/http://www.gamemichi.com/ Game-Michi] {{Job|5p4pvzxl74gxrj8dtky87kpfo}} [https://archive.org/download/archiveteam_archivebot_go_20190401050001/urls-transfer.sh-gamemichi.txt-inf-20190329-005508-5p4pv-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031202162803/http://www.interq.or.jp:80/red/pocky/cg/ Bishoujo NAVI] {{Job|1923nftkucm16x888vyvcvuvb}} [https://archive.org/download/archiveteam_archivebot_go_20190329030002/urls-transfer.sh-pocky.txt-inf-20190329-010753-1923n-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20031216025741/http://www.lovehina.to/~hina/ Love Hina Search] {{Job|bcxxlfuso9uveek93abd6ua2y}} [https://archive.org/download/archiveteam_archivebot_go_20190329030002/urls-transfer.sh-hina.txt-inf-20190329-011926-bcxxl-urls.txt]
 +
* Scrapes from [http://www.multiez.com/ MultiLink] {{Job|be6w30ni9v31t0rg5edq694k0}} {{URL|https://transfer.sh/z2lhW/multiez.txt}} [https://archive.org/download/archiveteam_archivebot_go_20190329030002/urls-transfer.sh-multiez.txt-inf-20190329-015320-be6w3-urls.txt]
 +
* Scrapes from [https://www.gameha.com/ Gameha] {{Job|5ezyb53ch6ip4uklwgal4nsak}} [https://archive.org/download/archiveteam_archivebot_go_20190401050001/urls-transfer.sh-gameha.txt-inf-20190329-020010-5ezyb-urls.txt]
 +
* Scrapes from [https://web.archive.org/web/20020827141733/http://www.ragnal.ccsj.com/cgibin/link/ralink_nn.html an earlier domain for Ragsearch] {{Job|cfv3zp5uj886dsp01gj4m1mt4}} [https://archive.org/download/archiveteam_archivebot_go_20190329080002/urls-transfer.sh-ragnal.txt-inf-20190329-042620-cfv3z-urls.txt]
 +
* {{Job|aa63sfmum7cb3m58vvumtuosl}} filtered from {{URL|https://geo.98nx.jp/list.txt}}, from user nakomikan on IRC
 +
* {{Job|f4bz9nodrgq4m620auucpjpoe}} list of SDF doujinshi manga circles filtered from {{URL|https://pastebin.com/6egiap0k}}, from nakomikan on IRC
 +
* {{Job|5phcgljf5fxowviasvwpb0flh}} filtered from {{URL|https://pastebin.com/f4y0Mrah}}, from user nakomikan on IRC
 +
 
 +
== Crawler Traps ==
 +
 
 +
* A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. ([http://cgi.geocities.jp/otanibc/websb2s/i-calendar.cgi?nen=2018&tuki=10&ikzhp=&mid=&mpass=&time=1541701596 Example])
 +
 
 +
== Issues ==
 +
* Hidden-entry sites ''(Importance: '''Low''')'': There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
 +
** However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
 +
** So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
 +
** Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
 +
* Deduplication ''(Importance: '''Low''')'': If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
 +
* Final Snapshot ''(Importance: '''Moderate''')'': The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
 +
** Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.
  
 
[[Category:GeoCities]]
 
[[Category:GeoCities]]
 
[[Category:Web hosting]]
 
[[Category:Web hosting]]
 
== Discovery Info ==
 
* DNS CNAMEs for geocities (JSON format): https://transfer.sh/QYWEG/geocities-dns-data
 

Latest revision as of 03:14, 4 April 2019

GeoCities Japan
GeoCities Japan logo
Geocities japan 2k.png
URL http://www.geocities.jp/[IAWcite.todayMemWeb]
http://www.geocities.co.jp/[IAWcite.todayMemWeb]
Project status Offline
Archiving status Partially saved
Project source Unknown
Project tracker Unknown
IRC channel #notagain (on EFnet)
Project lead User:Hiroi, User:DoomTay

GeoCities Japan was the Japanese version of GeoCities. It survived the 2009 shutdown of the global platform and shut down end of March 2019.

Shutdown

On 2018-10-01, Yahoo! Japan announced that they would be closing GeoCities at the end of March 2019. (New accounts could still be created until 2019-01-10.) It shut down on 2019-04-01 shortly after midnight JST.

Crawl Summaries

(Please add your crawls here)

Deduplication

We'll follow roughly the deduplication schema outlined here, but with a shared MySQL-complaint database. (The database will be online soon; in the meantime, you can begin to prepare the metadata following the description below.)

The deduplication workflow goes as follows:

  1. During / after individual crawls, each person generates the metadata (using warcsum or other tools) corresponding to their crawled WARC files, following the schema below.
  2. Metadata is then inserted into the database. It is crucial that this table does not get screwed up, so please contact me (hiroi on IRC channel) for access if you'd want to add your data.
    • If time/resource permits, the uploader may fill in deduplication info at the time of insertion, but this is not required.
    • That's because (provided that all warc files are available for download) the metadata in the db is enough for standalone duplication.
  3. A specific worker machine will be running through this table continuously and filling in deduplication info (ref_id, ref_uri, ref_date).
    • As of now such script hasn't actually been written yet. If you're willing to write it up, please let User:Hiroi know via IRC.
  4. At the time of release, we'll use this database to deduplicate all WARC archives at once (by replacing duplicated entries with revisit records) and combine all together for release.

The database schema is given by the following. For details on warc_offset and warc_len, please see source code of warcsum and other tools.

Table warc_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| name          | varchar(1024)| NO   |     | NULL    |                | (WARC file name)
| size          | bigint(20)   | NO   |     | NULL    |                | (size of the file)
| location      | varchar(2083)| YES  |     | NULL    |                | (current available location, i.e. download link)
| digest        | varchar(1024)| YES  |     | NULL    |                | (hash of the entire file)
+---------------+--------------+------+-----+---------+----------------+

Table uri_records
+---------------+--------------+------+-----+---------+----------------+
| Field         | Type         | Null | Key | Default | Extra          |
+---------------+--------------+------+-----+---------+----------------+
| id            | int(11)      | NO   | PRI | NULL    | auto_increment |
| warc_id       | int(11)      | NO   |     | NULL    |                | (warc_records.id)
| warc_offset   | bigint(20)   | NO   |     | NULL    |                | (the offset of individual record in WARC file)
| warc_len      | bigint(20)   | NO   |     | NULL    |                | (length of the (compressed) individual record)
| uri           | varchar(2083)| NO   |     | NULL    |                | (uri of the record)
| datetime      | varchar(256) | NO   |     | NULL    |                | (access time, taken from WARC file directly)
| digest        | varchar(1024)| NO   |     | NULL    |                | (default value is "sha1:xxxxxx")
| ref_id        | int(11)      | YES  |     | NULL    |                | (original copy's id, if the record is a duplicate)
| ref_uri       | varchar(2083)| YES  |     | NULL    |                | (original copy's uri, can be filled in to reduce queries)
| ref_date      | varchar(256) | YES  |     | NULL    |                | (original copy's date)
+---------------+--------------+------+-----+---------+----------------+

Discovery Info

Crawler Traps

  • A common calendar CGI script, usually named “i-calendar.cgi”, seems to be able to trap Heritrix with timestamped infinite loops despite having TooManyHopsDecideRule on. (Example)

Issues

  • Hidden-entry sites (Importance: Low): There are a few sites that do not use index.htm/index.html as their entry points; as a result, first level directory access will fail to reach them.
    • However, as long as there are other geocities sites linked to them, they should be discoverable by the crawler.
    • So the only problem are those pages whose inlinks are all dead. There should be very few of those. If we want to be absolutely sure, we can run a diff between IA's current CDX and that from the crawl.
    • Notice that this is not a problem with the neighborhood sites as we can enumerate the URLs.
  • Deduplication (Importance: Low): If we are going to release a torrent as we did with Geocities, they it may be worth to dedup. Most likely won't be a major difference.
  • Final Snapshot (Importance: Moderate): The page contents may still change between now and March 31 2019, so we need to do another crawl when the time is near.
    • Note that a lot of users will be setting up 301/302s before the server shuts down. According to Yahoo, we'll have until Sep 30 2019 to log down those 301/302s.