Splinder

From Archiveteam
Jump to: navigation, search
Splinder
Splinder logo
Splinder homepage.png
URL http://www.splinder.com/ [IA] [WebCite]

http://www.us.splinder.com/ [IA] [WebCite] http://archive.org/details/archiveteam-splinder [IA] [WebCite]

Project status Offline
Archiving status Saved! (in part)
Project source Unknown
Project tracker Unknown
IRC channel #archiveteam

Splinder.com has been the main blog hosting company in Italy for a while (see Wikipedia:it:Splinder). It was founded in 2001 and it hosts about half a million blogs and over 55 millions pages. Since 8th November, 2011 a warning on the home page says that no new PRO accounts are being created since the 1st June. The company has confirmed that the website will close on the 24th.[1] Later, the company issued an official statement saying that the closure would happen on January 31, 2012.[2] According to our tracker, we have downloaded or assigned all users, but there were some errors and the dataset still has to be checked; in the meanwhile it's been uploaded to archive.org.

Contents

Archiving status

http://archive.org/details/archiveteam-splinder contains items each with ~50 GB tar chunks of a single directory structure with all the Splinder data which was downloaded by team members and uploaded to batcave before its closure. Each item contains a list of all files which were contained in the directories to be put in the item's tar.

Unfortunately, this unlucky archiving project has had one more incident when the data has been assembled and uploaded to archive.org: most of the data for the 17th item wasn't put in the tar and the files were deleted; about 48 GB of data is therefore lost. Moreover, item 8 contains no tar for unknown reasons.

The original directories contained 1164439 users, out of 1337433 downloaded by team members as reported below (data extracted from the tracker, may contain some duplicate downloads), that is 87 %.

What can't be done any longer and therefore is lost forever:

  • download users which had problems in download (for instance, domain names not following domain rules, with weird characters in them, or with #splinder_noconn.html errors)
  • or upload (for instance, users with special characters in usernames),
  • redownload users claimed on the tracker but not downloaded,
  • redownload other users downloaded but not uploaded to our server because the downloader disappeared.

What still must be done:

  • if you still have some missing Splinder data you didn't upload to the ArchiveTeam server, please tar and upload it directly to archive.org and ask someone to put it in the Splinder collection;
  • if you have uploaded your data but you still have it locally, check it for files which got lost in the abovementioned 17th chunk disaster:
    • run the following commands in your splinder working directory (which has data/ subdir):
      wget http://archive.org/download/archiveteam-splinder-00000017/00000017.txt -O splinder-missing.txt
      cat splinder-missing.txt |cut -d/ -f2-6|uniq|sed -e 's,^,data/,' > splinder-missing-paths.txt
      tar cvf splinder-for-sketchcow.tar -T splinder-missing-paths.txt
    • upload splinder-for-sketchcow.tar somewhere,
    • notify SketchCow on #archiveteam EFNet IRC channel so that he can add it to the item,
    • if he's not around to get your file, add the URL below for the record, or upload it to a temporary archive.org item so that it doesn't get lost.

Ended grab info

Upload status

For the time being: please ignore any errors caused by special characters in usernames (| ^ etc.), we'll get those profiles later.

Uploaded to batcave?
Phase 1
Downloader Count Status
closure 254869 Uploaded
kenneth 206696 Uploaded
ndurner 177665 Uploaded
Nemo 111340 Uploaded with errors, some incomplete
donbex 71562 Uploaded; all special char profiles fixed, some incomplete
dnova 68740 Uploaded; all special char profiles fixed
underscor 58774
Wyatt 54525 Mostly Uploaded; need to get special character profiles up; redoing a large batch that failed checks.
crawl336 45785
Angra 35752
cameron_d 26357 Uploaded, I believe
db48x 23120 Uploaded, three profiles not uploaded
yipdw 18789 Uploaded
crawl338 17783
crawl337 16784
crawl334 15897
Coderjoe 13749 Uploaded from both machines all profiles which did not have .incomplete (fixed some backslashes in profile directory names)
bsmith093 13194 Uploaded, no special char profiles checked/fixed
DoubleJ 10301 Uploaded from all machines w/ no errors
crawl339 9026
anonymous 8653
kennethreitz 8287
alard 7299 Uploaded, one error
dashcloud 6803 Uploading
crawl333 6292
spirit 6282 Uploaded
crawl335 6106
Paradoks 5890 Uploaded
koon 5029
chronomex 4913 Partially Uploaded, moved house and has yet to get computers running
VMB 4620
shoop 4461
marceloantonio1 2927 Uploaded
undercave 2508
DFJustin 2456 Uploaded, may have errors
proub 1178
Hydriz 842 Uploaded
canUbeatclosure 669
tef 440 Uploaded
arima 347
NotGLaDOS 259 Uploaded
sarpedon 105
pberry 89
Wyattq 84 See: Wyatt
soultcer 74 Redid incomplete profiles and Uploaded
Konklone 56
PepsiMax 12
mareloantonio1 10 Uploaded
hrbrmstr 9
sente 7
rebiolca 6
2 5
Wyatt-B 3 See: Wyatt
Wyatt-A 2 See: Wyatt
asdf 2

How to help archiving

There is a distributed download script that gets usernames from a tracker and downloads the data.

Make sure you are on Linux, that you have curl, git, a recent version of Bash. Your system must also be able to compile wget.

  1. Get the code: git clone https://github.com/ArchiveTeam/splinder-grab
  2. Get and compile the latest version of wget-warc: ./get-wget-warc.sh
  3. Think of a nickname for yourself (preferably use your IRC name).
  4. Run the download script:
    • To run a single downloader, run ./dld-client.sh "<YOURNICK>".
    • To run multiple downloaders (and thus use your bandwidth more efficiently), do either:
      • simply run as many copies of dld-client.sh as you like
      • run ./dld-streamer.sh <YOURNICK> <N>, where <N> is the number of concurrent downloads you want.
  5. To stop the script gracefully, run touch STOP in the script's working directory. It will finish the current task and stop.

Notes

  • Compiling wget-warc will require dev packages for the various libraries that it needs. Most questions have been about gnutls; install the gnutls-devel or gnutls-dev package with your favorite package manager.
  • Downloading one user's data can take between 10 seconds and several days.
  • The data for one user is equally varied, from a few kB to several GB.
  • The downloaded data will be saved in the ./data/ subdirectory.
  • Download speeds from splinder.com are not that high (servers may be particularly overloaded during European day because of additional traffic of people exporting their blogs). You can run multiple clients to speed things up.

Errors

  • There are some problems with subdomains containing dashes[3]: if they fail on your machine (reported: wget compiled with +nls), for now stop and restart the script, someone else will do those users (although they seem to fail in part anyway).
    Some such users: macrisa, -Maryanne-, it:SalixArdens, it:MCris, it:7lilla, it:thepinkpenguin, it:bimbambolina, it:lazzaretta, it:Hedwige, it:N4m3L3Ss, it:Barbabietole_Azzurre, it:celebrolesa2212, it:buongiono.mattina, it:DarkExtra, it:-slash-, it:marlene1, it:Ohina, us:XyKy, us:Naluf, it:elisablu, it:*JuLs*, it:RikuSan, it:Nasutina
  • There are also some problems with upload-finished.sh because of some inconsistencies in escaping special characters, e.g. [4]; remember not to delete those directories without fixing/uploading them.
  • The script looks for errors in English, so it's better if you wget-warc to use English. Otherwise, errors like these won't be detected and the script will mark as done users which failed. Please run fix-dld.sh to fix those users, after changing if grep -q "ERROR 50" to your localised output.

splinder_noconn.html errors

Please check your wget logs for presence of a file named splinder_noconn.html. This is a transient maintenance page that has appeared in some downloads, but cannot be detected as an error by wget, because the page isn't returned with a status code indicating "an error occurred".

Some examples:

These accounts may have to be re-fetched.

Uploading your data

  • To upload the data you've downloaded, first contact SketchCow on IRC for an rsync slot. Once you have that you can run the ./upload-finished.sh script to upload your data. For example, run this in your script directory: ./upload-finished.sh batcave.textfiles.com::YOURNICK/splinder/
  • The script will upload only completed users. To check how much space the incomplete users are taking, without killing your disk, you can use ionice -c 3 find -name .incomplete -printf "%h\0" | ionice -c 3 du -mcs --files0-from=- in your splinder-grab directory.

Status

There is a real-time dashboard where you can check the progress.

External links

Site structure

The users are identified by their usernames. Fortunately, the side provides a list of all users. Usernames are not case-sensitive, but there is a case preference.

Example URLs

User profile: http://www.splinder.com/profile/<<username>>

Example profile:
http://www.splinder.com/profile/difficilifoglie

View count on profile page:
http://www.splinder.com/ajax.php?type=counter&op=profile&profile=Romanticdreamer

Example of friends list paging: (160 per page, starting at 0)
http://www.splinder.com/profile/difficilifoglie/friends
http://www.splinder.com/profile/difficilifoglie/friends/160

Inverse friends (probably also paged):
http://www.splinder.com/profile/difficilifoglie/friendof

Link to blog: (note: not always the same as the username)
http://difficilifoglie.splinder.com/
http://learnonline.splinder.com/

Photo:
http://www.splinder.com/profile/difficilifoglie/photo
http://www.splinder.com/mediablog/wondermum/media/24544805

Video:
http://www.splinder.com/profile/wondermum/video
http://www.splinder.com/mediablog/wondermum/media/25737390

Audio:
Not a separate user feed, but only accessible via mediablog
http://www.splinder.com/mediablog/learnonline/media/25727030

Mediablog: combination of the audio + video + photo lists
http://www.splinder.com/mediablog/learnonline
(16 per page, starting at 0)
http://www.splinder.com/mediablog/learnonline/16

Mediablog has PowerPoint, Word files:
http://www.splinder.com/mediablog/learnonline/media/25641346
http://www.splinder.com/mediablog/learnonline/media/25546305
http://www.splinder.com/mediablog/learnonline/media/21901634
http://www.splinder.com/mediablog/learnonline/media/24875290

User avatar: grab url from profile page

Photo file: grab url from photo page and remove _medium to get original picture
http://files.splinder.com/d5e492233631af39212268593afca02d_square.jpg
http://files.splinder.com/d5e492233631af39212268593afca02d_medium.jpg
http://files.splinder.com/d5e492233631af39212268593afca02d.jpg
older photos do not have this structure, different ids for each size:
http://www.splinder.com/mediablog/babboramo/media/17359043
http://files.splinder.com/13b615ccbd75354ee4e0d973da66c2b2.jpeg
http://files.splinder.com/770d7b9ecac27083d9204af327ebe743.jpeg

PowerPoint, Word files: grab url from media page
http://files.splinder.com/46dbf3d5a0b12e490f81ddb8444b4fad.ppt
http://files.splinder.com/ab3ce16c850ac530351d9df0937152c7.pdf

Video items: grab url from media page
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_square.jpg
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_thumbnail.jpg
http://files.splinder.com/8f5caff20685648bacd4ce1acf90e645_small.flv
note: square, thumbnail, small is not always available, check flashvars for vidpath, imgpath
http://www.splinder.com/mediablog/babboramo/media/13131052
http://files.splinder.com/e067653e1532e55ee208605fcb84361a.flv
http://files.splinder.com/f56060b7fef139f03b72e06ca9fcba55.jpeg

Audio items: grab url from media page, flashvars
sometimes there is a _thumbnail, remove that to get a better quality
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef_thumbnail.mp3
http://files.splinder.com/a5043c34a12ee66f5ad995ffd14493ef.mp3

Comments on blog posts:
http://www.splinder.com/myblog/comment/list/25742358
on some, but not on all blogs, those comments are also included in the blog page
http://dal15al25.splinder.com/post/25740180
http://soluzioni.splinder.com/post/2802227/blog-pager-su-piu-righe
http://soluzioni.splinder.com/post/25737683/avviso-per-gli-utenti-ce-da-preoccuparsi/
http://civati.splinder.com/post/25742977
pagination: see media comments

Comments on media items:
http://www.splinder.com/media/comment/list/21254470
http://www.splinder.com/media/comment/list/21254470?from=50
(50 per page, starting at 0)
number of comments is on the media page
http://www.splinder.com/mediablog/danspo/media/21254470


Blog urls:
the blogs have content from their own subdomain, but also from
files.splinder.com
www.splinder.com/misc/ (topbar css, gif)
www.splinder.com/includes/ (js)
www.splinder.com/modules/service_links/ (images)
syndication.splinder.com

links to www.splinder.com that should NOT be followed:
 /myblog/
 /users/
 /media/
 /node/
 /profile/
 /mediablog/
 /community/
 /user/
 /night/
 /home/
 /mysearch/
 /online/
 /trackback/

wget-warc --mirror --page-requisites --span-hosts --domains=learnonline.splinder.com,files.splinder.com,www.splinder.com,syndication.splinder.com --exclude-directories="/users,/media,/node,/profile,/mediablog,/community,/user,/night,/home,/mysearch,/online,/trackback,/myblog/post,/myblog/posts,/myblog/tags,/myblog/tag,/myblog/view,/myblog/latest,/myblog/subscribe" -nv -o wget.log "http://learnonline.splinder.com/"


[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY · Deathwatch · Projects · Download available archives
Archiveteam.jpg
Archiving projects Archive.is · BetaArchive · Internet Archive · It Died · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES
The Dead, the Dying & The Damned · UK Web Archive · WebCite
Blogs/Web hostings Angelfire · Blogger · Blogster · EtherPad · FortuneCity · Free ProHosting · Fuelmyblog · GeoCities (patch) · Google Sites · Jux · LiveJournal · My Opera · Open Diary · Posterous · Prodigy.net · Proust · Splinder · Tripod · Vox · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd
Corporations Apple · IBM · Google · Microsoft · Yahoo!
Events Arab Spring · Occupy movement · Spanish Revolution
Font Repos Google Web Fonts · GNU FreeFont · Fontspace
Image hosting services Cameroid · Flickr · Geograph Britain and Ireland · ImageShack · Imgur · Instagr.am · Panoramio · Photobucket · Picasa · Picplz · Ptch · puu.sh · Snapjoy · TwitPic · Wikimedia Commons
Knowledge/Wikis arXiv · Citizendium · Edit.This · Encyclopedia Dramatica · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books · Insurgency Wiki · Knol · Nupedia · OpenCourseWare · OpenStreetMap · Project Gutenberg · Puella Magi · Referata · SongMeanings · ShoutWiki · The Internet Movie Database · The Pirate Bay · TropicalWikis · Urban Dictionary · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia · Wikispaces · Wik.is · Wiki-Site · WikiTravel
Microblogging Identi.ca · Jaiku · Plurk · Sina Weibo · Tumblr · Twitter · TwitLonger
Music/Audio Audimated.com · digCCmixter · Dogmazic.net · Free Music Archive · Gogoyoko · Indaba Music · Jamendo · Last.fm · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · Twaud.io
People Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project
Q&A Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Expers Exchange · GirlsAskGuys · Google Answers · Google Questions and Answers · JustAnswer · MetaFilter · Quora · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers
Social bookmarking Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · Microsoft TechNet · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Zootool · Zotero
Social networks Bebo · BlackPlanet · Classmates.com · Cyworld · deviantART · Dopplr · douban · Facebook · Flixster · Friendster · Gaia Online · Google+ · Habbo · hi5 · Hyves · LinkedIn · mixi · MyHeritage · MyLife · Myspace · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Tagged · Viadeo · Vkontakte · WeeWorld · Wretch · more sites...
Software Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHub · Gitorious · Gna! · Google Code · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · mozdev · OSOR.eu · OW2 Consortium · Openmoko · Ourproject.org · Project Kenai · RubyForge · SEUL.org · SourceForge · tigris.org · Transifex · TuxFamily
Video hosting services Academic Earth · Blip.tv · Google Video · Justin.tv · TED Talks · Ustream · Viddler · Vimeo · Yahoo! Video · YouTube
Other 4chan · April Fools' Day · Amplicate · Circavie · Co.mments · Dmoz · Electronic Frontier Foundation · Feedly · Ficlets · FriendFeed · Gopher · Google Books Ngram · Google Reader · IFTTT · isoHunt · MegaUpload · MyBlogLog · Pastebin · Propeller.com · Quantcast · Salon Table Talk · SOPA blackout pages · World Wide Web · Yahoo! Buzz · Yahoo! Groups
Teams Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam
About Archive Team Introduction · Philosophy · Who We Are · Why Back Up? · Software · Films and documentaries about archiving · Formats · Cheap storage · Storage Media · Recommended Reading · FAQ
Personal tools