Talk:Yahoo! Groups

From Archiveteam
Jump to: navigation, search

LeighRoberts' notes and first impressions

Individual Warrior runner advice

  • Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
  • Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
  • Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
  • The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
  • Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
  • Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
  • Check your disk space regularly
  • Grafana setup advice for this?

Specific advice for this project that would have been useful

  • Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
  • Maximum object size available to all (Grafana?)

Technical improvements

  • queue management is super important
  • way to add new tracker capacity as needed
  • Make reports.pl part of the Dockerfile and run automatically
  • Instructions don’t *exactly* work for Python
    • apt install pip3
    • location of the run-pipeline3
    • might need to pip3 install setuptools first to get requirements.txt to run all the way

(probably most important) Organizational improvements

  • have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
  • determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs)