Dev/Staging

From Archiveteam
< Dev
Jump to navigation Jump to search

The staging servers accept WARC files, package them up, and upload to the Internet Archive. This guide is useful for those who are setting up Rsync targets.

Note that there is a Dockerized version here. You might find that easier than setting all this up. If so, you'll want to install it and skip to #Testing_the_target

Installation will cover:

  • Environment: Ubuntu/Debian
  • Tools:
    • Screen
    • Rsync
    • Git

Setup the Rsync target

The Rsync target consists of disk space, Rsync, and WARC packing scripts in a dedicated user account.

Create the system user account dedicated for the Rsync target:

sudo adduser --system --group --shell /bin/bash archiveteam

Log in as archiveteam:

sudo -u archiveteam -i

Create a place to store the uploads:

mkdir -p PROJECT_NAME/incoming-uploads/

You may log out of archiveteam at this point.

Rsync

You will need to install Rsync:

sudo apt-get install rsync

Once rsync is installed, you will need to edit the rsync configuration file. If no rsyncd.conf exists in /etc, copy it from /usr/share/doc/rsync/examples/rsyncd.conf

Rsync uses a concept of "modules" which can be considered as namespaces. If you have copied the example file, you can modify the example ftp module to fit your new project. Perhaps you may call the module after the project name.

You will also need to include:

  • path = /home/archiveteam/PROJECT_NAME/incoming-uploads/
  • read only = no
  • write only = yes
  • uid = archiveteam
  • gid = archiveteam

Make Rsync start up as daemon on boot up by editing /etc/default/rsync. Ensure it reads

RSYNC_ENABLE=true

Start up Rsync deamon:

sudo invoke-rc.d rsync start

The Megawarc Factory

The Megawarc Factory are scripts that package and bundle up all the uploaded WARC files that is received.

If Git, Curl, or Screen is not yet installed, install it now:

sudo apt-get install git curl screen

Log in as archiveteam and download the scripts needed:

git clone https://github.com/ArchiveTeam/archiveteam-megawarc-factory.git
cd archiveteam-megawarc-factory/
git clone https://github.com/alard/megawarc.git
cd

Let's begin to populate the configuration file:

cp archiveteam-megawarc-factory/config.example.sh PROJECT_NAME/config.sh
nano PROJECT_NAME/config.sh

Going through the config.sh:

  • MEGABYTES_PER_CHUNK denotes how big the mega WARC files. Typically it should be set at 50GB, but if you really don't have the space, you can use smaller files like 10GB.
  • IA_AUTH is your Internet Archive S3-like API authentication keys.
  • IA_COLLECTION, IA_ITEM_TITLE, IA_ITEM_PREFIX, FILE_PREFIX all should have the todos replaced with the project name.
  • FS1_BASE_DIR should be set to /home/archiveteam/PROJECT_NAME/
  • FS2_BASE_DIR should be set to same as above or another location.
  • COMPLETED_DIR should be left empty (i.e., "") if the uploaded file is to be deleted.

Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on hackint.

Let's run the Megawarc Factory. First, create a sentinel file:

cd PROJECT_NAME
touch RUN

You can run the Megawarc Factory in Screen. The 3 scripts will on separate command shells within one Screen session:

screen
../archiveteam-megawarc-factory/chunk-multiple
CTRL+A c
ionice -c 2 -n 6 nice -n 19 ../archiveteam-megawarc-factory/pack-multiple
CTRL+A c
../archiveteam-megawarc-factory/upload-multiple
CTRL+A d

Here's a few Screen pointers:

  • screen -r will resume an existing screen session
  • CTRL+A c creates a new command window
  • CTRL+A SPACE switches to the next window
  • CTRL+A " shows you a list of windows
  • CTRL+A d leaves, or detaches, the screen session

To stop the Megawarc Factory, remove the sentinel file:

rm RUN

You can log out of the archiveteam account now.

Testing the target

To make sure the target is working, try a command like this:

rsync -rltvv --progress <file here> rsync://localhost/ateam-airsync/<username>

Explanation of all the options:

-r
Recurse through directories
-l
Copy symlinks as symlinks
-t
Preserve modification times
-v
Verbose mode - the more -vs, the more verbose
--progress
Shows progress
<file here>
The filename to send
rsync://localhost/ateam-airsync/<username>
The destination - rsync defaults to port 873, and the username is the username to use.

Make sure the file(s) w(ere|as) copied successfully.


Developer Documentation