Difference between revisions of "UC Berkeley Course Captures"
(Grabbed Statistics 21 Fall 2014)
(→Archiving efforts: +archivebot crawl)
|Line 45:||Line 45:|
According to [http://chat.efnet.org:9090/?channels=%23berklost #berklost] IRC, "Waybackmachine is already grabbing these."
According to [http://chat.efnet.org:9090/?channels=%23berklost #berklost] IRC, "Waybackmachine is already grabbing these."
Revision as of 19:04, 6 March 2017
|UC Berkeley Course Captures|
|Archiving status||In progress...|
|IRC channel||(on EFnet)|
The University of California, Berkeley is planning to remove their public lecture recordings ("course captures", audio and video) and put them behind authentication. The planned date for the change is 2017-03-15.
The removal will affect at least these public channels:
- http://webcast.berkeley.edu/series (index of links to YouTube and iTunes)
The #Shutdown notice makes it sound as if YouTube videos will remain online at youtube.com, but will no longer be publicly listed. The new hosting behind authentication will lose playlist information (which links individual lecture videos together for one course). Therefore the pressing thing to do before 2017-03-15 (as regards the YouTube content) is to download indexes of videos and playlists—see #Indexes of files.
On the other hand, "iTunesU Course Capture content will be removed." It's not clear if iTunes content will continue to exist, even behind authentication.
Proposed archiving format:
- Sample: https://archive.org/details/TEST2_UCB_CS195_SP2015
- One item per YouTube playlist
- Identifier includes the course number and semester (there's a list of course subject abbreviations at http://guide.berkeley.edu/courses/)
youtube-dl --dump-jsonoutput as youtube-dl.json
- Videos in the preview are YouTube's highest-quality muxed format (format 22?)
- Video file naming convention is
%(playlist_index)s-%(title)s.%(ext)s(in youtube-dl's output template format)
- All other formats stored in tar files, one file per format (maybe overkill, as these are derived anyway?)
- Include stderr output of youtube-dl, in order to have a record of videos that aren't accessible (e.g.,
ERROR: Zrzh3Fz8DhQ: YouTube said: This video contains content from BBC Worldwide, who has blocked it on copyright grounds.)
There's an existing https://archive.org/details/ucberkeleylectures collection to which the newly archived files could perhaps be added.
- October 2016: https://np.reddit.com/r/DataHoarder/comments/5804np/youtube_archiver_and_uc_berkeley/
And lastly I finished downloading all of the UC Berkeley. Videos, any transcriptions/captions and all other video info. I made a torrent as they are the most efficient at sharing. All 3.1TB of it, it's not hosted on the fastest server, but with a few seeds it should go quick enough. If you want to keep this great learning resource alive, feel free to seed or partial seed, I will seed it for as long as I can.  For video listings please look at this list .
- March 2017: https://www.reddit.com/r/YouTubeBackups/comments/5x4kv8/ucberkeley_to_remove_10k_hours_of_lectures_posted/
Currently pulling down to a few locations in parallel at 720p.
- According to #berklost IRC, "Waybackmachine is already grabbing these." Additionally, webcast.berkeley.edu has been crawled by archivebot: http://archive.fart.website/archivebot/viewer/domain/webcast.berkeley.edu
Scripts for extracting YouTube metadata in an Internet Archive–compatible CSV format (the repo also includes #Indexes of files):
git clone https://repo.eecs.berkeley.edu/git-anon/users/fifield/archive-ucberkeley-webcast.git
How to download a playlist
This is how to download all the videos of a playlist in all available formats.
Get a list of playlist titles, IDs, and last line of video description (often lists the license):
gzip -dc indexes/youtube.com-user-UCBerkeley-playlists-20170301.json.gz | jq --compact-output '[.playlist_title,.playlist_id,.description|match(".*\\Z").string]' | uniq -c
Choose a playlist to download. Let's say it's
Make a directory for the download:
mkdir -p "$OUTDIR"
Extract just the JSON objects corresponding to this playlist:
gzip -dc indexes/youtube.com-user-UCBerkeley-playlists-20170301.json.gz | jq --compact-output "select(.playlist_id==\"$PLAYLIST\")" > "$OUTDIR/youtube-dl.json"
Now download all the files. It may fail partway through; you can keep running it again and again until it finishes.
youtube-dl --ignore-errors --no-progress --fixup warn --all-formats --output "$OUTDIR"/'%(format_id)s/%(playlist_index)s-%(title)s.%(ext)s' "https://www.youtube.com/playlist?list=$PLAYLIST" 2>&1 | tee -a "$OUTDIR/youtube-dl.log"
If you only want to download the highest-quality file-format, use
--format=best in place of
--all-formats in the youtube-dl command. By default (without any
--format option), youtube-dl will use
--format=bestvideo+bestaudio, which could locally mux together two separate video and audio streams, resulting in a file that never actually existed on YouTube.
How to extract metadata
The metadata.py script converts the metadata in the JSON file into CSV format. It's currently hardcoded to always set
collection=test_collection, so any uploads will not yet be permanent. You have to edit the script if you want to change that.
Think of an identifier for the item. A list of course subject abbreviations is at http://guide.berkeley.edu/courses/. Then run the metadata.py script.
./metadata.py "$IDENTIFIER" "$OUTDIR/youtube-dl.json" > "$PLAYLIST.metadata.csv"
How to upload files and set metadata
Note: you should probably hold off on uploading until there's a plan for naming conventions, etc.
First you have to upload a file (any file) to create the item, before you can set metadata. Important: you need to set the
collection metadata at this point, because they can't be changed later.
ia upload "$IDENTIFIER" "$OUTDIR"/youtube-dl.* --metadata "mediatype:movies" --metadata "collection:test_collection"
Now you can set the metadata. You'll be able to change this later if necessary.
ia metadata --spreadsheet "$PLAYLIST.metadata.csv"
Then upload video files of a certain format; e.g. for format 22, do:
ia upload "$IDENTIFIER" "$PLAYLIST"/22/*
To get an idea of what format to upload, check which directories are the largest:
du -sh "$PLAYLIST"/*
You can see short explanations of the available formats with:
jq '.formats.format' "$PLAYLIST/youtube-dl.json"
iTunes downloader script
This script isn't tested but might be a starting point.
class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.urls =  def handle_starttag(self, tag, attrs): if tag == 'tr': url = dict(attrs).get('video-preview-url') if url is not None: self.urls.append(url) def download(url, kwargs): u = urllib.urlopen(url, json.dumps(kwargs)) try: if u.getcode() != 200: raise IOError(u. getcode()) return u.read() finally: u.close() def main(): if videoId: parser = MyHTMLParser() parser.feed(download("https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=" + videoId).decode('utf-8')) urls.extend(parser.urls) if audioId: parser = MyHTMLParser() parser.feed(download("https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=" + audioId).decode('utf-8')) urls.extend(parser.urls)
Indexes of files
UCBerkeley channel ID: UCwbsWIWfcOL2FiUZ2hKNJHQ
UCBerkeley "uploads" playlist: UUwbsWIWfcOL2FiUZ2hKNJHQ
- JSON list of UCBerkeley channel playlists, scraped from the YouTube API: https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.playlists.list?part=snippet&channelId=UCwbsWIWfcOL2FiUZ2hKNJHQ&maxResults=50. This is actually a concatenation of 9 separate API responses (max 50 playlists per response).
- List of YouTube videos, from a Reddit thread.
- youtube-dl JSON dump of https://www.youtube.com/user/UCBerkeley/playlists, representing 234 playlists and 6,632 videos. Beware: for whatever reason, youtube-dl didn't find all the playlists. Use playlists-20170303.json for the full list. It was produced like this:
youtube-dl --ignore-errors --dump-json https://www.youtube.com/user/UCBerkeley/playlists 2>youtube.com-user-UCBerkeley-playlists-20170301.stderr | gzip -9v >youtube.com-user-UCBerkeley-playlists-20170301.json.orig.gz gzip -dc youtube.com-user-UCBerkeley-playlists-20170301.json.orig.gz | jq --compact-output 'del(.url,((.formats?,.requested_formats?)|(.url,.manifest_url,.fragments)))' | gzip -9v > youtube.com-user-UCBerkeley-playlists-20170301.json.gz
- youtube-dl stderr output for the preceding.
- youtube-dl JSON dump of https://www.youtube.com/user/UCBerkeley/videos, representing 9,886 videos, but without playlist information. This is slightly less than the 9,897 videos reported at https://www.youtube.com/playlist?list=UUwbsWIWfcOL2FiUZ2hKNJHQ. It was produced like this:
youtube-dl --ignore-errors --dump-json https://www.youtube.com/user/UCBerkeley/videos 2>youtube.com-user-UCBerkeley-videos-20170301.stderr | gzip -9v >youtube.com-user-UCBerkeley-videos-20170301.json.orig.gz gzip -dc youtube.com-user-UCBerkeley-videos-20170301.json.orig.gz | jq --compact-output 'del(.url,((.formats?,.requested_formats?)|(.url,.manifest_url,.fragments)))' | gzip -9v > youtube.com-user-UCBerkeley-videos-20170301.json.gz
- youtube-dl stderr output for the preceding.
id354813951.tar.xz(missing a few videos)id354813951_2.tar.xz
- Index of iTunes files. To download the video/audio files for a lecture, first fetch the URLs containing
downloadTrackfrom course.json. This returns some XML containing a second URL (and some metadata) which points to the actual download location. All these requests need to use the iTunes user agent string (
- List of 729 iTunes downloads that don't seem to be among the YouTube playlists (by comparison of course titles). It was produced like this:
jq -j '.items|(.id,"\t",.snippet.title,"\n")' indexes/playlists-20170303.json | sort | uniq > youtube.txt tar -O -xf indexes/id354813951_2.tar.xz --wildcards -- '*/course.json' | jq -j '.storePlatformData."product-dv-product".results|(.id,"\t",.name,"\n")' | sort | uniq > itunes.txt ./dedup-youtube-itunes.py youtube.txt itunes.txt
Sample commands for working with JSON indexes (using jq):
gzip -dc data/youtube.com-user-UCBerkeley-playlists-20170301.json.gz | jq -r .playlist_title | uniq
- Extract all playlist titles
gzip -dc data/youtube.com-user-UCBerkeley-playlists-20170301.json.gz | jq -r .playlist_id | uniq
- Extract all playlist IDs. Convert an ID into a URL as: https://www.youtube.com/playlist?list=id.
These extra playlists look like they contain more than one course and may merit special treatment:
YouTube videos without playlists
Nothing yet. Have to find out what videos are in videos.json but not in playlists.json, and deal with them separately.
See itunes-minus-youtube-20170304.txt under #Indexes of files for a list of iTunes downloads that are not among the YouTube playlists.
tobbez is currently downloading the items listed in itunes-minus-youtube-20170304.txt.
• • •
Cathy Koshland, UC Berkeley vice chancellor for undergraduate education, sent this message to the campus community today:
Dear Campus Community,
I wanted to share with you the decision to restrict access to our legacy Course Capture (classroom lecture) videos and podcasts, currently searchable at webcast.berkeley.edu and found on YouTube and UC Berkeley iTunesU, to members of the campus community.
As part of the campus’s ongoing effort to improve the accessibility of online content, we have determined that instead of focusing on legacy content that is 3-10 years old, much of which sees very limited use, we will work to create new public content that includes accessible features. Our public legacy libraries on YouTube and iTunesU include over 20,000 publications. This move will also partially address recent findings by the Department of Justice which suggests that the YouTube and iTunesU content meet higher accessibility standards as a condition of remaining publicly available. Finally, moving our content behind authentication allows us to better protect instructor intellectual property from “pirates” who have reused content for personal profit without consent.
Since fall 2015 we have piloted publishing all of our Course Capture content behind CAS/CalNet authentication. This strategy has enhanced our ability to accommodate students and UC Berkeley community members who have demonstrated an accessibility need, and we have concluded that authentication is an intervention that is appropriately responsive to the Berkeley community.
We will continue to evaluate the role of online Course Capture and distribution in tandem with advances in technology befitting the No. 1 public institution in the country. Berkeley will maintain its commitment to sharing content to the public through our partnership with EdX (edx.org). This free and accessible content includes a wide range of educational opportunities and topics from across higher ed.
Beginning March 15, 2017, access to iTunesU course content will be suspended. On the same day we will begin the process of moving the publicly offered YouTube content made from the current legacy channel [youtube.com/ucberkeley] to a new authentication login required channel. The entire process is expected to take three to five months. During this time the ETS team will migrate the videos into the new channel behind CalNet/CAS authentication. Berkeley users seeking to view this older content will be able to access it by logging into YouTube with their bConnected/Google-supported identity.
To help manage the instructional impact, instructors with legacy content have been contacted. Instructors utilizing the ETS Course Capture service since fall 2015 will experience no changes in viewing or accessing content.
Enrolled Berkeley students requiring accommodations will continue to receive support through the Disabled Students Program.
Finally, as we continue to strive for inclusion and effective teaching and learning for all members of the campus community, we encourage you to reference a new campus website designed to help instructors identify best practices and techniques in creating accessible course content for all users: accesscontent.berkeley.edu.
For additional information, please review this FAQ document.
2017-02-24• • •
Here is additional information to assist the campus community and the public with upcoming changes to UC Berkeley’s library of legacy public Course Capture (classroom lecture) content from webcast.berkeley.edu, located on YouTube and UC Berkeley iTunesU.
- Who uses this content? How much of the content is used/watched?
- Course recordings are a study-tool for current students. Results from a recent review of our legacy (2006-2015) public course recordings on YouTube show that the average video is watched for less than eight minutes.
- Who are the “pirates” mentioned in the CalMessage?
- Pirates is a term used to describe websites that embed YouTube content without the permission of the original copyright holder for profit. UC Berkeley legacy Course Capture content has been discovered on for-profit websites, which use either a subscription fee or on-page advertising.
- Why now? Is this related to the DOJ letter?
- UC Berkeley stopped posting course lecture videos publicly through webcast.berkeley.edu in 2015 as a way to reduce costs and increase adoption. However, we left legacy content from 2006-2015 in place. The Department of Justice letter indicates that they believe our legacy Course Capture content from webcast.berkeley.edu and located on YouTube and iTunesU is in violation of the Americans with Disabilities Act. We are removing the legacy webcast.berkeley.edu content from public access to focus on making future public content more accessible. Instructors are encouraged to reference accesscontent.berkeley.edu for best practices and resources for making course content accessible.
- If we don’t add captions and descriptions, what happens?
- Failure to meet the expectations of the Department of Justice could mean potential legal and financial ramifications.
- What about current students who need captioning?
- ETS and the Disabled Students Program (DSP) have been partnering over the last several years to identify courses requiring captioning based on student need. The partnership and support of students working with DSP will continue.
- What will happen to the recordings?
- Beginning March 15, 2017, iTunesU Course Capture content will be removed. You may continue to use/download course capture content until that date. Other content in this location such as events, KALX and Public Affairs content will remain available after March 15. On the same day ETS will begin moving the publicly offered YouTube course capture content from the current legacy channel [youtube.com/ucberkeley] to a new authentication login-required channel. The entire process is expected to take three to five months. Berkeley users seeking to view this older content will be able to access it by logging into YouTube with their bConnected/Google supported identity. Instructors with course recordings on YouTube recorded fall 2015 or later will experience no change. Individual video URLs (links) will remain unchanged. Instructors currently using impacted recordings are encouraged to contact the Course Capture team to identify ways to mitigate any effect on their courses: email@example.com
- How long will videos be interrupted?
- The entire process to migrate the public YouTube videos from their current location to a new YouTube channel that will be accessible with campus member’s bConnected/Google supported identity will take 8-10 weeks and begin on March 15, 2017. Each video will be unavailable on bCourses for 2-3 business days. If you are a current instructor using impacted legacy recordings please contact the Course Capture team to review your needs: firstname.lastname@example.org
- If I have other videos that I want to get captioned or audio described, how would I do that?
- While speech-to-text tools continue to improve, effective captioning remains a very manual process. The UC System has recently introduced contracts with several vendors to provide captioning services.The vendor transcribes a recording and adds the text to the appropriate YouTube video, or a transcriber may be hired to caption an event live. At UC Berkeley, content created/captured by Berkeley Video and Berkeley AV is now being captioned. Information on audio description best practices are available at: https://webaccess.berkeley.edu/resources/tips/audio-description and https://webaccess.berkeley.edu/ask-pecan/descriptive-audio
- I’m using the impacted recordings (iTunesU or spring 2015 or earlier YouTube content) in my course now. What should I do?
- ETS is working hard to mitigate impacts to current instruction. If you already have a list of your video links, you have no additional steps to take. Video URLs will remain unchanged. If you need assistance or have additional concerns, please contact the Course Capture team to review your needs: email@example.com
- I am an instructor who is using impacted recordings (iTunesU or spring 2015 or earlier YouTube content) for something outside of UC Berkeley. What should I do?
- If you are an instructor using legacy recordings currently available to the public as an extension of your research or teaching, please contact the Course Capture team: firstname.lastname@example.org
- Why was the public not notified before webcast.berkeley.edu content disappeared so that we had a chance to download iTunes legacy content?
- We added notifications to our sites and provided a warning before content began to be removed. The legacy content on webcast.berkeley.edu located on YouTube and UC Berkeley’s iTunes U is three to ten years old.
- I am a Berkeley instructor who wants to use old content in my class, where can I find the URL to share with my students?
- Before videos are migrated: Instructors can copy/paste their YouTube links for future reference. Link URLs will remain unchanged. Educational Technology Services (ETS) is working to modify webcast.berkeley.edu so that videos are accessible to UC Berkeley CalNet users starting in April Instructors with immediate questions can contact the Course Capture team: email@example.com
- Can I get a copy of my old lectures from YouTube to use personally?
- Currently, ETS doesn’t have a service that provides copies of recordings to individuals.
- I am a Berkeley CalNet user, so why can’t I search for videos and playlists that I used to be able to see on webcast.berkeley.edu?
- The process that allows us to place the videos behind authentication removes playlists and content search options. ETS is working to provide campus users a new website that will function as a directory of recordings that should launch sometime in April on the existing webcast.berkeley.edu site.
- Can I still find previous events and other non-Course Capture recordings on YouTube?
- The public UC Berkeley Events Channel (youtube.com/ucberkeleyevents) will continue to be available. Many recordings at this location are already captioned and plans are in place to caption future content.