News+C is a project brought to life by user:bzc6p, and is concerned with archiving news websites.
NewsGrabber vs. News+C
Wait, we already have NewsGrabber! How is this one different?
- News+C focuses on websites that have user comments, especially those that use third party comment plugins (Facebook, Disqus etc.)
- While NewsGrabber archives all articles of thousands of websites, News+C focuses on only some (popular) websites and archives only those, but more thoroughly.
- While anyone can join to the Newsgrabber project, as it needs only a script to be started, a News+C project needs more knowledge, time and attention.
News+C is in no way a replacement or a competitor of NewsGrabber. It is a small-scale project with a different approach, that, in fact, focuses on comments rather than on the news, and to some extent, prefers quality over quantity.
So there are two approaches:
- If you are not that expert/intelligent/patient/whatever, and don't want to deal with all that, there is a slower, but simpler, more universal and cosier approach: automating a web browser, using computer vision.
Solving the script puzzle
If you know how to efficiently archive Facebook or Disqus comment threads with a script, do not hesitate to share. The founder of this project, however, doesn't, so he is developing the other method.
Using computer vision
But the question is, as always: how do you automate this process?
Here comes the computer vision to the stage. You can – surprisingly easily –
- simulate keypresses
- simulate mouse movement and clicks
- find the location of an excerpt image on the screen
with a little programming.
This – according to user:bzc6p's knowledge – needs a graphical interface and can't be put in the background, but at least you can save a few hundred/thousand articles overnight, while you sleep.
Different scripts are necessary for different websites, but the approach is the same, and the scripts are also similar. The modifiable python2 script user:bzc6p uses has been named by its creator the Archiving SharpShooter (ASS).
Particular code may be published later, or if you are interested, you can ask user:bzc6p, but the project is still quite beta, so only the algorithm is explained here.
- Input is a list of news URLs.
- Key python2 libraries used are pyAutoGUI and openCV. The former is our hands, the latter our eyes.
- pyautogui.press(), pyautogui.click() types, scrolls and clicks. cv2.matchTemplate() finds the location of the "Read comments", "More comments" etc. buttons or links, and we click them. matchTemplate needs a template to search for (we cut them out from screenshots) and an up-to-date screenshot (we invoke scrot from python, and load that image). With matchTemplate we can also check if the page has been loaded or if we have reached the bottom of the page. (The threshold for matchTemplate must be carefully chosen for each template, so that it doesn't miss a template, nor find a false positive.)
What the program basically does:
- types URL in the address bar
- waits till page is loaded
- scrolls till it finds "Read comments" or equivalent sign
- clicks on that
- waits for comments to be loaded
- scrolls till "More comments" or equivalent is reached
- waits for more comments to be loaded
- repeats this until bottom of page is reached (no more comments)
During this, warcprox runs in the background, and every request is immediately saved to a WARC file. (Warcprox provides a proxy, which is set in the browser.)
Websites being archived
For an easier overview, let this page have subpages for countries/languages.