Archiving Articles
Looked at three options
- SingleFile (there’s a CLI)
- ArchiveBox
readability-cli
which wraps Readability
Requirement: I don’t need any JavaScript in my archive. Don’t really care about the images either. Just the text.
SingleFile
Not bad at all. Everything but the JavaScript, all scrunched into a… single file.
# Use JsDOM instead of Chrome/Puppeteer to avoid JavaScript
npm i -g jsdom
npm i -g "gildas-lormeau/SingleFile#master"
# Now just
single-file \
--back-end jsdom \
https://blog.bitgate.cz/static-site-analytics-with-nginx-goaccess-no-js \
output.html
# Can use Chrome (on macOS) like so
single-file \
--browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
https://blog.bitgate.cz/static-site-analytics-with-nginx-goaccess-no-js \
output.html
ArchiveBox
Saves everything kinda like archive.is. Images, CSS, JS, fonts, everything.
# On macOS
brew install archivebox/archivebox/archivebox
# Get the Readability driver
npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
archivebox init
archivebox add https://www.washingtonpost.com/politics/2021/01/15/pillow-salesman-apparently-has-some-ideas-about-declaring-martial-law/?utm_source=reddit.com
archivebox server
readability-cli
Saves just the DOM. No styling. Example.
Result
ArchiveBox is awesome but I ended up using SingleFile for a good balance. Plus, readability-cli
had some encoding issues.