Archiving Articles Revision as of Friday, 27 December 2024 at 23:30 UTC

Looked at three options

Requirement: I don’t need any JavaScript in my archive. Don’t really care about the images either. Just the text.

SingleFile

Not bad at all. Everything but the JavaScript, all scrunched into a… single file.

# Use JsDOM instead of Chrome/Puppeteer to avoid JavaScript
npm i -g jsdom
npm i -g "gildas-lormeau/SingleFile#master"

# Now just
single-file \
  --back-end jsdom \
  https://blog.bitgate.cz/static-site-analytics-with-nginx-goaccess-no-js \
  output.html

# Can use Chrome (on macOS) like so
single-file \
  --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
  https://blog.bitgate.cz/static-site-analytics-with-nginx-goaccess-no-js \
  output.html

ArchiveBox

Saves everything kinda like archive.is. Images, CSS, JS, fonts, everything.

# On macOS
brew install archivebox/archivebox/archivebox

# Get the Readability driver
npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"

archivebox init
archivebox add https://www.washingtonpost.com/politics/2021/01/15/pillow-salesman-apparently-has-some-ideas-about-declaring-martial-law/?utm_source=reddit.com
archivebox server

readability-cli

Saves just the DOM. No styling. Example.

Result

ArchiveBox is awesome but I ended up using SingleFile for a good balance. Plus, readability-cli had some encoding issues.