Going paperless

Publish date: Dec 30, 2022

Tags:

paperless-ngx

Recently I stumbled over this post on Hacker News, which linked to a tool called paperless-ngx.

Paperless-ngx is a web application (based on Python and Django) that let’s you easily organize your scanned documents in a self hosted digital archive.

After collecting all sorts of correspondence for decades, I thought using a tool like paperless-ngx would be a welcome occasion to get rid of the tons of outdated documents no longer needed physically, but still having those documents in a digital archive.

Under the hood, paperless-ngx uses OCRmyPDF to perform OCR on the scanned documents. The PDFs are then enriched with an layer of the recognized text, resulting in searchable scanned documents.

In addition every document has assigned metadata like a document-type (e.g. invoice) and correspondent, as well as tags. Each of this attributes can be configured to be applied automatically, based on simple rules. These attributes can be used to search and filter the documents.

When keeping documents also physically, paperless-ngx can automatically assign an auto-incrementing ASN (Archive Serial Number) to documents. This ASN needs to be written on the document before filing the original away. When all documents are stored sorted in a binder, the document can later be easily located, when the physical original is needed for some reason. The process is described here in more detail.

user management is included, this is the login form

paperless-ngx main window with document thumbnail overview

Feeding documents into paperless

I use a Brother ADS-1700W document scanner to automatically scan (small) batches of documents. The scanner scans pages up to format DIN A4, double sided. The scanner is connected through WLAN to my network. First I configured to use SMB protocol to transfer files from the scanner to the paperless-ngx box. But for some reason, SMB transfers were very, very slow (don’t know why - but I suspect the SMB implementation on the scanner has bad performance). Therefore I switched to SFTP using SSH key-based authentication, which resulted in good performance. Paperless-ngx supports watching a directory, which allows file uploads using e.g., SMB or SFTP. Besides that, it also supports email-import and upload of PDF documents using the web interface.

Deployment

Paperless-ngx relies on some additional services to get it’s job done:

Redis - here used as a message broker
Apache Tika - Content Analysis Toolkit to extract metadata from common (office) file formats (optional)
Gotenberg - API for PDF files (optional)

The actual Deployment is very simple, since the project provides various docker-compose files. Since I have already dockerized my home lab, it was easy to integrate paperless-ngx in my infrastructure. I opted for the sqlite version.

Summary

Summary after ~~6 month~~ 2.5 years:

paperless-ngx is an excellent tool that works absolutely flawlessly for me
I successfully archived ~~1200+~~ ca. 1600 documents with more than ~~3500~~ 5400 pages
searching in OCR’d documents works very well
my physical archive went down from 8 big binders to just 1
don’t forget to set up a reliable backup solution for your digital archive
and as always: don’t forget to protect/encrypt your sensitive data
Still highly recommended!

References

Paperless-ngx documentation