Going paperless
paperless-ngx
Recently I stumbled over this post on Hacker News, which linked to a tool called paperless-ngx.
Paperless-ngx is a web application (based on Python and Django) that let’s you easily organize your scanned documents in a self hosted digital archive.
After collecting all sorts of correspondence for decades, I thought using a tool like paperless-ngx would be a welcome occasion to get rid of the tons of outdated documents no longer needed physically, but still having those documents in a digital archive.
Under the hood, paperless-ngx uses OCRmyPDF to perform OCR on the scanned documents. The PDFs are then enriched with an layer of the recognized text, resulting in searchable scanned documents.
In addition every document has assigned metadata like a document-type (e.g. invoice) and correspondent, as well as tags. Each of this attributes can be configured to be applied automatically, based on simple rules. These attributes can be used to search and filter the documents.
When keeping documents also physically, paperless-ngx can automatically assign an auto-incrementing ASN (Archive Serial Number) to documents. This ASN needs to be written on the document before filing the original away. When all documents are stored sorted in a binder, the document can later be easily located, when the physical original is needed for some reason. The process is described here in more detail.
Feeding documents into paperless
I use a Brother ADS-1700W document scanner to automatically scan (small) batches of documents. The scanner scans pages up to format DIN A4, double sided. The scanner is connected through WLAN to my network. First I configured to use SMB protocol to transfer files from the scanner to the paperless-ngx box. But for some reason, SMB transfers were very, very slow (don’t know why - but I suspect the SMB implementation on the scanner has bad performance). Therefore I switched to SFTP using SSH key-based authentication, which resulted in good performance. Besides SMB and SFTP paperless-ngx also supports Email-import and upload of PDF documents using the web interface.
Deployment
paperless-ngx relies on some additional services to do it’s job:
- Redis - here used as a message broker
- Apache Tika - Content Analysis Toolkit to extract metadata from common (office) file formats (optional)
- Gotenberg - API for PDF files (optional)
The actual Deployment is very simple, since the project provides various docker-compose files. Since I have already dockerized my home lab, it was easy to integrate paperless-ngx in my infrastructure. I opted for the sqlite version.
Summary
Summary after 6 month:
- paperless-ngx is an excellent tool that works absolutely flawlessly for me
- I successfully archived 1200+ documents with more than 3500 pages
- searching in OCR’d documents works very well
- my physical archive went down from 8 big binders to just 1
- don’t forget to set up a reliable backup solution for your digital archive
- and as always: don’t forget to protect/encrypt your sensitive data
- Highly recommended!