Document content matching

Paperless-ngx does a great job matching documents with correct correspondents, storage path etc. However, there are documents for which the automatic matching doesn't work or a single regular expression match isn't sufficient. For such cases, further examining the document's content after consumption is necessary.

Update document details via organize and the Paperless-ngx CLI

organize is an open-source, command-line file management automation tool. It allows to execute certain actions based on custom filters. These can be easily defined in YAML.

Probably the most helpful filter in this context is the filecontent filter. The document's content can be matched with regular expressions which allows to dynamically re-use (parts of) the matched content in subsequent actions.

Following script

ensures that a newly-consumed document gets assigned a proper title based on the document's content. This helps to stick to a consistent naming pattern for documents that you receive regularly, e.g. invoices.
extracts a value out of the document content and stores it in a given custom field

The Paperless-ngx CLI can be used to update other fields as well. Check the CLI's help or GitHub repository for more information.

Prerequisites

For this solution to work, you will need to install the following packages:

As organize will leverage the API for updating the document title, the API prerequisites apply as well.

Structure

Sticking to the general idea of our scripts folder layout, we will end up with following structure for this solution.

paperless-ngx/
├─ my-post-consumption-scripts/
│  ├─ organize/
│  │  └─ organize.config.yml.tpl
│  └─ post-consumption-wrapper.sh
│  # Obviously the below file only exists
│  # if you're running Paperless-ngx via Docker Compose
├─ my-custom-container-init/
│  └─ 10-install-additional-packages.sh
└─ docker-compose.yml

Scripts

.envorganize.config.yml.tplpost-consumption-wrapper.sh10-install-additional-packages.sh

# Token to access the REST API
PNGX_TOKEN=
# Your Paperless-ngx URL, without trailing slash
# If running your post-consumption script within Docker, its likely to be http://localhost:8000
PNGX_HOST=

# organize configuration file
# https://organize.readthedocs.io

# YAML ANCHORS
# This filters the exact document that has been consumed
.locations: &current_document
  - path: "{env.DOCUMENT_ARCHIVE_DIR}"
    filter:
      # Needs to be replaced with e.g. `envsubst`
      # as organize doesn't replace environment placeholders in filter
      - "$DOCUMENT_ARCHIVE_FILENAME"

# RULES
rules:
  - name: "Nabu Casa invoice"
    locations: *current_document
    filters:
      - filecontent: "Nabu Casa"
      - filecontent: "(?P<title>Home Assistant Cloud)"
      - filecontent: 'Amount due.*(?P<amount>\d{2}\.\d{2})'
    actions:
      - echo: "Home Assistant hooray"
      - shell: "pngx edit {env.DOCUMENT_ID} --title '{filecontent.title}' --custom-fields 1={filecontent.amount}"
      - echo: "{shell.output}"

#!/usr/bin/env bash

# paperless-ngx post-consumption script
#
# https://docs.paperless-ngx.com/advanced_usage/#post-consume-script
#

SCRIPT_PATH=$(readlink -f "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")

# Add additional information to document
# Make sure organize-tool and poppler-utils has been installed
# on your system (resp. container, via custom-cont-init.d)

# In certain cases, like encrypted PDFs, no archived version is created by paperless.
# In this case, the archive path is "None". However, organize can still use the file.
# Therefore, use the source path instead.
if [[ "${DOCUMENT_ARCHIVE_PATH}" != "None" ]] ;then
    DOCUMENT="${DOCUMENT_ARCHIVE_PATH}" 
else
    DOCUMENT="${DOCUMENT_SOURCE_PATH}" 
fi
# organize-tool doesn't accept full file path as argument
# but expects directory and filename pattern without extension instead
export DOCUMENT_ARCHIVE_FILENAME=$(basename "${DOCUMENT}")
export DOCUMENT_ARCHIVE_DIR=$(dirname "${DOCUMENT}")

# While organize supports environment variables as placeholders in it's configuration,
# it's not yet supported everywhere in the configuration (e.g. filters),
# thus leveraging envsubst to replace environment placeholders
ORGANIZE_CONFIG_PATH=$(mktemp --suffix=.yml ${TMPDIR:-/tmp}/organize.config.XXXXXX)
envsubst < "${SCRIPT_DIR}/organize/organize.config.yml.tpl" > "${ORGANIZE_CONFIG_PATH}"

# Execute configured actions
# Add `--format errorsonly` to suppress most of organize's output in logs
organize run "${ORGANIZE_CONFIG_PATH}" --working-dir "${SCRIPT_DIR}/organize"

# Clean up
rm -f "${ORGANIZE_CONFIG_PATH}"

#!/usr/bin/env bash

# Install additional packages

# Add additional information to consumed documents
# based on hypercomplex ;) rules
# https://github.com/tfeldmann/organize/
# https://github.com/marcelbrueckner/paperless-ngx-cli
apt-get install poppler-utils
pip install --root-user-action=ignore organize-tool
pip install --root-user-action=ignore pypaperless-cli

Notes

Script files can also be found on GitHub.

Poppler is required for organize's filecontent filter to work, see https://github.com/tfeldmann/organize/issues/322. ↩