Skip to content

Document content matching

Paperless-ngx does a great job matching documents with correct correspondents, storage path etc. However, there are documents for which the automatic matching doesn't work or a single regular expression match isn't sufficient. For such cases, further examining the document's content after consumption is necessary.

Update document details via organize

organize is an open-source, command-line file management automation tool. It allows to execute certain actions based on custom filters. These can be easily defined in YAML.

Probably the most helpful filter in this context is the filecontent filter. The document's content can be matched with regular expressions which allows to dynamically re-use (parts of) the matched content in subsequent actions.

Following script

  1. ensures that a newly-consumed document gets assigned a proper title based on the document's content. This helps to stick to a consistent naming pattern for documents that you receive regularly, e.g. invoices.
  2. extracts a value out of the document content and stores it in a given custom field

Prerequisites

For this solution to work, you will need to install the following packages:

As organize will leverage the API for updating the document title, the API prerequisites apply as well.

Structure

Sticking to the general idea of our scripts folder layout, we will end up with following structure for this solution.

paperless-ngx/
├─ my-post-consumption-scripts/
│  ├─ organize/
│    ├─ organize.config.yml.tpl
│    └─ pngx-update-document.py
│  └─ post-consumption-wrapper.sh
│  # Obviously the below file only exists  # if you're running Paperless-ngx via Docker Compose
├─ my-custom-container-init/
│  └─ 10-install-additional-packages.sh
└─ docker-compose.yml

Scripts

# Token to access the REST API
PAPERLESS_TOKEN=
# Your Paperless-ngx URL, without trailing slash
PAPERLESS_URL=
# organize configuration file
# https://organize.readthedocs.io

# YAML ANCHORS
# This filters the exact document that has been consumed
.locations: &current_document
  - path: "{env.DOCUMENT_ARCHIVE_DIR}"
    filter:
      # Needs to be replaced with e.g. `envsubst`
      # as organize doesn't replace environment placeholders in filter
      - "$DOCUMENT_ARCHIVE_FILENAME"

# RULES
rules:
  - name: "Nabu Casa invoice"
    locations: *current_document
    filters:
      - filecontent: "Nabu Casa"
      - filecontent: "(?P<title>Home Assistant Cloud)"
      - filecontent: 'Amount due.*(?P<amount>\d{2}\.\d{2})'
    actions:
      - echo: "Home Assistant hooray"
      - shell: "./pngx-update-document.py --url http://localhost:8000 --document-id {env.DOCUMENT_ID} --title '{filecontent.title}' --custom-field-id 1 --custom-field-value {filecontent.amount}"
      - echo: "{shell.output}"
#!/usr/bin/env python

# Work in progress
# Only allows updating the title and a single custom field at the moment

import argparse, httpx, os, sys

parser = argparse.ArgumentParser(description='Update a single document via Paperless-ngx REST API')
parser.add_argument('--url',
    dest='url',
    action='store',
    help='Your Paperless-ngx URL',
    default=os.environ.get('PAPERLESS_URL')
)
parser.add_argument('--auth-token',
    dest='token',
    action='store',
    help='Your Paperless-ngx REST API authentication token',
    default=os.environ.get('PAPERLESS_TOKEN')
)
parser.add_argument('--document-id',
    dest='id',
    type=int,
    action='store',
    help='ID of the document that should be updated',
    required=True
)
parser.add_argument('--title',
    dest='title',
    action='store',
    help='Set the document title'
)
parser.add_argument('--custom-field-id',
    dest='custom_field_id',
    type=int,
    action='store',
    help='ID of the custom field that should be updated'
)
parser.add_argument('--custom-field-value',
    dest='custom_field_value',
    action='store',
    help='Value of the custom field that should be stored'
)
args = parser.parse_args()

headers = {'Authorization': f'Token {args.token}'}
data = {}

# Update title
if args.title is not None:
    data['title'] = args.title

# Update custom field
# Only if both --custom-field-id and --custom-field-value have been specified
if all(param is not None for param in [args.custom_field_id, args.custom_field_value]):
    new_field = {
        "field": args.custom_field_id,
        "value": args.custom_field_value
    }

    # Even when patching a single custom field, we need to include all of the document's existing custom fields
    # Otherwise, other custom fields will be removed from the document
    response = httpx.get(f"{args.url}/api/documents/{args.id}/", headers=headers)

    if response.is_error:
        msg = "HTTP error {} while trying to obtain document details via REST API at {}."
        sys.exit(msg.format(response.status_code, args.url))

    data['custom_fields'] = response.json()['custom_fields']

    # Update custom field value "in-place" if already attached to document (to keep custom field order)
    if any(custom_field['field'] == args.custom_field_id for custom_field in data['custom_fields']):
        data['custom_fields'] = [(new_field if custom_field['field'] == args.custom_field_id else custom_field) for custom_field in data['custom_fields']]
    # Otherwise, simply append to the list
    else:
        data['custom_fields'] = data['custom_fields'].append(new_field)

if data:
    response = httpx.patch(f"{args.url}/api/documents/{args.id}/", headers=headers, json=data)

    if response.is_error:
        msg = "HTTP error {} while trying to update document via REST API at {}."
        sys.exit(msg.format(response.status_code, args.url, data))

    print(f"Document with ID {args.id} successfully updated")
#!/usr/bin/env bash

# paperless-ngx post-consumption script
#
# https://docs.paperless-ngx.com/advanced_usage/#post-consume-script
#

SCRIPT_PATH=$(readlink -f "$0")
SCRIPT_DIR=$(dirname "$SCRIPT_PATH")

# Add additional information to document
# Make sure organize-tool and poppler-utils has been installed
# on your system (resp. container, via custom-cont-init.d)

# organize-tool doesn't accept full file path as argument
# but expects directory and filename pattern without extension instead
export DOCUMENT_ARCHIVE_FILENAME=$(basename "${DOCUMENT_ARCHIVE_PATH}")
export DOCUMENT_ARCHIVE_DIR=$(dirname "${DOCUMENT_ARCHIVE_PATH}")

# While organize supports environment variables as placeholders in it's configuration,
# it's not yet supported everywhere in the configuration (e.g. filters),
# thus leveraging envsubst to replace environment placeholders
ORGANIZE_CONFIG_PATH=$(mktemp --suffix=.yml ${TMPDIR:-/tmp}/organize.config.XXXXXX)
envsubst < "${SCRIPT_DIR}/organize/organize.config.yml.tpl" > "${ORGANIZE_CONFIG_PATH}"

# Execute configured actions
# Add `--format errorsonly` to suppress most of organize's output in logs
organize run "${ORGANIZE_CONFIG_PATH}" --working-dir "${SCRIPT_DIR}/organize"

# Clean up
rm -f "${ORGANIZE_CONFIG_PATH}"
#!/usr/bin/env bash

# Install additional packages

# Add additional information to consumed documents
# based on hypercomplex ;) rules
# https://github.com/tfeldmann/organize/
apt-get install poppler-utils
pip install organize-tool

Notes

Script files can also be found on GitHub.


  1. Poppler is required for organize's filecontent filter to work, see https://github.com/tfeldmann/organize/issues/322