Polyglot Box-Passing Processing Pipeline Architecture

Overview

One function of the SeqWeb system is to convert OEIS sequence data (.seq files) into semantic web knowledge graphs (.ttl files). Such conversions lie in the domain of SeqWeb’s Webwright subsystem (see Systems Design for context)

The Webwright subsystem constructs the knowledge graph by orchestrating an ensemble of Fabricators. Each Fabricator is responsible for generating a specific portion of the knowledge graph, and implements a polyglot processing pipeline composed of individually-implemented software Modules designed to interface readily with related Modules. Each Module may be implemented in an arbitrary language (e.g., Python, Java, Lisp, Bash — see Rationale for Polyglot Implementation below).

For example, a Fabricator might extract entities from selected OEIS entry text, and then build RDF triples linking the subject sequence to those entities via specific relationships. Upstream Modules in this Fabricator would handle entity extraction; downstream Modules would generate the RDF; others may support intermediate transformation, filtering, or enrichment.

Pipeline
Fabricator polyglot pipeline architecture

Modules communicate via a shared abstract structure called a box — functionally a box is just a language-agnostic key-value map that flows through the pipeline.

Box glyph
a box

Each module is designed to function as a plug-and-play unit in one or more Fabricator pipelines, in which modules may be composed, replaced, tested, or reused independently, regardless of implementation language.

Modules may be composed using two different implementation mechanisms: via native orchestration (when the modules are implemented in the same language) or executed in isolation through IO wrappers that expose a standardized shell or CLI interface. This hybrid model allows clean decoupling, testability, and flexibility across execution boundaries.


Key Terminology

Term Meaning
box A dictionary-like structure of key-value pairs shared between modules. It serves both as the input/output of each module and as a shared common carrier of state across pipeline stages.
inbox The input box to a module.
outbox The output box from a module, typically the inbox plus zero or more updates.
module A processing unit that takes an inbox and returns an outbox.
module group A group of related module definitions implemented in a single language.
program A shell-invocable wrapper around a module exposing the inbox/outbox interface over CLI+JSON.
IO wrapper A thin entry point that translates CLI input to a native-language box, invokes the inner function, and emits JSON output.
inner function A native-language, pure function that maps a box (dict/map) to another box.
destructuring interface A function signature that binds named keys from a box to function arguments (e.g., *, prompt, noisy=False, **_rest).
box-then-kwargs pattern Pattern where a function accepts the full box and also unpacks it for destructuring.
native composition Bypassing the IO wrapper by directly calling inner functions within the same runtime environment.
shell execution Cross-language or system-level execution where modules run as subprocesses using wrapper interfaces.

Implementation Patterns

Destructuring Inner Function (Python)

def normalize_prompt(box: dict, *, prompt, noisy=False, **_rest) -> dict:
    result = prompt.strip().lower()
    if noisy:
        print(f"\n\t🧹 Normalized Prompt:\n{result}")
    return {**box, "normalized_prompt": result}

CLI Wrapper (Python)

def main():
    parser = argparse.ArgumentParser(description="Run a KA pipeline stage.")
    parser.add_argument("prompt")
    parser.add_argument("--noisy", action="store_true")
    parser.add_argument("--config")

    args = parser.parse_args()
    inbox = {"prompt": args.prompt, "noisy": args.noisy, "config": args.config}
    outbox = normalize_prompt(inbox, **inbox)
    json.dump(outbox, sys.stdout)

Generic Fabricator Example: Native Python

from typing import Callable, Dict, List, Optional

def run_pipeline(
    stages: Optional[List[Callable]] = None, initial_box: Optional[Dict] = None
) -> Dict:
    """Run a sequence of stages on a box, using the box-then-kwargs pattern."""
    if stages is None:
        stages = []
    if initial_box is None:
        initial_box = {}

    box = initial_box
    for stage in stages:
        box = stage(box, **box)
    return box

This example Fabricator:


Cross-Language Fabricator Execution Model

When modules are implemented in different languages and invoked as standalone programs:

This design allows for:

Program Reusability

By structuring Module programs to build their inbox from command-line arguments (optionally merging with a JSON seed), each Module becomes naturally reusable in multiple contexts:

This versatility makes it easy to scale from interactive experimentation to production pipelines without changing the module’s core logic.


Architectural Principles


Validation, error handling, testing, debugging & logging

TODO: Add comprehensive section covering validation patterns, error handling strategies, testing approaches, debugging techniques, and logging standards for polyglot pipeline modules.


Rationale for Polyglot Implementation

While the entire Fabricator pipelines could, in theory, be implemented in a single language, our design favors a polyglot approach. This allows each module to leverage the language best suited to its particular role, improving expressiveness, maintainability, and integration with existing tools. For example:

Java Strengths

Python Strengths

Lisp (Common Lisp) Strengths

Bash Strengths

The modular, mixed-language Webwright design pattern allows us to prototype rapidly, optimize when needed, and maintain clarity between different kinds of logic: transformation, coordination, parsing, and enrichment.