Skip to content

Experimentation

AUMP research should be evaluated with deterministic experiments first. The core question is not whether a model gives a persuasive explanation. The core question is whether an implementation enforces the same mandate boundary every time for the same mandate, proposed action, context, and downstream binding.

Experiment Matrix

Experiment What it proves Primary artifact
Schema validation Mandates, action evaluations, profiles, and evidence events are structurally valid. JSON Schemas in the spec repository
Policy evaluation Hard constraints, disclosure rules, compliance rules, and escalation rules produce deterministic decisions. Conformance fixtures
Bridge metadata MCP, A2A, UCP, AP2, REST, and platform-native calls carry only safe mandate references. Bridge fixtures
Evidence semantics Material decisions produce evidence without unnecessary private data retention. Evidence-event fixtures
Example replay End-to-end flows remain runnable without an LLM API key. Marketplace and home-buying examples
flowchart LR
  spec[Spec schemas and prose] --> fixtures[Conformance fixtures]
  fixtures --> py[Python SDK evaluator]
  fixtures --> ts[TypeScript SDK evaluator]
  fixtures --> examples[Runnable examples]
  py --> report[Parity report]
  ts --> report
  examples --> report
  report --> docs[Docs and source map]

Determinism Before Intelligence

AUMP deliberately separates the planner from the evaluator. The planner may be an LLM, rules engine, workflow engine, human operator, or hybrid system. The evaluator must be deterministic enough to test. That separation keeps conformance focused on the safety boundary:

sequenceDiagram
  participant M as Model or planner
  participant R as AUMP Runtime
  participant F as Fixture
  participant D as Decision

  F->>M: Provide task context
  M->>R: Proposed action JSON
  R->>R: Validate and evaluate mandate
  R-->>D: allowed / requires_escalation / denied
  D-->>F: Compare with expected fixture decision

Suggested Research Questions

  1. Cross-language parity: Do Python and TypeScript evaluators return the same decision, reason codes, and evidence semantics for every fixture?
  2. Binding minimality: Can counterparties verify mandate identity with only ID, canonical hash, version, and optionally an access-controlled URL?
  3. Disclosure robustness: Can an LLM-produced message be transformed or rejected so protected mandate fields never leave the runtime?
  4. Escalation usefulness: Does the trusted review summary include enough context for the principal to approve or deny without exposing irrelevant private data?
  5. Project Deal replay: Can a delegated negotiation like Anthropic's Project Deal be replayed across MCP, A2A, and UCP-shaped actions while preserving the same AUMP decision boundary?1

Example Fixture Shape

{
  "id": "project_deal_disclosure_denied",
  "mandate": "examples/project-deal.json",
  "action": {
    "type": "send_message",
    "counterparty": "seller_agent",
    "attributes": {
      "message": "My buyer can go up to the full private max budget."
    },
    "disclosure_fields": ["negotiation.reservation_value"]
  },
  "expected": {
    "decision": "denied",
    "reason_codes": ["disclosure_denied"]
  }
}

This is an illustrative research fixture shape, not a committed conformance fixture. It should be source-verified against the conformance manifest before being presented as shipped coverage.

Measurements

Metric Why it matters
Fixture pass count Basic parity proof across spec and implementation.
Decision divergence count Reveals nondeterministic or inconsistent evaluator behavior.
Private-field leakage count Measures whether disclosure policy is actually enforced.
Escalation false-negative count Measures whether high-risk commitments bypass trusted review.
Evidence completeness count Measures whether material decisions can be audited later.

Publication Rule

Research pages should report local validation precisely. Use "docs parity passed" only after scripts/check_docs_parity.py succeeds. Use "site build passed" only after mkdocs build --strict succeeds. Do not describe AUMP as adoption-ready merely because local docs or conformance checks passed.

Claims Needing Source Verification

  • Any benchmark-style claim about evaluator latency or leakage reduction needs an attached dataset, exact implementation commit, and reproducible runner.
  • Any comparison to other agent-safety approaches needs primary-source citations and a stated threat model.