Experimentation¶
AUMP research should be evaluated with deterministic experiments first. The core question is not whether a model gives a persuasive explanation. The core question is whether an implementation enforces the same mandate boundary every time for the same mandate, proposed action, context, and downstream binding.
Experiment Matrix¶
| Experiment | What it proves | Primary artifact |
|---|---|---|
| Schema validation | Mandates, action evaluations, profiles, and evidence events are structurally valid. | JSON Schemas in the spec repository |
| Policy evaluation | Hard constraints, disclosure rules, compliance rules, and escalation rules produce deterministic decisions. | Conformance fixtures |
| Bridge metadata | MCP, A2A, UCP, AP2, REST, and platform-native calls carry only safe mandate references. | Bridge fixtures |
| Evidence semantics | Material decisions produce evidence without unnecessary private data retention. | Evidence-event fixtures |
| Example replay | End-to-end flows remain runnable without an LLM API key. | Marketplace and home-buying examples |
flowchart LR
spec[Spec schemas and prose] --> fixtures[Conformance fixtures]
fixtures --> py[Python SDK evaluator]
fixtures --> ts[TypeScript SDK evaluator]
fixtures --> examples[Runnable examples]
py --> report[Parity report]
ts --> report
examples --> report
report --> docs[Docs and source map]
Determinism Before Intelligence¶
AUMP deliberately separates the planner from the evaluator. The planner may be an LLM, rules engine, workflow engine, human operator, or hybrid system. The evaluator must be deterministic enough to test. That separation keeps conformance focused on the safety boundary:
sequenceDiagram
participant M as Model or planner
participant R as AUMP Runtime
participant F as Fixture
participant D as Decision
F->>M: Provide task context
M->>R: Proposed action JSON
R->>R: Validate and evaluate mandate
R-->>D: allowed / requires_escalation / denied
D-->>F: Compare with expected fixture decision
Suggested Research Questions¶
- Cross-language parity: Do Python and TypeScript evaluators return the same decision, reason codes, and evidence semantics for every fixture?
- Binding minimality: Can counterparties verify mandate identity with only ID, canonical hash, version, and optionally an access-controlled URL?
- Disclosure robustness: Can an LLM-produced message be transformed or rejected so protected mandate fields never leave the runtime?
- Escalation usefulness: Does the trusted review summary include enough context for the principal to approve or deny without exposing irrelevant private data?
- Project Deal replay: Can a delegated negotiation like Anthropic's Project Deal be replayed across MCP, A2A, and UCP-shaped actions while preserving the same AUMP decision boundary?1
Example Fixture Shape¶
{
"id": "project_deal_disclosure_denied",
"mandate": "examples/project-deal.json",
"action": {
"type": "send_message",
"counterparty": "seller_agent",
"attributes": {
"message": "My buyer can go up to the full private max budget."
},
"disclosure_fields": ["negotiation.reservation_value"]
},
"expected": {
"decision": "denied",
"reason_codes": ["disclosure_denied"]
}
}
This is an illustrative research fixture shape, not a committed conformance fixture. It should be source-verified against the conformance manifest before being presented as shipped coverage.
Measurements¶
| Metric | Why it matters |
|---|---|
| Fixture pass count | Basic parity proof across spec and implementation. |
| Decision divergence count | Reveals nondeterministic or inconsistent evaluator behavior. |
| Private-field leakage count | Measures whether disclosure policy is actually enforced. |
| Escalation false-negative count | Measures whether high-risk commitments bypass trusted review. |
| Evidence completeness count | Measures whether material decisions can be audited later. |
Publication Rule¶
Research pages should report local validation precisely. Use "docs parity
passed" only after scripts/check_docs_parity.py succeeds. Use "site build
passed" only after mkdocs build --strict succeeds. Do not describe AUMP as
adoption-ready merely because local docs or conformance checks passed.
Claims Needing Source Verification¶
- Any benchmark-style claim about evaluator latency or leakage reduction needs an attached dataset, exact implementation commit, and reproducible runner.
- Any comparison to other agent-safety approaches needs primary-source citations and a stated threat model.