> ## Documentation Index
> Fetch the complete documentation index at: https://agentapplications.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating packages

> How to test whether your Agent Application package improves reliability using eval-driven iteration.

An Agent Application is only useful if it improves outcomes in practice. The right eval loop tests both the package contract and the real CLI behavior.

## What to evaluate

A good evaluation set covers more than final prose output. Use the categories below to build a comprehensive set of assertions.

<AccordionGroup>
  <Accordion title="Discovery quality">
    Test whether the runtime surfaces your package and skills at the right times:

    * Does the package appear when a relevant task is requested?
    * Are unrelated packages correctly excluded?
    * Does the runtime distinguish between the base application contract and local operating guidance?
  </Accordion>

  <Accordion title="APP.md contract clarity">
    Test whether the contract in `APP.md` is sufficient to operate the application without guessing:

    * Is the `entry.command` unambiguous?
    * Are all callable commands listed in `commands`?
    * Is the state model described clearly enough to reason about mutations?
    * Are confirmation rules documented for destructive commands?
  </Accordion>

  <Accordion title="Local skill activation behavior">
    Test whether local skills load when they should and stay quiet when they should not:

    * Does the skill activate on relevant operating tasks?
    * Does the skill avoid activating on unrelated prompts?
    * Does the skill complement `APP.md` rather than contradict it?
  </Accordion>

  <Accordion title="CLI execution quality">
    Test whether the documented commands actually work:

    * Is `entry.command` callable?
    * Do all listed commands in `APP.md` execute without error?
    * Do commands behave consistently across repeated runs?
  </Accordion>

  <Accordion title="JSON output stability">
    Test whether the CLI returns machine-readable output reliably:

    * Does every successful command return parseable JSON on stdout?
    * Do failure cases return structured JSON errors with non-zero exit codes?
    * Do expected fields appear in both success and failure responses?
    * Does the output shape remain stable across versions?
  </Accordion>

  <Accordion title="State correctness after mutations">
    Test whether the application manages its owned state correctly:

    * Does state persist correctly after `add`, `update`, and `complete` operations?
    * Do IDs remain stable across runs?
    * Does the application own its state independently of prompt memory or chat history?
  </Accordion>

  <Accordion title="Confirmation behavior for destructive commands">
    Test whether destructive commands enforce their safety rules:

    * Does a destructive command fail without the required confirmation flag?
    * Does it return a structured error explaining the requirement?
    * Does it succeed when the confirmation flag is provided?
  </Accordion>

  <Accordion title="Time and token cost">
    Track the overhead the package introduces per run:

    * How many tokens does loading the full package cost?
    * How does that cost change with and without local skills loaded?
    * Is the extra context cost justified by the improvement in outcomes?
  </Accordion>
</AccordionGroup>

## Start with realistic test cases

Build your eval set from cases that reflect how the package will actually be used. Each case should include:

* a realistic user or operator prompt
* the package path being evaluated
* expected commands or behaviors
* expected JSON fields or side effects
* optional input files or seed state

Good starting cases:

* Inspect an Agent Application and summarize its callable commands.
* Add and complete an item in the example to-do application.
* Attempt a destructive command without confirmation and verify that it fails safely.
* Compare a package run with and without its local operating skill.

## Compare against a baseline

Run each case at least two ways:

* with the current package and local skills
* with the previous package revision or without the local skill

<Tip>
  Comparing against a baseline tells you whether the package is improving outcomes rather than only adding more context. A package that costs more tokens without improving pass rate is not an improvement.
</Tip>

## Write objective assertions first

Start with checks you can grade without human review:

* `APP.md`, `app/`, and `skills/` exist
* the documented `entry.command` is callable
* JSON output parses successfully
* expected fields are present in success and failure cases
* destructive commands require explicit confirmation
* state changes match the documented contract

<Tip>
  Add human review after objective assertions pass. Human review is most useful for broader questions: Is the skill helping the agent choose the right command? Is the output clear enough to act on?
</Tip>

## Track cost and drift

Collect per-run data to spot regressions early:

| Metric                     | Why it matters                                                      |
| -------------------------- | ------------------------------------------------------------------- |
| Pass rate                  | Measures overall reliability                                        |
| Failure category           | Identifies whether failures are in design, instructions, or tooling |
| Duration                   | Tracks execution time per run                                       |
| Total tokens               | Measures context cost                                               |
| CLI vs. `APP.md` alignment | Detects when docs drift away from implementation                    |

<Warning>
  Package docs can drift away from the implementation over time. Track whether live CLI behavior still matches `APP.md` as part of your regular eval cycle.
</Warning>

## Use failures to refine the package

Read failures at three levels:

* **Package design**: wrong boundary between `APP.md`, `app/`, and local skills
* **Instructions**: unclear command semantics or missing safety defaults
* **Tooling**: weak discovery, weak JSON validation, or poor confirmation UX

<Tip>
  If the same logic is being reinvented in every run, improve the package contract or bundle better local guidance. Repeated reinvention is a signal that something is missing from `APP.md` or a local skill.
</Tip>

## The evaluation loop

<Steps>
  <Step title="Run the eval set against the package">
    Execute all test cases against your current package version. Collect outputs, exit codes, and token counts.
  </Step>

  <Step title="Grade objective assertions">
    Check each case against your defined assertions: JSON parses, fields are present, confirmation rules hold, state matches.
  </Step>

  <Step title="Review outputs and execution traces">
    Read failures and near-misses. Identify whether the problem is in package design, instructions, or tooling.
  </Step>

  <Step title="Tighten APP.md, descriptions, or local skills">
    Make targeted changes based on what you found. Improve the contract if commands are ambiguous. Improve descriptions if discovery is missing cases. Improve local skills if operating guidance is missing.
  </Step>

  <Step title="Rerun and compare the delta">
    Run the full eval set again. Compare pass rate, failure categories, and token cost against the previous version. Stop when the package improves outcomes consistently and the extra context cost is justified.
  </Step>
</Steps>
