Optimizing descriptions - Agent Applications

Agent Applications relies on lightweight discovery metadata before full activation. That makes description fields important in both APP.md and local SKILL.md files. An under-specified description means the right package or skill is missed. An over-broad one causes false activations and wasted context.

How discovery works

A compatible runtime loads lightweight metadata before deciding whether to activate the full package contract or one of its local skills. That metadata typically includes:

package name
slug
description
version
key command signals

The description is often the most signal-dense field in that metadata. Runtimes use it to decide whether a package is relevant to the current task.

Write descriptions around intent

Good descriptions explain when the package matters, not just what file it is.

Write your description as if a runtime is scanning a catalog of packages and needs to know in one sentence whether yours applies to the current task.

Prefer:

description: Persistent to-do list operated through a JSON-first CLI with explicit confirmation for destructive actions.

Over:

description: To-do app.

The stronger description communicates:

the work context (persistent to-do list)
the command style (JSON-first CLI)
a relevant safety constraint (explicit confirmation for destructive actions)

Useful patterns for stronger descriptions:

describe the work context
mention the kind of tasks the package supports
include adjacent signals such as safety constraints, state model, or command style
keep it concise enough to stay readable in a catalog

Descriptions that name the application but omit what it does cause the most misses. A name alone is rarely enough for a runtime to activate the right package in context.

Design trigger evals

Test your descriptions with realistic prompts and planning situations. For each test case, label it should_trigger or should_not_trigger. Examples:

Prompt	Expected
`Inspect a local app package and explain which CLI commands mutate state`	`should_trigger` for the application package
`Summarize this static Markdown file`	`should_not_trigger` for a whole application package
`Safely remove an item from the to-do app after user confirmation`	`should_trigger` for both the package and its local usage skill

The most valuable negative tests are near-misses: prompts that share vocabulary with the package but do not actually need it activated.

Measure false positives and misses

For each test case, check whether the runtime:

surfaced the right package or skill
avoided loading unrelated packages
respected the difference between the base application contract and local operating guidance

Run each case multiple times if the underlying model behavior is nondeterministic. A single passing run does not confirm consistent behavior.

Iterate without overfitting

Use a train and validation split when improving descriptions:

Revise descriptions based on train-set failures.
Keep the validation set untouched.
Choose the version that generalizes best across both sets.

Avoid stuffing specific keywords from failed prompts into the description. That tends to overfit to the failure rather than fixing the underlying gap in how the description communicates intent.

Fix the broader concept instead. If a prompt about confirmation-required commands keeps missing, add a clear signal about your safety model rather than copying words from the failing prompt.

Common failure modes

These patterns cause the most discovery problems in practice:

descriptions that name the application but not the tasks it supports
descriptions that omit the JSON-first or confirmation-sensitive parts of the contract
descriptions that blur the boundary between APP.md and local skills
descriptions that are too generic to distinguish one package from another

When both packages and local skills are discoverable, precision matters more than keyword density. A focused description outperforms a longer one that covers everything loosely.

​How discovery works

​Write descriptions around intent

​Design trigger evals

​Measure false positives and misses

​Iterate without overfitting

​Common failure modes

How discovery works

Write descriptions around intent

Design trigger evals

Measure false positives and misses

Iterate without overfitting

Common failure modes