Lightweight Agent Evaluation
Scenario design, expected behavior, failure modes, and readiness scorecards.
Zenodo DOIAgent evaluation, reliability, and operational scorecards
A public workspace for small, repeatable evaluation workflows for tool-using AI agents. It connects papers, datasets, demos, package utilities, and scorecard templates.
Scenario design, expected behavior, failure modes, and readiness scorecards.
Zenodo DOIMixed-check regression testing for LLM and agent workflows.
Zenodo DOISmall-rule checks for prompt injection and vector poisoning risks.
Figshare DOI| Area | Question | Signal |
|---|---|---|
| Task fit | Does the agent know the boundary of the task? | Clear goal, inputs, and stop condition |
| Tool use | Are tool calls traceable and justified? | Observable calls and recoverable failures |
| Reliability | Can the workflow be repeated? | Fixtures, expected behavior, and regression checks |
| Readiness | Is it safe to widen access? | Known risks, review notes, and rollout decision |