Skip to content

feat: add experimental BinEval evaluation support#42100

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/add-eval-support-gh-aw
Draft

feat: add experimental BinEval evaluation support#42100
Copilot wants to merge 3 commits into
mainfrom
copilot/add-eval-support-gh-aw

Conversation

Copilot AI commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.

Schema (evals frontmatter)

  • New optional evals array with id + question fields; validated for unique IDs and non-empty questions
  • Emits experimental warning at compile time
evals:
  - id: builds
    question: Does the generated code compile?
  - id: focused
    question: Is the implementation limited to the requested change?

Evaluation model

  • EvalDefinition, EvalResult, EvalSummary types in frontmatter_types.go
  • WorkflowData.Evals []EvalDefinition for downstream consumers

Eval job

  • New eval job injected after agent + detection jobs in the compiled workflow
  • JS harness (eval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkout
  • Prompt generation produces per-question binary prompts with rationale; no holistic scoring
  • Results aggregated (total/passed/failed/pass-rate) and uploaded as a eval artifact with a markdown step summary

Not included

Phase 8 (persisting results to a git branch, à la experiments) is deferred.

Co-authored-by: Copilot <[email protected]>

Co-authored-by: pelikhan <[email protected]>
Comment thread actions/setup/js/eval_harness.cjs Outdated
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot run inference in AWF and use /reflect to find a suitable inference endpoint.

@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown
Contributor

Hey @Copilot 👋 — great work on the BinEval evaluation support! The end-to-end shape is clean: the evals frontmatter schema, the Go compiler module, the JS harness, and the wiring into the orchestrator all land as a cohesive unit.

A few things that stand out positively:

  • Well-scoped: every changed file is in service of the same feature — nothing unrelated was mixed in.
  • Tests at every layer: compiler_evals_test.go (Go unit tests), eval_harness.test.cjs (JS unit tests), and evals_experimental_warning_test.go (integration guard) give solid coverage across the stack.
  • Clear description: the PR body explains the schema, the evaluation model, the harness design, and explicitly calls out what was deferred (Phase 8 persistence) — exactly the context reviewers need.
  • Experimental gating: emitting a compile-time warning via emitExperimentalFeatureWarnings is the right pattern for a feature that isn't production-ready yet.

This looks ready for review. 🚀

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

Generated by ✅ Contribution Check · 293.1 AIC · ⌖ 20.4 AIC · ⊞ 6K ·

- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError)
- Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...}
- Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs)
- Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log
- Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results
- Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants

Co-authored-by: Copilot <[email protected]>

Co-authored-by: pelikhan <[email protected]>
Copilot stopped work on behalf of pelikhan due to an error June 28, 2026 17:49
@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot keep working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants