feat: add experimental BinEval evaluation support by Copilot · Pull Request #42100 · github/gh-aw

Copilot · 2026-06-28T14:47:18Z

Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.

Schema (`evals` frontmatter)

New optional evals array with id + question fields; validated for unique IDs and non-empty questions
Emits experimental warning at compile time

evals:
  - id: builds
    question: Does the generated code compile?
  - id: focused
    question: Is the implementation limited to the requested change?

Evaluation model

EvalDefinition, EvalResult, EvalSummary types in frontmatter_types.go
WorkflowData.Evals []EvalDefinition for downstream consumers

Eval job

New eval job injected after agent + detection jobs in the compiled workflow
JS harness (eval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkout
Prompt generation produces per-question binary prompts with rationale; no holistic scoring
Results aggregated (total/passed/failed/pass-rate) and uploaded as a eval artifact with a markdown step summary

Not included

Phase 8 (persisting results to a git branch, à la experiments) is deferred.

Co-authored-by: Copilot <[email protected]> Co-authored-by: pelikhan <[email protected]>

pelikhan · 2026-06-28T16:56:38Z

@copilot run inference in AWF and use /reflect to find a suitable inference endpoint.

github-actions · 2026-06-28T17:28:05Z

Hey @Copilot 👋 — great work on the BinEval evaluation support! The end-to-end shape is clean: the evals frontmatter schema, the Go compiler module, the JS harness, and the wiring into the orchestrator all land as a cohesive unit.

A few things that stand out positively:

Well-scoped: every changed file is in service of the same feature — nothing unrelated was mixed in.
Tests at every layer: compiler_evals_test.go (Go unit tests), eval_harness.test.cjs (JS unit tests), and evals_experimental_warning_test.go (integration guard) give solid coverage across the stack.
Clear description: the PR body explains the schema, the evaluation model, the harness design, and explicitly calls out what was deferred (Phase 8 persistence) — exactly the context reviewers need.
Experimental gating: emitting a compile-time warning via emitExperimentalFeatureWarnings is the right pattern for a feature that isn't production-ready yet.

This looks ready for review. 🚀

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

patchdiff.githubusercontent.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "patchdiff.githubusercontent.com"

See Network Configuration for more information.

Generated by ✅ Contribution Check · 293.1 AIC · ⌖ 20.4 AIC · ⊞ 6K · ◷

- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError) - Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...} - Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs) - Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log - Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results - Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants Co-authored-by: Copilot <[email protected]> Co-authored-by: pelikhan <[email protected]>

pelikhan · 2026-06-28T17:51:20Z

@copilot keep working

Add experimental BinEval evaluation support

867e943

Co-authored-by: Copilot <[email protected]> Co-authored-by: pelikhan <[email protected]>

Copilot AI assigned Copilot and pelikhan Jun 28, 2026

Copilot created this pull request from a session on behalf of pelikhan June 28, 2026 14:47 View session

pelikhan reviewed Jun 28, 2026

View reviewed changes

Comment thread actions/setup/js/eval_harness.cjs Outdated

Copilot started work on behalf of pelikhan June 28, 2026 14:50 View session

Copilot finished work on behalf of pelikhan June 28, 2026 15:19

Copilot AI requested a review from pelikhan June 28, 2026 15:19

Copilot started work on behalf of pelikhan June 28, 2026 16:57 View session

This comment has been minimized.

Sign in to view

Copilot stopped work on behalf of pelikhan due to an error June 28, 2026 17:49
Copilot integration ID is required

Merge branch 'main' into copilot/add-eval-support-gh-aw

6eada22

Copilot started work on behalf of pelikhan June 28, 2026 17:58 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add experimental BinEval evaluation support#42100

feat: add experimental BinEval evaluation support#42100
Copilot wants to merge 3 commits into
mainfrom
copilot/add-eval-support-gh-aw

Copilot AI commented Jun 28, 2026

Uh oh!

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 28, 2026

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Copilot AI commented Jun 28, 2026

Schema (evals frontmatter)

Evaluation model

Eval job

Not included

Uh oh!

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 28, 2026

Uh oh!

pelikhan commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Schema (`evals` frontmatter)