feat: add experimental BinEval evaluation support#42100
Draft
Copilot wants to merge 3 commits into
Draft
Conversation
Co-authored-by: Copilot <[email protected]> Co-authored-by: pelikhan <[email protected]>
Copilot created this pull request from a session on behalf of
pelikhan
June 28, 2026 14:47
View session
pelikhan
reviewed
Jun 28, 2026
Collaborator
|
@copilot run inference in AWF and use /reflect to find a suitable inference endpoint. |
This comment has been minimized.
This comment has been minimized.
Contributor
|
Hey A few things that stand out positively:
This looks ready for review. 🚀 Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "patchdiff.githubusercontent.com"See Network Configuration for more information.
|
- Remove direct GitHub Models API calls from eval_harness.cjs; keep only shared utility functions (readEvalSpec, buildEvalPrompt, aggregateResults, renderMarkdownSummary, sanitizeEvalError)
- Add actions/setup/md/eval.md: eval prompt template instructing the engine to output EVAL_RESULT:{...json...}
- Add actions/setup/js/setup_eval.cjs: prompt setup script (mirrors setup_threat_detection.cjs)
- Add actions/setup/js/parse_eval_results.cjs: result parser extracting EVAL_RESULT from engine log
- Update compiler_evals.go: eval job now follows the detection job pattern — pulls AWF containers, clears MCP config, installs the agentic engine, runs it inside AWF, parses results
- Add EvalLogPath, EvalDir, DefaultEvalMaxAICredits constants to pkg/constants
Co-authored-by: Copilot <[email protected]>
Co-authored-by: pelikhan <[email protected]>
Copilot stopped work on behalf of
pelikhan due to an error
June 28, 2026 17:49
Collaborator
|
@copilot keep working |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds native BinEval-style evaluations to gh-aw — small, binary questions declared in workflow frontmatter, executed post-run via an LLM harness, with results aggregated and reported as CI artifacts.
Schema (
evalsfrontmatter)evalsarray withid+questionfields; validated for unique IDs and non-empty questionsEvaluation model
EvalDefinition,EvalResult,EvalSummarytypes infrontmatter_types.goWorkflowData.Evals []EvalDefinitionfor downstream consumersEval job
evaljob injected after agent + detection jobs in the compiled workfloweval_harness.cjs) calls GitHub Models API (gpt-4o-mini) per question independently — no MCPs, no checkoutevalartifact with a markdown step summaryNot included
Phase 8 (persisting results to a git branch, à la experiments) is deferred.