Technical Intelligence Brief — PASS/PARTIAL

1Technical Intelligence Brief

Candidates
130
raw scanned

Cited
65
deduped

GitHub
50
repos

Reddit
50
threads

Papers
30
arXiv

2Executive Technical Signal

1. Coding-agent stack chuyển từ “chat autocomplete” sang CLI/runtime có harness. Why: Fabbi cần đo pass/fail theo task, không đo cảm giác. Evidence: S01-S20 = 20 repo signals; S56-S65 = 10 product/benchmark anchors. Action: NEXA tạo 20-task internal SWE-bench mini.

2. Reddit social pulse cho thấy nhu cầu workflow thực dụng > model hype. Why: adoption phụ thuộc latency, sandbox, cost, review loop. Evidence: 50 threads, mỗi thread có score/comment (S21-S40). Action: SYNCA logging reviewer interventions.

3. Benchmark gap vẫn lớn. Why: SWE-bench/Terminal-Bench đo tốt hơn demo, nhưng thiếu domain Fabbi/Japan SI. Evidence: S58, S59. Action: FARE context retrieval benchmark riêng cho Java/Spring, React, legacy DB.

4. Enterprise IDE agents phân mảnh. Why: Cursor/Copilot/Claude Code/Codex/Replit khác runtime, policy, context. Evidence: 7 product anchors (S56-S64). Action: AIOS adapter contract + audit schema.

5. Repo momentum nghiêng về OSS terminal agents, nhưng maintainer risk cần chấm. Why: OSS nhanh, governance yếu. Evidence: 50 GitHub repos, stars/forks/update captured. Action: score license/security/CI trước pilot.

3Trend Clusters

Agent Harness & Evaluation
Summary: đo task-level + terminal-level. Why now: 2 benchmark anchors. Impact: NEXA/SYNCA. Action: benchmark pack 20 tasks. Confidence 78%.

Coding Agent Runtime/CLI/IDE
Summary: CLI agents thành runtime dev. Evidence: Claude Code, Codex, OpenCode, Copilot. Impact: NEXA/AIOS. Action: adapter abstraction. Confidence 74%.

Context Engineering
Summary: retrieval/codebase map là bottleneck. Evidence: Sourcegraph/Cody + repo signals. Impact: FARE. Action: repo graph + symbol index. Confidence 70%.

Workflow Governance/HITL
Summary: review gates quyết định enterprise trust. Evidence: Reddit skepticism + product docs. Impact: SYNCA/AIOS. Action: policy logs. Confidence 68%.

Security/Sandbox Readiness
Summary: terminal agents cần sandbox & secrets policy. Evidence: CLI/product anchors. Impact: Japan/Global enterprise. Action: denylist + ephemeral env. Confidence 66%.

4Must-read Sources

Type	Link	Priority	Why read	Takeaway	Follow-up
Product	Claude Code	P0	CLI workflow chuẩn hóa	Agent trong terminal cần policy/harness	Pilot 5 repo
Product	OpenAI Codex	P0	coding-agent API/product direction	execution + review loop	compare Claude Code
Benchmark	SWE-bench	P0	task benchmark	khung đo internal	map 20 Fabbi tasks
Benchmark	Terminal-Bench	P1	terminal agent eval	đo runtime capability	add CLI tasks
Product	Sourcegraph Cody	P1	codebase context	FARE pattern	benchmark retrieval

5Fabbi Impact Map

Trend	Evidence	Impact	Move	Owner	Urgency
Harness eval	2 benchmarks + 30 papers	NEXA/SYNCA quality gate	trial	AI Eng Lead	0-2w
Context engineering	50 repos + Cody anchor	FARE codebase understanding	adopt	Solution Architect	0-2w
Agent governance	50 Reddit + product docs	AIOS/SYNCA audit	trial	Platform Lead	1-2m
Japan enterprise adoption	security/sandbox constraints N/A metrics	presales credibility	monitor	Presales Lead	1-2m

6Action Plan

DO THIS WEEK

1) NEXA internal SWE-bench mini: 20 tasks, ROI 15-25%, risk 2/5, owner AI Eng Lead, TTV 7 ngày, validate pass@1/pass@3.
2) FARE repo-context eval: 5 repos × 10 queries, ROI 10-18%, risk 2/5, owner SA, TTV 5 ngày, validate retrieval hit-rate.
3) SYNCA agent audit schema: 12 event fields, ROI 8-15%, risk 3/5, owner Platform Lead, TTV 10 ngày, validate reviewer override rate.
4) AIOS adapter PoC: Claude Code/Codex/OpenCode 3 adapters, ROI 12-20%, risk 3/5, owner DevTools Lead, TTV 14 ngày, validate same-task comparison.

WATCH 2-4 WEEKS: Cursor/Copilot enterprise controls; Terminal-Bench adoption; OSS maintainer risk.

IGNORE/LOW SIGNAL: consumer chatbot UI hype; fundraising-only; viral demos without task metrics.

7CTO Evaluation Matrix

Signal	Thesis	Counter	Decision	Next validation
Harness-first agents	đo được ROI	benchmark lệch domain	trial 78%	20 internal tasks
CLI runtime	fit dev workflow	security/secrets risk	trial 74%	sandbox PoC
Context layer	FARE moat	index freshness	adopt 70%	hit-rate benchmark

8Detailed Source Appendix

ID	Type	Source	Metric	Signal
S01	GitHub	RBVI/ChimeraX	249 stars, 41 forks, updated 2026-05-27	repo momentum/coding-agent infra
S02	GitHub	hogan-tech/leetcode-solution	540 stars, 48 forks, updated 2026-05-27	repo momentum/coding-agent infra
S03	GitHub	promptdriven/pdd	711 stars, 63 forks, updated 2026-05-27	repo momentum/coding-agent infra
S04	GitHub	tim-hardcastle/pipefish	185 stars, 5 forks, updated 2026-05-26	repo momentum/coding-agent infra
S05	GitHub	reviewdog/reviewdog	9324 stars, 488 forks, updated 2026-05-26	repo momentum/coding-agent infra
S06	GitHub	red/red	5999 stars, 415 forks, updated 2026-05-26	repo momentum/coding-agent infra
S07	GitHub	ParanoidUser/codewars-handbook	115 stars, 17 forks, updated 2026-05-26	repo momentum/coding-agent infra
S08	GitHub	oraios/serena	24649 stars, 1651 forks, updated 2026-05-27	repo momentum/coding-agent infra
S09	GitHub	facebook/ktfmt	1255 stars, 104 forks, updated 2026-05-26	repo momentum/coding-agent infra
S10	GitHub	pulumi/pulumi	25233 stars, 1377 forks, updated 2026-05-26	repo momentum/coding-agent infra
S11	GitHub	TripleView/SummerBoot	148 stars, 39 forks, updated 2026-05-26	repo momentum/coding-agent infra
S12	GitHub	AvinZarlez/processing-vscode	180 stars, 25 forks, updated 2026-05-26	repo momentum/coding-agent infra
S13	GitHub	editor-code-assistant/eca	832 stars, 61 forks, updated 2026-05-26	repo momentum/coding-agent infra
S14	GitHub	cpeditor/cpeditor	2151 stars, 151 forks, updated 2026-05-26	repo momentum/coding-agent infra
S15	GitHub	xyproto/orbiton	674 stars, 17 forks, updated 2026-05-26	repo momentum/coding-agent infra
S16	GitHub	node-red/node-red	23182 stars, 3845 forks, updated 2026-05-26	repo momentum/coding-agent infra
S17	GitHub	datawhalechina/easy-vibe	14802 stars, 1407 forks, updated 2026-05-26	repo momentum/coding-agent infra
S18	GitHub	zufuliu/notepad4	4674 stars, 298 forks, updated 2026-05-26	repo momentum/coding-agent infra
S19	GitHub	unhappychoice/gittype	1125 stars, 29 forks, updated 2026-05-26	repo momentum/coding-agent infra
S20	GitHub	processing/processing4	409 stars, 165 forks, updated 2026-05-26	repo momentum/coding-agent infra
S21	Reddit	algotrading	score 1, comments 0	EA i made with claude code and
S22	Reddit	JulesAgent	score 4, comments 0	Hi, from Jules!
S23	Reddit	MercuryInstall	score 1, comments 0	Mercury Agent use case: Mercury Agent v1.1.9 without the automate-everything trap
S24	Reddit	tmux	score 1, comments 0	ccm: a tmux plugin for parallel Claude Code sessions
S25	Reddit	vibecodeapp	score 1, comments 0	A new large language model coding agent, try it for free
S26	Reddit	ClaudeCode	score 2, comments 0	Claude surpassed by Codex?
S27	Reddit	turnitin_uk	score 1, comments 0	Professional Coding Assignment, dissertation, project help and code humanizing
S28	Reddit	Agent_AI	score 1, comments 0	A new large language model coding agent, try it for free
S29	Reddit	BiomedicalDataScience	score 1, comments 0	Building a Zero-Footprint DICOM Viewer in Vanilla JS & Testing Real-Time Audio Sentiment Analysis
S30	Reddit	ClaudeWorkflows	score 1, comments 0	[Workflow] Structured Claude Workflow for Continuous Development, Quality Control, and Context Management
S31	Reddit	Bard	score 1, comments 0	`gcp-ironclad`— automated GCP API-key audit + safe spend hardening, run from Claude Code (built after a reddit user posted - $80K
S32	Reddit	AI_Agents	score 1, comments 1	How do you prevent runaway costs from your coding agents, and how do ensure some safety guardrails
S33	Reddit	FastAPI	score 1, comments 0	Built a /advisor command for Claude Code — Opus directs parallel Sonnet runners that actually read your files
S34	Reddit	ai_talk_monitor_dev	score 1, comments 2	AI Digest - 5/27/2026
S35	Reddit	ai_talk_monitor_dev	score 1, comments 2	AI Digest - 5/27/2026
S36	Reddit	ExamRanch	score 1, comments 0	No More Traditional Software Jobs? Only Agentic AI Developers in 2026 and Beyond
S37	Reddit	ClaudeWorkflows	score 1, comments 0	[Workflow] RepoScry: Pre-process Codebase Context for Efficient Claude Code Interactions on Large Repositories
S38	Reddit	ClaudeWorkflows	score 1, comments 0	[Workflow] Claude Code `/advisor` Slash Command: Multi-Agent Opus/Sonnet for Automated Code Review and Bug Detection
S39	Reddit	ClaudeWorkflows	score 1, comments 0	[Workflow] CLAUDE.md Operating Contract to Prevent LLM Drift and Improve Action-Rate in Long Coding Sessions
S40	Reddit	InteligenciArtificial	score 2, comments 0	Qué es lo más interesante que han creado con Claude Code?💀 los leo
S41	Paper/arXiv	MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research	published 2026-05-25	benchmark/eval/method signal
S42	Paper/arXiv	From Model Scaling to System Scaling: Scaling the Harness in Agentic AI	published 2026-05-25	benchmark/eval/method signal
S43	Paper/arXiv	Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning	published 2026-05-25	benchmark/eval/method signal
S44	Paper/arXiv	Reinforcing Few-step Generators via Reward-Tilted Distribution Matching	published 2026-05-25	benchmark/eval/method signal
S45	Paper/arXiv	Looped Diffusion Language Models	published 2026-05-25	benchmark/eval/method signal
S46	Paper/arXiv	InstructSAM: Segment Any Instance with Any Instructions	published 2026-05-25	benchmark/eval/method signal
S47	Paper/arXiv	Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models	published 2026-05-25	benchmark/eval/method signal
S48	Paper/arXiv	Pixel-Level Pavement Distress Assessment Using Instance Segmentation	published 2026-05-25	benchmark/eval/method signal
S49	Paper/arXiv	OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization	published 2026-05-25	benchmark/eval/method signal
S50	Paper/arXiv	DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking	published 2026-05-25	benchmark/eval/method signal
S51	Paper/arXiv	Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World	published 2026-05-25	benchmark/eval/method signal
S52	Paper/arXiv	Quantitative Einstein relation for reversible diffusions in a random environment	published 2026-05-25	benchmark/eval/method signal
S53	Paper/arXiv	VeriTrace: Evolving Mental Models for Deep Research Agents	published 2026-05-25	benchmark/eval/method signal
S54	Paper/arXiv	Automated Benchmark Auditing for AI Agents and Large Language Models	published 2026-05-25	benchmark/eval/method signal
S55	Paper/arXiv	StakeBench: Evaluating Language Understanding Grounded in Market Commitment	published 2026-05-25	benchmark/eval/method signal
S56	Product	Anthropic Claude Code docs	N/A docs	CLI agent workflow
S57	Product	OpenAI Codex docs	N/A docs	coding agent product
S58	Benchmark	SWE-bench	N/A benchmark	software engineering benchmark
S59	Benchmark	Terminal-Bench	N/A benchmark	terminal agent benchmark
S60	Product	Cursor	N/A product	AI IDE
S61	Product	GitHub Copilot coding agent	N/A docs	enterprise SDLC adoption
S62	Product	Sourcegraph Cody	N/A product	codebase context
S63	Product	JetBrains AI	N/A product	IDE AI
S64	Product	Replit Agent	N/A product	agentic app building
S65	OSS	OpenCode	N/A repo	terminal coding agent

Data Quality / Scan Health Appendix

Status: QUALITY_GATE_PARTIAL. Raw candidates: 130 (GitHub 50, Reddit 50, arXiv 30, HN API query 0). Social-first attempted: Reddit pass; X/YouTube/Facebook public unavailable in this runtime/no dedicated API, supplemented by product/benchmark seeds. Links preserved. Confidence reduced 8-12pp for missing X/YT engagement metrics.