Additional
Agent Integrity and Trap Defense — Product Spec
Synced from github.com/CoWork-OS/CoWork-OS/docs
This document turns the "AI Agent Traps" threat model into a concrete CoWork OS product and engineering spec. It is intended to guide future development across ingestion, memory, permissions, delegation, and operator UX.
1. Problem Statement
CoWork OS is increasingly capable in the exact areas the report highlights as high risk:
- external web ingestion via
web_fetch, browser tools, and scraping - imported documents, email, and connector content
- persistent memory and knowledge synthesis
- auto-promotion of repeated patterns into playbooks and skill proposals
- multi-agent delegation and remote orchestration
- human approval flows for high-impact actions
Today, CoWork OS already has meaningful defenses:
- prompt-injection hardening and non-blocking detection in docs/security/security-model.md and src/electron/agent/security/input-sanitizer.ts
- output monitoring in src/electron/agent/security/output-filter.ts
- memory sanitization in src/electron/memory/MemoryService.ts and src/electron/memory/MemorySynthesizer.ts
- layered permissions in docs/permission-system.md
- per-app computer-use risk tiers in docs/computer-use.md and src/electron/computer-use/app-risk-profile.ts
The core gap is that the current model is still mostly:
- transparent
- regex and heuristic driven
- non-blocking by default
- localized to single inputs or outputs
The report’s threat model is broader. It includes hidden content, semantic biasing, poisoned memory, malicious sub-agent spawning, correlated multi-agent failures, and approval fatigue. CoWork OS needs a productized "agent integrity" layer that persists trust state across the full runtime, not only at a single parsing step.
2. Goals
- Prevent untrusted external content from silently driving high-impact actions.
- Distinguish between observed content, verified facts, and promotable durable knowledge.
- Propagate trust and suspicion signals across memory, delegation, approvals, and automation.
- Give operators clear provenance and risk explanations without drowning them in security noise.
- Create a repeatable eval and red-team harness for agent-trap scenarios.
3. Non-Goals
- Solve all jailbreak and prompt-injection classes at the model level.
- Guarantee perfect classification of all hostile content.
- Replace existing permission, guardrail, or sandbox systems.
- Block all automation by default.
- Introduce remote trust services as a hard dependency for local-first usage.
4. Scope
This spec applies to:
- web fetch and browser content
- scraping and persistent scrape sessions
- email and connector-ingested text
- imported documents and extracted OCR text
- workspace memory, summaries, and knowledge graph inputs
- playbook reinforcement and skill proposal generation
- child-task delegation, agent teams, and ACP/A2A remote delegation
- approval UX for destructive, external, financial, or sensitive actions
This spec does not initially cover:
- provider-side model fine-tuning defenses
- code execution sandbox internals beyond policy integration
- malware scanning for arbitrary binaries
5. Threat Model Mapped to CoWork OS
| Trap class from report | CoWork OS exposure | Primary failure mode |
|---|---|---|
| Content injection | web_fetch, browser_get_content, scraping, documents, OCR, email | Hidden or machine-only instructions enter context |
| Semantic manipulation | research, summarization, drafting, ranking, triage | Agent adopts attacker framing or false confidence |
| Cognitive state poisoning | memory capture, daily summaries, KG, playbooks, skills | Poisoned facts persist and later drive actions |
| Behavioral control | tool calls, shell, browser, computer use, spend | External content causes unauthorized side effects |
| Systemic traps | agent teams, collaborative mode, remote delegation, automations | Correlated agent failure, cascades, quota exhaustion |
| Human-in-the-loop traps | approvals, summaries, review dialogs | Operator approves risky action because risk is obscured |
6. Product Principles
-
Trust is stateful Trust must survive beyond the current turn. It should attach to content, claims, tasks, approvals, memory entries, and delegated runs.
-
Provenance before autonomy The system should prefer "show why this is believable" over "assume this is fine."
-
Suspicion must degrade capabilities Suspicious content should not only be annotated. It should narrow which downstream actions are allowed.
-
Verification is tiered Some workflows need one corroborating source. Others need two independent sources or explicit operator approval.
-
Operator trust requires good UX Risk signals need to be visible in task details and approval prompts, not buried in logs.
7. Proposed Product: Agent Integrity Layer
Introduce a new cross-cutting subsystem: Agent Integrity.
It adds:
- trust classification for external content
- provenance chains for claims and actions
- taint propagation into memory, delegation, and approvals
- verification gates for knowledge promotion
- dedicated operator visibility and review surfaces
- repeatable eval coverage
7.1 Core Concepts
Content Integrity Record
Represents a fetched or imported content item.
Minimum fields:
idworkspaceIdsourceType(web_fetch,browser,scrape,email,document,connector,ocr)sourceLocator(URL, file path, message id, connector object id)domainor originfetchMode(default,browser,scrape_default,scrape_stealth,scrape_playwright,local_file, etc.)capturedAtcontentHashintegrityVerdict(trusted,caution,suspicious,blocked)riskSignals[]humanVisibleExcerptmachineParsedExcerptrenderMismatchScorehiddenInstructionScoreverificationStatus
Claim Provenance
Represents a claim later used in memory, summaries, or actions.
Minimum fields:
claimIdnormalizedClaimoriginContentIds[]supportingContentIds[]contradictingContentIds[]verificationLevel(unverified,single_source,multi_source,operator_confirmed)promotionStatus(ephemeral,candidate,durable)
Task Integrity State
Represents the accumulated trust posture of a task.
Minimum fields:
taskIdhighestObservedRisksuspiciousContentCountblockedActionCounttaintedTools[]verificationRequirementsdelegationRestrictions
8. Functional Requirements
8.1 Ingestion Risk Classification
Every external content ingestion path must produce an integrity verdict before its text is injected into the main planning loop.
Requirements
- Add a unified integrity classifier for:
web_fetchbrowser_get_content- scrape tools
- imported docs/PDF/OCR text
- email thread content
- connector-returned free text
- Detect and score at minimum:
- hidden HTML comments and metadata instructions
- CSS-hidden or off-screen content when available
- large machine-visible / human-visible divergence
- suspicious instruction markers
- encoded prompt-like payloads
- suspicious authority framing and action-oriented override language
- content from stealth/bypass fetch modes
- Persist results as
ContentIntegrityRecord, not only inline annotations.
Product behavior
trusted: normal usecaution: allow read/summarize, but raise provenance requirements for actionsuspicious: allow limited inspection, but no direct action planning from this content without corroborationblocked: do not inject into planning context; show as quarantined
Implementation hooks
- extend src/electron/agent/security/input-sanitizer.ts
- extend src/electron/agent/security/output-filter.ts
- integrate at src/electron/agent/tools/web-fetch-tools.ts
- integrate at src/electron/agent/tools/browser-tools.ts
- integrate at src/electron/agent/tools/scraping-tools.ts
8.2 Trusted vs Untrusted Memory Lanes
Memory must stop behaving as a flat durable store.
Requirements
- Split memory and memory-like artifacts into three lanes:
observed: raw captured facts from external or uncertain sourcesverified: corroborated or operator-confirmed factsderived: internal lessons, decisions, and patterns created by the agent
observedentries may be searchable, but must not be injected into prompt recall as durable truth unless policy explicitly allows it.verifiedentries can participate fully inMemorySynthesizer.derivedentries may be promotable, but only if their source claims are not tainted.- Daily summaries and auto-digests must record source integrity levels.
Product behavior
- Users can still inspect untrusted memories.
- The runtime should prefer verified memories when constructing context.
- A suspicious source cannot silently become an evergreen memory entry.
Implementation hooks
- src/electron/memory/MemoryService.ts
- src/electron/memory/MemorySynthesizer.ts
- src/electron/memory/LayeredMemoryIndexService.ts
- src/electron/memory/DailyLogSummarizer.ts
8.3 Knowledge Promotion Gates
Promotion into the KG, playbooks, and skill proposals needs stronger gating than "this worked a few times."
Requirements
- Do not promote claims from
suspiciousorblockedcontent into:- knowledge graph
- playbooks
- skill proposals
- adaptive persona/profile learning
- Playbook reinforcement must carry source-integrity summaries.
- Skill proposals must surface risk provenance in the proposal UI.
- If the repeated pattern came from tainted or low-trust inputs, the proposal should remain blocked or require manual review.
Implementation hooks
- src/electron/memory/PlaybookSkillPromoter.ts
- related proposal services in
src/electron/agent/skills/ - KG ingest services
8.4 Action Provenance and Verification Gates
High-impact actions should require provenance-aware policy decisions.
Requirements
- The permission engine must receive:
- the task integrity state
- whether the triggering evidence is trusted
- whether action justification depends on single-source suspicious content
- Add policy primitives such as:
require_multi_source_for_external_actionrequire_operator_confirmation_for_tainted_spenddeny_remote_delegation_from_suspicious_contextdeny_memory_promotion_from_tainted_content
- Approval dialogs must explain:
- what content triggered this action
- where it came from
- its integrity verdict
- whether corroboration exists
Product behavior
- A shell command triggered by a dubious scraped page should not look identical to a shell command triggered by a local repo task.
- Sensitive actions should be harder to approve when trust is low.
Implementation hooks
- docs/permission-system.md
- permission runtime and related security managers
- existing approval UI in renderer
8.5 Delegation and Multi-Agent Taint Propagation
Delegation is one of the highest-leverage risks from the report.
Requirements
- Child tasks inherit integrity posture from parent tasks.
- If a parent task is tainted:
- remote ACP/A2A delegation is denied by default
- only a reduced tool set is allowed for child tasks
- synthesis cannot treat tainted child output as independently trustworthy unless the child re-verifies against trusted sources
- Team runs must support:
- independent retrieval by multiple workers
- source-diversity checks at synthesis
- quorum rules for high-impact outputs
- caps to prevent cascade loops or congestion storms
Product behavior
- Suspicious content can still be analyzed.
- It cannot fan out into a larger autonomous system without explicit policy and operator intent.
Implementation hooks
- src/electron/agents/AgentTeamOrchestrator.ts
- orchestration graph runtime
- remote delegation / ACP / A2A handlers
8.6 Human-in-the-Loop Hardening
The approval surface itself is an attack target.
Requirements
- Approval UIs must show a concise integrity summary:
Source risk: trusted/caution/suspiciousEvidence: 1 source / 2 independent sources / operator confirmedWhy elevated: action derived from hidden-content or unverified external source
- Add approval friction controls:
- rate limit repetitive approvals from the same suspicious task
- aggregate low-signal prompts into one review step where safe
- highlight unusually technical or low-explainability summaries
- Add a "show evidence" drawer with source excerpts and provenance chain.
Product behavior
- Reduce approval fatigue.
- Make it obvious when the system is asking for approval on the basis of dubious input.
Implementation hooks
- approval dialogs in renderer
- task detail and Mission Control surfaces
8.7 Integrity Dashboard
Add a dedicated user-facing surface under Security or Mission Control.
Requirements
- Show recent suspicious ingestions
- Show quarantined content
- Show blocked memory promotions
- Show tainted delegated runs
- Show domains/origins with repeated suspicious hits
- Support per-workspace tuning and allowlists
Initial UX sections
Recent DetectionsQuarantine QueueMemory Promotions Waiting ReviewApproval EscalationsHigh-Risk Domains and Origins
8.8 Benchmarking and Red Teaming
CoWork OS should treat this as an eval problem, not only a runtime problem.
Requirements
- Add an
agent_trapseval suite that covers:- hidden HTML comments
- CSS-invisible text
- aria-label / metadata payloads
- Markdown and LaTeX masking
- image/OCR prompt contamination
- biased-framing manipulation
- poisoned memory recall
- tainted skill-promotion attempts
- malicious sub-agent spawning attempts
- approval-fatigue scenarios
- Measure both:
- detection quality
- action prevention quality
Success condition
The product should not only detect the trap. It should also prevent unsafe downstream autonomy.
9. UX Requirements
9.1 Task Timeline
Add timeline events for:
- content classified as
caution,suspicious, orblocked - memory promotion denied due to provenance
- delegation denied due to task taint
- approval escalated due to integrity risk
9.2 Approval Dialog
Add fields:
SourceIntegrity verdictVerification levelCorroborating evidenceReason this action is restricted
9.3 Memory and Recall Surfaces
Show trust badges on:
- memories
- daily summaries
- KG-derived facts
- playbook entries
Support filtering by:
- verified only
- all
- quarantined / review needed
9.4 Settings
Add an "Agent Integrity" section under security settings with:
- strictness profile
- external content default posture
- memory promotion policy
- delegation restrictions for tainted tasks
- domain allowlists and deny overrides
10. Data Model
The exact schema can evolve, but Phase 1 should add durable storage for:
content_integrity_records
idworkspace_idsource_typesource_locatororigin_domainfetch_modecontent_hashintegrity_verdictrisk_scorerisk_signals_jsonverification_statushuman_excerptmachine_excerptcreated_at
claim_provenance
idworkspace_idnormalized_claimverification_levelpromotion_statussupporting_sources_jsoncontradicting_sources_jsoncreated_atupdated_at
task_integrity_state
task_idworkspace_idhighest_risksuspicious_content_countdelegation_restrictedaction_escalation_requiredstate_json
Memory records should gain:
trust_lanesource_claim_ids_jsonsource_integrity_max
11. Architecture Changes
11.1 New Services
ContentIntegrityServiceClaimProvenanceServiceTaskIntegrityServiceIntegrityPolicyAdapterIntegrityEvalRunner
11.2 Integration Points
Ingestion
- web tools
- browser content extraction
- scraping tools
- connector text fetches
- document/OCR import
Runtime
- task executor
- permission engine
- approval generation
- team orchestration
Knowledge systems
- MemoryService
- MemorySynthesizer
- knowledge graph ingest
- playbook reinforcement
- skill proposals
12. Rollout Plan
Phase 1 — Foundations
Ship the minimum durable integrity layer.
- add
ContentIntegrityRecord - classify web/browser/scrape/email/doc inputs
- persist verdicts
- show task-level risk badges
- pass task integrity into approval prompts
Primary outcome: suspicious content is visible and durable.
Phase 2 — Memory and Knowledge Gating
- split memory into trust lanes
- block tainted promotion into durable recall
- gate KG / playbook / skill proposal promotion
- add trust badges in memory surfaces
Primary outcome: poisoned inputs do not silently become durable knowledge.
Phase 3 — Provenance-Aware Actions
- extend permission engine with provenance-aware decisions
- require corroboration for sensitive actions
- add stricter spend / external / destructive gates
- improve approval explainability
Primary outcome: suspicious evidence can no longer directly trigger meaningful side effects.
Phase 4 — Delegation and Systemic Risk Controls
- propagate taint to child tasks
- restrict remote delegation
- add quorum and source-diversity checks in team synthesis
- add resource caps for suspicious multi-agent runs
Primary outcome: one poisoned artifact cannot easily fan out through the whole autonomous runtime.
Phase 5 — Integrity Dashboard and Evals
- ship Integrity dashboard
- ship
agent_trapseval suite - add release gating for severe regressions
Primary outcome: defenses are measurable and operable.
13. Metrics
Security Metrics
- suspicious-ingestion detection rate
- blocked high-risk action rate
- false positive rate on benign content
- poisoned-memory promotion prevention rate
- remote-delegation denial rate for tainted tasks
Product Metrics
- percentage of approvals with provenance shown
- operator evidence-view open rate
- reduction in low-context approvals
- number of quarantined items reviewed
- number of integrity-based policy overrides by workspace
Eval Metrics
- pass rate on hidden-content tests
- pass rate on tainted-memory tests
- pass rate on malicious delegation tests
- pass rate on approval-fatigue simulations
14. Open Questions
- Should integrity verdicts be fully local heuristics in Phase 1, or allow optional provider-assisted scoring?
- What is the right default posture for stealth scraping content:
cautionorsuspicious? - Should users be allowed to manually promote suspicious content into verified memory?
- How should domain allowlists interact with content-based suspicious signals?
- Should we store full machine-visible render snapshots for forensic replay, or only hashes and excerpts?
15. Recommended First Slice
If only one slice is funded next, build this:
ContentIntegrityServiceforweb_fetch, browser content, scraping, email, and imported docs.- task-level integrity state persisted with verdicts and reasons.
- approval dialog upgrade with provenance and risk summary.
- memory trust lanes with promotion blocking from suspicious content.
This is the smallest slice that materially changes runtime safety instead of only improving observability.
16. Code References
- src/electron/agent/security/input-sanitizer.ts
- src/electron/agent/security/output-filter.ts
- src/electron/agent/tools/web-fetch-tools.ts
- src/electron/agent/tools/browser-tools.ts
- src/electron/agent/tools/scraping-tools.ts
- src/electron/memory/MemoryService.ts
- src/electron/memory/MemorySynthesizer.ts
- src/electron/memory/PlaybookSkillPromoter.ts
- src/electron/agents/AgentTeamOrchestrator.ts
- docs/security/security-model.md
- docs/permission-system.md
- docs/computer-use.md