Five Claude agents audited my plugin release: a 2026 PyPI supply-chain playbook

Problem

I was about to ship the v1.0 of repo-context-hooks — a Claude Code plugin that keeps interrupted work, next-step context, and handoff notes alive across sessions — in the same quarter SentinelOne published its dependency-hijack research and Trail of Bits shipped its harness-side sandbox. Both pieces are excellent. Both are written from the operator side: how a Claude Code user defends against a hostile plugin. Plugin authors face the question from the other end. How do you ship a Claude Code plugin to PyPI in 2026 without becoming next quarter's dependency-hijack paper?

The published threat models are mostly post-xz (CVE-2024-3094): typo-squatting, malicious maintainer takeover, attestation-less wheels installed by humans who never run gh attestation verify. The plugin-author playbook for defending against those was scattered across PyPI documentation, sigstore READMEs, GitHub Actions docs, and a few independent blog posts. Nobody had stitched it into a single, copy-pasteable pipeline for a small Python package whose distribution channel is a coding agent.

So I shipped one. This is the post-mortem of what worked, what didn't, and what a five-Claude-agent code review caught that I'd missed — the defender-side Claude Code plugin security playbook I wish had existed when I started.

Constraints

Four constraints shaped every decision below.

Solo maintainer, side-project hours. No security team to review the publish pipeline. No internal SBOM service. No ability to negotiate a custom GitHub Enterprise feature.
Zero runtime dependencies. The package's pyproject.toml declares dependencies = []. Hooks call into the standard library and read checked-in workspace files. This was a constraint before threat modeling, but it doubled as a supply-chain reduction: there is no transitive graph to compromise, only the package itself.
End users are agents, not humans. A coding agent (Claude Code, Codex, Cursor) installs the package via pip on behalf of a developer who may never run a verification command. The hardening has to be machine-checkable, ideally automatic, with the human verification path as a fallback rather than the default.
The package is small enough that one motivated reviewer should find every defect. About 6,700 lines of Python (verified with wc -l on the v1.0 tag, excluding tests), four GitHub Actions workflows, one CLI, and a bundle of skill files. If a reviewer can't find every defect, the review process is broken — not the package size.

Design

Five layers. An attacker has to defeat all of them — independently — to ship a poisoned wheel. A single compromised credential or merged bad commit isn't enough.

Figure 1: hardening pipeline. Inline SVG, dark-mode + print aware, total payload <3 KB.

Layer 1: OIDC Trusted Publisher (no API tokens)

Every PyPI maintainer who lost a package to a phished API token in 2024-2025 was running the same flow: a long-lived PYPI_API_TOKEN secret stored in GitHub Actions, used by twine upload. PyPI Trusted Publishers retire that pattern. The publish job is authenticated by a short-lived OIDC token issued by GitHub Actions, scoped to a specific repository and a specific environment. There is no secret to steal.

The publish workflow declares the OIDC permissions, a TestPyPI gate that smoke-tests the wheel before promoting to real PyPI, and an explicit environment name that PyPI matches against the configured Trusted Publisher.

permissions:
  id-token: write       # OIDC for Trusted Publisher + Sigstore
  attestations: write   # PEP 740 provenance

jobs:
  publish:
    runs-on: ubuntu-latest
    needs: test-on-testpypi
    environment: pypi
    steps:
      - uses: pypa/gh-action-pypi-publish@release/v1
        with:
          attestations: true

Layer 2: Sigstore signing on every release

Trusted Publisher prevents token theft. Sigstore prevents wheel tampering anywhere downstream — a malicious mirror, a man-in-the-middle inside a corporate proxy, a poisoned cache. The publish workflow signs the built distributions with gh-action-sigstore-python and uploads PEP 740 attestations alongside the wheel. End users (or their agents) verify with one command:

gh attestation verify repo_context_hooks-0.6.0-py3-none-any.whl \
    --repo narendranathe/repo-context-hooks

It's still optional today. It will not be optional in 2027. Putting it in v1.0 means anyone scripting verification today gets a reproducible answer; the README documents the command so the agent layer can learn it.

Layer 3: CodeQL + Dependabot in CI

CodeQL runs on every PR and every Sunday at 04:00 UTC. The schedule matters: a vulnerability disclosed Friday afternoon shouldn't wait for the next merge to surface. Dependabot watches both pip and github-actions ecosystems, because pinning actions/checkout@v4 is half-protection if you don't upgrade when v4 itself ships a fix.

Layer 4: Property-tested telemetry hot paths

The package's only network surface is an opt-in telemetry endpoint that emits hook execution events. Two functions there had quietly been the source of bugs: is_sampled(), which gated whether an event was recorded, and deduplicate_hooks(), which collapsed duplicate hook firings. Both were unit-tested. Both still shipped a regression in v0.5: a NaN sample rate would silently disable telemetry instead of falling back to the default.

Hypothesis-based property tests catch this class of bug because the input space they explore is much wider than the cases I'd have hand-written. The fix was small — add math.isnan(rate) to the validator — but the property test is what surfaced it. Coverage went to 80% on the gate (the spec asked for 85%; see “Tradeoffs” below).

Layer 5: Five-critic parallel agent code review

The hardening layers above are mechanical. They don't catch design flaws, missing edge cases, or the kind of subtle issue a senior reviewer surfaces by reading the diff with intent. For that, I dispatched five parallel Claude agents against the v1.0 PR, each with a single review lens:

Critic 1 (test depth): coverage by line is a lower bound on real coverage. Where are the branch and edge gaps?
Critic 2 (CI correctness): are the CI gates actually gating, or is one of them an advisory check that nobody reads?
Critic 3 (security posture): is the publish pipeline meaningfully different from the threat models in the SentinelOne and Trail of Bits write-ups?
Critic 4 (maintainability): can a future maintainer change one workflow without breaking another? Are config files forward-incompatible?
Critic 5 (spec compliance): does the implementation match the issue's acceptance criteria, and where it doesn't, is the deviation documented?

Each agent ran in isolation against the same diff. They returned 51 raw findings; after deduplication, 41 distinct issues remained. About a third would have surfaced under careful solo review. The other two-thirds wouldn't have, because they required holding two parts of the system in mind at once — for example, a CI environment variable that the publish workflow reads but no other workflow sets.

Tradeoffs

Four things didn't work or work as designed. Cataloguing them honestly is the only reason this post is worth reading.

The 85% coverage gate became 80%

Issue #71's acceptance criteria asked for fail_under=85. I shipped 80. Two of the bundle scripts are thin shells that would have required test scaffolding worth more than the bug-finding it bought. The honest read is that the spec was written before the implementation surfaced where the marginal coverage hour goes. I documented the deviation in the PR description and CHANGELOG, and added a follow-up issue to revisit at v1.1. A reviewer who reads “80% gate” without that context is right to push back; the deviation must travel with the package.

The NaN sample rate bug shipped in v0.5

The Hypothesis property test caught it on the v1.0 hardening branch — not before v0.5 was already in users' environments. math.isnan is a one-liner; the bug is small. The lesson is sharper: I had unit tests for is_sampled() that all passed, and they all passed because I'd written them after thinking about which cases mattered. Property-based tests don't share that bias. Every hot-path validator should have one before it ships, not after.

Don’t gate on a third-party SaaS upload

Critic 2 flagged that the coverage upload step was configured as a hard failure on the publish gate. If the upstream service is down for any reason, the publish pipeline blocks. The fix was to gate on local pytest --cov-fail-under and treat the upload as a soft step — the gating signal stays inside CI even when an external service is degraded. Lesson: third-party SaaS in the publish path should never be a hard dependency on a job whose only purpose is to ship.

The five critics overlapped more than expected

I'd designed the lenses to be orthogonal. They weren't. About 20% of the raw findings were duplicates — two critics, sometimes three, surfacing the same issue from different framings. That's not pure waste; convergence on a finding is itself signal. But the dedup pass took ~40 minutes that I hadn't budgeted, and the weakest findings were the ones only one critic raised, which I had to re-evaluate manually. Next time I would either tighten the lens prompts to be more explicitly disjoint, or accept the overlap and pre-budget the dedup time.

Outcome

v1.0 shipped on schedule. All 41 distinct findings closed. The release is signed, OIDC-published, attested under PEP 740, and verifiable from any client with the GitHub CLI. The CodeQL run is green. The Dependabot graph is clean. Coverage is 80% with a documented path to 85%.

The defender-side playbook this post documents is the artifact I wish had existed when I started. As a checklist:

Configure PyPI Trusted Publisher; delete every PYPI_API_TOKEN secret.
Add sigstore/gh-action-sigstore-python to the publish job; flip attestations: true on pypa/gh-action-pypi-publish.
Wire CodeQL with both PR triggers and a weekly cron; wire Dependabot for pip and github-actions.
Add Hypothesis property tests to every validator on a network or sampling hot path before it ships, not after a regression.
Run a multi-critic agent review on the release PR. Budget time for dedup. Treat convergent findings as high-confidence.

None of these are novel on their own. The contribution is the order, the dependencies, and the cost of getting any one of them wrong. If you're shipping a Claude Code plugin (or any small PyPI package) in 2026, you can clone these patterns directly — the workflows, configs, and the agent-review prompts are all visible in the repo-context-hooks repository under MIT.

What I'd do differently for v2.0: bring the coverage gate back to 85%, add SLSA build provenance on top of Sigstore (SLSA L3 vs Sigstore alone is the natural next comparison), and tighten the critic prompts so the dedup pass is shorter. The threat model for plugin authors is going to keep evolving. The pipeline above buys a year of runway.

The discipline of reviewing it every release is what buys the rest.

Keep reading. More on the systems behind this post → repo-context-hooks on GitHub →

Narendranath Edara is a Senior AI Platform Engineer in Dallas, TX, building production AI systems with platform-engineering discipline. Portfolio · LinkedIn · GitHub.