Skip to content
All posts
June 21, 202610 min read

Cleaning Up 173 Claude Code Skills: From 9 Good to 182 Excellent

After a year of accumulating Claude Code skills, only 9 of 173 reliably activated on natural language. Here's the four-part pattern I lifted from anthropics/launch-your-agent, how I applied it to every skill in one session, and the audit tooling that keeps the catalog clean.

Claude CodeDeveloper ToolsAI EngineeringSolo Dev
Share:

I have 21 active Android apps, a few iOS ports, some web stuff, and a stack of CLI tools. Across all of them, Claude Code skills are the way I encode "how I want to work" so I don't repeat myself.

After a year, I had 173 skills. About 100 of them only fired when I typed the slash command. The rest needed me to remember exactly which phrase I'd used in the trigger list. That's not how skills are supposed to work — Claude should match your natural-language intent against the catalog and pick the right one. So I sat down, audited everything, and lifted a pattern from a public Anthropic reference repo that fixed it.

This is the writeup. Real metrics, real before/after, the four-part pattern, and the audit tooling I now run weekly.

The starting state

code
Total skills:    173
Excellent (4/4): 9
Good (3/4):      0
Needs work:      49
Poor:            28
Missing NOT FOR: 76
Trigger collisions: 15 phrases claimed by 11 skill pairs

Activation was unreliable. "Fix this crash" might fire

code
debug-sudarshan
, or
code
bug-hunter
, or
code
systematic-debugging
, or
code
silent-failure-hunter
, or just plain "I'll help you debug" with no skill at all. Five skills competed for the same prompt with no
code
NOT FOR
boundaries telling Claude when not to pick each one.

The pattern

I'd been looking for a fix when anthropics/launch-your-agent appeared on GitHub. It's a reference implementation of a Claude Code skill that walks a founder through launching a Claude Managed Agent. The skill's own descriptions are written to a very specific shape, and the shape works. So I extracted the pattern:

yaml
description: "[Concrete domain sentence — what it does, with specifics like
file paths, library versions, canonical IDs.] Use when [explicit activation
condition]. Triggers on \"phrase the user actually types\", \"another phrase\",
\"/slash-command\". NOT FOR [neighbor case] (use `neighbor-skill`)."

Four parts:

  1. One concrete sentence with real specifics. Not "Helps with X." — instead "Generate a privacy policy HTML page and publish to
    code
    https://sudarshanchaudhari.github.io/[appname]-privacy-policy/
    covering data collected, third-party SDKs, retention, deletion, contact, and GDPR alignment."
  2. code
    Use when …
    — explicit activation condition. The literal phrase. Variants like
    code
    Use after
    ,
    code
    Use before
    ,
    code
    Use immediately when
    also work.
  3. code
    Triggers on "…"
    — comma-separated phrases the user actually types. Not abstract topics. Real surface forms.
  4. code
    NOT FOR … (use \
    neighbor-skill`)`
    — every neighbor that might collide, with a redirect.

That fourth part is the magic. It teaches Claude both "fire on these phrases" and "don't fire when the neighbor is the right call." Most existing skill descriptions in the wild only do part 3. They miss the negative half of the activation rule.

Look at the gold-tier skills in the launch-your-agent repo —

code
RedTeam
,
code
FirstPrinciples
,
code
SystemsThinking
,
code
BitterPillEngineering
— they all hit this shape. They also stand out as the most reliably-activating skills in any Claude Code setup.

Applying it to 173 skills in one session

I'm not going to write 173 descriptions by hand. Here's the actual process:

Step 1 — Audit the existing state

I wrote a scorer that extracts every skill description, parses YAML properly (not regex — quoted multi-line YAML scalars trip naive parsers), and checks each of the four parts.

python
def score_desc(desc):
    score = 0
    present = []
    dl = desc.lower()
    # Part 1: concrete sentence
    if len(desc) >= 100 and ('.' in desc[:400] or len(desc) >= 200):
        score += 1; present.append('concrete-sentence')
    # Part 2: Use when (or variant)
    if re.search(r'\buse\s+(when|after|before|on|for|at|as|immediately|whenever)\b', dl):
        score += 1; present.append('use-when')
    # Part 3: quoted trigger phrases (or comma-list after Triggers on:)
    if (re.search(r'triggers?\s+on[:\s].*"[^"]+"', desc, re.IGNORECASE)
        or re.search(r'use when\s+[a-z][^.]{20,}', desc, re.IGNORECASE)):
        score += 1; present.append('triggers')
    # Part 4: NOT FOR with explicit neighbor redirect
    if 'not for' in dl and ('use `' in dl or 'use /' in dl):
        score += 1; present.append('not-for-boundary')
    return score, present

Run it. Get a per-skill 0-4 score plus a list of which parts are missing. Group by tier.

Step 2 — Batch-rewrite the worst tier

The 28 "Poor" tier descriptions all looked alike: "Skill X for SudarshanTechLabs." Three words. No triggers. No neighbors.

I rewrote each one by hand, but quickly — read the body, extract the actual capability, write a description that hits the four-part pattern. Each one took 30-60 seconds.

Step 3 — Append NOT FOR lines via script

The 76 skills missing only the

code
NOT FOR
boundary line were a perfect script target. For each one, I knew its cluster (debug, planning, audit, etc.) and which neighbors it would collide with. I wrote a Python script that took a
code
{skill: not_for_line}
map and appended each one to its description in one pass.

python
NOT_FOR = {
    'adr': 'NOT FOR runtime decision-making (just decide and proceed) or capturing one-off learnings (use `capture-learning`).',
    'agent-workflow': 'NOT FOR running existing agents (use `dispatching-parallel-agents`) or writing prompts (use `prompt-engineer`).',
    # ... 74 more
}

76 skills updated in 4 seconds. The map itself took 20 minutes to write — I had to make a judgment call per skill about which neighbors mattered.

Step 4 — Archive the obvious redundancies

Eight skills were superseded but never deleted. Things like

code
privacy-policy-mega
(kept around after
code
privacy-policy-gen
+
code
data-privacy-compliance
replaced it),
code
seo-blog-writer
(overlaps
code
new-blog
),
code
senior-code-reviewer-mega
(covered by
code
review-feature
+ language-specific reviewers).

I moved them to

code
~/.claude/skills/_archived/
rather than deleting outright — 30-day recoverable window in case I missed something. Wrote a
code
_archived/README.md
documenting why each was retired and what to use instead.

Step 5 — Resolve trigger collisions

The audit script also detects when two or more skills claim the same trigger phrase. 15 collisions surfaced:

  • code
    "karpathy check"
    claimed by both
    code
    karpathy-check
    (audit) and
    code
    karpathy-coder
    (write-time enforcement)
  • code
    "check all apps"
    claimed by both
    code
    find-anomalies
    and
    code
    cross-app-parity
  • code
    "swiftui"
    claimed by both
    code
    ios-macos-sudarshan
    (parent) and
    code
    swiftui-patterns
    (sub-skill)
  • ... 12 more

For each, I picked the skill that should primarily own the phrase, removed it from the loser, and made sure both skills had the right

code
NOT FOR
boundary pointing at each other. Another 50-line script.

Step 6 — Handle the YAML edge case

This is the part that bit me. Many older skills used YAML block-scalar form:

yaml
description: |
  This is a long description
  that spans multiple lines
  and looks tidy.

My first injection script naively prepended

code
Use when X.
after the
code
|
, producing:

yaml
description: |. Use when X.
  Original first line
  ...

Which is invalid YAML (

code
|
requires nothing after it on that line). It broke 22 files before I caught it. Lesson: when munging YAML, parse with a real library first, manipulate the parsed value, then re-emit. Don't regex.

Fortunately my

code
~/.claude/
is git-tracked.
code
git checkout HEAD -- skills/<broken>/SKILL.md
for each of the 22, then a smarter v2 script that handled all three description forms (single-line, quoted multi-line, block-scalar).

Step 7 — Build the skills I didn't have

Audit also surfaced gaps — capabilities I'd reach for but had no skill for. I scaffolded 12 new ones using the same four-part pattern:

  • code
    incident-postmortem
    — blameless RCA + GitHub issue + CHANGELOG entry
  • code
    keystore-rotate
    — Android Play App Signing upload-key rotation
  • code
    api-changelog
    — Keep-a-Changelog diff between git refs
  • code
    screenshot-set
    — Play Store + App Store screenshot capture
  • code
    cma-launch
    — port of
    code
    launch-your-agent
    's flow to my stack
  • code
    lane-resume
    — ADE lane pickup protocol
  • code
    cross-skill-test
    — simulate which skill fires for a prompt
  • code
    skill-promote
    — promote drafts from
    code
    auto-skill-reviewer.py
  • code
    play-listing-screenshot-compare
    — store listing drift detector
  • code
    secrets-scan-deep
    — portfolio-wide TruffleHog sweep
  • code
    cma-eval-suite
    — eval regression check for CMA agents
  • code
    agent-handoff
    — clean handoff to another session
  • code
    repo-decommission
    — end-of-life wrapper
  • code
    firebase-rotate
    — Firebase credential rotation per type
  • code
    store-rejection-fixer
    — triage + resubmission checklist
  • code
    ai-coding-rule-update
    — bump canonical versions in rules + propagate
  • code
    voice-to-spec
    — voice-note → idea → spec pipeline

The final state

code
Total skills:    182  (170 original + 12 new, 8 archived)
Excellent (4/4): 182  ↑ from 9
Good (3/4):      0    ↓ from 73
Needs work:      0    ↓ from 49
Poor:            0    ↓ from 28
Trigger collisions: 0  ↓ from 15
YAML parse errors:  0

100% of skills at gold-tier. Every collision resolved. Every redundancy archived.

Keeping it clean

A snapshot in time means nothing if it rots in a week. Three things now keep the catalog clean:

1. A PreToolUse hook that scores new SKILL.md edits. Registered in

code
settings.json
under
code
Write|Edit|MultiEdit
. Warns if a SKILL.md edit drops the description below 3/4. Blocks completely if the description is under 30 chars (un-activatable). The hook lives at
code
~/.claude/hooks/skill-quality-guard.py
.

2. A weekly cron that re-runs the audit. Sundays at 9am, output goes to

code
~/.claude/logs/skill-audit-v2.log
. If anything regresses, I know within a week.

3. A

code
cross-skill-test
tool I can invoke manually. Given a user prompt, it scores every skill's trigger overlap and shows me the top matches with a collision warning if the top two scores are within 20%. Useful before merging any new skill.

bash
$ python3 ~/.claude/skills/cross-skill-test/test.py "fix this crash"
PROMPT: 'fix this crash'

TOP 5 MATCHES:
  1. debug-sudarshan          score=15  matched: 'crash', 'fix'
  2. bug-hunter                score=10  matched: 'crash'
  3. systematic-debugging      score= 8  matched: 'crash'

✓ Clear winner — gap of 5 pts to runner-up

Plus a visual cluster map (

code
~/.claude/skills/skill-cluster-map.html
) — D3 force-directed graph where every NOT-FOR edge becomes a graph link. Reveals the cluster structure of the catalog at a glance: the debug cluster, the planning pipeline, the release flow, the ADE cluster. 287 edges across 182 nodes. Useful before adding a new skill — you can spot if your idea overlaps an existing cluster.

Why this works (the meta-lesson)

Skill descriptions are prompts. Claude reads them at activation time and picks the best match. Like any prompt, specificity wins:

  • Concrete sentences beat abstract topics. "Generate a privacy policy HTML page and publish to GitHub Pages at the standard URL" beats "Helps with privacy policy."
  • Triggers users actually type beat synonyms. "ANR" beats "application not responding." "Compose recomposition" beats "UI performance issues."
  • Boundaries beat hope. A
    code
    NOT FOR
    line redirecting to a neighbor teaches Claude both "fire on me" and "fire on them when …". The neighbor list itself becomes documentation for future-you.

Looking back at the 9 skills already at 4/4 before this cleanup, they were all the ones I'd written most recently — after I'd started internalizing what made skills reliable. The 164 others were just my prior shapes accumulating. There was no malicious intent, just drift.

This is why agents/skills need maintenance. Drift compounds. Every six months I'll run the same audit and bring whatever's slipped back into shape.

The cluster map

If you want to see the result visually, my generated cluster map looks like this when filtered to the

code
release
cluster:

code
release-sudarshan ──→ ship-check (NOT FOR pre-release verification only)
release-sudarshan ──→ document-release (NOT FOR doc sync)
release-sudarshan ──→ playstore-sudarshan
release-sudarshan ──→ store-listing (NOT FOR copy generation)
release-sudarshan ──→ changelog-gen (NOT FOR release notes)
ship-check        ──→ release-sudarshan
document-release  ──→ readme-gen
document-release  ──→ changelog-gen
changelog-gen     ──→ new-blog
changelog-gen     ──→ document-release
playstore-sudarshan ──→ store-listing
store-listing     ──→ changelog-gen
store-listing     ──→ new-blog
screenshot-set    ──→ playstore-sudarshan
screenshot-set    ──→ store-listing
store-rejection-fixer ──→ ship-check
store-rejection-fixer ──→ playstore-sudarshan
incident-postmortem ──→ release-sudarshan
keystore-rotate   ──→ release-sudarshan

Each edge is "if a user prompt could fire either of us, this one wins." Cluster boundaries become visible as the graph layout settles.

What you can steal

If your own Claude Code setup has accumulated skills, the cheapest thing to do is run an audit against the four-part pattern. The audit script I wrote is sitting in

code
~/.claude/skills/skill-audit/audit.py
— about 150 lines of Python, runs in under a second across 182 skills, prints a prioritized report.

I might pull it out into a standalone repo if there's interest. The pattern itself you can lift right out of anthropics/launch-your-agent — read their SKILL.md files (RedTeam, FirstPrinciples, SystemsThinking, BitterPillEngineering, IterativeDepth, ExtractWisdom) and you'll see the shape.

A clean skill catalog isn't a one-shot project. It's a hygiene practice. Audit. Score. Archive. Test for collisions. Add boundaries. Then do it again next quarter.


The audit pattern was extracted on 2026-06-21 during a six-hour session that took my Claude Code catalog from 9 → 182 Excellent. The pattern itself is in

code
RedTeam
,
code
FirstPrinciples
,
code
SystemsThinking
,
code
BitterPillEngineering
,
code
IterativeDepth
, and
code
ExtractWisdom
are the reference skills to study. I may open-source the audit toolkit separately; reach out if useful.

Share:
S

Sudarshan Chaudhari

AI Systems Builder / Product Engineer

Bangkok, Thailand

Solo Android developer with 13+ years in QA, building Android apps, AI automation systems, and developer tools at SudarshanTechLabs.

Stay updated

Get new posts on Android, Kotlin, and solo dev straight to your inbox.

Newsletter preferences

Building something? Available for Android dev and QA consulting.

Work with me

Comments — powered by Giscus

Apps tagged with this