KB Auto-Tagging System: Organizing Knowledge at Scale¶

Context: As the Knowledge Base grows, manually tagging articles becomes a bottleneck. We built an intelligent auto-tagging system using OpenRouter's Claude API that analyzes full article content, auto-applies known tags, and prompts users individually for novel suggestions.

The Problem¶

When KB articles lack tags:

No discoverability through tag filters
Manual categorization is slow, inconsistent, and error-prone
Tag inconsistency across articles (python vs py vs Python)
New contributors don't know what tags exist or mean
Valuable insights buried without proper metadata

The Solution: Intelligent Auto-Tagging¶

Our system:

Reads full article content — analyzes complete markdown (not just title/description)
Maintains a tag catalog with descriptions (tag-descriptions.yaml)
Detects untagged articles via frontmatter YAML scanning
Queries OpenRouter API (Claude Haiku) with efficient batching
Auto-applies known tags from existing KB (high confidence, immediate)
Prompts per new tag individually (user controls additions)
Updates tag catalog when new tags are accepted

Using the System¶

List All Tags with Usage¶

ew local.kb-tags

Shows tag name, frequency in KB, and descriptions. Discover available tags before publishing.

Auto-Tag Untagged Articles¶

ew local.kb-autotag

Interactive by default (safe, user-controlled)

Example session:

📄 Docker Production Migration: From Development...
   Evaluating tags batch 1/1...
   ✅ Suggested (12):
      • devops: Covers deployment strategies and infrastructure automation
      • docker: Central focus on containerization and docker-compose
      • guide: Step-by-step migration instructions
      • advanced, architecture, best-practices, howto, intermediate, python, security, self-hosting, tutorial

   ❌ Not selected (6):
      • ai: No AI/ML content
      • beginner: Assumes Docker experience
      • database, index, linux, reference: Not applicable

   ◆ Auto-applying 4 known tags (devops, docker, guide, python)
   ✨ Accept 'advanced'? (production-level patterns): y
   ✨ Accept 'architecture'? (system design): n
   ✨ Accept 'security'? (protection concerns): y

   ✓ Applied 6 tags (4 known + 2 new)
      ✓ Added 'security' to tag catalog

How It Works: Technical Deep Dive¶

The Auto-Tagging Pipeline¶

Collect existing KB tags — tags already used in articles (the "known" set)
Load tag catalog — 18+ tags with descriptions from tag-descriptions.yaml
Find untagged articles — scan KB for empty tag arrays
Read full article content — extract markdown body after YAML frontmatter
Batch evaluate tags — send 20 tags per OpenRouter API call (efficient)
Analyze with reasoning — AI provides YES/NO decision for each tag with reasoning
Separate results: Known tags (auto-apply) vs New tags (prompt user)
Validate user input — accepts y/yes or n/no only (retries on invalid)
Update catalog — when user accepts new tag, add to tag-descriptions.yaml
Apply tags — merge approved tags into article frontmatter

Why Full Content Analysis Matters¶

Title + Description only: - May miss implementation details - Loses context about trade-offs discussed - Misses specific technologies mentioned in code examples Full article analysis: - Catches specific tools and approaches used - Understands architecture decisions explained in text - Identifies real-world lessons and best practices - Provides ~75% accuracy matching original manual tags - Discovers additional useful tags from detailed content

Known vs New Tag Separation¶

Known tags (auto-apply immediately): - Already used in KB — proven relevant - High confidence in categorization - Saves user time on familiar tags New tags (prompt user individually): - Novel suggestions from content analysis - User controls whether to accept - Option to reject without catalog pollution - When accepted, added to tag catalog for future use

Tag Categories¶

Topic Tags¶

What the article covers:

python, docker, linux, devops, self-hosting
ai, security, database, architecture

Content Type Tags¶

How the content is structured:

guide — step-by-step instructional
tutorial — hands-on with examples
reference — API/documentation
howto — practical instructions for tasks
best-practices — recommended patterns

Level Tags¶

Skill level targeting:

beginner — new to the topic
intermediate — some experience
advanced — experienced users

Real-World Accuracy¶

Test case: Docker Production Migration article

Original manual tags (8): devops, docker, linux, python, self-hosting, production, guide, architecture

Auto-tagger results (6 matched + 5 new): ✓ Matched: devops, docker, python, self-hosting, guide, architecture (75% accuracy) ✗ Missing: linux, production + New discoveries: advanced, best-practices, security, intermediate, howto, tutorial

Key insight: Full article content revealed security practices (directory traversal protection, SSL/TLS via reverse proxy) that weren't obvious from the description alone.

Adding New Tags¶

1. System Auto-Updates Catalog¶

When you accept a new tag during auto-tagging:

✨ Accept 'kubernetes'? (container orchestration): y
   ✓ Added 'kubernetes' to tag catalog

The tag is automatically added to tag-descriptions.yaml with auto-generated title and description.

2. Manual Tag Addition¶

Edit /tag-descriptions.yaml:

tags:
  kubernetes:
    title: "Kubernetes"
    description: "Kubernetes orchestration, container management, and deployment"

Manually tag an article, then run auto-tagger to make it available for future suggestions.

Best Practices¶

✅ Do:

Use 2-4 specific topic tags
Include exactly 1 content type tag
Use hyphens for multi-word tags (self-hosting, best-practices)
Keep tags lowercase
Accept new tags when they accurately describe content

❌ Avoid:

Over-tagging (5+ topic tags = noise)
Vague tags (misc, other, stuff)
Inconsistent naming (python vs py)
Uppercase or spaces (Python, best practices)

Input Validation¶

The system validates all user input:

✓ y or yes → Accept tag
✓ n or no → Reject tag
✗ yep, nope, sure → ⚠ Invalid input. Please enter 'y' or 'n' → Retries

This prevents misclicks and ensures deliberate decisions.

Troubleshooting¶

Q: "No tags suggested"

Check if KB has any existing tags to build from
Verify article has description in frontmatter
Try with --interactive=false to see debug output Q: Suggestions seem generic
Full article content is being analyzed (this is correct)
Generic tags mean content is foundational/broad
Accept them if accurate Q: Same tags keep getting suggested
This is normal — consistent content receives consistent suggestions
Override manually if needed in article frontmatter Q: New tag not appearing in catalog
Article must be saved with the new tag first
Run ew local.kb-tags to verify it was added
Check file permissions on tag-descriptions.yaml

Implementation Details¶

Location: /knowledge_base_indexer.py Key functions:

query_openrouter() — Direct API calls to Claude Haiku
evaluate_tag_batch() — Evaluates batch of tags with full content
suggest_tags_with_gptme() — Main suggestion engine
get_yes_no_input() — Validated input with retry
add_tag_to_catalog() — Updates YAML catalog
autotag_articles() — Orchestrates full workflow
collect_all_tags() — Gathers existing KB tags
find_untagged_articles() — Identifies articles needing tags
generate_recently_added() — Auto-updates KB homepage Integration with kb-generate:
ew local.kb-generate calls auto-tag internally
Recently Added section updates with file modification time order
Category indices auto-populate from tagged articles

Lessons Learned¶

Full content is crucial — Title/description alone miss details
Batching improves efficiency — 20 tags per request vs multiple small calls
User control matters — Prompting per-tag prevents unwanted tags
Input validation prevents errors — Strict y/n prevents accidental acceptance
Known vs new separation saves time — Users only decide on novel suggestions
Catalog updates enable growth — System learns from user decisions
Interactive by default is safer — Auto-apply was risky for novel suggestions