KB Auto-Tagging System: Organizing Knowledge at Scale¶
Context: As the Knowledge Base grows, manually tagging articles becomes a bottleneck. We built an intelligent auto-tagging system using OpenRouter's Claude API that analyzes full article content, auto-applies known tags, and prompts users individually for novel suggestions.
The Problem¶
When KB articles lack tags:
- No discoverability through tag filters
- Manual categorization is slow, inconsistent, and error-prone
- Tag inconsistency across articles (
pythonvspyvsPython) - New contributors don't know what tags exist or mean
- Valuable insights buried without proper metadata
The Solution: Intelligent Auto-Tagging¶
Our system:
- Reads full article content — analyzes complete markdown (not just title/description)
- Maintains a tag catalog with descriptions (
tag-descriptions.yaml) - Detects untagged articles via frontmatter YAML scanning
- Queries OpenRouter API (Claude Haiku) with efficient batching
- Auto-applies known tags from existing KB (high confidence, immediate)
- Prompts per new tag individually (user controls additions)
- Updates tag catalog when new tags are accepted
Using the System¶
List All Tags with Usage¶
ew local.kb-tags
Shows tag name, frequency in KB, and descriptions. Discover available tags before publishing.
Auto-Tag Untagged Articles¶
ew local.kb-autotag
Interactive by default (safe, user-controlled)
Example session:
📄 Docker Production Migration: From Development...
Evaluating tags batch 1/1...
✅ Suggested (12):
• devops: Covers deployment strategies and infrastructure automation
• docker: Central focus on containerization and docker-compose
• guide: Step-by-step migration instructions
• advanced, architecture, best-practices, howto, intermediate, python, security, self-hosting, tutorial
❌ Not selected (6):
• ai: No AI/ML content
• beginner: Assumes Docker experience
• database, index, linux, reference: Not applicable
◆ Auto-applying 4 known tags (devops, docker, guide, python)
✨ Accept 'advanced'? (production-level patterns): y
✨ Accept 'architecture'? (system design): n
✨ Accept 'security'? (protection concerns): y
✓ Applied 6 tags (4 known + 2 new)
✓ Added 'security' to tag catalog
How It Works: Technical Deep Dive¶
The Auto-Tagging Pipeline¶
- Collect existing KB tags — tags already used in articles (the "known" set)
- Load tag catalog — 18+ tags with descriptions from
tag-descriptions.yaml - Find untagged articles — scan KB for empty tag arrays
- Read full article content — extract markdown body after YAML frontmatter
- Batch evaluate tags — send 20 tags per OpenRouter API call (efficient)
- Analyze with reasoning — AI provides YES/NO decision for each tag with reasoning
- Separate results: Known tags (auto-apply) vs New tags (prompt user)
- Validate user input — accepts y/yes or n/no only (retries on invalid)
- Update catalog — when user accepts new tag, add to
tag-descriptions.yaml - Apply tags — merge approved tags into article frontmatter
Why Full Content Analysis Matters¶
Title + Description only: - May miss implementation details - Loses context about trade-offs discussed - Misses specific technologies mentioned in code examples Full article analysis: - Catches specific tools and approaches used - Understands architecture decisions explained in text - Identifies real-world lessons and best practices - Provides ~75% accuracy matching original manual tags - Discovers additional useful tags from detailed content
Known vs New Tag Separation¶
Known tags (auto-apply immediately): - Already used in KB — proven relevant - High confidence in categorization - Saves user time on familiar tags New tags (prompt user individually): - Novel suggestions from content analysis - User controls whether to accept - Option to reject without catalog pollution - When accepted, added to tag catalog for future use
Tag Categories¶
Topic Tags¶
What the article covers:
python,docker,linux,devops,self-hostingai,security,database,architecture
Content Type Tags¶
How the content is structured:
guide— step-by-step instructionaltutorial— hands-on with examplesreference— API/documentationhowto— practical instructions for tasksbest-practices— recommended patterns
Level Tags¶
Skill level targeting:
beginner— new to the topicintermediate— some experienceadvanced— experienced users
Real-World Accuracy¶
Test case: Docker Production Migration article
Original manual tags (8): devops, docker, linux, python, self-hosting, production, guide, architecture
Auto-tagger results (6 matched + 5 new): ✓ Matched: devops, docker, python, self-hosting, guide, architecture (75% accuracy) ✗ Missing: linux, production + New discoveries: advanced, best-practices, security, intermediate, howto, tutorial
Key insight: Full article content revealed security practices (directory traversal protection, SSL/TLS via reverse proxy) that weren't obvious from the description alone.
Adding New Tags¶
1. System Auto-Updates Catalog¶
When you accept a new tag during auto-tagging:
✨ Accept 'kubernetes'? (container orchestration): y
✓ Added 'kubernetes' to tag catalog
The tag is automatically added to tag-descriptions.yaml with auto-generated title and description.
2. Manual Tag Addition¶
Edit /tag-descriptions.yaml:
tags:
kubernetes:
title: "Kubernetes"
description: "Kubernetes orchestration, container management, and deployment"
Manually tag an article, then run auto-tagger to make it available for future suggestions.
Best Practices¶
✅ Do:
- Use 2-4 specific topic tags
- Include exactly 1 content type tag
- Use hyphens for multi-word tags (
self-hosting,best-practices) - Keep tags lowercase
- Accept new tags when they accurately describe content
❌ Avoid:
- Over-tagging (5+ topic tags = noise)
- Vague tags (
misc,other,stuff) - Inconsistent naming (
pythonvspy) - Uppercase or spaces (
Python,best practices)
Input Validation¶
The system validates all user input:
- ✓
yoryes→ Accept tag - ✓
norno→ Reject tag - ✗
yep,nope,sure→ ⚠ Invalid input. Please enter 'y' or 'n' → Retries
This prevents misclicks and ensures deliberate decisions.
Troubleshooting¶
Q: "No tags suggested"
- Check if KB has any existing tags to build from
- Verify article has description in frontmatter
-
Try with
--interactive=falseto see debug output Q: Suggestions seem generic -
Full article content is being analyzed (this is correct)
- Generic tags mean content is foundational/broad
- Accept them if accurate Q: Same tags keep getting suggested
- This is normal — consistent content receives consistent suggestions
- Override manually if needed in article frontmatter Q: New tag not appearing in catalog
- Article must be saved with the new tag first
- Run
ew local.kb-tagsto verify it was added - Check file permissions on
tag-descriptions.yaml
Implementation Details¶
Location: /knowledge_base_indexer.py
Key functions:
query_openrouter()— Direct API calls to Claude Haikuevaluate_tag_batch()— Evaluates batch of tags with full contentsuggest_tags_with_gptme()— Main suggestion engineget_yes_no_input()— Validated input with retryadd_tag_to_catalog()— Updates YAML catalogautotag_articles()— Orchestrates full workflowcollect_all_tags()— Gathers existing KB tagsfind_untagged_articles()— Identifies articles needing tagsgenerate_recently_added()— Auto-updates KB homepage Integration with kb-generate:ew local.kb-generatecalls auto-tag internally- Recently Added section updates with file modification time order
- Category indices auto-populate from tagged articles
Lessons Learned¶
- Full content is crucial — Title/description alone miss details
-
Batching improves efficiency — 20 tags per request vs multiple small calls
-
User control matters — Prompting per-tag prevents unwanted tags
-
Input validation prevents errors — Strict y/n prevents accidental acceptance
-
Known vs new separation saves time — Users only decide on novel suggestions
-
Catalog updates enable growth — System learns from user decisions
-
Interactive by default is safer — Auto-apply was risky for novel suggestions