Ga naar inhoud

KB Auto-Tagging System: Organizing Knowledge at Scale

Context: As the Knowledge Base grows, manually tagging articles becomes a bottleneck. We built an intelligent auto-tagging system using OpenRouter's Claude API that analyzes full article content, auto-applies known tags, and prompts users individually for novel suggestions.

The Problem

When KB articles lack tags:

  • No discoverability through tag filters
  • Manual categorization is slow, inconsistent, and error-prone
  • Tag inconsistency across articles (python vs py vs Python)
  • New contributors don't know what tags exist or mean
  • Valuable insights buried without proper metadata

The Solution: Intelligent Auto-Tagging

Our system:

  1. Reads full article content — analyzes complete markdown (not just title/description)
  2. Maintains a tag catalog with descriptions (tag-descriptions.yaml)
  3. Detects untagged articles via frontmatter YAML scanning
  4. Queries OpenRouter API (Claude Haiku) with efficient batching
  5. Auto-applies known tags from existing KB (high confidence, immediate)
  6. Prompts per new tag individually (user controls additions)
  7. Updates tag catalog when new tags are accepted

Using the System

List All Tags with Usage

ew local.kb-tags

Shows tag name, frequency in KB, and descriptions. Discover available tags before publishing.

Auto-Tag Untagged Articles

ew local.kb-autotag

Interactive by default (safe, user-controlled)

Example session:

📄 Docker Production Migration: From Development...
   Evaluating tags batch 1/1...
   ✅ Suggested (12):
      • devops: Covers deployment strategies and infrastructure automation
      • docker: Central focus on containerization and docker-compose
      • guide: Step-by-step migration instructions
      • advanced, architecture, best-practices, howto, intermediate, python, security, self-hosting, tutorial

   ❌ Not selected (6):
      • ai: No AI/ML content
      • beginner: Assumes Docker experience
      • database, index, linux, reference: Not applicable

   ◆ Auto-applying 4 known tags (devops, docker, guide, python)
   ✨ Accept 'advanced'? (production-level patterns): y
   ✨ Accept 'architecture'? (system design): n
   ✨ Accept 'security'? (protection concerns): y

   ✓ Applied 6 tags (4 known + 2 new)
      ✓ Added 'security' to tag catalog

How It Works: Technical Deep Dive

The Auto-Tagging Pipeline

  1. Collect existing KB tags — tags already used in articles (the "known" set)
  2. Load tag catalog — 18+ tags with descriptions from tag-descriptions.yaml
  3. Find untagged articles — scan KB for empty tag arrays
  4. Read full article content — extract markdown body after YAML frontmatter
  5. Batch evaluate tags — send 20 tags per OpenRouter API call (efficient)
  6. Analyze with reasoning — AI provides YES/NO decision for each tag with reasoning
  7. Separate results: Known tags (auto-apply) vs New tags (prompt user)
  8. Validate user input — accepts y/yes or n/no only (retries on invalid)
  9. Update catalog — when user accepts new tag, add to tag-descriptions.yaml
  10. Apply tags — merge approved tags into article frontmatter

Why Full Content Analysis Matters

Title + Description only: - May miss implementation details - Loses context about trade-offs discussed - Misses specific technologies mentioned in code examples Full article analysis: - Catches specific tools and approaches used - Understands architecture decisions explained in text - Identifies real-world lessons and best practices - Provides ~75% accuracy matching original manual tags - Discovers additional useful tags from detailed content

Known vs New Tag Separation

Known tags (auto-apply immediately): - Already used in KB — proven relevant - High confidence in categorization - Saves user time on familiar tags New tags (prompt user individually): - Novel suggestions from content analysis - User controls whether to accept - Option to reject without catalog pollution - When accepted, added to tag catalog for future use

Tag Categories

Topic Tags

What the article covers:

  • python, docker, linux, devops, self-hosting
  • ai, security, database, architecture

Content Type Tags

How the content is structured:

  • guide — step-by-step instructional
  • tutorial — hands-on with examples
  • reference — API/documentation
  • howto — practical instructions for tasks
  • best-practices — recommended patterns

Level Tags

Skill level targeting:

  • beginner — new to the topic
  • intermediate — some experience
  • advanced — experienced users

Real-World Accuracy

Test case: Docker Production Migration article

Original manual tags (8): devops, docker, linux, python, self-hosting, production, guide, architecture

Auto-tagger results (6 matched + 5 new): ✓ Matched: devops, docker, python, self-hosting, guide, architecture (75% accuracy) ✗ Missing: linux, production + New discoveries: advanced, best-practices, security, intermediate, howto, tutorial

Key insight: Full article content revealed security practices (directory traversal protection, SSL/TLS via reverse proxy) that weren't obvious from the description alone.

Adding New Tags

1. System Auto-Updates Catalog

When you accept a new tag during auto-tagging:

 Accept 'kubernetes'? (container orchestration): y
    Added 'kubernetes' to tag catalog

The tag is automatically added to tag-descriptions.yaml with auto-generated title and description.

2. Manual Tag Addition

Edit /tag-descriptions.yaml:

tags:
  kubernetes:
    title: "Kubernetes"
    description: "Kubernetes orchestration, container management, and deployment"

Manually tag an article, then run auto-tagger to make it available for future suggestions.

Best Practices

Do:

  • Use 2-4 specific topic tags
  • Include exactly 1 content type tag
  • Use hyphens for multi-word tags (self-hosting, best-practices)
  • Keep tags lowercase
  • Accept new tags when they accurately describe content

Avoid:

  • Over-tagging (5+ topic tags = noise)
  • Vague tags (misc, other, stuff)
  • Inconsistent naming (python vs py)
  • Uppercase or spaces (Python, best practices)

Input Validation

The system validates all user input:

  • y or yes → Accept tag
  • n or no → Reject tag
  • yep, nope, sure⚠ Invalid input. Please enter 'y' or 'n' → Retries

This prevents misclicks and ensures deliberate decisions.

Troubleshooting

Q: "No tags suggested"

  • Check if KB has any existing tags to build from
  • Verify article has description in frontmatter
  • Try with --interactive=false to see debug output Q: Suggestions seem generic

  • Full article content is being analyzed (this is correct)

  • Generic tags mean content is foundational/broad
  • Accept them if accurate Q: Same tags keep getting suggested
  • This is normal — consistent content receives consistent suggestions
  • Override manually if needed in article frontmatter Q: New tag not appearing in catalog
  • Article must be saved with the new tag first
  • Run ew local.kb-tags to verify it was added
  • Check file permissions on tag-descriptions.yaml

Implementation Details

Location: /knowledge_base_indexer.py Key functions:

  • query_openrouter() — Direct API calls to Claude Haiku
  • evaluate_tag_batch() — Evaluates batch of tags with full content
  • suggest_tags_with_gptme() — Main suggestion engine
  • get_yes_no_input() — Validated input with retry
  • add_tag_to_catalog() — Updates YAML catalog
  • autotag_articles() — Orchestrates full workflow
  • collect_all_tags() — Gathers existing KB tags
  • find_untagged_articles() — Identifies articles needing tags
  • generate_recently_added() — Auto-updates KB homepage Integration with kb-generate:
  • ew local.kb-generate calls auto-tag internally
  • Recently Added section updates with file modification time order
  • Category indices auto-populate from tagged articles

Lessons Learned

  1. Full content is crucial — Title/description alone miss details
  2. Batching improves efficiency — 20 tags per request vs multiple small calls

  3. User control matters — Prompting per-tag prevents unwanted tags

  4. Input validation prevents errors — Strict y/n prevents accidental acceptance

  5. Known vs new separation saves time — Users only decide on novel suggestions

  6. Catalog updates enable growth — System learns from user decisions

  7. Interactive by default is safer — Auto-apply was risky for novel suggestions