🔍 Understanding How Language Models Think – One Circuit at a Time "Circuit Tracing: Revealing Computational Graphs in Language Models" by Anthropic ↳introduces a method to uncover how LLMs process and generate responses by constructing graph-based descriptions of their computations on specific prompts. ✓Key Idea ↳Instead of analyzing raw neurons or broad model components like MLPs and attention heads, the authors use sparse coding models—specifically cross-layer transcoders (CLTs)—to break down model activations into interpretable features and trace how these features interact (circuits). ✓How They Do It ↳Transcoders: Create an interpretable replacement model to analyze direct feature interactions. ↳Cross-Layer Transcoders (CLTs): Map features across layers while maintaining accuracy. ↳Attribution Graphs: Build computational maps showing the chain of influence leading to token predictions. ↳Linear Attribution: Simplify feature interactions by controlling attention and normalization. ↳Graph Pruning: Remove unnecessary connections for better interpretability. Interactive Interface: Explore these attribution graphs dynamically. ↳Validation: Use perturbation experiments to confirm identified mechanisms. Real-World Case Studies ↳Factual Recall: Understanding how the model knows that Michael Jordan plays basketball. ↳Addition in LLMs: Analyzing how "36 + 59 =" is computed at the feature level. ✓Challenges and Open Questions Missing attention circuit explanations (QK interactions). Reconstruction errors leading to "dark matter" nodes. Difficulty in understanding global circuits across multiple prompts. Complexity in graph structures, even after pruning. ✓Why This Matters Mechanistic interpretability is key to trustworthy AI, enabling us to move from black-box models to systems we can explain, debug, and align with human values. This paper from Anthropic represents a step forward in making LLMs more transparent and understandable at the circuit level. link https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Download the medial app to read full posts, comements and news.