๐ Understanding How Language Models Think โ One Circuit at a Time "Circuit Tracing: Revealing Computational Graphs in Language Models" by Anthropic โณintroduces a method to uncover how LLMs process and generate responses by constructing graph-based descriptions of their computations on specific prompts. โKey Idea โณInstead of analyzing raw neurons or broad model components like MLPs and attention heads, the authors use sparse coding modelsโspecifically cross-layer transcoders (CLTs)โto break down model activations into interpretable features and trace how these features interact (circuits). โHow They Do It โณTranscoders: Create an interpretable replacement model to analyze direct feature interactions. โณCross-Layer Transcoders (CLTs): Map features across layers while maintaining accuracy. โณAttribution Graphs: Build computational maps showing the chain of influence leading to token predictions. โณLinear Attribution: Simplify feature interactions by controlling attention and normalization. โณGraph Pruning: Remove unnecessary connections for better interpretability. Interactive Interface: Explore these attribution graphs dynamically. โณValidation: Use perturbation experiments to confirm identified mechanisms. Real-World Case Studies โณFactual Recall: Understanding how the model knows that Michael Jordan plays basketball. โณAddition in LLMs: Analyzing how "36 + 59 =" is computed at the feature level. โChallenges and Open Questions Missing attention circuit explanations (QK interactions). Reconstruction errors leading to "dark matter" nodes. Difficulty in understanding global circuits across multiple prompts. Complexity in graph structures, even after pruning. โWhy This Matters Mechanistic interpretability is key to trustworthy AI, enabling us to move from black-box models to systems we can explain, debug, and align with human values. This paper from Anthropic represents a step forward in making LLMs more transparent and understandable at the circuit level. link https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Download the medial app to read full posts, comements and news.