LLM Post-Training: A Deep Dive into Reasoning LLMs This survey paper provides an in-depth examination of post-training methodologies in Large Language Models (LLMs) focusing on improving reasoning capabilities. While LLMs achieve strong performance from pretraining on massive datasets, post-training methods such as fine-tuning, reinforcement learning (RL) and test-time scaling are essential for aligning LLMs with human intent, enhancing reasoning and ensuring safe, context-aware interactions. Key Highlights 1. Post-Training Taxonomy -The paper introduces a structured taxonomy of post-training strategies: -Fine-tuning: Task/domain-specific adaptation -Reinforcement Learning: Optimization using human or Al feedback -Test-time Scaling: Inference-time improvements in reasoning and efficiency 2. Fine-Tuning -Enhances domain-specific capabilities but risks overfitting. -Parameter-efficient techniques like LoRA and adapters reduce computational overhead. -Struggles with generalization if overly specialized. 3. Reinforcement Learning (RL) -RLHF, RLAIF, and DPO refine model outputs based on preference signals. -RL in LLMs requires dealing with high-dimensional, sparse, and subjective feedback. -Chain-of-thought (CoT) reasoning and stepwise reward modeling help improve logical consistency. 4. Test-Time Scaling -Involves techniques like Tree-of-Thoughts and Self-Consistency. -Dynamic computation during inference improves multi-step reasoning. -Includes search-based methods and retrieval-augmented generation (RAG). 5. Advanced Optimization Techniques -PPO, GRPO, TRPO, OREO, and ORPO are discussed with comparisons. -These methods balance stability, efficiency and alignment with human values. 6. Reward Modeling -Both explicit (human-annotated) and implicit (interaction-based) reward types are covered. -Process-oriented rewards (intermediate reasoning steps) are emphasized for complex reasoning. 7. Practical Benchmarks and Models -Extensive table covering 40+ state-of-the-art LLMs (e.g., GPT-4, Claude, DeepSeek, LLaMA 3, etc.) with their RL methods and architecture types. Introduces DeepSeek-R1 and DeepSeek-R1-Zero showcasing pure RL-based LLM training. Keep learning and keep growing!!
Download the medial app to read full posts, comements and news.