qwen/qvq-72b-preview
上下文长度: 128,000
text+image->text
Qwen
2024-12-25 更新
QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. Performance QVQ-72B-Preview o1-2024-12-17 gpt-4o-2024-05-13 Claude3.5 Sonnet-20241022 Qwen2VL-72B MMMU(val) 70.3 77.3 69.1 70.4 64.5 MathVista(mini) 71.4 71.0 63.8 65.3 70.5 MathVision(full) 35.9 – 30.4 35.6 25.9 OlympiadBench 20.4 – 25.9 – 11.2 Limitations Language Mixing and Code-Switching: The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses. Recursive Reasoning Loops: There’s a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer. Safety and Ethical Considerations: Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model. Performance and Benchmark Limitations: Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants. Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.