{ "id": "conv_20251008_181329_361", "title": "搜索deepseekv3.1,qwen2.5-max,kimi-k2,gemini2.5pro的模型...", "created_at": "2025-10-08T18:13:29.361814", "updated_at": "2025-10-08T18:58:04.811869", "messages": [ { "role": "user", "content": "搜索deepseekv3.1,qwen2.5-max,kimi-k2,gemini2.5pro的模型相关信息,主要了解其性能,做成一个精美的,设计高级的,带有可互动和方便对比的各种图表,动画和互动效果精致高级的html页面,保证数据的准确性", "timestamp": "2025-10-08T18:17:03.608240" }, { "role": "assistant", "content": "我来为您搜索这些AI模型的相关信息,然后创建一个精美的对比页面。", "timestamp": "2025-10-08T18:17:06.557527" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:17:06.564233", "tool_calls": [ { "id": "web_search:0", "index": 0, "type": "function", "function": { "name": "web_search", "arguments": "{\"query\": \"DeepSeek V3.1 AI model performance benchmarks specifications\", \"max_results\": 5}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"summary\": \"🔍 搜索查询: DeepSeek V3.1 AI model performance benchmarks specifications\\n📅 搜索时间: 2025-10-08T18:17:10.759701\\n\\n📝 AI摘要:\\nDeepSeek V3.1 is a hybrid AI model with fast reasoning and 128K context. It supports both thinking and non-thinking modes. It is known for strong tool use capabilities.\\n\\n---\\n\\n📊 搜索结果:\\n\\n1. DeepSeek V3.1 - Intelligence, Performance & Price Analysis\\n 🔗 https://artificialanalysis.ai/models/deepseek-v3-1-reasoning\\n 📄 | Llama 3.3 Nemotron Super 49B v1 (Reasoning) | NVIDIA | Open | 128k | | | Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) | NVIDIA | Open | 128k | | | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | NVI...\\n\\n2. DeepSeek-V3.1-Terminus\\n 🔗 https://api-docs.deepseek.com/news/news250922\\n 📄 [Skip to main content](https://api-docs.deepseek.com/news/news250922#__docusaurus_skipToContent_fallback) [English](https://api-docs.deepseek.com/news/news250922#) * [English](https://api-docs.deeps...\\n\\n3. DeepSeek V3.1 Complete Evaluation Analysis: The New AI ...\\n 🔗 https://dev.to/czmilo/deepseek-v31-complete-evaluation-analysis-the-new-ai-programming-benchmark-for-2025-58jc\\n 📄 [ [Model card](https://huggingface.co/deepseek-ai/DeepSeek-V3.1)[Files Files and versions xet](https://huggingface.co/deepseek-ai/DeepSe...\\n\\n5. deepseek-v3.1 Model by Deepseek-ai - NVIDIA NIM APIs\\n 🔗 https://build.nvidia.com/deepseek-ai/deepseek-v3_1/modelcard\\n 📄 DeepSeek V3.1 Instruct is a hybrid AI model with fast reasoning, 128K context, and strong tool use. DeepSeek V3.1 Instruct is a hybrid AI model with fast reasoning, 128K context, and strong tool use. ...\"}", "timestamp": "2025-10-08T18:17:10.767218", "tool_call_id": "web_search:0", "name": "web_search" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:17:12.751529", "tool_calls": [ { "id": "web_search:1", "index": 0, "type": "function", "function": { "name": "web_search", "arguments": "{\"max_results\": 5, \"query\": \"Qwen 2.5 Max AI model performance benchmarks specifications 2025\"}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"summary\": \"搜索失败: 搜索失败: \"}", "timestamp": "2025-10-08T18:17:17.173269", "tool_call_id": "web_search:1", "name": "web_search" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:17:18.715024", "tool_calls": [ { "id": "web_search:2", "index": 0, "type": "function", "function": { "name": "web_search", "arguments": "{\"max_results\": 5, \"query\": \"Qwen 2.5 Max model benchmarks performance comparison\"}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"summary\": \"🔍 搜索查询: Qwen 2.5 Max model benchmarks performance comparison\\n📅 搜索时间: 2025-10-08T18:17:23.647517\\n\\n📝 AI摘要:\\nQwen 2.5 Max is a leading AI model, surpassing DeepSeek and GPT-4 in benchmarks. It's designed for advanced applications and is available via Alibaba Cloud API. It competes with top-tier models like GPT-4o and DeepSeek V3.\\n\\n---\\n\\n📊 搜索结果:\\n\\n1. Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model\\n 🔗 https://qwenlm.github.io/blog/qwen2.5-max/\\n 📄 Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model | Qwen Qwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model * [简体中文](https://qwenlm.github.io/zh/blog/qwen2.5-max/)  * [Free Courses](https://opencv.org/university/free-cours...\\n\\n5. Alibaba's Qwen 2.5-Max Just Beat GPT-4 & DeepSeek? - YouTube\\n 🔗 https://www.youtube.com/watch?v=3I3V_rqaG1A\\n 📄 Alibaba’s Qwen 2.5-Max Just Beat GPT-4 & DeepSeek? | Qwen 2.5-Max vs DeepSeek R1 | Edureka\\nedureka!\\n4410000 subscribers\\n112 likes\\n7366 views\\n31 Jan 2025\\n🔥PGP in Generative AI and ML in collaboration w...\"}", "timestamp": "2025-10-08T18:17:23.654990", "tool_call_id": "web_search:2", "name": "web_search" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:17:34.933122", "tool_calls": [ { "id": "web_search:3", "index": 0, "type": "function", "function": { "name": "web_search", "arguments": "{\"max_results\": 5, \"query\": \"Kimi K2 AI model performance benchmarks specifications 2025\"}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"summary\": \"🔍 搜索查询: Kimi K2 AI model performance benchmarks specifications 2025\\n📅 搜索时间: 2025-10-08T18:17:41.155701\\n\\n📝 AI摘要:\\nKimi K2 achieved 65.8% on SWE-bench, 53.7% on LiveCodeBench, and excels in advanced reasoning tasks on Tau-2 Bench. It outperforms proprietary models like GPT 4.1 and Claude Opus 4.\\n\\n---\\n\\n📊 搜索结果:\\n\\n1. Kimi K2: Open Agentic Intelligence - Moonshot\\n 🔗 https://moonshotai.github.io/Kimi-K2/\\n 📄 Imagine using Kimi K2 to explore remote-work salaries with the Salary Data Analysis example, where 16 IPython calls generate stats, visualizations, and an interactive webpage of insights. For pre/post...\\n\\n2. Moonshot's Kimi K2 for Coding: Our First Impressions in Cline\\n 🔗 https://cline.bot/blog/moonshots-kimi-k2-for-coding-our-first-impressions-in-cline\\n 📄 # Moonshot's Kimi K2 for Coding: Our First Impressions in Cline Test Kimi K2, the open-source model achieving 65.8% on SWE-bench. Moonshot AI's Kimi K2 is now available in Cline. Kimi K2's coding perf...\\n\\n3. Kimi K2 AI for Coding: Features, Reviews and Use Cases [2025]\\n 🔗 https://www.secondtalent.com/resources/kimi-k2-ai-for-coding/\\n 📄 Kimi K2 AI for Coding: Features, Reviews and Use Cases [2025] | SecondTalent * [Hire Talent](https://secondtalent.com/hire-developers/) Close Hire Talent Open Hire Talent [Hire Software Developers](...\\n\\n4. Kimi K2 - Intelligence, Performance & Price Analysis\\n 🔗 https://artificialanalysis.ai/models/kimi-k2\\n 📄 Kimi K2 Model Details * [ [The Keyword](https://blog.google/) [Share](https://blog.google/technology...\\n\\n2. Gemini 2.5 Pro - Google DeepMind\\n 🔗 https://deepmind.google/models/gemini/pro/\\n 📄 [ Models Advanced audio ...\\n\\n3. Gemini 2.5 Pro | Generative AI on Vertex AI - Google Cloud\\n 🔗 https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro\\n 📄 * [English](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro) * [Deutsch](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro) * [Español](http...\\n\\n4. LLM Leaderboard 2025 - Vellum AI\\n 🔗 https://www.vellum.ai/llm-leaderboard\\n 📄 [Claude 3.5 Haiku ](https://www.vellum.ai/llm-leaderboard#)[Claude 3.5 Sonnet :\\n📍 URL: https://qwenlm.github.io/blog/qwen2.5-max/\\n==================================================\\nQwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model | Qwen\\n\\n===============\\n\\nWe have a new blog at [qwen.ai](https://qwen.ai/research)!\\n----------------------------------------------------------\\n\\nThis page will automatically redirect in 5 seconds.\\n\\nIf you are not redirected automatically, please click the button below.\\n\\nGo Now\\n\\n[](https://qwenlm.github.io/ \\\"Qwen (Alt + H)\\\")\\n\\n* [Blog](https://qwenlm.github.io/blog/ \\\"Blog\\\")\\n* [Publication](https://qwenlm.github.io/publication \\\"Publication\\\")\\n* [About](https://qwenlm.github.io/about \\\"About\\\")\\n* [Try Qwen Chat](https://chat.qwen.ai/ \\\"Try Qwen Chat\\\")\\n\\nQwen2.5-Max: Exploring the Intelligence of Large-scale MoE Model\\n================================================================\\n\\nJanuary 28, 2025·3 min·561 words·Qwen Team|Translations:\\n* [简体中文](https://qwenlm.github.io/zh/blog/qwen2.5-max/)\\n\\n\\n[QWEN CHAT](https://chat.qwenlm.ai/)[API](https://www.alibabacloud.com/help/en/model-studio/developer-reference/what-is-qwen-llm)[DEMO](https://huggingface.co/spaces/Qwen/Qwen2.5-Max-Demo)[DISCORD](https://discord.gg/yPEP2vHTu4)\\n\\nIt is widely recognized that continuously scaling both data size and model size can lead to significant improvements in model intelligence. However, the research and industry community has limited experience in effectively scaling extremely large models, whether they are dense or Mixture-of-Expert (MoE) models. Many critical details regarding this scaling process were only disclosed with the recent release of DeepSeek V3. Concurrently, we are developing Qwen2.5-Max, a large-scale MoE model that has been pretrained on over 20 trillion tokens and further post-trained with curated Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) methodologies. Today, we are excited to share the performance results of Qwen2.5-Max and announce the availability of its [API](https://www.alibabacloud.com/help/en/model-studio/getting-started/) through Alibaba Cloud. We also invite you to explore Qwen2.5-Max on [Qwen Chat](https://chat.qwenlm.ai/)!\\n\\nPerformance[#](https://qwenlm.github.io/blog/qwen2.5-max/#performance)\\n----------------------------------------------------------------------\\n\\nWe evaluate Qwen2.5-Max alongside leading models, whether proprietary or open-weight, across a range of benchmarks that are of significant interest to the community. These include MMLU-Pro, which tests knowledge through college-level problems, LiveCodeBench, which assesses coding capabilities, LiveBench, which comprehensively tests the general capabilities, and Arena-Hard, which approximates human preferences. Our findings include the performance scores for both base models and instruct models.\\n\\nWe begin by directly comparing the performance of the instruct models, which can serve for downstream applications such as chat and coding. We present the performance results of Qwen2.5-Max alongside leading state-of-the-art models, including DeepSeek V3, GPT-4o, and Claude-3.5-Sonnet.\\n\\n\\nQwen2.5-Max outperforms DeepSeek V3 in benchmarks such as Arena-Hard, LiveBench, LiveCodeBench, and GPQA-Diamond, while also demonstrating competitive results in other assessments, including MMLU-Pro.\\n\\nWhen comparing base models, we are unable to access the proprietary models such as GPT-4o and Claude-3.5-Sonnet. Therefore, we evaluate Qwen2.5-Max against DeepSeek V3, a leading open-weight MoE model, Llama-3.1-405B, the largest open-weight dense model, and Qwen2.5-72B, which is also among the top open-weight dense models. The results of this comparison are presented below.\\n\\n\\nOur base models have demonstrated significant advantages across most benchmarks, and we are optimistic that advancements in post-training techniques will elevate the next version of Qwen2.5-Max to new heights.\\n\\nUse Qwen2.5-Max[#](https://qwenlm.github.io/blog/qwen2.5-max/#use-qwen25-max)\\n-----------------------------------------------------------------------------\\n\\nNow Qwen2.5-Max is available in Qwen Chat, and you can directly chat with the model, or play with artifacts, search, etc.\\n\\nThe API of Qwen2.5-Max (whose model name is `qwen-max-2025-01-25`) is available. You can first [register an Alibaba Cloud account](https://account.alibabacloud.com/register/intl_register.htm) and activate Alibaba Cloud Model Studio service, and then navigate to the console and create an API key.\\n\\nSince the APIs of Qwen are OpenAI-API compatible, we can directly follow the common practice of using OpenAI APIs. Below is an example of using Qwen2.5-Max in Python:\\n\\n```python\\nfrom openai import OpenAI\\nimport os\\n\\nclient = OpenAI(\\n api_key=os.getenv(\\\"API_KEY\\\"),\\n base_url=\\\"https://dashscope-intl.aliyuncs.com/compatible-mode/v1\\\",\\n)\\n\\ncompletion = client.chat.completions.create(\\n model=\\\"qwen-max-2025-01-25\\\",\\n messages=[\\n {'role': 'system', 'content': 'You are a helpful assistant.'},\\n {'role': 'user', 'content': 'Which number is larger, 9.11 or 9.8?'}\\n ]\\n)\\n\\nprint(completion.choices[0].message)\\n```\\n\\nFuture Work[#](https://qwenlm.github.io/blog/qwen2.5-max/#future-work)\\n----------------------------------------------------------------------\\n\\nThe scaling of data and model size not only showcases advancements in model intelligence but also reflects our unwavering commitment to pioneering research. We are dedicated to enhancing the thinking and reasoning capabilities of large language models through the innovative application of scaled reinforcement learning. This endeavor holds the promise of enabling our models to transcend human intelligence, unlocking the potential to explore uncharted territories of knowledge and understanding.\\n\\nCitation[#](https://qwenlm.github.io/blog/qwen2.5-max/#citation)\\n================================================================\\n\\nFeel free to cite the following article if you find Qwen2.5 helpful.\\n\\n```\\n@article{qwen25,\\n title={Qwen2.5 technical report},\\n author={Qwen Team},\\n journal={arXiv preprint arXiv:2412.15115},\\n year={2024}\\n}\\n```\\n\\n© 2025 [Qwen](https://qwenlm.github.io/)Powered by [Hugo](https://gohugo.io/)[](https://qwenlm.github.io/blog/qwen2.5-max/#top \\\"Go to Top (Alt + G)\\\")\\n==================================================\"}", "timestamp": "2025-10-08T18:18:00.275031", "tool_call_id": "extract_webpage:6", "name": "extract_webpage" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:18:01.870272", "tool_calls": [ { "id": "extract_webpage:7", "index": 0, "type": "function", "function": { "name": "extract_webpage", "arguments": "{\"url\": \"https://moonshotai.github.io/Kimi-K2/\"}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"url\": \"https://moonshotai.github.io/Kimi-K2/\", \"content\": \"🌐 网页内容 (35224 字符):\\n📍 URL: https://moonshotai.github.io/Kimi-K2/\\n==================================================\\nKimi K2: Open Agentic Intelligence\\n\\n===============\\n\\nKimi K2: Open Agentic Intelligence\\n\\nUse cases\\n\\nBenchmarking Kimi K2\\n\\nOpen Agentic Intelligence\\n\\nGetting started with Kimi K2\\n\\n\\n\\nKimi K2: Open Agentic Intelligence\\n==================================\\n\\n**Update(0905)**:We have an updated weight that enhances **agentic coding** and supports **256K context**,[see here](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)\\n\\nKimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. But it goes further — meticulously optimized for agentic tasks, Kimi K2 does not just answer; it acts.\\n\\nAnd now, it is within your reach. Today, we are open-sourcing:\\n\\n* **Kimi-K2-Base:** The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions. \\n* **Kimi-K2-Instruct:** The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking. \\n\\nWith Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build.\\n\\n##### Agentic and Competitive Coding\\n\\n###### SWE-bench Verified\\n\\n###### SWE-bench Multilingual\\n\\n###### LiveCodeBench v6\\n\\n###### OJBench\\n\\n##### Tool Use\\n\\n###### Tau2-bench weighted average*\\n\\n###### AceBench(en)\\n\\n##### Math & STEM\\n\\n###### AIME 2025\\n\\n###### GPQA-Diamond\\n\\n- All models evaluated above are non-thinking models.\\n\\n- For Tau2-Bench, average is weighted by tasks.\\n\\n- For Swe-Bench Multilingual, we evaluated only Claude 4 Sonnet because the cost of Claude 4 Opus was prohibitive.\\n\\n### Use cases\\n\\nGive Kimi K2 your tools and describe your task.\\n\\nIt automatically understands how to use the tools and gets the job done. You don't have to write any complex workflow for agentic applications.\\n\\n###### Salary Data Analysis\\n\\nExample 1\\n\\nExample 2\\n\\nExample 3\\n\\n Chat \\n\\n Result \\n\\nUsing 2020–2025 salary data, test the effect of remote-work ratio on salary and determine whether this effect differs significantly across experience levels (EN / MI / SE / EX)—that is, whether an interaction effect exists. Support your analysis with statistical evidence and rich visualizations. Use a consistent, harmonious color palette (e.g., pastel, muted) across all plots.\\n\\nTo begin, I'll load the dataset you've provided to understand its structure and contents. This initial step is crucial for planning the subsequent analysis of how the remote-work ratio affects salary across different experience levels.\\n\\nUse ipython\\n\\nTask completed\\n\\nI will now filter the dataset to include only the years 2020 through 2025. Following that, I'll create violin plots to visualize the distribution of salaries for each experience level, segmented by their remote-work ratio. This initial visualization will provide a clear, intuitive understanding of the data before I proceed with a more rigorous statistical analysis.\\n\\nUse ipython\\n\\nTask completed\\n\\nTo prepare for a more rigorous statistical analysis, I will now categorize the `remote_ratio` into distinct groups: 'On-site', 'Hybrid', and 'Remote'. This transformation will make it easier to interpret the results. I will then generate a box plot using these new categories to visually compare the salary distributions across the different experience levels. This will help in clearly illustrating any potential interaction effects.\\n\\nUse ipython\\n\\nTask completed\\n\\nNow that I've categorized the data, I'll proceed with the statistical analysis. I'm going to run a two-way ANOVA to test for both the main effects of the remote-work category and experience level, as well as their interaction. This will provide us with concrete statistical evidence to determine if the relationship between remote work and salary changes significantly across different experience levels.\\n\\nUse ipython\\n\\nTask completed\\n\\nIt appears the `statsmodels` library isn't available in this environment. As an alternative, I will use the `scipy.stats` library to perform a two-way ANOVA. This will allow me to test for both the main effects and the interaction between the remote-work category and experience level, providing the statistical evidence needed to answer your question.\\n\\nUse ipython\\n\\nTask completed\\n\\nSince the initial ANOVA attempt was inconclusive due to insufficient data in some groups, I'll take a different approach. I will now perform pairwise comparisons, conducting t-tests between the salary distributions for on-site and remote work within each experience level. This method will allow me to statistically assess if the effect of remote work on salary is significant for each category of experience, providing a clearer picture of any interaction effects.\\n\\nUse ipython\\n\\nTask completed\\n\\nTo better understand the salary trends, I will now calculate the average salary for each combination of experience level and remote-work category. This will provide a clear summary of the central tendency within each group. I will then visualize this summary using a bar plot, which will make it easier to compare the salary differences and identify any potential interaction effects between experience and remote work.\\n\\nUse ipython\\n\\nTask completed\\n\\nI will now visualize the interaction between remote-work categories and experience levels. To do this, I'll create a bar plot showing the average salary for each group. This will allow for a direct comparison, making it easy to see if the salary differences between remote, hybrid, and on-site work vary significantly across entry-level, mid-level, senior, and expert employees.\\n\\nUse ipython\\n\\nTask completed\\n\\nNow that I've visualized the interaction, I'll quantify it by calculating the percentage salary difference between remote and on-site work for each experience level. This will provide a clear, numerical measure of how the effect of remote work varies across the different job categories, directly addressing the core of your question about interaction effects.\\n\\nUse ipython\\n\\nTask completed\\n\\nI'll now create a more detailed visualization to better understand the interaction. My plan is to generate a scatter plot of individual salaries, distinguishing between remote and on-site employees with different colors. To make the comparison clearer, I'll add horizontal lines representing the average salary for each group. This will provide a direct visual comparison of the salary distributions and help illustrate the interaction effect between experience level and remote work.\\n\\nUse ipython\\n\\nTask completed\\n\\nI'll now create a more polished visualization to clearly illustrate the interaction effect. My plan is to generate a scatter plot showing the average salary for each combination of experience level and remote-work category. I'll use distinct colors and markers for each remote-work group and connect the data points with lines. This will create a clear and effective interaction plot, making it easy to see if the salary lines for remote, hybrid, and on-site work diverge across different experience levels.\\n\\nUse ipython\\n\\nTask completed\\n\\nI'll now create a final summary visualization to clearly present the interaction effect. I'm going to generate a bar plot showing the percentage salary difference between remote and on-site work for each experience level. This will provide a direct and easy-to-understand measure of how the salary benefit of remote work changes across entry, mid, senior, and expert employees, effectively illustrating the key finding of our analysis.\\n\\nUse ipython\\n\\nTask completed\\n\\nI'll now create a final summary visualization to clearly present the interaction effect. I'm going to generate a bar plot showing the percentage salary difference between remote and on-site work for each experience level. This will provide a direct and easy-to-understand measure of how the salary benefit of remote work changes across entry, mid, senior, and expert employees, effectively illustrating the key finding of our analysis.\\n\\nUse ipython\\n\\nTask completed\\n\\n### Summary of Findings\\n\\nUsing 2020–2025 salary data, I tested the effect of remote-work ratio on salary and whether this effect differs significantly across experience levels (EN / MI / SE / EX). The analysis reveals a **significant interaction effect** between remote-work ratio and experience level on salary.\\n\\n#### Key Statistical Evidence\\n\\n* **Interaction Effect**: The impact of remote work on salary is not uniform across experience levels. Specifically: \\n * **Entry-Level Employees**: Remote workers earn **8.0% more** on average than on-site workers (p ≈ 0.20, not significant). \\n * **Mid-Level Employees**: Remote workers earn **8.5% less** on average than on-site workers (p = 0.003, highly significant). \\n * **Senior-Level Employees**: Remote workers earn **3.5% less** on average than on-site workers (p ≈ 0.05, marginally significant). \\n * **Expert-Level Employees**: Remote workers earn **4.6% more** on average than on-site workers (p ≈ 0.61, not significant). \\n\\n#### Visual Analysis\\n\\nThe interaction plot and bar charts illustrate that the salary premium or penalty associated with remote work varies by experience level. Notably, mid-level employees show a significant salary disadvantage when working remotely, while entry-level and expert-level employees tend to have a salary advantage (though not statistically significant in this dataset).\\n\\n#### Conclusion\\n\\nThe analysis confirms a **significant interaction effect** between remote-work ratio and experience level on salary. The effect of remote work on salary differs across experience levels, with mid-level employees experiencing a notable salary penalty when working remotely, while entry-level and expert-level employees may benefit from remote work arrangements.\\n\\nPlease create a webpage using HTML, to show the statistical process and conclusions, using an attractive color scheme (e.g., pastel, muted, top magazine style). Embed all the previously generated images. Additionally, include a simulator where I can enter my personal details to get a recommendation on whether I should choose remote work.\\n\\nI'll create a comprehensive webpage that presents the statistical analysis and includes a personal simulator. Let me build this with an attractive, professional design using a harmonious color palette.\\n\\nUse ipython\\n\\nTask completed\\n\\nLet me fix the HTML content and create a proper webpage:\\n\\nUse ipython\\n\\nTask completed\\n\\nNow let me display the HTML content directly so you can see the webpage:\\n\\nUse ipython\\n\\nTask completed\\n\\nPerfect! I've created a comprehensive HTML webpage that presents the remote work salary analysis with the following features:\\n\\n🎨 **Design Features:**\\n-----------------------\\n\\n* **Attractive pastel/muted color palette** with harmonious gradients \\n* **Professional, magazine-style layout** with clean typography \\n* **Responsive design** that works on all devices \\n* **Consistent visual hierarchy** with proper spacing and sections \\n\\n📊 **Analysis Content:**\\n------------------------\\n\\n* **Executive Summary** with key statistics in beautiful stat cards \\n* **Statistical Process** explaining the methodology \\n* **Visual Analysis** sections with placeholders for the generated plots \\n* **Key Findings** with detailed breakdowns \\n\\n🧮 **Interactive Simulator:**\\n-----------------------------\\n\\n* **Personalized input fields** for experience level, current salary, work preference, and industry \\n* **Real-time calculation** based on the statistical analysis results \\n* **Color-coded recommendations** (green=positive, yellow=caution, red=negative) \\n* **Detailed projections** showing salary impact and recommendations \\n\\n📈 **Key Statistical Insights Highlighted:**\\n--------------------------------------------\\n\\n* Entry-Level: +8.0% remote work premium \\n* Mid-Level: -8.5% remote work penalty \\n* Senior-Level: -3.5% remote work penalty \\n* Expert-Level: +4.6% remote work premium \\n\\nThe webpage successfully demonstrates the **significant interaction effect** between remote work ratio and experience level on salary, with clear visual presentation and an interactive tool for personalized recommendations. The design uses a sophisticated color scheme that enhances readability while maintaining visual appeal.\\n\\nImagine using Kimi K2 to explore remote-work salaries with the Salary Data Analysis example, where 16 IPython calls generate stats, visualizations, and an interactive webpage of insights. Dive into the Stanford NLP Genealogy, and get Kimi K2 to generate an interactive site built through 5 web searches, 4 browsings, 3 clicks, 5 scrolls, 6 edits, and 2 deployments. Or plan your dream Coldplay Tour 2025 in London with Kimi K2, it crafted the plan for you through 17 seamless tool calls spanning search, calendar, Gmail, flights, Airbnb, and restaurant bookings.\\n\\nBring Kimi K2 to your command line. It edits files. It runs commands.\\n\\nKimi K2 understands your environment, decides what actions to take, and executes them seamlessly.\\n\\nJavaScript Minecraft\\n\\nExample 1\\n\\nExample 2\\n\\nExample 3\\n\\n Terminal \\n\\n Result \\n\\n> Write a web version of 3D minecraft. - Trees - Make the textures clearer, use functions to hand-draw the textures yourself, in the Kandinsky style. Don't just write a link, it won't load. - Make the ground smoother\\n\\nFor clarity, the terminal displays only a summary of each trajectory. In every example, Kimi K2 orchestrates multiple tools and commands behind the scenes to accomplish complex objectives. For instance, Kimi K2 can automate Minecraft development in JavaScript: it manages rendering, runs and debugs test cases, captures logs on failure, and iteratively improves the code until all tests succeed. For pre/post norm analysis, Kimi K2 uses the Weights & Biases (wandb) data reader to extract insights from language model experiments and generates a polished analysis report. When converting a Flask project to Rust, Kimi K2 systematically refactors the codebase and runs performance benchmarks to ensure robust results.\\n\\n### Benchmarking Kimi K2\\n\\n###### Evaluation Results\\n\\nKimi-K2-Instruct\\n\\nKimi-K2-Base\\n\\nThe table below details the performance of Kimi-K2-Instruct, showing that it matches—or outperforms—the latest open-source and proprietary models across a diverse set of tasks. The model shines on knowledge-intensive and reasoning benchmarks, delivering outstanding results in natural-language understanding, mathematics and sciences, code generation, and agentic tool uses.\\n\\n| | | Open Source | Proprietary |\\n| --- | --- | --- | --- |\\n| Benchmark | Metric | Kimi-K2-Instruct | DeepSeek-V3-0324 | Qwen3-235B-A22B (Non-thinking) | Claude Sonnet 4 (w/o extended thinking) | Claude Opus 4 (w/o extended thinking) | GPT-4.1 | Gemini 2.5 Flash Preview (05-20) |\\n| Coding Tasks |\\n| LiveCodeBench v6(Aug 24-May 25) | Pass@1 | 53.7 | 46.9 | 37.0 | 48.5 | 47.4 | 44.7 | 44.7 |\\n| OJBench | Pass@1 | 27.1 | 24.0 | 11.3 | 15.3 | 19.6 | 19.5 | 19.5 |\\n| MultiPL-E | Pass@1 | 85.7 | 83.1 | 78.2 | 88.6 | 89.6 | 86.7 | 85.6 |\\n| SWE-bench Verified (Agentless Coding) | Single Patch without Test (Acc) | 51.8 | 36.6 | 39.4 | 50.2 | 53.0 | 40.8 | 32.6 |\\n| SWE-bench Verified (Agentic Coding) | Single Attempt (Acc) | 65.8 | 38.8 | 34.4 | 72.7* | 72.5* | 54.6 | — |\\n| Multiple Attempts (Acc) | 71.6 | — | — | 80.2* | 79.4* | — | — |\\n| SWE-bench Multilingual(Agentic Coding) | Single Attempt (Acc) | 47.3 | 25.8 | 20.9 | 51.0 | — | 31.5 | — |\\n| TerminalBench | Inhouse Framework (Acc) | 30.0 | — | — | 35.5 | 43.2 | 8.3 | — |\\n| Terminus (Acc) | 25.0 | 16.3 | 6.6 | — | — | 30.3 | 16.8 |\\n| Aider-Polyglot | Acc | 60.0 | 55.1 | 61.8 | 56.4 | 70.7 | 52.4 | 44.0 |\\n| Tool Use Tasks |\\n| Tau2 retail | Avg@4 | 70.6 | 69.1 | 57.0 | 75.0 | 81.8 | 74.8 | 64.3 |\\n| Tau2 airline | Avg@4 | 56.5 | 39.0 | 26.5 | 55.5 | 60.0 | 54.5 | 42.5 |\\n| Tau2 telecom | Avg@4 | 65.8 | 32.5 | 22.1 | 45.2 | 57.0 | 38.6 | 16.9 |\\n| AceBench | Acc | 76.5 | 72.7 | 70.5 | 76.2 | 75.6 | 80.1 | 74.5 |\\n| Math & STEM Tasks |\\n| AIME 2024 | Avg@64 | 69.6 | 59.4* | 40.1* | 43.4 | 48.2 | 46.5 | 61.3 |\\n| AIME 2025 | Avg@64 | 49.5 | 46.7 | 24.7* | 33.1* | 33.9* | 37.0 | 46.6 |\\n| MATH-500 | Acc | 97.4 | 94.0* | 91.2* | 94.0 | 94.4 | 92.4 | 95.4 |\\n| HMMT 2025 | Avg@32 | 38.8 | 27.5 | 11.9 | 15.9 | 15.9 | 19.4 | 34.7 |\\n| CNMO 2024 | Avg@16 | 74.3 | 74.7 | 48.6 | 60.4 | 57.6 | 56.6 | 75.0 |\\n| PolyMath-en | Avg@4 | 65.1 | 59.5 | 51.9 | 52.8 | 49.8 | 54.0 | 49.9 |\\n| ZebraLogic | Acc | 89.0 | 84.0 | 37.7* | 79.7 | 59.3 | 58.5 | 57.9 |\\n| AutoLogi | Acc | 89.5 | 88.9 | 83.3* | 89.8 | 86.1 | 88.2 | 84.1 |\\n| GPQA-Diamond | Avg@8 | 75.1 | 68.4* | 62.9* | 70.0* | 74.9* | 66.3 | 68.2 |\\n| SuperGPQA | Acc | 57.2 | 53.7 | 50.2 | 55.7 | 56.5 | 50.8 | 49.6 |\\n| Humanity's Last Exam (Text Only) | Acc | 4.7 | 5.2 | 5.7 | 5.8 | 7.1 | 3.7 | 5.6 |\\n| General Tasks |\\n| MMLU | EM | 89.5 | 89.4 | 87.0 | 91.5 | 92.9 | 90.4 | 90.1 |\\n| MMLU-Redux | EM | 92.7 | 90.5 | 89.2* | 93.6 | 94.2 | 92.4 | 90.6 |\\n| MMLU-Pro | EM | 81.1 | 81.2* | 77.3 | 83.7 | 86.6 | 81.8 | 79.4 |\\n| IFEval | Prompt Strict | 89.8 | 81.1 | 83.2* | 87.6 | 87.4 | 88.0 | 84.3 |\\n| Multi-Challenge | Acc | 54.1 | 31.4 | 34.0 | 46.8 | 49.0 | 36.4 | 39.5 |\\n| SimpleQA | Correct | 31.0 | 27.7 | 13.2 | 15.9 | 22.8 | 42.3 | 23.3 |\\n| Livebench(2024/11/25) | Pass@1 | 76.4 | 72.4 | 67.6 | 74.8 | 74.6 | 69.8 | 67.8 |\\n\\n- All models evaluated above are non-thinking models.\\n\\n- Bold denotes global SOTA, and underlined denotes open-source SOTA.\\n\\n- Data points marked with * are taken directly from the model's tech report or blog.\\n\\n- All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length.\\n\\n- Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model.\\n\\n- To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2.\\n\\n- Some data points have been omitted due to prohibitively expensive evaluation costs.\\n\\n### Open Agentic Intelligence\\n\\nPre-training is the crucial foundation for [Agentic Intelligence](https://ysymyth.github.io/The-Second-Half/), establishing the priors that makes reinforcement learning (RL) exploration tractable, efficient, and generalizable. However, as Ilya Sutskever also observes, human data is a finite \\\"fossil fuel\\\", and its growth is lagging far behind the pace of compute. This makes **token efficiency** during pre-training a new critical coefficient in the AI scaling laws.\\n\\nPost-training is pivotal in the \\\"[Era of Experience](https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf)\\\" (David Silver, Richard Sutton, 2025). In this era, LLMs increasingly learn from their own self-generated interactions, receiving rewards that free them from the limits of human data and enable them to surpass human capabilities.\\n\\nKimi K2 is forged from these very insights.\\n\\n#### MuonClip Optimizer\\n\\nWithout rigor, given an approximately finite pretraining dataset and a fixed model configuration, a more token-efficient optimizer generates more intelligence. Our previous work [Moonlight](https://github.com/MoonshotAI/Moonlight) has demonstrated that the [Muon](https://kellerjordan.github.io/posts/muon/) optimizer substantially outperforms the widely-used AdamW optimizer for LLM training.\\n\\nKimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3. Based on scaling-law analysis, we reduce the number of heads for long-context efficiency, and increase MoE sparsity for greater token efficiency. While scaling up, we encountered a persistent challenge: training instability caused by exploding attention logits, an issue that occurs more frequently with Muon but less with AdamW in our experiments. Existing solutions such as logit soft-capping and query-key normalization were found inadequate.\\n\\nTo address this, we introduce the MuonClip optimizer that improves Muon with our proposed qk-clip technique. Specifically, qk-clip stabilizes training by directly rescaling the weight matrices of the query and key projections after Muon updates, thus controlling the scale of attention logits at the source. Concretely, the query and key projections are scaled as follows:\\n\\nq i=η α W qx ik i=η 1−α W kx i\\n\\nwhere α is a balancing hyperparameter, so the attention logit becomes:\\n\\n(η α q i)⊤(η 1−α k j)=η q i⊤k j\\n\\nThe adaptive factor η (with threshold t) is set after every step based on the max attention logit in this step:\\n\\nη=min(i,j max(q i⊤k j)t,1)\\n\\nwhere t is a pre-set threshold. This is a general technique that can be possibly applied to other stabilization use cases.\\n\\nOur experiments show that MuonClip effectively prevents logit explosions while maintaining downstream task performance. In practice, Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike, demonstrating MuonClip as a robust solution for stable, large-scale LLM training.\\n\\n\\n\\n#### Agentic Capabilities\\n\\nThe enhanced agentic capabilities of Kimi K2 originate from two important aspects — large-scale agentic data synthesis and general reinforcement learning.\\n\\n**Large-Scale Agentic Data Synthesis for Tool Use Learning:** To teach the model sophisticated tool-use capabilities, we developed a comprehensive pipeline inspired by ACEBench that simulates real-world tool-using scenarios at scale. Our approach systematically evolves hundreds of domains containing thousands of tools—including both real MCP (Model Context Protocol) tools and synthetic ones—then generates hundreds of agents with diverse tool sets.\\n\\nAll tasks are rubric-based, enabling consistent evaluation. Agents interact with simulated environments and user agents, creating realistic multi-turn tool-use scenarios. An LLM judge evaluates simulation results against task rubrics, filtering for high-quality training data. This scalable pipeline generates diverse, high-quality data, paving the way for large-scale rejection sampling and reinforcement learning.\\n\\n\\n\\n**General Reinforcement Learning:** The key challenge is to apply RL to tasks with both verifiable and non-verifiable rewards; typical examples of verifiable tasks are math and competition coding, while writing a research report is usually viewed as non-verifiable. Going beyond verifiable rewards, our general RL system uses a self-judging mechanism where the model acts as its own critic, providing scalable, rubric-based feedback for non-verifiable tasks.\\n\\nMeanwhile, on-policy rollouts with verifiable rewards are used to continuously update the critic so that the critic keeps improving its evaluation accuracy on the latest policy. This can be viewed as a way of using verifiable rewards to improve the estimation of non-verifiable rewards.\\n\\n### Getting started with Kimi K2\\n\\n#### Try Kimi K2 on [kimi.com](https://www.kimi.com/)\\n\\nStarting today, Kimi users on web and mobile can select and use the new Kimi K2 model for free. At this moment, our MCP features for web and app are still in development. We hope to begin rolling them out in the coming weeks. In the meantime, you’re welcome to try our Researcher for an early look at its agentic capabilities. Please note that vision features are not supported for Kimi K2 yet.\\n\\n#### Use Kimi K2 with API\\n\\nThe Kimi Platform offers an OpenAI/Anthropic compatible interface, allowing for easy adaptation of your existing applications to Kimi K2. We encourage developers to explore our tool calling API for building agent applications. For detailed information, visit [platform.moonshot.ai](https://platform.moonshot.ai/).\\n\\n#### Serve Kimi K2 on your own\\n\\nWe recommend running Kimi K2 on one of the following inference engines: vLLM, SGLang, KTransformers, or TensorRT-LLM. For detailed deployment instructions, please see our [GitHub repository](https://github.com/MoonshotAI/Kimi-K2?tab=readme-ov-file#4-deployment).\\n\\n#### What's next\\n\\nWhile Kimi K2 serves as a strong foundation for open agentic intelligence, a general agent uses more advanced capabilities such as thinking and visual understanding. We plan to add these to Kimi K2 in the future.\\n\\n#### Limitations\\n\\nIn our internal tests, we've identified some limitations in current Kimi K2 models. When dealing with hard reasoning tasks or unclear tool definition, the model may generate excessive tokens, sometimes leading to truncated outputs or incomplete tool calls. Additionally, performance may decline on certain tasks if tool use is enabled. When building complete software projects, one-shot prompting yields performance degradation compared to using K2 under an agentic framework. We are working to address these issues in future releases and looking forward to more feedbacks.\\n\\n# Kimi K2: Open Agentic Intelligence **Update(0905)**:We have an updated weight that enhances **agentic coding** and supports **256K context**,[see here](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) Kimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. But it goes further — meticulously optimized for agentic tasks, Kimi K2 does not just answer; it acts. And now, it is within your reach. Today, we are open-sourcing: - **Kimi-K2-Base:** The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions. - **Kimi-K2-Instruct:** The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking. With Kimi K2, advanced agentic intelligence is more open and accessible than ever. We can't wait to see what you build. ```chart compare 这里是数据对比 ``` ### Use cases Give Kimi K2 your tools and describe your task. It automatically understands how to use the tools and gets the job done. You don't have to write any complex workflow for agentic applications. ```bash chatCompare 这块用于写 Kimi 那个 Case ``` Imagine using Kimi K2 to explore remote-work salaries with the Salary Data Analysis example, where 16 IPython calls generate stats, visualizations, and an interactive webpage of insights. Dive into the Stanford NLP Genealogy, and get Kimi K2 to generate an interactive site built through 5 web searches, 4 browsings, 3 clicks, 5 scrolls, 6 edits, and 2 deployments. Or plan your dream Coldplay Tour 2025 in London with Kimi K2, it crafted the plan for you through 17 seamless tool calls spanning search, calendar, Gmail, flights, Airbnb, and restaurant bookings. Bring Kimi K2 to your command line. It edits files. It runs commands. Kimi K2 understands your environment, decides what actions to take, and executes them seamlessly. ```bash codeCli 这块用于写 Terminal 那个 Case ``` For clarity, the terminal displays only a summary of each trajectory. In every example, Kimi K2 orchestrates multiple tools and commands behind the scenes to accomplish complex objectives. For instance, Kimi K2 can automate Minecraft development in JavaScript: it manages rendering, runs and debugs test cases, captures logs on failure, and iteratively improves the code until all tests succeed. For pre/post norm analysis, Kimi K2 uses the Weights & Biases (wandb) data reader to extract insights from language model experiments and generates a polished analysis report. When converting a Flask project to Rust, Kimi K2 systematically refactors the codebase and runs performance benchmarks to ensure robust results. ### Benchmarking Kimi K2 ```bash tableCompare 这里是表格对比占位图 ``` ### Open Agentic Intelligence Pre-training is the crucial foundation for [Agentic Intelligence](https://ysymyth.github.io/The-Second-Half/), establishing the priors that makes reinforcement learning (RL) exploration tractable, efficient, and generalizable. However, as Ilya Sutskever also observes, human data is a finite \\\"fossil fuel\\\", and its growth is lagging far behind the pace of compute. This makes **token efficiency** during pre-training a new critical coefficient in the AI scaling laws. Post-training is pivotal in the \\\"[Era of Experience](https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf)\\\" (David Silver, Richard Sutton, 2025). In this era, LLMs increasingly learn from their own self-generated interactions, receiving rewards that free them from the limits of human data and enable them to surpass human capabilities. Kimi K2 is forged from these very insights. #### MuonClip Optimizer Without rigor, given an approximately finite pretraining dataset and a fixed model configuration, a more token-efficient optimizer generates more intelligence. Our previous work [Moonlight](https://github.com/MoonshotAI/Moonlight) has demonstrated that the [Muon](https://kellerjordan.github.io/posts/muon/) optimizer substantially outperforms the widely-used AdamW optimizer for LLM training. Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3. Based on scaling-law analysis, we reduce the number of heads for long-context efficiency, and increase MoE sparsity for greater token efficiency. While scaling up, we encountered a persistent challenge: training instability caused by exploding attention logits, an issue that occurs more frequently with Muon but less with AdamW in our experiments. Existing solutions such as logit soft-capping and query-key normalization were found inadequate. To address this, we introduce the MuonClip optimizer that improves Muon with our proposed qk-clip technique. Specifically, qk-clip stabilizes training by directly rescaling the weight matrices of the query and key projections after Muon updates, thus controlling the scale of attention logits at the source. Concretely, the query and key projections are scaled as follows: $$ q_i = \\\\eta^{\\\\alpha} W_q x_i $$ $$ k_i = \\\\eta^{1-\\\\alpha} W_k x_i $$ where $\\\\alpha$ is a balancing hyperparameter, so the attention logit becomes: $$ (\\\\eta^{\\\\alpha} q_i)^\\\\top (\\\\eta^{1-\\\\alpha} k_j) = \\\\eta\\\\, q_i^\\\\top k_j $$ The adaptive factor $\\\\eta$ (with threshold $t$) is set after every step based on the max attention logit in this step: $$ \\\\eta = \\\\min(\\\\frac{t}{\\\\displaystyle\\\\max_{i,j}\\\\bigl(q_i^\\\\top k_j\\\\bigr)}, 1) $$ where $t$ is a pre-set threshold. This is a general technique that can be possibly applied to other stabilization use cases. Our experiments show that MuonClip effectively prevents logit explosions while maintaining downstream task performance. In practice, Kimi K2 was pre-trained on 15.5T tokens using MuonClip with zero training spike, demonstrating MuonClip as a robust solution for stable, large-scale LLM training. ```bash Muon 这里放 Muon 那个 loss 图 ``` #### Agentic Capabilities The enhanced agentic capabilities of Kimi K2 originate from two important aspects — large-scale agentic data synthesis and general reinforcement learning. **Large-Scale Agentic Data Synthesis for Tool Use Learning:** To teach the model sophisticated tool-use capabilities, we developed a comprehensive pipeline inspired by ACEBench that simulates real-world tool-using scenarios at scale. Our approach systematically evolves hundreds of domains containing thousands of tools—including both real MCP (Model Context Protocol) tools and synthetic ones—then generates hundreds of agents with diverse tool sets. All tasks are rubric-based, enabling consistent evaluation. Agents interact with simulated environments and user agents, creating realistic multi-turn tool-use scenarios. An LLM judge evaluates simulation results against task rubrics, filtering for high-quality training data. This scalable pipeline generates diverse, high-quality data, paving the way for large-scale rejection sampling and reinforcement learning. ```bash workflow 这里是workflow图片 ``` **General Reinforcement Learning:** The key challenge is to apply RL to tasks with both verifiable and non-verifiable rewards; typical examples of verifiable tasks are math and competition coding, while writing a research report is usually viewed as non-verifiable. Going beyond verifiable rewards, our general RL system uses a self-judging mechanism where the model acts as its own critic, providing scalable, rubric-based feedback for non-verifiable tasks. Meanwhile, on-policy rollouts with verifiable rewards are used to continuously update the critic so that the critic keeps improving its evaluation accuracy on the latest policy. This can be viewed as a way of using verifiable rewards to improve the estimation of non-verifiable rewards. ### Getting started with Kimi K2 #### Try Kimi K2 on [kimi.com](https://www.kimi.com) Starting today, Kimi users on web and mobile can select and use the new Kimi K2 model for free. At this moment, our MCP features for web and app are still in development. We hope to begin rolling them out in the coming weeks. In the meantime, you’re welcome to try our Researcher for an early look at its agentic capabilities. Please note that vision features are not supported for Kimi K2 yet. #### Use Kimi K2 with API The Kimi Platform offers an OpenAI/Anthropic compatible interface, allowing for easy adaptation of your existing applications to Kimi K2. We encourage developers to explore our tool calling API for building agent applications. For detailed information, visit [platform.moonshot.ai](https://platform.moonshot.ai). #### Serve Kimi K2 on your own We recommend running Kimi K2 on one of the following inference engines: vLLM, SGLang, KTransformers, or TensorRT-LLM. For detailed deployment instructions, please see our [GitHub repository](https://github.com/MoonshotAI/Kimi-K2?tab=readme-ov-file#4-deployment). #### What's next While Kimi K2 serves as a strong foundation for open agentic intelligence, a general agent uses more advanced capabilities such as thinking and visual understanding. We plan to add these to Kimi K2 in the future. #### Limitations In our internal tests, we've identified some limitations in current Kimi K2 models. When dealing with hard reasoning tasks or unclear tool definition, the model may generate excessive tokens, sometimes leading to truncated outputs or incomplete tool calls. Additionally, performance may decline on certain tasks if tool use is enabled. When building complete software projects, one-shot prompting yields performance degradation compared to using K2 under an agentic framework. We are working to address these issues in future releases and looking forward to more feedbacks.\\n==================================================\"}", "timestamp": "2025-10-08T18:18:05.095116", "tool_call_id": "extract_webpage:7", "name": "extract_webpage" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:18:09.077691", "tool_calls": [ { "id": "extract_webpage:8", "index": 0, "type": "function", "function": { "name": "extract_webpage", "arguments": "{\"url\": \"https://deepmind.google/models/gemini/pro/\"}" } } ] }, { "role": "tool", "content": "{\"success\": true, \"url\": \"https://deepmind.google/models/gemini/pro/\", \"content\": \"🌐 网页内容 (54302 字符):\\n📍 URL: https://deepmind.google/models/gemini/pro/\\n==================================================\\nGemini Pro - Google DeepMind\\n\\n===============\\n\\nBuild with our next generation AI systems\\n-----------------------------------------\\n\\n[Explore models chevron_right](https://deepmind.google/models/)\\n\\n### Gemini\\n\\nOur most intelligent AI models\\n\\n[ 2.5 Pro](https://deepmind.google/models/gemini/pro/)[ 2.5 Flash](https://deepmind.google/models/gemini/flash/)[ 2.5 Flash Image](https://deepmind.google/models/gemini/image/)[ 2.5 Flash-Lite](https://deepmind.google/models/gemini/flash-lite/)\\n\\n[Learn more](https://deepmind.google/models/gemini/)\\n\\n### Gemma\\n\\nLightweight, state-of-the-art open models\\n\\n[ Gemma 3](https://deepmind.google/models/gemma/gemma-3/)[ Gemma 3n](https://deepmind.google/models/gemma/gemma-3n/)[ ShieldGemma 2](https://deepmind.google/models/gemma/shieldgemma-2/)\\n\\n[Learn more](https://deepmind.google/models/gemma/)\\n\\n### Generative models\\n\\nImage, music and video generation models\\n\\n[ Imagen](https://deepmind.google/models/imagen/)[ Lyria](https://deepmind.google/models/lyria/)[ Veo](https://deepmind.google/models/veo/)\\n\\n### Gemini model ecosystem\\n\\nModels and experiments built with Gemini\\n\\n[ Gemini Robotics](https://deepmind.google/models/gemini-robotics/)[ Project Astra](https://deepmind.google/models/project-astra/)[ Project Mariner](https://deepmind.google/models/project-mariner/)[ Gemini Diffusion](https://deepmind.google/models/gemini-diffusion/)\\n\\nOur latest AI breakthroughs and updates from the lab\\n----------------------------------------------------\\n\\n[Explore research chevron_right](https://deepmind.google/research/)\\n\\n### Projects\\n\\nExplore some of the biggest AI innovations\\n\\n[Learn more](https://deepmind.google/research/projects/)\\n\\n### Publications\\n\\nRead a selection of our recent papers\\n\\n[Learn more](https://deepmind.google/research/publications/)\\n\\n### News\\n\\nDiscover the latest updates from our lab\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\nUnlocking a new era of discovery with AI\\n----------------------------------------\\n\\n[Explore science chevron_right](https://deepmind.google/science/)\\n\\n### AI for biology\\n\\n[ AlphaFold](https://deepmind.google/science/alphafold/)[ AlphaGenome](https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/)[ AlphaMissense](https://deepmind.google/discover/blog/a-catalogue-of-genetic-mutations-to-help-pinpoint-the-cause-of-diseases/)[ AlphaProteo](https://deepmind.google/discover/blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/)\\n\\n### AI for climate and sustainability\\n\\n[ WeatherNext](https://deepmind.google/science/weathernext/)[ Weather Lab](https://deepmind.google.com/science/weatherlab/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### AI for mathematics and computer science\\n\\n[ AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)[ AlphaProof](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)[ AlphaGeometry](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)\\n\\n### AI for physics and chemistry\\n\\n[ GNoME](https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/)[ Fusion](https://deepmind.google/discover/blog/accelerating-fusion-science-through-learned-plasma-control/)[ AlphaQubit](https://deepmind.google/discover/blog/alphaqubit-tackles-one-of-quantum-computings-biggest-challenges/)\\n\\n### AI transparency\\n\\n[ SynthID](https://deepmind.google/science/synthid/)\\n\\nOur mission is to build AI responsibly to benefit humanity\\n----------------------------------------------------------\\n\\n[About Google DeepMind chevron_right](https://deepmind.google/about/)\\n\\n### News\\n\\nDiscover our latest AI breakthroughs, projects, and updates\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\n### Careers\\n\\nWe’re looking for people who want to make a real, positive impact on the world\\n\\n[Learn more](https://deepmind.google/about/careers/)\\n\\n### Milestones\\n\\nFor over 20 years, Google has worked to make AI helpful for everyone\\n\\n[Learn more](https://ai.google/advancing-ai/milestones/?section=intro&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### Education\\n\\nWe work to make AI more accessible to the next generation\\n\\n[Learn more](https://deepmind.google/about/education/)\\n\\n### Responsibility\\n\\nEnsuring AI safety through proactive security, even against evolving threats\\n\\n[Learn more](https://deepmind.google/about/responsibility-safety/)\\n\\n### The Podcast\\n\\nUncover the extraordinary ways AI is transforming our world\\n\\n[Learn more](https://deepmind.google/discover/the-podcast/)\\n\\nModels Research Science About \\n\\nBuild with our next generation AI systems\\n-----------------------------------------\\n\\n[Explore models chevron_right](https://deepmind.google/models/)\\n\\n### Gemini\\n\\nOur most intelligent AI models\\n\\n[ 2.5 Pro](https://deepmind.google/models/gemini/pro/)[ 2.5 Flash](https://deepmind.google/models/gemini/flash/)[ 2.5 Flash Image](https://deepmind.google/models/gemini/image/)[ 2.5 Flash-Lite](https://deepmind.google/models/gemini/flash-lite/)\\n\\n[Learn more](https://deepmind.google/models/gemini/)\\n\\n### Gemma\\n\\nLightweight, state-of-the-art open models\\n\\n[ Gemma 3](https://deepmind.google/models/gemma/gemma-3/)[ Gemma 3n](https://deepmind.google/models/gemma/gemma-3n/)[ ShieldGemma 2](https://deepmind.google/models/gemma/shieldgemma-2/)\\n\\n[Learn more](https://deepmind.google/models/gemma/)\\n\\n### Generative models\\n\\nImage, music and video generation models\\n\\n[ Imagen](https://deepmind.google/models/imagen/)[ Lyria](https://deepmind.google/models/lyria/)[ Veo](https://deepmind.google/models/veo/)\\n\\n### Gemini model ecosystem\\n\\nModels and experiments built with Gemini\\n\\n[ Gemini Robotics](https://deepmind.google/models/gemini-robotics/)[ Project Astra](https://deepmind.google/models/project-astra/)[ Project Mariner](https://deepmind.google/models/project-mariner/)[ Gemini Diffusion](https://deepmind.google/models/gemini-diffusion/)\\n\\nOur latest AI breakthroughs and updates from the lab\\n----------------------------------------------------\\n\\n[Explore research chevron_right](https://deepmind.google/research/)\\n\\n### Projects\\n\\nExplore some of the biggest AI innovations\\n\\n[Learn more](https://deepmind.google/research/projects/)\\n\\n### Publications\\n\\nRead a selection of our recent papers\\n\\n[Learn more](https://deepmind.google/research/publications/)\\n\\n### News\\n\\nDiscover the latest updates from our lab\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\nUnlocking a new era of discovery with AI\\n----------------------------------------\\n\\n[Explore science chevron_right](https://deepmind.google/science/)\\n\\n### AI for biology\\n\\n[ AlphaFold](https://deepmind.google/science/alphafold/)[ AlphaGenome](https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/)[ AlphaMissense](https://deepmind.google/discover/blog/a-catalogue-of-genetic-mutations-to-help-pinpoint-the-cause-of-diseases/)[ AlphaProteo](https://deepmind.google/discover/blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/)\\n\\n### AI for climate and sustainability\\n\\n[ WeatherNext](https://deepmind.google/science/weathernext/)[ Weather Lab](https://deepmind.google.com/science/weatherlab/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### AI for mathematics and computer science\\n\\n[ AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)[ AlphaProof](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)[ AlphaGeometry](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)\\n\\n### AI for physics and chemistry\\n\\n[ GNoME](https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/)[ Fusion](https://deepmind.google/discover/blog/accelerating-fusion-science-through-learned-plasma-control/)[ AlphaQubit](https://deepmind.google/discover/blog/alphaqubit-tackles-one-of-quantum-computings-biggest-challenges/)\\n\\n### AI transparency\\n\\n[ SynthID](https://deepmind.google/science/synthid/)\\n\\nOur mission is to build AI responsibly to benefit humanity\\n----------------------------------------------------------\\n\\n[About Google DeepMind chevron_right](https://deepmind.google/about/)\\n\\n### News\\n\\nDiscover our latest AI breakthroughs, projects, and updates\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\n### Careers\\n\\nWe’re looking for people who want to make a real, positive impact on the world\\n\\n[Learn more](https://deepmind.google/about/careers/)\\n\\n### Milestones\\n\\nFor over 20 years, Google has worked to make AI helpful for everyone\\n\\n[Learn more](https://ai.google/advancing-ai/milestones/?section=intro&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### Education\\n\\nWe work to make AI more accessible to the next generation\\n\\n[Learn more](https://deepmind.google/about/education/)\\n\\n### Responsibility\\n\\nEnsuring AI safety through proactive security, even against evolving threats\\n\\n[Learn more](https://deepmind.google/about/responsibility-safety/)\\n\\n### The Podcast\\n\\nUncover the extraordinary ways AI is transforming our world\\n\\n[Learn more](https://deepmind.google/discover/the-podcast/)\\n\\nModels \\n\\nBuild with our next generation AI systems\\n-----------------------------------------\\n\\n[Explore models chevron_right](https://deepmind.google/models/)\\n\\n### Gemini\\n\\nOur most intelligent AI models\\n\\n[ 2.5 Pro](https://deepmind.google/models/gemini/pro/)[ 2.5 Flash](https://deepmind.google/models/gemini/flash/)[ 2.5 Flash Image](https://deepmind.google/models/gemini/image/)[ 2.5 Flash-Lite](https://deepmind.google/models/gemini/flash-lite/)\\n\\n[Learn more](https://deepmind.google/models/gemini/)\\n\\n### Gemma\\n\\nLightweight, state-of-the-art open models\\n\\n[ Gemma 3](https://deepmind.google/models/gemma/gemma-3/)[ Gemma 3n](https://deepmind.google/models/gemma/gemma-3n/)[ ShieldGemma 2](https://deepmind.google/models/gemma/shieldgemma-2/)\\n\\n[Learn more](https://deepmind.google/models/gemma/)\\n\\n### Generative models\\n\\nImage, music and video generation models\\n\\n[ Imagen](https://deepmind.google/models/imagen/)[ Lyria](https://deepmind.google/models/lyria/)[ Veo](https://deepmind.google/models/veo/)\\n\\n### Gemini model ecosystem\\n\\nModels and experiments built with Gemini\\n\\n[ Gemini Robotics](https://deepmind.google/models/gemini-robotics/)[ Project Astra](https://deepmind.google/models/project-astra/)[ Project Mariner](https://deepmind.google/models/project-mariner/)[ Gemini Diffusion](https://deepmind.google/models/gemini-diffusion/)\\n\\nResearch \\n\\nOur latest AI breakthroughs and updates from the lab\\n----------------------------------------------------\\n\\n[Explore research chevron_right](https://deepmind.google/research/)\\n\\n### Projects\\n\\nExplore some of the biggest AI innovations\\n\\n[Learn more](https://deepmind.google/research/projects/)\\n\\n### Publications\\n\\nRead a selection of our recent papers\\n\\n[Learn more](https://deepmind.google/research/publications/)\\n\\n### News\\n\\nDiscover the latest updates from our lab\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\nScience \\n\\nUnlocking a new era of discovery with AI\\n----------------------------------------\\n\\n[Explore science chevron_right](https://deepmind.google/science/)\\n\\n### AI for biology\\n\\n[ AlphaFold](https://deepmind.google/science/alphafold/)[ AlphaGenome](https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/)[ AlphaMissense](https://deepmind.google/discover/blog/a-catalogue-of-genetic-mutations-to-help-pinpoint-the-cause-of-diseases/)[ AlphaProteo](https://deepmind.google/discover/blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/)\\n\\n### AI for climate and sustainability\\n\\n[ WeatherNext](https://deepmind.google/science/weathernext/)[ Weather Lab](https://deepmind.google.com/science/weatherlab/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### AI for mathematics and computer science\\n\\n[ AlphaEvolve](https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/)[ AlphaProof](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)[ AlphaGeometry](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)\\n\\n### AI for physics and chemistry\\n\\n[ GNoME](https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/)[ Fusion](https://deepmind.google/discover/blog/accelerating-fusion-science-through-learned-plasma-control/)[ AlphaQubit](https://deepmind.google/discover/blog/alphaqubit-tackles-one-of-quantum-computings-biggest-challenges/)\\n\\n### AI transparency\\n\\n[ SynthID](https://deepmind.google/science/synthid/)\\n\\nAbout \\n\\nOur mission is to build AI responsibly to benefit humanity\\n----------------------------------------------------------\\n\\n[About Google DeepMind chevron_right](https://deepmind.google/about/)\\n\\n### News\\n\\nDiscover our latest AI breakthroughs, projects, and updates\\n\\n[Learn more](https://deepmind.google/discover/blog/)\\n\\n### Careers\\n\\nWe’re looking for people who want to make a real, positive impact on the world\\n\\n[Learn more](https://deepmind.google/about/careers/)\\n\\n### Milestones\\n\\nFor over 20 years, Google has worked to make AI helpful for everyone\\n\\n[Learn more](https://ai.google/advancing-ai/milestones/?section=intro&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n### Education\\n\\nWe work to make AI more accessible to the next generation\\n\\n[Learn more](https://deepmind.google/about/education/)\\n\\n### Responsibility\\n\\nEnsuring AI safety through proactive security, even against evolving threats\\n\\n[Learn more](https://deepmind.google/about/responsibility-safety/)\\n\\n### The Podcast\\n\\nUncover the extraordinary ways AI is transforming our world\\n\\n[Learn more](https://deepmind.google/discover/the-podcast/)\\n\\n[Build with Gemini](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Try Gemini](https://gemini.google.com/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n[Google DeepMind](https://deepmind.google/)\\n\\n[Google AI Learn about all of our AI](https://ai.google/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Google DeepMind Explore the frontier of AI](https://deepmind.google/)[Google Labs Try our AI experiments](https://labs.google/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Google Research Explore our research](https://research.google/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n[Gemini app Chat with Gemini](https://gemini.google.com/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Google AI Studio Build with our next-gen AI models](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n[Models](https://deepmind.google/models/)[Research](https://deepmind.google/research/)[Science](https://deepmind.google/science/)[About](https://deepmind.google/about/)\\n\\n[Build with Gemini](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Try Gemini](https://gemini.google.com/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[](https://deepmind.google/search)\\n\\nGemini\\n======\\n\\nOur most intelligent AI models\\n\\n[Chat with Gemini](https://gemini.google.com/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[Build with Gemini](https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-pro&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n[ Models Gemini 2.5 Flash-Lite is now ready for scaled production use](https://developers.googleblog.com/en/gemini-25-flash-lite-is-now-stable-and-generally-available)\\n\\n[ Models We’re expanding our Gemini 2.5 family of models](https://blog.google/products/gemini/gemini-2-5-model-family-expands)\\n\\n[ Models Gemini 2.5: Updates to our family of thinking models](https://developers.googleblog.com/en/gemini-2-5-thinking-model-updates)\\n\\n[ Models Advanced audio dialog and generation with Gemini 2.5](https://blog.google/technology/google-deepmind/gemini-2-5-native-audio)\\n\\n\\n\\nGemini 2.5 models are capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.\\n\\nGemini 2.5 models are capable of reasoning through their thoughts before responding,resulting in enhanced performance and improved accuracy.\\n\\n* [Models](https://deepmind.google/models/gemini/pro/#models)\\n* [Hands-on](https://deepmind.google/models/gemini/pro/#hands-on)\\n* [Performance](https://deepmind.google/models/gemini/pro/#performance)\\n* [Safety](https://deepmind.google/models/gemini/pro/#safety)\\n* [Build](https://deepmind.google/models/gemini/pro/#build)\\n\\nModel family\\n------------\\n\\nGemini 2.5 builds on the best of Gemini — with native multimodality and a long context window.\\n\\n* [General availability 2.5 Pro Best for coding and highly complex tasks Learn more](https://deepmind.google/models/gemini/pro/)\\n* [General availability 2.5 Flash Best for fast performance on everyday tasks Learn more](https://deepmind.google/models/gemini/flash/)\\n* [General availability 2.5 Flash-Lite Best for high volume, cost efficient tasks Learn more](https://deepmind.google/models/gemini/flash-lite/)\\n\\nCreate and edit images with Gemini 2.5 Flash Image\\n--------------------------------------------------\\n\\nGenerate, transform and edit images with simple text prompts, or combine multiple images to create something new. All in Gemini.\\n\\n[Learn more](https://deepmind.google/models/gemini/image/)\\n\\n\\n\\nPause video Play video\\n\\nHands-on with Gemini 2.5\\n------------------------\\n\\nSee how Gemini 2.5 uses its reasoning capabilities to create interactive simulations and do advanced coding.\\n\\n[](https://www.youtube.com/watch?v=_CU0PkLNtwc)\\n\\n\\n\\nWatch\\n\\n### Make an interactive animation\\n\\nSee how Gemini 2.5 Pro uses its reasoning capabilities to create an interactive animation of “cosmic fish” with a simple prompt.\\n\\n[](https://www.youtube.com/watch?v=RLCBSpgos6s)\\n\\n\\n\\nWatch\\n\\n### Create your own dinosaur game\\n\\nWatch Gemini 2.5 Pro create an endless runner game, using executable code from a single line prompt.\\n\\n[](https://www.youtube.com/watch?v=0EkJQyT2ZcE)\\n\\n\\n\\nWatch\\n\\n### Code a fractal visualization\\n\\nSee how Gemini 2.5 Pro creates a simulation of intricate fractal patterns to explore a Mandelbrot set.\\n\\n[](https://www.youtube.com/watch?v=ozdV2JjXFLk)\\n\\n\\n\\nWatch\\n\\n### Plot interactive economic data\\n\\nWatch Gemini 2.5 Pro use its reasoning capabilities to create an interactive bubble chart to visualize economic and health indicators over time.\\n\\n[](https://www.youtube.com/watch?v=6SqP1WNSayU)\\n\\n\\n\\nWatch\\n\\n### Animate complex behavior\\n\\nSee how Gemini 2.5 Pro creates an interactive Javascript animation of colorful boids inside a spinning hexagon.\\n\\n[](https://www.youtube.com/watch?v=83ci60fPH1A)\\n\\n\\n\\nWatch\\n\\n### Code particle simulations\\n\\nWatch Gemini 2.5 Pro use its reasoning capabilities to create an interactive simulation of a reflection nebula.\\n\\nAdaptive and budgeted thinking\\n------------------------------\\n\\nAdaptive controls and adjustable thinking budgets allow you to balance performance and cost.\\n\\n* Calibrated\\n\\nThe model explores diverse thinking strategies, leading to more accurate and relevant outputs. \\n* Controllable\\n\\nDevelopers have fine-grained control over the model's thinking process, allowing them to manage resource usage. \\n* Adaptive\\n\\nWhen no thinking budget is set, the model assesses the complexity of a task and calibrates the amount of thinking accordingly. \\n\\nGemini 2.5 Deep Think\\n---------------------\\n\\nAn enhanced reasoning mode that uses cutting edge research techniques in parallel thinking and reinforcement learning to significantly improve Gemini’s ability to solve complex problems.\\n\\n* [Try with Google AI Ultra](https://one.google.com/about/google-ai-plans/)\\n* [View model card](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Deep-Think-Model-Card.pdf)\\n\\nDeep Think can better help tackle problems that require creativity, strategic planning, and making improvements step-by-step.\\n\\nDeep Think can better help tackle problems that require creativity,strategic planning,and making improvements step-by-step.\\n\\n* Iterative development and design\\n\\nWe’ve seen impressive results on tasks that require building something by making small changes over time. \\n* Aiding scientific and mathematical discovery\\n\\nBy reasoning through complex problems, Deep Think can act as a powerful tool for researchers. \\n* Algorithmic development and code\\n\\nDeep Think excels at tough coding problems where problem formulation and careful consideration of tradeoffs and time complexity is paramount. \\n\\nPerformance\\n\\nGemini 2.5 is state-of-the-art across a wide range of benchmarks.\\n-----------------------------------------------------------------\\n\\n[View 2.5 tech report](https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf)\\n\\n\\n\\nBenchmarks\\n----------\\n\\nIn addition to its strong performance on academic benchmarks, Gemini 2.5 tops the popular coding leaderboard WebDev Arena.\\n\\n| Benchmark | | **Gemini 2.5 Flash-Lite** Non-thinking | **Gemini 2.5 Flash-Lite** Thinking | **Gemini 2.5 Flash** Non-thinking | **Gemini 2.5 Flash** Thinking [View 2.5 Flash](https://deepmind.google/models/gemini/flash) | **Gemini 2.5 Pro** Thinking [View 2.5 Pro](https://deepmind.google/models/gemini/pro) |\\n| --- | --- | --- | --- | --- | --- | --- |\\n| **Input price** | $/1M tokens (no caching) | $0.10 | $0.10 | $0.30 | $0.30 | $1.25 $2.50 > 200k tokens |\\n| **Output price** | $/1M tokens | $0.40 | $0.40 | $2.50 | $2.50 | $10.00 $15.00 > 200k tokens |\\n| Reasoning & knowledge **Humanity's Last Exam (no tools)** | | 5.1% | 6.9% | 8.4% | 11.0% | 21.6% |\\n| Science **GPQA diamond** | | 64.6% | 66.7% | 78.3% | 82.8% | 86.4% |\\n| Mathematics **AIME 2025** | | 49.8% | 63.1% | 61.6% | 72.0% | 88.0% |\\n| Code generation **LiveCodeBench** (UI: 1/1/2025-5/1/2025) | | 33.7% | 34.3% | 41.1% | 55.4% | 69.0% |\\n| Code editing **Aider Polyglot** | | 26.7% | 27.1% | 44.0% | 56.7% | 82.2% |\\n| Agentic coding **SWE-bench Verified** | single attempt | 31.6% | 27.6% | 50.0% | 48.9% | 59.6% |\\n| | multiple attempts | 42.6% | 44.9% | 60.0% | 60.3% | 67.2% |\\n| Factuality **SimpleQA** | | 10.7% | 13.0% | 25.8% | 26.9% | 54.0% |\\n| Factuality **FACTS grounding** | | 84.1% | 86.8% | 83.4% | 85.3% | 87.8% |\\n| Visual reasoning **MMMU** | | 72.9% | 72.9% | 76.9% | 79.7% | 82.0% |\\n| Image understanding **Vibe-Eval (Reka)** | | 51.3% | 57.5% | 66.2% | 65.4% | 67.2% |\\n| Long context **MRCR v2 (8-needle)** | 128k (average) | 16.6% | 30.6% | 34.1% | 54.3% | 58.0% |\\n| | 1M (pointwise) | 4.1% | 5.4% | 16.8% | 21.0% | 16.4% |\\n| Multilingual performance **Global MMLU (Lite)** | | 81.1% | 84.5% | 85.8% | 88.4% | 89.2% |\\n\\nMethodology\\n\\nGemini results: All Gemini scores are pass @1.\\\"Single attempt\\\" settings allow no majority voting or parallel test-time compute; \\\"multiple attempts\\\" settings allow test-time selection of the candidate answer. They are all run with the AI Studio API with default sampling settings. To reduce variance, we average over multiple trials for smaller benchmarks. Aider Polyglot score is the pass rate average of 3 trials. Vibe-Eval results are reported using Gemini as a judge. Google's scaffolding for \\\"multiple attempts\\\" for SWE-Bench includes drawing multiple trajectories and re-scoring them using model's own judgement. For Aider results differ from the official leaderboard due to a difference in the settings used for evaluation (non-default).\\n\\nResult sources: Where provider numbers are not available we report numbers from leaderboards reporting results on these benchmarks: Humanity's Last Exam results are sourced from https://agi.safe.ai/ and https://scale.com/leaderboard/humanitys_last_exam, LiveCodeBench results are from https://livecodebench.github.io/leaderboard.html (1/1/2025 - 5/1/2025 in the UI), Aider Polyglot numbers come from https://aider.chat/docs/leaderboards/. FACTS come from https://www.kaggle.com/benchmarks/google/facts-grounding. For MRCR v2 which is not publically available yet we include 128k results as a cumulative score to ensure they can be comparable with other models and a pointwise value for 1M context window to show the capability of the model at full length. The methodology has changed in this table vs previously published results for MRCR v2 as we have decided to focus on a harder, 8-needle version of the benchmark going forward.\\n\\nInput and output price reflects text, image and video modalities.\\n\\nBuilding responsibly in the agentic era\\n\\nAs we develop these new technologies, we recognize the responsibility it entails, and aim to prioritize safety and security in all our efforts.\\n-----------------------------------------------------------------------------------------------------------------------------------------------\\n\\n[Learn more](https://ai.google/responsibility/principles/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n\\n\\nFor developers\\n\\nGemini’s advanced thinking, native multimodality and massive context window empowers developers to build next-generation experiences.\\n-------------------------------------------------------------------------------------------------------------------------------------\\n\\n[Start building](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n\\n\\nDeveloper ecosystem\\n-------------------\\n\\nBuild with cutting-edge generative AI models and tools to make AI helpful for everyone.\\n\\n[ Google AI Studio Build with the latest models from Google DeepMind](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)[ Gemini API Easily integrate Google’s most capable AI model to your apps](https://ai.google.dev/gemini-api/docs?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\nAccessing our latest AI models\\n------------------------------\\n\\nWe want developers to gain access to our models as quickly as possible. We’re making these available through Google AI Studio.\\n\\n[Sign in to Google AI Studio](https://aistudio.google.com/prompts/new_chat?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\nGet the latest updates\\n----------------------\\n\\nSign up for news on the latest innovations from Google DeepMind.\\n\\nEmail address \\n\\nPlease enter a valid email (e.g., \\\"name@example.com\\\")\\n\\nI accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with [Google's Privacy Policy](https://policies.google.com/privacy).\\n\\nSign up\\n\\nFollow us\\n\\n[](https://x.com/googledeepmind)\\n\\n[](https://www.instagram.com/googledeepmind)\\n\\n[](https://www.youtube.com/@googledeepmind)\\n\\n[](https://www.linkedin.com/company/googledeepmind/)\\n\\n[](https://github.com/google-deepmind)\\n\\nBuild AI responsibly to benefit humanity\\n\\nModels\\n\\nBuild with our next generation AI systems\\n\\n[ Gemini](https://deepmind.google/models/gemini/)[ Gemma](https://deepmind.google/models/gemma/)[ Veo](https://deepmind.google/models/veo/)[ Imagen](https://deepmind.google/models/imagen/)[ Lyria](https://deepmind.google/models/lyria/)\\n\\nScience\\n\\nUnlocking a new era of discovery with AI\\n\\n[ AlphaFold](https://deepmind.google/science/alphafold/)[ SynthID](https://deepmind.google/science/synthid/)[ WeatherNext](https://deepmind.google/science/weathernext/)\\n\\nLearn more\\n\\n[About](https://deepmind.google/about/)[News](https://deepmind.google/discover/blog/)[Careers](https://deepmind.google/about/careers/)[Research](https://deepmind.google/research/)[Responsibility & Safety](https://deepmind.google/about/responsibility-safety/)\\n\\nSign up for updates on our latest innovations\\n\\nI accept Google's Terms and Conditions and acknowledge that my information will be used in accordance with [Google's Privacy Policy](https://policies.google.com/privacy).\\n\\n chevron_right \\n\\nPlease enter a valid email (e.g., \\\"name@example.com\\\")\\n\\n[](https://www.google.com/ \\\"Google\\\")\\n\\n[About Google](https://about.google/)\\n\\n[Google products](https://about.google/products/)\\n\\n[Privacy](https://policies.google.com/privacy)\\n\\n[Terms](https://policies.google.com/terms)\\n\\n Manage cookies \\n\\n[Gemini Pro](https://deepmind.google/models/gemini/pro/)\\n\\nGemini Pro\\n\\nGeneral availability\\n\\nGemini 2.5 Pro\\n==============\\n\\nBest for coding and highly complex tasks\\n\\n[Build with Gemini](https://aistudio.google.com/prompts/new_chat?model=gemini-2.5-pro&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\n\\n\\nGemini 2.5 Pro is our most advanced model yet, excelling at coding and complex prompts.\\n\\nGemini 2.5 Pro is our most advanced model yet,excelling at coding and complex prompts.\\n\\nPro performance\\n---------------\\n\\n* Enhanced reasoning\\n\\nState-of-the-art in key math and science benchmarks. \\n* Advanced coding\\n\\nEasily generate code for web development tasks. \\n* Natively multimodal\\n\\nUnderstands input across text, audio, images and video. \\n* Long context\\n\\nExplore vast datasets with a 1-million token context window. \\n\\n* * *\\n\\nPreview\\n\\nNative audio\\n------------\\n\\nConverse in more expressive ways with native audio outputs that capture the subtle nuances of how we speak. Seamlessly switch between 24 languages, all with the same voice.\\n\\n* [Try in Google AI Studio](https://aistudio.google.com/live?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n* [Learn more](https://blog.google/technology/google-deepmind/gemini-2-5-native-audio/?utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=)\\n\\nUnmute video Mute video Pause video Play video\\n\\n* Natural conversation\\n\\nRemarkable quality, more appropriate expressivity, and prosody, delivered with low latency so you can converse fluidly. \\n* Style control\\n\\nUse natural language prompts to adapt the delivery within the conversation, steer it to adopt accents and produce a range of tones and expressions. \\n* Tool integration\\n\\nGemini 2.5 can use tools and function calling during dialog allowing it to incorporate real-time information or use custom developer-built tools. \\n* Conversation context awareness\\n\\nOur system is trained to discern and disregard background speech, ambient conversations and other irrelevant audio. \\n\\n* * *\\n\\nVibe-coding nature with 2.5 Pro\\n-------------------------------\\n\\nImages transformed into code-based representations of its natural behavior.\\n\\n\\n\\nWatch\\n\\nHands-on with 2.5 Pro\\n---------------------\\n\\nSee how Gemini 2.5 Pro uses its reasoning capabilities to create interactive simulations and do advanced coding.\\n\\n### Make an interactive animation\\n\\nSee how Gemini 2.5 Pro uses its reasoning capabilities to create an interactive animation of “cosmic fish” with a simple prompt.\\n\\n\\n\\nWatch\\n\\n### Create your own dinosaur game\\n\\nWatch Gemini 2.5 Pro create an endless runner game, using executable code from a single line prompt.\\n\\n\\n\\nWatch\\n\\n### Code a fractal visualization\\n\\nSee how Gemini 2.5 Pro creates a simulation of intricate fractal patterns to explore a Mandelbrot set.\\n\\n\\n\\nWatch\\n\\n### Plot interactive economic data\\n\\nWatch Gemini 2.5 Pro use its reasoning capabilities to create an interactive bubble chart to visualize economic and health indicators over time.\\n\\n\\n\\nWatch\\n\\n### Animate complex behavior\\n\\nSee how Gemini 2.5 Pro creates an interactive Javascript animation of colorful boids inside a spinning hexagon.\\n\\n\\n\\nWatch\\n\\n### Code particle simulations\\n\\nWatch Gemini 2.5 Pro use its reasoning capabilities to create an interactive simulation of a reflection nebula.\\n\\n\\n\\nWatch\\n\\n* * *\\n\\nBenchmarks\\n----------\\n\\nGemini 2.5 Pro leads common benchmarks by meaningful margins.\\n\\n| Benchmark | | **Gemini 2.5 Pro** Thinking | **OpenAI o3** High | **OpenAI o4-mini** High | **Claude Opus 4** 32k thinking | **Grok 3 Beta** Extended thinking | **DeepSeek R1** 05-28 |\\n| --- | --- | --- | --- | --- | --- | --- | --- |\\n| **Input price** | $/1M tokens (no caching) | $1.25 $2.50 > 200k tokens | $10.00 | $1.10 | $15.00 | $3.00 | $0.55 |\\n| **Output price** | $/1M tokens | $10.00 $15.00 > 200k tokens | $40.00 | $4.40 | $75.00 | $15.00 | $2.19 |\\n| Reasoning & knowledge **Humanity's Last Exam (no tools)** | | 21.6% | 20.3% | 14.3% | 10.7% | — | 14.0%* |\\n| Science **GPQA diamond** | single attempt | 86.4% | 83.3% | 81.4% | 79.6% | 80.2% | 81.0% |\\n| | multiple attempts | — | — | — | 83.3% | 84.6% | — |\\n| Mathematics **AIME 2025** | single attempt | 88.0% | 88.9% | 92.7% | 75.5% | 77.3% | 87.5% |\\n| | multiple attempts | — | — | — | 90.0% | 93.3% | — |\\n| Code generation **LiveCodeBench** (UI: 1/1/2025-5/1/2025) | single attempt | 69.0% | 72.0% | 75.8% | 51.1% | — | 70.5% |\\n| Code editing **Aider Polyglot** | | 82.2% diff-fenced | 79.6% diff | 72.0% diff | 72.0% diff | 53.3% diff | 71.6% |\\n| Agentic coding **SWE-bench Verified** | single attempt | 59.6% | 69.1% | 68.1% | 72.5% | — | — |\\n| | multiple attempts | 67.2% | — | — | 79.4% | — | 57.6% |\\n| Factuality **SimpleQA** | | 54.0% | 48.6% | 19.3% | — | 43.6% | 27.8% |\\n| Factuality **FACTS grounding** | | 87.8% | 69.6% | 62.1% | 77.7% | 74.8% | — |\\n| Visual reasoning **MMMU** | single attempt | 82.0% | 82.9% | 81.6% | 76.5% | 76.0% | no MM support |\\n| | multiple attempts | — | — | — | — | 78.0% | no MM support |\\n| Image understanding **Vibe-Eval (Reka)** | | 67.2% | — | — | — | — | no MM support |\\n| Video understanding **VideoMMMU** | | 83.6% | — | — | — | — | no MM support |\\n| Long context **MRCR v2 (8-needle)** | 128k (average) | 58.0% | 57.1% | 36.3% | — | 34.0% | — |\\n| | 1M (pointwise) | 16.4% | no support | no support | no support | no support | no support |\\n| Multilingual performance **Global MMLU (Lite)** | | 89.2% | — | — | — | — | — |\\n\\nMethodology\\n\\nGemini results: All Gemini scores are pass @1.\\\"Single attempt\\\" settings allow no majority voting or parallel test-time compute; \\\"multiple attempts\\\" settings allow test-time selection of the candidate answer. They are all run with the AI Studio API for the model-id gemini-2.5-pro-preview-06-05 with default sampling settings. To reduce variance, we average over multiple trials for smaller benchmarks. Aider Polyglot score is the pass rate average of 3 trials. Vibe-Eval results are reported using Gemini as a judge.\\n\\nNon-Gemini results: All the results for non-Gemini models are sourced from providers' self reported numbers unless mentioned otherwise below.\\n\\nAll SWE-bench Verified numbers follow official provider reports, using different scaffoldings and infrastructure. Google's scaffolding for \\\"multiple attempts\\\" for SWE-Bench includes drawing multiple trajectories and re-scoring them using model's own judgement.\\n\\nThinking vs not-thinking: For Claude 4 results are reported for the reasoning model where available (HLE, LCB, Aider). For Grok-3 all results come with extended reasoning except for SimpleQA (based on xAI reports) and Aider. For OpenAI models high level of reasoning is shown where results are available (except for GPQA, AIME 2025, SWE-Bench, FACTS, MMMU).\\n\\nSingle attempt vs multiple attempts: When two numbers are reported for the same eval higher number uses majority voting with n=64 for Grok models and internal scoring with parallel test time compute for Anthropic models.\\n\\nResult sources: Where provider numbers are not available we report numbers from leaderboards reporting results on these benchmarks: Humanity's Last Exam results are sourced from https://agi.safe.ai/ and https://scale.com/leaderboard/humanitys_last_exam, AIME 2025 numbers are sourced from https://matharena.ai/. LiveCodeBench results are from https://livecodebench.github.io/leaderboard.html (1/1/2025 - 5/1/2025 in the UI), Aider Polyglot numbers come from https://aider.chat/docs/leaderboards/. FACTS come from https://www.kaggle.com/benchmarks/google/facts-grounding. For MRCR v2 which is not publicly available yet we include 128k results as a cumulative score to ensure they can be comparable with other models and a pointwise value for 1M context window to show the capability of the model at full length. The methodology has changed in this table vs previously published results for MRCR v2 as we have decided to focus on a harder, 8-needle version of the benchmark going forward.\\n\\nAPI costs are sourced from providers' website and are current as of June 5th.\\n\\n* indicates evaluated on text problems only (without images)\\n\\nInput and output price reflects text, image and video modalities.\\n\\n* * *\\n\\nModel information\\n-----------------\\n\\n* [View 2.5 Pro model card](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-2-5-Pro-Model-Card.pdf)\\n* [View 2.5 tech report](https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf)\\n\\n2.5 Pro\\nModel deployment status General availability\\nSupported data types for input Text, Image, Video, Audio, PDF\\nSupported data types for output Text\\nSupported # tokens for input 1M\\nSupported # tokens for output 64k\\nKnowledge cutoff January 2025\\nTool use Function calling\\n\\n Structured output\\n\\n Search as a tool\\n\\n Code execution\\nBest for Reasoning\\n\\n Coding\\n\\n Complex prompts\\nAvailability Gemini app\\n\\n Google AI Studio\\n\\n Gemini API\\n\\n Vertex AI\\n==================================================\"}", "timestamp": "2025-10-08T18:18:11.739744", "tool_call_id": "extract_webpage:8", "name": "extract_webpage" }, { "role": "user", "content": "你应该创建空白文件,然后使用append", "timestamp": "2025-10-08T18:18:49.301758" }, { "role": "assistant", "content": "我来为您创建一个精美的AI模型对比HTML页面。首先创建基础文件结构:", "timestamp": "2025-10-08T18:18:56.618544" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:18:56.624475", "tool_calls": [ { "id": "create_folder:9", "index": 0, "type": "function", "function": { "name": "create_folder", "arguments": "{\"path\": \"ai_models_comparison\"}" } } ] }, { "role": "tool", "content": "{\"success\": false, \"error\": \"文件夹已存在\"}", "timestamp": "2025-10-08T18:18:58.454497", "tool_call_id": "create_folder:9", "name": "create_folder" }, { "role": "assistant", "content": ">\\", "timestamp": "2025-10-08T18:21:19.911664" }, { "role": "assistant", "content": "", "timestamp": "2025-10-08T18:21:19.919750", "tool_calls": [ { "id": "create_file:10", "index": 0, "type": "function", "function": { "name": "create_file", "arguments": "{\"path\": \"ai_models_comparison/index.html\", \"file_type\": \"html\", \"annotation\": \"AI模型对比页面主文件\", \"content\": \"\\n\\n
\\n \\n \\n深度对比 DeepSeek V3.1、Qwen 2.5-Max、Kimi K2 和 Gemini 2.5 Pro 的性能表现
\\n深度求索
\\n阿里巴巴
\\n月之暗面
\\nGoogle DeepMind
\\n大学水平知识测试
\\n代码生成能力
\\n数学竞赛题目
\\n软件工程任务
\\n| 基准测试 | \\nDeepSeek V3.1 | \\nQwen 2.5-Max | \\nKimi K2 | \\nGemini 2.5 Pro | \\n
|---|---|---|---|---|
| MMLU-Pro | \\n81.2% | \\n81.1% | \\n81.1% | \\n89.2% | \\n
| LiveCodeBench v6 | \\n46.9% | \\n- | \\n53.7% | \\n69.0% | \\n
| AIME 2025 | \\n46.7% | \\n- | \\n49.5% | \\n88.0% | \\n
| SWE-bench Verified | \\n38.8% | \\n- | \\n65.8% | \\n59.6% | \\n
| GPQA-Diamond | \\n68.4% | \\n- | \\n75.1% | \\n86.4% | \\n