June 5, 2025

DeepSeek AI Model Training: Alarming Claims Surface Regarding Gemini Data Use

4 min read

BitcoinWorld DeepSeek AI Model Training: Alarming Claims Surface Regarding Gemini Data Use In the fast-paced world of artificial intelligence, innovation often sparks intense scrutiny, especially regarding how powerful AI models are built. Recently, attention has turned to Chinese lab DeepSeek following the release of an updated version of their R1 reasoning AI model. This new model has shown strong performance on various benchmarks, particularly in math and coding. However, the source of the training data used for this model was not disclosed by DeepSeek, leading to speculation among AI researchers. Some believe that at least a portion of the data may have originated from Google’s Gemini family of AI models. Evidence Suggests Potential Gemini Influence Several developers have presented observations that fuel this speculation. Sam Paech, a developer based in Melbourne, shared what he believes is evidence that DeepSeek’s latest model, R1-0528, was trained on outputs from Gemini. Paech noted in an X post that the DeepSeek model seems to prefer words and expressions similar to those favored by Google’s Gemini 2.5 Pro. He speculated that DeepSeek might have switched from using synthetic data generated by OpenAI models to using synthetic data from Gemini. Another developer, known pseudonymously as the creator of the ‘SpeechMap’ AI evaluation tool, also commented on the similarities. This developer observed that the internal ‘traces’ or ‘thoughts’ the DeepSeek model generates while solving problems ‘read like Gemini traces’. While these observations are not definitive proof, they add to the growing suspicion. Past Accusations and AI Distillation This isn’t the first time DeepSeek has faced accusations related to using data from rival AI models. In December, developers noticed that DeepSeek’s V3 model occasionally identified itself as ChatGPT, OpenAI’s chatbot. This suggested the model might have been trained on ChatGPT conversation logs. Earlier this year, OpenAI reportedly told the Financial Times they had found evidence linking DeepSeek to the use of AI distillation . Distillation is a technique where a smaller AI model is trained to replicate the behavior of a larger, more capable model by using the larger model’s outputs as training data. While distillation is a known practice, OpenAI’s terms of service explicitly prohibit customers from using their model outputs to build competing AI services. According to Bloomberg, Microsoft, a major OpenAI partner and investor, detected significant amounts of data being extracted through OpenAI developer accounts in late 2024. OpenAI believes these accounts are affiliated with DeepSeek. These events further fuel concerns about potential intellectual property issues in the competitive AI landscape. Why Would DeepSeek Use Synthetic Data from Gemini? AI expert Nathan Lambert, a researcher at the nonprofit AI research institute AI2, believes it’s plausible that DeepSeek trained on data from Google’s Gemini. Lambert suggested in an X post that if he were in DeepSeek’s position, he would ‘definitely create a ton of synthetic data from the best API model out there’. He explained that DeepSeek is ‘short on GPUs and flush with cash’. Using synthetic data generated by powerful external models like Gemini effectively provides them with more compute resources for training without needing extensive, costly hardware infrastructure. This makes the practice strategically appealing, despite the ethical and legal questions it raises regarding terms of service. Challenges in AI Model Training Data It is important to note the complexity of identifying training data sources definitively. Many AI models can converge on similar language patterns and even misidentify themselves because the open web, a primary source for training data, is increasingly populated with AI-generated content. This ‘contamination’ makes it challenging to filter AI outputs from training datasets. However, the specific observations about preferred word choices and the structure of reasoning traces point to a more direct influence than just general web data contamination, according to the researchers who made the claims. Industry Reactions and Countermeasures In response to concerns about AI Distillation and data scraping, AI companies are enhancing security measures. In April, OpenAI implemented a mandatory ID verification process for accessing certain advanced models, requiring a government-issued ID from supported countries (China is not currently on this list). Google has also taken steps, recently beginning to ‘summarize’ the detailed traces generated by models available via its AI Studio developer platform. This action makes it harder for others to train rival models using Gemini’s step-by-step reasoning processes. Anthropic announced a similar move in May, citing the need to protect its ‘competitive advantages’. Conclusion The speculation surrounding DeepSeek’s latest AI model training and the potential use of Gemini data highlights significant challenges in the AI industry. As models become more capable, the methods used to train them and the sources of their data are under increasing scrutiny. While definitive proof remains elusive, the observations from developers and past incidents raise important questions about data ethics, intellectual property, and the future of competitive AI development. The industry’s move towards stricter access controls and data obfuscation reflects the growing tension and the high stakes involved in building the next generation of AI. To learn more about the latest AI model developments, explore our articles on key trends shaping the AI market . This post DeepSeek AI Model Training: Alarming Claims Surface Regarding Gemini Data Use first appeared on BitcoinWorld and is written by Editorial Team

Bitcoin World logo

Source: Bitcoin World

Leave a Reply

Your email address will not be published. Required fields are marked *

You may have missed