On Tuesday, Anthropic announced a new initiative to develop new benchmarks to test the capabilities of advanced artificial intelligence (AI) models. The company AI will fund the project and has invited interested entities to apply. The company said existing benchmarks are not sufficient to fully test the capabilities and impact of newer large language models (LLM). As a result, it is necessary to develop a new set of evaluations focused on the security of artificial intelligence, advanced capabilities and its social impact, Anthropic stated.
Anthropic to fund new benchmarks for AI models
In an editorial, Anthropic highlighted the need for a comprehensive third-party assessment ecosystem to overcome the limited scope of current benchmarks. The artificial intelligence company announced that through its initiative, it will fund third-party organizations that want to develop new assessments for artificial intelligence models focused on quality and high security standards.
For Anthropic, high-priority areas include tasks and questions that can measure the LLM’s AI (ASL) assurance levels, advanced ideation and response generation capabilities, and the social impact of those capabilities.
Under the ASL category, the company highlighted several parameters that include the ability of AI models to assist or act autonomously in the execution of cyber-attacks, the potential of models to help generate or improve knowledge about the creation of chemical, biological, radiological and nuclear (CBRN) risks, risk assessment for national security and much more.
In terms of advanced capabilities, Anthropic noted that benchmarks should be able to assess the potential of artificial intelligence to transform scientific research, participation and harm denial, and multilingual capabilities. Furthermore, the AI company said that the potential of AI models to influence society needs to be understood. For this, evaluations should be able to target concepts such as “harmful prejudice, discrimination, overreliance, dependence, attachment, psychological impact, economic effects, homogenization and other broad social effects.”
In addition, the AI company also outlined some principles for good ratings. It said that estimates should not be available in training data used by artificial intelligence because they often turn into a memorization test for models. He also encouraged keeping between 1,000 and 10,000 tasks or questions for AI testing. It also asked organizations to use subject matter experts to create assignments that test performance in a specific domain.