Microsoft Unveils Tool for Developers to Create AI Behavior Tests with Text

Written by Armel

June 3, 2026

AI researchers and labs have made significant progress in assessing AI models across various factors, including safety, compliance, sycophancy, and alignment. However, organizations are now confronted with a specific challenge: ensuring their AI systems perform according to their unique product or service requirements.

To streamline the testing process, Microsoft unveiled ASSERT, which stands for Adaptive Spec-driven Scoring for Evaluation and Regression Testing, on Tuesday.

Microsoft claims this open-source framework simplifies the evaluation of AI behavior tailored to specific applications by using AI to transform high-level, natural language specifications into detailed tests that yield scores for analysis.

ASSERT converts plain-language definitions of an AI’s expected actions and policies into structured guidelines encompassing both acceptable and unacceptable conduct. It generates scenarios and test cases, executes them on the target system, and provides performance scores. Additionally, it logs the paths taken by the AI, including intermediate steps and tool calls, allowing developers to pinpoint failures.

Developers have the option to incorporate context, tools, and constraints to further refine the evaluations as needed.

For instance, a developer might instruct a document research AI to avoid sending emails externally and to restrict sensitive information to C-level executives while providing concise summaries with relevant context. ASSERT would then produce test cases to verify compliance with these critical rules consistently.

According to Microsoft, ASSERT addresses the limitations present in broader evaluations that may not account for the application-specific nuances necessary for a product’s accurate functioning.

“One of the things we’ve learned is that evaluations are absolutely critical to making good decisions,” explained Sarah Bird, chief product officer of Responsible AI at Microsoft. “Without understanding how the AI system behaves, it’s challenging to determine whether it meets your organization’s standards… If you want a reliable system, you need to evaluate many more dimensions that are specific to the application.”

Bird noted that ASSERT can be employed not only during the development phase but also post-deployment and for ongoing monitoring.

The announcement comes in the midst of a broader transformation within the AI sector. As AI models become increasingly sophisticated, researchers are prioritizing reproducible testing and regression checks. Initiatives such as Stanford’s HELM, MLCommons’ AILuminate, and evaluation organizations like METR are developing benchmarks to assess model behavior under various conditions.

When you purchase through links in our articles, we may earn a small commission. This doesn’t affect our editorial independence.

#Microsoft #tool #lets #devs #spin #behavior #tests #text #descriptions

Source link

Cyera Aims for $12B Valuation with 80x ARR Multiple Despite Losses

Martin Scorsese Emerges as Unexpected Advocate for AI in Hollywood