Home

IBM unleashes CUGA, an open-source AI agent that actually completes more than half its tasks

IBM researchers have released an open source AI agent called CUGA that aspires to automate complex enterprise workflows and get it right about half the time, depending on the task.

CUGA stands for Configurable Generalist Agent. Per its listing on AI platform HuggingFace, the software offers “Intelligent task automation through multi-agent orchestration, API integration, and code generation on enterprise demo applications.”

"Our vision for IBM CUGA is to develop a generalist agent that can be adapted and configured by knowledge workers to perform routine or complex aspects of their work in a safe and trustworthy manner," wrote IBM authors Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, and Nir Mashkif in a paper [PDF] released back in July.

Not everyone is convinced that agents are safe or trustworthy. IT consultancy Gartner recently advised blocking all agentic browsers, after warning a few months ago that about 40 percent of agentic enterprise projects will be cancelled by 2027 for lack of business value.

However, the lure of automation remains strong and IBM is keen to help. Big Blue's researchers cite CUGA's performance on the WebArena and AppWorld benchmarks – 61.7 percent success rate completing web tasks and 48.2 percent scenario completion rate evaluating API tasks, respectively – and note the agent's scores, which are sufficiently poor to get a human worker fired, presently represent top-tier marks for agents.

Curiously, IBM does not appear to have used its own enterprise-focused WebAgentBench benchmark to evaluate CUGA. A paper by company researchers on that homegrown test suite describes the evaluation of three agents – AgentWorkflowMemory (AWM), WorkArena-Legacy, and WebVoyager – in terms of how well they completed prompted tasks.

Those agents managed an average raw completion rate of 24.4 percent and just 15 percent for policy-compliant completions. When five or more policies were in place, the average completion rate under policy was just 7.1 percent. And enterprises commonly have more than just five policies that apply to business workflows.

"Enterprise workflows often layer dozens of concurrent policies, suggesting that the real-world shortfall will be even more pronounced and that policy-robust optimization, not just raw completion, must become the focal objective," the benchmark paper [PDF] says.

On the WebArena benchmark where CUGA scored a success rate of 61.7 percent, AWM scored just 35.5 percent.

IBM scientists earlier this year pointed out the deficiencies of various AI benchmark tests, but at least CUGA's scores suggest agents are improving.

Offered under an Apache 2.0 license, CUGA starts with a chat layer designed to discern the user's intent from a prompt. This might be "get top account by revenue from digital sales, then add it to current page," or any of the other sample prompts included with the HuggingFace demo, which simulates a small CRM system that comes with 20 preconfigured tools for making sales-related queries and API calls.

A task planning and control component analyzes prompts entered into CUGA, and breaks the goal down into a set of structured subtasks tracked in a task ledger, the authors explain. The ledger is dynamic and can re-plan when things don't go right the first time.

"Subtasks are delegated to specialized agents, such as the API agent, which uses an inner reasoning loop to generate pseudo-code instructions before invoking code in a secure sandbox," the researchers explain in a blog post. "The system leverages a tool registry that goes beyond MCP protocols to parse and understand tool capabilities, enabling precise orchestration."

Finally, the system returns what is hopefully a policy-compliant response to the user.

IBM’s devs designed CUGA to work with Langflow, a low-code platform for AI agent design, and to support various open models, such as gpt-oss-120b and Llama-4-Maverick-17B-128E-Instruct-fp8. Coincidentally, Meta, maker of Llama, is reportedly working on a follow-up model called Avocado that may not be open-source.

CUGA appears to still have a few rough spots. A recently reported bug, for example, suggests that the agent occasionally may have trouble exiting its run loop. But if you're deploying AI agent software and you expect to automate multi-step business tasks without a hitch, you might want to lower your expectations. ®

Source: The register

Previous

Next