CLTR
- Mar 27
- 5 min read

Why we recommend risk assessments over evaluations for AI-enabled biological tools (BTs)

By James Smith, Sophie Rose, Richard Moulange and Cassidy Nelson, CLTR

As part of our work to identify the three most beneficial next steps that the UK Government can take to reduce the biological risk posed by BTs, our team reflected on where the approach to narrow, specialised tools will need to differ from existing approaches to mitigating the risks from frontier AI.

In this post, we outline why comprehensive risk assessments—which draw on literature and stakeholder engagement to assess these tools’ capabilities—are an effective and feasible alternative to conducting evaluations .

For more detail, see our March 2024 policy paper, How the UK Government should address the misuse risk from AI-enabled biological tools.

Model evaluations are direct tests of a model’s performance that can be done in a number of ways. In the context of biological risks of frontier models, two evaluation approaches have received particular attention:

(i) automated evaluations: quantified tests that assess model performance without the need for humans to interact with models; and

(ii) red teaming: individuals or teams with different levels of expertise probe models directly to attempt to elicit harmful information.

These evaluations are an important component of both the UK AI Safety Institute (AISI) and leading AI companies' ongoing, extensive efforts to assess risks from frontier models. They also serve as a valuable mechanism through which to implement other mitigation measures: leading AI companies have agreed, through voluntary commitments, to allow AISI to evaluate their models before they are deployed and to address identified issues.

What are the challenges associated with developing evaluations for BTs?

Despite the central role of UK Government-led model evaluations for frontier models, it will be challenging in the near-term to establish analogous evaluations for BTs.

Challenge 1: It is likely impractical to design and build evaluations suitable for the range of BTs available.

BTs encompass a broad range of highly specialised tools that perform many specific functions – with different inputs, architectures and outputs – and require significant technical skill to use (see our previous work on Understanding AI-facilitated Biological Weapon Development). These differences mean that the design of BT evaluations could be very different from the design of current frontier model evaluations. For example, given a protein design tool, one might develop evaluations to test if the tool could design novel toxins, but these evaluations would not be applicable to genome assembly tools, which do not design proteins. Even for frontier models with chatbot-style interfaces, differences between models can make implementing evaluations challenging. For BTs, this will likely be even more difficult due to the advanced technical skills required to use a given model effectively, and the differences in the specific expertise needed to use different models.

Challenge 2: Even if the Government were to focus on developing evaluations only for the riskiest BTs, identifying them is also challenging:

(A) Factors that are being used to select which frontier models to evaluate – developer characteristics or compute – will not be suitable. State-of-the-art BTs are developed by a broad range of stakeholders across academia and industry, in contrast to frontier models where state-of-the-art development is concentrated among several leading AI companies. This makes it less clear which of the many developers will create BTs that require evaluation pre-deployment. Some state-of-the-art BTs require fairly limited training compute, so training compute is less helpful for identifying leading BTs than frontier models. AlphaFold-2, for example, used approximately 3 x 1021 FLOP of training compute.

(B) We do not yet understand the relevant threat models well enough to identify models to evaluate based on the risk they present. Different BTs enable different parts of the bioweapon development risk chain. Protein design tools could enable actors to design more dangerous biological agents, whereas experimental simulation tools could reduce the amount of agent testing required. It is unclear which steps in the bioweapons development chain are the greatest bottleneck to bioweapons development, and therefore most concerning for BTs to enable. The risk may also differ across actors and threat models; for example, lower-resourced actors could be better enabled by tools that reduce resources needed to build known pathogens, whereas well-resourced actors might be better enabled by tools that improve novel agent design.

How do risk assessments help overcome these challenges?

Risk assessments can be done on a broader range of tools because they require fewer resources, so it is not necessary to identify a small subset of tools to evaluate.
Completing the risk assessment will improve our understanding of BT risks, helping pinpoint factors that can:
- Be used to prioritise which, if any, future models should be subject to evaluations, addressing Challenge 2a, or ;
- Result in direct identification of risky sub-categories of models that should be evaluated, addressing Challenge 2b (see Figure 1 below)

Figure 1: Example of how risk assessment based on literature and expert engagement could inform development of evaluations.

Risk assessment based on literature and expert engagement should be done across a broad range of BT sub-categories, and may result in identification of high-risk sub-categories. If suitable information for decision making cannot be gathered from the literature and expert engagement, identified high-risk sub-categories may warrant further evaluation. Models could then be red-teamed: individuals or teams could probe models to attempt to elicit harmful information. Red-teaming results could inform the design of repeatable, automated tests (automated evaluations) for future models. For example, if red-teaming results find that a model can provide harmful information, an automated evaluation could be built to measure the ability of the model to provide that harmful information in the future. Automated evaluations may in turn identify model capabilities that warrant closer scrutiny through red-teaming. For example, if an automated evaluation shows that a model can provide harmful information in one domain, red-teamers might probe the model for similar information in another important domain.

Things to keep in mind when considering whether to pursue evaluation development

Although risk assessments based on scientific literature and expert engagement could help to inform future evaluations, it is unclear whether doing evaluations will be valuable or advisable.

Evaluations themselves can present biosecurity risks. For example, evaluations may involve the use of tools to complete a task that needs to be completed in the development of a bioweapon, so creating those evaluations creates a proliferation risk by supplying information on how to conduct that task.
Evaluations will also require considerable resources and technical skills to build and implement across BTs. Where sufficient information can be gathered for decision making from the scientific literature and expert engagement, the potential to increase biosecurity risk and the additional costs are unlikely to be justified.

As such, we recommend that the need for evaluations be determined as risk assessments based on literature and expert engagement are developed and conducted.

Why we recommend risk assessments over evaluations for AI-enabled biological tools (BTs)

Figure 1: Example of how risk assessment based on literature and expert engagement could inform development of evaluations.

Recent Posts