The Allens AI Australian law benchmark

Introduction

The Allens AI Australian Law Benchmark explores the intriguing question: can we trust AI tools to provide accurate and reliable advice on matters of Australian law?

Our comprehensive tests, conducted in February 2024, examined market-leading large language models (LLMs). We aimed to mimic a real-world scenario where individuals turn to AI for legal guidance instead of consulting human lawyers. The benchmark, developed in consultation with Linklaters, is an extension of the 'LinksAI English Law Benchmark', which we adapted for Australian law.

As we stand on the brink of a new era in legal consultancy, this project seeks to both aid in the responsible integration of AI into the legal sphere, and contribute to a vital and ongoing discourse on the evolution of trust in technology. The world of AI, including in the legal sphere, is innovating at an exhilarating pace. The results presented here are a reflection of a specific moment on this rapid trajectory of progress. We anticipate that future iterations and advancements will continue to push the boundaries and redefine our understanding of what these technologies can achieve.

Key takeaways

19922D AI Aus legal report_icons_1-expert-2.png

The models we tested should not be used for Australian law legal advice without expert human supervision.

19922D AI Aus legal report_icons_2-strongest.png

The strongest overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini 1 tested relatively similarly.

19922D AI Aus legal report_icons_3-not-reliable.png

In 2024, even the best-performing LLMs we tested were not consistently reliable when asked to answer legal questions.

19922D AI Aus legal report_icons_4-critical-reasoning.png

For tasks that involve critical reasoning, none of the tools we tested can be relied on to produce correct legal advice without expert human supervision.

19922D AI Aus legal report_icons_5-poor-citation.png

Poor citation remains a major problem for many of the models.

19922D AI Aus legal report_icons_6-jurisdictions.png

'Infection' by legal analysis from larger jurisdictions with different laws is a significant problem for smaller jurisdictions like Australia.

19922D AI Aus legal report_icons_7-safeguards.png

Legal teams within any business considering the use of generative AI technologies should ensure they have safeguards in place that govern how the output can be used.

19922D AI Aus legal report_icons_7-safeguards.png

AI will undoubtedly take its place not as a replacement for lawyers but as an indispensable tool to augment their capabilities.

About the research

What questions did we use?

The benchmark comprises 30 questions relevant to 10 different practice areas. The questions would ordinarily require advice from a competent mid-level lawyer specialised in that practice area. The intention was to test whether the AI models can reasonably replicate certain tasks carried out by a human lawyer.

While our question set has some questions in common with the LinksAI English Law Benchmark, others are designed to test issues unique to the Australian law context.

Which LLMs were tested?

We tested the question set against five different models, being GPT-4, Gemini 1, Claude 2, Perplexity and LLaMa 2. We used general-purpose implementations of these LLMs, which are not specially trained or fine-tuned to provide legal advice. Our methodology therefore approximates how a lay user might attempt to carry out tasks using AI instead of a human lawyer.

How many times was each LLM tested?

In a development to the October 2023 LinksAI methodology, we put each of the 30 questions to each AI three times, starting the session anew each time. LLMs use probabilistic algorithms to assemble their written output. Repeating each question controls for boundary conditions (as shown in instances where the same model's answers markedly differed each time a question was asked). The AI tools are then compared based on average scores across the three attempts.

How were the answers marked?

The answers were marked by senior lawyers from each practice area. Each answer was given a mark out of 10 comprising 5 marks for substance (is the answer right?), 3 for citations (is the answer supported by relevant statute, case law, regulations?) and 2 for clarity.

How did each model perform?

The strongest overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini 1 tested relatively similarly.

LLMs performing at the level of GPT-4 could have a practical role in assisting legal practitioners to summarise relatively well-understood areas of law. GPT-4 appears capable of – for example – preparing a sensible first draft of that advice in some cases. However, inconsistencies in the performance of even the best-performing models means that the draft still needs careful review by someone able to verify it is accurate and correct, and that it does not contain irrelevant or fictitious citations.

For tasks that involve critical reasoning, even the best-performing LLMs performed poorly. The models we tested should not be used for Australian law legal advice without expert human supervision. There are real risks to using them if you don’t already know the answer.

Important note: Many of the answers to these questions are wrong and lack context or nuance. They do not constitute legal advice and should not be relied on even in situations where they have received a positive mark. The providers of the LLMs discussed in this report do not recommend their products are used for legal advice, and the output of those LLMs is provided on an 'as is' basis.