The Allens AI Australian law benchmark

Methodological observations

Questioning in an interactive fashion

It is possible to get better results from these LLMs by questioning them in an interactive fashion or using better prompts.

'One-shot'

The questions were all 'one-shot', in that the question would be asked and the first answer taken. While we asked each question three times, each session was started afresh, and there was no attempt to question the LLMs we tested to clarify their answers or to correct their mistakes. There are a number of situations in which this might have helped, such as:

  • The overly high-level answer to the question about the legal threshold for 'substantial lessening of competition' under Australian merger control rules. We could have asked for more details.
  • The jurisdictional error of correctly summarising judicial consideration of a relevant UK example, without acknowledging that this was in a UK case. We could have asked whether there are any relevant differences between UK and Australian copyright law.

Short prompts and questions

Similarly, we used a brief rubric and some of the questions were deliberately short, eg 'Can a pile of bricks be protected by copyright?'. The questions were also hard.

It is likely that the use of longer, more detailed prompts, including additional contextual information, would have generated better and more accurate results. For example, some questions could have pointed the LLMs to a specific source to help them answer the question.

Benchmarking constraints

It may be useful to explain our approach in more detail. First, we followed the principle that the methodology should try to ensure the process is standardised and repeatable.

Once the LLMs are interrogated interactively, there is a risk that interaction starts to distort the answer, as the level of that interaction could vary dramatically from case to case. There is a risk that this interaction also simply consists of 'coaching' the LLM towards the right outcome.

Real-world usage

Second, we think that the approach used in the benchmarking process reflects the way these LLMs are likely to be used in practice.

Prompt engineering is a relatively new skill. (The Wikipedia entry for it was only created in October 2021.) While creating very detailed sophisticated prompts might deliver better results, not many lawyers or lay people currently have these skills (though they may well develop them over time).

Similarly, we think that in many real-world cases, users will not want to (or just will not) provide detailed context for their prompts. Creating longer prompts with more context increases the work in using the LLMs and, in some cases, it is only possible to provide that extra context if you already know the answer.

Finally, for some questions – such as question 1 on the role of subjective intention when interpreting a contract – it is not clear that additional context is needed.

Machine against machine (not humans)

We also did not get a 'real' mid-level associate to provide a comparative answer, as it would be difficult to conduct that exercise fairly.

The matter of geography

The questions are all in English and relate only to Australian law. The October 2023 LinksAI report hypothesised that the performance of these LLMs when asked questions in other languages, or about other legal systems, is likely to be different, as LLMs are likely to perform worse with less common languages or legal systems. This seems to have been borne out, as we discussed on the previous page. Despite being asked to answer from an Australian law perspective, many of the responses were swayed by UK, US and EU law.

Why no 'human in the loop'?

We did not get an actual 5-PAE lawyer to attempt these questions; even though that is the acid test for the match between human and machine. This is because of the difficulty in ensuring a fair comparison, such as the need to make sure that all the practice-specific lawyers selected for this exercise are not made aware this is a public benchmarking exercise (which would skew the results), are of similar ability, are given the same amount of time to answer the questions, and, importantly, do not consult firm colleagues to get the answers.

If I ask again, do I get a different answer?

We found that repeating the same question with the same LLMs would frequently provide a different answer. For example, Claude 2’s summary of the principles a court would apply when interpreting the language of the patent claim (in answer to question 5) oscillated between referring to incorrect considerations and being generally correct. Usually, however, marks for substance were relatively consistent across the answers to the same question per LLM.

Disclaimer

No reliance should be placed on these answers, even in cases where they have been positively marked. They are for general information purposes only, and do not claim to be comprehensive or provide legal or other advice.

Similarly, we understand that the providers of the LLMs discussed in this report do not recommend their products are used for Australian law advice, and that the output of those LLMs is provided on an 'as is' basis.

Conclusion

Like any benchmarking process, the methodology does not exactly replicate the likely real-world use of these LLMs. However, it provides a broad indication of their likely absolute performance, as well as a relative assessment of different models and their progression over time. It also helps to identify the types of questions LLMs are best placed to answer.

Benchmarks

GPT-4 is reportedly capable of passing the US Uniform Bar Examination. This is at least at odds with the performance of GPT-4 in our benchmark. However, there may be more content on the internet about US law and how to pass the US Bar.

GPT-4 passes the Bar Exam

A study from 2023 concluded that GPT-4 (the model powering GPT-4) could pass the US Uniform Bar Examination.* The US Bar Exam involves multiple-choice questions, short-form essay questions and long-form essay questions requiring the practical application of the law to a particular set of facts. In that study, GPT-4 answered correctly nearly 75% of the Bar Exam’s multiple-choice questions, outperforming the average human test taker by more than 7%.

Comment

While GPT-4 scored more than 50% in our benchmarking, it did not demonstrate the competency expected of a mid-level associate. In one sense, it did 'worse' than it did with the US Bar Exam. There are a number of potential explanations. For example, each year around 60,000 people take the Bar Exam. To prepare, they have access to a wide range of study materials, past papers and model answers. It is possible that GPT-4 has 'learned' how to answer these questions from that wealth of material. In other words, even if it has not seen the exact question before, it has seen enough similar examples to be able to predict the answers to these questions. In contrast, the answers to our benchmark questions are less prevalent on the internet (though can mostly be found somewhere), and relate to a much smaller jurisdiction.

Alternatively, it may simply be the result of the different methodology we adopted for question design and marking. Some sections of the US Bar Exam include a significant proportion of multiple-choice questions, other sections provide lengthy fact patterns and accompanying materials, and it does not require externally sourced citations. All of these features are sharply different from the 'one-shot' questions in our benchmark. The Bar Exam also does not mark answers against the results that a mid-level associate could achieve with time and access to research materials. In that sense, we apply a higher standard, but that is necessary in order to investigate the performance of generative AI in real-life situations with real-life consequences.

On the other hand, we also included easier questions, such as whether a party’s subjective intention is used in contractual interpretation. Some of the LLMs answered them very badly as well. Learn more.

Legal hallucination study

A Stanford University study published in January 2024 found a very high rate of 'hallucination' in three popular LLMs when asked to answer law-related questions. This study focused on asking questions about the procedural history of cases, not requesting legal advice, but the results are consistent with our findings, especially in relation to citation.**

Other Australian law benchmarks

We are not aware of any other benchmarks or studies into the ability of LLMs to answer Australian law questions.

A recent trial undertaken by global firm Ashurst mainly focused on other potential use cases for lawyers (which are out of the scope of this benchmark). One of the tasks tested, however, was the creation of case summaries. Ashurst's blind study showed that assessors could not consistently identify whether the output was from a human or from the AI too, but case summaries produced entirely by humans were judged of a higher quality than AI-assisted case summaries. These conclusions are consistent with our findings that LLMs perform best when summarising the law and related guidance, but do not (yet) perform to the standard of human lawyers.***


* GPT-4 Passes the Bar Exam (15 March 2023). Katz, Daniel Martin and Bommarito, Michael James and Gao, Shang and Arredondo, Pablo, Available at SSRN: https://ssrn.com/abstract=4389233.

** M Dahl et al, Hallucinating Law: Legal Mistakes with Large Language Models are Pervasive, 11 January 2024.

*** What this law firm learnt from experimenting with AI, Australian Financial Review, https://www.afr.com/work-and-careers/workplace/what-this-law-firm-learnt-from-experimenting-with-ai-20240408-p5fi2k.