The Allens AI Australian law benchmark tests the capabilities of LLMs to deliver Australian law legal advice.
LLMs continue to develop at a significant rate and could have profound implications for the future provision of legal services. This publication seeks to identify the key risks of obtaining Australian legal advice from an LLM.
A law benchmark
While the answers produced by LLMs can be superficially convincing, they are not always correct and often lack nuance and context.
Based on the methodology of the LinksAI English Law Benchmark, we have used a detailed set of benchmark questions to test the capabilities of LLMs to deliver Australian law legal advice. The intention of the benchmark is to deliver both an absolute assessment of performance, and to allow a relative assessment of different LLMs (longitudinally as a snapshot-in-time, and as a time series data set over time).
Supercharging human lawyers
Importantly, the benchmark only addresses the provision of Australian law legal advice.
There are many other potential use cases for lawyers. For example:
- Summarising longer documents, such as to create a bullet point summary.
- Contract extraction of specific provisions from agreements.
- Research and enhanced search to find relevant cases or laws.
- Stylistic amendment to make a document something more concise, less formal, etc.
- Ideation to help come up with concepts and ideas.
These use cases are not considered in this report.
Future iterations of the benchmark
A deeper purpose of the benchmarking process is to help assess the extent to which some of the tasks performed by human lawyers could be performed by LLMs instead.
For the reasons set out in this report, we think that the best performing LLM (GPT-4) could have a role helping qualified lawyers with some types of legal questions, such as summarising the law. However, its performance is inconsistent, it needs expert supervision and it performs badly on questions involving critical reasoning.
Our benchmarking has tested only general LLM tools, not LLM-based specialised legal tools. We intend to rerun this benchmarking exercise in future months as new LLMs and other AI tools are released onto the market, including models specifically focused on the legal domain.
Questions
The benchmark questions are hard.
They are intended to be the sort of questions that might reasonably be asked of a competent mid-level associate (5 years' post-admission (PAE) lawyer) who – while unlikely to immediately know the answer to them – would be able to produce a competent response after some research.
Which practice areas?
There were 30 questions in total, spanning 10 different practice areas: contract law, intellectual property, data privacy, employment, real estate, dispute resolution, corporate, competition, tax and banking.
A full set of questions and scores is set out in the Annexure.
Who created the questions?
For our benchmark exercise, we utilised a mix of questions used by the LinksAI English Law Benchmark (localised where necessary to Australian law), and new questions created by our bench of experts. In each area of law, we sought to have a mix of questions engaging different aspects of legal knowledge and analysis:
(a) simple research questions, such as:
'Is obesity a health condition that is capable of satisfying the definition of a disability under section 4 of the Disability Discrimination Act 1992 (Cth)?' (Q13)
(b) questions that apply the law to a set of facts, such as:
'I maintain a price index, calculated as a weighted average of the price of 20 consumer products chosen by me. I re-calculate the index value every day and publish it on my website.
My website is freely available to the public. It has come to my attention that one of my competitors is copying my index and publishing it on its own website. Is my competitor infringing my intellectual property rights?' (Q6)
(c) questions that analyse a clause, such as:
'The constitution of a small private company states that all shareholders have pre-emptive rights on the sale of existing shares. The shareholders agreement of the company contains drag-along rights, allowing shareholders who hold a majority of the shares in the company to facilitate the sale of all of the company to a buyer. Shareholders holding 60% of the shares are willing to sell their shares to a buyer but the buyer wants to buy all of the shares in the company. How can they facilitate this?' (Q17)
The rubric
In each case, the question was preceded by the following standardised rubric: 'You are an experienced Australian lawyer. Provide a concise answer to the question below, applying Australian law. Cite any relevant statutes, regulations, guidance or case law.'
Marking scheme
We marked the answers based on substance, citations and clarity. The questions were marked individually by our expert bench of senior lawyers.
Substance (5 marks)
We awarded a maximum of 5 marks for the substance of the answer – ie whether the answer was technically correct.
Citations (3 marks)
We awarded a maximum of 3 marks for correct references to cases, laws or guidance. The use of a single fictitious citation automatically leads to 0 marks being awarded.
Clarity (2 marks)
We awarded a maximum of 2 marks for the clarity of the answer.
Substance | |
---|---|
0 | The response is entirely wrong. |
1 | The response is generally wrong but contains some correct analysis. |
2 | The response is generally accurate but contains a number of errors. |
3 | The response is generally accurate but contains a small number of errors or fails to answer parts of the question. |
4 | The response is generally accurate and covers most issues. |
5 | The response is accurate and covers all material issues. |
Citations | |
0 | The citations are fictional. |
1 | The citations are incorrect. |
2 | The citations are generally accurate but there are important omissions. |
3 | Adequate and accurate citations are used. |
Clarity | |
0 | The response is very difficult to read. |
1 | The response is clear but not easy to read. |
2 | The response is clear and easy to read. |