The Allens AI Australian law benchmark

Answers

Summary

The LLMs we tested performed below the level of a qualified lawyer.

  • GPT-4 and LLaMa 2 provided answers that were convincing but were frequently wrong and contained incorrect citations.
  • GPT-4 performed better on some types of questions (such as those involving summarising the law) but less well on others.

We tested the following LLMs, in each case via publicly available, web-based chatbots:

  • Microsoft Prometheus, a fine-tuned version of OpenAI's GPT-4 model, tested via Microsoft Bing Copilot chatbot, with 'Search web' option and 'More Precise' style option.
  • Google's Gemini 1.0 Pro model (formerly Bard), tested via Gemini web chatbot.
  • Anthropic's Claude 2.1 model, tested via Claude chatbot.
  • Meta's LLaMa 2.70b model, tested via Perplexity Labs chatbot.
  • Perplexity AI's Perplexity model, based on GPT-3.5 combined with PPLX 7b Online (based on Mistral 7b) and PPLX 70b Online (based on LLaMa 2.70b), tested via Perplexity Ask chatbot.

Overall performance of GPT-4, Gemini 1, Claude 2, LLaMa 2 and Perplexity

 
 
 

Comparison of LLMs

The overall performance of Claude 2, LLaMa 2 and Gemini 1 was similar. GPT-4 and LLaMa 2 did the best on substance. The benchmarking exercise provides an opportunity to test the relative performance of different AI models.

Hybrid model Perplexity performed relatively well

The Perplexity model was released in December 2022 by Perplexity AI. The model is based on OpenAI's GPT-3.5 model, combined with Perplexity AI's own PPLX 7b Online and PPLX 70b Online LLMs, which are built on top of the open-source Mistral 7b and LLaMa 2.70b models. Perplexity is described as having the ability to provide up-to-date information and avoid 'generating inaccurate facts'.

We found that Perplexity achieved a relatively high overall score (4.27 out of 10). Perplexity outperformed most others in its citation score (1.3 out of 3).

LLaMa 2, Claude 2 and Gemini 1 suffered on citation

We ran the questions through the publicly available version of Google’s Gemini 1 (Gemini 1.0 Pro). It is based on the Gemini family of LLMs and was originally released under the name 'Bard' on 21 March 2023.

Released by Anthropic on 14 March 2023, Claude is a generative AI chatbot powered by the LLM of the same name. For our benchmarking, Claude 2.1 was used.

We also tested the LLaMa 2.70b chat model (via Perplexity Labs), which is part of the family of autoregressive LLMs released by Meta. LLaMa 2 was released in July 2023.

Gemini 1 scored the lowest overall (3.64 out of 10), LLaMa 2 slightly better (3.74 out of 10), and Claude 2 slightly better still (3.99 out of 10). LLaMa 2 particularly suffered low citation scores (0.61 out of 3), whereas on substance, LLaMa 2 (2.03 out of 5) scored better than Gemini 1 (1.86 out of 5), Claude 2 (1.73 out of 5) and Perplexity (1.62 out of 5).

GPT-4 did the best

We ran the questions through OpenAI's GPT-4 (via its implementation as Microsoft Bing Copilot Microsoft's model built on OpenAI's GPT-4, originally released to the public as 'Bing Chat' on 7 February 2023). For our benchmark, the Copilot chatbot function built into the Bing web browser was used, with the 'web' search function and 'Precise' conversation style selected.

GPT-4 scored the highest overall (5.27 out of 10). It almost achieved a pass mark on substance (2.38 out of 5), and was less prone to incorrect or fictitious citations (1.37 out of 3).

GPT-4’s top ranking is relative. It was still frequently wrong on substance and incorrect on citation, and so should only be used for Australian law advice with expert human supervision.

Substance

The substance of the answers was generally inaccurate. The output of all LLMs was frequently inaccurate and requires expert human review.

Is it technically correct?

GPT-4 achieved the most instances of full marks on substance (ie answers that scored 5 out of 5 for substance). However, while it had only two instances of scoring 0 for substance, 52% of its answers scored a 1 or 2 for substance, indicating an answer that is mostly wrong or contains several errors.

There were a range of reasons for this, including:

  • Just getting the law wrong. In answer to a question about the extent to which a former shareholder can bring a claim against a third party who has breached a contract, GPT-4 repeatedly failed to identify the correct prerequisites for a derivative claim under the relevant sections of the Corporations Act 2001 (Cth). 
  • Missing the point of the question. In answer to a question about whether a supplier of cloud computing services can terminate their agreement with a company that has just gone into administration, based on an ipso facto termination clause, GPT-4 delves into irrelevant principles regarding the contractual and common law rights to terminate.
  • Being in the wrong place. When answering a question about IP rights in a price index, GPT-4 referred to database rights under EU law that do not exist in Australia.

Too confident or too eager

The output of all LLMs was frequently inaccurate and requires expert human review. One apparent issue is the 'sycophancy problem'*: the bias towards confirming the user's input. We also found that the 'eagerness' of the models to provide a clear answer leads to an overstated level of confidence in the advice. For example, when asked to describe the Australian competition law risks with sharing certain information disclosed by a reseller to a manufacturer, LLaMa 2 recommended confidentiality arrangements and firewalls. In fact, these are not appropriate as a general response. The correct solution instead needs to be assessed on a case-by-case basis.

A more extreme example comes from the question about former shareholder claims against third parties mentioned above. LLaMa 2 said a shareholder has a right to bring a claim against a third party for breach of contract as a 'reflective loss' claim. In fact, such claims are restricted, due to the well-known 'no reflective loss' rule. Perplexity manufactured an answer that is completely inapposite, perhaps due to a bias towards giving a 'yes' answer, and expressed it with falsely inflated confidence.


* Sharma, M, et al, Towards Understanding Sycophancy in Language Models, 27 October 2023.

Citations

We need to talk about citations. The LLMs we tested produced a variety of incorrect, irrelevant and fictitious citations. This is a major source of concern.

Citations for the wrong jurisdiction

Some of the citations that appeared in answers were real but for cases in another jurisdiction. Despite being asked to answer from an Australian law perspective, many of the responses were swayed by UK and EU law.

Additionally, the answers show that general language models apparently aren't capable of understanding that words may be substitutable in one context but not another. In the legal context, words are not substitutable in citations. A UK case cannot be 'translated' to Australia simply by substituting the High Court for the House of Lords. For example, LLaMa 2 referred to 'Kirin-Amgen Inc v Hoechst Marion Roussel Ltd [2004] FCAFC 171'. The real case citation is Kirin-Amgen Inc v Hoechst Marion Roussel Ltd [2004] UKHL 46, to which LLaMa 2 has attributed a fictitious Australian citation.

Similarly, Gemini 1 gave a citation of 'Navitaire Inc v Jetstar Airways Pty Ltd [2022] FCAFC 84', a fictitious case. A real case relevant to this question, but under English law, is Navitaire Inc v EasyJet Airline Co. [2004] EWHC 1725 (Ch) – it seems that Gemini 1 constructed a 'localised' citation by simply replacing the name of a British budget airline with an Australian one and changing the court issuing the decision!

Fictional extracts

In some cases, the source of law is correctly identified but is then supplemented by a fictional extract from that source. For example, Claude 2 responds to the question regarding a registered mortgage by citing section 55K(4) of the Real Property Act 1900 (NSW). That legislation is relevant but section 55K does not exist (and therefore neither does its subsection (4)).

Flights of fancy

Most worrying of all, despite the request to provide citations, in 32% of answers, the underlying case law or legislation is either completely absent or just made up.

An extreme example is LLaMa 2's answer that cites five cases, attributing different principles to each of them, all of which are fictional.

GPT-4 was the best performer and only provided fictitious citations in 4.4% of answers. However, it still had many instances of low scores for citation, because it had a tendency to cite non-authoritative sources (such as blog posts or law firm websites) without citing the case authority or legislation, or to cite the correct legislation but without section references.

A very good answer

GPT-4 achieved 10 out of 10 on this question. The answer given is correct and covers all material issues, and was given 5/5 for substance, 3/3 for citations and 2/2 for clarity.

Question: Can an Australian company give financial assistance to support the acquisition of shares in its Australian holding company?

Answer: Under Australian law, specifically the Corporations Act 2001 (Cth), a company is generally prohibited from providing financial assistance to a person to acquire shares in the company or its holding company¹²³⁴⁵. This is outlined in Section 260A of the Act¹³⁴⁵.

However, there are exceptions to this rule. A company may provide financial assistance if:

  1. The assistance does not materially prejudice the interests of the company or its shareholders, or the company's ability to pay its creditors¹²³⁴⁵.
  2. The assistance is approved by shareholders under Section 260B¹²⁵.
  3. The assistance is exempted under Section 260C².

The most preferable approach is to obtain shareholder approval under Section 260B of the Act, which is colloquially known as a "whitewash procedure"¹. This involves passing a special resolution with no votes being cast by the individual or company acquiring the shares¹.

Please note that this is a general summary and the specifics of each case can vary. It's always recommended to seek legal advice for individual circumstances.


Source: Conversation with Bing, 15/02/2024

1 A Helping Hand – Receiving Financial Assistance from a Company (Section .... https://piperalderman.com.au/insight/a-helping-hand-receiving-financial-assistance-from-a-company-section-260a-corporations-act/.

2 Financial Assistance By A Company In Respect Of The Acquisition Of Its .... https://www.mondaq.com/australia/financial-services/52496/financial-assistance-by-a-company-in-respect-of-the-acquisition-of-its-shares-an-unnecessary-burden-on-mortgagors/.

3 Identifying prohibited financial assistance in Australia - Lexology. https://www.lexology.com/library/detail.aspx?g=4fb1a006-3c5b-4c32-af29-dab2011d22ae/.

4 Financial Assistance for Company Shareholders | MST Lawyers. https://www.mst.com.au/can-a-company-provide-financial-assistance-to-a-person-to-acquire-shares-in-the-company/.

5 Case note: financial assistance in relation to acquisition of shares .... https://www.brightlaw.com.au/case-note-financial-assistance-in-relation-to-acquisition-of-shares-prohibited-by-the-corporations-act/.

Why is GPT-4 able to answer so well?

This is a good example of the ability of LLMs to summarise the law, and related guidance, in particular legal areas that are likely to be well known and discussed extensively on the internet (given they operate as an 'internet in your pocket').

The real-world answer would, of course, apply these principles to the specific characteristics of and risks for the client’s business, but this is impossible for GPT-4 to do based on the short prompt. It might also go into further detail around the procedural steps required to complete a financial assistance whitewash or specify the exemptions to the financial assistance prohibition.

(One of) the worst answers

Gemini 1 got some of the basics horribly wrong. The answer achieved 1/5 for substance, 0/3 for citations and 1/2 for clarity.

Question: How do the provisions relating to an "existing use" under the Environmental Planning and Assessment Act 1979 (NSW) operate?

Answer: Existing Use Rights under the Environmental Planning and Assessment Act 1979 (NSW)

Existing use rights in NSW operate under Part 4, Division 4.11 of the Environmental Planning and Assessment Act 1979 (NSW) (EP&A Act). Here's a concise breakdown:

What is an "existing use"?

Defined in section 4.65 of the EP&A Act as:

A use of land lawfully carried out before the relevant date (usually the commencement of a new planning instrument).

Use must have been carried out continuously for 1 year after the relevant date (section 4.65(2)).

Key provisions:

Continuance of existing use: Existing uses can continue without development consent even if prohibited by new planning instruments (section 4.66(1)).

Limitations:

No significant intensification: Use cannot be significantly increased in scale or intensity (section 4.66(2)).

Limited alterations: Alterations and extensions require development consent, and must be on the same land (section 4.66(3)).

Abandonment: Use ceases after 12 months, existing use rights are lost (section 4.66(3)).

Regulations: Specific regulations may further define and limit existing use rights (section 4.67).

Relevant case law:

Agostino v Penrith City Council [2022] NSWLEC 1258: Clarifies the scope of existing use rights on specific land parcels.

Strata Plan No. 10487 v Willoughby City Council [2011] NSWLEC 1457: Highlights the limitations on alterations and extensions.

Additional notes:

Existing use rights can be complex and fact-specific. Seeking legal advice is recommended for specific situations.

This is not an exhaustive answer and does not constitute legal advice.

It's important to remember that this is a simplified overview and may not address all the nuances of existing use rights. It is always best to seek professional legal advice for specific situations.

Why is this a poor answer?

The answer correctly notes some aspects of the 'existing use' regime, but there are significant omissions and inaccuracies. In particular, the statutory definition of 'existing use' actually refers to 'building, work or land', and has two paragraphs: (a) applicable where there is no need to obtain development consent, and (b), applicable where development consent was originally required for that use. This answer conflates the two.

More importantly, though, the answer gives the impression that all existing uses will cease 12 months after the Environmental Planning Instrument (EPI) comes into force. In fact, there is a presumption of abandonment where an existing use is not carried out for a period of 12 months, which results in the extinguishment of the existing use right. It also omits several important principles in case law that assist in characterising existing uses.

Citations

The answer correctly cites statutory provisions, but one of the cases ('Strata Plan No. 10487 v Willoughby City Council [2011] NSWLEC 1457') is fictional.

The correct answer

A correct answer would state that an 'existing use' is a use of land that is commenced, or granted consent to, prior to the commencement of an EPI that prohibits that use, which may continue to be carried out despite the prohibition. The answer would then detail principles relevant to the characterisation of existing use (eg existing uses should be characterised as liberally as the statutory language allows) and note the statutory limitations on existing uses, and the exceptions to these limitations in the regulations. A correct answer would also note that where an existing use is not carried out for a period of 12 months, the existing use right is presumed to be extinguished.

A comparison across practice areas

There is some variance that might reflect the mix of questions. GPT-4 achieved the highest marks for banking (8.44 out of 10) and the lowest for intellectual property (3.78 out of 10).

Reasons for the variations

The difference in the marks between the various practice areas is interesting and might be partly based on the mix of questions. While all contained a mixture of straight legal questions and questions requiring an application of the facts, the banking questions were mostly 'summarisation' questions, which the LLMs generally find easier to answer.

In contrast, the intellectual property questions required the application of the law to specific facts and touched on issues that are likely to be less 'commonplace', ie less widely discussed on the internet. As the LLMs (at least partly) learn from internet content, they might be able to more easily parrot back the answers to summarisation questions.

Little Australia

Another issue appears to be peculiar to Australia (and other 'smaller' jurisdictions). The October 2023 LinksAI report postulated that the performance of LLMs might be worse for 'less common' legal systems. Our results bear this out, at least in less-harmonised areas of law such as intellectual property, as answers frequently incorporated EU and UK law analysis that is not correct for Australian law. This effect is less severe in areas with greater harmonisation, such as banking.

It is curious, however, that less 'infection' by US law is seen in the responses, compared with EU and UK law. It may be that LLMs have 'learned' that Australian lawyers, judges and governments generally treat UK sources as more appropriate references in an Australian context than US law. This includes pre-Brexit UK jurisprudence that was based on EU law.

No central moderation

Equally, it is possible that the higher marks for some practice areas are because the questions were easier and the marking more generous. There is a degree of subjectivity in both designing the questions and the marking process. However, because the subject matters are technical, we did not attempt to centrally moderate the difficulty of the questions, or the marks applied in relation to substance.