Allens publishes Australia's first AI legal benchmark

Allens has launched a pioneering Australian law benchmark for generative AI, based on the LinksAI English law benchmark published by Linklaters in October 2023. The first-of-its-kind initiative tests the ability of large language models (LLMs) to answer legal questions under Australian law, providing a systematic framework to test, compare and track developments in generative AI's capabilities over time.

The benchmark suggests that while LLMs can summarise well-understood areas of law effectively, they should not be used for dispensing Australian law advice without expert human supervision due to inconsistencies and inaccuracies.

Among those evaluated, GPT-4 emerged as the best-performing LLM followed by Perplexity; however none displayed consistent reliability when presented with complex legal queries.

Allens' Intellectual Property practice group leader, Miriam Stiel, noted: 'While we're seeing some impressive developments in AI technology applied to law, our findings underline that there is still considerable progress needed before these tools can be relied upon fully without human oversight.

'Even as we anticipate further improvements in AI capabilities and accuracy in answering questions on Australian law, it's crucial to remember that providing accurate legal advice is just one facet of a lawyer's role, which also which also involves the exercise of judgment and risk analysis, to assist clients in their strategic and commercial decision making.'

Citation issues persist among the models tested, with frequent inaccuracies, hallucinations, and a general inability to discern authoritative sources.

The benchmark also found:

While GPT-4 scored more than 50% in our benchmarking, it did not demonstrate the competency expected of a mid-level associate. LLMs performing at the level of GPT-4 could have a practical role in assisting legal practitioners to summarise relatively well understood areas of law.
52% of GPT-4's answers scored a 1 or 2 for substance, indicating an answer that is mostly wrong or contains several errors.
Despite asking for citation, in 32% of answers, the underlying case law or legislation is either completely absent or just made up.

Read the full report: The Allens AI Australian law benchmark