Evaluating and Refining Outputs

Mitchell Adams

21 Evaluating and Refining Outputs

Responsible AI

To set the scene, first watch this video on the responsible use of AI. Every Generative AI conversation should be a responsible AI conversation.

LLM Risks and Mitigations

When integrating generative AI into legal practice, practitioners must be aware of their limitations and associated risks. One of the most significant concerns is hallucination, where LLMs generate false but plausible-sounding information. In legal contexts, this can lead to misinterpretations of case law, citations to non-existent cases, or inaccurate legal analyses. Cases of practitioners submitting documents to the court with fictitious citations are emerging, leading to potential disciplinary action.^[1]

Another related concern is model ‘sycophancy’, where a model agrees with users or provides affirming responses, even when the response or output is inaccurate, in an attempt to satisfy the user’s expectations.^[2]

Compounding this is the black-box nature of LLMs; because their reasoning process is opaque, it can be difficult to ascertain why a model produced a particular response, making it challenging to validate its reliability. Bias is another critical issue, as AI models trained on vast datasets may inadvertently reflect societal, historical, or institutional biases, leading to unfair or skewed outputs. Additionally, practitioners must consider confidentiality and privacy risks, as sharing sensitive client information with any AI system raises concerns about data security and compliance with professional and regulatory obligations.

Further limitations include temporal knowledge cutoffs (where LLMs lack awareness of recent developments), jurisdictional blindness (the inability to distinguish between Australian state/territory laws without explicit instruction), and sycophancy (the tendency to agree with user assertions regardless of accuracy).^[3] Overreliance on these tools can lead to skill atrophy, where practitioners’ core legal reasoning abilities diminish through disuse.^[4]

Given these risks, legal professionals should approach LLMs with appropriate caution, implement validation processes, maintain human oversight at critical decision points, and ensure compliance with legal and ethical obligations. Ultimately, the practitioner, not the technology, bears the full responsibility for all work products.

Watch

Watch the following video where Martin Keen explains the different types of hallucinations, why they happen, and ends with recommending steps that users can take to minimise their occurrence.

Techniques for assessing AI-generated legal content

When integrating generative AI into legal practice, it is crucial to critically evaluate AI-generated content to uphold professional standards and fulfil ethical obligations. While AI can offer valuable insights, summarise case law, and draft legal documents, it is not infallible.

This section outlines practical techniques for verifying, assessing, and enhancing AI-generated legal content. These techniques can help practitioners navigate the intersection of AI and legal practice with confidence and due diligence. Consider how these techniques may be applied in various situations. The assessment approaches can be tailored to fit each application’s specific context and risk level. The Law Society of New South Wales has created a flowchart to help practitioners determine whether the use of generative AI is suitable for a specific legal task, as part of its Guide to the Responsible Use of Artificial Intelligence.^[5]

For instance, the techniques used in common use cases may be relevant as follows:

Legal Research and Analysis — Focuses on verifying citations, ensuring the currency of the law, and checking the completeness of the analysis.
Contract Drafting and Review — Emphasises consistency across terms, completeness of provisions, and jurisdictional compliance.
Due Diligence Document Review — Concentrates on completeness, consistency across documents, and identification of critical provisions.
Client Communication and Legal Advice — Highlights ethical compliance, relevance, and appropriate explanation of complex concepts.
Regulatory Compliance Analysis— Focuses on the currency of regulations, jurisdictional accuracy, and comprehensive coverage of requirements.
Legal Education and Knowledge Management — Emphasises accuracy while simplifying concepts, audience appropriateness, and balanced presentation.

To assist in evaluating AI-generated content, the following section highlights the essential actions and processes to consider when reviewing content.

1. Cross-Reference with Authoritative Sources

Key Action: Verify all legal information, case citations, and statutory references against established legal databases.

Process:

Identify all citations, legal principles, and statutory interpretations in the AI output.
Look up each reference in Westlaw, LexisNexis, or official court/legislative websites.
Verify that the AI’s interpretation aligns with the actual text and context of the cited source.
Check that the quoted material is presented in the proper context.

Critical Questions:

Do all cited cases and statutes exist?
Are the holdings and principles accurately represented?
Has the model fabricated or misattributed any legal authority?
Does the interpretation align with reputable secondary sources?

Practice Tip: Create a verification checklist for each document type with standard references that must be verified.

2. Consistency Check

Key Action: Ensure that the output maintains logical and legal consistency throughout.

Process:

Review the document for contradictory statements.
Ensure defined terms are used consistently.
Verify that conclusions logically follow from the presented facts and legal principles.

Critical Questions:

Does the model maintain a consistent interpretation of legal concepts throughout?
Are there logical contradictions in how tests are applied?
Did the model appropriately distinguish between majority judgment and dissent?
Does the analysis acknowledge jurisdictional differences where relevant?

Practice Tip: Read the document in reverse order (conclusion first) to spot contradictions more easily.

3. Completeness Analysis

Key Action: Assess whether the model has addressed all necessary aspects of the task.

Process:

Compare the output against a standard template or checklist for the specific task.
Verify all necessary provisions are included.
Identify any missing elements.
Verify that relevant counterarguments have been considered.

Critical Questions:

Has the model addressed all aspects of the tasks?
Are there considerations that have been overlooked?
Does the analysis account for counterarguments?
Does the model fail to recognise when a case is distinguishable from the present facts?

Practice Tip: Develop issue-specific checklists for common tasks in your practice area

4. Relevance and Currency Check

Key Action: Confirm that the information is up-to-date and reflects current law.

Process:

Confirm that all cited laws are applicable in the relevant jurisdiction
Check the dates of cited cases and statutes
Research whether any cases have been overruled or statutes amended
Verify that recent significant developments in the area of law are incorporated

Critical Questions:

Has the model mistakenly applied law from the wrong jurisdiction?
Has the model cited superseded or overruled precedent?
Does the analysis reflect the most current statutory language?
Has the model accounted for any relevant legal changes after its knowledge cutoff date?
Could current information affect the analysis?

Practice Tip: Include a timestamp on all AI-generated material to track currency.

5. Bias and Balance Assessment

Key Action: Evaluate whether the model has presented a balanced analysis.

Process:

Identify whether the model has given appropriate weight to opposing arguments.
Assess whether the model has considered diverse perspectives.

Critical Questions:

Does the analysis present both sides of contested issues?
Does the analysis acknowledge limitations in its conclusions?
Has the model omitted relevant dissenting opinions or views?

Practice Tip: Always ask the model to generate arguments for both sides of a legal issue, even when only one perspective is needed for your final work product.

6. Legal Ethics Compliance

Key Action: Ensure the AI-generated content complies with professional responsibility obligations.

Process:

Review the terms of service for an AI application
Review for potential confidentiality issues
Verify that representations about the law are accurate and not misleading

Critical Questions:

Does the system and content maintain client confidentiality?
Are there any misleading statements?
Does the work product reflect adequate supervision of AI tools?

Practice Tip: Create an ethics compliance checklist specifically for AI-generated work products.

7. Remediation Process

When issues are identified in AI-generated content, follow this structured approach:

Document the Error: Record the specific issue, including what makes it incorrect or problematic
Research the Correct Information: Consult authoritative sources to determine the accuracy
Refine Your Prompt: Modify your initial instructions to help the model avoid similar errors
Test and Iterate: Generate a new version and verify if the issue has been resolved
Create a Knowledge Base: Maintain a record of common errors and effective remediation strategies

Practical Implementation

Consider implementing these techniques through a standardised workflow:

Initial Generation: Create AI-generated content using clear, detailed prompts.
Primary Review: Conduct a first-pass review focusing on apparent errors.
Structured Assessment: Apply the techniques above.
Targeted Refinement: Regenerate specific sections as needed.
Final Verification: Perform a comprehensive review of the final product.
Documentation: Record your verification process for professional responsibility purposes.

Strategies for improving outputs through prompt refinement

Effective use of generative AI requires evaluating outputs and refining prompts. As we have already discussed, the quality of AI-generated legal content depends on how the prompts are constructed.

Following the assessment techniques, this section explores practical techniques for refining prompts to enhance the quality of generative AI outputs.

Firstly, when optimising for results, approach prompt engineering systematically:

(1) Diagnose the output: Consider the critical questions posed above.
(2) Make targeted refinements: Apply strategies to address identified weaknesses (below).
(3) Test refined prompts to evaluate improvement: Start a new chat with the model or engage in multi-step refinement (i.e., continue the same chat).
(4) Review: Assess whether specific issues have been resolved.

Specify Legal Authority and Jurisdiction

If you find that outputs are generic or defaulting to American law, explicitly define the relevant legal framework to ensure outputs conform to the applicable jurisdiction.

For example:

I need the analysis of [legal issue] under [Australian jurisdiction] law.

Tips:

Use constraints, for example, by specifying the jurisdiction (state/territory or Commonwealth).
Augment an RAG process and provide the specific legislative provisions, regulations and cases when applicable.

Request Counterarguments

Direct the model to consider multiple perspectives to ensure a balanced analysis.

For example:

After providing your initial analysis of this fair dealing claim, please:

Present the strongest argument against your primary conclusion
Identify how the fair dealing purposes in the Copyright Act 1968 (Cth) might be interpreted differently
Discuss relevant Law Reform Commission recommendations or academic commentary from secondary sources.

Critique your previous analysis. Assume an appellate court finds your reasoning flawed. What arguments might they make?

Tips:

Request specific counterarguments to the main conclusions.
Ask for alternative interpretations.
Direct the model to identify different approaches.
Request consideration of influential academic commentary from Australian sources.

Implement Constraint Prompting

Set specific boundaries to guide the model away from potential errors or irrelevant areas.

For example:

In your analysis of this patent issue:

Do not discuss international patent law unless directly relevant to the interpretation of the Patents Act 1990 (Cth)
Limit your analysis to standard patents (do not address innovation patents)
Avoid discussion of competition law implications unless directly relevant

Tips:

Use negative constraints to prevent unwanted content
Define scope limitations clearly (what is in and out of bounds)
Identify areas where speculation should be avoided

Apply Role-Based Perspectives

Direct the model to adopt specific legal roles to gain varied analytical perspectives.

For example:

Analyse this administrative law issue from the following perspectives:

As counsel for a government agency defending the decision
As a barrister representing an applicant seeking judicial review in the Federal Court
As a Federal Court judge
As a member of the Administrative Review Tribunal, considering merits review

Tips:

Include both advocacy and adjudicative roles
Request identification of different stakeholder interests
Use roles to explore different perspectives

Use Iterative Refinement

Engage in multi-step refinement to progressively improve outputs.

Example process:

(a) Generate an initial response.
(b) Ask the model to identify potential weaknesses in its analysis.
(c) Request a revised version addressing these weaknesses.
(d) Pose follow-up questions on any remaining areas of concern.

Tips:

Start with a broad analysis before narrowing the focus
Request explicit identification of assumptions or weaknesses
Ask for an application for specific hypothetical scenarios
Build toward practical solutions or recommendations

Implement Document and Context Structuring

Provide clear guidance on how information should be organised and presented. See Starting to Prompt.

Implementation Tips:

Specify document format and structure requirements.
Set length parameters for different sections.
Direct the model to use specific formatting (headings, bullet points).
Request executive summaries or conclusions.
Specify Australian spelling and terminology conventions.

Remember, while these strategies can significantly enhance AI outputs, they do not replace human legal expertise. Always apply your professional judgment and knowledge to verify and refine AI-generated content before using it in your work.

See Handa & Mallick [2024] FedCFamC2F 957. ↵
Mrinank Sharma et al., 'Towards Understanding Sycophancy in Language Models' (2023) arXiv:2310.13548 (preprint). ↵
See Lars Malmqvist, ‘Sycophancy in Large Language Models: Causes and Mitigations’ (2024) arXiv:2411.15287 (preprint). ↵
Daniel Dillu, ‘How Over-Reliance on AI Could Lead to Cognitive Atrophy’, Medium (Web Page 9 October 2023) <https://medium.com/neuranest/how-over-reliance-on-ai-could-lead-to-cognitive-atrophy-d04d214c7e75>. ↵
New South Wales Law Soceity, 'A Solicitor’s Guide to Responsible Use of Artificial Intelligence' (Report, October 2024) <https://www.lawsociety.com.au/publications-and-resources/ai-legal-professionals>. ↵

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

GenAI for Legal Practice Copyright © 2025 by Swinburne University of Technology is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Responsible AI

LLM Risks and Mitigations

Watch

Techniques for assessing AI-generated legal content

1. Cross-Reference with Authoritative Sources

2. Consistency Check

3. Completeness Analysis

4. Relevance and Currency Check

5. Bias and Balance Assessment

6. Legal Ethics Compliance

7. Remediation Process

Practical Implementation

Strategies for improving outputs through prompt refinement

License

Share This Book