A seminar by Associate Professor Fei Huang from UNSW
Title: Gender Bias in LLM-Evaluated Insurance Claims
Abstract: Large language models (LLMs) are increasingly deployed in AI-assisted insurance claim processing, a high-volume service operation in which small per-claim disparities can accumulate into substantial aggregate injustice. We examine whether, and under what operational conditions, six frontier multimodal LLMs generate differential financial outcomes based on claimant gender.
Existing audit methodologies typically rely on binary Male/Female comparisons and single-prompt configurations. We show that these approaches can produce systematically misleading conclusions about model fairness. Using a (4 x2 x 2) factorial counterfactual design, we vary four gender conditions (Male, Female, Non-binary, and Not Specified), the presence or absence of a claimant name, and the inclusion of an AI-generated claim narrative across six LLMs and 1,388 vehicle insurance claims, yielding 133,248 evaluations across three financial outcomes.
Our primary finding is that bias is concentrated among Non-binary and Not Specified claimants. Standard binary audits would have classified three of the six models as unbiased. Bias direction is model-specific and inconsistent across providers, while claimant names unpredictably moderate bias profiles. Notably, removing claimant names shifts Claude 4.5 Sonnet from no significant bias to its strongest observed disparities across all three outcomes simultaneously. Richer claim narratives partially reduce payout and eligibility bias, although severity bias persists. Among the evaluated models, Claude 4 Sonnet exhibits broad fairness across all 16 experimental conditions, whereas its successor introduces complex bias patterns absent in the earlier version.
These findings demonstrate that organisations cannot rely on binary, single-configuration audits to establish fairness in AI-assisted service operations. Effective audits must include Non-binary and Not Specified gender categories, evaluate multiple prompt configurations, and treat model updates as requiring formal re-audit rather than assuming continuity in fairness properties.
( Joint work with Md Mushahidul Islam Shamim (UNSW), Warut Khern-am-nuai (McGill), Maxime C. Cohen (McGill)
For further information, please contact RSFAS Seminars.
All information collected by the University is governed by the ANU Privacy Policy.

