Research
I wrote my first AI research paper in 2025! I modeled the performance curve of a question-answering model across increasing levels of lexical and contextual drift and compared fine-tuning methods for mitigating dataset artifacts.
The Narrative Cliff: Hierarchical Contrast Sets Reveal Non-Monotonic Robustness in QA Models
Authors: Garrett Moran
Abstract
Pre-trained Question Answering (QA) models frequently fail to transfer in-distribution competence to out-of-distribution settings. We investigate the model degradation curve across a curated, three-tiered contrast set, introducing increasing levels of qualitative shift in question and context: syntactic variation (L1), elaborate rephrasing (L2), and narrative domain shift (L3). Using Jaccard distance to quantify drift, our evaluation of ELECTRA-small reveals a non-monotonic robustness curve: performance drops on L1 (0.41 mean drift), peaks on L2 (0.67 mean drift), and falls significantly on L3 (0.73 mean drift). Crucially, a minimal lexical drift increase (Δ 0.06) between L2 and L3 corresponds with a sharp decline in performance, indicating that narrative framing impacts robustness far more than lexical overlap suggests. Regarding mitigation, we find full-parameter fine-tuning induces catastrophic forgetting. However, freezing the encoder preserves performance on L1 and L2, but collapses on L3, falling significantly below the baseline. This exposes a sharp "narrative cliff": the frozen model adapts to elaborate paraphrasing but collapses under narrative shift, demonstrating that lexical adaptability and domain generalization are distinct capabilities, and that optimizing for one can inadvertently degrade the other.