Filtering by
- Creators: Computer Science and Engineering Program

Despite their remarkable capabilities, Large Language Models (LLMs) exhibit concerning vulnerabilities to minor input perturbations, posing risks in safety-critical applications. This thesis systematically analyzes how semantic-preserving prompt perturbations—at the character, word, and sentence levels—affect LLM safety behavior. It investigates whether even slight rephrasings can cause an otherwise aligned model to flip from a safe (refusal or benign) response to an unsafe one, or vice versa. The study also identifies key perturbation characteristics that drive such flips and examines how flip likelihood varies across different categories of harmful content.To answer these questions, the study evaluates multiple advanced LLMs (LLaMA2, LLaMA3, Mistral) on the CatQA dataset of harmful queries. Each query is perturbed by applying controlled character-level typos, word-level substitutions, and sentence-level paraphrases while preserving semantic meaning. Model outputs are assessed using the automated safety evaluation tool LlamaGuard.
Empirical results show that sentence-level paraphrasing consistently improves safety compliance, whereas fine-grained character-level noise often degrades it due to tokenization weaknesses. Word-level changes yield the most inconsistent behavior, with random insertions particularly likely to elicit unsafe outputs. Among the models tested, LLaMA3 was the most vulnerable, exhibiting the highest rate of unsafe responses under perturbation.
Overall, these findings underscore the importance of perturbation-aware and context-aware safety evaluation and offer practical insights for improving LLM alignment in real-world deployments.