Detection of Data Exposure in Software Services Using Large Language Models

Description

Data exposure in software services engineering is critical because leakages of confidential or sensitive information are frequent, but developers often struggle to prevent them due to diverse causes like hardcoded secrets, storage misconfigurations, insecure logging, improper data transmission, or unsafe

Data exposure in software services engineering is critical because leakages of confidential or sensitive information are frequent, but developers often struggle to prevent them due to diverse causes like hardcoded secrets, storage misconfigurations, insecure logging, improper data transmission, or unsafe deserialization. Modern software development practices can exacerbate these risks. Existing methods for detecting data exposure, often using regular expressions or static pattern analysis, frequently generate many false positives or lack the deep contextual understanding required for reliable detection across varied programming languages. To address these limitations, this thesis presents an effective approach to improve data exposure detection in software services using a Large Language Model, enhanced via low rank adapters (LoRA) for efficient specialization and few-shot learning for ambiguity refinement. Unlike detection systems relying solely on static patterns or simple heuristics, the Large Language Model-powered framework presented provides deep contextual code analysis, enabling highly accurate identification of exposures. The evaluation demonstrates performance significantly surpassing existing tools and alternative techniques in both precision and recall. Key capabilities include efficient model adaptation through LoRA, ambiguity resolution using few-shot learning based on optimized thresholds, and precise line-level localization of identified exposures, ensuring more reliable results and facilitating faster remediation. The integration of a feedback loop for continuous learning further distinguishes the framework as an accurate, scalable, and intelligent solution suited for detecting data exposures in complex software service environments.

Downloads

621.65 KB

Details

Contributors
Date Created
2025
Language
  • en
Note
  • Partial requirement for: M.S., Arizona State University, 2025
  • Field of study: Computer Science
Additional Information
English
Extent
  • 66 pages
Open Access
Peer-reviewed