Fine-Tuning Large Language Models: Assessing Memorization and Redaction of Personally Identifiable Information
Master thesis
View/ Open
Date
2024Metadata
Show full item recordCollections
- Master of Science [1822]
Abstract
Memorization and redaction of Personally Identifiable Information (PII) in fine-tuned transformer-based Large Language Models (LLMs) is a new research area with limited literature overall and no existing studies on Norwegian data. In this thesis, we first
generate synthetic English and Norwegian datasets containing PII, fine-tune LLMs on these datasets, and measure the extent of memorization.
Second, we develop and release PII redaction models for English and Norwegian by fine-tuning an open-source LLM. Our
findings reveal significant PII memorization during fine-tuning and demonstrate that fine-tuned redaction models can effectively remove
PII from text while maintaining data integrity. We conclude that businesses should implement robust PII redaction processes
when fine-tuning LLMs to ensure data privacy and compliance with regulations.
Description
Masteroppgave(MSc) in Master of Science in Business, Data Science for Business - Handelshøyskolen BI, 2024