Geo-tagged tweets collected at the building level has patterns that aid in building function classification. However, this data source suffers from substantial noise, limiting its effectiveness. Conducting a systematic noise analysis requires a noise-free environment, which is difficult to obtain from real-world data. In this study, we propose an approach using an LLM-generated synthetic oracle dataset that contains only correctly assigned tweets aligned with their respective buildings. To make the dataset reflects real-world distributions, we use a data generation pipeline that integrates data attributes from real world into LLM prompts. To evaluate the utility of the synthetic dataset for noise analysis, we compare the performance of Naïve Bayes (NB) and mBERT classifiers on it against real-world noisy data. Additionally, we assess the dataset’s diversity by comparing Self-BLEU and perplexity scores against those of real-world datasets. Our findings reveal that while noise significantly disrupts mBERT’s contextual learning, its removal in the synthetic dataset enables mBERT to substantially outperform NB. This highlights that noise reduction is more effective than increasing model complexity for context-dependent text classification tasks. Moreover, despite reduced noise and sentence structure variations, the synthetic dataset preserves realistic linguistic characteristics. These results confirm that a synthetic oracle dataset provides an effective noise-free experimental environment for studying noise impact in text classification.
inproceedings
BibTeXKey: BKZ25