Home  | Publications | BKZ25

Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets

MCML Authors

Abstract

Geo-tagged tweets collected at the building level has patterns that aid in building function classification. However, this data source suffers from substantial noise, limiting its effectiveness. Conducting a systematic noise analysis requires a noise-free environment, which is difficult to obtain from real-world data. In this study, we propose an approach using an LLM-generated synthetic oracle dataset that contains only correctly assigned tweets aligned with their respective buildings. To make the dataset reflects real-world distributions, we use a data generation pipeline that integrates data attributes from real world into LLM prompts. To evaluate the utility of the synthetic dataset for noise analysis, we compare the performance of Naïve Bayes (NB) and mBERT classifiers on it against real-world noisy data. Additionally, we assess the dataset’s diversity by comparing Self-BLEU and perplexity scores against those of real-world datasets. Our findings reveal that while noise significantly disrupts mBERT’s contextual learning, its removal in the synthetic dataset enables mBERT to substantially outperform NB. This highlights that noise reduction is more effective than increasing model complexity for context-dependent text classification tasks. Moreover, despite reduced noise and sentence structure variations, the synthetic dataset preserves realistic linguistic characteristics. These results confirm that a synthetic oracle dataset provides an effective noise-free experimental environment for studying noise impact in text classification.

inproceedings


ECIR 2025

47th European Conference on Information Retrieval. Lucca, Italy, Apr 06-10, 2025.
Conference logo
A Conference

Authors

S. Bai • A. Kruspe • X. Zhu

Links

PDF

Research Area

 C3 | Physics and Geo Sciences

BibTeXKey: BKZ25

Back to Top