This work aims to gather and analyze data for text-based fraud detection using data from financial disclosures – specifically, the Management’s Discussion and Analysis (MDA) sections of 10-K reports submitted to the US Securities and Exchange Commission. We provide a comprehensive overview of the process for creating the data set and introduce the resulting data set as an open-source resource for future research in the financial natural language processing domain. We subsequently train a range of machine learning and deep learning classifiers on the MDA text, intending to provide reasonable baselines for future researchers and<br>to offer insight into the nature of fraudulent disclosures and how such data can be effectively used for uncovering fraud.
inproceedings AA25a
BibTeXKey: AA25a