Home  | Publications | AA25a

Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports

MCML Authors

Abstract

This work aims to gather and analyze data for text-based fraud detection using data from financial disclosures – specifically, the Management’s Discussion and Analysis (MDA) sections of 10-K reports submitted to the US Securities and Exchange Commission. We provide a comprehensive overview of the process for creating the data set and introduce the resulting data set as an open-source resource for future research in the financial natural language processing domain. We subsequently train a range of machine learning and deep learning classifiers on the MDA text, intending to provide reasonable baselines for future researchers and<br>to offer insight into the nature of fraudulent disclosures and how such data can be effectively used for uncovering fraud.

inproceedings AA25a


FinNLP @EMNLP 2025

10th Workshop on Financial Technology and Natural Language Processing at the Conference on Empirical Methods in Natural Language Processing. Suzhou, China, Nov 04-09, 2025.

Authors

M. Amin • M. Aßenmacher

Links

DOI

Research Area

 A1 | Statistical Foundations & Explainability

BibTeXKey: AA25a

Back to Top