Home | Publications | KS25

Can Large Language Models Advance Occupational Coding? Evidence and Methodological Insights

MCML Authors

Olga Kononykhina

→ Group Frauke Kreuter
Social Data Science and AI

Malte Schierholz

Dr.

→ Group Frauke Kreuter
Social Data Science and AI

Abstract

Occupational coding is a critical funnel between open-ended job descriptions and the statistical frameworks that shape employment research and policies. Automatic coding tools—whether rule-based or machine learning (ML)—have streamlined the process, and demonstrate promising results. Yet, ML approaches typically require extensive, high-quality training data that exceed what a typical national survey can provide and fall under data protection constraints. This study asks whether mainstream large language models (LLMs) can serve as a viable alternative, largely bypassing the need for exhaustive training data and requiring only some coding skills and API access. We created embeddings for standardized German (Kldb) job descriptions, then used respondents’ own words (e.g., “doctor”) from a representative German survey to generate job embeddings. Cosine similarity was applied to find the five most likely occupational codes for each response. To assess performance, we compared LLM-based suggestions with those from a German ML occupational coding tool (OccuCoDe), using professional manual coding as our benchmark. Results show that in 55% of the cases, both LLM and OccuCoDe included the correct code among their top five suggestions. However, there was limited overlap: in 60% of the cases, the two tools shared at most two out of their five recommended codes. While OccuCoDe more frequently placed the correct code as the first suggestion, LLM-embeddings suggested the correct occupation in 45% of cases where OccuCoDe did not provide any result. Additionally, LLM performance was sensitive to minor changes in job descriptions (e.g., capitalisation or gendered job titles) and sometimes showed “embedding drift,” raising reproducibility concerns. Our findings highlight LLMs’ promise as a complement or substitute to other tools for occupational coding in limited training data contexts, while underscoring critical limitations that must be addressed before fully entrusting them with classifying the work we do.

inproceedings KS25

ESRA 2025

11th Conference of the European Survey Research Association. Utrecht, The Netherlands, Jul 14-18, 2025.

Authors

O. Kononykhina • M. Schierholz

Links

URL

Research Area

C4 | Computational Social Sciences

BibTeXKey: KS25

#p-kreuter