Jump to

On the Impact of Cross-Domain Data on German Language Models

Fast facts

Internal authorship
Prof. Dr.-Ing. Christoph M. Friedrich
Further publishers
A. Dada, A. Chen, C. Peng, K. E. Smith, A. Idrissi Yaghir, C. M. Seibold, J. Li, D. Truhn, J. Egger, J. Bian, J. Kleesiek, Y. Wu
Publishment
- 2023
- Volume Findings of the Association for Computational Linguistics: EMNLP 2023
Title of the conference proceedings
Findings of the Association for Computational Linguistics: EMNLP 2023
Organizational unit
Computer science
Subjects
- Applied computer science
Research fields
- Medical Informatics (MI)
Publication format
Conference paper

Quote

A. Dada, A. Chen, C. Peng, K. E. Smith, A. Idrissi Yaghir, C. M. Seibold, J. Li, C. M. Friedrich, D. Truhn, J. Egger, J. Bian, J. Kleesiek, and Y. Wu, "On the Impact of Cross-Domain Data on German Language Models," in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 13801-13813 [Online]. Available: https://aclanthology.org/2023.findings-emnlp.922/

Content

Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. By training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to 4.45% over the previous state-of-the-art.

About the publication

Linked publications and references

https://aclanthology.org/2023.findings-emnlp.0(Opens in a new tab)

Fachhochschule Dortmund

Information for

Language