Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining

Inderjeet Nair, Natwar Modani


Abstract
Pretraining large language models has resulted in tremendous performance improvement for many natural language processing (NLP) tasks. While for non-domain specific tasks, such models can be used directly, a common strategy to achieve better performance for specific domains involves pretraining these language models over domain specific data using objectives like Masked Language Modelling (MLM), Autoregressive Language Modelling, etc. While such pretraining addresses the change in vocabulary and style of language for the domain, it is otherwise a domain agnostic approach. In this work, we investigate the effect of incorporating pretraining objectives that explicitly tries to exploit the domain specific language characteristics in addition to such MLM based pretraining. Particularly, we examine two distinct characteristics associated with the legal domain and propose pretraining objectives modelling these characteristics. The proposed objectives target improvement of token-level feature representation, as well as aim to incorporate sentence level semantics. We demonstrate superiority in the performance of the models pretrained using our objectives against those trained using domain-agnostic objectives over several legal downstream tasks.
Anthology ID:
2023.findings-eacl.190
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2516–2526
Language:
URL:
https://aclanthology.org/2023.findings-eacl.190/
DOI:
10.18653/v1/2023.findings-eacl.190
Bibkey:
Cite (ACL):
Inderjeet Nair and Natwar Modani. 2023. Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2516–2526, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining (Nair & Modani, Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.190.pdf
Video:
 https://aclanthology.org/2023.findings-eacl.190.mp4

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy