Constructing Korean Patent Retrieval Datasets to Improve Deep Learning-Based Patent Retrieval Performance: An Automated Methodology

Donguk Lee; Woochul Sim; Jinwoo Park; Bonggun Lee

Research Article

Constructing Korean Patent Retrieval Datasets to Improve Deep Learning-Based Patent Retrieval Performance: An Automated Methodology

Dong-Uk Lee¹, Woo-Chul Sim², Jin-Woo Park³, Bong-Gun Lee⁴

¹Associate of Intelligent Information Strategy Department, Korea Institute of Patent Information, Republic of Korea
²Assistant Manager of Intelligent Information Strategy Department, Korea Institute of Patent Information, Republic of Korea
³Manager of Intelligent Information Strategy Department, Korea Institute of Patent Information, Republic of Korea
⁴Head of Intelligent Information Strategy Department, Korea Institute of Patent Information, Republic of Korea

Correspondence to Bonggun Lee, E-mail: bglee@kipi.or.kr

Volume 21, Number 1, Pages 151-180, March 2026.
Journal of Intellectual Property 2026;21(1):151-180. https://doi.org/10.34122/jip.2026.21.1.151
Received on December 23, 2025, Revised on January 15, 2026, Accepted on March 06, 2026, Published on March 30, 2026.
Copyright © 2026 Korea Institute of Intellectual Property.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives (https://creativecommons.org/licenses/by-nc-nd/4.0/) which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.

Abstract

Owing to the difficulty of constructing large-scale datasets and the scarcity of Korean-language resources, recent deep learning-based patent retrieval research gaces limitaions in improving model performance. To address these challenges, this study proposes a methodology for automatically building a large-scale patent retrieval dataset from Korean patent documents. The method automatically extracts semantically related pairs of technical components between patent applications and cited prior art using claim comparison tables in office action notices. In addition, the sentences that are most similar to each technical component are extracted from both the patent application and the cited prior art documents. Korean patent XML parsing techniques are combined with a KorPatBERT-based CPC classification model, and a hybrid similarity measure integrating sentence embedding–based semantic similarity with lexical similarity is employed. Subsequently, a large-scale, high-quality dataset approximately 19 times larger than a manually constructed expert dataset was built and validated through large-scale experiments simulating real-world retrieval environments. Experimental results indicate that models trained on the automatically constructed dataset achieved Top-70 accuracy comparable to or better than those trained on expert-built datasets. Accordingly, this study presents a practical and cost-effective approach for constructing high-quality Korean patent retrieval datasets and demonstrates improved performance and real-world applicability.

Keywords

Automatic Dataset Construction, CPC Classification, KorPatBERT, Patent Retrieval, Patent Similar Technical Component Dataset, Semantic Similarity

Notes

Conflicts of Interest

No potential conflict of interest relevant to this article was reported.

Funding

The author received manuscript fees for this article from Korea Institute of Intellectual Property.

Journal of Intellectual Property (J Intellect Property; JIP)

KCI Indexed
OPEN ACCESS, PEER REVIEWED

pISSN 1975-5945

eISSN 2733-8487

Research Article

Constructing Korean Patent Retrieval Datasets to Improve Deep Learning-Based Patent Retrieval Performance: An Automated Methodology

Abstract

Keywords

Notes

Conflicts of Interest

Funding

Section