The cluster approach to acquiring patent datasets and assessing the quality of “prior art search”
https://doi.org/10.33186/1027-3689-2025-5-58-80
Abstract
As the global patent collection is widening, the complexity of patent documents search for assessing technique novelty, i. e. revealing the relevant art or prior art from public patent data, is increasing, too. Searching for this information, vast and complex, is challenging. Research findings evidence on the increasing scale of NLP use for more accurate and integrated patent search. Despite many achievements, the automated patent search system for appropriate accuracy and completeness has not been introduced. The author argues that development of new effective approaches to designing these systems is significantly limited due to the lack of the datasets ready for educating and testing. The automated acquisition of datasets of arbitrary configuration (with consideration for various selection criteria, i. e. documents by patent agency/agencies; all published documents for a limited period of time: document types; patent classification classes, etc.) would enable to eliminate limitations and build the datasets meeting the needs and goals set up by the systems designers. The author proposes new approaches to dataset acquisition, testing of automated art patent search systems, and assessment of these systems.
About the Author
A. V. GorbunovRussian Federation
Alexander V. Gorbunov – Head, National Center for Artificial Intelligence
Moscow
References
1. WIPO Publication. WIPO Intellectual Property Handbook Second Edition, volume No. 489 (E). WIPO, 2004. ISBN 978-92-805-1291-5.
2. Rubilar-Torrealba R., Chahuán-Jiménez K., de la Fuente-Mella H. Analysis of the Growth in the Number of Patents Granted and Its Effect over the Level of Growth of the Countries: An Econometric Estimation of the Mixed Model Approach. Sustainability 2022, 14, 2384.
3. OECD. Patents and Innovation: Trends and Policy Challenges; OECD Organization for Economic Cooperation and Developmemt. Paris, France, 2004.
4. Shalaby W., Zadrozny W. Patent retrieval: A literature review. Knowl. Inf. Syst. 2019, 61, 631–660.
5. Risch J., Krestel R. Domain-specific word embeddings for patent classification. Data Technol. Appl. 2019, 53, 108–122.
6. Pogiatzis A. NLP: Contextualized Word Embeddings from BERT. 20 March 2019. URL: https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b.
7. Humayun M. A., Yassin H., Shuja J., Alourani A., Abas P. E. A transformer fine-tuning strategy for text dialect identification. Neural Comput. Appl. 2023, 35, 6115–6124.
8. Roda G., Tait J., Piroi F., Zenz V. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain. In Proceedings of the Workshop of the Cross-Language Evaluation Forum for European Languages. Corfu, Greece, 30 September – 2 October 2009. Volume 1175.
9. Piroi F. CLEF-IP 2010: Retrieval Experiments in the Intellectual Property Domain. In Proceedings of the CLEF 2010. Padua, Italy, 20–23 September 2010.
10. Piroi F., Lupu M., Hanbury A., Zenz V. CLEF-IP 2011: Retrieval in the intellectual property domain. In Proceedings of the CLEF 2011. Amsterdam, The Netherlands, 19–22 September 2011.
11. Piroi F., Lupu M., Hanbury A., Magdy W., Sexton, A., Filippov I. CLEF-IP 2012: Retrieval experiments in the intellectual property domain. In Proceedings of the CEURWorkshop. Melbourne, Australia, 10–12 December 2012; Proceedings 1178.
12. Iwayama M., Fujii A., Kando N., Takano A. Overview of patent retrieval task at NTCIR-3. In Proceedings of the CL-2003 Workshop on Patent Corpus Processing. Sapporo, Japan, 12 July 2003.
13. Fujii A., Iwayama M., Kando N. Overview of Patent Retrieval Task at NTCIR-4. In Proceedings of the NTCIR-4. Tokyo, Japan, 2–4 June 2004. URL: https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings4/PATENT/NTCIR4-OVPATENT-FujiiA.pdf.
14. Fujii A., Iwayama M., Kando N. Overview of Patent Retrieval Task at NTCIR-5. In Proceedings of the NTCIR-5. Tokyo, Japan, 6–9 December 2005. URL: https://research.nii.ac.jp/ntcir/workshop/OnlineProceedings5/data/PATENT/NTCIR5-OV-PATENT-FujiiA-pp.pdf.
15. Fujii A., Iwayama M., Kando N. Overview of the Sixth NTCRWorkshop. In Proceedings of the NTCIR-6. Tokyo, Japan, 15–18 May 2007. URL: http://ntur.lib.ntu.edu.tw/retrieve/170726/26.pdf.
16. Lupu M., Piroi F., Huang X., Zhu J., Tait J. Overview of the TREC 2009 chemical IR track. In Proceedings of the TREC 2009. Gaithersburg, MD, USA, 17–20 November 2009.
17. Lupu M., Tait J., Huang J., Zhu J. TREC-CHEM 2010: Notebook Report. In Proceedings of the TREC 2010. Gaithersburg, MD, USA, 16–19 November 2010; NIST Special Publication, 500–294. URL: https://trec.nist.gov/pubs/trec19/papers/CHEM.OVERVIEW.pdf.
18. Lupu M., Gurulingappa H., Filippov I., Zhao, J., Fluck J., Jacobs M., Huang J., Tait J. Overview of the TREC 2011 Chemical IR track. In Proceedings of the TREC 2011. Gaithersburg, MD, USA, 15–18 November 2011.
19. SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval July 2019. Pp. 1213–1216. https://doi.org/10.1145/3331184.3331346.
20. Polozhenie o Gosudarstvennom patentnom fonde. URL: https://www.fips.ru/documents/npa-rf/prikazy-rospatenta/polozhenie-o-gosudarstvennom-patentnom-fonde.php#4.
21. WIPO Standard ST.96. Version 6.0 (approved by the Task Force, October 3, 2022), Standard ST.96 (Main Body). URL: https://www.wipo.int/export/sites/www/standards/en/pdf/03-96-01.pdf, Annexes I to VII. https://www.wipo.int/standards/en/st96/v6-0/.
22. Mohammad M. Rahman,·Chanchal K. Roy,·David Lo. Automatic query reformulation for code search using crowdsourced knowledge. Empirical Software Engineering. URL: https: //doi.org/10.1007/s10664-018-9671-0 https://www.researchgate.net/publication/330514983.
23. Prior Art Candidates Search Task, Публикация международного исследовательского проекта IRF (Information Retrieval Facility). URL: https://www.ir-facility.org/prior-art-search1.
24. The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines, Maik Fröbe, Jan Philipp Bittner, Martin Potthast, Matthias Hagen; 42nd European Conference on Information Retrieval (ECIR). Lissabon, April 14–17, 2020.
25. Gorbunov A. V., Genin B. L., Zolkin D. S. Semanticheskie clastery` patentny`kh dokumentov i generator naborov danny`kh dlia mashinnogo obucheniia // Sbornik trudov X Mezhdunarodnoi` nauchno-prakticheskoi` konferentcii «Intellektual`naia inzhenernaia e`konomika i Industriia 5.0» (INPROM), 25–28 aprelia 2024, Sankt-Peterburg. V 2 t. T. 2 / pod red. d-ra e`kon. nauk D. G. Rodionova, d-ra e`kon. nauk A. V. Babkina. Sankt-Peterburg: Izd-vo POLITEKH-PRESS, 2024. S. 457–461. ISBN 978-5-7422-8536-2.
26. Kundu R. F1 Score in Machine Learning: Intro & Calculation. Machine Learning. 16 December 2022. URL: https://www.v7labs.com/blog/f1-score-guide.
27. Otten N. V. Mean Average Precision Made Simple [Complete Guide]. 14 September 2023. URL: https://spotintelligence.com/2023/09/14/mean-average-precision/.
28. Manning C. D., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge University Press: New York, NY, USA, 2008.
29. Gorbunov A. V., Genin B. L., Zolkin D. S. Iskusstvenny`i` intellekt v rabote patentny`kh vedomstv // Informatcionny`e resursy` Rossii. 2021. № 3. S. 18–23.
30. Frome A., Corrado G. S., Shlens J. et al. DeViSE: A Deep Visual-Semantic Embedding Model // Advances in Neural Information Processing Systems 26 (NIPS 2013), December 5–10, 2013, Harrah and Harveys. NV, USA. Curran Associates, Inc., 2013. P. 2121–2129.
31. Gorbunov A. V., Genin B. L., Zolkin D. S. Zadacha vy`iavleniia e`lementov semanticheskogo clastera patentny`kh dokumentov dlia poiska urovnia tekhniki // ISSN 0548-0019, NTI. Ser. 1: Organizatciia i metodika informatcionnoi` raboty`. 2023, № 8. S. 27–32.
32. Kumaravel G., Sankaranarayanan S. PQPS: Prior-Art Query-Based Patent Summarizer Using RBM and Bi-LSTM. Mob. Inf. Syst. 2021, 2021, 2497770.
33. Zihayat M., Etwaroo R. A non-factoid question answering system for prior art search. Expert Syst. Appl. 2021, 177, 114910.
34. Pradeep. Understanding TF-IDF in NLP: A Comprehensive Guide. Medium. 2023. URL: https://medium.com/@er.iit.pradeep09/understanding-tf-idf-in-nlp-a-comprehensive-guide-26707db0cec5.
Review
For citations:
Gorbunov A.V. The cluster approach to acquiring patent datasets and assessing the quality of “prior art search”. Scientific and Technical Libraries. 2025;(5):58-80. (In Russ.) https://doi.org/10.33186/1027-3689-2025-5-58-80