Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Automated patent classification for crop protection via domain adaptation

Automated patent classification for crop protection via domain adaptation INTRODUCTIONPatents corpora1,2 are a valuable resource that shows how technology evolves over time. This is documented by the volume of granted patents every year, for example, over 300K patents per year in the last 6 years for the sole United States Patent and Trademark Office (USPTO).3 Monitoring this throughput is critical for capturing trends and developing domain‐specific knowledge bases that can be used to organize information and accelerate the discovery process. This calls for the development of automated systems for information retrieval and patent literature search that can be easily adapted to fit the needs of different fields. The classification of patents is a critical step in the design of such systems. To this end, the World Intellectual Property Organization (WIPO) introduced the International Patent Classification (IPC) system, a hierarchical system of codes (language independent) that classify patents based on the different technological areas that they cover.4In this paper, we address the problem of classifying patents relying solely on the content of the inventions and using language models (LMs) and domain adaptation strategies. Domain adaptation in transformers is the preferred strategy for pushing the performance boundaries of transformer‐based models in domains that are highly differentiated from the pretraining corpora.5 We investigate the application of various domain adaptation strategies using several LMs based on Transformer models from the Bidirectional Encoder Representations from Transformers (BERT) family,6 including SciBERT,7 BERT‐like models adapted in the patent domain using adaptive pretraining, BERT‐like models fine‐tuned using adapters,8 as well as combinations of the above. We evaluate the proposed approaches in terms of precision, recall, and F1‐score under two different scenarios. First, we rely on an existing baseline dataset including patents from USPTO. Then, we focus on a specific use case originated from the crop protection industry domain. Our analysis allowed us to identify fine‐tuning recipes that ensure robust performance.Our approach for patent classification outperforms the state‐of‐the‐art. We depart from the standard solely IPC‐based model evaluation by introducing an evaluation based on actual use‐cases and labels that do not conform to the IPC hierarchy. Furthermore, we examine and evaluate the use of our domain adapted methodologies in a multilingual setup of patent classification. On top of that, we assess the effectiveness of such methods on classifying patents made available in 2021. Finally, we establish two new patent based BERT‐like models, namely domain‐adaptive patent BERT or dapBERT and domain‐adaptive patent SciBERT or dapSciBERT, that can be leveraged for any Natural Language Processing (NLP) task related to the patent domain. We have made the code, the models and the dataset of our work available at https://github.com/GT4SD/domain-adaptive-patent-classifier.RELATED WORKNumerous methods for patent classification have been introduced, as well as many baseline datasets.9 Early attempts proposed k‐Nearest Neighbor,10 support vector machine,10,11 Naive Bayes10,11 or neural networks.12 CNNs (Convolutional Neural Networks) and various word embeddings13–15 have also been successfully combined for the task. Recent trends indicate an increased emphasis on fine‐tuned pretrained LMs, with ULMFit16 and BERT17 based methods being the state of the art. PatentBERT17 is the work that comes closest to our approach. However, while PatentBERT relies on an existing pretrained BERT model and performs patent classification using standard recipes for fine‐tuning, we begin by adapting a BERT‐like model to the patent domain before fine‐tuning the classification task. This way, we ensure that the LM being used is domain‐aware.The content of an invention is critical when it is used as input for the patent classification method. The methods described previously act on different parts of the document. DeepPatent13 utilizes titles and abstracts while PatentBERT17 has been developed using claims and titles or abstracts. The full patent text, including title, abstract, description, and claims, has being evaluated in other attempts.14 In general, the title and abstract sections are more informative than the full‐text representation of the patent document.11 Additionally, focusing exclusively on titles and abstracts has the advantage of being the two most easily accessible sections of a patent. This paper follows the same strategy and does not use full texts. Only the title, abstract, or both are used during processing.PATENT CLASSIFICATION USING BERT‐LIKE MODELSA pretrained BERT model fine‐tuned on labeled data for patent classification represents state‐of‐the‐art performance in patent classification. While pretrained LMs have been shown to be more robust to out‐of‐distribution generalization than previous models,18 they are still ill‐equipped to deal with data that differ significantly from what was observed during pretraining. Patent corpora are a clear example: the unique syntax and vocabulary of patent applications may differ significantly from those used for pretraining. To address the distribution mismatch, we avoid performing an expensive and resource‐intensive pretraining of a BERT‐like model from scratch and instead examine four alternative approaches (i) adoption of a pretrained BERT‐like model trained on corpora with similar vocabulary (vocabulary adaptation), (ii) domain‐adaptive pretraining on the domain of interest, (iii) adapters8 based fine‐tuning, and (iv) combination of the above options. Figure 1 depicts the four different approaches under investigation.1FIGUREPatent classification based on the four different approaches. Case (i) depicts the standard task‐specific fine‐tuning approach that can be used leveraging any BERT‐like model. Method of case (ii) performs a domain adaption of a BERT‐like model prior to the task‐specific fine‐tuning. Cases (iii) and (iv) utilize adapters without or with domain‐adaptive pretraining respectively.Vocabulary adaptationA domain‐specific vocabulary is crucial to generate meaningful word embeddings, and it is thus essential in the majority of NLP applications. There is a strong correlation between poor NLP models performance on unfamiliar domains and the effects of out‐of‐vocabulary words.18 However, including a significant number of domain‐specific tokens in a BERT‐like model requires the inclusion of the same number of uninitialized embedding vectors in the model that must be learned during training. Incorporating such a large number of embedding vectors is impractical in the standard fine‐tuning scenario, and pre‐training from scratch would be required to learn appropriate new token representations. As an alternative, we use the existing BERT variant that is closest to the patent domain as base for standard fine‐tuning. In the following, we consider SciBERT.7 SciBERT was trained on scientific documents, and while their structure and syntax do not resemble those of a patent, their vocabulary is more relevant to the domain of interest than a standard pretrained BERT. To demonstrate this, we analyzed the vocabulary overlap of a patent corpus and corpora similar to those used in the trainings of BERT and SciBERT models. Specifically, we used 150K patent abstracts as the patent corpus, 150K texts from the BookCorpus19 and English Wikipedia as the BERT corpus, and 150K texts from Semantic Scolar20 as the SciBERT corpus. All texts are in English. For each of them, we found the 10K most common words excluding stop‐words, and then we examined the overlap of the respective sets. Figure 2a depicts the analysis of the vocabulary overlap resulting in a greater similarity between the patent corpus and SciBERT than BERT.2FIGUREOverlap between different corpora vocabularies and patent labels using Venn diagrams.Domain‐adaptive pretrainingDomain‐adaptive pretraining5 involves fine‐tuning the model on additional unlabelled data, originated from the domain of interest, using a pretraining objective prior to task‐specific fine‐tuning. This operation aims to shift a pretrained model towards the domain of interest and is achieved by performing just a few extra epochs of training on the domain‐specific data in order to avoid degradation of the model's general language capabilities. Here, the training dataset consists of 10.000.000 patent abstracts downloaded from WIPO and USPTO.AdaptersAdapters8 represent an alternative fine‐tuning strategy that relies on optimizing a small set of additional newly initialized weights at every layer of the transformer. These weights are trained during fine‐tuning, while the pretrained parameters of the transformer model are frozen. This strategy has two significant advantages. To begin, by freezing the entire base model, we can significantly reduce the amount of computation required during training. Additionally, training many task‐specific adapters using the same base model enables an efficient parameter sharing between the different tasks. In this case, the memory footprint of the application can be reduced because we only need to store the extra weights for each task, not the entire fine‐tuned model. In our work, we utilize the adapters configuration presented by Pfeiffer et al.21 which places an adapter layer only after the feed‐forward block of each Transformer layer.Combining methodsAll of the methods outlined above can also be combined in a two‐phase approach to improve performance. The first phase focuses on domain adaptation. The selected model is adapted to the patent domain by performing domain‐adaptive pretraining. This phase produces a LM that is specialized for a patent corpus. This model is not tailored for patent classification and can be used for any downstream NLP task. In the second phase, we perform task‐specific fine‐tuning using a cross‐entropy loss. For the second phase, we have two alternatives to perform the classification: utilizing adapters or following the standard approach of attaching a classification head as output in the existing architecture.RESULTSWe used a newly established patent dataset focusing on the crop industry domain and a standard USPTO based dataset for the evaluation of the different models. The former data set reveals the performance of the models in actual use cases belonging to the agrochemical domain. The latter dataset serves as a standard baseline in the IPC classification task, the most documented case in the literature.As baselines, we used a standard fine‐tuned BERT model similar to Lee et al.17 and CNN based methods that have been presented in Roudsari et al.15 relying on GloVe22 and FastText embeddings.23 We compared them with SciBERT, as a vocabulary adaptation approach, dapBERT, as a domain adapted pretraining method and BERT fine‐tuned using adapters. In addition, we included dapSciBERT, SciBERT + adapters, dapBERT + adapters, and dapSciBERT + adapters methods to highlight possible combinations of methods that can take place. All approaches that are based on standard task‐specific fine‐tuning have a classification head consisting of a dense layer, with ReLU activation function and dropout, plus the output layer. The fine‐tuning was performed for five epochs. When it comes to the methods that include adapters, the classification head consisted of only the output layer and the adapters‐based fine‐tuning had been performed for 30 epochs. The full details regarding the training hyperparameters can be found in Appendix C (Table C1).DapBERT and dapSciBERT are two BERT‐like models trained based on the domain adaptive pretraining method described above for the patent domain. Even though Gururangan et al.5 indicates that only a single pass on the domain dataset is required, we investigated whether additional training is beneficial for domain adaptation. We evaluated the versions of the adapted models that were trained for three epochs, as these versions yielded the best results. Additional training epochs did not improve the performance (see Appendix A: Table A7). To train these models, we relied on the Generative Toolkit for Scientific Discovery (GT4SD) library24 and its LM trainer.Crop protection industry datasetThe discovery of fungicides, insecticides, and herbicides is the primary focus for crop protection research.25 To keep up to date with emerging trends, the recent literature has to be followed. The manual identification of relevant articles is time‐consuming and therefore, an automated identification and correct categorization mechanism would deliver a key element towards a more efficient process.In patents, the IPC hierarchy provides a general classification of the patents based on their topics, yet this classification is not always aligned with important domain specific categories. Figure 2b highlights this aspect focusing on insecticides, herbicides, and fungicides as the three main crop protection categories. These three categories, have a high degree of overlap in terms of IPC codes, indicating that we cannot distinguish these labels relying solely on the IPC hierarchy.There is a large amount of patent data available in the public domain. Patents starting from 2012 to 2020 were analyzed and classified manually into three categories—insecticide, fungicide, and herbicide—plus an extra no‐class label with irrelevant patents, leading to a data set with 9976 entries. Specifically, the dataset contained 3393 Insecticide related patents, 1518 Fungicide related patents, 1512 Herbicide related patents and 3553 irrelevant (no class) patents. This data set was used to build and evaluate various four class models based on the methodologies described above. The evaluation included a 10‐fold cross‐validation.Figure 3 presents the results of the evaluation focusing on the macro F1‐score. We examined three distinct input cases: title, abstract, and title + abstract, and included only SciBERT based adapted models in the results due to their superior performance. Tables with the full evaluation, including all the investigated models and all the metrics, can be found in Appendix A: Table A6. The results show that any kind of domain adaptation can be beneficial for the task and all of them outperform the baselines. The best performance is observed by dapSciBERT + adapters method using as input both the title and the abstract of a patent. An approach that has the extra advantage of fewer storage requirements in case of multiple classification cases should be accommodated by the same system. Furthermore, the results dictate that the abstract is a way more informative part of a patent than the title for the classification task. To verify the significance of the difference between the baselines and our proposed alternatives we performed statistical tests to compare the mean value of the best baseline method which is the finetuned BERT model and our best variant in terms of both performance and methodological advantages which is the dapSciBERT + adapters. A Wilcoxon Signed‐Ranks Test indicated that dapSciBERT + adapters method has greater mean value than BERT methods using as input title (statistic = 50.0, p‐value = 0.01), abstract (statistic = 55.0, p‐value = 0.01) and title + abstract (statistic = 52.0, p‐value = 0.01). In general, the differences in terms of mean F1‐score between the methods are not large, yet even a 1% or 2% increase in such scenarios is highly important as this improvement is translated into tens of thousands of further correctly classified incoming patents in automated streams that should handle every year millions of patents.3FIGUREMacro F1‐score of the presented models in the English agrochemical related case based on different input text. The error bars represent the standard deviation measured for each model.Multilingual patent classificationWe also extended the model evaluations to determine the applicability of the previously described methodologies in a multilingual scenario. In fact, a reliable multilingual classifier could be a game changer for the field, as it could classify patents submitted to a large number of different speaking patent offices in a massively, quickly, and inexpensive way, without the need for additional tools such as translators. We relied on BERT's multilingual version and re‐evaluated all the different adaptation techniques described previously (adapters, domain adapted models, and their combination), training also an adapted BERT multilingual model for the patent domain called dapBERT‐multi. For the multilingual case, there are no available LMs trained on scientific or other relevant domains, thus we confined our experiment to the multilingual BERT variant. To train the dapBERT‐multi model, we followed the domain adaptive pretraining method described above for the patent domain. We used a multilingual corpus of 17,476,660 patent abstracts for the adapted pretraining. The patents collected covered 14 languages, including English, Chinese, French, Korean, Japanese, and German. We trained the model, fine‐tuning all weights, for three epochs.The methods were evaluated using the same case as in the English version. Nonetheless, a sizable portion of the patents was chosen in their alternate non‐English language version. In total the multilingual version of the dataset had 9989 patents with 47% not written in English. The multilingual dataset contained 3408 Insecticide related patents, 1519 Fungicide related patents, 1520 Herbicide related patents and 3542 irrelevant (no class) patents. Figure 4 presents the performance of our proposed methods. We chose the multilingual BERT as a baseline, as well as a CNN approach based on the multilingual BERT's embeddings. We used the same evaluation strategy as described previously for monolingual models. In general, the adapted methods outperform the baselines, with the performance resembling closely the same pattern as the English version. The combination of an adapted model and adapters (dapBERT‐multi + adapters) presents the best results which is more than 2% better than the baselines. Wilcoxon Signed‐Ranks Test also verifies our findings, specifically it proves that dapBERT‐multi + adapters method has greater mean value than multiBERT method using as input title (statistic = 50.0, p‐value = 0.01), abstract (statistic = 48.0, p‐value = 0.02) and title + abstract (statistic = 53.0, p‐value = 0.03). We also observe that the multilingual models only have slightly worse performance than the monolingual English models even if the amount of data that has been used is the same and the dominant language in the multilingual dataset is still English. This observation underlines the power of language generalization that these LMs hold and their potential in such cases and domains.4FIGUREMacro F1‐score of the presented models in the multilingual versions of the agrochemical case based on different input text. The error bars represent the standard deviation measured for each model.Evaluation under real‐life conditionsIn a real case scenario, a model is trained on past data and is applied to predict incoming novel data. To this end, the classification model trained on data up to 2020 was applied to patents made available in 2021. This exercise focused on patents related to human necessities or chemistry, which are defined by their assigned IPC codes, leading to a corpus of 78,712 patents. For the patents in the list that were not written in English, we relied on the respective English version that was retrieved through Google patents. The best performing model (dapSciBERT + adapters) was subsequently applied to label the patents into the three classes (fungicide, herbicide, insecticide) or to provide the no‐class label. The results of the predictions were compared to the classification obtained from subject matter experts. An interesting observation is that using our model we could identify errors that occurred in five cases during the manual annotation. In addition, 108 patents that the model classified as insecticide, herbicide, or fungicide contained relevant keywords but were not classified accordingly by the subject matter experts. Of course, patents may contain relevant keywords without being relevant, but this could indicate that some relevant patents were missed during the manual assessment. Undoubtedly, the volume of false positive examples indicates that a patent classifier cannot yet be used as a standalone method to cherry‐pick patents of interest, given an incoming patent stream. Nevertheless, such a model can facilitate the process and significantly reduce the volume of patents that require manual inspection. The model was also compared to a baseline model. As a baseline, we relied again on a fine‐tuned BERT model for patent classification. Furthermore, to investigate multilingual extensions, we repeated the same experiment using a multilingual version of the same corpus in which the original language of each patent is used as input text, corresponding to 58% of the patents. Table 1 presents the overall results. The confusion matrices of classification results for each model can be found in Appendix A: Table A6. For both only‐English and multilingual cases, our model outperforms the baseline in terms of the F1 score. The fine‐tuned BERT baselines retrieve a few more correct instances in both cases, yet our proposed models manage to provide significantly fewer false positive instances. Thus, the percentage of actual patents of interest predicted with our models is much higher than the baselines. IPC‐code filtering or ensemble classifiers could improve even more the performance and reduce or eliminate the need for manual intervention. The comparison between English and multilingual classifiers' performance seconds the previous investigations' findings. It indicates that even if both variants have remarkable performance, using the English‐only classifier leads to slightly better performance. This performance can be attributed to the fact that we relied on a standard multilingual BERT as there is no available multilingual BERT‐like model trained solely on scientific domains.1TABLEEvaluation of the performance of different monolingual or multilingual classifiers in a stream of patents made available in 2021.ModelF1‐scoreCorrect foundTotal found (correct + missclassified)False positivesBERT0.947548124786dapSciBERT + adapters0.987427861870BERT‐multi0.947538114521dapBERT‐multi + adapters0.967397853352Total patents of interest in 2021839Note: The groundtruth is originated from manual annotations made by experts. The utilized metrics are the F1‐score, the number of correct classified patents (correctly classified patents of interest), the total number of found patents (found patents of interest including cases with wrong label assignment) and the number of false positive patents (generally irrelevant patents that be assigned to one of the categories).USPTO datasetWe further evaluated our methods based on an USPTO based dataset used in Roudsari et al.15 It contained 235,858 patents submitted in 2014 as training set and 42,321 patents submitted in 2015 as testing set. The datasets contained both the title and the abstract of a patent to identify the associated IPC subclass labels of it. The target labels were 89 IPC subclasses. More information about the dataset generation process can be found in Roudsari et al.15Table 2 summarizes the comparison of the different models based on micro averaging precision, recall, and F1‐score as well as coverage error. Coverage error is a metric that depicts how far we need to go down a ranked list of categories on average to account for all the true positive categories. The results for the embedding‐based (CBOW, GloVe and Skip‐gram) and GPT‐2‐based methods have been extracted from Roudsari et al.152TABLECoverage error and micro averaging precision, recall and F1‐score of all the evaluated methods, considering the USPTO dataset benchmark proposed in Roudsari et al.15ModelMacroCoverage errorPrecisionRecallF1CBOW0.640.460.524.50GloVe0.680.420.514.36Skip‐gram0.740.420.523.92FastText0.760.510.603.87GPT‐20.760.490.593.90BERT0.760.520.603.52SciBERT0.760.530.623.46BERT + adapters0.770.520.613.40SciBERT + adapters0.780.530.623.29dapBERT0.770.540.633.31dapSciBERT0.770.550.633.28dapBERT + adapters0.790.530.623.19dapSciBERT + adapters0.780.540.633.15Note: In bold the top‐3 values for each metric.Micro averaging of the above metrics, as well as evaluation at top‐1 and top‐5 predictions, have been also examined following the exact same evaluation process of Roudsari et al.15 and the results reveal a similar performance with the macro averaging results, are available in the Appendix A: Table A7.The USPTO evaluation suggests that BERT‐like approaches outperform the competition and that all of the presented domain adaptation based methods have the potential to improve the performance even further. The leveraging of SciBERT or adapters for vocabulary adaptation offers marginal benefits in the overall task performance in comparison to standard BERT finetuning. The improvement became more significant when we utilized the domain‐adapted models. Specifically, dapSciBERT had more than 2% improvement comparing to BERT in the majority of the utilized metrics. The combination of Patent based models and adapters can improve even more the performance of the models, especially in terms of coverage error. Overall, based on the experiments the best approach was the dapSciBERT + adapters which achieved a significant improvement in comparison to the rest CNN + embedding methods and the BERT baseline.CONCLUSIONPatent classification is a fundamental step in many patent analysis or patent generation pipelines. Improving the performance of classification methods is a critical task, and domain adaptation of transformers appears to be a promising direction. In this paper, we propose and investigate different methods for patent classification relying mainly on domain adaptation. The domain adaptive pretraining demonstrated the best results in the test cases and its performance can be furthermore improved if we select a pretrained base model with vocabulary closer to our domain, such as SciBERT. Additionally, the domain‐adapted LM generated in the first phase can be fine‐tuned and used for any downstream NLP task. When combined with already domain‐adapted models, the use of adapters results in the same or even better performance. This finding, combined with their lightweight characteristics, such as requiring fewer training resources and less storage space, makes them an appealing option, particularly when it is required to develop multiple classification schemes for a single domain. We further utilized and examined the use of domain adaptation techniques for multilingual patent classification. The multilingual performance follows a similar pattern to the English‐written patents. The same level of performance in both multilingual and English cases highlights the strength of the proposed methods for patent classification and the great application potential that they can serve. To push further performance boundaries, future steps may include the exploration of additional domains,5 vocabulary26 adaptation methods or the use of patents' metadata.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available and can be found at https://github.com/GT4SD/domain-adaptive-patent-classifier.REFERENCESWIPO. 2023. Accessed January 10, 2023. https://www.wipo.int/portal/en/index.htmlUSPTO. 2023. Accessed January 10, 2023. https://www.uspto.govU.S. Patent Statistics Chart Calendar Years 1963–2020; 2023. Accessed January 10, 2023. https://www.uspto.gov/web/offices/ac/ido/oeip/taf/us_stat.htmWIPO. Guide to the International Patent Classification; 2022. https://www.wipo.int/publications/en/details.jsp?id=4593&plang=ENGururangan S, Marasović A, Swayamdipta S, et al. Don't stop pretraining: adapt language models to domains and tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020:8342‐8360.Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre‐training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171‐4186.Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP‐IJCNLP). Association for Computational Linguistics; 2019:3615‐3620.Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter‐efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning. PMLR; 2019:2790‐2799.Krestel R, Chikkamath R, Hewel C, Risch J. A survey on deep learning for patent analysis. World Patent Inform. 2021;65:102035.Fall CJ, Törcsvári A, Benzineb K, Karetka G. Automated categorization in the international patent classification. SIGIR Forum. 2003;37:10‐25.D'hondt E, Verberne S, Koster C, Boves L. Text representations for patent classification. Comput Linguist. 2013;39:755‐775.Trappey AJC, Hsu FC, Trappey CV, Lin CI. Development of a patent document classification and search platform using a back‐propagation network. Expert Syst Appl. 2006;31:755‐765.Li S, Hu J, Cui Y, Hu J. DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics. 2018;117:721‐744.Abdelgawad L, Kluegl P, Genc E, Falkner S, Hutter F. Optimizing neural networks for patent classification. Machine Learning and Knowledge Discovery in Databases. Springer International Publishing; 2020:688‐703.Roudsari AH, Afshar J, Lee S, Lee W. Comparison and analysis of embedding methods for patent documents. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); 2021:152‐155.Hepburn J. Universal language model fine‐tuning for patent classification. In: Proceedings of the Australasian Language Technology Association Workshop; 2018:93‐96.Lee JS, Hsiang J. Patent classification by fine‐tuning BERT language model. World Patent Inform. 2020;61:101965.Hendrycks D, Liu X, Wallace E, Dziedzic A, Krishnan R, Song D. Pretrained transformers improve out‐of‐distribution robustness. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020:2744‐2751.Zhu Y, Kiros R, Zemel RS, et al. Aligning books and movies: Towards story‐like visual explanations by watching movies and reading books. ICCV. IEEE Computer Society; 2015:19‐27.Ammar W, Groeneveld D, Bhagavatula C, et al. Construction of the literature graph in semantic scholar. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). Association for Computational Linguistics; 2018:84‐91.Pfeiffer J, Vulić I, Gurevych I, Ruder S. MAD‐X: an adapter‐based framework for multi‐task cross‐lingual transfer. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:7654‐7673.Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2014:1532‐1543.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Ling. 2017;5:135‐146.Manica M, Cadow J, Christofidellis D, et al. GT4SD: generative toolkit for scientific discovery. arXiv preprint arXiv:2207.03928; 2022.Umetsu N, Shirai Y. Development of novel pesticides in the 21st century. J Pestic Sci. 2020;45:54‐74.Tai W, Kung HT, Dong X, Comiter M, Kuo CF. exBERT: extending pre‐trained models with domain‐specific vocabulary under constrained training resources. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020:1433‐1439.AAPPENDIXCrop protection industry datasetTable A1 presents the full evaluation of the models in the agrochemical related case. Furthermore, Tables A2 and A3 present the full evaluation of the models in the multilingual version of the dataset (with or without English). Lastly, Tables A4–A7 present the confusion matrices of the classification results obtained by our models or baselines in the evaluation using the patents published in 2021 and as groundtruth the annotations made by experts of the field. All tables have in bold the top value for each metric. In general, all the different adaptation techniques that have been utilized improve the performance both in terms of precision and recall. The combination of domain adaptive pretraining + adapters offers the highest boost in the performance in general.A1TABLEPerformance of the evaluated models in the English agrochemical dataset, including the standard deviation measured for each model.ModelTitlesAbstractsTitles + abstractsPrecisionRecallF1PrecisionRecallF1PrecisionRecallF1CNN + GloVe0.86 ± 0.010.84 ± 0.010.85 ± 0.010.91 ± 0.010.89 ± 0.010.90 ± 0.010.92 ± 0.010.90 ± 0.010.91 ± 0.01CNN + FastText0.87 ± 0.010.86 ± 0.010.86 ± 0.010.92 ± 0.010.90 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.01BERT0.89 ± 0.020.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01SciBERT0.89 ± 0.020.88 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01BERT + adapters0.89 ± 0.020.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01SciBERT + adapters0.89 ± 0.020.88 ± 0.010.88 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapBERT0.89 ± 0.010.89 ± 0.010.89 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.93 ± 0.010.93 ± 0.010.92 ± 0.01dapSciBERT0.90 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapBERT + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapSciBERT + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.94 ± 0.010.94 ± 0.01A2TABLEPerformance of the evaluated models in the multilingual agrochemical dataset, including the standard deviation measured for each model.ModelTitlesAbstractsTitles + abstractsPrecisionRecallF1PrecisionRecallF1PrecisionRecallF1CNN + BERT0.88 ± 0.010.87 ± 0.010.87 ± 0.010.91 ± 0.010.91 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.01BERT‐multi0.89 ± 0.010.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01BERT‐multi + adapters0.89 ± 0.010.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01dapBERT‐multi0.89 ± 0.010.88 ± 0.010.89 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.93 ± 0.010.92 ± 0.010.93 ± 0.01dapBERT‐multi + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01A3TABLEPerformance of the evaluated models in the multilingual datasets focusing on patents not written in English and using abstract and title as input.ModelPrecisionRecallF1CNN + BERT0.91 ± 0.010.88 ± 0.010.89 ± 0.01BERT‐multi0.91 ± 0.010.89 ± 0.010.90 ± 0.01BERT‐multi + adapters0.90 ± 0.010.90 ± 0.010.90 ± 0.01dapBERT‐multi0.91 ± 0.010.89 ± 0.010.90 ± 0.01dapBERT‐multi + adapters0.91 ± 0.010.90 ± 0.010.91 ± 0.01Note: The measured standard deviation for each model has also been included in the table.A4TABLEConfusion matrix of results obtained by the fined‐tuned BERT in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide17502763Herbicide616613918Insecticide3074133105No class731273,092A5TABLEConfusion matrix of results obtained by our dapSciBERT + adapters approach in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide207614697Herbicide01601206Insecticide176375967No class1343076,009A6TABLEConfusion matrix of results obtained by the fined‐tuned multilingual BERT in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide186114672Herbicide41615794Insecticide24104063055No class1051373,361A7TABLEConfusion matrix of results obtained by our multilingual dapBERT‐multi + adapters approach in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide195412554Herbicide01572293Insecticide2173872505No class1882874,530BAPPENDIXUSPTO datasetTable B1 presents the results of the different models at top 1 and top 5 predictions. Specifically, we first predict 1 and 5 labels for each patent, and then we calculate the precision, recall, and F1‐score. The results for the embedding‐based methods have been extracted from Roudsari et al.15 In addition, Table B2 depicts the comparison of the different models based on micro averaging precision, recall, and F1‐score as well as coverage error. Coverage error is a metric that depicts how far we need to go down a ranked list of categories on average to account for all the true positive categories. Various dapBERT and dapSciBERT checkpoints have been added to the table. Specifically, we compare the versions taken after 1, 3, and 10 training epochs. As it can be extracted, the checkpoints taken after 3 and 10 raining epochs performs equally, which indicates that three training epochs of adapted pretraining is enough and no further performance can be seen for the patent classification task with further training. Both tables have in bold the top value for each metric.B1TABLEPerformance of all the evaluated methods at top‐1 and top‐5 predictions considering the USPTO dataset benchmark proposed in Roudsari et al.15Model@1 (%)@5 (%)PrecisionRecallF1PrecisionRecallF1CBOW75.8056.1561.9027.6288.2640.33GloVe76.5156.7162.5127.9089.1440.73Skip‐gram78.8058.4664.4228.4990.6841.54FastText78.8758.4964.4628.4890.7041.53GPT‐280.5259.9765.9928.5190.5741.55BERT82.2550.6862.7129.1089.6543.94SciBERT83.2051.2663.4429.2890.2244.21BERT + adapters82.8551.0563.1829.2290.0244.12SciBERT + adapters83.6351.5263.7729.4690.7644.48dapBERT83.9551.7364.0129.4590.7444.47dapSciBERT84.2851.9364.2629.5691.0844.64dapBERT + adapters84.3551.9764.3229.6091.2044.70dapSciBERT + adapters84.5352.0964.4629.6891.4544.82B2TABLEPerformance of different domain adapted checkpoints in the classification task.ModelMicroCoverage errorPrecisionRecallF1CBOW0.710.550.624.50GloVe0.750.510.614.36Skip‐gram0.800.510.623.92FastText0.800.510.623.87GPT‐20.800.560.663.90BERT0.800.590.683.52SciBERT0.800.610.693.46dapBERT10.810.590.683.51dapSciBERT10.800.620.703.33dapBERT30.800.610.703.31dapSciBERT30.810.620.713.28dapBERT100.810.610.703.31dapSciBERT100.810.610.713.31CAPPENDIXTraining hyperparametersC1TABLETraining hyperparameters for the two different fine‐tuning approaches.ParameterStandard fine‐tuningAdapters based fine‐tuningLearning rate0.000020.0005Training epochs530Batch size3232Maximum input length512512OptimizerAdamAdam http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Applied AI Letters Wiley

Automated patent classification for crop protection via domain adaptation

Loading next page...
 
/lp/wiley/automated-patent-classification-for-crop-protection-via-domain-oumwOUVGjD
Publisher
Wiley
Copyright
© 2023 The Authors. Applied AI Letters published by John Wiley & Sons Ltd
eISSN
2689-5595
DOI
10.1002/ail2.80
Publisher site
See Article on Publisher Site

Abstract

INTRODUCTIONPatents corpora1,2 are a valuable resource that shows how technology evolves over time. This is documented by the volume of granted patents every year, for example, over 300K patents per year in the last 6 years for the sole United States Patent and Trademark Office (USPTO).3 Monitoring this throughput is critical for capturing trends and developing domain‐specific knowledge bases that can be used to organize information and accelerate the discovery process. This calls for the development of automated systems for information retrieval and patent literature search that can be easily adapted to fit the needs of different fields. The classification of patents is a critical step in the design of such systems. To this end, the World Intellectual Property Organization (WIPO) introduced the International Patent Classification (IPC) system, a hierarchical system of codes (language independent) that classify patents based on the different technological areas that they cover.4In this paper, we address the problem of classifying patents relying solely on the content of the inventions and using language models (LMs) and domain adaptation strategies. Domain adaptation in transformers is the preferred strategy for pushing the performance boundaries of transformer‐based models in domains that are highly differentiated from the pretraining corpora.5 We investigate the application of various domain adaptation strategies using several LMs based on Transformer models from the Bidirectional Encoder Representations from Transformers (BERT) family,6 including SciBERT,7 BERT‐like models adapted in the patent domain using adaptive pretraining, BERT‐like models fine‐tuned using adapters,8 as well as combinations of the above. We evaluate the proposed approaches in terms of precision, recall, and F1‐score under two different scenarios. First, we rely on an existing baseline dataset including patents from USPTO. Then, we focus on a specific use case originated from the crop protection industry domain. Our analysis allowed us to identify fine‐tuning recipes that ensure robust performance.Our approach for patent classification outperforms the state‐of‐the‐art. We depart from the standard solely IPC‐based model evaluation by introducing an evaluation based on actual use‐cases and labels that do not conform to the IPC hierarchy. Furthermore, we examine and evaluate the use of our domain adapted methodologies in a multilingual setup of patent classification. On top of that, we assess the effectiveness of such methods on classifying patents made available in 2021. Finally, we establish two new patent based BERT‐like models, namely domain‐adaptive patent BERT or dapBERT and domain‐adaptive patent SciBERT or dapSciBERT, that can be leveraged for any Natural Language Processing (NLP) task related to the patent domain. We have made the code, the models and the dataset of our work available at https://github.com/GT4SD/domain-adaptive-patent-classifier.RELATED WORKNumerous methods for patent classification have been introduced, as well as many baseline datasets.9 Early attempts proposed k‐Nearest Neighbor,10 support vector machine,10,11 Naive Bayes10,11 or neural networks.12 CNNs (Convolutional Neural Networks) and various word embeddings13–15 have also been successfully combined for the task. Recent trends indicate an increased emphasis on fine‐tuned pretrained LMs, with ULMFit16 and BERT17 based methods being the state of the art. PatentBERT17 is the work that comes closest to our approach. However, while PatentBERT relies on an existing pretrained BERT model and performs patent classification using standard recipes for fine‐tuning, we begin by adapting a BERT‐like model to the patent domain before fine‐tuning the classification task. This way, we ensure that the LM being used is domain‐aware.The content of an invention is critical when it is used as input for the patent classification method. The methods described previously act on different parts of the document. DeepPatent13 utilizes titles and abstracts while PatentBERT17 has been developed using claims and titles or abstracts. The full patent text, including title, abstract, description, and claims, has being evaluated in other attempts.14 In general, the title and abstract sections are more informative than the full‐text representation of the patent document.11 Additionally, focusing exclusively on titles and abstracts has the advantage of being the two most easily accessible sections of a patent. This paper follows the same strategy and does not use full texts. Only the title, abstract, or both are used during processing.PATENT CLASSIFICATION USING BERT‐LIKE MODELSA pretrained BERT model fine‐tuned on labeled data for patent classification represents state‐of‐the‐art performance in patent classification. While pretrained LMs have been shown to be more robust to out‐of‐distribution generalization than previous models,18 they are still ill‐equipped to deal with data that differ significantly from what was observed during pretraining. Patent corpora are a clear example: the unique syntax and vocabulary of patent applications may differ significantly from those used for pretraining. To address the distribution mismatch, we avoid performing an expensive and resource‐intensive pretraining of a BERT‐like model from scratch and instead examine four alternative approaches (i) adoption of a pretrained BERT‐like model trained on corpora with similar vocabulary (vocabulary adaptation), (ii) domain‐adaptive pretraining on the domain of interest, (iii) adapters8 based fine‐tuning, and (iv) combination of the above options. Figure 1 depicts the four different approaches under investigation.1FIGUREPatent classification based on the four different approaches. Case (i) depicts the standard task‐specific fine‐tuning approach that can be used leveraging any BERT‐like model. Method of case (ii) performs a domain adaption of a BERT‐like model prior to the task‐specific fine‐tuning. Cases (iii) and (iv) utilize adapters without or with domain‐adaptive pretraining respectively.Vocabulary adaptationA domain‐specific vocabulary is crucial to generate meaningful word embeddings, and it is thus essential in the majority of NLP applications. There is a strong correlation between poor NLP models performance on unfamiliar domains and the effects of out‐of‐vocabulary words.18 However, including a significant number of domain‐specific tokens in a BERT‐like model requires the inclusion of the same number of uninitialized embedding vectors in the model that must be learned during training. Incorporating such a large number of embedding vectors is impractical in the standard fine‐tuning scenario, and pre‐training from scratch would be required to learn appropriate new token representations. As an alternative, we use the existing BERT variant that is closest to the patent domain as base for standard fine‐tuning. In the following, we consider SciBERT.7 SciBERT was trained on scientific documents, and while their structure and syntax do not resemble those of a patent, their vocabulary is more relevant to the domain of interest than a standard pretrained BERT. To demonstrate this, we analyzed the vocabulary overlap of a patent corpus and corpora similar to those used in the trainings of BERT and SciBERT models. Specifically, we used 150K patent abstracts as the patent corpus, 150K texts from the BookCorpus19 and English Wikipedia as the BERT corpus, and 150K texts from Semantic Scolar20 as the SciBERT corpus. All texts are in English. For each of them, we found the 10K most common words excluding stop‐words, and then we examined the overlap of the respective sets. Figure 2a depicts the analysis of the vocabulary overlap resulting in a greater similarity between the patent corpus and SciBERT than BERT.2FIGUREOverlap between different corpora vocabularies and patent labels using Venn diagrams.Domain‐adaptive pretrainingDomain‐adaptive pretraining5 involves fine‐tuning the model on additional unlabelled data, originated from the domain of interest, using a pretraining objective prior to task‐specific fine‐tuning. This operation aims to shift a pretrained model towards the domain of interest and is achieved by performing just a few extra epochs of training on the domain‐specific data in order to avoid degradation of the model's general language capabilities. Here, the training dataset consists of 10.000.000 patent abstracts downloaded from WIPO and USPTO.AdaptersAdapters8 represent an alternative fine‐tuning strategy that relies on optimizing a small set of additional newly initialized weights at every layer of the transformer. These weights are trained during fine‐tuning, while the pretrained parameters of the transformer model are frozen. This strategy has two significant advantages. To begin, by freezing the entire base model, we can significantly reduce the amount of computation required during training. Additionally, training many task‐specific adapters using the same base model enables an efficient parameter sharing between the different tasks. In this case, the memory footprint of the application can be reduced because we only need to store the extra weights for each task, not the entire fine‐tuned model. In our work, we utilize the adapters configuration presented by Pfeiffer et al.21 which places an adapter layer only after the feed‐forward block of each Transformer layer.Combining methodsAll of the methods outlined above can also be combined in a two‐phase approach to improve performance. The first phase focuses on domain adaptation. The selected model is adapted to the patent domain by performing domain‐adaptive pretraining. This phase produces a LM that is specialized for a patent corpus. This model is not tailored for patent classification and can be used for any downstream NLP task. In the second phase, we perform task‐specific fine‐tuning using a cross‐entropy loss. For the second phase, we have two alternatives to perform the classification: utilizing adapters or following the standard approach of attaching a classification head as output in the existing architecture.RESULTSWe used a newly established patent dataset focusing on the crop industry domain and a standard USPTO based dataset for the evaluation of the different models. The former data set reveals the performance of the models in actual use cases belonging to the agrochemical domain. The latter dataset serves as a standard baseline in the IPC classification task, the most documented case in the literature.As baselines, we used a standard fine‐tuned BERT model similar to Lee et al.17 and CNN based methods that have been presented in Roudsari et al.15 relying on GloVe22 and FastText embeddings.23 We compared them with SciBERT, as a vocabulary adaptation approach, dapBERT, as a domain adapted pretraining method and BERT fine‐tuned using adapters. In addition, we included dapSciBERT, SciBERT + adapters, dapBERT + adapters, and dapSciBERT + adapters methods to highlight possible combinations of methods that can take place. All approaches that are based on standard task‐specific fine‐tuning have a classification head consisting of a dense layer, with ReLU activation function and dropout, plus the output layer. The fine‐tuning was performed for five epochs. When it comes to the methods that include adapters, the classification head consisted of only the output layer and the adapters‐based fine‐tuning had been performed for 30 epochs. The full details regarding the training hyperparameters can be found in Appendix C (Table C1).DapBERT and dapSciBERT are two BERT‐like models trained based on the domain adaptive pretraining method described above for the patent domain. Even though Gururangan et al.5 indicates that only a single pass on the domain dataset is required, we investigated whether additional training is beneficial for domain adaptation. We evaluated the versions of the adapted models that were trained for three epochs, as these versions yielded the best results. Additional training epochs did not improve the performance (see Appendix A: Table A7). To train these models, we relied on the Generative Toolkit for Scientific Discovery (GT4SD) library24 and its LM trainer.Crop protection industry datasetThe discovery of fungicides, insecticides, and herbicides is the primary focus for crop protection research.25 To keep up to date with emerging trends, the recent literature has to be followed. The manual identification of relevant articles is time‐consuming and therefore, an automated identification and correct categorization mechanism would deliver a key element towards a more efficient process.In patents, the IPC hierarchy provides a general classification of the patents based on their topics, yet this classification is not always aligned with important domain specific categories. Figure 2b highlights this aspect focusing on insecticides, herbicides, and fungicides as the three main crop protection categories. These three categories, have a high degree of overlap in terms of IPC codes, indicating that we cannot distinguish these labels relying solely on the IPC hierarchy.There is a large amount of patent data available in the public domain. Patents starting from 2012 to 2020 were analyzed and classified manually into three categories—insecticide, fungicide, and herbicide—plus an extra no‐class label with irrelevant patents, leading to a data set with 9976 entries. Specifically, the dataset contained 3393 Insecticide related patents, 1518 Fungicide related patents, 1512 Herbicide related patents and 3553 irrelevant (no class) patents. This data set was used to build and evaluate various four class models based on the methodologies described above. The evaluation included a 10‐fold cross‐validation.Figure 3 presents the results of the evaluation focusing on the macro F1‐score. We examined three distinct input cases: title, abstract, and title + abstract, and included only SciBERT based adapted models in the results due to their superior performance. Tables with the full evaluation, including all the investigated models and all the metrics, can be found in Appendix A: Table A6. The results show that any kind of domain adaptation can be beneficial for the task and all of them outperform the baselines. The best performance is observed by dapSciBERT + adapters method using as input both the title and the abstract of a patent. An approach that has the extra advantage of fewer storage requirements in case of multiple classification cases should be accommodated by the same system. Furthermore, the results dictate that the abstract is a way more informative part of a patent than the title for the classification task. To verify the significance of the difference between the baselines and our proposed alternatives we performed statistical tests to compare the mean value of the best baseline method which is the finetuned BERT model and our best variant in terms of both performance and methodological advantages which is the dapSciBERT + adapters. A Wilcoxon Signed‐Ranks Test indicated that dapSciBERT + adapters method has greater mean value than BERT methods using as input title (statistic = 50.0, p‐value = 0.01), abstract (statistic = 55.0, p‐value = 0.01) and title + abstract (statistic = 52.0, p‐value = 0.01). In general, the differences in terms of mean F1‐score between the methods are not large, yet even a 1% or 2% increase in such scenarios is highly important as this improvement is translated into tens of thousands of further correctly classified incoming patents in automated streams that should handle every year millions of patents.3FIGUREMacro F1‐score of the presented models in the English agrochemical related case based on different input text. The error bars represent the standard deviation measured for each model.Multilingual patent classificationWe also extended the model evaluations to determine the applicability of the previously described methodologies in a multilingual scenario. In fact, a reliable multilingual classifier could be a game changer for the field, as it could classify patents submitted to a large number of different speaking patent offices in a massively, quickly, and inexpensive way, without the need for additional tools such as translators. We relied on BERT's multilingual version and re‐evaluated all the different adaptation techniques described previously (adapters, domain adapted models, and their combination), training also an adapted BERT multilingual model for the patent domain called dapBERT‐multi. For the multilingual case, there are no available LMs trained on scientific or other relevant domains, thus we confined our experiment to the multilingual BERT variant. To train the dapBERT‐multi model, we followed the domain adaptive pretraining method described above for the patent domain. We used a multilingual corpus of 17,476,660 patent abstracts for the adapted pretraining. The patents collected covered 14 languages, including English, Chinese, French, Korean, Japanese, and German. We trained the model, fine‐tuning all weights, for three epochs.The methods were evaluated using the same case as in the English version. Nonetheless, a sizable portion of the patents was chosen in their alternate non‐English language version. In total the multilingual version of the dataset had 9989 patents with 47% not written in English. The multilingual dataset contained 3408 Insecticide related patents, 1519 Fungicide related patents, 1520 Herbicide related patents and 3542 irrelevant (no class) patents. Figure 4 presents the performance of our proposed methods. We chose the multilingual BERT as a baseline, as well as a CNN approach based on the multilingual BERT's embeddings. We used the same evaluation strategy as described previously for monolingual models. In general, the adapted methods outperform the baselines, with the performance resembling closely the same pattern as the English version. The combination of an adapted model and adapters (dapBERT‐multi + adapters) presents the best results which is more than 2% better than the baselines. Wilcoxon Signed‐Ranks Test also verifies our findings, specifically it proves that dapBERT‐multi + adapters method has greater mean value than multiBERT method using as input title (statistic = 50.0, p‐value = 0.01), abstract (statistic = 48.0, p‐value = 0.02) and title + abstract (statistic = 53.0, p‐value = 0.03). We also observe that the multilingual models only have slightly worse performance than the monolingual English models even if the amount of data that has been used is the same and the dominant language in the multilingual dataset is still English. This observation underlines the power of language generalization that these LMs hold and their potential in such cases and domains.4FIGUREMacro F1‐score of the presented models in the multilingual versions of the agrochemical case based on different input text. The error bars represent the standard deviation measured for each model.Evaluation under real‐life conditionsIn a real case scenario, a model is trained on past data and is applied to predict incoming novel data. To this end, the classification model trained on data up to 2020 was applied to patents made available in 2021. This exercise focused on patents related to human necessities or chemistry, which are defined by their assigned IPC codes, leading to a corpus of 78,712 patents. For the patents in the list that were not written in English, we relied on the respective English version that was retrieved through Google patents. The best performing model (dapSciBERT + adapters) was subsequently applied to label the patents into the three classes (fungicide, herbicide, insecticide) or to provide the no‐class label. The results of the predictions were compared to the classification obtained from subject matter experts. An interesting observation is that using our model we could identify errors that occurred in five cases during the manual annotation. In addition, 108 patents that the model classified as insecticide, herbicide, or fungicide contained relevant keywords but were not classified accordingly by the subject matter experts. Of course, patents may contain relevant keywords without being relevant, but this could indicate that some relevant patents were missed during the manual assessment. Undoubtedly, the volume of false positive examples indicates that a patent classifier cannot yet be used as a standalone method to cherry‐pick patents of interest, given an incoming patent stream. Nevertheless, such a model can facilitate the process and significantly reduce the volume of patents that require manual inspection. The model was also compared to a baseline model. As a baseline, we relied again on a fine‐tuned BERT model for patent classification. Furthermore, to investigate multilingual extensions, we repeated the same experiment using a multilingual version of the same corpus in which the original language of each patent is used as input text, corresponding to 58% of the patents. Table 1 presents the overall results. The confusion matrices of classification results for each model can be found in Appendix A: Table A6. For both only‐English and multilingual cases, our model outperforms the baseline in terms of the F1 score. The fine‐tuned BERT baselines retrieve a few more correct instances in both cases, yet our proposed models manage to provide significantly fewer false positive instances. Thus, the percentage of actual patents of interest predicted with our models is much higher than the baselines. IPC‐code filtering or ensemble classifiers could improve even more the performance and reduce or eliminate the need for manual intervention. The comparison between English and multilingual classifiers' performance seconds the previous investigations' findings. It indicates that even if both variants have remarkable performance, using the English‐only classifier leads to slightly better performance. This performance can be attributed to the fact that we relied on a standard multilingual BERT as there is no available multilingual BERT‐like model trained solely on scientific domains.1TABLEEvaluation of the performance of different monolingual or multilingual classifiers in a stream of patents made available in 2021.ModelF1‐scoreCorrect foundTotal found (correct + missclassified)False positivesBERT0.947548124786dapSciBERT + adapters0.987427861870BERT‐multi0.947538114521dapBERT‐multi + adapters0.967397853352Total patents of interest in 2021839Note: The groundtruth is originated from manual annotations made by experts. The utilized metrics are the F1‐score, the number of correct classified patents (correctly classified patents of interest), the total number of found patents (found patents of interest including cases with wrong label assignment) and the number of false positive patents (generally irrelevant patents that be assigned to one of the categories).USPTO datasetWe further evaluated our methods based on an USPTO based dataset used in Roudsari et al.15 It contained 235,858 patents submitted in 2014 as training set and 42,321 patents submitted in 2015 as testing set. The datasets contained both the title and the abstract of a patent to identify the associated IPC subclass labels of it. The target labels were 89 IPC subclasses. More information about the dataset generation process can be found in Roudsari et al.15Table 2 summarizes the comparison of the different models based on micro averaging precision, recall, and F1‐score as well as coverage error. Coverage error is a metric that depicts how far we need to go down a ranked list of categories on average to account for all the true positive categories. The results for the embedding‐based (CBOW, GloVe and Skip‐gram) and GPT‐2‐based methods have been extracted from Roudsari et al.152TABLECoverage error and micro averaging precision, recall and F1‐score of all the evaluated methods, considering the USPTO dataset benchmark proposed in Roudsari et al.15ModelMacroCoverage errorPrecisionRecallF1CBOW0.640.460.524.50GloVe0.680.420.514.36Skip‐gram0.740.420.523.92FastText0.760.510.603.87GPT‐20.760.490.593.90BERT0.760.520.603.52SciBERT0.760.530.623.46BERT + adapters0.770.520.613.40SciBERT + adapters0.780.530.623.29dapBERT0.770.540.633.31dapSciBERT0.770.550.633.28dapBERT + adapters0.790.530.623.19dapSciBERT + adapters0.780.540.633.15Note: In bold the top‐3 values for each metric.Micro averaging of the above metrics, as well as evaluation at top‐1 and top‐5 predictions, have been also examined following the exact same evaluation process of Roudsari et al.15 and the results reveal a similar performance with the macro averaging results, are available in the Appendix A: Table A7.The USPTO evaluation suggests that BERT‐like approaches outperform the competition and that all of the presented domain adaptation based methods have the potential to improve the performance even further. The leveraging of SciBERT or adapters for vocabulary adaptation offers marginal benefits in the overall task performance in comparison to standard BERT finetuning. The improvement became more significant when we utilized the domain‐adapted models. Specifically, dapSciBERT had more than 2% improvement comparing to BERT in the majority of the utilized metrics. The combination of Patent based models and adapters can improve even more the performance of the models, especially in terms of coverage error. Overall, based on the experiments the best approach was the dapSciBERT + adapters which achieved a significant improvement in comparison to the rest CNN + embedding methods and the BERT baseline.CONCLUSIONPatent classification is a fundamental step in many patent analysis or patent generation pipelines. Improving the performance of classification methods is a critical task, and domain adaptation of transformers appears to be a promising direction. In this paper, we propose and investigate different methods for patent classification relying mainly on domain adaptation. The domain adaptive pretraining demonstrated the best results in the test cases and its performance can be furthermore improved if we select a pretrained base model with vocabulary closer to our domain, such as SciBERT. Additionally, the domain‐adapted LM generated in the first phase can be fine‐tuned and used for any downstream NLP task. When combined with already domain‐adapted models, the use of adapters results in the same or even better performance. This finding, combined with their lightweight characteristics, such as requiring fewer training resources and less storage space, makes them an appealing option, particularly when it is required to develop multiple classification schemes for a single domain. We further utilized and examined the use of domain adaptation techniques for multilingual patent classification. The multilingual performance follows a similar pattern to the English‐written patents. The same level of performance in both multilingual and English cases highlights the strength of the proposed methods for patent classification and the great application potential that they can serve. To push further performance boundaries, future steps may include the exploration of additional domains,5 vocabulary26 adaptation methods or the use of patents' metadata.DATA AVAILABILITY STATEMENTThe data that support the findings of this study are openly available and can be found at https://github.com/GT4SD/domain-adaptive-patent-classifier.REFERENCESWIPO. 2023. Accessed January 10, 2023. https://www.wipo.int/portal/en/index.htmlUSPTO. 2023. Accessed January 10, 2023. https://www.uspto.govU.S. Patent Statistics Chart Calendar Years 1963–2020; 2023. Accessed January 10, 2023. https://www.uspto.gov/web/offices/ac/ido/oeip/taf/us_stat.htmWIPO. Guide to the International Patent Classification; 2022. https://www.wipo.int/publications/en/details.jsp?id=4593&plang=ENGururangan S, Marasović A, Swayamdipta S, et al. Don't stop pretraining: adapt language models to domains and tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020:8342‐8360.Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre‐training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics; 2019:4171‐4186.Beltagy I, Lo K, Cohan A. SciBERT: a pretrained language model for scientific text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP‐IJCNLP). Association for Computational Linguistics; 2019:3615‐3620.Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter‐efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning. PMLR; 2019:2790‐2799.Krestel R, Chikkamath R, Hewel C, Risch J. A survey on deep learning for patent analysis. World Patent Inform. 2021;65:102035.Fall CJ, Törcsvári A, Benzineb K, Karetka G. Automated categorization in the international patent classification. SIGIR Forum. 2003;37:10‐25.D'hondt E, Verberne S, Koster C, Boves L. Text representations for patent classification. Comput Linguist. 2013;39:755‐775.Trappey AJC, Hsu FC, Trappey CV, Lin CI. Development of a patent document classification and search platform using a back‐propagation network. Expert Syst Appl. 2006;31:755‐765.Li S, Hu J, Cui Y, Hu J. DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics. 2018;117:721‐744.Abdelgawad L, Kluegl P, Genc E, Falkner S, Hutter F. Optimizing neural networks for patent classification. Machine Learning and Knowledge Discovery in Databases. Springer International Publishing; 2020:688‐703.Roudsari AH, Afshar J, Lee S, Lee W. Comparison and analysis of embedding methods for patent documents. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp); 2021:152‐155.Hepburn J. Universal language model fine‐tuning for patent classification. In: Proceedings of the Australasian Language Technology Association Workshop; 2018:93‐96.Lee JS, Hsiang J. Patent classification by fine‐tuning BERT language model. World Patent Inform. 2020;61:101965.Hendrycks D, Liu X, Wallace E, Dziedzic A, Krishnan R, Song D. Pretrained transformers improve out‐of‐distribution robustness. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics; 2020:2744‐2751.Zhu Y, Kiros R, Zemel RS, et al. Aligning books and movies: Towards story‐like visual explanations by watching movies and reading books. ICCV. IEEE Computer Society; 2015:19‐27.Ammar W, Groeneveld D, Bhagavatula C, et al. Construction of the literature graph in semantic scholar. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). Association for Computational Linguistics; 2018:84‐91.Pfeiffer J, Vulić I, Gurevych I, Ruder S. MAD‐X: an adapter‐based framework for multi‐task cross‐lingual transfer. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2020:7654‐7673.Pennington J, Socher R, Manning C. GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics; 2014:1532‐1543.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Trans Assoc Comput Ling. 2017;5:135‐146.Manica M, Cadow J, Christofidellis D, et al. GT4SD: generative toolkit for scientific discovery. arXiv preprint arXiv:2207.03928; 2022.Umetsu N, Shirai Y. Development of novel pesticides in the 21st century. J Pestic Sci. 2020;45:54‐74.Tai W, Kung HT, Dong X, Comiter M, Kuo CF. exBERT: extending pre‐trained models with domain‐specific vocabulary under constrained training resources. Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics; 2020:1433‐1439.AAPPENDIXCrop protection industry datasetTable A1 presents the full evaluation of the models in the agrochemical related case. Furthermore, Tables A2 and A3 present the full evaluation of the models in the multilingual version of the dataset (with or without English). Lastly, Tables A4–A7 present the confusion matrices of the classification results obtained by our models or baselines in the evaluation using the patents published in 2021 and as groundtruth the annotations made by experts of the field. All tables have in bold the top value for each metric. In general, all the different adaptation techniques that have been utilized improve the performance both in terms of precision and recall. The combination of domain adaptive pretraining + adapters offers the highest boost in the performance in general.A1TABLEPerformance of the evaluated models in the English agrochemical dataset, including the standard deviation measured for each model.ModelTitlesAbstractsTitles + abstractsPrecisionRecallF1PrecisionRecallF1PrecisionRecallF1CNN + GloVe0.86 ± 0.010.84 ± 0.010.85 ± 0.010.91 ± 0.010.89 ± 0.010.90 ± 0.010.92 ± 0.010.90 ± 0.010.91 ± 0.01CNN + FastText0.87 ± 0.010.86 ± 0.010.86 ± 0.010.92 ± 0.010.90 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.01BERT0.89 ± 0.020.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01SciBERT0.89 ± 0.020.88 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01BERT + adapters0.89 ± 0.020.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01SciBERT + adapters0.89 ± 0.020.88 ± 0.010.88 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapBERT0.89 ± 0.010.89 ± 0.010.89 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.93 ± 0.010.93 ± 0.010.92 ± 0.01dapSciBERT0.90 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapBERT + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01dapSciBERT + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.010.94 ± 0.010.94 ± 0.01A2TABLEPerformance of the evaluated models in the multilingual agrochemical dataset, including the standard deviation measured for each model.ModelTitlesAbstractsTitles + abstractsPrecisionRecallF1PrecisionRecallF1PrecisionRecallF1CNN + BERT0.88 ± 0.010.87 ± 0.010.87 ± 0.010.91 ± 0.010.91 ± 0.010.91 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.01BERT‐multi0.89 ± 0.010.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01BERT‐multi + adapters0.89 ± 0.010.88 ± 0.010.88 ± 0.010.92 ± 0.010.91 ± 0.010.91 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.01dapBERT‐multi0.89 ± 0.010.88 ± 0.010.89 ± 0.010.92 ± 0.010.91 ± 0.010.92 ± 0.010.93 ± 0.010.92 ± 0.010.93 ± 0.01dapBERT‐multi + adapters0.89 ± 0.010.89 ± 0.010.89 ± 0.010.92 ± 0.010.92 ± 0.010.92 ± 0.010.93 ± 0.010.93 ± 0.010.93 ± 0.01A3TABLEPerformance of the evaluated models in the multilingual datasets focusing on patents not written in English and using abstract and title as input.ModelPrecisionRecallF1CNN + BERT0.91 ± 0.010.88 ± 0.010.89 ± 0.01BERT‐multi0.91 ± 0.010.89 ± 0.010.90 ± 0.01BERT‐multi + adapters0.90 ± 0.010.90 ± 0.010.90 ± 0.01dapBERT‐multi0.91 ± 0.010.89 ± 0.010.90 ± 0.01dapBERT‐multi + adapters0.91 ± 0.010.90 ± 0.010.91 ± 0.01Note: The measured standard deviation for each model has also been included in the table.A4TABLEConfusion matrix of results obtained by the fined‐tuned BERT in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide17502763Herbicide616613918Insecticide3074133105No class731273,092A5TABLEConfusion matrix of results obtained by our dapSciBERT + adapters approach in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide207614697Herbicide01601206Insecticide176375967No class1343076,009A6TABLEConfusion matrix of results obtained by the fined‐tuned multilingual BERT in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide186114672Herbicide41615794Insecticide24104063055No class1051373,361A7TABLEConfusion matrix of results obtained by our multilingual dapBERT‐multi + adapters approach in the 2021 patent stream.ActualFungicideHerbicideInsecticideNo classPredictedFungicide195412554Herbicide01572293Insecticide2173872505No class1882874,530BAPPENDIXUSPTO datasetTable B1 presents the results of the different models at top 1 and top 5 predictions. Specifically, we first predict 1 and 5 labels for each patent, and then we calculate the precision, recall, and F1‐score. The results for the embedding‐based methods have been extracted from Roudsari et al.15 In addition, Table B2 depicts the comparison of the different models based on micro averaging precision, recall, and F1‐score as well as coverage error. Coverage error is a metric that depicts how far we need to go down a ranked list of categories on average to account for all the true positive categories. Various dapBERT and dapSciBERT checkpoints have been added to the table. Specifically, we compare the versions taken after 1, 3, and 10 training epochs. As it can be extracted, the checkpoints taken after 3 and 10 raining epochs performs equally, which indicates that three training epochs of adapted pretraining is enough and no further performance can be seen for the patent classification task with further training. Both tables have in bold the top value for each metric.B1TABLEPerformance of all the evaluated methods at top‐1 and top‐5 predictions considering the USPTO dataset benchmark proposed in Roudsari et al.15Model@1 (%)@5 (%)PrecisionRecallF1PrecisionRecallF1CBOW75.8056.1561.9027.6288.2640.33GloVe76.5156.7162.5127.9089.1440.73Skip‐gram78.8058.4664.4228.4990.6841.54FastText78.8758.4964.4628.4890.7041.53GPT‐280.5259.9765.9928.5190.5741.55BERT82.2550.6862.7129.1089.6543.94SciBERT83.2051.2663.4429.2890.2244.21BERT + adapters82.8551.0563.1829.2290.0244.12SciBERT + adapters83.6351.5263.7729.4690.7644.48dapBERT83.9551.7364.0129.4590.7444.47dapSciBERT84.2851.9364.2629.5691.0844.64dapBERT + adapters84.3551.9764.3229.6091.2044.70dapSciBERT + adapters84.5352.0964.4629.6891.4544.82B2TABLEPerformance of different domain adapted checkpoints in the classification task.ModelMicroCoverage errorPrecisionRecallF1CBOW0.710.550.624.50GloVe0.750.510.614.36Skip‐gram0.800.510.623.92FastText0.800.510.623.87GPT‐20.800.560.663.90BERT0.800.590.683.52SciBERT0.800.610.693.46dapBERT10.810.590.683.51dapSciBERT10.800.620.703.33dapBERT30.800.610.703.31dapSciBERT30.810.620.713.28dapBERT100.810.610.703.31dapSciBERT100.810.610.713.31CAPPENDIXTraining hyperparametersC1TABLETraining hyperparameters for the two different fine‐tuning approaches.ParameterStandard fine‐tuningAdapters based fine‐tuningLearning rate0.000020.0005Training epochs530Batch size3232Maximum input length512512OptimizerAdamAdam

Journal

Applied AI LettersWiley

Published: Feb 1, 2023

Keywords: BERT; domain‐adaption; NLP; patent analysis; patent classification; transformers

References