Using Machine Learning Approach To Identify Synonyms For Document Mining
Using Machine Learning Approach To Identify Synonyms For Document Mining
Introduction
Companies need to conduct patent searches regularly before their products being
launched (i.e., FTO) or before carrying out any R&D activities. The search helps identify
which prior art IPs overlap with specific technologies being developed to prevent
infringing upon existing IPRs [1]. However, current patent laws do not explicitly restrict
standard terminologies to be used for particular domains when writing patent
applications. Rather, IP offices allow patent applicants to write their applications freely
using synonyms, which may cause incomplete or inaccurate patents found using
traditional search queries [2].
Traditional patent searches use the exactly matching “key terms” Boolean query
algorithm, which often miss out documents consisting of synonymous words. This study
1
Corresponding Author, Email: trappey@ie.nthu.edu.tw.
510 A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms
aims to construct a smart patent search that considers synonyms. First, the proposed word
and phrase synonym dictionaries are constructed based on a traditional OPTED
dictionary and a web encyclopedia corpus. Although the terminologies have many
synonymous terms, they can be found referring to the newly built word and phrase
synonym dictionaries. The synonym-enhanced search approach will be smarter than the
original search/query method and helps companies conduct comprehensive patent search
and analysis to avoid possible threats of IP disputes. This research will also conduct a
search test in a specific domain of solar power related patents and compare the
differences between the intelligent search method proposed in this research and the
traditional method.
1. Literature review
The relevant literature are divided into four parts. The word embedding and vectorization
explain the different methods and concepts of projecting words into vector space. The
second part is synonym extraction, which describes various methods and techniques for
extracting synonyms in the text. The third part will discuss the properties of the corpus
used in this study. The final section is the ontology schema describing methods and
algorithms for conducting synonym extraction.
ǡೕ
݂ݐǡ ൌ σ (1)
ೖ ೖǡೕ
ȁȁ
݂݅݀ ൌ ݈݃ (2)
ห൛ǣ௧ אௗೕ ൟห
݂݂݀݅ݐǡ ൌ ݂ݐǡ ൈ ݂݅݀ (3)
Where ݂ݐǡ represents term frequency. ݊ǡ shows the number of occurences of term i in
document ݀ . σ ݊ǡ is the total number of occurrences of all words in document ݀ , ȁܦȁ
is the number of files in the corpus, ห൛݆ǣ ݐ ݀ א ൟห is the number of documents which
containݐ .
Word2vec was proposed in 2013. Considering the order of the context, the words
are projected into the vector space. There are two types of model in Word2vec. The first
model called Continuous Bag-of-Words Model (CBOW), which predicts the center word
by the words around it. That is, predictingܹ௧ byܹ௧ିଶ , ܹ௧ିଵ ǡ ܹ௧ାଵ ǡ ܹ௧ାଶ and so on. The
other is Continuous Skip-gram Model. In contrast to CBOW, it uses the center word to
predict the surrounding words. That is, predictingܹ௧ିଶ ,ܹ௧ିଵ , ܹ௧ାଵ ǡ ܹ௧ାଶ by ܹ௧ [5].
CBOW and Skip-gram have better performance than other Neural Network Language
Models (NNLMs) in semantic accuracy or syntactic accuracy. Word2vec is also an
important foundation for the future development of Doc2vec [6].
Similarly, under the condition that the A event occurs, the probability of the occurrence
of event B is
ሺתሻ
ܲሺܤȁܣሻ ൌ (5)
ሺሻ
In equation (7), P(A) is also called the prior probability, which represents the probability
distribution of A before the observation of evidence B and does not consider any factors
related to B; P(A ؒ B) called posterior probability, which means the conditional
probability obtained after observing evidence B.
512 A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms
݈ܿܽݏݏ means i-th category, ݂ݐሬሬሬԦ is the vector of the keyword frequencyͫSinceܲ൫݂ݐ ሬሬሬԦ൯ in
each class is always same, the classifier trains ܲ൫݂ݐ ሬሬሬԦ ห݈ܿܽݏݏ ൯ to make classification. In
the application of document classification, the simple Bayesian classifier often uses
Bernoulli Naive Bayes and Multinomial Naive Bayes. The former only considers
whether the keywords appear or not and denoted as 1 or 0. The latter consider the term
frequency. For example, assume that we want to classify some articles into sport news
or financial news. The article that contains many words related to basketball will be
classified as sport news rather than financial news because ܲ൫݂ݐ ሬሬሬԦหݏݓ݁݊ݐݎݏ൯ is much
larger than ܲ൫݂ݐ ሬሬሬԦห݂݈݅݊ܽ݊ܿ݅ܽ݊݁ݏݓ൯Ǥ Bayesian classification has good classification
ability under few samples. Even the training set is insufficient, the classification accuracy
is still good [8].
Nowadays, there are many patent search platforms. Some of them are free such as Google
Patents, United States Patent and Trademark Office (USPTO), Free Patents Online
(FPO), etc. There are also some paid platforms such as WIPS, PatBase, Derwent
Innovation (DI), etc. USPTO provides quick search and advanced search. Users can set
the conditions to find the patents. USPTO only provides exactly matching and the
number of results is limited. FPO has a basic word stemming function and Google Patents
show the results with some basic statistics such as top assignees and inventors. The major
defects of these free patent search platforms are the searching flexibility and it’s hard to
export and analyze the results.
WIPS is the first online worldwide patent information service provider in South
Korea. PatBase was created by Minesoft and RWS. In WIPS and PatBase, users can
combine the results their got in deferent queries to get new results and there are more
specific conditions to choose while searching. Derwent Innovation (DI) database
consolidates patents from more than 90 country/regional patent offices. In DI, smart
search finds keywords from a paragraph of text, provided by the users, then the patents
are identified containing these keywords. Furthermore, the wildcard and proximity
operators improve the searching results in DI and WIPS. Although the paid patent
platforms seem smarter than the free platforms, the synonyms they can identified to
enhance search versatility is still limited. This research intends to significantly enhance
the comprehensive patent search results.
Figure 1 shows the ontology schema of hierarchically presented key parts for synonym
extraction. Synonym extraction methods (in both word and phrase levels) and the word
and phrase corpuses as synonym training bases are the critical parts in the ontology.
There are some sub-techniques for word level synonym extraction, e.g., Inverted Index
Extraction (IIE), Pattern-based Extraction (PbE), and Maximum Entropy (MaxEnt). For
phrase level synonym extraction, algorithms such as Entity Link Frequency (ELF),
Average Cosine Similarity (ACS), Pseudo Relevance Feedback (PRF), Average Ranking
(AveR), Ranking Score Combination (RSC), and Self-supervised synonym extraction
are highlighted [16]. The main training corpuses are The Online Plain Text English
Dictionary (OPTED), Dictionary.com, and Wikipedia [17].
2. Methodology
This study proposed a method of detecting synonyms of key terms (both words and
phrases) for smart patent search. The research process can be divided into several parts.
First, a synonym dictionary must be constructed for synonymous words. This step is
completed by feature extraction. By continuously iterating by observing the grammatical
features, more features can be found to capture the synonym. We use OPTED as our
corpus.
514 A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms
Figure 1. The ontology map outlines the key parts of synonym extraction methods and corpuses.
2
Synonym Dataset: TOEFL Synonym Questions, http://lsa.colorado.edu/
A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms 515
example, the interpretation of bag "a container or receptacle of leather, plastic, cloth,
paper, etc., capable of being closed at the mouth; Pouch.” It can be seen that pouch is a
synonym for bag, so this feature can be used as an initial feature to extract more
synonyms. Then the reverse searching step starts. For example, the interpretation of stain
is “a cause of reproach; stigma;” so stigma is the synonym of stain. the interpretation of
tarnish is “to diminish or destroy the purity of; stain; sully.” Now, stain appears as a
synonym of tarnish so tarnish will also be added in to the synonyms of stain.
Bootstrapping will be implemented after the previous step. The algorithm will search
for new features in the dictionary based on old patterns and the features will continue to
increase iteratively. For example, we assume that tarnish is our target word. In its
interpretation “to diminish or destroy the purity of; stain; sully” sully will be added in to
the synonym of tarnish. In the interpretation of sully “to soil, stain, or tarnish.” Tarnish
appears in another pattern “to A, B or C” so this will be the new pattern. Finally, in the
transitive closure phase, in order to ensure that the synonyms are indeed highly similar,
the algorithm will test whether synonyms can form a transitive closure. If the synonyms
cannot form a loop, it will be dropped.
The word level synonym dictionary (original version) generated by this method is
not as good as expected. Therefore, this study improves it from three directions. The first
one is the expansion of part of speech; The second direction is to enhance the inference
ability. Finally, the third approach is to add more undiscovered important features. The
feature number increased from 6 to 12. The methods can effectively improve the
efficiency of synonym extraction. Table 1 shows the patterns of word level dictionary.
Pattern 7 to 12 are the differences between two versions.
Table 1. Example patterns of word level dictionary.
Pattern Pattern Example sentence Synonym
no. found
1 "^.*; (\w+).$" Abrogate (a.) Abrogated; abolished. abolished
2 "^.*; (\w+);" Abbreviate (a.) Abbreviated; abridged; shortened. abridged
3 "^.*; the Acme (n.) The top or highest point; the culmination. culmination
(\w+).$"
4 "^.*; to (\w+).$" Abalienate (v. t.) To transfer the title of from one to alienate
another; to alienate.
5 "^.*; a (\w+).$" Abider (n.) One who dwells; a resident. resident
Notes: ^|$|.|* Beginning of line | end of line | any single character
match | the preceding element match zero or more times
This study refers to and modifies the self-supervised learning algorithm mentioned
above and improves the details of collecting negative training samples and the classifier
to construct a phrase level synonym dictionary. There are four main steps: data pre-
processing, collecting positive training samples, collecting negative training samples,
and training and classification. n this study, the corpus used to construct the phrase level
synonym dictionary is Wikipedia. The data we used is the enwiki dump provided by
Wikimedia Foundation (WMF). Since the file is a structured XML file with many tags,
such as <page>, <revision>, <title>, <id>, <redirect>, <text>, <ref>, etc. We use Python
to remove labels, symbols, and noise to produce csv files for all Wikipedia articles. In
addition to them, this XML file also contains a lot of structured information like the
paired list of redirect pages.
Before training the classifier, we must have enough positive and negative training
samples. A positive training sample refers to the sentence between the two synonyms
and its label is set to 0. A negative training sample is the sentences between two non-
516 A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms
synonyms and labeled as 1. In Wikipedia, redirect pages is a type of special pages that
users often redirected to a redirect page when searching for many similar keywords.
These keywords and redirected pages usually express the same or very similar entities.
We use this Wikipedia's structural features to collect positive training samples. In the
previous phase, all pairs of redirect pages have been extracted. In this step, for each
"redirected to" and "redirected from" page, if they appear in a sentence within a distance.
The text between them will be regarded as a positive training sample. Table 2 shows
some of the positive training samples. The first column are the pages redirected from,
and the second column are the pages redirected to. The third column are the sentence
between them, not including the page name. For example, in the first row, the original
sentence is “Anarchist movement as known by the Anarchism…” and the page is
redirected from Anarchist movement to Anarchism in Wikipedia.
Table 2. Sample texts between positive terms using Wikipedia’s “redirect pages” (e.g., term 1 and term
2 synonym pairs) as training base.
Identify words/phrases/symbols
Term 1 (redirected from) Term 2 (redirected to)
between two terms
Anarchist movement Anarchism as known by the
Oscars Academy Awards also known as the
American National
ANSI (
Standards Institute
Induced abortion Abortion is often used to mean only
Australian Football Australian rules football officially known as
3. Validation
First, we test the synonym word dictionary using TOEFL synonym questions to measure
the proposed method’s accuracy in identifying synonyms. There are 80 questions and
each question is a multiple choice question (choose one from four). Only one option is
synonymous or very similar to the word of the question. The final version of our
synonym dictionary gets 56 right answers, reaching 70% accuracy. The validation results
show that the final version of the word level synonym dictionary can identify a high
proportion of synonyms. Recall is the proportion of real positive cases that are correctly
predicted positive. This research further tests the recall capability of the proposed
methods by comparing our method with an existing dictionary, ALWC. We collect 2,280
patens in solar power domain to be the testing corpus. For example, 365 and 124
synonyms in word and phrase levels respectively are found using our dictionaries, while
only 88 synonyms are found using ALWC dictionary in patent #1. On average, 190 and
67 word and phrase synonyms found using our dictionaries for each patent. On average,
only 57 synonyms are found using ALWC dictionary in each patent. Over all, our word
and phrase level dictionaries outperform the existing ALWC dictionary. Table 5 shows
some samples of synonyms found in a sample patent (#167).
4. Conclusion
In this paper, to conduct a complete freedom-to-operate analysis before the product put
on the market, we constructed both word level dictionary and phrase level dictionary by
pattern-based method and machine learning approach. The proposed method has made
significant performance for synonym mining of technical document. The test result
shows that the phrase level dictionary is able to detect the technical terms in many
synonymous forms and the word level dictionary outperforms the existing dictionary.
The patent search can be more complete through the dictionaries we constructed, and the
freedom-to-operate analysis can be improved. The potential intellectual property
litigations and disputes can be avoided.
518 A.J.C. Trappey et al. / Using Machine Learning Approach to Identify Synonyms
Acknowledgement
This research is partially supported by the Ministry of Science and Technology research
funding (MOST-107-2221-E-007-071 and MOST-107-2410-H-009-023) in Taiwan.
References