0% found this document useful (0 votes)
7 views39 pages

Semantic Textual Similarity

Uploaded by

Riddhiman Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views39 pages

Semantic Textual Similarity

Uploaded by

Riddhiman Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Semantic Textual

Similarity in Replika

Denis Fedorenko
Research Engineer, Luka Inc.
Plan
• Task definition

• Baseline model

• Model improvements

• Conclusion and future work


Semantic Textual Similarity
• The task is to measure the meaning similarity of two
texts

• Find a model

M: (text1, text2) → ℝ
Toy STS model
• How many common words are in two texts?

• Example:

J("I have a funny dog", "I have a cat") = 3/6 = 0.5


Toy STS model
• More examples:

J("I have a dog", "I have a cat") = 3/5 = 0.6


J("I have a dog", "I have a puppy") = 3/5 = 0.6
J("I have a funny dog", "My puppy is very nice") = 0

• This model is very sensitive to synonyms and


paraphrases

• How can we overcome this issue?


STS framework
• Find a model (text-to-vector):
n
E: (text) → ℝ

• Such that:

M: (E(text1), E(text2)) → ℝ
where M is a similarity function (e.g. cosine)
or some trainable model (e.g. logistic regression, neural network)
STS in Replika
• The task is to determine whether two utterances are
semantically equivalent

• Find a model

M: (utterance1, utterance2) → {0, 1}

• A particular case of STS


What is "equivalence"?
• Paraphrases

• Utterances that have the same set of possible


answers

• Ultimately, equivalence should be determined by


product requirements
Example: scripts
User phrase constraint:
User phrase template:

Result:

Matched phrase
Example: Replika-QA
User phrase constraint:
User phrase templates:

Result:

Matched phrase
STS evaluation
• On holdout testsets:

• Classification metrics (precision, recall, AUC)

• Information retrieval metrics (average precision,


recall@N)

• In the wild:

• User feedback (upvotes and downvotes) in the


scripts and Replika-QA
Metrics
Plan
• Task definition

• Baseline model

• Model improvements

• Conclusion and future work


Baseline STS model
• Two-class logistic regression classifier over text vectors
produced by the context encoder of the retrieval-based
dialog model (DM)

∈ (0, 1)
sigmoid(W * |v1 – v2|)
rule: 1 if f(x) > 0.5 else 0

v1=DM.Encoder(utter1) v2=DM.Encoder(utter2)

• Trainset: 3900 text pairs obtained by different high-recall


heuristics and marked by assessors

• Testset: 400 text pairs


Retrieval-based
dialog model
Basic QA-LSTM: Tan et al. (2015)
Dialog text encoder
• During training, similar contexts often have similar
or even coinciding answers

• As a result, similar texts are encoded into similar


vectors

• Hence the encoders can be successfully used for


the further text analysis (classification,
clusterization)
Plan
• Task definition

• Baseline model

• Model improvements

• Conclusion and future work


Possible improvements
• Enlarge the datasets

• Search for the better classification model


Dataset extraction pipeline

User logs Extract matches Utterance pairs

Trainset
Preprocessed
Preprocess Mark & Split
utterance pairs

Amazon Testset
Mechanical Turk
(crowdsourcing)
Matches extraction
• Extract matches of the baseline model from the
logs. Obtained false positives will help to improve
precision

• Use a different algorithm (e.g. skip-thought


(Kiros et al. (2015))) to extract novel text pairs from
the logs. Obtained false negatives (according to
the baseline model) will help to improve recall
Matches preprocessing
• Remove text pair duplicates

• Remove too short/long text pairs (outliers)

• Remove pairs with coinciding texts (trivial samples)

• Remove too noisy text pairs e.g. with a lot of out-of-


vocabulary words (non-informative samples and noise)

• Remove pairs with highly dissimilar texts (fight the curse


of dimensionality):

cosine(DM.Encoder(text1), DM.Encoder(text2)) < threshold


Dataset extraction results
• Trainset: 17556 text pairs

• Testsets

• Scripts testset: 1035 text pairs


measures quality on scripts

• Common testset: 1162 text pairs


measures average quality

• Errors (or, false positives) testset: 555 text pairs


measures model's specificity
Scripts testset
Common testset
Errors (false positives)
testset
7 different error types:

We can investigate what kinds of errors the model make


Possible improvements
• Enlarge the datasets

• Search for the better classification model


Classification pipeline
Text pair Vectorize Vector pair

Extract Result:
Feature vector Classify
features 0|1

Trainset
Classification pipeline
Text pair Vectorize Vector pair

Extract Result:
Feature vector Classify
features 0|1

We can vary these components!


Trainset
Pipeline components
• Vectorizers: • Trainsets:
• Dialog context encoder • Marked user logs
• Dialog response encoder • External:
• Features:
• Quora (~400k)
• |v1 - v2|
• SemEval/SICK (~20k)
• v1 * v2
• Combination of all above
• [|v1 - v2|, v1 * v2]
• Classifiers:
• Logistic regression
• SVM
• Random forest
• ...
Model selection

less is better

more is better

Select top candidate models by AUC, tune them on


the validation set and select the best model by FPR
Model selection results

• Best configuration:

• Dialog context encoder

• Marked user logs dataset only

• [|v1 - v2|, v1 * v2] feature vector

• Linear SVM
Model selection discussion
• Quality gain is not as high as it could be

• Classification model quality is limited by the quality


of the underlying vectorizer (dialog model)

• We can try to fine-tune on STS data the already


trained dialog model to solve the target task
directly
Transfer learning
Context Response

Context Response
encoder encoder

Context vector Response vector

Cosine

Loss:

Already trained retrieval-based dialog model!


Transfer learning
Context Response Text1 Text2

Context Response Context shared Context


encoder encoder encoder encoder
Copy trained weights

Context vector Response vector Text1 vector Text2 vector

Cosine Cosine

Loss:
Sigmoid

Loss:
Update weights
Transfer learning results

Trainset: user logs + SemEval/SICK


Transfer learning discussion
• It's not a trivial approach itself

• Need to carefully tune optimizer, it's parameters


and the model itself (e.g. by adding dropout, batch
normalization etc)

• Need more data (much more than 20000 samples)


Conclusion
• Semantic textual similarity is an open problem of the
natural language processing (Cera et al. (2017))

• Definition of the similarity is very important and should


be determined by the target product requirements

• Correct evaluation methodology is also very important


and should be done according to the target
application

• Text representation (text-to-vector) is a crucial step


Future work
• Datasets:

• Enlarge the user logs trainset up to 100000 samples and more

• Incorporate high-quality external datasets (like novel


ParaNMT-50M, Wieting et al. (2017))

• Model:

• Incorporate more features: linguistic, pairwise word similarities


etc (Maharjan et al. (2017))

• Incorporate "hard" negative training samples (Wieting et al.


(2017))

• Mostly focus on end-to-end training and transfer learning


References
• Kiros et al. (2015). Skip-Thought Vectors

• Cera et al. (2017). SemEval-2017 Task 1: Semantic Textual


Similarity Multilingual and Cross-lingual Focused Evaluation

• Wieting et al. (2017). Pushing the Limits of Paraphrastic


Sentence Embeddings with Millions of Machine Translations

• Maharjan et al. (2017). DT Team at SemEval-2017 Task 1:


Semantic Similarity Using Alignments, Sentence-Level
Embeddings and Gaussian Mixture Model Output

• Tan et al. (2015). LSTM-based Deep Learning Models for Non-


factoid Answer Selection

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy