You have 1 free member-only story left this month.

Text Similarities : Estimate the degree of similarity between two texts

  • Search engines need to model the relevance of a document to a query, beyond the overlap in words between the two. For instance, question-and-answer sites such as Quora or Stackoverflow need to determine whether a question has already been asked before.
  • In legal matters, text similarity task allow to mitigate risks on a new contract, based on the assumption that if a new contract is similar to a existent one that has been proved to be resilient, the risk of this new contract being the cause of financial loss is minimised. Here is the principle of Case Law principle. Automatic linking of related documents ensures that identical situations are treated similarly in every case. Text similarity foster fairness and equality. Precedence retrieval of legal documents is an information retrieval task to retrieve prior case documents that are related to a given case document.
  • In customer services, AI system should be able to understand semantically similar queries from users and provide a uniform response. The emphasis on semantic similarity aims to create a system that recognizes language and word patterns to craft responses that are similar to how a human conversation works. For example, if the user asks “What has happened to my delivery?” or “What is wrong with my shipping?”, the user will expect the same response.

What is text similarity?

  • On the surface, if you consider only word level similarity, these two phrases appear very similar as 3 of the 4 unique words are an exact overlap. It typically does not take into account the actual meaning behind words or the entire phrase in context.
  • Instead of doing a word for word comparison, we also need to pay attention to context in order to capture more of the semantics. To consider semantic similarity we need to focus on phrase/paragraph levels (or lexical chain level) where a piece of text is broken into a relevant group of related words prior to computing similarity. We know that while the words significantly overlap, these two phrases actually have different meaning.

What is our winning strategy?

  • Jaccard Similarity ☹☹☹
  • Different embeddings+ K-means ☹☹
  • Different embeddings+ Cosine Similarity
  • Word2Vec + Smooth Inverse Frequency + Cosine Similarity 😊
  • Different embeddings+LSI + Cosine Similarity
  • Different embeddings+ LDA + Jensen-Shannon distance 😊
  • Different embeddings+ Word Mover Distance 😊😊
  • Different embeddings+ Variational Auto Encoder (VAE) 😊 😊
  • Different embeddings+ Universal sentence encoder 😊😊
  • Different embeddings+ Siamese Manhattan LSTM 😊😊😊
  • BERT embeddings + Cosine Similarity ❤
  • Knowledge-based Measures

What do we mean by different embeddings?

- Bag of Words (BoW)
- Term Frequency - Inverse Document Frequency (TF-IDF)
- Continuous BoW (CBOW) model and SkipGram model embedding(SkipGram)
- Pre-trained word embedding models :
-> Word2Vec (by Google)
-> GloVe (by Stanford)
-> fastText (by Facebook)
- Poincarré embedding
- Node2Vec embedding based on Random Walk and Graph

A very sexy approach [ Knowledge-based Measures (wordNet)] [Bonus]

0. Jaccard Similarity ☹☹☹:

Jaccard Similarity Principle
Jaccard Similarity Function
Image result for obama speaks to the media in illinois word mover distance jaccard similarity
Why Jaccard Similarity is not efficient?

1. K-means and Hierarchical Clustering Dendrogram☹:

  • Bag of words with either TF (term frequency) called Count Vectorizer method
  • TF-IDF (term frequency- inverse document frequency)
  • Word Embeddings either coming from pre-trained methods such as Fastext, Glove or Word2Vec or customized method by using Continuous Bag of Words (CBoW) or Skip Gram models
  • BoW or TF-IDF create one number per word while word embeddings typically creates one vector per word.
  • BoW or TF-IDF is good for classification documents as a whole, but word embeddings is good for identifying contextual content
corpus = [‘The sky is blue and beautiful.’,
‘Love this blue and beautiful sky!’,
‘The quick brown fox jumps over the lazy dog.’,
“A king’s breakfast has sausages, ham, bacon, eggs, toast and beans”,
‘I love green eggs, ham, sausages and bacon!’,
‘The brown fox is quick and the blue dog is lazy!’,
‘The sky is very blue and the sky is very beautiful today’,
‘The dog is lazy but the brown fox is quick!’,
President greets the press in Chicago’,
Obama speaks in Illinois
]

2. Cosine Similarity ☹:

CountVectorizer Method + Cosine Similarity ☹

Pre-trained Method (such as Glove) + Cosine Similarity 😊

Smooth Inverse Frequency

  1. Weighting: SIF takes the weighted average of the word embeddings in the sentence. Every word embedding is weighted by a/(a + p(w)), where a is a parameter that is typically set to 0.001 and p(w) is the estimated frequency of the word in a reference corpus.
  2. Common component removal: SIF computes the principal component of the resulting embeddings for a set of sentences. It then subtracts from these sentence embeddings their projections on their first principal component. This should remove variation related to frequency and syntax that is less relevant semantically.

3. Latent Semantic Indexing (LSI)

  • d1 = “Shipment of gold damaged in a fire”
  • d2 = “Delivery of silver arrived in a silver truck”
  • d3 = “Shipment of gold arrived in a truck”
  • Stop words were not ignored
  • Text was tokenized and lowercased
  • No stemming was used
  • Terms were sorted alphabetically
  • d1 = (-0.4945, 0.6492)
  • d2 = (-0.6458, -0.7194)
  • d3 = (-0.5817, 0.2469)
  • More information :

4. Word Mover’s Distance

I have 2 sentences:

  • Obama speaks to the media in Illinois
  • The president greets the press in Chicago

Removing stop words:

  • Obama speaks media Illinois
  • president greets press Chicago
  • Gini’s measure of discrepancy
  • Transportation problem [Hitchcock]
  • Monge Kantorovitch problem
  • Omstein Distance
  • Lipschitz norm …
  • x has m=2 masses and
  • y has n=3 masses.
  • x is w_S = Sum of all w_i [ 0.74 + 0.26]
  • y is u_S = Sum of all u_j [0.51 + 0.23 + 0.26]

Equal-Weight Distributions

A minimum work flow

Flow

  1. Obama speaks to the media in Illinois –> Obama speaks media Illinois –> 4 words
  2. The president greets the press –> president greets press –> 3 words
Image result for obama speaks to the media in illinois word mover distance jaccard similarity

5. LDA with Jensen-Shannon distance

  • Understanding the different varieties topics in a corpus (obviously),
  • Getting a better insight into the type of documents in a corpus (whether they are about news, wikipedia articles, business documents)
  • Quantifying the most used / most important words in a corpus
  • … and even document similarity and recommendation (here is we focus all our attention)
  • A distribution over topics for each document (1)
  • A distribution over words for each topics (2)
Amazing ❤ https://www.kaggle.com/ktattan/lda-and-document-similarity
https://www.kaggle.com/ktattan/lda-and-document-similarity

Beyond Cosine: A Statistical Test.

  • Using a symmetric formula, when the problem does not require symmetry. If we denote sim(A,B) as the similarity formula used, the formula is symmetric when sim(A,B) = sim(B,A). For example, ranking documents based on a query, does not work well with a symmetric formula. The best working methods like BM25 and DFR are not similarities in spite of the term used in the Lucene documentation since they are not symmetric with respect to the document and query. As a consequence of the preceding point similarity is not optimal for ranking documents based on queries
  • Using a similarity formula without understanding its origin and statistical properties. For example, the cosine similarity is closely related to the normal distribution, but the data on which it is applied is not from a normal distribution. In particular, the squared length normalization is suspicious.

6. Variational Auto Encoder

Image result for autoencoder
https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases
  • Encoding Architecture : The encoder architecture comprises of series of layers with decreasing number of nodes and ultimately reduces to a latent view repersentation.
  • Latent View Repersentation : Latent view repersents the lowest level space in which the inputs are reduced and information is preserved.
  • Decoding Architecture : The decoding architecture is the mirro image of the encoding architecture but in which number of nodes in every layer increases and ultimately outputs the similar (almost) input.
http://blog.qure.ai/notes/using-variational-autoencoders

7. Pre-trained sentence encoders

  • InferSent (Facebook Research) : BiLSTM with max pooling, trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral.
  • Google Sentence Encoder : a simpler Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.

8. Siamese Manhattan LSTM (MaLSTM)

http://www.erogol.com/duplicate-question-detection-deep-learning/

9. Bidirectional Encoder Representations from Transformers (BERT) with cosine distance

  • record the play
  • play the record
  • play the game

10. A word about Knowledge-based Measures

  • WordNet, which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.
  • Wu and Palmer metric measures the semantic similarity of two concepts as their depth of the least common subsumer in the wordnet graph
  • Resnik metric estimates the similarity as the probability of encountering the least common subsumer in a large corpus. This probability is known as the Information Content (IC). Note that a concept, here, is a synset which is a word sense (synonym) and each word has several synsets. These examples are implemented in the Python NLTK module.

SOURCES

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store