Demystifying Text Analytics Part 3 — Finding Similar Documents with Cosine Similarity Algorithm

This is a Part 3 of Demystifying Text Analytics series.

In the last two posts, we imported 100 text documents from companies in California. These are about how they comply with ‘California Transparency in Supply Chain Act’.

We have transformed and prepared the text data and giving the scores to each term by calculating the TF-IDF.

Now, remember the following questions we originally asked in Part 1?

“Can we find a set of similar documents compared to a document from company A?” Or, “Can we find similar documents so that we might be able to detect that some companies might be using a same template or simply copying and pasting a ‘golden’ document regardless of what they actually do everyday?”

Now that we have TF-IDF calculated already, we can calculate the similarities among the documents by using ‘Cosine Similarity’ algorithm.

In this post, I’m going to calculate the similarities among the documents based on the TF-IDF scores. To calculate the similarities I’ll use ‘Cosine Similarity’ algorithm

What is Cosine Similarity?

We can calculate the similarity between pairs of the documents using ‘Cosine Similarity’ algorithm.

But what is Cosine Similarity?

It measures the cosine of the angle between two vectors. In this case, each document can be presented as a vector whose direction is determined on a set of the TF-IDF values in the space.

Let’s assume we have only two terms (or words), “supply” and “transparency” as text data to simplify this example, and we have TF-IDF values calculated per term for Document A and Document B like the below.

If we visualize these values in two-dimensional space it would look something like the below.

Each of the arrows pointing towards Document A and Document B respectively is the ‘vector’.

If the two vectors are pointing in a similar direction the angle between the two vectors is very narrow. And this means that these two documents represented by the vectors are similar.

So in order to measure the similarity we want to calculate the cosine of the angle between the two vectors.

Of course, we have hundreds of terms than just the two for these documents we are working with. This means that we have ‘high’ dimensional space rather than the two-dimensional space. But the concept is still the same.

By using all the dimensions (or terms in this case) and their corresponding values (or TF-IDF values in this case) each document’s position is determined in the high dimensional space.

Anyway, this is why the typical ‘distance’ algorithm like ‘Euclidean’ won’t work well to calculate the similarity. Instead, we want to use the cosine similarity algorithm to measure the similarity in such a high-dimensional space. (Curse of dimensionality)

Calculate Cosine Similarity with Exploratory

Let’s take a look at how we can calculate the cosine similarity in Exploratory.

Open the data frame we have used in the previous post in Exploratory Desktop. In the previous exercise, we have filtered the data to top 10 for each document.

But we want to apply the cosine similarity algorithm on top of the full data, not the filtered data.

Remove the last two steps of Group By and Top N.

We should have the data like this.

Once the data is ready, make sure to select the last step of ‘9. Filter’ at the right hand side of the data wrangling step.

And click ‘Add’ button, then select

Run Analytics -> Calculate Cosine Similarity

In the dialog, select a grouping column (e.g. Company Name) you want to calculate the cosine similarity for, then select a dimension (e.g. terms) and a measure columns (e.g. TF-IDF).

This will return the cosine similarity value for every single combination of the documents.

The higher the similarity values are the more similar the two documents are in this case.

Visualize the similarities with Heatmap

We can quickly visualize this similarity scores by using ‘Heatmap’ under Chart view.

Assign ‘name.x’ to X-Axis, ‘name.y’ to Y-Axis, and ‘value’ to Color.

We can zoom into an area where we can see a high red color concentration by dragging the mouse pointer for the area.

We can see ‘Georgia-Pacific Consumer Operations LLC’ and ‘Georgia-Pacific Panel Products LLC’ have the similarity value of 1, which means they are basically the same, if not exactly the same.

Filter to keep the document pairs with high Cosign Similarity Scores

Let’s zero in only the highly similar document pairs. We can filter the data to keep only the document pairs that have greater than 0.5 cosign similarity scores.

Filter -> Greater than

from the column header menu of ‘value’ column.

Type 0.5 as the value.

Just by looking at these document names (company names) we can tell that some of these companies are actually related each other (e.g. “Bissell Homecare Inc” and “Bissell Inc”) hence most likely they have shared a same document or template.

What is interesting here though is that some companies like ‘Moen Incorporated’, ‘MasterBrand’, and ‘Therma-Tru’, which we can’t tell any close relationships by just looking at their names, seems to be using the same document or template!

As you have seen, calculating the cosine similarity based on TF-IDF helps to find the similarity between two documents.

Now, what if we want to understand the overall relationship among the documents rather than the relationship between each pair of the documents?

In the next episode, I’m going to discuss how we can use ‘clustering’ and ‘dimensionality reduction’ methods to understand such relationship among the documents.