Through a step-by-step process, calculate TFIDF for the given corpus and mention the word(s) having highest value.

Document 1: We are going to Mumbai

Document 2: Mumbai is a famous place.

Document 3: We are going to a famous place.

Document 4: I am famous in Mumbai.

Term Frequency: Term frequency is the frequency of a word in one document. Term frequency can easily be found from the document vector table as in that table we mention the frequency of each word of the vocabulary in each document.

We Are Going to Mumbai is a famous Place I am in

1 1 1 1 1 0 0 0 0 0 0 0

0 0 0 0 1 1 1 1 1 0 0 0

1 1 1 1 0 0 1 1 1 0 0 0

0 0 0 0 1 0 0 1 0 1 1 1

Inverse Document Frequency: The other half of TFIDF which is Inverse Document Frequency. For this, let us first understand what does document frequency mean. Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents.

The document frequency for the exemplar vocabulary would be:

We Are going to Mumbai is a Famous place I am in

2 2 2 2 3 1 2 3 2 1 1 1

Talking about inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.

Here, the total number of documents are 3, hence inverse document frequency becomes:

We Are going to Mumbai is a Famous place I am in

4/2 4/2 4/2 4/2 4/3 4/1 4/2 4/3 4/2 4/1 4/1 4/1

The formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log (IDF(W))

The words having highest value are – Mumbai, Famous