Machines, unlike humans, cannot understand raw text, but they can see numbers. Therefore, we need to convert the text to numbers. There are different approaches to converting text to the appropriate numeric form. The Bag of Words and Word Embedding models are two of the most commonly used approaches.
The Bag of words model extracts features from the text. This model converts text into a matrix of words in the document and deals with whether the words appear in the document or not. Then it can be used for algorithms. This is called a word „bag” because any information about the order or structure of the words in the document is discarded.
import numpy as np
import nltk
It is best to follow the logic of this method on a simple example. We have a simple text consisting of 3 sentences.
tekst = ['Tom likes blue.', 'Adam likes yellow.' ,'Ann likes red and blue']
We have a total of 11 words and 8 unique words. The next step is to build the matrix and count the frequency of each word:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(tekst)
print(vectorizer.vocabulary_)
vector = vectorizer.transform(tekst)
print(vector.shape)
print(vector.toarray())
Our vector is 3×8: 3 sentences and 8 unique words Our matrix uses one word. But it can be a combination of two or more words, which is called the bigrame or trigram model, and the general approach is called the n-gram model. For example: bigramy in our case would be: ‚Tom likes’, ‚likes blue’, ‚blue Adam’, ‚Adam likes’ …. (after removing punctuation).
The word bag approach works well for converting text to numbers. However, it has one drawback. Assigns a result to a word based on its occurrence in a specific document. It does not take into account the fact that the word may also have a high frequency in other documents. TFIDF solves this problem by multiplying the word frequency by the inverse frequency of the document. TF stands for „Term Frequency” while IDF stands for „Inverse Document Frequency”.
TF-IDF is a technique that measures how important a word is in a given document. TF (Term Frequency) measures the frequency of a word in a document.
TF = (number of words in the text) / (Total number of words in the text) IDF (Inverse Document Frequency) measures the rank of a specific word in terms of its accuracy in the text. Stop words containing unnecessary information such as „a”, „to” and „i” have less meaning despite their occurrence.
IDF = (total number of documents / number of documents with the word t) Thus, TF-IDF is a product of TF and IDF:
TF-IDF = TF * IDF
Having a bag of words, we can transform it into TfIdf and we get a matrix with frequency calculations in relation to the entire document:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(vector).toarray(); X
We can also skip the word bag stage and compute the matrix directly from the text, note the parameters that can help you better define the matrix (TfidfVectorizer ())
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfconverter = TfidfVectorizer()
X = tfidfconverter.fit_transform(tekst).toarray(); X
Having a numeric version of the text, we can use it in machine learning, but this is in another post.