The BM25 algorithm

the First time this algorithm was met on Wikipedia and didn't pay him much attention. Later, studying the scientific works of employees of Yandex, I noticed that they refer to it, for example, article Segalovich about the algorithms determining near-duplicate, so I decided to figure out what is the meaning of its use. Try simple examples to explain it. So, what this algorithm is?

First. Introduced a dependency of relevance on the occurrence or non-occurrence of words in queries with more than one word.
Let there are several queries consisting of multiple words, for example (the example is purely illustrative):
the
    the
  • by Samsung
  • the
  • to buy a Samsung Galaxy smartphone

Let compares two documents (again illustrative) and the first document does not contain word Galaxy. According to calculations the estimate of this sum of relevantnoise each of the words.

image

The relevance of each of the words is equal to its IDF * on the second factor in the above expression. The only relevance of a search query equal to the sum of relevantnoise all words. Thus, the absence of the word or in other words image (its frequency) is equal to 0, yields a relevance of 0. Therefore, if the first two words of the score will be equally more relevant is the document that contains the word Galaxy.

Second. Advantage when searching for queries with more than 2 words, one of which is less common (more narrow) will be given to documents which contain this is a highly specialized word. For example, there is a request to purchase Samsung Galaxy Note 2 (purely illusory example). Let the Note 2 is a more rare word (fewer times in the collection than Samsung Galaxy). Let is a 2-and document each relevant request and each of them contains in addition to Samsung Galaxy Note 2 also. In the first document note 2 used only once, whereas in the second – 3 times (assuming that the document contains more information about Note 2). But first consider the result of calculating the relevancy algorithm, if the frequencies of all the words in the documents are the same. That's what you get for BM25 in Excel.

image

Also note that due to the fact that the number of documents containing the word Note 2 is less than equal to 50 times of the containing the word galaxy (500), we retrieve IDF equal 3,279634 which is much higher than the IDF for the word galaxy.

Yet we had the same values of frequency for the word note 2 (for other words also). Now let's in Excel will increase the frequency of the word note 2 to док2, will do instead of 0.02 to 0.05 (5 instances).

image

Please note that the IDF value does not change but the value of the formula (the second multiplier in the image at the top) is now equal to 0,061856 and that is the value involved in the computation score, which is now for док2 already 0,290559

Now the most important thing. Increase the frequency of occurrences of the word galaxy to 5 to dock 1

image

As we can see the total frequency of each word in док1 and док2 the same. But the value of score (relevance) higher in док2, because the word note2 is more rare therefore its resulting effect is more than the word galaxy.

In practice, the presence of polysyllabic words in the queries is very important. Of course, the relevance of current search systems is determined not only based on the frequencies as it was shown on the example formula BM25, but some correlation is. Basically it concerns the fact that if in the document there is no word from the search query then this document is much more difficult to rise to the TOP of the query compared to those who have this word contains. Let's look at an example for search engine Yandex.
Enter inquiry Samsung galaxy. I results concerned the Samsung galaxy as a whole (site 2, as usual Wikipedia) the rest of models, pictures, etc.

Enter inquiry samsung galaxy note 2. The results completely changed, now presented with a page that contain information not just about the Samsung galaxy, and Samsung galaxy note 2.

Enter inquiry samsung galaxy note 2 priceAgain the issue is changing now in the results pages that contain the word price, not just the Samsung galaxy.

Enter inquiry samsung galaxy note 2 price Kharkov. The results changes dramatically all the pages in the top 10 contain the word Kharkiv.

Can we say that the word Kharkov is more specialised, as cited in the BM25 algorithm above? IDF words Kharkov knows only search engine, but in the context of a search query Samsung galaxy note 2, it no doubt narrows down the search area. Can be example with Yandex is kind of unfortunate due to the fact that in the above case, a major role to play given the regionality of the request, but I think anybody SEOs that word from the search query must be in the text, I just tried to show the algorithm BM25 and reveal 2 important aspects.

Link to xls document — книга11.xls
Article based on information from habrahabr.ru

Комментарии

Популярные сообщения из этого блога

March Habrameeting in Kiev

PostgreSQL load testing using JMeter, Yandex.Tank and Overload

Monitoring PostgreSQL with Zabbix