The BM25 algorithm
the First time this algorithm was met on Wikipedia and didn't pay him much attention. Later, studying the scientific works of employees of Yandex, I noticed that they refer to it, for example, article Segalovich about the algorithms determining near-duplicate, so I decided to figure out what is the meaning of its use. Try simple examples to explain it. So, what this algorithm is? First. Introduced a dependency of relevance on the occurrence or non-occurrence of words in queries with more than one word. Let there are several queries consisting of multiple words, for example (the example is purely illustrative): the the by Samsung the to buy a Samsung Galaxy smartphone Let compares two documents (again illustrative) and the first document does not contain word Galaxy. According to calculations the estimate of this sum of relevantnoise each of the words. The relevance of each of the words is equal to its IDF * on the second factor in the above expression...