Từ Brezhnev đến search engines
(Cảm ơn anh Nghị đã cho link.) Bài báo này của Andrei Broder ở CPM 2000 mở đầu như sau:
A Communist era joke in Russia goes like this: Leonid Brezhnev (the Party leader) wanted to get rid of the Premier, Aleksey Kosygin. (In fact he did, in 1980.) So Brezhnev, went to Kosygin and said: “My dear friend and war comrade Aleksey, I had very disturbing news: I just found out that you are Jewish: I have no choice, I must ask you to resign.” Kosygin, in total shock says: “But Leonid, as you know very well I am not Jewish!”; then Brezhnev says: “Well, Aleksey, then think about it…”
What this has to do with near-duplicate documents? In mid 1995, the AltaVista web search engine was built at the Digital research labs in Palo Alto (see [10]). Soon after the first internal prototype was deployed, a colleague, Chuck Thacker, came to me and said: “I really like AltaVista, but it is very annoying that often half the first page of answers is just the same document in many variants. “I know” said I. “Well,” said Chuck, “you did a lot of work on fingerprinting documents; can you make up a fingerprinting scheme such that two documents that are near-duplicate get the same fingerprint?” I was of course indignant: “No way!! You miss the idea of fingerprints completely: fingerprints are such that with high probability two distinct documents will have different fingerprints, no matter how little they differ! Similar documents getting the same fingerprint is entirely against their purpose.” So, of course, Chuck said: “Well, then think about it…” … and as usual, Chuck was right.
Eventually I found found a solution to this problem, …

Như vậy fingerprinting giống như một cái document ID giúp phân biệt. Vậy cuối cùng ông ấy đã giải quyết vấn đề duplicate results như thế nào ? Nếu không quá sâu, anh Hưng có thể giải thích một ít được không ?
Chào Sơn,
Ý tưởng chính là tìm một “distance function” giữa các documents, ví dụ như xem các documents là các vectors của các từ khóa với tần số của chúng. Có thể dùng statistical sampling để định trị “distance” cho hiệu quả về thời gian. Theo Andrei thì thuật toán ông trình bày trong bài đã được dùng thành công trong AltaVista. (Andrei đã sang làm cho Yahoo! hồi 2005 thì phải.)