A brief foray into vectorial semantics

An Article by James Somers

One of the best (and easiest) ways to start making sense of a document is to highlight its “important” words, or the words that appear within that document more often than chance would predict. That’s the idea behind Amazon.com’s “Statistically Improbable Phrases”:

Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.