A brief foray into vectorial semantics

An Article by James Somers

One of the best (and easiest) ways to start making sense of a document is to highlight its “important” words, or the words that appear within that document more often than chance would predict. That’s the idea behind Amazon.com’s “Statistically Improbable Phrases”:

Amazon.com’s Statistically Improbable Phrases, or “SIPs”, are the most distinctive phrases in the text of books in the Search Inside!ℱ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.