
Lucene Indexing Fundamentals An index contains a collection of documents A document is a collection of fields A field is a named collection of terms A term is a paired string Inverted Index for efficient term-based search Indexing Strategies Primarily requires a sorted list of terms postings Possible Data Structures B-Tree based – O(log b N) Requires random access, hence frequent disk seeks Merge based – O(log k N) Sorts in-memory, then merge the sorted runs/segments Creates new segments for newly added documents Merge existing segments Trie Based – O(term-length) Requires random access, frequent disk seeks Simultaneous search/indexing Index Compression Speeds up overall process by resulting in less I/O Searching Needs an inverted index for efficient search Merge search results for different query terms Requirements of a Search Index Minimize disk access at the expense of CPU Indexing Support for Dynamic Indexing Relatively fast indexing without compromising search time. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.What is a Search Index Search Index is a variant of Database Similarities with traditional RDBMS Needs to have fast lookup for keys Bulk of data resides on secondary storage Minimize secondary storage access Differences(additions) compared to RDBMS No definitive score, returns only top k ranked hits Relaxed transactional requirements Two-level processing required ( details later.) It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. One has to compute similarity to everyone else
#Vsm using apache lucene plus#
So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value. It uses those to compute a variant of cosine similarity outlined here. Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise. This is more properly called a term-document matrix. I'd appreciate it if an real example is given.Īs I understood, VSM is a matrix where the values of TFIDF of each term are filled. how come lucene similarity calculation amoung millions of documents is so fast.



how lucene builds VSM so fast which can be used for calculating similarity.Ģ. So please help me to understand two point here:ġ. If possible, can someone also explain this ? I guess that's also related to how it builds VSM internally. In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene. When i tried building VSM from a set of documents, it took a long time with this tool I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.Īs I understood, VSM is a matrix where the values of TFIDF of each term are filled.
