About Me

Android, HTML5 and BlackBerry apps developers. Proactive towards new technologies. Reading books, playing chess are my hobbies.

Saturday, February 2, 2013

Lucene Search API : Introduction to Lucene - Part 1

In the Lucene Search API Part 1, lets look into following

  1. What is Lucene
  2. What is Indexing
  3. How to do Indexing

If you are interesting in Lucene searching please look into my other post on Lucene Searching.


What is lucene :

Lucene is an open source, highly scalable text search-engine java library available from the Apache Software Foundation. You can use Lucene in commercial and open source applications. Lucene's powerful APIs focus mainly on text indexing and searching.

It can be used to build search capabilities for applications such as e-mail clients, mailing lists, Web searches, database search, etc.

Lucene Core, provides Java-based indexing and search technology,as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
Web sites like Wikipedia, TheServerSide, jGuru, Eclipse help system and LinkedIn have been powered by Lucene.
  • Basically developing search applications with lucene core involves three steps.
    • Preparing the Index file of the database
    • Searching on the Index file
    • Displaying Results.

Preparing Index file for the database

What is Indexing :
Indexing is a process of converting text data into a format that facilitates rapid searching on a database table at the cost of slower writes and increased storage space. Lucene stores the input data in a data structure called an inverted index, which is stored on the file system or memory as a set of index files.
Indexing and Analysis are the two steps involved in creating a index file.

Indexing :Lucene stores the input data in a data structure called an inverted index, which is stored on the file system or memory as a set of index files.  It lets users perform fast keyword look-ups and finds the documents that match a given query. Before the text data is added to the index, it is processed by an analyzer.

Analysis : Analysis is converting the text data into a fundamental unit of searching, which is called as term. During analysis, the text data goes through multiple operations: extracting the words, removing common words, ignoring punctuation, reducing words to root form, changing words to lowercase, etc. Analysis happens just before indexing and query parsing. Analysis converts text data into tokens, and these tokens are added as terms in the Lucene index.
Note : As analysis removes words before indexing, it decreases index size, but it can have a negative effect on precision query processing. You can have more control over the analysis process by creating custom analyzers using basic building blocks provided by Lucene.

The following are the fundamental Lucene classes for indexing text: Directory, IndexWriter, Analyzer, Document, and Field.
Directory : An abstract class that represents the location where index files are stored.
FSDirectory : Base class for Directory implementations that store index files in the file system.
RAMDirectory : A memory-resident Directory implementation. This class is not intended to work with huge indexes.
IndexWriter :  The IndexWriter class creates a new index and adds documents to an existing index. It provides methods to add, delete, or update documents in the index.

Before text is indexed, it is passed through an Analyzer. Analyzer classes are in charge of extracting indexable tokens out of text to be indexed and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently used words that don't help distinguish one document from the other, such as a, an, the, in, and on), for instance, while others deal with converting all tokens to lowercase letters, so that searches are not case sensitive.

Analyzer : As discussed, the analyzers are responsible for preprocessing the text data and converting it into tokens stored in the index. IndexWriter accepts an analyzer used to tokenize data before it is indexed. To index text properly, you should use an analyzer that's appropriate for the language of the text that needs to be indexed.
Ex: WhitespaceAnalyzer, SimpleAnalyzer, StandardAnalyzer.
Default analyzers work well for the English language. There are several other analyzers in the Lucene sandbox, including those for Chinese, Japanese, and Korean.


Other Important terms of Lucene 

An index consists of a set of Documents, and each Document consist of one or more Fields.
Each Field has a name and a value. You can think of a Document as a row in an RDBMS, and Fields as columns in that row.

Field : A field is a section of a Document. Each field has two parts, a name and a value. Fields are optionally
stored in the index, so that they may be returned with hits on the document.

Document : Documents are the unit of indexing and search. A Document is a set of fields. Each document should typically
contain one or more stored fields which uniquely identify it.

Searching on Index files is explained here

No comments:

Post a Comment