Wednesday, September 5, 2012

Getting Started with Apache Lucene 4 with Maven: Hello World (Part 1: Analyzers)

Apache Lucene is a powerful search-engine library that stands for a perfect tool in case you somehow need to search unstructured data/text inside your application. Lucene website has a nice beginner tutorial here, however details are not provided and beginners might be lost with terminologies that are coded and used inside that tutorial without knowing them.

So shall we start right off the bat?

In this article, we tend to develop a simple Lucene 4 application with Apache Maven 2.

 In all search projects, the challenge is the text itself. The reason is that text in considered unstructured data. In order to tackle this problem, Analyzers  are trying to give some structure to the text they need to search into.  They achieve this through tokenizing (Dividing text into text units such as words or a character sequence, a phrase, email address, ... ), stemming (Transforming each word token to its root) and stopping characters (characters that repeat often in the text and therefore are of less value) inside the stream.

Lucene comes up with a set of different Analyzers to be used in different situations. Here, we explains a set of them given the following text:


First: Theory
1. WhitespaceAnalyzer
The simplest Analyzer in the package that each token starts after a white space and ends with a white space without any stemming and stopping method performed:
public WhitespaceAnalyzer(Version.LUCENE_36)

2. SimpleAnalyzer

The Analyzer tokenizes the text into letters (with isLetter method in java.lang.Character class) abd applies a lower-case filter to it. This Analyzer shall have problems while working with far-eastern languages such as Chineese.
public SimpleAnalyzer(Version.LUCENE_36)

3. StandardAnalyzer

The most popular Analyzer for Enlgish texts that includes the following rules for  creating token which suites most European Languages (StandardTokenizer):
  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token. 
It also inlcludes set of basic stops words which can be extended in different cases and contexts.

4. PorterAnalyzer

The Analyzer is based on Martin Porter stemming algorithm. In this approach, we tokenize texts based on all three stemming, filtering to lower-case and also filtering stop words. This approach is highly useful however it should be noted that it can create meaningless tokens such as "tire" for the word "tired". This is no problem as Analyzer are used in fetching documents and do not affect the response which user will read afterwards.

5. StandardBgramAnalyzer

The term bgram refers to subsequence of 2 characters from a string character. The Analyzer creates its index longer than other Analyzers. While it filters the text to lower case, in order to keep stop words in the index set, an underscore character "_" is placed between first and second part of the tokens. Start and End of a string will be indexed twice in this approach. The following is the tokens for "Bird on a wire":

bird
bird_on
on_a
a_wire
wire


It should be noted that idea of forming bgrams is based on finite-state machines. Lastly, it should be remembered that the big advantage of this approach is that it can match the queries which include stop words more precisely than others analyzers. 


Application

Since we are only starting, we will just start with setting pom.xml and building the application dependencies and we continue in the next blog post.

First, creating Maven Archetypes:
mvn archetype:generate -DgroupId=se.findwise -DartifactId=my-lucene-app -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false


Second, editing pom.xml:

  4.0.0
  se.findwise
  my-lucene-app
  jar
  1.0-SNAPSHOT
  my-lucene-app
  http://maven.apache.org
  
  
   
    maven-compiler-plugin
    2.0.2
    
      1.6
      1.6
    
   
   
        org.codehaus.mojo
        exec-maven-plugin
        
          java
          
            -Xms512m
            -Xmx512m
            -XX:NewRatio=3
            -XX:+PrintGCTimeStamps
            -XX:+PrintGCDetails
            -Xloggc:gc.log
            -classpath
            
            se.findwise.App
          
        
      
  
 
  
    
      junit
      junit
      3.8.1
      test
    
    
      org.apache.lucene
      lucene-core
      4.0.0-BETA
    
  

Have you notices that in the element there are some additional arugement (lines 22-33) ? Yes, check out this awesome article. We will use this in next posts to make the running process of our application really easier than before.
Now just run the following and we are finished for today! ;)

mvn install
or
mvn package