Search Configuration

Search Configuration

1 XML Configuration

JCR index configuration. You can find this file here: .../portal/WEB-INF/conf/jcr/repository-configuration.xml

<repository-service default-repository="db1">
  <repositories>
    <repository name="db1" system-workspace="ws" default-workspace="ws">
       ....
      <workspaces>
        <workspace name="ws">
       ....
          <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
            <properties>
              <property name="index-dir" value="${java.io.tmpdir}/temp/index/db1/ws" />
              <property name="synonymprovider-class" value="org.exoplatform.services.jcr.impl.core.query.lucene.PropertiesSynonymProvider" />
              <property name="synonymprovider-config-path" value="/synonyms.properties" />
              <property name="indexing-config-path" value="/indexing-configuration.xml" />
              <property name="query-class" value="org.exoplatform.services.jcr.impl.core.query.QueryImpl" />
            </properties>
          </query-handler>
        ... 
        </workspace>
     </workspaces>
    </repository>        
  </repositories>
</repository-service>

2 Configuration parameters

ParameterDefaultDescriptionSince
index-dirnoneThe location of the index directory. This parameter is mandatory. Up to 1.9 this parameter called "indexDir"1.0
use-compoundfiletrueAdvises lucene to use compound files for the index files.1.9
min-merge-docs100Minimum number of nodes in an index until segments are merged.1.9
volatile-idle-time3Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.1.9
max-merge-docsInteger.MAX_VALUEMaximum number of nodes in segments that will be merged. The default value changed in JCR 1.9 to Integer.MAX_VALUE.1.9
merge-factor10Determines how often segment indices are merged.1.9
max-field-length10000The number of words that are fulltext indexed at most per property.1.9
cache-size1000Size of the document number cache. This cache maps uuids to lucene document numbers1.9
force-consistencycheckfalseRuns a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown.1.9
auto-repairtrueErrors detected by a consistency check are automatically repaired. If false, errors are only written to the log.1.9
query-classQueryImplClass name that implements the javax.jcr.query.Query interface.This class must also extend from the class: org.exoplatform.services.jcr.impl.core.query.AbstractQueryImpl.1.9
document-ordertrueIf true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false'.1.9
result-fetch-sizeInteger.MAX_VALUEThe number of results when a query is executed. Default value: Integer.MAX_VALUE (-> all).1.9
excerptprovider-classDefaultXMLExcerptThe name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.ExcerptProvider and should be used for the rep:excerpt() function in a query.1.9
support-highlightingfalseIf set to true additional information is stored in the index to support highlighting using the rep:excerpt() function.1.9
synonymprovider-classnoneThe name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SynonymProvider. The default value is null (-> not set).1.9
synonymprovider-config-pathnoneThe path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a path element inside the SearchIndex element, then this path is interpreted relative to the root path of the path. Whether this parameter is mandatory depends on the synonym provider implementation. The default value is null (-> not set).1.9
indexing-configuration-pathnoneThe path to the indexing configuration file.1.9
indexing-configuration-classIndexingConfigurationImplThe name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.IndexingConfiguration.1.9
force-consistencycheckfalseIf set to true a consistency check is performed depending on the parameter forceConsistencyCheck. If set to false no consistency check is performed on startup, even if a redo log had been applied.1.9
spellchecker-classnoneThe name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SpellChecker.1.9
spellchecker-more-populartrueIf set true - spellchecker return only the suggest words that are as frequent or more frequent than the checked word. If set false, spellchecker return null (if checked word exit in dictionary), or spellchecker will return most close suggest word.1.10
spellchecker-min-distance0.55fMinimal distance between checked word and proposed suggest word.1.10
errorlog-size50(Kb)The default size of error log file in Kb.1.9
max-volatile-timeis (-1) that means no maximal timeout setMax age of volatile. It is guaranteed that in any case volatile index will be flush to FS not later then in max-volatile-time seconds.1.12
analyzerorg.apache.lucene.analysis.standard.StandardAnalyzerClass name of a lucene analyzer to use for fulltext indexing of text.1.12

3 Global Search Index

3.1 Global Search Index Configuration

The global search index is configured in the above-mentioned configuration file (portal/WEB-INF/conf/jcr/repository-configuration.xml) in the tag "query-handler".

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

In fact when using Lucene you always should use the same analyzer for indexing and for querying - otherwise the results are unpredictable. You don't have to worry about this, eXo JCR does this for you automatically. If you don't like the StandardAnalyzer configured by default just replace it by your own.

If you don't have a handy QueryHandler you will learn how create a customized Handler in 5 minutes.

3.2 Customized Search Indexes and Analyzers

By default Exo JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:

public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    return result;
  }

  • The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
  • The second one (LowerCaseFilter) normalizes token text to lower case.
  • The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.
For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.

In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.

3.2.1 Create the filter

The ISOLatin1AccentFilter is not present in the current Lucene version used by Exo. You can use the attached file. You can also create your own filter, the relevant method is

public final Token next(final Token reusableToken) throws java.io.IOException
which defines how chars are read and used by the filter.

3.2.2 Create the analyzer

The analyzer have to extends org.apache.lucene.analysis.standard.StandardAnalyzer, and overload the method

public TokenStream tokenStream(String fieldName, Reader reader)
to put your own filters. You can have a glance at the example analyzer attached to this article.

3.2.3 Create the search index

Now, we have the analyzer, we have to write the SearchIndex, which will use the analyzer. Your have to extends org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex. You have to write the constructor, to set the right analyzer, and the method

public Analyzer getAnalyzer() {
    return MyAnalyzer;
  }
to return your analyzer. You can see the attached SearchIndex.

Since 1.12 version we can set Analyzer directly in configuration. So, creation new SearchIndex only for new Analyzer is redundant.

3.2.4 Configure your application to use your SearchIndex

In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to replace each

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
by your own class
<query-handler class="mypackage.indexation.MySearchIndex">

3.2.5 Configure your application to use your Analyzer

In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to add parameter "analyzer" to each query-handler config:

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
   <properties>
      ...
      <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/>
      ...
   </properties>
</query-handler>

When you start exo, your SearchIndex will start to index contents with the specified filters.

4 Index Adjustments

4.1 IndexingConfiguration

Starting with version 1.9, the default search index implementation in JCR allows you to control which properties of a node are indexed. You also can define different analyzers for different nodes.

The configuration parameter is called indexingConfiguration and per default is not set. This means all properties of a node are indexed.

If you wish to configure the indexing behavior you need to add a parameter to the query-handler element in your configuration file.

<param name="indexing-configuration-path" value="/indexing_configuration.xml"/>

4.2 Index rules

4.2.1 Node Scope Limit

To optimize the index size you can limit the node scope so that only certain properties of a node type are indexed.

With the below configuration only properties named Text are indexed for nodes of type nt:unstructured. This configuration also applies to all nodes whose type extends from nt:unstructured.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>
Please note that you have to declare the namespace prefixes in the configuration element that you are using throughout the XML file!

4.2.2 Index Boost Value

It is also possible to configure a boost value for the nodes that match the index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property>Text</property>
  </index-rule>
</configuration>

If you do not wish to boost the complete node but only certain properties you can also provide a boost value for the listed properties:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property boost="3.0">Title</property>
    <property boost="1.5">Text</property>
  </index-rule>
</configuration>

4.2.3 Conditional Index Rules

You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remaining ones are ignored:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

In the above example the first rule only applies if the nt:unstructured node has a priority property with a value 'high'. The condition syntax supports only the equals operator and a string literal.

You may also reference properties in the condition that are not on the current node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="ancestor::*/@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="0.5"
              condition="parent::foo/@priority = 'low'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="1.5"
              condition="bar/@priority = 'medium'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

The indexing configuration also allows you to specify the type of a node in the condition. Please note however that the type match must be exact. It does not consider sub types of the specified node type.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="element(*, nt:unstructured)/@priority = 'high'">
    <property>Text</property>
  </index-rule>
</configuration>

4.2.4 Exclusion from the Node Scope Index

Per default all configured properties are fulltext indexed if they are of type STRING and included in the node scope index. A node scope search finds normally all nodes of an index. That is, the select jcr:contains(., 'foo') returns all nodes that have a string property containing the word 'foo'. You can exclude explicitly a property from the node scope index:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property nodeScopeIndex="false">Text</property>
  </index-rule>
</configuration>

4.3 Index Aggregates

Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.

JCR allows you to define index aggregates based on relative path patterns and primary node types.

The following example creates an index aggregate on nt:file that includes the content of the jcr:content node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>jcr:content</include>
  </aggregate>
</configuration>

You can also restrict the included nodes to a certain type:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include primaryType="nt:resource">jcr:content</include>
  </aggregate>
</configuration>

You may also use the * to match all child nodes:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">http://wiki.exoplatform.com/xwiki/bin/edit/JCR/Search+Configuration
    <include primaryType="nt:resource">*</include>
  </aggregate>
</configuration>

If you wish to include nodes up to a certain depth below the current node you can add multiple include elements. E.g. the nt:file node may contain a complete XML document under jcr:content:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>*</include>
    <include>*/*</include>
    <include>*/*/*</include>
  </aggregate>
</configuration>

4.4 Property-Level Analyzers

4.4.1 Example

In this configuration section you define how a property has to be analyzed. If there is an analyzer configuration for a property, this analyzer is used for indexing and searching of this property. For example:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <analyzers> 
        <analyzer class="org.apache.lucene.analysis.KeywordAnalyzer">
            <property>mytext</property>
        </analyzer>
        <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer">
            <property>mytext2</property>
        </analyzer>
  </analyzers> 
</configuration>

The configuration above means that the property "mytext" for the entire workspace is indexed (and searched) with the Lucene KeywordAnalyzer, and property "mytext2" with the WhitespaceAnalyzer. Using different analyzers for different languages is particularly useful.

The WhitespaceAnalyzer tokenizes a property, the KeywordAnalyzer takes the property as a whole.

4.4.2 Characteristics of Node Scope Searches

When using analyzers, you may encounter an unexpected behavior when searching within a property compared to searching within a node scope. The reason is that the node scope always uses the global analyzer.

Let's suppose that the property "mytext" contains the text : "testing my analyzers" and that you haven't configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex).

If your query is for example:

xpath = "//*[jcr:contains(mytext,'analyzer')]"
This xpath does not return a hit in the node with the property above and default analyzers.

Also a search on the node scope

xpath = "//*[jcr:contains(.,'analyzer')]"
won't give a hit. Realize, that you can only set specific analyzers on a node property, and that the node scope indexing/analyzing is always done with the globally defined analyzer in the SearchIndex element.

Now, if you change the analyzer used to index the "mytext" property above to

<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">
     <property>mytext</property>
</analyzer>

and you do the same search again, then for

xpath = "//*[jcr:contains(mytext,'analyzer')]"

you would get a hit because of the word stemming (analyzers - analyzer).

The other search,

xpath = "//*[jcr:contains(.,'analyzer')]"
still would not give a result, since the node scope is indexed with the global analyzer, which in this case does not take into account any word stemming.

In conclusion, be aware that when using analyzers for specific properties, you might find a hit in a property for some search text, and you do not find a hit with the same search text in the node scope of the property!

Important note: Both index rules and index aggregates influence how content is indexed in JCR. If you change the configuration the existing content is not automatically re-indexed according to the new rules. You therefore have to manually re-index the content when you change the configuration!

5 Advanced features

Exo JCR supports some advanced features, which are not specified in JSR 170:

  • Get a text excerpt with highlighted words that matches the query: ExcerptProvider.
  • Search for a term and its synonyms: SynonymSearch
  • Search for similar nodes: SimilaritySearch
  • Check spelling of a fulltext query statement: SpellChecker
  • Define index aggregates and rules: IndexingConfiguration (see this article)
Created by Sergey Kabashnyuk on 05/22/2008
Last modified by Sergey Karpenko on 06/15/2010

Products

generated on Thu Sep 02 15:40:35 UTC 2010

eXo Optional Modules

eXo Core Foundations


Copyright (c) 2000-2010. All Rights Reserved - eXo platform SAS
2.4.30451