Search Configuration
1 XML Configuration
JCR index configuration. You can find this file here: .../portal/WEB-INF/conf/jcr/repository-configuration.xml<repository-service default-repository="db1"> <repositories> <repository name="db1" system-workspace="ws" default-workspace="ws"> .... <workspaces> <workspace name="ws"> .... <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="${java.io.tmpdir}/temp/index/db1/ws" /> <property name="synonymprovider-class" value="org.exoplatform.services.jcr.impl.core.query.lucene.PropertiesSynonymProvider" /> <property name="synonymprovider-config-path" value="/synonyms.properties" /> <property name="indexing-config-path" value="/indexing-configuration.xml" /> <property name="query-class" value="org.exoplatform.services.jcr.impl.core.query.QueryImpl" /> </properties> </query-handler> ... </workspace> </workspaces> </repository> </repositories> </repository-service>
2 Configuration parameters
| Parameter | Default | Description | Since |
|---|---|---|---|
| index-dir | none | The location of the index directory. This parameter is mandatory. Up to 1.9 this parameter called "indexDir" | 1.0 |
| use-compoundfile | true | Advises lucene to use compound files for the index files. | 1.9 |
| min-merge-docs | 100 | Minimum number of nodes in an index until segments are merged. | 1.9 |
| volatile-idle-time | 3 | Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached. | 1.9 |
| max-merge-docs | Integer.MAX_VALUE | Maximum number of nodes in segments that will be merged. The default value changed in JCR 1.9 to Integer.MAX_VALUE. | 1.9 |
| merge-factor | 10 | Determines how often segment indices are merged. | 1.9 |
| max-field-length | 10000 | The number of words that are fulltext indexed at most per property. | 1.9 |
| cache-size | 1000 | Size of the document number cache. This cache maps uuids to lucene document numbers | 1.9 |
| force-consistencycheck | false | Runs a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown. | 1.9 |
| auto-repair | true | Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log. | 1.9 |
| query-class | QueryImpl | Class name that implements the javax.jcr.query.Query interface.This class must also extend from the class: org.exoplatform.services.jcr.impl.core.query.AbstractQueryImpl. | 1.9 |
| document-order | true | If true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false'. | 1.9 |
| result-fetch-size | Integer.MAX_VALUE | The number of results when a query is executed. Default value: Integer.MAX_VALUE (-> all). | 1.9 |
| excerptprovider-class | DefaultXMLExcerpt | The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.ExcerptProvider and should be used for the rep:excerpt() function in a query. | 1.9 |
| support-highlighting | false | If set to true additional information is stored in the index to support highlighting using the rep:excerpt() function. | 1.9 |
| synonymprovider-class | none | The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SynonymProvider. The default value is null (-> not set). | 1.9 |
| synonymprovider-config-path | none | The path to the synonym provider configuration file. This path interpreted relative to the path parameter. If there is a path element inside the SearchIndex element, then this path is interpreted relative to the root path of the path. Whether this parameter is mandatory depends on the synonym provider implementation. The default value is null (-> not set). | 1.9 |
| indexing-configuration-path | none | The path to the indexing configuration file. | 1.9 |
| indexing-configuration-class | IndexingConfigurationImpl | The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.IndexingConfiguration. | 1.9 |
| force-consistencycheck | false | If set to true a consistency check is performed depending on the parameter forceConsistencyCheck. If set to false no consistency check is performed on startup, even if a redo log had been applied. | 1.9 |
| spellchecker-class | none | The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SpellChecker. | 1.9 |
| errorlog-size | 50(Kb) | The default size of error log file in Kb. | 1.9 |
| upgrade-index | false | Allows JCR to convert an existing index into the new format. Also it is possible to set this property via system property, for example: -Dupgrade-index=true Indexes before JCR 1.12 will not run with JCR 1.12. Hence you have to run an automatic migration: Start JCR with -Dupgrade-index=true. The old index format is then converted in the new index format. After the conversion the new format is used. On the next start you don't need this option anymore. The old index is replaced and a back conversion is not possible - therefore better take a backup of the index before. (Only for migrations from JCR 1.9 and later.) | 1.12 |
| analyzer | org.apache.lucene.analysis.standard.StandardAnalyzer | Class name of a lucene analyzer to use for fulltext indexing of text. | 1.12 |
3 Global Search Index
3.1 Global Search Index Configuration
The global search index is configured in the above-mentioned configuration file (portal/WEB-INF/conf/jcr/repository-configuration.xml) in the tag "query-handler".<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">3.2 Customized Search Indexes and Analyzers
By default Exo JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:public TokenStream tokenStream(String fieldName, Reader reader) { StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result = new StandardFilter(tokenStream); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); return result; }
- The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
- The second one (LowerCaseFilter) normalizes token text to lower case.
- The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.
3.2.1 Create the filter
The ISOLatin1AccentFilter is not present in the current Lucene version used by Exo. You can use the attached file. You can also create your own filter, the relevant method ispublic final Token next(final Token reusableToken) throws java.io.IOException
3.2.2 Create the analyzer
The analyzer have to extends org.apache.lucene.analysis.standard.StandardAnalyzer, and overload the methodpublic TokenStream tokenStream(String fieldName, Reader reader)
3.2.3 Create the search index
Now, we have the analyzer, we have to write the SearchIndex, which will use the analyzer. Your have to extends org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex. You have to write the constructor, to set the right analyzer, and the methodpublic Analyzer getAnalyzer() { return MyAnalyzer; }
3.2.4 Configure your application to use your SearchIndex
In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to replace each<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"><query-handler class="mypackage.indexation.MySearchIndex">3.2.5 Configure your application to use your Analyzer
In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to add parameter "analyzer" to each query-handler config:<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> ... <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/> ... </properties> </query-handler>
4 Index Adjustments
4.1 IndexingConfiguration
Starting with version 1.9, the default search index implementation in JCR allows you to control which properties of a node are indexed. You also can define different analyzers for different nodes. The configuration parameter is called indexingConfiguration and per default is not set. This means all properties of a node are indexed. If you wish to configure the indexing behavior you need to add a parameter to the query-handler element in your configuration file.<param name="indexing-configuration-path" value="/indexing_configuration.xml"/>4.2 Index rules
4.2.1 Node Scope Limit
To optimize the index size you can limit the node scope so that only certain properties of a node type are indexed. With the below configuration only properties named Text are indexed for nodes of type nt:unstructured. This configuration also applies to all nodes whose type extends from nt:unstructured.<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
4.2.2 Index Boost Value
It is also possible to configure a boost value for the nodes that match the index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0"> <property>Text</property> </index-rule> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property boost="3.0">Title</property> <property boost="1.5">Text</property> </index-rule> </configuration>
4.2.3 Conditional Index Rules
You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remaining ones are ignored:<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="@priority = 'high'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="ancestor::*/@priority = 'high'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured" boost="0.5" condition="parent::foo/@priority = 'low'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured" boost="1.5" condition="bar/@priority = 'medium'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="element(*, nt:unstructured)/@priority = 'high'"> <property>Text</property> </index-rule> </configuration>
4.2.4 Exclusion from the Node Scope Index
Per default all configured properties are fulltext indexed if they are of type STRING and included in the node scope index. A node scope search finds normally all nodes of an index. That is, the select jcr:contains(., 'foo') returns all nodes that have a string property containing the word 'foo'. You can exclude explicitly a property from the node scope index:<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property nodeScopeIndex="false">Text</property> </index-rule> </configuration>
4.3 Index Aggregates
Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes. JCR allows you to define index aggregates based on relative path patterns and primary node types. The following example creates an index aggregate on nt:file that includes the content of the jcr:content node:<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include>jcr:content</include> </aggregate> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include primaryType="nt:resource">jcr:content</include> </aggregate> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file">http://wiki.exoplatform.com/xwiki/bin/edit/JCR/Search+Configuration <include primaryType="nt:resource">*</include> </aggregate> </configuration>
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include>*</include> <include>*/*</include> <include>*/*/*</include> </aggregate> </configuration>
4.4 Property-Level Analyzers
4.4.1 Example
In this configuration section you define how a property has to be analyzed. If there is an analyzer configuration for a property, this analyzer is used for indexing and searching of this property. For example:<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <analyzers> <analyzer class="org.apache.lucene.analysis.KeywordAnalyzer"> <property>mytext</property> </analyzer> <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"> <property>mytext2</property> </analyzer> </analyzers> </configuration>
4.4.2 Characteristics of Node Scope Searches
When using analyzers, you may encounter an unexpected behavior when searching within a property compared to searching within a node scope. The reason is that the node scope always uses the global analyzer. Let's suppose that the property "mytext" contains the text : "testing my analyzers" and that you haven't configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex). If your query is for example:xpath = "//*[jcr:contains(mytext,'analyzer')]"xpath = "//*[jcr:contains(.,'analyzer')]"<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer"> <property>mytext</property> </analyzer>
xpath = "//*[jcr:contains(mytext,'analyzer')]"xpath = "//*[jcr:contains(.,'analyzer')]"5 Advanced features
Exo JCR supports some advanced features, which are not specified in JSR 170:- Get a text excerpt with highlighted words that matches the query: ExcerptProvider.
- Search for a term and its synonyms: SynonymSearch
- Search for similar nodes: SimilaritySearch
- Check spelling of a fulltext query statement: SpellChecker
- Define index aggregates and rules: IndexingConfiguration (see this article)
on 15/10/2009 at 13:01