Indri Parameter Files

The indri applications, IndriBuildIndex, IndriDaemon, and IndriRunQuery accept parameters from either the command line or from a file. The parameter file uses an XML format. The command line uses dotted path notation. The top level element in the parameters file is named parameters.

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
corpus

a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are

path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • xml -- xml marked up data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
  • warc -- WARC (Web ARChive) format, such as can be output by the heritrix webcrawler.
  • warcchar -- WARC (Web ARChive) format, such as can be output by the heritrix webcrawler. Tokenizes individual characters, enabling indexing of unsgemented text.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
annotations
The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line.
metadata

The pathname of the file or directory containing offset metadata for the documents specified in path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.

Combining the first two of these elements, the parameter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>

metadata

a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options

  1. field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.

  2. forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line. The external document id field "docno" is automatically added as a forward metadata field.

  3. backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line. The external document id field "docno" is automatically added as a backward metadata field.

field

a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:

name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
parserName
the name of the parser to use to convert a numeric field to an unsigned integer value. The default is NumericFieldAnnotator. If numeric field data is provided via offset annotations, you should use the value OffsetAnnotationAnnotator. If the field contains a formatted date (see Date Fields) you should use the value DateFieldAnnotator.
stemmer

a complex element specifying the stemming algorithm to use in the subelement name. Valid options are:

  • porter -- Porter stemmer
  • krovetz -- Krovetz stemmer
  • arabic_stop -- Larkey stemmer, remove stopwords
  • arabic_norm2 -- Larkey stemmer, table normalization
  • arabic_norm2_stop -- Larkey stemmer, table normalization with stopping
  • arabic_light10 -- Larkey stemmer, light9 plus ll prefix
  • arabic_light10_stop -- Larkey stemmer, light10 and remove stop words

Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.

normalize
true to perform case normalization when indexing, false to index with mixed case. Default true
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
offsetannotationhint
An optional parameter to provide a hint to the indexer to speed up indexing of offset annotations when using offset annotation files as specified in the <corpus> parameter. Valid values here are "unordered" and "ordered". An "unordered" hint (the default) will inform the indexer that the document IDs of the annotations are not necessarily in the same order as the documents in the corpus. The indexer will adjust its internal memory allocations appropriately to pre-allocate enough memory before reading in the annotations file. If you are absolutely certain that the annotations in the offset annotation file are in the exact same order as the documents, then you can use the "ordered" hint. This will tell the indexer to not read in the entire file at once, but rather read in the offset annotations file as needed for only the annotations that are specified for the currently indexing document ID.

QueryEnvironment Parameters

Retrieval Parameters

index
path to an Indri Repository. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line. This element can be specified multiple times to combine Repositories.
server
hostname of a host running an Indri server (IndriDaemon). Specified as <server>hostname</server> in the parameter file and as -server=hostname on the command line. The hostname can include an optional port number to connect to, using the form hostname:portnum. This element can be specified multiple times to combine servers.
count
an integer value specifying the maximum number of results to return for a given query. Specified as <count>number</count> in the parameter file and as -count=number on the command line.
query

An indri query language query to run. This element can be specified multiple times. The query element may take numerous optional parameters. With none of the optional parameters, the query text can be the body of the element, eg:

<query>combine(query terms)</query>

The optional parameters are:

type
one of indri, to use the indri query language, or nexi to use the nexi query language. The default is indri. This element may appear 0 or 1 times.
number
The query number or identifier. This may be a non-numeric symbol. The default is to number the queries in the parameters in order, starting with 0. This element may appear 0 or 1 times.
text
The query text, eg, "#combine(query terms)". This element may appear 0 or 1 times and must be used if any of the other parameters are supplied.
workingSetDocno
The external document id of a document to add to the working set for the query. This element may appear 0 or more times. When specified, query evaluation is restricted to the document ids specified.
feedbackDocno
The external document id of a document to add to the relevance feeedback set for the query. This element may appear 0 or more times. When specified, query expansion is performed using only the document ids specified. It is still necessary to specify a non-zero value for the fbDocs parameter when specifying feedbackDocno elements.

rule

specifies the smoothing rule (TermScoreFunction) to apply. Format of the rule is:

( key ":" value ) [ "," key ":" value ]*

Here's an example rule in command line format:

-rule=method:linear,collectionLambda:0.2,field:title

and in parameter file format:
<rule>method:linear,collectionLambda:0.2,field:title</rule>

This corresponds to Jelinek-Mercer smoothing with background lambda equal to 0.2, only for items in a title field.

If nothing is listed for a key, all values are assumed. So, a rule that does not specify a field matches all fields. This makes -rule=method:linear,collectionLambda:0.2 a valid rule.

Valid keys:

method
smoothing method (text)
field
field to apply this rule to
operator
type of item in query to apply to { term, window }

Valid methods:

dirichlet
(also 'd', 'dir') (default mu=2500)
jelinek-mercer
(also 'jm', 'linear') (default collectionLambda=0.4, documentLambda=0.0), collectionLambda is also known as just "lambda", either will work
twostage
(also 'two-stage', 'two') (default mu=2500, lambda=0.4)

If the rule doesn't parse correctly, the default is Dirichlet, mu=2500.

stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
maxWildcardTerms
(optional) An integer specifying the maximum number of wildcard terms that can be generated for a synonym list for this query or set of queries. If this limit is reached for a wildcard term, an exception will be thrown. If this parameter is not specified, a default of 100 will be used.

Baseline (non-LM) retrieval

baseline

Specifies the baseline (non-language modeling) retrieval method to apply. This enables running baseline experiments on collections too large for the Lemur RetMethod API. When running a baseline experiment, the queries may not contain any indri query language operators, they must contain only terms.

Format of the parameter value:

(tfidf|okapi) [ "," key ":" value ]*

Here's an example rule in command line format:

-baseline=tfidf,k1:1.0,b:0.3

and in parameter file format:
<baseline>tfidf,k1:1.0,b:0.3</baseline>

Methods:

tfidf

Performs retrieval via tf.idf scoring as implemented in lemur::retrieval::TFIDFRetMethod using BM25TF term weighting. Pseudo-relevance feedback may be performed via the parameters below.

Parameters (optional):

k1
k1 parameter for term weight (default 1.2)
b
b parameter for term weight (default 0.75)

okapi

Performs retrieval via Okapi scoring as implemented in lemur::retrieval::OkapiRetMethod. Pseudo-relevance feedback may <bold>not</bold> be performed with this baseline method.

Parameters (optional):

k1
k1 parameter for term weight (default 1.2)
b
b parameter for term weight (default 0.75)
k3
k3 parameter for query term weight (default 7)

Formatting Parameters

queryOffset
an integer value specifying one less than the starting query number, eg 150 for TREC formatted output. Specified as <queryOffset>number</queryOffset> in the parameter file and as -queryOffset=number on the command line.
runID
a string specifying the id for a query run, used in TREC scorable output. Specified as <runID>someID</runID> in the parameter file and as -runID=someID on the command line.
trecFormat
the symbol true to produce TREC scorable output, otherwise the symbol false. Specified as <trecFormat>true</trecFormat> in the parameter file and as -trecFormat=true on the command line. Note that 0 can be used for false, and 1 can be used for true.

Pseudo-Relevance Feedback Parameters

fbDocs
an integer specifying the number of documents to use for feedback. Specified as <fbDocs>number</fbDocs> in the parameter file and as -fbDocs=number on the command line.
fbTerms
an integer specifying the number of terms to use for feedback. Specified as <fbTerms>number</fbTerms> in the parameter file and as -fbTerms=number on the command line.
fbMu
a floating point value specifying the value of mu to use for feedback. Specified as <fbMu>number</fbMu> in the parameter file and as -fbMu=number on the command line.
fbOrigWeight
a floating point value in the range [0.0..1.0] specifying the weight for the original query in the expanded query. Specified as <fbOrigWeight>number</fbOrigWeight> in the parameter file and as -fbOrigWeight=number on the command line.

IndriDaemon Parameters

index
path to the Indri Repository to act as server for. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.
port
an integer value specifying the port number to use.Specified as <port>number</port> in the parameter file and as -port=number on the command line.

Generated on 4 Jun 2014 for Indri by  doxygen 1.6.1