Indri Repository Builder

This application builds an Indri Repository for a collection of documents. Parameter formats for all Indri applications are also described in IndriParameters.html

Repository construction parameters

memory
an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K = 1000, M = 1000000, G = 1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory=100M on the command line.
corpus

a complex element containing parameters related to a corpus. This element can be specified multiple times. The parameters are

path
The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class
The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. The known classes are:
  • html -- web page data.
  • xml -- xml marked up data.
  • trecweb -- TREC web format, eg terabyte track.
  • trectext -- TREC format, eg TREC-3 onward.
  • trecalt -- TREC format, eg TREC-3 onward, with only the TEXT field included.
  • warc -- WARC (Web ARChive) format, such as can be output by the heritrix webcrawler.
  • warcchar -- WARC (Web ARChive) format, such as can be output by the heritrix webcrawler. Tokenizes individual characters, enabling indexing of unsgemented text.
  • doc -- Microsoft Word format (windows platform only).
  • ppt -- Microsoft Powerpoint format (windows platform only).
  • pdf -- Adobe PDF format.
  • txt -- Plain text format.
annotations
The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line.
metadata

The pathname of the file or directory containing offset metadata for the documents specified in path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.

Combining the first two of these elements, the parameter file would contain:
<corpus>
  <path>/path/to/file_or_directory</path>
  <class>trecweb</class>
</corpus>

metadata

a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options

  1. field -- Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.

  2. forward -- Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line.

  3. backward -- Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line.

field

a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:

name
the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
numeric
the symbol true if the field contains numeric data, otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
parserName
the name of the parser to use to convert a numeric field to an unsigned integer value. The default is NumericFieldAnnotator. If numeric field data is provided via offset annotations, you should use the value OffsetAnnotationAnnotator. If the field contains a formatted date (see Date Fields) you should use the value DateFieldAnnotator.
stemmer

a complex element specifying the stemming algorithm to use in the subelement name. Valid options are:

  • porter -- Porter stemmer
  • krovetz -- Krovetz stemmer
  • arabic_stop -- Larkey stemmer, remove stopwords
  • arabic_norm2 -- Larkey stemmer, table normalization
  • arabic_norm2_stop -- Larkey stemmer, table normalization with stopping
  • arabic_light10 -- Larkey stemmer, light9 plus ll prefix
  • arabic_light10_stop -- Larkey stemmer, light10 and remove stop words

Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.

normalize
true to perform case normalization when indexing, false to index with mixed case. Default true
stopper
a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
offsetannotationhint
An optional parameter to provide a hint to the indexer to speed up indexing of offset annotations when using offset annotation files as specified in the <corpus> parameter. Valid values here are "unordered" and "ordered". An "unordered" hint (the default) will inform the indexer that the document IDs of the annotations are not necessarily in the same order as the documents in the corpus. The indexer will adjust its internal memory allocations appropriately to pre-allocate enough memory before reading in the annotations file. If you are absolutely certain that the annotations in the offset annotation file are in the exact same order as the documents, then you can use the "ordered" hint. This will tell the indexer to not read in the entire file at once, but rather read in the offset annotations file as needed for only the annotations that are specified for the currently indexing document ID.

Generated on 4 Jun 2014 for Indri by  doxygen 1.6.1