Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added solr specific indexing and searching rules.

...

This functionality allows documents to be searched for by giving keywords or phases. Searching can be based on category, type, format, search fields.

2.1  Quick Search

            Select Doc Category : Work

...

   System shows records with any field matching one or more keywords.

2.2  Advanced Search

            Select Doc Category : Work

...

   Search is performed based on the conditions entered by the user.Solr-specific search rules:

2.3 Solr-specific search rules

Solr allows us to specify how the input data is indexed and searched for.

Data of type String is indexed and stored verbatim.

Data of type Text can be analyzed during indexing time and searching time as follows:

2.3.1 Tokenization

It is the process of splitting the input text into tokens that are indexed and searched for.

White Space Tokenizer is used. It is a simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.

Input: "To be, or what?"

Output: "To", "be,", "or", "what?"

2.3.2 Synonym Filtering

It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token

It is applied only on search parameters text.

Synonyms are specified in a text file named 'synonyms.txt'

The following are currently defined in this file.

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

# Synonym mappings can be used for spelling correction too

pixima => pixma

2.3.3 Stop word filtering

It is the process of discarding tokens that are on the given stop words list.

The file named stopwords.txt specifies such words. Currently they are:

No Format

an and are as at

be but by

for

if in into is it

no not

of on or

s such

t that the their then there these they this to

was will with
2.3.4 Word delimiter (splitting)

It is the process of splitting tokens at word delimiters. The rules for determining delimiters are as follows:

  •     A change in case within a word: "CamelCase" -> "Camel", "Case"
  •     A transition from alpha to numeric characters or vice versa:"Gonzo5000" -> "Gonzo", "5000"   ;  "4500XL" -> "4500", "XL"
  •     Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
  •     A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
  •     Any leading or trailing delimiters are discarded: "-hot-spot" -> "hot", "spot"
2.3.5 Lower case conversion

Any uppercase letters in a token are converted to the equivalent lowercase token. All other characters are left unchanged.

2.3.6 Keyword protection

Protecting words from being modified by stemmers.

Protected word list may be specified in a file named protwords.txt

No such words are specified at this time.

2.3.7 Stemming

It is the process of reducing any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk".

Porter Stemming Algorithm is used. It is only appropriate for English language text.

2.3.8 Remove duplicates

Removing duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.

3. Display of Search results

...

Records with empty or null values will appear at the top of the search results.

3.4 Solr-specific sorting features

...

  • If the data contains more than one space then they are treated as a single space.
  • All data beginning with a numeral are arranged ahead of any data beginning with a letter.
  • Data consisting of a single word precedes any data beginning with the same word and followed by other words.
  • Data beginning with Articles (a, an and the) are displayed in ascending order.
  • Numbers at beginning or within the data are arranged in arithmetical order and sorted in ascending order.
  • Punctuation in numbers, as in other text, has no arrangement value (and sorted in ascending order).
  • Decimal fractions are arranged according to their arithmetical value (and sorted in ascending order).

...