Info | ||
---|---|---|
| ||
https://jira.kuali.org/browse/OLE-1144 (OLE Search Executive- see also linked tasks and sub-tasks) |
Section | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
DocStore Search
1. Indexed Data
1.1 Searchable fields for all document categories, types and formats
Field Name | Work-Bib-MARC | Work-Bib-DublinQ | Work-Bib-DublinUnQ | Work-Instance-OLEML | Work-Holdings-OLEML | Work-Item-OLEML | Work-License-ONIXPL | Work-License-PDF |
---|---|---|---|---|---|---|---|---|
Title | Yes | Yes | Yes | No | No | No | No | No |
Author | Yes | Yes | Yes | No | No | No | No | No |
Subject | Yes | Yes | Yes | No | No | No | No | No |
Description | Yes | Yes | Yes | No | No | No | No | No |
Date of Publication | Yes | Yes | Yes | No | No | No | No | No |
Format | Yes | Yes | Yes | No | No | No | No | No |
Language | Yes | Yes | Yes | No | No | No | No | No |
Publisher | Yes | Yes | Yes | No | No | No | No | No |
ISSN/ISBN/other (last for dc identifier) | Yes | Yes | Yes | No | No | No | No | No |
Genre (marc genre/dc type) | Yes | Yes | Yes | No | No | No | No | No |
Edition | Yes | No | No | No | No | No | No | No |
Barcode | No | No | No | No | No | Yes | No | No |
Location | Yes | No | No | No | No | Yes | No | No |
Source | No | No | No | Yes | No | No | No | No |
Record Type | No | No | No | No | Yes | No | No | No |
Encoding Level | No | No | No | No | Yes | No | No | No |
Receipt Status | No | No | No | No | Yes | No | No | No |
Acquisition Method | No | No | No | No | Yes | No | No | No |
Policy Type | No | No | No | No | Yes | No | No | No |
Copies Reported | No | No | No | No | Yes | No | No | No |
Item Type | No | No | No | No | No | Yes | No | No |
Location Status | No | No | No | No | No | Yes | No | No |
Shelving Scheme | No | No | No | No | No | Yes | No | No |
Shelving Order | No | No | No | No | No | Yes | No | No |
Address | No | No | No | No | No | Yes | No | No |
Copy Number | No | No | No | No | No | Yes | No | No |
Volume Number | No | No | No | No | No | Yes | No | No |
Contract Number | No | No | No | No | No | No | Yes | No |
Licensee | No | No | No | No | No | No | Yes | No |
Licensor | No | No | No | No | No | No | Yes | No |
Status | No | No | No | No | No | No | Yes | No |
Method | No | No | No | No | No | No | Yes | No |
Type | No | No | No | No | No | No | Yes | No |
Name | No | No | No | No | No | No | No | Yes |
File Name | No | No | No | No | No | No | No | Yes |
Date Uploaded | No | No | No | No | No | No | No | Yes |
Owner | No | No | No | No | No | No | No | Yes |
Notes | No | No | No | No | No | No | No | Yes |
1.2 Facet fields for all document categories, types and formats
Facet Field | Work-Bib-MARC | Work-Bib-DublinQ | Work-Bib-DublinUnQ | Work-Instance-OLEML | Work-Holdings-OLEML | Work-Item-OLEML | Work-License-ONIXPL | Work-License-PDF |
---|---|---|---|---|---|---|---|---|
Subject | Yes | Yes | Yes | No | No | No | No | No |
Author | Yes | Yes | Yes | No | No | No | No | No |
Format | Yes | Yes | Yes | No | No | No | No | No |
Language | Yes | Yes | Yes | No | No | No | No | No |
Publication Date | Yes | Yes | Yes | No | No | No | No | No |
Genre | Yes | Yes | Yes | No | No | No | No | No |
1.3 Field definitions for Work-Bib-MARC documents
...
Field
...
Data fields for search (MV- indicates multi-valued)
...
Data fields for short display
...
Data fields for detailed display
...
Data fields for Facet
...
ISSN
...
022 - a,z (MV)
...
first value
...
all values
...
same as search field
...
ISBN
...
020 - a,z (MV)
...
first value
...
all values
...
same as search field
...
Author/Creator
...
For each 100, 110: every subf except $6 (gives us 2 values for every tag). Also every subf except $t for: 111, 700, 710, 711, 800, 810, 811, 400, 410, 411)
Ref: http://www.loc.gov/marc/bibliographic/
100 - Main Entry - Personal Name (NR)
110 - Main Entry - Corporate Name (NR)
111 - Main Entry - Meeting Name (NR)
700 - Added Entry - Personal Name (R)
710 - Added Entry - Corporate Name (R)
711 - Added Entry - Meeting Name (R)
800 - Series Added Entry - Personal Name (R)
811 - Series Added Entry -Meeting Name (R)
400 - Series Statement/Added Entry - Personal Name(R)
410 - Series Statement/Added Entry - Corporate Name(R)
411 - Series Statement/Added Entry - Meeting Name(R)
...
first non-empty value of
100$a or 110$a.
Show blank if both are
missing.
...
All non-empty
indexed values
...
All non-empty
indexed values
...
Title
...
245 - all subf exc. c and 6. Also, 130, 240, 246, 247, 440, 490, 730, 740, 773, 774, 780, 785, 830, 840) (MV)
...
245$a and 245$b
...
all values
...
...
Place of Publication
...
260 - a (MV)
...
first value
...
all values
...
same as search field
...
Description
...
505 - a (MV)
KG/LR: UPenn just included the MARC 505 in its Description index (which is distinct from its Format/Description index). Include just 505 $a. The SMEs may want additional 5xx fields in the Description index, but 505 should be fine for November.
...
first value
...
all values
...
same as search field
...
Subject
...
600, 610, 611, 630, 650, 651, 653, 69X: every subf exc. $6, $2, $= across these tags
No hyphens for X00, X10, and X11 fields 600, 610, 611, 700, 710, 711, etc), but hyphens for other fields.
first non-empty value of 600$a, $d etc 610$a etc
...
Date of Publication
...
<marc:controlfield tag="008">[Date 1 in the 7-10 positions LR: Can also include 260 $c. (260-c is same as the value in control field. Use this if control field does not have pub date value.) (MV).
...
first value
...
all values
...
Edition
...
250 - a,b (MV)
...
first value
...
all values
...
same as search field
...
Form/Genre
...
655 - a, v (MV)
...
first value
...
all values
...
same as search field
...
<marc:controlfield tag="008">[language code in the 35-37 positions]</marc:controlfield> LR: Add 546 $a (MV)
Language Codes (iso-639-3)
...
all values
...
all values
...
same as search field
...
Format
...
first value
...
all values
...
same as search field
Test Template for Work-Bib-Marc document fields
1.4 Format field definitions for Work-Bib-MARC documents
Label | Marc Fields | Comments |
---|---|---|
Manuscript | Has any holdings with "manuscripts" in location_name (gets only this value) | LR: MARC XML does not have location_name so this is irrelevant to the IU data that OLE has for November. Manuscript could be determined by the Leader 06/07. 06 values a, f, t equal manuscripts on their own. 07 values c and d seem to imply mauscript/archival collections/series. We should check with the SMEs on this one. |
Microformat | Has 245 $h containing "micro" OR has any holdings with "micro" in location_name OR call_number starts "micro" (gets only this value) | LR: the 245 $h "micro" will work for the IU OLE MARCXML we have, but the reamaing text is specific to UPenn. |
Archive | Has any holdings with "archive" in location_name (gets only this value) | LR: This is specific to UPenn. We may need to talk to IU about if they include Archive descriptions in their MARC records and how they designate them as such. |
Thesis/Dissertation | bib_format is 'tm' AND has a 502 field | LR UPenn's bib_format seems to be a combination of the data values found in the 06/07 Leader fields. For example, t in the 06 is Manuscript and m in the 07 is Monograph/Item and together they equal a Thesis/Dissertation. |
Conference/Event | Has a 111 or 711 field [LR: Include 611 or 811] |
|
Book | bib_format is 'aa', 'am' or 'ac' or 'tm'; exclude $h [micro*] and $k [kit] | LR: the 2 characters are from the Leader 06/07 the inclusions are 245 subfields |
Sound recording | bib_format is 'im' or 'jm' or 'jc' or 'jd' or 'js' | LR: the 2 characters are from the Leader 06/07 |
Musical score | bib_format is cm, dm, ca, cb, cd or cs | LR: the 2 characters are from the Leader 06/07 |
Map/Atlas | bib_format is 'e*' or 'fm' | LR: the 2 characters are from the Leader 06/07 |
Video | bib_format is 'gm' AND 007/0 = v | LR: the 2 characters are from the Leader 06/07 |
Projected graphic | bib_format is 'gm' AND 007/0 = g | LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific. |
Journal/Periodical | bib_format is 'as' or 'gs' | LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific. |
Image | bib_format is 'km' | LR: the 2 characters are from the Leader 06/07 |
Datafile | bib_format is 'mm' | LR: the 2 characters are from the Leader 06/07 |
Newspaper | bib_format is 'as' AND (008/21 = 'n' OR 008/22 = 'e' ) | LR: the 2 characters are from the Leader 06/07. The 008 controlled field in those 2 positions provides the "form") |
3D object | bib_format is 'r*' | LR: the single character maps to the 06 position in the leader. |
Database/Website | bib_format is '*i' | LR: the single character maps to the 06 position in the leader. |
Government document | bib_format is NOT c*, d*, i*, j* AND ( (008/28 = f, i, o and 260$b not 'press') ) | LR: the single character maps to the 06 position in the leader. 008 is a fixed length controlled field and 260 $b is a type of publication. |
Other | any bib_format not caught above | LR: Presumably relates to other 06/07 Leader data values not represented. |
1.5 Field definitions for Work-Bib-DublinCore documents
...
Field
...
DC-UnQ fields for Search
...
DC-Q fields for Search
...
Data fields for short display
...
Data fields for detailed display
...
Data fields for Facet
...
Author
...
<dc:creator>
...
<dcvalue element="contributor" qualifier="author">
...
first value
...
All non-empty
indexed values
...
All non-empty
indexed values
...
Description
...
<dc:description> (MV)
Per Bob P.: Show only <dc:description>.
...
Per Bob P.: Do not show Abstract description.
[show blank]
...
first value
...
all values
...
same as search field
...
Language
...
<dc:language> (MV)
Language Codes (iso-639-3)
...
<dcvalue element="language" qualifier="iso">en_US</dcvalue>
Language Codes (iso-639-1-cc)
...
first value
...
all values
...
same as search field
...
Subject
...
<dc:subject> (R)
...
<dcvalue element="subject" qualifier="none">
...
first value
...
all values
...
same as search field
...
Title
...
<dc:title>
...
<dcvalue element="title" qualifier="none">
...
first value
...
all values
...
same as search field
...
Type
...
<dc:type> (MV)
...
<dcvalue element="type" qualifier="none">
...
first value
...
all values
...
same as search field
...
Date of Publication
...
<dc:date>
...
<dcvalue element="date" qualifier="issued">
...
first value
...
all values
...
same as search field
...
Format
...
<dc:format> (MV)
...
first value
...
all values
...
same as search field
...
Publisher
...
<dc:publisher> (MV)
...
first value
...
all values
...
same as search field
...
ISBN/ISSN/other
...
<dc:identifier>(ISSN)0198-9669</dc:identifier> (MV)
<dc:identifier>(ISBN)0306710382</dc:identifier> (MV)
...
<dcvalue element="identifier" qualifier="isbn">0-918006-48-1</dcvalue>
...
first value
...
all values
...
same as search field
Test Template for Work-Bib-DublinCore document fields
2. Search
This functionality allows documents to be searched for by giving keywords or phases. Searching can be based on category, type, format, search fields.
2.1 Quick Search
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : ALL
Searching on default condition(click search button without specifying any conditions) will give all the records in search result page.
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : MARC
Type one or more keywords in a text box.
System shows records with any field matching one or more keywords.
2.2 Advanced Search
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : MARC
The drop down for search fields will be populated based on the category selected above.
User specifies a search condition:
Selects a field.
Enters one or more keywords.
Specifies whether the keywords should be searched for as "All of these", "Any of these" or "As a phrase".
"All of these" - Any record with the selected field having all the entered keywords is included in the search results.
"Any of these" - Any record with the selected field having at least one of the entered keywords is included in the search results.
"As a phrase" - Any record with the selected field having all the entered keywords in same order is included in the search results.
User adds another condition:
Chooses whether to apply this condition in addition to the previous one ("AND") or to apply this condition as an alternative to the previous one ("OR") ("NOT"???),
"AND" - the conditions before and after this operator should be satisfied.
"OR" - one of the conditions before and after this operator should be satisfied.
"NOT" - the condition after this operator should not be satisfied.
User repeats previous step as many times as needed using the ADD and DELETE links.
[+]ADD : click on this link to add fields for a new search condition.
[-]Delete : click on this link to delete the last search condition.
Search is performed based on the conditions entered by the user.
2.2.1 Using the wildcard char * in searching
An asterisk (*) can be entered as part of a search term (value to be searched for). But it should not be the first character.
For example, rec* will match record, recurring etc. j*n will match John, join etc.
2.3 Solr-specific search rules
Solr allows us to specify how the input data is indexed and searched for.
Data of type String is indexed and stored verbatim.
Data of type Text can be analyzed during indexing time and searching time as follows:
2.3.1 Tokenization
It is the process of splitting the input text into tokens that are indexed and searched for.
White Space Tokenizer is used. It is a simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.
Input: "To be, or what?"
Output: "To", "be,", "or", "what?"
2.3.2 Synonym Filtering
It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token
It is applied only on search parameters text.
Synonyms are specified in a text file named 'synonyms.txt'
The following are currently defined in this file.
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
# Synonym mappings can be used for spelling correction too
pixima => pixma
2.3.3 Stop word filtering
It is the process of discarding tokens that are on the given stop words list.
The file named stopwords.txt specifies such words. Currently they are:
No Format |
---|
an and are as at
be but by
for
if in into is it
no not
of on or
s such
t that the their then there these they this to
was will with
|
2.3.4 Word delimiter (splitting)
It is the process of splitting tokens at word delimiters. The rules for determining delimiters are as follows:
- A change in case within a word: "CamelCase" -> "Camel", "Case"
- A transition from alpha to numeric characters or vice versa:"Gonzo5000" -> "Gonzo", "5000" ; "4500XL" -> "4500", "XL"
- Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
- A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
- Any leading or trailing delimiters are discarded: "-hot-spot" -> "hot", "spot"
2.3.5 Lower case conversion
Any uppercase letters in a token are converted to the equivalent lowercase token. All other characters are left unchanged.
2.3.6 Keyword protection
Protecting words from being modified by stemmers.
Protected word list may be specified in a file named protwords.txt
No such words are specified at this time.
2.3.7 Stemming
It is the process of reducing any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk".
Porter Stemming Algorithm is used. It is only appropriate for English language text.
2.3.8 Remove duplicates
Removing duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.
2.3.9 Known Behaviors (to be fixed within Search)
- All of these search selection searches for the words in the order they are entered.
- As a phrase search selection does not search for the words in the order they appear.
- Special characters such as '&' and ':' are searchable only if part of a longer phrase and the entire phrase is wrapped in quotation marks. (If the quotation marks are left off, no hits will be returned.)
- Wildcards (*, ?, #, !, []) do not function.
3. Display of Search results
The search results are displayed in pages of 25 records.
The page size can be changed to 50,100
User can navigate to different pages of search results by using a pagination control.
For each record, short display fields (specified in section 1) will be displayed.
For each record 'View XML' button shows the xml of the document and 'Edit XML' opens the Editor.
Links are shown if applicable
DocType | links |
---|---|
Bibliographic | Instance |
Instance | Bib,Holdings,Item |
Holdings | Bib,Instance,Item |
Item | Bib,Instance,Holdings |
3.1 Short display
Each record in the search results shows a subset of the record's fields.
3.2 Detailed display
A link is (not yet) provided for each record in the search results.
When this link is clicked, a popup, or a new tab of the browser is opened, with all the fields of the record.
3.3 Highlighting
In each search result, words matching the key words entered by the user are highlighted.
3.4 Facets
Facets are grouping of search results that help to analyze the search results and further filter or narrow down the search results.
These are helpful when the user cannot guess what exact keywords to search for.
Facet fields:
Author,Subject,Format,Language,Publication Date,Genre.
Top 5 facet values in the decreasing order of occurrences are shown for each facet field.
The remaining facets are seen by clicking the "more" link in a popup window.
The facet values in the popup are shown in alphabetical order.
Click on one or more facet values to filter search results and view only those records containing these facet values.
3.5 Sorting
By default, the search results are sorted by Title (A-Z)
The sorting options are:
Title (A-Z) : sorts the results in the ascending order (starts with 'a') of the title field value. This the default sorting order.
Title (Z-A) :sorts the results in the descending order (starts with 'z') of the title field value.
Author (A-Z) :sorts the results in the ascending order (starts with 'a') of the author field value.
Author (Z-A) : sorts the results in the descending order (starts with 'a') of the author field value.
Pub date(new-old) : sorts the results in the descending order of the publication date field value.
Pub date(old-new) : sorts the results in the ascending order of the publication date field value.
Relevance : it is default sort criteria provided by solr.
Records with empty or null values will appear at the top of the search results.
As per JIRA OLE_2194,
2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rule
3.5.1 NISO Standard for Sort
...
NISO Search Results on docstore Search- in progress: https://jira.kuali.org/browse/OLE-2194
4. Configurability of searchable fields
As seen above, there are different fields in the documents of different category/type/formats, that are to be indexed, searched for and displayed. How a field value should be extracted from a source document is also important.
This information about different document categories, types and formats and their corresponding fields is used in DocStore as well as OLE. It is preferable to store this information in a well encapsulated, reusable and maintainable manner.
This is done in an external xml configuration file (DocumentConfig.xml) which is loaded into a corresponding POJO name DocumentConfig. The configuration file is loaded at the time of startup of application and retained in memory until the application is shutdown. So any changes to this file will become effective only after restarting the application. There are limitations on what can be configured in this file. Changes to this file will need the existing documents to be re-indexed.
It is available in %ole.docstore.home%/properties folder. (e.g. /opt/docstore/properties)
Field name convention:
Names of fields used for indexing/searching are suffixed with "_search".
Names of fields used for display are suffixed with "_display".
Names of fields used for facets are suffixed with "_facet".
Names of fields used for sorting are suffixed with "_sort".
Field Info/Attributes
Field Attribute | Purpose | Example |
---|---|---|
Id | Unique identifier of a field with a given [category, type, format] |
|
Name | Name of the field suitable for display |
|
Type | Indicates the type of value of the field (informative purpose only) |
|
Mapping info can be defined for each field which specifies how the value(s) for the field should be extracted from the input file for the corresponding document. Mapping can be specified as XPATH value or a custom value.
Mapping Info/Attributes
Mapping Attribute | Purpose | Example |
---|---|---|
Type | Indicates how the mapping info is to be interpreted |
|
Include | Values to be included |
|
Exclude | Values to be excluded |
|
Modifying configuration info:
- Open the DocumentConfig.xml file.
- Add/modify/delete one or more fields of any [document category/type/format].
- Save the file.
- Reload the DocStore application. (Restart the Tomcat server.)
- Re-index the data related to the document category/type/format modified.
Adding a field:
Copy and paste an existing field definition and modify the attributes suitably.
Modifying a field:
Name and mapping info can be modified for any existing field.
Deleting a field:
A field definition can be commented or deleted.
4.1 Document Configurations
A common place to define all Configurations related to all documents which are indexed & searchable by the solr are being defined in this file. Like the fields that can be searched, displayed and faceted (for each document category/type/format) cab be defined in the xml file 'DocumentConfig.xml' & conveyed to docstore. A sample of this file is as given below:
4.1.1 Document Configurations File
/opt/docstore/properties/DocumentConfig.xml
<documentConfig>
<documentCategory id="work" name="Work">
<documentType id="bibliographic" name="Bibliographic">
<documentFormat id="all" name="ALL">
<field id="Title_search" name="Title" type="text" />
<field id="Author_search" name="Author" type="text" />
.......
</documentFormat>
<documentFormat id="marc" name="MARC">
<field id="ISBN_display" name="ISBN" type="text">
<mapping type="custom">
<include>020-a;z</include>
<exclude/>
</mapping>
</field>
<field id="ISBN_search" name="ISBN" type="text">
<mapping type="custom">
<include>020-a;z</include>
<exclude/>
</mapping>
</field>
<field id="ISSN_display" name="ISSN" type="text">
<mapping type="custom">
<include>022-a;z</include>
<exclude/>
</mapping>
</field>
.......
</documentFormat>
......
</documentType>
........
</documentCategory>
.........
</documentConfig>
This configurations file is used by DocStore to index and to display the fields in several areas of DocStore search & web app modules. This is a one time loaded file and is also loaded by the solr doc builders and other applications once at the time of startup and there after be used at the time of indexing & display.
Currently Supported Document Categories, Types and Formats are as below
<documentConfig>
<documentCategory id="work" name="Work">
<documentType id="bibliographic" name="Bibliographic">
<documentFormat id="all" name="ALL" ...>
<documentFormat id="marc" name="MARC" ...>
<documentFormat id="dublin" name="Dublin Core" ...>
<documentFormat id="dublinunq" name="Dublin Unqualified" ...>
</documentType>
<documentType id="license" name="License">
<documentFormat id="all" name="ALL" ...>
<documentFormat id="onixpl" name="ONIXPL" ...>
<documentFormat id="pdf" name="PDF" ...>
<documentFormat id="doc" name="DOC" ...>
<documentFormat id="xslt" name="XSLT" ...>
</documentType>
<documentType id="instance" name="Instance">
<documentFormat id="oleml" name="OLEML" ...>
</documentType>
<documentType id="holdings" name="Instance Holding">
<documentFormat id="oleml" name="OLEML" ...>
</documentType>
<documentType id="item" name="Instance Item">
<documentFormat id="oleml" name="OLEML" ...>
</documentType>
</documentCategory>
<documentCategory id="security" name="Security">
</documentCategory>
</documentConfig>
4.1.2 Field Definitions
A Field inside this configurations file can be defined with attribute id is the name of the field inside solr by which it can be represented or indexed, name is the one to be displayable name of the filed and type is its entity type.
<field id="ISBN_search" name="ISBN" type="text">
<mapping type="custom">
<include>020-a;z</include>
<exclude/>
</mapping>
</field>
Mapping is either of the types 'custom' / 'xpath'. All the fields or defined values to include or exclude to derive actual field values of xml. Custom is the type where its definitions are customized and understood by only that type of doc builder. XPath is the one defined to be of derived xpath of the elements in conjunctions with its types of concerned java, what tags or fields to be included or excluded.
<field id="ContractNumber_search" name="Contract Number" type="text">
<mapping type="xpath">
<include>/publicationsLicenseExpression/licenseDetail/licenseIdentifier/IDValue/value</include>
</mapping>
</field>
Above is the example for the field of mapping type xpath.
4.1.3 Add/ Delete/ Update a Field Definition
Steps to Add / Update a new field to be added for indexing and display:
- Open the document DocumentConfig.xml from /opt/docstore/properties/
- You can just take an existing field definition of that type as reference for deriving a new field (In case of adding a new field). Incase of update just make the field definition to be modified as required.
- Take care it is not a duplicate id to any of the existing field definitions.
- Save the file.
- Restart docstore.
- Re-Index the concenrned category/type/format of documents to get changes be reflected in docstore.
above steps be followed in given order to reflect the required changes in docstore.
Incase of deletion of a field or if a specific field is 'not required to be indexed': All the above steps be followed except 2,
...
Info | ||
---|---|---|
| ||
https://jira.kuali.org/browse/OLE-1144 (OLE Search Executive- see also linked tasks and sub-tasks) |
Section | |||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
DocStore Search
1. Indexed Data
1.1 Searchable fields for all document categories, types and formats
1.1.1 Searchable fields for Bibliographic
Field Name | Work-Bib-MARC | Work-Bib-DublinQ | Work-Bib-DublinUnQ |
---|---|---|---|
Title | Yes | Yes | Yes |
Author | Yes | Yes | Yes |
Subject | Yes | Yes | Yes |
Description | Yes | Yes | Yes |
Date of Publication | Yes | Yes | Yes |
Format | Yes | Yes | Yes |
Language | Yes | Yes | Yes |
Subject | Yes | Yes | Yes |
Publisher | Yes | Yes | Yes |
Publication Date | Yes | Yes | Yes |
ISSN/ISBN/other (last for dc identifier) | Yes | Yes | Yes |
Genre (marc genre/dc type) | Yes | Yes | Yes |
Edition | Yes | Yes | Yes |
Bib Identifier | Yes | Yes | Yes |
Holdings Identifier | Yes | Yes | Yes |
Local Id | Yes | Yes | Yes |
Doc Category | Yes | Yes | Yes |
Doc Format | Yes | Yes | Yes |
Doc Type | Yes | Yes | Yes |
Status | Yes | Yes | Yes |
Status Updated On | Yes | Yes | Yes |
Staff Only Flag | Yes | Yes | Yes |
Created By | Yes | Yes | Yes |
Updated By | Yes | Yes | Yes |
Date Entered | Yes | Yes | Yes |
Date Updated | Yes | Yes | Yes |
1.1.2 Searchable fields for Holdings
Field Name | Work-Holdings-OLEML | Work-Item-OLEML |
---|---|---|
Bib Identifier | Yes | Yes |
Holdings Identifier | Yes | Yes |
Item Identifier | Yes | Yes |
Receipt Status | Yes | No |
Copy Number | Yes | Yes |
Call Number | Yes | Yes |
Call Number Type | Yes | Yes |
Item Part | Yes | No |
Call Number Prefix | Yes | Yes |
Classification Part | Yes | Yes |
Shelving Scheme Code | Yes | Yes |
Shelving Scheme Value | Yes | Yes |
Shelving Order | Yes | Yes |
Location Level | Yes | Yes |
Location Level Name | Yes | Yes |
Access Status | Yes | No |
Access Location | Yes | No |
Statistical Search Code Value | Yes | Yes |
Publisher | Yes | No |
Imprint | Yes | No |
Platform | Yes | No |
Public Note | Yes | No |
Holding Note | Yes | No |
Local Id | Yes | Yes |
Doc Category | Yes | Yes |
Doc Format | Yes | Yes |
Doc Type | Yes | Yes |
Status | Yes | Ye |
Status Updated On | Yes | Yes |
Staff Only Flag | Yes | Yes |
Claims Return Flag | No | Yes |
Claims Return Flag Create Date | No | Yes |
Claims Return Note | No | Yes |
Current Borrower | No | Yes |
Proxy Borrower | No | Yes |
Due Date Time | No | Yes |
Barcode ARSL | No | Yes |
Volume Number | No | Yes |
Enumeration | No | Yes |
Chronology | No | Yes |
Item Barcode | No | Yes |
Item Status | No | Yes |
Item Type Full Value | No | Yes |
Item Type Code Value | No | Yes |
Created By | Yes | Yes |
Updated By | Yes | Yes |
Date Entered | Yes | Yes |
Date Updated | Yes | Yes |
1.1.4 Searchable fields for License
Field Name | Work-License-ONIXPL | Work-License-PDF |
---|---|---|
Contract Number | Yes | No |
Licensee | Yes | No |
Licensor | Yes | No |
Status | Yes | No |
Method | Yes | No |
Type | Yes | No |
Name | No | Yes |
File Name | No | Yes |
Date Uploaded | No | Yes |
Owner | No | Yes |
Notes | No | Yes |
1.2 Facet fields for all document categories, types and formats
Facet Field | Work-Bib-MARC | Work-Bib-DublinQ | Work-Bib-DublinUnQ | Work-Instance-OLEML | Work-Holdings-OLEML | Work-Item-OLEML | Work-License-ONIXPL | Work-License-PDF |
---|---|---|---|---|---|---|---|---|
Subject | Yes | Yes | Yes | No | No | No | No | No |
Author | Yes | Yes | Yes | No | No | No | No | No |
Format | Yes | Yes | Yes | No | No | No | No | No |
Language | Yes | Yes | Yes | No | No | No | No | No |
Publication Date | Yes | Yes | Yes | No | No | No | No | No |
Genre | Yes | Yes | Yes | No | No | No | No | No |
1.3 Field definitions for Work-Bib-MARC documents
Field | Data fields for search (MV- indicates multi-valued) | Data fields for short display | Data fields for detailed display | Data fields for Facet |
---|---|---|---|---|
ISSN | 022 - a,z (MV) | first value | all values | same as search field |
ISBN | 020 - a,z (MV) | first value | all values | same as search field |
Author/Creator | For each 100, 110: every subf except $6 (gives us 2 values for every tag). Also every subf except $t and $6 for: 111, 700, 710, 711, 800, 810, 811, 400, 410, 411) | first non-empty value of | All non-empty | All non-empty |
Title | 245 - all subf exc. c and 6. Also, 130, 240, 246, 247, 440, 490, 730, 740, 773, 774, 780, 785, 830, 840) (MV) | 245$a and 245$b | all values |
|
Place of Publication | 260 - a (MV) | first value | all values | same as search field |
Description | 505 - a (MV) | first value | all values | same as search field |
Subject | 600, 610, 611, 630, 650, 651, 653, 69X: every subf exc. $6, $2, $=, $? across these tags | first non-empty value of 600, 610, 611, 650, 651, 653, 69X . | All non-empty indexed | All non-empty indexed values. |
Date of Publication | <marc:controlfield tag="008">[Date 1 in the 7-10 positions LR: Can also include 260 $c. (260-c is same as the value in control field. Use this if control field does not have pub date value.) (MV). | first value | all values | same as search field. |
Edition | 250 - a,b (MV) | first value | all values | same as search field |
Form/Genre | 655 - a, v (MV) | first value | all values | same as search field |
Language | <marc:controlfield tag="008">[language code in the 35-37 positions]</marc:controlfield> LR: Add 546 $a (MV) | all values | all values | same as search field |
Format | 856 - q | first value | all values | same as search field |
Test Template for Work-Bib-Marc document fields
1.4 Format field definitions for Work-Bib-MARC documents
Label | Marc Fields | Comments |
---|---|---|
Manuscript | Has any holdings with "manuscripts" in location_name (gets only this value) | LR: MARC XML does not have location_name so this is irrelevant to the IU data that OLE has for November. Manuscript could be determined by the Leader 06/07. 06 values a, f, t equal manuscripts on their own. 07 values c and d seem to imply mauscript/archival collections/series. We should check with the SMEs on this one. |
Microformat | Has 245 $h containing "micro" OR has any holdings with "micro" in location_name OR call_number starts "micro" (gets only this value) | LR: the 245 $h "micro" will work for the IU OLE MARCXML we have, but the reamaing text is specific to UPenn. |
Archive | Has any holdings with "archive" in location_name (gets only this value) | LR: This is specific to UPenn. We may need to talk to IU about if they include Archive descriptions in their MARC records and how they designate them as such. |
Thesis/Dissertation | bib_format is 'tm' AND has a 502 field | LR UPenn's bib_format seems to be a combination of the data values found in the 06/07 Leader fields. For example, t in the 06 is Manuscript and m in the 07 is Monograph/Item and together they equal a Thesis/Dissertation. |
Conference/Event | Has a 111 or 711 field [LR: Include 611 or 811] |
|
Book | bib_format is 'aa', 'am' or 'ac' or 'tm'; exclude $h [micro*] and $k [kit] | LR: the 2 characters are from the Leader 06/07 the inclusions are 245 subfields |
Sound recording | bib_format is 'im' or 'jm' or 'jc' or 'jd' or 'js' | LR: the 2 characters are from the Leader 06/07 |
Musical score | bib_format is cm, dm, ca, cb, cd or cs | LR: the 2 characters are from the Leader 06/07 |
Map/Atlas | bib_format is 'e*' or 'fm' | LR: the 2 characters are from the Leader 06/07 |
Video | bib_format is 'gm' AND 007/0 = v | LR: the 2 characters are from the Leader 06/07 |
Projected graphic | bib_format is 'gm' AND 007/0 = g | LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific. |
Journal/Periodical | bib_format is 'as' or 'gs' | LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific. |
Image | bib_format is 'km' | LR: the 2 characters are from the Leader 06/07 |
Datafile | bib_format is 'mm' | LR: the 2 characters are from the Leader 06/07 |
Newspaper | bib_format is 'as' AND (008/21 = 'n' OR 008/22 = 'e' ) | LR: the 2 characters are from the Leader 06/07. The 008 controlled field in those 2 positions provides the "form") |
3D object | bib_format is 'r*' | LR: the single character maps to the 06 position in the leader. |
Database/Website | bib_format is '*i' | LR: the single character maps to the 06 position in the leader. |
Government document | bib_format is NOT c*, d*, i*, j* AND ( (008/28 = f, i, o and 260$b not 'press') ) | LR: the single character maps to the 06 position in the leader. 008 is a fixed length controlled field and 260 $b is a type of publication. |
Other | any bib_format not caught above | LR: Presumably relates to other 06/07 Leader data values not represented. |
1.5 Field definitions for Work-Bib-DublinCore documents
Field | DC-UnQ fields for Search | DC-Q fields for Search | Data fields for short display | Data fields for detailed display | Data fields for Facet |
---|---|---|---|---|---|
Author | <dc:creator> | <dcvalue element="contributor" qualifier="author"> | first value | All non-empty | All non-empty |
Description | <dc:description> (MV) | Per Bob P.: Do not show Abstract description. | first value | all values | same as search field |
Language | <dc:language> (MV) | <dcvalue element="language" qualifier="iso">en_US</dcvalue> | first value | all values | same as search field |
Subject | <dc:subject> (R) | <dcvalue element="subject" qualifier="none"> | first value | all values | same as search field |
Title | <dc:title> | <dcvalue element="title" qualifier="none"> | first value | all values | same as search field |
Type | <dc:type> (MV) | <dcvalue element="type" qualifier="none"> | first value | all values | same as search field |
Date of Publication | <dc:date> | <dcvalue element="date" qualifier="issued"> | first value | all values | same as search field |
Format | <dc:format> (MV) | <dcvalue element="type" (This is covered in a separate field. So do not include it in Format) | first value | all values | same as search field |
Publisher | <dc:publisher> (MV) | <dcvalue element="publisher" | first value | all values | same as search field |
ISBN/ISSN/other | <dc:identifier>(ISSN)0198-9669</dc:identifier> (MV) | <dcvalue element="identifier" qualifier="isbn">0-918006-48-1</dcvalue> | first value | all values | same as search field |
Test Template for Work-Bib-DublinCore document fields
2. Search
This functionality allows documents to be searched for by giving keywords or phases. Searching can be based on category, type, format, search fields.
2.1 Quick Search
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : ALL
Searching on default condition(click search button without specifying any conditions) will give all the records in search result page.
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : MARC
Type one or more keywords in a text box.
System shows records with any field matching one or more keywords.
2.2 Advanced Search
Select Doc Category : Work
Doc Type : Bibliographic
Doc Format : MARC
The drop down for search fields will be populated based on the category selected above.
User specifies a search condition:
Selects a field.
Enters one or more keywords.
Specifies whether the keywords should be searched for as "All of these", "Any of these" or "As a phrase".
"All of these" - Any record with the selected field having all the entered keywords is included in the search results.
"Any of these" - Any record with the selected field having at least one of the entered keywords is included in the search results.
"As a phrase" - Any record with the selected field having all the entered keywords in same order is included in the search results.
User adds another condition:
Chooses whether to apply this condition in addition to the previous one ("AND") or to apply this condition as an alternative to the previous one ("OR") ("NOT"???),
"AND" - the conditions before and after this operator should be satisfied.
"OR" - one of the conditions before and after this operator should be satisfied.
"NOT" - the condition after this operator should not be satisfied.
User repeats previous step as many times as needed using the ADD and DELETE links.
[+]ADD : click on this link to add fields for a new search condition.
[-]Delete : click on this link to delete the last search condition.
Search is performed based on the conditions entered by the user.
2.2.1 Using the wildcard char * in searching
An asterisk (*) can be entered as part of a search term (value to be searched for). But it should not be the first character.
For example, rec* will match record, recurring etc. j*n will match John, join etc.
2.3 Solr-specific search rules
Solr allows us to specify how the input data is indexed and searched for.
Data of type String is indexed and stored verbatim.
Data of type Text can be analyzed during indexing time and searching time as follows:
2.3.1 Tokenization
It is the process of splitting the input text into tokens that are indexed and searched for.
White Space Tokenizer is used. It is a simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.
Input: "To be, or what?"
Output: "To", "be,", "or", "what?"
2.3.2 Synonym Filtering
It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token
It is applied only on search parameters text.
Synonyms are specified in a text file named ‘synonyms.txt’
The following are currently defined in this file.
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
# Synonym mappings can be used for spelling correction too
pixima => pixma
2.3.3 Stop word filtering
It is the process of discarding tokens that are on the given stop words list.
The file named stopwords.txt specifies such words. Currently they are:
No Format |
---|
an and are as at
be but by
for
if in into is it
no not
of on or
s such
t that the their then there these they this to
was will with
|
2.3.4 Word delimiter (splitting)
It is the process of splitting tokens at word delimiters. The rules for determining delimiters are as follows:
- A change in case within a word: "CamelCase" -> "Camel", "Case"
- A transition from alpha to numeric characters or vice versa:"Gonzo5000" -> "Gonzo", "5000" ; "4500XL" -> "4500", "XL"
- Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
- A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
- Any leading or trailing delimiters are discarded: "-hot-spot" -> "hot", "spot"
2.3.5 Lower case conversion
Any uppercase letters in a token are converted to the equivalent lowercase token. All other characters are left unchanged.
2.3.6 Keyword protection
Protecting words from being modified by stemmers.
Protected word list may be specified in a file named protwords.txt
No such words are specified at this time.
2.3.7 Stemming
It is the process of reducing any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk".
Porter Stemming Algorithm is used. It is only appropriate for English language text.
2.3.8 Remove duplicates
Removing duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.
2.3.9 Known Behaviors (to be fixed within Search)
- All of these search selection searches for the words in the order they are entered.
- As a phrase search selection does not search for the words in the order they appear.
- Special characters such as ‘&’ and ‘:’ are searchable only if part of a longer phrase and the entire phrase is wrapped in quotation marks. (If the quotation marks are left off, no hits will be returned.)
- Wildcards (*, ?, #, !, []) do not function.
3. Display of Search results
The search results are displayed in pages of 25 records.
The page size can be changed to 50,100
User can navigate to different pages of search results by using a pagination control.
For each record, short display fields (specified in section 1) will be displayed.
For each record 'View XML' button shows the xml of the document and 'Edit XML' opens the Editor.
Links are shown if applicable
DocType | links |
---|---|
Bibliographic | Instance |
Instance | Bib,Holdings,Item |
Holdings | Bib,Instance,Item |
Item | Bib,Instance,Holdings |
3.1 Short display
Each record in the search results shows a subset of the record's fields.
3.2 Detailed display
A link is (not yet) provided for each record in the search results.
When this link is clicked, a popup, or a new tab of the browser is opened, with all the fields of the record.
3.3 Highlighting
In each search result, words matching the key words entered by the user are highlighted.
3.4 Facets
Facets are grouping of search results that help to analyze the search results and further filter or narrow down the search results.
These are helpful when the user cannot guess what exact keywords to search for.
Facet fields:
Author,Subject,Format,Language,Publication Date,Genre.
Top 5 facet values in the decreasing order of occurrences are shown for each facet field.
The remaining facets are seen by clicking the “more” link in a popup window.
The facet values in the popup are shown in alphabetical order.
Click on one or more facet values to filter search results and view only those records containing these facet values.
3.5 Sorting
By default, the search results are sorted by Title (A-Z)
The sorting options are:
Title (A-Z) : sorts the results in the ascending order (starts with ‘a’) of the title field value. This the default sorting order.
Title (Z-A) :sorts the results in the descending order (starts with ‘z’) of the title field value.
Author (A-Z) :sorts the results in the ascending order (starts with ‘a’) of the author field value.
Author (Z-A) : sorts the results in the descending order (starts with ‘a’) of the author field value.
Pub date(new-old) : sorts the results in the descending order of the publication date field value.
Pub date(old-new) : sorts the results in the ascending order of the publication date field value.
Relevance : it is default sort criteria provided by solr.
Records with empty or null values will appear at the top of the search results.
As per JIRA OLE_2194,
2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rule
3.5.1 NISO Standard for Sort
Anchor | ||||
---|---|---|---|---|
|
The sort rules as recommended in the NISO standard are to be implemented. Please refer to NISO Standard for Sort for details.
NISO Search Results on docstore Search- in progress: https://jira.kuali.org/browse/OLE-2194
Most of the rules for sorting are done in Solr by specifying filters like LowerCaseFilterFactory in schema.xml
4. Configurability of searchable fields
As seen above, there are different fields in the documents of different category/type/formats, that are to be indexed, searched for and displayed. How a field value should be extracted from a source document is also important.
This information about different document categories, types and formats and their corresponding fields is used in DocStore as well as OLE. It is preferable to store this information in a well encapsulated, reusable and maintainable manner.
This is done in an external xml configuration file (DocumentConfig.xml) which is loaded into a corresponding POJO name DocumentConfig. The configuration file is loaded at the time of startup of application and retained in memory until the application is shutdown. So any changes to this file will become effective only after restarting the application. There are limitations on what can be configured in this file. Changes to this file will need the existing documents to be re-indexed.
4.1 DocumentConfig Info
It is available in %ole.docstore.home%/properties folder. (e.g. /opt/docstore/properties)
4.2 Field Definition
Field name convention:
Names of fields used for indexing/searching are suffixed with “_search”.
Names of fields used for display are suffixed with “_display”.
Names of fields used for facets are suffixed with “_facet”.
Names of fields used for sorting are suffixed with “_sort”.
Field Info/Attributes
Field Attribute | Purpose | Example |
---|---|---|
Id | Unique identifier of a field with a given [category, type, format] | id="ISBN_search" |
Name | Name of the field suitable for display | name="ISBN" |
Type | Indicates the type of value of the field (informative purpose only) | type="text" |
Field Definition Example:
No Format |
---|
<field id="ISBN_search" name="ISBN" type="text">
<mapping type="custom">
<include>020-a;z</include>
<exclude/>
</mapping>
</field>
|
4.3 Mapping Definition
Mapping info can be defined for each field which specifies how the value(s) for the field should be extracted from the input file for the corresponding document. Mapping can be specified as XPATH value or a custom value.
Mapping Info/Attributes
Mapping Attribute | Purpose | Example |
---|---|---|
Type | Indicates how the mapping info is to be interpreted | type="custom" |
Include | Values to be included | <include>020-a;z</include> |
Exclude | Values to be excluded | <exclude/> |
xpath Mapping Example:
No Format |
---|
<field id="ContractNumber_search" name="Contract Number" type="text">
<mapping type="xpath">
<include>/publicationsLicenseExpression/licenseDetail/licenseIdentifier/IDValue/value</include>
</mapping>
</field>
|
4.4 Modifying configuration info
- Open the DocumentConfig.xml file.
- Add/modify/delete one or more fields of any [document category/type/format].
- Save the file.
- Reload the DocStore application. (Restart the Tomcat server.)
- Re-index the data related to the document category/type/format modified.
Adding a field:
Copy and paste an existing field definition and modify the attributes suitably.
Modifying a field:
Name and mapping info can be modified for any existing field.
Deleting a field:
A field definition can be commented or deleted.
Test Template for Document Config
Transactional Search
OLE coding to-date for Acquisitions functions have utilized KNS Lookups, DocSearch (Detailed Search, Superuser Search), and named or session-based searches......
...