Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
titleJira #

https://jira.kuali.org/browse/OLE-1144 (OLE Search Executive- see also linked tasks and sub-tasks)

Section
Column
width40%

Overview of OLE Search:

As OLE grows to include more document types (patrons, bibliographic formats, licenses, etc.), OLE will create integrated Search capabilities, where the search functions will provide inter-operability instead of some of the localized wildcards, limits on bibliographic fields, or other search-specific criteria below. Library users will be included in future testing and development of OLE Search, building upon our emerging search types for

  • Document Store Search – full Bibliographic searching 
    • Indexes, Facets, Sort, Boolean
  • Acquisitions Search 
  • Order Holding Queue
  • Receiving Queue
  • Fund/ Balance Inquiries
  • Load Reports

KFS inherited transactional Searches (no Bibliographic fields, only transactional- vendor, dates, etc.)

  • Payment Request
  • Purchase Orders
  • Requisition
  • Receiving

Other search functions available in OLE

  • Lookups
  • Lookup icon 
  • Custom Document Searches
  • Doc Search button
  • Saving Custom (Session) Searches 
Column
width40%
Panel
borderColor#A40000
bgColor#F8F8F8
titleBGColor#E8E8E8
titleContents
borderStyledashed
Table of Contents
minLevel1
outlinefalse

DocStore Search

1. Indexed Data

1.1 Searchable fields for all document categories, types and formats

Field Name

Work-Bib-MARC

Work-Bib-DublinQ

Work-Bib-DublinUnQ

Work-Instance-OLEML

Work-Holdings-OLEML

Work-Item-OLEML

Work-License-ONIXPL

Work-License-PDF

Title

Yes

Yes

Yes

No

No

No

No

No

Author

Yes

Yes

Yes

No

No

No

No

No

Subject

Yes

Yes

Yes

No

No

No

No

No

Description

Yes

Yes

Yes

No

No

No

No

No

Date of Publication

Yes

Yes

Yes

No

No

No

No

No

Format

Yes

Yes

Yes

No

No

No

No

No

Language

Yes

Yes

Yes

No

No

No

No

No

Publisher

Yes

Yes

Yes

No

No

No

No

No

ISSN/ISBN/other (last for dc identifier)

Yes

Yes

Yes

No

No

No

No

No

Genre (marc genre/dc type)

Yes

Yes

Yes

No

No

No

No

No

Edition

Yes

No

No

No

No

No

No

No

Barcode

Yes

No

No

No

No

Yes

No

No

Location

Yes

No

No

No

No

No

No

No

Source

No

No

No

Yes

No

No

No

No

Record Type

No

No

No

No

Yes

No

No

No

Encoding Level

No

No

No

No

Yes

No

No

No

Receipt Status

No

No

No

No

Yes

No

No

No

Acquisition Method

No

No

No

No

Yes

No

No

No

Policy Type

No

No

No

No

Yes

No

No

No

Copies Reported

No

No

No

No

Yes

No

No

No

Item Type

No

No

No

No

No

Yes

No

No

Location Status

No

No

No

No

No

Yes

No

No

Shelving Scheme

No

No

No

No

No

Yes

No

No

Shelving Order

No

No

No

No

No

Yes

No

No

Address

No

No

No

No

No

Yes

No

No

Copy Number

No

No

No

No

No

Yes

No

No

Volume Number

No

No

No

No

No

Yes

No

No

Contract Number

No

No

No

No

No

No

Yes

No

Licensee

No

No

No

No

No

No

Yes

No

Licensor

No

No

No

No

No

No

Yes

No

Status

No

No

No

No

No

No

Yes

No

Method

No

No

No

No

No

No

Yes

No

Type

No

No

No

No

No

No

Yes

No

Name

No

No

No

No

No

No

No

Yes

File Name

No

No

No

No

No

No

No

Yes

Date Uploaded

No

No

No

No

No

No

No

Yes

Owner

No

No

No

No

No

No

No

Yes

Notes

No

No

No

No

No

No

No

Yes

1.2 Facet fields for all document categories, types and formats

...

Info
titleJira #

https://jira.kuali.org/browse/OLE-1144 (OLE Search Executive- see also linked tasks and sub-tasks)

Section
Column
width40%

Overview of OLE Search:

As OLE grows to include more document types (patrons, bibliographic formats, licenses, etc.), OLE will create integrated Search capabilities, where the search functions will provide inter-operability instead of some of the localized wildcards, limits on bibliographic fields, or other search-specific criteria below. Library users will be included in future testing and development of OLE Search, building upon our emerging search types for

  • Document Store Search – full Bibliographic searching 
    • Indexes, Facets, Sort, Boolean
  • Acquisitions Search 
  • Order Holding Queue
  • Receiving Queue
  • Fund/ Balance Inquiries
  • Load Reports

KFS inherited transactional Searches (no Bibliographic fields, only transactional- vendor, dates, etc.)

  • Payment Request
  • Purchase Orders
  • Requisition
  • Receiving

Other search functions available in OLE

  • Lookups
  • Lookup icon 
  • Custom Document Searches
  • Doc Search button
  • Saving Custom (Session) Searches 
Column
width40%
Panel
borderColor#A40000
bgColor#F8F8F8
titleBGColor#E8E8E8
titleContents
borderStyledashed
Table of Contents
minLevel1
outlinefalse

DocStore Search

1. Indexed Data

1.1 Searchable fields for all document categories, types and formats

1.1.1 Searchable fields for Bibliographic

Field Name

Work-Bib-MARC

Work-Bib-DublinQ

Work-Bib-DublinUnQ

Work-Instance-OLEML

Work-Holdings-OLEML

Work-Item-OLEML

Work-License-ONIXPL

Work-License-PDF

Subject Title

Yes

Yes

Yes

Author

Yes

Yes

Yes

Subject

Yes

No Yes No

Yes

No

No

No

Author

Description

Yes

Yes

Yes No

Date of Publication

No Yes No

Yes

No Yes

No

Format

Yes

Yes

Yes

No Language No

Yes

No Yes No

Yes No

Subject

Language

Yes

Yes

Yes

No Publisher No

Yes

No Yes

No

No Yes

Publication DateYes

Yes

Yes

No

No

No

No

No

Genre ISSN/ISBN/other (last for dc identifier)

Yes

Yes

Yes

Genre (marc genre/dc type)

Yes

Yes

Yes

Edition

No Yes No

Yes

No Yes No

Bib Identifier

No

1.3 Field definitions for Work-Bib-MARC documents

...

Field

...

Data fields for search (MV- indicates multi-valued)

...

Data fields for short display

...

Data fields for detailed display

...

Data fields for Facet

...

ISSN

...

022 - a,z (MV)

...

first value

...

all values

...

same as search field

...

ISBN

...

020 - a,z (MV)

...

first value

...

all values

...

same as search field

...

Author/Creator

...

For each 100, 110: every subf except $6 (gives us 2 values for every tag). Also every subf except $t for: 111, 700, 710, 711, 800, 810, 811, 400, 410, 411) (MV)

...

first non-empty value of 100$a or 110$a etc

...

all values

...

same as short display value

...

Title

...

245 - all subf exc. c and 6. Also, 130, 240, 246, 247, 440, 490, 730, 740, 773, 774, 780, 785, 830, 840) (MV)

...

245$a and 245$b

...

all values

...

 

...

Place of Publication

...

260 - a (MV)

...

first value

...

all values

...

same as search field

...

Description

...

505 - a (MV)
KG/LR: UPenn just included the MARC 505 in its Description index (which is distinct from its Format/Description index). Include just 505 $a.  The SMEs may want additional 5xx fields in the Description index, but 505 should be fine for November.

...

first value

...

all values

...

same as search field

...

Subject

...

600, 610, 611, 630, 650, 651, 653, 69X: every subf exc. $6 across these tags (MV)
No hyphens for X00, X10, and X11 fields (600, 610, 611, 700, 710, 711, etc), but hyphens for other fields.

...

first non-empty value of 600$a, 610$a etc

...

all values

...

same as short display value

...

Date of Publication

...

<marc:controlfield tag="008">[Date 1 in the 7-10 positions LR: Can also include 260 $c. (260-c is same as the value in control field. Use this if control field does not have pub date value.) (MV)

...

first value

...

all values

...

same as search field

...

Edition

...

250 - a,b (MV)

...

first value

...

all values

...

same as search field

...

Form/Genre

...

655 - a, v (MV)

...

first value

...

all values

...

same as search field

...

Language

...

<marc:controlfield tag="008">[language code in the 35-37 positions]</marc:controlfield> LR: Add 546 $a (MV)

...

all values

...

all values

...

same as search field

...

Format

...

first value

...

all values

...

same as search field

1.4 Format field definitions for Work-Bib-MARC documents

Label

Marc Fields

Comments

Manuscript

Has any holdings  with "manuscripts" in location_name (gets only this value)

LR: MARC XML does not have location_name so this is irrelevant to the IU data that  OLE has for November.  Manuscript could be determined by the Leader 06/07.  06 values a, f, t equal manuscripts on their own.  07 values c and d seem to imply mauscript/archival collections/series.  We should check with the SMEs on this one.

Microformat

Has 245 $h containing "micro" OR has any holdings  with "micro" in location_name OR call_number starts "micro" (gets only this value)

LR: the 245 $h "micro" will work for the IU OLE MARCXML we have, but the reamaing text is specific to UPenn.

Archive

Has any holdings  with "archive" in location_name (gets only this value)

LR: This is specific to UPenn.  We may need to talk to IU about if they include Archive descriptions in their MARC records and how they designate them as such.

Thesis/Dissertation

bib_format is 'tm' AND has a 502 field

LR UPenn's bib_format seems to be a combination of the data values found in the 06/07 Leader fields.  For example, t in the 06 is Manuscript and m in the 07 is Monograph/Item and together they equal a Thesis/Dissertation.

Conference/Event

Has a 111 or 711 field [LR: Include 611 or 811]

 

Book

bib_format is 'aa', 'am' or 'ac' or 'tm'; exclude $h [micro*] and $k [kit]

LR: the 2 characters are from the Leader 06/07 the inclusions are 245 subfields

Sound recording

bib_format is 'im' or 'jm' or 'jc' or 'jd' or 'js'

LR: the 2 characters are from the Leader 06/07

Musical score

bib_format is cm, dm, ca, cb, cd or cs

LR: the 2 characters are from the Leader 06/07

Map/Atlas

bib_format is 'e*' or 'fm'

LR: the 2 characters are from the Leader 06/07

Video

bib_format is 'gm' AND 007/0 = v

LR: the 2 characters are from the Leader 06/07

Projected graphic

bib_format is 'gm' AND 007/0 = g

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Journal/Periodical

bib_format is 'as' or 'gs'

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Image

bib_format is 'km'

LR: the 2 characters are from the Leader 06/07

Datafile

bib_format is 'mm'

LR: the 2 characters are from the Leader 06/07

Newspaper

bib_format is 'as' AND (008/21 = 'n' OR 008/22 = 'e' )

LR: the 2 characters are from the Leader 06/07.  The 008 controlled field in those 2 positions provides the "form")

3D object

bib_format is 'r*'

LR: the single character maps to the 06 position in the leader.

Database/Website

bib_format is '*i'

LR: the single character maps to the 06 position in the leader.

Government document

bib_format is NOT c*, d*, i*, j* AND ( (008/28 = f, i, o and 260$b not 'press') )

LR: the single character maps to the 06 position in the leader. 008 is a fixed length controlled field and 260 $b is a type of publication.

Other

any bib_format not caught above

LR: Presumably relates to other 06/07 Leader data values not represented.

1.5 Field definitions for Work-Bib-DublinCore documents

...

Field

...

DC-UnQ fields for Search

...

DC-Q fields for Search

...

Data fields for short display

...

Data fields for detailed display

...

Data fields for Facet

...

Author

...

<dc:creator> (MV)

...

<dcvalue element="contributor" qualifier="author">

...

first value

...

all values

...

same as search field

...

Description

...

<dc:description> (MV)
Per Bob P.: Show only <dc:description>.

...

Per Bob P.:  Do not show Abstract description.
[show blank]

...

first value

...

all values

...

same as search field

...

Language

...

<dc:language>  (MV)

...

<dcvalue element="language" qualifier="iso">en_US</dcvalue>

...

first value

...

all values

...

same as search field

...

Subject

...

<dc:subject> (MV)

...

<dcvalue element="subject" qualifier="none">

...

first value

...

all values

...

same as search field

...

Title

...

<dc:title>

...

<dcvalue element="title" qualifier="none">

...

first value

...

all values

...

same as search field

...

Type

...

<dc:type> (MV)

...

<dcvalue element="type" qualifier="none">

...

first value

...

all values

...

same as search field

...

Date of Publication

...

<dc:date>

...

<dcvalue element="date" qualifier="issued">

...

first value

...

all values

...

same as search field

...

Format

...

<dc:format> (MV)

...

first value

...

all values

...

same as search field

...

Publisher

...

<dc:publisher> (MV)

...

first value

...

all values

...

same as search field

...

ISBN/ISSN/other

...

<dc:identifier>(ISSN)0198-9669</dc:identifier>  (MV)
 <dc:identifier>(ISBN)0306710382</dc:identifier> (MV)

...

<dcvalue element="identifier" qualifier="isbn">0-918006-48-1</dcvalue>

...

first value

...

all values

...

same as search field

2. Search

This functionality allows documents to be searched for by giving keywords or phases. Searching can be based on category, type, format, search fields.

2.1  Quick Search

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : ALL

   Searching on default condition(click search button without specifying any conditions) will give all the records in search result page.

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : MARC

   Type one or more keywords in a text box.

   System shows records with any field matching one or more keywords.

2.2  Advanced Search

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : MARC

   The drop down for search fields will be populated based on the category selected above.

   User specifies a search condition:

             Selects a field.

             Enters one or more keywords.

             Specifies whether the keywords should be searched for as "All of these", "Any of these" or "As a phrase".

                            "All of these"   - Any record with the selected field having all the entered keywords is included in the search results.

                           "Any of these" - Any record with the selected field having at least one of the entered keywords is included in the search results.

                           "As a phrase"  - Any record with the selected field having all the entered keywords in same order is included in the search results.

   User adds another condition:

             Chooses whether to apply this condition in addition to the previous one ("AND") or to apply this condition as an alternative to the previous one ("OR")  ("NOT"???),

             "AND"  - the conditions before and after this operator should be satisfied.

             "OR"    - one of the conditions before and after this operator should be satisfied.

             "NOT"  - the condition after this operator should not be satisfied.

    User repeats previous step as many times as needed using the ADD and DELETE links.

          [+]ADD : click on this link to add fields for a new search condition.

           [-]Delete : click on this link to delete the last search condition.

   Search is performed based on the conditions entered by the user.

2.3 Solr-specific search rules

Solr allows us to specify how the input data is indexed and searched for.

Data of type String is indexed and stored verbatim.

Data of type Text can be analyzed during indexing time and searching time as follows:

2.3.1 Tokenization

It is the process of splitting the input text into tokens that are indexed and searched for.

White Space Tokenizer is used. It is a simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.

Input: "To be, or what?"

Output: "To", "be,", "or", "what?"

2.3.2 Synonym Filtering

It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token

It is applied only on search parameters text.

Synonyms are specified in a text file named 'synonyms.txt'

The following are currently defined in this file.

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

# Synonym mappings can be used for spelling correction too

pixima => pixma

2.3.3 Stop word filtering

It is the process of discarding tokens that are on the given stop words list.

The file named stopwords.txt specifies such words. Currently they are:

No Format

an and are as at

be but by

for

if in into is it

no not

of on or

s such

t that the their then there these they this to

was will with
2.3.4 Word delimiter (splitting)

It is the process of splitting tokens at word delimiters. The rules for determining delimiters are as follows:

  •     A change in case within a word: "CamelCase" -> "Camel", "Case"
  •     A transition from alpha to numeric characters or vice versa:"Gonzo5000" -> "Gonzo", "5000"   ;  "4500XL" -> "4500", "XL"
  •     Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
  •     A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
  •     Any leading or trailing delimiters are discarded: "-hot-spot" -> "hot", "spot"
2.3.5 Lower case conversion

Any uppercase letters in a token are converted to the equivalent lowercase token. All other characters are left unchanged.

2.3.6 Keyword protection

Protecting words from being modified by stemmers.

Protected word list may be specified in a file named protwords.txt

No such words are specified at this time.

2.3.7 Stemming

It is the process of reducing any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk".

Porter Stemming Algorithm is used. It is only appropriate for English language text.

2.3.8 Remove duplicates

Removing duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.

2.3.9 Known Behaviors (to be fixed within Search)
  • All of these search selection searches for the words in the order they are entered.
  • As a phrase search selection does not search for the words in the order they appear.
  • Special characters such as '&' and ':' are searchable only if part of a longer phrase and the entire phrase is wrapped in quotation marks.  (If the quotation marks are left off, no hits will be returned.)
  • Wildcards (*, ?, #, !, []) do not function. 

3. Display of Search results

    The search results are displayed in pages of 25 records.

    The page size can be changed to 50,100

    User can navigate to different pages of search results by using a pagination control.

    For each record, short display fields (specified in section 1) will be displayed.

    For each record 'View XML' button shows the xml of the document and 'Edit XML' opens the Editor.

   Links are shown if applicable

DocType

links

Bibliographic

Instance

Instance

Bib,Holdings,Item

Holdings

Bib,Instance,Item

Item

Bib,Instance,Holdings

3.1 Highlighting

In each search result, words matching the key words entered by the user are highlighted.

3.2 Facets

Facets are grouping of search results that help to analyze the search results and further filter or narrow down the search results.

These are helpful when the user cannot guess what exact keywords to search for.

Facet fields:

Author,Subject,Format,Language,Publication Date,Genre.

Top 5 facet values in the decreasing order of occurrences are shown for each facet field.

The remaining facets are seen by clicking the "more" link in a popup window.

The facet values in the popup are shown in alphabetical order.

Click on one or more facet values to filter search results and view only those records containing these facet values.

3.3 Sorting

By default, the search results are sorted by Title (A-Z)

The sorting options are:

    Title (A-Z) : sorts the results in the  ascending order (starts with 'a') of the title field value. This the default sorting order.
    Title (Z-A) :sorts the results in the descending  order (starts with 'z') of the title field value.
    Author (A-Z) :sorts the results in the ascending order (starts with 'a') of the author field value.
    Author (Z-A) : sorts the results in the descending order (starts with 'a') of the author field value.
    Pub date(new-old) : sorts the results in the descending order of the publication date field value.
    Pub date(old-new) : sorts the results in the ascending order of the publication date field value.  
    Relevance : it is default sort criteria provided by solr.

Records with empty or null values will appear at the top of the search results.

3.4 Solr-specific sorting features

  • If the data contains more than one space then they are treated as a single space.
  • All data beginning with a numeral are arranged ahead of any data beginning with a letter.
  • Data consisting of a single word precedes any data beginning with the same word and followed by other words.
  • Data beginning with Articles (a, an and the) are displayed in ascending order.
  • Numbers at beginning or within the data are arranged in arithmetical order and sorted in ascending order.
  • Punctuation in numbers, as in other text, has no arrangement value (and sorted in ascending order).
  • Decimal fractions are arranged according to their arithmetical value (and sorted in ascending order).

4. NISO Standard for Sort

...

YesYesYes
Holdings IdentifierYesYesYes
Local IdYesYesYes
Doc CategoryYesYesYes
Doc FormatYesYesYes
Doc TypeYesYesYes
StatusYesYesYes
Status Updated OnYesYesYes
Staff Only FlagYesYesYes
Created ByYesYesYes
Updated ByYesYesYes
Date EnteredYesYesYes
Date UpdatedYesYesYes
1.1.2 Searchable fields for Holdings

Field Name

Work-Holdings-OLEML

Work-Item-OLEML

Bib IdentifierYes

Yes

Holdings IdentifierYesYes
Item IdentifierYesYes
Receipt StatusYes

No

Copy NumberYesYes
Call NumberYesYes
Call Number TypeYesYes
Item PartYes

No

Call Number PrefixYesYes
Classification PartYesYes
Shelving Scheme CodeYesYes
Shelving Scheme ValueYesYes
Shelving OrderYes

Yes

Location LevelYes

Yes

Location Level NameYes

Yes

Access StatusYes

No

Access Location

Yes

No

Statistical Search Code ValueYesYes
PublisherYes

No

ImprintYes

No

Platform

Yes

No

Public NoteYes

No

Holding NoteYesNo
Local IdYes

Yes

Doc CategoryYes

Yes

Doc FormatYes

Yes

Doc TypeYes

Yes

StatusYes

Ye

Status Updated OnYesYes
Staff Only FlagYesYes
Claims Return FlagNoYes
Claims Return Flag Create DateNoYes
Claims Return NoteNoYes
Current BorrowerNoYes
Proxy BorrowerNoYes
Due Date TimeNoYes
Barcode ARSLNoYes
Volume NumberNoYes
EnumerationNoYes
ChronologyNoYes
Item BarcodeNoYes
Item StatusNoYes
Item Type Full ValueNoYes
Item Type Code ValueNoYes
Created ByYesYes
Updated ByYesYes
Date EnteredYesYes
Date UpdatedYesYes
1.1.4 Searchable fields for License

Field Name

Work-License-ONIXPL

Work-License-PDF

Contract Number

Yes

No

Licensee

Yes

No

Licensor

Yes

No

Status

Yes

No

Method

Yes

No

Type

Yes

No

Name

No

Yes

File Name

No

Yes

Date Uploaded

No

Yes

Owner

No

Yes

Notes

No

Yes

1.2 Facet fields for all document categories, types and formats

Facet Field

Work-Bib-MARC

Work-Bib-DublinQ

Work-Bib-DublinUnQ

Work-Instance-OLEML

Work-Holdings-OLEML

Work-Item-OLEML

Work-License-ONIXPL

Work-License-PDF

Subject

Yes

Yes

Yes

No

No

No

No

No

Author

Yes

Yes

Yes

No

No

No

No

No

Format

Yes

Yes

Yes

No

No

No

No

No

Language

Yes

Yes

Yes

No

No

No

No

No

Publication Date

Yes

Yes

Yes

No

No

No

No

No

Genre

Yes

Yes

Yes

No

No

No

No

No

1.3 Field definitions for Work-Bib-MARC documents

Field

Data fields for search (MV- indicates multi-valued)

Data fields for short display

Data fields for detailed display

Data fields for Facet

ISSN

022 - a,z (MV)

first value

all values

same as search field

ISBN

020 - a,z (MV)

first value

all values

same as search field

Author/Creator

For each 100, 110: every subf except $6 (gives us 2 values for every tag). Also every subf except $t and $6 for: 111, 700, 710, 711, 800, 810, 811, 400, 410, 411) 
Ref: http://www.loc.gov/marc/bibliographic/
100 - Main Entry - Personal Name (NR)
110 - Main Entry - Corporate Name (NR)
111 - Main Entry - Meeting Name (NR)
700 - Added Entry - Personal Name (R)
710 - Added Entry - Corporate Name (R)
711 - Added Entry - Meeting Name (R)
800 - Series Added Entry - Personal Name (R)
810 - Series Added Entry - Corporate Name (R)
811 - Series Added Entry -Meeting Name (R)
400 - Series Statement/Added Entry - Personal Name(R)
410 - Series Statement/Added Entry - Corporate Name(R)
411 - Series Statement/Added Entry - Meeting Name(R)

first non-empty value of
100$a or 110$a.
Show blank if both are
missing.

All non-empty
indexed values

All non-empty
indexed values

Title

245 - all subf exc. c and 6. Also, 130, 240, 246, 247, 440, 490, 730, 740, 773, 774, 780, 785, 830, 840) (MV)

245$a and 245$b

all values

 

Place of Publication

260 - a (MV)

first value

all values

same as search field

Description

505 - a (MV)
KG/LR: UPenn just included the MARC 505 in its Description index (which is distinct from its Format/Description index). Include just 505 $a.  The SMEs may want additional 5xx fields in the Description index, but 505 should be fine for November.

first value

all values

same as search field

Subject

600, 610, 611, 630, 650, 651, 653, 69X: every subf exc. $6, $2, $=, $? across these tags

600 - Subject Added Entry - Personal Name (R)
610 - Subject Added Entry - Corporate Name (R)
611 - Subject Added Entry - Meeting Name (R)
630 - Subject Added Entry - Uniform Title (R)
650 - Subject - Added Entry -  Topical Term (R)
651 - Subject Added Entry - Geographic  Name (R)
Ref : http://www.loc.gov/marc/bibliographic/bd6xx.html

first non-empty value of 600, 610, 611, 650, 651, 653, 69X .

No hyphens for X00, X10, and X11 fields 600, 610, 611, etc), but hyphens for other fields.    

Punctuation:
Follow punctuation and order given in
subfields:
600, 610,611
(space between subfields) subfield subfield

Sample data:
600(subject - AuthorName) :
Kangxi,|cEmperor of China,|d1654-1722.
Facet/display:
kangxi,Emperor of China, 1654-1722.

Use double hyphen for subdivisions:
630
follow punctuation in fields, mainly using Periods between subfield value, except $v, $x, $y, $z
are "subdivisions" and use double - hyphen paired
with another subfield

Sample data:
630(Standard title subject) :
Bible.|pN.T.|pLuke|xCommentaries.
Facet/display:
Bible.N.T.Luke -- Commentaries.

Use double - hyphen for all subfields:
 650, 651
(double - hyphens between subfields) --
subfield -- subfield (period on end).

Sample data:
650(subject-general):
Gardens|xSocial aspects|zChina|zBeijing|xHistory|y18th century.
Facet/display:
Gardens-Social aspectsChinaBeijingHistory-18th century.

All non-empty indexed
values.

Ordering of  multiple values:
Ordering of subject
headings in display
of individual bib
record should be in order of xml.

All non-empty  indexed values.

Hyphening pattern is same as display fields.

650(subject-general):
Gardens|xSocial aspects|zChina|
zBeijing|xHistory|
y18th century.
Facet/display:
Gardens--Socialaspects
-ChinaBeijing-
History--18th century.

Additional facets to be indexed:
1. Gardens - -Social aspects
2. Gardens - -Social aspects - -China
3. Gardens - -Social aspects - -China - -Beijing
4. Gardens - -Social aspects - -China - -Beijing - -History

Date of Publication

<marc:controlfield tag="008">[Date 1 in the 7-10 positions LR: Can also include 260 $c. (260-c is same as the value in control field. Use this if control field does not have pub date value.) (MV).


Some records with 260 $c are having data other than date and data with no proper date.So for now 260 $c is ignored for publication date.

first value

all values

same as search field.

Keep decades, but add centuries and
"current year"
<controlfield tag="008">010108q10001999dcu b f000 0 eng d</controlfield>
The above example 1000-1999 means it should
appear with every facet 1000s,1100s, etc and every decade in 1000-1999.
Eliminate "odd" facets- don't want "s", "0000s",
or any years beyond the current year(manual removal 2020s and 9990s.
Only want single decades and single centuries.
<controlfield tag="008">001214k08000999xx zzz a n lat d</controlfield>.

Eg:
Two records with publication dates as 1905 and 1913.
The facets will be computed as:
20th century (2)-- 1900 to 1999.
1900s (1)-- 1900 to 1909.
1910s (1) – 1910 to 1919.

Edition

250 - a,b (MV)

first value

all values

same as search field

Form/Genre

655 - a, v (MV)

first value

all values

same as search field

Language

<marc:controlfield tag="008">[language code in the 35-37 positions]</marc:controlfield> LR: Add 546 $a (MV)
Language Codes (iso-639-3)

all values

all values

same as search field

Format

856 - q
245 - h  

LR: Format is very tricky b/c many MARC fields/subfields can be used to determine format.  

I think we could also consider adding the following:

Leader 06/07 – these are Type of Record and Bibliographic Material – each is a single letter characters and when combined, they seem to map to the UPenn bib_format field.  (See the worksheet now named Format – marc)
007 Physical Description Fixed Field-General Information – the character positions indicate physical format information.  See [http://www.loc.gov/marc/bibliographic/bd007.html
] 655 $a, $v (Genre/Form)
300 $e, $3 (Extent: Accompanying Material, Materials specified)
337 $a (Media Type)
338 $a (Carrier Type)
340 $a, $e, $m, $3 (Physical Medium: Material base, support, book format, materials specified)
Possibly other 3XX fields/subfields.

In the end, we may want to just touch base with some SMEs (Gwyneth? Bob? Stuart) to determine specifically what we could/should include)
(MV)

first value

all values

same as search field

Test Template for Work-Bib-Marc document fields

1.4 Format field definitions for Work-Bib-MARC documents

Label

Marc Fields

Comments

Manuscript

Has any holdings  with "manuscripts" in location_name (gets only this value)

LR: MARC XML does not have location_name so this is irrelevant to the IU data that  OLE has for November.  Manuscript could be determined by the Leader 06/07.  06 values a, f, t equal manuscripts on their own.  07 values c and d seem to imply mauscript/archival collections/series.  We should check with the SMEs on this one.

Microformat

Has 245 $h containing "micro" OR has any holdings  with "micro" in location_name OR call_number starts "micro" (gets only this value)

LR: the 245 $h "micro" will work for the IU OLE MARCXML we have, but the reamaing text is specific to UPenn.

Archive

Has any holdings  with "archive" in location_name (gets only this value)

LR: This is specific to UPenn.  We may need to talk to IU about if they include Archive descriptions in their MARC records and how they designate them as such.

Thesis/Dissertation

bib_format is 'tm' AND has a 502 field

LR UPenn's bib_format seems to be a combination of the data values found in the 06/07 Leader fields.  For example, t in the 06 is Manuscript and m in the 07 is Monograph/Item and together they equal a Thesis/Dissertation.

Conference/Event

Has a 111 or 711 field [LR: Include 611 or 811]

 

Book

bib_format is 'aa', 'am' or 'ac' or 'tm'; exclude $h [micro*] and $k [kit]

LR: the 2 characters are from the Leader 06/07 the inclusions are 245 subfields

Sound recording

bib_format is 'im' or 'jm' or 'jc' or 'jd' or 'js'

LR: the 2 characters are from the Leader 06/07

Musical score

bib_format is cm, dm, ca, cb, cd or cs

LR: the 2 characters are from the Leader 06/07

Map/Atlas

bib_format is 'e*' or 'fm'

LR: the 2 characters are from the Leader 06/07

Video

bib_format is 'gm' AND 007/0 = v

LR: the 2 characters are from the Leader 06/07

Projected graphic

bib_format is 'gm' AND 007/0 = g

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Journal/Periodical

bib_format is 'as' or 'gs'

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Image

bib_format is 'km'

LR: the 2 characters are from the Leader 06/07

Datafile

bib_format is 'mm'

LR: the 2 characters are from the Leader 06/07

Newspaper

bib_format is 'as' AND (008/21 = 'n' OR 008/22 = 'e' )

LR: the 2 characters are from the Leader 06/07.  The 008 controlled field in those 2 positions provides the "form")

3D object

bib_format is 'r*'

LR: the single character maps to the 06 position in the leader.

Database/Website

bib_format is '*i'

LR: the single character maps to the 06 position in the leader.

Government document

bib_format is NOT c*, d*, i*, j* AND ( (008/28 = f, i, o and 260$b not 'press') )

LR: the single character maps to the 06 position in the leader. 008 is a fixed length controlled field and 260 $b is a type of publication.

Other

any bib_format not caught above

LR: Presumably relates to other 06/07 Leader data values not represented.

1.5 Field definitions for Work-Bib-DublinCore documents

Field

DC-UnQ fields for Search

DC-Q fields for Search

Data fields for short display

Data fields for detailed display

Data fields for Facet

Author

<dc:creator> 

<dcvalue element="contributor" qualifier="author">

first value

All non-empty
indexed values

All non-empty
indexed values

Description

<dc:description> (MV)
Per Bob P.: Show only <dc:description>.

Per Bob P.:  Do not show Abstract description.
[show blank]

first value

all values

same as search field

Language

<dc:language>  (MV)
Language Codes (iso-639-3)

<dcvalue element="language" qualifier="iso">en_US</dcvalue>
Language Codes (iso-639-1-cc)

first value

all values

same as search field

Subject

<dc:subject> (R)

<dcvalue element="subject" qualifier="none">

first value

all values

same as search field

Title

<dc:title>

<dcvalue element="title" qualifier="none">

first value

all values

same as search field

Type

<dc:type> (MV)

<dcvalue element="type" qualifier="none">

first value

all values

same as search field

Date of Publication

<dc:date>

<dcvalue element="date" qualifier="issued">

first value

all values

same as search field

Format

<dc:format> (MV)

<dcvalue element="type" (This is covered in a separate field. So do not include it in Format)
<dcvalue element="format" qualifier="mimetype">

??? (LR: In looking back at the MARC to Qualified DC mapping it is not entirely clear, but it should be both the  format and type elements.

first value

all values

same as search field

Publisher

<dc:publisher> (MV)

<dcvalue element="publisher"

??? (KG/LR: publisher.  It doesn't appear in the crosswalk, but that could be that the UMD dataset did not include that tag)

first value

all values

same as search field

ISBN/ISSN/other

<dc:identifier>(ISSN)0198-9669</dc:identifier>  (MV)
 <dc:identifier>(ISBN)0306710382</dc:identifier> (MV)

<dcvalue element="identifier" qualifier="isbn">0-918006-48-1</dcvalue>

first value

all values

same as search field

Test Template for Work-Bib-DublinCore document fields

2. Search

This functionality allows documents to be searched for by giving keywords or phases. Searching can be based on category, type, format, search fields.

2.1  Quick Search

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : ALL

   Searching on default condition(click search button without specifying any conditions) will give all the records in search result page.

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : MARC

   Type one or more keywords in a text box.

   System shows records with any field matching one or more keywords.

2.2  Advanced Search

            Select Doc Category : Work

                      Doc Type : Bibliographic

                      Doc Format : MARC

   The drop down for search fields will be populated based on the category selected above.

   User specifies a search condition:

             Selects a field.

             Enters one or more keywords.

             Specifies whether the keywords should be searched for as "All of these", "Any of these" or "As a phrase".

                            "All of these"   - Any record with the selected field having all the entered keywords is included in the search results.

                           "Any of these" - Any record with the selected field having at least one of the entered keywords is included in the search results.

                           "As a phrase"  - Any record with the selected field having all the entered keywords in same order is included in the search results.

   User adds another condition:

             Chooses whether to apply this condition in addition to the previous one ("AND") or to apply this condition as an alternative to the previous one ("OR")  ("NOT"???),

             "AND"  - the conditions before and after this operator should be satisfied.

             "OR"    - one of the conditions before and after this operator should be satisfied.

             "NOT"  - the condition after this operator should not be satisfied.

    User repeats previous step as many times as needed using the ADD and DELETE links.

          [+]ADD : click on this link to add fields for a new search condition.

           [-]Delete : click on this link to delete the last search condition.

   Search is performed based on the conditions entered by the user.

2.2.1 Using the wildcard char * in searching

    An asterisk (*) can be entered as part of a search term (value to be searched for). But it should not be the first character.

    For example, rec* will match record, recurring etc. j*n will match John, join etc.    

2.3 Solr-specific search rules

Solr allows us to specify how the input data is indexed and searched for.

Data of type String is indexed and stored verbatim.

Data of type Text can be analyzed during indexing time and searching time as follows:

2.3.1 Tokenization

It is the process of splitting the input text into tokens that are indexed and searched for.

White Space Tokenizer is used. It is a simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation will be included in the tokenization.

Input: "To be, or what?"

Output: "To", "be,", "or", "what?"

2.3.2 Synonym Filtering

It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token

It is applied only on search parameters text.

Synonyms are specified in a text file named ‘synonyms.txt’

The following are currently defined in this file.

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

# Synonym mappings can be used for spelling correction too

pixima => pixma

2.3.3 Stop word filtering

It is the process of discarding tokens that are on the given stop words list.

The file named stopwords.txt specifies such words. Currently they are:

No Format
an and are as at

be but by

for

if in into is it

no not

of on or

s such

t that the their then there these they this to

was will with
2.3.4 Word delimiter (splitting)

It is the process of splitting tokens at word delimiters. The rules for determining delimiters are as follows:

  •     A change in case within a word: "CamelCase" -> "Camel", "Case"
  •     A transition from alpha to numeric characters or vice versa:"Gonzo5000" -> "Gonzo", "5000"   ;  "4500XL" -> "4500", "XL"
  •     Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
  •     A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
  •     Any leading or trailing delimiters are discarded: "-hot-spot" -> "hot", "spot"
2.3.5 Lower case conversion

Any uppercase letters in a token are converted to the equivalent lowercase token. All other characters are left unchanged.

2.3.6 Keyword protection

Protecting words from being modified by stemmers.

Protected word list may be specified in a file named protwords.txt

No such words are specified at this time.

2.3.7 Stemming

It is the process of reducing any of the forms of a word such as "walks, walking, walked", to its elemental root e.g., "walk".

Porter Stemming Algorithm is used. It is only appropriate for English language text.

2.3.8 Remove duplicates

Removing duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same text and position values.

2.3.9 Known Behaviors (to be fixed within Search)
  • All of these search selection searches for the words in the order they are entered.
  • As a phrase search selection does not search for the words in the order they appear.
  • Special characters such as ‘&’ and ‘:’ are searchable only if part of a longer phrase and the entire phrase is wrapped in quotation marks.  (If the quotation marks are left off, no hits will be returned.)
  • Wildcards (*, ?, #, !, []) do not function. 

3. Display of Search results

    The search results are displayed in pages of 25 records.

    The page size can be changed to 50,100

    User can navigate to different pages of search results by using a pagination control.

    For each record, short display fields (specified in section 1) will be displayed.

    For each record 'View XML' button shows the xml of the document and 'Edit XML' opens the Editor.

   Links are shown if applicable

DocType

links

Bibliographic

Instance

Instance

Bib,Holdings,Item

Holdings

Bib,Instance,Item

Item

Bib,Instance,Holdings

3.1 Short display

    Each record in the search results shows a subset of the record's fields.

3.2 Detailed display

   A link is (not yet) provided for each record in the search results.

   When this link is clicked, a popup, or a new tab of the browser is opened, with all the fields of the record.

3.3 Highlighting

In each search result, words matching the key words entered by the user are highlighted.

3.4 Facets

Facets are grouping of search results that help to analyze the search results and further filter or narrow down the search results.

These are helpful when the user cannot guess what exact keywords to search for.

Facet fields:

Author,Subject,Format,Language,Publication Date,Genre.

Top 5 facet values in the decreasing order of occurrences are shown for each facet field.

The remaining facets are seen by clicking the “more” link in a popup window.

The facet values in the popup are shown in alphabetical order.

Click on one or more facet values to filter search results and view only those records containing these facet values.

3.5 Sorting

By default, the search results are sorted by Title (A-Z)

The sorting options are:

    Title (A-Z) : sorts the results in the  ascending order (starts with ‘a’) of the title field value. This the default sorting order.
    Title (Z-A) :sorts the results in the descending  order (starts with ‘z’) of the title field value.
    Author (A-Z) :sorts the results in the ascending order (starts with ‘a’) of the author field value.
    Author (Z-A) : sorts the results in the descending order (starts with ‘a’) of the author field value.
    Pub date(new-old) : sorts the results in the descending order of the publication date field value.
    Pub date(old-new) : sorts the results in the ascending order of the publication date field value.  
    Relevance : it is default sort criteria provided by solr.

Records with empty or null values will appear at the top of the search results.

As per JIRA OLE_2194,

2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rule

Test Template for sort

3.5.1 NISO Standard for Sort

Anchor
TxSearch
TxSearch

The sort rules as recommended in the NISO standard are to be implemented. Please refer to NISO Standard for Sort for details.

NISO Search Results on docstore Search- in progress: https://jira.kuali.org/browse/OLE-2194  

Most of the rules for sorting are done in Solr by specifying filters like LowerCaseFilterFactory in schema.xml

4. Configurability of searchable fields

As seen above, there are different fields in the documents of different category/type/formats, that are to be indexed, searched for and displayed. How a field value should be extracted from a source document is also important.

This information about different document categories, types and formats and their corresponding fields is used in DocStore as well as OLE. It is preferable to store this information in a well encapsulated, reusable and maintainable manner.

This is done in an external xml configuration file (DocumentConfig.xml) which is loaded into a corresponding POJO name DocumentConfig. The configuration file is loaded at the time of startup of application and retained in memory until the application is shutdown. So any changes to this file will become effective only after restarting the application. There are limitations on what can be configured in this file. Changes to this file will need the existing documents to be re-indexed.

4.1 DocumentConfig Info

DocumentConfig.xml

It is available in %ole.docstore.home%/properties folder. (e.g. /opt/docstore/properties)

4.2 Field Definition

Field name convention:

Names of fields used for indexing/searching are suffixed with “_search”.

Names of fields used for display are suffixed with “_display”.

Names of fields used for facets are suffixed with “_facet”.

Names of fields used for sorting are suffixed with “_sort”.

Field Info/Attributes

Field Attribute

Purpose

Example

Id

Unique identifier of a field with a given [category, type, format]

id="ISBN_search"

Name

Name of the field suitable for display

name="ISBN"

Type

Indicates the type of value of the field (informative purpose only)

type="text"

Field Definition Example:

No Format
       <field id="ISBN_search" name="ISBN" type="text">
          <mapping type="custom">
              <include>020-a;z</include>
              <exclude/>                       
          </mapping>
       </field>

4.3 Mapping Definition

Mapping info can be defined for each field which specifies how the value(s) for the field should be extracted from the input file for the corresponding document. Mapping can be specified as XPATH value or a custom value.

Mapping Info/Attributes

Mapping Attribute

Purpose

Example

Type

Indicates how the mapping info is to be interpreted

type="custom"

Include

Values to be included

<include>020-a;z</include>

Exclude

Values to be excluded

<exclude/>

xpath Mapping Example:

No Format
       <field id="ContractNumber_search" name="Contract Number" type="text">
          <mapping type="xpath">
              <include>/publicationsLicenseExpression/licenseDetail/licenseIdentifier/IDValue/value</include>                       
          </mapping>
       </field>

4.4 Modifying configuration info

  1. Open the DocumentConfig.xml file.
  2. Add/modify/delete one or more fields of any [document category/type/format].
  3. Save the file.
  4. Reload the DocStore application. (Restart the Tomcat server.)
  5. Re-index the data related to the document category/type/format modified.

Adding a field:

Copy and paste an existing field definition and modify the attributes suitably.

Modifying a field:

Name and mapping info can be modified for any existing field.

Deleting a field:

A field definition can be commented or deleted.

Test Template for Document Config

Transactional Search

OLE coding to-date for Acquisitions functions have utilized KNS Lookups, DocSearch (Detailed Search, Superuser Search), and named or session-based searches......

<insert more info on framework>

(Detailed Search, Superuser Search), and named or session-based searches......

<insert more info on framework>

  1. Doc Search (Button)- takes user to short search, but can jump to Superuser or Detailed Search and specify edocument (PO, etc).
  2. OLE Acquisitions Search
    1. OLE created OLE Acquisitions Search to assist Acquisitions and Selectors. We still need to add more transactional search options (such as PO's paid, PO's received on PO searches, etc), but created the following Search fields that combine purchasing and Bib information:
      1. Document Type:
      2. Document Number:
      3. Purchase Order #:
      4. Vendor Name:
      5. Date Created From:
      6. Date Created To:
      7. Initiator:
      8. Requestor:
      9. Account Number:
      10. Organization Code:
      11. Chart Code:
      12. Title:
      13. Author:
      14. Publisher:
      15. ISXN:
      16. Search Type
  3. KFS (inherited Custom Doc Searches)
    1. inherited Custom Doc Searches allow users to save Named Searches, or select from search session history.
    2. OLE will extend or replace existing searches to include docstore or bib data:
    3. Requisition search
    4. Purchase Order search
    5. Receiving search
    6. Payment Request search
  4. OLE may also need to extend KFS Account Searches
  5. OLE has implemented the following combined search and group-action queues, allowing users to search, sort, filter and apply global actions across multiple documents in current workflow:
    1. OLE Order Holding Queue
      1. Selector:
      2. Document Number:
      3. Requisition Status:
      4. Vendor Name:
      5. Requestor:
      6. Format:
      7. Chart Code:
      8. Account Number:
      9. Object Code:
      10. Workflow Status Change Date From:
      11. Workflow Status Change Date To:
      12. Title:
      13. Author:
      14. Publisher:
      15. (will add ISxN in future)
  6. OLE Receiving Queue
    1. Purchase Order Number
    2. ISxN
    3. Title
    4. Journal
    5. Filters (not yet functional, but on UI):
      1. Serials
      2. Standing Orders
      3. Monographs
      4. Status
      5. Vendor
      6. Purchase Order Date
  7. Known Issues with KNS searching:
    1. If you know the exact words or phrase, enter the text wrapped in quotes
    2. If you want to return hits for any of the words entered, leave off the quotes
    3. Use "*" only at the end of a word and only when you have entered a single word in the field
    4. Specify dates in the format mm/dd/yyyy.
    5. Boolean operators used within search screens are not currently providing consistent results.

...

The following and other features will may be added and reviewed with SMEs for future releases.

  1. Authority records: linkages, search, NACO standards
  2. Call Number Browse (coming in OLE 0.8)
  3. Linked PO or Circ record from Item, and Order/Circ status (coming in OLE 0.8)
  4. Linked License Agreement (electronic journals etc)- in progress
  5. Search filters: Location, Format, TBA
  6. External Linked Data: Authority, or other stores
  7. Saved DocStore Searches (or user preferences)
  8. Wildcard behaviors
  9. Positional Operators
  10. Truncation
  11. Nested Search (more than one operator in same expression)
  12. Field/Marc tagging search
  13. Checkin, Checkout from Search
  14. Rice/KNS upgrades (future): search facets and other enhancements for transactional search
  15. Non-Roman Characters (ie, Chinese, Russian, etc)