Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
titleJira #

https://jira.kuali.org/browse/OLE-1144 (OLE Search Executive- see also linked tasks and sub-tasks)

...

1.1.1 Searchable fields for Bibliographic

Field Name

Work-Bib-MARC

Work-Bib-DublinQ

Work-Bib-DublinUnQ

Title

Yes

Yes

Yes

Author

Yes

Yes

Yes

Subject

Yes

Yes

Yes

Yes

Description

Yes

Yes

Yes

Date of Publication

Yes

Yes

Yes

Format

Yes

Yes

Yes

Language

Yes

Yes

Yes

Publisher

Yes

Yes

Yes

ISSN/ISBN/other (last for dc identifier)

Yes

Yes

Yes

Genre (marc genre/dc type)

Yes

Yes

Yes

Edition

Yes

No

No Description

Yes

Yes

Yes

Date of Publication

Yes

Yes

Yes

Format

Yes

Yes

Yes

Language

Yes

Yes

Yes

Publisher

Yes

Yes

Yes

ISSN/ISBN/other (last for dc identifier)

Yes

Yes

Yes

Genre (marc genre/dc type)

Yes

Yes

Yes

Edition

Yes

No

No

Bib IdentifierYes  
Holdings IdentifierYes  
Local IdYes  
Doc CategoryYes  
Doc FormatYes  
Doc TypeYes  
StatusYes  
Status Updated OnYes  
Staff Only FlagYes  
Created ByYes  
Updated ByYes  
Date EnteredYes  
Date UpdatedYes  
1.1.2 Searchable fields for Instance

Field Name

Work-Instance-OLEML

Work-Holdings-OLEML

Work-Item-OLEML

Barcode

No

No

Yes

Location

No

No

Yes

Source

Yes

No

No

Record Type

No

Yes

No

Encoding Level

No

Yes

No

Receipt Status

No

Yes

No

Acquisition Method

No

Yes

No

Policy Type

No

Yes

No

Copies Reported

No

Yes

No

Item Type

No

No

Yes

Location Status

No

No

Yes

Shelving Scheme

No

No

Yes

Shelving Order

No

No

Yes

Address

No

No

Yes

Copy Number

No

No

Yes

Volume Number

No

No Yes

Ye

    
1.1.

...

4 Searchable fields for License

Field Name

Work-License-ONIXPL

Work-License-PDF

Contract Number

Yes

No

Licensee

Yes

No

Licensor

Yes

No

Status

Yes

No

Method

Yes

No

Type

Yes

No

Name

No

Yes

File Name

No

Yes

Date Uploaded

No

Yes

Owner

No

Yes

Notes

No

Yes

1.2 Facet fields for all document categories, types and formats

Facet Field

Work-Bib-MARC

Work-Bib-DublinQ

Work-Bib-DublinUnQ

Work-Instance-OLEML

Work-Holdings-OLEML

Work-Item-OLEML

Work-License-ONIXPL

Work-License-PDF

Subject

Yes

Yes

Yes

No

No

No

No

No

Author

Yes

Yes

Yes

No

No

No

No

No

Format

Yes

Yes

Yes

No

No

No

No

No

Language

Yes

Yes

Yes

No

No

No

No

No

Publication Date

Yes

Yes

Yes

No

No

No

No

No

Genre

Yes

Yes

Yes

No

No

No

No

No

1.3 Field definitions for Work-Bib-MARC documents

Field

Data fields for search (MV- indicates multi-valued)

Data fields for short display

Data fields for detailed display

Data fields for Facet

ISSN

022 - a,z (MV)

first value

all values

same as search field

ISBN

020 - a,z (MV)

first value

all values

same as search field

Author/Creator

For each 100, 110: every subf except $6 (gives us 2 values for every tag). Also every subf except $t and $6 for: 111, 700, 710, 711, 800, 810, 811, 400, 410, 411) 
Ref: http://www.loc.gov/marc/bibliographic/
100 - Main Entry - Personal Name (NR)
110 - Main Entry - Corporate Name (NR)
111 - Main Entry - Meeting Name (NR)
700 - Added Entry - Personal Name (R)
710 - Added Entry - Corporate Name (R)
711 - Added Entry - Meeting Name (R)
800 - Series Added Entry - Personal Name (R)
810 - Series Added Entry - Corporate Name (R)
811 - Series Added Entry -Meeting Name (R)
400 - Series Statement/Added Entry - Personal Name(R)
410 - Series Statement/Added Entry - Corporate Name(R)
411 - Series Statement/Added Entry - Meeting Name(R)

first non-empty value of
100$a or 110$a.
Show blank if both are
missing.

All non-empty
indexed values

All non-empty
indexed values

Title

245 - all subf exc. c and 6. Also, 130, 240, 246, 247, 440, 490, 730, 740, 773, 774, 780, 785, 830, 840) (MV)

245$a and 245$b

all values

 

Place of Publication

260 - a (MV)

first value

all values

same as search field

Description

505 - a (MV)
KG/LR: UPenn just included the MARC 505 in its Description index (which is distinct from its Format/Description index). Include just 505 $a.  The SMEs may want additional 5xx fields in the Description index, but 505 should be fine for November.

first value

all values

same as search field

Subject

600, 610, 611, 630, 650, 651, 653, 69X: every subf exc. $6, $2, $=, $? across these tags

600 - Subject Added Entry - Personal Name (R)
610 - Subject Added Entry - Corporate Name (R)
611 - Subject Added Entry - Meeting Name (R)
630 - Subject Added Entry - Uniform Title (R)
650 - Subject - Added Entry -  Topical Term (R)
651 - Subject Added Entry - Geographic  Name (R)
Ref : http://www.loc.gov/marc/bibliographic/bd6xx.html

first non-empty value of 600, 610, 611, 650, 651, 653, 69X .

No hyphens for X00, X10, and X11 fields 600, 610, 611, etc), but hyphens for other fields.    

Punctuation:
Follow punctuation and order given in
subfields:
600, 610,611
(space between subfields) subfield subfield

Sample data:
600(subject - AuthorName) :
Kangxi,|cEmperor of China,|d1654-1722.
Facet/display:
kangxi,Emperor of China, 1654-1722.

Use double hyphen for subdivisions:
630
follow punctuation in fields, mainly using Periods between subfield value, except $v, $x, $y, $z
are "subdivisions" and use double - hyphen paired
with another subfield

Sample data:
630(Standard title subject) :
Bible.|pN.T.|pLuke|xCommentaries.
Facet/display:
Bible.N.T.Luke -- Commentaries.

Use double - hyphen for all subfields:
 650, 651
(double - hyphens between subfields) --
subfield -- subfield (period on end).

Sample data:
650(subject-general):
Gardens|xSocial aspects|zChina|zBeijing|xHistory|y18th century.
Facet/display:
Gardens-Social aspectsChinaBeijingHistory-18th century.

All non-empty indexed
values.

Ordering of  multiple values:
Ordering of subject
headings in display
of individual bib
record should be in order of xml.

All non-empty  indexed values.

Hyphening pattern is same as display fields.

650(subject-general):
Gardens|xSocial aspects|zChina|
zBeijing|xHistory|
y18th century.
Facet/display:
Gardens--Socialaspects
-ChinaBeijing-
History--18th century.

Additional facets to be indexed:
1. Gardens - -Social aspects
2. Gardens - -Social aspects - -China
3. Gardens - -Social aspects - -China - -Beijing
4. Gardens - -Social aspects - -China - -Beijing - -History

Date of Publication

<marc:controlfield tag="008">[Date 1 in the 7-10 positions LR: Can also include 260 $c. (260-c is same as the value in control field. Use this if control field does not have pub date value.) (MV).


Some records with 260 $c are having data other than date and data with no proper date.So for now 260 $c is ignored for publication date.

first value

all values

same as search field.

Keep decades, but add centuries and
"current year"
<controlfield tag="008">010108q10001999dcu b f000 0 eng d</controlfield>
The above example 1000-1999 means it should
appear with every facet 1000s,1100s, etc and every decade in 1000-1999.
Eliminate "odd" facets- don't want "s", "0000s",
or any years beyond the current year(manual removal 2020s and 9990s.
Only want single decades and single centuries.
<controlfield tag="008">001214k08000999xx zzz a n lat d</controlfield>.

Eg:
Two records with publication dates as 1905 and 1913.
The facets will be computed as:
20th century (2)-- 1900 to 1999.
1900s (1)-- 1900 to 1909.
1910s (1) – 1910 to 1919.

Edition

250 - a,b (MV)

first value

all values

same as search field

Form/Genre

655 - a, v (MV)

first value

all values

same as search field

Language

<marc:controlfield tag="008">[language code in the 35-37 positions]</marc:controlfield> LR: Add 546 $a (MV)
Language Codes (iso-639-3)

all values

all values

same as search field

Format

856 - q
245 - h  

LR: Format is very tricky b/c many MARC fields/subfields can be used to determine format.  

I think we could also consider adding the following:

Leader 06/07 – these are Type of Record and Bibliographic Material – each is a single letter characters and when combined, they seem to map to the UPenn bib_format field.  (See the worksheet now named Format – marc)
007 Physical Description Fixed Field-General Information – the character positions indicate physical format information.  See [http://www.loc.gov/marc/bibliographic/bd007.html
] 655 $a, $v (Genre/Form)
300 $e, $3 (Extent: Accompanying Material, Materials specified)
337 $a (Media Type)
338 $a (Carrier Type)
340 $a, $e, $m, $3 (Physical Medium: Material base, support, book format, materials specified)
Possibly other 3XX fields/subfields.

In the end, we may want to just touch base with some SMEs (Gwyneth? Bob? Stuart) to determine specifically what we could/should include)
(MV)

first value

all values

same as search field

...

1.4 Format field definitions for Work-Bib-MARC documents

Label

Marc Fields

Comments

Manuscript

Has any holdings  with "manuscripts" in location_name (gets only this value)

LR: MARC XML does not have location_name so this is irrelevant to the IU data that  OLE has for November.  Manuscript could be determined by the Leader 06/07.  06 values a, f, t equal manuscripts on their own.  07 values c and d seem to imply mauscript/archival collections/series.  We should check with the SMEs on this one.

Microformat

Has 245 $h containing "micro" OR has any holdings  with "micro" in location_name OR call_number starts "micro" (gets only this value)

LR: the 245 $h "micro" will work for the IU OLE MARCXML we have, but the reamaing text is specific to UPenn.

Archive

Has any holdings  with "archive" in location_name (gets only this value)

LR: This is specific to UPenn.  We may need to talk to IU about if they include Archive descriptions in their MARC records and how they designate them as such.

Thesis/Dissertation

bib_format is 'tm' AND has a 502 field

LR UPenn's bib_format seems to be a combination of the data values found in the 06/07 Leader fields.  For example, t in the 06 is Manuscript and m in the 07 is Monograph/Item and together they equal a Thesis/Dissertation.

Conference/Event

Has a 111 or 711 field [LR: Include 611 or 811]

 

Book

bib_format is 'aa', 'am' or 'ac' or 'tm'; exclude $h [micro*] and $k [kit]

LR: the 2 characters are from the Leader 06/07 the inclusions are 245 subfields

Sound recording

bib_format is 'im' or 'jm' or 'jc' or 'jd' or 'js'

LR: the 2 characters are from the Leader 06/07

Musical score

bib_format is cm, dm, ca, cb, cd or cs

LR: the 2 characters are from the Leader 06/07

Map/Atlas

bib_format is 'e*' or 'fm'

LR: the 2 characters are from the Leader 06/07

Video

bib_format is 'gm' AND 007/0 = v

LR: the 2 characters are from the Leader 06/07

Projected graphic

bib_format is 'gm' AND 007/0 = g

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Journal/Periodical

bib_format is 'as' or 'gs'

LR: 007 is a controlled field that indicates the format/physical description at general level and then associated subfields are more specific.

Image

bib_format is 'km'

LR: the 2 characters are from the Leader 06/07

Datafile

bib_format is 'mm'

LR: the 2 characters are from the Leader 06/07

Newspaper

bib_format is 'as' AND (008/21 = 'n' OR 008/22 = 'e' )

LR: the 2 characters are from the Leader 06/07.  The 008 controlled field in those 2 positions provides the "form")

3D object

bib_format is 'r*'

LR: the single character maps to the 06 position in the leader.

Database/Website

bib_format is '*i'

LR: the single character maps to the 06 position in the leader.

Government document

bib_format is NOT c*, d*, i*, j* AND ( (008/28 = f, i, o and 260$b not 'press') )

LR: the single character maps to the 06 position in the leader. 008 is a fixed length controlled field and 260 $b is a type of publication.

Other

any bib_format not caught above

LR: Presumably relates to other 06/07 Leader data values not represented.

...

Field

DC-UnQ fields for Search

DC-Q fields for Search

Data fields for short display

Data fields for detailed display

Data fields for Facet

Author

<dc:creator> 

<dcvalue element="contributor" qualifier="author">

first value

All non-empty
indexed values

All non-empty
indexed values

Description

<dc:description> (MV)
Per Bob P.: Show only <dc:description>.

Per Bob P.:  Do not show Abstract description.
[show blank]

first value

all values

same as search field

Language

<dc:language>  (MV)
Language Codes (iso-639-3)

<dcvalue element="language" qualifier="iso">en_US</dcvalue>
Language Codes (iso-639-1-cc)

first value

all values

same as search field

Subject

<dc:subject> (R)

<dcvalue element="subject" qualifier="none">

first value

all values

same as search field

Title

<dc:title>

<dcvalue element="title" qualifier="none">

first value

all values

same as search field

Type

<dc:type> (MV)

<dcvalue element="type" qualifier="none">

first value

all values

same as search field

Date of Publication

<dc:date>

<dcvalue element="date" qualifier="issued">

first value

all values

same as search field

Format

<dc:format> (MV)

<dcvalue element="type" (This is covered in a separate field. So do not include it in Format)
<dcvalue element="format" qualifier="mimetype">

??? (LR: In looking back at the MARC to Qualified DC mapping it is not entirely clear, but it should be both the  format and type elements.

first value

all values

same as search field

Publisher

<dc:publisher> (MV)

<dcvalue element="publisher"

??? (KG/LR: publisher.  It doesn't appear in the crosswalk, but that could be that the UMD dataset did not include that tag)

first value

all values

same as search field

ISBN/ISSN/other

<dc:identifier>(ISSN)0198-9669</dc:identifier>  (MV)
 <dc:identifier>(ISBN)0306710382</dc:identifier> (MV)

<dcvalue element="identifier" qualifier="isbn">0-918006-48-1</dcvalue>

first value

all values

same as search field

...

Output: "To", "be,", "or", "what?"

2.3.2 Synonym Filtering

It is the process of synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at the same position as the original token

...

The file named stopwords.txt specifies such words. Currently they are:

No Format

an and are as at

be but by

for

if in into is it

no not

of on or

s such

t that the their then there these they this to

was will with

...

Field Info/Attributes

Field Attribute

Purpose

Example

Id

Unique identifier of a field with a given [category, type, format]

id="ISBN_search"

Name

Name of the field suitable for display

name="ISBN"

Type

Indicates the type of value of the field (informative purpose only)

type="text"

Field Definition Example:

No Format

       <field id="ISBN_search" name="ISBN" type="text">
          <mapping type="custom">
              <include>020-a;z</include>
              <exclude/>                       
          </mapping>
       </field>

...

Mapping Info/Attributes

Mapping Attribute

Purpose

Example

Type

Indicates how the mapping info is to be interpreted

type="custom"

Include

Values to be included

<include>020-a;z</include>

Exclude

Values to be excluded

<exclude/>

xpath Mapping Example:

No Format

       <field id="ContractNumber_search" name="Contract Number" type="text">
          <mapping type="xpath">
              <include>/publicationsLicenseExpression/licenseDetail/licenseIdentifier/IDValue/value</include>                       
          </mapping>
       </field>

...