Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3
Info
titleJira Link

https://jira.kuali.org/browse/OLE-2194

From Requirements, below are decision-points:
 1. 2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rules.
2. NISO standards, Section 3- follow Sort order of characters very closely for Search Results display, and Browse/More display of Facets (main results view of facets is still by # of hits, hi to lo)
3. NISO standards, Section 4- Headings. Choice for current is "word by word". The following are in word-by-word sort order: cream, cream cheese, cream corn. 4.1.2.1
4. NISO standards, Section 7- Symbols. Choice for current is #7.1 for ASCII.
5. We are NOT yet addressing any non-roman/unicode characters, ie for treatment of Chinese, Russian etc. Weill still index or sort on their "romanized" values.

But implementation still needs to address NISO standards for #5 Abbreviations, and # 6 Numbering.

 

NISO Rule/Recommendation

Meaning

Example

Implemented ? (Y / N)

Comments SME Decisions

Example                 

Implementation Status

Comments

 

 

 

 

 

 

 

 

2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards.

Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rules.

Apply for Marc indexing, esp on Titles, possibly publishers, subjects, corporate authors- prior to applying sort orders below

 

Implemented in 0.8f


3

Order of Spaces

 

 

 Characters

The basic order of characters should be in the following sequence:
spaces
symbols other than numerals, letters, and punctuation marks
numerals (0 through 9)
letters (A through Z)

follow Sort order of characters very closely for Search Results display, and Browse/More display of Facets (main results view of facets is still by # of hits, hi to lo)

$$$ and sense
1, 2, buckle my shoe
A-1 steak sauce

Implemented

 

3.1

Spaces

If the data contains more than one space then it should be treated as a single space

 

 

Y Implemented

 

3.2

Punctuation Marks Treated as Spaces(  -,---,/)

The hyphen, dash (of any length), or slash is to be treated as a space.

 

N

Need to replace hyphen with a space in _sort field  

Implemented in 0.8f


3.3

Punctuation Marks Ignored (other than  -,---,/)

The following punctuation marks should be disregarded for arrangement purposes: period (full stop), comma, semi-colon, colon, parentheses, square brackets, angle brackets, braces (curved brackets), apostrophe, quotation marks (single or double), exclamation mark, question mark. They are not to be treated as spaces.

 

Ambassador hotel
...and so to bed

Implemented in 0.8f


3.4

Symbols Other Than Numerals, Letters and Punctuation Marks

Such symbols are arranged after a space but before a numeral.

 

N

Need to remove these chars in _sort field.

3.4

Two or more contiguous symbols should be treated as a single character.

 

¥ £ $ exchange
$$$ and sense
% of gain
$10 a day
20 funny stories

Implemented

 

3.5

Numerals (0 through 9)

All data beginning with a numeral should be arranged ahead of any data beginning with a letter.

 

007 James Bond
James

Implemented

 

3.6

Letters (A through Z)

The records should be arranged in the order of English alphabet ( Upper case and lower case has equal arrangement value)

 

Abalone
abdomen
Ambassador hotel

Implemented in 0.8f


3.6.1

Modified Letters

Letters modified by diacritical marks and ligatures of two letters should be arranged like their nearest basic equivalent letters in the English alphabet

Bob: This is OK for now -- may have to be refined later, in that some European languages alphabetize letters with diacritics separately from their base letter.

This is important and needs to be researched and effort estimated in sprint 0.8f, but should not hold up other items.
Development will be scheduled in a future sprint. Impact on performance to be kept in mind.

á, à, â, å, ä are arranged as a
ñ is arranged as  n
ø is arranged as o
æ is arranged as ae
oe is arranged as oe

Implemented in 0.8f


3.7

Superscript and Subscript Characters

Superscript and subscript characters are arranged as “on-the-line”  Characters.Basic characters followed by both sub- and superscript characters are arranged in the sequence: basic character - subscript - superscript.

Should be implemented. This can happen in Roman and non-Roman chars. Non-Roman chars will be taken up in future sprint. Need sample data also for research.

H2
H24
H34 

When a character (H)has both subscript ₂ and superscript ⁴ characters, it should be coded as H24.Then the ordering will be as specified by NISO.

Implemented in 0.8f

 

 

 

 

 

 

 

 

4.

Headings

 

Choice for current is "word by word". T 4.1.2.1

The following are in word-by-word sort order: cream, cream cheese, cream corn.

 

 

4.1

Arrangement of Headings

Headings shall be arranged exactly as written, printed or otherwise displayed. The
arrangement of a heading among other headings should be based solely on the sequence
of numbers in arithmetical order and on the sequence of the 26 letters of the
English alphabet.

 

 


 

4.1.1

Single-Word Headings

Data consisting of a single word precedes any data beginning with  the same word and followed by other words.

 

New
New Zealand

Implemented

 

4.1.2

Multi-word Headings(Word-by-Word)

This method is preferred, because it keeps together data beginning with the same word (or words).

Use 4.1.2.1 Word-by-Word application of Headings arrangement (do not apply 4.1.2.2 letter-by-letter)

networks
New, Agnes
New, Thomas
New Zealand
news agencies
Newton, Isaac

Implemented in 0.8f


4.2

Headings with Qualifiers

Qualifying or explanatory terms are integral parts of a heading and should be arranged
as any other words in the heading. Punctuation marks enclosing or preceding such
terms are ignored.

 

bill (bank note)
Bill Clinton; a life
bill (ornithology)
bill (request for payment)
bill (weapon)

Implemented in 0.8f


4.3

Headings with Identical Initial Words

Data beginning with identical initial words should be arranged in the following sequence.
a)Single-word headings
b)Multi-word headings, including headings with qualifiers

 

New
New York
New (Zealand)

Implemented in 0.8f


4.4

Headings with Cross-References

Cross-references are not part of a heading, and  therefore do not affect the arrangement of a heading.

No need to do anything for MARC and other formats. In case it is required for non-MARC formats, Bob will let us know.

fathers see parents
Father’s Day see also Mother’s Day

Implementation not needed  (as the sortable data does not have cross-references)

Difficult to identify cross references.

4.5

Subheadings

Subheadings are normally arranged in alphanumeric sequence.Subheadings are subject to the same arrangement rules as  the headings they modify.

Nothing to do here as there are no subheadings for sortable fields.

memory
    Alzheimer’s disease
    and psychoses
    long-term
    loss
    of childhood events
    short-term

Implementation not needed (as the sortable data does not have sub-headings)

No subheadings seen in the data

4.6

Headings Beginning with Articles

Data beginning with Articles (a,an and the) are displayed in ascending order.

See Marc indicators. Bob- if Dublin Core or other format, should we use generic rules if "A, An, The" used at beginning of heading? Ignore and start with next full word? Bob: yes, but we'll need a longer list of initial words to ignore, including the most common foreign ones (El, Le, La, Il, etc.)  See chart at http://en.wikipedia.org/wiki/Article_%28grammar%29 for example

Articles should be considered for sorting. But if there is second indicator (in case of MARC records), it should be enforced.

A man
Man
Man, A see A man
Man, The see The man
The man

Implemented in 0.8f


 

 

 

 

 

 

 

5

Abbreviations

Abbreviations should be alphabetized exactly as written, not as spelled out.

Ignore punctuation chars.

Order is :
A B C
Aarhus
abacus
A.B.C.
abdomen
Cmdr. Smith
CO2 lasers
M. Flip ignorait sa mort
M’Bow, Ahmadu
Mlle. Henriette
Mme. Pompadour
Monsieur Verdoux
Mr. Adams
Mrs. Miniver
No. 10, Downing Street
No and yes

Implemented in 0.8f


 

 

 

 

 

 

 

6

Numbers

 

 

 

 

 

6.1

Headings Containing Numbers

Numbers at beginning or within the data should arranged in arithmetical order and sorted in ascending order. Headings beginning with numbers written in Arabic numerals should be sorted in ascending arithmetical order before headings beginning with a letter
sequence.

Can the index treat numbers as whole entities, rather than digit by digit?  The former is preferable -- if the latter, then "apt.11a" will come before "apt.7a".  But it won't come up that often, so if it has to go digit-by-digit, we can live with that.

Digit-by-digit is ok

007 James Bond
2 kinetic sculptors
2-phase flow in turbines
1984, Nineteen Eighty-four The 14th Amendment Zero-sum

Not implemented as per NISO. But the current implementation (digig-by-digit) is acceptable.

The ordering is digit by digit.(Difficult to order by value)

6.2

Punctuation in Numbers

Punctuation in numbers, as in other text, has no arrangement value (and sorted in ascending order).

 

$5000 reward
5,000- and 10,000-year
5000 años de historia

Implemented in 0.8f


6.3

Decimal Fractions

Decimal fractions should be arranged according to their arithmetical value (and sorted in ascending order).

Digit-by-digit is ok

0.25 mm
.30 Vickers machine gun
.303-inch machine guns

Not implemented as per NISO. But the current implementation (digit-by-digit) is acceptable.


6.4

Roman Numbers

Roman numbers should be arranged by their arithmetical value. To achieve this, the sequence
of letters must first be tagged as a number by human intervention, and it may then be
sorted as a Roman numeral, either manually or by an algorithm.

See text and also notes/jira on non-roman characters
BP: may be practically impossible to identify them in library metadata -- probably not worth too much effort

Word-by-word sorting is ok.

17 days to better living
XX century encyclopedia
20 short stories
John II

Implemented in 0.8f

Cannot identify Roman numbers. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

Arrangement of Symbols Other than Numerals and Letters

symbols, whether single or forming a contiguous sequence, are arranged after a space but before any numerals or letters

 

see image- for special character handling

 


7.1

Arrangement in Standardized sequence

Symbols that form part of a standardized sequence. for example, ASCII (ANSI X3.4, American National Standard Code for Information Interchange)

Choice for current is #7.1 for ASCII.

ASCII chars will be ordered in ASCII sequence. Ordering of Non-ASCII chars is unspecified.

#
+
&
%
$
*

Implemented

 

7.2

Arrangement in Order of Appearance

Not recommended as per Jira: OLE-2194

Do not use

 

Not in scope

 

7.3

Arrangement by Verbal Equivalent

Not recommended as per Jira: OLE-2194

Do not use

 

Not in scope

 

 

Non-Roman (OLE 0.8)

 

We are NOT yet addressing any non-roman/unicode characters for full features or Indexing, ie for treatment of Chinese, Russian etc, butl still index or sort on their "romanized" values- and need display and edits to diacritics/non-roman available- https://jira.kuali.org/browse/OLE-2934