Info | ||
---|---|---|
| ||
From Requirements, below are decision-points:
1. 2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rules.
2. NISO standards, Section 3- follow Sort order of characters very closely for Search Results display, and Browse/More display of Facets (main results view of facets is still by # of hits, hi to lo)
3. NISO standards, Section 4- Headings. Choice for current is "word by word". The following are in word-by-word sort order: cream, cream cheese, cream corn. 4.1.2.1
4. NISO standards, Section 7- Symbols. Choice for current is #7.1 for ASCII.
5. We are NOT yet addressing any non-roman/unicode characters, ie for treatment of Chinese, Russian etc. Weill still index or sort on their "romanized" values.
But implementation still needs to address NISO standards for #5 Abbreviations, and # 6 Numbering.
| NISO Rule/Recommendation | Meaning | Example | Implemented ? (Y / N) SME Decisions | Example | Implementation Status | Comments | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||||||||
| 2nd indicator in MARC is non-filing character. need to use rules for these in applying sort/display standards. | Ex. 245 1 3 $aAn April Shower- 2nd indicator is "3". Ignore first 3 characters, ie "An(space)" in applying sort rules. | Apply for Marc indexing, esp on Titles, possibly publishers, subjects, corporate authors- prior to applying sort orders below |
| Implemented in 0.8f | |||||||||
3 | Order of Spaces |
|
| of Characters | The basic order of characters should be in the following sequence: | follow Sort order of characters very closely for Search Results display, and Browse/More display of Facets (main results view of facets is still by # of hits, hi to lo) | $$$ and sense | Implemented |
| |||||
3.1 | Spaces | If the data contains more than one space then it should be treated as a single space |
| Y | Implemented |
| ||||||||
3.2 | Punctuation Marks Treated as Spaces( -,---,/) | The hyphen, dash (of any length), or slash is to be treated as a space. |
| N | Need to replace hyphen with a space in _sort field | Implemented in 0.8f | ||||||||
3.3 | Punctuation Marks Ignored (other than -,---,/) | The following punctuation marks should be disregarded for arrangement purposes: period (full stop), comma, semi-colon, colon, parentheses, square brackets, angle brackets, braces (curved brackets), apostrophe, quotation marks (single or double), exclamation mark, question mark. They are not to be treated as spaces. |
| N | Need to remove these chars in _sort field. Ambassador hotel | Implemented in 0.8f | ||||||||
3.4 | Symbols Other Than Numerals, Letters and Punctuation Marks | Such symbols are arranged after a space but before a numeral. | These symbols should be in the given order: ¥ , $$ , %, $10 | N | Need to investigate solr. | ¥ £ $ exchange | Implemented |
| ||||||
3.5 | Numerals (0 through 9) | All data beginning with a numeral should be arranged ahead of any data beginning with a letter. | Y | 007 James Bond | Implemented |
| ||||||||
3.6 | Letters (A through Z) | The records should be arranged in the order of English alphabet ( Upper case and lower case has equal arrangement value) |
| N | convert _sort field to lower case. | Abalone | Implemented in 0.8f | |||||||
3.6.1 | Modified Letters | Letters modified by diacritical marks and ligatures of two letters should be arranged like their nearest basic equivalent letters in the English alphabet |
| N | Need to investigate solr. Bob: This is OK for now -- may have to be refined later, in that some European languages alphabetize letters with diacritics separately from their base letter. | á, à, â, å, ä are arranged as a | Implemented in 0.8f | |||||||
3.7 | Superscript and Subscript Characters | Superscript and subscript characters are arranged as "on“on-the-line" Characters. |
| N | Need to investigate solr. line” Characters.Basic characters followed by both sub- and superscript characters are arranged in the sequence: basic character - subscript - superscript. | Should be implemented. This can happen in Roman and non-Roman chars. Non-Roman chars will be taken up in future sprint. Need sample data also for research. | H2 | Implemented in 0.8f |
| |||||
|
|
|
|
|
|
| ||||||||
4. | Headings |
| Choice for current is "word by word". T 4.1.2.1 | The following are in word-by-word sort order: cream, cream cheese, cream corn. |
|
| ||||||||
4.1 | Arrangement of Headings | Headings shall be arranged exactly as written, printed or otherwise displayed. The |
|
|
| |||||||||
4.1.1 | Single-Word Headings | Data consisting of a single word precedes any data beginning with the same word and followed by other words. |
| New | Y Implemented |
| ||||||||
4.1.2 | Multi-word Headings(Word-by-Word) | This method is preferred, because it keeps together data beginning with the same word (or words). | Order is : N. E. Zenith Co. ? networks ? new moon ? Newton, Isaac ? Newton's rings | N | Can be done by modifying _sort field. |
|
|
|
|
| Use 4.1.2.1 Word-by-Word application of Headings arrangement (do not apply 4.1.2.2 letter-by-letter) | networks | Implemented in 0.8f | |
4.2 | Headings with Qualifiers | The parentheses and square brackets are ignored when the data is like: bill (Bank note),Bill Clinton,bill (weapon) |
| N | Can be done by modifying _sort field. Qualifying or explanatory terms are integral parts of a heading and should be arranged |
| bill (bank note) | Implemented in 0.8f | ||||||
4.3 | Headings with Identical Initial Words | Data beginning with identical initial words should be arranged in the following sequence. |
| N | Can be done by modifying _sort field. | New | Implemented in 0.8f | |||||||
4.4 | Headings with Cross-References | Cross-references are not part of a heading, and therefore do not affect the arrangement of a heading. |
| N | Cannot identify cross references. No need to do anything for MARC and other formats. In case it is required for non-MARC formats, Bob will let us know. | fathers see parents | Implementation not needed (as the sortable data does not have cross-references) | Difficult to identify cross references. | ||||||
4.5 | Subheadings | Subheadings are normally arranged in alphanumeric sequence.Subheadings are subject to the same arrangement rules as the headings they modify. | Nothing to do here as there are no subheadings for sortable fields. | memory | Implementation not needed (as the sortable data does not have sub-headings) | No subheadings seen in the data | ||||||||
4.6 | Headings Beginning with Articles | Data beginning with Articles (a,an and the) are displayed in ascending order. | See Marc indicators. Bob- if Dublin Core or other format, should we use generic rules if "A, An, The" used at beginning of heading? Ignore and start with next full word? Bob: yes, but we'll need a longer list of initial words to ignore, including the most common foreign ones (El, Le, La, Il, etc.) See chart at http://en.wikipedia.org/wiki/Article_%28grammar%29 for example | A man | Implemented in 0.8f | |||||||||
|
|
|
|
|
|
| ||||||||
5 | Abbreviations | Abbreviations should be alphabetized exactly as written, not as spelled out. | Ignore punctuation chars. | Order is : | Implemented in 0.8f | |||||||||
|
|
|
|
|
|
| ||||||||
6 | Numbers |
|
|
|
|
| ||||||||
6.1 | Headings Containing Numbers | Numbers at beginning or within the data should arranged in arithmetical order and sorted in ascending order. Headings beginning with numbers written in Arabic numerals should be sorted in ascending arithmetical order before headings beginning with a letter | Can the index treat numbers as whole entities, rather than digit by digit? The former is preferable -- if the latter, then "apt.11a" will come before "apt.7a". But it won't come up that often, so if it has to go digit-by-digit, we can live with that. | 007 James Bond | Not implemented as per NISO. But the current implementation (digig-by-digit) is acceptable. | The ordering is digit by digit.(Difficult to order by value) | ||||||||
6.2 | Punctuation in Numbers | Punctuation in numbers, as in other text, has no arrangement value (and sorted in ascending order). |
| $5000 reward | Implemented in 0.8f | |||||||||
6.3 | Decimal Fractions | Decimal fractions should be arranged according to their arithmetical value (and sorted in ascending order). | Digit-by-digit is ok | 0.25 mm | Not implemented as per NISO. But the current implementation (digit-by-digit) is acceptable. | |||||||||
6.4 | Roman Numbers | Roman numbers should be arranged by their arithmetical value. To achieve this, the sequence | See text and also notes/jira on non-roman characters | 17 days to better living | Implemented in 0.8f | Cannot identify Roman numbers. | ||||||||
|
|
|
|
|
|
| ||||||||
|
|
|
|
|
|
| ||||||||
7 | Arrangement of Symbols Other than Numerals and Letters | symbols, whether single or forming a contiguous sequence, are arranged after a space but before any numerals or letters |
| see image- for special character handling |
| |||||||||
7.1 | Arrangement in Standardized sequence | Symbols that form part of a standardized sequence. for example, ASCII (ANSI X3.4, American National Standard Code for Information Interchange) | Choice for current is #7.1 for ASCII. | # | Implemented |
| ||||||||
7.2 | Arrangement in Order of Appearance | Not recommended as per Jira: OLE-2194 | Do not use |
| Not in scope |
| ||||||||
7.3 | Arrangement by Verbal Equivalent | Not recommended as per Jira: OLE-2194 | Do not use |
| Not in scope |
| ||||||||
| Non-Roman (OLE 0.8) |
| We are NOT yet addressing any non-roman/unicode characters for full features or Indexing, ie for treatment of Chinese, Russian etc, butl still index or sort on their "romanized" values- and need display and edits to diacritics/non-roman available- https://jira.kuali.org/browse/OLE-2934 |
|
|
|