Encoding for Lexile Text Analyzer API Submissions
Overview
The Lexile® Text Analyzer returns a Lexile measure for the text that is submitted. In order for the Lexile Text Analyzer to return accurate results the text needs to be properly edited and encoded.
Only properly edited prose can receive a Lexile measure. Examples of this include most works of fiction and non-fiction, as well as articles from newspapers and magazines. Texts that should not receive a Lexile measure include poems, plays, song lyrics, unconventionally formatted texts, and texts lacking sentence-style punctuation.
The Lexile Text Analyzer considers an entire text when determining a Lexile measure, so the whole, unabridged text should be used for analysis. Although a measure can be determined from any length of text, a Lexile measure obtained from a passage, section, or chapter of a book or article is not an accurate substitute for the measure obtained from its complete source text.
Text Encoding
The Lexile Text Analyzer can only process texts that can be entirely encoded in ASCII (see http://en.wikipedia.org/wiki/ASCII). Texts that include extended characters such as smart or curly quotes or accented characters need to be mapped to the ASCII character set before being processed.
Texts submitted within these character spaces will be automatically translated to ASCII-space based on recommended mappings.
Character Encodings and Mapping Recommendations
Where applicable, the following steps may be taken to map a Unicode string to ASCII-space.
Unicode Code | Unicode Character Name | Mapped Characters | Mapped Code(s) |
---|---|---|---|
U+2002 | EN SPACE | “ “ | U+0020 |
U+2003 | EM SPACE | “ “ | U+0020 |
U+2009 | THIN SPACE | “ “ | U+0020 |
U+00A0 | NO-BREAK SPACE | “ “ | U+0020 |
U+202F | NARROW NO-BREAK SPACE | “ “ | U+0020 |
U+201D | RIGHT DOUBLE QUOTATION MARK | “”” | U+0022 |
U+201C | LEFT DOUBLE QUOTATION MARK | “”” | U+0022 |
U+2019 | RIGHT SINGLE QUOTATION MARK | “‘“ | U+0027 |
U+00B4 | ACUTE ACCENT | “‘“ | U+0027 |
U+2039 | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | “‘“ | U+0027 |
U+203A | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | “‘“ | U+0027 |
U+2010 | HYPHEN | “-” | U+002D |
U+2211 | NON-BREAKING HYPHEN | “-” | U+002D |
U+2212 | MINUS SIGN | “-” | U+002D |
U+2013 | EN DASH | “-” | U+002D |
U+2027 | HYPHENATION POINT | “-” | U+002D |
U+00AD | SOFT HYPHEN | “-” | U+002D |
U+2014 | EM DASH | “--” | U+002D, U+002D |
U+2026 | HORIZONTAL ELLIPSIS | “...” | U+002E, U+002E, U+002E |
Letter Normalization And Mappings
NFKD Unicode character normalization should be performed on all characters. In the case of composite characters with Latin bases, the Latin letter base is kept while the modifiers are stripped. For instance, U+00D1 (“LATIN CAPITAL LETTER N WITH TILDE”, Ñ) is decomposed to 004E 0303. The 0303 tilde is stripped and the character is converted to 004E, corresponding to “LATIN CAPITAL LETTER N”.
Some suggested Basic Latin mappings by the Unicode Consortium can be found within the following document: http://www.unicode.org/charts/PDF/U0000.pdf.
The following table includes a few additional character mappings.
Unicode Code | Unicode Character Name | Mapped Characters | Mapped Code(s) |
---|---|---|---|
u+00C6 | LATIN CAPITAL LETTER AE | “AE” | U+0041, U+0045 |
U+01FC | LATIN CAPITAL LETTER AE WITH ACUTE | “AE” | U+0041, U+0045 |
U+0152 | LATIN CAPITAL LIGATURE OE | “OE” | U+004F, U+0045 |
U+00E6 | LATIN SMALL LETTER AE | “ae” | U+0061, U+0065 |
U+01FD | LATIN SMALL LETTER AE WITH ACUTE | “ae” | U+0061, U+0065 |
U+0153 | LATIN SMALL LIGATURE OE | “oe” | U+006F, U+0065 |
U+0149 | LATIN SMALL LETTER N PRECEDED BY APOSTROPHE | “‘n” | U+0027, U+006E |
Other Characters
Characters remaining in the text that have not been mapped to an ASCII encodeable representation will be stripped before being analyzed by the Lexile Analyzer.