Encoding for Lexile Text Analyzer API Submissions

Overview

The Lexile® Text Analyzer returns a Lexile measure for the text that is submitted. In order for the Lexile Text Analyzer to return accurate results the text needs to be properly edited and encoded.

Only properly edited prose can receive a Lexile measure. Examples of this include most works of fiction and non-fiction, as well as articles from newspapers and magazines. Texts that should not receive a Lexile measure include poems, plays, song lyrics, unconventionally formatted texts, and texts lacking sentence-style punctuation.

The Lexile Text Analyzer considers an entire text when determining a Lexile measure, so the whole, unabridged text should be used for analysis. Although a measure can be determined from any length of text, a Lexile measure obtained from a passage, section, or chapter of a book or article is not an accurate substitute for the measure obtained from its complete source text.

Text Encoding

The Lexile Text Analyzer can only process texts that can be entirely encoded in ASCII (see http://en.wikipedia.org/wiki/ASCII​). Texts that include extended characters such as smart or curly quotes or accented characters need to be mapped to the ASCII character set before being processed.

Note: The analyzer will accept UTF-8 (see ​http://en.wikipedia.org/wiki/UTF-8​) encoding as long as the characters represented exist within the cp1252 or latin1 character space.

Texts submitted within these character spaces will be automatically translated to ASCII-space based on recommended mappings.

Character Encodings and Mapping Recommendations

Where applicable, the following steps may be taken to map a Unicode string to ASCII-space.

Unicode Code Unicode Character Name Mapped Characters Mapped Code(s)
U+2002 EN SPACE “ “ U+0020
U+2003 EM SPACE “ “ U+0020
U+2009 THIN SPACE “ “ U+0020
U+00A0 NO-BREAK SPACE “ “ U+0020
U+202F NARROW NO-BREAK SPACE “ “ U+0020
U+201D RIGHT DOUBLE QUOTATION MARK “”” U+0022
U+201C LEFT DOUBLE QUOTATION MARK “”” U+0022
U+2019 RIGHT SINGLE QUOTATION MARK “‘“ U+0027
U+00B4 ACUTE ACCENT “‘“ U+0027
U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK “‘“ U+0027
U+203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK “‘“ U+0027
U+2010 HYPHEN “-” U+002D
U+2211 NON-BREAKING HYPHEN “-” U+002D
U+2212 MINUS SIGN “-” U+002D
U+2013 EN DASH “-” U+002D
U+2027 HYPHENATION POINT “-” U+002D
U+00AD SOFT HYPHEN “-” U+002D
U+2014 EM DASH “--” U+002D, U+002D
U+2026 HORIZONTAL ELLIPSIS “...” U+002E, U+002E, U+002E

Letter Normalization And Mappings

NFKD Unicode character normalization should be performed on all characters. In the case of composite characters with Latin bases, the Latin letter base is kept while the modifiers are stripped. For instance, U+00D1 (“LATIN CAPITAL LETTER N WITH TILDE”, Ñ) is decomposed to 004E 0303. The 0303 tilde is stripped and the character is converted to 004E, corresponding to “LATIN CAPITAL LETTER N”.

Some suggested Basic Latin mappings by the Unicode Consortium can be found within the following document: http://www.unicode.org/charts/PDF/U0000.pdf.

The following table includes a few additional character mappings.

Unicode Code Unicode Character Name Mapped Characters Mapped Code(s)
u+00C6 LATIN CAPITAL LETTER AE “AE” U+0041, U+0045
U+01FC LATIN CAPITAL LETTER AE WITH ACUTE “AE” U+0041, U+0045
U+0152 LATIN CAPITAL LIGATURE OE “OE” U+004F, U+0045
U+00E6 LATIN SMALL LETTER AE “ae” U+0061, U+0065
U+01FD LATIN SMALL LETTER AE WITH ACUTE “ae” U+0061, U+0065
U+0153 LATIN SMALL LIGATURE OE “oe” U+006F, U+0065
U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE “‘n” U+0027, U+006E

Other Characters

Characters remaining in the text that have not been mapped to an ASCII encodeable representation will be stripped before being analyzed by the Lexile Analyzer.