Encoding for Lexile Text Analyzer API Submissions

Overview

The Lexile® Text Analyzer returns a Lexile measure for the text that is submitted. In order for the Lexile Text Analyzer to return accurate results the text needs to be properly edited and encoded.

Only properly edited prose can receive a Lexile measure. Examples of this include most works of fiction and non-fiction, as well as articles from newspapers and magazines. Texts that should not receive a Lexile measure include poems, plays, song lyrics, unconventionally formatted texts, and texts lacking sentence-style punctuation.

The Lexile Text Analyzer considers an entire text when determining a Lexile measure, so the whole, unabridged text should be used for analysis. Although a measure can be determined from any length of text, a Lexile measure obtained from a passage, section, or chapter of a book or article is not an accurate substitute for the measure obtained from its complete source text.

Text Encoding

The Lexile Text Analyzer can only process texts that can be entirely encoded in ASCII (see http://en.wikipedia.org/wiki/ASCII). Texts that include extended characters such as smart or curly quotes or accented characters need to be mapped to the ASCII character set before being processed.

Note: The analyzer will accept UTF-8 (see http://en.wikipedia.org/wiki/UTF-8) encoding as long as the characters represented exist within the cp1252 or latin1 character space.

Texts submitted within these character spaces will be automatically translated to ASCII-space based on recommended mappings.

Character Encodings and Mapping Recommendations

Where applicable, the following steps may be taken to map a Unicode string to ASCII-space.


Unicode Code	Unicode Character Name	Mapped Characters	Mapped Code(s)
U+2002	EN SPACE	“ “	U+0020
U+2003	EM SPACE	“ “	U+0020
U+2009	THIN SPACE	“ “	U+0020
U+00A0	NO-BREAK SPACE	“ “	U+0020
U+202F	NARROW NO-BREAK SPACE	“ “	U+0020
U+201D	RIGHT DOUBLE QUOTATION MARK	“””	U+0022
U+201C	LEFT DOUBLE QUOTATION MARK	“””	U+0022
U+2019	RIGHT SINGLE QUOTATION MARK	“‘“	U+0027
U+00B4	ACUTE ACCENT	“‘“	U+0027
U+2039	SINGLE LEFT-POINTING ANGLE QUOTATION MARK	“‘“	U+0027
U+203A	SINGLE RIGHT-POINTING ANGLE QUOTATION MARK	“‘“	U+0027
U+2010	HYPHEN	“-”	U+002D
U+2211	NON-BREAKING HYPHEN	“-”	U+002D
U+2212	MINUS SIGN	“-”	U+002D
U+2013	EN DASH	“-”	U+002D
U+2027	HYPHENATION POINT	“-”	U+002D
U+00AD	SOFT HYPHEN	“-”	U+002D
U+2014	EM DASH	“--”	U+002D, U+002D
U+2026	HORIZONTAL ELLIPSIS	“...”	U+002E, U+002E, U+002E

Letter Normalization And Mappings

NFKD Unicode character normalization should be performed on all characters. In the case of composite characters with Latin bases, the Latin letter base is kept while the modifiers are stripped. For instance, U+00D1 (“LATIN CAPITAL LETTER N WITH TILDE”, Ñ) is decomposed to 004E 0303. The 0303 tilde is stripped and the character is converted to 004E, corresponding to “LATIN CAPITAL LETTER N”.

Some suggested Basic Latin mappings by the Unicode Consortium can be found within the following document: http://www.unicode.org/charts/PDF/U0000.pdf.

The following table includes a few additional character mappings.


Unicode Code	Unicode Character Name	Mapped Characters	Mapped Code(s)
u+00C6	LATIN CAPITAL LETTER AE	“AE”	U+0041, U+0045
U+01FC	LATIN CAPITAL LETTER AE WITH ACUTE	“AE”	U+0041, U+0045
U+0152	LATIN CAPITAL LIGATURE OE	“OE”	U+004F, U+0045
U+00E6	LATIN SMALL LETTER AE	“ae”	U+0061, U+0065
U+01FD	LATIN SMALL LETTER AE WITH ACUTE	“ae”	U+0061, U+0065
U+0153	LATIN SMALL LIGATURE OE	“oe”	U+006F, U+0065
U+0149	LATIN SMALL LETTER N PRECEDED BY APOSTROPHE	“‘n”	U+0027, U+006E

Other Characters

Characters remaining in the text that have not been mapped to an ASCII encodeable representation will be stripped before being analyzed by the Lexile Analyzer.