Class UtfTextUtils
When the character set is unknown, methods in this class assume UTF encoded text and try to detect the UTF variant (8/16/32 bits, big/little endian), using the BOM (if present) or an educated guess assuming the first character is in the range U+0000-U+00FF. This heuristic works for all latin text based formats, which includes Avro IDL, JSON, XML, etc. If the heuristic fails, UTF-8 is assumed.
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic String
static Charset
detectUtfCharset
(byte[] firstFewBytes) Assuming UTF encoded bytes, detect the UTF variant (8/16/32 bits, big/little endian).static String
readAllBytes
(InputStream input, Charset charset) Reads the specified input stream as text.static String
readAllChars
(Reader input)
-
Constructor Details
-
UtfTextUtils
public UtfTextUtils()
-
-
Method Details
-
asString
-
readAllBytes
Reads the specified input stream as text. Ifcharset
isnull
, the method will assume UTF encoded text and attempt to detect the appropriate charset.- Parameters:
input
- the input to readcharset
- the character set of the input, if known- Returns:
- all bytes, read into a string
- Throws:
IOException
- when reading the input fails for some reason
-
readAllChars
- Throws:
IOException
-
detectUtfCharset
Assuming UTF encoded bytes, detect the UTF variant (8/16/32 bits, big/little endian).To ensure the most accurate detection, the algorithm requires at least 4 bytes. One should only provide less than 4 bytes of data if that is all there is.
Detection is certain when a byte order mark (BOM) is used. Otherwise a heuristic is used, which works when the first character is from the first 256 characters from the BMP (U+0000-U+00FF). This works for all latin-based textual formats, like Avro IDL, JSON, YAML, XML, etc.
- Parameters:
firstFewBytes
- the first few bytes of the text to detect the character set of- Returns:
- the character set to use
-