org.apache.avro.util.UtfTextUtils

public class UtfTextUtils extends Object

Text utilities especially suited for UTF encoded bytes.

When the character set is unknown, methods in this class assume UTF encoded text and try to detect the UTF variant (8/16/32 bits, big/little endian), using the BOM (if present) or an educated guess assuming the first character is in the range U+0000-U+00FF. This heuristic works for all latin text based formats, which includes Avro IDL, JSON, XML, etc. If the heuristic fails, UTF-8 is assumed.

See Also:

Constructor Summary

Constructors

Constructor

Description

UtfTextUtils()
Method Summary

Modifier and Type

Method

Description

static String

asString(byte[] bytes, Charset charset)

static Charset

detectUtfCharset(byte[] firstFewBytes)

Assuming UTF encoded bytes, detect the UTF variant (8/16/32 bits, big/little endian).

static String

readAllBytes(InputStream input, Charset charset)

Reads the specified input stream as text.

static String

readAllChars(Reader input)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- UtfTextUtils
  
  public UtfTextUtils()
Method Details
- asString
  
  public static String asString(byte[] bytes, Charset charset)
- readAllBytes
  
  public static String readAllBytes(InputStream input, Charset charset) throws IOException
  
  Reads the specified input stream as text. If charset is null, the method will assume UTF encoded text and attempt to detect the appropriate charset.
  
  Parameters:
  
  input - the input to read
  
  charset - the character set of the input, if known
  
  Returns:
  
  all bytes, read into a string
  
  Throws:
  
  IOException - when reading the input fails for some reason
- readAllChars
  
  public static String readAllChars(Reader input) throws IOException
  
  Throws:
  
  IOException
- detectUtfCharset
  
  public static Charset detectUtfCharset(byte[] firstFewBytes)
  
  Assuming UTF encoded bytes, detect the UTF variant (8/16/32 bits, big/little endian).
  To ensure the most accurate detection, the algorithm requires at least 4 bytes. One should only provide less than 4 bytes of data if that is all there is.
  
  Detection is certain when a byte order mark (BOM) is used. Otherwise a heuristic is used, which works when the first character is from the first 256 characters from the BMP (U+0000-U+00FF). This works for all latin-based textual formats, like Avro IDL, JSON, YAML, XML, etc.
  
  Parameters:
  
  firstFewBytes - the first few bytes of the text to detect the character set of
  
  Returns:
  
  the character set to use

Class UtfTextUtils

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

UtfTextUtils

Method Details

asString

readAllBytes

readAllChars

detectUtfCharset