Class UtfTextUtils

java.lang.Object
org.apache.avro.util.UtfTextUtils

public class UtfTextUtils extends Object
Text utilities especially suited for UTF encoded bytes.

When the character set is unknown, methods in this class assume UTF encoded text and try to detect the UTF variant (8/16/32 bits, big/little endian), using the BOM (if present) or an educated guess assuming the first character is in the range U+0000-U+00FF. This heuristic works for all latin text based formats, which includes Avro IDL, JSON, XML, etc. If the heuristic fails, UTF-8 is assumed.

See Also:
  • Constructor Details

    • UtfTextUtils

      public UtfTextUtils()
  • Method Details

    • asString

      public static String asString(byte[] bytes, Charset charset)
    • readAllBytes

      public static String readAllBytes(InputStream input, Charset charset) throws IOException
      Reads the specified input stream as text. If charset is null, the method will assume UTF encoded text and attempt to detect the appropriate charset.
      Parameters:
      input - the input to read
      charset - the character set of the input, if known
      Returns:
      all bytes, read into a string
      Throws:
      IOException - when reading the input fails for some reason
    • readAllChars

      public static String readAllChars(Reader input) throws IOException
      Throws:
      IOException
    • detectUtfCharset

      public static Charset detectUtfCharset(byte[] firstFewBytes)
      Assuming UTF encoded bytes, detect the UTF variant (8/16/32 bits, big/little endian).

      To ensure the most accurate detection, the algorithm requires at least 4 bytes. One should only provide less than 4 bytes of data if that is all there is.

      Detection is certain when a byte order mark (BOM) is used. Otherwise a heuristic is used, which works when the first character is from the first 256 characters from the BMP (U+0000-U+00FF). This works for all latin-based textual formats, like Avro IDL, JSON, YAML, XML, etc.

      Parameters:
      firstFewBytes - the first few bytes of the text to detect the character set of
      Returns:
      the character set to use