Info: (grep) Character Encoding

Info Catalog

grep: Problematic Expressions

grep: Regular Expressions

grep: Matching Non-ASCII

grep: Character Encoding

 
 3.8 Character Encoding
 ======================
 
 The ‘LC_CTYPE’ locale specifies the encoding of characters in patterns
 and data, that is, whether text is encoded in UTF-8, ASCII, or some
 other encoding.  ⇒Environment Variables.
 
    In the ‘C’ or ‘POSIX’ locale, every character is encoded as a single
 byte and every byte is a valid character.  In more-complex encodings
 such as UTF-8, a sequence of multiple bytes may be needed to represent a
 character, and some bytes may be encoding errors that do not contribute
 to the representation of any character.  POSIX does not specify the
 behavior of ‘grep’ when patterns or input data contain encoding errors
 or null characters, so portable scripts should avoid such usage.  As an
 extension to POSIX, GNU ‘grep’ treats null characters like any other
 character.  However, unless the ‘-a’ (‘--binary-files=text’) option is
 used, the presence of null characters in input or of encoding errors in
 output causes GNU ‘grep’ to treat the file as binary and suppress
 details about matches.  ⇒File and Directory Selection.
 
    Regardless of locale, the 103 characters in the POSIX Portable
 Character Set (a subset of ASCII) are always encoded as a single byte,
 and the 128 ASCII characters have their usual single-byte encodings on
 all but oddball platforms.

Info Catalog

grep: Problematic Expressions

grep: Regular Expressions

grep: Matching Non-ASCII