#
cp.text
This module provides support for loading, manipulating, and comparing unicode text data.
It works by storing characters with their Unicode 'codepointvalue. In practice, this means that every character is a 64-bit integer, so a
textvalue will use substantially more memory than the equivalent encoded
string` value.
The advantages of text
over string
representations for Unicode are:
- comparisons, equality checks, etc. actually work for Unicode text and are not encoding-dependent.
- direct access to codepoint values.
The advantages of string
representations for Unicode are:
- compactness.
- reading/writing to files via the standard
io
library.
#
Strings and Unicode
LUA has limited built-in support for Unicode text. string
values are "8-bit clean", which means it is an array of 8-bit characters. This is also how binary data from files is usually loaded, as 8-bit 'bytes'. Unicode characters can be up to 32-bits, so there are several standard ways to represent Unicode characters using 8-bit characters. Without going into detail, the most common encodings are called 'UTF-8' and 'UTF-16'. There are two variations of 'UTF-16', depending on the hardware architecture, known as 'big-endian' and 'little-endian'.
The built-in functions for string
, such as match
, gsub
and even len
will not work as expected when a string contains Unicode text. As such, this library fills some of the gaps for common operations when working with Unicode text.
#
Examples
You can convert to and from string
and text
values like so:
local text = require("cp.text")
local simpleString = "foobar"
local simpleText = text(stringValue)
local utf8String = "a丽𐐷" -- contains non-ascii characters, defaults to UTF-8.
local unicodeText = text "a丽𐐷" -- contains non-ascii characters, converts from a UTF-8 string.
local utf8String = tostring(unicodeText) -- `tostring` will default to UTF-8 encoding
local utf16leString = unicodeText:encode(text.encoding.utf16le) -- or you can be more specific
Note that text
values are not in any specific encoding, since they are stored as 64-bit integer code-points
rather than 8-bit characers.
#
Submodules
#
API Overview
Constants - Useful values which cannot be changed
encoding
Functions - API calls offered directly by the extension
is
Constructors - API calls which return an object, typically one that offers API methods
char fromCodepoints fromFile fromString
Methods - API calls which can only be made on an object returned by a constructor
encode find len match sub