# cp.text

This module provides support for loading, manipulating, and comparing unicode text data. It works by storing characters with their Unicode 'codepointvalue. In practice, this means that every character is a 64-bit integer, so atextvalue will use substantially more memory than the equivalent encodedstring` value.

The advantages of text over string representations for Unicode are:

comparisons, equality checks, etc. actually work for Unicode text and are not encoding-dependent.
direct access to codepoint values.

The advantages of string representations for Unicode are:

compactness.
reading/writing to files via the standard io library.

# Strings and Unicode

LUA has limited built-in support for Unicode text. string values are "8-bit clean", which means it is an array of 8-bit characters. This is also how binary data from files is usually loaded, as 8-bit 'bytes'. Unicode characters can be up to 32-bits, so there are several standard ways to represent Unicode characters using 8-bit characters. Without going into detail, the most common encodings are called 'UTF-8' and 'UTF-16'. There are two variations of 'UTF-16', depending on the hardware architecture, known as 'big-endian' and 'little-endian'.

The built-in functions for string, such as match, gsub and even len will not work as expected when a string contains Unicode text. As such, this library fills some of the gaps for common operations when working with Unicode text.

# Examples

You can convert to and from string and text values like so:

local text = require("cp.text")

local simpleString		= "foobar"
local simpleText		= text(stringValue)
local utf8String		= "a丽𐐷"				-- contains non-ascii characters, defaults to UTF-8.
local unicodeText		= text "a丽𐐷"			-- contains non-ascii characters, converts from a UTF-8 string.
local utf8String		= tostring(unicodeText) -- `tostring` will default to UTF-8 encoding
local utf16leString		= unicodeText:encode(text.encoding.utf16le) -- or you can be more specific

Note that text values are not in any specific encoding, since they are stored as 64-bit integer code-points rather than 8-bit characers.

# Submodules

cp.text.matcher

# API Overview

Constants - Useful values which cannot be changed

encoding

Functions - API calls offered directly by the extension

Constructors - API calls which return an object, typically one that offers API methods

char
fromCodepoints
fromFile
fromString

Methods - API calls which can only be made on an object returned by a constructor

encode
find
len
match
sub

# API Documentation

# Constants

# encoding


Signature	`cp.text.encoding`
Type	Constant
Description	The list of supported encoding formats.
Notes	The list of supported encoding formats: `utf8` - UTF-8. The most common format on the web, backwards compatible with ANSI/ASCII. `utf16le` - UTF-16 (little-endian). Commonly used in Windows and Mac text files. ** `utf16be` - UTF-16 (big-endian). Alternate 16-bit format, common on Linux and PowerPC-based architectures.
Source	src/extensions/cp/text/init.lua line 65

# Functions

# is


Signature	`cp.text.is(value) -> boolean`
Type	Function
Description	Checks if the provided value is a `text` instance.
Parameters	`value` - The value to check
Returns	`true` if the value is a `text` instance.
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 261

# Constructors

# char


Signature	`cp.text.char(...) -> text`
Type	Constructor
Description	Returns the list of one or more codepoint items into a text value, concatenating the results.
Parameters	`...` - The list of codepoint integers.
Returns	The `cp.text` value for the list of codepoint values.
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 248

# fromCodepoints


Signature	`cp.text.fromCodepoints(codepoints[, i[, j]]) -> text`
Type	Constructor
Description	Returns a new `text` instance representing the specified array of codepoints. Since `i` and `j` default to the first and last indexes of the array, simply passing in the array will convert all codepoints in that array.
Parameters	`codepoints` - The array of codepoint integers. `i` - The starting index to read from codepoints. Defaults to `1`. `j` - The ending index to read from codepoints. Default to `-1`.
Returns	A new `text` instance.
Notes	You can use a negative value for `i` and `j`. If so, it will count back from then end of the `codepoints` array. If the codepoint array begins with a Byte-Order Marker (BOM), the BOM is skipped in the resulting text.
Examples	None
Source	src/extensions/cp/text/init.lua line 167

# fromFile


Signature	`cp.text.fromFile(path[, encoding]) -> text`
Type	Constructor
Description	Returns a new `text` instance representing the text loaded from the specified path. If no encoding is specified, it will attempt to determine the encoding from a leading Byte-Order Marker (BOM). If none is present, it defaults to UTF-8.
Parameters	`value` - The value to turn into a unicode text instance. `encoding` - One of the falues from `text.encoding`: `utf8`, `utf16le`, or `utf16be`. Defaults to `utf8`.
Returns	A new `text` instance.
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 227

# fromString


Signature	`cp.text.fromString(value[, encoding]) -> text`
Type	Constructor
Description	Returns a new `text` instance representing the string value of the specified value. If no encoding is specified, it will attempt to determine the encoding from a leading Byte-Order Marker (BOM). If none is present, it defaults to UTF-8.
Parameters	`value` - The value to turn into a unicode text instance. `encoding` - One of the falues from `text.encoding`: `utf8`, `utf16le`, or `utf16be`. Defaults to `utf8`.
Returns	A new `text` instance.
Notes	Calling `text(value)` is the same as calling `text.fromString(value, text.encoding.utf8)`, so simple text can be initialized via `local x = text "foo"` when the `.lua` file's encoding is UTF-8.
Examples	None
Source	src/extensions/cp/text/init.lua line 132

# Methods

# encode


Signature	`cp.text:encode([encoding]) -> string`
Type	Method
Description	Returns the text as an encoded `string` value.
Parameters	`encoding` - The encoding to use when converting. Defaults to `cp.text.encoding.utf8`.
Returns
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 397

# find


Signature	`cp.text:find(pattern [, init [, plain]])`
Type	Method
Description	Looks for the first match of pattern in the string `value`. If it finds a match, then find returns the indices of `value` where this occurrence starts and ends; otherwise, it returns `nil`. A third, optional numerical argument `init` specifies where to start the search; its default value is `1` and can be negative. A value of `true` as a fourth, optional argument plain turns off the pattern matching facilities, so the function does a plain "find substring" operation, with no characters in pattern being considered "magic". Note that if plain is given, then `init` must be given as well.
Parameters	`pattern` - The pattern to find. `init` - The index to start matching from. Defaults to `1`. `plain` - If `true`, the pattern is treated as plain text.
Returns	the start index, the end index, followed by any captures
Notes	If the pattern has captures, then in a successful match the captured values are also returned, after the two indices.
Examples	None
Source	src/extensions/cp/text/init.lua line 293

# len


Signature	`cp.text:len() -> number`
Type	Method
Description	Returns the number of codepoints in the text.
Parameters	None
Returns	The number of codepoints.
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 381

# match


Signature	`cp.text:match(pattern[, start]) -> ...`
Type	Method
Description	Looks for the first match of the `pattern` in the text value. If it finds one, then match returns the captures from the pattern; otherwise it returns `nil`. If pattern specifies no captures, then the whole match is returned. A third, optional numerical argument `init` specifies where to start the search; its default value is `1` and can be negative.
Parameters	`pattern` - The text pattern to process. `start` - If specified, indicates the starting position to process from. Defaults to `1`.
Returns	The capture results, the whole match, or `nil`.
Notes	None
Examples	None
Source	src/extensions/cp/text/init.lua line 311

# sub


Signature	`cp.text:sub(i [, j]) -> cp.text`
Type	Method
Description	Returns the substring of this text that starts at `i` and continues until `j`; `i` and `j` can be negative.
Parameters	i - See above j - See above
Returns	None
Notes	If `j` is absent, then it is assumed to be equal to `-1` (which is the same as the string length). In particular, the call `cp.text:sub(1,j)` returns a prefix of `s` with length `j`, and `cp.text:sub(-i)` (for a positive `i`) returns a suffix of s with length i.
Examples	None
Source	src/extensions/cp/text/init.lua line 274