GAP (GAPDoc) - Chapter 6: String and Text Utilities

This section describes some utility functions for handling texts within GAP. They are used by the functions in the GAPDoc package but may be useful for other purposes as well. We start with some variables containing useful strings and go on with functions for parsing and reformatting text.

6.1-1 WHITESPACE

These variables contain sets of characters which are useful for text processing. They are defined as follows.

6.1-2 TextAttr

The record TextAttr contains strings which can be printed to change the terminal attribute for the following characters. This only works with terminals which understand basic ANSI escape sequences. Try the following example to see if this is the case for the terminal you are using. It shows the effect of the foreground and background color attributes and of the .bold, .blink, .normal, .reverse and .underscore which can partly be mixed.

extra := ["CSI", "reset", "delline", "home"];;
for t in Difference(RecNames(TextAttr), extra) do
  Print(TextAttr.(t), "TextAttr.", t, TextAttr.reset,"\n");
od;

The suggested defaults for colors 0..7 are black, red, green, brown, blue, magenta, cyan, white. But this may be different for your terminal configuration.

The escape sequence .delline deletes the content of the current line and .home moves the cursor to the beginning of the current line.

for i in [1..5] do 
  Print(TextAttr.home, TextAttr.delline, String(i,-6), "\c"); 
  Sleep(1); 
od;

Whenever you use this in some printing routines you should make it optional. Use these attributes only when UserPreference("UseColorsInTerminal"); returns true.

6.1-3 WrapTextAttribute

‣ WrapTextAttribute( str, attr ) ( function )

The argument str must be a text as GAP string, possibly with markup by escape sequences as in TextAttr (6.1-2). This function returns a string which is wrapped by the escape sequences attr and TextAttr.reset. It takes care of markup in the given string by appending attr also after each given TextAttr.reset in str.

gap> str := Concatenation("XXX",TextAttr.2, "BLUB", TextAttr.reset,"YYY");
"XXX\033[32mBLUB\033[0mYYY"
gap> str2 := WrapTextAttribute(str, TextAttr.1);
"\033[31mXXX\033[32mBLUB\033[0m\033[31m\027YYY\033[0m"
gap> str3 := WrapTextAttribute(str, TextAttr.underscore);
"\033[4mXXX\033[32mBLUB\033[0m\033[4m\027YYY\033[0m"
gap> # use Print(str); and so on to see how it looks like.

6.1-4 FormatParagraph

‣ FormatParagraph( str[, len][, flush][, attr][, widthfun] ) ( function )

This function formats a text given in the string str as a paragraph. The optional arguments have the following meaning:

This function tries to handle markup with the escape sequences explained in TextAttr (6.1-2) correctly.

gap> str := "One two three four five six seven eight nine ten eleven.";;
gap> Print(FormatParagraph(str, 25, "left", ["/* ", " */"]));           
/* One two three four five */
/* six seven eight nine ten */
/* eleven. */

6.1-5 SubstitutionSublist

‣ SubstitutionSublist( list, sublist, new[, flag] ) ( function )

This function looks for (non-overlapping) occurrences of a sublist sublist in a list list (compare PositionSublist (Reference: PositionSublist)) and returns a list where these are substituted with the list new.

The optional argument flag can either be "all" (this is the default if not given) or "one". In the second case only the first occurrence of sublist is substituted.

If sublist does not occur in list then list itself is returned (and not a ShallowCopy(list)).

gap> SubstitutionSublist("xababx", "ab", "a");
"xaax"

6.1-6 StripBeginEnd

‣ StripBeginEnd( list, strip ) ( function )

Here list and strip must be lists. This function returns the sublist of list which does not contain the leading and trailing entries which are entries of strip. If the result is equal to list then list itself is returned.

gap> StripBeginEnd(" ,a, b,c,   ", ", ");
"a, b,c"

6.1-7 StripEscapeSequences

‣ StripEscapeSequences( str ) ( function )

This function returns the string one gets from the string str by removing all escape sequences which are explained in TextAttr (6.1-2). If str does not contain such a sequence then str itself is returned.

6.1-8 RepeatedString

‣ RepeatedString( c, len ) ( function )

‣ RepeatedUTF8String( c, len ) ( function )

Here c must be either a character or a string and len is a non-negative number. Then RepeatedString returns a string of length len consisting of copies of c.

In the variant RepeatedUTF8String the argument c is considered as string in UTF-8 encoding, and it can also be specified as unicode string or character, see Unicode (6.2-1). The result is a string in UTF-8 encoding which has visible width len as explained in WidthUTF8String (6.2-3).

gap> RepeatedString('=',51);
"==================================================="
gap> RepeatedString("*=",51);
"*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*"
gap> s := "bäh";;
gap> enc := GAPInfo.TermEncoding;;
gap> if enc <> "UTF-8" then s := Encode(Unicode(s, enc), "UTF-8"); fi;
gap> l := RepeatedUTF8String(s, 8);;
gap> u := Unicode(l, "UTF-8");;
gap> Print(Encode(u, enc), "\n");
bähbähbä

6.1-9 NumberDigits

‣ NumberDigits( str, base ) ( function )

‣ DigitsNumber( n, base ) ( function )

The argument str of NumberDigits must be a string consisting only of an optional leading '-' and characters in 0123456789abcdefABCDEF, describing an integer in base base with \(2 \leq \textit{base} \leq 16\). This function returns the corresponding integer.

gap> NumberDigits("1A3F",16);
6719
gap> DigitsNumber(6719, 16);
"1A3F"

6.1-10 LabelInt

‣ LabelInt( n, type, pre, post ) ( function )

The argument n must be an integer in the range from 1 to 5000, while pre and post must be strings.

The argument type can be one of "Decimal", "Roman", "roman", "Alpha", "alpha".

The function returns a string that starts with pre, followed by a decimal, respectively roman number or alphanumerical number literal (capital, respectively small letters), followed by post.

gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Decimal","","."));
[ "1.", "2.", "3.", "4.", "5.", "691." ]
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"alpha","(",")"));
[ "(a)", "(b)", "(c)", "(d)", "(e)", "(zo)" ]
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"alpha","(",")"));
[ "(a)", "(b)", "(c)", "(d)", "(e)", "(zo)" ]
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Alpha","",".)"));
[ "A.)", "B.)", "C.)", "D.)", "E.)", "ZO.)" ]
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"roman","","."));
[ "i.", "ii.", "iii.", "iv.", "v.", "dcxci." ]
gap> List([1,2,3,4,5,691], i-> LabelInt(i,"Roman","",""));
[ "I", "II", "III", "IV", "V", "DCXCI" ]

6.1-11 PositionMatchingDelimiter

‣ PositionMatchingDelimiter( str, delim, pos ) ( function )

Here str must be a string and delim a string with two different characters. This function searches the smallest position r of the character delim[2] in str such that the number of occurrences of delim[2] in str between positions pos+1 and r is by one greater than the corresponding number of occurrences of delim[1].

gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 0);
fail
gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 1);
2
gap> PositionMatchingDelimiter("{}x{ab{c}d}", "{}", 6);
11

6.1-12 WordsString

This returns the list of words of a text stored in the string str. All non-letters are considered as word boundaries and are removed.

gap> WordsString("one_two \n    three!?");
[ "one", "two", "three" ]

6.1-13 Base64String

The first function translates arbitrary binary data given as a GAP string into a base 64 encoded string. This encoded string contains only printable ASCII characters and is used in various data transfer protocols (MIME encoded emails, weak password encryption, ...). We use the specification in RFC 2045.

The second function has the reverse functionality. Here we also accept the characters -_ instead of +/ as last two characters. Whitespace is ignored.

gap> b := Base64String("This is a secret!");
"VGhpcyBpcyBhIHNlY3JldCEA="
gap> StringBase64(b);                       
"This is a secret!"

6.2 Unicode Strings

The GAPDoc package provides some tools to deal with unicode characters and strings. These can be used for recoding text strings between various encodings.

6.2-1 Unicode Strings and Characters

‣ Unicode( list[, encoding] ) ( operation )

‣ IntListUnicodeString( ustr ) ( function )

Unicode characters are described by their codepoint, an integer in the range from \(0\) to \(2^{21}-1\). For details about unicode, see https://www.unicode.org.

The function UChar wraps an integer num into a GAP object lying in the filter IsUnicodeCharacter. Use Int to get the codepoint back. The argument num can also be a GAP character which is then translated to an integer via IntChar (Reference: IntChar).

Unicode produces a GAP object in the filter IsUnicodeString. This is a wrapped list of integers for the unicode characters in the string. The function IntListUnicodeString gives access to this list of integers. Basic list functionality is available for IsUnicodeString elements. The entries are in IsUnicodeCharacter. The argument list for Unicode is either a list of integers or a GAP string. In the latter case an encoding can be specified as string, its default is "UTF-8".

Currently supported encodings can be found in UNICODE_RECODE.NormalizedEncodings (ASCII, ISO-8859-X, UTF-8 and aliases). The encoding "XML" means an ASCII encoding in which non-ASCII characters are specified by XML character entities. The encoding "URL" is for URL-encoded (also called percent-encoded strings, as specified in RFC 3986 (see here). The listed encodings "LaTeX" and aliases cannot be used with Unicode. See the operation Encode (6.2-2) for mapping a unicode string to a GAP string.

gap> ustr := Unicode("a and \366", "latin1");
Unicode("a and ö")
gap> ustr = Unicode("a and &#246;", "XML");  
true
gap> IntListUnicodeString(ustr);
[ 97, 32, 97, 110, 100, 32, 246 ]
gap> ustr[7];
'ö'

6.2-2 Encode

‣ Encode( ustr[, encoding] ) ( operation )

‣ SimplifiedUnicodeString( ustr[, encoding][, "single"] ) ( function )

‣ LowercaseUnicodeString( ustr ) ( function )

‣ UppercaseUnicodeString( ustr ) ( function )

‣ SimplifiedUnicodeTable ( global variable )

‣ LowercaseUnicodeTable ( global variable )

The operation Encode translates a unicode string ustr into a GAP string in some specified encoding. The default encoding is "UTF-8".

Supported encodings can be found in UNICODE_RECODE.NormalizedEncodings. Except for some cases mentioned below characters which are not available in the target encoding are substituted by '?' characters.

If the encoding is "URL" (see Unicode (6.2-1)) then an optional argument encreserved can be given, it must be a list of reserved characters which should be percent encoded; the default is to encode only the % character.

The encoding "LaTeX" substitutes non-ASCII characters and LaTeX special characters by LaTeX code as given in an ordered list LaTeXUnicodeTable of pairs [codepoint, string]. If you have a unicode character for which no substitution is contained in that list, you will get a warning and the translation is Unicode(nr). In this case find a substitution and add a corresponding [codepoint, string] pair to LaTeXUnicodeTable using AddSet (Reference: AddSet). Also, please, tell the GAPDoc authors about your addition, such that we can extend the list LaTeXUnicodeTable. (Most of the initial entries were generated from lists in the TeX projects encTeX and ucs.) There are some variants of this encoding:

"LaTeXleavemarkup" does the same translations for non-ASCII characters but leaves the LaTeX special characters (e.g., any LaTeX commands) as they are.

"LaTeXUTF8" does not give a warning about unicode characters without explicit translation, instead it translates the character to its UTF-8 encoding. Make sure to setup your LaTeX document such that all these characters are understood.

Note that the "LaTeX" encoding can only be used with Encode but not for the opposite translation with Unicode (6.2-1) (which would need far too complicated heuristics).

The function SimplifiedUnicodeString can be used to substitute many non-ASCII characters by related ASCII characters or strings (e.g., by a corresponding character without accents). The argument ustr and the result are unicode strings, if encoding is "ASCII" then all non-ASCII characters are translated, otherwise only the non-latin1 characters. If the string "single" in an argument then only substitutions are considered which don't make the result string longer. The translations are stored in a sorted list SimplifiedUnicodeTable. Its entries are of the form [codepoint, trans1, trans2, ...]. Here trans1 and so on is either an integer for the codepoint of a substitution character or it is a list of codepoint integers. If you are missing characters in this list and know a sensible ASCII approximation, then add an entry (with AddSet (Reference: AddSet)) and tell the GAPDoc authors about it. (The initial content of SimplifiedUnicodeTable was mainly generated from the transtab tables by Markus Kuhn.)

The function LowercaseUnicodeString gets and returns a unicode string and translates each uppercase character to its corresponding lowercase version. This function uses a list LowercaseUnicodeTable of pairs of codepoint integers. This list was generated using the file UnicodeData.txt from the unicode definition (field 14 in each row).

The function UppercaseUnicodeString does the similar translation to uppercase characters.

gap> ustr := Unicode("a and &#246;", "XML");
Unicode("a and ö")
gap> SimplifiedUnicodeString(ustr, "ASCII");
Unicode("a and oe")
gap> SimplifiedUnicodeString(ustr, "ASCII", "single");
Unicode("a and o")
gap> ustr2 := UppercaseUnicodeString(ustr);;
gap> Print(Encode(ustr2, GAPInfo.TermEncoding), "\n");
A AND Ö

6.2-3 Lengths of UTF-8 strings

Let str be a GAP string with text in UTF-8 encoding. There are three lengths of such a string which must be distinguished. The operation Length (Reference: Length) returns the number of bytes and so the memory occupied by str. The function NrCharsUTF8String returns the number of unicode characters in str, that is the length of Unicode(str).

In many applications the function WidthUTF8String is more interesting, it returns the number of columns needed by the string if printed to a terminal. This takes into account that some unicode characters are combining characters and that there are wide characters which need two columns (e.g., for Chinese or Japanese). (To be precise: This implementation assumes that there are no control characters in str and uses the character width returned by the wcwidth function in the GNU C-library called with UTF-8 locale.)

gap> # A, German umlaut u, B, zero width space, C, newline
gap> str := Encode( Unicode( "A&#xFC;B&#x200B;C\n", "XML" ) );;
gap> Print(str);
AüBC
gap> # umlaut u needs two bytes and the zero width space three
gap> Length(str);
9
gap> NrCharsUTF8String(str);
6
gap> # zero width space and newline don't contribute to width
gap> WidthUTF8String(str);
4

6.2-4 InitialSubstringUTF8String

‣ InitialSubstringUTF8String( str, maxwidth[, suf] ) ( function )

The arguments str and suf each must be a GAP string with text in UTF-8 encoding or a unicode string. The argument suf is optional and its default value is the empty string. If the visible width of str is at most maxwidth then str is returned as UTF-8 encoded GAP string. Otherwise, suf is appended to the maximal initial substring of str such that the total visible width of the result is at most maxwidth.

gap> # A, German umlaut u, B, zero width space, C, newline
gap> str := Encode( Unicode( "A&#xFC;B&#x200B;C\n", "XML" ) );;
gap> ini := InitialSubstringUTF8String(str, 3);;
gap> WidthUTF8String(ini);
3
gap> IntListUnicodeString(Unicode(ini));
[ 65, 252, 66, 8203 ]
gap> l := Unicode([ 23380, 22827, 23376 ] );; # three chars of width 2
gap> s := InitialSubstringUTF8String(l, 4, "*");;
gap> WidthUTF8String(s);
3

6.3 Print Utilities

The following printing utilities turned out to be useful for interactive work with texts in GAP. But they are more general and so we document them here.

6.3-1 PrintTo1

‣ PrintTo1( filename, fun ) ( function )

‣ AppendTo1( filename, fun ) ( function )

The argument fun must be a function without arguments. Everything which is printed by a call fun() is printed into the file filename. As with PrintTo (Reference: PrintTo) and AppendTo (Reference: AppendTo) this overwrites or appends to, respectively, a previous content of filename.

These functions can be particularly efficient when many small pieces of text shall be written to a file, because no multiple reopening of the file is necessary.

gap> f := function() local i; 
>   for i in [1..100000] do Print(i, "\n"); od; end;; 
gap> PrintTo1("nonsense", f); # now check the local file `nonsense'

6.3-2 StringPrint

‣ StringPrint( obj1[, obj2[, ...]] ) ( function )

These functions return a string containing the output of a Print, ViewObj or Display call, respectively, with the same arguments.

This should be considered as a (temporary?) hack. It would be better to have String (Reference: String) methods for all GAP objects and to have a generic Print (Reference: Print)-function which just interprets these strings.

6.3-3 PrintFormattedString

‣ PrintFormattedString( str ) ( function )

This function prints a string str. The difference to Print(str); is that no additional line breaks are introduced by GAP's standard printing mechanism. This can be used to print lines which are longer than the current screen width. In particular one can print text which contains escape sequences like those explained in TextAttr (6.1-2), where lines may have more characters than visible characters.

6.3-4 Page

These functions are similar to Print (Reference: Print) and Display (Reference: Display), respectively. The difference is that the output is not sent directly to the screen, but is piped into the current pager; see Pager (Reference: Pager).

gap> Page([1..1421]+0);
gap> PageDisplay(CharacterTable("Symmetric", 14));

6.3-5 StringFile

‣ FileString( filename, str[, append] ) ( function )

The function StringFile returns the content of file filename as a string. This works efficiently with arbitrary (binary or text) files. If something went wrong, this function returns fail.

Conversely the function FileString writes the content of a string str into the file filename. If the optional third argument append is given and equals true then the content of str is appended to the file. Otherwise previous content of the file is deleted. This function returns the number of bytes written or fail if something went wrong.

6 String and Text Utilities

6.1 Text Utilities