String operations – Wikipedia

Posted on May 14, 2015 by lordneo

before-content-x4

In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.

after-content-x4

Table of Contents

Strings and languages[edit]

A string is a finite sequence of characters.
The empty string is denoted by

{displaystyle varepsilon }

$varepsilon$ .
The concatenation of two string

{displaystyle s}

$s$ and

{displaystyle t}

$t$ is denoted by

{displaystyle scdot t}

$scdot t$ , or shorter by

after-content-x4

{displaystyle st}

$s t$ .
Concatenating with the empty string makes no difference:

{displaystyle scdot varepsilon =s=varepsilon cdot s}

$scdot varepsilon =s=varepsilon cdot s$ .
Concatenation of strings is associative:

{displaystyle scdot (tcdot u)=(scdot t)cdot u}

$scdot (tcdot u)=(scdot t)cdot u$ .

For example,

{displaystyle (langle brangle cdot langle lrangle )cdot (varepsilon cdot langle ahrangle )=langle blrangle cdot langle ahrangle =langle blahrangle }

$(langle brangle cdot langle lrangle )cdot (varepsilon cdot langle ahrangle )=langle blrangle cdot langle ahrangle =langle blahrangle$ .

A language is a finite or infinite set of strings.
Besides the usual set operations like union, intersection etc., concatenation can be applied to languages:
if both

{displaystyle S}

$S$ and

{displaystyle T}

$T$ are languages, their concatenation

{displaystyle Scdot T}

$Scdot T$ is defined as the set of concatenations of any string from

{displaystyle S}

$S$ and any string from

{displaystyle T}

$T$ , formally

{displaystyle Scdot T={scdot tmid sin Sland tin T}}

$Scdot T={scdot tmid sin Sland tin T}$ .
Again, the concatenation dot

{displaystyle cdot }

$cdot$ is often omitted for brevity.

The language

{displaystyle {varepsilon }}

${varepsilon }$ consisting of just the empty string is to be distinguished from the empty language

{displaystyle {}}

${}$ .
Concatenating any language with the former doesn’t make any change:

{displaystyle Scdot {varepsilon }=S={varepsilon }cdot S}

$Scdot {varepsilon }=S={varepsilon }cdot S$ ,
while concatenating with the latter always yields the empty language:

{displaystyle Scdot {}={}={}cdot S}

$Scdot {}={}={}cdot S$ .
Concatenation of languages is associative:

{displaystyle Scdot (Tcdot U)=(Scdot T)cdot U}

$Scdot (Tcdot U)=(Scdot T)cdot U$ .

For example, abbreviating

{displaystyle D={langle 0rangle ,langle 1rangle ,langle 2rangle ,langle 3rangle ,langle 4rangle ,langle 5rangle ,langle 6rangle ,langle 7rangle ,langle 8rangle ,langle 9rangle }}

$D={langle 0rangle ,langle 1rangle ,langle 2rangle ,langle 3rangle ,langle 4rangle ,langle 5rangle ,langle 6rangle ,langle 7rangle ,langle 8rangle ,langle 9rangle }$ , the set of all three-digit decimal numbers is obtained as

{displaystyle Dcdot Dcdot D}

$Dcdot Dcdot D$ . The set of all decimal numbers of arbitrary length is an example for an infinite language.

Alphabet of a string[edit]

The alphabet of a string is the set of all of the characters that occur in a particular string. If s is a string, its alphabet is denoted by

{displaystyle operatorname {Alph} (s)}

The alphabet of a language

{displaystyle S}

$S$ is the set of all characters that occur in any string of

{displaystyle S}

$S$ , formally:

{displaystyle operatorname {Alph} (S)=bigcup _{sin S}operatorname {Alph} (s)}

$operatorname {Alph}(S)=bigcup _{{sin S}}operatorname {Alph}(s)$ .

For example, the set

{displaystyle {langle arangle ,langle crangle ,langle orangle }}

${langle arangle ,langle crangle ,langle orangle }$ is the alphabet of the string

{displaystyle langle cacaorangle }

$langle cacaorangle$ ,
and the above

{displaystyle D}

$D$ is the alphabet of the above language

{displaystyle Dcdot Dcdot D}

$Dcdot Dcdot D$ as well as of the language of all decimal numbers.

String substitution[edit]

Let L be a language, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping f that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character a ∈ Σ, one has f(a)=L_a where L_a ⊆ Δ^* is some language whose alphabet is Δ. This mapping may be extended to strings as

f(ε)=ε

for the empty string ε, and

f(sa)=f(s)f(a)

for string s ∈ L and character a ∈ Σ. String substitutions may be extended to entire languages as ^[1]

{displaystyle f(L)=bigcup _{sin L}f(s)}

Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.^[2]
Similarly, context-free languages are closed under string substitution.^[3]^{[note 1]}

A simple example is the conversion f_uc(.) to uppercase, which may be defined e.g. as follows:

character	mapped to language	remark
x	f_uc(x)
‹a›	{ ‹A› }	map lowercase char to corresponding uppercase char
‹A›	{ ‹A› }	map uppercase char to itself
‹ß›	{ ‹SS› }	no uppercase char available, map to two-char string
‹0›	{ ε }	map digit to empty string
‹!›	{ }	forbid punctuation, map to empty language
…		similar for other chars

For the extension of f_uc to strings, we have e.g.

f_uc(‹Straße›) = {‹S›} ⋅ {‹T›} ⋅ {‹R›} ⋅ {‹A›} ⋅ {‹SS›} ⋅ {‹E›} = {‹STRASSE›},
f_uc(‹u2›) = {‹U›} ⋅ {ε} = {‹U›}, and
f_uc(‹Go!›) = {‹G›} ⋅ {‹O›} ⋅ {} = {}.

For the extension of f_uc to languages, we have e.g.

f_uc({ ‹Straße›, ‹u2›, ‹Go!› }) = { ‹STRASSE› } ∪ { ‹U› } ∪ { } = { ‹STRASSE›, ‹U› }.

String homomorphism[edit]

A string homomorphism (often referred to simply as a homomorphism in formal language theory) is a string substitution such that each character is replaced by a single string. That is,

{displaystyle f(a)=s}

${displaystyle f(a)=s}$ , where

{displaystyle s}

$s$ is a string, for each character

{displaystyle a}

$a$ .^{[note 2]}^[4]

String homomorphisms are monoid morphisms on the free monoid, preserving the empty string and the binary operation of string concatenation. Given a language

{displaystyle L}

$L$ , the set

{displaystyle f(L)}

${displaystyle f(L)}$ is called the homomorphic image of

{displaystyle L}

$L$ . The inverse homomorphic image of a string

{displaystyle s}

$s$ is defined as

{displaystyle f^{-1}(s)={w|f(w)=s}}

${displaystyle f^{-1}(s)={w|f(w)=s}}$

while the inverse homomorphic image of a language

{displaystyle L}

$L$ is defined as

{displaystyle f^{-1}(L)={s|f(s)in L}}

${displaystyle f^{-1}(L)={s|f(s)in L}}$

In general,

{displaystyle f(f^{-1}(L))neq L}

${displaystyle f(f^{-1}(L))neq L}$ , while one does have

{displaystyle f(f^{-1}(L))subseteq L}

${displaystyle f(f^{-1}(L))subseteq L}$

and

{displaystyle Lsubseteq f^{-1}(f(L))}

${displaystyle Lsubseteq f^{-1}(f(L))}$

for any language

{displaystyle L}

$L$ .

The class of regular languages is closed under homomorphisms and inverse homomorphisms.^[5]
Similarly, the context-free languages are closed under homomorphisms^{[note 3]} and inverse homomorphisms.^[6]

A string homomorphism is said to be ε-free (or e-free) if

{displaystyle f(a)neq varepsilon }

${displaystyle f(a)neq varepsilon }$ for all a in the alphabet

{displaystyle Sigma }

$Sigma$ . Simple single-letter substitution ciphers are examples of (ε-free) string homomorphisms.

An example string homomorphism g_uc can also be obtained by defining similar to the above substitution: g_uc(‹a›) = ‹A›, …, g_uc(‹0›) = ε, but letting g_uc be undefined on punctuation chars.
Examples for inverse homomorphic images are

g_uc⁻¹({ ‹SSS› }) = { ‹sss›, ‹sß›, ‹ßs› }, since g_uc(‹sss›) = g_uc(‹sß›) = g_uc(‹ßs›) = ‹SSS›, and
g_uc⁻¹({ ‹A›, ‹bb› }) = { ‹a› }, since g_uc(‹a›) = ‹A›, while ‹bb› cannot be reached by g_uc.

For the latter language, g_uc(g_uc⁻¹({ ‹A›, ‹bb› })) = g_uc({ ‹a› }) = { ‹A› } ≠ { ‹A›, ‹bb› }.
The homomorphism g_uc is not ε-free, since it maps e.g. ‹0› to ε.

A very simple string homomorphism example that maps each character to just a character is the conversion of an EBCDIC-encoded string to ASCII.

String projection[edit]

If s is a string, and

{displaystyle Sigma }

$Sigma$ is an alphabet, the string projection of s is the string that results by removing all characters that are not in

{displaystyle Sigma }

$Sigma$ . It is written as

{displaystyle pi _{Sigma }(s),}

$pi _{Sigma }(s),$ . It is formally defined by removal of characters from the right hand side:

{displaystyle pi _{Sigma }(s)={begin{cases}varepsilon &{mbox{if }}s=varepsilon {mbox{ the empty string}}\pi _{Sigma }(t)&{mbox{if }}s=ta{mbox{ and }}anotin Sigma \pi _{Sigma }(t)a&{mbox{if }}s=ta{mbox{ and }}ain Sigma end{cases}}}

Here

{displaystyle varepsilon }

$varepsilon$ denotes the empty string. The projection of a string is essentially the same as a projection in relational algebra.

String projection may be promoted to the projection of a language. Given a formal language L, its projection is given by

{displaystyle pi _{Sigma }(L)={pi _{Sigma }(s) vert sin L}}

Right quotient[edit]

The right quotient of a character a from a string s is the truncation of the character a in the string s, from the right hand side. It is denoted as

{displaystyle s/a}

$s/a$ . If the string does not have a on the right hand side, the result is the empty string. Thus:

{displaystyle (sa)/b={begin{cases}s&{mbox{if }}a=b\varepsilon &{mbox{if }}aneq bend{cases}}}

The quotient of the empty string may be taken:

{displaystyle varepsilon /a=varepsilon }

Similarly, given a subset

{displaystyle Ssubset M}

$Ssubset M$ of a monoid

{displaystyle M}

$M$ , one may define the quotient subset as

{displaystyle S/a={sin M vert sain S}}

Left quotients may be defined similarly, with operations taking place on the left of a string.^{[citation needed]}

Hopcroft and Ullman (1979) define the quotient L₁/L₂ of the languages L₁ and L₂ over the same alphabet as L₁/L₂ = { s | ∃t∈L₂. st∈L₁ }.^[7]
This is not a generalization of the above definition, since, for a string s and distinct characters a, b, Hopcroft’s and Ullman’s definition implies {sa} / {b} yielding {}, rather than { ε }.

The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language L₁ and an arbitrary language L₂ is known as Brzozowski derivative; if L₂ is represented by a regular expression, so can be the left quotient.^[8]

Syntactic relation[edit]

The right quotient of a subset

{displaystyle Ssubset M}

$Ssubset M$ of a monoid

{displaystyle M}

$M$ defines an equivalence relation, called the right syntactic relation of S. It is given by

{displaystyle sim _{S};,=,{(s,t)in Mtimes M vert S/s=S/t}}

The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if

{displaystyle {S/m vert min M}}

is finite. In the case that M is the monoid of words over some alphabet, S is then a regular language, that is, a language that can be recognized by a finite state automaton. This is discussed in greater detail in the article on syntactic monoids.^{[citation needed]}

Right cancellation[edit]

The right cancellation of a character a from a string s is the removal of the first occurrence of the character a in the string s, starting from the right hand side. It is denoted as

{displaystyle sdiv a}

$sdiv a$ and is recursively defined as

{displaystyle (sa)div b={begin{cases}s&{mbox{if }}a=b\(sdiv b)a&{mbox{if }}aneq bend{cases}}}

The empty string is always cancellable:

{displaystyle varepsilon div a=varepsilon }

Clearly, right cancellation and projection commute:

{displaystyle pi _{Sigma }(s)div a=pi _{Sigma }(sdiv a)}

Prefixes[edit]

The prefixes of a string is the set of all prefixes to a string, with respect to a given language:

{displaystyle operatorname {Pref} _{L}(s)={t vert s=tu{mbox{ for }}t,uin operatorname {Alph} (L)^{*}}}

where

{displaystyle sin L}

$sin L$ .

The prefix closure of a language is

{displaystyle operatorname {Pref} (L)=bigcup _{sin L}operatorname {Pref} _{L}(s)=left{t vert s=tu;sin L;t,uin operatorname {Alph} (L)^{*}right}}

Example:

{displaystyle L=left{abcright}{mbox{ then }}operatorname {Pref} (L)=left{varepsilon ,a,ab,abcright}}

$L=left{abcright}{mbox{ then }}operatorname {Pref}(L)=left{varepsilon ,a,ab,abcright}$

A language is called prefix closed if

{displaystyle operatorname {Pref} (L)=L}

$operatorname {Pref}(L)=L$ .

The prefix closure operator is idempotent:

{displaystyle operatorname {Pref} (operatorname {Pref} (L))=operatorname {Pref} (L)}

The prefix relation is a binary relation

{displaystyle sqsubseteq }

$sqsubseteq$ such that

{displaystyle ssqsubseteq t}

$ssqsubseteq t$ if and only if

{displaystyle sin operatorname {Pref} _{L}(t)}

$sin operatorname {Pref}_{L}(t)$ . This relation is a particular example of a prefix order.^{[citation needed]}

References[edit]

^ Hopcroft, Ullman (1979), Sect.3.2, p.60
^ Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.4, p.60
^ Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.2, p.131
^ Hopcroft, Ullman (1979), Sect.3.2, p.60-61
^ Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.5, p.61
^ Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.3, p.132
^ Hopcroft, Ullman (1979), Sect.3.2, p.62
^ Janusz A. Brzozowski (1964). “Derivatives of Regular Expressions”. J ACM. 11 (4): 481–494. doi:10.1145/321239.321249. S2CID 14126942.

after-content-x4

String operations – Wikipedia

Strings and languages[edit]

Alphabet of a string[edit]

String substitution[edit]

String homomorphism[edit]

String projection[edit]

Right quotient[edit]

Syntactic relation[edit]

Right cancellation[edit]

Prefixes[edit]

See also[edit]

References[edit]

Recent Posts

Recent Comments

Archives

Categories

Meta