Ionflux::Tools::Utf8Tokenizer Class Reference
[String tokenizer]

Tokenizer with UTF-8 support. More...

#include <Utf8Tokenizer.hpp>

Inheritance diagram for Ionflux::Tools::Utf8Tokenizer:

[legend]Collaboration diagram for Ionflux::Tools::Utf8Tokenizer:


Public Member Functions
	Utf8Tokenizer ()
	Constructor.
	Utf8Tokenizer (const std::string &initInput)
	Constructor.
	Utf8Tokenizer (const std::vector< Utf8TokenType > &initTokenTypes, const std::string &initInput="")
	Constructor.
virtual	~Utf8Tokenizer ()
	Destructor.
virtual void	reset ()
	Reset.
virtual void	clearTokenTypes ()
	Clear token types.
virtual void	useDefaultTokenTypes ()
	Use default token types.
virtual void	addDefaultTokenType ()
	Add default token type.
virtual void	setTokenTypes (const std::vector< Utf8TokenType > &newTokenTypes)
	Set token types.
virtual void	addTokenTypes (const std::vector< Utf8TokenType > &newTokenTypes)
	Add token types.
virtual void	addTokenType (const Utf8TokenType &newTokenType)
	Add token type.
virtual void	setInput (const std::string &newInput)
	Set input.
virtual void	setInput (const std::vector< unsigned int > &newInput)
	Set input.
virtual Utf8Token	getNextToken (Utf8TokenTypeMap *otherTypeMap=0)
	Get next token.
virtual Utf8Token	getCurrentToken ()
	Get current token.
virtual int	getCurrentTokenType ()
	Get current token type.
virtual unsigned int	getCurrentPos ()
	Get current position.
virtual unsigned int	getCurrentTokenPos ()
	Get current token position.
virtual unsigned int	getQuoteChar ()
	Get quote character.
virtual void	setExtractQuoted (bool newExtractQuoted)
	Set extract quoted strings flag.
virtual bool	getExtractQuoted () const
	Get extract quoted strings flag.
virtual void	setExtractEscaped (bool newExtractEscaped)
	Set extract escaped characters flag.
virtual bool	getExtractEscaped () const
	Get extract escaped characters flag.
Static Public Member Functions
static bool	isValid (const Utf8Token &checkToken)
	Validate token.
Static Public Attributes
static const Utf8TokenType	TT_INVALID
	Token type: invalid (special).
static const Utf8TokenType	TT_NONE
	Token type: none (special).
static const Utf8TokenType	TT_DEFAULT = {1, "", 0}
	Token type: default (special).
static const Utf8TokenType	TT_QUOTED = {2, "", 0}
	Token type: quoted (special).
static const Utf8TokenType	TT_ESCAPED = {3, "", 0}
	Token type: escaped (special).
static const Utf8TokenType	TT_LINEAR_WHITESPACE = {4, " \t", 0}
	Token type: linear whitespace.
static const Utf8TokenType	TT_LINETERM = {5, "\n\r", 1}
	Token type: linear whitespace.
static const Utf8TokenType	TT_IDENTIFIER
	Token type: identifier.
static const Utf8TokenType	TT_NUMBER = {7, "0123456789", 0}
	Token type: identifier.
static const Utf8TokenType	TT_ALPHA
	Token type: latin alphabet.
static const Utf8TokenType	TT_DEFAULT_SEP = {9, "_-.", 0}
	Token type: default separators.
static const Utf8TokenType	TT_LATIN
	Token type: lots of latin characters.
static const Utf8Token	TOK_INVALID
	Token type: invalid (special).
static const Utf8Token	TOK_NONE
	Token type: none (special).
static const std::string	QUOTE_CHARS = "'\""
	Quote characters.
static const unsigned int	ESCAPE_CHAR = '\\'
	Escape character.
static const Utf8TokenizerClassInfo	utf8TokenizerClassInfo
	Class information instance.
static const Ionflux::Tools::ClassInfo *	CLASS_INFO
	Class information.
Protected Attributes
std::vector< unsigned int >	theInput
	Input characters to be tokenized.
std::vector< unsigned int >	quoteChars
	Quote characters.
unsigned int	currentPos
	Current position in the input character string.
unsigned int	currentTokenPos
	Position of the current token in the input character string.
unsigned int	currentQuoteChar
	The current quote character.
Utf8TokenTypeMap *	typeMap
	Token type map.
Utf8Token	currentToken
	Current token.
bool	extractQuoted
	Extract quoted strings flag.
bool	extractEscaped
	Extract escaped characters flag.

Detailed Description

Tokenizer with UTF-8 support.

A generic tokenizer for parsing UTF-8 strings. To set up a tokenizer, first create a Utf8Tokenizer object. This will be set up using the default token types Utf8Tokenizer::TT_WHITESPACE, Utf8Tokenizer::TT_LINETERM and Utf8Tokenizer::TT_IDENTIFIER. You may then add your own custom token types and optionally set up the Utf8Tokenizer::TT_ANYTHING token type (which will match anything not matched by previously defined token types). To enable extraction of quoted strings and escaped characters, call Utf8Tokenizer::setExtractQuoted() with true as an argument.
To get a token from the token stream, call Utf8Tokenizer::getNextToken(). Make sure your code handles the Utf8Tokenizer::TT_NONE and Utf8Tokenizer::TT_INVALID special token types (which cannot be disabled). Utf8Tokenizer::getNextToken() will always return Utf8Tokenizer::TT_NONE at the end of the token stream and Utf8Tokenizer::TT_INVALID if an invalid token is encountered.

Constructor & Destructor Documentation

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer ( )

Constructor.
Construct new Utf8Tokenizer object.

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer ( const std::string & initInput )

Constructor.
Construct new Utf8Tokenizer object.

Parameters:

initInput UTF-8 input string.

Ionflux::Tools::Utf8Tokenizer::Utf8Tokenizer ( const std::vector< Utf8TokenType > & initTokenTypes,

const std::string & initInput = ""

)

Constructor.
Construct new Utf8Tokenizer object.

Parameters:

initTokenTypes Token types.

initInput UTF-8 input string.

Ionflux::Tools::Utf8Tokenizer::~Utf8Tokenizer ( ) [virtual]

Destructor.
Destruct Utf8Tokenizer object.

Member Function Documentation

void Ionflux::Tools::Utf8Tokenizer::addDefaultTokenType ( ) [virtual]

Add default token type.
Add a special token type TT_DEFAULT which will be returned if a token is not recognized.

void Ionflux::Tools::Utf8Tokenizer::addTokenType ( const Utf8TokenType & newTokenType ) [virtual]

Add token type.
Add the specified token type.

Parameters:

newTokenType Token type.

void Ionflux::Tools::Utf8Tokenizer::addTokenTypes ( const std::vector< Utf8TokenType > & newTokenTypes ) [virtual]

Add token types.
Add the specified token types.

Parameters:

newTokenTypes .

void Ionflux::Tools::Utf8Tokenizer::clearTokenTypes ( ) [virtual]

Clear token types.
Remove all token types.

unsigned int Ionflux::Tools::Utf8Tokenizer::getCurrentPos ( ) [virtual]

Get current position.
Get the current position in the input string.

Returns:
Current position.

Utf8Token Ionflux::Tools::Utf8Tokenizer::getCurrentToken ( ) [virtual]

Get current token.
Get the current token from the input string.

Returns:
Current token.

unsigned int Ionflux::Tools::Utf8Tokenizer::getCurrentTokenPos ( ) [virtual]

Get current token position.
Get the position of the current token in the input string.

Returns:
Current token position.

int Ionflux::Tools::Utf8Tokenizer::getCurrentTokenType ( ) [virtual]

Get current token type.
Get the type ID of the current token.

Returns:
Type ID of current token.

bool Ionflux::Tools::Utf8Tokenizer::getExtractEscaped ( ) const [virtual]

Get extract escaped characters flag.

Returns:
Current value of extract escaped characters flag.

bool Ionflux::Tools::Utf8Tokenizer::getExtractQuoted ( ) const [virtual]

Get extract quoted strings flag.

Returns:
Current value of extract quoted strings flag.

Utf8Token Ionflux::Tools::Utf8Tokenizer::getNextToken ( Utf8TokenTypeMap * otherTypeMap = 0 ) [virtual]

Get next token.
Get the next token from the input string. If the optional otherTypeMap is set, the specified token type map will be used instead of the default token type map.

Parameters:

otherTypeMap Token type map.

Returns:
Next token.

unsigned int Ionflux::Tools::Utf8Tokenizer::getQuoteChar ( ) [virtual]

Get quote character.
Get the quote character for the current token.

Returns:
Current quote character.

bool Ionflux::Tools::Utf8Tokenizer::isValid ( const Utf8Token & checkToken ) [static]

Validate token.
Check whether the specified token is valid (i.e. it is not invalid or empty).

Parameters:

checkToken Token to be checked.

Returns:
true if the specified token is valid, false otherwise.

void Ionflux::Tools::Utf8Tokenizer::reset ( ) [virtual]

Reset.
Reset the tokenizer.

void Ionflux::Tools::Utf8Tokenizer::setExtractEscaped ( bool newExtractEscaped ) [virtual]

Set extract escaped characters flag.
Set new value of extract escaped characters flag.

Parameters:

newExtractEscaped New value of extract escaped characters flag.

void Ionflux::Tools::Utf8Tokenizer::setExtractQuoted ( bool newExtractQuoted ) [virtual]

Set extract quoted strings flag.
Set new value of extract quoted strings flag.

Parameters:

newExtractQuoted New value of extract quoted strings flag.

void Ionflux::Tools::Utf8Tokenizer::setInput ( const std::vector< unsigned int > & newInput ) [virtual]

Set input.
Set the unicode input characters.

Parameters:

newInput Unicode characters.

void Ionflux::Tools::Utf8Tokenizer::setInput ( const std::string & newInput ) [virtual]

Set input.
Set the UTF-8 encoded input string.

Parameters:

newInput UTF-8 input string.

void Ionflux::Tools::Utf8Tokenizer::setTokenTypes ( const std::vector< Utf8TokenType > & newTokenTypes ) [virtual]

Set token types.
Set the token types for the tokenizer.

Parameters:

newTokenTypes .

void Ionflux::Tools::Utf8Tokenizer::useDefaultTokenTypes ( ) [virtual]

Use default token types.
Use default token types (TT_LINEAR_WHITESPACE, TT_IDENTIFIER, TT_LINETERM).

Member Data Documentation

const ClassInfo * Ionflux::Tools::Utf8Tokenizer::CLASS_INFO [static]

Initial value:
&Utf8Tokenizer::utf8TokenizerClassInfo
Class information.

Reimplemented from Ionflux::Tools::ManagedObject.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentPos [protected]

Current position in the input character string.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentQuoteChar [protected]

The current quote character.

Utf8Token Ionflux::Tools::Utf8Tokenizer::currentToken [protected]

Current token.

unsigned int Ionflux::Tools::Utf8Tokenizer::currentTokenPos [protected]

Position of the current token in the input character string.

const unsigned int Ionflux::Tools::Utf8Tokenizer::ESCAPE_CHAR = '\\' [static]

Escape character.

bool Ionflux::Tools::Utf8Tokenizer::extractEscaped [protected]

Extract escaped characters flag.

bool Ionflux::Tools::Utf8Tokenizer::extractQuoted [protected]

Extract quoted strings flag.

const std::string Ionflux::Tools::Utf8Tokenizer::QUOTE_CHARS = "'\"" [static]

Quote characters.

std::vector<unsigned int> Ionflux::Tools::Utf8Tokenizer::quoteChars [protected]

Quote characters.

std::vector<unsigned int> Ionflux::Tools::Utf8Tokenizer::theInput [protected]

Input characters to be tokenized.

const Utf8Token Ionflux::Tools::Utf8Tokenizer::TOK_INVALID [static]

Initial value:
{ Utf8TokenType::INVALID_ID, ""}
Token type: invalid (special).

const Utf8Token Ionflux::Tools::Utf8Tokenizer::TOK_NONE [static]

Initial value:
{ Utf8TokenType::EMPTY_ID, ""}
Token type: none (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_ALPHA [static]

Initial value:
{8, "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ", 0}
Token type: latin alphabet.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_DEFAULT = {1, "", 0} [static]

Token type: default (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_DEFAULT_SEP = {9, "_-.", 0} [static]

Token type: default separators.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_ESCAPED = {3, "", 0} [static]

Token type: escaped (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_IDENTIFIER [static]

Initial value:
{6, "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_", 0}
Token type: identifier.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_INVALID [static]

Initial value:
{ Utf8TokenType::INVALID_ID, "", 0}
Token type: invalid (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LATIN [static]

Initial value:
{10, "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíî" "ïðñòóôõöøùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜ" "ĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŉŊŋŌ" "ōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ", 0}
Token type: lots of latin characters.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LINEAR_WHITESPACE = {4, " \t", 0} [static]

Token type: linear whitespace.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_LINETERM = {5, "\n\r", 1} [static]

Token type: linear whitespace.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_NONE [static]

Initial value:
{ Utf8TokenType::EMPTY_ID, "", 0}
Token type: none (special).

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_NUMBER = {7, "0123456789", 0} [static]

Token type: identifier.

const Utf8TokenType Ionflux::Tools::Utf8Tokenizer::TT_QUOTED = {2, "", 0} [static]

Token type: quoted (special).

Utf8TokenTypeMap* Ionflux::Tools::Utf8Tokenizer::typeMap [protected]

Token type map.

const Utf8TokenizerClassInfo Ionflux::Tools::Utf8Tokenizer::utf8TokenizerClassInfo [static]

Class information instance.

The documentation for this class was generated from the following files:

Generated on Tue Mar 14 21:11:52 2006 for Ionflux Tools Class Library (iftools) by

1.4.6

Ionflux::Tools::Utf8Tokenizer Class Reference [String tokenizer]

Public Member Functions

Static Public Member Functions

Static Public Attributes

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation

Ionflux::Tools::Utf8Tokenizer Class Reference
[String tokenizer]