Lexicon¶
Lexicon is a bidirectional mapping between strings and a numeric / enumeration type. It is intended
to support parsing and diagnostics for enumerations. It has some significant advantages over a simple
array of strings.
The integer can be looked up by string. This makes parsing much easier and more robust.
The integers do not have to be contiguous or zero based.
Multiple names can map to the same integer.
Defaults for missing names or integers.
Definition¶
-
template<typename E>
class Lexicon¶
Usage¶
Lexicons can be used in a dynamic or static fashion. The basic use is as a static translation object that converts between an enumeration and names. The constructors allow setting up the entire Lexicon.
The primary things to set up for a Lexicon are
The equivalence of names and values.
The default (if any) for a name.
The default (if any) for a value.
Values and names can be associated either using pairs of values and names, or a pair of a value and a list of names, the first of which is the primary name. This must be consistent for all of the defined values, so if one value has multiple names, all names must use the value, name list form.
Defaults¶
In addition, defaults can be specified. Because all possible defaults have distinct signatures there is no need to order them - the constructor can deduce what is meant. Defaults are very handy when using a Lexicon for parsing - the default value can be an invalid value, in which case checking an input token for being a valid name is very simple
extern swoc::Lexicon<Types> lex; // Initialized elsewhere.
auto value = lex[token];
if (value != INVALID) { // handle successful parse }
Lexicon can also be used dynamically where the contents are built up over time or due to run time inputs. One example is using Lexion to support enumeration or flag set columns for IPSpace. A configuration file can list the allowed / supported keys for the columns, which are then loaded into a Lexicon and use to parse the data file. The key methods are
Lexicon::define() which adds a value, name definition.
Lexicon::set_default() which sets a default.
Each Lexicon has its own internal storage where copies of all of the strings are kept. This makes dynamic use much easier and robust as there are no lifetime concerns with the strings.
Lexicons can be used for “normalizing” pointers to strings. Double indexing will convert the arbitrary pointer to the string to a consistent pointer, which can then be numerically compared for equivalence. This is only a benefit if the pointer is to be stored and compared multiple times.
token = lex[lex[token]]; // Normalize string pointer.
Iteration¶
For iteration, the lexicon is treated as a list of pairs of values and names. Standard iteration is over the values and the primary names for those values. The value type of the iterator is a tuple of the value and name.
extern swoc::Lexicon<Type> lex; // Initialized elsewhere.
for ( auto const & pair : lex ) {
std::cout << std::get<Lexicon<Type>::VALUE_IDX>(pair) << " has the name "
<< std::get<Lexicon<Type>::NAME_IDX>(pair) << std::endl;
}
It is possible to iterate over the names
as well using the Lexicon::begin_names() and Lexicon::end_names() methods. For
convience there the method Lexicon::by_names() returns a temporary object which has begin
and end methods which return name iterators. This makes container iteration easier.
extern swoc::Lexicon<Type> lex; // Initialized elsewhere.
for ( auto const & pair : lex.by_names() ) {
// code for each pair.
}
Constructing¶
To make the class more flexible it can be constructed in a variety of ways. For a static instance the entire
class can be initialized in the constructor. For dynamic use any subset can be initialized. In
the previous example, the instance was initialized with all of the defined values and a default
for missing names. Because this fully constructs the Lexicon, it can be marked const to prevent
accidental changes. It could also have been constructed with a default name:
swoc::Lexicon<NetType> const Example1{
{{NetType::EXTERNAL, "external"}, {NetType::PROD, "prod"}, {NetType::SECURE, "secure"}, {NetType::EDGE, "edge"}},
"*invalid*", // default name for undefined values
NetType::INVALID // default value for undefined name
};
Note the default name was put before the default value. Because they are distinct types, the defaults can be added in either order, but must always follow the field definitions. The defaults can also be omitted entirely, which is common if the Lexicon is used for output and not parsing, where the enumeration is always valid because all enumeration values are in the Lexicon.
swoc::Lexicon<NetType> const Example2{
{{NetType::EXTERNAL, "external"}, {NetType::PROD, "prod"}, {NetType::SECURE, "secure"}, {NetType::EDGE, "edge"}},
};
For dynamic use, it is common to have just the defaults in the constructor, and not any of the fields, although of course if some “built in” names and values are needed those can be added as in the previous examples.
swoc::Lexicon<NetType> Example3{
"*invalid*", // default name for undefined values
NetType::INVALID // default value for undefined name
};
As before both, either, or none of the defaults are required.
Finally, here is a example of using Lexicon to translate a boolean value, allowing for various alternative forms for the true and false names.
enum BoolTag {
INVALID = -1,
False = 0,
True = 1,
};
swoc::Lexicon<BoolTag> const BoolNames{
{{BoolTag::True, {"true", "1", "on", "enable", "Y", "yes"}}, {BoolTag::False, {"false", "0", "off", "disable", "N", "no"}}},
BoolTag::INVALID
};
The set of value names is easily changed. The BoolTag type is used to be able to indicate when a
name doesn’t match anything in the Lexicon. Each field is a value and then a list of names, instead
of just the pair of a value and name as in the previous examples. If a BoolTag was passed in to
the Lexicon, it would return “true”, “false”, or throw an exception for BoolTag::INVALID because
that value is missing and there is no default name. The strings returned are returned because they
are the first elements in the list of names. This is fine for any debugging or diagnostic messages
because only the true and false values would be stored, INVALID indicates a parsing
error. The enumeration values were chosen so casting from bool to BoolTag yields the
appropriate string.
C++20 Notes¶
Due to changes in the language some initializations that compile in C++17 become ambigous in C++20
although I think this is due to a compiler bug in g++ (this problem has not occurred in Clang).
To provide a work around, the type definitions with and with_multi are exported
from Lexicon to force the field initialization list to be a specific type, avoiding the
ambiguity.
[[maybe_unused]] ExampleNames Static_Names_Multi{
{{Example::Value_0, {"zero", "0"}},
{Example::Value_1, {"one", "1"}},
{Example::Value_2, {"two", "2"}},
{Example::Value_3, {"three", "3"}},
{Example::INVALID, {"INVALID"}}}
};
This issue only arises if none of the multiple name lists are longer than two elements. For instance
this example doesn’t require with_multi because the first list of names has three elements.
[[maybe_unused]] ExampleNames Static_Names_Multi{
{{Example::Value_0, {"zero", "0"}},
{Example::Value_1, {"one", "1"}},
{Example::Value_2, {"two", "2"}},
{Example::Value_3, {"three", "3"}},
{Example::INVALID, {"INVALID"}}}
};
Note
Techno-babble
The base issue is the code:std::string_view constructor, new in C++20, that takes two iterators
and constructs the view in the standard STL half open way. This makes the following ambiguous for
the argument types std::string_view and std::initializer_list<std::string_view>
{ "alpha", "bravo" }
This can be read as
std::string_view{char const*, char const*)
which satisfies the two iterator constructor. In C++17 such a list would never satisfy a
std::string_view constructor and so was unambiguously a list of names and not a single
name. The internal fix was to use TextView which has that constructor and mark that
constructor explicit. This doesn’t fully work for g++ which still thinks the list is
ambigous even though explicitly using the single name structure Lexicon::Pair doesn’t
compile. That is, it only compiles in the g++ compiler’s imagination, not in actual code.
As for the idea of using variadic templates to pick off the field definitions one by one, that doesn’t work because the compiler needs to decide the type of all the arguments before picking the constructor, but it can’t do that until after it’s already picked the variadic constructor.
Examples¶
For illustrative purposes, consider using IPSpace where each address has a set of flags
representing the type of address, such as production, edge, secure, etc. This is stored in memory
as a std::bitset. To load up the data a comma separated value file is provided which has the
first column as the IP address range and the subsequent values are flag names.
The starting point is an enumeration with the address types:
enum class NetType {
EXTERNAL = 0, // 0x1
PROD, // 0x2
SECURE, // 0x4
EDGE, // 0x8
INVALID
};
To do conversions a Lexicon is created:
swoc::Lexicon<NetType> const NetTypeNames{
{{NetType::EXTERNAL, "external"}, {NetType::PROD, "prod"}, {NetType::SECURE, "secure"}, {NetType::EDGE, "edge"}},
NetType::INVALID // default value for undefined name
};
The file loading and parsing is then:
// Process all the lines in the file.
while (text) {
auto line = text.take_prefix_at('\n').trim_if(&isspace);
auto addr_token = line.take_prefix_at(','); // first token is the range.
swoc::IPRange r{addr_token};
if (!r.empty()) { // empty means failed parse.
Flags flags;
while (line) { // parse out the rest of the comma separated elements
auto token = line.take_prefix_at(',');
auto idx = NetTypeNames[token];
if (idx != NetType::INVALID) { // one of the valid strings
flags.set(static_cast<int>(idx)); // set the bit
}
}
space.mark(r, flags); // store the flags in the spae.
}
}
with the simulated file contents
swoc::TextView text{R"(
10.0.0.2-10.0.0.254,edge
10.12.0.0/25,prod
10.15.37.10-10.15.37.99,prod,secure
172.19.0.0/22,external,secure
192.168.18.0/23,external,prod
)"};
This uses the Lexicon to convert the strings in the file to the enumeration values, which are the
bitset indices. The defalt is set to INVALID so that any string that doesn’t match a string
in the Lexicon is mapped to INVALID.
Once the IP Space is loaded, lookup is simple, given an address:
auto [range, flags] = *space.find(addr);
At this point flags has the set of flags stored for that address from the original data. Data
can be accessed like
if (flags[NetType::PROD]) { ... }
The example ex_host_file.cc processes a standard host file into a lexicon that enables forward and reverse lookups. A name can be used to find an address and an address can be used to find the first name with that address.
Design Notes¶
Lexicon was designed to solve a common problem I had with converting between enumerations and
strings. Simple arrays were, as noted in the introduction, were not adequate, particularly for
parsing. There was also some influence from internationalization efforts where the Lexicon could be
loaded with other languages. Secondary names have proven useful for parsing, allowing easy aliases
for the enumeration (e.g., for true for a boolean the names can be a list like “yes”, “1”,
“enable”, etc.)