.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. .. include:: ../common-defs.rst .. default-domain:: cpp .. highlight:: cpp .. |TV| replace:: :code:`TextView` .. |SV| replace:: :code:`std::string_view`. .. _string-view: https://en.cppreference.com/w/cpp/string/basic_string_view ******** TextView ******** Synopsis ******** :code:`#include "swoc/TextView.h"` .. class:: TextView :libswoc:`Reference documentation `. This class acts as a view of memory allocated / owned elsewhere and treated as a sequence of 8 bit characters. It is in effect a pointer and should be treated as such (e.g. care must be taken to avoid dangling references by knowing where the memory really is). The purpose is to provide string manipulation that is safer than raw pointers and much faster than duplicating strings. Usage ***** |TV| is a subclass of `std::string_view `_ and inherits all of its methods. The additional functionality of |TV| is for easy string manipulation, with an emphasis on fast parsing of string data. As noted, an instance of |TV| is a pointer and needs to be handled as such. It does not own the memory and therefore, like a pointer, care must be taken that the memory is not deallocated while the |TV| still references it. The advantage of this is creating new views and modifying existing ones is very cheap. Any place that passes a :code:`char *` and a size is an excellent candidate for using a |TV|. Code that uses functions such as :code:`strtok` or tracks pointers and offsets internally is an excellent candidate for using |TV| instead. Because |TV| is a subclass of :code:`std::string_view` it can be unclear which is a better choice. In many cases it doesn't matter, since because of this relationship converting between the types is at most as expensive as a copy of the same type, and in cases of constant reference, can be free. In general if the string is treated as a block of data, :code:`std::string_view` is a better choice. If the contents of the string are to be examined / parsed then |TV| is better. For example, if the string is used simply as a key or a hash source, use :code:`std::string_view`. Contrariwise if the string may contain substrings of interest such as key / value pairs, then use a |TV|. Although I do sometimes use |TV| because of the lack of support for instance reuse in |SV| - e.g. no :code:`assign` or :code:`clear` methods. When passing |TV| as an argument, it is very debatable whether passing by value or passing by reference is more efficient. The appropriate conclusion is it's not likely to matter in production code. My personal heuristic is whether the function will modify the value. If so, passing by value saves a copy to a local variable therefore it should be passed by value. If the function simply passes the |TV| on to other functions, then pass by constant reference. This distinction is irrelevant to the caller, the same code at the call site will work in either case. As noted, |TV| is designed as a pointer style class. Therefore it has an increment operator which is equivalent to :code:`std::string_view::remove_prefix`. |TV| also has a dereference operator, which acts the same way as on a pointer. The difference is the view knows where the end of the view is. This provides a comfortably familiar way of iterating through a view, the main difference being checking the view itself rather than a dereference of it (like a C-style string) or a range limit. E.g. the code to write a simple hash function [#]_ could be .. code-block:: cpp void hasher(TextView v) { size_t hash = 0; while (v) { hash = hash * 13 + * v ++; } return hash; } Although alternatively, this can be done in a non-modifying way. .. code-block:: cpp void hasher(TextView v) { size_t hash = 0; for ( auto c : v) { hash = hash * 13 + c; } return hash; } Because |TV| inherits from :code:`std::string_view` it can also be used as a container for range :code:`for` loops. .. code-block:: cpp void hasher(TextView const& v) { size_t hash = 0; for (char c : v) hash = hash * 13 + c; return hash; } The first approach enables dropping out of the loop on some condition with the view updated to no longer contain processed characters, making restart or other processing simple. The standard functions :code:`strcmp`, :code:`memcmp`, code:`memcpy`, and :code:`strcasecmp` are overloaded for |TV| so that a |TV| can be used as if it were a C-style string. The size is is taken from the |TV| and doesn't need to be passed in explicitly. .. class:: CharSet :libswoc:`Reference documentation `. This is a simple class that contains a set of characters. This is intended primarily to make parsing faster and simpler. Rather than checking a list of delimiters the character can be checked with a single `std::bitset` lookup. Basic Operations ================ |TV| is essentially a collection of operations which have been found to be common and useful in manipulating contiguous blocks of text. Construction ------------ Constructing a view means creating a view from another object which owns the memory (for creating views from other views see `Extraction`_). This can be a :code:`char const*` pointer and size, two pointers, a literal string, a :code:`std::string` or a :code:`std::string_view` although in the last case there is presumably yet another object that actually owns the memory. All of these constructors require only the equivalent of two assignment statements. The one thing to be careful of is if a literal string or C-string is used, the resulting |TV| will drop the terminating nul character from the view. This is almost always the correct behavior, but if it isn't an explicit size can be used. A |TV| can be constructed from a null :code:`char const*` pointer or a straight :code:`nullptr`. This will construct an empty |TV| identical to one default constructed. |TV| supports a generic constructor that will accept any class that provides the :code:`data` and :code:`size` methods that return values convertible to :code:`char const *` and :code:`size_t`. This enables greater interoperability with other libraries, as any well written C++ library with its own string class will have these methods implemented sensibly. Searching --------- Because |TV| is a subclass of :code:`std::string_view` all of its search method work on a |TV|. The only search methods provided beyond those in :code:`std::string` are :libswoc:`TextView::find_if` and :libswoc:`TextView::rfind_if` which search the view by a predicate. The predicate takes a single :code:`char` argument and returns a :code:`bool`. The search terminates on the first character for which the predicate returns :code:`true`. Extraction ---------- Extraction is creating a new view from an existing view. Because views cannot in general be expanded new views will be sub-sequences of existing views. This is the primary utility of a |TV|. As noted in the `general description `_ |TV| supports copying or removing prefixes and suffixes of the view. All of this is possible using the underlying :code:`std::string_view_substr` but this is frequently much clumsier. The development of |TV| was driven to a large extent by the desire to make such code much more compact and expressive, while being at least as safe. In particular extraction methods on |TV| do useful and well defined things when given out of bounds arguments. This is quite handy when extracting tokens based on separator characters. The primary distinction is how a character in the view is selected. * By index, an offset in to the view. These have plain names, such as :libswoc:`TextView::prefix`. * By character comparison, either a single character or set of characters which is matched against a single character in the view. These are suffixed with "at" such as :libswoc:`TextView::prefix_at`. * By predicate, a function that takes a single character argument and returns a bool to indicate a match. These are suffixed with "if", such as :libswoc:`TextView::prefix_if`. A secondary distinction is what is done to the view by the methods. * The base methods make a new view without modifying the existing view. * The "split..." methods remove the corresponding part of the view and return it. The selected character is discarded and not left in either the returned view nor the source view. If the selected character is not in the view, an empty view is returned and the source view is not modified. * The "take..." methods remove the corresponding part of the view and return it. The selected character is discarded and not left in either the returned view nor the source view. If the selected character is not in the view, the entire view is returned and the source view is cleared. * The "clip..." methods remove the corresponding part of the view and return it. Only those characters are removed - in contrast to "split..." and "take..." which drop a (presumed) separator. If the first character doesn't match, the view is not modified and an empty view is returned. These are very similar to the "trim..." methods described below, the difference what part of the original view is returned. .. _`std::string_view::remove_prefix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_prefix .. _`std::string_view::remove_suffix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_suffix This is a table of the affix oriented methods, grouped by the properties of the methods. "Bounded" indicates whether the operation requires the target character, however specified, to be within the bounds of the view. A bounded method does nothing if the target character is not in the view. On this note, the :code:`remove_prefix` and :code:`remove_suffix` are implemented differently in |TV| compared to :code:`std::string_view`. Rather than being undefined, the methods will clear the view if the size specified is larger than the contents of the view. +-----------------+--------+---------+------------------------------------------+ | Operation | Affix | Bounded | Method | +=================+========+=========+==========================================+ | Copy | Prefix | No | :libswoc:`TextView::prefix` | | + +---------+------------------------------------------+ | | | Yes | :libswoc:`TextView::prefix_at` | | + + +------------------------------------------+ | | | | :libswoc:`TextView::prefix_if` | | +--------+---------+------------------------------------------+ | | Suffix | No | :libswoc:`TextView::suffix` | | + +---------+------------------------------------------+ | | | Yes | :libswoc:`TextView::suffix_at` | | + + +------------------------------------------+ | | | | :libswoc:`TextView::suffix_if` | +-----------------+--------+---------+------------------------------------------+ | Modify | Prefix | No | `std::string_view::remove_prefix`_ | | | +---------+------------------------------------------+ | | | Yes | :libswoc:`TextView::remove_prefix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::remove_prefix_if` | | +--------+---------+------------------------------------------+ | | Suffix | No | `std::string_view::remove_suffix`_ | | | +---------+------------------------------------------+ | | | Yes | :libswoc:`TextView::remove_suffix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::remove_suffix_if` | +-----------------+--------+---------+------------------------------------------+ | Modify and Copy | Prefix | Yes | :libswoc:`TextView::split_prefix` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::split_prefix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::split_prefix_if` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::clip_prefix_of` | | | +---------+------------------------------------------+ | | | No | :libswoc:`TextView::take_prefix` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::take_prefix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::take_prefix_if` | | +--------+---------+------------------------------------------+ | | Suffix | Yes | :libswoc:`TextView::split_suffix` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::split_suffix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::split_suffix_if` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::clip_suffix_of` | | | +---------+------------------------------------------+ | | | No | :libswoc:`TextView::take_suffix` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::take_suffix_at` | | | + +------------------------------------------+ | | | | :libswoc:`TextView::take_suffix_if` | +-----------------+--------+---------+------------------------------------------+ Other ----- The comparison operators for |TV| are inherited from :code:`std::string_view` and therefore use the content of the view to determine the relationship. |TV| provides a collection of "trim" methods which remove leading or trailing characters. These have similar suffixes with the same meaning as the affix methods. This can be done for a single character, one of a set of characters, or a predicate. The most common use is with the predicate :code:`isspace` which removes leading and/or trailing whitespace as needed. While the plethora of view methods can seem a bit much, all of these are useful in different situations and exist because of such use cases. Numeric conversions are provided, in signed (:libswoc:`svtoi`), unsigned (:libswoc:`svtou`), and floating point (:libswoc:`svtod`) flavors. The integer functions are designed to be "complete" in the sense that any other string to integer conversion can be mapped to one of these functions. The floating point conversion is sufficiently accurate - it will return a floating point value that is within one epsilon of the exact value, but not always the closest. This is fine for general use such as in configurations, but possibly not quite enough for high precision work. The standard functions :code:`strcmp`, :code:`strcasecmp`, and :code:`memcmp` are overloaded when at least of the parameters is a |TV|. The length is taken from the view, rather than being an explicit parameter as with :code:`strncasecmp`. When no other useful result can be returned, |TV| methods return a reference to the instance. This makes chaining methods easy. If a list consisted of colon separated elements, each of which was of the form "A.B.old" and just the "A.B" part was needed, sans leading white space: .. literalinclude:: ../../unit_tests/ex_TextView.cc :lines: 223-227 Parsing with TextView ===================== Time for some examples demonstrating string parsing using |TV|. There are two major reasons for developing |TV| parsing. The first was to minimize the need to allocate memory to hold intermediate results. For this reason, the normal style of use is a streaming / incremental one, where tokens are extracted from a source one by one and placed in |TV| instances, with the original source |TV| being reduced by each extraction until it is empty. The second was to minimize cut and paste coding. Typical C or C++ parsing logic consists mostly of very generic code to handle pointer and size updates. The point of |TV| is to automate all of that yielding code focused entirely on the parsing logic, not boiler plate string or view manipulation. It is a common occurrence to not get such code exactly correct leading to hard to track bugs. Use of |TV| eliminates those problems. The minimization of exceptions on sizes beyond the view boundaries was done primarily to help parsing. It noticeably simplifies the logic if excessive removal or advancement yields an empty view rather than an exception. CSV Example ----------- For example, assume :arg:`value` contains a null terminated string which is expected to be tokens separated by commas. To handle this generically a function could be written which takes a token handler and calls it for each token. .. literalinclude:: ../../unit_tests/ex_TextView.cc :start-after: doc csv start :end-before: doc csv end If :arg:`value` was :literal:`"bob ,dave, sam"` then :arg:`token` would be successively :literal:`bob`, :literal:`dave`, :literal:`sam`. Each loop iteration is guaranteed to remove text from :arg:`src` making the loop eventually terminate when all text has been removed, because an empty :code:`TextView` is :code:`false`. This is a recommended style because :code:`TextView` instances are very cheap to copy. This is essentially the same as having a current pointer and and end pointer and checking for :code:`current >= end` except :code:`TextView` does all the work, leading to simpler and less buggy code. White space is dropped because of the calls to :code:`ltrim_if` and :code:`rtrim_if`. By calling in the loop condition, the loop exits if the remaining text is only whitespace and no token is processed. Alternatively :code:`trim_if` could be used after extraction. The performance will be *slightly* better because although :code:`trim_if` calls :code:`ltrim_if` and :code:`rtrim_if`, a final token extraction on trailing whitespace will be avoided. In practice it won't make a difference, do what's convenient. It could be tempting to squeeze the code a bit to be .. literalinclude:: ../../unit_tests/ex_TextView.cc :start-after: doc csv non-empty start :end-before: doc csv non-empty end However this causes a significant behavior difference - the loop terminates on an empty token because that token will be :code:`false`. That is, this will work if there is a guarantee of no empty tokens (e.g. adjacent separators). Key / Value Example ------------------- A similar case is parsing a list of key / value pairs in a comma separated list. Each pair is "key=value" where white space is ignored. In this case it is also permitted to have just a keyword for values that are boolean. .. literalinclude:: ../../unit_tests/ex_TextView.cc :start-after: doc kv start :end-before: doc kv end .. sidebar:: Verification `Test code for example `__. The basic list processing is the same as the previous example, extracting each comma separated element. The resulting element is treated as a "list" with ``=`` as the separator. Note if there is no ``=`` character then all of the list element is moved to :arg:`key` leaving :arg:`value` empty, which is the desired result. A bit of extra white space trimming it done in case there was space next to the ``=``. Line Processing --------------- |TV| works well when parsing lines from a file. For this example, :libswoc:`load` will be used. This method, given a path, loads the entire content of the file into a :code:`std::string`. This will serve as the owner of the string memory. If it is kept around with the configuration, all of the parsed strings can be instances of |TV| that reference memory in that :code:`std::string`. If the density of useful text is sufficiently high, this is a convenient way to handle parsing with minimal memory allocations. This example counts the number of code lines in the documenations ``conf.py`` file. .. literalinclude:: ../../unit_tests/ex_TextView.cc :lines: 203-217 The |TV| :arg:`src` is constructed from the :code:`std::string` :arg:`content` which contains the file contents. While that view is not empty, a line is taken each look and leading and trailing whitespace is trimmed. If this results in an empty view or one where the first character is the Python comment character ``#`` it is not counted. The newlines are discard by the prefix extraction. The use of :libswoc:`TextView::take_prefix_at` forces the extraction of text even if there is no final newline. If this were a file of key value pairs, then :arg:`line` would be subjected to one of the other examples to extract the values. For all of this, there is only one memory allocation, that needed for :arg:`content` to load the file contents. Entity Tag Lists Example ------------------------ An example from actual production code is this example that parses a quoted, comma separated list of values ("CSV"). This is used for parsing `entity tags `__ as used for HTTP fields such as "If-Match" (`14.24 `__). This will be a CSV each where each value is quoted. To make it interesting these quoted strings may contain commas, which do not count as separators. Therefore the simple approach in previous examples will not work in all cases. This example also does not use the callback style of the previous examples - instead the tokens are pulled off in a streaming style with the source :code:`TextView` being passed by reference in order to be updated by the tokenizer. Further, some callers want the quotes, and some do not, so a flag to strip quotes from the resulting elements is needed. The final result looks like .. literalinclude:: ../../unit_tests/ex_TextView.cc :start-after: "TextView Tokens" :lines: 2-26 .. sidebar:: Verification `Test code for example `__. This takes a :code:`TextView&` which is the source view which will be updated as tokens are removed (therefore the caller must do the empty view check). The other arguments are the separator character and the "strip quotes" flag. The algorithm is to find the next "interesting" character, which is either a separator or a quote. Quotes flip the "in quote" flag back and forth, and separators terminate the loop if the "in quote" flag is not set. This skips quoted separators. If neither is found then all of the view is returned as the result. Whitespace is always trimmed and then quotes are trimmed if requested, before the view is returned. In this case keeping an offset of the amount of the source view processed is the most convenient mechanism for tracking progress. The result is a fairly compact piece of code that does non-trivial parsing and conversion on a source string, without a lot of complex parsing state, and no memory allocation. History ******* The first attempt at this functionality was in the TSConfig library in the :code:`ts::Buffer` and :code:`ts::ConstBuffer` classes. Originally intended just as raw memory views, :code:`ts::ConstBuffer` in particular was repeatedly enhanced to provide better support for strings. The header was eventually moved from :literal:`lib/tsconfig` to :literal:`lib/ts` and was used in in various part of the Traffic Server core. There was then a proposal to make these classes available to plugin writers as they proved handy in the core. A suggested alternative was `Boost.StringRef `_ which provides a similar functionality using :code:`std::string` as the base of the pre-allocated memory. A version of the header was ported to Traffic Server (by stripping all the Boost support and cross includes) but in use proved to provide little of the functionality available in :code:`ts::ConstBuffer`. If extensive reworking was required in any case, it seemed better to start from scratch and build just what was useful in the Traffic Server context. The next step was the :code:`TextView` class which turned out reasonably well. About this time :code:`std::string_view` was officially adopted for C++17, which was a bit of a problem because :code:`TextView` was extremely similar in functionality but quite different in interface. Further, it had a number of quite useful methods that were not in :code:`std::string_view`. To simplify the use of :code:`TextView` (which was actually called "StringView" then) it was made a subclass of :code:`std::string_view` with user defined conversions so that two classes could be used almost interchangeable in an efficient way. Passing a :code:`TextView` to a :code:`std::string_view const&` is zero marginal cost because of inheritance and passing by value is also no more expensive than just :code:`std::string_view`. .. rubric:: Footnotes .. [#] This is a horrible hash function, do not actually use it.