WPILibC++ 2024.1.1-beta-4
ConvertUTF.h
Go to the documentation of this file.
1/*===--- ConvertUTF.h - Universal Character Names conversions ---------------===
2 *
3 * Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 * See https://llvm.org/LICENSE.txt for license information.
5 * SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6 *
7 *==------------------------------------------------------------------------==*/
8/*
9 * Copyright © 1991-2015 Unicode, Inc. All rights reserved.
10 * Distributed under the Terms of Use in
11 * http://www.unicode.org/copyright.html.
12 *
13 * Permission is hereby granted, free of charge, to any person obtaining
14 * a copy of the Unicode data files and any associated documentation
15 * (the "Data Files") or Unicode software and any associated documentation
16 * (the "Software") to deal in the Data Files or Software
17 * without restriction, including without limitation the rights to use,
18 * copy, modify, merge, publish, distribute, and/or sell copies of
19 * the Data Files or Software, and to permit persons to whom the Data Files
20 * or Software are furnished to do so, provided that
21 * (a) this copyright and permission notice appear with all copies
22 * of the Data Files or Software,
23 * (b) this copyright and permission notice appear in associated
24 * documentation, and
25 * (c) there is clear notice in each modified Data File or in the Software
26 * as well as in the documentation associated with the Data File(s) or
27 * Software that the data or software has been modified.
28 *
29 * THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF
30 * ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
31 * WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
32 * NONINFRINGEMENT OF THIRD PARTY RIGHTS.
33 * IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS
34 * NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL
35 * DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
36 * DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
37 * TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
38 * PERFORMANCE OF THE DATA FILES OR SOFTWARE.
39 *
40 * Except as contained in this notice, the name of a copyright holder
41 * shall not be used in advertising or otherwise to promote the sale,
42 * use or other dealings in these Data Files or Software without prior
43 * written authorization of the copyright holder.
44 */
45
46/* ---------------------------------------------------------------------
47
48 Conversions between UTF32, UTF-16, and UTF-8. Header file.
49
50 Several funtions are included here, forming a complete set of
51 conversions between the three formats. UTF-7 is not included
52 here, but is handled in a separate source file.
53
54 Each of these routines takes pointers to input buffers and output
55 buffers. The input buffers are const.
56
57 Each routine converts the text between *sourceStart and sourceEnd,
58 putting the result into the buffer between *targetStart and
59 targetEnd. Note: the end pointers are *after* the last item: e.g.
60 *(sourceEnd - 1) is the last item.
61
62 The return result indicates whether the conversion was successful,
63 and if not, whether the problem was in the source or target buffers.
64 (Only the first encountered problem is indicated.)
65
66 After the conversion, *sourceStart and *targetStart are both
67 updated to point to the end of last text successfully converted in
68 the respective buffers.
69
70 Input parameters:
71 sourceStart - pointer to a pointer to the source buffer.
72 The contents of this are modified on return so that
73 it points at the next thing to be converted.
74 targetStart - similarly, pointer to pointer to the target buffer.
75 sourceEnd, targetEnd - respectively pointers to the ends of the
76 two buffers, for overflow checking only.
77
78 These conversion functions take a ConversionFlags argument. When this
79 flag is set to strict, both irregular sequences and isolated surrogates
80 will cause an error. When the flag is set to lenient, both irregular
81 sequences and isolated surrogates are converted.
82
83 Whether the flag is strict or lenient, all illegal sequences will cause
84 an error return. This includes sequences such as: <F4 90 80 80>, <C0 80>,
85 or <A0> in UTF-8, and values above 0x10FFFF in UTF-32. Conformant code
86 must check for illegal sequences.
87
88 When the flag is set to lenient, characters over 0x10FFFF are converted
89 to the replacement character; otherwise (when the flag is set to strict)
90 they constitute an error.
91
92 Output parameters:
93 The value "sourceIllegal" is returned from some routines if the input
94 sequence is malformed. When "sourceIllegal" is returned, the source
95 value will point to the illegal value that caused the problem. E.g.,
96 in UTF-8 when a sequence is malformed, it points to the start of the
97 malformed sequence.
98
99 Author: Mark E. Davis, 1994.
100 Rev History: Rick McGowan, fixes & updates May 2001.
101 Fixes & updates, Sept 2001.
102
103------------------------------------------------------------------------ */
104
105#ifndef WPIUTIL_WPI_CONVERTUTF_H
106#define WPIUTIL_WPI_CONVERTUTF_H
107
108#include <cstddef>
109#include <string>
110#include <span>
111#include <string_view>
112#include <system_error>
113
114// Wrap everything in namespace wpi so that programs can link with llvm and
115// their own version of the unicode libraries.
116
117namespace wpi {
118
119/* ---------------------------------------------------------------------
120 The following 4 definitions are compiler-specific.
121 The C standard does not guarantee that wchar_t has at least
122 16 bits, so wchar_t is no less portable than unsigned short!
123 All should be unsigned values to avoid sign extension during
124 bit mask & shift operations.
125------------------------------------------------------------------------ */
126
127typedef unsigned int UTF32; /* at least 32 bits */
128typedef unsigned short UTF16; /* at least 16 bits */
129typedef unsigned char UTF8; /* typically 8 bits */
130typedef bool Boolean; /* 0 or 1 */
131
132/* Some fundamental constants */
133#define UNI_REPLACEMENT_CHAR (UTF32)0x0000FFFD
134#define UNI_MAX_BMP (UTF32)0x0000FFFF
135#define UNI_MAX_UTF16 (UTF32)0x0010FFFF
136#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
137#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF
138
139#define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4
140
141#define UNI_UTF16_BYTE_ORDER_MARK_NATIVE 0xFEFF
142#define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE
143
144#define UNI_UTF32_BYTE_ORDER_MARK_NATIVE 0x0000FEFF
145#define UNI_UTF32_BYTE_ORDER_MARK_SWAPPED 0xFFFE0000
146
147typedef enum {
148 conversionOK, /* conversion successful */
149 sourceExhausted, /* partial character in source, but hit end */
150 targetExhausted, /* insuff. room in target for conversion */
151 sourceIllegal /* source sequence is illegal/malformed */
153
154typedef enum {
158
160 const UTF8** sourceStart, const UTF8* sourceEnd,
161 UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);
162
163/**
164 * Convert a partial UTF8 sequence to UTF32. If the sequence ends in an
165 * incomplete code unit sequence, returns \c sourceExhausted.
166 */
168 const UTF8** sourceStart, const UTF8* sourceEnd,
169 UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
170
171/**
172 * Convert a partial UTF8 sequence to UTF32. If the sequence ends in an
173 * incomplete code unit sequence, returns \c sourceIllegal.
174 */
176 const UTF8** sourceStart, const UTF8* sourceEnd,
177 UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
178
180 const UTF16** sourceStart, const UTF16* sourceEnd,
181 UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags);
182
184 const UTF32** sourceStart, const UTF32* sourceEnd,
185 UTF8** targetStart, UTF8* targetEnd, ConversionFlags flags);
186
188 const UTF16** sourceStart, const UTF16* sourceEnd,
189 UTF32** targetStart, UTF32* targetEnd, ConversionFlags flags);
190
192 const UTF32** sourceStart, const UTF32* sourceEnd,
193 UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);
194
195Boolean isLegalUTF8Sequence(const UTF8 *source, const UTF8 *sourceEnd);
196
197Boolean isLegalUTF8String(const UTF8 **source, const UTF8 *sourceEnd);
198
199unsigned getUTF8SequenceSize(const UTF8 *source, const UTF8 *sourceEnd);
200
201unsigned getNumBytesForUTF8(UTF8 firstByte);
202
203/*************************************************************************/
204/* Below are LLVM-specific wrappers of the functions above. */
205
206template <typename T> class SmallVectorImpl;
207
208/**
209 * Convert an UTF8 string_view to UTF8, UTF16, or UTF32 depending on
210 * WideCharWidth. The converted data is written to ResultPtr, which needs to
211 * point to at least WideCharWidth * (Source.Size() + 1) bytes. On success,
212 * ResultPtr will point one after the end of the copied string. On failure,
213 * ResultPtr will not be changed, and ErrorPtr will be set to the location of
214 * the first character which could not be converted.
215 * \return true on success.
216 */
217bool ConvertUTF8toWide(unsigned WideCharWidth, std::string_view Source,
218 char *&ResultPtr, const UTF8 *&ErrorPtr);
219
220/**
221* Converts a UTF-8 string_view to a std::wstring.
222* \return true on success.
223*/
224bool ConvertUTF8toWide(std::string_view Source, std::wstring &Result);
225
226/**
227* Converts a UTF-8 C-string to a std::wstring.
228* \return true on success.
229*/
230bool ConvertUTF8toWide(const char *Source, std::wstring &Result);
231
232/**
233* Converts a std::wstring to a UTF-8 encoded std::string.
234* \return true on success.
235*/
236bool convertWideToUTF8(const std::wstring &Source, SmallVectorImpl<char> &Result);
237
238
239/**
240 * Convert an Unicode code point to UTF8 sequence.
241 *
242 * \param Source a Unicode code point.
243 * \param [in,out] ResultPtr pointer to the output buffer, needs to be at least
244 * \c UNI_MAX_UTF8_BYTES_PER_CODE_POINT bytes. On success \c ResultPtr is
245 * updated one past end of the converted sequence.
246 *
247 * \returns true on success.
248 */
249bool ConvertCodePointToUTF8(unsigned Source, char *&ResultPtr);
250
251/**
252 * Convert the first UTF8 sequence in the given source buffer to a UTF32
253 * code point.
254 *
255 * \param [in,out] source A pointer to the source buffer. If the conversion
256 * succeeds, this pointer will be updated to point to the byte just past the
257 * end of the converted sequence.
258 * \param sourceEnd A pointer just past the end of the source buffer.
259 * \param [out] target The converted code
260 * \param flags Whether the conversion is strict or lenient.
261 *
262 * \returns conversionOK on success
263 *
264 * \sa ConvertUTF8toUTF32
265 */
267 const UTF8 *sourceEnd,
268 UTF32 *target,
270 if (*source == sourceEnd)
271 return sourceExhausted;
272 unsigned size = getNumBytesForUTF8(**source);
273 if ((ptrdiff_t)size > sourceEnd - *source)
274 return sourceExhausted;
275 return ConvertUTF8toUTF32(source, *source + size, &target, target + 1, flags);
276}
277
278/**
279 * Returns true if a blob of text starts with a UTF-16 big or little endian byte
280 * order mark.
281 */
282bool hasUTF16ByteOrderMark(std::span<const char> SrcBytes);
283
284/**
285 * Converts a stream of raw bytes assumed to be UTF16 into a UTF8 std::string.
286 *
287 * \param [in] SrcBytes A buffer of what is assumed to be UTF-16 encoded text.
288 * \param [out] Out Converted UTF-8 is stored here on success.
289 * \returns true on success
290 */
291bool convertUTF16ToUTF8String(std::span<const char> SrcBytes, SmallVectorImpl<char> &Out);
292
293/**
294* Converts a UTF16 string into a UTF8 std::string.
295*
296* \param [in] Src A buffer of UTF-16 encoded text.
297* \param [out] Out Converted UTF-8 is stored here on success.
298* \returns true on success
299*/
300bool convertUTF16ToUTF8String(std::span<const UTF16> Src, SmallVectorImpl<char> &Out);
301
302/**
303 * Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
304 *
305 * \param [in] SrcBytes A buffer of what is assumed to be UTF-32 encoded text.
306 * \param [out] Out Converted UTF-8 is stored here on success.
307 * \returns true on success
308 */
309bool convertUTF32ToUTF8String(std::span<const char> SrcBytes, std::string &Out);
310
311/**
312 * Converts a UTF32 string into a UTF8 std::string.
313 *
314 * \param [in] Src A buffer of UTF-32 encoded text.
315 * \param [out] Out Converted UTF-8 is stored here on success.
316 * \returns true on success
317 */
318bool convertUTF32ToUTF8String(std::span<const UTF32> Src, std::string &Out);
319
320/**
321 * Converts a UTF-8 string into a UTF-16 string with native endianness.
322 *
323 * \returns true on success
324 */
326 SmallVectorImpl<UTF16> &DstUTF16);
327
328#if defined(_WIN32)
329namespace sys {
330namespace windows {
331std::error_code UTF8ToUTF16(std::string_view utf8, SmallVectorImpl<wchar_t> &utf16);
332/// Convert to UTF16 from the current code page used in the system
333std::error_code CurCPToUTF16(std::string_view utf8, SmallVectorImpl<wchar_t> &utf16);
334std::error_code UTF16ToUTF8(const wchar_t *utf16, size_t utf16_len,
336/// Convert from UTF16 to the current code page used in the system
337std::error_code UTF16ToCurCP(const wchar_t *utf16, size_t utf16_len,
339} // namespace windows
340} // namespace sys
341#endif
342
343} /* end namespace wpi */
344
345#endif
and restrictions which apply to each piece of software is included later in this file and or inside of the individual applicable source files The disclaimer of warranty in the WPILib license above applies to all code in and nothing in any of the other licenses gives permission to use the names of FIRST nor the names of the WPILib contributors to endorse or promote products derived from this software The following pieces of software have additional or alternate and or Google Inc All rights reserved Redistribution and use in source and binary with or without are permitted provided that the following conditions are this list of conditions and the following disclaimer *Redistributions in binary form must reproduce the above copyright this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution *Neither the name of Google Inc nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED BUT NOT LIMITED THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY OR CONSEQUENTIAL WHETHER IN STRICT OR EVEN IF ADVISED OF THE POSSIBILITY OF SUCH January AND DISTRIBUTION Definitions License shall mean the terms and conditions for and distribution as defined by Sections through of this document Licensor shall mean the copyright owner or entity authorized by the copyright owner that is granting the License Legal Entity shall mean the union of the acting entity and all other entities that control are controlled by or are under common control with that entity For the purposes of this definition control direct or to cause the direction or management of such whether by contract or including but not limited to software source documentation source
Definition: ThirdPartyNotices.txt:111
This class consists of common code factored out of the SmallVector class to reduce code duplication b...
Definition: SmallVector.h:579
basic_string_view< char > string_view
Definition: core.h:501
Definition: ntcore_cpp.h:26
bool convertUTF8ToUTF16String(std::string_view SrcUTF8, SmallVectorImpl< UTF16 > &DstUTF16)
Converts a UTF-8 string into a UTF-16 string with native endianness.
bool ConvertCodePointToUTF8(unsigned Source, char *&ResultPtr)
Convert an Unicode code point to UTF8 sequence.
Boolean isLegalUTF8Sequence(const UTF8 *source, const UTF8 *sourceEnd)
ConversionResult ConvertUTF32toUTF16(const UTF32 **sourceStart, const UTF32 *sourceEnd, UTF16 **targetStart, UTF16 *targetEnd, ConversionFlags flags)
bool hasUTF16ByteOrderMark(std::span< const char > SrcBytes)
Returns true if a blob of text starts with a UTF-16 big or little endian byte order mark.
unsigned getUTF8SequenceSize(const UTF8 *source, const UTF8 *sourceEnd)
ConversionResult convertUTF8Sequence(const UTF8 **source, const UTF8 *sourceEnd, UTF32 *target, ConversionFlags flags)
Convert the first UTF8 sequence in the given source buffer to a UTF32 code point.
Definition: ConvertUTF.h:266
ConversionResult ConvertUTF8toUTF16(const UTF8 **sourceStart, const UTF8 *sourceEnd, UTF16 **targetStart, UTF16 *targetEnd, ConversionFlags flags)
bool convertWideToUTF8(const std::wstring &Source, SmallVectorImpl< char > &Result)
Converts a std::wstring to a UTF-8 encoded std::string.
bool ConvertUTF8toWide(unsigned WideCharWidth, std::string_view Source, char *&ResultPtr, const UTF8 *&ErrorPtr)
Convert an UTF8 string_view to UTF8, UTF16, or UTF32 depending on WideCharWidth.
unsigned getNumBytesForUTF8(UTF8 firstByte)
bool convertUTF32ToUTF8String(std::span< const char > SrcBytes, std::string &Out)
Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
ConversionResult ConvertUTF32toUTF8(const UTF32 **sourceStart, const UTF32 *sourceEnd, UTF8 **targetStart, UTF8 *targetEnd, ConversionFlags flags)
ConversionResult ConvertUTF16toUTF32(const UTF16 **sourceStart, const UTF16 *sourceEnd, UTF32 **targetStart, UTF32 *targetEnd, ConversionFlags flags)
ConversionResult ConvertUTF8toUTF32Partial(const UTF8 **sourceStart, const UTF8 *sourceEnd, UTF32 **targetStart, UTF32 *targetEnd, ConversionFlags flags)
Convert a partial UTF8 sequence to UTF32.
ConversionResult ConvertUTF8toUTF32(const UTF8 **sourceStart, const UTF8 *sourceEnd, UTF32 **targetStart, UTF32 *targetEnd, ConversionFlags flags)
Convert a partial UTF8 sequence to UTF32.
ConversionResult ConvertUTF16toUTF8(const UTF16 **sourceStart, const UTF16 *sourceEnd, UTF8 **targetStart, UTF8 *targetEnd, ConversionFlags flags)
Boolean isLegalUTF8String(const UTF8 **source, const UTF8 *sourceEnd)
bool convertUTF16ToUTF8String(std::span< const char > SrcBytes, SmallVectorImpl< char > &Out)
Converts a stream of raw bytes assumed to be UTF16 into a UTF8 std::string.
unsigned char UTF8
Definition: ConvertUTF.h:129
unsigned short UTF16
Definition: ConvertUTF.h:128
flags
Definition: http_parser.h:206
ConversionFlags
Definition: ConvertUTF.h:154
@ lenientConversion
Definition: ConvertUTF.h:156
@ strictConversion
Definition: ConvertUTF.h:155
unsigned int UTF32
Definition: ConvertUTF.h:127
ConversionResult
Definition: ConvertUTF.h:147
@ sourceIllegal
Definition: ConvertUTF.h:151
@ sourceExhausted
Definition: ConvertUTF.h:149
@ conversionOK
Definition: ConvertUTF.h:148
@ targetExhausted
Definition: ConvertUTF.h:150
bool Boolean
Definition: ConvertUTF.h:130