summaryrefslogtreecommitdiffstats
path: root/meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch
diff options
context:
space:
mode:
Diffstat (limited to 'meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch')
-rw-r--r--meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch2270
1 files changed, 2270 insertions, 0 deletions
diff --git a/meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch b/meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch
new file mode 100644
index 0000000000..5b1896ed69
--- /dev/null
+++ b/meta/recipes-devtools/gcc/gcc/0002-CVE-2021-42574.patch
@@ -0,0 +1,2270 @@
1From bd5e882cf6e0def3dd1bc106075d59a303fe0d1e Mon Sep 17 00:00:00 2001
2From: David Malcolm <dmalcolm@redhat.com>
3Date: Mon, 18 Oct 2021 18:55:31 -0400
4Subject: [PATCH] diagnostics: escape non-ASCII source bytes for certain
5 diagnostics
6MIME-Version: 1.0
7Content-Type: text/plain; charset=utf8
8Content-Transfer-Encoding: 8bit
9
10This patch adds support to GCC's diagnostic subsystem for escaping certain
11bytes and Unicode characters when quoting source code.
12
13Specifically, this patch adds a new flag rich_location::m_escape_on_output
14which is a hint from a diagnostic that non-ASCII bytes in the pertinent
15lines of the user's source code should be escaped when printed.
16
17The patch sets this for the following diagnostics:
18- when complaining about stray bytes in the program (when these
19are non-printable)
20- when complaining about "null character(s) ignored");
21- for -Wnormalized= (and generate source ranges for such warnings)
22
23The escaping is controlled by a new option:
24 -fdiagnostics-escape-format=[unicode|bytes]
25
26For example, consider a diagnostic involing a source line containing the
27string "before" followed by the Unicode character U+03C0 ("GREEK SMALL
28LETTER PI", with UTF-8 encoding 0xCF 0x80) followed by the byte 0xBF
29(a stray UTF-8 trailing byte), followed by the string "after", where the
30diagnostic highlights the U+03C0 character.
31
32By default, this line will be printed verbatim to the user when
33reporting a diagnostic at it, as:
34
35 beforeƸXafter
36 ^
37
38(using X for the stray byte to avoid putting invalid UTF-8 in this
39commit message)
40
41If the diagnostic sets the "escape" flag, it will be printed as:
42
43 before<U+03C0><BF>after
44 ^~~~~~~~
45
46with -fdiagnostics-escape-format=unicode (the default), or as:
47
48 before<CF><80><BF>after
49 ^~~~~~~~
50
51if the user supplies -fdiagnostics-escape-format=bytes.
52
53This only affects how the source is printed; it does not affect
54how column numbers that are printed (as per -fdiagnostics-column-unit=
55and -fdiagnostics-column-origin=).
56
57gcc/c-family/ChangeLog:
58 * c-lex.c (c_lex_with_flags): When complaining about non-printable
59 CPP_OTHER tokens, set the "escape on output" flag.
60
61gcc/ChangeLog:
62 * common.opt (fdiagnostics-escape-format=): New.
63 (diagnostics_escape_format): New enum.
64 (DIAGNOSTICS_ESCAPE_FORMAT_UNICODE): New enum value.
65 (DIAGNOSTICS_ESCAPE_FORMAT_BYTES): Likewise.
66 * diagnostic-format-json.cc (json_end_diagnostic): Add
67 "escape-source" attribute.
68 * diagnostic-show-locus.c
69 (exploc_with_display_col::exploc_with_display_col): Replace
70 "tabstop" param with a cpp_char_column_policy and add an "aspect"
71 param. Use these to compute m_display_col accordingly.
72 (struct char_display_policy): New struct.
73 (layout::m_policy): New field.
74 (layout::m_escape_on_output): New field.
75 (def_policy): New function.
76 (make_range): Update for changes to exploc_with_display_col ctor.
77 (default_print_decoded_ch): New.
78 (width_per_escaped_byte): New.
79 (escape_as_bytes_width): New.
80 (escape_as_bytes_print): New.
81 (escape_as_unicode_width): New.
82 (escape_as_unicode_print): New.
83 (make_policy): New.
84 (layout::layout): Initialize new fields. Update m_exploc ctor
85 call for above change to ctor.
86 (layout::maybe_add_location_range): Update for changes to
87 exploc_with_display_col ctor.
88 (layout::calculate_x_offset_display): Update for change to
89 cpp_display_width.
90 (layout::print_source_line): Pass policy
91 to cpp_display_width_computation. Capture cpp_decoded_char when
92 calling process_next_codepoint. Move printing of source code to
93 m_policy.m_print_cb.
94 (line_label::line_label): Pass in policy rather than context.
95 (layout::print_any_labels): Update for change to line_label ctor.
96 (get_affected_range): Pass in policy rather than context, updating
97 calls to location_compute_display_column accordingly.
98 (get_printed_columns): Likewise, also for cpp_display_width.
99 (correction::correction): Pass in policy rather than tabstop.
100 (correction::compute_display_cols): Pass m_policy rather than
101 m_tabstop to cpp_display_width.
102 (correction::m_tabstop): Replace with...
103 (correction::m_policy): ...this.
104 (line_corrections::line_corrections): Pass in policy rather than
105 context.
106 (line_corrections::m_context): Replace with...
107 (line_corrections::m_policy): ...this.
108 (line_corrections::add_hint): Update to use m_policy rather than
109 m_context.
110 (line_corrections::add_hint): Likewise.
111 (layout::print_trailing_fixits): Likewise.
112 (selftest::test_display_widths): New.
113 (selftest::test_layout_x_offset_display_utf8): Update to use
114 policy rather than tabstop.
115 (selftest::test_one_liner_labels_utf8): Add test of escaping
116 source lines.
117 (selftest::test_diagnostic_show_locus_one_liner_utf8): Update to
118 use policy rather than tabstop.
119 (selftest::test_overlapped_fixit_printing): Likewise.
120 (selftest::test_overlapped_fixit_printing_utf8): Likewise.
121 (selftest::test_overlapped_fixit_printing_2): Likewise.
122 (selftest::test_tab_expansion): Likewise.
123 (selftest::test_escaping_bytes_1): New.
124 (selftest::test_escaping_bytes_2): New.
125 (selftest::diagnostic_show_locus_c_tests): Call the new tests.
126 * diagnostic.c (diagnostic_initialize): Initialize
127 context->escape_format.
128 (convert_column_unit): Update to use default character width policy.
129 (selftest::test_diagnostic_get_location_text): Likewise.
130 * diagnostic.h (enum diagnostics_escape_format): New enum.
131 (diagnostic_context::escape_format): New field.
132 * doc/invoke.texi (-fdiagnostics-escape-format=): New option.
133 (-fdiagnostics-format=): Add "escape-source" attribute to examples
134 of JSON output, and document it.
135 * input.c (location_compute_display_column): Pass in "policy"
136 rather than "tabstop", passing to
137 cpp_byte_column_to_display_column.
138 (selftest::test_cpp_utf8): Update to use cpp_char_column_policy.
139 * input.h (class cpp_char_column_policy): New forward decl.
140 (location_compute_display_column): Pass in "policy" rather than
141 "tabstop".
142 * opts.c (common_handle_option): Handle
143 OPT_fdiagnostics_escape_format_.
144 * selftest.c (temp_source_file::temp_source_file): New ctor
145 overload taking a size_t.
146 * selftest.h (temp_source_file::temp_source_file): Likewise.
147
148gcc/testsuite/ChangeLog:
149 * c-c++-common/diagnostic-format-json-1.c: Add regexp to consume
150 "escape-source" attribute.
151 * c-c++-common/diagnostic-format-json-2.c: Likewise.
152 * c-c++-common/diagnostic-format-json-3.c: Likewise.
153 * c-c++-common/diagnostic-format-json-4.c: Likewise, twice.
154 * c-c++-common/diagnostic-format-json-5.c: Likewise.
155 * gcc.dg/cpp/warn-normalized-4-bytes.c: New test.
156 * gcc.dg/cpp/warn-normalized-4-unicode.c: New test.
157 * gcc.dg/encoding-issues-bytes.c: New test.
158 * gcc.dg/encoding-issues-unicode.c: New test.
159 * gfortran.dg/diagnostic-format-json-1.F90: Add regexp to consume
160 "escape-source" attribute.
161 * gfortran.dg/diagnostic-format-json-2.F90: Likewise.
162 * gfortran.dg/diagnostic-format-json-3.F90: Likewise.
163
164libcpp/ChangeLog:
165 * charset.c (convert_escape): Use encoding_rich_location when
166 complaining about nonprintable unknown escape sequences.
167 (cpp_display_width_computation::::cpp_display_width_computation):
168 Pass in policy rather than tabstop.
169 (cpp_display_width_computation::process_next_codepoint): Add "out"
170 param and populate *out if non-NULL.
171 (cpp_display_width_computation::advance_display_cols): Pass NULL
172 to process_next_codepoint.
173 (cpp_byte_column_to_display_column): Pass in policy rather than
174 tabstop. Pass NULL to process_next_codepoint.
175 (cpp_display_column_to_byte_column): Pass in policy rather than
176 tabstop.
177 * errors.c (cpp_diagnostic_get_current_location): New function,
178 splitting out the logic from...
179 (cpp_diagnostic): ...here.
180 (cpp_warning_at): New function.
181 (cpp_pedwarning_at): New function.
182 * include/cpplib.h (cpp_warning_at): New decl for rich_location.
183 (cpp_pedwarning_at): Likewise.
184 (struct cpp_decoded_char): New.
185 (struct cpp_char_column_policy): New.
186 (cpp_display_width_computation::cpp_display_width_computation):
187 Replace "tabstop" param with "policy".
188 (cpp_display_width_computation::process_next_codepoint): Add "out"
189 param.
190 (cpp_display_width_computation::m_tabstop): Replace with...
191 (cpp_display_width_computation::m_policy): ...this.
192 (cpp_byte_column_to_display_column): Replace "tabstop" param with
193 "policy".
194 (cpp_display_width): Likewise.
195 (cpp_display_column_to_byte_column): Likewise.
196 * include/line-map.h (rich_location::escape_on_output_p): New.
197 (rich_location::set_escape_on_output): New.
198 (rich_location::m_escape_on_output): New.
199 * internal.h (cpp_diagnostic_get_current_location): New decl.
200 (class encoding_rich_location): New.
201 * lex.c (skip_whitespace): Use encoding_rich_location when
202 complaining about null characters.
203 (warn_about_normalization): Generate a source range when
204 complaining about improperly normalized tokens, rather than just a
205 point, and use encoding_rich_location so that the source code
206 is escaped on printing.
207 * line-map.c (rich_location::rich_location): Initialize
208 m_escape_on_output.
209
210Signed-off-by: David Malcolm <dmalcolm@redhat.com>
211
212CVE: CVE-2021-42574
213Upstream-Status: Backport [https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=bd5e882cf6e0def3dd1bc106075d59a303fe0d1e]
214Signed-off-by: Pgowda <pgowda.cve@gmail.com>
215
216---
217 gcc/c-family/c-lex.c | 6 +-
218 gcc/common.opt | 13 +
219 gcc/diagnostic-format-json.cc | 3 +
220 gcc/diagnostic-show-locus.c | 580 +++++++++++++++---
221 gcc/diagnostic.c | 10 +-
222 gcc/diagnostic.h | 18 +
223 gcc/doc/invoke.texi | 43 +-
224 gcc/input.c | 62 +-
225 gcc/input.h | 7 +-
226 gcc/opts.c | 4 +
227 gcc/selftest.c | 15 +
228 gcc/selftest.h | 2 +
229 .../c-c++-common/diagnostic-format-json-1.c | 1 +
230 .../c-c++-common/diagnostic-format-json-2.c | 1 +
231 .../c-c++-common/diagnostic-format-json-3.c | 1 +
232 .../c-c++-common/diagnostic-format-json-4.c | 2 +
233 .../c-c++-common/diagnostic-format-json-5.c | 1 +
234 .../gcc.dg/cpp/warn-normalized-4-bytes.c | 21 +
235 .../gcc.dg/cpp/warn-normalized-4-unicode.c | 19 +
236 gcc/testsuite/gcc.dg/encoding-issues-bytes.c | Bin 0 -> 595 bytes
237 .../gcc.dg/encoding-issues-unicode.c | Bin 0 -> 613 bytes
238 .../gfortran.dg/diagnostic-format-json-1.F90 | 1 +
239 .../gfortran.dg/diagnostic-format-json-2.F90 | 1 +
240 .../gfortran.dg/diagnostic-format-json-3.F90 | 1 +
241 libcpp/charset.c | 63 +-
242 libcpp/errors.c | 82 ++-
243 libcpp/include/cpplib.h | 76 ++-
244 libcpp/include/line-map.h | 13 +
245 libcpp/internal.h | 23 +
246 libcpp/lex.c | 38 +-
247 libcpp/line-map.c | 3 +-
248 31 files changed, 942 insertions(+), 168 deletions(-)
249 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c
250 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c
251 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-bytes.c
252 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-unicode.c
253
254diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
255--- a/gcc/c-family/c-lex.c 2020-07-22 23:35:17.296384022 -0700
256+++ b/gcc/c-family/c-lex.c 2021-12-25 01:30:50.669689023 -0800
257@@ -587,7 +587,11 @@ c_lex_with_flags (tree *value, location_
258 else if (ISGRAPH (c))
259 error_at (*loc, "stray %qc in program", (int) c);
260 else
261- error_at (*loc, "stray %<\\%o%> in program", (int) c);
262+ {
263+ rich_location rich_loc (line_table, *loc);
264+ rich_loc.set_escape_on_output (true);
265+ error_at (&rich_loc, "stray %<\\%o%> in program", (int) c);
266+ }
267 }
268 goto retry;
269
270diff --git a/gcc/common.opt b/gcc/common.opt
271--- a/gcc/common.opt 2021-12-25 01:29:12.915317374 -0800
272+++ b/gcc/common.opt 2021-12-25 01:30:50.669689023 -0800
273@@ -1337,6 +1337,10 @@ fdiagnostics-format=
274 Common Joined RejectNegative Enum(diagnostics_output_format)
275 -fdiagnostics-format=[text|json] Select output format.
276
277+fdiagnostics-escape-format=
278+Common Joined RejectNegative Enum(diagnostics_escape_format)
279+-fdiagnostics-escape-format=[unicode|bytes] Select how to escape non-printable-ASCII bytes in the source for diagnostics that suggest it.
280+
281 ; Required for these enum values.
282 SourceInclude
283 diagnostic.h
284@@ -1351,6 +1355,15 @@ EnumValue
285 Enum(diagnostics_column_unit) String(byte) Value(DIAGNOSTICS_COLUMN_UNIT_BYTE)
286
287 Enum
288+Name(diagnostics_escape_format) Type(int)
289+
290+EnumValue
291+Enum(diagnostics_escape_format) String(unicode) Value(DIAGNOSTICS_ESCAPE_FORMAT_UNICODE)
292+
293+EnumValue
294+Enum(diagnostics_escape_format) String(bytes) Value(DIAGNOSTICS_ESCAPE_FORMAT_BYTES)
295+
296+Enum
297 Name(diagnostics_output_format) Type(int)
298
299 EnumValue
300diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
301--- a/gcc/diagnostic.c 2021-12-25 01:29:12.915317374 -0800
302+++ b/gcc/diagnostic.c 2021-12-25 01:30:50.669689023 -0800
303@@ -223,6 +223,7 @@ diagnostic_initialize (diagnostic_contex
304 context->column_unit = DIAGNOSTICS_COLUMN_UNIT_DISPLAY;
305 context->column_origin = 1;
306 context->tabstop = 8;
307+ context->escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
308 context->edit_context_ptr = NULL;
309 context->diagnostic_group_nesting_depth = 0;
310 context->diagnostic_group_emission_count = 0;
311@@ -2152,8 +2153,8 @@ test_diagnostic_get_location_text ()
312 const char *const content = "smile \xf0\x9f\x98\x82\n";
313 const int line_bytes = strlen (content) - 1;
314 const int def_tabstop = 8;
315- const int display_width = cpp_display_width (content, line_bytes,
316- def_tabstop);
317+ const cpp_char_column_policy policy (def_tabstop, cpp_wcwidth);
318+ const int display_width = cpp_display_width (content, line_bytes, policy);
319 ASSERT_EQ (line_bytes - 2, display_width);
320 temp_source_file tmp (SELFTEST_LOCATION, ".c", content);
321 const char *const fname = tmp.get_filename ();
322diff --git a/gcc/diagnostic-format-json.cc b/gcc/diagnostic-format-json.cc
323--- a/gcc/diagnostic-format-json.cc 2021-12-25 01:29:12.915317374 -0800
324+++ b/gcc/diagnostic-format-json.cc 2021-12-25 01:30:50.669689023 -0800
325@@ -264,6 +264,9 @@ json_end_diagnostic (diagnostic_context
326 json::value *path_value = context->make_json_for_path (context, path);
327 diag_obj->set ("path", path_value);
328 }
329+
330+ diag_obj->set ("escape-source",
331+ new json::literal (richloc->escape_on_output_p ()));
332 }
333
334 /* No-op implementation of "begin_group_cb" for JSON output. */
335diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
336--- a/gcc/diagnostic.h 2021-12-25 01:29:12.919317307 -0800
337+++ b/gcc/diagnostic.h 2021-12-25 01:30:50.669689023 -0800
338@@ -38,6 +38,20 @@ enum diagnostics_column_unit
339 DIAGNOSTICS_COLUMN_UNIT_BYTE
340 };
341
342+/* An enum for controlling how to print non-ASCII characters/bytes when
343+ a diagnostic suggests escaping the source code on output. */
344+
345+enum diagnostics_escape_format
346+{
347+ /* Escape non-ASCII Unicode characters in the form <U+XXXX> and
348+ non-UTF-8 bytes in the form <XX>. */
349+ DIAGNOSTICS_ESCAPE_FORMAT_UNICODE,
350+
351+ /* Escape non-ASCII bytes in the form <XX> (thus showing the underlying
352+ encoding of non-ASCII Unicode characters). */
353+ DIAGNOSTICS_ESCAPE_FORMAT_BYTES
354+};
355+
356 /* Enum for overriding the standard output format. */
357
358 enum diagnostics_output_format
359@@ -303,6 +317,10 @@ struct diagnostic_context
360 /* The size of the tabstop for tab expansion. */
361 int tabstop;
362
363+ /* How should non-ASCII/non-printable bytes be escaped when
364+ a diagnostic suggests escaping the source code on output. */
365+ enum diagnostics_escape_format escape_format;
366+
367 /* If non-NULL, an edit_context to which fix-it hints should be
368 applied, for generating patches. */
369 edit_context *edit_context_ptr;
370diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
371--- a/gcc/diagnostic-show-locus.c 2021-12-25 01:29:12.919317307 -0800
372+++ b/gcc/diagnostic-show-locus.c 2021-12-25 01:30:50.673688956 -0800
373@@ -175,10 +175,26 @@ enum column_unit {
374 class exploc_with_display_col : public expanded_location
375 {
376 public:
377- exploc_with_display_col (const expanded_location &exploc, int tabstop)
378- : expanded_location (exploc),
379- m_display_col (location_compute_display_column (exploc, tabstop))
380- {}
381+ exploc_with_display_col (const expanded_location &exploc,
382+ const cpp_char_column_policy &policy,
383+ enum location_aspect aspect)
384+ : expanded_location (exploc),
385+ m_display_col (location_compute_display_column (exploc, policy))
386+ {
387+ if (exploc.column > 0)
388+ {
389+ /* m_display_col is now the final column of the byte.
390+ If escaping has happened, we may want the first column instead. */
391+ if (aspect != LOCATION_ASPECT_FINISH)
392+ {
393+ expanded_location prev_exploc (exploc);
394+ prev_exploc.column--;
395+ int prev_display_col
396+ = (location_compute_display_column (prev_exploc, policy));
397+ m_display_col = prev_display_col + 1;
398+ }
399+ }
400+ }
401
402 int m_display_col;
403 };
404@@ -313,6 +329,31 @@ test_line_span ()
405
406 #endif /* #if CHECKING_P */
407
408+/* A bundle of information containing how to print unicode
409+ characters and bytes when quoting source code.
410+
411+ Provides a unified place to support escaping some subset
412+ of characters to some format.
413+
414+ Extends char_column_policy; printing is split out to avoid
415+ libcpp having to know about pretty_printer. */
416+
417+struct char_display_policy : public cpp_char_column_policy
418+{
419+ public:
420+ char_display_policy (int tabstop,
421+ int (*width_cb) (cppchar_t c),
422+ void (*print_cb) (pretty_printer *pp,
423+ const cpp_decoded_char &cp))
424+ : cpp_char_column_policy (tabstop, width_cb),
425+ m_print_cb (print_cb)
426+ {
427+ }
428+
429+ void (*m_print_cb) (pretty_printer *pp,
430+ const cpp_decoded_char &cp);
431+};
432+
433 /* A class to control the overall layout when printing a diagnostic.
434
435 The layout is determined within the constructor.
436@@ -345,6 +386,8 @@ class layout
437
438 void print_line (linenum_type row);
439
440+ void on_bad_codepoint (const char *ptr, cppchar_t ch, size_t ch_sz);
441+
442 private:
443 bool will_show_line_p (linenum_type row) const;
444 void print_leading_fixits (linenum_type row);
445@@ -386,6 +429,7 @@ class layout
446 private:
447 diagnostic_context *m_context;
448 pretty_printer *m_pp;
449+ char_display_policy m_policy;
450 location_t m_primary_loc;
451 exploc_with_display_col m_exploc;
452 colorizer m_colorizer;
453@@ -398,6 +442,7 @@ class layout
454 auto_vec <line_span> m_line_spans;
455 int m_linenum_width;
456 int m_x_offset_display;
457+ bool m_escape_on_output;
458 };
459
460 /* Implementation of "class colorizer". */
461@@ -646,6 +691,11 @@ layout_range::intersects_line_p (linenum
462 /* Default for when we don't care what the tab expansion is set to. */
463 static const int def_tabstop = 8;
464
465+static cpp_char_column_policy def_policy ()
466+{
467+ return cpp_char_column_policy (8, cpp_wcwidth);
468+}
469+
470 /* Create some expanded locations for testing layout_range. The filename
471 member of the explocs is set to the empty string. This member will only be
472 inspected by the calls to location_compute_display_column() made from the
473@@ -662,10 +712,13 @@ make_range (int start_line, int start_co
474 = {"", start_line, start_col, NULL, false};
475 const expanded_location finish_exploc
476 = {"", end_line, end_col, NULL, false};
477- return layout_range (exploc_with_display_col (start_exploc, def_tabstop),
478- exploc_with_display_col (finish_exploc, def_tabstop),
479+ return layout_range (exploc_with_display_col (start_exploc, def_policy (),
480+ LOCATION_ASPECT_START),
481+ exploc_with_display_col (finish_exploc, def_policy (),
482+ LOCATION_ASPECT_FINISH),
483 SHOW_RANGE_WITHOUT_CARET,
484- exploc_with_display_col (start_exploc, def_tabstop),
485+ exploc_with_display_col (start_exploc, def_policy (),
486+ LOCATION_ASPECT_CARET),
487 0, NULL);
488 }
489
490@@ -950,6 +1003,164 @@ fixit_cmp (const void *p_a, const void *
491 return hint_a->get_start_loc () - hint_b->get_start_loc ();
492 }
493
494+/* Callbacks for use when not escaping the source. */
495+
496+/* The default callback for char_column_policy::m_width_cb is cpp_wcwidth. */
497+
498+/* Callback for char_display_policy::m_print_cb for printing source chars
499+ when not escaping the source. */
500+
501+static void
502+default_print_decoded_ch (pretty_printer *pp,
503+ const cpp_decoded_char &decoded_ch)
504+{
505+ for (const char *ptr = decoded_ch.m_start_byte;
506+ ptr != decoded_ch.m_next_byte; ptr++)
507+ {
508+ if (*ptr == '\0' || *ptr == '\r')
509+ {
510+ pp_space (pp);
511+ continue;
512+ }
513+
514+ pp_character (pp, *ptr);
515+ }
516+}
517+
518+/* Callbacks for use with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
519+
520+static const int width_per_escaped_byte = 4;
521+
522+/* Callback for char_column_policy::m_width_cb for determining the
523+ display width when escaping with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
524+
525+static int
526+escape_as_bytes_width (cppchar_t ch)
527+{
528+ if (ch < 0x80 && ISPRINT (ch))
529+ return cpp_wcwidth (ch);
530+ else
531+ {
532+ if (ch <= 0x7F) return 1 * width_per_escaped_byte;
533+ if (ch <= 0x7FF) return 2 * width_per_escaped_byte;
534+ if (ch <= 0xFFFF) return 3 * width_per_escaped_byte;
535+ return 4 * width_per_escaped_byte;
536+ }
537+}
538+
539+/* Callback for char_display_policy::m_print_cb for printing source chars
540+ when escaping with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
541+
542+static void
543+escape_as_bytes_print (pretty_printer *pp,
544+ const cpp_decoded_char &decoded_ch)
545+{
546+ if (!decoded_ch.m_valid_ch)
547+ {
548+ for (const char *iter = decoded_ch.m_start_byte;
549+ iter != decoded_ch.m_next_byte; ++iter)
550+ {
551+ char buf[16];
552+ sprintf (buf, "<%02x>", (unsigned char)*iter);
553+ pp_string (pp, buf);
554+ }
555+ return;
556+ }
557+
558+ cppchar_t ch = decoded_ch.m_ch;
559+ if (ch < 0x80 && ISPRINT (ch))
560+ pp_character (pp, ch);
561+ else
562+ {
563+ for (const char *iter = decoded_ch.m_start_byte;
564+ iter < decoded_ch.m_next_byte; ++iter)
565+ {
566+ char buf[16];
567+ sprintf (buf, "<%02x>", (unsigned char)*iter);
568+ pp_string (pp, buf);
569+ }
570+ }
571+}
572+
573+/* Callbacks for use with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
574+
575+/* Callback for char_column_policy::m_width_cb for determining the
576+ display width when escaping with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
577+
578+static int
579+escape_as_unicode_width (cppchar_t ch)
580+{
581+ if (ch < 0x80 && ISPRINT (ch))
582+ return cpp_wcwidth (ch);
583+ else
584+ {
585+ // Width of "<U+%04x>"
586+ if (ch > 0xfffff)
587+ return 10;
588+ else if (ch > 0xffff)
589+ return 9;
590+ else
591+ return 8;
592+ }
593+}
594+
595+/* Callback for char_display_policy::m_print_cb for printing source chars
596+ when escaping with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
597+
598+static void
599+escape_as_unicode_print (pretty_printer *pp,
600+ const cpp_decoded_char &decoded_ch)
601+{
602+ if (!decoded_ch.m_valid_ch)
603+ {
604+ escape_as_bytes_print (pp, decoded_ch);
605+ return;
606+ }
607+
608+ cppchar_t ch = decoded_ch.m_ch;
609+ if (ch < 0x80 && ISPRINT (ch))
610+ pp_character (pp, ch);
611+ else
612+ {
613+ char buf[16];
614+ sprintf (buf, "<U+%04X>", ch);
615+ pp_string (pp, buf);
616+ }
617+}
618+
619+/* Populate a char_display_policy based on DC and RICHLOC. */
620+
621+static char_display_policy
622+make_policy (const diagnostic_context &dc,
623+ const rich_location &richloc)
624+{
625+ /* The default is to not escape non-ASCII bytes. */
626+ char_display_policy result
627+ (dc.tabstop, cpp_wcwidth, default_print_decoded_ch);
628+
629+ /* If the diagnostic suggests escaping non-ASCII bytes, then
630+ use policy from user-supplied options. */
631+ if (richloc.escape_on_output_p ())
632+ {
633+ result.m_undecoded_byte_width = width_per_escaped_byte;
634+ switch (dc.escape_format)
635+ {
636+ default:
637+ gcc_unreachable ();
638+ case DIAGNOSTICS_ESCAPE_FORMAT_UNICODE:
639+ result.m_width_cb = escape_as_unicode_width;
640+ result.m_print_cb = escape_as_unicode_print;
641+ break;
642+ case DIAGNOSTICS_ESCAPE_FORMAT_BYTES:
643+ result.m_width_cb = escape_as_bytes_width;
644+ result.m_print_cb = escape_as_bytes_print;
645+ break;
646+ }
647+ }
648+
649+ return result;
650+}
651+
652 /* Implementation of class layout. */
653
654 /* Constructor for class layout.
655@@ -966,8 +1177,10 @@ layout::layout (diagnostic_context * con
656 diagnostic_t diagnostic_kind)
657 : m_context (context),
658 m_pp (context->printer),
659+ m_policy (make_policy (*context, *richloc)),
660 m_primary_loc (richloc->get_range (0)->m_loc),
661- m_exploc (richloc->get_expanded_location (0), context->tabstop),
662+ m_exploc (richloc->get_expanded_location (0), m_policy,
663+ LOCATION_ASPECT_CARET),
664 m_colorizer (context, diagnostic_kind),
665 m_colorize_source_p (context->colorize_source_p),
666 m_show_labels_p (context->show_labels_p),
667@@ -977,7 +1190,8 @@ layout::layout (diagnostic_context * con
668 m_fixit_hints (richloc->get_num_fixit_hints ()),
669 m_line_spans (1 + richloc->get_num_locations ()),
670 m_linenum_width (0),
671- m_x_offset_display (0)
672+ m_x_offset_display (0),
673+ m_escape_on_output (richloc->escape_on_output_p ())
674 {
675 for (unsigned int idx = 0; idx < richloc->get_num_locations (); idx++)
676 {
677@@ -1063,10 +1277,13 @@ layout::maybe_add_location_range (const
678
679 /* Everything is now known to be in the correct source file,
680 but it may require further sanitization. */
681- layout_range ri (exploc_with_display_col (start, m_context->tabstop),
682- exploc_with_display_col (finish, m_context->tabstop),
683+ layout_range ri (exploc_with_display_col (start, m_policy,
684+ LOCATION_ASPECT_START),
685+ exploc_with_display_col (finish, m_policy,
686+ LOCATION_ASPECT_FINISH),
687 loc_range->m_range_display_kind,
688- exploc_with_display_col (caret, m_context->tabstop),
689+ exploc_with_display_col (caret, m_policy,
690+ LOCATION_ASPECT_CARET),
691 original_idx, loc_range->m_label);
692
693 /* If we have a range that finishes before it starts (perhaps
694@@ -1400,7 +1617,7 @@ layout::calculate_x_offset_display ()
695 = get_line_bytes_without_trailing_whitespace (line.get_buffer (),
696 line.length ());
697 int eol_display_column
698- = cpp_display_width (line.get_buffer (), line_bytes, m_context->tabstop);
699+ = cpp_display_width (line.get_buffer (), line_bytes, m_policy);
700 if (caret_display_column > eol_display_column
701 || !caret_display_column)
702 {
703@@ -1479,7 +1696,7 @@ layout::print_source_line (linenum_type
704 /* This object helps to keep track of which display column we are at, which is
705 necessary for computing the line bounds in display units, for doing
706 tab expansion, and for implementing m_x_offset_display. */
707- cpp_display_width_computation dw (line, line_bytes, m_context->tabstop);
708+ cpp_display_width_computation dw (line, line_bytes, m_policy);
709
710 /* Skip the first m_x_offset_display display columns. In case the leading
711 portion that will be skipped ends with a character with wcwidth > 1, then
712@@ -1527,7 +1744,8 @@ layout::print_source_line (linenum_type
713 tabs and replacing some control bytes with spaces as necessary. */
714 const char *c = dw.next_byte ();
715 const int start_disp_col = dw.display_cols_processed () + 1;
716- const int this_display_width = dw.process_next_codepoint ();
717+ cpp_decoded_char cp;
718+ const int this_display_width = dw.process_next_codepoint (&cp);
719 if (*c == '\t')
720 {
721 /* The returned display width is the number of spaces into which the
722@@ -1536,15 +1754,6 @@ layout::print_source_line (linenum_type
723 pp_space (m_pp);
724 continue;
725 }
726- if (*c == '\0' || *c == '\r')
727- {
728- /* cpp_wcwidth() promises to return 1 for all control bytes, and we
729- want to output these as a single space too, so this case is
730- actually the same as the '\t' case. */
731- gcc_assert (this_display_width == 1);
732- pp_space (m_pp);
733- continue;
734- }
735
736 /* We have a (possibly multibyte) character to output; update the line
737 bounds if it is not whitespace. */
738@@ -1556,7 +1765,8 @@ layout::print_source_line (linenum_type
739 }
740
741 /* Output the character. */
742- while (c != dw.next_byte ()) pp_character (m_pp, *c++);
743+ m_policy.m_print_cb (m_pp, cp);
744+ c = dw.next_byte ();
745 }
746 print_newline ();
747 return lbounds;
748@@ -1655,14 +1865,14 @@ layout::print_annotation_line (linenum_t
749 class line_label
750 {
751 public:
752- line_label (diagnostic_context *context, int state_idx, int column,
753+ line_label (const cpp_char_column_policy &policy,
754+ int state_idx, int column,
755 label_text text)
756 : m_state_idx (state_idx), m_column (column),
757 m_text (text), m_label_line (0), m_has_vbar (true)
758 {
759 const int bytes = strlen (text.m_buffer);
760- m_display_width
761- = cpp_display_width (text.m_buffer, bytes, context->tabstop);
762+ m_display_width = cpp_display_width (text.m_buffer, bytes, policy);
763 }
764
765 /* Sorting is primarily by column, then by state index. */
766@@ -1722,7 +1932,7 @@ layout::print_any_labels (linenum_type r
767 if (text.m_buffer == NULL)
768 continue;
769
770- labels.safe_push (line_label (m_context, i, disp_col, text));
771+ labels.safe_push (line_label (m_policy, i, disp_col, text));
772 }
773 }
774
775@@ -2002,7 +2212,7 @@ public:
776
777 /* Get the range of bytes or display columns that HINT would affect. */
778 static column_range
779-get_affected_range (diagnostic_context *context,
780+get_affected_range (const cpp_char_column_policy &policy,
781 const fixit_hint *hint, enum column_unit col_unit)
782 {
783 expanded_location exploc_start = expand_location (hint->get_start_loc ());
784@@ -2013,13 +2223,11 @@ get_affected_range (diagnostic_context *
785 int finish_column;
786 if (col_unit == CU_DISPLAY_COLS)
787 {
788- start_column
789- = location_compute_display_column (exploc_start, context->tabstop);
790+ start_column = location_compute_display_column (exploc_start, policy);
791 if (hint->insertion_p ())
792 finish_column = start_column - 1;
793 else
794- finish_column
795- = location_compute_display_column (exploc_finish, context->tabstop);
796+ finish_column = location_compute_display_column (exploc_finish, policy);
797 }
798 else
799 {
800@@ -2032,12 +2240,13 @@ get_affected_range (diagnostic_context *
801 /* Get the range of display columns that would be printed for HINT. */
802
803 static column_range
804-get_printed_columns (diagnostic_context *context, const fixit_hint *hint)
805+get_printed_columns (const cpp_char_column_policy &policy,
806+ const fixit_hint *hint)
807 {
808 expanded_location exploc = expand_location (hint->get_start_loc ());
809- int start_column = location_compute_display_column (exploc, context->tabstop);
810+ int start_column = location_compute_display_column (exploc, policy);
811 int hint_width = cpp_display_width (hint->get_string (), hint->get_length (),
812- context->tabstop);
813+ policy);
814 int final_hint_column = start_column + hint_width - 1;
815 if (hint->insertion_p ())
816 {
817@@ -2047,8 +2256,7 @@ get_printed_columns (diagnostic_context
818 {
819 exploc = expand_location (hint->get_next_loc ());
820 --exploc.column;
821- int finish_column
822- = location_compute_display_column (exploc, context->tabstop);
823+ int finish_column = location_compute_display_column (exploc, policy);
824 return column_range (start_column,
825 MAX (finish_column, final_hint_column));
826 }
827@@ -2066,13 +2274,13 @@ public:
828 column_range affected_columns,
829 column_range printed_columns,
830 const char *new_text, size_t new_text_len,
831- int tabstop)
832+ const cpp_char_column_policy &policy)
833 : m_affected_bytes (affected_bytes),
834 m_affected_columns (affected_columns),
835 m_printed_columns (printed_columns),
836 m_text (xstrdup (new_text)),
837 m_byte_length (new_text_len),
838- m_tabstop (tabstop),
839+ m_policy (policy),
840 m_alloc_sz (new_text_len + 1)
841 {
842 compute_display_cols ();
843@@ -2090,7 +2298,7 @@ public:
844
845 void compute_display_cols ()
846 {
847- m_display_cols = cpp_display_width (m_text, m_byte_length, m_tabstop);
848+ m_display_cols = cpp_display_width (m_text, m_byte_length, m_policy);
849 }
850
851 void overwrite (int dst_offset, const char_span &src_span)
852@@ -2118,7 +2326,7 @@ public:
853 char *m_text;
854 size_t m_byte_length; /* Not including null-terminator. */
855 int m_display_cols;
856- int m_tabstop;
857+ const cpp_char_column_policy &m_policy;
858 size_t m_alloc_sz;
859 };
860
861@@ -2154,15 +2362,16 @@ correction::ensure_terminated ()
862 class line_corrections
863 {
864 public:
865- line_corrections (diagnostic_context *context, const char *filename,
866+ line_corrections (const char_display_policy &policy,
867+ const char *filename,
868 linenum_type row)
869- : m_context (context), m_filename (filename), m_row (row)
870+ : m_policy (policy), m_filename (filename), m_row (row)
871 {}
872 ~line_corrections ();
873
874 void add_hint (const fixit_hint *hint);
875
876- diagnostic_context *m_context;
877+ const char_display_policy &m_policy;
878 const char *m_filename;
879 linenum_type m_row;
880 auto_vec <correction *> m_corrections;
881@@ -2208,10 +2417,10 @@ source_line::source_line (const char *fi
882 void
883 line_corrections::add_hint (const fixit_hint *hint)
884 {
885- column_range affected_bytes = get_affected_range (m_context, hint, CU_BYTES);
886- column_range affected_columns = get_affected_range (m_context, hint,
887+ column_range affected_bytes = get_affected_range (m_policy, hint, CU_BYTES);
888+ column_range affected_columns = get_affected_range (m_policy, hint,
889 CU_DISPLAY_COLS);
890- column_range printed_columns = get_printed_columns (m_context, hint);
891+ column_range printed_columns = get_printed_columns (m_policy, hint);
892
893 /* Potentially consolidate. */
894 if (!m_corrections.is_empty ())
895@@ -2280,7 +2489,7 @@ line_corrections::add_hint (const fixit_
896 printed_columns,
897 hint->get_string (),
898 hint->get_length (),
899- m_context->tabstop));
900+ m_policy));
901 }
902
903 /* If there are any fixit hints on source line ROW, print them.
904@@ -2294,7 +2503,7 @@ layout::print_trailing_fixits (linenum_t
905 {
906 /* Build a list of correction instances for the line,
907 potentially consolidating hints (for the sake of readability). */
908- line_corrections corrections (m_context, m_exploc.file, row);
909+ line_corrections corrections (m_policy, m_exploc.file, row);
910 for (unsigned int i = 0; i < m_fixit_hints.length (); i++)
911 {
912 const fixit_hint *hint = m_fixit_hints[i];
913@@ -2635,6 +2844,59 @@ namespace selftest {
914
915 /* Selftests for diagnostic_show_locus. */
916
917+/* Verify that cpp_display_width correctly handles escaping. */
918+
919+static void
920+test_display_widths ()
921+{
922+ gcc_rich_location richloc (UNKNOWN_LOCATION);
923+
924+ /* U+03C0 "GREEK SMALL LETTER PI". */
925+ const char *pi = "\xCF\x80";
926+ /* U+1F642 "SLIGHTLY SMILING FACE". */
927+ const char *emoji = "\xF0\x9F\x99\x82";
928+ /* Stray trailing byte of a UTF-8 character. */
929+ const char *stray = "\xBF";
930+ /* U+10FFFF. */
931+ const char *max_codepoint = "\xF4\x8F\xBF\xBF";
932+
933+ /* No escaping. */
934+ {
935+ test_diagnostic_context dc;
936+ char_display_policy policy (make_policy (dc, richloc));
937+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 1);
938+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 2);
939+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 1);
940+ /* Don't check width of U+10FFFF; it's in a private use plane. */
941+ }
942+
943+ richloc.set_escape_on_output (true);
944+
945+ {
946+ test_diagnostic_context dc;
947+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
948+ char_display_policy policy (make_policy (dc, richloc));
949+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 8);
950+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 9);
951+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 4);
952+ ASSERT_EQ (cpp_display_width (max_codepoint, strlen (max_codepoint),
953+ policy),
954+ strlen ("<U+10FFFF>"));
955+ }
956+
957+ {
958+ test_diagnostic_context dc;
959+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
960+ char_display_policy policy (make_policy (dc, richloc));
961+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 8);
962+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 16);
963+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 4);
964+ ASSERT_EQ (cpp_display_width (max_codepoint, strlen (max_codepoint),
965+ policy),
966+ 16);
967+ }
968+}
969+
970 /* For precise tests of the layout, make clear where the source line will
971 start. test_left_margin sets the total byte count from the left side of the
972 screen to the start of source lines, after the line number and the separator,
973@@ -2704,10 +2966,10 @@ test_layout_x_offset_display_utf8 (const
974 char_span lspan = location_get_source_line (tmp.get_filename (), 1);
975 ASSERT_EQ (line_display_cols,
976 cpp_display_width (lspan.get_buffer (), lspan.length (),
977- def_tabstop));
978+ def_policy ()));
979 ASSERT_EQ (line_display_cols,
980 location_compute_display_column (expand_location (line_end),
981- def_tabstop));
982+ def_policy ()));
983 ASSERT_EQ (0, memcmp (lspan.get_buffer () + (emoji_col - 1),
984 "\xf0\x9f\x98\x82\xf0\x9f\x98\x82", 8));
985
986@@ -2855,12 +3117,13 @@ test_layout_x_offset_display_tab (const
987 ASSERT_EQ ('\t', *(lspan.get_buffer () + (tab_col - 1)));
988 for (int tabstop = 1; tabstop != num_tabstops; ++tabstop)
989 {
990+ cpp_char_column_policy policy (tabstop, cpp_wcwidth);
991 ASSERT_EQ (line_bytes + extra_width[tabstop],
992 cpp_display_width (lspan.get_buffer (), lspan.length (),
993- tabstop));
994+ policy));
995 ASSERT_EQ (line_bytes + extra_width[tabstop],
996 location_compute_display_column (expand_location (line_end),
997- tabstop));
998+ policy));
999 }
1000
1001 /* Check that the tab is expanded to the expected number of spaces. */
1002@@ -3992,6 +4255,43 @@ test_one_liner_labels_utf8 ()
1003 " bb\xf0\x9f\x98\x82\xf0\x9f\x98\x82\n",
1004 pp_formatted_text (dc.printer));
1005 }
1006+
1007+ /* Example of escaping the source lines. */
1008+ {
1009+ text_range_label label0 ("label 0\xf0\x9f\x98\x82");
1010+ text_range_label label1 ("label 1\xcf\x80");
1011+ text_range_label label2 ("label 2\xcf\x80");
1012+ gcc_rich_location richloc (foo, &label0);
1013+ richloc.add_range (bar, SHOW_RANGE_WITHOUT_CARET, &label1);
1014+ richloc.add_range (field, SHOW_RANGE_WITHOUT_CARET, &label2);
1015+ richloc.set_escape_on_output (true);
1016+
1017+ {
1018+ test_diagnostic_context dc;
1019+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1020+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1021+ ASSERT_STREQ (" <U+1F602>_foo = <U+03C0>_bar.<U+1F602>_field<U+03C0>;\n"
1022+ " ^~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~\n"
1023+ " | | |\n"
1024+ " | | label 2\xcf\x80\n"
1025+ " | label 1\xcf\x80\n"
1026+ " label 0\xf0\x9f\x98\x82\n",
1027+ pp_formatted_text (dc.printer));
1028+ }
1029+ {
1030+ test_diagnostic_context dc;
1031+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1032+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1033+ ASSERT_STREQ
1034+ (" <f0><9f><98><82>_foo = <cf><80>_bar.<f0><9f><98><82>_field<cf><80>;\n"
1035+ " ^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
1036+ " | | |\n"
1037+ " | | label 2\xcf\x80\n"
1038+ " | label 1\xcf\x80\n"
1039+ " label 0\xf0\x9f\x98\x82\n",
1040+ pp_formatted_text (dc.printer));
1041+ }
1042+ }
1043 }
1044
1045 /* Make sure that colorization codes don't interrupt a multibyte
1046@@ -4046,9 +4346,9 @@ test_diagnostic_show_locus_one_liner_utf
1047
1048 char_span lspan = location_get_source_line (tmp.get_filename (), 1);
1049 ASSERT_EQ (25, cpp_display_width (lspan.get_buffer (), lspan.length (),
1050- def_tabstop));
1051+ def_policy ()));
1052 ASSERT_EQ (25, location_compute_display_column (expand_location (line_end),
1053- def_tabstop));
1054+ def_policy ()));
1055
1056 test_one_liner_simple_caret_utf8 ();
1057 test_one_liner_caret_and_range_utf8 ();
1058@@ -4434,30 +4734,31 @@ test_overlapped_fixit_printing (const li
1059 pp_formatted_text (dc.printer));
1060
1061 /* Unit-test the line_corrections machinery. */
1062+ char_display_policy policy (make_policy (dc, richloc));
1063 ASSERT_EQ (3, richloc.get_num_fixit_hints ());
1064 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1065 ASSERT_EQ (column_range (12, 12),
1066- get_affected_range (&dc, hint_0, CU_BYTES));
1067+ get_affected_range (policy, hint_0, CU_BYTES));
1068 ASSERT_EQ (column_range (12, 12),
1069- get_affected_range (&dc, hint_0, CU_DISPLAY_COLS));
1070- ASSERT_EQ (column_range (12, 22), get_printed_columns (&dc, hint_0));
1071+ get_affected_range (policy, hint_0, CU_DISPLAY_COLS));
1072+ ASSERT_EQ (column_range (12, 22), get_printed_columns (policy, hint_0));
1073 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1074 ASSERT_EQ (column_range (18, 18),
1075- get_affected_range (&dc, hint_1, CU_BYTES));
1076+ get_affected_range (policy, hint_1, CU_BYTES));
1077 ASSERT_EQ (column_range (18, 18),
1078- get_affected_range (&dc, hint_1, CU_DISPLAY_COLS));
1079- ASSERT_EQ (column_range (18, 20), get_printed_columns (&dc, hint_1));
1080+ get_affected_range (policy, hint_1, CU_DISPLAY_COLS));
1081+ ASSERT_EQ (column_range (18, 20), get_printed_columns (policy, hint_1));
1082 const fixit_hint *hint_2 = richloc.get_fixit_hint (2);
1083 ASSERT_EQ (column_range (29, 28),
1084- get_affected_range (&dc, hint_2, CU_BYTES));
1085+ get_affected_range (policy, hint_2, CU_BYTES));
1086 ASSERT_EQ (column_range (29, 28),
1087- get_affected_range (&dc, hint_2, CU_DISPLAY_COLS));
1088- ASSERT_EQ (column_range (29, 29), get_printed_columns (&dc, hint_2));
1089+ get_affected_range (policy, hint_2, CU_DISPLAY_COLS));
1090+ ASSERT_EQ (column_range (29, 29), get_printed_columns (policy, hint_2));
1091
1092 /* Add each hint in turn to a line_corrections instance,
1093 and verify that they are consolidated into one correction instance
1094 as expected. */
1095- line_corrections lc (&dc, tmp.get_filename (), 1);
1096+ line_corrections lc (policy, tmp.get_filename (), 1);
1097
1098 /* The first replace hint by itself. */
1099 lc.add_hint (hint_0);
1100@@ -4649,30 +4950,31 @@ test_overlapped_fixit_printing_utf8 (con
1101 pp_formatted_text (dc.printer));
1102
1103 /* Unit-test the line_corrections machinery. */
1104+ char_display_policy policy (make_policy (dc, richloc));
1105 ASSERT_EQ (3, richloc.get_num_fixit_hints ());
1106 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1107 ASSERT_EQ (column_range (14, 14),
1108- get_affected_range (&dc, hint_0, CU_BYTES));
1109+ get_affected_range (policy, hint_0, CU_BYTES));
1110 ASSERT_EQ (column_range (12, 12),
1111- get_affected_range (&dc, hint_0, CU_DISPLAY_COLS));
1112- ASSERT_EQ (column_range (12, 22), get_printed_columns (&dc, hint_0));
1113+ get_affected_range (policy, hint_0, CU_DISPLAY_COLS));
1114+ ASSERT_EQ (column_range (12, 22), get_printed_columns (policy, hint_0));
1115 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1116 ASSERT_EQ (column_range (22, 22),
1117- get_affected_range (&dc, hint_1, CU_BYTES));
1118+ get_affected_range (policy, hint_1, CU_BYTES));
1119 ASSERT_EQ (column_range (18, 18),
1120- get_affected_range (&dc, hint_1, CU_DISPLAY_COLS));
1121- ASSERT_EQ (column_range (18, 20), get_printed_columns (&dc, hint_1));
1122+ get_affected_range (policy, hint_1, CU_DISPLAY_COLS));
1123+ ASSERT_EQ (column_range (18, 20), get_printed_columns (policy, hint_1));
1124 const fixit_hint *hint_2 = richloc.get_fixit_hint (2);
1125 ASSERT_EQ (column_range (35, 34),
1126- get_affected_range (&dc, hint_2, CU_BYTES));
1127+ get_affected_range (policy, hint_2, CU_BYTES));
1128 ASSERT_EQ (column_range (30, 29),
1129- get_affected_range (&dc, hint_2, CU_DISPLAY_COLS));
1130- ASSERT_EQ (column_range (30, 30), get_printed_columns (&dc, hint_2));
1131+ get_affected_range (policy, hint_2, CU_DISPLAY_COLS));
1132+ ASSERT_EQ (column_range (30, 30), get_printed_columns (policy, hint_2));
1133
1134 /* Add each hint in turn to a line_corrections instance,
1135 and verify that they are consolidated into one correction instance
1136 as expected. */
1137- line_corrections lc (&dc, tmp.get_filename (), 1);
1138+ line_corrections lc (policy, tmp.get_filename (), 1);
1139
1140 /* The first replace hint by itself. */
1141 lc.add_hint (hint_0);
1142@@ -4866,15 +5168,16 @@ test_overlapped_fixit_printing_2 (const
1143 richloc.add_fixit_insert_before (col_21, "}");
1144
1145 /* These fixits should be accepted; they can't be consolidated. */
1146+ char_display_policy policy (make_policy (dc, richloc));
1147 ASSERT_EQ (2, richloc.get_num_fixit_hints ());
1148 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1149 ASSERT_EQ (column_range (23, 22),
1150- get_affected_range (&dc, hint_0, CU_BYTES));
1151- ASSERT_EQ (column_range (23, 23), get_printed_columns (&dc, hint_0));
1152+ get_affected_range (policy, hint_0, CU_BYTES));
1153+ ASSERT_EQ (column_range (23, 23), get_printed_columns (policy, hint_0));
1154 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1155 ASSERT_EQ (column_range (21, 20),
1156- get_affected_range (&dc, hint_1, CU_BYTES));
1157- ASSERT_EQ (column_range (21, 21), get_printed_columns (&dc, hint_1));
1158+ get_affected_range (policy, hint_1, CU_BYTES));
1159+ ASSERT_EQ (column_range (21, 21), get_printed_columns (policy, hint_1));
1160
1161 /* Verify that they're printed correctly. */
1162 diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1163@@ -5141,10 +5444,11 @@ test_tab_expansion (const line_table_cas
1164 ....................123 45678901234 56789012345 columns */
1165
1166 const int tabstop = 8;
1167+ cpp_char_column_policy policy (tabstop, cpp_wcwidth);
1168 const int first_non_ws_byte_col = 7;
1169 const int right_quote_byte_col = 15;
1170 const int last_byte_col = 25;
1171- ASSERT_EQ (35, cpp_display_width (content, last_byte_col, tabstop));
1172+ ASSERT_EQ (35, cpp_display_width (content, last_byte_col, policy));
1173
1174 temp_source_file tmp (SELFTEST_LOCATION, ".c", content);
1175 line_table_test ltt (case_);
1176@@ -5187,6 +5491,114 @@ test_tab_expansion (const line_table_cas
1177 }
1178 }
1179
1180+/* Verify that the escaping machinery can cope with a variety of different
1181+ invalid bytes. */
1182+
1183+static void
1184+test_escaping_bytes_1 (const line_table_case &case_)
1185+{
1186+ const char content[] = "before\0\1\2\3\r\x80\xff""after\n";
1187+ const size_t sz = sizeof (content);
1188+ temp_source_file tmp (SELFTEST_LOCATION, ".c", content, sz);
1189+ line_table_test ltt (case_);
1190+ const line_map_ordinary *ord_map = linemap_check_ordinary
1191+ (linemap_add (line_table, LC_ENTER, false, tmp.get_filename (), 0));
1192+ linemap_line_start (line_table, 1, 100);
1193+
1194+ location_t finish
1195+ = linemap_position_for_line_and_column (line_table, ord_map, 1,
1196+ strlen (content));
1197+
1198+ if (finish > LINE_MAP_MAX_LOCATION_WITH_COLS)
1199+ return;
1200+
1201+ /* Locations of the NUL and \r bytes. */
1202+ location_t nul_loc
1203+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 7);
1204+ location_t r_loc
1205+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 11);
1206+ gcc_rich_location richloc (nul_loc);
1207+ richloc.add_range (r_loc);
1208+
1209+ {
1210+ test_diagnostic_context dc;
1211+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1212+ ASSERT_STREQ (" before \1\2\3 \x80\xff""after\n"
1213+ " ^ ~\n",
1214+ pp_formatted_text (dc.printer));
1215+ }
1216+ richloc.set_escape_on_output (true);
1217+ {
1218+ test_diagnostic_context dc;
1219+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1220+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1221+ ASSERT_STREQ
1222+ (" before<U+0000><U+0001><U+0002><U+0003><U+000D><80><ff>after\n"
1223+ " ^~~~~~~~ ~~~~~~~~\n",
1224+ pp_formatted_text (dc.printer));
1225+ }
1226+ {
1227+ test_diagnostic_context dc;
1228+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1229+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1230+ ASSERT_STREQ (" before<00><01><02><03><0d><80><ff>after\n"
1231+ " ^~~~ ~~~~\n",
1232+ pp_formatted_text (dc.printer));
1233+ }
1234+}
1235+
1236+/* As above, but verify that we handle the initial byte of a line
1237+ correctly. */
1238+
1239+static void
1240+test_escaping_bytes_2 (const line_table_case &case_)
1241+{
1242+ const char content[] = "\0after\n";
1243+ const size_t sz = sizeof (content);
1244+ temp_source_file tmp (SELFTEST_LOCATION, ".c", content, sz);
1245+ line_table_test ltt (case_);
1246+ const line_map_ordinary *ord_map = linemap_check_ordinary
1247+ (linemap_add (line_table, LC_ENTER, false, tmp.get_filename (), 0));
1248+ linemap_line_start (line_table, 1, 100);
1249+
1250+ location_t finish
1251+ = linemap_position_for_line_and_column (line_table, ord_map, 1,
1252+ strlen (content));
1253+
1254+ if (finish > LINE_MAP_MAX_LOCATION_WITH_COLS)
1255+ return;
1256+
1257+ /* Location of the NUL byte. */
1258+ location_t nul_loc
1259+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 1);
1260+ gcc_rich_location richloc (nul_loc);
1261+
1262+ {
1263+ test_diagnostic_context dc;
1264+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1265+ ASSERT_STREQ (" after\n"
1266+ " ^\n",
1267+ pp_formatted_text (dc.printer));
1268+ }
1269+ richloc.set_escape_on_output (true);
1270+ {
1271+ test_diagnostic_context dc;
1272+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1273+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1274+ ASSERT_STREQ (" <U+0000>after\n"
1275+ " ^~~~~~~~\n",
1276+ pp_formatted_text (dc.printer));
1277+ }
1278+ {
1279+ test_diagnostic_context dc;
1280+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1281+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1282+ ASSERT_STREQ (" <00>after\n"
1283+ " ^~~~\n",
1284+ pp_formatted_text (dc.printer));
1285+ }
1286+}
1287+
1288 /* Verify that line numbers are correctly printed for the case of
1289 a multiline range in which the width of the line numbers changes
1290 (e.g. from "9" to "10"). */
1291@@ -5243,6 +5655,8 @@ diagnostic_show_locus_c_tests ()
1292 test_layout_range_for_single_line ();
1293 test_layout_range_for_multiple_lines ();
1294
1295+ test_display_widths ();
1296+
1297 for_each_line_table_case (test_layout_x_offset_display_utf8);
1298 for_each_line_table_case (test_layout_x_offset_display_tab);
1299
1300@@ -5263,6 +5677,8 @@ diagnostic_show_locus_c_tests ()
1301 for_each_line_table_case (test_fixit_replace_containing_newline);
1302 for_each_line_table_case (test_fixit_deletion_affecting_newline);
1303 for_each_line_table_case (test_tab_expansion);
1304+ for_each_line_table_case (test_escaping_bytes_1);
1305+ for_each_line_table_case (test_escaping_bytes_2);
1306
1307 test_line_numbers_multiline_range ();
1308 }
1309diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
1310--- a/gcc/doc/invoke.texi 2021-12-25 01:29:12.927317174 -0800
1311+++ b/gcc/doc/invoke.texi 2021-12-25 01:30:50.681688823 -0800
1312@@ -295,7 +295,8 @@ Objective-C and Objective-C++ Dialects}.
1313 -fdiagnostics-show-path-depths @gol
1314 -fno-show-column @gol
1315 -fdiagnostics-column-unit=@r{[}display@r{|}byte@r{]} @gol
1316--fdiagnostics-column-origin=@var{origin}}
1317+-fdiagnostics-column-origin=@var{origin} @gol
1318+-fdiagnostics-escape-format=@r{[}unicode@r{|}bytes@r{]}}
1319
1320 @item Warning Options
1321 @xref{Warning Options,,Options to Request or Suppress Warnings}.
1322@@ -4451,6 +4452,38 @@ first column. The default value of 1 co
1323 behavior and to the GNU style guide. Some utilities may perform better with an
1324 origin of 0; any non-negative value may be specified.
1325
1326+@item -fdiagnostics-escape-format=@var{FORMAT}
1327+@opindex fdiagnostics-escape-format
1328+When GCC prints pertinent source lines for a diagnostic it normally attempts
1329+to print the source bytes directly. However, some diagnostics relate to encoding
1330+issues in the source file, such as malformed UTF-8, or issues with Unicode
1331+normalization. These diagnostics are flagged so that GCC will escape bytes
1332+that are not printable ASCII when printing their pertinent source lines.
1333+
1334+This option controls how such bytes should be escaped.
1335+
1336+The default @var{FORMAT}, @samp{unicode} displays Unicode characters that
1337+are not printable ASCII in the form @samp{<U+XXXX>}, and bytes that do not
1338+correspond to a Unicode character validly-encoded in UTF-8-encoded will be
1339+displayed as hexadecimal in the form @samp{<XX>}.
1340+
1341+For example, a source line containing the string @samp{before} followed by the
1342+Unicode character U+03C0 (``GREEK SMALL LETTER PI'', with UTF-8 encoding
1343+0xCF 0x80) followed by the byte 0xBF (a stray UTF-8 trailing byte), followed by
1344+the string @samp{after} will be printed for such a diagnostic as:
1345+
1346+@smallexample
1347+ before<U+03C0><BF>after
1348+@end smallexample
1349+
1350+Setting @var{FORMAT} to @samp{bytes} will display all non-printable-ASCII bytes
1351+in the form @samp{<XX>}, thus showing the underlying encoding of non-ASCII
1352+Unicode characters. For the example above, the following will be printed:
1353+
1354+@smallexample
1355+ before<CF><80><BF>after
1356+@end smallexample
1357+
1358 @item -fdiagnostics-format=@var{FORMAT}
1359 @opindex fdiagnostics-format
1360 Select a different format for printing diagnostics.
1361@@ -4518,9 +4551,11 @@ might be printed in JSON form (after for
1362 @}
1363 @}
1364 ],
1365+ "escape-source": false,
1366 "message": "...this statement, but the latter is @dots{}"
1367 @}
1368 ]
1369+ "escape-source": false,
1370 "column-origin": 1,
1371 @},
1372 @dots{}
1373@@ -4607,6 +4642,7 @@ of the expression, which have labels. I
1374 "label": "T @{aka struct t@}"
1375 @}
1376 ],
1377+ "escape-source": false,
1378 "message": "invalid operands to binary + @dots{}"
1379 @}
1380 @end smallexample
1381@@ -4660,6 +4696,7 @@ might be printed in JSON form as:
1382 @}
1383 @}
1384 ],
1385+ "escape-source": false,
1386 "message": "\u2018struct s\u2019 has no member named @dots{}"
1387 @}
1388 @end smallexample
1389@@ -4717,6 +4754,10 @@ For example, the intraprocedural example
1390 ]
1391 @end smallexample
1392
1393+Diagnostics have a boolean attribute @code{escape-source}, hinting whether
1394+non-ASCII bytes should be escaped when printing the pertinent lines of
1395+source code (@code{true} for diagnostics involving source encoding issues).
1396+
1397 @end table
1398
1399 @node Warning Options
1400diff --git a/gcc/input.c b/gcc/input.c
1401--- a/gcc/input.c 2021-12-25 01:29:12.927317174 -0800
1402+++ b/gcc/input.c 2021-12-25 01:30:50.681688823 -0800
1403@@ -913,7 +913,8 @@ make_location (location_t caret, source_
1404 source line in order to calculate the display width. If that cannot be done
1405 for any reason, then returns the byte column as a fallback. */
1406 int
1407-location_compute_display_column (expanded_location exploc, int tabstop)
1408+location_compute_display_column (expanded_location exploc,
1409+ const cpp_char_column_policy &policy)
1410 {
1411 if (!(exploc.file && *exploc.file && exploc.line && exploc.column))
1412 return exploc.column;
1413@@ -921,7 +922,7 @@ location_compute_display_column (expande
1414 /* If line is NULL, this function returns exploc.column which is the
1415 desired fallback. */
1416 return cpp_byte_column_to_display_column (line.get_buffer (), line.length (),
1417- exploc.column, tabstop);
1418+ exploc.column, policy);
1419 }
1420
1421 /* Dump statistics to stderr about the memory usage of the line_table
1422@@ -3609,43 +3610,50 @@ test_line_offset_overflow ()
1423 void test_cpp_utf8 ()
1424 {
1425 const int def_tabstop = 8;
1426+ cpp_char_column_policy policy (def_tabstop, cpp_wcwidth);
1427+
1428 /* Verify that wcwidth of invalid UTF-8 or control bytes is 1. */
1429 {
1430- int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, def_tabstop);
1431+ int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, policy);
1432 ASSERT_EQ (8, w_bad);
1433- int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, def_tabstop);
1434+ int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, policy);
1435 ASSERT_EQ (5, w_ctrl);
1436 }
1437
1438 /* Verify that wcwidth of valid UTF-8 is as expected. */
1439 {
1440- const int w_pi = cpp_display_width ("\xcf\x80", 2, def_tabstop);
1441+ const int w_pi = cpp_display_width ("\xcf\x80", 2, policy);
1442 ASSERT_EQ (1, w_pi);
1443- const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, def_tabstop);
1444+ const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, policy);
1445 ASSERT_EQ (2, w_emoji);
1446 const int w_umlaut_precomposed = cpp_display_width ("\xc3\xbf", 2,
1447- def_tabstop);
1448+ policy);
1449 ASSERT_EQ (1, w_umlaut_precomposed);
1450 const int w_umlaut_combining = cpp_display_width ("y\xcc\x88", 3,
1451- def_tabstop);
1452+ policy);
1453 ASSERT_EQ (1, w_umlaut_combining);
1454- const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, def_tabstop);
1455+ const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, policy);
1456 ASSERT_EQ (2, w_han);
1457- const int w_ascii = cpp_display_width ("GCC", 3, def_tabstop);
1458+ const int w_ascii = cpp_display_width ("GCC", 3, policy);
1459 ASSERT_EQ (3, w_ascii);
1460 const int w_mixed = cpp_display_width ("\xcf\x80 = 3.14 \xf0\x9f\x98\x82"
1461 "\x9f! \xe4\xb8\xba y\xcc\x88",
1462- 24, def_tabstop);
1463+ 24, policy);
1464 ASSERT_EQ (18, w_mixed);
1465 }
1466
1467 /* Verify that display width properly expands tabs. */
1468 {
1469 const char *tstr = "\tabc\td";
1470- ASSERT_EQ (6, cpp_display_width (tstr, 6, 1));
1471- ASSERT_EQ (10, cpp_display_width (tstr, 6, 3));
1472- ASSERT_EQ (17, cpp_display_width (tstr, 6, 8));
1473- ASSERT_EQ (1, cpp_display_column_to_byte_column (tstr, 6, 7, 8));
1474+ ASSERT_EQ (6, cpp_display_width (tstr, 6,
1475+ cpp_char_column_policy (1, cpp_wcwidth)));
1476+ ASSERT_EQ (10, cpp_display_width (tstr, 6,
1477+ cpp_char_column_policy (3, cpp_wcwidth)));
1478+ ASSERT_EQ (17, cpp_display_width (tstr, 6,
1479+ cpp_char_column_policy (8, cpp_wcwidth)));
1480+ ASSERT_EQ (1,
1481+ cpp_display_column_to_byte_column
1482+ (tstr, 6, 7, cpp_char_column_policy (8, cpp_wcwidth)));
1483 }
1484
1485 /* Verify that cpp_byte_column_to_display_column can go past the end,
1486@@ -3658,13 +3666,13 @@ void test_cpp_utf8 ()
1487 /* 111122223456
1488 Byte columns. */
1489
1490- ASSERT_EQ (5, cpp_display_width (str, 6, def_tabstop));
1491+ ASSERT_EQ (5, cpp_display_width (str, 6, policy));
1492 ASSERT_EQ (105,
1493- cpp_byte_column_to_display_column (str, 6, 106, def_tabstop));
1494+ cpp_byte_column_to_display_column (str, 6, 106, policy));
1495 ASSERT_EQ (10000,
1496- cpp_byte_column_to_display_column (NULL, 0, 10000, def_tabstop));
1497+ cpp_byte_column_to_display_column (NULL, 0, 10000, policy));
1498 ASSERT_EQ (0,
1499- cpp_byte_column_to_display_column (NULL, 10000, 0, def_tabstop));
1500+ cpp_byte_column_to_display_column (NULL, 10000, 0, policy));
1501 }
1502
1503 /* Verify that cpp_display_column_to_byte_column can go past the end,
1504@@ -3678,25 +3686,25 @@ void test_cpp_utf8 ()
1505 /* 000000000000000000000000000000000111111
1506 111122223333444456666777788889999012345
1507 Byte columns. */
1508- ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, def_tabstop));
1509+ ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, policy));
1510 ASSERT_EQ (15,
1511- cpp_display_column_to_byte_column (str, 15, 11, def_tabstop));
1512+ cpp_display_column_to_byte_column (str, 15, 11, policy));
1513 ASSERT_EQ (115,
1514- cpp_display_column_to_byte_column (str, 15, 111, def_tabstop));
1515+ cpp_display_column_to_byte_column (str, 15, 111, policy));
1516 ASSERT_EQ (10000,
1517- cpp_display_column_to_byte_column (NULL, 0, 10000, def_tabstop));
1518+ cpp_display_column_to_byte_column (NULL, 0, 10000, policy));
1519 ASSERT_EQ (0,
1520- cpp_display_column_to_byte_column (NULL, 10000, 0, def_tabstop));
1521+ cpp_display_column_to_byte_column (NULL, 10000, 0, policy));
1522
1523 /* Verify that we do not interrupt a UTF-8 sequence. */
1524- ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, def_tabstop));
1525+ ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, policy));
1526
1527 for (int byte_col = 1; byte_col <= 15; ++byte_col)
1528 {
1529 const int disp_col
1530- = cpp_byte_column_to_display_column (str, 15, byte_col, def_tabstop);
1531+ = cpp_byte_column_to_display_column (str, 15, byte_col, policy);
1532 const int byte_col2
1533- = cpp_display_column_to_byte_column (str, 15, disp_col, def_tabstop);
1534+ = cpp_display_column_to_byte_column (str, 15, disp_col, policy);
1535
1536 /* If we ask for the display column in the middle of a UTF-8
1537 sequence, it will return the length of the partial sequence,
1538diff --git a/gcc/input.h b/gcc/input.h
1539--- a/gcc/input.h 2021-12-25 01:29:12.927317174 -0800
1540+++ b/gcc/input.h 2021-12-25 01:30:50.681688823 -0800
1541@@ -39,8 +39,11 @@ STATIC_ASSERT (BUILTINS_LOCATION < RESER
1542 extern bool is_location_from_builtin_token (location_t);
1543 extern expanded_location expand_location (location_t);
1544
1545-extern int location_compute_display_column (expanded_location exploc,
1546- int tabstop);
1547+class cpp_char_column_policy;
1548+
1549+extern int
1550+location_compute_display_column (expanded_location exploc,
1551+ const cpp_char_column_policy &policy);
1552
1553 /* A class capturing the bounds of a buffer, to allow for run-time
1554 bounds-checking in a checked build. */
1555diff --git a/gcc/opts.c b/gcc/opts.c
1556--- a/gcc/opts.c 2021-12-25 01:29:12.927317174 -0800
1557+++ b/gcc/opts.c 2021-12-25 01:30:50.681688823 -0800
1558@@ -2447,6 +2447,10 @@ common_handle_option (struct gcc_options
1559 dc->column_origin = value;
1560 break;
1561
1562+ case OPT_fdiagnostics_escape_format_:
1563+ dc->escape_format = (enum diagnostics_escape_format)value;
1564+ break;
1565+
1566 case OPT_fdiagnostics_show_cwe:
1567 dc->show_cwe = value;
1568 break;
1569diff --git a/gcc/selftest.c b/gcc/selftest.c
1570--- a/gcc/selftest.c 2020-07-22 23:35:17.820389797 -0700
1571+++ b/gcc/selftest.c 2021-12-25 01:30:50.681688823 -0800
1572@@ -193,6 +193,21 @@ temp_source_file::temp_source_file (cons
1573 fclose (out);
1574 }
1575
1576+/* As above, but with a size, to allow for NUL bytes in CONTENT. */
1577+
1578+temp_source_file::temp_source_file (const location &loc,
1579+ const char *suffix,
1580+ const char *content,
1581+ size_t sz)
1582+: named_temp_file (suffix)
1583+{
1584+ FILE *out = fopen (get_filename (), "w");
1585+ if (!out)
1586+ fail_formatted (loc, "unable to open tempfile: %s", get_filename ());
1587+ fwrite (content, sz, 1, out);
1588+ fclose (out);
1589+}
1590+
1591 /* Avoid introducing locale-specific differences in the results
1592 by hardcoding open_quote and close_quote. */
1593
1594diff --git a/gcc/selftest.h b/gcc/selftest.h
1595--- a/gcc/selftest.h 2020-07-22 23:35:17.820389797 -0700
1596+++ b/gcc/selftest.h 2021-12-25 01:30:50.681688823 -0800
1597@@ -112,6 +112,8 @@ class temp_source_file : public named_te
1598 public:
1599 temp_source_file (const location &loc, const char *suffix,
1600 const char *content);
1601+ temp_source_file (const location &loc, const char *suffix,
1602+ const char *content, size_t sz);
1603 };
1604
1605 /* RAII-style class for avoiding introducing locale-specific differences
1606diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c
1607--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c 2021-12-25 01:29:12.927317174 -0800
1608+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c 2021-12-25 01:30:50.681688823 -0800
1609@@ -9,6 +9,7 @@
1610
1611 /* { dg-regexp "\"kind\": \"error\"" } */
1612 /* { dg-regexp "\"column-origin\": 1" } */
1613+/* { dg-regexp "\"escape-source\": false" } */
1614 /* { dg-regexp "\"message\": \"#error message\"" } */
1615
1616 /* { dg-regexp "\"caret\": \{" } */
1617diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c
1618--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c 2021-12-25 01:29:12.927317174 -0800
1619+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c 2021-12-25 01:30:50.681688823 -0800
1620@@ -9,6 +9,7 @@
1621
1622 /* { dg-regexp "\"kind\": \"warning\"" } */
1623 /* { dg-regexp "\"column-origin\": 1" } */
1624+/* { dg-regexp "\"escape-source\": false" } */
1625 /* { dg-regexp "\"message\": \"#warning message\"" } */
1626 /* { dg-regexp "\"option\": \"-Wcpp\"" } */
1627 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wcpp\"" } */
1628diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c
1629--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c 2021-12-25 01:29:12.927317174 -0800
1630+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c 2021-12-25 01:30:50.681688823 -0800
1631@@ -9,6 +9,7 @@
1632
1633 /* { dg-regexp "\"kind\": \"error\"" } */
1634 /* { dg-regexp "\"column-origin\": 1" } */
1635+/* { dg-regexp "\"escape-source\": false" } */
1636 /* { dg-regexp "\"message\": \"#warning message\"" } */
1637 /* { dg-regexp "\"option\": \"-Werror=cpp\"" } */
1638 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wcpp\"" } */
1639diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c
1640--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c 2021-12-25 01:29:12.927317174 -0800
1641+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c 2021-12-25 01:30:50.681688823 -0800
1642@@ -19,6 +19,7 @@ int test (void)
1643
1644 /* { dg-regexp "\"kind\": \"note\"" } */
1645 /* { dg-regexp "\"message\": \"...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'\"" } */
1646+/* { dg-regexp "\"escape-source\": false" } */
1647
1648 /* { dg-regexp "\"caret\": \{" } */
1649 /* { dg-regexp "\"file\": \"\[^\n\r\"\]*diagnostic-format-json-4.c\"" } */
1650@@ -39,6 +40,7 @@ int test (void)
1651 /* { dg-regexp "\"kind\": \"warning\"" } */
1652 /* { dg-regexp "\"column-origin\": 1" } */
1653 /* { dg-regexp "\"message\": \"this 'if' clause does not guard...\"" } */
1654+/* { dg-regexp "\"escape-source\": false" } */
1655 /* { dg-regexp "\"option\": \"-Wmisleading-indentation\"" } */
1656 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wmisleading-indentation\"" } */
1657
1658diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c
1659--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c 2021-12-25 01:29:12.927317174 -0800
1660+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c 2021-12-25 01:30:50.681688823 -0800
1661@@ -14,6 +14,7 @@ int test (struct s *ptr)
1662
1663 /* { dg-regexp "\"kind\": \"error\"" } */
1664 /* { dg-regexp "\"column-origin\": 1" } */
1665+/* { dg-regexp "\"escape-source\": false" } */
1666 /* { dg-regexp "\"message\": \".*\"" } */
1667
1668 /* Verify fix-it hints. */
1669diff --git a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c
1670--- a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c 1969-12-31 16:00:00.000000000 -0800
1671+++ b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c 2021-12-25 01:30:50.681688823 -0800
1672@@ -0,0 +1,21 @@
1673+// { dg-do preprocess }
1674+// { dg-options "-std=gnu99 -Werror=normalized=nfc -fdiagnostics-show-caret -fdiagnostics-escape-format=bytes" }
1675+/* { dg-message "some warnings being treated as errors" "" {target "*-*-*"} 0 } */
1676+
1677+/* Ć Ā½ = U+0F43 TIBETAN LETTER GHA, which has decomposition "0F42 0FB7" i.e.
1678+ U+0F42 TIBETAN LETTER GA: Ć Ā½
1679+ U+0FB7 TIBETAN SUBJOINED LETTER HA: Ć Ā¾Ā·
1680+
1681+ The UTF-8 encoding of U+0F43 TIBETAN LETTER GHA is: E0 BD 83. */
1682+
1683+foo before_\u0F43_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1684+/* { dg-begin-multiline-output "" }
1685+ foo before_\u0F43_after bar
1686+ ^~~~~~~~~~~~~~~~~~~
1687+ { dg-end-multiline-output "" } */
1688+
1689+foo before_Ć Ā½_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1690+/* { dg-begin-multiline-output "" }
1691+ foo before_<e0><bd><83>_after bar
1692+ ^~~~~~~~~~~~~~~~~~~~~~~~~
1693+ { dg-end-multiline-output "" } */
1694diff --git a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c
1695--- a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c 1969-12-31 16:00:00.000000000 -0800
1696+++ b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c 2021-12-25 01:30:50.681688823 -0800
1697@@ -0,0 +1,19 @@
1698+// { dg-do preprocess }
1699+// { dg-options "-std=gnu99 -Werror=normalized=nfc -fdiagnostics-show-caret -fdiagnostics-escape-format=unicode" }
1700+/* { dg-message "some warnings being treated as errors" "" {target "*-*-*"} 0 } */
1701+
1702+/* Ć Ā½ = U+0F43 TIBETAN LETTER GHA, which has decomposition "0F42 0FB7" i.e.
1703+ U+0F42 TIBETAN LETTER GA: Ć Ā½
1704+ U+0FB7 TIBETAN SUBJOINED LETTER HA: Ć Ā¾Ā· */
1705+
1706+foo before_\u0F43_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1707+/* { dg-begin-multiline-output "" }
1708+ foo before_\u0F43_after bar
1709+ ^~~~~~~~~~~~~~~~~~~
1710+ { dg-end-multiline-output "" } */
1711+
1712+foo before_Ć Ā½_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1713+/* { dg-begin-multiline-output "" }
1714+ foo before_<U+0F43>_after bar
1715+ ^~~~~~~~~~~~~~~~~~~~~
1716+ { dg-end-multiline-output "" } */
1717diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90
1718--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 2021-12-25 01:29:12.931317107 -0800
1719+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 2021-12-25 01:30:50.681688823 -0800
1720@@ -9,6 +9,7 @@
1721
1722 ! { dg-regexp "\"kind\": \"error\"" }
1723 ! { dg-regexp "\"column-origin\": 1" }
1724+! { dg-regexp "\"escape-source\": false" }
1725 ! { dg-regexp "\"message\": \"#error message\"" }
1726
1727 ! { dg-regexp "\"caret\": \{" }
1728diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90
1729--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 2021-12-25 01:29:12.931317107 -0800
1730+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 2021-12-25 01:30:50.681688823 -0800
1731@@ -9,6 +9,7 @@
1732
1733 ! { dg-regexp "\"kind\": \"warning\"" }
1734 ! { dg-regexp "\"column-origin\": 1" }
1735+! { dg-regexp "\"escape-source\": false" }
1736 ! { dg-regexp "\"message\": \"#warning message\"" }
1737 ! { dg-regexp "\"option\": \"-Wcpp\"" }
1738 ! { dg-regexp "\"option_url\": \"\[^\n\r\"\]*#index-Wcpp\"" }
1739diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90
1740--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 2021-12-25 01:29:12.931317107 -0800
1741+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 2021-12-25 01:30:50.681688823 -0800
1742@@ -9,6 +9,7 @@
1743
1744 ! { dg-regexp "\"kind\": \"error\"" }
1745 ! { dg-regexp "\"column-origin\": 1" }
1746+! { dg-regexp "\"escape-source\": false" }
1747 ! { dg-regexp "\"message\": \"#warning message\"" }
1748 ! { dg-regexp "\"option\": \"-Werror=cpp\"" }
1749 ! { dg-regexp "\"option_url\": \"\[^\n\r\"\]*#index-Wcpp\"" }
1750diff --git a/libcpp/charset.c b/libcpp/charset.c
1751--- a/libcpp/charset.c 2021-12-25 01:29:12.931317107 -0800
1752+++ b/libcpp/charset.c 2021-12-25 01:30:50.681688823 -0800
1753@@ -1549,12 +1549,14 @@ convert_escape (cpp_reader *pfile, const
1754 "unknown escape sequence: '\\%c'", (int) c);
1755 else
1756 {
1757+ encoding_rich_location rich_loc (pfile);
1758+
1759 /* diagnostic.c does not support "%03o". When it does, this
1760 code can use %03o directly in the diagnostic again. */
1761 char buf[32];
1762 sprintf(buf, "%03o", (int) c);
1763- cpp_error (pfile, CPP_DL_PEDWARN,
1764- "unknown escape sequence: '\\%s'", buf);
1765+ cpp_error_at (pfile, CPP_DL_PEDWARN, &rich_loc,
1766+ "unknown escape sequence: '\\%s'", buf);
1767 }
1768 }
1769
1770@@ -2277,14 +2279,16 @@ cpp_string_location_reader::get_next ()
1771 }
1772
1773 cpp_display_width_computation::
1774-cpp_display_width_computation (const char *data, int data_length, int tabstop) :
1775+cpp_display_width_computation (const char *data, int data_length,
1776+ const cpp_char_column_policy &policy) :
1777 m_begin (data),
1778 m_next (m_begin),
1779 m_bytes_left (data_length),
1780- m_tabstop (tabstop),
1781+ m_policy (policy),
1782 m_display_cols (0)
1783 {
1784- gcc_assert (m_tabstop > 0);
1785+ gcc_assert (policy.m_tabstop > 0);
1786+ gcc_assert (policy.m_width_cb);
1787 }
1788
1789
1790@@ -2296,19 +2300,28 @@ cpp_display_width_computation (const cha
1791 point to a valid UTF-8-encoded sequence, then it will be treated as a single
1792 byte with display width 1. m_cur_display_col is the current display column,
1793 relative to which tab stops should be expanded. Returns the display width of
1794- the codepoint just processed. */
1795+ the codepoint just processed.
1796+ If OUT is non-NULL, it is populated. */
1797
1798 int
1799-cpp_display_width_computation::process_next_codepoint ()
1800+cpp_display_width_computation::process_next_codepoint (cpp_decoded_char *out)
1801 {
1802 cppchar_t c;
1803 int next_width;
1804
1805+ if (out)
1806+ out->m_start_byte = m_next;
1807+
1808 if (*m_next == '\t')
1809 {
1810 ++m_next;
1811 --m_bytes_left;
1812- next_width = m_tabstop - (m_display_cols % m_tabstop);
1813+ next_width = m_policy.m_tabstop - (m_display_cols % m_policy.m_tabstop);
1814+ if (out)
1815+ {
1816+ out->m_ch = '\t';
1817+ out->m_valid_ch = true;
1818+ }
1819 }
1820 else if (one_utf8_to_cppchar ((const uchar **) &m_next, &m_bytes_left, &c)
1821 != 0)
1822@@ -2318,14 +2331,24 @@ cpp_display_width_computation::process_n
1823 of one. */
1824 ++m_next;
1825 --m_bytes_left;
1826- next_width = 1;
1827+ next_width = m_policy.m_undecoded_byte_width;
1828+ if (out)
1829+ out->m_valid_ch = false;
1830 }
1831 else
1832 {
1833 /* one_utf8_to_cppchar() has updated m_next and m_bytes_left for us. */
1834- next_width = cpp_wcwidth (c);
1835+ next_width = m_policy.m_width_cb (c);
1836+ if (out)
1837+ {
1838+ out->m_ch = c;
1839+ out->m_valid_ch = true;
1840+ }
1841 }
1842
1843+ if (out)
1844+ out->m_next_byte = m_next;
1845+
1846 m_display_cols += next_width;
1847 return next_width;
1848 }
1849@@ -2341,7 +2364,7 @@ cpp_display_width_computation::advance_d
1850 const int start = m_display_cols;
1851 const int target = start + n;
1852 while (m_display_cols < target && !done ())
1853- process_next_codepoint ();
1854+ process_next_codepoint (NULL);
1855 return m_display_cols - start;
1856 }
1857
1858@@ -2349,29 +2372,33 @@ cpp_display_width_computation::advance_d
1859 how many display columns are occupied by the first COLUMN bytes. COLUMN
1860 may exceed DATA_LENGTH, in which case the phantom bytes at the end are
1861 treated as if they have display width 1. Tabs are expanded to the next tab
1862- stop, relative to the start of DATA. */
1863+ stop, relative to the start of DATA, and non-printable-ASCII characters
1864+ will be escaped as per POLICY. */
1865
1866 int
1867 cpp_byte_column_to_display_column (const char *data, int data_length,
1868- int column, int tabstop)
1869+ int column,
1870+ const cpp_char_column_policy &policy)
1871 {
1872 const int offset = MAX (0, column - data_length);
1873- cpp_display_width_computation dw (data, column - offset, tabstop);
1874+ cpp_display_width_computation dw (data, column - offset, policy);
1875 while (!dw.done ())
1876- dw.process_next_codepoint ();
1877+ dw.process_next_codepoint (NULL);
1878 return dw.display_cols_processed () + offset;
1879 }
1880
1881 /* For the string of length DATA_LENGTH bytes that begins at DATA, compute
1882 the least number of bytes that will result in at least DISPLAY_COL display
1883 columns. The return value may exceed DATA_LENGTH if the entire string does
1884- not occupy enough display columns. */
1885+ not occupy enough display columns. Non-printable-ASCII characters
1886+ will be escaped as per POLICY. */
1887
1888 int
1889 cpp_display_column_to_byte_column (const char *data, int data_length,
1890- int display_col, int tabstop)
1891+ int display_col,
1892+ const cpp_char_column_policy &policy)
1893 {
1894- cpp_display_width_computation dw (data, data_length, tabstop);
1895+ cpp_display_width_computation dw (data, data_length, policy);
1896 const int avail_display = dw.advance_display_cols (display_col);
1897 return dw.bytes_processed () + MAX (0, display_col - avail_display);
1898 }
1899diff --git a/libcpp/errors.c b/libcpp/errors.c
1900--- a/libcpp/errors.c 2020-07-22 23:35:18.712399623 -0700
1901+++ b/libcpp/errors.c 2021-12-25 01:30:50.681688823 -0800
1902@@ -27,6 +27,31 @@ along with this program; see the file CO
1903 #include "cpplib.h"
1904 #include "internal.h"
1905
1906+/* Get a location_t for the current location in PFILE,
1907+ generally that of the previously lexed token. */
1908+
1909+location_t
1910+cpp_diagnostic_get_current_location (cpp_reader *pfile)
1911+{
1912+ if (CPP_OPTION (pfile, traditional))
1913+ {
1914+ if (pfile->state.in_directive)
1915+ return pfile->directive_line;
1916+ else
1917+ return pfile->line_table->highest_line;
1918+ }
1919+ /* We don't want to refer to a token before the beginning of the
1920+ current run -- that is invalid. */
1921+ else if (pfile->cur_token == pfile->cur_run->base)
1922+ {
1923+ return 0;
1924+ }
1925+ else
1926+ {
1927+ return pfile->cur_token[-1].src_loc;
1928+ }
1929+}
1930+
1931 /* Print a diagnostic at the given location. */
1932
1933 ATTRIBUTE_FPTR_PRINTF(5,0)
1934@@ -52,25 +77,7 @@ cpp_diagnostic (cpp_reader * pfile, enum
1935 enum cpp_warning_reason reason,
1936 const char *msgid, va_list *ap)
1937 {
1938- location_t src_loc;
1939-
1940- if (CPP_OPTION (pfile, traditional))
1941- {
1942- if (pfile->state.in_directive)
1943- src_loc = pfile->directive_line;
1944- else
1945- src_loc = pfile->line_table->highest_line;
1946- }
1947- /* We don't want to refer to a token before the beginning of the
1948- current run -- that is invalid. */
1949- else if (pfile->cur_token == pfile->cur_run->base)
1950- {
1951- src_loc = 0;
1952- }
1953- else
1954- {
1955- src_loc = pfile->cur_token[-1].src_loc;
1956- }
1957+ location_t src_loc = cpp_diagnostic_get_current_location (pfile);
1958 rich_location richloc (pfile->line_table, src_loc);
1959 return cpp_diagnostic_at (pfile, level, reason, &richloc, msgid, ap);
1960 }
1961@@ -142,6 +149,43 @@ cpp_warning_syshdr (cpp_reader * pfile,
1962
1963 va_end (ap);
1964 return ret;
1965+}
1966+
1967+/* As cpp_warning above, but use RICHLOC as the location of the diagnostic. */
1968+
1969+bool cpp_warning_at (cpp_reader *pfile, enum cpp_warning_reason reason,
1970+ rich_location *richloc, const char *msgid, ...)
1971+{
1972+ va_list ap;
1973+ bool ret;
1974+
1975+ va_start (ap, msgid);
1976+
1977+ ret = cpp_diagnostic_at (pfile, CPP_DL_WARNING, reason, richloc,
1978+ msgid, &ap);
1979+
1980+ va_end (ap);
1981+ return ret;
1982+
1983+}
1984+
1985+/* As cpp_pedwarning above, but use RICHLOC as the location of the
1986+ diagnostic. */
1987+
1988+bool
1989+cpp_pedwarning_at (cpp_reader * pfile, enum cpp_warning_reason reason,
1990+ rich_location *richloc, const char *msgid, ...)
1991+{
1992+ va_list ap;
1993+ bool ret;
1994+
1995+ va_start (ap, msgid);
1996+
1997+ ret = cpp_diagnostic_at (pfile, CPP_DL_PEDWARN, reason, richloc,
1998+ msgid, &ap);
1999+
2000+ va_end (ap);
2001+ return ret;
2002 }
2003
2004 /* Print a diagnostic at a specific location. */
2005diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
2006--- a/libcpp/include/cpplib.h 2021-12-25 01:29:12.931317107 -0800
2007+++ b/libcpp/include/cpplib.h 2021-12-25 01:30:50.685688757 -0800
2008@@ -1176,6 +1176,14 @@ extern bool cpp_warning_syshdr (cpp_read
2009 const char *msgid, ...)
2010 ATTRIBUTE_PRINTF_3;
2011
2012+/* As their counterparts above, but use RICHLOC. */
2013+extern bool cpp_warning_at (cpp_reader *, enum cpp_warning_reason,
2014+ rich_location *richloc, const char *msgid, ...)
2015+ ATTRIBUTE_PRINTF_4;
2016+extern bool cpp_pedwarning_at (cpp_reader *, enum cpp_warning_reason,
2017+ rich_location *richloc, const char *msgid, ...)
2018+ ATTRIBUTE_PRINTF_4;
2019+
2020 /* Output a diagnostic with "MSGID: " preceding the
2021 error string of errno. No location is printed. */
2022 extern bool cpp_errno (cpp_reader *, enum cpp_diagnostic_level,
2023@@ -1320,42 +1328,95 @@ extern const char * cpp_get_userdef_suff
2024
2025 /* In charset.c */
2026
2027+/* The result of attempting to decode a run of UTF-8 bytes. */
2028+
2029+struct cpp_decoded_char
2030+{
2031+ const char *m_start_byte;
2032+ const char *m_next_byte;
2033+
2034+ bool m_valid_ch;
2035+ cppchar_t m_ch;
2036+};
2037+
2038+/* Information for mapping between code points and display columns.
2039+
2040+ This is a tabstop value, along with a callback for getting the
2041+ widths of characters. Normally this callback is cpp_wcwidth, but we
2042+ support other schemes for escaping non-ASCII unicode as a series of
2043+ ASCII chars when printing the user's source code in diagnostic-show-locus.c
2044+
2045+ For example, consider:
2046+ - the Unicode character U+03C0 "GREEK SMALL LETTER PI" (UTF-8: 0xCF 0x80)
2047+ - the Unicode character U+1F642 "SLIGHTLY SMILING FACE"
2048+ (UTF-8: 0xF0 0x9F 0x99 0x82)
2049+ - the byte 0xBF (a stray trailing byte of a UTF-8 character)
2050+ Normally U+03C0 would occupy one display column, U+1F642
2051+ would occupy two display columns, and the stray byte would be
2052+ printed verbatim as one display column.
2053+
2054+ However when escaping them as unicode code points as "<U+03C0>"
2055+ and "<U+1F642>" they occupy 8 and 9 display columns respectively,
2056+ and when escaping them as bytes as "<CF><80>" and "<F0><9F><99><82>"
2057+ they occupy 8 and 16 display columns respectively. In both cases
2058+ the stray byte is escaped to <BF> as 4 display columns. */
2059+
2060+struct cpp_char_column_policy
2061+{
2062+ cpp_char_column_policy (int tabstop,
2063+ int (*width_cb) (cppchar_t c))
2064+ : m_tabstop (tabstop),
2065+ m_undecoded_byte_width (1),
2066+ m_width_cb (width_cb)
2067+ {}
2068+
2069+ int m_tabstop;
2070+ /* Width in display columns of a stray byte that isn't decodable
2071+ as UTF-8. */
2072+ int m_undecoded_byte_width;
2073+ int (*m_width_cb) (cppchar_t c);
2074+};
2075+
2076 /* A class to manage the state while converting a UTF-8 sequence to cppchar_t
2077 and computing the display width one character at a time. */
2078 class cpp_display_width_computation {
2079 public:
2080 cpp_display_width_computation (const char *data, int data_length,
2081- int tabstop);
2082+ const cpp_char_column_policy &policy);
2083 const char *next_byte () const { return m_next; }
2084 int bytes_processed () const { return m_next - m_begin; }
2085 int bytes_left () const { return m_bytes_left; }
2086 bool done () const { return !bytes_left (); }
2087 int display_cols_processed () const { return m_display_cols; }
2088
2089- int process_next_codepoint ();
2090+ int process_next_codepoint (cpp_decoded_char *out);
2091 int advance_display_cols (int n);
2092
2093 private:
2094 const char *const m_begin;
2095 const char *m_next;
2096 size_t m_bytes_left;
2097- const int m_tabstop;
2098+ const cpp_char_column_policy &m_policy;
2099 int m_display_cols;
2100 };
2101
2102 /* Convenience functions that are simple use cases for class
2103 cpp_display_width_computation. Tab characters will be expanded to spaces
2104- as determined by TABSTOP. */
2105+ as determined by POLICY.m_tabstop, and non-printable-ASCII characters
2106+ will be escaped as per POLICY. */
2107+
2108 int cpp_byte_column_to_display_column (const char *data, int data_length,
2109- int column, int tabstop);
2110+ int column,
2111+ const cpp_char_column_policy &policy);
2112 inline int cpp_display_width (const char *data, int data_length,
2113- int tabstop)
2114+ const cpp_char_column_policy &policy)
2115 {
2116 return cpp_byte_column_to_display_column (data, data_length, data_length,
2117- tabstop);
2118+ policy);
2119 }
2120 int cpp_display_column_to_byte_column (const char *data, int data_length,
2121- int display_col, int tabstop);
2122+ int display_col,
2123+ const cpp_char_column_policy &policy);
2124 int cpp_wcwidth (cppchar_t c);
2125
2126 #endif /* ! LIBCPP_CPPLIB_H */
2127diff --git a/libcpp/include/line-map.h b/libcpp/include/line-map.h
2128--- a/libcpp/include/line-map.h 2020-07-22 23:35:18.712399623 -0700
2129+++ b/libcpp/include/line-map.h 2021-12-25 01:30:50.685688757 -0800
2130@@ -1732,6 +1732,18 @@ class rich_location
2131 const diagnostic_path *get_path () const { return m_path; }
2132 void set_path (const diagnostic_path *path) { m_path = path; }
2133
2134+ /* A flag for hinting that the diagnostic involves character encoding
2135+ issues, and thus that it will be helpful to the user if we show some
2136+ representation of how the characters in the pertinent source lines
2137+ are encoded.
2138+ The default is false (i.e. do not escape).
2139+ When set to true, non-ASCII bytes in the pertinent source lines will
2140+ be escaped in a manner controlled by the user-supplied option
2141+ -fdiagnostics-escape-format=, so that the user can better understand
2142+ what's going on with the encoding in their source file. */
2143+ bool escape_on_output_p () const { return m_escape_on_output; }
2144+ void set_escape_on_output (bool flag) { m_escape_on_output = flag; }
2145+
2146 private:
2147 bool reject_impossible_fixit (location_t where);
2148 void stop_supporting_fixits ();
2149@@ -1758,6 +1770,7 @@ protected:
2150 bool m_fixits_cannot_be_auto_applied;
2151
2152 const diagnostic_path *m_path;
2153+ bool m_escape_on_output;
2154 };
2155
2156 /* A struct for the result of range_label::get_text: a NUL-terminated buffer
2157diff --git a/libcpp/internal.h b/libcpp/internal.h
2158--- a/libcpp/internal.h 2020-07-22 23:35:18.712399623 -0700
2159+++ b/libcpp/internal.h 2021-12-25 01:30:50.685688757 -0800
2160@@ -758,6 +758,9 @@ struct _cpp_dir_only_callbacks
2161 extern void _cpp_preprocess_dir_only (cpp_reader *,
2162 const struct _cpp_dir_only_callbacks *);
2163
2164+/* In errors.c */
2165+extern location_t cpp_diagnostic_get_current_location (cpp_reader *);
2166+
2167 /* In traditional.c. */
2168 extern bool _cpp_scan_out_logical_line (cpp_reader *, cpp_macro *, bool);
2169 extern bool _cpp_read_logical_line_trad (cpp_reader *);
2170@@ -946,6 +949,26 @@ int linemap_get_expansion_line (class li
2171 const char* linemap_get_expansion_filename (class line_maps *,
2172 location_t);
2173
2174+/* A subclass of rich_location for emitting a diagnostic
2175+ at the current location of the reader, but flagging
2176+ it with set_escape_on_output (true). */
2177+class encoding_rich_location : public rich_location
2178+{
2179+ public:
2180+ encoding_rich_location (cpp_reader *pfile)
2181+ : rich_location (pfile->line_table,
2182+ cpp_diagnostic_get_current_location (pfile))
2183+ {
2184+ set_escape_on_output (true);
2185+ }
2186+
2187+ encoding_rich_location (cpp_reader *pfile, location_t loc)
2188+ : rich_location (pfile->line_table, loc)
2189+ {
2190+ set_escape_on_output (true);
2191+ }
2192+};
2193+
2194 #ifdef __cplusplus
2195 }
2196 #endif
2197diff --git a/libcpp/lex.c b/libcpp/lex.c
2198--- a/libcpp/lex.c 2021-12-24 20:23:45.568762024 -0800
2199+++ b/libcpp/lex.c 2021-12-25 01:30:50.685688757 -0800
2200@@ -1268,7 +1268,11 @@ skip_whitespace (cpp_reader *pfile, cppc
2201 while (is_nvspace (c));
2202
2203 if (saw_NUL)
2204- cpp_error (pfile, CPP_DL_WARNING, "null character(s) ignored");
2205+ {
2206+ encoding_rich_location rich_loc (pfile);
2207+ cpp_error_at (pfile, CPP_DL_WARNING, &rich_loc,
2208+ "null character(s) ignored");
2209+ }
2210
2211 buffer->cur--;
2212 }
2213@@ -1297,6 +1301,28 @@ warn_about_normalization (cpp_reader *pf
2214 if (CPP_OPTION (pfile, warn_normalize) < NORMALIZE_STATE_RESULT (s)
2215 && !pfile->state.skipping)
2216 {
2217+ location_t loc = token->src_loc;
2218+
2219+ /* If possible, create a location range for the token. */
2220+ if (loc >= RESERVED_LOCATION_COUNT
2221+ && token->type != CPP_EOF
2222+ /* There must be no line notes to process. */
2223+ && (!(pfile->buffer->cur
2224+ >= pfile->buffer->notes[pfile->buffer->cur_note].pos
2225+ && !pfile->overlaid_buffer)))
2226+ {
2227+ source_range tok_range;
2228+ tok_range.m_start = loc;
2229+ tok_range.m_finish
2230+ = linemap_position_for_column (pfile->line_table,
2231+ CPP_BUF_COLUMN (pfile->buffer,
2232+ pfile->buffer->cur));
2233+ loc = COMBINE_LOCATION_DATA (pfile->line_table,
2234+ loc, tok_range, NULL);
2235+ }
2236+
2237+ encoding_rich_location rich_loc (pfile, loc);
2238+
2239 /* Make sure that the token is printed using UCNs, even
2240 if we'd otherwise happily print UTF-8. */
2241 unsigned char *buf = XNEWVEC (unsigned char, cpp_token_len (token));
2242@@ -1304,11 +1330,11 @@ warn_about_normalization (cpp_reader *pf
2243
2244 sz = cpp_spell_token (pfile, token, buf, false) - buf;
2245 if (NORMALIZE_STATE_RESULT (s) == normalized_C)
2246- cpp_warning_with_line (pfile, CPP_W_NORMALIZE, token->src_loc, 0,
2247- "`%.*s' is not in NFKC", (int) sz, buf);
2248+ cpp_warning_at (pfile, CPP_W_NORMALIZE, &rich_loc,
2249+ "`%.*s' is not in NFKC", (int) sz, buf);
2250 else
2251- cpp_warning_with_line (pfile, CPP_W_NORMALIZE, token->src_loc, 0,
2252- "`%.*s' is not in NFC", (int) sz, buf);
2253+ cpp_warning_at (pfile, CPP_W_NORMALIZE, &rich_loc,
2254+ "`%.*s' is not in NFC", (int) sz, buf);
2255 free (buf);
2256 }
2257 }
2258diff --git a/libcpp/line-map.c b/libcpp/line-map.c
2259--- a/libcpp/line-map.c 2020-07-22 23:35:18.712399623 -0700
2260+++ b/libcpp/line-map.c 2021-12-25 01:30:50.685688757 -0800
2261@@ -2007,7 +2007,8 @@ rich_location::rich_location (line_maps
2262 m_fixit_hints (),
2263 m_seen_impossible_fixit (false),
2264 m_fixits_cannot_be_auto_applied (false),
2265- m_path (NULL)
2266+ m_path (NULL),
2267+ m_escape_on_output (false)
2268 {
2269 add_range (loc, SHOW_RANGE_WITH_CARET, label);
2270 }