<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>begriffs.com</title>
    <link href="https://begriffs.com/atom.xml" rel="self" />
    <link href="https://begriffs.com" />
    <id>https://begriffs.com/atom.xml</id>
    <author>
        <name>Joe Nelson</name>
        <email>joe@begriffs.com</email>
    </author>
    <updated>2023-10-10T00:00:00Z</updated>
    <entry>
    <title>Build dependable bare-metal ARM firmware with UNIX tools</title>
    <link href="https://begriffs.com/posts/2023-10-10-bare-metal-firmware.html" />
    <id>https://begriffs.com/posts/2023-10-10-bare-metal-firmware.html</id>
    <published>2023-10-10T00:00:00Z</published>
    <updated>2023-10-10T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p><a href="https://gum.co/begriffs-firmware" rel="nofollow"
   class="gumroad-button">Download the eBook ($12)</a></p>
<p>For software developers, the world of hardware and firmware can be an exciting change. Firmware catapults your logic into the physical world. Rather than moving text between forms and a database, you can move motors. Rather than listening for an API call, you can listen for SONAR or GPS signals.</p>
<p><strong>This is the guide I wish I had when first starting embedded development.</strong> It cultivates professional embedded programming habits from the start. We’ll skip the beginner ecosystem like Arduino, and get the most out of hardware with bare metal programming.</p>
<p>The low-level approach allows you to:</p>
<ul>
<li>Choose from a variety of chips to match project requirements.</li>
<li>Use a real-time OS, if desired, instead of a “superloop,” for more natural multitasking.</li>
<li>Make the most of hardware resources. No Arduino, and no embedded Linux. Projects have nearly instant boot times.</li>
<li>Avoid bugs in intermediate libraries by using a smaller software stack.</li>
<li>Do full remote debugging with the ability to breakpoint and inspect variables and registers.</li>
<li>Achieve MISRA conformance if necessary, for safety critical systems.</li>
</ul>
<p>In particular, we target the ARM architecture, due to popularity. While the examples use STMicroelectronics hardware, we avoid their vendor IDE and hardware abstraction layer (HAL). The principles in this guide work with chips from any ARM vendor. Rather than proprietary IDEs and libraries, we’ll use entirely open source tools in a Unix environment (like BSD, Linux, or macOS). Here’s why:</p>
<ul>
<li>Your project won’t “bit rot.” Once it builds, it will continue to build for years to come.</li>
<li>Leverage a mature toolset, like POSIX Make, C99, GCC/LLVM, and GDB/LLDB. They’re either already on your system, or easy to install with the OS package manager.</li>
<li>Use the ubiquitous CMSIS hardware interface. ARM contractually obligates its hardware vendors to supply CMSIS implementations for their products.</li>
<li>Let official manuals rather than 3rd party libraries be the source of truth. The register names in CMSIS match terminology in the hardware reference manuals.</li>
</ul>
<p>Using a strong foundation of toolchain and libraries, we’ll build the same simple “blinky” project in four different ways. We’ll see the boot-up sequence of CMSIS vs the standard library crt0 system. We’ll try writing the program with and without an RTOS, and try dynamic vs static memory allocation. We’ll also see an example of a fault handler, and how to do remote debugging.</p>
<p>By the end of the guide, you can venture confidently into building, flashing, and debugging more complex projects. The guide constructs examples based on product datasheets and first principles, it’s not a copy of existing demos or code snippets.</p>
<p>Download the guide below. For the cost of a sandwich you’ll be up and running.</p>
<a href="https://gum.co/begriffs-firmware" rel="nofollow"
   class="gumroad-button">Download the eBook ($12)</a>
<script type="text/javascript" src="https://gumroad.com/js/gumroad.js"></script>]]></summary>
</entry>
<entry>
    <title>Pleasant debugging with GDB and DDD</title>
    <link href="https://begriffs.com/posts/2022-07-17-debugging-gdb-ddd.html" />
    <id>https://begriffs.com/posts/2022-07-17-debugging-gdb-ddd.html</id>
    <published>2022-07-17T00:00:00Z</published>
    <updated>2022-07-17T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>GDB is an old and ubiquitous debugger for Linux and BSD systems that has extensive language, processor, and binary format support. Its interface is a little cryptic, but learning GDB pays off.</p>
<p>This article is a set of miscellaneous configuration and scripting tricks that illustrate reusable principles. It assumes you’re familiar with the basics of debugging, like breakpoints, stepping, inspecting variables, etc.</p>
<p><strong>Table of contents</strong></p>
<ul>
<li><a href="#gdb-front-ends">GDB front ends</a>
<ul>
<li><a href="#fixing-ddd-freeze-on-startup">Fixing DDD freeze on startup</a></li>
<li><a href="#honoring-gdbinit-changes">Honoring gdbinit changes</a></li>
<li><a href="#dark-mode">Dark mode</a></li>
<li><a href="#utf-8-rendering">UTF-8 rendering</a></li>
<li><a href="#remote-gdb-configuration">Remote GDB configuration</a></li>
</ul></li>
<li><a href="#gdb-tricks">GDB tricks</a>
<ul>
<li><a href="#useful-execution-commands">Useful execution commands</a></li>
<li><a href="#batch-mode">Batch mode</a></li>
<li><a href="#user-defined-commands">User-defined commands</a></li>
<li><a href="#hooks">Hooks</a></li>
</ul></li>
<li><a href="#python-api">Python API</a>
<ul>
<li><a href="#simple-helper-functions">Simple helper functions</a></li>
<li><a href="#pretty-printing">Pretty printing</a></li>
</ul></li>
<li><a href="#ddd-features">DDD features</a>
<ul>
<li><a href="#historical-values">Historical values</a></li>
<li><a href="#interesting-shortcuts">Interesting shortcuts</a></li>
</ul></li>
<li><a href="#further-reading">Further reading</a></li>
</ul>
<h3 id="gdb-front-ends">GDB front ends</h3>
<p>By default, GDB provides a terse line-based terminal. You need to explicitly ask to print the source code being debugged, the values of variables, or the current list of breakpoints. There are four ways to customize this interface. Ordered from basic to complicated, they are:</p>
<ol type="1">
<li>Get used to the <strong>default</strong> behavior. Then you’ll be comfortable on any system with GDB installed. However, this approach does forego some real conveniences.</li>
<li>Enable the built-in GDB <strong>TUI mode</strong> with the <a href="https://sourceware.org/gdb/current/onlinedocs/gdb/TUI.html">-tui command line flag</a> (available since GDB version 7.5). The TUI creates Curses windows for source, registers, commands, etc. It’s easier to trace execution through the code and spot breakpoints than in the default interface.</li>
<li>Customize the UI using <strong>scripting</strong>, sourced from your <code>.gdbinit</code>. Some good examples are projects like <a href="https://github.com/cyrus-and/gdb-dashboard">gdb-dashboard</a> and <a href="https://github.com/hugsy/gef">gef</a>.</li>
<li>Use a <strong>graphical front-end</strong> that communicates with an “inferior” GDB instance. Front ends either use the GDB machine interface (MI) to communicate, or they screen scrape sessions directly.</li>
</ol>
<p>In my experiments, the TUI mode (option two) seemed promising, but it has some limitations:</p>
<ul>
<li>no persistent window to display variables or the call stack</li>
<li>no ability to set or clear breakpoints by mouse</li>
<li>no value inspection with mouse hover</li>
<li>mouse scroll wheel didn’t work for me on OpenBSD+xterm</li>
<li>no interactive structure/pointer exploration</li>
<li>no historical value tracking for variables (aside from GDB’s Linux-only <a href="https://sourceware.org/gdb/onlinedocs/gdb/Process-Record-and-Replay.html">process record and replay</a>)</li>
</ul>
<p>Ultimately I chose option four, with the <a href="https://www.gnu.org/software/ddd/">Data Display Debugger</a> (DDD). It’s fairly ancient, and requires configuration changes to work at all with recent versions of GDB. However, it has a lot of features delivered in a 3MB binary, with no library dependencies other than a Motif-compatible UI toolkit. DDD can also control GDB sessions remotely over SSH.</p>
<figure>
<img src="../images/ddd.png" alt="DDD screenshot" /><figcaption aria-hidden="true">DDD screenshot</figcaption>
</figure>
<h4 id="fixing-ddd-freeze-on-startup">Fixing DDD freeze on startup</h4>
<p>As a front-end, DDD translates user actions to text commands that it sends to GDB. Newer front-ends use GDB’s unambiguous machine interface (MI), but DDD never got updated for that. It parses the standard text interface, essentially screen scraping GDB’s regular output. This causes some problems, but there are workarounds.</p>
<p>Upon starting DDD, the first serious error you’ll run into is the program locking up with this message:</p>
<pre><code>Waiting until GDB gets ready...</code></pre>
<p>The freeze happens because DDD is looking for the prompt <code>(gdb)</code>. However, DDD never sees that prompt because it incorrectly changed the prompt at startup.</p>
<p>To fix this error, you must explicitly set the prompt and <em>unset</em> the extended-prompt. In <code>~/.ddd/init</code> include this code:</p>
<pre><code>Ddd*gdbSettings: \
unset extended-prompt\n\
set prompt (gdb) \n</code></pre>
<p>The root of the problem is that during DDD’s first run, it probes all GDB settings, and saves them in to its .ddd/init file for consistency in future runs. It probes by running <code>show settingname</code> for all settings. However, it interprets the results wrong for these settings:</p>
<ul>
<li>exec-direction</li>
<li>extended-prompt</li>
<li>filename-display</li>
<li>interactive-mode</li>
<li>max-value-size</li>
<li>mem inaccessible-by-default</li>
<li>mpx bound</li>
<li>record btrace bts</li>
<li>record btrace pt</li>
<li>remote interrupt-sequence</li>
<li>remote system-call-allowed</li>
<li>tdesc</li>
</ul>
<p>The incorrect detection is especially bad for <code>extended-prompt</code>. GDB reports the value as <code>not set</code>, which DDD interprets – not as the lack of a value – but as text to set for the extended prompt. That text overrides the regular prompt, causing GDB to output <code>not set</code> as its actual prompt.</p>
<h4 id="honoring-gdbinit-changes">Honoring gdbinit changes</h4>
<p>As mentioned, DDD probes and saves all GDB settings during first launch. While specifying all settings in <code>~/.ddd/init</code> might make for deterministic behavior on local and remote debugging sessions, it’s inflexible. I want <code>~/.gdbinit</code> to be the source of truth.</p>
<p>Thus you should:</p>
<ul>
<li>Delete all <code>Ddd*gdbSettings</code> other than the prompt ones above, and</li>
<li>Set <code>Ddd*saveOptionsOnExit: off</code> to prevent DDD from putting the values back.</li>
</ul>
<h4 id="dark-mode">Dark mode</h4>
<p>DDD’s default color scheme is a bit glaring. For dark mode in the code window, console, and data display panel, set these resources:</p>
<pre><code>Ddd*XmText.background:             black
Ddd*XmText.foreground:             white
Ddd*XmTextField.background:        black
Ddd*XmTextField.foreground:        white
Ddd*XmList.background:             black
Ddd*XmList.foreground:             white
Ddd*graph_edit.background:         #333333
Ddd*graph_edit.edgeColor:          red
Ddd*graph_edit.nodeColor:          white
Ddd*graph_edit.gridColor:          white</code></pre>
<h4 id="utf-8-rendering">UTF-8 rendering</h4>
<p>By default, DDD uses X core fonts. All its resources, like <code>Ddd*defaultFont</code>, can pick from only those legacy fonts, which don’t properly render UTF-8. For proper rendering, we have to change the Motif <a href="http://www.ist.co.uk/motif/books/vol6A/ch-24.fm.html">rendering table</a> to use the newer FreeType (XFT) fonts. Pick an XFT font you have on your system; I chose Inconsolata:</p>
<pre><code>Ddd*renderTable: rt
Ddd*rt*fontType: FONT_IS_XFT
Ddd*rt*fontName: Inconsolata
Ddd*rt*fontSize: 8</code></pre>
<p>The change applies to all UI areas of the program <em>except</em> the data display window. That window comes from an earlier codebase bolted on to DDD, and I don’t know how to change its rendering. AFAICT, you can choose only legacy fonts there, with <code>Ddd*dataFont</code> and <code>Ddd*dataFontSize</code>.</p>
<p>Although international graphemes are garbled in the data display window, you can inspect UTF-8 variables by printing them in the GDB console, or by hovering the mouse over variable names for a tooltip display.</p>
<h4 id="remote-gdb-configuration">Remote GDB configuration</h4>
<p>DDD interacts with GDB through the terminal like a user would, so it can drive debugging sessions over SSH just as easily as local sessions. It also knows how to fetch remote source files, and find remote program PIDs to which GDB can attach. DDD’s default program for running commands on a remote inferior is <code>remsh</code> or <code>rsh</code>, but it can be customized to use SSH:</p>
<pre><code>Ddd*rshCommand: ssh -t</code></pre>
<p>In my experience, the <code>-t</code> is needed, or else GDB warnings and errors can appear out of order with the <code>(gdb)</code> prompt, making DDD hang.</p>
<p>To debug a remote GDB over SSH, pass the <code>--host</code> option to DDD. I usually include these command-line options:</p>
<pre><code>ddd --debugger gdb --host admin@example.com --no-exec-window</code></pre>
<p>(I specify the remote debugger command as <code>gdb</code> when it differs from my local inferior debugger command of <code>egdb</code> from the OpenBSD <a href="https://openports.pl/path/devel/gdb">devel/gdb</a> port.)</p>
<h3 id="gdb-tricks">GDB tricks</h3>
<h4 id="useful-execution-commands">Useful execution commands</h4>
<p>Beyond the basics of <code>run</code>, <code>continue</code> and <code>next</code>, don’t forget some other handy commands.</p>
<ul>
<li><code>finish</code> - execute until the current function returns, and break in caller. Useful if you accidentally go too deep, or if the rest of a function is of no interest.</li>
<li><code>until</code> - execute until reaching a later line. You can use this on the last line of a loop to run through the rest of the iterations, break out, and stop.</li>
<li><code>start</code> - create a temporary breakpoint on the first line of <code>main()</code> and then <code>run</code>. Starts the program and breaks right away.</li>
<li><code>step</code> vs <code>next</code> - how to remember the difference? Think a flight of “steps” goes downward, “stepping down” into subroutines. Whereas “next” is the next contiguous source line.</li>
</ul>
<h4 id="batch-mode">Batch mode</h4>
<p>GDB can be used non-interactively, with predefined scripts, to create little utility programs. For example, the <a href="https://poormansprofiler.org">poor man’s profiler</a> is a technique of calling GDB repeatedly to sample the call stack of a running program. It sends the results to awk to tally where most wall clock time (as opposed to just CPU time) is being spent.</p>
<p>A related idea is using GDB to print information about a core dump without leaving the UNIX command line. We can issue a single GDB command to list the backtraces for all threads, plus all stack frame variables and function arguments. Notice the <a href="https://sourceware.org/gdb/current/onlinedocs/gdb/Print-Settings.html">print settings</a> customized for clean, verbose output.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># show why program.core died</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="fu">gdb</span> <span class="at">--batch</span> <span class="dt">\</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>  <span class="at">-ex</span> <span class="st">&quot;set print frame-arguments all&quot;</span> <span class="dt">\</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>  <span class="at">-ex</span> <span class="st">&quot;set print pretty on&quot;</span> <span class="dt">\</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>  <span class="at">-ex</span> <span class="st">&quot;set print addr off&quot;</span> <span class="dt">\</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>  <span class="at">-ex</span> <span class="st">&quot;thread apply all bt full&quot;</span> <span class="dt">\</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>  /path/to/program program.core</span></code></pre></div>
<p>You can put this incantation (minus the final program and core file paths) into a shell alias (like <code>bt</code>) so you can run it more easily. To test, you can generate a core by running a program and sending it SIGQUIT with <code>Ctrl-\</code>. Adjusting <code>ulimit -c</code> may also be necessary to save cores, depending on your OS.</p>
<h4 id="user-defined-commands">User-defined commands</h4>
<p>GDB allows you to define custom commands that can do arbitrarily complex things. Commands can set breakpoints, display values, and even call to the shell.</p>
<p>Here’s an example that does a few of these things. It traces the system calls made by a single function of interest. The real work happens by shelling out to OpenBSD’s <a href="https://man.openbsd.org/ktrace">ktrace(1)</a>. (An equivalent tracing utility should exist for your operating system.)</p>
<pre class="gdb"><code>define ktrace
    # if a user presses enter on a blank line, GDB will by default
    # repeat the command, but we don&#39;t want that for ktrace

    dont-repeat

    # set a breakpoint for the specified function, and run commands
    # when the breakpoint is hit

    break $arg0
    commands
        # don&#39;t echo the commands to the user
        silent

        # set a convenience variable with the result of a C function
        set $tracepid = (int)getpid()

        # eval (GDB 7.2+) interpolates values into a command, and runs it
        eval &quot;set $ktraceout=\&quot;/tmp/ktrace.%d.out\&quot;&quot;, $tracepid
        printf &quot;ktrace started: %s\n&quot;, $ktraceout
        eval &quot;shell ktrace -a -f %s -p %d&quot;, $ktraceout, $tracepid

        printf &quot;\nrun \&quot;ktrace_stop\&quot; to stop tracing\n\n&quot;

        # &quot;finish&quot; continues execution for the duration of the current
        # function, and then breaks
        finish

        # After commands that continue execution, like finish does,
        # we lose control in the GDB breakpoint. We cannot issue
        # more commands here
    end

    # GDB automatically sets $bpnum to the identifier of the created breakpoint
    set $tracebp = $bpnum
end

define ktrace_stop
    dont-repeat

    # consult $ktraceout and $tracebp set by ktrace earlier

    eval &quot;shell ktrace -c -f %s&quot;, $ktraceout
    del $tracebp
    printf &quot;ktrace stopped for %s\n&quot;, $ktraceout
end</code></pre>
<p>Here’s demonstration with a simple program. It has two functions that involve different kinds of system calls:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _POSIX_C_SOURCE 200112L</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unistd.h&gt;</span></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> delay<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>	sleep<span class="op">(</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> alert<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;Hello&quot;</span><span class="op">);</span></span>
<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a>	alert<span class="op">();</span></span>
<span id="cb9-19"><a href="#cb9-19" aria-hidden="true" tabindex="-1"></a>	delay<span class="op">();</span></span>
<span id="cb9-20"><a href="#cb9-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>After loading the program into GDB, here’s how to see which syscalls the <code>delay()</code> function makes. Tracing is focused to just that function, and doesn’t include the system calls made by any other functions, like <code>alert()</code>.</p>
<pre class="gdb"><code>(gdb) ktrace delay
Breakpoint 1 at 0x1a10: file sleep.c, line 7.
(gdb) run
Starting program: sleep
ktrace started: /tmp/ktrace.5432.out

run &quot;ktrace_stop&quot; to stop tracing

main () at sleep.c:20
(gdb) ktrace_stop
ktrace stopped for /tmp/ktrace.5432.out</code></pre>
<p>The trace output is a binary file, and we can use kdump(1) to view it, like this:</p>
<pre><code>$ kdump -f /tmp/ktrace.5432.out
  5432 sleep    CALL  kbind(0x7f7ffffda6a8,24,0xa0ef4d749fb64797)
  5432 sleep    RET   kbind 0
  5432 sleep    CALL  nanosleep(0x7f7ffffda748,0x7f7ffffda738)
  5432 sleep    STRU  struct timespec { 1 }
  5432 sleep    STRU  struct timespec { 0 }
  5432 sleep    RET   nanosleep 0</code></pre>
<p>This shows that, on OpenBSD, sleep(3) calls nanosleep(2).</p>
<p>On a related note, another way to get insight into syscalls is by setting <a href="https://sourceware.org/gdb/onlinedocs/gdb/Set-Catchpoints.html">catchpoints</a> to break on a call of interest. This is a Linux-only feature.</p>
<h4 id="hooks">Hooks</h4>
<p>GDB treats user defined commands specially whose names begin with <code>hook-</code> or <code>hookpost-</code>. It runs <code>hook-foo</code> (<code>hookpost-foo</code>) automatically before (after) a user runs the command <code>foo</code>. In addition, a pseudo-command “stop” exists for when execution stops at a breakpoint.</p>
<p>As an example, consider <a href="https://sourceware.org/gdb/onlinedocs/gdb/Auto-Display.html">automatic variable displays</a>. GDB can automatically print the value of expressions every time the program stops with, e.g. <code>display varname</code>. However, what if we want to display all local variables this way?</p>
<p>There’s no direct expression to do it with <code>display</code>, but we can create a hook:</p>
<pre class="gdb"><code>define hook-stop
    # do it conditionally
    if $display_locals_flag
        # dump the values of all local vars
        info locals
    end
end

# commands to (de)activate the display

define display_locals
    set $display_locals_flag = 1
end

define undisplay_locals
    set $display_locals_flag = 0
end</code></pre>
<p>To be fair, the <a href="https://sourceware.org/gdb/onlinedocs/gdb/TUI-Single-Key-Mode.html#TUI-Single-Key-Mode">TUI single key mode</a> binds <code>info locals</code> to the <code>v</code> key, so our hook is less useful in TUI mode than it first appears.</p>
<h3 id="python-api">Python API</h3>
<h4 id="simple-helper-functions">Simple helper functions</h4>
<p>GDB exposes a <a href="https://sourceware.org/gdb/onlinedocs/gdb/Python-API.html">Python API</a> for finer control over the debugger. GDB scripts can include Python directly in designated blocks. For instance, right in <code>.gdbinit</code> we can access the Python API to get call stack frame information.</p>
<p>In this example, we’ll trace function calls matching a regex. If no regex is specified, we’ll match all functions visible to GDB, except low level functions (which start with underscore).</p>
<pre class="gdb"><code># drop into python to access frame information

python
    # this module contains the GDB API

    import gdb

    # define a helper function we can use later in a user command
    #
    # it prints the name of the function in the specified frame,
    # with indentation depth matching the stack depth

    def frame_indented_name(frame):
        # frame.level() is not always available,
        # so we traverse the list and count depth

        f = frame
        depth = 0
        while (f):
            depth = depth + 1
            f = f.older()
        return &quot;%s%s&quot; % (&quot;  &quot; * depth, frame.name())
end

# trace calls of functions matching a regex

define ftrace
    dont-repeat

    # we&#39;ll set possibly many breakpoints, so record the
    # starting number of the group

    set $first_new = 1 + ($bpnum ? $bpnum : 0)

    if $argc &lt; 1
        # by default, trace all functions except those that start with
        # underscore, which are low-level system things
        #
        # rbreak sets multiple breakpoints via a regex

        rbreak ^[a-zA-Z]
    else
        # or match based on ftrace argument, if passed

        rbreak $arg0
    end
    commands
        silent
        
        # drop into python again to use our helper function to
        # print the name of the newest frame

        python print(frame_indented_name(gdb.newest_frame()))

        # then immediately keep going
        cont
    end

    printf &quot;\nTracing enabled. To disable, run:\n\tdel %d-%d\n&quot;, $first_new, $bpnum
end</code></pre>
<p>To use ftrace, put breakpoints at either end of an area of interest. When you arrive at the first breakpoint, run ftrace with an optional regex argument. Then, continue the debugger and watch the output.</p>
<p>Here’s sample trace output from inserting a key-value into a treemap (<code>tm_insert()</code>) in my <a href="https://github.com/begriffs/libderp">libderp</a> library. You can see the “split” and “skew” operations happening in the underlying balanced <a href="https://user.it.uu.se/~arnea/ps/simp.pdf">AA-tree</a>.</p>
<pre><code>tm_insert
  malloc
    omalloc
  malloc
    omalloc
          map
          insert
  internal_tm_insert
    derp_strcmp
    internal_tm_insert
      derp_strcmp
      internal_tm_insert
        derp_strcmp
        internal_tm_insert
        internal_tm_skew
        internal_tm_split
      internal_tm_skew
      internal_tm_split
    internal_tm_skew
    internal_tm_split</code></pre>
<h4 id="pretty-printing">Pretty printing</h4>
<p>GDB allows you to customize the way it displays values. For instance, you may want to inspect Unicode strings when working with the ICU library. ICU’s internal encoding for <a href="https://unicode-org.github.io/icu/userguide/strings/#icu-16-bit-unicode-strings">UChar</a> is UTF-16. GDB has no way to know that an array ostensibly containing numbers is actually a string of UTF-16 code units. However, using the Python API, we can convert the string to a form GDB understands.</p>
<p>While a bit esoteric, this example provides the template you would use to create pretty printers for any type.</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> gdb.printing, re</span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="co"># a pretty printer </span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> UCharPrinter:</span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a>    <span class="co">&#39;Print ICU UChar string&#39;</span></span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>, val):</span>
<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>.val <span class="op">=</span> val</span>
<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>    <span class="co"># tell gdb to print the value in quotes, like a string</span></span>
<span id="cb15-12"><a href="#cb15-12" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> display_hint(<span class="va">self</span>):</span>
<span id="cb15-13"><a href="#cb15-13" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> <span class="st">&#39;string&#39;</span></span>
<span id="cb15-14"><a href="#cb15-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-15"><a href="#cb15-15" aria-hidden="true" tabindex="-1"></a>    <span class="co"># the actual work...</span></span>
<span id="cb15-16"><a href="#cb15-16" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> to_string(<span class="va">self</span>):</span>
<span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a>        p_c16 <span class="op">=</span> gdb.lookup_type(<span class="st">&#39;char16_t&#39;</span>).pointer()</span>
<span id="cb15-18"><a href="#cb15-18" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> <span class="va">self</span>.val.cast(p_c16).string(<span class="st">&#39;UTF-16&#39;</span>)</span>
<span id="cb15-19"><a href="#cb15-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-20"><a href="#cb15-20" aria-hidden="true" tabindex="-1"></a><span class="co"># bookkeeping that associates the UCharPrinter with the types</span></span>
<span id="cb15-21"><a href="#cb15-21" aria-hidden="true" tabindex="-1"></a><span class="co"># it can handle, and adds an entry to &quot;info pretty-printer&quot;</span></span>
<span id="cb15-22"><a href="#cb15-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-23"><a href="#cb15-23" aria-hidden="true" tabindex="-1"></a><span class="kw">class</span> UCharPrinterInfo(gdb.printing.PrettyPrinter):</span>
<span id="cb15-24"><a href="#cb15-24" aria-hidden="true" tabindex="-1"></a>    <span class="co"># friendly name for printer</span></span>
<span id="cb15-25"><a href="#cb15-25" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__init__</span>(<span class="va">self</span>):</span>
<span id="cb15-26"><a href="#cb15-26" aria-hidden="true" tabindex="-1"></a>        <span class="bu">super</span>().<span class="fu">__init__</span>(<span class="st">&#39;UChar string printer&#39;</span>)</span>
<span id="cb15-27"><a href="#cb15-27" aria-hidden="true" tabindex="-1"></a>        <span class="va">self</span>._re <span class="op">=</span> re.<span class="bu">compile</span>(<span class="st">&#39;^UChar [\[*]&#39;</span>)</span>
<span id="cb15-28"><a href="#cb15-28" aria-hidden="true" tabindex="-1"></a>  </span>
<span id="cb15-29"><a href="#cb15-29" aria-hidden="true" tabindex="-1"></a>    <span class="co"># is UCharPrinter appropriate for val?</span></span>
<span id="cb15-30"><a href="#cb15-30" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> <span class="fu">__call__</span>(<span class="va">self</span>, val):</span>
<span id="cb15-31"><a href="#cb15-31" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="va">self</span>._re.match(<span class="bu">str</span>(val.<span class="bu">type</span>)):</span>
<span id="cb15-32"><a href="#cb15-32" aria-hidden="true" tabindex="-1"></a>            <span class="cf">return</span> UCharPrinter(val)</span></code></pre></div>
<p>While it’s nice to create code such as the pretty printer above, the code won’t do anything until we tell GDB how and when to load it. You can certainly dump Python code blocks into your <code>~/.gdbinit</code>, but that’s not very modular, and can load things unnecessarily.</p>
<p>I prefer to organize the code in dedicated directories like this:</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="fu">mkdir</span> <span class="at">-p</span> ~/.gdb/<span class="dt">{py-modules</span><span class="op">,</span><span class="dt">auto-load}</span></span></code></pre></div>
<p>The <code>~/.gdb/py-modules</code> is for user modules (like the ICU pretty printer), and <code>~/.gdb/auto-load</code> is for scripts that GDB automatically loads at certain times.</p>
<p>Having created those directories, tell GDB to consult them. Add this to your <code>~/.gdbinit</code>:</p>
<pre class="gdb"><code>add-auto-load-safe-path /home/foo/.gdb
add-auto-load-scripts-directory /home/foo/.gdb/auto-load</code></pre>
<p>Now, when GDB loads a library like <code>/usr/lib/baz.so.x.y</code> on behalf of your program, it will also search for <code>~/.gdb/auto-load/usr/lib/baz.so.x.y-gdb.py</code> and load it if it exists. To see which libraries GDB loads for an application, enable verbose mode, and then start execution.</p>
<pre><code>(gdb) set verbose
(gdb) start

...
Reading symbols from /usr/libexec/ld.so...
Reading symbols from /usr/lib/libpthread.so.26.1...
Reading symbols from ...</code></pre>
<p>On my machine for an application using ICU, GDB loaded <code>/usr/local/lib/libicuio.so.20.1</code>. To enable the ICU pretty printer, I create an auto-load file:</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co"># ~/.gdb/auto-load/usr/local/lib/libicuuc.so.20.1-gdb.py</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> gdb.printing</span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> printers.libicuuc</span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a>gdb.printing.register_pretty_printer(</span>
<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a>    gdb.current_objfile(),</span>
<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a>    printers.libicuuc.UCharPrinterInfo())</span></code></pre></div>
<p>The final question is how the auto-loader resolves the <code>printers.libicuuc</code> module. We need to add <code>~/.gdb/py-modules</code> to the Python system path. I use a little trick: a file in the appropriate directory that detects its own location and adds that to the syspath:</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co"># ~/.gdb/py-modules/add-syspath.py</span></span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> sys, os</span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>sys.path.append(os.path.dirname(os.path.realpath(<span class="va">__file__</span>)))</span></code></pre></div>
<p>Then just source the file from <code>~/.gdbinit</code>:</p>
<pre class="gdb"><code>source /home/foo/.gdb/py-modules/add-syspath.py</code></pre>
<p>After doing that, save the ICU pretty printing code as <code>~/.gdb/py-modules/printers/libicuuc.py</code>, and the <code>import printers.libicuuc</code> statement will find it.</p>
<h3 id="ddd-features">DDD features</h3>
<p>In addition to providing a graphical user interface, DDD has a few features of its own.</p>
<h4 id="historical-values">Historical values</h4>
<p>Each time the program stops at a breakpoint, DDD records the values of all displayed variables. You can place breakpoints strategically to sample the historical values of a variable, and then view or plot them on a graph.</p>
<p>For instance, compile this program with debugging information enabled, and load it in DDD:</p>
<div class="sourceCode" id="cb22"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> x <span class="op">=</span> <span class="dv">381</span><span class="op">;</span></span>
<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>x <span class="op">!=</span> <span class="dv">1</span><span class="op">)</span></span>
<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a>		x <span class="op">=</span> <span class="op">(</span>x <span class="op">%</span> <span class="dv">2</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="op">?</span> x<span class="op">/</span><span class="dv">2</span> <span class="op">:</span> <span class="dv">3</span><span class="op">*</span>x <span class="op">+</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb22-7"><a href="#cb22-7" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<ol type="1">
<li><p>Double click to the left of the <code>x = ...</code> line to set a breakpoint. Right click the stop sign icon that appears, and select <strong>Properties…</strong>. In the dialog box, click <strong>Edit &gt;&gt;</strong> and enter <code>continue</code> into the text box. Apply your change and close the dialog. This breakpoint will stop, record the value of <code>x</code>, then immediately continue running.</p></li>
<li><p>Set a breakpoint on the <code>return 0</code> line.</p></li>
<li><p>Select <strong>GDB console</strong> from the <strong>View</strong> menu (or press Alt-1).</p></li>
<li><p>Run <code>start</code> in the GDB console to run the program and break at the first line.</p></li>
<li><p>Double click the “x” variable to add it to the graphical display. (If you don’t put it in the display window, DDD won’t track its values over time.)</p></li>
<li><p>Select <strong>Continue</strong> from the <strong>Program</strong> menu (or press F9). You’ll see the displayed value of <code>x</code> updating rapidly.</p></li>
<li><p>When execution stops at the last breakpoint, run <code>graph history x</code> in the GDB console. It will output an array of all previous values:</p>
<pre><code>(gdb) graph history x
history x = {0, 381, 1144, 572, 286, 143, 430, 215, 646, 323, 970, 485,
1456, 728, 364, 182, 91, 274, 137, 412, 206, 103, 310, 155, 466, 233, 700, 350,
175, 526, 263, 790, 395, 1186, 593, 1780, 890, 445, 1336, 668, 334, 167, 502,
251, 754, 377, 1132, 566, 283, 850, 425, 1276, 638, 319, 958, 479, 1438, 719,
2158, 1079, 3238, 1619, 4858, 2429, 7288, 3644, 1822, 911, 2734, 1367, 4102,
2051, 6154, 3077, 9232, 4616, 2308, 1154, 577, 1732, 866, 433, 1300, 650, 325,
976, 488, 244, 122, 61, 184, 92, 46, 23, 70, 35, 106, 53, 160, 80, 40, 20, 10,
5, 16, 8, 4, 2, 1}</code></pre></li>
</ol>
<p><img src="../images/ddd-graph.png" alt="graph of values"  class="right" /></p>
<p>To see the values plotted graphically, run</p>
<pre><code>graph plot `graph display x`</code></pre>
<p>DDD sends the data to gnuplot to render the graph. (Be sure to set <code>Ddd*plotTermType: x11</code> in <code>~/.ddd/init</code>, or else DDD will hang with a dialog saying “Starting Gnuplot…”.)</p>
<h4 id="interesting-shortcuts">Interesting shortcuts</h4>
<p>DDD has some shortcuts that aren’t obvious from the interface, but which I found interesting in the documentation.</p>
<ul>
<li>Control-doubleclick on the left of a line to set a temporary breakpoint, or on an existing breakpoint to delete it. Control double clicking in the data window dereferences in place, rather than creating a new display.</li>
<li>Click and drag a breakpoint to a new line, and it moves while preserving all its properties.</li>
<li>Click and hold buttons to reveal special functions. For instance, on the watch button to set a watchpoint on change or on read.</li>
<li>Pressing Esc (or the interrupt button) acts like an impromptu breakpoint.</li>
<li>By default, typing into the source window redirects keystrokes to the GDB console, so you don’t have to focus the console to issue commands.</li>
<li>Control-Up/Down changes the stack frame quickly.</li>
<li>You can display more than single local variables in the data window. Go to Data -&gt; Status Displays to access checkboxes of other common ones, like the backtrace, or all local vars at once.</li>
<li>Pressing F1 shows help specific to whatever control is under the mouse cursor.</li>
<li>GDB by default tries to confirm kill/detach when you quit. Use ‘set confirm off’ to disable the prompt.</li>
</ul>
<h3 id="further-reading">Further reading</h3>
<ul>
<li><a href="https://sourceware.org/gdb/documentation/">Debugging with GDB: the GNU Source-Level Debugger</a>. This page links to a printed book, a PDF, and online HTML.</li>
<li><a href="https://www.gnu.org/software/ddd/manual/">DDD Manual</a></li>
<li><a href="https://www.goodreads.com/book/show/3938178-debugging">Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems</a>. Timeless debugging techniques, not specific to any particular tooling, or even computers per se.</li>
</ul>]]></summary>
</entry>
<entry>
    <title>Practical parsing with Flex and Bison</title>
    <link href="https://begriffs.com/posts/2021-11-28-practical-parsing.html" />
    <id>https://begriffs.com/posts/2021-11-28-practical-parsing.html</id>
    <published>2021-11-28T00:00:00Z</published>
    <updated>2021-11-28T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>Although parsing is often described from the perspective of writing a compiler, there are many common smaller tasks where it’s useful. Reading file formats, talking over the network, creating shells, and analyzing source code are all easier using a robust parser.</p>
<p>By taking time to learn general-purpose parsing tools, you can go beyond fragile homemade solutions, and inflexible third-party libraries. We’ll cover <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html">Lex</a> and <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/yacc.html">Yacc</a> in this guide because they are mature and portable. We’ll also cover their later incarnations as Flex and Bison.</p>
<p><img src="../images/parse/calc.png" class="right" /> Above all, this guide is practical. We’ll see how to properly integrate parser generators into your build system, how to create thread-safe parsing modules, and how to parse real data formats. I’ll motivate each feature of the parser generator with a concrete problem it can solve. And, I promise, none of the typical calculator examples.</p>
<p><strong>Table of contents</strong></p>
<ul>
<li><a href="#lexical-scanning">Lexical scanning</a>
<ul>
<li><a href="#more-realistic-scanner">More realistic scanner</a></li>
<li><a href="#using-a-scanner-as-a-library">Using a scanner as a library</a></li>
</ul></li>
<li><a href="#parsing">Parsing</a>
<ul>
<li><a href="#mental-model-of-lr-parsing">Mental model of LR parsing</a></li>
<li><a href="#ambiguous-grammars">Ambiguous grammars</a></li>
<li><a href="#constructing-semantic-values">Constructing semantic values</a></li>
<li><a href="#using-a-parser-as-a-library">Using a parser as a library</a></li>
<li><a href="#designing-against-an-rfc">Designing against an RFC</a></li>
<li><a href="#parsing-a-more-complicated-rfc">Parsing a more complicated RFC</a></li>
</ul></li>
<li><a href="#further-resources">Further resources</a></li>
</ul>
<h3 id="lexical-scanning">Lexical scanning</h3>
<p>People usually use two stages to process structured text. The first stage, lexing (aka scanning), breaks the input into meaningful chunks of characters. The second, parsing, groups the scanned chunks following potentially recursive rules. However, a nice lexing tool like Lex can be useful on its own, even when not paired with a parser.</p>
<p>The simplest way to describe Lex is that it runs user-supplied C code blocks for regular expression matches. It reads a list of regexes and constructs a giant state machine which attempts to match them all “simultaneously.”</p>
<p>A lex input file is composed of three possible sections: definitions, rules, and helper functions. The sections are delimited by <code>%%</code>. Lex transforms its input file into a plain C file that can be built using an ordinary C compiler.</p>
<p>Here’s an example. We’ll match the strings <code>cot</code>, <code>cat</code>, and <code>cats</code>. Our actions will print a replacement for each.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* catcot.l */</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="st">cot</span> { printf<span class="op">(</span><span class="st">&quot;portable bed&quot;</span><span class="op">);</span> }</span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="st">cat</span> { printf<span class="op">(</span><span class="st">&quot;thankless pet&quot;</span><span class="op">);</span> }</span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a><span class="st">cats</span> { printf<span class="op">(</span><span class="st">&quot;anti-herd&quot;</span><span class="op">);</span> }</span></code></pre></div>
<p>To build it:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co"># turn the input into an intermediate C file</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="fu">lex</span> <span class="at">-t</span> catcot.l <span class="op">&gt;</span> catcot.c</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="co"># compile it</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="fu">cc</span> <span class="at">-o</span> catcot catcot.c <span class="at">-ll</span></span></code></pre></div>
<p>(Alternately, build it in one step with <code>make catcot</code>. Even in the absence of a Makefile, POSIX make has <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html#tag_20_76_13_09">suffix rules</a> that handle <code>.l</code> files.)</p>
<p>The program outputs simple substitutions:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="bu">echo</span> <span class="st">&quot;the cat on the cot joined the cats&quot;</span> <span class="kw">|</span> <span class="ex">./catcot</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="ex">the</span> thankless pet on the portable bed joined the anti-herd</span></code></pre></div>
<p>The reason it prints non-matching words (such as “the”) is that there’s an implicit rule matching any character (<code>.</code>) and echoing it. In most real parsers we’ll want to override that.</p>
<p>Here’s what’s happening inside the scanner. Lex reads the regexes and generates a state machine to consume input. Below is a visualization of the states, with transitions labeled by input character. The circles with a double outline indicate states that trigger actions.</p>
<figure>
<img src="../images/parse/cat.png" alt="cat state machine" /><figcaption aria-hidden="true">cat state machine</figcaption>
</figure>
<p>Note there’s no notion of word boundaries in our lexer, it’s operating on characters alone. For instance:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="bu">echo</span> <span class="st">&quot;catch!&quot;</span> <span class="kw">|</span> <span class="ex">./catcot</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="ex">thankless</span> petch!</span></code></pre></div>
<p>That sounds rather like an insult.</p>
<p>An important subtlety is how Lex handles multiple eligible matches. It picks the longest possible match available, and in the case of a tie, picks the matching pattern defined earliest.</p>
<p>To illustrate, suppose we add a looser regex, <code>c.t</code>, first.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="st">c.t</span> { printf<span class="op">(</span><span class="st">&quot;mumble mumble&quot;</span><span class="op">);</span> } </span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="st">cot</span> { printf<span class="op">(</span><span class="st">&quot;portable bed&quot;</span><span class="op">);</span> }</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="st">cat</span> { printf<span class="op">(</span><span class="st">&quot;thankless pet&quot;</span><span class="op">);</span> }</span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="st">cats</span> { printf<span class="op">(</span><span class="st">&quot;anti-herd&quot;</span><span class="op">);</span> }</span></code></pre></div>
<p>Lex detects that the rule masks cat and cot, and outputs a warning:</p>
<pre><code>catcot.l:10: warning, rule cannot be matched
catcot.l:11: warning, rule cannot be matched</code></pre>
<p>It still compiles though, and behaves like this:</p>
<pre><code>echo &quot;the cat on the cot joined the cats&quot; | ./catcot
the mumble mumble on the mumble mumble joined the anti-herd</code></pre>
<p>Notice that it still matched <code>cats</code>, because <code>cats</code> is longer than <code>c.t</code>.</p>
<p>Compare what happens if we move the loose regex to the end of our rules. It can then pick up whatever strings get past the others.</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="st">cot</span> { printf<span class="op">(</span><span class="st">&quot;portable bed&quot;</span><span class="op">);</span> }</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="st">cat</span> { printf<span class="op">(</span><span class="st">&quot;thankless pet&quot;</span><span class="op">);</span> }</span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="st">cats</span> { printf<span class="op">(</span><span class="st">&quot;anti-herd&quot;</span><span class="op">);</span> }</span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="st">c.t</span> { printf<span class="op">(</span><span class="st">&quot;mumble mumble&quot;</span><span class="op">);</span> } </span></code></pre></div>
<p>It acts like this:</p>
<pre><code>echo &quot;cut the cot&quot; | ./catcot
mumble mumble the portable bed</code></pre>
<p>Now’s a good time to take a detour and observe how our user-defined code acts in the generated C file. Lex creates a function called <code>yylex()</code>, and inserts the code blocks verbatim into a switch statement. When using lex with a parser, the parser will call <code>yylex()</code> to retrieve tokens, named by integers. For now, our user-defined code isn’t returning tokens to a parser, but doing simple print statements.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* catcot.c (generated by lex) */</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> yylex <span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* ... */</span></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>	<span class="cf">switch</span> <span class="op">(</span> yy_act <span class="op">)</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* ... */</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="dv">1</span><span class="op">:</span></span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a>		YY_RULE_SETUP</span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a>		<span class="pp">#line 9 &quot;catcot.l&quot;</span></span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span> printf<span class="op">(</span><span class="st">&quot;portable bed&quot;</span><span class="op">);</span> <span class="op">}</span></span>
<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a>			YY_BREAK</span>
<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="dv">2</span><span class="op">:</span></span>
<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a>		YY_RULE_SETUP</span>
<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a>		<span class="pp">#line 10 &quot;catcot.l&quot;</span></span>
<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span> printf<span class="op">(</span><span class="st">&quot;thankless pet&quot;</span><span class="op">);</span> <span class="op">}</span></span>
<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>			YY_BREAK</span>
<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="dv">3</span><span class="op">:</span></span>
<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>		YY_RULE_SETUP</span>
<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>		<span class="pp">#line 11 &quot;catcot.l&quot;</span></span>
<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span> printf<span class="op">(</span><span class="st">&quot;anti-herd&quot;</span><span class="op">);</span> <span class="op">}</span></span>
<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>			YY_BREAK</span>
<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* ... */</span></span>
<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* ... */</span></span>
<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>As mentioned, a lex file is comprised of three sections:</p>
<pre><code>DEFINITIONS

%%

RULES

%%

HELPER FUNCTIONS</code></pre>
<p>The definitions section is where you can embed C code to include headers and declare functions used in rules. The definitions section can also define friendly names for regexes that can be reused in the rules.</p>
<p>The rules section, as we saw, contains a list of regexes and associated user code.</p>
<p>The final section is where to put the full definitions of helper functions. This is also where you’d put the <code>main()</code> function. If you omit <code>main()</code>, the Lex library provides one that simply calls <code>yylex()</code>. This default <code>main()</code> implementation (and implementations for a few other functions) is available by linking your lex-generated C code with <code>-ll</code> compiler flag.</p>
<p>Let’s see a short, fun example: converting Roman numerals to decimal. Thanks to lex’s behavior of matching longer strings first, it can read the single-letter numerals, but look ahead for longer subtractive forms like “IV” or “XC.”</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* roman-lex.l */</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* the %{ ... %} enclose C blocks that are copied</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a><span class="co">   into the generated code */</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a><span class="co">/* globals are visible to user actions amd main() */</span></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> total<span class="op">;</span></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a> <span class="co">/*&lt;- notice the whitespace before this comment, which</span></span>
<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a><span class="co">      is necessary for comments in the rules section */</span></span>
<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a> <span class="co">/* the basics */</span></span>
<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a><span class="st">I</span>  { total <span class="op">+=</span>    <span class="dv">1</span><span class="op">;</span> }</span>
<span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a><span class="st">V</span>  { total <span class="op">+=</span>    <span class="dv">5</span><span class="op">;</span> }</span>
<span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a><span class="st">X</span>  { total <span class="op">+=</span>   <span class="dv">10</span><span class="op">;</span> }</span>
<span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a><span class="st">L</span>  { total <span class="op">+=</span>   <span class="dv">50</span><span class="op">;</span> }</span>
<span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a><span class="st">C</span>  { total <span class="op">+=</span>  <span class="dv">100</span><span class="op">;</span> }</span>
<span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a><span class="st">D</span>  { total <span class="op">+=</span>  <span class="dv">500</span><span class="op">;</span> }</span>
<span id="cb12-28"><a href="#cb12-28" aria-hidden="true" tabindex="-1"></a><span class="st">M</span>  { total <span class="op">+=</span> <span class="dv">1000</span><span class="op">;</span> }</span>
<span id="cb12-29"><a href="#cb12-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-30"><a href="#cb12-30" aria-hidden="true" tabindex="-1"></a> <span class="co">/* special cases match with preference</span></span>
<span id="cb12-31"><a href="#cb12-31" aria-hidden="true" tabindex="-1"></a><span class="co">    because they are longer strings */</span></span>
<span id="cb12-32"><a href="#cb12-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-33"><a href="#cb12-33" aria-hidden="true" tabindex="-1"></a><span class="st">IV</span> { total <span class="op">+=</span>    <span class="dv">4</span><span class="op">;</span> }</span>
<span id="cb12-34"><a href="#cb12-34" aria-hidden="true" tabindex="-1"></a><span class="st">IX</span> { total <span class="op">+=</span>    <span class="dv">9</span><span class="op">;</span> }</span>
<span id="cb12-35"><a href="#cb12-35" aria-hidden="true" tabindex="-1"></a><span class="st">XL</span> { total <span class="op">+=</span>   <span class="dv">40</span><span class="op">;</span> }</span>
<span id="cb12-36"><a href="#cb12-36" aria-hidden="true" tabindex="-1"></a><span class="st">XC</span> { total <span class="op">+=</span>   <span class="dv">90</span><span class="op">;</span> }</span>
<span id="cb12-37"><a href="#cb12-37" aria-hidden="true" tabindex="-1"></a><span class="st">CD</span> { total <span class="op">+=</span>  <span class="dv">400</span><span class="op">;</span> }</span>
<span id="cb12-38"><a href="#cb12-38" aria-hidden="true" tabindex="-1"></a><span class="st">CM</span> { total <span class="op">+=</span>  <span class="dv">900</span><span class="op">;</span> }</span>
<span id="cb12-39"><a href="#cb12-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-40"><a href="#cb12-40" aria-hidden="true" tabindex="-1"></a> <span class="co">/* ignore final newline */</span></span>
<span id="cb12-41"><a href="#cb12-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-42"><a href="#cb12-42" aria-hidden="true" tabindex="-1"></a><span class="st">\n</span> <span class="op">;</span></span>
<span id="cb12-43"><a href="#cb12-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-44"><a href="#cb12-44" aria-hidden="true" tabindex="-1"></a> <span class="co">/* but die on anything else */</span></span>
<span id="cb12-45"><a href="#cb12-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-46"><a href="#cb12-46" aria-hidden="true" tabindex="-1"></a><span class="st">.</span>  {</span>
<span id="cb12-47"><a href="#cb12-47" aria-hidden="true" tabindex="-1"></a>	fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;unexpected: </span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb12-48"><a href="#cb12-48" aria-hidden="true" tabindex="-1"></a>	exit<span class="op">(</span>EXIT_FAILURE<span class="op">);</span></span>
<span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a><span class="co">/* provide our own main() rather than the implementation</span></span>
<span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a><span class="co">   from lex&#39;s library linked with -ll */</span></span>
<span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-56"><a href="#cb12-56" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb12-57"><a href="#cb12-57" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-58"><a href="#cb12-58" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* only have to call yylex() once, since our</span></span>
<span id="cb12-59"><a href="#cb12-59" aria-hidden="true" tabindex="-1"></a><span class="co">	   actions don&#39;t return */</span></span>
<span id="cb12-60"><a href="#cb12-60" aria-hidden="true" tabindex="-1"></a>	yylex<span class="op">();</span></span>
<span id="cb12-61"><a href="#cb12-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-62"><a href="#cb12-62" aria-hidden="true" tabindex="-1"></a>	fprintf<span class="op">(</span>yyout<span class="op">,</span> <span class="st">&quot;</span><span class="sc">%d\n</span><span class="st">&quot;</span><span class="op">,</span> total<span class="op">);</span></span>
<span id="cb12-63"><a href="#cb12-63" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb12-64"><a href="#cb12-64" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h4 id="more-realistic-scanner">More realistic scanner</h4>
<p>Now that we’ve seen Lex’s basic operation in the previous section, let’s consider a useful example: syntax highlighting. Detecting keywords in syntax is a problem that lex can handle by itself, without help from yacc.</p>
<p>Because lex and yacc are so old (<a href="https://www.computerworld.com/article/2534771/yacc--unix--and-advice-from--bell-labs-alumni-stephen-johnson.html">predating C</a>), and used in so many projects, you can find grammars already written for most languages. For instance, we’ll take <a href="http://www.quut.com/c/ANSI-C-grammar-l.html">quut’s C specification</a> for lex, and modify it to do syntax highlighting.</p>
<p>This relatively short program accurately handles the full complexity of the language. It’s easiest to understand by reading in full. See the inline comments for new and subtle details.</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* c.l syntax highlighter */</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="co">/* POSIX for isatty, fileno */</span></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _POSIX_C_SOURCE </span><span class="dv">200112</span><span class="bu">L</span></span>
<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unistd.h&gt;</span></span>
<span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a><span class="co">/* declarations are visible to user actions */</span></span>
<span id="cb13-12"><a href="#cb13-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-13"><a href="#cb13-13" aria-hidden="true" tabindex="-1"></a><span class="kw">enum</span> FG</span>
<span id="cb13-14"><a href="#cb13-14" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-15"><a href="#cb13-15" aria-hidden="true" tabindex="-1"></a>	fgRED      <span class="op">=</span> <span class="dv">31</span><span class="op">,</span>   fgGREEN    <span class="op">=</span> <span class="dv">32</span><span class="op">,</span></span>
<span id="cb13-16"><a href="#cb13-16" aria-hidden="true" tabindex="-1"></a>	fgORANGE   <span class="op">=</span> <span class="dv">33</span><span class="op">,</span>   fgCYAN     <span class="op">=</span> <span class="dv">36</span><span class="op">,</span>   </span>
<span id="cb13-17"><a href="#cb13-17" aria-hidden="true" tabindex="-1"></a>	fgDARKGREY <span class="op">=</span> <span class="dv">90</span><span class="op">,</span>   fgYELLOW   <span class="op">=</span> <span class="dv">93</span></span>
<span id="cb13-18"><a href="#cb13-18" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb13-19"><a href="#cb13-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-20"><a href="#cb13-20" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> set_color<span class="op">(</span><span class="kw">enum</span> FG<span class="op">);</span></span>
<span id="cb13-21"><a href="#cb13-21" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> reset_color<span class="op">(</span><span class="dt">void</span><span class="op">);</span></span>
<span id="cb13-22"><a href="#cb13-22" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> color_print<span class="op">(</span><span class="kw">enum</span> FG<span class="op">,</span> <span class="at">const</span> <span class="dt">char</span> <span class="op">*);</span></span>
<span id="cb13-23"><a href="#cb13-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-24"><a href="#cb13-24" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> consume_comment<span class="op">(</span><span class="dt">void</span><span class="op">);</span></span>
<span id="cb13-25"><a href="#cb13-25" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb13-26"><a href="#cb13-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-27"><a href="#cb13-27" aria-hidden="true" tabindex="-1"></a><span class="co">/* named regexes we can use in rules */</span></span>
<span id="cb13-28"><a href="#cb13-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-29"><a href="#cb13-29" aria-hidden="true" tabindex="-1"></a><span class="dt">O   </span><span class="st">[0-7]</span></span>
<span id="cb13-30"><a href="#cb13-30" aria-hidden="true" tabindex="-1"></a><span class="dt">D   </span><span class="st">[0-9]</span></span>
<span id="cb13-31"><a href="#cb13-31" aria-hidden="true" tabindex="-1"></a><span class="dt">NZ  </span><span class="st">[1-9]</span></span>
<span id="cb13-32"><a href="#cb13-32" aria-hidden="true" tabindex="-1"></a><span class="dt">L   </span><span class="st">[a-zA-Z_]</span></span>
<span id="cb13-33"><a href="#cb13-33" aria-hidden="true" tabindex="-1"></a><span class="dt">A   </span><span class="st">[a-zA-Z_0-9]</span></span>
<span id="cb13-34"><a href="#cb13-34" aria-hidden="true" tabindex="-1"></a><span class="dt">H   </span><span class="st">[a-fA-F0-9]</span></span>
<span id="cb13-35"><a href="#cb13-35" aria-hidden="true" tabindex="-1"></a><span class="dt">HP  </span><span class="st">(0[xX])</span></span>
<span id="cb13-36"><a href="#cb13-36" aria-hidden="true" tabindex="-1"></a><span class="dt">E   </span><span class="st">([Ee][+-]?{D}+)</span></span>
<span id="cb13-37"><a href="#cb13-37" aria-hidden="true" tabindex="-1"></a><span class="dt">P   </span><span class="st">([Pp][+-]?{D}+)</span></span>
<span id="cb13-38"><a href="#cb13-38" aria-hidden="true" tabindex="-1"></a><span class="dt">FS  </span><span class="st">(f|F|l|L)</span></span>
<span id="cb13-39"><a href="#cb13-39" aria-hidden="true" tabindex="-1"></a><span class="dt">IS  </span><span class="st">(((u|U)(l|L|ll|LL)?)|((l|L|ll|LL)(u|U)?))</span></span>
<span id="cb13-40"><a href="#cb13-40" aria-hidden="true" tabindex="-1"></a><span class="dt">CP  </span><span class="st">(u|U|L)</span></span>
<span id="cb13-41"><a href="#cb13-41" aria-hidden="true" tabindex="-1"></a><span class="dt">SP  </span><span class="st">(u8|u|U|L)</span></span>
<span id="cb13-42"><a href="#cb13-42" aria-hidden="true" tabindex="-1"></a><span class="dt">ES  </span><span class="st">(\\([&#39;&quot;\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))</span></span>
<span id="cb13-43"><a href="#cb13-43" aria-hidden="true" tabindex="-1"></a><span class="dt">WS  </span><span class="st">[ \t\v\n\f]</span></span>
<span id="cb13-44"><a href="#cb13-44" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-45"><a href="#cb13-45" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb13-46"><a href="#cb13-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-47"><a href="#cb13-47" aria-hidden="true" tabindex="-1"></a> <span class="co">/* attempting to match and capture an entire multi-line</span></span>
<span id="cb13-48"><a href="#cb13-48" aria-hidden="true" tabindex="-1"></a><span class="co">    comment could strain lex&#39;s buffers, so we match the</span></span>
<span id="cb13-49"><a href="#cb13-49" aria-hidden="true" tabindex="-1"></a><span class="co">    beginning, and call consume_comment() to deal with</span></span>
<span id="cb13-50"><a href="#cb13-50" aria-hidden="true" tabindex="-1"></a><span class="co">    the ensuing characters, in our own less resource-</span></span>
<span id="cb13-51"><a href="#cb13-51" aria-hidden="true" tabindex="-1"></a><span class="co">    intensive way */</span></span>
<span id="cb13-52"><a href="#cb13-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-53"><a href="#cb13-53" aria-hidden="true" tabindex="-1"></a><span class="st">&quot;/*&quot;</span>      {</span>
<span id="cb13-54"><a href="#cb13-54" aria-hidden="true" tabindex="-1"></a>	set_color<span class="op">(</span>fgDARKGREY<span class="op">);</span></span>
<span id="cb13-55"><a href="#cb13-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-56"><a href="#cb13-56" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* For greater flexibility, we&#39;ll output to lex&#39;s stream, yyout.</span></span>
<span id="cb13-57"><a href="#cb13-57" aria-hidden="true" tabindex="-1"></a><span class="co">	   It defaults to stdout. */</span></span>
<span id="cb13-58"><a href="#cb13-58" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span>yytext<span class="op">,</span> yyout<span class="op">);</span></span>
<span id="cb13-59"><a href="#cb13-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-60"><a href="#cb13-60" aria-hidden="true" tabindex="-1"></a>	consume_comment<span class="op">();</span></span>
<span id="cb13-61"><a href="#cb13-61" aria-hidden="true" tabindex="-1"></a>	reset_color<span class="op">();</span></span>
<span id="cb13-62"><a href="#cb13-62" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-63"><a href="#cb13-63" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-64"><a href="#cb13-64" aria-hidden="true" tabindex="-1"></a> <span class="co">/* single-line comments can be handled the default way.</span></span>
<span id="cb13-65"><a href="#cb13-65" aria-hidden="true" tabindex="-1"></a><span class="co">    The yytext variable is provided by lex and points</span></span>
<span id="cb13-66"><a href="#cb13-66" aria-hidden="true" tabindex="-1"></a><span class="co">    to the characters that match the regex */</span></span>
<span id="cb13-67"><a href="#cb13-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-68"><a href="#cb13-68" aria-hidden="true" tabindex="-1"></a><span class="st">&quot;//&quot;.*</span>    {</span>
<span id="cb13-69"><a href="#cb13-69" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgDARKGREY<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-70"><a href="#cb13-70" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-71"><a href="#cb13-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-72"><a href="#cb13-72" aria-hidden="true" tabindex="-1"></a><span class="st">^[ \t]*#.*</span>      {</span>
<span id="cb13-73"><a href="#cb13-73" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgRED<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-74"><a href="#cb13-74" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-75"><a href="#cb13-75" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-76"><a href="#cb13-76" aria-hidden="true" tabindex="-1"></a> <span class="co">/* you can use the same code block for multiple regexes */</span></span>
<span id="cb13-77"><a href="#cb13-77" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-78"><a href="#cb13-78" aria-hidden="true" tabindex="-1"></a><span class="st">auto</span>     <span class="kw">|</span></span>
<span id="cb13-79"><a href="#cb13-79" aria-hidden="true" tabindex="-1"></a><span class="st">bool</span>     <span class="kw">|</span></span>
<span id="cb13-80"><a href="#cb13-80" aria-hidden="true" tabindex="-1"></a><span class="st">char</span>     <span class="kw">|</span></span>
<span id="cb13-81"><a href="#cb13-81" aria-hidden="true" tabindex="-1"></a><span class="st">const</span>    <span class="kw">|</span></span>
<span id="cb13-82"><a href="#cb13-82" aria-hidden="true" tabindex="-1"></a><span class="st">double</span>   <span class="kw">|</span></span>
<span id="cb13-83"><a href="#cb13-83" aria-hidden="true" tabindex="-1"></a><span class="st">enum</span>     <span class="kw">|</span></span>
<span id="cb13-84"><a href="#cb13-84" aria-hidden="true" tabindex="-1"></a><span class="st">extern</span>   <span class="kw">|</span></span>
<span id="cb13-85"><a href="#cb13-85" aria-hidden="true" tabindex="-1"></a><span class="st">float</span>    <span class="kw">|</span></span>
<span id="cb13-86"><a href="#cb13-86" aria-hidden="true" tabindex="-1"></a><span class="st">inline</span>   <span class="kw">|</span></span>
<span id="cb13-87"><a href="#cb13-87" aria-hidden="true" tabindex="-1"></a><span class="st">int</span>      <span class="kw">|</span></span>
<span id="cb13-88"><a href="#cb13-88" aria-hidden="true" tabindex="-1"></a><span class="st">long</span>     <span class="kw">|</span></span>
<span id="cb13-89"><a href="#cb13-89" aria-hidden="true" tabindex="-1"></a><span class="st">register</span> <span class="kw">|</span></span>
<span id="cb13-90"><a href="#cb13-90" aria-hidden="true" tabindex="-1"></a><span class="st">restrict</span> <span class="kw">|</span></span>
<span id="cb13-91"><a href="#cb13-91" aria-hidden="true" tabindex="-1"></a><span class="st">short</span>    <span class="kw">|</span></span>
<span id="cb13-92"><a href="#cb13-92" aria-hidden="true" tabindex="-1"></a><span class="st">size_t</span>   <span class="kw">|</span></span>
<span id="cb13-93"><a href="#cb13-93" aria-hidden="true" tabindex="-1"></a><span class="st">signed</span>   <span class="kw">|</span></span>
<span id="cb13-94"><a href="#cb13-94" aria-hidden="true" tabindex="-1"></a><span class="st">static</span>   <span class="kw">|</span></span>
<span id="cb13-95"><a href="#cb13-95" aria-hidden="true" tabindex="-1"></a><span class="st">struct</span>   <span class="kw">|</span></span>
<span id="cb13-96"><a href="#cb13-96" aria-hidden="true" tabindex="-1"></a><span class="st">typedef</span>  <span class="kw">|</span></span>
<span id="cb13-97"><a href="#cb13-97" aria-hidden="true" tabindex="-1"></a><span class="st">union</span>    <span class="kw">|</span></span>
<span id="cb13-98"><a href="#cb13-98" aria-hidden="true" tabindex="-1"></a><span class="st">unsigned</span> <span class="kw">|</span></span>
<span id="cb13-99"><a href="#cb13-99" aria-hidden="true" tabindex="-1"></a><span class="st">void</span>     <span class="kw">|</span></span>
<span id="cb13-100"><a href="#cb13-100" aria-hidden="true" tabindex="-1"></a><span class="st">volatile</span> <span class="kw">|</span></span>
<span id="cb13-101"><a href="#cb13-101" aria-hidden="true" tabindex="-1"></a><span class="st">_Bool</span>    <span class="kw">|</span></span>
<span id="cb13-102"><a href="#cb13-102" aria-hidden="true" tabindex="-1"></a><span class="st">_Complex</span> {</span>
<span id="cb13-103"><a href="#cb13-103" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgGREEN<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-104"><a href="#cb13-104" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-105"><a href="#cb13-105" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-106"><a href="#cb13-106" aria-hidden="true" tabindex="-1"></a><span class="st">break</span>    <span class="kw">|</span></span>
<span id="cb13-107"><a href="#cb13-107" aria-hidden="true" tabindex="-1"></a><span class="st">case</span>     <span class="kw">|</span></span>
<span id="cb13-108"><a href="#cb13-108" aria-hidden="true" tabindex="-1"></a><span class="st">continue</span> <span class="kw">|</span></span>
<span id="cb13-109"><a href="#cb13-109" aria-hidden="true" tabindex="-1"></a><span class="st">default</span>  <span class="kw">|</span></span>
<span id="cb13-110"><a href="#cb13-110" aria-hidden="true" tabindex="-1"></a><span class="st">do</span>       <span class="kw">|</span></span>
<span id="cb13-111"><a href="#cb13-111" aria-hidden="true" tabindex="-1"></a><span class="st">else</span>     <span class="kw">|</span></span>
<span id="cb13-112"><a href="#cb13-112" aria-hidden="true" tabindex="-1"></a><span class="st">for</span>      <span class="kw">|</span></span>
<span id="cb13-113"><a href="#cb13-113" aria-hidden="true" tabindex="-1"></a><span class="st">goto</span>     <span class="kw">|</span></span>
<span id="cb13-114"><a href="#cb13-114" aria-hidden="true" tabindex="-1"></a><span class="st">if</span>       <span class="kw">|</span></span>
<span id="cb13-115"><a href="#cb13-115" aria-hidden="true" tabindex="-1"></a><span class="st">return</span>   <span class="kw">|</span></span>
<span id="cb13-116"><a href="#cb13-116" aria-hidden="true" tabindex="-1"></a><span class="st">sizeof</span>   <span class="kw">|</span></span>
<span id="cb13-117"><a href="#cb13-117" aria-hidden="true" tabindex="-1"></a><span class="st">switch</span>   <span class="kw">|</span></span>
<span id="cb13-118"><a href="#cb13-118" aria-hidden="true" tabindex="-1"></a><span class="st">while</span>    {</span>
<span id="cb13-119"><a href="#cb13-119" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgYELLOW<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-120"><a href="#cb13-120" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-121"><a href="#cb13-121" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-122"><a href="#cb13-122" aria-hidden="true" tabindex="-1"></a> <span class="co">/* we use the named regexes heavily below; putting</span></span>
<span id="cb13-123"><a href="#cb13-123" aria-hidden="true" tabindex="-1"></a><span class="co">    them in curly brackets expands them */</span></span>
<span id="cb13-124"><a href="#cb13-124" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-125"><a href="#cb13-125" aria-hidden="true" tabindex="-1"></a><span class="st">{L}{A}*</span>  {</span>
<span id="cb13-126"><a href="#cb13-126" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-127"><a href="#cb13-127" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* without this rule, keywords within larger words</span></span>
<span id="cb13-128"><a href="#cb13-128" aria-hidden="true" tabindex="-1"></a><span class="co">	   would be highlighted, like the &quot;if&quot; in &quot;life&quot; --</span></span>
<span id="cb13-129"><a href="#cb13-129" aria-hidden="true" tabindex="-1"></a><span class="co">	   this rule prevents that because it&#39;s a longer match */</span></span>
<span id="cb13-130"><a href="#cb13-130" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-131"><a href="#cb13-131" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span>yytext<span class="op">,</span> yyout<span class="op">);</span></span>
<span id="cb13-132"><a href="#cb13-132" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-133"><a href="#cb13-133" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-134"><a href="#cb13-134" aria-hidden="true" tabindex="-1"></a><span class="st">{HP}{H}+{IS}?</span>               <span class="kw">|</span></span>
<span id="cb13-135"><a href="#cb13-135" aria-hidden="true" tabindex="-1"></a><span class="st">{NZ}{D}*{IS}?</span>               <span class="kw">|</span></span>
<span id="cb13-136"><a href="#cb13-136" aria-hidden="true" tabindex="-1"></a><span class="st">&quot;0&quot;{O}*{IS}?</span>                <span class="kw">|</span></span>
<span id="cb13-137"><a href="#cb13-137" aria-hidden="true" tabindex="-1"></a><span class="st">{CP}?&quot;&#39;&quot;([^&#39;\\\n]|{ES})+&quot;&#39;&quot;</span> <span class="kw">|</span></span>
<span id="cb13-138"><a href="#cb13-138" aria-hidden="true" tabindex="-1"></a><span class="st">{D}+{E}{FS}?</span>                <span class="kw">|</span></span>
<span id="cb13-139"><a href="#cb13-139" aria-hidden="true" tabindex="-1"></a><span class="st">{D}*&quot;.&quot;{D}+{E}?{FS}?</span>        <span class="kw">|</span></span>
<span id="cb13-140"><a href="#cb13-140" aria-hidden="true" tabindex="-1"></a><span class="st">{D}+&quot;.&quot;{E}?{FS}?</span>            <span class="kw">|</span></span>
<span id="cb13-141"><a href="#cb13-141" aria-hidden="true" tabindex="-1"></a><span class="st">{HP}{H}+{P}{FS}?</span>            <span class="kw">|</span></span>
<span id="cb13-142"><a href="#cb13-142" aria-hidden="true" tabindex="-1"></a><span class="st">{HP}{H}*&quot;.&quot;{H}+{P}{FS}?</span>     <span class="kw">|</span></span>
<span id="cb13-143"><a href="#cb13-143" aria-hidden="true" tabindex="-1"></a><span class="st">{HP}{H}+&quot;.&quot;{P}{FS}?</span>         {</span>
<span id="cb13-144"><a href="#cb13-144" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgCYAN<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-145"><a href="#cb13-145" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-146"><a href="#cb13-146" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-147"><a href="#cb13-147" aria-hidden="true" tabindex="-1"></a><span class="st">({SP}?\&quot;([^&quot;\\\n]|{ES})*\&quot;{WS}*)+</span> {</span>
<span id="cb13-148"><a href="#cb13-148" aria-hidden="true" tabindex="-1"></a>	color_print<span class="op">(</span>fgORANGE<span class="op">,</span> yytext<span class="op">);</span></span>
<span id="cb13-149"><a href="#cb13-149" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb13-150"><a href="#cb13-150" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-151"><a href="#cb13-151" aria-hidden="true" tabindex="-1"></a> <span class="co">/* explicitly mention the default rule */</span></span>
<span id="cb13-152"><a href="#cb13-152" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-153"><a href="#cb13-153" aria-hidden="true" tabindex="-1"></a><span class="st">.</span> ECHO<span class="op">;</span></span>
<span id="cb13-154"><a href="#cb13-154" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-155"><a href="#cb13-155" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb13-156"><a href="#cb13-156" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-157"><a href="#cb13-157" aria-hidden="true" tabindex="-1"></a><span class="co">/* definitions of the functions we declared earlier */</span></span>
<span id="cb13-158"><a href="#cb13-158" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-159"><a href="#cb13-159" aria-hidden="true" tabindex="-1"></a><span class="co">/* the color functions use ANSI escape codes, and may</span></span>
<span id="cb13-160"><a href="#cb13-160" aria-hidden="true" tabindex="-1"></a><span class="co">   not be portable across all terminal emulators. */</span></span>
<span id="cb13-161"><a href="#cb13-161" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-162"><a href="#cb13-162" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> set_color<span class="op">(</span><span class="kw">enum</span> FG c<span class="op">)</span></span>
<span id="cb13-163"><a href="#cb13-163" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-164"><a href="#cb13-164" aria-hidden="true" tabindex="-1"></a>	fprintf<span class="op">(</span>yyout<span class="op">,</span> <span class="st">&quot;</span><span class="sc">\033</span><span class="st">[</span><span class="sc">%d</span><span class="st">;1m&quot;</span><span class="op">,</span> c<span class="op">);</span></span>
<span id="cb13-165"><a href="#cb13-165" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb13-166"><a href="#cb13-166" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-167"><a href="#cb13-167" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> reset_color<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb13-168"><a href="#cb13-168" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-169"><a href="#cb13-169" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span><span class="st">&quot;</span><span class="sc">\033</span><span class="st">[0m&quot;</span><span class="op">,</span> yyout<span class="op">);</span></span>
<span id="cb13-170"><a href="#cb13-170" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb13-171"><a href="#cb13-171" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-172"><a href="#cb13-172" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> color_print<span class="op">(</span><span class="kw">enum</span> FG c<span class="op">,</span> <span class="at">const</span> <span class="dt">char</span> <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb13-173"><a href="#cb13-173" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-174"><a href="#cb13-174" aria-hidden="true" tabindex="-1"></a>	set_color<span class="op">(</span>c<span class="op">);</span></span>
<span id="cb13-175"><a href="#cb13-175" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span>s<span class="op">,</span> yyout<span class="op">);</span></span>
<span id="cb13-176"><a href="#cb13-176" aria-hidden="true" tabindex="-1"></a>	reset_color<span class="op">();</span></span>
<span id="cb13-177"><a href="#cb13-177" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb13-178"><a href="#cb13-178" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-179"><a href="#cb13-179" aria-hidden="true" tabindex="-1"></a><span class="co">/* this function directly consumes characters in lex</span></span>
<span id="cb13-180"><a href="#cb13-180" aria-hidden="true" tabindex="-1"></a><span class="co">   using the input() function. It pulls characters</span></span>
<span id="cb13-181"><a href="#cb13-181" aria-hidden="true" tabindex="-1"></a><span class="co">   from the same stream that the regex state machine</span></span>
<span id="cb13-182"><a href="#cb13-182" aria-hidden="true" tabindex="-1"></a><span class="co">   reads. */</span></span>
<span id="cb13-183"><a href="#cb13-183" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> consume_comment<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb13-184"><a href="#cb13-184" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-185"><a href="#cb13-185" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> c<span class="op">;</span></span>
<span id="cb13-186"><a href="#cb13-186" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-187"><a href="#cb13-187" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* EOF in lex is 0, which is different from</span></span>
<span id="cb13-188"><a href="#cb13-188" aria-hidden="true" tabindex="-1"></a><span class="co">	   the EOF macro in the C standard library */</span></span>
<span id="cb13-189"><a href="#cb13-189" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> input<span class="op">())</span> <span class="op">!=</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb13-190"><a href="#cb13-190" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb13-191"><a href="#cb13-191" aria-hidden="true" tabindex="-1"></a>		putchar<span class="op">(</span>c<span class="op">);</span></span>
<span id="cb13-192"><a href="#cb13-192" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>c <span class="op">==</span> <span class="ch">&#39;*&#39;</span><span class="op">)</span></span>
<span id="cb13-193"><a href="#cb13-193" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb13-194"><a href="#cb13-194" aria-hidden="true" tabindex="-1"></a>			<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> input<span class="op">())</span> <span class="op">==</span> <span class="ch">&#39;*&#39;</span><span class="op">)</span></span>
<span id="cb13-195"><a href="#cb13-195" aria-hidden="true" tabindex="-1"></a>				putchar<span class="op">(</span>c<span class="op">);</span></span>
<span id="cb13-196"><a href="#cb13-196" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>c <span class="op">==</span> <span class="dv">0</span><span class="op">)</span> <span class="cf">break</span><span class="op">;</span></span>
<span id="cb13-197"><a href="#cb13-197" aria-hidden="true" tabindex="-1"></a>			putchar<span class="op">(</span>c<span class="op">);</span></span>
<span id="cb13-198"><a href="#cb13-198" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>c <span class="op">==</span> <span class="ch">&#39;/&#39;</span><span class="op">)</span> <span class="cf">return</span><span class="op">;</span></span>
<span id="cb13-199"><a href="#cb13-199" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb13-200"><a href="#cb13-200" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb13-201"><a href="#cb13-201" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb13-202"><a href="#cb13-202" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-203"><a href="#cb13-203" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb13-204"><a href="#cb13-204" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb13-205"><a href="#cb13-205" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>isatty<span class="op">(</span>fileno<span class="op">(</span>stdout<span class="op">)))</span></span>
<span id="cb13-206"><a href="#cb13-206" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb13-207"><a href="#cb13-207" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* a more flexible option would be to make the</span></span>
<span id="cb13-208"><a href="#cb13-208" aria-hidden="true" tabindex="-1"></a><span class="co">		   color changing functions do nothing, but that&#39;s</span></span>
<span id="cb13-209"><a href="#cb13-209" aria-hidden="true" tabindex="-1"></a><span class="co">		   too much fuss for an example program */</span></span>
<span id="cb13-210"><a href="#cb13-210" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-211"><a href="#cb13-211" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Stdout is not a terminal</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb13-212"><a href="#cb13-212" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb13-213"><a href="#cb13-213" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb13-214"><a href="#cb13-214" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* since we&#39;ll be changing terminal color, be sure to</span></span>
<span id="cb13-215"><a href="#cb13-215" aria-hidden="true" tabindex="-1"></a><span class="co">	   reset it for any program termination event */</span></span>
<span id="cb13-216"><a href="#cb13-216" aria-hidden="true" tabindex="-1"></a>	atexit<span class="op">(</span>reset_color<span class="op">);</span></span>
<span id="cb13-217"><a href="#cb13-217" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb13-218"><a href="#cb13-218" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* let our lex rules do the rest */</span></span>
<span id="cb13-219"><a href="#cb13-219" aria-hidden="true" tabindex="-1"></a>	yylex<span class="op">();</span></span>
<span id="cb13-220"><a href="#cb13-220" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb13-221"><a href="#cb13-221" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h4 id="using-a-scanner-as-a-library">Using a scanner as a library</h4>
<p>One of the biggest areas of improvement between classic lex/yacc and flex/bison is the ability of the latter to generate code that’s easier to embed into a larger application. Lex and yacc are designed to create standalone programs, with user-defined code blocks stuck inside. When classic lex and yacc work together, they use a bunch of global variables.</p>
<p>Flex and Bison, on the other hand, can generate thread-safe functions with uniquely prefixed names that can be safely linked into larger programs. To demonstrate, we’ll do another scanner (with Flex this time).</p>
<p>The following Rube Goldberg contraption uses Flex to split words on whitespace and call a user-supplied callback for each word. There’s certainly an easier non-Flex way to do this task, but this example illustrates how to encapsulate Flex code into a reusable library.</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* words.l */</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* don&#39;t generate functions we don&#39;t need */</span></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="kw">%option nounput noinput noyywrap</span></span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="co">/* generate a scanner that&#39;s thread safe */</span></span>
<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a><span class="kw">%option reentrant</span></span>
<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a><span class="co">/* Generate &quot;words&quot; rather than &quot;yy&quot; as a prefix, e.g.</span></span>
<span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a><span class="co">   wordslex() rather than yylex(). This allows multiple</span></span>
<span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a><span class="co">   Flex scanners to be linked with the same application */</span></span>
<span id="cb14-12"><a href="#cb14-12" aria-hidden="true" tabindex="-1"></a><span class="kw">%option prefix=&quot;words&quot;</span></span>
<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-14"><a href="#cb14-14" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb14-15"><a href="#cb14-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-16"><a href="#cb14-16" aria-hidden="true" tabindex="-1"></a><span class="st">[^ \t\n]+</span> {</span>
<span id="cb14-17"><a href="#cb14-17" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* the return statement causes yylex to stop and return */</span></span>
<span id="cb14-18"><a href="#cb14-18" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">1</span><span class="op">;</span> <span class="co">/* our code for a word token */</span></span>
<span id="cb14-19"><a href="#cb14-19" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb14-20"><a href="#cb14-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-21"><a href="#cb14-21" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* do nothing for any other characters, don&#39;t</span></span>
<span id="cb14-22"><a href="#cb14-22" aria-hidden="true" tabindex="-1"></a><span class="co">     output them as would be the default behavior */</span></span>
<span id="cb14-23"><a href="#cb14-23" aria-hidden="true" tabindex="-1"></a><span class="st">.|\n</span>	  <span class="op">;</span></span>
<span id="cb14-24"><a href="#cb14-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-25"><a href="#cb14-25" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb14-26"><a href="#cb14-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-27"><a href="#cb14-27" aria-hidden="true" tabindex="-1"></a><span class="co">/* Callers interact with this function, which neatly hides</span></span>
<span id="cb14-28"><a href="#cb14-28" aria-hidden="true" tabindex="-1"></a><span class="co">   the Flex inside.</span></span>
<span id="cb14-29"><a href="#cb14-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-30"><a href="#cb14-30" aria-hidden="true" tabindex="-1"></a><span class="co">   Also, we&#39;ll call &quot;yy&quot; functions like &quot;yylex()&quot; inside,</span></span>
<span id="cb14-31"><a href="#cb14-31" aria-hidden="true" tabindex="-1"></a><span class="co">   and Flex will rename them in the resulting C file to</span></span>
<span id="cb14-32"><a href="#cb14-32" aria-hidden="true" tabindex="-1"></a><span class="co">   calls with the &quot;words&quot; prefix, like &quot;wordslex()&quot;</span></span>
<span id="cb14-33"><a href="#cb14-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-34"><a href="#cb14-34" aria-hidden="true" tabindex="-1"></a><span class="co">   Zero return means success, nonzero is a Flex error</span></span>
<span id="cb14-35"><a href="#cb14-35" aria-hidden="true" tabindex="-1"></a><span class="co">   code. */</span></span>
<span id="cb14-36"><a href="#cb14-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-37"><a href="#cb14-37" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> words_callback<span class="op">(</span><span class="dt">char</span> <span class="op">*</span>s<span class="op">,</span> <span class="dt">void</span> <span class="op">(*</span>f<span class="op">)(</span><span class="at">const</span> <span class="dt">char</span> <span class="op">*))</span></span>
<span id="cb14-38"><a href="#cb14-38" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb14-39"><a href="#cb14-39" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* in the reentrant mode, we maintain our</span></span>
<span id="cb14-40"><a href="#cb14-40" aria-hidden="true" tabindex="-1"></a><span class="co">	   own scanner and its associated state */</span></span>
<span id="cb14-41"><a href="#cb14-41" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> i<span class="op">;</span></span>
<span id="cb14-42"><a href="#cb14-42" aria-hidden="true" tabindex="-1"></a>	<span class="dt">yyscan_t</span> scanner<span class="op">;</span></span>
<span id="cb14-43"><a href="#cb14-43" aria-hidden="true" tabindex="-1"></a>	YY_BUFFER_STATE buf<span class="op">;</span></span>
<span id="cb14-44"><a href="#cb14-44" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-45"><a href="#cb14-45" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">((</span>i <span class="op">=</span> yylex_init<span class="op">(&amp;</span>scanner<span class="op">))</span> <span class="op">!=</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb14-46"><a href="#cb14-46" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> i<span class="op">;</span></span>
<span id="cb14-47"><a href="#cb14-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-48"><a href="#cb14-48" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* read from a string rather than a stream */</span></span>
<span id="cb14-49"><a href="#cb14-49" aria-hidden="true" tabindex="-1"></a>	buf <span class="op">=</span> yy_scan_string<span class="op">(</span>s<span class="op">,</span> scanner<span class="op">);</span></span>
<span id="cb14-50"><a href="#cb14-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-51"><a href="#cb14-51" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Each time yylex finds a word, it returns nonzero.</span></span>
<span id="cb14-52"><a href="#cb14-52" aria-hidden="true" tabindex="-1"></a><span class="co">	   It resumes where it left off when we call it again */</span></span>
<span id="cb14-53"><a href="#cb14-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>i <span class="op">=</span> yylex<span class="op">(</span>scanner<span class="op">))</span> <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb14-54"><a href="#cb14-54" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb14-55"><a href="#cb14-55" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* call the user supplied function f with</span></span>
<span id="cb14-56"><a href="#cb14-56" aria-hidden="true" tabindex="-1"></a><span class="co">		   yytext of the match */</span></span>
<span id="cb14-57"><a href="#cb14-57" aria-hidden="true" tabindex="-1"></a>		f<span class="op">(</span>yyget_text<span class="op">(</span>scanner<span class="op">));</span></span>
<span id="cb14-58"><a href="#cb14-58" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb14-59"><a href="#cb14-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-60"><a href="#cb14-60" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* clean up */</span></span>
<span id="cb14-61"><a href="#cb14-61" aria-hidden="true" tabindex="-1"></a>	yy_delete_buffer<span class="op">(</span>buf<span class="op">,</span> scanner<span class="op">);</span></span>
<span id="cb14-62"><a href="#cb14-62" aria-hidden="true" tabindex="-1"></a>	yylex_destroy<span class="op">(</span>scanner<span class="op">);</span></span>
<span id="cb14-63"><a href="#cb14-63" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb14-64"><a href="#cb14-64" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Build it like this:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co"># generate scanner, build object file</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="fu">flex</span> <span class="at">-t</span> words.l <span class="op">&gt;</span> words.c</span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="fu">cc</span> <span class="at">-c</span> words.c</span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="co"># verify that all public text symbols are prefixed by &quot;words&quot;</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="fu">nm</span> <span class="at">-g</span> words.o <span class="kw">|</span> <span class="fu">grep</span> <span class="st">&quot; T &quot;</span></span></code></pre></div>
<div class="alert alert-info" role="alert">
<h4>
Fixing compiler warnings
</h4>
<p>
If you compile with more warnings enabled, the compiler will complain about “unused parameter yyscanner” in several functions. Flex’s reentrant mode adds this parameter to the functions, and the default implementation doesn’t use it.
</p>
<p>
To fix the warnings, we can provide our own definitions. First, disable some of Flex’s auto-generated functions. Add these options to your lex input file:
</p>
<pre class="sourceCode flex"><code class="sourceCode flex">
%option noyyalloc noyyfree noyyrealloc
</code></pre>
<p>
Provide the implementations yourself down by words_callback, and add the macro in a code block up by the %options.
</p>
<pre class="sourceCode c"><code class="sourceCode c">
/* add in a code block by the %options */
#define YY_EXIT_FAILURE ((void)yyscanner, EXIT_FAILURE)

/* add definitions down by words_callback */

void *wordsalloc(size_t size, void *yyscanner)
{
    (void) yyscanner;
    return malloc(size);
}

void *wordsrealloc(void * ptr, size_t size, void *yyscanner)
{
    (void) yyscanner;
    return realloc(ptr, size);
}

void wordsfree(void *ptr, void *yyscanner)
{
    (void) yyscanner;
    free(ptr);
}
</code></pre>
</div>
<p>A calling program can use our library without seeing any Flex internals.</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* test_words.c */</span></span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a><span class="co">/* words_callback defined in the object file -- you could put</span></span>
<span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a><span class="co">   this declaration in a header file words.h */</span></span>
<span id="cb16-7"><a href="#cb16-7" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> words_callback<span class="op">(</span><span class="dt">char</span> <span class="op">*,</span> <span class="dt">void</span> <span class="op">(*)(</span><span class="dt">const</span> <span class="dt">char</span> <span class="op">*));</span></span>
<span id="cb16-8"><a href="#cb16-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-9"><a href="#cb16-9" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> print_word<span class="op">(</span><span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>w<span class="op">)</span></span>
<span id="cb16-10"><a href="#cb16-10" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-11"><a href="#cb16-11" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span>w<span class="op">);</span></span>
<span id="cb16-12"><a href="#cb16-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-13"><a href="#cb16-13" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if you want to use the parameter w in the future, you</span></span>
<span id="cb16-14"><a href="#cb16-14" aria-hidden="true" tabindex="-1"></a><span class="co">	   need to duplicate it in memory whose lifetime you control */</span></span>
<span id="cb16-15"><a href="#cb16-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb16-16"><a href="#cb16-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-17"><a href="#cb16-17" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb16-18"><a href="#cb16-18" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-19"><a href="#cb16-19" aria-hidden="true" tabindex="-1"></a>	words_callback<span class="op">(</span></span>
<span id="cb16-20"><a href="#cb16-20" aria-hidden="true" tabindex="-1"></a>		<span class="st">&quot;The quick brown fox</span><span class="sc">\n</span><span class="st">&quot;</span></span>
<span id="cb16-21"><a href="#cb16-21" aria-hidden="true" tabindex="-1"></a>		<span class="st">&quot;jumped over the lazy dog</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb16-22"><a href="#cb16-22" aria-hidden="true" tabindex="-1"></a>		<span class="op">&amp;</span>print_word</span>
<span id="cb16-23"><a href="#cb16-23" aria-hidden="true" tabindex="-1"></a>	<span class="op">);</span></span>
<span id="cb16-24"><a href="#cb16-24" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb16-25"><a href="#cb16-25" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>To build the program, just link it with <code>words.o</code>.</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="fu">cc</span> <span class="at">-o</span> test_words test_words.c words.o</span></code></pre></div>
<h3 id="parsing">Parsing</h3>
<p>Now that we’ve seen how to identify tokens with a scanner, let’s learn how a parser can act on the tokens using recursive rules. Yacc/byacc/bison are LALR (look-ahead left recursive) parsers, and Bison supports more powerful modes if desired.</p>
<h4 id="mental-model-of-lr-parsing">Mental model of LR parsing</h4>
<p>LR parsers build bottom-up toward a goal, shifting tokens onto a stack and combining (“reducing”) them according to rules. It’s helpful to get a mental model for this process, so let’s jump into a simple example and simulate what yacc does.</p>
<p>Here’s a yacc grammar with a single rule to build a result called foo. We specify that foo is comprised of lex tokens A, B, and C.</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> A B C</span>
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a><span class="st">foo</span>: A B C</span></code></pre></div>
<p>Yacc transforms the grammar into a state machine which looks like this:</p>
<figure>
<img src="../images/parse/abc.png" alt="foo: A B C" /><figcaption aria-hidden="true">foo: A B C</figcaption>
</figure>
<p>The first rule in the file (and the <em>only</em> rule in our case) becomes yacc’s goal. Yacc begins in state 0, with the implicit rule 0: <code>$accept: • foo $end</code>. The parse will be accepted if we can produce a <code>foo</code> followed immediately by the end of input. The bullet point indicates our progress reading the input. In state 0 it’s at the beginning, meaning we haven’t read anything yet.</p>
<p>Initially there’s no lookahead token, so yacc calls <code>yylex()</code> to get one. If lex produces an A, we follow the state transition to state 1. Because the arrow is a solid line, not dashed, yacc “shifts” the token to its token stack. It also pushes state 1 onto a state stack, which now holds states 0 and 1.</p>
<p>State 1 is trying to satisfy the rule which it calls rule 1, namely <code>1 foo: A • B C</code>. The bullet point after the A indicates we’ve seen the A already. Don’t confuse the state numbers and rule numbers – yacc numbers them independently.</p>
<p>Yacc continues processing input, shifting tokens and moving to states 3 and 5 if lex produces the expected tokens. If, at any point, lex produces a token not matching any transitions in the current state, then yacc reports a syntax error and terminates. (There’s a way to do error recovery, but that’s another topic.)</p>
<p>State 5 has seen all necessary tokens for rule 1: <code>1 foo: A B C •</code>. Yacc continues to the diamond marked “R1,” which is a reduction action. Yacc “reduces” rule 1, popping the A, B, C terminal tokens off the stack and pushing a single non-terminal <code>foo</code> token. When it pops the three tokens, it pops the same number of states (states 5, 3, and 1). Popping three states lands us back in state 0.</p>
<p>State 0 has a dashed line going to state 2 that matches the foo token that was just reduced. The dashed line means “goto” rather than “shift,” because rule 0 doesn’t have to shift anything onto the token stack. The previous reduction already took care of that.</p>
<p>Finally, state 2 asks lex for another token, and if lex reports EOF, that matches <code>$end</code> and sends us to state 4, which ties a ribbon on it with the Acc(ept) action.</p>
<p>From what we’ve seen so far, each state may seem to be merely tracking progress through a single rule. However, states actually track all legal ways forward from tokens previously consumed. A single state can track multiple candidate rules. For instance:</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> A B C</span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="st"> </span><span class="co">/* foo is either x or y */</span></span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a><span class="st">foo</span>: x | y;</span>
<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a> <span class="co">/* x and y both start with an A */</span></span>
<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-11"><a href="#cb19-11" aria-hidden="true" tabindex="-1"></a><span class="st">x</span>: A B;</span>
<span id="cb19-12"><a href="#cb19-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-13"><a href="#cb19-13" aria-hidden="true" tabindex="-1"></a><span class="st">y</span>: A C;</span></code></pre></div>
<p>For this grammar, yacc produces the following state machine:</p>
<figure>
<img src="../images/parse/axy.png" alt="foo : x | y" /><figcaption aria-hidden="true">foo : x | y</figcaption>
</figure>
<p>In state 1 we’ve seen token A, and so rules 3 and 4 are both in the running to reduce an x or y. On a B or C token, the possibilities narrow to a single rule (in state 5 or 6).</p>
<p>Also notice that our rule <code>foo : x | y</code> doesn’t occur verbatim in any states. Yacc separates it into <code>1 foo: x</code> and <code>2 foo: y</code>. Thus, the numbered rules don’t always match the rules in the grammar one-to-one.</p>
<p>Yacc can also use peek ahead by one token to choose which rule to reduce, without shifting the “lookahead” token. In the following grammar, rules x and y match the same tokens. However, the foo rule can say to choose x when followed by a B, or y when followed by a C:</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> A B C</span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a><span class="st">foo </span>: x B | y C;</span>
<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a><span class="st">x </span>: A;</span>
<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a><span class="st">y </span>: A;</span></code></pre></div>
<p>Note multiple reductions coming out of state 1 in the generated state machine:</p>
<figure>
<img src="../images/parse/la2.png" alt="lookahead for the first state" /><figcaption aria-hidden="true">lookahead for the first state</figcaption>
</figure>
<p>The presence of a bracketed token (<code>[C]</code>) exiting state 1 indicates that the state uses lookahead. If the state sees token C, it reduces rule 4. Otherwise it reduces rule 3. Lookahead tokens remain to be read when following a dashed-line (goto) action, such as from state 0 to state 4.</p>
<h4 id="ambiguous-grammars">Ambiguous grammars</h4>
<p>While yacc is a powerful tool to transform a grammar into a state machine, it may not operate the way you intend on ambiguous grammars. These are grammars with a state that could proceed in more than one way with the same input.</p>
<p>As grammars get complicated, it’s quite possible to create ambiguities. Let’s look at small examples that make it easier to see the mechanics of the conflict. That way, when it happens in a real grammar, we’ll have a better feeling for it.</p>
<p>In the following example, the input <code>A B</code> matches both <code>x</code> and <code>y B</code>. There’s no reason for yacc to choose one construction over the other when reducing to <code>foo</code>. So why does this matter, you ask? Don’t we get to <code>foo</code> either way? Yes, but real parsers will have different user code assigned to run per rule, and it matters which code block gets executed.</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> A B</span>
<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a><span class="st">foo </span>: x | y B ;</span>
<span id="cb21-6"><a href="#cb21-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb21-7"><a href="#cb21-7" aria-hidden="true" tabindex="-1"></a><span class="st">x </span>: A B ;</span>
<span id="cb21-8"><a href="#cb21-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb21-9"><a href="#cb21-9" aria-hidden="true" tabindex="-1"></a><span class="st">y </span>: A ;</span></code></pre></div>
<p>The state machine shows ambiguity at state 1:</p>
<figure>
<img src="../images/parse/sr.png" alt="shift/reduce conflict" /><figcaption aria-hidden="true">shift/reduce conflict</figcaption>
</figure>
<p>At state 1, when the next token is B, the state <em>could</em> shift the token and enter state 5 (attempting to reduce x). It could <em>also</em> reduce y and leave B as lookahead. This is called a shift/reduce conflict. Yacc’s policy in such a conflict is to favor a shift over a reduce.</p>
<p>Alternately, we can construct a grammar with a state that has more than one eligible reduction for the same input. The purest toy example would be <code>foo : A | A</code>, generating:</p>
<figure>
<img src="../images/parse/aa.png" alt="reduce/reduce conflict" /><figcaption aria-hidden="true">reduce/reduce conflict</figcaption>
</figure>
<p>In a reduce/reduce conflict, yacc chooses to reduce the conflicting rule presented earlier in the grammar.</p>
<h4 id="constructing-semantic-values">Constructing semantic values</h4>
<p>While matching tokens, parsers typically build a user-defined value in memory to represent features of the input. Once the parse reaches the goal state and succeeds, then the user code will act on the memory value (or pass it along to a calling program).</p>
<p>Yacc has stores the semantic values from parsed tokens in variables (<code>$1</code>, <code>$2</code>, …) accessible to code blocks, and it provides a variable (<code>$$</code>) for assigning the semantic result of the current code block.</p>
<p>Let’s see it in action. We won’t do a hackneyed calculator, but let’s still make a parser that operates on integers. Integer values allow us to avoid thinking about memory management.</p>
<p>We’ll revisit the roman numeral example, and this time let lex match the digits while yacc combines them into a final result. It’s actually more cumbersome than our earlier way, but illustrates how to work with semantic parse values.</p>
<p>There are some comments in the example below about portability between yacc variants. The three most prominent variants, in order of increasing features, are: the <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/yacc.html">POSIX interface</a> matching roughly the AT&amp;T yacc functionally, <a href="https://invisible-island.net/byacc/byacc.html">byacc</a> (Berkeley Yacc), and <a href="https://www.gnu.org/software/bison/">GNU Bison</a>.</p>
<div class="sourceCode" id="cb22"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* roman.y  (plain yacc) */</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a><span class="co">/* declarations to fix warnings from sloppy</span></span>
<span id="cb22-7"><a href="#cb22-7" aria-hidden="true" tabindex="-1"></a><span class="co">   yacc/byacc/bison code generation. For instance,</span></span>
<span id="cb22-8"><a href="#cb22-8" aria-hidden="true" tabindex="-1"></a><span class="co">   the code should have a declaration of yylex. */</span></span>
<span id="cb22-9"><a href="#cb22-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-10"><a href="#cb22-10" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> yylex<span class="op">(</span><span class="dt">void</span><span class="op">);</span></span>
<span id="cb22-11"><a href="#cb22-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-12"><a href="#cb22-12" aria-hidden="true" tabindex="-1"></a><span class="co">/* The POSIX specification says yyerror should return</span></span>
<span id="cb22-13"><a href="#cb22-13" aria-hidden="true" tabindex="-1"></a><span class="co">   int, although bison documentation says the value is</span></span>
<span id="cb22-14"><a href="#cb22-14" aria-hidden="true" tabindex="-1"></a><span class="co">   ignored. We match POSIX just in case. */</span></span>
<span id="cb22-15"><a href="#cb22-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-16"><a href="#cb22-16" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> yyerror<span class="op">(</span><span class="at">const</span> <span class="dt">char</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb22-17"><a href="#cb22-17" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb22-18"><a href="#cb22-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-19"><a href="#cb22-19" aria-hidden="true" tabindex="-1"></a><span class="co">/* tokens our lexer will produce */</span></span>
<span id="cb22-20"><a href="#cb22-20" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> NUM</span>
<span id="cb22-21"><a href="#cb22-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-22"><a href="#cb22-22" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb22-23"><a href="#cb22-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-24"><a href="#cb22-24" aria-hidden="true" tabindex="-1"></a><span class="co">/* The first rule is the final goal. Yacc will work</span></span>
<span id="cb22-25"><a href="#cb22-25" aria-hidden="true" tabindex="-1"></a><span class="co">   backward trying to arrive here. This &quot;results&quot; rule</span></span>
<span id="cb22-26"><a href="#cb22-26" aria-hidden="true" tabindex="-1"></a><span class="co">   is a stub we use to print the value from &quot;number.&quot; */</span></span>
<span id="cb22-27"><a href="#cb22-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-28"><a href="#cb22-28" aria-hidden="true" tabindex="-1"></a><span class="st">results </span>:</span>
<span id="cb22-29"><a href="#cb22-29" aria-hidden="true" tabindex="-1"></a>  number { fprintf<span class="op">(</span>yyout<span class="op">,</span> <span class="st">&quot;</span><span class="sc">%d\n</span><span class="st">&quot;</span><span class="op">,</span> <span class="kw">$1</span><span class="op">);</span> }</span>
<span id="cb22-30"><a href="#cb22-30" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb22-31"><a href="#cb22-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-32"><a href="#cb22-32" aria-hidden="true" tabindex="-1"></a><span class="co">/* as the lexer produces more NUMs, keep adding them */</span></span>
<span id="cb22-33"><a href="#cb22-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-34"><a href="#cb22-34" aria-hidden="true" tabindex="-1"></a><span class="st">number </span>:</span>
<span id="cb22-35"><a href="#cb22-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-36"><a href="#cb22-36" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* this is a common pattern for saying number is one or</span></span>
<span id="cb22-37"><a href="#cb22-37" aria-hidden="true" tabindex="-1"></a><span class="co">     more NUMs.  Notice we specify &quot;number NUM&quot; and not</span></span>
<span id="cb22-38"><a href="#cb22-38" aria-hidden="true" tabindex="-1"></a><span class="co">     &quot;NUM number&quot;. In yacc recursion, think &quot;right is wrong</span></span>
<span id="cb22-39"><a href="#cb22-39" aria-hidden="true" tabindex="-1"></a><span class="co">     and left is right.&quot; */</span></span>
<span id="cb22-40"><a href="#cb22-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-41"><a href="#cb22-41" aria-hidden="true" tabindex="-1"></a>  number NUM { <span class="kw">$$</span> <span class="op">=</span> <span class="kw">$1</span> <span class="op">+</span> <span class="kw">$2</span><span class="op">;</span> }</span>
<span id="cb22-42"><a href="#cb22-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-43"><a href="#cb22-43" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* base case, using default rule of $$ = $1 */</span></span>
<span id="cb22-44"><a href="#cb22-44" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb22-45"><a href="#cb22-45" aria-hidden="true" tabindex="-1"></a>| NUM</span>
<span id="cb22-46"><a href="#cb22-46" aria-hidden="true" tabindex="-1"></a>;</span></code></pre></div>
<p>The corresponding lexer matches individual numerals, and returns them with their semantic values.</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* roman.l */</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a><span class="co">/* The .tab.h file is generated by yacc, and we&#39;ll explain</span></span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a><span class="co">   it later */</span></span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;roman.tab.h&quot;</span></span>
<span id="cb23-8"><a href="#cb23-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-9"><a href="#cb23-9" aria-hidden="true" tabindex="-1"></a><span class="co">/* lex communicates semantic token values to yacc through</span></span>
<span id="cb23-10"><a href="#cb23-10" aria-hidden="true" tabindex="-1"></a><span class="co">   a shared global variable */</span></span>
<span id="cb23-11"><a href="#cb23-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-12"><a href="#cb23-12" aria-hidden="true" tabindex="-1"></a><span class="at">extern</span> <span class="dt">int</span> yylval<span class="op">;</span></span>
<span id="cb23-13"><a href="#cb23-13" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb23-14"><a href="#cb23-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-15"><a href="#cb23-15" aria-hidden="true" tabindex="-1"></a><span class="co">/* when using flex (rather than vanilla lex) fix</span></span>
<span id="cb23-16"><a href="#cb23-16" aria-hidden="true" tabindex="-1"></a><span class="co">   unused function warnings by adding:</span></span>
<span id="cb23-17"><a href="#cb23-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-18"><a href="#cb23-18" aria-hidden="true" tabindex="-1"></a><span class="co">%option noinput nounput</span></span>
<span id="cb23-19"><a href="#cb23-19" aria-hidden="true" tabindex="-1"></a><span class="co">*/</span></span>
<span id="cb23-20"><a href="#cb23-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-21"><a href="#cb23-21" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb23-22"><a href="#cb23-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-23"><a href="#cb23-23" aria-hidden="true" tabindex="-1"></a> <span class="co">/* The constant for NUM comes from roman.tab.h,</span></span>
<span id="cb23-24"><a href="#cb23-24" aria-hidden="true" tabindex="-1"></a><span class="co">    and was generated because we declared</span></span>
<span id="cb23-25"><a href="#cb23-25" aria-hidden="true" tabindex="-1"></a><span class="co">    &quot;%token NUM&quot; in roman.y */</span></span>
<span id="cb23-26"><a href="#cb23-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-27"><a href="#cb23-27" aria-hidden="true" tabindex="-1"></a><span class="st">I</span>  { yylval <span class="op">=</span>    <span class="dv">1</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-28"><a href="#cb23-28" aria-hidden="true" tabindex="-1"></a><span class="st">V</span>  { yylval <span class="op">=</span>    <span class="dv">5</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-29"><a href="#cb23-29" aria-hidden="true" tabindex="-1"></a><span class="st">X</span>  { yylval <span class="op">=</span>   <span class="dv">10</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-30"><a href="#cb23-30" aria-hidden="true" tabindex="-1"></a><span class="st">L</span>  { yylval <span class="op">=</span>   <span class="dv">50</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-31"><a href="#cb23-31" aria-hidden="true" tabindex="-1"></a><span class="st">C</span>  { yylval <span class="op">=</span>  <span class="dv">100</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-32"><a href="#cb23-32" aria-hidden="true" tabindex="-1"></a><span class="st">D</span>  { yylval <span class="op">=</span>  <span class="dv">500</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-33"><a href="#cb23-33" aria-hidden="true" tabindex="-1"></a><span class="st">M</span>  { yylval <span class="op">=</span> <span class="dv">1000</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-34"><a href="#cb23-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-35"><a href="#cb23-35" aria-hidden="true" tabindex="-1"></a><span class="st">IV</span> { yylval <span class="op">=</span>    <span class="dv">4</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-36"><a href="#cb23-36" aria-hidden="true" tabindex="-1"></a><span class="st">IX</span> { yylval <span class="op">=</span>    <span class="dv">9</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-37"><a href="#cb23-37" aria-hidden="true" tabindex="-1"></a><span class="st">XL</span> { yylval <span class="op">=</span>   <span class="dv">40</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-38"><a href="#cb23-38" aria-hidden="true" tabindex="-1"></a><span class="st">XC</span> { yylval <span class="op">=</span>   <span class="dv">90</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-39"><a href="#cb23-39" aria-hidden="true" tabindex="-1"></a><span class="st">CD</span> { yylval <span class="op">=</span>  <span class="dv">400</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-40"><a href="#cb23-40" aria-hidden="true" tabindex="-1"></a><span class="st">CM</span> { yylval <span class="op">=</span>  <span class="dv">900</span><span class="op">;</span> <span class="cf">return</span> NUM<span class="op">;</span> }</span>
<span id="cb23-41"><a href="#cb23-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-42"><a href="#cb23-42" aria-hidden="true" tabindex="-1"></a> <span class="co">/* ignore final newline */</span></span>
<span id="cb23-43"><a href="#cb23-43" aria-hidden="true" tabindex="-1"></a><span class="st">\n</span> <span class="op">;</span></span>
<span id="cb23-44"><a href="#cb23-44" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-45"><a href="#cb23-45" aria-hidden="true" tabindex="-1"></a> <span class="co">/* As a default action, return the ascii value of</span></span>
<span id="cb23-46"><a href="#cb23-46" aria-hidden="true" tabindex="-1"></a><span class="co">    the character as if it were a token identifier.</span></span>
<span id="cb23-47"><a href="#cb23-47" aria-hidden="true" tabindex="-1"></a><span class="co">    The values from roman.tab.h are offset above 256 to</span></span>
<span id="cb23-48"><a href="#cb23-48" aria-hidden="true" tabindex="-1"></a><span class="co">    be above any ascii value, so there&#39;s no ambiguity</span></span>
<span id="cb23-49"><a href="#cb23-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-50"><a href="#cb23-50" aria-hidden="true" tabindex="-1"></a><span class="co">    Our parser won&#39;t be expecting these values, so</span></span>
<span id="cb23-51"><a href="#cb23-51" aria-hidden="true" tabindex="-1"></a><span class="co">    they will lead to a syntax error */</span></span>
<span id="cb23-52"><a href="#cb23-52" aria-hidden="true" tabindex="-1"></a><span class="st">.</span>  { <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span></code></pre></div>
<p>To review: lex generates a yylex() function, and yacc generates yyparse() that calls yylex() repeatedly to get new token identifiers. Lex actions copy semantic values to <code>yylval</code> which Yacc copies into <code>$</code>-variables accessible in parser rule actions.</p>
<p>Building an executable <code>roman</code> from the input files <code>roman.y</code> and <code>roman.l</code> requires explanation. With appropriate command line flags, yacc will create the files <code>roman.tab.c</code> and <code>roman.tab.h</code> from <code>roman.y</code>. Lex will create <code>roman.lex.c</code> from <code>roman.l</code>, using token identifiers in <code>roman.tab.h</code>.</p>
<p>In short, here are the build dependencies for each file:</p>
<figure>
<img src="../images/parse/build.png" alt="build dependency graph" /><figcaption aria-hidden="true">build dependency graph</figcaption>
</figure>
<p>And here’s how to express it all in a Makefile.</p>
<div class="sourceCode" id="cb24"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="co"># put together object files from lexer and parser, and</span></span>
<span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a><span class="co"># link the yacc and lex libraries (in that order, to pick</span></span>
<span id="cb24-3"><a href="#cb24-3" aria-hidden="true" tabindex="-1"></a><span class="co"># main() from yacc&#39;s library rather than lex&#39;s)</span></span>
<span id="cb24-4"><a href="#cb24-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-5"><a href="#cb24-5" aria-hidden="true" tabindex="-1"></a><span class="dv">roman :</span><span class="dt"> roman.tab.o roman.lex.o</span></span>
<span id="cb24-6"><a href="#cb24-6" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">CC</span><span class="ch">)</span> -o <span class="ch">$@</span> roman.tab.o roman.lex.o -ly -ll</span>
<span id="cb24-7"><a href="#cb24-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-8"><a href="#cb24-8" aria-hidden="true" tabindex="-1"></a><span class="co"># tell make which files yacc will generate</span></span>
<span id="cb24-9"><a href="#cb24-9" aria-hidden="true" tabindex="-1"></a><span class="co">#</span></span>
<span id="cb24-10"><a href="#cb24-10" aria-hidden="true" tabindex="-1"></a><span class="co"># an explanation of the arguments:</span></span>
<span id="cb24-11"><a href="#cb24-11" aria-hidden="true" tabindex="-1"></a><span class="co"># -b roman  -  name the files roman.tab.*</span></span>
<span id="cb24-12"><a href="#cb24-12" aria-hidden="true" tabindex="-1"></a><span class="co"># -d        -  generate a .tab.h file too</span></span>
<span id="cb24-13"><a href="#cb24-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-14"><a href="#cb24-14" aria-hidden="true" tabindex="-1"></a><span class="dv">roman.tab.h roman.tab.c :</span><span class="dt"> roman.y</span></span>
<span id="cb24-15"><a href="#cb24-15" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">YACC</span><span class="ch">)</span> -d -b roman <span class="ch">$?</span></span>
<span id="cb24-16"><a href="#cb24-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-17"><a href="#cb24-17" aria-hidden="true" tabindex="-1"></a><span class="co"># the object file relies on the generated lexer, and</span></span>
<span id="cb24-18"><a href="#cb24-18" aria-hidden="true" tabindex="-1"></a><span class="co"># on the token constants </span></span>
<span id="cb24-19"><a href="#cb24-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-20"><a href="#cb24-20" aria-hidden="true" tabindex="-1"></a><span class="dv">roman.lex.o :</span><span class="dt"> roman.tab.h roman.lex.c</span></span>
<span id="cb24-21"><a href="#cb24-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-22"><a href="#cb24-22" aria-hidden="true" tabindex="-1"></a><span class="co"># can&#39;t use the default suffix rule because we&#39;re</span></span>
<span id="cb24-23"><a href="#cb24-23" aria-hidden="true" tabindex="-1"></a><span class="co"># changing the name of the output to .lex.c</span></span>
<span id="cb24-24"><a href="#cb24-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb24-25"><a href="#cb24-25" aria-hidden="true" tabindex="-1"></a><span class="dv">roman.lex.c :</span><span class="dt"> roman.l</span></span>
<span id="cb24-26"><a href="#cb24-26" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">LEX</span><span class="ch">)</span> -t <span class="ch">$?</span> &gt; <span class="ch">$@</span></span></code></pre></div>
<p>And now, the moment of truth:</p>
<div class="sourceCode" id="cb25"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> make</span>
<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo MMMCMXCIX <span class="kw">|</span> <span class="ex">./roman</span></span>
<span id="cb25-4"><a href="#cb25-4" aria-hidden="true" tabindex="-1"></a><span class="ex">3999</span></span></code></pre></div>
<h4 id="using-a-parser-as-a-library">Using a parser as a library</h4>
<p>In this example we’ll parse LISP <a href="https://wikipedia.org/wiki/S-expression">S-expressions</a>, limited to string and integer atoms. There’s more going on in this one, such as memory management, different semantic types per token, and packaging the lexer and parser together into a single thread-safe library. This example requires Bison.</p>
<div class="sourceCode" id="cb26"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* lisp.y  (requires Bison) */</span></span>
<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* a &quot;pure&quot; api means communication variables like yylval</span></span>
<span id="cb26-4"><a href="#cb26-4" aria-hidden="true" tabindex="-1"></a><span class="co">   won&#39;t be global variables, and yylex is assumed to</span></span>
<span id="cb26-5"><a href="#cb26-5" aria-hidden="true" tabindex="-1"></a><span class="co">   have a different signature */</span></span>
<span id="cb26-6"><a href="#cb26-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-7"><a href="#cb26-7" aria-hidden="true" tabindex="-1"></a><span class="kw">%define</span> api.pure true</span>
<span id="cb26-8"><a href="#cb26-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-9"><a href="#cb26-9" aria-hidden="true" tabindex="-1"></a><span class="co">/* change prefix of symbols from yy to &quot;lisp&quot; to avoid</span></span>
<span id="cb26-10"><a href="#cb26-10" aria-hidden="true" tabindex="-1"></a><span class="co">   clashes with any other parsers we may want to link */</span></span>
<span id="cb26-11"><a href="#cb26-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-12"><a href="#cb26-12" aria-hidden="true" tabindex="-1"></a><span class="kw">%define</span> api.prefix {lisp}</span>
<span id="cb26-13"><a href="#cb26-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-14"><a href="#cb26-14" aria-hidden="true" tabindex="-1"></a><span class="co">/* generate much more meaningful errors rather than the</span></span>
<span id="cb26-15"><a href="#cb26-15" aria-hidden="true" tabindex="-1"></a><span class="co">   uninformative string &quot;syntax error&quot; */</span></span>
<span id="cb26-16"><a href="#cb26-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-17"><a href="#cb26-17" aria-hidden="true" tabindex="-1"></a><span class="kw">%define</span> parse.error verbose</span>
<span id="cb26-18"><a href="#cb26-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-19"><a href="#cb26-19" aria-hidden="true" tabindex="-1"></a><span class="co">/* Bison offers different %code insertion locations in</span></span>
<span id="cb26-20"><a href="#cb26-20" aria-hidden="true" tabindex="-1"></a><span class="co">   addition to yacc&#39;s %{ %} construct.</span></span>
<span id="cb26-21"><a href="#cb26-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-22"><a href="#cb26-22" aria-hidden="true" tabindex="-1"></a><span class="co">   The &quot;top&quot; location is good for headers and feature</span></span>
<span id="cb26-23"><a href="#cb26-23" aria-hidden="true" tabindex="-1"></a><span class="co">   flags like the _XOPEN_SOURCE we use here */</span></span>
<span id="cb26-24"><a href="#cb26-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-25"><a href="#cb26-25" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> top {</span>
<span id="cb26-26"><a href="#cb26-26" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* XOPEN for strdup */</span></span>
<span id="cb26-27"><a href="#cb26-27" aria-hidden="true" tabindex="-1"></a>	#define _XOPEN_SOURCE 600</span>
<span id="cb26-28"><a href="#cb26-28" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;stdio.h&gt;</span></span>
<span id="cb26-29"><a href="#cb26-29" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;stdlib.h&gt;</span></span>
<span id="cb26-30"><a href="#cb26-30" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;string.h&gt;</span></span>
<span id="cb26-31"><a href="#cb26-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-32"><a href="#cb26-32" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Bison versions 3.7.5 and above provide the YYNOMEM</span></span>
<span id="cb26-33"><a href="#cb26-33" aria-hidden="true" tabindex="-1"></a><span class="co">	   macro to allow our actions to signal the unlikely</span></span>
<span id="cb26-34"><a href="#cb26-34" aria-hidden="true" tabindex="-1"></a><span class="co">	   event that they couldn&#39;t allocate memory. Thanks</span></span>
<span id="cb26-35"><a href="#cb26-35" aria-hidden="true" tabindex="-1"></a><span class="co">	   to the Bison team for adding this feature at my</span></span>
<span id="cb26-36"><a href="#cb26-36" aria-hidden="true" tabindex="-1"></a><span class="co">	   request. :) YYNOMEM causes yyparse() to return 2.</span></span>
<span id="cb26-37"><a href="#cb26-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-38"><a href="#cb26-38" aria-hidden="true" tabindex="-1"></a><span class="co">	   The following conditional define allows us to use</span></span>
<span id="cb26-39"><a href="#cb26-39" aria-hidden="true" tabindex="-1"></a><span class="co">	   the functionality in earlier versions too. */</span></span>
<span id="cb26-40"><a href="#cb26-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-41"><a href="#cb26-41" aria-hidden="true" tabindex="-1"></a>	#ifndef YYNOMEM</span>
<span id="cb26-42"><a href="#cb26-42" aria-hidden="true" tabindex="-1"></a>	#define YYNOMEM goto yyexhaustedlab</span>
<span id="cb26-43"><a href="#cb26-43" aria-hidden="true" tabindex="-1"></a>	#endif</span>
<span id="cb26-44"><a href="#cb26-44" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb26-45"><a href="#cb26-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-46"><a href="#cb26-46" aria-hidden="true" tabindex="-1"></a><span class="co">/* The &quot;requires&quot; code location is designed for defining</span></span>
<span id="cb26-47"><a href="#cb26-47" aria-hidden="true" tabindex="-1"></a><span class="co">   data types that we can use as yylval&#39;s for tokens. Code</span></span>
<span id="cb26-48"><a href="#cb26-48" aria-hidden="true" tabindex="-1"></a><span class="co">   in this section is also added to the .tab.h file for</span></span>
<span id="cb26-49"><a href="#cb26-49" aria-hidden="true" tabindex="-1"></a><span class="co">   inclusion by calling code */</span></span>
<span id="cb26-50"><a href="#cb26-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-51"><a href="#cb26-51" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> requires {</span>
<span id="cb26-52"><a href="#cb26-52" aria-hidden="true" tabindex="-1"></a>	enum sexpr_type {</span>
<span id="cb26-53"><a href="#cb26-53" aria-hidden="true" tabindex="-1"></a>		SEXPR_ID, SEXPR_NUM, SEXPR_PAIR, SEXPR_NIL</span>
<span id="cb26-54"><a href="#cb26-54" aria-hidden="true" tabindex="-1"></a>	};</span>
<span id="cb26-55"><a href="#cb26-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-56"><a href="#cb26-56" aria-hidden="true" tabindex="-1"></a>	struct sexpr</span>
<span id="cb26-57"><a href="#cb26-57" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb26-58"><a href="#cb26-58" aria-hidden="true" tabindex="-1"></a>		enum sexpr_type type;</span>
<span id="cb26-59"><a href="#cb26-59" aria-hidden="true" tabindex="-1"></a>		union</span>
<span id="cb26-60"><a href="#cb26-60" aria-hidden="true" tabindex="-1"></a>		{</span>
<span id="cb26-61"><a href="#cb26-61" aria-hidden="true" tabindex="-1"></a>			int   num;</span>
<span id="cb26-62"><a href="#cb26-62" aria-hidden="true" tabindex="-1"></a>			char *id;</span>
<span id="cb26-63"><a href="#cb26-63" aria-hidden="true" tabindex="-1"></a>		} value;</span>
<span id="cb26-64"><a href="#cb26-64" aria-hidden="true" tabindex="-1"></a>		struct sexpr *left, *right;</span>
<span id="cb26-65"><a href="#cb26-65" aria-hidden="true" tabindex="-1"></a>	};</span>
<span id="cb26-66"><a href="#cb26-66" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb26-67"><a href="#cb26-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-68"><a href="#cb26-68" aria-hidden="true" tabindex="-1"></a><span class="co">/* These are the semantic types available for tokens,</span></span>
<span id="cb26-69"><a href="#cb26-69" aria-hidden="true" tabindex="-1"></a><span class="co">   which we name num, str, and node.</span></span>
<span id="cb26-70"><a href="#cb26-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-71"><a href="#cb26-71" aria-hidden="true" tabindex="-1"></a><span class="co">   The %union construction is classic yacc as well. It</span></span>
<span id="cb26-72"><a href="#cb26-72" aria-hidden="true" tabindex="-1"></a><span class="co">   generates a C union and sets its as the YYSTYPE, which</span></span>
<span id="cb26-73"><a href="#cb26-73" aria-hidden="true" tabindex="-1"></a><span class="co">   will be the type of yylval */</span></span>
<span id="cb26-74"><a href="#cb26-74" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-75"><a href="#cb26-75" aria-hidden="true" tabindex="-1"></a><span class="kw">%union</span></span>
<span id="cb26-76"><a href="#cb26-76" aria-hidden="true" tabindex="-1"></a>{</span>
<span id="cb26-77"><a href="#cb26-77" aria-hidden="true" tabindex="-1"></a>	int num;</span>
<span id="cb26-78"><a href="#cb26-78" aria-hidden="true" tabindex="-1"></a>	char *str;</span>
<span id="cb26-79"><a href="#cb26-79" aria-hidden="true" tabindex="-1"></a>	struct sexpr *node;</span>
<span id="cb26-80"><a href="#cb26-80" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb26-81"><a href="#cb26-81" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-82"><a href="#cb26-82" aria-hidden="true" tabindex="-1"></a><span class="co">/* Add another argument in yyparse() so that we</span></span>
<span id="cb26-83"><a href="#cb26-83" aria-hidden="true" tabindex="-1"></a><span class="co">   can communicate the parsed result to the caller.</span></span>
<span id="cb26-84"><a href="#cb26-84" aria-hidden="true" tabindex="-1"></a><span class="co">   We can&#39;t return the result directly, since the</span></span>
<span id="cb26-85"><a href="#cb26-85" aria-hidden="true" tabindex="-1"></a><span class="co">   return value is already reserved as an int, with</span></span>
<span id="cb26-86"><a href="#cb26-86" aria-hidden="true" tabindex="-1"></a><span class="co">   0=success, 1=error, 2=nomem</span></span>
<span id="cb26-87"><a href="#cb26-87" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-88"><a href="#cb26-88" aria-hidden="true" tabindex="-1"></a><span class="co">   </span><span class="al">NOTE</span></span>
<span id="cb26-89"><a href="#cb26-89" aria-hidden="true" tabindex="-1"></a><span class="co">   In our case, the param is a data pointer. However,</span></span>
<span id="cb26-90"><a href="#cb26-90" aria-hidden="true" tabindex="-1"></a><span class="co">   if it were a function pointer (such as a callback),</span></span>
<span id="cb26-91"><a href="#cb26-91" aria-hidden="true" tabindex="-1"></a><span class="co">   then its type would have to be put behind a typedef,</span></span>
<span id="cb26-92"><a href="#cb26-92" aria-hidden="true" tabindex="-1"></a><span class="co">   or else parse-param will mangle the declaration. */</span></span>
<span id="cb26-93"><a href="#cb26-93" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-94"><a href="#cb26-94" aria-hidden="true" tabindex="-1"></a><span class="kw">%parse-param</span> {struct sexpr **result}</span>
<span id="cb26-95"><a href="#cb26-95" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-96"><a href="#cb26-96" aria-hidden="true" tabindex="-1"></a><span class="co">/* param adds an extra param to yyparse (like parse-param)</span></span>
<span id="cb26-97"><a href="#cb26-97" aria-hidden="true" tabindex="-1"></a><span class="co">   but also causes yyparse to send the value to yylex.</span></span>
<span id="cb26-98"><a href="#cb26-98" aria-hidden="true" tabindex="-1"></a><span class="co">   In our case the caller will initialize their own scanner</span></span>
<span id="cb26-99"><a href="#cb26-99" aria-hidden="true" tabindex="-1"></a><span class="co">   instance and pass it through */</span></span>
<span id="cb26-100"><a href="#cb26-100" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-101"><a href="#cb26-101" aria-hidden="true" tabindex="-1"></a><span class="kw">%param</span> {void *scanner}</span>
<span id="cb26-102"><a href="#cb26-102" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-103"><a href="#cb26-103" aria-hidden="true" tabindex="-1"></a><span class="co">/* the &quot;provides&quot; location adds the code to our generated</span></span>
<span id="cb26-104"><a href="#cb26-104" aria-hidden="true" tabindex="-1"></a><span class="co">   parser, but also to the .tab.h file for use by callers */</span></span>
<span id="cb26-105"><a href="#cb26-105" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-106"><a href="#cb26-106" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> provides {</span>
<span id="cb26-107"><a href="#cb26-107" aria-hidden="true" tabindex="-1"></a>	void sexpr_free(struct sexpr *s);</span>
<span id="cb26-108"><a href="#cb26-108" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb26-109"><a href="#cb26-109" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-110"><a href="#cb26-110" aria-hidden="true" tabindex="-1"></a><span class="co">/* unqualified %code is for internal use, things that</span></span>
<span id="cb26-111"><a href="#cb26-111" aria-hidden="true" tabindex="-1"></a><span class="co">   our actions can see. These declarations prevent</span></span>
<span id="cb26-112"><a href="#cb26-112" aria-hidden="true" tabindex="-1"></a><span class="co">   warnings.  Notice the final param in each that came</span></span>
<span id="cb26-113"><a href="#cb26-113" aria-hidden="true" tabindex="-1"></a><span class="co">   from the %param directive above */</span></span>
<span id="cb26-114"><a href="#cb26-114" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-115"><a href="#cb26-115" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> {</span>
<span id="cb26-116"><a href="#cb26-116" aria-hidden="true" tabindex="-1"></a>	int lisperror(void *foo, char const *msg, const void *s);</span>
<span id="cb26-117"><a href="#cb26-117" aria-hidden="true" tabindex="-1"></a>	int lisplex(void *lval, const void *s);</span>
<span id="cb26-118"><a href="#cb26-118" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb26-119"><a href="#cb26-119" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-120"><a href="#cb26-120" aria-hidden="true" tabindex="-1"></a><span class="co">/* Now when we declare tokens, we add their type</span></span>
<span id="cb26-121"><a href="#cb26-121" aria-hidden="true" tabindex="-1"></a><span class="co">   in brackets. The type names come from our %union */</span></span>
<span id="cb26-122"><a href="#cb26-122" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-123"><a href="#cb26-123" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;str&gt;</span> ID</span>
<span id="cb26-124"><a href="#cb26-124" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;num&gt;</span> NUM</span>
<span id="cb26-125"><a href="#cb26-125" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-126"><a href="#cb26-126" aria-hidden="true" tabindex="-1"></a><span class="co">/* whereas tokens come from the lexer, these</span></span>
<span id="cb26-127"><a href="#cb26-127" aria-hidden="true" tabindex="-1"></a><span class="co">   non-terminals are defined in the parser, and</span></span>
<span id="cb26-128"><a href="#cb26-128" aria-hidden="true" tabindex="-1"></a><span class="co">   we set their types with %type */</span></span>
<span id="cb26-129"><a href="#cb26-129" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-130"><a href="#cb26-130" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;node&gt;</span> start sexpr pair list members atom</span>
<span id="cb26-131"><a href="#cb26-131" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-132"><a href="#cb26-132" aria-hidden="true" tabindex="-1"></a><span class="co">/* if there&#39;s an error partway through parsing, the</span></span>
<span id="cb26-133"><a href="#cb26-133" aria-hidden="true" tabindex="-1"></a><span class="co">   caller wouldn&#39;t get a chance to free memory for</span></span>
<span id="cb26-134"><a href="#cb26-134" aria-hidden="true" tabindex="-1"></a><span class="co">   the work in progress. Bison will clean up the memory</span></span>
<span id="cb26-135"><a href="#cb26-135" aria-hidden="true" tabindex="-1"></a><span class="co">   if we provide destructors, though. */</span></span>
<span id="cb26-136"><a href="#cb26-136" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-137"><a href="#cb26-137" aria-hidden="true" tabindex="-1"></a><span class="kw">%destructor</span> { free($$); } &lt;str&gt;</span>
<span id="cb26-138"><a href="#cb26-138" aria-hidden="true" tabindex="-1"></a><span class="kw">%destructor</span> { sexpr_free($$); } &lt;node&gt;</span>
<span id="cb26-139"><a href="#cb26-139" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-140"><a href="#cb26-140" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb26-141"><a href="#cb26-141" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-142"><a href="#cb26-142" aria-hidden="true" tabindex="-1"></a><span class="st"> </span><span class="co">/* once again we use a dummy non-terminal to perform</span></span>
<span id="cb26-143"><a href="#cb26-143" aria-hidden="true" tabindex="-1"></a><span class="co">    a side-effect, in this case setting *result */</span></span>
<span id="cb26-144"><a href="#cb26-144" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-145"><a href="#cb26-145" aria-hidden="true" tabindex="-1"></a><span class="st">start </span>:</span>
<span id="cb26-146"><a href="#cb26-146" aria-hidden="true" tabindex="-1"></a>  sexpr   { <span class="op">*</span>result <span class="op">=</span> <span class="kw">$$</span> <span class="op">=</span> <span class="kw">$1</span><span class="op">;</span> <span class="cf">return</span> <span class="dv">0</span><span class="op">;</span> }</span>
<span id="cb26-147"><a href="#cb26-147" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-148"><a href="#cb26-148" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-149"><a href="#cb26-149" aria-hidden="true" tabindex="-1"></a><span class="st">sexpr </span>:</span>
<span id="cb26-150"><a href="#cb26-150" aria-hidden="true" tabindex="-1"></a>  atom</span>
<span id="cb26-151"><a href="#cb26-151" aria-hidden="true" tabindex="-1"></a>| list</span>
<span id="cb26-152"><a href="#cb26-152" aria-hidden="true" tabindex="-1"></a>| pair</span>
<span id="cb26-153"><a href="#cb26-153" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-154"><a href="#cb26-154" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-155"><a href="#cb26-155" aria-hidden="true" tabindex="-1"></a><span class="st">list </span>:</span>
<span id="cb26-156"><a href="#cb26-156" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-157"><a href="#cb26-157" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* This is a shortcut: we use the ascii value for</span></span>
<span id="cb26-158"><a href="#cb26-158" aria-hidden="true" tabindex="-1"></a><span class="co">     parens &#39;(&#39;=40, &#39;)&#39;=41 as their token codes.</span></span>
<span id="cb26-159"><a href="#cb26-159" aria-hidden="true" tabindex="-1"></a><span class="co">     Thus we don&#39;t have to define a bunch of crap</span></span>
<span id="cb26-160"><a href="#cb26-160" aria-hidden="true" tabindex="-1"></a><span class="co">     manually like LPAREN, RPAREN */</span></span>
<span id="cb26-161"><a href="#cb26-161" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-162"><a href="#cb26-162" aria-hidden="true" tabindex="-1"></a>  <span class="ch">&#39;(&#39;</span> members <span class="ch">&#39;)&#39;</span> { <span class="kw">$$</span> <span class="op">=</span> <span class="kw">$2</span><span class="op">;</span> }</span>
<span id="cb26-163"><a href="#cb26-163" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-164"><a href="#cb26-164" aria-hidden="true" tabindex="-1"></a>| <span class="ch">&#39;(&#39;&#39;)&#39;</span> {</span>
<span id="cb26-165"><a href="#cb26-165" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>nil <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>nil<span class="op">);</span></span>
<span id="cb26-166"><a href="#cb26-166" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>nil<span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-167"><a href="#cb26-167" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>nil <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{<span class="op">.</span>type <span class="op">=</span> SEXPR_NIL}<span class="op">;</span></span>
<span id="cb26-168"><a href="#cb26-168" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> nil<span class="op">;</span></span>
<span id="cb26-169"><a href="#cb26-169" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-170"><a href="#cb26-170" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-171"><a href="#cb26-171" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-172"><a href="#cb26-172" aria-hidden="true" tabindex="-1"></a><span class="st">members </span>:</span>
<span id="cb26-173"><a href="#cb26-173" aria-hidden="true" tabindex="-1"></a>  sexpr {</span>
<span id="cb26-174"><a href="#cb26-174" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>s<span class="op">),</span></span>
<span id="cb26-175"><a href="#cb26-175" aria-hidden="true" tabindex="-1"></a>				 <span class="op">*</span>nil <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>nil<span class="op">);</span></span>
<span id="cb26-176"><a href="#cb26-176" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s <span class="op">||</span> <span class="op">!</span>nil<span class="op">)</span> {</span>
<span id="cb26-177"><a href="#cb26-177" aria-hidden="true" tabindex="-1"></a>		free<span class="op">(</span>s<span class="op">),</span> free<span class="op">(</span>nil<span class="op">);</span></span>
<span id="cb26-178"><a href="#cb26-178" aria-hidden="true" tabindex="-1"></a>		YYNOMEM<span class="op">;</span></span>
<span id="cb26-179"><a href="#cb26-179" aria-hidden="true" tabindex="-1"></a>	}</span>
<span id="cb26-180"><a href="#cb26-180" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>nil <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{<span class="op">.</span>type <span class="op">=</span> SEXPR_NIL}<span class="op">;</span></span>
<span id="cb26-181"><a href="#cb26-181" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-182"><a href="#cb26-182" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* convention: we assume that a previous parser</span></span>
<span id="cb26-183"><a href="#cb26-183" aria-hidden="true" tabindex="-1"></a><span class="co">	   value like $1 is non-NULL, else it would have</span></span>
<span id="cb26-184"><a href="#cb26-184" aria-hidden="true" tabindex="-1"></a><span class="co">	   died already with YYNOMEM. We&#39;re responsible</span></span>
<span id="cb26-185"><a href="#cb26-185" aria-hidden="true" tabindex="-1"></a><span class="co">	   for checking only our own allocations */</span></span>
<span id="cb26-186"><a href="#cb26-186" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-187"><a href="#cb26-187" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>s <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{</span>
<span id="cb26-188"><a href="#cb26-188" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>type <span class="op">=</span> SEXPR_PAIR<span class="op">,</span></span>
<span id="cb26-189"><a href="#cb26-189" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>left <span class="op">=</span> <span class="kw">$1</span><span class="op">,</span></span>
<span id="cb26-190"><a href="#cb26-190" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>right <span class="op">=</span> nil</span>
<span id="cb26-191"><a href="#cb26-191" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb26-192"><a href="#cb26-192" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> s<span class="op">;</span></span>
<span id="cb26-193"><a href="#cb26-193" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-194"><a href="#cb26-194" aria-hidden="true" tabindex="-1"></a>| sexpr members {</span>
<span id="cb26-195"><a href="#cb26-195" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb26-196"><a href="#cb26-196" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-197"><a href="#cb26-197" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Another important memory convention: we</span></span>
<span id="cb26-198"><a href="#cb26-198" aria-hidden="true" tabindex="-1"></a><span class="co">	   can&#39;t trust that our lexer successfully</span></span>
<span id="cb26-199"><a href="#cb26-199" aria-hidden="true" tabindex="-1"></a><span class="co">	   allocated its yylvalue, because the signature</span></span>
<span id="cb26-200"><a href="#cb26-200" aria-hidden="true" tabindex="-1"></a><span class="co">	   of yylex doesn&#39;t communicate failure. We</span></span>
<span id="cb26-201"><a href="#cb26-201" aria-hidden="true" tabindex="-1"></a><span class="co">	   assume NULL in $1 means alloc failure and</span></span>
<span id="cb26-202"><a href="#cb26-202" aria-hidden="true" tabindex="-1"></a><span class="co">	   we report that. The only other way to signal</span></span>
<span id="cb26-203"><a href="#cb26-203" aria-hidden="true" tabindex="-1"></a><span class="co">	   from yylex would be to make a fake token to</span></span>
<span id="cb26-204"><a href="#cb26-204" aria-hidden="true" tabindex="-1"></a><span class="co">	   represent out-of-memory, but that&#39;s harder */</span></span>
<span id="cb26-205"><a href="#cb26-205" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-206"><a href="#cb26-206" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-207"><a href="#cb26-207" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>s <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{</span>
<span id="cb26-208"><a href="#cb26-208" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>type <span class="op">=</span> SEXPR_PAIR<span class="op">,</span></span>
<span id="cb26-209"><a href="#cb26-209" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>left <span class="op">=</span> <span class="kw">$1</span><span class="op">,</span></span>
<span id="cb26-210"><a href="#cb26-210" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>right <span class="op">=</span> <span class="kw">$2</span></span>
<span id="cb26-211"><a href="#cb26-211" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb26-212"><a href="#cb26-212" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> s<span class="op">;</span></span>
<span id="cb26-213"><a href="#cb26-213" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-214"><a href="#cb26-214" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-215"><a href="#cb26-215" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-216"><a href="#cb26-216" aria-hidden="true" tabindex="-1"></a><span class="st">pair </span>:</span>
<span id="cb26-217"><a href="#cb26-217" aria-hidden="true" tabindex="-1"></a>  <span class="ch">&#39;(&#39;</span> sexpr <span class="ch">&#39;.&#39;</span> sexpr <span class="ch">&#39;)&#39;</span> {</span>
<span id="cb26-218"><a href="#cb26-218" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb26-219"><a href="#cb26-219" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-220"><a href="#cb26-220" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>s <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{</span>
<span id="cb26-221"><a href="#cb26-221" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>type <span class="op">=</span> SEXPR_PAIR<span class="op">,</span></span>
<span id="cb26-222"><a href="#cb26-222" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>left <span class="op">=</span> <span class="kw">$2</span><span class="op">,</span></span>
<span id="cb26-223"><a href="#cb26-223" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>right <span class="op">=</span> <span class="kw">$4</span></span>
<span id="cb26-224"><a href="#cb26-224" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb26-225"><a href="#cb26-225" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> s<span class="op">;</span></span>
<span id="cb26-226"><a href="#cb26-226" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-227"><a href="#cb26-227" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-228"><a href="#cb26-228" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-229"><a href="#cb26-229" aria-hidden="true" tabindex="-1"></a><span class="st">atom </span>:</span>
<span id="cb26-230"><a href="#cb26-230" aria-hidden="true" tabindex="-1"></a>  ID {</span>
<span id="cb26-231"><a href="#cb26-231" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span><span class="kw">$1</span><span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-232"><a href="#cb26-232" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-233"><a href="#cb26-233" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb26-234"><a href="#cb26-234" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-235"><a href="#cb26-235" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>s <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{</span>
<span id="cb26-236"><a href="#cb26-236" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>type <span class="op">=</span> SEXPR_ID<span class="op">,</span></span>
<span id="cb26-237"><a href="#cb26-237" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>value<span class="op">.</span>id <span class="op">=</span> strdup<span class="op">(</span><span class="kw">$1</span><span class="op">)</span></span>
<span id="cb26-238"><a href="#cb26-238" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb26-239"><a href="#cb26-239" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">-&gt;</span>value<span class="op">.</span>id<span class="op">)</span></span>
<span id="cb26-240"><a href="#cb26-240" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb26-241"><a href="#cb26-241" aria-hidden="true" tabindex="-1"></a>		free<span class="op">(</span>s<span class="op">);</span></span>
<span id="cb26-242"><a href="#cb26-242" aria-hidden="true" tabindex="-1"></a>		YYNOMEM<span class="op">;</span></span>
<span id="cb26-243"><a href="#cb26-243" aria-hidden="true" tabindex="-1"></a>	}</span>
<span id="cb26-244"><a href="#cb26-244" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> s<span class="op">;</span></span>
<span id="cb26-245"><a href="#cb26-245" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-246"><a href="#cb26-246" aria-hidden="true" tabindex="-1"></a>| NUM {</span>
<span id="cb26-247"><a href="#cb26-247" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb26-248"><a href="#cb26-248" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span> YYNOMEM<span class="op">;</span></span>
<span id="cb26-249"><a href="#cb26-249" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>s <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">)</span>{</span>
<span id="cb26-250"><a href="#cb26-250" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>type <span class="op">=</span> SEXPR_NUM<span class="op">,</span></span>
<span id="cb26-251"><a href="#cb26-251" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>value<span class="op">.</span>num <span class="op">=</span> <span class="kw">$1</span></span>
<span id="cb26-252"><a href="#cb26-252" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb26-253"><a href="#cb26-253" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> s<span class="op">;</span></span>
<span id="cb26-254"><a href="#cb26-254" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb26-255"><a href="#cb26-255" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb26-256"><a href="#cb26-256" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-257"><a href="#cb26-257" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb26-258"><a href="#cb26-258" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-259"><a href="#cb26-259" aria-hidden="true" tabindex="-1"></a><span class="co">/* notice the extra parameters required</span></span>
<span id="cb26-260"><a href="#cb26-260" aria-hidden="true" tabindex="-1"></a><span class="co">   by %param and %parse-param */</span></span>
<span id="cb26-261"><a href="#cb26-261" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-262"><a href="#cb26-262" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> lisperror<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>yylval<span class="op">,</span> <span class="dt">char</span> <span class="at">const</span> <span class="op">*</span>msg<span class="op">,</span> <span class="at">const</span> <span class="dt">void</span> <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb26-263"><a href="#cb26-263" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb26-264"><a href="#cb26-264" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>yylval<span class="op">;</span></span>
<span id="cb26-265"><a href="#cb26-265" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>s<span class="op">;</span></span>
<span id="cb26-266"><a href="#cb26-266" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;</span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span> msg<span class="op">);</span></span>
<span id="cb26-267"><a href="#cb26-267" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb26-268"><a href="#cb26-268" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-269"><a href="#cb26-269" aria-hidden="true" tabindex="-1"></a><span class="co">/* useful internally by us, and externally by callers */</span></span>
<span id="cb26-270"><a href="#cb26-270" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb26-271"><a href="#cb26-271" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> sexpr_free<span class="op">(</span><span class="kw">struct</span> sexpr <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb26-272"><a href="#cb26-272" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb26-273"><a href="#cb26-273" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span></span>
<span id="cb26-274"><a href="#cb26-274" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span><span class="op">;</span></span>
<span id="cb26-275"><a href="#cb26-275" aria-hidden="true" tabindex="-1"></a>	</span>
<span id="cb26-276"><a href="#cb26-276" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>s<span class="op">-&gt;</span>type <span class="op">==</span> SEXPR_ID<span class="op">)</span></span>
<span id="cb26-277"><a href="#cb26-277" aria-hidden="true" tabindex="-1"></a>		free<span class="op">(</span>s<span class="op">-&gt;</span>value<span class="op">.</span>id<span class="op">);</span></span>
<span id="cb26-278"><a href="#cb26-278" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>s<span class="op">-&gt;</span>type <span class="op">==</span> SEXPR_PAIR<span class="op">)</span></span>
<span id="cb26-279"><a href="#cb26-279" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb26-280"><a href="#cb26-280" aria-hidden="true" tabindex="-1"></a>		sexpr_free<span class="op">(</span>s<span class="op">-&gt;</span>left<span class="op">);</span></span>
<span id="cb26-281"><a href="#cb26-281" aria-hidden="true" tabindex="-1"></a>		sexpr_free<span class="op">(</span>s<span class="op">-&gt;</span>right<span class="op">);</span></span>
<span id="cb26-282"><a href="#cb26-282" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb26-283"><a href="#cb26-283" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>s<span class="op">);</span></span>
<span id="cb26-284"><a href="#cb26-284" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>The parser does the bulk of the work. We just need to pair it with a scanner that reads atoms and parens.</p>
<div class="sourceCode" id="cb27"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* lisp.l */</span></span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* disable unused functions so we don&#39;t</span></span>
<span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a><span class="co">   get compiler warnings about them */</span></span>
<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a><span class="kw">%option noyywrap nounput noinput</span></span>
<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a><span class="kw">%option noyyalloc noyyrealloc noyyfree</span></span>
<span id="cb27-8"><a href="#cb27-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-9"><a href="#cb27-9" aria-hidden="true" tabindex="-1"></a><span class="co">/* change our prefix from yy to lisp */</span></span>
<span id="cb27-10"><a href="#cb27-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-11"><a href="#cb27-11" aria-hidden="true" tabindex="-1"></a><span class="kw">%option prefix=&quot;lisp&quot;</span></span>
<span id="cb27-12"><a href="#cb27-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-13"><a href="#cb27-13" aria-hidden="true" tabindex="-1"></a><span class="co">/* use the pure parser calling convention */</span></span>
<span id="cb27-14"><a href="#cb27-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-15"><a href="#cb27-15" aria-hidden="true" tabindex="-1"></a><span class="kw">%option reentrant bison-bridge</span></span>
<span id="cb27-16"><a href="#cb27-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-17"><a href="#cb27-17" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb27-18"><a href="#cb27-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;lisp.tab.h&quot;</span></span>
<span id="cb27-19"><a href="#cb27-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-20"><a href="#cb27-20" aria-hidden="true" tabindex="-1"></a><span class="pp">#define YY_EXIT_FAILURE </span><span class="op">((</span><span class="dt">void</span><span class="op">)</span>yyscanner<span class="op">,</span><span class="pp"> </span>EXIT_FAILURE<span class="op">)</span></span>
<span id="cb27-21"><a href="#cb27-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-22"><a href="#cb27-22" aria-hidden="true" tabindex="-1"></a><span class="co">/* XOPEN for strdup */</span></span>
<span id="cb27-23"><a href="#cb27-23" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _XOPEN_SOURCE </span><span class="dv">600</span></span>
<span id="cb27-24"><a href="#cb27-24" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;limits.h&gt;</span></span>
<span id="cb27-25"><a href="#cb27-25" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb27-26"><a href="#cb27-26" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb27-27"><a href="#cb27-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-28"><a href="#cb27-28" aria-hidden="true" tabindex="-1"></a><span class="co">/* seems like a bug that I have to do this, since flex</span></span>
<span id="cb27-29"><a href="#cb27-29" aria-hidden="true" tabindex="-1"></a><span class="co">   should know prefix=lisp and match bison&#39;s LISPSTYPE */</span></span>
<span id="cb27-30"><a href="#cb27-30" aria-hidden="true" tabindex="-1"></a><span class="pp">#define YYSTYPE </span>LISPSTYPE</span>
<span id="cb27-31"><a href="#cb27-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-32"><a href="#cb27-32" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> lisperror<span class="op">(</span><span class="at">const</span> <span class="dt">char</span> <span class="op">*</span>msg<span class="op">);</span></span>
<span id="cb27-33"><a href="#cb27-33" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb27-34"><a href="#cb27-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-35"><a href="#cb27-35" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb27-36"><a href="#cb27-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-37"><a href="#cb27-37" aria-hidden="true" tabindex="-1"></a><span class="st">[[:alpha:]][[:alnum:]]*</span> {</span>
<span id="cb27-38"><a href="#cb27-38" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* The memory that yytext points to gets overwritten</span></span>
<span id="cb27-39"><a href="#cb27-39" aria-hidden="true" tabindex="-1"></a><span class="co">	   each time a pattern matches. We need to give the caller</span></span>
<span id="cb27-40"><a href="#cb27-40" aria-hidden="true" tabindex="-1"></a><span class="co">	   a copy. Also, if strdup fails and returns NULL, it&#39;s up</span></span>
<span id="cb27-41"><a href="#cb27-41" aria-hidden="true" tabindex="-1"></a><span class="co">	   to the caller (the parser) to detect that.</span></span>
<span id="cb27-42"><a href="#cb27-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-43"><a href="#cb27-43" aria-hidden="true" tabindex="-1"></a><span class="co">	   Notice yylval is a pointer to union now.  It&#39;s passed</span></span>
<span id="cb27-44"><a href="#cb27-44" aria-hidden="true" tabindex="-1"></a><span class="co">	   as an arg to yylex in pure parsing mode */</span></span>
<span id="cb27-45"><a href="#cb27-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-46"><a href="#cb27-46" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">-&gt;</span>str <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span></span>
<span id="cb27-47"><a href="#cb27-47" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> ID<span class="op">;</span></span>
<span id="cb27-48"><a href="#cb27-48" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb27-49"><a href="#cb27-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-50"><a href="#cb27-50" aria-hidden="true" tabindex="-1"></a><span class="st">[-+]?[[:digit:]]+</span> {</span>
<span id="cb27-51"><a href="#cb27-51" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> n <span class="op">=</span> strtol<span class="op">(</span>yytext<span class="op">,</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">);</span></span>
<span id="cb27-52"><a href="#cb27-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-53"><a href="#cb27-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>n <span class="op">&lt;</span> INT_MIN <span class="op">||</span> n <span class="op">&gt;</span> INT_MAX<span class="op">)</span></span>
<span id="cb27-54"><a href="#cb27-54" aria-hidden="true" tabindex="-1"></a>		lisperror<span class="op">(</span><span class="st">&quot;Number out of range&quot;</span><span class="op">);</span></span>
<span id="cb27-55"><a href="#cb27-55" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">-&gt;</span>num <span class="op">=</span> <span class="op">(</span><span class="dt">int</span><span class="op">)</span>n<span class="op">;</span></span>
<span id="cb27-56"><a href="#cb27-56" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NUM<span class="op">;</span></span>
<span id="cb27-57"><a href="#cb27-57" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb27-58"><a href="#cb27-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-59"><a href="#cb27-59" aria-hidden="true" tabindex="-1"></a><span class="st">[[:space:]]</span>  <span class="op">;</span> <span class="co">/* ignore */</span></span>
<span id="cb27-60"><a href="#cb27-60" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-61"><a href="#cb27-61" aria-hidden="true" tabindex="-1"></a><span class="st">/*</span> <span class="kw">this</span> is a handy rule to <span class="cf">return</span> the ASCII value</span>
<span id="cb27-62"><a href="#cb27-62" aria-hidden="true" tabindex="-1"></a>   of any other character<span class="op">.</span> Importantly<span class="op">,</span> parens <span class="op">*/</span></span>
<span id="cb27-63"><a href="#cb27-63" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb27-64"><a href="#cb27-64" aria-hidden="true" tabindex="-1"></a><span class="st">.</span> { <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span></code></pre></div>
<p>Finally, here’s how to call the parser from a regular program.</p>
<div class="sourceCode" id="cb28"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* driver_lisp.c */</span></span>
<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-3"><a href="#cb28-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb28-4"><a href="#cb28-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb28-5"><a href="#cb28-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-6"><a href="#cb28-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#define YYSTYPE LISPSTYPE</span></span>
<span id="cb28-7"><a href="#cb28-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;lisp.tab.h&quot;</span></span>
<span id="cb28-8"><a href="#cb28-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;lisp.lex.h&quot;</span></span>
<span id="cb28-9"><a href="#cb28-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-10"><a href="#cb28-10" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> sexpr_print<span class="op">(</span><span class="kw">struct</span> sexpr<span class="op">*</span> s<span class="op">,</span> <span class="dt">unsigned</span> depth<span class="op">)</span></span>
<span id="cb28-11"><a href="#cb28-11" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb28-12"><a href="#cb28-12" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">unsigned</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> depth<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb28-13"><a href="#cb28-13" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;  &quot;</span><span class="op">);</span></span>
<span id="cb28-14"><a href="#cb28-14" aria-hidden="true" tabindex="-1"></a>	<span class="cf">switch</span> <span class="op">(</span>s<span class="op">-&gt;</span>type<span class="op">)</span></span>
<span id="cb28-15"><a href="#cb28-15" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb28-16"><a href="#cb28-16" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> SEXPR_ID<span class="op">:</span></span>
<span id="cb28-17"><a href="#cb28-17" aria-hidden="true" tabindex="-1"></a>			puts<span class="op">(</span>s<span class="op">-&gt;</span>value<span class="op">.</span>id<span class="op">);</span></span>
<span id="cb28-18"><a href="#cb28-18" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb28-19"><a href="#cb28-19" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> SEXPR_NUM<span class="op">:</span></span>
<span id="cb28-20"><a href="#cb28-20" aria-hidden="true" tabindex="-1"></a>			printf<span class="op">(</span><span class="st">&quot;%d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> s<span class="op">-&gt;</span>value<span class="op">.</span>num<span class="op">);</span></span>
<span id="cb28-21"><a href="#cb28-21" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb28-22"><a href="#cb28-22" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> SEXPR_PAIR<span class="op">:</span></span>
<span id="cb28-23"><a href="#cb28-23" aria-hidden="true" tabindex="-1"></a>			puts<span class="op">(</span><span class="st">&quot;.&quot;</span><span class="op">);</span></span>
<span id="cb28-24"><a href="#cb28-24" aria-hidden="true" tabindex="-1"></a>			sexpr_print<span class="op">(</span>s<span class="op">-&gt;</span>left<span class="op">,</span> depth<span class="op">+</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb28-25"><a href="#cb28-25" aria-hidden="true" tabindex="-1"></a>			sexpr_print<span class="op">(</span>s<span class="op">-&gt;</span>right<span class="op">,</span> depth<span class="op">+</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb28-26"><a href="#cb28-26" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb28-27"><a href="#cb28-27" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> SEXPR_NIL<span class="op">:</span></span>
<span id="cb28-28"><a href="#cb28-28" aria-hidden="true" tabindex="-1"></a>			puts<span class="op">(</span><span class="st">&quot;()&quot;</span><span class="op">);</span></span>
<span id="cb28-29"><a href="#cb28-29" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb28-30"><a href="#cb28-30" aria-hidden="true" tabindex="-1"></a>		<span class="cf">default</span><span class="op">:</span></span>
<span id="cb28-31"><a href="#cb28-31" aria-hidden="true" tabindex="-1"></a>			abort<span class="op">();</span></span>
<span id="cb28-32"><a href="#cb28-32" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb28-33"><a href="#cb28-33" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb28-34"><a href="#cb28-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-35"><a href="#cb28-35" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb28-36"><a href="#cb28-36" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb28-37"><a href="#cb28-37" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> i<span class="op">;</span></span>
<span id="cb28-38"><a href="#cb28-38" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> sexpr <span class="op">*</span>expr<span class="op">;</span></span>
<span id="cb28-39"><a href="#cb28-39" aria-hidden="true" tabindex="-1"></a>	yyscan_t scanner<span class="op">;</span></span>
<span id="cb28-40"><a href="#cb28-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-41"><a href="#cb28-41" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">((</span>i <span class="op">=</span> lisplex_init<span class="op">(&amp;</span>scanner<span class="op">))</span> <span class="op">!=</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb28-42"><a href="#cb28-42" aria-hidden="true" tabindex="-1"></a>		exit<span class="op">(</span>i<span class="op">);</span></span>
<span id="cb28-43"><a href="#cb28-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-44"><a href="#cb28-44" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> e <span class="op">=</span> lispparse<span class="op">(&amp;</span>expr<span class="op">,</span> scanner<span class="op">);</span></span>
<span id="cb28-45"><a href="#cb28-45" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;Code = %d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> e<span class="op">);</span></span>
<span id="cb28-46"><a href="#cb28-46" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>e <span class="op">==</span> <span class="dv">0</span> <span class="co">/* success */</span><span class="op">)</span></span>
<span id="cb28-47"><a href="#cb28-47" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb28-48"><a href="#cb28-48" aria-hidden="true" tabindex="-1"></a>		sexpr_print<span class="op">(</span>expr<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb28-49"><a href="#cb28-49" aria-hidden="true" tabindex="-1"></a>		sexpr_free<span class="op">(</span>expr<span class="op">);</span></span>
<span id="cb28-50"><a href="#cb28-50" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb28-51"><a href="#cb28-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb28-52"><a href="#cb28-52" aria-hidden="true" tabindex="-1"></a>	lisplex_destroy<span class="op">(</span>scanner<span class="op">);</span></span>
<span id="cb28-53"><a href="#cb28-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb28-54"><a href="#cb28-54" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>To build it, use the Makefile pattern from <code>roman</code> to create analogous <code>lisp.lex.o</code> and <code>lisp.tab.o</code>. This example requires Flex and Bison, so set <code>LEX=flex</code> and <code>YACC=bison</code> at the top of the Makefile to override whatever system defaults are used for these programs. Finally, compile <code>driver_lisp.c</code> and link with those object files.</p>
<p>Here’s the program in action:</p>
<div class="sourceCode" id="cb29"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;(1 () (2 . 3) (4))&quot;</span> <span class="kw">|</span> <span class="ex">./driver_lisp</span></span>
<span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a><span class="ex">Code</span> = 0</span>
<span id="cb29-3"><a href="#cb29-3" aria-hidden="true" tabindex="-1"></a><span class="bu">.</span></span>
<span id="cb29-4"><a href="#cb29-4" aria-hidden="true" tabindex="-1"></a>  <span class="ex">1</span></span>
<span id="cb29-5"><a href="#cb29-5" aria-hidden="true" tabindex="-1"></a>  <span class="bu">.</span></span>
<span id="cb29-6"><a href="#cb29-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">()</span></span>
<span id="cb29-7"><a href="#cb29-7" aria-hidden="true" tabindex="-1"></a>    <span class="bu">.</span></span>
<span id="cb29-8"><a href="#cb29-8" aria-hidden="true" tabindex="-1"></a>      <span class="bu">.</span></span>
<span id="cb29-9"><a href="#cb29-9" aria-hidden="true" tabindex="-1"></a>        <span class="ex">2</span></span>
<span id="cb29-10"><a href="#cb29-10" aria-hidden="true" tabindex="-1"></a>        <span class="ex">3</span></span>
<span id="cb29-11"><a href="#cb29-11" aria-hidden="true" tabindex="-1"></a>      <span class="bu">.</span></span>
<span id="cb29-12"><a href="#cb29-12" aria-hidden="true" tabindex="-1"></a>        <span class="bu">.</span></span>
<span id="cb29-13"><a href="#cb29-13" aria-hidden="true" tabindex="-1"></a>          <span class="ex">4</span></span>
<span id="cb29-14"><a href="#cb29-14" aria-hidden="true" tabindex="-1"></a>          <span class="kw">()</span></span>
<span id="cb29-15"><a href="#cb29-15" aria-hidden="true" tabindex="-1"></a>        <span class="kw">()</span></span></code></pre></div>
<h4 id="designing-against-an-rfc">Designing against an RFC</h4>
<p>Internet Request For Comment (RFC) documents describe the syntax of many protocols and data formats. They often include complete <a href="https://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_form">Augmented Backus-Naur Form</a> (ABNF) grammars, which we can convert into robust yacc parsers.</p>
<p>Let’s examine <a href="https://datatracker.ietf.org/doc/html/rfc4180">RFC4181</a>, which describes the comma-separated value (CSV) format. It’s pretty simple, but has problematic edge cases: commas in quoted values, quoted quotes, raw newlines in quoted values, and blank-as-a-value.</p>
<p>Here’s the full grammar from the RFC. Notice how alternatives are specified with “/” rather than “|”, and how ABNF has the constructions <code>*(zero-or-more-things)</code> and <code>[optional-thing]</code>:</p>
<pre><code>file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field

field = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

COMMA = %x2C

CR = %x0D

DQUOTE =  %x22

LF = %x0A

CRLF = CR LF

TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E</code></pre>
<p>The grammar makes no distinction between lexing and parsing, although the uppercase identifiers hint at lexer tokens. While it may be tempting to translate to yacc top-down, starting at the <code>file</code> level, I’ve found the most productive way is to start with lexing.</p>
<p>We can combine most of the grammar into two lex rules to match fields:</p>
<div class="sourceCode" id="cb31"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-3"><a href="#cb31-3" aria-hidden="true" tabindex="-1"></a><span class="st">\&quot;([^&quot;]|\&quot;\&quot;)*\&quot;</span> {</span>
<span id="cb31-4"><a href="#cb31-4" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* this is what the ABNF calls &quot;escaped&quot; */</span></span>
<span id="cb31-5"><a href="#cb31-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-6"><a href="#cb31-6" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: copy un-escaped internals to yylval */</span></span>
<span id="cb31-7"><a href="#cb31-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-8"><a href="#cb31-8" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> FIELD<span class="op">;</span></span>
<span id="cb31-9"><a href="#cb31-9" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb31-10"><a href="#cb31-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-11"><a href="#cb31-11" aria-hidden="true" tabindex="-1"></a><span class="st">[^&quot;,\r\n]+</span> {</span>
<span id="cb31-12"><a href="#cb31-12" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* This is *almost* what the ABNF calls &quot;un-escaped,&quot;</span></span>
<span id="cb31-13"><a href="#cb31-13" aria-hidden="true" tabindex="-1"></a><span class="co">	   except it won&#39;t match an empty field, like</span></span>
<span id="cb31-14"><a href="#cb31-14" aria-hidden="true" tabindex="-1"></a><span class="co">	   a,,b</span></span>
<span id="cb31-15"><a href="#cb31-15" aria-hidden="true" tabindex="-1"></a><span class="co">	    ^---- this</span></span>
<span id="cb31-16"><a href="#cb31-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-17"><a href="#cb31-17" aria-hidden="true" tabindex="-1"></a><span class="co">	   Actually, even if we tried matching an empty string,</span></span>
<span id="cb31-18"><a href="#cb31-18" aria-hidden="true" tabindex="-1"></a><span class="co">	   the comma or crlf would prove a longer match and</span></span>
<span id="cb31-19"><a href="#cb31-19" aria-hidden="true" tabindex="-1"></a><span class="co">	   trump this one.</span></span>
<span id="cb31-20"><a href="#cb31-20" aria-hidden="true" tabindex="-1"></a><span class="co">	*/</span></span>
<span id="cb31-21"><a href="#cb31-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-22"><a href="#cb31-22" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: capture the value to yylval */</span></span>
<span id="cb31-23"><a href="#cb31-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-24"><a href="#cb31-24" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* no need to bother yacc with two token types, we</span></span>
<span id="cb31-25"><a href="#cb31-25" aria-hidden="true" tabindex="-1"></a><span class="co">	   call them both FIELD. */</span></span>
<span id="cb31-26"><a href="#cb31-26" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> FIELD<span class="op">;</span></span>
<span id="cb31-27"><a href="#cb31-27" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb31-28"><a href="#cb31-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-29"><a href="#cb31-29" aria-hidden="true" tabindex="-1"></a> <span class="co">/* handle both UNIX and DOS style, per the spec */</span></span>
<span id="cb31-30"><a href="#cb31-30" aria-hidden="true" tabindex="-1"></a><span class="st">\n|\r\n</span>    { <span class="cf">return</span> CRLF<span class="op">;</span> }</span>
<span id="cb31-31"><a href="#cb31-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb31-32"><a href="#cb31-32" aria-hidden="true" tabindex="-1"></a> <span class="co">/* catch the comma, and any other unexpected thing */</span></span>
<span id="cb31-33"><a href="#cb31-33" aria-hidden="true" tabindex="-1"></a><span class="st">.</span>          { <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span></code></pre></div>
<p>With FIELD out of the way, here’s what’s left to translate:</p>
<pre><code>file = [header CRLF] record *(CRLF record) [CRLF]

header = name *(COMMA name)

record = field *(COMMA field)

name = field</code></pre>
<p>Let’s also drop the designation of the first row as the “header.” The application can choose to treat the first ordinary row as a header if desired. This simplifies the grammar to:</p>
<pre><code>file = record *(CRLF record) [CRLF]

record = field *(COMMA field)</code></pre>
<p>At this point it’s easy to convert to yacc.</p>
<div class="sourceCode" id="cb34"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> CRLF FIELD</span>
<span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-3"><a href="#cb34-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb34-4"><a href="#cb34-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-5"><a href="#cb34-5" aria-hidden="true" tabindex="-1"></a><span class="st">file </span>:</span>
<span id="cb34-6"><a href="#cb34-6" aria-hidden="true" tabindex="-1"></a>  record</span>
<span id="cb34-7"><a href="#cb34-7" aria-hidden="true" tabindex="-1"></a>| file CRLF record</span>
<span id="cb34-8"><a href="#cb34-8" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb34-9"><a href="#cb34-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-10"><a href="#cb34-10" aria-hidden="true" tabindex="-1"></a><span class="st">record </span>:</span>
<span id="cb34-11"><a href="#cb34-11" aria-hidden="true" tabindex="-1"></a>  field.opt</span>
<span id="cb34-12"><a href="#cb34-12" aria-hidden="true" tabindex="-1"></a>| record <span class="ch">&#39;,&#39;</span> field.opt</span>
<span id="cb34-13"><a href="#cb34-13" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb34-14"><a href="#cb34-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-15"><a href="#cb34-15" aria-hidden="true" tabindex="-1"></a> <span class="co">/* Here is where we handle the potentially blank</span></span>
<span id="cb34-16"><a href="#cb34-16" aria-hidden="true" tabindex="-1"></a><span class="co">    non-escaped FIELD. The &quot;.opt&quot; suffix doesn&#39;t mean</span></span>
<span id="cb34-17"><a href="#cb34-17" aria-hidden="true" tabindex="-1"></a><span class="co">    anything to yacc, it&#39;s just a reminder for us that</span></span>
<span id="cb34-18"><a href="#cb34-18" aria-hidden="true" tabindex="-1"></a><span class="co">    this *may* match a FIELD, or nothing at all */</span></span>
<span id="cb34-19"><a href="#cb34-19" aria-hidden="true" tabindex="-1"></a><span class="st">field.opt </span>:</span>
<span id="cb34-20"><a href="#cb34-20" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* empty */</span></span>
<span id="cb34-21"><a href="#cb34-21" aria-hidden="true" tabindex="-1"></a>| FIELD</span>
<span id="cb34-22"><a href="#cb34-22" aria-hidden="true" tabindex="-1"></a>;</span></code></pre></div>
<p>Matching blank fields is tricky. There are three fields in <code>a,,b</code>, no way around it. That means we have to identify some value (either a non-terminal symbol, or a terminal token) out of thin air <em>between</em> characters of input. As a corollary, given that we have to honor blank fields as existing, we’re forced to interpret e.g. a 0-byte file as one record with a single blank field.</p>
<p>We handled the situation with an empty yacc rule in <code>field.opt</code>. Empty rules allow the parser to reduce when it sees unexpected lookahead tokens. Perhaps it’s also possible to use fancy tricks in the lexer (like trailing context and start conditions) to also match empty non-escaped fields. However, I think an empty parser rule is more elegant.</p>
<p>Three notes about empty rules:</p>
<ol type="1">
<li>We wrote the empty rule in a way that plain yacc can understand. If you want to use a Bison extension, you can write empty rules as <code>%empty</code>, which distinguishes them from accidentally missing rules.</li>
<li>Bison’s <code>--graph</code> visualization doesn’t render empty rules properly. Use the <code>-v</code> option and examine the textual <code>.output</code> file to see the rule.</li>
<li>Adding multiple empty rules can be common source of reduce/reduce conflicts. I ran into this with early experiments in parsing CSV, and the Bison manual <a href="https://www.gnu.org/software/bison/manual/html_node/Reduce_002fReduce.html">section 5.6</a> provides a great example.</li>
</ol>
<p>Now that we’ve seen the structure of the grammar, let’s fill in the skeleton to process the CSV content. From now on, examples in this article will use my <a href="https://github.com/begriffs/libderp">libderp</a> library for basic data structures like maps and vectors.</p>
<div class="sourceCode" id="cb35"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb35-1"><a href="#cb35-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* csv.l */</span></span>
<span id="cb35-2"><a href="#cb35-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-3"><a href="#cb35-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb35-4"><a href="#cb35-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _XOPEN_SOURCE </span><span class="dv">600</span></span>
<span id="cb35-5"><a href="#cb35-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb35-6"><a href="#cb35-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb35-7"><a href="#cb35-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-8"><a href="#cb35-8" aria-hidden="true" tabindex="-1"></a><span class="co">/* the union in csv.tab.h requires the vector type, and</span></span>
<span id="cb35-9"><a href="#cb35-9" aria-hidden="true" tabindex="-1"></a><span class="co">   plain yacc doesn&#39;t have &quot;%code requires&quot; to provide</span></span>
<span id="cb35-10"><a href="#cb35-10" aria-hidden="true" tabindex="-1"></a><span class="co">   the include like Bison, so we include derp/vector.h */</span></span>
<span id="cb35-11"><a href="#cb35-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;derp/vector.h&gt;</span></span>
<span id="cb35-12"><a href="#cb35-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;csv.tab.h&quot;</span></span>
<span id="cb35-13"><a href="#cb35-13" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb35-14"><a href="#cb35-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-15"><a href="#cb35-15" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb35-16"><a href="#cb35-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-17"><a href="#cb35-17" aria-hidden="true" tabindex="-1"></a><span class="st">\&quot;([^&quot;]|\&quot;\&quot;)*\&quot;</span> {</span>
<span id="cb35-18"><a href="#cb35-18" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* yyleng is precomputed strlen(yytext) */</span></span>
<span id="cb35-19"><a href="#cb35-19" aria-hidden="true" tabindex="-1"></a>    <span class="dt">size_t</span> i<span class="op">,</span> n <span class="op">=</span> yyleng<span class="op">;</span></span>
<span id="cb35-20"><a href="#cb35-20" aria-hidden="true" tabindex="-1"></a>    <span class="dt">char</span> <span class="op">*</span>s<span class="op">;</span></span>
<span id="cb35-21"><a href="#cb35-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-22"><a href="#cb35-22" aria-hidden="true" tabindex="-1"></a>    s <span class="op">=</span> yylval<span class="op">.</span>str <span class="op">=</span> calloc<span class="op">(</span>n<span class="op">,</span> <span class="dv">1</span><span class="op">);</span></span>
<span id="cb35-23"><a href="#cb35-23" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span></span>
<span id="cb35-24"><a href="#cb35-24" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> FIELD<span class="op">;</span></span>
<span id="cb35-25"><a href="#cb35-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-26"><a href="#cb35-26" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* copy yytext, changing &quot;&quot; to &quot; */</span></span>
<span id="cb35-27"><a href="#cb35-27" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">1</span> <span class="co">/*skip 0=&quot;*/</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">-</span><span class="dv">1</span><span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb35-28"><a href="#cb35-28" aria-hidden="true" tabindex="-1"></a>    {</span>
<span id="cb35-29"><a href="#cb35-29" aria-hidden="true" tabindex="-1"></a>        <span class="op">*</span>s<span class="op">++</span> <span class="op">=</span> yytext<span class="op">[</span>i<span class="op">];</span></span>
<span id="cb35-30"><a href="#cb35-30" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> <span class="op">(</span>yytext<span class="op">[</span>i<span class="op">]</span> <span class="op">==</span> <span class="ch">&#39;&quot;&#39;</span><span class="op">)</span></span>
<span id="cb35-31"><a href="#cb35-31" aria-hidden="true" tabindex="-1"></a>            i<span class="op">++;</span> <span class="co">/* skip second one */</span></span>
<span id="cb35-32"><a href="#cb35-32" aria-hidden="true" tabindex="-1"></a>    }</span>
<span id="cb35-33"><a href="#cb35-33" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> FIELD<span class="op">;</span></span>
<span id="cb35-34"><a href="#cb35-34" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb35-35"><a href="#cb35-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb35-36"><a href="#cb35-36" aria-hidden="true" tabindex="-1"></a><span class="st">[^&quot;,\r\n]+</span> { yylval<span class="op">.</span>str <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span> <span class="cf">return</span> FIELD<span class="op">;</span> }</span>
<span id="cb35-37"><a href="#cb35-37" aria-hidden="true" tabindex="-1"></a><span class="st">\n|\r\n</span>    { <span class="cf">return</span> CRLF<span class="op">;</span> }</span>
<span id="cb35-38"><a href="#cb35-38" aria-hidden="true" tabindex="-1"></a><span class="st">.</span>          { <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span></code></pre></div>
<p>The complete parser below combines values from the lexer into full records, using the vector type. It then prints each record and frees it.</p>
<div class="sourceCode" id="cb36"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb36-1"><a href="#cb36-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* csv.y  (plain yacc) */</span></span>
<span id="cb36-2"><a href="#cb36-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-3"><a href="#cb36-3" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb36-4"><a href="#cb36-4" aria-hidden="true" tabindex="-1"></a>	<span class="pp">#include </span><span class="im">&lt;stdbool.h&gt;</span></span>
<span id="cb36-5"><a href="#cb36-5" aria-hidden="true" tabindex="-1"></a>	<span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb36-6"><a href="#cb36-6" aria-hidden="true" tabindex="-1"></a>	<span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb36-7"><a href="#cb36-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-8"><a href="#cb36-8" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* for the vector datatype and v_ functions */</span></span>
<span id="cb36-9"><a href="#cb36-9" aria-hidden="true" tabindex="-1"></a>	<span class="pp">#include </span><span class="im">&lt;derp/vector.h&gt;</span></span>
<span id="cb36-10"><a href="#cb36-10" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* for helper function derp_free */</span></span>
<span id="cb36-11"><a href="#cb36-11" aria-hidden="true" tabindex="-1"></a>	<span class="pp">#include </span><span class="im">&lt;derp/common.h&gt;</span></span>
<span id="cb36-12"><a href="#cb36-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-13"><a href="#cb36-13" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> yylex<span class="op">(</span><span class="dt">void</span><span class="op">);</span></span>
<span id="cb36-14"><a href="#cb36-14" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> yyerror<span class="op">(</span><span class="at">const</span> <span class="dt">char</span> <span class="op">*</span>s<span class="op">);</span></span>
<span id="cb36-15"><a href="#cb36-15" aria-hidden="true" tabindex="-1"></a>	<span class="dt">bool</span> one_empty_field<span class="op">(</span>vector <span class="op">*);</span></span>
<span id="cb36-16"><a href="#cb36-16" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb36-17"><a href="#cb36-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-18"><a href="#cb36-18" aria-hidden="true" tabindex="-1"></a><span class="kw">%union</span></span>
<span id="cb36-19"><a href="#cb36-19" aria-hidden="true" tabindex="-1"></a>{</span>
<span id="cb36-20"><a href="#cb36-20" aria-hidden="true" tabindex="-1"></a>	char *str;</span>
<span id="cb36-21"><a href="#cb36-21" aria-hidden="true" tabindex="-1"></a>	vector *record;</span>
<span id="cb36-22"><a href="#cb36-22" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb36-23"><a href="#cb36-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-24"><a href="#cb36-24" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> CRLF</span>
<span id="cb36-25"><a href="#cb36-25" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;str&gt;</span> FIELD</span>
<span id="cb36-26"><a href="#cb36-26" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;str&gt;</span> field.opt</span>
<span id="cb36-27"><a href="#cb36-27" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;record&gt;</span> record</span>
<span id="cb36-28"><a href="#cb36-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-29"><a href="#cb36-29" aria-hidden="true" tabindex="-1"></a><span class="co">/* in bison, add this:</span></span>
<span id="cb36-30"><a href="#cb36-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-31"><a href="#cb36-31" aria-hidden="true" tabindex="-1"></a><span class="co">%destructor { free($$); } &lt;str&gt;</span></span>
<span id="cb36-32"><a href="#cb36-32" aria-hidden="true" tabindex="-1"></a><span class="co">%destructor { v_free($$); } &lt;record&gt;</span></span>
<span id="cb36-33"><a href="#cb36-33" aria-hidden="true" tabindex="-1"></a><span class="co">*/</span></span>
<span id="cb36-34"><a href="#cb36-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-35"><a href="#cb36-35" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb36-36"><a href="#cb36-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-37"><a href="#cb36-37" aria-hidden="true" tabindex="-1"></a><span class="st">file </span>:</span>
<span id="cb36-38"><a href="#cb36-38" aria-hidden="true" tabindex="-1"></a>  consumed_record</span>
<span id="cb36-39"><a href="#cb36-39" aria-hidden="true" tabindex="-1"></a>| file CRLF consumed_record</span>
<span id="cb36-40"><a href="#cb36-40" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb36-41"><a href="#cb36-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-42"><a href="#cb36-42" aria-hidden="true" tabindex="-1"></a><span class="co">/* A record can be constructed in two ways, but we want to</span></span>
<span id="cb36-43"><a href="#cb36-43" aria-hidden="true" tabindex="-1"></a><span class="co">   run the same side effect for either case. We add an</span></span>
<span id="cb36-44"><a href="#cb36-44" aria-hidden="true" tabindex="-1"></a><span class="co">   intermediate non-terminal symbol &quot;consumed_record&quot; just</span></span>
<span id="cb36-45"><a href="#cb36-45" aria-hidden="true" tabindex="-1"></a><span class="co">   to perform the action. In library code, this would be a</span></span>
<span id="cb36-46"><a href="#cb36-46" aria-hidden="true" tabindex="-1"></a><span class="co">   good place to send the the record to a callback function. */</span></span>
<span id="cb36-47"><a href="#cb36-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-48"><a href="#cb36-48" aria-hidden="true" tabindex="-1"></a><span class="st">consumed_record </span>:</span>
<span id="cb36-49"><a href="#cb36-49" aria-hidden="true" tabindex="-1"></a>  record {</span>
<span id="cb36-50"><a href="#cb36-50" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* a record comprised of exactly one blank field is a</span></span>
<span id="cb36-51"><a href="#cb36-51" aria-hidden="true" tabindex="-1"></a><span class="co">	   blank record, which we can skip */</span></span>
<span id="cb36-52"><a href="#cb36-52" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>one_empty_field<span class="op">(</span><span class="kw">$1</span><span class="op">))</span></span>
<span id="cb36-53"><a href="#cb36-53" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb36-54"><a href="#cb36-54" aria-hidden="true" tabindex="-1"></a>		<span class="dt">size_t</span> n <span class="op">=</span> v_length<span class="op">(</span><span class="kw">$1</span><span class="op">);</span></span>
<span id="cb36-55"><a href="#cb36-55" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;#fields = </span><span class="sc">%zu\n</span><span class="st">&quot;</span><span class="op">,</span> n<span class="op">);</span></span>
<span id="cb36-56"><a href="#cb36-56" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span><span class="dt">size_t</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb36-57"><a href="#cb36-57" aria-hidden="true" tabindex="-1"></a>			printf<span class="op">(</span><span class="st">&quot;</span><span class="sc">\t%s\n</span><span class="st">&quot;</span><span class="op">,</span> <span class="op">(</span><span class="dt">char</span><span class="op">*)</span>v_at<span class="op">(</span><span class="kw">$1</span><span class="op">,</span> i<span class="op">));</span></span>
<span id="cb36-58"><a href="#cb36-58" aria-hidden="true" tabindex="-1"></a>	}</span>
<span id="cb36-59"><a href="#cb36-59" aria-hidden="true" tabindex="-1"></a>	v_free<span class="op">(</span><span class="kw">$1</span><span class="op">);</span></span>
<span id="cb36-60"><a href="#cb36-60" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb36-61"><a href="#cb36-61" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb36-62"><a href="#cb36-62" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-63"><a href="#cb36-63" aria-hidden="true" tabindex="-1"></a><span class="st">record </span>:</span>
<span id="cb36-64"><a href="#cb36-64" aria-hidden="true" tabindex="-1"></a>  field.opt {</span>
<span id="cb36-65"><a href="#cb36-65" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* In our earlier example, lisp.y, we showed how to check</span></span>
<span id="cb36-66"><a href="#cb36-66" aria-hidden="true" tabindex="-1"></a><span class="co">	   for memory allocation failure. We skip that here for</span></span>
<span id="cb36-67"><a href="#cb36-67" aria-hidden="true" tabindex="-1"></a><span class="co">	   brevity. */</span></span>
<span id="cb36-68"><a href="#cb36-68" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-69"><a href="#cb36-69" aria-hidden="true" tabindex="-1"></a>	vector <span class="op">*</span>r <span class="op">=</span> v_new<span class="op">();</span></span>
<span id="cb36-70"><a href="#cb36-70" aria-hidden="true" tabindex="-1"></a>	v_dtor<span class="op">(</span>r<span class="op">,</span> derp_free<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb36-71"><a href="#cb36-71" aria-hidden="true" tabindex="-1"></a>	v_append<span class="op">(</span>r<span class="op">,</span> <span class="kw">$1</span><span class="op">);</span></span>
<span id="cb36-72"><a href="#cb36-72" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> r<span class="op">;</span></span>
<span id="cb36-73"><a href="#cb36-73" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb36-74"><a href="#cb36-74" aria-hidden="true" tabindex="-1"></a>| record <span class="ch">&#39;,&#39;</span> field.opt {</span>
<span id="cb36-75"><a href="#cb36-75" aria-hidden="true" tabindex="-1"></a>	v_append<span class="op">(</span><span class="kw">$1</span><span class="op">,</span> <span class="kw">$3</span><span class="op">);</span></span>
<span id="cb36-76"><a href="#cb36-76" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> <span class="kw">$1</span><span class="op">;</span></span>
<span id="cb36-77"><a href="#cb36-77" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb36-78"><a href="#cb36-78" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb36-79"><a href="#cb36-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-80"><a href="#cb36-80" aria-hidden="true" tabindex="-1"></a><span class="st">field.opt </span>:</span>
<span id="cb36-81"><a href="#cb36-81" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* empty */</span> { <span class="kw">$$</span> <span class="op">=</span> calloc<span class="op">(</span><span class="dv">1</span><span class="op">,</span><span class="dv">1</span><span class="op">);</span> }</span>
<span id="cb36-82"><a href="#cb36-82" aria-hidden="true" tabindex="-1"></a>| FIELD</span>
<span id="cb36-83"><a href="#cb36-83" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb36-84"><a href="#cb36-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-85"><a href="#cb36-85" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb36-86"><a href="#cb36-86" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-87"><a href="#cb36-87" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> one_empty_field<span class="op">(</span>vector <span class="op">*</span>r<span class="op">)</span></span>
<span id="cb36-88"><a href="#cb36-88" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb36-89"><a href="#cb36-89" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> v_length<span class="op">(</span>r<span class="op">)</span> <span class="op">==</span> <span class="dv">1</span> <span class="op">&amp;&amp;</span> <span class="op">*((</span><span class="dt">char</span><span class="op">*)</span>v_first<span class="op">(</span>r<span class="op">))</span> <span class="op">==</span> <span class="ch">&#39;</span><span class="sc">\0</span><span class="ch">&#39;</span><span class="op">;</span></span>
<span id="cb36-90"><a href="#cb36-90" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb36-91"><a href="#cb36-91" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb36-92"><a href="#cb36-92" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> yyerror<span class="op">(</span><span class="at">const</span> <span class="dt">char</span> <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb36-93"><a href="#cb36-93" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb36-94"><a href="#cb36-94" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;</span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span> s<span class="op">);</span></span>
<span id="cb36-95"><a href="#cb36-95" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Build it (using the steps shown for earlier examples). You’ll also need to link with <a href="https://github.com/begriffs/libderp">libderp</a> version 0.1.0, which you can see how to do in the project readme.</p>
<p>Next, verify with test cases:</p>
<div class="sourceCode" id="cb37"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb37-1"><a href="#cb37-1" aria-hidden="true" tabindex="-1"></a><span class="co"># https://en.wikipedia.org/wiki/Comma-separated_values#Example</span></span>
<span id="cb37-2"><a href="#cb37-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb37-3"><a href="#cb37-3" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./csv <span class="op">&lt;&lt; EOF</span></span>
<span id="cb37-4"><a href="#cb37-4" aria-hidden="true" tabindex="-1"></a><span class="st">Year,Make,Model,Description,Price</span></span>
<span id="cb37-5"><a href="#cb37-5" aria-hidden="true" tabindex="-1"></a><span class="st">1997,Ford,E350,&quot;ac, abs, moon&quot;,3000.00</span></span>
<span id="cb37-6"><a href="#cb37-6" aria-hidden="true" tabindex="-1"></a><span class="st">1999,Chevy,&quot;Venture &quot;&quot;Extended Edition&quot;&quot;&quot;,&quot;&quot;,4900.00</span></span>
<span id="cb37-7"><a href="#cb37-7" aria-hidden="true" tabindex="-1"></a><span class="st">1999,Chevy,&quot;Venture &quot;&quot;Extended Edition, Very Large&quot;&quot;&quot;,,5000.00</span></span>
<span id="cb37-8"><a href="#cb37-8" aria-hidden="true" tabindex="-1"></a><span class="st">1996,Jeep,Grand Cherokee,&quot;MUST SELL!</span></span>
<span id="cb37-9"><a href="#cb37-9" aria-hidden="true" tabindex="-1"></a><span class="st">air, moon roof, loaded&quot;,4799.00</span></span>
<span id="cb37-10"><a href="#cb37-10" aria-hidden="true" tabindex="-1"></a><span class="op">EOF</span></span></code></pre></div>
<pre><code>#fields = 5
        Year
        Make
        Model
        Description
        Price
#fields = 5
        1997
        Ford
        E350
        ac, abs, moon
        3000.00
#fields = 5
        1999
        Chevy
        Venture &quot;Extended Edition&quot;
        
        4900.00
#fields = 5
        1999
        Chevy
        Venture &quot;Extended Edition, Very Large&quot;
        
        5000.00
#fields = 5
        1996
        Jeep
        Grand Cherokee
        MUST SELL!
air, moon roof, loaded
        4799.00</code></pre>
<div class="sourceCode" id="cb39"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb39-1"><a href="#cb39-1" aria-hidden="true" tabindex="-1"></a><span class="co"># extra testing for empty fields before crlf and eof</span></span>
<span id="cb39-2"><a href="#cb39-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-3"><a href="#cb39-3" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> printf <span class="st">&quot;,\n,&quot;</span> <span class="kw">|</span> <span class="ex">./csv</span></span></code></pre></div>
<pre><code>#fields = 2
        
        
#fields = 2
        
        </code></pre>
<h4 id="parsing-a-more-complicated-rfc">Parsing a more complicated RFC</h4>
<p><a href="https://ircv3.net/irc/">IRCv3</a> extends the Internet Relay Chat (IRC) protocol with useful features. Its core syntactical change to support new features is <a href="https://ircv3.net/specs/extensions/message-tags">message tagging</a>. We’ll write a parser to extract information from <a href="https://datatracker.ietf.org/doc/html/rfc1459#section-2.3.1">RFC 1459</a> messages, including IRCv3 tags.</p>
<p>The BNF from this standard is written in a slightly different dialect than that of the CSV RFC.</p>
<pre><code>&lt;message&gt;       ::= [&#39;@&#39; &lt;tags&gt; &lt;SPACE&gt;] [&#39;:&#39; &lt;prefix&gt; &lt;SPACE&gt; ] &lt;command&gt; [params] &lt;crlf&gt;

&lt;tags&gt;          ::= &lt;tag&gt; [&#39;;&#39; &lt;tag&gt;]*
&lt;tag&gt;           ::= &lt;key&gt; [&#39;=&#39; &lt;escaped_value&gt;]
&lt;key&gt;           ::= [ &lt;client_prefix&gt; ] [ &lt;vendor&gt; &#39;/&#39; ] &lt;key_name&gt;
&lt;client_prefix&gt; ::= &#39;+&#39;
&lt;key_name&gt;      ::= &lt;non-empty sequence of ascii letters, digits, hyphens (&#39;-&#39;)&gt;
&lt;escaped_value&gt; ::= &lt;sequence of zero or more utf8 characters except NUL, CR, LF, semicolon (`;`) and SPACE&gt;
&lt;vendor&gt;        ::= &lt;host&gt;
&lt;host&gt;          ::= see RFC 952 [DNS:4] for details on allowed hostnames

&lt;prefix&gt;        ::= &lt;servername&gt; | &lt;nick&gt; [ &#39;!&#39; &lt;user&gt; ] [ &#39;@&#39; &lt;host&gt; ]
&lt;nick&gt;          ::= &lt;letter&gt; { &lt;letter&gt; | &lt;number&gt; | &lt;special&gt; }
&lt;command&gt;       ::= &lt;letter&gt; { &lt;letter&gt; } | &lt;number&gt; &lt;number&gt; &lt;number&gt;
&lt;SPACE&gt;         ::= &#39; &#39; { &#39; &#39; }
&lt;params&gt;        ::= &lt;SPACE&gt; [ &#39;:&#39; &lt;trailing&gt; | &lt;middle&gt; &lt;params&gt; ]
&lt;middle&gt;        ::= &lt;Any *non-empty* sequence of octets not including SPACE
                    or NUL or CR or LF, the first of which may not be &#39;:&#39;&gt;
&lt;trailing&gt;      ::= &lt;Any, possibly *empty*, sequence of octets not including
                      NUL or CR or LF&gt;

&lt;user&gt;          ::= &lt;nonwhite&gt; { &lt;nonwhite&gt; }
&lt;letter&gt;        ::= &#39;a&#39; ... &#39;z&#39; | &#39;A&#39; ... &#39;Z&#39;
&lt;number&gt;        ::= &#39;0&#39; ... &#39;9&#39;
&lt;crlf&gt;          ::= CR LF</code></pre>
<p>As before, it’s helpful to start from the bottom up, applying the power of lex regexes. However, we run into the problem that most of the tokens match almost anything. The same string could conceivably be a host, nick, user, key_name, and command all at once. Lex would match the string with whichever rule comes first in the grammar.</p>
<p>Yacc can’t easily pass lex any clues about what tokens it expects, given what tokens have come before. Lex is on its own. For this reason, the designers of lex gave it a way to keep a memory. Rules can be tagged with a <em>start condition</em>, saying they are eligible only in certain states. Rule actions can then enter new states prior to returning.</p>
<div class="sourceCode" id="cb42"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb42-1"><a href="#cb42-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* Incomplete irc.l, showing start conditions and patterns.</span></span>
<span id="cb42-2"><a href="#cb42-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-3"><a href="#cb42-3" aria-hidden="true" tabindex="-1"></a><span class="co">   This lexer produces the following tokens:</span></span>
<span id="cb42-4"><a href="#cb42-4" aria-hidden="true" tabindex="-1"></a><span class="co">   SPACE COMMAND MIDDLE TRAILING TAG PREFIX &#39;:&#39; &#39;@&#39;</span></span>
<span id="cb42-5"><a href="#cb42-5" aria-hidden="true" tabindex="-1"></a><span class="co">*/</span></span>
<span id="cb42-6"><a href="#cb42-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-7"><a href="#cb42-7" aria-hidden="true" tabindex="-1"></a><span class="co">/* It&#39;s nice to prefix the regex names with &quot;re_&quot;</span></span>
<span id="cb42-8"><a href="#cb42-8" aria-hidden="true" tabindex="-1"></a><span class="co">   to see them better in the rules */</span></span>
<span id="cb42-9"><a href="#cb42-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-10"><a href="#cb42-10" aria-hidden="true" tabindex="-1"></a><span class="dt">re_space    </span><span class="st">[ ]+</span></span>
<span id="cb42-11"><a href="#cb42-11" aria-hidden="true" tabindex="-1"></a><span class="dt">re_host     </span><span class="st">[[:alnum:]][[:alnum:]\.\-]*</span></span>
<span id="cb42-12"><a href="#cb42-12" aria-hidden="true" tabindex="-1"></a><span class="dt">re_nick     </span><span class="st">[[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*</span></span>
<span id="cb42-13"><a href="#cb42-13" aria-hidden="true" tabindex="-1"></a><span class="dt">re_user     </span><span class="st">[~[:alpha:]][[:alnum:]]*</span></span>
<span id="cb42-14"><a href="#cb42-14" aria-hidden="true" tabindex="-1"></a><span class="dt">re_keyname  </span><span class="st">[[:alnum:]\-]+</span></span>
<span id="cb42-15"><a href="#cb42-15" aria-hidden="true" tabindex="-1"></a><span class="dt">re_keyval   </span><span class="st">[^ ;\r\n]*</span></span>
<span id="cb42-16"><a href="#cb42-16" aria-hidden="true" tabindex="-1"></a><span class="dt">re_command  </span><span class="st">[[:alpha:]]+|[[:digit:]]{3}</span></span>
<span id="cb42-17"><a href="#cb42-17" aria-hidden="true" tabindex="-1"></a><span class="dt">re_middle   </span><span class="st">[^: \r\n][^ \r\n]*</span></span>
<span id="cb42-18"><a href="#cb42-18" aria-hidden="true" tabindex="-1"></a><span class="dt">re_trailing </span><span class="st">[^\r\n]*</span></span>
<span id="cb42-19"><a href="#cb42-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-20"><a href="#cb42-20" aria-hidden="true" tabindex="-1"></a><span class="co">/* Declare start conditions. The &quot;%x&quot; means</span></span>
<span id="cb42-21"><a href="#cb42-21" aria-hidden="true" tabindex="-1"></a><span class="co">   they are exclusive, vs &quot;%s&quot; for inclusive. */</span></span>
<span id="cb42-22"><a href="#cb42-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-23"><a href="#cb42-23" aria-hidden="true" tabindex="-1"></a><span class="kw">%x IN_TAGS IN_PREFIX IN_PARAMS</span></span>
<span id="cb42-24"><a href="#cb42-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-25"><a href="#cb42-25" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb42-26"><a href="#cb42-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-27"><a href="#cb42-27" aria-hidden="true" tabindex="-1"></a> <span class="co">/* these patterns are not tagged with a start</span></span>
<span id="cb42-28"><a href="#cb42-28" aria-hidden="true" tabindex="-1"></a><span class="co">    condition, and are active in the default state</span></span>
<span id="cb42-29"><a href="#cb42-29" aria-hidden="true" tabindex="-1"></a><span class="co">    of INITIAL. They will match only when none of</span></span>
<span id="cb42-30"><a href="#cb42-30" aria-hidden="true" tabindex="-1"></a><span class="co">    the exclusive conditions are active. They</span></span>
<span id="cb42-31"><a href="#cb42-31" aria-hidden="true" tabindex="-1"></a><span class="co">    *would* match on inclusive states (but we have</span></span>
<span id="cb42-32"><a href="#cb42-32" aria-hidden="true" tabindex="-1"></a><span class="co">    none).</span></span>
<span id="cb42-33"><a href="#cb42-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-34"><a href="#cb42-34" aria-hidden="true" tabindex="-1"></a><span class="co">    The </span><span class="re">BEGIN</span><span class="co"> command changes state. */</span></span>
<span id="cb42-35"><a href="#cb42-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-36"><a href="#cb42-36" aria-hidden="true" tabindex="-1"></a><span class="st">@</span> { BEGIN IN_TAGS<span class="op">;</span> <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span>
<span id="cb42-37"><a href="#cb42-37" aria-hidden="true" tabindex="-1"></a><span class="st">:</span> { BEGIN IN_PREFIX<span class="op">;</span> <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span>
<span id="cb42-38"><a href="#cb42-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-39"><a href="#cb42-39" aria-hidden="true" tabindex="-1"></a><span class="st">{re_space}</span> { <span class="cf">return</span> SPACE<span class="op">;</span> }</span>
<span id="cb42-40"><a href="#cb42-40" aria-hidden="true" tabindex="-1"></a><span class="st">{re_command}</span> {</span>
<span id="cb42-41"><a href="#cb42-41" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: construct yylval */</span></span>
<span id="cb42-42"><a href="#cb42-42" aria-hidden="true" tabindex="-1"></a>	BEGIN IN_PARAMS<span class="op">;</span></span>
<span id="cb42-43"><a href="#cb42-43" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> COMMAND<span class="op">;</span></span>
<span id="cb42-44"><a href="#cb42-44" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-45"><a href="#cb42-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-46"><a href="#cb42-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-47"><a href="#cb42-47" aria-hidden="true" tabindex="-1"></a> <span class="co">/* these patterns will only match IN_TAGS, which</span></span>
<span id="cb42-48"><a href="#cb42-48" aria-hidden="true" tabindex="-1"></a><span class="co">    as we saw earlier, gets activated from the</span></span>
<span id="cb42-49"><a href="#cb42-49" aria-hidden="true" tabindex="-1"></a><span class="co">    INITIAL state when &quot;@&quot; is encountered */</span></span>
<span id="cb42-50"><a href="#cb42-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-51"><a href="#cb42-51" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;\+?({re_host}\/)?{re_keyname}(={re_keyval})?</span>  {</span>
<span id="cb42-52"><a href="#cb42-52" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: construct yylval */</span></span>
<span id="cb42-53"><a href="#cb42-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> TAG<span class="op">;</span></span>
<span id="cb42-54"><a href="#cb42-54" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-55"><a href="#cb42-55" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;{re_space}</span> {</span>
<span id="cb42-56"><a href="#cb42-56" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb42-57"><a href="#cb42-57" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> SPACE<span class="op">;</span></span>
<span id="cb42-58"><a href="#cb42-58" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-59"><a href="#cb42-59" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;;</span> { <span class="cf">return</span> <span class="ch">&#39;;&#39;</span><span class="op">;</span> }</span>
<span id="cb42-60"><a href="#cb42-60" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-61"><a href="#cb42-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-62"><a href="#cb42-62" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PREFIX&gt;({re_host})|({re_nick})(!{re_user})?(@{re_host})?</span> {</span>
<span id="cb42-63"><a href="#cb42-63" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: construct yylval */</span></span>
<span id="cb42-64"><a href="#cb42-64" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb42-65"><a href="#cb42-65" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> PREFIX<span class="op">;</span></span>
<span id="cb42-66"><a href="#cb42-66" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-67"><a href="#cb42-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-68"><a href="#cb42-68" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-69"><a href="#cb42-69" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;{re_space}</span> { <span class="cf">return</span> SPACE<span class="op">;</span> }</span>
<span id="cb42-70"><a href="#cb42-70" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;{re_middle}</span> {</span>
<span id="cb42-71"><a href="#cb42-71" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: construct yylval */</span></span>
<span id="cb42-72"><a href="#cb42-72" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> MIDDLE<span class="op">;</span></span>
<span id="cb42-73"><a href="#cb42-73" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-74"><a href="#cb42-74" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;:{re_trailing}</span> {</span>
<span id="cb42-75"><a href="#cb42-75" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* </span><span class="al">TODO</span><span class="co">: construct yylval */</span></span>
<span id="cb42-76"><a href="#cb42-76" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb42-77"><a href="#cb42-77" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> TRAILING<span class="op">;</span></span>
<span id="cb42-78"><a href="#cb42-78" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb42-79"><a href="#cb42-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-80"><a href="#cb42-80" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-81"><a href="#cb42-81" aria-hidden="true" tabindex="-1"></a> <span class="co">/* the &quot;*&quot; state applies to all states,</span></span>
<span id="cb42-82"><a href="#cb42-82" aria-hidden="true" tabindex="-1"></a><span class="co">    including INITIAL and the exclusive ones */</span></span>
<span id="cb42-83"><a href="#cb42-83" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb42-84"><a href="#cb42-84" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;*&gt;\n|\r\n</span>  <span class="op">;</span> <span class="co">/* ignore */</span></span></code></pre></div>
<p>We’ll revisit the lexer to fill in details for assigning yylval. First, let’s see the parser and its data types.</p>
<div class="sourceCode" id="cb43"><pre class="sourceCode yacc"><code class="sourceCode yacc"><span id="cb43-1"><a href="#cb43-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* irc.y  (Bison only)</span></span>
<span id="cb43-2"><a href="#cb43-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-3"><a href="#cb43-3" aria-hidden="true" tabindex="-1"></a><span class="co">   Using Bison mostly for the %code positions, making </span></span>
<span id="cb43-4"><a href="#cb43-4" aria-hidden="true" tabindex="-1"></a><span class="co">   it easier to use libderp between flex and bison.</span></span>
<span id="cb43-5"><a href="#cb43-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-6"><a href="#cb43-6" aria-hidden="true" tabindex="-1"></a><span class="co">   - </span><span class="al">WARNING</span><span class="co"> -</span></span>
<span id="cb43-7"><a href="#cb43-7" aria-hidden="true" tabindex="-1"></a><span class="co">   There is absolutely no memory hygiene in this example.</span></span>
<span id="cb43-8"><a href="#cb43-8" aria-hidden="true" tabindex="-1"></a><span class="co">   We don&#39;t check for allocation failure, and we don&#39;t free</span></span>
<span id="cb43-9"><a href="#cb43-9" aria-hidden="true" tabindex="-1"></a><span class="co">   things when done. See the earlier lisp.y/.l examples</span></span>
<span id="cb43-10"><a href="#cb43-10" aria-hidden="true" tabindex="-1"></a><span class="co">   for guidance about that.</span></span>
<span id="cb43-11"><a href="#cb43-11" aria-hidden="true" tabindex="-1"></a><span class="co">*/</span></span>
<span id="cb43-12"><a href="#cb43-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-13"><a href="#cb43-13" aria-hidden="true" tabindex="-1"></a><span class="co">/* output more descriptive messages than &quot;syntax error&quot; */</span></span>
<span id="cb43-14"><a href="#cb43-14" aria-hidden="true" tabindex="-1"></a><span class="kw">%define</span> parse.error verbose</span>
<span id="cb43-15"><a href="#cb43-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-16"><a href="#cb43-16" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> top {</span>
<span id="cb43-17"><a href="#cb43-17" aria-hidden="true" tabindex="-1"></a>	#define _XOPEN_SOURCE 600</span>
<span id="cb43-18"><a href="#cb43-18" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;stdio.h&gt;</span></span>
<span id="cb43-19"><a href="#cb43-19" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;stdlib.h&gt;</span></span>
<span id="cb43-20"><a href="#cb43-20" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb43-21"><a href="#cb43-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-22"><a href="#cb43-22" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> requires {</span>
<span id="cb43-23"><a href="#cb43-23" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;derp/list.h&gt;</span></span>
<span id="cb43-24"><a href="#cb43-24" aria-hidden="true" tabindex="-1"></a>	#include <span class="dt">&lt;derp/treemap.h&gt;</span></span>
<span id="cb43-25"><a href="#cb43-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-26"><a href="#cb43-26" aria-hidden="true" tabindex="-1"></a>	struct prefix</span>
<span id="cb43-27"><a href="#cb43-27" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb43-28"><a href="#cb43-28" aria-hidden="true" tabindex="-1"></a>		char *host;</span>
<span id="cb43-29"><a href="#cb43-29" aria-hidden="true" tabindex="-1"></a>		char *nick;</span>
<span id="cb43-30"><a href="#cb43-30" aria-hidden="true" tabindex="-1"></a>		char *user;</span>
<span id="cb43-31"><a href="#cb43-31" aria-hidden="true" tabindex="-1"></a>	};</span>
<span id="cb43-32"><a href="#cb43-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-33"><a href="#cb43-33" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* building an irc_message is the overall</span></span>
<span id="cb43-34"><a href="#cb43-34" aria-hidden="true" tabindex="-1"></a><span class="co">	   goal for this parser */</span></span>
<span id="cb43-35"><a href="#cb43-35" aria-hidden="true" tabindex="-1"></a>	struct irc_message</span>
<span id="cb43-36"><a href="#cb43-36" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb43-37"><a href="#cb43-37" aria-hidden="true" tabindex="-1"></a>		treemap *tags;</span>
<span id="cb43-38"><a href="#cb43-38" aria-hidden="true" tabindex="-1"></a>		struct prefix *prefix;</span>
<span id="cb43-39"><a href="#cb43-39" aria-hidden="true" tabindex="-1"></a>		char *command;</span>
<span id="cb43-40"><a href="#cb43-40" aria-hidden="true" tabindex="-1"></a>		list *params;</span>
<span id="cb43-41"><a href="#cb43-41" aria-hidden="true" tabindex="-1"></a>	};</span>
<span id="cb43-42"><a href="#cb43-42" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb43-43"><a href="#cb43-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-44"><a href="#cb43-44" aria-hidden="true" tabindex="-1"></a><span class="kw">%code</span> provides {</span>
<span id="cb43-45"><a href="#cb43-45" aria-hidden="true" tabindex="-1"></a>	int yyerror(char const *msg);</span>
<span id="cb43-46"><a href="#cb43-46" aria-hidden="true" tabindex="-1"></a>	int yylex(void);</span>
<span id="cb43-47"><a href="#cb43-47" aria-hidden="true" tabindex="-1"></a>	void message_print(struct irc_message *m);</span>
<span id="cb43-48"><a href="#cb43-48" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb43-49"><a href="#cb43-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-50"><a href="#cb43-50" aria-hidden="true" tabindex="-1"></a><span class="kw">%union</span></span>
<span id="cb43-51"><a href="#cb43-51" aria-hidden="true" tabindex="-1"></a>{</span>
<span id="cb43-52"><a href="#cb43-52" aria-hidden="true" tabindex="-1"></a>	char *str;</span>
<span id="cb43-53"><a href="#cb43-53" aria-hidden="true" tabindex="-1"></a>	struct prefix *prefix;</span>
<span id="cb43-54"><a href="#cb43-54" aria-hidden="true" tabindex="-1"></a>	treemap *map;</span>
<span id="cb43-55"><a href="#cb43-55" aria-hidden="true" tabindex="-1"></a>	struct map_pair *pair;</span>
<span id="cb43-56"><a href="#cb43-56" aria-hidden="true" tabindex="-1"></a>	list *list;</span>
<span id="cb43-57"><a href="#cb43-57" aria-hidden="true" tabindex="-1"></a>	struct irc_message *msg;</span>
<span id="cb43-58"><a href="#cb43-58" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb43-59"><a href="#cb43-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-60"><a href="#cb43-60" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span>          SPACE</span>
<span id="cb43-61"><a href="#cb43-61" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;str&gt;</span>    COMMAND MIDDLE TRAILING</span>
<span id="cb43-62"><a href="#cb43-62" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;pair&gt;</span>   TAG</span>
<span id="cb43-63"><a href="#cb43-63" aria-hidden="true" tabindex="-1"></a><span class="kw">%token</span> <span class="dt">&lt;prefix&gt;</span> PREFIX</span>
<span id="cb43-64"><a href="#cb43-64" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-65"><a href="#cb43-65" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;msg&gt;</span> message tagged_message prefixed_message</span>
<span id="cb43-66"><a href="#cb43-66" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;map&gt;</span> tags</span>
<span id="cb43-67"><a href="#cb43-67" aria-hidden="true" tabindex="-1"></a><span class="kw">%type</span> <span class="dt">&lt;list&gt;</span> params</span>
<span id="cb43-68"><a href="#cb43-68" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-69"><a href="#cb43-69" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb43-70"><a href="#cb43-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-71"><a href="#cb43-71" aria-hidden="true" tabindex="-1"></a><span class="st"> </span><span class="co">/* Like in the CSV example, we start with a dummy</span></span>
<span id="cb43-72"><a href="#cb43-72" aria-hidden="true" tabindex="-1"></a><span class="co">    rule just to add side-effects */</span></span>
<span id="cb43-73"><a href="#cb43-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-74"><a href="#cb43-74" aria-hidden="true" tabindex="-1"></a><span class="st">final </span>:</span>
<span id="cb43-75"><a href="#cb43-75" aria-hidden="true" tabindex="-1"></a>  tagged_message { message_print<span class="op">(</span><span class="kw">$1</span><span class="op">);</span> }</span>
<span id="cb43-76"><a href="#cb43-76" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-77"><a href="#cb43-77" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-78"><a href="#cb43-78" aria-hidden="true" tabindex="-1"></a> <span class="co">/* Messages begin with two optional components,</span></span>
<span id="cb43-79"><a href="#cb43-79" aria-hidden="true" tabindex="-1"></a><span class="co">    a set of tags and a prefix.</span></span>
<span id="cb43-80"><a href="#cb43-80" aria-hidden="true" tabindex="-1"></a><span class="co"> </span></span>
<span id="cb43-81"><a href="#cb43-81" aria-hidden="true" tabindex="-1"></a><span class="co">    &lt;message&gt; ::= [&#39;@&#39; &lt;tags&gt; &lt;SPACE&gt;] [&#39;:&#39; &lt;prefix&gt; &lt;SPACE&gt; ] &lt;command&gt; [params]</span></span>
<span id="cb43-82"><a href="#cb43-82" aria-hidden="true" tabindex="-1"></a><span class="co"> </span></span>
<span id="cb43-83"><a href="#cb43-83" aria-hidden="true" tabindex="-1"></a><span class="co">    Rather than making a single message rule with</span></span>
<span id="cb43-84"><a href="#cb43-84" aria-hidden="true" tabindex="-1"></a><span class="co">    tons of variations (and duplicated code), I chose</span></span>
<span id="cb43-85"><a href="#cb43-85" aria-hidden="true" tabindex="-1"></a><span class="co">    to build the message in stages.</span></span>
<span id="cb43-86"><a href="#cb43-86" aria-hidden="true" tabindex="-1"></a><span class="co"> </span></span>
<span id="cb43-87"><a href="#cb43-87" aria-hidden="true" tabindex="-1"></a><span class="co">    tagged_message &lt;- prefixed_message &lt;- message</span></span>
<span id="cb43-88"><a href="#cb43-88" aria-hidden="true" tabindex="-1"></a><span class="co"> </span></span>
<span id="cb43-89"><a href="#cb43-89" aria-hidden="true" tabindex="-1"></a><span class="co">    A prefixed_message adds prefix information, or</span></span>
<span id="cb43-90"><a href="#cb43-90" aria-hidden="true" tabindex="-1"></a><span class="co">    passes the message along verbatim if there is none.</span></span>
<span id="cb43-91"><a href="#cb43-91" aria-hidden="true" tabindex="-1"></a><span class="co">    Similarly for tagged_message. */</span></span>
<span id="cb43-92"><a href="#cb43-92" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-93"><a href="#cb43-93" aria-hidden="true" tabindex="-1"></a><span class="st">tagged_message </span>:</span>
<span id="cb43-94"><a href="#cb43-94" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-95"><a href="#cb43-95" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* When there are more than one matched token,</span></span>
<span id="cb43-96"><a href="#cb43-96" aria-hidden="true" tabindex="-1"></a><span class="co">     it&#39;s helpful to add Bison &quot;named references&quot;</span></span>
<span id="cb43-97"><a href="#cb43-97" aria-hidden="true" tabindex="-1"></a><span class="co">     in brackets. Thus, below, the rule can refer to</span></span>
<span id="cb43-98"><a href="#cb43-98" aria-hidden="true" tabindex="-1"></a><span class="co">     $ts rather than $2, or $msg rather than $4.</span></span>
<span id="cb43-99"><a href="#cb43-99" aria-hidden="true" tabindex="-1"></a><span class="co">     Makes it way easier to rearrange tokens while</span></span>
<span id="cb43-100"><a href="#cb43-100" aria-hidden="true" tabindex="-1"></a><span class="co">     you&#39;re experimenting. */</span></span>
<span id="cb43-101"><a href="#cb43-101" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-102"><a href="#cb43-102" aria-hidden="true" tabindex="-1"></a>  <span class="ch">&#39;@&#39;</span> tags[ts] SPACE prefixed_message[msg] {</span>
<span id="cb43-103"><a href="#cb43-103" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$msg</span><span class="op">-&gt;</span>tags <span class="op">=</span> <span class="kw">$ts</span><span class="op">;</span></span>
<span id="cb43-104"><a href="#cb43-104" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> <span class="kw">$msg</span><span class="op">;</span></span>
<span id="cb43-105"><a href="#cb43-105" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-106"><a href="#cb43-106" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-107"><a href="#cb43-107" aria-hidden="true" tabindex="-1"></a>  <span class="co">/* here&#39;s the pass-through case when there are</span></span>
<span id="cb43-108"><a href="#cb43-108" aria-hidden="true" tabindex="-1"></a><span class="co">     no tags on the message */</span></span>
<span id="cb43-109"><a href="#cb43-109" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-110"><a href="#cb43-110" aria-hidden="true" tabindex="-1"></a>| prefixed_message</span>
<span id="cb43-111"><a href="#cb43-111" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-112"><a href="#cb43-112" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-113"><a href="#cb43-113" aria-hidden="true" tabindex="-1"></a><span class="st">prefixed_message </span>:</span>
<span id="cb43-114"><a href="#cb43-114" aria-hidden="true" tabindex="-1"></a>  <span class="ch">&#39;:&#39;</span> PREFIX[pfx] SPACE message[msg] {</span>
<span id="cb43-115"><a href="#cb43-115" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$msg</span><span class="op">-&gt;</span>prefix <span class="op">=</span> <span class="kw">$pfx</span><span class="op">;</span></span>
<span id="cb43-116"><a href="#cb43-116" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> <span class="kw">$msg</span><span class="op">;</span></span>
<span id="cb43-117"><a href="#cb43-117" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-118"><a href="#cb43-118" aria-hidden="true" tabindex="-1"></a>| message</span>
<span id="cb43-119"><a href="#cb43-119" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-120"><a href="#cb43-120" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-121"><a href="#cb43-121" aria-hidden="true" tabindex="-1"></a><span class="st">message </span>:</span>
<span id="cb43-122"><a href="#cb43-122" aria-hidden="true" tabindex="-1"></a>  COMMAND[cmd] params[ps] {</span>
<span id="cb43-123"><a href="#cb43-123" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> irc_message <span class="op">*</span>m <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>m<span class="op">);</span></span>
<span id="cb43-124"><a href="#cb43-124" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>m <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> irc_message<span class="op">)</span> {</span>
<span id="cb43-125"><a href="#cb43-125" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>command<span class="op">=</span><span class="kw">$cmd</span><span class="op">,</span> <span class="op">.</span>params<span class="op">=</span><span class="kw">$ps</span></span>
<span id="cb43-126"><a href="#cb43-126" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb43-127"><a href="#cb43-127" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> m<span class="op">;</span></span>
<span id="cb43-128"><a href="#cb43-128" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-129"><a href="#cb43-129" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-130"><a href="#cb43-130" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-131"><a href="#cb43-131" aria-hidden="true" tabindex="-1"></a><span class="st">tags </span>:</span>
<span id="cb43-132"><a href="#cb43-132" aria-hidden="true" tabindex="-1"></a>  TAG {</span>
<span id="cb43-133"><a href="#cb43-133" aria-hidden="true" tabindex="-1"></a>	treemap <span class="op">*</span>t <span class="op">=</span> tm_new<span class="op">(</span>derp_strcmp<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb43-134"><a href="#cb43-134" aria-hidden="true" tabindex="-1"></a>	tm_insert<span class="op">(</span>t<span class="op">,</span> <span class="kw">$1</span><span class="op">-&gt;</span>k<span class="op">,</span> <span class="kw">$1</span><span class="op">-&gt;</span>v<span class="op">);</span></span>
<span id="cb43-135"><a href="#cb43-135" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> t<span class="op">;</span></span>
<span id="cb43-136"><a href="#cb43-136" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-137"><a href="#cb43-137" aria-hidden="true" tabindex="-1"></a>| tags[ts] <span class="ch">&#39;;&#39;</span> TAG[t] {</span>
<span id="cb43-138"><a href="#cb43-138" aria-hidden="true" tabindex="-1"></a>	tm_insert<span class="op">(</span><span class="kw">$ts</span><span class="op">,</span> <span class="kw">$t</span><span class="op">-&gt;</span>k<span class="op">,</span> <span class="kw">$t</span><span class="op">-&gt;</span>v<span class="op">);</span></span>
<span id="cb43-139"><a href="#cb43-139" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> <span class="kw">$ts</span><span class="op">;</span></span>
<span id="cb43-140"><a href="#cb43-140" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-141"><a href="#cb43-141" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-142"><a href="#cb43-142" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-143"><a href="#cb43-143" aria-hidden="true" tabindex="-1"></a><span class="st">params </span>:</span>
<span id="cb43-144"><a href="#cb43-144" aria-hidden="true" tabindex="-1"></a>  SPACE TRAILING {</span>
<span id="cb43-145"><a href="#cb43-145" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> l_new<span class="op">();</span></span>
<span id="cb43-146"><a href="#cb43-146" aria-hidden="true" tabindex="-1"></a>	l_prepend<span class="op">(</span><span class="kw">$$</span><span class="op">,</span> <span class="kw">$2</span><span class="op">);</span></span>
<span id="cb43-147"><a href="#cb43-147" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-148"><a href="#cb43-148" aria-hidden="true" tabindex="-1"></a>| SPACE MIDDLE[mid] params[ps] {</span>
<span id="cb43-149"><a href="#cb43-149" aria-hidden="true" tabindex="-1"></a>	l_prepend<span class="op">(</span><span class="kw">$ps</span><span class="op">,</span> <span class="kw">$mid</span><span class="op">);</span></span>
<span id="cb43-150"><a href="#cb43-150" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> <span class="kw">$ps</span><span class="op">;</span></span>
<span id="cb43-151"><a href="#cb43-151" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-152"><a href="#cb43-152" aria-hidden="true" tabindex="-1"></a>| %empty {</span>
<span id="cb43-153"><a href="#cb43-153" aria-hidden="true" tabindex="-1"></a>	<span class="kw">$$</span> <span class="op">=</span> l_new<span class="op">();</span></span>
<span id="cb43-154"><a href="#cb43-154" aria-hidden="true" tabindex="-1"></a>  }</span>
<span id="cb43-155"><a href="#cb43-155" aria-hidden="true" tabindex="-1"></a>;</span>
<span id="cb43-156"><a href="#cb43-156" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-157"><a href="#cb43-157" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb43-158"><a href="#cb43-158" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-159"><a href="#cb43-159" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> yyerror<span class="op">(</span><span class="dt">char</span> <span class="at">const</span> <span class="op">*</span>msg<span class="op">)</span></span>
<span id="cb43-160"><a href="#cb43-160" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb43-161"><a href="#cb43-161" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;</span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span> msg<span class="op">);</span></span>
<span id="cb43-162"><a href="#cb43-162" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb43-163"><a href="#cb43-163" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-164"><a href="#cb43-164" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> message_print<span class="op">(</span><span class="kw">struct</span> irc_message <span class="op">*</span>m<span class="op">)</span></span>
<span id="cb43-165"><a href="#cb43-165" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb43-166"><a href="#cb43-166" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>m<span class="op">-&gt;</span>tags<span class="op">)</span></span>
<span id="cb43-167"><a href="#cb43-167" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb43-168"><a href="#cb43-168" aria-hidden="true" tabindex="-1"></a>		<span class="kw">struct</span> tm_iter  <span class="op">*</span>it <span class="op">=</span> tm_iter_begin<span class="op">(</span>m<span class="op">-&gt;</span>tags<span class="op">);</span></span>
<span id="cb43-169"><a href="#cb43-169" aria-hidden="true" tabindex="-1"></a>		<span class="kw">struct</span> map_pair <span class="op">*</span>p<span class="op">;</span></span>
<span id="cb43-170"><a href="#cb43-170" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb43-171"><a href="#cb43-171" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span><span class="st">&quot;Tags:&quot;</span><span class="op">);</span></span>
<span id="cb43-172"><a href="#cb43-172" aria-hidden="true" tabindex="-1"></a>		<span class="cf">while</span> <span class="op">((</span>p <span class="op">=</span> tm_iter_next<span class="op">(</span>it<span class="op">))</span> <span class="op">!=</span> NULL<span class="op">)</span></span>
<span id="cb43-173"><a href="#cb43-173" aria-hidden="true" tabindex="-1"></a>			printf<span class="op">(</span><span class="st">&quot;</span><span class="sc">\t</span><span class="st">&#39;</span><span class="sc">%s</span><span class="st">&#39;=&#39;</span><span class="sc">%s</span><span class="st">&#39;</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> <span class="op">(</span><span class="dt">char</span><span class="op">*)</span>p<span class="op">-&gt;</span>k<span class="op">,</span> <span class="op">(</span><span class="dt">char</span><span class="op">*)</span>p<span class="op">-&gt;</span>v<span class="op">);</span></span>
<span id="cb43-174"><a href="#cb43-174" aria-hidden="true" tabindex="-1"></a>		tm_iter_free<span class="op">(</span>it<span class="op">);</span></span>
<span id="cb43-175"><a href="#cb43-175" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb43-176"><a href="#cb43-176" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>m<span class="op">-&gt;</span>prefix<span class="op">)</span></span>
<span id="cb43-177"><a href="#cb43-177" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;Prefix: Nick </span><span class="sc">%s</span><span class="st">, User </span><span class="sc">%s</span><span class="st">, Host </span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb43-178"><a href="#cb43-178" aria-hidden="true" tabindex="-1"></a>		       m<span class="op">-&gt;</span>prefix<span class="op">-&gt;</span>nick<span class="op">,</span> m<span class="op">-&gt;</span>prefix<span class="op">-&gt;</span>user<span class="op">,</span></span>
<span id="cb43-179"><a href="#cb43-179" aria-hidden="true" tabindex="-1"></a>			   m<span class="op">-&gt;</span>prefix<span class="op">-&gt;</span>host<span class="op">);</span></span>
<span id="cb43-180"><a href="#cb43-180" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>m<span class="op">-&gt;</span>command<span class="op">)</span></span>
<span id="cb43-181"><a href="#cb43-181" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;Command: </span><span class="sc">%s\n</span><span class="st">&quot;</span><span class="op">,</span> m<span class="op">-&gt;</span>command<span class="op">);</span></span>
<span id="cb43-182"><a href="#cb43-182" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>l_is_empty<span class="op">(</span>m<span class="op">-&gt;</span>params<span class="op">))</span></span>
<span id="cb43-183"><a href="#cb43-183" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb43-184"><a href="#cb43-184" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span><span class="st">&quot;Params:&quot;</span><span class="op">);</span></span>
<span id="cb43-185"><a href="#cb43-185" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>list_item <span class="op">*</span>li <span class="op">=</span> l_first<span class="op">(</span>m<span class="op">-&gt;</span>params<span class="op">);</span> li<span class="op">;</span> li <span class="op">=</span> li<span class="op">-&gt;</span>next<span class="op">)</span></span>
<span id="cb43-186"><a href="#cb43-186" aria-hidden="true" tabindex="-1"></a>			printf<span class="op">(</span><span class="st">&quot;</span><span class="sc">\t%s\n</span><span class="st">&quot;</span><span class="op">,</span> <span class="op">(</span><span class="dt">char</span><span class="op">*)</span>li<span class="op">-&gt;</span>data<span class="op">);</span></span>
<span id="cb43-187"><a href="#cb43-187" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb43-188"><a href="#cb43-188" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Returning to the lexer, here is the code with all the details filled in to construct yylval for the tokens.</p>
<div class="sourceCode" id="cb44"><pre class="sourceCode lex"><code class="sourceCode lex"><span id="cb44-1"><a href="#cb44-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* irc.l  - complete file */</span></span>
<span id="cb44-2"><a href="#cb44-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-3"><a href="#cb44-3" aria-hidden="true" tabindex="-1"></a><span class="kw">%option noyywrap nounput noinput</span></span>
<span id="cb44-4"><a href="#cb44-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-5"><a href="#cb44-5" aria-hidden="true" tabindex="-1"></a><span class="bn">%{</span></span>
<span id="cb44-6"><a href="#cb44-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&quot;irc.tab.h&quot;</span></span>
<span id="cb44-7"><a href="#cb44-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-8"><a href="#cb44-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _XOPEN_SOURCE </span><span class="dv">600</span></span>
<span id="cb44-9"><a href="#cb44-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-10"><a href="#cb44-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;limits.h&gt;</span></span>
<span id="cb44-11"><a href="#cb44-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb44-12"><a href="#cb44-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb44-13"><a href="#cb44-13" aria-hidden="true" tabindex="-1"></a><span class="bn">%}</span></span>
<span id="cb44-14"><a href="#cb44-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-15"><a href="#cb44-15" aria-hidden="true" tabindex="-1"></a><span class="dt">re_space    </span><span class="st">[ ]+</span></span>
<span id="cb44-16"><a href="#cb44-16" aria-hidden="true" tabindex="-1"></a><span class="dt">re_host     </span><span class="st">[[:alnum:]][[:alnum:]\.\-]*</span></span>
<span id="cb44-17"><a href="#cb44-17" aria-hidden="true" tabindex="-1"></a><span class="dt">re_nick     </span><span class="st">[[:alpha:]][[:alnum:]\-\[\]\\`^{}_]*</span></span>
<span id="cb44-18"><a href="#cb44-18" aria-hidden="true" tabindex="-1"></a><span class="dt">re_user     </span><span class="st">[~[:alpha:]][[:alnum:]]*</span></span>
<span id="cb44-19"><a href="#cb44-19" aria-hidden="true" tabindex="-1"></a><span class="dt">re_keyname  </span><span class="st">[[:alnum:]\-]+</span></span>
<span id="cb44-20"><a href="#cb44-20" aria-hidden="true" tabindex="-1"></a><span class="dt">re_keyval   </span><span class="st">[^ ;\r\n]*</span></span>
<span id="cb44-21"><a href="#cb44-21" aria-hidden="true" tabindex="-1"></a><span class="dt">re_command  </span><span class="st">[[:alpha:]]+|[[:digit:]]{3}</span></span>
<span id="cb44-22"><a href="#cb44-22" aria-hidden="true" tabindex="-1"></a><span class="dt">re_middle   </span><span class="st">[^: \r\n][^ \r\n]*</span></span>
<span id="cb44-23"><a href="#cb44-23" aria-hidden="true" tabindex="-1"></a><span class="dt">re_trailing </span><span class="st">[^\r\n]*</span></span>
<span id="cb44-24"><a href="#cb44-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-25"><a href="#cb44-25" aria-hidden="true" tabindex="-1"></a><span class="kw">%x IN_TAGS IN_PREFIX IN_PARAMS</span></span>
<span id="cb44-26"><a href="#cb44-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-27"><a href="#cb44-27" aria-hidden="true" tabindex="-1"></a><span class="bn">%%</span></span>
<span id="cb44-28"><a href="#cb44-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-29"><a href="#cb44-29" aria-hidden="true" tabindex="-1"></a><span class="st">@</span> { BEGIN IN_TAGS<span class="op">;</span> <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span>
<span id="cb44-30"><a href="#cb44-30" aria-hidden="true" tabindex="-1"></a><span class="st">:</span> { BEGIN IN_PREFIX<span class="op">;</span> <span class="cf">return</span> <span class="op">*</span>yytext<span class="op">;</span> }</span>
<span id="cb44-31"><a href="#cb44-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-32"><a href="#cb44-32" aria-hidden="true" tabindex="-1"></a><span class="st">{re_space}</span> { <span class="cf">return</span> SPACE<span class="op">;</span> }</span>
<span id="cb44-33"><a href="#cb44-33" aria-hidden="true" tabindex="-1"></a><span class="st">{re_command}</span> {</span>
<span id="cb44-34"><a href="#cb44-34" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">.</span>str <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span></span>
<span id="cb44-35"><a href="#cb44-35" aria-hidden="true" tabindex="-1"></a>	BEGIN IN_PARAMS<span class="op">;</span></span>
<span id="cb44-36"><a href="#cb44-36" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> COMMAND<span class="op">;</span></span>
<span id="cb44-37"><a href="#cb44-37" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-38"><a href="#cb44-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-39"><a href="#cb44-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-40"><a href="#cb44-40" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;\+?({re_host}\/)?{re_keyname}(={re_keyval})?</span>  {</span>
<span id="cb44-41"><a href="#cb44-41" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> map_pair <span class="op">*</span>p <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>p<span class="op">);</span></span>
<span id="cb44-42"><a href="#cb44-42" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>split <span class="op">=</span> strchr<span class="op">(</span>yytext<span class="op">,</span> <span class="ch">&#39;=&#39;</span><span class="op">);</span></span>
<span id="cb44-43"><a href="#cb44-43" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>split<span class="op">)</span></span>
<span id="cb44-44"><a href="#cb44-44" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>split <span class="op">=</span> <span class="ch">&#39;</span><span class="sc">\0</span><span class="ch">&#39;</span><span class="op">;</span></span>
<span id="cb44-45"><a href="#cb44-45" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>p <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> map_pair<span class="op">)</span>{</span>
<span id="cb44-46"><a href="#cb44-46" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>k <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">),</span></span>
<span id="cb44-47"><a href="#cb44-47" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>v <span class="op">=</span> split <span class="op">?</span> strdup<span class="op">(</span>split<span class="op">+</span><span class="dv">1</span><span class="op">)</span> <span class="op">:</span> calloc<span class="op">(</span><span class="dv">1</span><span class="op">,</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb44-48"><a href="#cb44-48" aria-hidden="true" tabindex="-1"></a>	}<span class="op">;</span></span>
<span id="cb44-49"><a href="#cb44-49" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">.</span>pair <span class="op">=</span> p<span class="op">;</span></span>
<span id="cb44-50"><a href="#cb44-50" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> TAG<span class="op">;</span></span>
<span id="cb44-51"><a href="#cb44-51" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-52"><a href="#cb44-52" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;{re_space}</span> {</span>
<span id="cb44-53"><a href="#cb44-53" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb44-54"><a href="#cb44-54" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> SPACE<span class="op">;</span></span>
<span id="cb44-55"><a href="#cb44-55" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-56"><a href="#cb44-56" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_TAGS&gt;;</span> { <span class="cf">return</span> <span class="ch">&#39;;&#39;</span><span class="op">;</span> }</span>
<span id="cb44-57"><a href="#cb44-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-58"><a href="#cb44-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-59"><a href="#cb44-59" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PREFIX&gt;({re_host})|({re_nick})(!{re_user})?(@{re_host})?</span> {</span>
<span id="cb44-60"><a href="#cb44-60" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> prefix <span class="op">*</span>p <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span> <span class="op">*</span>p<span class="op">);</span></span>
<span id="cb44-61"><a href="#cb44-61" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>p<span class="op">)</span></span>
<span id="cb44-62"><a href="#cb44-62" aria-hidden="true" tabindex="-1"></a>		<span class="cf">goto</span> done<span class="op">;</span></span>
<span id="cb44-63"><a href="#cb44-63" aria-hidden="true" tabindex="-1"></a>	<span class="op">*</span>p <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> prefix<span class="op">)</span>{<span class="dv">0</span>}<span class="op">;</span></span>
<span id="cb44-64"><a href="#cb44-64" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>bang <span class="op">=</span> strchr<span class="op">(</span>yytext<span class="op">,</span> <span class="ch">&#39;!&#39;</span><span class="op">),</span></span>
<span id="cb44-65"><a href="#cb44-65" aria-hidden="true" tabindex="-1"></a>	     <span class="op">*</span>at   <span class="op">=</span> strchr<span class="op">(</span>yytext<span class="op">,</span> <span class="ch">&#39;@&#39;</span><span class="op">);</span></span>
<span id="cb44-66"><a href="#cb44-66" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>bang <span class="op">&amp;&amp;</span> <span class="op">!</span>at<span class="op">)</span></span>
<span id="cb44-67"><a href="#cb44-67" aria-hidden="true" tabindex="-1"></a>	{</span>
<span id="cb44-68"><a href="#cb44-68" aria-hidden="true" tabindex="-1"></a>		p<span class="op">-&gt;</span>host <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span></span>
<span id="cb44-69"><a href="#cb44-69" aria-hidden="true" tabindex="-1"></a>		<span class="cf">goto</span> done<span class="op">;</span></span>
<span id="cb44-70"><a href="#cb44-70" aria-hidden="true" tabindex="-1"></a>	}</span>
<span id="cb44-71"><a href="#cb44-71" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>bang<span class="op">)</span> <span class="op">*</span>bang <span class="op">=</span> <span class="ch">&#39;</span><span class="sc">\0</span><span class="ch">&#39;</span><span class="op">;</span></span>
<span id="cb44-72"><a href="#cb44-72" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>at<span class="op">)</span> <span class="op">*</span>at <span class="op">=</span> <span class="ch">&#39;</span><span class="sc">\0</span><span class="ch">&#39;</span><span class="op">;</span></span>
<span id="cb44-73"><a href="#cb44-73" aria-hidden="true" tabindex="-1"></a>	p<span class="op">-&gt;</span>nick <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span></span>
<span id="cb44-74"><a href="#cb44-74" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>bang<span class="op">)</span></span>
<span id="cb44-75"><a href="#cb44-75" aria-hidden="true" tabindex="-1"></a>		p<span class="op">-&gt;</span>user <span class="op">=</span> strdup<span class="op">(</span>bang<span class="op">+</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb44-76"><a href="#cb44-76" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>at<span class="op">)</span></span>
<span id="cb44-77"><a href="#cb44-77" aria-hidden="true" tabindex="-1"></a>		p<span class="op">-&gt;</span>host <span class="op">=</span> strdup<span class="op">(</span>at<span class="op">+</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb44-78"><a href="#cb44-78" aria-hidden="true" tabindex="-1"></a>done<span class="op">:</span></span>
<span id="cb44-79"><a href="#cb44-79" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">.</span>prefix <span class="op">=</span> p<span class="op">;</span></span>
<span id="cb44-80"><a href="#cb44-80" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb44-81"><a href="#cb44-81" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> PREFIX<span class="op">;</span></span>
<span id="cb44-82"><a href="#cb44-82" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-83"><a href="#cb44-83" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-84"><a href="#cb44-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-85"><a href="#cb44-85" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;{re_space}</span> { <span class="cf">return</span> SPACE<span class="op">;</span> }</span>
<span id="cb44-86"><a href="#cb44-86" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;{re_middle}</span> {</span>
<span id="cb44-87"><a href="#cb44-87" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">.</span>str <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">);</span></span>
<span id="cb44-88"><a href="#cb44-88" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> MIDDLE<span class="op">;</span></span>
<span id="cb44-89"><a href="#cb44-89" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-90"><a href="#cb44-90" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;IN_PARAMS&gt;:{re_trailing}</span> {</span>
<span id="cb44-91"><a href="#cb44-91" aria-hidden="true" tabindex="-1"></a>	yylval<span class="op">.</span>str <span class="op">=</span> strdup<span class="op">(</span>yytext<span class="op">+</span><span class="dv">1</span><span class="op">);</span> <span class="co">/* trim : */</span></span>
<span id="cb44-92"><a href="#cb44-92" aria-hidden="true" tabindex="-1"></a>	BEGIN INITIAL<span class="op">;</span></span>
<span id="cb44-93"><a href="#cb44-93" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> TRAILING<span class="op">;</span></span>
<span id="cb44-94"><a href="#cb44-94" aria-hidden="true" tabindex="-1"></a>}</span>
<span id="cb44-95"><a href="#cb44-95" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb44-96"><a href="#cb44-96" aria-hidden="true" tabindex="-1"></a><span class="st">&lt;*&gt;\n|\r\n</span>  <span class="op">;</span> <span class="co">/* ignore */</span></span></code></pre></div>
<p>Build irc.y and irc.l according to our typical pattern (and link with libderp). Here’s an example of the IRCv3 parser in action:</p>
<div class="sourceCode" id="cb45"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb45-1"><a href="#cb45-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Try an example from</span></span>
<span id="cb45-2"><a href="#cb45-2" aria-hidden="true" tabindex="-1"></a><span class="co"># https://ircv3.net/specs/extensions/message-tags#examples</span></span>
<span id="cb45-3"><a href="#cb45-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb45-4"><a href="#cb45-4" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./irc <span class="op">&lt;&lt;EOF</span></span>
<span id="cb45-5"><a href="#cb45-5" aria-hidden="true" tabindex="-1"></a><span class="st">@aaa=bbb;ccc;example.com/ddd=eee :nick!ident@host.com PRIVMSG me :Hello</span></span>
<span id="cb45-6"><a href="#cb45-6" aria-hidden="true" tabindex="-1"></a><span class="op">EOF</span></span></code></pre></div>
<pre><code>Tags:
        &#39;aaa&#39;=&#39;bbb&#39;
        &#39;ccc&#39;=&#39;&#39;
        &#39;example.com/ddd&#39;=&#39;eee&#39;
Prefix: Nick nick, User ident, Host host.com
Command: PRIVMSG
Params:
        me
        Hello</code></pre>
<h3 id="further-resources">Further resources</h3>
<ul>
<li>POSIX (issue 7) specifications for <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html">Lex</a> and <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/yacc.html">Yacc</a>. (To view POSIX docs locally, try <a href="https://github.com/begriffs/posix-man">begriffs/posix-man</a>.)</li>
<li><em>Lex &amp; Yacc, 2nd ed</em> by John R. Levine, Tony Mason, Doug Brown. Levine subsequently wrote an updated book called <em>flex &amp; bison: Text Processing Tools</em>. However I got the older version to get a better feel for history and portability.</li>
<li>To bridge the gap between core knowledge and the latest features, consult the <a href="https://www.gnu.org/software/bison/manual/">GNU Bison manual</a> and the Flex manual. (You can build the Flex manual <a href="https://github.com/westes/flex/tree/master/doc">from source</a>, or download <a href="../pdf/flex.pdf">version 2.6.4</a> that I’ve pre-built for you as PDF.)</li>
<li><em>Effective Flex &amp; Bison</em> by Chris L. verBurg is a collection of tips for “correctness, efficiency, robustness, complexity, maintainability and usability.” It’s clear Chris has plenty of experience writing real-world parsers.</li>
<li>Vim has classic yacc highlighting built in, but you can add support for Bison extensions with <a href="https://github.com/justinmk/vim-syntax-extra">justinmk/vim-syntax-extra</a>.</li>
</ul>]]></summary>
</entry>
<entry>
    <title>Dynamic linking best practices</title>
    <link href="https://begriffs.com/posts/2021-07-04-shared-libraries.html" />
    <id>https://begriffs.com/posts/2021-07-04-shared-libraries.html</id>
    <published>2021-07-04T00:00:00Z</published>
    <updated>2021-07-04T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>In this article we’ll learn how to build shared libraries and install them properly on several platforms. For guidance, we’ll examine the goals and history of dynamic linking on UNIX-based operating systems.</p>
<p>Content for the article comes from researching how to create a shared library, wading through sloppy conventions that people recommend online, and testing on multiple Unix-like systems. Hopefully it can set the record straight and help improve the quality of open source libraries.</p>
<ul>
<li><a href="#the-common-unix-pattern">The common UNIX pattern</a></li>
<li><a href="#versioning">Versioning</a>
<ul>
<li><a href="#version-identifiers">Version identifiers</a></li>
<li><a href="#api-vs-abi">API vs ABI</a></li>
</ul></li>
<li><a href="#variance-of-linker-and-loader-by-system">Variance of linker and loader by system</a>
<ul>
<li><a href="#linkers-ld-lld">Linkers (ld, lld)</a></li>
<li><a href="#loaders-ld.so-dyld">Loaders (ld.so, dyld)</a></li>
</ul></li>
<li><a href="#portable-best-practices">Portable best practices</a>
<ul>
<li><a href="#linking">Linking</a></li>
<li><a href="#loading">Loading</a></li>
</ul></li>
<li><a href="#example-code">Example code</a></li>
</ul>
<h2 id="the-common-unix-pattern">The common UNIX pattern</h2>
<p>The design typically used nowadays for dynamic linking (in BSD, MacOS, and Linux) came from SunOS in 1988. The paper <a href="https://www.cs.cornell.edu/courses/cs414/2001FA/sharedlib.pdf">Shared Libraries in SunOS</a> neatly explains the goals, design, and implementation.</p>
<p>The authors’ main motivations were saving disk and memory space, and upgrading libraries (or the OS) without needing to relink programs. The resource usage motivation is probably less important on today’s powerful personal computers than it was in 1988. However, the flexibility to upgrade libraries is as useful as ever, as well as the ability to easily inspect which library versions each application uses.</p>
<p>Dynamic linking is not without its critics, and isn’t appropriate in all situations. It runs a little slower because of position-independent code (PIC) and late loading. (The SunOS paper called it a “classic space/time trade-off.”) The complexity of the loader on some systems offers <a href="http://www.nth-dimension.org.uk/pub/BTL.pdf">increased attack surface</a>. Finally, upgraded libraries may affect some programs differently than others, for instance breaking those that rely on undocumented behavior.</p>
<h3 id="the-link-editor-and-loader">The link editor and loader</h3>
<p>At compile time the link editor resolves symbols in specified libraries, and makes a note in the resulting binary to load those libraries. At runtime, applications call code to map the shared library symbols in memory at the correct memory addresses.</p>
<p>SunOS and subsequent UNIX-like systems added compile-time flags to the linker (ld) to generate – or link against – dynamically linked libraries. The designers also added a special system library (ld.so) with code to find and load other libraries for an application. The pre-<code>main()</code> initialization routine of a program loads ld.so and runs it from within the program to find and load the rest of the required libraries.</p>
<h2 id="versioning">Versioning</h2>
<p>As mentioned, applications can take advantage of updated libraries without needing recompilation. Library updates can be classified in three categories:</p>
<ol type="1">
<li>Implementation improvements for the current interface. Bug fixes, performance. (Patch release)</li>
<li>New features, additions to the interface. (Minor release)</li>
<li>Backward-incompatible change to the interface or its operation. (Major release)</li>
</ol>
<p>An application linked against a library at a given major release will continue to work properly when loading any newer minor or patch release. Applications may not work properly when loading a different major release, or an earlier minor release than that used at link time.</p>
<p>Multiple applications can exist on a machine at once, and each may require different releases of a single library. The system should provide a way to store multiple library releases and load the right one for each app. Different systems have different ways to do it, as we’ll see later.</p>
<h3 id="version-identifiers">Version identifiers</h3>
<p>Each library release can be marked with a version identifier (or “version”) which seeks to capture information about the library’s release history. There are multiple ways to map release history to a version identifier.</p>
<p>The two most common mapping systems are <em>semantic versioning</em> and <em>libtool versioning.</em> Semantic versioning counts the number of releases of various kinds that have happened, and writes them in lexicographic order. Libtool versioning counts distinct library interfaces.</p>
<p>Semantic versioning is written as <code>major.minor.patch</code> and libtool as <code>current:revision:age</code>. The intuition is that <code>current</code> counts interface changes. Any time the interface changes, whether in a minor or major way, <code>current</code> increases. Here’s how each system would record the same history of release events:</p>
<table class="table">
<thead>
<tr>
<th>
Event
</th>
<th>
Semver
</th>
<th>
Libtool
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
Initial
</td>
<td>
1.0.0
</td>
<td>
1:0:0
</td>
</tr>
<tr>
<td>
Minor
</td>
<td>
1.1.0
</td>
<td>
2:0:1
</td>
</tr>
<tr>
<td>
Minor
</td>
<td>
1.2.0
</td>
<td>
3:0:2
</td>
</tr>
<tr>
<td>
Patch
</td>
<td>
1.2.1
</td>
<td>
3:1:2
</td>
</tr>
<tr>
<td>
Major
</td>
<td>
2.0.0
</td>
<td>
4:0:0
</td>
</tr>
<tr>
<td>
Patch
</td>
<td>
2.0.1
</td>
<td>
4:1:0
</td>
</tr>
<tr>
<td>
Patch
</td>
<td>
2.0.2
</td>
<td>
4:2:0
</td>
</tr>
<tr>
<td>
Minor
</td>
<td>
2.1.0
</td>
<td>
5:0:1
</td>
</tr>
</tbody>
</table>
<p>Here’s how applications answer the question, <strong>“Can I load a given library?”</strong></p>
<dl>
<dt>
Semver
</dt>
<dd>
Does the library to be loaded have the same major version as the library I linked with, and a minor version at least as big?
</dd>
<dt>
Libtool
</dt>
<dd>
Is the <code>current</code> interface number of the library I linked with between <code>current - age</code> and <code>current</code> of the library to be loaded?
</dd>
</dl>
<p>We’ll be using semantic versioning in this guide, because libtool versioning is only relevant to libtool, a tool to abstract library creation across platforms. I believe we can make portable libraries without libtool. I mention both systems only to show that there’s more than one way to build version identifiers.</p>
<p>One final note: version identifiers say that things <em>have</em> changed, but omit <em>what</em> changed. More complicated systems exist to track library compatibility. Solaris, for instance, developed a system called symbol versioning. Symbol versioning chases space savings at the expense of operational complexity, and we’ll consider it later.</p>
<h3 id="api-vs-abi">API vs ABI</h3>
<p>One subtlety of versioning is that changes can happen in either a library’s <em>programming</em> interface (API) or <em>binary</em> interface (ABI). A C library’s programming interface is defined through its header files. A backward-incompatible API change means a program written for the previous version would not compile when including headers from the new version.</p>
<p>By contrast, a binary interface is a runtime concept. It concerns the calling conventions for functions, or the memory layout (and meaning) of data shared between program and library. The ABI ensures compatibility at load and run-time, while the API ensures compatibility at compile and link time.</p>
<p>The two interfaces usually change hand-in-hand, and people sometimes confuse them. It’s possible for one to change without the other, though.</p>
<p><strong>Examples of breaking ABI, but API stability:</strong></p>
<p>In these library changes, application code doesn’t need to change, but does need to be recompiled with the new library headers in order to work at runtime.</p>
<ul>
<li>A changed numerical value behind a #define constant. A program compiled before the change would pass the wrong value to the library.</li>
<li>Reordered elements in a struct. The program and library would read different offsets in memory thinking they’re referring to the same element, which is definitely an ABI break. Even adding an element <em>after</em> the others would affect the structure’s size, and hence layout within an array. An added field at the end may or may not affect a particular library’s ABI.</li>
<li>Widening function arguments. For instance changing a short int argument to a long int on an architecture/compiler where their size differs. Recompilation would be necessary to handle e.g. sign extension, or the offset of the next argument.</li>
<li>Other languages, including C++, have more opportunities for surprise ABI breakage.</li>
</ul>
<p><strong>Examples of ABI stability, but breaking API:</strong></p>
<p>In these library changes, application code would need to be modified to compile successfully against the new library, even though code compiled before the change could load and call the library without issue.</p>
<ul>
<li>Changing an argument from <code>const foo *</code> to <code>foo *</code>. A pointer to a const object cannot be implicitly converted to a pointer to a non-const object. The ABI doesn’t care though, and moves the same bytes. (If the library does in fact modify the dereferenced value, it may be an unpleasant surprise to the application of course.)</li>
<li>Changing the name of a struct element, while keeping its meaning and leaving it in the same position relative to the other elements.</li>
</ul>
<p>It’s usually easy to tell when you’ve added functionality vs broken backward compatibility, but there are tools to check for sure. For instance, the <a href="https://lvc.github.io/abi-compliance-checker/">ABI Compliance Checker</a> can detect breakages in C and C++ libraries.</p>
<p>In light of the versioning discussion earlier, which changes should the version identifier describe? At the very least, the ABI. When the loader is searching for a library, the ABI determines whether a library would be compatible at runtime. However, I think a more conservative versioning scheme is wise, where you bump a version when <em>either</em> the API or ABI change. You’ll end up with potentially more library versions installed, but each shared API/ABI version will provide guarantees at both compilation and runtime.</p>
<h2 id="variance-of-linker-and-loader-by-system">Variance of linker and loader by system</h2>
<h3 id="linkers-ld-lld">Linkers (ld, lld)</h3>
<p>After compiling object files, the compiler front-end (gcc, clang, cc, c99) will invoke the linker (ld, lld) to find unresolved symbols and match them across object files or in shared libraries. The linker searches only the shared libraries requested by the front-end, in the order specified on the command line. If an unresolved symbol is found in a listed library, the linker marks a dependency on that library in the generated executable.</p>
<p>The <code>-l</code> option adds a library to the list of candidates for symbol search. To add <code>libfoo.so</code> (or <code>libfoo.dylib</code> on Mac), specify <code>-lfoo</code>. The linker looks for the library files in its search path. To add directories to the default search path(s), use <code>-L</code>, for instance <code>-L/usr/local/lib</code>.</p>
<p>What happens if multiple versions of a library exist in the same directory? For instance two major versions, <code>libfoo.so.1</code> and <code>libfoo.so.2</code>? OpenBSD knows about version numbers, and would pick the highest version automatically for <code>-lfoo</code>. Linux and Mac would match neither, because they’re looking for an exact match of <code>libfoo.so</code> (or <code>libfoo.dylib</code>). Similarly, what if both a static and dynamic library exist in the same directory, <code>libfoo.a</code> and <code>libfoo.so</code>? All systems will choose the dynamic one.</p>
<p>Greater control is necessary. GCC has a colon option to solve the problem, for instance <code>-l:libfoo.so.1</code>. However clang doesn’t have it, so a truly portable build shouldn’t rely on it. Some systems solve the problem by creating a symlink from <code>libfoo.so</code> to the specific library desired. However when done in a system location like <code>/usr/local/lib</code>, it nominates a single inflexible link-time version for the whole system. I’ll suggest a different solution later that involves storing link-time files in a separate place from load-time libraries.</p>
<h3 id="loaders-ld.so-dyld">Loaders (ld.so, dyld)</h3>
<p>At launch time, programs with dynamic library dependencies load and run ld.so (or dyld on Mac) to find and load the rest of their dependencies. The load library inspects DT_NEEDED ELF tags (or LOAD_DYLIB names in Mach-O on Mac) to determine which library filename to find on the system. Interestingly, these values are not specified by the program developer, but by the library developer. They are extracted from the libraries themselves at link-time.</p>
<p>Dynamic libraries contain an internal “runtime name” called SONAME in ELF, or install_name in Mach-O. An application may link against a file named <code>libfoo.so</code>, but the library SONAME can say, “search for me under the filename libfoo.so.1.2 at load time.” The loader cares only about filenames, it never consults SONAMES. Conversely, the linker’s output cares only about SONAMES, not input library filenames.</p>
<p>Loaders in different operating systems go about finding dependent libraries slightly differently. OpenBSD’s ld.so is very true to the SunOS model, and <a href="https://www.openbsd.org/faq/ports/specialtopics.html#SharedLibs">understands semantic versions</a>. For instance, if asked to load libfoo.so.1.2, it will attempt to find libfoo.so.1.x with the largest x ≥ 2. FreeBSD also <a href="https://docs.freebsd.org/en/books/developers-handbook/policies/#policies-shlib">claims</a> to have this behavior, but I didn’t observe it in my <a href="https://github.com/begriffs/test-ld.so">tests</a>.</p>
<p>In 1995, Solaris 2.5 created a way to track semantic versioning at the symbol level, rather than for the entire library. With symbol versioning there would be a single e.g. libfoo.so file that simply grows over time. Every function inside is marked with a version number. The same function name can even exist under multiple versions with different implementations.</p>
<p>The advantage of symbol versioning is that it can save space. In the alternative, where versioning is per-library rather than per-symbol, a large percentage of object code is often copied unchanged from one library version to the next. The disadvantages of symbol versioning are:</p>
<ol type="1">
<li>It’s harder to see exactly which versions are installed on a system. Versions are hidden within libraries, rather than visible in filenames.</li>
<li>Library developers have to maintain a separate symbol mapfile for the linker.</li>
</ol>
<p>Symbol versioning quickly found its way into Linux, and became a staple of Glibc. Because of Linux’s symbol versioning preference, its ld.so doesn’t make any effort to rendezvous with the latest minor library version (à la SunOS or OpenBSD). Ld.so searches for an exact match between SONAME and filename.</p>
<p>However, even on Linux, most libraries don’t use symbol versioning. Also, their SONAMEs typically record only a major version (like libfoo.so.2). Within that major version, you just have to hope the hidden minor version is new enough for all applications compiled or installed on the system. If an app relies on functions added in a later minor library version, it’ll crash when it attempts to call them. (Setting the environment variable <code>LD_BIND_NOW=1</code> will attempt to resolve all symbols at program start instead, to detect the failure up front.)</p>
<p>MacOS uses an entirely different object format (Mach-O rather than ELF), and a differently named loader library (dyld rather than ld.so). Mac’s dynamically linked libraries are named <code>.dylib</code>, and their version numbers precede the extension.</p>
<p>Native Mac applications are usually installed into their own dedicated directories, with libraries bundled inside. Thus the loader has special provisions for finding libraries, like the keywords <code>@executable_path</code>, <code>@loader_path</code> and <code>@rpath</code> in the <code>install_name</code>. MacOS supports system libraries too, with dyld consulting the <code>DYLD_FALLBACK_LIBRARY_PATH</code>, by default <code>$(HOME)/lib:/usr/local/lib:/lib:/usr/lib</code>.</p>
<p>Like Linux, Mac does an exact name match – no minor version rendezvous. Unlike Linux, libraries can record their full semantic version internally, and a “compatibility” version. The compatibility version gets copied into an application at link time, and says the application requires at least that version at runtime.</p>
<p>For example, <code>libfoo.1.dylib</code> with full version 1.2.3 should have a compatibility version of 1.2.0 according to the rules of semantic versioning. An application linked against it would refuse to load libfoo with lesser minor version, like 1.1.5. At load time, the user would see a clear error:</p>
<pre><code>dyld: Library not loaded: libfoo.1.dylib
  Referenced from: myapp
  Reason: Incompatible library version: myapp requires version 1.2.0 or later,
          but libfoo.1.dylib provides version 1.1.5</code></pre>
<h2 id="portable-best-practices">Portable best practices</h2>
<h3 id="linking">Linking</h3>
<p>Standard practice is to create symlinks libfoo.so -&gt; libfoo.so.x -&gt; libfoo.so.x.y.z in a shared system directory. The first link (without the version number) is for linking at build time. Problem is, it’s pinned to one version. There’s no portable way to select which version to link against when there are multiple versions installed.</p>
<p>Also, standard practice gives even less care to versioning header files. Sometimes whichever version was most recently installed overwrites them in /usr/local/include. Sometimes the headers are maintained only at the major version level, in /usr/local/include/libfoo-n.</p>
<p>To solve these problems, I suggest bundling all development (linking) library files together into a different directory structure per version. Since I advocated earlier that the “total” library version should be bumped whenever the API <em>or</em> ABI changes, the same version safely applies to headers and binaries.</p>
<p>First choose an installation PREFIX. If the system has an /opt directory, pick that, otherwise /usr/local. In this directory, add dynamic and/or static libraries, headers, man pages, and pkg-config files as desired:</p>
<pre><code>$PREFIX/libfoo-dev.x.y.z
├── libfoo.pc
├── libfoo-static.pc
├── include
│   └── foo
│       ├── ...
│       └── ...
├── lib
│   ├── libfoo.so (or dylib or dll)
│   └── static
│       └── libfoo.a
└── man
    ├── ...
    └── ...</code></pre>
<p>Linking against libfoo.x.y.z is easy. In a Makefile, set your flags like this:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="dt">CFLAGS  </span><span class="ch">+=</span><span class="st"> -I/opt/libfoo-dev.x.y.z/include </span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="dt">LDFLAGS </span><span class="ch">+=</span><span class="st"> -L/opt/libfoo-dev.x.y.z/lib</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="dt">LDLIBS  </span><span class="ch">+=</span><span class="st"> -lfoo</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="co"># an example suffix rule using the flags</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="ot">.c:</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">CC</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">CFLAGS</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">LDFLAGS</span><span class="ch">)</span> -o <span class="ch">$@</span> <span class="ch">$&lt;</span> <span class="ch">$(</span><span class="dt">LDLIBS</span><span class="ch">)</span></span></code></pre></div>
<h3 id="version-flexibility-with-pkg-config">Version flexibility with pkg-config</h3>
<p><a href="https://www.freedesktop.org/wiki/Software/pkg-config/">Pkg-config</a> can allow an application to express a range of acceptable library versions, rather than hardcoding a specific one. In a configure script, we’ll test for the library’s presence and version, and output the flags to <code>config.mk</code>:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co"># supposing we require libfoo 1.x for x &gt;= 1</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="ex">pkg-config</span> <span class="at">--print-errors</span> <span class="st">&#39;libfoo &gt;= 1.1, libfoo &lt; 2.0&#39;</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="co"># save flags to config.mk</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="fu">cat</span> <span class="op">&gt;</span> config.mk <span class="op">&lt;&lt;-EOF</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="st">	CFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--cflags</span> libfoo<span class="va">)</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="st">	LDFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-L</span> libfoo<span class="va">)</span></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="st">	LDLIBS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-l</span> libfoo<span class="va">)</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="op">EOF</span></span></code></pre></div>
<p>Then our Makefile becomes:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">include</span> config.mk</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="ot">.c:</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">CC</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">CFLAGS</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">LDFLAGS</span><span class="ch">)</span> -o <span class="ch">$@</span> <span class="ch">$&lt;</span> <span class="ch">$(</span><span class="dt">LDLIBS</span><span class="ch">)</span></span></code></pre></div>
<p>To choose a specific version of libfoo, we can add it to the pkg-config search path and run the configure script:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># make desired libfoo version visible to pkg-config</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="bu">export</span> <span class="va">PKG_CONFIG_PATH</span><span class="op">=</span><span class="st">&quot;/opt/libfoo-dev.x.y.z:</span><span class="va">$PKG_CONFIG_PATH</span><span class="st">&quot;</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="ex">./configure</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="fu">make</span></span></code></pre></div>
<p>To create pkg-config <code>.pc</code> files for a library, see <a href="https://people.freedesktop.org/~dbn/pkg-config-guide.html">Dan Nicholson’s guide</a>. In order to offer both a static and dynamic library, the best way I could imagine was to release separate files, <code>libfoo.pc</code> and <code>libfoo-static.pc</code> that differ in their <code>-L</code> flag. One uses <code>lib</code> and another <code>lib/static</code>. (Pkg-config’s <code>--static</code> flag is a bit of a misnomer, and just passes items in <code>Libs.private</code> <em>in addition to</em> <code>Libs</code> in the build process.)</p>
<h3 id="loading">Loading</h3>
<p>This section talks about installing dynamic libraries for system-wide loading. Libraries installed for this purpose are not meant to link with at compile time, but to load at runtime.</p>
<h4 id="elf-installation-bsdlinux">ELF installation (BSD/Linux)</h4>
<p>ELF objects don’t have much version metadata. SONAME is about it. That, combined with the lackluster behavior of loaders on some systems, means the traditional installation technique doesn’t work too well.</p>
<p>Let’s review the traditional way to install ELF libraries, and then a safer method I designed.</p>
<p><strong>Traditional installation method</strong></p>
<ol type="1">
<li>For version x.y.z, compile libfoo.so with SONAME libfoo.so.x</li>
<li>Copy libfoo.so to /usr/local/lib/libfoo.so.x.y.z</li>
<li>Create symlink libfoo.so.x -&gt; libfoo.x.y.z</li>
</ol>
<p>This way allows a sysadmin to see exactly which versions are installed, and to have multiple major versions installed at once. It doesn’t allow multiple minor versions per major (although usually only the latest minor is needed), and more importantly doesn’t offer protection against loading too old a minor version.</p>
<p><strong>Safer installation method</strong></p>
<ol type="1">
<li><p>For version x.y.z, compile libfoo.so with SONAME libfoo.so.x.y</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># use compilation flags</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="ex">-shared</span> <span class="at">-Wl,-soname,libfoo.so.</span><span class="va">${MAJOR}</span><span class="at">.</span><span class="va">${MINOR}</span></span></code></pre></div></li>
<li><p>Copy libfoo.so to /usr/local/lib/libfoo.so.x.y.z</p></li>
<li><p>Backfill minor version symlinks in DEST:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="va">i</span><span class="op">=</span>0</span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="cf">while</span> <span class="bu">[</span> <span class="va">$i</span> <span class="ot">-le</span> <span class="st">&quot;</span><span class="va">$MINOR</span><span class="st">&quot;</span> <span class="bu">]</span><span class="kw">;</span> <span class="cf">do</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>	<span class="fu">ln</span> <span class="at">-fs</span> <span class="st">&quot;libfoo.so.</span><span class="va">$VER</span><span class="st">&quot;</span> <span class="st">&quot;</span><span class="va">$DEST</span><span class="st">/libfoo.so.</span><span class="va">$MAJOR</span><span class="st">.</span><span class="va">$i</span><span class="st">&quot;</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>	<span class="va">i</span><span class="op">=</span><span class="va">$((i</span><span class="op">+</span><span class="dv">1</span><span class="va">))</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="cf">done</span></span></code></pre></div></li>
</ol>
<p>At the cost of potentially a lot of minor version symlinks, this technique emulates the SunOS and OpenBSD behavior of minor version rendezvous. Also, because the SONAME has major.minor granularity, it will protect against loading too old a minor version.</p>
<p>(As an alternative to the symlinks, FreeBSD has <a href="https://nixdoc.net/man-pages/FreeBSD/man5/libmap.conf.5.html">libmap.conf</a>)</p>
<h4 id="mach-o-installation-macos">Mach-O installation (MacOS)</h4>
<p>Mach-O has more version metadata inside than ELF, so a traditional install works fine here.</p>
<ol type="1">
<li><p>For version x.y.z, compile libfoo.dylib with</p>
<ul>
<li>install_name libfoo.x.dylib</li>
<li>current version x.y.z</li>
<li>compatibility version x.y</li>
</ul>
<div class="sourceCode" id="cb9"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co"># use compilation flags</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="ex">-dynamiclib</span> <span class="at">-install_name</span> <span class="st">&quot;libfoo.</span><span class="va">${MAJOR}</span><span class="st">.dylib&quot;</span> <span class="dt">\</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>            <span class="at">-current_version</span> <span class="va">${VER}</span> <span class="dt">\</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>            <span class="at">-compatibility_version</span> <span class="va">${MAJOR}</span>.<span class="va">${MINOR}</span>.0</span></code></pre></div></li>
<li><p>Copy libfoo.dylib to /usr/local/lib/libfoo.x.dylib</p></li>
</ol>
<p>It’s important to set the compatibility version correctly so that Mac’s dyld will prevent loading too old a minor version. To upgrade the library, overwrite libfoo.x.dylib with one of a later internal minor release.</p>
<h2 id="example-code">Example code</h2>
<p>For an example of how to build a library portably, and install it conveniently for the linker and loader, see <a href="https://github.com/begriffs/libderp">begriffs/libderp</a>. It’s my first shared library, where I tested the ideas for this article.</p>]]></summary>
</entry>
<entry>
    <title>Tips for stable and portable software</title>
    <link href="https://begriffs.com/posts/2020-08-31-portable-stable-software.html" />
    <id>https://begriffs.com/posts/2020-08-31-portable-stable-software.html</id>
    <published>2020-08-31T00:00:00Z</published>
    <updated>2020-08-31T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>After several years’ involvement with quickly evolving programming languages, I’ve come to appreciate stability. I’d like to make my programs easy to build on a wide variety of systems with minimal adjustment. I’d like them to keep working long into the future as environments change.</p>
<p>To think about stability more clearly, let’s divide a functioning program into its layers. Then we can examine development choices one layer at a time.</p>
<figure>
<img src="/images/portability.png" alt="concentric circles of program resources" /><figcaption aria-hidden="true">concentric circles of program resources</figcaption>
</figure>
<p>The more features a program needs, the further out it must reach through the layers.</p>
<div class="alert alert-info" role="alert">
<h4>
Correction
</h4>
<p>The operating system should be listed as the outermost layer, instead of 3rd-party libraries. Libraries are often designed to be portable across operating systems.</p>
</div>
<h3 id="layer-0-programming-language">Layer 0: Programming language</h3>
<h4 id="choose-a-language-with-multiple-implementations-and-a-standard">Choose a language with multiple implementations and a standard</h4>
<p>Every language has to start somewhere, often as an implementation by a single person or small group. At this stage the language evolves rapidly, and to be fair it’s this stage that advances the state of the art.</p>
<p>However, using a language in its single-implementation stage means you’re committing a percentage of your energy to the “research project” of the language itself. You’ll deal with breaking changes (including tools), and experimental dead-ends.</p>
<p>If you love the idea behind a new language, or believe it’s a winner and that your early familiarity will pay off, then go for it! Otherwise use a language that has advanced beyond a single implementation. That way you can focus on your domain of expertise rather than keeping up with a language research agenda.</p>
<p>Languages get to the next stage when groups of people fork them for new situations and architectures. Some people add features, other people discover difficulties in their environments. Stakeholders then debate and reach consensus through a standardization process. The end result is that the standard, rather than a particular software artifact, defines the language and has the final say.</p>
<p>Naturally the whole thing takes a while. Standardized languages are going to be fairly old. They’ll miss out on recent ideas, but will be well understood. Here are some mature languages with standards:</p>
<ul>
<li>Ada</li>
<li>C</li>
<li>Common Lisp</li>
<li>ECMAScript</li>
<li>Pascal</li>
<li>SQL</li>
</ul>
<p>I’ve been using C lately because of its portability, simple (yet expressive) abstract machine model, and deep compatibility with POSIX and foundational libraries.</p>
<h4 id="avoid-or-wrap-compiler-language-extensions">Avoid – or wrap – compiler language extensions</h4>
<p>If you’re using a language with a standard, take advantage of it. First, choose a specific version of the standard. Older versions are generally more widely supported, but have fewer features. In the C world I usually pick C99 because it has some conveniences over C89, and is still supported pretty much everywhere (although only partially on Windows).</p>
<p>Consult your compiler documentation to see if the compiler can catch accidental uses of non-standard behavior. In clang or gcc, add the following flags to your Makefile:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co"># enforce a specific version of the standard</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="dt">CFLAGS </span><span class="ch">+=</span><span class="st"> -std=c99 -pedantic</span></span></code></pre></div>
<p>Substitute another version for “c99” as desired. The pedantic flag rejects all programs that use forbidden extensions, and some other programs that do not follow ISO C.</p>
<p>If you do want to use compiler extensions (such as <a href="https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html">those in gcc</a> or <a href="http://clang.llvm.org/docs/LanguageExtensions.html">clang</a>), wrap them behind your own macros so that the code stays portable. The PostgreSQL project does this kind of thing in <a href="https://github.com/postgres/postgres/blob/master/src/include/c.h">c.h</a>. Here’s an example at random:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="co"> * Use &quot;pg_attribute_always_inline&quot; in place of &quot;inline&quot; for functions that</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co"> * we wish to force inlining of, even when the compiler&#39;s heuristics would</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="co"> * choose not to.  But, if possible, don&#39;t force inlining in unoptimized</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="co"> * debug builds.</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="co"> */</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#if (defined(__GNUC__) &amp;&amp; __GNUC__ &gt; 3 &amp;&amp; defined(__OPTIMIZE__)) || defined(__SUNPRO_C) || defined(__IBMC__)</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="co">/* GCC &gt; 3, Sunpro and XLC support always_inline via __attribute__ */</span></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#define pg_attribute_always_inline __attribute__((always_inline)) inline</span></span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#elif defined(_MSC_VER)</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="co">/* MSVC has a special keyword for this */</span></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define pg_attribute_always_inline __forceinline</span></span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a><span class="pp">#else</span></span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a><span class="co">/* Otherwise, the best we can do is to say &quot;inline&quot; */</span></span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#define pg_attribute_always_inline inline</span></span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span></code></pre></div>
<p>Notice how they adapt to various compilers and provide a final fallback. Of course, avoiding extensions in the first place is the simplest option, when possible.</p>
<h3 id="layer-1-standard-library">Layer 1: Standard library</h3>
<h4 id="learn-it-and-consult-the-standard">Learn it, and consult the standard</h4>
<p>Take time to learn your language’s standard library. It’s a freebie, you get it wherever your program goes. Read about the library functions in the language standard, since they will be covered there.</p>
<p>Gaining knowledge of the standard library can help reduce reliance on unnecessary third-party libraries. The ECMAScript world, for instance, is rife with micro-libraries that attempt to supplement the <a href="https://ecma-international.org/ecma-262/11.0/">ECMA standard</a>’s real or perceived shortcomings.</p>
<p>The size of a single-implementation language’s library is a trade-off between ease of implementation and ease of use. A giant library like that in the <a href="https://golang.org/pkg/#stdlib">Go</a> language makes it harder for creators of would-be rival implementations, and thus slows the progress to a robust standard.</p>
<p>To learn more about the C standard library, see <a href="https://begriffs.com/posts/2019-01-19-inside-c-standard-lib.html">my article</a>.</p>
<h4 id="learn-the-rationale-and-gotchas">Learn the rationale and gotchas</h4>
<p>Because standards bodies avoid breaking existing codebases, and because stable languages are slow to change, there will be weird or dangerous functions in the standard library. However the dangers are <em>well known</em> and documented in supporting literature, unlike the dangers in new, relatively untested systems.</p>
<p>Here are some great books for C:</p>
<ul>
<li>“The CERT C Coding Standard” by Robert C. Seacord (ISBN 978-0321984043). Illustrates potential insecurity with, among other things, the standard library. Lists real code that caused vulnerabilities.</li>
<li>“The Standard C Library” by P. J. Plauger (ISBN 978-0131315099). Thorough details about the C89 stdlib.</li>
<li>“C Traps and Pitfalls” by Andrew Koenig (978-0201179286).</li>
<li>“C Programming FAQs” by Steve Summit (ISBN 978-0201845198). I can see why these were historically the most frequently asked questions. I asked many of them myself.</li>
</ul>
<p>Also the C99 standard has an accompanying <a href="http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf">rationale document</a>. It talks about alternate designs considered and rejected.</p>
<h3 id="layer-2-posix">Layer 2: POSIX</h3>
<p>Similarly to how competing C implementations led to the C standard, the <a href="https://www.livinginternet.com/i/iw_unix_war.htm">Unix wars</a> led to POSIX. POSIX specifies a “lowest common denominator” interface that <a href="https://en.wikipedia.org/wiki/POSIX#POSIX-oriented_operating_systems">many operating systems</a> honor to a greater or lesser degree.</p>
<h4 id="read-the-spec-compare-with-man-pages">Read the spec, compare with man pages</h4>
<p>Whenever you use system calls outside the C standard library, check whether they’re part of POSIX, and if their official description differs from your local man pages. The Open Group offers a free searchable <a href="https://pubs.opengroup.org/onlinepubs/9699919799/">HTML version of POSIX.1</a>. As of this writing it’s POSIX.1-2017 (which is POSIX.1-2008 plus two technical corrigenda).</p>
<p>There’s one more complication: POSIX.1-2008 (aka “Issue 7”) isn’t fully supported everywhere. (For instance I found that macOS doesn’t support pthread barriers, semaphores, or asynchronous thread cancellation.) I think the root cause is that 2008 requires thread and real-time functionality that was previously in optional extensions. If you stick to functionality in POSIX.1-2001 (aka <a href="https://pubs.opengroup.org/onlinepubs/007904975/">Issue 6</a>) you should be safe on all reasonably recent platforms.</p>
<h4 id="activate-a-version">Activate a version</h4>
<p>To call POSIX functions you must define the <code>_POSIX_C_SOURCE</code> “feature test” macro before including header files. Select a specific POSIX version by using one of these values:</p>
<table class="table">
<thead>
<tr>
<th>
Edition
</th>
<th>
Release year
</th>
<th>
Macro value
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
1
</td>
<td>
1988
</td>
<td>
<em>(N/A)</em>
</td>
</tr>
<tr>
<td>
2
</td>
<td>
1990
</td>
<td>
1
</td>
</tr>
<tr>
<td>
3
</td>
<td>
1992
</td>
<td>
2
</td>
</tr>
<tr>
<td>
4
</td>
<td>
1993
</td>
<td>
199309L
</td>
</tr>
<tr>
<td>
5
</td>
<td>
1995
</td>
<td>
199506L
</td>
</tr>
<tr>
<td>
6
</td>
<td>
2001
</td>
<td>
200112L
</td>
</tr>
<tr>
<td>
7
</td>
<td>
2008
</td>
<td>
200809L
</td>
</tr>
</tbody>
</table>
<p>Header files hide or reveal functions based on the feature test macro. For example, the <a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html">getline()</a> function from Issue 7 allocates memory and reads a line.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* line.c */</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;sys/types.h&gt;</span><span class="pp"> </span><span class="co">/* ssize_t */</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>line <span class="op">=</span> NULL<span class="op">;</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> len <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a>	<span class="dt">ssize_t</span> read<span class="op">;</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>read <span class="op">=</span> getline<span class="op">(&amp;</span>line<span class="op">,</span> <span class="op">&amp;</span>len<span class="op">,</span> stdin<span class="op">))</span> <span class="op">!=</span> <span class="op">-</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;Length %zd: %s&quot;</span><span class="op">,</span> read<span class="op">,</span> line<span class="op">);</span></span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>line<span class="op">);</span></span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Trying to use <code>getline()</code> on Issue 6 (POSIX.1-2001) fails:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> cc <span class="at">-std</span><span class="op">=</span>c99 <span class="at">-pedantic</span> <span class="at">-Werror</span> <span class="at">-D_POSIX_C_SOURCE</span><span class="op">=</span>200112L line.c <span class="at">-o</span> line</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="ex">line.c:10:17:</span> error: implicit declaration of function <span class="st">&#39;getline&#39;</span> is invalid in C99 [-Werror,-Wimplicit-function-declaration]</span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>        <span class="cf">while</span> <span class="kw">((</span><span class="va">read</span> <span class="op">=</span> <span class="va">getline</span>(<span class="op">&amp;</span><span class="va">line</span><span class="kw">,</span> <span class="op">&amp;</span><span class="va">len</span><span class="kw">,</span> <span class="va">stdin</span>)) <span class="op">!=</span> <span class="op">-</span><span class="dv">1</span>)</span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>                       <span class="op">^</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="dv">1</span> <span class="va">error</span> <span class="va">generated</span>.</span></code></pre></div>
<p>Selecting Issue 7 with <code>-D_POSIX_C_SOURCE=200809L</code> fixes it.</p>
<p><strong>Important note:</strong> setting <code>_POSIX_C_SOURCE</code> will hide non-POSIX operating system extras in the standard headers. The best practice is to separate your source files into those that are POSIX conformant, and those (hopefully few) that aren’t. Compile the latter without the feature macro and link them all together at the end.</p>
<h4 id="use-posix-in-the-build-process-too">Use POSIX in the build process too</h4>
<p>POSIX defines the interface for not just the library functions discussed earlier, but for the shell and common tools too. If you use those tools for your builds then you don’t need to install any extra software on destination machines to compile your project.</p>
<p>Probably the most common sources of accidental lock-in are bashisms and GNU extensions to Make. For scripts, use <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sh.html">sh</a>, and use (POSIX) <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html">make</a> for Makefiles. Too many projects use GNU features needlessly. In fact, learning the portable subset of Make features leads to cleaner, more reliable builds.</p>
<p>This is a topic for an entire article of its own. Chris Wellons wrote a <a href="https://nullprogram.com/blog/2017/08/20/">nice tutorial</a> about it. Also “Managing Projects with make” by Andrew Oram (ISBN 0-937175-90-0) is a little book that’s packed with good advice.</p>
<h3 id="layer-3-operating-system-extras">Layer 3: Operating system extras</h3>
<p>Operating systems include useful functionality beyond POSIX. For instance extensions to pthreads (setting reader-writer preference or thread processor affinity), access to specialized hardware (like audio or graphics), alternate I/O interfaces and semantics, and functions for safety like strlcpy or pledge.</p>
<p>Three ways to use these features portably are to:</p>
<ol type="1">
<li>wrap them in your own interface and conditionally compile the implementation, or</li>
<li>build a static shim library (“libcompat”) as part of your project to use when functionality is missing, or</li>
<li>link to a third party library that abstracts the details.</li>
</ol>
<p>We’ll talk about third-party libraries later. Let’s look at option one now.</p>
<h4 id="detecting-os-functions">Detecting OS functions</h4>
<p>Consider the example of generating random data. It requires help from the OS since POSIX offers only <em>pseudo-</em>random numbers.</p>
<p>We’ll split our Makefile into two parts:</p>
<ol type="1">
<li><code>Makefile</code> – specifies targets, dependencies and rules, that hold on all systems</li>
<li><code>config.mk</code> – sets macros and build flags specific to the local system</li>
</ol>
<p>The Makefile will include the specifics of <code>config.mk</code> like this:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># inside the Makefile...</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="co"># set up common options and then...</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="kw">include</span> config.mk</span></code></pre></div>
<p>We’ll generate <code>config.mk</code> with a <code>configure</code> script. A developer will run the script before their first build to detect the environment options. The most primitive way for <code>configure</code> to work would be to try parse <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uname.html">uname</a> and make decisions based on what OS or distro it sees. A more accurate way is to try to directly probe the needed OS C functions.</p>
<p>To see if a C function exists, we can just try compiling test snippets of code and see if they succeed. You might think this is awkward or that it requires cluttering your project with test code, but it’s actually pretty elegant.</p>
<p>First make this shell script helper function:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">compiles ()</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">{</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>	<span class="va">stage</span><span class="op">=</span><span class="st">&quot;</span><span class="va">$(</span><span class="fu">mktemp</span> <span class="at">-d</span><span class="va">)</span><span class="st">&quot;</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>	<span class="bu">echo</span> <span class="st">&quot;</span><span class="va">$2</span><span class="st">&quot;</span> <span class="op">&gt;</span> <span class="st">&quot;</span><span class="va">$stage</span><span class="st">/test.c&quot;</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>	<span class="kw">(</span><span class="fu">cc</span> <span class="at">-Werror</span> <span class="st">&quot;</span><span class="va">$1</span><span class="st">&quot;</span> <span class="at">-o</span> <span class="st">&quot;</span><span class="va">$stage</span><span class="st">/test&quot;</span> <span class="st">&quot;</span><span class="va">$stage</span><span class="st">/test.c&quot;</span> <span class="op">&gt;</span>/dev/null <span class="dv">2</span><span class="op">&gt;&amp;</span>1<span class="kw">)</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>	<span class="va">cc_success</span><span class="op">=</span><span class="va">$?</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>	<span class="fu">rm</span> <span class="at">-rf</span> <span class="st">&quot;</span><span class="va">$stage</span><span class="st">&quot;</span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="va">$cc_success</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="kw">}</span></span></code></pre></div>
<p>The <code>compiles()</code> function takes two arguments: an optional compiler flag, and the source code to attempt to compile.</p>
<div class="alert alert-info" role="alert">
<h4>
Portability
</h4>
<p>Note that <code>mktemp</code> and <code>cc</code> are not POSIX compliant. You can write your own <code>mktemp</code> function using POSIX primitives, but I wanted to conserve space in this example. For <code>cc</code>, the spec offers <code>c99</code> (or <code>c89</code> in 4th edition POSIX). However, the <code>c99</code> utility doesn’t allow controlling warning levels, and I wanted to specify that warnings be treated as errors. The <code>cc</code> alias is a common de-facto standard.</p>
</div>
<p>Let’s use the helper to check for OS random number generators. The BSD world offers <a href="https://man.openbsd.org/arc4random_buf.3">arc4random_buf</a> to get random bytes, and Linux offers <a href="https://man7.org/linux/man-pages/man2/getrandom.2.html">getrandom</a>. The <code>configure</code> script can check for each feature like this:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="ex">compiles</span> <span class="st">&quot;&quot;</span> <span class="st">&quot;</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="st">	#include &lt;stdint.h&gt;</span></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="st">	#include &lt;stdlib.h&gt;</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="st">	int main(void)</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="st">	{</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="st">		void (*p)(void *, size_t) = arc4random_buf;</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="st">		return (intptr_t)p;</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="st">	}&quot;</span></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a><span class="cf">then</span></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>	<span class="bu">echo</span> <span class="st">&quot;CFLAGS += -DHAVE_ARC4RANDOM&quot;</span> <span class="op">&gt;&gt;</span> config.mk</span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="cf">fi</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="ex">compiles</span> <span class="st">&quot;-D_POSIX_C_SOURCE=200112L&quot;</span> <span class="st">&quot;</span></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a><span class="st">	#include &lt;stdint.h&gt;</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a><span class="st">	#include &lt;sys/types.h&gt;</span></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a><span class="st">	#include &lt;sys/random.h&gt;</span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a><span class="st">	int main(void)</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a><span class="st">	{</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="st">		ssize_t (*p)(void *, size_t, unsigned int) = getrandom;</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a><span class="st">		return (intptr_t)p;</span></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a><span class="st">	}&quot;</span></span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="cf">then</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>	<span class="bu">echo</span> <span class="st">&quot;CFLAGS += -DHAVE_GETRANDOM&quot;</span> <span class="op">&gt;&gt;</span> config.mk</span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a><span class="cf">fi</span></span></code></pre></div>
<p>See? Not too bad. These code snippets test not only whether the functions exist, but also check their type signatures. Notice how the second example is compiled with POSIX for the <code>ssize_t</code> type, while the first example is intentionally <em>not</em> marked POSIX conformant because doing so would hide the extra function <code>arc4random_buf</code> that BSD puts in <code>stdlib.h</code>.</p>
<h4 id="wrap-os-functions-behind-your-own">Wrap OS functions behind your own</h4>
<p>It’s helpful to isolate the use of non-portable functions in a distinct translation unit, and export your own interface on top. That way it’s more straightforward to set up conditional compilation in one place, or to refactor in the future.</p>
<p>Let’s continue the example from the previous section of generating random bytes. With the hard work of OS feature detection behind us, we can wrap the differing OS interfaces behind our own function:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdint.h&gt;</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#ifdef HAVE_GETRANDOM</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;sys/random.h&gt;</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> get_random_bytes<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>buf<span class="op">,</span> <span class="dt">size_t</span> n<span class="op">)</span></span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#if defined HAVE_ARC4RANDOM  </span><span class="co">/* BSD */</span></span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a>	arc4random_buf<span class="op">(</span>buf<span class="op">,</span> n<span class="op">);</span></span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#elif defined HAVE_GETRANDOM </span><span class="co">/* Linux */</span></span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a>	getrandom<span class="op">(</span>buf<span class="op">,</span> n<span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb8-13"><a href="#cb8-13" aria-hidden="true" tabindex="-1"></a><span class="pp">#else</span></span>
<span id="cb8-14"><a href="#cb8-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#error OS does not provide recognized function to get entropy</span></span>
<span id="cb8-15"><a href="#cb8-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span>
<span id="cb8-16"><a href="#cb8-16" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>The Makefile defines <code>HAVE_ARC4RANDOM</code> or <code>HAVE_GETRANDOM</code> using CFLAGS when the corresponding functions exist. The code can just use ifdefs. Notice the <code>#error</code> in the <code>#else</code> case to fail compilation with a clear message on unsupported platforms.</p>
<p>The degree of portability we strive for causes trade-offs. Example: we could add a fallback to reading <code>/dev/random</code>. The configure script from the previous section could check whether the device exists:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="bu">test</span> <span class="at">-c</span> /dev/random<span class="kw">;</span> <span class="cf">then</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>	<span class="bu">echo</span> <span class="st">&quot;CFLAGS += -DHAVE_DEVRANDOM&quot;</span> <span class="op">&gt;&gt;</span> config.mk</span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="cf">fi</span></span></code></pre></div>
<p>Using that information, we could add another <code>#elif</code> in <code>get_random_bytes()</code> so that it can potentially work on more systems. However, in this case, the increased portability would require a change in interface. Since <code>fopen()</code> or <code>fread()</code> on <code>/dev/random</code> could fail, our function would need to return bool. Currently the OS functions we’re calling can’t fail, so a void return is fine.</p>
<h4 id="test-on-multiple-oses-and-hardware">Test on multiple OSes and hardware</h4>
<p>The true test of portability is, of course, building and running on multiple operating systems, compilers, and hardware architectures. It can be surprising to see what assumptions this can uncover. Testing portability early and often makes it easier to keep a program shipshape.</p>
<p>The PostgreSQL project, for instance, maintains a bunch of disparate machines known as the “buildfarm.” <a href="https://buildfarm.postgresql.org/cgi-bin/show_members.pl">Buildfarm members</a> each have their own OS, compiler, and architecture. The team compiles every new feature on these machines and runs the test suite there.</p>
<p>Focusing on the architectures alone, we can see an impressive variety in the buildfarm:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/ARM_architecture">ARM</a>: v6, v7, ARM64</li>
<li><a href="https://en.wikipedia.org/wiki/IA-64">IA-64</a></li>
<li><a href="https://en.wikipedia.org/wiki/Linux_on_IBM_Z">IBM Z</a></li>
<li><a href="https://en.wikipedia.org/wiki/MIPS_architecture">MIPS</a></li>
<li><a href="https://en.wikipedia.org/wiki/PA-RISC">PA-RISC</a></li>
<li><a href="https://en.wikipedia.org/wiki/Ppc64">PowerPC</a>, both big- and little-endian</li>
<li><a href="https://en.wikipedia.org/wiki/SPARC">SPARC</a></li>
<li><a href="https://en.wikipedia.org/wiki/X86">x86</a>, including <a href="https://en.wikipedia.org/wiki/P6_(microarchitecture)">i686</a> and <a href="https://en.wikipedia.org/wiki/Intel_80386">i386</a></li>
<li><a href="https://en.wikipedia.org/wiki/X86-64">x86-64</a></li>
</ul>
<p>Even if you have no intention to run on these architectures, testing there will lead to better code. (See my article <a href="https://begriffs.com/posts/2018-11-15-c-portability.html">C Portability Lessons from Weird Machines</a>.)</p>
<div class="alert alert-info" role="alert">
<h4>
Begriffs Buildfarm?
</h4>
<p>I’ve been thinking of assembling a buildfarm and offering a paid continuous integration service. If this interests you, please send me an email. I think the project is a good cause, and with enough subscriptions I could cover the electricity and hardware costs.</p>
</div>
<h3 id="layer-4-third-party-libraries">Layer 4: third-party libraries</h3>
<p>Many languages have their own <a href="https://en.wikipedia.org/wiki/List_of_software_package_management_systems#Application-level_package_managers">application-level package managers</a>, but C has no exclusive package manager. The language has too much history and spans too many environments to have locked into that. Instead people build dependencies from source, or use the OS package manager.</p>
<h4 id="build-with-pkg-config">Build with pkg-config</h4>
<p>Linking to libraries requires knowing their path, name, and compiler settings. Additionally we want to know which version is installed and whether it’s in-bounds. Since there’s no application-level package manager for C, we need to use another tool to discover installed libraries.</p>
<p>The most cross-platform way to find – and build against – dependency libraries is <a href="https://www.freedesktop.org/wiki/Software/pkg-config/">pkg-config</a>. The tool allows you to query system packages, regardless of how they were installed. To be compatible with pkg-config, each library <code>foo</code> provides a <code>libfoo.pc</code> file containing keys and values like this:</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode ini"><code class="sourceCode ini"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="dt">prefix</span><span class="ot">=</span><span class="st">/usr/local</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="dt">exec_prefix</span><span class="ot">=</span><span class="st">${prefix}</span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="dt">includedir</span><span class="ot">=</span><span class="st">${prefix}/include</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="dt">libdir</span><span class="ot">=</span><span class="st">${exec_prefix}/lib</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="dt">Name: libfoo</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="dt">Description: The foo library</span></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="dt">Version: 1.2.3</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="dt">Cflags: -I${includedir}/foo</span></span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="dt">Libs: -L${libdir} -lfoo</span></span></code></pre></div>
<p>The <code>pkg-config</code> executable can query the metadata and provide flags for your Makefile. Call it from your configure script like this:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># check that a sufficient version is installed</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="ex">pkg-config</span> <span class="at">--print-errors</span> <span class="st">&#39;libfoo &gt;= 1.0&#39;</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="co"># save flags to config.mk</span></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="fu">cat</span> <span class="op">&gt;&gt;</span> config.mk <span class="op">&lt;&lt;-EOF</span></span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="st">	CFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--cflags</span> libfoo<span class="va">)</span></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="st">	LDFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-L</span> libfoo<span class="va">)</span></span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a><span class="st">	LDLIBS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-l</span> libfoo<span class="va">)</span></span>
<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a><span class="op">EOF</span></span></code></pre></div>
<p>Notice the LDLIBS vs LDFLAGS distinction. LDLIBS are options that need to go at the very end of the build line. The default POSIX make <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html#tag_20_76_13_09">suffix rules</a> don’t mention LDLIBS, but here’s a rule you can use instead:</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="ot">.c:</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>	<span class="ch">$(</span><span class="dt">CC</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">CFLAGS</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">LDFLAGS</span><span class="ch">)</span> -o <span class="ch">$@</span> <span class="ch">$&lt;</span> <span class="ch">$(</span><span class="dt">LDLIBS</span><span class="ch">)</span></span></code></pre></div>
<p>Sometimes an operating system will include extra functionality and package it up as a portable library you can use on other operating systems. In this case you can use pkg-config conditionally.</p>
<p>For instance, OpenBSD spun off the <a href="https://www.libressl.org/">LibreSSL</a> project (a more usable OpenSSL). OpenBSD includes the functionality internally. In the configure script just do an operating system check:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># LibreSSL</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="cf">case</span> <span class="st">&quot;</span><span class="va">$(</span><span class="fu">uname</span> <span class="at">-s</span><span class="va">)</span><span class="st">&quot;</span> <span class="kw">in</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>	<span class="ss">OpenBSD</span><span class="kw">)</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>		<span class="co"># included with OS</span></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a>		<span class="bu">echo</span> <span class="st">&#39;LDLIBS += -ltls&#39;</span> <span class="op">&gt;&gt;</span> config.mk</span>
<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a>		<span class="cf">;;</span></span>
<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a>	<span class="pp">*</span><span class="kw">)</span></span>
<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a>		<span class="co"># requires a package</span></span>
<span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a>		<span class="ex">pkg-config</span> <span class="at">--print-errors</span> <span class="st">&#39;libtls &gt;= 2.5.0&#39;</span></span>
<span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a>		<span class="fu">cat</span> <span class="op">&gt;&gt;</span> config.mk <span class="op">&lt;&lt;-EOF</span></span>
<span id="cb13-11"><a href="#cb13-11" aria-hidden="true" tabindex="-1"></a><span class="st">			CFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--cflags</span> libtls<span class="va">)</span></span>
<span id="cb13-12"><a href="#cb13-12" aria-hidden="true" tabindex="-1"></a><span class="st">			LDFLAGS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-L</span> libtls<span class="va">)</span></span>
<span id="cb13-13"><a href="#cb13-13" aria-hidden="true" tabindex="-1"></a><span class="st">			LDLIBS += </span><span class="va">$(</span><span class="ex">pkg-config</span> <span class="at">--libs-only-l</span> libtls<span class="va">)</span></span>
<span id="cb13-14"><a href="#cb13-14" aria-hidden="true" tabindex="-1"></a><span class="op">		EOF</span></span>
<span id="cb13-15"><a href="#cb13-15" aria-hidden="true" tabindex="-1"></a><span class="cf">esac</span></span></code></pre></div>
<p>For more information about pkg-config, see Dan Nicholson’s <a href="https://people.freedesktop.org/~dbn/pkg-config-guide.html">guide</a>.</p>
<h4 id="compensating-for-the-standard-library">Compensating for the standard library</h4>
<p>The C standard library has no generic collections. You have to write your own linked lists, trees, and hash tables. Real Programmers™ might like this, but I don’t.</p>
<p>POSIX offers limited help with their interface in <a href="https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/search.h.html">search.h</a>:</p>
<ul>
<li><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/tdelete.html">Binary search tree</a>. This interface has worked for me, although <code>twalk()</code> doesn’t contain an argument to pass auxiliary data to the callback. The callback needs to consult a global or thread-local variable for that. The quality of implementation may vary as well, likely with regard to how/if the tree is balanced.</li>
<li><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/insque.html">Queue</a>. Very basic functions to insert or delete from a doubly linked (possibly circular) list. It takes void*, but expects a structure whose first two members are pointers to the same structure type (forward and backward pointers).</li>
<li><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/hcreate.html">Hash table</a>. Unnecessarily constrained interface. It creates a <em>single</em> hash table in hidden memory. You can destroy the table and later make another, but can never have more than one active at a time anywhere in the callstack. Obviously not thread safe, but that seems to be the least of its problems.</li>
</ul>
<p>To go beyond that, you’ll have to use third-party libraries. Many well-known libraries seem pretty bloated (GLib, tbox, Apache Portable Runtime). I found a smaller, cleaner library called simply <a href="https://fragglet.github.io/c-algorithms/">C Algorithms</a>. Haven’t used it in a project yet, but it looks stable and <a href="https://fragglet.github.io/c-algorithms/testing.html">well tested</a>. I also built the library locally with added pedantic C99 flags and got no warnings.</p>
<p>Two other stable libraries (code snippets?) which have received a lot of use over the years are <a href="http://troydhanson.github.io/uthash/index.html">Uthash</a> and BSD’s queue(3) (<a href="http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/sys/queue.h?rev=1.45&amp;content-type=text/x-cvsweb-markup">browse queue.h from OpenBSD</a>, or the <a href="https://github.com/freebsd/freebsd/blob/master/sys/sys/queue.h">FreeBSD variant</a>).</p>
<p>Uthash describes itself this way:</p>
<blockquote>
<p>Any C structure can be stored in a hash table using uthash. Just add a <em>UT_hash_handle</em> to the structure and choose one or more fields in your structure to act as the key. Then use these macros to store, retrieve or delete items from the hash table.”</p>
</blockquote>
<p>The BSD queue code has been used and improved all the way back to the 1990s. It provides macros to create and manipulate singly-linked lists, simple queues, lists, and tail queues. The <a href="http://man.openbsd.org/queue">man page</a> is quite good.</p>
<p>The functionality differs in the codebase of OpenBSD and FreeBSD. I use the OpenBSD version, but it has a little less functionality. In particular, FreeBSD adds the STAILQ (singly-linked tail queue), and a list swap operation. There was once a CIRCLEQ for circular queues, but it used dodgy coding practices and was <a href="http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/sys/queue.h.diff?r1=1.41&amp;r2=1.42&amp;f=h">removed</a>.</p>
<p>Both Uthash and Queue are header files with macros that you vendor into your project and include rather than linking against. In general I consider “header-only libraries” to be undesirable because they abuse the notion of a translation unit, bloat object files, and make debugging harder. However I’ve used these libraries and they do work well.</p>
<h4 id="user-interface">User interface</h4>
<p>The fewer UI features a program requires, the more portable it will be and the fewer opportunities there will be for it to mess up. (Does your command line app really need to output an emoji rocket ship or animated-in-place text spinner?)</p>
<p>The lowest common denominator is the standard I/O library in C, or its equivalent in other languages. Reading and writing text, pretending to be a teletype.</p>
<p>The next level of sophistication is static output but an input line you can modify (like the fancier teletypes that could edit a line before sending). Different terminals support intraline editing differently, and you should use a library to handle it. The classic is <a href="https://tiswww.case.edu/php/chet/readline/rltop.html">GNU readline</a>. Readline provides this functionality:</p>
<ul>
<li>Moving the text cursor (vi and emacs modes)</li>
<li>Searching the command history</li>
<li>Controlling a kill ring</li>
<li>Using tab completion</li>
</ul>
<p>Its license is straight up GPL though, not even LGPL. There are more permissive knockoffs like <a href="http://thrysoee.dk/editline/">libedit</a> (requires ncurses), or <a href="https://github.com/antirez/linenoise">linenoise</a> (which is restricted to VT100 terminals/emulators).</p>
<p>Going up yet another level is the text user interface (TUI), where the whole screen is your canvas, but you draw on it with text. Historically terminal control codes diverged wildly, so a standard programming interface was born, <a href="https://publications.opengroup.org/c094">X/Open Curses</a>. The most popular implementation is <a href="https://invisible-island.net/ncurses/">ncurses</a>, which adds some nonstandard extensions as well.</p>
<p>Curses handles these tasks:</p>
<ul>
<li>Terminal capability detection</li>
<li>“Raw” mode keyboard input</li>
<li>Cursor motion</li>
<li>Line drawing</li>
<li>Highlighting, underlining</li>
<li>Inserting and deleting lines and characters</li>
<li>Status line</li>
<li>Area clear</li>
<li>Windows</li>
<li>Color</li>
</ul>
<p>To stop pretending the computer is an archaic device from the 70s, you can use the cross-platform <a href="https://www.libsdl.org/">SDL2</a> library. It gives low level access to audio, keyboard, mouse, joystick, and graphics hardware. The <a href="https://wiki.libsdl.org/Installation#Supported_platforms">platform support</a> really is impressive. Everything from Unix, Mac, and Windows to mobile and web rendering.</p>
<p>Finally, for a classic native desktop application with widgets, the most stable and portable choice is probably <a href="https://motif.ics.com/motif">Motif</a>. The interface is stark, but it runs everywhere, and won’t change or break on you.</p>
<figure>
<img src="../images/motif.png" alt="Sample of Motif widgets" /><figcaption aria-hidden="true">Sample of Motif widgets</figcaption>
</figure>
<p>The Motif Programming Manual (<a href="http://www.ist.co.uk/motif/motif_prog.html">free download</a>) says this by way of introduction:</p>
<blockquote>
<p>So why motif? Because it remains what it has long been: the common native windowing toolkit for all the UNIX platforms, fully supported by all the major operating system vendors. It is still the only truly industrial strength toolkit capable of supporting large scale and long term projects. Everything else is tainted: it isn’t ready or fully functionally complete, or the functional specification changes in a non-backwards-compatible manner per release, or there are performance issues. Perhaps it doesn’t truly port across UNIX systems, or it isn’t fully ICCCM compliant with software written in any other toolkit on the desktop, or there are political battles as various groups try to control the specification for their own purposes. […] With motif, you know where you are: it’s stable, it’s robust, it’s professionally supported, and it all works.</p>
</blockquote>
<p>A <a href="http://www.ist.co.uk/motif/motif_refs.html">reference manual</a> is also available for download.</p>
<p>I was a little skeptical that it would be supported on macOS, but I tried the hello world example and, sure enough, it worked fine on XQuartz. I think there’s value in using Motif rather than a monster like GTK.</p>]]></summary>
</entry>
<entry>
    <title>Create impeccable MIME email from markdown</title>
    <link href="https://begriffs.com/posts/2020-07-16-generating-mime-email.html" />
    <id>https://begriffs.com/posts/2020-07-16-generating-mime-email.html</id>
    <published>2020-07-16T00:00:00Z</published>
    <updated>2020-07-16T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h3 id="the-goal">The goal</h3>
<p>I want to create emails that look their best in all mail clients, whether graphical or text based. Ideally I’d write a message in a simple format like Markdown, and generate the final email from the input file. Additionally, I’d like to be able to include fenced code snippets in the message, and make them available as attachments.</p>
<h3 id="demo">Demo</h3>
<p>I created a utility called <a href="https://github.com/begriffs/mimedown">mimedown</a> that reads markdown through stdin and prints multipart MIME to stdout.</p>
<p>Let’s see it in action. Here’s an example message:</p>
<pre><code>## This is a demo email with code

Hey, does this code look fishy to you?

```crash.c
#include &lt;stdio.h&gt;

int main(void)
{
	char a[] = &quot;string literal&quot;;
	char *p  = &quot;string literal&quot;;

	/* capitalize first letter */
	p[0] = a[0] = &#39;S&#39;;
	printf(&quot;a: %s\np: %s\n&quot;, a, p);
	return 0;
}
```

It blows up when I compile it and run it:

```compile.txt
$ cc -std=c99 -pedantic -Wall -Wextra crash.c -o crash
$ ./crash
Bus error: 10
```

Turns out we&#39;re invoking undefined behavior.

* The C99 spec, appendix J.2 Undefined Behavior mentions this case:
  &gt; The program attempts to modify a string literal (6.4.5).
* Steve Summit&#39;s C FAQ [question 1.32](http://c-faq.com/decl/strlitinit.html)
  covers the difference between an array initialized with string literal vs a
  pointer to a string literal constant.
* The SEI CERT C Coding standard
  [STR30-C](https://wiki.sei.cmu.edu/confluence/display/c/STR30-C.+Do+not+attempt+to+modify+string+literals)
  demonstrates the problem with non-compliant code, and compares with compliant
  fixes.</code></pre>
<p>After running it through the generator and emailing it to myself, here’s how the result looks in the Fastmail web interface:</p>
<figure>
<img src="/images/mimedown-graphical.png" alt="rendered in fastmail" /><figcaption aria-hidden="true">rendered in fastmail</figcaption>
</figure>
<p>Notice how the code blocks are displayed inline <em>and</em> are available as attachments with the correct MIME type.</p>
<p>I intentionally haven’t configured Mutt to render HTML, so it falls back to the text alternative in the message, which also looks good. Notice how the message body is interleaved with <code>Content-Disposition: inline</code> attachments for each code snippet.</p>
<figure>
<img src="/images/mimedown-mutt-code.png" alt="code and text in Mutt" /><figcaption aria-hidden="true">code and text in Mutt</figcaption>
</figure>
<p>The email generator also creates references for external urls. It substitutes the urls in the original body text with references, and consolidates the links into a bibliography of type <code>text/uri-list</code> at the end of the message. Here’s another Mutt screenshot of the end of the message, with red circles added.</p>
<figure>
<img src="/images/mimedown-url-references.png" alt="links as references" /><figcaption aria-hidden="true">links as references</figcaption>
</figure>
<p>The generated MIME structure of our sample message looks like this:</p>
<pre><code>  I     1 &lt;no description&gt;          [multipa/alternativ, 7bit, 3.1K]
  I     2 ├─&gt;&lt;no description&gt;            [multipa/mixed, 7bit, 1.7K]
  I     3 │ ├─&gt;&lt;no description&gt;      [text/plain, 7bit, utf-8, 0.1K]
  I     4 │ ├─&gt;crash.c                 [text/x-c, 7bit, utf-8, 0.2K]
  I     5 │ ├─&gt;&lt;no description&gt;      [text/plain, 7bit, utf-8, 0.1K]
  I     6 │ ├─&gt;compile.txt           [text/plain, 7bit, utf-8, 0.1K]
  I     7 │ ├─&gt;&lt;no description&gt;      [text/plain, 7bit, utf-8, 0.5K]
  I     8 │ └─&gt;references.uri     [text/uri-list, 7bit, utf-8, 0.2K]
  I     9 └─&gt;&lt;no description&gt;         [text/html, 7bit, utf-8, 1.3K]</code></pre>
<p>At the outermost level, the message is split into two alternatives: HTML and multipart/mixed. Within the multipart/mixed part is a succession of message text and code snippets, all with inline disposition. The final mixed item is the list of referenced urls (if necessary).</p>
<h3 id="other-niceties">Other niceties</h3>
<p>Lines of the message body are re-flowed to at most 72 characters, to conform to historical length constraints. Additionally, to accommodate narrow terminal windows, mimedown uses a technique called <a href="https://joeclark.org/ffaq.html">format=flowed</a>. This is a clever standard (<a href="https://www.ietf.org/rfc/rfc3676.html">RFC 3676</a>) which adds trailing spaces to any lines that we would like the client reader to re-flow, such as those in paragraphs.</p>
<p>Neither hard wrapping nor format=flowed is applied to code block fences in the original markdown. Code snippets are turned into verbatim attachments and won’t be mangled.</p>
<p>Finally, the HTML version of the message is tasteful and conservative. It should display properly on any HTML client, since it validates with ISO HTML (ISO/IEC 15445:2000, based on HTML 4.01 Strict).</p>
<h3 id="try-it-yourself">Try it yourself</h3>
<p>Clone it here: <a href="https://github.com/begriffs/mimedown">github.com/begriffs/mimedown</a>. It’s written in portable C99. The only build dependency is the <a href="https://github.com/commonmark/cmark">cmark</a> library for parsing markdown.</p>]]></summary>
</entry>
<entry>
    <title>Logging TLS session keys in LibreSSL</title>
    <link href="https://begriffs.com/posts/2020-05-25-libressl-keylogging.html" />
    <id>https://begriffs.com/posts/2020-05-25-libressl-keylogging.html</id>
    <published>2020-05-25T00:00:00Z</published>
    <updated>2020-05-25T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p><a href="https://www.libressl.org">LibreSSL</a> is a fork of <a href="https://www.openssl.org">OpenSSL</a> that improves code quality and security. It was originally developed for OpenBSD, but has since been ported to several platforms (Linux, *BSD, HP-UX, Solaris, macOS, AIX, Windows) and is now the default TLS provider for some of them.</p>
<p>When debugging a program that uses LibreSSL, it can be useful to see decrypted network traffic. Wireshark can <a href="https://wiki.wireshark.org/TLS">decrypt TLS</a> if you provide the secret session key, however the session key is difficult to obtain. It is different from the private key used for functions like <code>tls_config_set_keypair_file()</code>, which merely secures the initial TLS handshake with asymmetric cryptography. The handshake establishes the session key between client and server using a method such as Diffie-Hellman (DH). The session key is then used for efficient symmetric cryptography for the remainder of the communication.</p>
<p>Web browsers, from their Netscape provenance, will log session keys to a file specified by the environment variable <code>SSLKEYLOGFILE</code> when present. Netscape packaged this behavior in its <a href="https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS">Network Security Services</a> library.</p>
<p>OpenSSL and LibreSSL don’t implement that NSS behavior, although OpenSSL allows code to register a <a href="https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_keylog_callback.html">callback</a> for when TLS key material is generated or received. The callback receives a string in the <a href="https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/Key_Log_Format">NSS Key Log Format</a>.</p>
<p>In addition to refactoring OpenSSL code, LibreSSL offers a simplified TLS interface called <a href="https://man.openbsd.org/tls_init.3">libtls</a>. The simplicity makes it more likely that applications will use it safely. However, I couldn’t find an easy way to log session keys for my libtls connection.</p>
<p>I found a somewhat hacky way to do it, and <a href="https://marc.info/?l=libressl&amp;m=158908819814107&amp;w=2">asked</a> their development list whether there’s a better way. From the lack of response, I assume there isn’t yet. Posting the solution here in case it’s helpful for anyone else.</p>
<p>This module provides a <code>tls_dump_keylog()</code> function that appends to the file specified in <code>SSLKEYLOGFILE</code>.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdbool.h&gt;</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;openssl/ssl.h&gt;</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="co">/* A copy of the tls structure from libtls/tls_internal.h</span></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="co"> *</span></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="co"> * This is a fragile hack! When the structure changes in libtls</span></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="co"> * then it will be Undefined Behavior to alias it with this.</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a><span class="co"> * See C99 section 6.5 (Expressions), paragraph 7</span></span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a><span class="co"> */</span></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> tls_internal <span class="op">{</span></span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> tls_config <span class="op">*</span>config<span class="op">;</span></span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> tls_keypair <span class="op">*</span>keypair<span class="op">;</span></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> <span class="op">{</span></span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a>		<span class="dt">char</span> <span class="op">*</span>msg<span class="op">;</span></span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a>		<span class="dt">int</span> num<span class="op">;</span></span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a>		<span class="dt">int</span> tls<span class="op">;</span></span>
<span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span> error<span class="op">;</span></span>
<span id="cb1-22"><a href="#cb1-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-23"><a href="#cb1-23" aria-hidden="true" tabindex="-1"></a>	<span class="dt">uint32_t</span> flags<span class="op">;</span></span>
<span id="cb1-24"><a href="#cb1-24" aria-hidden="true" tabindex="-1"></a>	<span class="dt">uint32_t</span> state<span class="op">;</span></span>
<span id="cb1-25"><a href="#cb1-25" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-26"><a href="#cb1-26" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>servername<span class="op">;</span></span>
<span id="cb1-27"><a href="#cb1-27" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> socket<span class="op">;</span></span>
<span id="cb1-28"><a href="#cb1-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-29"><a href="#cb1-29" aria-hidden="true" tabindex="-1"></a>	SSL <span class="op">*</span>ssl_conn<span class="op">;</span></span>
<span id="cb1-30"><a href="#cb1-30" aria-hidden="true" tabindex="-1"></a>	SSL_CTX <span class="op">*</span>ssl_ctx<span class="op">;</span></span>
<span id="cb1-31"><a href="#cb1-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-32"><a href="#cb1-32" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> tls_sni_ctx <span class="op">*</span>sni_ctx<span class="op">;</span></span>
<span id="cb1-33"><a href="#cb1-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-34"><a href="#cb1-34" aria-hidden="true" tabindex="-1"></a>	X509 <span class="op">*</span>ssl_peer_cert<span class="op">;</span></span>
<span id="cb1-35"><a href="#cb1-35" aria-hidden="true" tabindex="-1"></a>	STACK_OF<span class="op">(</span>X509<span class="op">)</span> <span class="op">*</span>ssl_peer_chain<span class="op">;</span></span>
<span id="cb1-36"><a href="#cb1-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-37"><a href="#cb1-37" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> tls_conninfo <span class="op">*</span>conninfo<span class="op">;</span></span>
<span id="cb1-38"><a href="#cb1-38" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-39"><a href="#cb1-39" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> tls_ocsp <span class="op">*</span>ocsp<span class="op">;</span></span>
<span id="cb1-40"><a href="#cb1-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-41"><a href="#cb1-41" aria-hidden="true" tabindex="-1"></a>	tls_read_cb read_cb<span class="op">;</span></span>
<span id="cb1-42"><a href="#cb1-42" aria-hidden="true" tabindex="-1"></a>	tls_write_cb write_cb<span class="op">;</span></span>
<span id="cb1-43"><a href="#cb1-43" aria-hidden="true" tabindex="-1"></a>	<span class="dt">void</span> <span class="op">*</span>cb_arg<span class="op">;</span></span>
<span id="cb1-44"><a href="#cb1-44" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb1-45"><a href="#cb1-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-46"><a href="#cb1-46" aria-hidden="true" tabindex="-1"></a><span class="dt">static</span> <span class="dt">void</span> printhex<span class="op">(</span><span class="dt">FILE</span> <span class="op">*</span>fp<span class="op">,</span> <span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*</span> s<span class="op">,</span> <span class="dt">size_t</span> len<span class="op">)</span></span>
<span id="cb1-47"><a href="#cb1-47" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-48"><a href="#cb1-48" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>len<span class="op">--</span> <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb1-49"><a href="#cb1-49" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>fp<span class="op">,</span> <span class="st">&quot;%02x&quot;</span><span class="op">,</span> <span class="op">*</span>s<span class="op">++);</span></span>
<span id="cb1-50"><a href="#cb1-50" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb1-51"><a href="#cb1-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-52"><a href="#cb1-52" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> tls_dump_keylog<span class="op">(</span><span class="kw">struct</span> tls <span class="op">*</span>tls<span class="op">)</span></span>
<span id="cb1-53"><a href="#cb1-53" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-54"><a href="#cb1-54" aria-hidden="true" tabindex="-1"></a>	<span class="dt">FILE</span> <span class="op">*</span>fp<span class="op">;</span></span>
<span id="cb1-55"><a href="#cb1-55" aria-hidden="true" tabindex="-1"></a>	SSL_SESSION <span class="op">*</span>sess<span class="op">;</span></span>
<span id="cb1-56"><a href="#cb1-56" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">int</span> len_key<span class="op">,</span> len_id<span class="op">;</span></span>
<span id="cb1-57"><a href="#cb1-57" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> key<span class="op">[</span><span class="dv">256</span><span class="op">];</span></span>
<span id="cb1-58"><a href="#cb1-58" aria-hidden="true" tabindex="-1"></a>	<span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span> <span class="op">*</span>id<span class="op">;</span></span>
<span id="cb1-59"><a href="#cb1-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-60"><a href="#cb1-60" aria-hidden="true" tabindex="-1"></a>	<span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>path <span class="op">=</span> getenv<span class="op">(</span><span class="st">&quot;SSLKEYLOGFILE&quot;</span><span class="op">);</span></span>
<span id="cb1-61"><a href="#cb1-61" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>path<span class="op">)</span></span>
<span id="cb1-62"><a href="#cb1-62" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb1-63"><a href="#cb1-63" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-64"><a href="#cb1-64" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* potentially nonstrict aliasing */</span></span>
<span id="cb1-65"><a href="#cb1-65" aria-hidden="true" tabindex="-1"></a>	sess <span class="op">=</span> SSL_get_session<span class="op">(((</span><span class="kw">struct</span> tls_internal<span class="op">*)</span>tls<span class="op">)-&gt;</span>ssl_conn<span class="op">);</span></span>
<span id="cb1-66"><a href="#cb1-66" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>sess<span class="op">)</span></span>
<span id="cb1-67"><a href="#cb1-67" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb1-68"><a href="#cb1-68" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Failed to get session for TLS</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">);</span></span>
<span id="cb1-69"><a href="#cb1-69" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb1-70"><a href="#cb1-70" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb1-71"><a href="#cb1-71" aria-hidden="true" tabindex="-1"></a>	len_key <span class="op">=</span> SSL_SESSION_get_master_key<span class="op">(</span>sess<span class="op">,</span> key<span class="op">,</span> <span class="kw">sizeof</span> key<span class="op">);</span></span>
<span id="cb1-72"><a href="#cb1-72" aria-hidden="true" tabindex="-1"></a>	id      <span class="op">=</span> SSL_SESSION_get_id<span class="op">(</span>sess<span class="op">,</span> <span class="op">&amp;</span>len_id<span class="op">);</span></span>
<span id="cb1-73"><a href="#cb1-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-74"><a href="#cb1-74" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">((</span>fp <span class="op">=</span> fopen<span class="op">(</span>path<span class="op">,</span> <span class="st">&quot;a&quot;</span><span class="op">))</span> <span class="op">==</span> NULL<span class="op">)</span></span>
<span id="cb1-75"><a href="#cb1-75" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb1-76"><a href="#cb1-76" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Unable to write keylog to &#39;%s&#39;</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> path<span class="op">);</span></span>
<span id="cb1-77"><a href="#cb1-77" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb1-78"><a href="#cb1-78" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb1-79"><a href="#cb1-79" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span><span class="st">&quot;RSA Session-ID:&quot;</span><span class="op">,</span> fp<span class="op">);</span></span>
<span id="cb1-80"><a href="#cb1-80" aria-hidden="true" tabindex="-1"></a>	printhex<span class="op">(</span>fp<span class="op">,</span> id<span class="op">,</span> len_id<span class="op">);</span></span>
<span id="cb1-81"><a href="#cb1-81" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span><span class="st">&quot; Master-Key:&quot;</span><span class="op">,</span> fp<span class="op">);</span></span>
<span id="cb1-82"><a href="#cb1-82" aria-hidden="true" tabindex="-1"></a>	printhex<span class="op">(</span>fp<span class="op">,</span> key<span class="op">,</span> len_key<span class="op">);</span></span>
<span id="cb1-83"><a href="#cb1-83" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span><span class="st">&quot;</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> fp<span class="op">);</span></span>
<span id="cb1-84"><a href="#cb1-84" aria-hidden="true" tabindex="-1"></a>	fclose<span class="op">(</span>fp<span class="op">);</span></span>
<span id="cb1-85"><a href="#cb1-85" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> true<span class="op">;</span></span>
<span id="cb1-86"><a href="#cb1-86" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>To use the logfile in Wireshark, right click on a TLS packet, and select <strong>Protocol Preferences</strong> → <strong>(Pre)-Master-Secret log filename</strong>.</p>
<figure>
<img src="../images/wireshark-session-key.png" alt="(Pre)-Master-Secret log filename menu item" /><figcaption aria-hidden="true">(Pre)-Master-Secret log filename menu item</figcaption>
</figure>
<p>In the resulting dialog, add the filename to the logfile. Then you can view the decrypted traffic with <strong>Follow</strong> → <strong>TLS Stream</strong>.</p>
<figure>
<img src="../images/wireshark-decrypt.png" alt="Follow TLS stream menu item" /><figcaption aria-hidden="true">Follow TLS stream menu item</figcaption>
</figure>]]></summary>
</entry>
<entry>
    <title>Concurrent programming, with examples</title>
    <link href="https://begriffs.com/posts/2020-03-23-concurrent-programming.html" />
    <id>https://begriffs.com/posts/2020-03-23-concurrent-programming.html</id>
    <published>2020-03-23T00:00:00Z</published>
    <updated>2020-03-23T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>Mention concurrency and you’re bound to get two kinds of unsolicited advice: first that it’s a nightmarish problem which will melt your brain, and second that there’s a magical programming language or niche paradigm which will make all your problems disappear.</p>
<p>We won’t run to either extreme here. Instead we’ll cover the production workhorses for concurrent software – threading and locking – and learn about them through a series of interesting programs. By the end of this article you’ll know the terminology and patterns used by POSIX threads (pthreads).</p>
<p>This is an introduction rather than a reference. Plenty of reference material exists for pthreads – whole books in fact. I won’t dwell on all the options of the API, but will briskly give you the big picture. None of the examples contain error handling because it would merely clutter them.</p>
<h3 id="table-of-contents">Table of contents</h3>
<ul>
<li><a href="#concurrency-vs-parallelism">Concurrency vs parallelism</a></li>
<li><a href="#our-first-concurrent-program">Our first concurrent program</a></li>
<li><a href="#data-races">Data races</a></li>
<li><a href="#locks-and-deadlock">Locks and deadlock</a></li>
<li><a href="#condition-variables">Condition variables</a></li>
<li><a href="#other-synchronization-primitives">Other synchronization primitives</a>
<ul>
<li><a href="#barriers">Barriers</a></li>
<li><a href="#spinlocks">Spinlocks</a></li>
<li><a href="#reader-writer-locks">Reader-writer locks</a></li>
<li><a href="#semaphores">Semaphores</a></li>
</ul></li>
<li><a href="#cancellation">Cancellation</a></li>
<li><a href="#development-tools">Development tools</a>
<ul>
<li><a href="#valgrind-drd-and-helgrind">Valgrind DRD and helgrind</a></li>
<li><a href="#clang-threadsanitizer-tsan">Clang ThreadSanitizer (TSan)</a></li>
<li><a href="#mutrace">Mutrace</a></li>
<li><a href="#off-cpu-profiling">Off-CPU profiling</a></li>
<li><a href="#macos-instruments">macOS Instruments</a></li>
<li><a href="#perf-c2c">perf c2c</a></li>
<li><a href="#intel-vtune-profiler">Intel VTune Profiler</a></li>
</ul></li>
<li><a href="#further-reading">Further reading</a></li>
</ul>
<h3 id="concurrency-vs-parallelism">Concurrency vs parallelism</h3>
<p>First it’s important to distinguish concurrency vs parallelism. <strong>Concurrency</strong> is the ability of parts of a program to work correctly when executed out of order. For instance, imagine tasks A and B. One way to execute them is sequentially, meaning doing all steps for A, then all for B:</p>
<svg width="250" height="50" style="display: block; margin: 0 auto;">
<rect width="120" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <text x="55" y="30" fill="white">A</text> <rect x="130" width="120" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <text x="185" y="30" fill="white">B</text>
</svg>
<p>Concurrent execution, on the other hand, alternates doing a little of each task until both are all complete:</p>
<svg width="250" height="50" style="display: block; margin: 0 auto;">
<rect         width="40" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="50"  width="10" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="70"  width="10" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="90"  width="20" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="120" width="10" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="140" width="30" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="180" width="20" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="210" width="10" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <rect x="230" width="20" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" />
</svg>
<p>Concurrency allows a program to make progress even when certain parts are blocked. For instance, when one task is waiting for user input, the system can switch to another task and do calculations.</p>
<p>When tasks don’t just interleave, but run at the same time, that’s called <strong>parallelism</strong>. Multiple CPU cores can run instructions simultaneously:</p>
<svg width="120" height="110" style="display: block; margin: 0 auto;">
<rect width="120" height="50" style="fill:rgb(0,0,127);stroke-width:1;stroke:rgb(0,0,0)" /> <text x="55" y="30" fill="white">A</text> <rect y="55" width="120" height="50" style="fill:rgb(127,0,0);stroke-width:1;stroke:rgb(0,0,0)" /> <text x="55" y="85" fill="white">B</text>
</svg>
<p>When a program – even without hardware parallelism – switches rapidly enough from one task to another, it can feel to the user that tasks are executing at the same time. You could say it provides the “illusion of parallelism.” However, true parallelism has the potential for greater processor throughput for problems that can be broken into independent subtasks. Some ways of dealing with concurrency, such as multi-threaded programming, can exploit hardware parallelism automatically when available.</p>
<p>Some languages (or more accurately, some language implementations) are unable to achieve true multi-threaded parallelism. Ruby MRI and CPython for instance use a global interpreter lock (GIL) to simplify their implementation. The GIL prevents more than one thread from running at once. Programs in these interpreters can benefit from I/O concurrency, but not extra computational power.</p>
<h3 id="our-first-concurrent-program">Our first concurrent program</h3>
<p>Languages and libraries offer different ways to add concurrency to a program. UNIX for instance has a bunch of disjointed mechanisms like signals, asynchronous I/O (AIO), select, poll, and setjmp/longjmp. Using these mechanisms can complicate program structure and make programs harder to read than sequential code.</p>
<p>Threads offer a cleaner and more consistent way to address these motivations. For I/O they’re usually clearer than polling or callbacks, and for processing they are more efficient than Unix processes.</p>
<h4 id="crazy-bankers">Crazy bankers</h4>
<p>Let’s get started by adding concurrency to a program to simulate a bunch of crazy bankers sending random amounts of money from one bank account to another. The bankers don’t communicate with one another, so this is a demonstration of concurrency without synchronization.</p>
<p>Adding concurrency is the easy part. The real work is in making threads wait for one another to ensure a correct result. We’ll see a number of mechanisms and patterns for synchronization later, but for now let’s see what goes wrong without synchronization.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* banker.c */</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;time.h&gt;</span></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ACCOUNTS 10</span></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_THREADS  20</span></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ROUNDS   10000</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a><span class="co">/* 10 accounts with $100 apiece means there&#39;s $1,000</span></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="co">   in the system. Let&#39;s hope it stays that way...  */</span></span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define INIT_BALANCE 100</span></span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="co">/* making a struct here for the benefit of future</span></span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a><span class="co">   versions of this program */</span></span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> account</span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> balance<span class="op">;</span></span>
<span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> accts<span class="op">[</span>N_ACCOUNTS<span class="op">];</span></span>
<span id="cb1-22"><a href="#cb1-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-23"><a href="#cb1-23" aria-hidden="true" tabindex="-1"></a><span class="co">/* Helper for bankers to choose an account and amount at</span></span>
<span id="cb1-24"><a href="#cb1-24" aria-hidden="true" tabindex="-1"></a><span class="co">   random. It came from Steve Summit&#39;s excellent C FAQ</span></span>
<span id="cb1-25"><a href="#cb1-25" aria-hidden="true" tabindex="-1"></a><span class="co">   http://c-faq.com/lib/randrange.html */</span></span>
<span id="cb1-26"><a href="#cb1-26" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> rand_range<span class="op">(</span><span class="dt">int</span> N<span class="op">)</span></span>
<span id="cb1-27"><a href="#cb1-27" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-28"><a href="#cb1-28" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="op">(</span><span class="dt">int</span><span class="op">)((</span><span class="dt">double</span><span class="op">)</span>rand<span class="op">()</span> <span class="op">/</span> <span class="op">((</span><span class="dt">double</span><span class="op">)</span>RAND_MAX <span class="op">+</span> <span class="dv">1</span><span class="op">)</span> <span class="op">*</span> N<span class="op">);</span></span>
<span id="cb1-29"><a href="#cb1-29" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb1-30"><a href="#cb1-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-31"><a href="#cb1-31" aria-hidden="true" tabindex="-1"></a><span class="co">/* each banker will run this function concurrently. The</span></span>
<span id="cb1-32"><a href="#cb1-32" aria-hidden="true" tabindex="-1"></a><span class="co">   weird signature is required for a thread function */</span></span>
<span id="cb1-33"><a href="#cb1-33" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>disburse<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb1-34"><a href="#cb1-34" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-35"><a href="#cb1-35" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">,</span> from<span class="op">,</span> to<span class="op">;</span></span>
<span id="cb1-36"><a href="#cb1-36" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> payment<span class="op">;</span></span>
<span id="cb1-37"><a href="#cb1-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-38"><a href="#cb1-38" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* idiom to tell compiler arg is unused */</span></span>
<span id="cb1-39"><a href="#cb1-39" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>arg<span class="op">;</span></span>
<span id="cb1-40"><a href="#cb1-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-41"><a href="#cb1-41" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ROUNDS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb1-42"><a href="#cb1-42" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb1-43"><a href="#cb1-43" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* pick distinct &#39;from&#39; and &#39;to&#39; accounts */</span></span>
<span id="cb1-44"><a href="#cb1-44" aria-hidden="true" tabindex="-1"></a>		from <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb1-45"><a href="#cb1-45" aria-hidden="true" tabindex="-1"></a>		<span class="cf">do</span> <span class="op">{</span></span>
<span id="cb1-46"><a href="#cb1-46" aria-hidden="true" tabindex="-1"></a>			to <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb1-47"><a href="#cb1-47" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span> <span class="cf">while</span> <span class="op">(</span>to <span class="op">==</span> from<span class="op">);</span></span>
<span id="cb1-48"><a href="#cb1-48" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-49"><a href="#cb1-49" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* go nuts sending money, try not to overdraft */</span></span>
<span id="cb1-50"><a href="#cb1-50" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb1-51"><a href="#cb1-51" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb1-52"><a href="#cb1-52" aria-hidden="true" tabindex="-1"></a>			payment <span class="op">=</span> <span class="dv">1</span> <span class="op">+</span> rand_range<span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance<span class="op">);</span></span>
<span id="cb1-53"><a href="#cb1-53" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">-=</span> payment<span class="op">;</span></span>
<span id="cb1-54"><a href="#cb1-54" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>to<span class="op">].</span>balance   <span class="op">+=</span> payment<span class="op">;</span></span>
<span id="cb1-55"><a href="#cb1-55" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb1-56"><a href="#cb1-56" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb1-57"><a href="#cb1-57" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb1-58"><a href="#cb1-58" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb1-59"><a href="#cb1-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-60"><a href="#cb1-60" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb1-61"><a href="#cb1-61" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb1-62"><a href="#cb1-62" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">;</span></span>
<span id="cb1-63"><a href="#cb1-63" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> total<span class="op">;</span></span>
<span id="cb1-64"><a href="#cb1-64" aria-hidden="true" tabindex="-1"></a>	pthread_t ts<span class="op">[</span>N_THREADS<span class="op">];</span></span>
<span id="cb1-65"><a href="#cb1-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-66"><a href="#cb1-66" aria-hidden="true" tabindex="-1"></a>	srand<span class="op">(</span>time<span class="op">(</span>NULL<span class="op">));</span></span>
<span id="cb1-67"><a href="#cb1-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-68"><a href="#cb1-68" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb1-69"><a href="#cb1-69" aria-hidden="true" tabindex="-1"></a>		accts<span class="op">[</span>i<span class="op">].</span>balance <span class="op">=</span> INIT_BALANCE<span class="op">;</span></span>
<span id="cb1-70"><a href="#cb1-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-71"><a href="#cb1-71" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;Initial money in system: %d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb1-72"><a href="#cb1-72" aria-hidden="true" tabindex="-1"></a>		N_ACCOUNTS <span class="op">*</span> INIT_BALANCE<span class="op">);</span></span>
<span id="cb1-73"><a href="#cb1-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-74"><a href="#cb1-74" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* start the threads, using whatever parallelism the</span></span>
<span id="cb1-75"><a href="#cb1-75" aria-hidden="true" tabindex="-1"></a><span class="co">	   system happens to offer. Note that pthread_create</span></span>
<span id="cb1-76"><a href="#cb1-76" aria-hidden="true" tabindex="-1"></a><span class="co">	   is the *only* function that creates concurrency */</span></span>
<span id="cb1-77"><a href="#cb1-77" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb1-78"><a href="#cb1-78" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(&amp;</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">,</span> disburse<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb1-79"><a href="#cb1-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-80"><a href="#cb1-80" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* wait for the threads to all finish, using the</span></span>
<span id="cb1-81"><a href="#cb1-81" aria-hidden="true" tabindex="-1"></a><span class="co">	   pthread_t handles pthread_create gave us */</span></span>
<span id="cb1-82"><a href="#cb1-82" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb1-83"><a href="#cb1-83" aria-hidden="true" tabindex="-1"></a>		pthread_join<span class="op">(</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">);</span></span>
<span id="cb1-84"><a href="#cb1-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-85"><a href="#cb1-85" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>total <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb1-86"><a href="#cb1-86" aria-hidden="true" tabindex="-1"></a>		total <span class="op">+=</span> accts<span class="op">[</span>i<span class="op">].</span>balance<span class="op">;</span></span>
<span id="cb1-87"><a href="#cb1-87" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-88"><a href="#cb1-88" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;Final money in system: %ld</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> total<span class="op">);</span></span>
<span id="cb1-89"><a href="#cb1-89" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>The following simple Makefile can be used to compile all the programs in this article:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="ot">.POSIX:</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="dt">CFLAGS </span><span class="ch">=</span><span class="st"> -std=c99 -pedantic -D_POSIX_C_SOURCE=200809L -Wall -Wextra</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="dt">LDLIBS </span><span class="ch">=</span><span class="st"> -lpthread</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="ot">.c:</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a>		<span class="ch">$(</span><span class="dt">CC</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">CFLAGS</span><span class="ch">)</span> <span class="ch">$(</span><span class="dt">LDFLAGS</span><span class="ch">)</span> -o <span class="ch">$@</span> <span class="ch">$&lt;</span> <span class="ch">$(</span><span class="dt">LDLIBS</span><span class="ch">)</span></span></code></pre></div>
<p>We’re overriding make’s default <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/make.html#tag_20_76_13_09">suffix rule</a> for .c so that <code>-lpthread</code> comes after the source input file. This makeefile will work with any of our programs. If you have <code>foo.c</code> you can simply run <code>make foo</code> and it knows what to do without your needing to add any specific rule for <code>foo</code> to the Makefile.</p>
<h3 id="data-races">Data races</h3>
<p>Try compiling and running <code>banker.c</code>. Notice anything strange?</p>
<p>Threads share memory directly. Each thread can read and write variables in shared memory without any overhead. However when threads simultaneously read and write the same data it’s called a <strong>data race</strong> and generally causes problems.</p>
<p>In particular, threads in <code>banker.c</code> have data races when they read and write account balances. The bankers program moves money between accounts, however the total amount of money in the system does not remain constant. The books don’t balance. Exactly how the program behaves depends on thread scheduling policies of the operating system. On OpenBSD the total money seldom stays at $1,000. Sometimes money gets duplicated, sometimes it vanishes. On macOS the result is generally that all the money disappears, or even becomes negative!</p>
<p>The property that money is neither created nor destroyed in a bank is an example of a <strong>program invariant</strong>, and it gets violated by data races. Note that parallelism is not required for a race, only concurrency.</p>
<p>Here’s the problematic code in the <code>disburse()</code> function:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>payment <span class="op">=</span> <span class="dv">1</span> <span class="op">+</span> rand_range<span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance<span class="op">);</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">-=</span> payment<span class="op">;</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>accts<span class="op">[</span>to<span class="op">].</span>balance   <span class="op">+=</span> payment<span class="op">;</span></span></code></pre></div>
<p>The threads running this code can be paused or interleaved at any time. Not just between any of the statements, but partway through arithmetic operations which may not execute atomically on the hardware. Never rely on “thread inertia,” which is the mistaken feeling that the thread will finish a group of statements without interference.</p>
<p>Let’s examine exactly how statements can interleave between banker threads, and the resulting problems. The columns of the table below are threads, and the rows are moments in time.</p>
<p>Here’s a timeline where two threads read the same account balance when planning how much money to transfer. It can cause an overdraft.</p>
<table class="table" style="border: 1px solid #ccc; background: #eee;">
<caption>
Overdrafting
</caption>
<thead>
<tr>
<th>
Thread A
</th>
<th>
Thread B
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
payment = 1 + rand_range(accts[from].balance);
</td>
<td />
</tr>
<tr>
<td />
<td>
payment = 1 + rand_range(accts[from].balance);
</td>
</tr>
<tr>
<td colspan="2">
At this point, thread B’s payment-to-be may be in excess of the true balance because thread A has already earmarked some of the money unbeknownst to B.
</td>
</tr>
<tr>
<td>
accts[from].balance -= payment;
</td>
<td />
</tr>
<tr>
<td />
<td>
accts[from].balance -= payment;
</td>
</tr>
<tr>
<td colspan="2">
Some of the same dollars could be transferred twice and the originating account could even go negative if the overlap of the payments is big enough.
</td>
</tr>
</tbody>
</table>
<p>Here’s a timeline where the debit made by one thread can be undone by that made by another.</p>
<table class="table" style="border: 1px solid #ccc; background: #eee;">
<caption>
Lost debit
</caption>
<thead>
<tr>
<th>
Thread A
</th>
<th>
Thread B
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
accts[from].balance -= payment;
</td>
<td>
accts[from].balance -= payment;
</td>
</tr>
<tr>
<td colspan="2">
If <code>-=</code> is not atomic, the threads might switch execution after reading the balance and after doing arithmetic, but before assignment. Thus one assignment would be overwritten by the other. The “lost update” creates extra money in the system.
</td>
</tr>
</tbody>
</table>
<p>Similar problems can occur when bankers have a data race in destination accounts. Races in the destination account would tend to decrease total money supply. (To learn more about concurrency problems, see my article <a href="/posts/2017-08-01-practical-guide-sql-isolation.html">Practical Guide to SQL Transaction Isolation</a>).</p>
<h3 id="locks-and-deadlock">Locks and deadlock</h3>
<p>In the example above, we found that a certain section of code was vulnerable to data races. Such tricky parts of a program are called <strong>critical sections.</strong> We must ensure each thread gets all the way through the section before another thread is allowed to enter it.</p>
<p>To give threads mutually exclusive access to a critical section, pthreads provides the mutually exclusive lock (<strong>mutex</strong> for short). The pattern is:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>some_mutex<span class="op">);</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* ... do things in the critical section ... */</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>some_mutex<span class="op">);</span></span></code></pre></div>
<p>Any thread calling <code>pthread_mutex_lock</code> on a previously locked mutex will go to sleep and not be scheduled until the mutex is unlocked (and any other threads already waiting on the mutex have gone first).</p>
<p>Another way to look at mutexes is that their job is to preserve program invariants. The critical section between locking and unlocking is a place where a certain invariant may be temporarily broken, as long as it is restored by the end. Some people recommend adding an <code>assert()</code> statement before unlocking, to help document the invariant. If an invariant is difficult to specify in an assertion, a comment can be useful instead.</p>
<p>A function is called <strong>thread-safe</strong> if multiple invocations can safely run concurrently. A cheap, but inefficient, way to make any function thread-safe is to give it its own mutex and lock it right away:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* inefficient but effective way to protect a function */</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>pthread_mutex_t foo_mtx <span class="op">=</span> PTHREAD_MUTEX_INITIALIZER<span class="op">;</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> foo<span class="op">(</span><span class="co">/* some arguments */</span><span class="op">)</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>foo_mtx<span class="op">);</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* we&#39;re safe in here, but it&#39;s a bottleneck */</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>foo_mtx<span class="op">);</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>To see why this is inefficient, imagine if <code>foo()</code> was designed to output characters to a file specified in its arguments. Because the function takes a global lock, no two threads could run it at once, even if they wanted to write to different files. Writing to different files should be independent activities, and what we really want to protect against are two threads concurrently writing the <em>same</em> file.</p>
<p>The amount of data that a mutex protects is called its <strong>granularity,</strong> and smaller granularity can often be more efficient. In our <code>foo()</code> example, we could store a mutex for every file we write, and have the function choose and lock the appropriate mutex. Multi-threaded programs typically add a mutex as a member variable to data structures, to associate the lock with its data.</p>
<p>Let’s update the banker program to keep a mutex in each account and prevent data races.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* banker_lock.c */</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;time.h&gt;</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ACCOUNTS 10</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_THREADS  100</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ROUNDS   10000</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> account</span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> balance<span class="op">;</span></span>
<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* add a mutex to prevent races on balance */</span></span>
<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_t mtx<span class="op">;</span></span>
<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> accts<span class="op">[</span>N_ACCOUNTS<span class="op">];</span></span>
<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> rand_range<span class="op">(</span><span class="dt">int</span> N<span class="op">)</span></span>
<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="op">(</span><span class="dt">int</span><span class="op">)((</span><span class="dt">double</span><span class="op">)</span>rand<span class="op">()</span> <span class="op">/</span> <span class="op">((</span><span class="dt">double</span><span class="op">)</span>RAND_MAX <span class="op">+</span> <span class="dv">1</span><span class="op">)</span> <span class="op">*</span> N<span class="op">);</span></span>
<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>disburse<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">,</span> from<span class="op">,</span> to<span class="op">;</span></span>
<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> payment<span class="op">;</span></span>
<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-29"><a href="#cb6-29" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>arg<span class="op">;</span></span>
<span id="cb6-30"><a href="#cb6-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-31"><a href="#cb6-31" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ROUNDS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb6-32"><a href="#cb6-32" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb6-33"><a href="#cb6-33" aria-hidden="true" tabindex="-1"></a>		from <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb6-34"><a href="#cb6-34" aria-hidden="true" tabindex="-1"></a>		<span class="cf">do</span> <span class="op">{</span></span>
<span id="cb6-35"><a href="#cb6-35" aria-hidden="true" tabindex="-1"></a>			to <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb6-36"><a href="#cb6-36" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span> <span class="cf">while</span> <span class="op">(</span>to <span class="op">==</span> from<span class="op">);</span></span>
<span id="cb6-37"><a href="#cb6-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-38"><a href="#cb6-38" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* get an exclusive lock on both balances before</span></span>
<span id="cb6-39"><a href="#cb6-39" aria-hidden="true" tabindex="-1"></a><span class="co">		   updating (there&#39;s a problem with this, see below) */</span></span>
<span id="cb6-40"><a href="#cb6-40" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb6-41"><a href="#cb6-41" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb6-42"><a href="#cb6-42" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb6-43"><a href="#cb6-43" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb6-44"><a href="#cb6-44" aria-hidden="true" tabindex="-1"></a>			payment <span class="op">=</span> <span class="dv">1</span> <span class="op">+</span> rand_range<span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance<span class="op">);</span></span>
<span id="cb6-45"><a href="#cb6-45" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">-=</span> payment<span class="op">;</span></span>
<span id="cb6-46"><a href="#cb6-46" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>to<span class="op">].</span>balance   <span class="op">+=</span> payment<span class="op">;</span></span>
<span id="cb6-47"><a href="#cb6-47" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb6-48"><a href="#cb6-48" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb6-49"><a href="#cb6-49" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb6-50"><a href="#cb6-50" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb6-51"><a href="#cb6-51" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb6-52"><a href="#cb6-52" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb6-53"><a href="#cb6-53" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-54"><a href="#cb6-54" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb6-55"><a href="#cb6-55" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb6-56"><a href="#cb6-56" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">;</span></span>
<span id="cb6-57"><a href="#cb6-57" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> total<span class="op">;</span></span>
<span id="cb6-58"><a href="#cb6-58" aria-hidden="true" tabindex="-1"></a>	pthread_t ts<span class="op">[</span>N_THREADS<span class="op">];</span></span>
<span id="cb6-59"><a href="#cb6-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-60"><a href="#cb6-60" aria-hidden="true" tabindex="-1"></a>	srand<span class="op">(</span>time<span class="op">(</span>NULL<span class="op">));</span></span>
<span id="cb6-61"><a href="#cb6-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-62"><a href="#cb6-62" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* set the initial balance, but also create a</span></span>
<span id="cb6-63"><a href="#cb6-63" aria-hidden="true" tabindex="-1"></a><span class="co">	   new mutex for each account */</span></span>
<span id="cb6-64"><a href="#cb6-64" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb6-65"><a href="#cb6-65" aria-hidden="true" tabindex="-1"></a>		accts<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> account<span class="op">)</span></span>
<span id="cb6-66"><a href="#cb6-66" aria-hidden="true" tabindex="-1"></a>			<span class="op">{</span><span class="dv">100</span><span class="op">,</span> PTHREAD_MUTEX_INITIALIZER<span class="op">};</span></span>
<span id="cb6-67"><a href="#cb6-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-68"><a href="#cb6-68" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb6-69"><a href="#cb6-69" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(&amp;</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">,</span> disburse<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb6-70"><a href="#cb6-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-71"><a href="#cb6-71" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;(This program will probably deadlock, &quot;</span></span>
<span id="cb6-72"><a href="#cb6-72" aria-hidden="true" tabindex="-1"></a>	     <span class="st">&quot;and need to be manually terminated...)&quot;</span><span class="op">);</span></span>
<span id="cb6-73"><a href="#cb6-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-74"><a href="#cb6-74" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb6-75"><a href="#cb6-75" aria-hidden="true" tabindex="-1"></a>		pthread_join<span class="op">(</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">);</span></span>
<span id="cb6-76"><a href="#cb6-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-77"><a href="#cb6-77" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>total <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb6-78"><a href="#cb6-78" aria-hidden="true" tabindex="-1"></a>		total <span class="op">+=</span> accts<span class="op">[</span>i<span class="op">].</span>balance<span class="op">;</span></span>
<span id="cb6-79"><a href="#cb6-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-80"><a href="#cb6-80" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;Total money in system: %ld</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> total<span class="op">);</span></span>
<span id="cb6-81"><a href="#cb6-81" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Now everything should be safe. No money being created or destroyed, just perfect exchanges between the accounts. The invariant is that the total balance of the source and destination accounts is the same before we transfer the money as after. It’s broken only inside the critical section.</p>
<p>As a side note, at this point you might think it would be more efficient be to take a single lock at a time, like this:</p>
<ul>
<li>lock the source account</li>
<li>withdraw money into a thread local variable</li>
<li>unlock the source account</li>
<li>(danger zone!)</li>
<li>lock the destination account</li>
<li>deposit the money</li>
<li>unlock the destination account</li>
</ul>
<p>This would not be safe. During the time between unlocking the source account and locking the destination, the invariant does not hold, yet another thread could observe this state. For instance a report running in another thread just at that time could read the balance of both accounts and observe money missing from the system.</p>
<p>We do need to lock both accounts during the transfer. However the way we’re doing it causes a different problem. Try to run the program. It gets stuck forever and never prints the final balance! Its threads are <strong>deadlocked.</strong></p>
<p>Deadlock is the second villain of concurrent programming, and happens when threads wait on each others’ locks, but no thread unlocks for any other. The case of the bankers is a classic simple form called the <strong>deadly embrace.</strong> Here’s how it plays out:</p>
<table class="table" style="border: 1px solid #ccc; background: #eee;">
<caption>
Deadly embrace
</caption>
<thead>
<tr>
<th>
Thread A
</th>
<th>
Thread B
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
lock account 1
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
lock account 2
</td>
</tr>
<tr>
<td>
lock account 2
</td>
<td>
</td>
</tr>
<tr>
<td colspan="2">
At this point thread A is blocked because thread B already holds a lock on account 2.
</td>
</tr>
<tr>
<td>
</td>
<td>
lock account 1
</td>
</tr>
<tr>
<td colspan="2">
Now thread B is blocked because thread A holds a lock on account 1. However thread A will never unlock account 1 because thread A is blocked!
</td>
</tr>
</tbody>
</table>
<p>The problem happens because threads lock resources in different orders, and because they refuse to give locks up. We can solve the problem by addressing either of these causes.</p>
<p>The first approach to preventing deadlock is to enforce a <strong>locking hierarchy.</strong> This means the programmer comes up with an arbitrary order for locks, and always takes “earlier” locks before “later” ones. The terminology comes from locks in hierarchical data structures like trees, but it really amounts to using any kind of consistent locking order.</p>
<p>In our case of the banker program we store all the accounts in an array, so we can use the array index as the lock order. Let’s compare.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* the original way to lock mutexes, which caused deadlock */</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="co">/* move money */</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span></code></pre></div>
<p>Here’s a safe way, enforcing a locking hierarchy:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* lock mutexes in earlier accounts first */</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MIN(a,b) ((a) &lt; (b) ? (a) : (b))</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MAX(a,b) ((a) &lt; (b) ? (b) : (a))</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>MIN<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>MAX<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb8-8"><a href="#cb8-8" aria-hidden="true" tabindex="-1"></a><span class="co">/* move money */</span></span>
<span id="cb8-9"><a href="#cb8-9" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>MAX<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb8-10"><a href="#cb8-10" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>MIN<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb8-11"><a href="#cb8-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb8-12"><a href="#cb8-12" aria-hidden="true" tabindex="-1"></a><span class="co">/* notice we unlock in opposite order */</span></span></code></pre></div>
<p>A locking hierarchy is the most efficient way to prevent deadlock, but it isn’t always easy to contrive. It’s also creates a potentially undocumented coupling between different parts of a program which need to collaborate in the convention.</p>
<p><strong>Backoff</strong> is a different way to prevent deadlock which works for locks taken in any order. It takes a lock, but then checks whether the next is obtainable. If not, it unlocks the first to allow another thread to make progress, and tries again.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* using pthread_mutex_trylock to dodge deadlock */</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="cf">while</span> <span class="op">(</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>	</span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>pthread_mutex_trylock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a>		<span class="cf">break</span><span class="op">;</span> <span class="co">/* got both locks */</span></span>
<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* didn&#39;t get the second one, so unlock the first */</span></span>
<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* force a sleep so another thread can try --</span></span>
<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a><span class="co">	   include &lt;sched.h&gt; for this function */</span></span>
<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a>	sched_yield<span class="op">();</span></span>
<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a><span class="co">/* move money */</span></span>
<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>to<span class="op">].</span>mtx<span class="op">);</span></span>
<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>from<span class="op">].</span>mtx<span class="op">);</span></span></code></pre></div>
<p>One tricky part is the call to <code>sched_yield()</code>. Without it the loop will immediately try to grab the lock again, competing as hard as it can with other threads who could make more productive use of the lock. This causes <strong>livelock</strong>, where threads fight for access to the locks. The <code>sched_yield()</code> puts the calling thread to sleep and at the back of the scheduler’s run queue.</p>
<p>Despite its flexibility, backoff is definitely less efficient than a locking hierarchy because it can make wasted calls to lock and unlock mutexes. Try modifying the banker program with these approaches and measure how fast they run.</p>
<h3 id="condition-variables">Condition variables</h3>
<p>After safely getting access to a shared variable with a mutex, a thread may discover that the value of the variable is not yet suitable for the thread to act upon. For instance, if the thread was looking for an item to process in a shared queue, but found the queue was empty. The thread could poll the value, but this is inefficient. Pthreads provides <strong>condition variables</strong> to allow threads to wait for events of interest or notify other threads when these events happen.</p>
<p>Condition variables are not themselves locks, nor do they hold any value of their own. They are merely events with a programmer-assigned meaning. For example, a structure representing a queue could have a mutex for safely accessing the data, plus some condition variables. One to represent the event of the queue becoming empty, and another to announce when a new item is added.</p>
<p>Before getting deeper into how condition variables work, let’s see one in action with our banker program. We’ll measure contention between the bankers. First we’ll increase the number of threads and accounts, and keep statistics about how many bankers manage to get inside the <code>disburse()</code> critical section at once. Any time the max score is broken, we’ll signal a condition variable. A dedicated thread will wait on it and update a scoreboard.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* banker_stats.c */</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;time.h&gt;</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="co">/* increase the accounts and threads, but make sure there are</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co"> * &quot;too many&quot; threads so they tend to block each other */</span></span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ACCOUNTS 50</span></span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_THREADS  100</span></span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ROUNDS   10000</span></span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MIN(a,b) ((a) &lt; (b) ? (a) : (b))</span></span>
<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MAX(a,b) ((a) &lt; (b) ? (b) : (a))</span></span>
<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> account</span>
<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> balance<span class="op">;</span></span>
<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_t mtx<span class="op">;</span></span>
<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a><span class="op">}</span> accts<span class="op">[</span>N_ACCOUNTS<span class="op">];</span></span>
<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> rand_range<span class="op">(</span><span class="dt">int</span> N<span class="op">)</span></span>
<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="op">(</span><span class="dt">int</span><span class="op">)((</span><span class="dt">double</span><span class="op">)</span>rand<span class="op">()</span> <span class="op">/</span> <span class="op">((</span><span class="dt">double</span><span class="op">)</span>RAND_MAX <span class="op">+</span> <span class="dv">1</span><span class="op">)</span> <span class="op">*</span> N<span class="op">);</span></span>
<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a><span class="co">/* keep a special mutex and condition variable</span></span>
<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a><span class="co"> * reserved for just the stats */</span></span>
<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>pthread_mutex_t stats_mtx <span class="op">=</span> PTHREAD_MUTEX_INITIALIZER<span class="op">;</span></span>
<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a>pthread_cond_t  stats_cnd <span class="op">=</span> PTHREAD_COND_INITIALIZER<span class="op">;</span></span>
<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> stats_curr <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> stats_best <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a><span class="co">/* use this interface to modify the stats */</span></span>
<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> stats_change<span class="op">(</span><span class="dt">int</span> delta<span class="op">)</span></span>
<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>stats_mtx<span class="op">);</span></span>
<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a>	stats_curr <span class="op">+=</span> delta<span class="op">;</span></span>
<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>stats_curr <span class="op">&gt;</span> stats_best<span class="op">)</span></span>
<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a>		stats_best <span class="op">=</span> stats_curr<span class="op">;</span></span>
<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* signal new high score */</span></span>
<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a>		pthread_cond_broadcast<span class="op">(&amp;</span>stats_cnd<span class="op">);</span></span>
<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>stats_mtx<span class="op">);</span></span>
<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a><span class="co">/* a dedicated thread to update the scoreboard UI */</span></span>
<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>stats_print<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> prev_best<span class="op">;</span></span>
<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>arg<span class="op">;</span></span>
<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* we never return, nobody needs to</span></span>
<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a><span class="co">	 * pthread_join() with us */</span></span>
<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>	pthread_detach<span class="op">(</span>pthread_self<span class="op">());</span></span>
<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_lock<span class="op">(&amp;</span>stats_mtx<span class="op">);</span></span>
<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a>		prev_best <span class="op">=</span> stats_best<span class="op">;</span></span>
<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* go to sleep until stats change, and always</span></span>
<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a><span class="co">		 * check that they actually have changed */</span></span>
<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a>		<span class="cf">while</span> <span class="op">(</span>prev_best <span class="op">==</span> stats_best<span class="op">)</span></span>
<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a>			pthread_cond_wait<span class="op">(</span></span>
<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a>				<span class="op">&amp;</span>stats_cnd<span class="op">,</span> <span class="op">&amp;</span>stats_mtx<span class="op">);</span></span>
<span id="cb10-69"><a href="#cb10-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-70"><a href="#cb10-70" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* overwrite current line with new score */</span></span>
<span id="cb10-71"><a href="#cb10-71" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;</span><span class="sc">\r</span><span class="st">%2d&quot;</span><span class="op">,</span> stats_best<span class="op">);</span></span>
<span id="cb10-72"><a href="#cb10-72" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_unlock<span class="op">(&amp;</span>stats_mtx<span class="op">);</span></span>
<span id="cb10-73"><a href="#cb10-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-74"><a href="#cb10-74" aria-hidden="true" tabindex="-1"></a>		fflush<span class="op">(</span>stdout<span class="op">);</span></span>
<span id="cb10-75"><a href="#cb10-75" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-76"><a href="#cb10-76" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-77"><a href="#cb10-77" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-78"><a href="#cb10-78" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>disburse<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb10-79"><a href="#cb10-79" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-80"><a href="#cb10-80" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">,</span> from<span class="op">,</span> to<span class="op">;</span></span>
<span id="cb10-81"><a href="#cb10-81" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> payment<span class="op">;</span></span>
<span id="cb10-82"><a href="#cb10-82" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-83"><a href="#cb10-83" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span>arg<span class="op">;</span></span>
<span id="cb10-84"><a href="#cb10-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-85"><a href="#cb10-85" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ROUNDS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb10-86"><a href="#cb10-86" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-87"><a href="#cb10-87" aria-hidden="true" tabindex="-1"></a>		from <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb10-88"><a href="#cb10-88" aria-hidden="true" tabindex="-1"></a>		<span class="cf">do</span> <span class="op">{</span></span>
<span id="cb10-89"><a href="#cb10-89" aria-hidden="true" tabindex="-1"></a>			to <span class="op">=</span> rand_range<span class="op">(</span>N_ACCOUNTS<span class="op">);</span></span>
<span id="cb10-90"><a href="#cb10-90" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span> <span class="cf">while</span> <span class="op">(</span>to <span class="op">==</span> from<span class="op">);</span></span>
<span id="cb10-91"><a href="#cb10-91" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-92"><a href="#cb10-92" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>MIN<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb10-93"><a href="#cb10-93" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_lock<span class="op">(&amp;</span>accts<span class="op">[</span>MAX<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb10-94"><a href="#cb10-94" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-95"><a href="#cb10-95" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* notice we still have a lock hierarchy, because</span></span>
<span id="cb10-96"><a href="#cb10-96" aria-hidden="true" tabindex="-1"></a><span class="co">		 * we call stats_change() after locking all account</span></span>
<span id="cb10-97"><a href="#cb10-97" aria-hidden="true" tabindex="-1"></a><span class="co">		 * mutexes (stats_mtx comes last) */</span></span>
<span id="cb10-98"><a href="#cb10-98" aria-hidden="true" tabindex="-1"></a>		stats_change<span class="op">(</span><span class="dv">1</span><span class="op">);</span> <span class="co">/* another banker in crit sec */</span></span>
<span id="cb10-99"><a href="#cb10-99" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb10-100"><a href="#cb10-100" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb10-101"><a href="#cb10-101" aria-hidden="true" tabindex="-1"></a>			payment <span class="op">=</span> <span class="dv">1</span> <span class="op">+</span> rand_range<span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance<span class="op">);</span></span>
<span id="cb10-102"><a href="#cb10-102" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">-=</span> payment<span class="op">;</span></span>
<span id="cb10-103"><a href="#cb10-103" aria-hidden="true" tabindex="-1"></a>			accts<span class="op">[</span>to<span class="op">].</span>balance   <span class="op">+=</span> payment<span class="op">;</span></span>
<span id="cb10-104"><a href="#cb10-104" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb10-105"><a href="#cb10-105" aria-hidden="true" tabindex="-1"></a>		stats_change<span class="op">(-</span><span class="dv">1</span><span class="op">);</span> <span class="co">/* leaving crit sec */</span></span>
<span id="cb10-106"><a href="#cb10-106" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-107"><a href="#cb10-107" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>MAX<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb10-108"><a href="#cb10-108" aria-hidden="true" tabindex="-1"></a>		pthread_mutex_unlock<span class="op">(&amp;</span>accts<span class="op">[</span>MIN<span class="op">(</span>from<span class="op">,</span> to<span class="op">)].</span>mtx<span class="op">);</span></span>
<span id="cb10-109"><a href="#cb10-109" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-110"><a href="#cb10-110" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb10-111"><a href="#cb10-111" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb10-112"><a href="#cb10-112" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-113"><a href="#cb10-113" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb10-114"><a href="#cb10-114" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-115"><a href="#cb10-115" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">;</span></span>
<span id="cb10-116"><a href="#cb10-116" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> total<span class="op">;</span></span>
<span id="cb10-117"><a href="#cb10-117" aria-hidden="true" tabindex="-1"></a>	pthread_t ts<span class="op">[</span>N_THREADS<span class="op">],</span> stats<span class="op">;</span></span>
<span id="cb10-118"><a href="#cb10-118" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-119"><a href="#cb10-119" aria-hidden="true" tabindex="-1"></a>	srand<span class="op">(</span>time<span class="op">(</span>NULL<span class="op">));</span></span>
<span id="cb10-120"><a href="#cb10-120" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-121"><a href="#cb10-121" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb10-122"><a href="#cb10-122" aria-hidden="true" tabindex="-1"></a>		accts<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> <span class="op">(</span><span class="kw">struct</span> account<span class="op">)</span></span>
<span id="cb10-123"><a href="#cb10-123" aria-hidden="true" tabindex="-1"></a>			<span class="op">{</span><span class="dv">100</span><span class="op">,</span> PTHREAD_MUTEX_INITIALIZER<span class="op">};</span></span>
<span id="cb10-124"><a href="#cb10-124" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-125"><a href="#cb10-125" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb10-126"><a href="#cb10-126" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(&amp;</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">,</span> disburse<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb10-127"><a href="#cb10-127" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-128"><a href="#cb10-128" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* start thread to update the user on how many bankers</span></span>
<span id="cb10-129"><a href="#cb10-129" aria-hidden="true" tabindex="-1"></a><span class="co">	 * are in the disburse() critical section at once */</span></span>
<span id="cb10-130"><a href="#cb10-130" aria-hidden="true" tabindex="-1"></a>	pthread_create<span class="op">(&amp;</span>stats<span class="op">,</span> NULL<span class="op">,</span> stats_print<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb10-131"><a href="#cb10-131" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-132"><a href="#cb10-132" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_THREADS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb10-133"><a href="#cb10-133" aria-hidden="true" tabindex="-1"></a>		pthread_join<span class="op">(</span>ts<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">);</span></span>
<span id="cb10-134"><a href="#cb10-134" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-135"><a href="#cb10-135" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* not joining with the thread running stats_print,</span></span>
<span id="cb10-136"><a href="#cb10-136" aria-hidden="true" tabindex="-1"></a><span class="co">	 * we&#39;ll let it disappar when main exits */</span></span>
<span id="cb10-137"><a href="#cb10-137" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-138"><a href="#cb10-138" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>total <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> N_ACCOUNTS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb10-139"><a href="#cb10-139" aria-hidden="true" tabindex="-1"></a>		total <span class="op">+=</span> accts<span class="op">[</span>i<span class="op">].</span>balance<span class="op">;</span></span>
<span id="cb10-140"><a href="#cb10-140" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-141"><a href="#cb10-141" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;</span><span class="sc">\n</span><span class="st">Total money in system: %ld</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> total<span class="op">);</span></span>
<span id="cb10-142"><a href="#cb10-142" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>With fifty accounts and a hundred threads, not all threads will be able to be in the critical section of <code>disburse()</code> at once. It varies between runs. Run the program and see how well it does on your machine. (One complication is that making all threads synchronize on <code>stats_mtx</code> may throw off the measurement, because there are threads who could have executed independently but now must interact.)</p>
<p>Let’s look at how to properly use condition variables. We notified threads of a new event with <code>pthread_cond_broadcast(&amp;stats_cnd)</code>. This function marks all threads waiting on <code>state_cnd</code> as ready to run.</p>
<p>Sometimes multiple threads are waiting on a single cond var. A broadcast will wake them all, but sometimes the event source knows that only one thread will be able to do any work. For instance if only one item is added to a shared queue. In that case the <code>pthread_cond_signal</code> function is better than <code>pthread_cond_broadcast</code>. Unnecessarily waking multiple threads causes overhead. In our case we know that only one thread is waiting on the cond var, so it really makes no difference.</p>
<p>Remember that it’s never <em>wrong</em> to use a broadcast, whereas in some cases it might be wrong to use a signal. Signal is just an optimized broadcast.</p>
<p>The waiting side of a cond var ought always to have this pattern:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>pthread_mutex_lock<span class="op">(&amp;</span>mutex<span class="op">);</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="cf">while</span> <span class="op">(!</span>PREDICATE<span class="op">)</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>	pthread_cond_wait<span class="op">(&amp;</span>cond_var<span class="op">,</span> <span class="op">&amp;</span>mutex<span class="op">);</span></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>pthread_mutex_unlock<span class="op">(&amp;</span>mutex<span class="op">);</span></span></code></pre></div>
<p>Condition variables are always associated with a predicate, and the association is implicit in the programmer’s head. You shouldn’t reuse a condition variable for multiple predicates. The intention is that code will signal the cond var when the predicate becomes true.</p>
<p>Before testing the predicate we lock a mutex that covers the data being tested. That way no other thread can change the data immediately after we test it (also <code>pthread_cond_wait()</code> requires a locked mutex). If the predicate is already true we needn’t wait on the cond var, so the loop falls through, otherwise the thread begins to wait.</p>
<p>Condition variables allow you to make this series of events atomic: unlock a mutex, register our interest in the event, and block. Without that atomicity another thread might awaken to take our lock and broadcast before we’ve registered ourselves as interested. Without the atomicity we could be blocked forever.</p>
<p>When <code>pthread_cond_wait()</code> returns, the calling thread awakens and atomically gets its mutex back. It’s all set to check the predicate again in the loop. But why check the predicate? Wasn’t the cond var signaled because the predicate was true, and isn’t the relevant data protected by a mutex? There are three reasons to check:</p>
<ol type="1">
<li>If the condition variable had been broadcast, other threads might have been listening, and another might have been scheduled first and might have done our job. The loop tests for that interception.</li>
<li>On some multiprocessor systems, making condition variable wakeup completely predictable might substantially slow down all cond var operations. Such systems allow <strong>spurious wakeups</strong>, and threads need to be prepared to check if they were woken appropriately.</li>
<li>It can be convenient to signal on a loose predicate. Threads can signal the variables when the event seems <em>likely</em>, or even mistakenly signal, and the program will still work. For instance, we signal when when <code>stats_best</code> gets a new high score, but we could have chosen to signal at every invocation of <code>stats_change()</code>.</li>
</ol>
<p>Given that we have to pass a locked mutex to <code>pthread_cond_wait()</code>, which we had to create, why don’t cond vars come with their own built-in mutex? The reason is flexibility. Although you should use only one mutex with a cond var, there can be multiple cond vars for the same mutex. Think of the example of the mutex protecting a queue, and the different events that can happen in the queue.</p>
<h3 id="other-synchronization-primitives">Other synchronization primitives</h3>
<h4 id="barriers">Barriers</h4>
<p>It’s time to bid farewell to the banker programs, and turn to something more lively: Conway’s Game of Life! The game has a set of rules operating on a grid of cells that determines which cells live or die based on how many living neighbors each has.</p>
<p>The game can take advantage of multiple processors, using each processor to operate on a different part of the grid in parallel. It’s a so-called <strong>embarrassingly parallel</strong> problem because each section of the grid can be processed in isolation, without needing results from other sections.</p>
<p>Barriers ensure that all threads have reached a particular stage in a parallel computation before allowing any to proceed to the next stage. Each thread calls <code>pthread_barrier_wait()</code> to rendezvous with the others. One of the threads, chosen randomly, will see the <code>PTHREAD_BARRIER_SERIAL_THREAD</code> return value, which nominates that thread to do any cleanup or preparation between stages.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* life.c */</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;assert.h&gt;</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdbool.h&gt;</span></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;time.h&gt;</span></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a><span class="co">/* mandatory in POSIX.1-2008, but check laggards like macOS */</span></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unistd.h&gt;</span></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a><span class="pp">#if !defined(_POSIX_BARRIERS) || _POSIX_BARRIERS &lt; 0</span></span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#error your OS lacks POSIX barrier support</span></span>
<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span>
<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a><span class="co">/* dimensions of board */</span></span>
<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#define ROWS 32</span></span>
<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a><span class="pp">#define COLS 78</span></span>
<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a><span class="co">/* how long to pause between rounds */</span></span>
<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a><span class="pp">#define FRAME_MS 100</span></span>
<span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a><span class="pp">#define THREADS 4</span></span>
<span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a><span class="co">/* proper modulus (in C, &#39;%&#39; is merely remainder) */</span></span>
<span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MOD(x,N) (((x) &lt; 0) ? ((x) % (N) + (N)) : ((x) % (N)))</span></span>
<span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> alive<span class="op">[</span>ROWS<span class="op">][</span>COLS<span class="op">],</span> alive_next<span class="op">[</span>ROWS<span class="op">][</span>COLS<span class="op">];</span></span>
<span id="cb12-28"><a href="#cb12-28" aria-hidden="true" tabindex="-1"></a>pthread_barrier_t tick<span class="op">;</span></span>
<span id="cb12-29"><a href="#cb12-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-30"><a href="#cb12-30" aria-hidden="true" tabindex="-1"></a><span class="co">/* Should a cell live or die? Using ssize_t because we have</span></span>
<span id="cb12-31"><a href="#cb12-31" aria-hidden="true" tabindex="-1"></a><span class="co">   to deal with signed arithmetic like row-1 when row=0 */</span></span>
<span id="cb12-32"><a href="#cb12-32" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> fate<span class="op">(</span><span class="dt">ssize_t</span> row<span class="op">,</span> <span class="dt">ssize_t</span> col<span class="op">)</span></span>
<span id="cb12-33"><a href="#cb12-33" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-34"><a href="#cb12-34" aria-hidden="true" tabindex="-1"></a>	<span class="dt">ssize_t</span> i<span class="op">,</span> j<span class="op">;</span></span>
<span id="cb12-35"><a href="#cb12-35" aria-hidden="true" tabindex="-1"></a>	<span class="dt">short</span> neighbors <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb12-36"><a href="#cb12-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-37"><a href="#cb12-37" aria-hidden="true" tabindex="-1"></a>	assert<span class="op">(</span><span class="dv">0</span> <span class="op">&lt;=</span> row <span class="op">&amp;&amp;</span> row <span class="op">&lt;</span> ROWS<span class="op">);</span></span>
<span id="cb12-38"><a href="#cb12-38" aria-hidden="true" tabindex="-1"></a>	assert<span class="op">(</span><span class="dv">0</span> <span class="op">&lt;=</span> col <span class="op">&amp;&amp;</span> col <span class="op">&lt;</span> COLS<span class="op">);</span></span>
<span id="cb12-39"><a href="#cb12-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-40"><a href="#cb12-40" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* joined edges form a torus */</span></span>
<span id="cb12-41"><a href="#cb12-41" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> row<span class="op">-</span><span class="dv">1</span><span class="op">;</span> i <span class="op">&lt;=</span> row<span class="op">+</span><span class="dv">1</span><span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb12-42"><a href="#cb12-42" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>j <span class="op">=</span> col<span class="op">-</span><span class="dv">1</span><span class="op">;</span> j <span class="op">&lt;=</span> col<span class="op">+</span><span class="dv">1</span><span class="op">;</span> j<span class="op">++)</span></span>
<span id="cb12-43"><a href="#cb12-43" aria-hidden="true" tabindex="-1"></a>			neighbors <span class="op">+=</span> alive<span class="op">[</span>MOD<span class="op">(</span>i<span class="op">,</span> ROWS<span class="op">)][</span>MOD<span class="op">(</span>j<span class="op">,</span> COLS<span class="op">)];</span></span>
<span id="cb12-44"><a href="#cb12-44" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* don&#39;t count self as a neighbor */</span></span>
<span id="cb12-45"><a href="#cb12-45" aria-hidden="true" tabindex="-1"></a>	neighbors <span class="op">-=</span> alive<span class="op">[</span>row<span class="op">][</span>col<span class="op">];</span></span>
<span id="cb12-46"><a href="#cb12-46" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-47"><a href="#cb12-47" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> neighbors <span class="op">==</span> <span class="dv">3</span> <span class="op">||</span></span>
<span id="cb12-48"><a href="#cb12-48" aria-hidden="true" tabindex="-1"></a>		<span class="op">(</span>neighbors <span class="op">==</span> <span class="dv">2</span> <span class="op">&amp;&amp;</span> alive<span class="op">[</span>row<span class="op">][</span>col<span class="op">]);</span></span>
<span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a><span class="co">/* overwrite the board on screen */</span></span>
<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> draw<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a>	<span class="dt">ssize_t</span> i<span class="op">,</span> j<span class="op">;</span></span>
<span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-56"><a href="#cb12-56" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* clear screen (non portable, requires ANSI terminal) */</span></span>
<span id="cb12-57"><a href="#cb12-57" aria-hidden="true" tabindex="-1"></a>	fputs<span class="op">(</span><span class="st">&quot;</span><span class="sc">\033</span><span class="st">[2J</span><span class="sc">\033</span><span class="st">[1;1H&quot;</span><span class="op">,</span> stdout<span class="op">);</span></span>
<span id="cb12-58"><a href="#cb12-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-59"><a href="#cb12-59" aria-hidden="true" tabindex="-1"></a>	flockfile<span class="op">(</span>stdout<span class="op">);</span></span>
<span id="cb12-60"><a href="#cb12-60" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> ROWS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb12-61"><a href="#cb12-61" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb12-62"><a href="#cb12-62" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* putchar_unlocked is thread safe when stdout is locked,</span></span>
<span id="cb12-63"><a href="#cb12-63" aria-hidden="true" tabindex="-1"></a><span class="co">		   and it&#39;s as fast as single-threaded putchar */</span></span>
<span id="cb12-64"><a href="#cb12-64" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> COLS<span class="op">;</span> j<span class="op">++)</span></span>
<span id="cb12-65"><a href="#cb12-65" aria-hidden="true" tabindex="-1"></a>			putchar_unlocked<span class="op">(</span>alive<span class="op">[</span>i<span class="op">][</span>j<span class="op">]</span> <span class="op">?</span> <span class="ch">&#39;X&#39;</span> <span class="op">:</span> <span class="ch">&#39; &#39;</span><span class="op">);</span></span>
<span id="cb12-66"><a href="#cb12-66" aria-hidden="true" tabindex="-1"></a>		putchar_unlocked<span class="op">(</span><span class="ch">&#39;\n&#39;</span><span class="op">);</span></span>
<span id="cb12-67"><a href="#cb12-67" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-68"><a href="#cb12-68" aria-hidden="true" tabindex="-1"></a>	funlockfile<span class="op">(</span>stdout<span class="op">);</span></span>
<span id="cb12-69"><a href="#cb12-69" aria-hidden="true" tabindex="-1"></a>	fflush<span class="op">(</span>stdout<span class="op">);</span></span>
<span id="cb12-70"><a href="#cb12-70" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb12-71"><a href="#cb12-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-72"><a href="#cb12-72" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>update_strip<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb12-73"><a href="#cb12-73" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-74"><a href="#cb12-74" aria-hidden="true" tabindex="-1"></a>	<span class="dt">ssize_t</span> offset <span class="op">=</span> <span class="op">*(</span><span class="dt">ssize_t</span><span class="op">*)</span>arg<span class="op">,</span> i<span class="op">,</span> j<span class="op">;</span></span>
<span id="cb12-75"><a href="#cb12-75" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> timespec t<span class="op">;</span></span>
<span id="cb12-76"><a href="#cb12-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-77"><a href="#cb12-77" aria-hidden="true" tabindex="-1"></a>	t<span class="op">.</span>tv_sec <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb12-78"><a href="#cb12-78" aria-hidden="true" tabindex="-1"></a>	t<span class="op">.</span>tv_nsec <span class="op">=</span> FRAME_MS <span class="op">*</span> <span class="dv">1000000</span><span class="op">;</span></span>
<span id="cb12-79"><a href="#cb12-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-80"><a href="#cb12-80" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb12-81"><a href="#cb12-81" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb12-82"><a href="#cb12-82" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>pthread_barrier_wait<span class="op">(&amp;</span>tick<span class="op">)</span> <span class="op">==</span></span>
<span id="cb12-83"><a href="#cb12-83" aria-hidden="true" tabindex="-1"></a>			PTHREAD_BARRIER_SERIAL_THREAD<span class="op">)</span></span>
<span id="cb12-84"><a href="#cb12-84" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb12-85"><a href="#cb12-85" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* we drew the short straw, so we&#39;re on graphics duty */</span></span>
<span id="cb12-86"><a href="#cb12-86" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-87"><a href="#cb12-87" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* could have used pointers to multidimensional</span></span>
<span id="cb12-88"><a href="#cb12-88" aria-hidden="true" tabindex="-1"></a><span class="co">			 * arrays and swapped them rather than memcpy&#39;ing</span></span>
<span id="cb12-89"><a href="#cb12-89" aria-hidden="true" tabindex="-1"></a><span class="co">			 * the array contents, but it makes the code a</span></span>
<span id="cb12-90"><a href="#cb12-90" aria-hidden="true" tabindex="-1"></a><span class="co">			 * little more complicated with dereferences */</span></span>
<span id="cb12-91"><a href="#cb12-91" aria-hidden="true" tabindex="-1"></a>			memcpy<span class="op">(</span>alive<span class="op">,</span> alive_next<span class="op">,</span> <span class="kw">sizeof</span> alive<span class="op">);</span></span>
<span id="cb12-92"><a href="#cb12-92" aria-hidden="true" tabindex="-1"></a>			draw<span class="op">();</span></span>
<span id="cb12-93"><a href="#cb12-93" aria-hidden="true" tabindex="-1"></a>			nanosleep<span class="op">(&amp;</span>t<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb12-94"><a href="#cb12-94" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb12-95"><a href="#cb12-95" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-96"><a href="#cb12-96" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* rejoin at another barrier to avoid data race on</span></span>
<span id="cb12-97"><a href="#cb12-97" aria-hidden="true" tabindex="-1"></a><span class="co">		   the game board while it&#39;s copied and drawn */</span></span>
<span id="cb12-98"><a href="#cb12-98" aria-hidden="true" tabindex="-1"></a>		pthread_barrier_wait<span class="op">(&amp;</span>tick<span class="op">);</span></span>
<span id="cb12-99"><a href="#cb12-99" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> offset<span class="op">;</span> i <span class="op">&lt;</span> offset <span class="op">+</span> <span class="op">(</span>ROWS <span class="op">/</span> THREADS<span class="op">);</span> i<span class="op">++)</span></span>
<span id="cb12-100"><a href="#cb12-100" aria-hidden="true" tabindex="-1"></a>			<span class="cf">for</span> <span class="op">(</span>j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> COLS<span class="op">;</span> j<span class="op">++)</span></span>
<span id="cb12-101"><a href="#cb12-101" aria-hidden="true" tabindex="-1"></a>				alive_next<span class="op">[</span>i<span class="op">][</span>j<span class="op">]</span> <span class="op">=</span> fate<span class="op">(</span>i<span class="op">,</span> j<span class="op">);</span></span>
<span id="cb12-102"><a href="#cb12-102" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-103"><a href="#cb12-103" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-104"><a href="#cb12-104" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb12-105"><a href="#cb12-105" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb12-106"><a href="#cb12-106" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-107"><a href="#cb12-107" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb12-108"><a href="#cb12-108" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-109"><a href="#cb12-109" aria-hidden="true" tabindex="-1"></a>	pthread_t <span class="op">*</span>workers<span class="op">;</span></span>
<span id="cb12-110"><a href="#cb12-110" aria-hidden="true" tabindex="-1"></a>	<span class="dt">ssize_t</span> <span class="op">*</span>offsets<span class="op">;</span></span>
<span id="cb12-111"><a href="#cb12-111" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">,</span> j<span class="op">;</span></span>
<span id="cb12-112"><a href="#cb12-112" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-113"><a href="#cb12-113" aria-hidden="true" tabindex="-1"></a>	assert<span class="op">(</span>ROWS <span class="op">%</span> THREADS <span class="op">==</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb12-114"><a href="#cb12-114" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* main counts as a thread, so need only THREADS-1 more */</span></span>
<span id="cb12-115"><a href="#cb12-115" aria-hidden="true" tabindex="-1"></a>	workers <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span><span class="op">(*</span>workers<span class="op">)</span> <span class="op">*</span> <span class="op">(</span>THREADS<span class="op">-</span><span class="dv">1</span><span class="op">));</span></span>
<span id="cb12-116"><a href="#cb12-116" aria-hidden="true" tabindex="-1"></a>	offsets <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span><span class="op">(*</span>offsets<span class="op">)</span> <span class="op">*</span> ROWS <span class="op">/</span> THREADS<span class="op">);</span></span>
<span id="cb12-117"><a href="#cb12-117" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-118"><a href="#cb12-118" aria-hidden="true" tabindex="-1"></a>	srand<span class="op">(</span>time<span class="op">(</span>NULL<span class="op">));</span></span>
<span id="cb12-119"><a href="#cb12-119" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> ROWS<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb12-120"><a href="#cb12-120" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>j <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> j <span class="op">&lt;</span> COLS<span class="op">;</span> j<span class="op">++)</span></span>
<span id="cb12-121"><a href="#cb12-121" aria-hidden="true" tabindex="-1"></a>			alive_next<span class="op">[</span>i<span class="op">][</span>j<span class="op">]</span> <span class="op">=</span> rand<span class="op">()</span> <span class="op">&lt;</span> <span class="op">(</span><span class="dt">int</span><span class="op">)((</span>RAND_MAX<span class="op">+</span><span class="dv">1</span><span class="bu">u</span><span class="op">)</span> <span class="op">/</span> <span class="dv">3</span><span class="op">);</span></span>
<span id="cb12-122"><a href="#cb12-122" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-123"><a href="#cb12-123" aria-hidden="true" tabindex="-1"></a>	pthread_barrier_init<span class="op">(&amp;</span>tick<span class="op">,</span> NULL<span class="op">,</span> THREADS<span class="op">);</span></span>
<span id="cb12-124"><a href="#cb12-124" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> THREADS<span class="op">-</span><span class="dv">1</span><span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb12-125"><a href="#cb12-125" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb12-126"><a href="#cb12-126" aria-hidden="true" tabindex="-1"></a>		offsets<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> i <span class="op">*</span> ROWS <span class="op">/</span> THREADS<span class="op">;</span></span>
<span id="cb12-127"><a href="#cb12-127" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(&amp;</span>workers<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">,</span> update_strip<span class="op">,</span> <span class="op">&amp;</span>offsets<span class="op">[</span>i<span class="op">]);</span></span>
<span id="cb12-128"><a href="#cb12-128" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-129"><a href="#cb12-129" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-130"><a href="#cb12-130" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* use current thread as a worker too */</span></span>
<span id="cb12-131"><a href="#cb12-131" aria-hidden="true" tabindex="-1"></a>	offsets<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> i <span class="op">*</span> ROWS <span class="op">/</span> THREADS<span class="op">;</span></span>
<span id="cb12-132"><a href="#cb12-132" aria-hidden="true" tabindex="-1"></a>	update_strip<span class="op">(&amp;</span>offsets<span class="op">[</span>i<span class="op">]);</span></span>
<span id="cb12-133"><a href="#cb12-133" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-134"><a href="#cb12-134" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* shouldn&#39;t ever get here */</span></span>
<span id="cb12-135"><a href="#cb12-135" aria-hidden="true" tabindex="-1"></a>	pthread_barrier_destroy<span class="op">(&amp;</span>tick<span class="op">);</span></span>
<span id="cb12-136"><a href="#cb12-136" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>offsets<span class="op">);</span></span>
<span id="cb12-137"><a href="#cb12-137" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>workers<span class="op">);</span></span>
<span id="cb12-138"><a href="#cb12-138" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb12-139"><a href="#cb12-139" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>It’s a fun example although slightly contrived. We’re adding a sleep between rounds to slow down the animation, so it’s unnecessary to chase parallelism. Also there’s a memoized algorithm called hashlife we should be using if pure speed is the goal. However our code illustrates a natural use for barriers.</p>
<p>Notice how we wait at the barrier twice in rapid succession. After emerging from the first barrier, one of the threads (chosen at random) copies the new state to the board and draws it. The other threads run ahead to the next barrier and wait there so they don’t cause a data race writing to the board. Once the drawing thread arrives at the barrier with them, then all can proceed to calculate cells’ fate for the next round.</p>
<p>Barriers are guaranteed to be present in POSIX.1-2008, but are optional in earlier versions of the standard. Notably macOS is stuck at an old version of POSIX. Presumably they’re too busy “innovating” with their keyboard touchbar to invest in operating system fundamentals.</p>
<h4 id="spinlocks">Spinlocks</h4>
<p>Spinlocks are implementations of mutexes optimized for fine-grained locking. Often used in low level code like drivers or operating systems, spinlocks are designed to be the most primitive and fastest sync mechanism available. They’re generally not appropriate for application programming. They are only truly necessary for situations like interrupt handlers when a thread is not allowed to go to sleep for any reason.</p>
<p>Aside from that scenario, it’s better to just use a mutex, since mutexes are pretty efficient these days. Modern mutexes often try a short-lived internal spinlock and fall back to heavier techniques only as needed. Mutexes also sometimes use a wait queue called a <strong>futex</strong>, which can take a lock in user-space whenever there is no contention from another thread.</p>
<p>When attempting to lock a spinlock, a thread runs a tight loop repeatedly checking a value in shared memory for a sign it’s safe to proceed. Spinlock implementations use special atomic assembly language instructions to test that the value is unlocked and lock it. The particular instructions vary per architecture, and can be performed in user space to avoid the overhead of a system call.</p>
<p>The while waiting for a lock, the loop doesn’t block the thread, but instead continues running and burns CPU energy. The technique works only on true multi-processor systems or a uniprocessor system with preemption enabled. On a uniprocessor system with cooperative threading the loop could never be interrupted, and will livelock.</p>
<p>In POSIX.1-2008 spinlock support is mandatory. In previous versions the presence of this feature was indicated by the <code>_POSIX_SPIN_LOCKS</code> macro. Spinlock functions start with <code>pthread_spin_</code>.</p>
<h4 id="reader-writer-locks">Reader-writer locks</h4>
<p>Whereas a mutex enforces mutual exclusion, a <strong>reader-writer lock</strong> allows concurrent read access. Multiple threads can read in parallel, but all block when a thread takes the lock for writing. The increased concurrency can improve application performance. However, blindly replacing mutexes with reader-writer locks “for performance” doesn’t work. Our earlier banker program, for instance, could suffer from duplicate withdrawals if it allowed multiple readers in an account at once.</p>
<p>Below is an rwlock example. It’s a password cracker I call 5dm (md5 backwards). It aims for maximum parallelism searching for a preimage of an MD5 hash. Worker threads periodically poll whether one among them has found an answer, and they use a reader-writer lock to avoid blocking on each other when doing so.</p>
<p>The example is slightly contrived, in that the difficulty of brute forcing passwords increases exponentially with their length. Using multiple threads reduces the time by only a constant factor – but 4x faster is still 4x faster on a four core computer!</p>
<p>The example below uses <code>MD5()</code> from OpenSSL. To build it, include this in our previous Makefile:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="dt">CFLAGS  </span><span class="ch">+=</span><span class="st"> `pkg-config --cflags libcrypto`</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="dt">LDFLAGS </span><span class="ch">+=</span><span class="st"> `pkg-config --libs-only-L libcrypto`</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="dt">LDLIBS  </span><span class="ch">+=</span><span class="st"> `pkg-config --libs-only-l libcrypto`</span></span></code></pre></div>
<p>To run it, pass in an MD5 hash and max preimage search length. Note the <code>-n</code> in echo to suppress the newline, since newline is not in our search alphabet:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> time ./5dm <span class="va">$(</span><span class="bu">echo</span> <span class="at">-n</span> <span class="st">&#39;fun&#39;</span> <span class="kw">|</span> <span class="ex">md5</span><span class="va">)</span> 5</span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="ex">fun</span></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="ex">real</span>  0m0.067s</span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="ex">user</span>  0m0.205s</span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="ex">sys</span>	  0m0.007s</span></code></pre></div>
<p>Notice how 0.2 seconds of CPU time elapsed in parallel, but the user got their answer in 0.067 seconds.</p>
<p>On to the code:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* 5dm.c */</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdbool.h&gt;</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;openssl/md5.h&gt;</span></span>
<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a><span class="co">/* build arbitrary words from the ascii between &#39; &#39; and &#39;~&#39; */</span></span>
<span id="cb15-12"><a href="#cb15-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define ASCII_FIRST &#39; &#39;</span></span>
<span id="cb15-13"><a href="#cb15-13" aria-hidden="true" tabindex="-1"></a><span class="pp">#define ASCII_LAST  &#39;~&#39;</span></span>
<span id="cb15-14"><a href="#cb15-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)</span></span>
<span id="cb15-15"><a href="#cb15-15" aria-hidden="true" tabindex="-1"></a><span class="co">/* refuse to search beyond this astronomical length */</span></span>
<span id="cb15-16"><a href="#cb15-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#define LONGEST_PREIMAGE 128</span></span>
<span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-18"><a href="#cb15-18" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MAX(x,y) ((x)&lt;(y) ? (y) : (x))</span></span>
<span id="cb15-19"><a href="#cb15-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-20"><a href="#cb15-20" aria-hidden="true" tabindex="-1"></a><span class="co">/* a fast way to enumerate words, operating on an array in-place */</span></span>
<span id="cb15-21"><a href="#cb15-21" aria-hidden="true" tabindex="-1"></a><span class="dt">unsigned</span> word_advance<span class="op">(</span><span class="dt">char</span> <span class="op">*</span>word<span class="op">,</span> <span class="dt">unsigned</span> delta<span class="op">)</span></span>
<span id="cb15-22"><a href="#cb15-22" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-23"><a href="#cb15-23" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>delta <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb15-24"><a href="#cb15-24" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb15-25"><a href="#cb15-25" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(*</span>word <span class="op">==</span> <span class="ch">&#39;\0&#39;</span><span class="op">)</span></span>
<span id="cb15-26"><a href="#cb15-26" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-27"><a href="#cb15-27" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word<span class="op">++</span> <span class="op">=</span> ASCII_FIRST <span class="op">+</span> delta <span class="op">-</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb15-28"><a href="#cb15-28" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb15-29"><a href="#cb15-29" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-30"><a href="#cb15-30" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb15-31"><a href="#cb15-31" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-32"><a href="#cb15-32" aria-hidden="true" tabindex="-1"></a>		<span class="dt">char</span> c <span class="op">=</span> <span class="op">*</span>word <span class="op">-</span> ASCII_FIRST<span class="op">;</span></span>
<span id="cb15-33"><a href="#cb15-33" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word <span class="op">=</span> ASCII_FIRST <span class="op">+</span> <span class="op">((</span>c <span class="op">+</span> delta<span class="op">)</span> <span class="op">%</span> N_ALPHA<span class="op">);</span></span>
<span id="cb15-34"><a href="#cb15-34" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>c <span class="op">+</span> delta <span class="op">&gt;=</span> N_ALPHA<span class="op">)</span></span>
<span id="cb15-35"><a href="#cb15-35" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> <span class="dv">1</span> <span class="op">+</span> word_advance<span class="op">(</span>word<span class="op">+</span><span class="dv">1</span><span class="op">,</span> <span class="dv">1</span> <span class="co">/* not delta */</span><span class="op">);</span></span>
<span id="cb15-36"><a href="#cb15-36" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-37"><a href="#cb15-37" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb15-38"><a href="#cb15-38" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb15-39"><a href="#cb15-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-40"><a href="#cb15-40" aria-hidden="true" tabindex="-1"></a><span class="co">/* pack each pair of ASCII hex digits into single bytes */</span></span>
<span id="cb15-41"><a href="#cb15-41" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> hex2md5<span class="op">(</span><span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>hex<span class="op">,</span> <span class="dt">unsigned</span> <span class="dt">char</span> <span class="op">*</span>b<span class="op">)</span></span>
<span id="cb15-42"><a href="#cb15-42" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-43"><a href="#cb15-43" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> offset <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb15-44"><a href="#cb15-44" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span><span class="op">(</span>strlen<span class="op">(</span>hex<span class="op">)</span> <span class="op">!=</span> MD5_DIGEST_LENGTH<span class="op">*</span><span class="dv">2</span><span class="op">)</span></span>
<span id="cb15-45"><a href="#cb15-45" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb15-46"><a href="#cb15-46" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>offset <span class="op">&lt;</span> MD5_DIGEST_LENGTH<span class="op">*</span><span class="dv">2</span><span class="op">)</span></span>
<span id="cb15-47"><a href="#cb15-47" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-48"><a href="#cb15-48" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>sscanf<span class="op">(</span>hex<span class="op">+</span>offset<span class="op">,</span> <span class="st">&quot;%2hhx&quot;</span><span class="op">,</span> b<span class="op">++)</span> <span class="op">==</span> <span class="dv">1</span><span class="op">)</span></span>
<span id="cb15-49"><a href="#cb15-49" aria-hidden="true" tabindex="-1"></a>			offset <span class="op">+=</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb15-50"><a href="#cb15-50" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb15-51"><a href="#cb15-51" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb15-52"><a href="#cb15-52" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-53"><a href="#cb15-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> true<span class="op">;</span></span>
<span id="cb15-54"><a href="#cb15-54" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb15-55"><a href="#cb15-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-56"><a href="#cb15-56" aria-hidden="true" tabindex="-1"></a><span class="co">/* random things a worker will need, since thread</span></span>
<span id="cb15-57"><a href="#cb15-57" aria-hidden="true" tabindex="-1"></a><span class="co"> * functions receive only one argument */</span></span>
<span id="cb15-58"><a href="#cb15-58" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> goal</span>
<span id="cb15-59"><a href="#cb15-59" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-60"><a href="#cb15-60" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* input */</span></span>
<span id="cb15-61"><a href="#cb15-61" aria-hidden="true" tabindex="-1"></a>	pthread_t <span class="op">*</span>workers<span class="op">;</span></span>
<span id="cb15-62"><a href="#cb15-62" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> n_workers<span class="op">;</span></span>
<span id="cb15-63"><a href="#cb15-63" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> max_len<span class="op">;</span></span>
<span id="cb15-64"><a href="#cb15-64" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> hash<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb15-65"><a href="#cb15-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-66"><a href="#cb15-66" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* output */</span></span>
<span id="cb15-67"><a href="#cb15-67" aria-hidden="true" tabindex="-1"></a>	pthread_rwlock_t lock<span class="op">;</span></span>
<span id="cb15-68"><a href="#cb15-68" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb15-69"><a href="#cb15-69" aria-hidden="true" tabindex="-1"></a>	<span class="dt">bool</span> success<span class="op">;</span></span>
<span id="cb15-70"><a href="#cb15-70" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb15-71"><a href="#cb15-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-72"><a href="#cb15-72" aria-hidden="true" tabindex="-1"></a><span class="co">/* custom starting word for each worker, but shared goal */</span></span>
<span id="cb15-73"><a href="#cb15-73" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> task</span>
<span id="cb15-74"><a href="#cb15-74" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-75"><a href="#cb15-75" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> goal <span class="op">*</span>goal<span class="op">;</span></span>
<span id="cb15-76"><a href="#cb15-76" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> initial_preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb15-77"><a href="#cb15-77" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb15-78"><a href="#cb15-78" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-79"><a href="#cb15-79" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>crack_thread<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb15-80"><a href="#cb15-80" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-81"><a href="#cb15-81" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>t <span class="op">=</span> arg<span class="op">;</span></span>
<span id="cb15-82"><a href="#cb15-82" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> len<span class="op">,</span> changed<span class="op">;</span></span>
<span id="cb15-83"><a href="#cb15-83" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> hashed<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb15-84"><a href="#cb15-84" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb15-85"><a href="#cb15-85" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> iterations <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb15-86"><a href="#cb15-86" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-87"><a href="#cb15-87" aria-hidden="true" tabindex="-1"></a>	strcpy<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>initial_preimage<span class="op">);</span></span>
<span id="cb15-88"><a href="#cb15-88" aria-hidden="true" tabindex="-1"></a>	len <span class="op">=</span> strlen<span class="op">(</span>preimage<span class="op">);</span></span>
<span id="cb15-89"><a href="#cb15-89" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-90"><a href="#cb15-90" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>len <span class="op">&lt;=</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>max_len<span class="op">)</span></span>
<span id="cb15-91"><a href="#cb15-91" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-92"><a href="#cb15-92" aria-hidden="true" tabindex="-1"></a>		MD5<span class="op">((</span><span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*)</span>preimage<span class="op">,</span> len<span class="op">,</span> hashed<span class="op">);</span></span>
<span id="cb15-93"><a href="#cb15-93" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>memcmp<span class="op">(</span>hashed<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>hash<span class="op">,</span> MD5_DIGEST_LENGTH<span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb15-94"><a href="#cb15-94" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb15-95"><a href="#cb15-95" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* success -- tell others to call it off */</span></span>
<span id="cb15-96"><a href="#cb15-96" aria-hidden="true" tabindex="-1"></a>			pthread_rwlock_wrlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb15-97"><a href="#cb15-97" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-98"><a href="#cb15-98" aria-hidden="true" tabindex="-1"></a>			t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>success <span class="op">=</span> true<span class="op">;</span></span>
<span id="cb15-99"><a href="#cb15-99" aria-hidden="true" tabindex="-1"></a>			strcpy<span class="op">(</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>preimage<span class="op">,</span> preimage<span class="op">);</span></span>
<span id="cb15-100"><a href="#cb15-100" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-101"><a href="#cb15-101" aria-hidden="true" tabindex="-1"></a>			pthread_rwlock_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb15-102"><a href="#cb15-102" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb15-103"><a href="#cb15-103" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-104"><a href="#cb15-104" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* each worker jumps ahead n_workers words, and all workers</span></span>
<span id="cb15-105"><a href="#cb15-105" aria-hidden="true" tabindex="-1"></a><span class="co">		   started at an offset, so all words are covered */</span></span>
<span id="cb15-106"><a href="#cb15-106" aria-hidden="true" tabindex="-1"></a>		changed <span class="op">=</span> word_advance<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_workers<span class="op">);</span></span>
<span id="cb15-107"><a href="#cb15-107" aria-hidden="true" tabindex="-1"></a>		len <span class="op">=</span> MAX<span class="op">(</span>len<span class="op">,</span> changed<span class="op">);</span></span>
<span id="cb15-108"><a href="#cb15-108" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-109"><a href="#cb15-109" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* check if another worker has succeeded, but only every</span></span>
<span id="cb15-110"><a href="#cb15-110" aria-hidden="true" tabindex="-1"></a><span class="co">		   thousandth iteration, since taking the lock adds overhead */</span></span>
<span id="cb15-111"><a href="#cb15-111" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>iterations<span class="op">++</span> <span class="op">%</span> <span class="dv">1000</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb15-112"><a href="#cb15-112" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb15-113"><a href="#cb15-113" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* in the overwhelming majority of cases workers only read,</span></span>
<span id="cb15-114"><a href="#cb15-114" aria-hidden="true" tabindex="-1"></a><span class="co">			   so an rwlock allows them to continue in parallel */</span></span>
<span id="cb15-115"><a href="#cb15-115" aria-hidden="true" tabindex="-1"></a>			pthread_rwlock_rdlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb15-116"><a href="#cb15-116" aria-hidden="true" tabindex="-1"></a>			<span class="dt">int</span> success <span class="op">=</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>success<span class="op">;</span></span>
<span id="cb15-117"><a href="#cb15-117" aria-hidden="true" tabindex="-1"></a>			pthread_rwlock_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb15-118"><a href="#cb15-118" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>success<span class="op">)</span></span>
<span id="cb15-119"><a href="#cb15-119" aria-hidden="true" tabindex="-1"></a>				<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb15-120"><a href="#cb15-120" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-121"><a href="#cb15-121" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-122"><a href="#cb15-122" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb15-123"><a href="#cb15-123" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb15-124"><a href="#cb15-124" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-125"><a href="#cb15-125" aria-hidden="true" tabindex="-1"></a><span class="co">/* launch a parallel search for an md5 preimage */</span></span>
<span id="cb15-126"><a href="#cb15-126" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> crack<span class="op">(</span><span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span> <span class="op">*</span>md5<span class="op">,</span> <span class="dt">size_t</span> max_len<span class="op">,</span></span>
<span id="cb15-127"><a href="#cb15-127" aria-hidden="true" tabindex="-1"></a>           <span class="dt">unsigned</span> threads<span class="op">,</span> <span class="dt">char</span> <span class="op">*</span>result<span class="op">)</span></span>
<span id="cb15-128"><a href="#cb15-128" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-129"><a href="#cb15-129" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> goal g <span class="op">=</span></span>
<span id="cb15-130"><a href="#cb15-130" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-131"><a href="#cb15-131" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>workers   <span class="op">=</span> malloc<span class="op">(</span>threads <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span>pthread_t<span class="op">)),</span></span>
<span id="cb15-132"><a href="#cb15-132" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>n_workers <span class="op">=</span> threads<span class="op">,</span></span>
<span id="cb15-133"><a href="#cb15-133" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>max_len   <span class="op">=</span> max_len<span class="op">,</span></span>
<span id="cb15-134"><a href="#cb15-134" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>success   <span class="op">=</span> false<span class="op">,</span></span>
<span id="cb15-135"><a href="#cb15-135" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>lock      <span class="op">=</span> PTHREAD_RWLOCK_INITIALIZER</span>
<span id="cb15-136"><a href="#cb15-136" aria-hidden="true" tabindex="-1"></a>	<span class="op">};</span></span>
<span id="cb15-137"><a href="#cb15-137" aria-hidden="true" tabindex="-1"></a>	memcpy<span class="op">(</span>g<span class="op">.</span>hash<span class="op">,</span> md5<span class="op">,</span> MD5_DIGEST_LENGTH<span class="op">);</span></span>
<span id="cb15-138"><a href="#cb15-138" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-139"><a href="#cb15-139" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>tasks <span class="op">=</span> malloc<span class="op">(</span>threads <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span><span class="kw">struct</span> task<span class="op">));</span></span>
<span id="cb15-140"><a href="#cb15-140" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-141"><a href="#cb15-141" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">size_t</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> threads<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb15-142"><a href="#cb15-142" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-143"><a href="#cb15-143" aria-hidden="true" tabindex="-1"></a>		tasks<span class="op">[</span>i<span class="op">].</span>goal <span class="op">=</span> <span class="op">&amp;</span>g<span class="op">;</span></span>
<span id="cb15-144"><a href="#cb15-144" aria-hidden="true" tabindex="-1"></a>		tasks<span class="op">[</span>i<span class="op">].</span>initial_preimage<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb15-145"><a href="#cb15-145" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* offset the starting word for each worker by i */</span></span>
<span id="cb15-146"><a href="#cb15-146" aria-hidden="true" tabindex="-1"></a>		word_advance<span class="op">(</span>tasks<span class="op">[</span>i<span class="op">].</span>initial_preimage<span class="op">,</span> i<span class="op">);</span></span>
<span id="cb15-147"><a href="#cb15-147" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(</span>g<span class="op">.</span>workers<span class="op">+</span>i<span class="op">,</span> NULL<span class="op">,</span> crack_thread<span class="op">,</span> tasks<span class="op">+</span>i<span class="op">);</span></span>
<span id="cb15-148"><a href="#cb15-148" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-149"><a href="#cb15-149" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-150"><a href="#cb15-150" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if one worker finds the answer, others will abort */</span></span>
<span id="cb15-151"><a href="#cb15-151" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">size_t</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> threads<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb15-152"><a href="#cb15-152" aria-hidden="true" tabindex="-1"></a>		pthread_join<span class="op">(</span>g<span class="op">.</span>workers<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">);</span></span>
<span id="cb15-153"><a href="#cb15-153" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-154"><a href="#cb15-154" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>g<span class="op">.</span>success<span class="op">)</span></span>
<span id="cb15-155"><a href="#cb15-155" aria-hidden="true" tabindex="-1"></a>		strcpy<span class="op">(</span>result<span class="op">,</span> g<span class="op">.</span>preimage<span class="op">);</span></span>
<span id="cb15-156"><a href="#cb15-156" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-157"><a href="#cb15-157" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>tasks<span class="op">);</span></span>
<span id="cb15-158"><a href="#cb15-158" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>g<span class="op">.</span>workers<span class="op">);</span></span>
<span id="cb15-159"><a href="#cb15-159" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> g<span class="op">.</span>success<span class="op">;</span></span>
<span id="cb15-160"><a href="#cb15-160" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb15-161"><a href="#cb15-161" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-162"><a href="#cb15-162" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb15-163"><a href="#cb15-163" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-164"><a href="#cb15-164" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb15-165"><a href="#cb15-165" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> max_len <span class="op">=</span> <span class="dv">4</span><span class="op">;</span></span>
<span id="cb15-166"><a href="#cb15-166" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> md5<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb15-167"><a href="#cb15-167" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-168"><a href="#cb15-168" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">2</span> <span class="op">&amp;&amp;</span> argc <span class="op">!=</span> <span class="dv">3</span><span class="op">)</span></span>
<span id="cb15-169"><a href="#cb15-169" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-170"><a href="#cb15-170" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-171"><a href="#cb15-171" aria-hidden="true" tabindex="-1"></a>		        <span class="st">&quot;Usage: %s md5-string [search-depth]</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-172"><a href="#cb15-172" aria-hidden="true" tabindex="-1"></a>		        argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb15-173"><a href="#cb15-173" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-174"><a href="#cb15-174" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-175"><a href="#cb15-175" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-176"><a href="#cb15-176" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>hex2md5<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> md5<span class="op">))</span></span>
<span id="cb15-177"><a href="#cb15-177" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-178"><a href="#cb15-178" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-179"><a href="#cb15-179" aria-hidden="true" tabindex="-1"></a>		       <span class="st">&quot;Could not parse as md5: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb15-180"><a href="#cb15-180" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-181"><a href="#cb15-181" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-182"><a href="#cb15-182" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-183"><a href="#cb15-183" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">2</span> <span class="op">&amp;&amp;</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">2</span><span class="op">],</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">))</span></span>
<span id="cb15-184"><a href="#cb15-184" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">((</span>max_len <span class="op">=</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">2</span><span class="op">],</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">))</span> <span class="op">&gt;</span> LONGEST_PREIMAGE<span class="op">)</span></span>
<span id="cb15-185"><a href="#cb15-185" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb15-186"><a href="#cb15-186" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-187"><a href="#cb15-187" aria-hidden="true" tabindex="-1"></a>					<span class="st">&quot;Preimages limited to %d characters</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-188"><a href="#cb15-188" aria-hidden="true" tabindex="-1"></a>					LONGEST_PREIMAGE<span class="op">);</span></span>
<span id="cb15-189"><a href="#cb15-189" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-190"><a href="#cb15-190" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-191"><a href="#cb15-191" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-192"><a href="#cb15-192" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>crack<span class="op">(</span>md5<span class="op">,</span> max_len<span class="op">,</span> <span class="dv">4</span><span class="op">,</span> preimage<span class="op">))</span></span>
<span id="cb15-193"><a href="#cb15-193" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-194"><a href="#cb15-194" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span>preimage<span class="op">);</span></span>
<span id="cb15-195"><a href="#cb15-195" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb15-196"><a href="#cb15-196" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-197"><a href="#cb15-197" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb15-198"><a href="#cb15-198" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-199"><a href="#cb15-199" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-200"><a href="#cb15-200" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;Could not find result in strings up to length %d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-201"><a href="#cb15-201" aria-hidden="true" tabindex="-1"></a>		        max_len<span class="op">);</span></span>
<span id="cb15-202"><a href="#cb15-202" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-203"><a href="#cb15-203" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-204"><a href="#cb15-204" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Although read-write locks can be implemented in terms of mutexes and condition variables, such implementations are significantly less efficient than is possible. Therefore, this synchronization primitive is included in POSIX.1-2008 for the purpose of allowing more efficient implementations in multi-processor systems.</p>
<p>The final thing to be aware of is that an rwlock implementation can choose either reader-preference or writer-preference. When readers and writers are contending for a lock, the preference determines who gets to skip the queue and go first. When there is a lot of reader activity with a reader-preference, then a writer will continually get moved to the end of the line and experience <strong>starvation</strong>, where it never gets to write. I noticed writer starvation on Linux (glibc) when running four threads on a little 1-core virtual machine. Glibc provides the nonportable <code>pthread_rwlockattr_setkind_np()</code> function to specify a preference.</p>
<p>You may have noticed that workers in our password cracker use polling to see whether the solution has been found, and whether they should give up. We’ll examine a more explicit method of cancellation in a later section.</p>
<h4 id="semaphores">Semaphores</h4>
<p>Semaphores keep count of, in the abstract, an amount of resource “units” available. Threads can safely add or remove a unit without causing a data race. When a thread requests a unit but there are none, then the thread will block.</p>
<p>A semaphore is like a mix between a lock and a condition variable. Unlike mutexes, semaphores have no concept of an owner. Any thread may release threads blocked on a semaphore, whereas with a mutex the lock holder must unlock it. Unlike a condition variable, a semaphore operates independently of a predicate.</p>
<p>An example of a problem uniquely suited for semaphores would be to ensure that exactly two threads run at once on a task. You would initialize the semaphore to the value two, and allow a bunch of threads to wait on the semaphore. After two get past, the rest will block. When each thread is done, it posts one unit back to the semaphore, which allows another thread to take its place.</p>
<p>In reality, if you’ve got pthreads, you only <em>need</em> semaphores for asynchronous signal handlers. You <em>can</em> use them in other situations, but this is the only place they are needed. Mutexes aren’t async signal safe. Making them so would be much slower than an implementation that isn’t async signal safe, and would slow down ordinary mutex operation.</p>
<p>Here’s an example of posting a semaphore from a signal handler:</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* sem_tickler.c */</span></span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;semaphore.h&gt;</span></span>
<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;signal.h&gt;</span></span>
<span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-7"><a href="#cb16-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unistd.h&gt;</span></span>
<span id="cb16-8"><a href="#cb16-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#if !defined(_POSIX_SEMAPHORES) || _POSIX_SEMAPHORES &lt; 0</span></span>
<span id="cb16-9"><a href="#cb16-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#error your OS lacks POSIX semaphore support</span></span>
<span id="cb16-10"><a href="#cb16-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span>
<span id="cb16-11"><a href="#cb16-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-12"><a href="#cb16-12" aria-hidden="true" tabindex="-1"></a>sem_t tickler<span class="op">;</span></span>
<span id="cb16-13"><a href="#cb16-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-14"><a href="#cb16-14" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> int_catch<span class="op">(</span><span class="dt">int</span> sig<span class="op">)</span></span>
<span id="cb16-15"><a href="#cb16-15" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-16"><a href="#cb16-16" aria-hidden="true" tabindex="-1"></a>	<span class="op">(</span><span class="dt">void</span><span class="op">)</span> sig<span class="op">;</span></span>
<span id="cb16-17"><a href="#cb16-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-18"><a href="#cb16-18" aria-hidden="true" tabindex="-1"></a>	signal<span class="op">(</span>SIGINT<span class="op">,</span> <span class="op">&amp;</span>int_catch<span class="op">);</span></span>
<span id="cb16-19"><a href="#cb16-19" aria-hidden="true" tabindex="-1"></a>	sem_post<span class="op">(&amp;</span>tickler<span class="op">);</span> <span class="co">/* async signal safe: */</span></span>
<span id="cb16-20"><a href="#cb16-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb16-21"><a href="#cb16-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-22"><a href="#cb16-22" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb16-23"><a href="#cb16-23" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-24"><a href="#cb16-24" aria-hidden="true" tabindex="-1"></a>	sem_init<span class="op">(&amp;</span>tickler<span class="op">,</span> <span class="dv">0</span><span class="op">,</span> <span class="dv">0</span><span class="op">);</span></span>
<span id="cb16-25"><a href="#cb16-25" aria-hidden="true" tabindex="-1"></a>	signal<span class="op">(</span>SIGINT<span class="op">,</span> <span class="op">&amp;</span>int_catch<span class="op">);</span></span>
<span id="cb16-26"><a href="#cb16-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-27"><a href="#cb16-27" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> <span class="dv">3</span><span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb16-28"><a href="#cb16-28" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb16-29"><a href="#cb16-29" aria-hidden="true" tabindex="-1"></a>		sem_wait<span class="op">(&amp;</span>tickler<span class="op">);</span></span>
<span id="cb16-30"><a href="#cb16-30" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span><span class="st">&quot;That tickles!&quot;</span><span class="op">);</span></span>
<span id="cb16-31"><a href="#cb16-31" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb16-32"><a href="#cb16-32" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;(Died from overtickling)&quot;</span><span class="op">);</span></span>
<span id="cb16-33"><a href="#cb16-33" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb16-34"><a href="#cb16-34" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Semaphores aren’t even necessary for proper signal handling. It’s easier to have a thread simply <code>sigwait()</code> than it is to set up an asynchronous handler. In the example below, the main thread waits, but you can spawn a dedicated thread for this in a real application.</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* sigwait_tickler.c */</span></span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;signal.h&gt;</span></span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb17-7"><a href="#cb17-7" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb17-8"><a href="#cb17-8" aria-hidden="true" tabindex="-1"></a>	sigset_t set<span class="op">;</span></span>
<span id="cb17-9"><a href="#cb17-9" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> which<span class="op">;</span></span>
<span id="cb17-10"><a href="#cb17-10" aria-hidden="true" tabindex="-1"></a>	sigemptyset<span class="op">(&amp;</span>set<span class="op">);</span></span>
<span id="cb17-11"><a href="#cb17-11" aria-hidden="true" tabindex="-1"></a>	sigaddset<span class="op">(&amp;</span>set<span class="op">,</span> SIGINT<span class="op">);</span></span>
<span id="cb17-12"><a href="#cb17-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-13"><a href="#cb17-13" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">int</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> <span class="dv">3</span><span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb17-14"><a href="#cb17-14" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb17-15"><a href="#cb17-15" aria-hidden="true" tabindex="-1"></a>		sigwait<span class="op">(&amp;</span>set<span class="op">,</span> <span class="op">&amp;</span>which<span class="op">);</span></span>
<span id="cb17-16"><a href="#cb17-16" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span><span class="st">&quot;That tickles!&quot;</span><span class="op">);</span></span>
<span id="cb17-17"><a href="#cb17-17" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb17-18"><a href="#cb17-18" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;(Died from overtickling)&quot;</span><span class="op">);</span></span>
<span id="cb17-19"><a href="#cb17-19" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb17-20"><a href="#cb17-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>So don’t feel dependent on semaphores. In fact your system may not have them. The POSIX semaphore API works with pthreads and is present in POSIX.1-2008, but is an optional part of POSIX.1b in earlier versions. Apple, for one, <a href="https://lists.apple.com/archives/darwin-kernel/2009/Apr/msg00010.html">decided</a> to punt, so the semaphore functions on macOS are stubbed to return error codes.</p>
<h3 id="cancellation">Cancellation</h3>
<p>Thread cancellation is generally used when you have threads doing long-running tasks and there’s a way for a user to abort through the UI or console. Another common scenario is when multiple threads set off to explore a search space and one finds the answer first.</p>
<p>Our previous reader-writer lock example was the second scenario, where the threads explored a search space. It was an example of do-it-yourself cancellation through polling. However sometimes threads aren’t able to poll, such as when they are blocked on I/O or a lock. Pthreads offers an API to cancel threads even in those situations.</p>
<p>By default a cancelled thread isn’t immediately blown away, because it may have a mutex locked, be holding resources, or have a potentially broken invariant. The canceller wouldn’t know how to repair that invariant without some complicated logic. The thread to be canceled needs to be written to do cleanup and unlock mutexes.</p>
<p>For each thread, cancellation can be enabled or disabled, and if enabled, may be in deferred or asynchronous mode. The default is enabled and deferred, which allows a cancelled thread to survive until the next <strong>cancellation points</strong>, such as waiting on a condition variable or blocking on IO (see <a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_05_02">full list</a>). In a purely computational section of code you can add your own cancellation points with <code>pthread_testcancel()</code>.</p>
<p>Let’s see how to modify our previous MD5 cracking example using standard pthread cancellation. Three of the functions are the same as before: <code>word_advance()</code>, <code>hex2md5()</code>, and <code>main()</code>. But we now use a condition variable to alert <code>crack()</code> whenever a <code>crack_thread()</code> returns.</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* 5dm-testcancel.c */</span></span>
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdbool.h&gt;</span></span>
<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-8"><a href="#cb18-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;openssl/md5.h&gt;</span></span>
<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;pthread.h&gt;</span></span>
<span id="cb18-10"><a href="#cb18-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-11"><a href="#cb18-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#define ASCII_FIRST &#39; &#39;</span></span>
<span id="cb18-12"><a href="#cb18-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define ASCII_LAST  &#39;~&#39;</span></span>
<span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a><span class="pp">#define N_ALPHA (1 + ASCII_LAST - ASCII_FIRST)</span></span>
<span id="cb18-14"><a href="#cb18-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define LONGEST_PREIMAGE 128</span></span>
<span id="cb18-15"><a href="#cb18-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-16"><a href="#cb18-16" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MAX(x,y) ((x)&lt;(y) ? (y) : (x))</span></span>
<span id="cb18-17"><a href="#cb18-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-18"><a href="#cb18-18" aria-hidden="true" tabindex="-1"></a><span class="dt">unsigned</span> word_advance<span class="op">(</span><span class="dt">char</span> <span class="op">*</span>word<span class="op">,</span> <span class="dt">unsigned</span> delta<span class="op">)</span></span>
<span id="cb18-19"><a href="#cb18-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-20"><a href="#cb18-20" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>delta <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-21"><a href="#cb18-21" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb18-22"><a href="#cb18-22" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(*</span>word <span class="op">==</span> <span class="ch">&#39;\0&#39;</span><span class="op">)</span></span>
<span id="cb18-23"><a href="#cb18-23" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-24"><a href="#cb18-24" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word<span class="op">++</span> <span class="op">=</span> ASCII_FIRST <span class="op">+</span> delta <span class="op">-</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb18-25"><a href="#cb18-25" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb18-26"><a href="#cb18-26" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-27"><a href="#cb18-27" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb18-28"><a href="#cb18-28" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-29"><a href="#cb18-29" aria-hidden="true" tabindex="-1"></a>		<span class="dt">char</span> c <span class="op">=</span> <span class="op">*</span>word <span class="op">-</span> ASCII_FIRST<span class="op">;</span></span>
<span id="cb18-30"><a href="#cb18-30" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>word <span class="op">=</span> ASCII_FIRST <span class="op">+</span> <span class="op">((</span>c <span class="op">+</span> delta<span class="op">)</span> <span class="op">%</span> N_ALPHA<span class="op">);</span></span>
<span id="cb18-31"><a href="#cb18-31" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>c <span class="op">+</span> delta <span class="op">&gt;=</span> N_ALPHA<span class="op">)</span></span>
<span id="cb18-32"><a href="#cb18-32" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> <span class="dv">1</span> <span class="op">+</span> word_advance<span class="op">(</span>word<span class="op">+</span><span class="dv">1</span><span class="op">,</span> <span class="dv">1</span> <span class="co">/* not delta */</span><span class="op">);</span></span>
<span id="cb18-33"><a href="#cb18-33" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-34"><a href="#cb18-34" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb18-35"><a href="#cb18-35" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb18-36"><a href="#cb18-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-37"><a href="#cb18-37" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> hex2md5<span class="op">(</span><span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>hex<span class="op">,</span> <span class="dt">unsigned</span> <span class="dt">char</span> <span class="op">*</span>b<span class="op">)</span></span>
<span id="cb18-38"><a href="#cb18-38" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-39"><a href="#cb18-39" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> offset <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb18-40"><a href="#cb18-40" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span><span class="op">(</span>strlen<span class="op">(</span>hex<span class="op">)</span> <span class="op">!=</span> MD5_DIGEST_LENGTH<span class="op">*</span><span class="dv">2</span><span class="op">)</span></span>
<span id="cb18-41"><a href="#cb18-41" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb18-42"><a href="#cb18-42" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>offset <span class="op">&lt;</span> MD5_DIGEST_LENGTH<span class="op">*</span><span class="dv">2</span><span class="op">)</span></span>
<span id="cb18-43"><a href="#cb18-43" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-44"><a href="#cb18-44" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>sscanf<span class="op">(</span>hex<span class="op">+</span>offset<span class="op">,</span> <span class="st">&quot;%2hhx&quot;</span><span class="op">,</span> b<span class="op">++)</span> <span class="op">==</span> <span class="dv">1</span><span class="op">)</span></span>
<span id="cb18-45"><a href="#cb18-45" aria-hidden="true" tabindex="-1"></a>			offset <span class="op">+=</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb18-46"><a href="#cb18-46" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb18-47"><a href="#cb18-47" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> false<span class="op">;</span></span>
<span id="cb18-48"><a href="#cb18-48" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-49"><a href="#cb18-49" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> true<span class="op">;</span></span>
<span id="cb18-50"><a href="#cb18-50" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb18-51"><a href="#cb18-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-52"><a href="#cb18-52" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> goal</span>
<span id="cb18-53"><a href="#cb18-53" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-54"><a href="#cb18-54" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* input */</span></span>
<span id="cb18-55"><a href="#cb18-55" aria-hidden="true" tabindex="-1"></a>	pthread_t <span class="op">*</span>workers<span class="op">;</span></span>
<span id="cb18-56"><a href="#cb18-56" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> n_workers<span class="op">;</span></span>
<span id="cb18-57"><a href="#cb18-57" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> max_len<span class="op">;</span></span>
<span id="cb18-58"><a href="#cb18-58" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> hash<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb18-59"><a href="#cb18-59" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-60"><a href="#cb18-60" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* output */</span></span>
<span id="cb18-61"><a href="#cb18-61" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_t lock<span class="op">;</span></span>
<span id="cb18-62"><a href="#cb18-62" aria-hidden="true" tabindex="-1"></a>	pthread_cond_t returning<span class="op">;</span></span>
<span id="cb18-63"><a href="#cb18-63" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> n_done<span class="op">;</span></span>
<span id="cb18-64"><a href="#cb18-64" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb18-65"><a href="#cb18-65" aria-hidden="true" tabindex="-1"></a>	<span class="dt">bool</span> success<span class="op">;</span></span>
<span id="cb18-66"><a href="#cb18-66" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb18-67"><a href="#cb18-67" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-68"><a href="#cb18-68" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> task</span>
<span id="cb18-69"><a href="#cb18-69" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-70"><a href="#cb18-70" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> goal <span class="op">*</span>goal<span class="op">;</span></span>
<span id="cb18-71"><a href="#cb18-71" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> initial_preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb18-72"><a href="#cb18-72" aria-hidden="true" tabindex="-1"></a><span class="op">};</span></span>
<span id="cb18-73"><a href="#cb18-73" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-74"><a href="#cb18-74" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>crack_thread<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb18-75"><a href="#cb18-75" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-76"><a href="#cb18-76" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>t <span class="op">=</span> arg<span class="op">;</span></span>
<span id="cb18-77"><a href="#cb18-77" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> len<span class="op">,</span> changed<span class="op">;</span></span>
<span id="cb18-78"><a href="#cb18-78" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> hashed<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb18-79"><a href="#cb18-79" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb18-80"><a href="#cb18-80" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> iterations <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb18-81"><a href="#cb18-81" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-82"><a href="#cb18-82" aria-hidden="true" tabindex="-1"></a>	strcpy<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>initial_preimage<span class="op">);</span></span>
<span id="cb18-83"><a href="#cb18-83" aria-hidden="true" tabindex="-1"></a>	len <span class="op">=</span> strlen<span class="op">(</span>preimage<span class="op">);</span></span>
<span id="cb18-84"><a href="#cb18-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-85"><a href="#cb18-85" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>len <span class="op">&lt;=</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>max_len<span class="op">)</span></span>
<span id="cb18-86"><a href="#cb18-86" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-87"><a href="#cb18-87" aria-hidden="true" tabindex="-1"></a>		MD5<span class="op">((</span><span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*)</span>preimage<span class="op">,</span> len<span class="op">,</span> hashed<span class="op">);</span></span>
<span id="cb18-88"><a href="#cb18-88" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>memcmp<span class="op">(</span>hashed<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>hash<span class="op">,</span> MD5_DIGEST_LENGTH<span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-89"><a href="#cb18-89" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb18-90"><a href="#cb18-90" aria-hidden="true" tabindex="-1"></a>			pthread_mutex_lock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb18-91"><a href="#cb18-91" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-92"><a href="#cb18-92" aria-hidden="true" tabindex="-1"></a>			t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>success <span class="op">=</span> true<span class="op">;</span></span>
<span id="cb18-93"><a href="#cb18-93" aria-hidden="true" tabindex="-1"></a>			strcpy<span class="op">(</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>preimage<span class="op">,</span> preimage<span class="op">);</span></span>
<span id="cb18-94"><a href="#cb18-94" aria-hidden="true" tabindex="-1"></a>			t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_done<span class="op">++;</span></span>
<span id="cb18-95"><a href="#cb18-95" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-96"><a href="#cb18-96" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* alert the boss that another worker is done */</span></span>
<span id="cb18-97"><a href="#cb18-97" aria-hidden="true" tabindex="-1"></a>			pthread_cond_signal<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>returning<span class="op">);</span></span>
<span id="cb18-98"><a href="#cb18-98" aria-hidden="true" tabindex="-1"></a>			pthread_mutex_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb18-99"><a href="#cb18-99" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb18-100"><a href="#cb18-100" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb18-101"><a href="#cb18-101" aria-hidden="true" tabindex="-1"></a>		changed <span class="op">=</span> word_advance<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_workers<span class="op">);</span></span>
<span id="cb18-102"><a href="#cb18-102" aria-hidden="true" tabindex="-1"></a>		len <span class="op">=</span> MAX<span class="op">(</span>len<span class="op">,</span> changed<span class="op">);</span></span>
<span id="cb18-103"><a href="#cb18-103" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-104"><a href="#cb18-104" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>iterations<span class="op">++</span> <span class="op">%</span> <span class="dv">1000</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-105"><a href="#cb18-105" aria-hidden="true" tabindex="-1"></a>			pthread_testcancel<span class="op">();</span> <span class="co">/* add a cancellation point */</span></span>
<span id="cb18-106"><a href="#cb18-106" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-107"><a href="#cb18-107" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-108"><a href="#cb18-108" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb18-109"><a href="#cb18-109" aria-hidden="true" tabindex="-1"></a>	t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_done<span class="op">++;</span></span>
<span id="cb18-110"><a href="#cb18-110" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* alert the boss that another worker is done */</span></span>
<span id="cb18-111"><a href="#cb18-111" aria-hidden="true" tabindex="-1"></a>	pthread_cond_signal<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>returning<span class="op">);</span></span>
<span id="cb18-112"><a href="#cb18-112" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb18-113"><a href="#cb18-113" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb18-114"><a href="#cb18-114" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb18-115"><a href="#cb18-115" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-116"><a href="#cb18-116" aria-hidden="true" tabindex="-1"></a><span class="co">/* cancellation cleanup function that we also call</span></span>
<span id="cb18-117"><a href="#cb18-117" aria-hidden="true" tabindex="-1"></a><span class="co"> * during regular exit from the crack() function */</span></span>
<span id="cb18-118"><a href="#cb18-118" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> crack_cleanup<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb18-119"><a href="#cb18-119" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-120"><a href="#cb18-120" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>tasks <span class="op">=</span> arg<span class="op">;</span></span>
<span id="cb18-121"><a href="#cb18-121" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> goal <span class="op">*</span>g <span class="op">=</span> tasks<span class="op">[</span><span class="dv">0</span><span class="op">].</span>goal<span class="op">;</span></span>
<span id="cb18-122"><a href="#cb18-122" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-123"><a href="#cb18-123" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* this mutex unlock pairs with the lock in the crack() function */</span></span>
<span id="cb18-124"><a href="#cb18-124" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>g<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb18-125"><a href="#cb18-125" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">size_t</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> g<span class="op">-&gt;</span>n_workers<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb18-126"><a href="#cb18-126" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-127"><a href="#cb18-127" aria-hidden="true" tabindex="-1"></a>		pthread_cancel<span class="op">(</span>g<span class="op">-&gt;</span>workers<span class="op">[</span>i<span class="op">]);</span></span>
<span id="cb18-128"><a href="#cb18-128" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* must wait for each to terminate, so that freeing</span></span>
<span id="cb18-129"><a href="#cb18-129" aria-hidden="true" tabindex="-1"></a><span class="co">		 * their shared memory is safe */</span></span>
<span id="cb18-130"><a href="#cb18-130" aria-hidden="true" tabindex="-1"></a>		pthread_join<span class="op">(</span>g<span class="op">-&gt;</span>workers<span class="op">[</span>i<span class="op">],</span> NULL<span class="op">);</span></span>
<span id="cb18-131"><a href="#cb18-131" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-132"><a href="#cb18-132" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* now it&#39;s safe to free memory */</span></span>
<span id="cb18-133"><a href="#cb18-133" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>g<span class="op">-&gt;</span>workers<span class="op">);</span></span>
<span id="cb18-134"><a href="#cb18-134" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>tasks<span class="op">);</span></span>
<span id="cb18-135"><a href="#cb18-135" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb18-136"><a href="#cb18-136" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-137"><a href="#cb18-137" aria-hidden="true" tabindex="-1"></a><span class="dt">bool</span> crack<span class="op">(</span><span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span> <span class="op">*</span>md5<span class="op">,</span> <span class="dt">size_t</span> max_len<span class="op">,</span></span>
<span id="cb18-138"><a href="#cb18-138" aria-hidden="true" tabindex="-1"></a>           <span class="dt">unsigned</span> threads<span class="op">,</span> <span class="dt">char</span> <span class="op">*</span>result<span class="op">)</span></span>
<span id="cb18-139"><a href="#cb18-139" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-140"><a href="#cb18-140" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> goal g <span class="op">=</span></span>
<span id="cb18-141"><a href="#cb18-141" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-142"><a href="#cb18-142" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>workers   <span class="op">=</span> malloc<span class="op">(</span>threads <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span>pthread_t<span class="op">)),</span></span>
<span id="cb18-143"><a href="#cb18-143" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>n_workers <span class="op">=</span> threads<span class="op">,</span></span>
<span id="cb18-144"><a href="#cb18-144" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>max_len   <span class="op">=</span> max_len<span class="op">,</span></span>
<span id="cb18-145"><a href="#cb18-145" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>success   <span class="op">=</span> false<span class="op">,</span></span>
<span id="cb18-146"><a href="#cb18-146" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>n_done    <span class="op">=</span> <span class="dv">0</span><span class="op">,</span></span>
<span id="cb18-147"><a href="#cb18-147" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>lock      <span class="op">=</span> PTHREAD_MUTEX_INITIALIZER<span class="op">,</span></span>
<span id="cb18-148"><a href="#cb18-148" aria-hidden="true" tabindex="-1"></a>		<span class="op">.</span>returning <span class="op">=</span> PTHREAD_COND_INITIALIZER</span>
<span id="cb18-149"><a href="#cb18-149" aria-hidden="true" tabindex="-1"></a>	<span class="op">};</span></span>
<span id="cb18-150"><a href="#cb18-150" aria-hidden="true" tabindex="-1"></a>	memcpy<span class="op">(</span>g<span class="op">.</span>hash<span class="op">,</span> md5<span class="op">,</span> MD5_DIGEST_LENGTH<span class="op">);</span></span>
<span id="cb18-151"><a href="#cb18-151" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-152"><a href="#cb18-152" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>tasks <span class="op">=</span> malloc<span class="op">(</span>threads <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span><span class="kw">struct</span> task<span class="op">));</span></span>
<span id="cb18-153"><a href="#cb18-153" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-154"><a href="#cb18-154" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span><span class="dt">size_t</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> threads<span class="op">;</span> i<span class="op">++)</span></span>
<span id="cb18-155"><a href="#cb18-155" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-156"><a href="#cb18-156" aria-hidden="true" tabindex="-1"></a>		tasks<span class="op">[</span>i<span class="op">].</span>goal <span class="op">=</span> <span class="op">&amp;</span>g<span class="op">;</span></span>
<span id="cb18-157"><a href="#cb18-157" aria-hidden="true" tabindex="-1"></a>		tasks<span class="op">[</span>i<span class="op">].</span>initial_preimage<span class="op">[</span><span class="dv">0</span><span class="op">]</span> <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb18-158"><a href="#cb18-158" aria-hidden="true" tabindex="-1"></a>		word_advance<span class="op">(</span>tasks<span class="op">[</span>i<span class="op">].</span>initial_preimage<span class="op">,</span> i<span class="op">);</span></span>
<span id="cb18-159"><a href="#cb18-159" aria-hidden="true" tabindex="-1"></a>		pthread_create<span class="op">(</span>g<span class="op">.</span>workers<span class="op">+</span>i<span class="op">,</span> NULL<span class="op">,</span> crack_thread<span class="op">,</span> tasks<span class="op">+</span>i<span class="op">);</span></span>
<span id="cb18-160"><a href="#cb18-160" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-161"><a href="#cb18-161" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-162"><a href="#cb18-162" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* coming up to cancellation points, so establish</span></span>
<span id="cb18-163"><a href="#cb18-163" aria-hidden="true" tabindex="-1"></a><span class="co">	 * a cleanup handler */</span></span>
<span id="cb18-164"><a href="#cb18-164" aria-hidden="true" tabindex="-1"></a>	pthread_cleanup_push<span class="op">(</span>crack_cleanup<span class="op">,</span> tasks<span class="op">);</span></span>
<span id="cb18-165"><a href="#cb18-165" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-166"><a href="#cb18-166" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>g<span class="op">.</span>lock<span class="op">);</span></span>
<span id="cb18-167"><a href="#cb18-167" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* We can&#39;t join() on all the workers now because it&#39;s up to</span></span>
<span id="cb18-168"><a href="#cb18-168" aria-hidden="true" tabindex="-1"></a><span class="co">	 * us to cancel them after one finds the answer. We have to</span></span>
<span id="cb18-169"><a href="#cb18-169" aria-hidden="true" tabindex="-1"></a><span class="co">	 * remain responsive and not block on any particular worker */</span></span>
<span id="cb18-170"><a href="#cb18-170" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(!</span>g<span class="op">.</span>success <span class="op">&amp;&amp;</span> g<span class="op">.</span>n_done <span class="op">&lt;</span> threads<span class="op">)</span></span>
<span id="cb18-171"><a href="#cb18-171" aria-hidden="true" tabindex="-1"></a>		pthread_cond_wait<span class="op">(&amp;</span>g<span class="op">.</span>returning<span class="op">,</span> <span class="op">&amp;</span>g<span class="op">.</span>lock<span class="op">);</span></span>
<span id="cb18-172"><a href="#cb18-172" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* at this point either a thread succeeded or all have given up */</span></span>
<span id="cb18-173"><a href="#cb18-173" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>g<span class="op">.</span>success<span class="op">)</span></span>
<span id="cb18-174"><a href="#cb18-174" aria-hidden="true" tabindex="-1"></a>		strcpy<span class="op">(</span>result<span class="op">,</span> g<span class="op">.</span>preimage<span class="op">);</span></span>
<span id="cb18-175"><a href="#cb18-175" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* mutex unlocked in the cleanup handler */</span></span>
<span id="cb18-176"><a href="#cb18-176" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-177"><a href="#cb18-177" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Use the same cleanup handler for normal exit too. The &quot;1&quot;</span></span>
<span id="cb18-178"><a href="#cb18-178" aria-hidden="true" tabindex="-1"></a><span class="co">	 * argument says to execute the function we had previous pushed */</span></span>
<span id="cb18-179"><a href="#cb18-179" aria-hidden="true" tabindex="-1"></a>	pthread_cleanup_pop<span class="op">(</span><span class="dv">1</span><span class="op">);</span></span>
<span id="cb18-180"><a href="#cb18-180" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> g<span class="op">.</span>success<span class="op">;</span></span>
<span id="cb18-181"><a href="#cb18-181" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb18-182"><a href="#cb18-182" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-183"><a href="#cb18-183" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb18-184"><a href="#cb18-184" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-185"><a href="#cb18-185" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb18-186"><a href="#cb18-186" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> max_len <span class="op">=</span> <span class="dv">4</span><span class="op">;</span></span>
<span id="cb18-187"><a href="#cb18-187" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> md5<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb18-188"><a href="#cb18-188" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-189"><a href="#cb18-189" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">2</span> <span class="op">&amp;&amp;</span> argc <span class="op">!=</span> <span class="dv">3</span><span class="op">)</span></span>
<span id="cb18-190"><a href="#cb18-190" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-191"><a href="#cb18-191" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb18-192"><a href="#cb18-192" aria-hidden="true" tabindex="-1"></a>		        <span class="st">&quot;Usage: %s md5-string [search-depth]</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb18-193"><a href="#cb18-193" aria-hidden="true" tabindex="-1"></a>		        argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb18-194"><a href="#cb18-194" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-195"><a href="#cb18-195" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-196"><a href="#cb18-196" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-197"><a href="#cb18-197" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>hex2md5<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> md5<span class="op">))</span></span>
<span id="cb18-198"><a href="#cb18-198" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-199"><a href="#cb18-199" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb18-200"><a href="#cb18-200" aria-hidden="true" tabindex="-1"></a>		       <span class="st">&quot;Could not parse as md5: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb18-201"><a href="#cb18-201" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-202"><a href="#cb18-202" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-203"><a href="#cb18-203" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-204"><a href="#cb18-204" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">2</span> <span class="op">&amp;&amp;</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">2</span><span class="op">],</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">))</span></span>
<span id="cb18-205"><a href="#cb18-205" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">((</span>max_len <span class="op">=</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">2</span><span class="op">],</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">))</span> <span class="op">&gt;</span> LONGEST_PREIMAGE<span class="op">)</span></span>
<span id="cb18-206"><a href="#cb18-206" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb18-207"><a href="#cb18-207" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb18-208"><a href="#cb18-208" aria-hidden="true" tabindex="-1"></a>					<span class="st">&quot;Preimages limited to %d characters</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb18-209"><a href="#cb18-209" aria-hidden="true" tabindex="-1"></a>					LONGEST_PREIMAGE<span class="op">);</span></span>
<span id="cb18-210"><a href="#cb18-210" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-211"><a href="#cb18-211" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb18-212"><a href="#cb18-212" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-213"><a href="#cb18-213" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>crack<span class="op">(</span>md5<span class="op">,</span> max_len<span class="op">,</span> <span class="dv">4</span><span class="op">,</span> preimage<span class="op">))</span></span>
<span id="cb18-214"><a href="#cb18-214" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-215"><a href="#cb18-215" aria-hidden="true" tabindex="-1"></a>		puts<span class="op">(</span>preimage<span class="op">);</span></span>
<span id="cb18-216"><a href="#cb18-216" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb18-217"><a href="#cb18-217" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-218"><a href="#cb18-218" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb18-219"><a href="#cb18-219" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-220"><a href="#cb18-220" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb18-221"><a href="#cb18-221" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;Could not find result in strings up to length %d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb18-222"><a href="#cb18-222" aria-hidden="true" tabindex="-1"></a>		        max_len<span class="op">);</span></span>
<span id="cb18-223"><a href="#cb18-223" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-224"><a href="#cb18-224" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-225"><a href="#cb18-225" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Using cancellation is actually a little more flexible than our rwlock implementation in 5dm. If the <code>crack()</code> function is running in its own thread, the whole thing can now be cancelled. The cancellation handler will “pass along” the cancellation to each of the worker threads.</p>
<p>Writing general purpose library code that works with threads requires some care. It should handle deferred cancellation gracefully, including disabling cancellation when appropriate and always using cleanup handlers.</p>
<p>For cleanup handlers, notice the pattern of how we <code>pthread_cleanup_push()</code> the cancellation handler, and later <code>pthread_cleanup_pop()</code> it for regular (non-cancel) cleanup too. Using the same cleanup procedure in all situations makes the code more reliable.</p>
<p>Also notice how the boss thread now cancels workers, rather than the winning worker cancelling the others. You can join a canceled thread, but you can’t cancel an already joined (or detached) thread. If you want to both cancel and join a thread it ought to be done in one place.</p>
<p>Let’s turn out attention to the new worker threads. They are still polling for cancellation, like they polled with the reader-writer locks, but in this case they do it with a new function:</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="op">(</span>iterations<span class="op">++</span> <span class="op">%</span> <span class="dv">1000</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>	pthread_testcancel<span class="op">();</span></span></code></pre></div>
<p>Admittedly it adds a little overhead to poll every thousandth loop, both with the rwlock, and with the testcancel. It also adds latency to the time between the cancellation request and the thread quitting, since the loop could run up to 999 times in between. A more efficient but dangerous method is to enable <strong>asynchronous cancellation</strong>, meaning the thread immediately dies when cancelled.</p>
<p>Async cancellation is dangerous because code is seldom async-cancel-safe. Anything that uses locks or works with shared state even slightly can break badly. Async-cancel-safe code can call very few functions, since those functions may not be safe. This includes calling libraries that use something as innocent as <code>malloc()</code>, since stopping malloc part way through could corrupt the heap.</p>
<p>Our <code>crack_thread()</code> function should be async-cancel-safe, at least during its calculation and not when taking locks. The <code>MD5()</code> function from OpenSSL also appears to be safe. Here’s how we can rewrite our function (notice how we disable cancellation before taking a lock):</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* rewritten to use async cancellation */</span></span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> <span class="op">*</span>crack_thread<span class="op">(</span><span class="dt">void</span> <span class="op">*</span>arg<span class="op">)</span></span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>	<span class="kw">struct</span> task <span class="op">*</span>t <span class="op">=</span> arg<span class="op">;</span></span>
<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> len<span class="op">,</span> changed<span class="op">;</span></span>
<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a>	<span class="dt">unsigned</span> <span class="dt">char</span> hashed<span class="op">[</span>MD5_DIGEST_LENGTH<span class="op">];</span></span>
<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> preimage<span class="op">[</span>LONGEST_PREIMAGE<span class="op">];</span></span>
<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> cancel_type<span class="op">,</span> cancel_state<span class="op">;</span></span>
<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-11"><a href="#cb20-11" aria-hidden="true" tabindex="-1"></a>	strcpy<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>initial_preimage<span class="op">);</span></span>
<span id="cb20-12"><a href="#cb20-12" aria-hidden="true" tabindex="-1"></a>	len <span class="op">=</span> strlen<span class="op">(</span>preimage<span class="op">);</span></span>
<span id="cb20-13"><a href="#cb20-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-14"><a href="#cb20-14" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* async so we don&#39;t have to pthread_testcancel() */</span></span>
<span id="cb20-15"><a href="#cb20-15" aria-hidden="true" tabindex="-1"></a>	pthread_setcanceltype<span class="op">(</span></span>
<span id="cb20-16"><a href="#cb20-16" aria-hidden="true" tabindex="-1"></a>			PTHREAD_CANCEL_ASYNCHRONOUS<span class="op">,</span> <span class="op">&amp;</span>cancel_type<span class="op">);</span></span>
<span id="cb20-17"><a href="#cb20-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-18"><a href="#cb20-18" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>len <span class="op">&lt;=</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>max_len<span class="op">)</span></span>
<span id="cb20-19"><a href="#cb20-19" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb20-20"><a href="#cb20-20" aria-hidden="true" tabindex="-1"></a>		MD5<span class="op">((</span><span class="dt">const</span> <span class="dt">unsigned</span> <span class="dt">char</span><span class="op">*)</span>preimage<span class="op">,</span> len<span class="op">,</span> hashed<span class="op">);</span></span>
<span id="cb20-21"><a href="#cb20-21" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>memcmp<span class="op">(</span>hashed<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>hash<span class="op">,</span> MD5_DIGEST_LENGTH<span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb20-22"><a href="#cb20-22" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb20-23"><a href="#cb20-23" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* protect the mutex against async cancellation */</span></span>
<span id="cb20-24"><a href="#cb20-24" aria-hidden="true" tabindex="-1"></a>			pthread_setcancelstate<span class="op">(</span></span>
<span id="cb20-25"><a href="#cb20-25" aria-hidden="true" tabindex="-1"></a>					PTHREAD_CANCEL_DISABLE<span class="op">,</span> <span class="op">&amp;</span>cancel_state<span class="op">);</span></span>
<span id="cb20-26"><a href="#cb20-26" aria-hidden="true" tabindex="-1"></a>			pthread_mutex_lock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb20-27"><a href="#cb20-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-28"><a href="#cb20-28" aria-hidden="true" tabindex="-1"></a>			t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>success <span class="op">=</span> true<span class="op">;</span></span>
<span id="cb20-29"><a href="#cb20-29" aria-hidden="true" tabindex="-1"></a>			strcpy<span class="op">(</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>preimage<span class="op">,</span> preimage<span class="op">);</span></span>
<span id="cb20-30"><a href="#cb20-30" aria-hidden="true" tabindex="-1"></a>			t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_done<span class="op">++;</span></span>
<span id="cb20-31"><a href="#cb20-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-32"><a href="#cb20-32" aria-hidden="true" tabindex="-1"></a>			pthread_cond_signal<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>returning<span class="op">);</span></span>
<span id="cb20-33"><a href="#cb20-33" aria-hidden="true" tabindex="-1"></a>			pthread_mutex_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb20-34"><a href="#cb20-34" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb20-35"><a href="#cb20-35" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb20-36"><a href="#cb20-36" aria-hidden="true" tabindex="-1"></a>		changed <span class="op">=</span> word_advance<span class="op">(</span>preimage<span class="op">,</span> t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_workers<span class="op">);</span></span>
<span id="cb20-37"><a href="#cb20-37" aria-hidden="true" tabindex="-1"></a>		len <span class="op">=</span> MAX<span class="op">(</span>len<span class="op">,</span> changed<span class="op">);</span></span>
<span id="cb20-38"><a href="#cb20-38" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb20-39"><a href="#cb20-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-40"><a href="#cb20-40" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* restore original cancellation type */</span></span>
<span id="cb20-41"><a href="#cb20-41" aria-hidden="true" tabindex="-1"></a>	pthread_setcanceltype<span class="op">(</span>cancel_type<span class="op">,</span> <span class="op">&amp;</span>cancel_type<span class="op">);</span></span>
<span id="cb20-42"><a href="#cb20-42" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-43"><a href="#cb20-43" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_lock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb20-44"><a href="#cb20-44" aria-hidden="true" tabindex="-1"></a>	t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>n_done<span class="op">++;</span></span>
<span id="cb20-45"><a href="#cb20-45" aria-hidden="true" tabindex="-1"></a>	pthread_cond_signal<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>returning<span class="op">);</span></span>
<span id="cb20-46"><a href="#cb20-46" aria-hidden="true" tabindex="-1"></a>	pthread_mutex_unlock<span class="op">(&amp;</span>t<span class="op">-&gt;</span>goal<span class="op">-&gt;</span>lock<span class="op">);</span></span>
<span id="cb20-47"><a href="#cb20-47" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb20-48"><a href="#cb20-48" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Asynchronous cancellation does not appear to work on macOS, but as we’ve seen that’s par for the course on that operating system.</p>
<h3 id="development-tools">Development tools</h3>
<h4 id="valgrind-drd-and-helgrind">Valgrind DRD and helgrind</h4>
<p><a href="https://valgrind.org/docs/manual/drd-manual.html">DRD</a> and <a href="https://valgrind.org/docs/manual/hg-manual.html">Helgrind</a> are Valgrind tools for detecting errors in multithreaded C and C++ programs. The tools work for any program that uses the POSIX threading primitives or that uses threading concepts built on top of the POSIX threading primitives.</p>
<p>The tools have overlapping abilities like detecting data races and improper use of the pthreads API. Additionally, Helgrind can detect locking hierarchy violations, and DRD can alert when there is lock contention.</p>
<p>Both tools pinpoint the lines of code where problems arise. For example, we can run DRD on our first crazy bankers program:</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="fu">valgrind</span> <span class="at">--tool</span><span class="op">=</span>drd ./banker</span></code></pre></div>
<p>Here is a characteristic example of an error it emits:</p>
<pre><code>==8524== Thread 3:
==8524== Conflicting load by thread 3 at 0x003090b0 size 8
==8524==    at 0x1088BD: disburse (banker.c:48)
==8524==    by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524==    by 0x4E514A3: start_thread (pthread_create.c:456)
==8524== Allocation context: BSS section of /home/admin/banker
==8524== Other segment start (thread 2)
==8524==    at 0x514FD01: clone (clone.S:80)
==8524== Other segment end (thread 2)
==8524==    at 0x509D820: rand (rand.c:26)
==8524==    by 0x108857: rand_range (banker.c:26)
==8524==    by 0x1088A0: disburse (banker.c:42)
==8524==    by 0x4C324F3: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==8524==    by 0x4E514A3: start_thread (pthread_create.c:456)</code></pre>
<p>It finds conflicting loads and stores from lines 48, 51, and 52.</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="dv">48</span><span class="op">:</span> <span class="cf">if</span> <span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">&gt;</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a><span class="dv">49</span><span class="op">:</span> <span class="op">{</span></span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a><span class="dv">50</span><span class="op">:</span>		payment <span class="op">=</span> <span class="dv">1</span> <span class="op">+</span> rand_range<span class="op">(</span>accts<span class="op">[</span>from<span class="op">].</span>balance<span class="op">);</span></span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a><span class="dv">51</span><span class="op">:</span>		accts<span class="op">[</span>from<span class="op">].</span>balance <span class="op">-=</span> payment<span class="op">;</span></span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a><span class="dv">52</span><span class="op">:</span>		accts<span class="op">[</span>to<span class="op">].</span>balance   <span class="op">+=</span> payment<span class="op">;</span></span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a><span class="dv">53</span><span class="op">:</span> <span class="op">}</span></span></code></pre></div>
<p>Helgrind can identify the lock hierarchy violation in our example of deadlocking bankers:</p>
<div class="sourceCode" id="cb24"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="fu">valgrind</span> <span class="at">--tool</span><span class="op">=</span>helgrind ./banker_lock</span></code></pre></div>
<pre><code>==8989== Thread #4: lock order &quot;0x3091F8 before 0x3090D8&quot; violated
==8989==
==8989== Observed (incorrect) order is: acquisition of lock at 0x3090D8
==8989==    at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989==    by 0x1089B9: disburse (banker_lock.c:38)
==8989==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989==    by 0x4E454A3: start_thread (pthread_create.c:456)
==8989==
==8989==  followed by a later acquisition of lock at 0x3091F8
==8989==    at 0x4C3010C: mutex_lock_WRK (hg_intercepts.c:904)
==8989==    by 0x1089D1: disburse (banker_lock.c:39)
==8989==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==8989==    by 0x4E454A3: start_thread (pthread_create.c:456)</code></pre>
<p>To identify when there is too much contention for a lock, we can ask DRD to alert us when a thread blocks for more than <em>n</em> milliseconds on a mutex:</p>
<div class="sourceCode" id="cb26"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="fu">valgrind</span> <span class="at">--tool</span><span class="op">=</span>drd <span class="at">--exclusive-threshold</span><span class="op">=</span>2 ./banker_lock_hierarchy</span></code></pre></div>
<p>Since we throw too many threads at a small number of accounts, we see wait times that cross the threshold, like this one that waited seven ms:</p>
<pre><code>==7565== Acquired at:
==7565==    at 0x483F428: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:888)
==7565==    by 0x483F428: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565==    by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)
==7565== Lock on mutex 0x10c258 was held during 7 ms (threshold: 2 ms).
==7565==    at 0x4840478: pthread_mutex_unlock_intercept (drd_pthread_intercepts.c:978)
==7565==    by 0x4840478: pthread_mutex_unlock (drd_pthread_intercepts.c:991)
==7565==    by 0x109395: disburse (banker_lock_hierarchy.c:47)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)
==7565== mutex 0x10c258 was first observed at:
==7565==    at 0x483F368: pthread_mutex_lock_intercept (drd_pthread_intercepts.c:885)
==7565==    by 0x483F368: pthread_mutex_lock (drd_pthread_intercepts.c:898)
==7565==    by 0x109280: disburse (banker_lock_hierarchy.c:40)
==7565==    by 0x483C114: vgDrd_thread_wrapper (drd_pthread_intercepts.c:444)
==7565==    by 0x4863FA2: start_thread (pthread_create.c:486)
==7565==    by 0x49764CE: clone (clone.S:95)</code></pre>
<h4 id="clang-threadsanitizer-tsan">Clang ThreadSanitizer (TSan)</h4>
<p>ThreadSanitizer is a clang instrumentation module. To use it, choose <code>CC = clang</code> and add <code>-fsanitize=thread</code> to CFLAGS. Then when you build programs, they will be modified to detect data races and print statistics to stderr.</p>
<p>Here’s a portion of the output when running the bankers program:</p>
<pre><code>WARNING: ThreadSanitizer: data race (pid=11312)
  Read of size 8 at 0x0000014aeeb0 by thread T2:
    #0 disburse /home/admin/banker.c:48 (banker+0x0000004a4372)

  Previous write of size 8 at 0x0000014aeeb0 by thread T1:
    #0 disburse /home/admin/banker.c:52 (banker+0x0000004a43ba)</code></pre>
<p>TSan can also detect lock hierarchy violations, such as in banker_lock:</p>
<pre><code>WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=10095)
  Cycle in lock order graph: M1 (0x0000014aef78) =&gt; M2 (0x0000014aeeb8) =&gt; M1

  Mutex M2 acquired here while holding mutex M1 in thread T1:
    #0 pthread_mutex_lock &lt;null&gt; (banker_lock+0x000000439a10)
    #1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)

    Hint: use TSAN_OPTIONS=second_deadlock_stack=1 to get more informative warning message

  Mutex M1 acquired here while holding mutex M2 in thread T9:
    #0 pthread_mutex_lock &lt;null&gt; (banker_lock+0x000000439a10)
    #1 disburse /home/admin/banker_lock.c:39 (banker_lock+0x0000004a4398)</code></pre>
<h4 id="mutrace">Mutrace</h4>
<p>While Valgrind DRD can identify highly contended locks, it virtualizes the execution of the program under test, and skews the numbers. Other utilities can use software probes to get this information from a test running at full speed. In BSD land there is the <a href="http://dtrace.org/guide/chp-plockstat.html">plockstat</a> provider for DTrace, and on Linux there is the specially-written <a href="http://0pointer.de/blog/projects/mutrace.html">mutrace</a>. I had a lot of trouble trying to get plockstat to work on FreeBSD, so here’s an example of using mutrace to analyze our banker program.</p>
<div class="sourceCode" id="cb30"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a><span class="ex">mutrace</span> ./banker_lock_hierarchy</span></code></pre></div>
<pre><code>mutrace: Showing 10 most contended mutexes:

 Mutex #   Locked  Changed    Cont. tot.Time[ms] avg.Time[ms] max.Time[ms]  Flags
       0   200211   153664    95985      991.349        0.005        0.267 M-.--.
       1   200552   142173    61902      641.963        0.003        0.170 M-.--.
       2   199657   140837    47723      476.737        0.002        0.125 M-.--.
       3   199566   140863    39268      371.451        0.002        0.108 M-.--.
       4   199936   141381    33243      295.909        0.001        0.090 M-.--.
       5   199548   141297    28193      232.647        0.001        0.084 M-.--.
       6   200329   142027    24230      183.301        0.001        0.066 M-.--.
       7   199951   142338    21018      142.494        0.001        0.057 M-.--.
       8   200145   142990    18201      107.692        0.001        0.052 M-.--.
       9   200105   143794    15713       76.231        0.000        0.028 M-.--.
                                                                           ||||||
                                                                           /|||||
          Object:                                     M = Mutex, W = RWLock /||||
           State:                                 x = dead, ! = inconsistent /|||
             Use:                                 R = used in realtime thread /||
      Mutex Type:                 r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
  Mutex Protocol:                                      i = INHERIT, p = PROTECT /
     RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC

mutrace: Note that the flags column R is only valid in --track-rt mode!

mutrace: Total runtime is 1896.903 ms.

mutrace: Results for SMP with 4 processors.</code></pre>
<h4 id="off-cpu-profiling">Off-CPU profiling</h4>
<p>Typical profilers measure the amount of CPU time spent in each function. However when a thread is blocked by I/O, a lock, or a condition variable, then it isn’t using CPU time. To determine where functions spend the most “wall clock time,” we need to sample the call stack for all threads at intervals, and count how frequently we see each entry. When a thread is off-CPU its call stack stays unchanged.</p>
<p>The <code>pstack</code> program is traditionally the way to get a snapshot of a running program’s stack. It exists on old Unices, and used to be on Linux until Linux made a breaking change. The most portable way to get stack snapshots is using gdb with an awk wrapper, as documented in the <a href="http://poormansprofiler.org">Poor Man’s Profiler</a>.</p>
<p>Remember our early condition variable example that measured how many threads entered the critical section in <code>disburse()</code> at once? We asked whether synchronization on <code>stats_mtx</code> threw off the measurement. With off-CPU profiling we can look for clues.</p>
<p>Here’s a script based on the Poor Man’s Profiler:</p>
<div class="sourceCode" id="cb32"><pre class="sourceCode sh"><code class="sourceCode bash"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a><span class="ex">./banker_stats</span> <span class="kw">&amp;</span></span>
<span id="cb32-2"><a href="#cb32-2" aria-hidden="true" tabindex="-1"></a><span class="va">pid</span><span class="op">=</span><span class="va">$!</span></span>
<span id="cb32-3"><a href="#cb32-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-4"><a href="#cb32-4" aria-hidden="true" tabindex="-1"></a><span class="cf">while</span> <span class="bu">kill</span> <span class="at">-0</span> <span class="va">$pid</span></span>
<span id="cb32-5"><a href="#cb32-5" aria-hidden="true" tabindex="-1"></a>  <span class="cf">do</span></span>
<span id="cb32-6"><a href="#cb32-6" aria-hidden="true" tabindex="-1"></a>    <span class="fu">gdb</span> <span class="at">-ex</span> <span class="st">&quot;set pagination 0&quot;</span> <span class="at">-ex</span> <span class="st">&quot;thread apply all bt&quot;</span> <span class="at">-batch</span> <span class="at">-p</span> <span class="va">$pid</span></span>
<span id="cb32-7"><a href="#cb32-7" aria-hidden="true" tabindex="-1"></a>  <span class="cf">done</span> <span class="kw">|</span> <span class="dt">\</span></span>
<span id="cb32-8"><a href="#cb32-8" aria-hidden="true" tabindex="-1"></a><span class="fu">awk</span> <span class="st">&#39;</span></span>
<span id="cb32-9"><a href="#cb32-9" aria-hidden="true" tabindex="-1"></a><span class="st">  BEGIN { s = &quot;&quot;; }</span></span>
<span id="cb32-10"><a href="#cb32-10" aria-hidden="true" tabindex="-1"></a><span class="st">  /^Thread/ { print s; s = &quot;&quot;; }</span></span>
<span id="cb32-11"><a href="#cb32-11" aria-hidden="true" tabindex="-1"></a><span class="st">  /^\#/ { if (s != &quot;&quot; ) { s = s &quot;,&quot; $4} else { s = $4 } }</span></span>
<span id="cb32-12"><a href="#cb32-12" aria-hidden="true" tabindex="-1"></a><span class="st">  END { print s }&#39;</span> <span class="kw">|</span> <span class="dt">\</span></span>
<span id="cb32-13"><a href="#cb32-13" aria-hidden="true" tabindex="-1"></a><span class="fu">sort</span> <span class="kw">|</span> <span class="fu">uniq</span> <span class="at">-c</span> <span class="kw">|</span> <span class="fu">sort</span> <span class="at">-r</span> <span class="at">-n</span> <span class="at">-k</span> 1,1</span></code></pre></div>
<p>It outputs limited information, but we can see that waiting for locks in <code>disburse()</code> takes the majority of program time, being present in 872 of our samples. By contrast, waiting for the <code>stats_mtx</code> lock in <code>stats_update()</code> doesn’t appear in our sample at all. It must have had very little affect on our parallelism.</p>
<pre><code>    872 at,__GI___pthread_mutex_lock,disburse,start_thread,clone
     11 at,__random,rand,rand_range,disburse,start_thread,clone
      9 expected=0,,mutex=0x562533c3f0c0,&lt;stats_cnd&gt;,,stats_print,start_thread,clone
      9 __GI___pthread_timedjoin_ex,main
      5 at,__pthread_mutex_unlock_usercnt,disburse,start_thread,clone
      1 at,__pthread_mutex_unlock_usercnt,stats_change,disburse,start_thread,clone
      1 at,__GI___pthread_mutex_lock,stats_change,disburse,start_thread,clone
      1 __random,rand,rand_range,disburse,start_thread,clone</code></pre>
<h4 id="macos-instruments">macOS Instruments</h4>
<p>Although Mac’s POSIX thread support is pretty weak, its XCode tooling does include a nice profiler. From the Instruments application, choose the profiling template called “System Trace.” It adds a GUI on top of DTrace to display thread states (among other things). I modified our banker program to use only five threads and recorded its run. The Instruments app visualizes every event that happens, including threads blocking and being interrupted:</p>
<figure>
<img src="/images/thread-states.png" alt="thread states" /><figcaption aria-hidden="true">thread states</figcaption>
</figure>
<p>Within the program you can zoom into the history and hover over events for info.</p>
<h4 id="perf-c2c">perf c2c</h4>
<p>Perf is a Linux tool to measure hardware performance counters during the execution of a program. Joe Mario created a Perf feature called <a href="https://joemario.github.io/blog/2016/09/01/c2c-blog/">c2c</a> which detects <strong>false sharing</strong> of variables between CPUs.</p>
<p>In a NUMA multi-core computer, each CPU has its own set of caches, and all CPUs share main memory. Memory is divided into fixed size blocks (often 64 bytes) called <strong>cache lines</strong>. Any time a CPU reads or writes memory, it must fetch or store the entire cache line surrounding the desired address. If one CPU has already cached a line, and another CPU writes to that area in memory, the system has to perform an expensive operation to make the caches coherent.</p>
<p>When two unrelated variables in a program are stored close enough together in memory to be in the same cache line, it can cause a performance problem in multi-threaded programs. If threads running on separate CPUs access the unrelated variables, it can cause a tug of war between their underlying cache line, which is called false sharing.</p>
<p>For instance, our Game of Life simulator could potentially have false sharing at the edges of each section of board accessed by each thread. To verify this, I attempted to run perf c2c on an Amazon EC2 instance (since I lack a physical computer running Linux), but got an error that memory events are not supported on the virtual machine. I was running kernel 4.19.0 on Intel Xeon Platinum 8124M CPUs, so I assume this was a security restriction from Amazon.</p>
<p>If you are able to run c2c, and detect false sharing in a multi-threaded program, the solution is to align the variables more aggressively. POSIX provides the <a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_memalign.html">posix_memalign()</a> function to allocate bytes aligned on a desired boundary. In our Life example, we could have used an array of pointers to dynamically allocated rows rather than a contiguous two-dimensional array.</p>
<h4 id="intel-vtune-profiler">Intel VTune Profiler</h4>
<p>The VTune Profiler is available for free (with registration) on Linux, macOS, and Windows. It works on x86 hardware only of course. I haven’t used it, but their <a href="https://software.intel.com/en-us/vtune/features/multithreaded">marketing page</a> shows some nice pictures. The tool can visually identify the granularity of locks, present a prioritized list of synchronization objects that hurt performance, and visualize lock contention.</p>
<h3 id="further-reading">Further reading</h3>
<ul>
<li><a href="https://www.goodreads.com/book/show/987956.Programming_with_Posix_Threads">Programming with Posix Threads</a> by David R. Butenhof</li>
<li><a href="https://www.goodreads.com/book/show/828272.Pthreads_Programming">Pthreads Programming</a> by Bradford Nichols, Dick Buttlar, Jacqueline Farrell</li>
<li><a href="https://www.goodreads.com/book/show/15710583-is-parallel-programming-hard-and-if-so-what-can-you-do-about-it">Is Parallel Programming Hard, And, If So, What Can You Do About It?</a> by Paul McKenney</li>
</ul>]]></summary>
</entry>
<entry>
    <title>History and effective use of Vim</title>
    <link href="https://begriffs.com/posts/2019-07-19-history-use-vim.html" />
    <id>https://begriffs.com/posts/2019-07-19-history-use-vim.html</id>
    <published>2019-07-19T00:00:00Z</published>
    <updated>2019-07-19T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>This article is based on historical research and on simply reading the Vim user manual cover to cover. Hopefully these notes will help you (re?)discover core functionality of the editor, so you can abandon pre-packaged vimrc files and use plugins more thoughtfully.</p>
<figure>
<img src="/images/vim-books.png" alt="physical books" /><figcaption aria-hidden="true">physical books</figcaption>
</figure>
<p>To go beyond the topics in this blog post, I’d recommend getting a paper copy of the manual and a good pocket reference. I couldn’t find any hard copy of the official Vim manual, and ended up printing <a href="/pdf/vim-user-manual.pdf">this PDF</a> using <a href="https://www.printme1.com">printme1.com</a>. The PDF is a printer-friendly version of the files <code>$VIMRUNTIME/doc/usr_??.txt</code> distributed with the editor. For a convenient list of commands, I’d recommend the <a href="https://www.goodreads.com/book/show/9787030-vi-and-vim-editors-pocket-reference">vi and Vim Editors Pocket Reference</a>.</p>
<h3 id="table-of-contents">Table of Contents</h3>
<ul>
<li><a href="#history">History</a></li>
<li><a href="#configuration-hierarchy">Configuration hierarchy</a></li>
<li><a href="#third-party-plugins">Third-party plugins</a></li>
<li><a href="#backups-and-undo">Backups and undo</a></li>
<li><a href="#include-and-path">Include and path</a></li>
<li><a href="#edit-compile-cycle">Edit ⇄ compile cycle</a></li>
<li><a href="#diffs-and-patches">Diffs and patches</a></li>
<li><a href="#buffer-io">Buffer I/O</a></li>
<li><a href="#filetypes">Filetypes</a></li>
<li><a href="#dont-forget-the-mouse">Don’t forget the mouse</a></li>
<li><a href="#misc-editing">Misc editing</a></li>
</ul>
<h3 id="history">History</h3>
<h4 id="birth-of-vi">Birth of vi</h4>
<p>Vi commands and features go back more than fifty years, starting with the QED editor. Here is the lineage:</p>
<ul>
<li>1966 : QED (“Quick EDitor”) in Berkeley Timesharing System</li>
<li>1969 Jul: moon landing (just for reference)</li>
<li>1969 Aug: QED -&gt; ed at AT&amp;T</li>
<li>1976 Feb: ed -&gt; em (“Editor for Mortals”) at Queen Mary College</li>
<li>1976 : em -&gt; ex (“EXtended”) at UC Berkeley</li>
<li>1977 Oct: ex gets visual mode, vi</li>
</ul>
<p><img src="/images/tty33asr.jpg" class="right" alt="hard copy terminal" /></p>
<p>You can discover the similarities all the way between QED and ex by reading the <a href="/pdf/qed-editor.pdf">QED manual</a> and <a href="/pdf/ex-manual.pdf">ex manual</a>. Both editors use a similar grammar to specify and operate on line ranges.</p>
<p>Editors like QED, ed, and em were designed for hard-copy terminals, which are basically electric typewriters with a modem attached. Hard-copy terminals print system output on paper. Output could not be changed once printed, obviously, so the editing process consisted of user commands to update and manually print ranges of text.</p>
<p><img src="/images/adm3a.jpg" class="left" alt="video terminal" /></p>
<p>By 1976 video terminals such as the ADM-3A started to be available. The Ex editor added an “open mode” which allowed intraline editing on video terminals, and a visual mode for screen oriented editing on cursor-addressible terminals. The visual mode (activated with the command “vi”) kept an up-to-date view of part of the file on screen, while preserving an ex command line at the bottom of the screen. (Fun fact: the h,j,k,l keys on the ADM-3A had arrows drawn on them, so that choice of motion keys in vi was simply to match the keyboard.)</p>
<p>Learn more about the journey from ed to ex/vi in this <a href="/pdf/unix-review-bill-joy.pdf">interview</a> with Bill Joy. He talks about how he made ex/vi, and some things that disappointed him about it.</p>
<p>Classic vi is truly just an alter-ego of ex – they are the same binary, which decides to start in ex mode or vi mode based on the name of the executable invoked. The legacy of all this history is that ex/vi is refined by use, requires scant system resources, and can operate under limited bandwidth communication. It is also available on most systems and <a href="http://pubs.opengroup.org/onlinepubs/9699919799/utilities/vi.html">fully specified</a> in POSIX.</p>
<h4 id="from-vi-to-vim">From vi to vim</h4>
<p>Being a derivative of ed, the ex/vi editor was intellectual property of AT&amp;T. To use vi on platforms other than Unix, people had to write clones that did not share in the original codebase.</p>
<p>Some of the clones:</p>
<ul>
<li>nvi - 1980 for 4BSD</li>
<li>calvin - 1987 for DOS</li>
<li>vile - 1990 for DOS</li>
<li>stevie - 1987 for Atari ST</li>
<li>elvis - 1990 for Minix and 386BSD</li>
<li>vim - 1991 for Amiga</li>
<li>viper - 1995 for Emacs</li>
<li>elwin - 1995 for Windows</li>
<li>lemmy - 2002 for Windows</li>
</ul>
<p>We’ll be focusing on that little one in the middle: vim. Bram Moolenaar wanted to use vi on the Amiga. He began porting Stevie from the Atari and evolving it. He called his port “Vi IMitation.” For a full first-hand account, see Bram’s <a href="/pdf/vim-interview.pdf">interview</a> with Free Software Magazine.</p>
<p>By version 1.22 Vim was rechristened “Vi IMproved,” matching and surpassing features of the original. Here is the timeline of the next major versions, with some of their big features:</p>
<table class="table">
<tbody>
<tr>
<td>
1991 Nov 2
</td>
<td>
Vim 1.14: First release (on Fred Fish disk #591).
</td>
</tr>
<tr>
<td>
1992
</td>
<td>
Vim 1.22: Port to Unix. Vim now competes with Vi.
</td>
</tr>
<tr>
<td>
1994 Aug 12
</td>
<td>
Vim 3.0: Support for multiple buffers and windows.
</td>
</tr>
<tr>
<td>
1996 May 29
</td>
<td>
Vim 4.0: Graphical User Interface (largely by Robert Webb).
</td>
</tr>
<tr>
<td>
1998 Feb 19
</td>
<td>
Vim 5.0: Syntax coloring/highlighting.
</td>
</tr>
<tr>
<td>
2001 Sep 26
</td>
<td>
Vim 6.0: Folding, plugins, vertical split.
</td>
</tr>
<tr>
<td>
2006 May 8
</td>
<td>
Vim 7.0: Spell check, omni completion, undo branches, tabs.
</td>
</tr>
<tr>
<td>
2016 Sep 12
</td>
<td>
Vim 8.0: Jobs, async I/O, native packages.
</td>
</tr>
</tbody>
</table>
<p>For more info about each version, see e.g. <code>:help vim8</code>. To see plans for the future, including known bugs, see <code>:help todo.txt</code>.</p>
<p>Version 8 included some async job support due to peer pressure from NeoVim, whose developers <a href="https://groups.google.com/forum/#!searchin/vim_dev/neovim/vim_dev/x0BF9Y0Uby8/Xse9Bvyza0AJ">wanted</a> to run debuggers and REPLs for their web scripting languages inside the editor.</p>
<p>Vim is super portable. By adapting over time to work on a wide variety of platforms, the editor was forced to keep portable coding habits. It runs on OS/390, Amiga, BeOS and BeBox, Macintosh classic, Atari MiNT, MS-DOS, OS/2, QNX, RISC-OS, BSD, Linux, OS X, VMS, and MS-Windows. You can rely on Vim being there no matter what computer you’re using.</p>
<p>In a final twist in the vi saga, the original ex/vi source code was finally released in 2002 under a BSD free software license. It is available at <a href="http://ex-vi.sourceforge.net">ex-vi.sourceforge.net</a>.</p>
<p>Let’s get down to business. Before getting to odds, ends, and intermediate tricks, it helps to understand how Vim organizes and reads its configuration files.</p>
<h3 id="configuration-hierarchy">Configuration hierarchy</h3>
<p>I used to think, incorrectly, that Vim reads all its settings and scripts from the ~/.vimrc file alone. Browsing random “dotfiles” repositories can reinforce this notion. Quite often people publish monstrous single .vimrc files that try to control every aspect of the editor. These big configs are sometimes called “vim distros.”</p>
<p>In reality Vim has a tidy structure, where .vimrc is just one of several inputs. In fact you can ask Vim exactly which scripts it has loaded. Try this: edit a source file from a random programming project on your computer. Once loaded, run</p>
<pre class="viml"><code>:scriptnames</code></pre>
<p>Take time to read the list. Try to guess what the scripts might do, and note the directories where they live.</p>
<p>Was the list longer than you expected? If you have installed loads of plugins the editor has a lot to do. Check what slows down the editor most at startup by running the following and look at the <code>start.log</code> it creates:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="ex">vim</span> <span class="at">--startuptime</span> start.log name-of-your-file</span></code></pre></div>
<p>Just for comparison, see how quickly Vim starts without your existing configuration:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="ex">vim</span> <span class="at">--clean</span> <span class="at">--startuptime</span> clean.log name-of-your-file</span></code></pre></div>
<p>To determine which scripts to run at startup or buffer load time, Vim traverses a “runtime path.” The path is a comma-separated list of directories that each contain a common structure. Vim inspects the structure of each directory to find scripts to run. Directories are processed in the order they appear in the list.</p>
<p>Check the runtimepath on your system by running:</p>
<pre class="viml"><code>:set runtimepath</code></pre>
<p>My system contains the following directories in the default value for <code>runtimepath</code>. Not all of them even exist in the filesystem, but they would be consulted if they did.</p>
<dl>
<dt>
~/.vim
</dt>
<dd>
The home directory, for personal preferences.
</dd>
<dt>
/usr/local/share/vim/vimfiles
</dt>
<dd>
A system-wide Vim directory, for preferences from the system administrator.
</dd>
<dt>
/usr/local/share/vim/vim81
</dt>
<dd>
Aka $VIMRUNTIME, for files distributed with Vim.
</dd>
<dt>
/usr/local/share/vim/vimfiles/after
</dt>
<dd>
The “after” directory in the system-wide Vim directory. This is for the system administrator to overrule or add to the distributed defaults.
</dd>
<dt>
~/.vim/after
</dt>
<dd>
The “after” directory in the home directory. This is for personal preferences to overrule or add to the distributed defaults or system-wide settings.
</dd>
</dl>
<p>Because directories are processed by their order in line, the only thing that is special about the “after” directories is that they are at the end of the list. There is nothing magical about the word “after.”</p>
<p>When processing each directory, Vim looks for subfolders with specific names. To learn more about them, see <code>:help runtimepath</code>. Here is a selection of those we will be covering, with brief descriptions.</p>
<dt>
plugin/
</dt>
<dd>
Vim script files that are loaded automatically when editing any kind of file. Called “global plugins.”
</dd>
<dt>
autoload/
</dt>
<dd>
(Not to be confused with “plugin.”) Scripts in autoload contain functions that are loaded only when requested by other scripts.
</dd>
<dt>
ftdetect/
</dt>
<dd>
Scripts to detect filetypes. They can base their decision on filename extension, location, or internal file contents.
</dd>
<dt>
ftplugin/
</dt>
<dd>
Scripts that are executed when editing files with known type.
</dd>
<dt>
compiler/
</dt>
<dd>
Definitions of how to run various compilers or linters, and of how to parse their output. Can be shared between multiple ftplugins. Also not applied automatically, must be called with <code>:compiler</code>
</dd>
<dt>
pack/
</dt>
<dd>
Container for Vim 8 native packages, the successor to “Pathogen” style package management. The native packaging system does not require any third-party code.
</dd>
</dl>
<p>Finally, <code>~/.vimrc</code> is the catchall for general editor settings. Use it for setting defaults that can be overridden for particular file types. For a comprehensive overview of settings you can choose in .vimrc, run <code>:options</code>.</p>
<h3 id="third-party-plugins">Third-party plugins</h3>
<p>Plugins are simply Vim scripts that must be put into the correct places in the runtimepath in order to execute. Installing them is conceptually easy: download the file(s) into place. The challenge is that it’s hard to remove or update some plugins because they litter subdirectories in the runtimepath with their scripts, and it can be hard to tell which plugin is responsible for which files.</p>
<p>“Plugin managers” evolved to address this need. Vim.org has had a <a href="https://www.vim.org/scripts/script_search_results.php">plugin registry</a> going back at least as far as 2003 (as identified by the Internet Archive). However it wasn’t until about 2008 that the notion of a plugin manager really came into vogue.</p>
<p>These tools add plugins’ separate directories to Vim’s runtimepath, and compile help tags for plugin documentation. Most plugin managers also install and update plugin code from the internet, sometimes in parallel or with colorful progress bars.</p>
<p>In chronological order, here is the parade of plugin managers. I based the date ranges on earliest and latest releases of each, or when no official releases are identified, on the earliest and latest commit dates.</p>
<ul>
<li>Mar 2006 - Jul 2014 : <a href="https://www.vim.org/scripts/script.php?script_id=1502">Vimball</a> (A distribution format and associated Vim commands)</li>
<li>Oct 2008 - Dec 2015 : <a href="https://github.com/tpope/vim-pathogen">Pathogen</a> (Deprecated in favor of native vim packages)</li>
<li>Aug 2009 - Dec 2009 : <a href="https://github.com/c9s/Vimana">Vimana</a></li>
<li>Dec 2009 - Dec 2014 : <a href="https://github.com/MarcWeber/vim-addon-manager">VAM</a></li>
<li>Aug 2010 - Nov 2010 : <a href="https://github.com/vimjolts/jolt">Jolt</a></li>
<li>Oct 2010 - Nov 2012 : <a href="https://github.com/tomtom/tplugin_vim">tplugin</a></li>
<li>Oct 2010 - Feb 2014 : <a href="https://github.com/VundleVim/Vundle.vim">Vundle</a> (Discontinued after NeoBundle ripped off code)</li>
<li>Mar 2012 - Mar 2018 : <a href="https://github.com/kana/vim-flavor">vim-flavor</a></li>
<li>Apr 2012 - Mar 2016 : <a href="https://github.com/Shougo/neobundle.vim">NeoBundle</a> (Deprecated in favor of dein)</li>
<li>Jan 2013 - Aug 2017 : <a href="https://github.com/csexton/infect">infect</a></li>
<li>Feb 2013 - Aug 2016 : <a href="https://github.com/rkulla/vimogen">vimogen</a></li>
<li>Oct 2013 - Jan 2015 : <a href="https://github.com/sunaku/vim-unbundle">vim-unbundle</a></li>
<li>Dec 2013 - Jul 2015 : <a href="https://github.com/ardagnir/vizardry">Vizardry</a></li>
<li>Feb 2014 - Oct 2018 : <a href="https://github.com/junegunn/vim-plug">vim-plug</a></li>
<li>Jan 2015 - Oct 2015 : <a href="https://github.com/tomtom/enabler_vim">enabler</a></li>
<li>Aug 2015 - Apr 2016 : <a href="https://github.com/dbeniamine/vizardry">Vizardry 2</a></li>
<li>Jan 2016 - Jun 2018 : <a href="https://github.com/Shougo/dein.vim">dein.vim</a></li>
<li>Sep 2016 - Present : native in Vim 8</li>
<li>Feb 2017 - Sep 2018 : <a href="https://github.com/k-takata/minpac">minpac</a></li>
<li>Mar 2018 - Mar 2018 : <a href="https://github.com/meldavis/autopac">autopac</a></li>
<li>Feb 2017 - Jun 2018 : <a href="https://github.com/maralla/pack">pack</a></li>
<li>Mar 2017 - Sep 2017 : <a href="https://github.com/nicodebo/vim-pck">vim-pck</a></li>
<li>Sep 2017 - Sep 2017 : <a href="https://github.com/mkarpoff/vim8-pack">vim8-pack</a></li>
<li>Sep 2017 - May 2019 : <a href="https://github.com/vim-volt/volt">volt</a></li>
<li>Sep 2018 - Feb 2019 : <a href="https://github.com/kristijanhusak/vim-packager">vim-packager</a></li>
<li>Feb 2019 - Feb 2019 : <a href="https://github.com/bennyyip/plugpac.vim">plugpac.vim</a></li>
</ul>
<p>The first thing to note is the overwhelming variety of these tools, and the second is that each is typically active for about four years before presumably going out of fashion.</p>
<p>The most stable way to manage plugins is to simply use Vim 8’s built-in functionality, which requires no third-party code. Let’s walk through how to do it.</p>
<p>First create two directories, opt and start, within a pack directory in your runtimepath.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">mkdir</span> <span class="at">-p</span> ~/.vim/pack/foobar/<span class="dt">{opt</span><span class="op">,</span><span class="dt">start}</span></span></code></pre></div>
<p>Note the placeholder “foobar.” This name is entirely up to you. It classifies the packages that will go inside. Most people throw all their plugins into a single nondescript category, which is fine. Pick whatever name you like; I’ll continue to use foobar here. You could theoretically create multiple categories too, like ~/.vim/pack/navigation and ~/.vim/pack/linting. Note that Vim does not detect duplication between categories and will double-load duplicates if they exist.</p>
<p>Packages in “start” get loaded automatically, whereas those in “opt” won’t load until specifically requested in Vim with the <code>:packadd</code> command. Opt is good for lesser-used packages, and keeps Vim fast by not running scripts unnecessarily. Note that there isn’t a counterpart to <code>:packadd</code> to unload a package.</p>
<p>For this example we’ll add the “ctrlp” fuzzy find plugin to opt. Download and extract the latest release into place:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="ex">curl</span> <span class="at">-L</span> https://github.com/kien/ctrlp.vim/archive/1.79.tar.gz <span class="dt">\</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>	<span class="kw">|</span> <span class="fu">tar</span> zx <span class="at">-C</span> ~/.vim/pack/foobar/opt</span></code></pre></div>
<p>That command creates a ~/.vim/pack/foobar/opt/ctrlp.vim-1.79 folder, and the package is ready to use. Back in vim, create a helptags index for the new package:</p>
<pre class="viml"><code>:helptags ~/.vim/pack/foobar/opt/ctrlp.vim-1.79/doc</code></pre>
<p>That creates a file called “tags” in the package’s doc folder, which makes the topics available for browsing in Vim’s internal help system. (Alternately you can run <code>:helptags ALL</code> once the package has been loaded, which takes care of all docs in the runtimepath.)</p>
<p>When you want to use the package, load it (and know that tab completion works for plugin names, so you don’t have to type the whole name):</p>
<pre class="viml"><code>:packadd ctrlp.vim-1.79</code></pre>
<p>Packadd includes the package’s base directory in the runtimepath, and sources its plugin and ftdetect scripts. After loading ctrlp, you can press CTRL-P to pop up a fuzzy find file matcher.</p>
<p>Some people keep their ~/.vim directory under version control and use git submodules for each package. For my part, I simply extract packages from tarballs and track them in my own repository. If you use mature packages you don’t need to upgrade them often, plus the scripts are generally small and don’t clutter git history much.</p>
<h3 id="backups-and-undo">Backups and undo</h3>
<p>Depending on user settings, Vim can protect against four types of loss:</p>
<ol type="1">
<li>A crash during editing (between saves). Vim can protect against this one by periodically saving unwritten changes to a swap file.</li>
<li>Editing the same file with two instances of Vim, overwriting changes from one or both instances. Swap files protect against this too.</li>
<li>A crash during the save process itself, after the destination file is truncated but before the new contents have been fully written. Vim can protect against this with a “writebackup.” To do this, it writes to a new file and swaps it with the original on success, in a way that depends on the “backupcopy” setting.</li>
<li>Saving new file contents but wanting the original back. Vim can protect against this by persisting the backup copy of the file after writing changes.</li>
</ol>
<p>Before examining sensible settings, how about some comic relief? Here are just a sampling of comments from vimrc files on GitHub:</p>
<ul>
<li>“Do not create swap file. Manage this in version control”</li>
<li>“Backups are for pussies. Use version control”</li>
<li>“use version control FFS!”</li>
<li>“We live in a world with version control, so get rid of swaps and backups”</li>
<li>“don’t write backup files, version control is enough backup”</li>
<li>“I’ve never actually used the VIM backup files… Use version control”</li>
<li>“Since most stuff is on version control anyway”</li>
<li>“Disable backup files, you are using a version control system anyway :)”</li>
<li>“version control has arrived, git will save us”</li>
<li>“disable swap and backup files (Always use version control! ALWAYS!)”</li>
<li>“Turn backup off, since I version control everything”</li>
</ul>
<p>The comments reflect awareness of only the fourth case above (and the third by accident), whereas the authors generally go on to disable the swap file too, leaving one and two unprotected.</p>
<p>Here is the configuration I recommend to keep your edits safe:</p>
<pre class="viml"><code>&quot; Protect changes between writes. Default values of
&quot; updatecount (200 keystrokes) and updatetime
&quot; (4 seconds) are fine
set swapfile
set directory^=~/.vim/swap//

&quot; protect against crash-during-write
set writebackup
&quot; but do not persist backup after successful write
set nobackup
&quot; use rename-and-write-new method whenever safe
set backupcopy=auto
&quot; patch required to honor double slash at end
if has(&quot;patch-8.1.0251&quot;)
	&quot; consolidate the writebackups -- not a big
	&quot; deal either way, since they usually get deleted
	set backupdir^=~/.vim/backup//
end

&quot; persist the undo tree for each file
set undofile
set undodir^=~/.vim/undo//</code></pre>
<p>These settings enable backups for writes-in-progress, but do not persist them after successful write because version control etc etc. Note that you’ll need to <code>mkdir ~/.vim/{swap,undodir,backup}</code> or else Vim will fall back to the next available folder in the preference list. You should also probably chmod the folders to keep the contents private, because the swap files and undo history might contain sensitive information.</p>
<p>One thing to note about the paths in our config is that they end in a double slash. That ending enables a feature to disambiguate swaps and backups for files with the same name that live in different directories. For instance the swap file for <code>/foo/bar</code> will be saved in <code>~/.vim/swap/%foo%bar.swp</code> (slashes escaped as percent signs). Vim had a bug until a fairly recent patch where the double slash was not honored for backupdir, and we guard against that above.</p>
<p>We also have Vim persist the history of undos for each file, so that you can apply them even after quitting and editing the file again. While it may sound redundant with the swap file, the undo history is complementary because it is written only when the file is written. (If it were written more frequently it might not match the state of the file on disk after a crash, so Vim doesn’t do that.)</p>
<p>Speaking of undo, Vim maintains a full tree of edit history. This means you can make a change, undo it, then redo it differently and all three states are recoverable. You can see the times and magnitude of changes with the <code>:undolist</code> command, but it’s hard to visualize the tree structure from it. You can navigate to specific changes in that list, or move in time with <code>:earlier</code> and <code>:later</code> which take a time argument like 5m, or the count of file saves, like 3f. However navigating the undo tree is an instance when I think a plugin – like <a href="https://github.com/mbbill/undotree">undotree</a> – <em>is</em> warranted.</p>
<p>Enabling these disaster recovery settings can bring you peace of mind. I used to save compulsively after most edits or when stepping away from the computer, but now I’ve made an effort to leave documents unsaved for hours at a time. I know how the swap file works now.</p>
<p>Some final notes: keep an eye on all these disaster recovery files, they can pile up in your .vim folder and use space over time. Also setting nowritebackup might be necessary when saving a huge file with low disk space, because Vim must otherwise make an entire copy of the file temporarily. By default the “backupskip” setting disables backups for anything in the system temp directory.</p>
<p>Vim’s “patchmode” is related to backups. You can use it in directories that aren’t under version control. For instance if you want to download a source tarball, make an edit and send a patch over a mailing list without bringing git into the picture. Run <code>:set patchmod=.orig</code> and any file ‘foo’ Vim is about to write will be backed up to ‘foo.orig’. You can then create a patch on the command line between the .orig files and the new ones.</p>
<h3 id="include-and-path">Include and path</h3>
<p>Most programming languages allow you to include one module or file from another. Vim knows how to track program identifiers in included files using the configuration settings <code>path</code>, <code>include</code>, <code>suffixesadd</code>, and <code>includeexpr</code>. The identifier search (see <code>:help include-search</code>) is an alternative to maintaining a tags file with ctags for system headers.</p>
<p>The settings for C programs work out of the box. Other languages are supported too, but require tweaking. That’s outside the scope of this article, see <code>:help include</code>.</p>
<p>If everything is configured right, you can press <code>[i</code> on an identifier to display its definition, or <code>[d</code> for a macro constant. Also when you press <code>gf</code> with the cursor on a filename, Vim searches the path to find it and jump there. Because the path also affects the <code>:find</code> command, some people have the tendency to add ‘**/*’ or commonly accessed directories to the path in order to use <code>:find</code> like a poor man’s fuzzy finder. Doing this slows down the identifier search with directories which aren’t relevant to that task.</p>
<p>A way to get the same level of crappy find capability, without polluting the path, is to just make another mapping. You can then press &lt;Leader&gt;&lt;space&gt; (which is typically backslash space) then start typing a filename and use tab or CTRL-D completion to find the file.</p>
<pre class="viml"><code>&quot; fuzzy-find lite
nmap &lt;Leader&gt;&lt;space&gt; :e ./**/</code></pre>
<p>Just to reiterate: the path parameter was designed for header files. If you want more proof, there is even a <code>:checkpath</code> command to see whether the path is functioning. Load a C file and run <code>:checkpath</code>. It will display filenames it was unable to find that are included transitively by the current file. Also <code>:checkpath!</code> with a bang dumps the whole hierarchy of files included from the current file.</p>
<p>By default path has the value “.,/usr/include,,” meaning the working directory, /usr/include, and files that are siblings of the active buffer. The directory specifiers and globs are pretty powerful, see <code>:help file-searching</code> for the details.</p>
<p>In my C ftplugin (more on that later), I also have the path search for include files within the current project, like ./src/include or ./include .</p>
<pre class="viml"><code>setlocal path=.,,*/include/**3,./*/include/**3
setlocal path+=/usr/include</code></pre>
<p>The ** with a number like **3 bounds the depth of the search in subdirectories. It’s wise to add depth bounds where you can to avoid identifier searches that lock up.</p>
<p>Here are other patterns you might consider adding to your path if <code>:checkpath</code> identifies that files can’t be found in your project. It depends on your system of course.</p>
<ul>
<li>More system includes: <code>/usr/include/**4,/usr/local/include/**3</code></li>
<li>Homebrew library headers: <code>/usr/local/Cellar/**2/include/**2</code></li>
<li>Macports library headers: <code>/opt/local/include/**</code></li>
<li>OpenBSD library headers: <code>/usr/local/lib/\*/include,/usr/X11R6/include/\*\*3</code></li>
</ul>
<p>See also: <code>:he [</code>, <code>:he gf</code>, <code>:he :find</code>.</p>
<h3 id="edit-compile-cycle">Edit ⇄ compile cycle</h3>
<p>The <code>:make</code> command runs a program of the user’s choice to build a project, and collects the output in the quickfix buffer. Each item in the quickfix records the filename, line, column, type (warning/error) and message of each output item. A fairly idomatic mapping uses bracket commands to move through quickfix items:</p>
<pre class="viml"><code>&quot; quickfix shortcuts
nmap ]q :cnext&lt;cr&gt;
nmap ]Q :clast&lt;cr&gt;
nmap [q :cprev&lt;cr&gt;
nmap [Q :cfirst&lt;cr&gt;</code></pre>
<p>If, after updating the program and rebuilding, you are curious what the error messages said last time, use <code>:colder</code> (and <code>:cnewer</code> to return). To see more information about the currently selected error use <code>:cc</code>, and use <code>:copen</code> to see the full quickfix buffer. You can populate the quickfix yourself without running <code>:make</code> with <code>:cfile</code>, <code>:caddfile</code>, or <code>:cexpr</code>.</p>
<p>Vim parses output from the build process according to the errorformat string, which contains scanf-like escape sequences. It’s typical to set this in a “compiler file.” For instance, Vim ships with one for gcc in $VIMRUNTIME/compiler/gcc.vim, but has no compiler file for clang. I created the following definition for ~/.vim/compiler/clang.vim:</p>
<pre class="viml"><code>&quot; formatting variations documented at
&quot; https://clang.llvm.org/docs/UsersManual.html#formatting-of-diagnostics
&quot;
&quot; It should be possible to make this work for the combination of
&quot; -fno-show-column and -fcaret-diagnostics as well with multiline
&quot; and %p, but I was too lazy to figure it out.
&quot;
&quot; The %D and %X patterns are not clang per se. They capture the
&quot; directory change messages from (GNU) &#39;make -w&#39;. I needed this
&quot; for building a project which used recursive Makefiles.

CompilerSet errorformat=
	\%f:%l%c:{%*[^}]}{%*[^}]}:\ %trror:\ %m,
	\%f:%l%c:{%*[^}]}{%*[^}]}:\ %tarning:\ %m,
	\%f:%l:%c:\ %trror:\ %m,
	\%f:%l:%c:\ %tarning:\ %m,
	\%f(%l,%c)\ :\ %trror:\ %m,
	\%f(%l,%c)\ :\ %tarning:\ %m,
	\%f\ +%l%c:\ %trror:\ %m,
	\%f\ +%l%c:\ %tarning:\ %m,
	\%f:%l:\ %trror:\ %m,
	\%f:%l:\ %tarning:\ %m,
	\%D%*\\a[%*\\d]:\ Entering\ directory\ %*[`&#39;]%f&#39;,
	\%D%*\\a:\ Entering\ directory\ %*[`&#39;]%f&#39;,
	\%X%*\\a[%*\\d]:\ Leaving\ directory\ %*[`&#39;]%f&#39;,
	\%X%*\\a:\ Leaving\ directory\ %*[`&#39;]%f&#39;,
	\%DMaking\ %*\\a\ in\ %f

CompilerSet makeprg=make</code></pre>
<p>To activate this compiler profile, run <code>:compiler clang</code>. This is typically done in an ftplugin file.</p>
<p>Another example is running <a href="https://www.gnu.org/software/diction/">GNU Diction</a> on a text document to identify wordy and commonly misused phrases in sentences. Create a “compiler” called diction.vim:</p>
<pre class="viml"><code>CompilerSet errorformat=%f:%l:\ %m
CompilerSet makeprg=diction\ -s\ %</code></pre>
<p>After you run <code>:compiler diction</code> you can use the normal <code>:make</code> command to run it and populate the quickfix. The final mild convenience in my .vimrc is a mapping to run make:</p>
<pre class="viml"><code>&quot; real make
map &lt;silent&gt; &lt;F5&gt; :make&lt;cr&gt;&lt;cr&gt;&lt;cr&gt;
&quot; GNUism, for building recursively
map &lt;silent&gt; &lt;s-F5&gt; :make -w&lt;cr&gt;&lt;cr&gt;&lt;cr&gt;</code></pre>
<h3 id="diffs-and-patches">Diffs and patches</h3>
<p>Vim’s internal diffing is powerful, but it can be daunting, especially the three-way merge view. In reality it’s not so bad once you take time to study it. The main idea is that every window is either in or out of “diff mode.” All windows put in diffmode (with <code>:difft[his]</code>) get compared with all other windows already in diff mode.</p>
<p>For example, let’s start simple. Create two files:</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="bu">echo</span> <span class="st">&quot;hello, world&quot;</span> <span class="op">&gt;</span> h1</span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="bu">echo</span> <span class="st">&quot;goodbye, world&quot;</span> <span class="op">&gt;</span> h2</span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a><span class="ex">vim</span> h1 h2</span></code></pre></div>
<p>In vim, split the arguments into their own windows with <code>:all</code>. In the top window, for h1, run <code>:difft</code>. You’ll see a gutter appear, but no difference detected. Move to the other window with CTWL-W CTRL-W and run <code>:difft</code> again. Now hello and goobye are identified as different in the current chunk. Continuing in the bottom window, you can run <code>:diffg[et]</code> to get “hello” from the top window, or <code>:diffp[ut]</code> to send “goodbye” into the top window. Pressing <code>]c</code> or <code>[c</code> would move between chunks if there were more than one.</p>
<p>A shortcut would be running <code>vim -d h1 h2</code> instead (or its alias, <code>vimdiff h1 h2</code>) which applies <code>:difft</code> to all windows. Alternatively, load just h1 with <code>vim h1</code> and then <code>:diffsplit h2</code>. Remember that fundamentally these commands just load files into windows and set the diff mode.</p>
<p>With these basics in mind, let’s learn to use Vim as a three-way mergetool for git. First configure git:</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> config merge.tool vimdiff</span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> config merge.conflictstyle diff3</span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a><span class="fu">git</span> config mergetool.prompt false</span></code></pre></div>
<p>Now, when you hit a merge conflict, run <code>git mergetool</code>. It will bring Vim up with four windows. This part looks scary, and is where I used to flail around and often quit in frustration.</p>
<pre><code>+-----------+------------+------------+
|           |            |            |
|           |            |            |
|   LOCAL   |    BASE    |   REMOTE   |
+-----------+------------+------------+
|                                     |
|                                     |
|             (edit me)               |
+-------------------------------------+</code></pre>
<p>Here’s the trick: do all the editing in the bottom window. The top three windows simply provide context about how the file differs on either side of the merge (local / remote), and how it looked prior to either side doing any work (base).</p>
<p>Move within the bottom window with <code>]c</code>, and for each chunk choose whether to replace it with text from local, base, or remote – or whether to write in your own change which might combine parts from several.</p>
<p>To make it easier to pull changes from the top windows, I set some mappings in my vimrc:</p>
<pre class="viml"><code>&quot; shortcuts for 3-way merge
map &lt;Leader&gt;1 :diffget LOCAL&lt;CR&gt;
map &lt;Leader&gt;2 :diffget BASE&lt;CR&gt;
map &lt;Leader&gt;3 :diffget REMOTE&lt;CR&gt;</code></pre>
<p>We’ve already seen <code>:diffget</code>, and here our bindings pass an argument of the buffer name that identifies which window to pull from.</p>
<p>Once done with the merge, run <code>:wqa</code> to save all the windows and quit. If you want to abandon the merge instead, run <code>:cq</code> to abort all changes and return an error code to the shell. This will signal to git that it should ignore your changes.</p>
<p>Diffget can also accept a range. If you want to pull in <em>all</em> changes from one of the top windows rather than working chunk by chunk, just run <code>:1,$+1diffget {LOCAL,BASE,REMOTE}</code>. The “+1” is required because there can be deleted lines “below” the last line of a buffer.</p>
<p>The three-way marge is fairly easy after all. There’s no need for plugins like Fugitive, at least for presenting a simplified view for resolving merge conflicts.</p>
<p>Finally, as of patch 8.1.0360, Vim is bundled with the xdiff library and can create diffs internally. This can be more efficient than shelling out to an external program, and allows for a choice of diff algorithms. The “<a href="https://bramcohen.livejournal.com/73318.html">patience</a>” algorithm often produces more human-readable output than the default, “myers.” Set it in your .vimrc like so:</p>
<pre class="viml"><code>if has(&quot;patch-8.1.0360&quot;)
	set diffopt+=internal,algorithm:patience
endif</code></pre>
<h3 id="buffer-io">Buffer I/O</h3>
<p>See if this sounds familiar: you’re editing a buffer and want to save it as a new file, so you <code>:w newname</code>. After editing some more, you <code>:w</code>, but it writes over the original file. What you want for this scenario is <code>:saveas newname</code>, which does the write but also changes the filename of the buffer for future writes. Alternately, the <code>:file newname</code> command will change the filename without doing a write.</p>
<p>It also pays off to learn more about the read and write commands. Becuase r and w are Ex commands, they work with ranges. Here are some variations you might not know about:</p>
<table class="table">
<tbody>
<tr>
<td>
:w &gt;&gt;foo
</td>
<td>
append the whole buffer to a file
</td>
</tr>
<tr>
<td>
:.w &gt;&gt;foo
</td>
<td>
append current line to a file
</td>
</tr>
<tr>
<td>
:$r foo
</td>
<td>
read foo into the end of the buffer
</td>
</tr>
<tr>
<td>
:0r foo
</td>
<td>
read foo into the start, moving existing lines down
</td>
</tr>
<tr>
<td>
:.,$w foo
</td>
<td>
write current line and below to a file
</td>
</tr>
<tr>
<td>
:r !ls
</td>
<td>
read ls output into cursor position
</td>
</tr>
<tr>
<td>
:w !wc
</td>
<td>
send buffer to wc and display output
</td>
</tr>
<tr>
<td>
:.!tr ‘A-Za-z’ ‘N-ZA-Mn-za-m’
</td>
<td>
apply ROT-13 to current line
</td>
</tr>
<tr>
<td>
:w|so %
</td>
<td>
chain commands: write and then source buffer
</td>
</tr>
<tr>
<td>
:e!
</td>
<td>
throw away unsaved changes, reload buffer
</td>
</tr>
<tr>
<td>
:hide edit foo
</td>
<td>
edit foo, hide current buffer if dirty
</td>
</tr>
</tbody>
</table>
<p>Useless fun fact: we piped a line to <code>tr</code> in an example above to apply a ROT-13 cypher, but Vim has that functionality built in with the the <code>g?</code> command. Apply it to a motion, like <code>g?$</code>.</p>
<h3 id="filetypes">Filetypes</h3>
<p>Filetypes are a way to change settings based on the type of file detected in a buffer. They don’t need to be automatically detected though, we can manually enable them to interesting effect. An example is doing hex editing. Any file can be viewed as raw hexadecimal values. GitHub user the9ball <a href="https://github.com/the9ball/.vim/blob/7138beef974b3510f0dc92b7629ad236ddd39ec9/ftplugin/xxd.vim">created</a> a clever ftplugin script that filters a buffer back and forth through the xxd utility for hex editing.</p>
<p>The xxd utility was bundled as part of Vim 5 for convenience. The Vim todo.txt file mentions they want to make it more seamless to edit binary files, but xxd can take us pretty far.</p>
<p>Here is code you can put in <code>~/.vim/ftplugin/xxd.vim</code>. Its presence in ftplugin means Vim will execute the script when filetype (aka “ft”) becomes xxd. I added some basic comments to the script.</p>
<pre class="viml"><code>&quot; without the xxd command this is all pointless
if !executable(&#39;xxd&#39;)
	finish
endif

&quot; don&#39;t insert a newline in the final line if it
&quot; doesn&#39;t already exist, and don&#39;t insert linebreaks
setlocal binary noendofline
silent %!xxd -g 1
%s/\r$//e

&quot; put the autocmds into a group for easy removal later
augroup ftplugin-xxd
	&quot; erase any existing autocmds on buffer
	autocmd! * &lt;buffer&gt;

	&quot; before writing, translate back to binary
	autocmd BufWritePre &lt;buffer&gt; let b:xxd_cursor = getpos(&#39;.&#39;)
	autocmd BufWritePre &lt;buffer&gt; silent %!xxd -r

	&quot; after writing, restore hex view and mark unmodified
	autocmd BufWritePost &lt;buffer&gt; silent %!xxd -g 1
	autocmd BufWritePost &lt;buffer&gt; %s/\r$//e
	autocmd BufWritePost &lt;buffer&gt; setlocal nomodified
	autocmd BufWritePost &lt;buffer&gt; call setpos(&#39;.&#39;, b:xxd_cursor) | unlet b:xxd_cursor

	&quot; update text column after changing hex values
	autocmd TextChanged,InsertLeave &lt;buffer&gt; let b:xxd_cursor = getpos(&#39;.&#39;)
	autocmd TextChanged,InsertLeave &lt;buffer&gt; silent %!xxd -r
	autocmd TextChanged,InsertLeave &lt;buffer&gt; silent %!xxd -g 1
	autocmd TextChanged,InsertLeave &lt;buffer&gt; call setpos(&#39;.&#39;, b:xxd_cursor) | unlet b:xxd_cursor
augroup END

&quot; when filetype is set to no longer be &quot;xxd,&quot; put the binary
&quot; and endofline settings back to what they were before, remove
&quot; the autocmds, and replace buffer with its binary value
let b:undo_ftplugin = &#39;setl bin&lt; eol&lt; | execute &quot;au! ftplugin-xxd * &lt;buffer&gt;&quot; | execute &quot;silent %!xxd -r&quot;&#39;</code></pre>
<p>Try opening a file, then running <code>:set ft</code>. Note what type it is. Then<code>:set ft=xxd</code>. Vim will turn into a hex editor. To restore your view, <code>:set ft=foo</code> where foo was the original type. Note that in hex view you even get syntax highlighting because <code>$VIMRUNTIME/syntax/xxd.vim</code> ships with Vim by default.</p>
<p>Notice the nice use of “b:undo_ftplugin” which is an opportunity for filetypes to clean up after themselves when the user or ftdetect mechanism switches away from them to another filetype. (The example above could use a little work because if you <code>:set ft=xxd</code> then set it back, the buffer is marked as modified even if you never changed anything.)</p>
<p>Ftplugins also allow you to refine an existing filetype. For instance, Vim already has some good defaults for C programming in <code>$VIMRUNTIME/ftplugin/c.vim</code>. I put these extra options in <code>~/.vim/after/ftplugin/c.vim</code> to add my own settings on top:</p>
<pre class="viml"><code>&quot; the smartest indent engine for C
setlocal cindent
&quot; my preferred &quot;Allman&quot; style indentation
setlocal cino=&quot;Ls,:0,l1,t0,(s,U1,W4&quot;

&quot; for quickfix errorformat
compiler clang
&quot; shows long build messages better
setlocal ch=2

&quot; auto-create folds per grammar
setlocal foldmethod=syntax
setlocal foldlevel=10

&quot; local project headers
setlocal path=.,,*/include/**3,./*/include/**3
&quot; basic system headers
setlocal path+=/usr/include

setlocal tags=./tags,tags;~
&quot;                      ^ in working dir, or parents
&quot;                ^ sibling of open file

&quot; the default is menu,preview but the preview window is annoying
setlocal completeopt=menu

iabbrev #i #include
iabbrev #d #define
iabbrev main() int main(int argc, char **argv)

&quot; add #include guard
iabbrev #g _&lt;c-r&gt;=expand(&quot;%:t:r&quot;)&lt;cr&gt;&lt;esc&gt;VgUV:s/[^A-Z]/_/g&lt;cr&gt;A_H&lt;esc&gt;yypki#ifndef &lt;esc&gt;j0i#define &lt;esc&gt;o&lt;cr&gt;&lt;cr&gt;#endif&lt;esc&gt;2ki</code></pre>
<p>Notice how the script uses “setlocal” rather than “set.” This applies the changes to just the current buffer rather than the whole Vim instance.</p>
<p>This script also enables some light abbreviations. Like I can type <code>#g</code> and press enter and it adds an include guard with the current filename:</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#ifndef _FILENAME_H</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#define _FILENAME_H</span></span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a><span class="co">/* &lt;-- cursor here */</span></span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#endif</span></span></code></pre></div>
<p>You can also mix filetypes by using a dot (“.”). Here is one application. Different projects have different coding conventions, so you can combine your default C settings with those for a particular project. The OpenBSD source code follows the <a href="https://man.openbsd.org/style.9">style(9)</a> format, so let’s make a special openbsd filetype. Combine the two filetypes with <code>:set ft=c.openbsd</code> on relevant files.</p>
<p>To detect the openbsd filetype we can look at the <em>contents</em> of buffers rather than just their extensions or locations on disk. The telltale sign is that C files in the <a href="https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/">OpenBSD source</a> contain <code>/* $OpenBSD:</code> in the first line.</p>
<p>To detect them, create <code>~/.vim/after/ftdetect/openbsd.vim</code>:</p>
<pre class="viml"><code>augroup filetypedetect
        au BufRead,BufNewFile *.[ch]
                \  if getline(1) =~ &#39;OpenBSD;&#39;
                \|   setl ft=c.openbsd
                \| endif
augroup END</code></pre>
<p>The <a href="https://cvsweb.openbsd.org/cgi-bin/cvsweb/ports/editors/vim/#dirlist">Vim port</a> for OpenBSD already includes a special syntax file for this filetype: <code>/usr/local/share/vim/vimfiles/syntax/openbsd.vim</code>. If you recall, the <code>/usr/local/share/vim/vimfiles</code> directory is in the runtimepath and is set aside for files from the system administrator. The provided openbsd.vim script includes a function:</p>
<pre class="viml"><code>function! OpenBSD_Style()
	setlocal cindent
	setlocal cinoptions=(4200,u4200,+0.5s,*500,:0,t0,U4200
	setlocal indentexpr=IgnoreParenIndent()
	setlocal indentkeys=0{,0},0),:,0#,!^F,o,O,e
	setlocal noexpandtab
	setlocal shiftwidth=8
	setlocal tabstop=8
	setlocal textwidth=80
endfun</code></pre>
<p>We simply need to call the function at the appropriate time. Create <code>~/.vim/after/ftplugin/openbsd.vim</code>:</p>
<pre class="viml"><code>call OpenBSD_Style()</code></pre>
<p>Now opening any C or header file with the characteristic comment at the top will be recognized as type c.openbsd and will use indenting options that conform with the style(9) man page.</p>
<h3 id="dont-forget-the-mouse">Don’t forget the mouse</h3>
<p>This is a friendly reminder that despite our command-line machismo, the mouse is in fact supported in Vim, and can do some things more easily than the keyboard. Mouse events work even over SSH thanks to xterm turning mouse events into stdin <a href="https://invisible-island.net/xterm/ctlseqs/ctlseqs.html#h2-Mouse-Tracking">escape codes</a>.</p>
<p>To enable mouse support, set <code>mouse=n</code>. Many people use <code>mouse=a</code> to make it work in all modes, but I prefer to enable it only in normal mode. This avoids creating visual selections when I click links with a keyboard modifier to open them in my browser.</p>
<p>Here are things the mouse can do:</p>
<ul>
<li>Open or close folds (when <code>foldcolumn</code> &gt; 0).</li>
<li>Select tabs (beats gt gt gt…)</li>
<li>Click to complete a motion, like d&lt;click!&gt;. Similar to the easymotion plugin but without any plugin.</li>
<li>Jump to help topics with double click.</li>
<li>Drag the status line at the bottom to change cmdheight.</li>
<li>Drag edge of window to resize.</li>
<li>Scroll wheel.</li>
</ul>
<h3 id="misc-editing">Misc editing</h3>
<p>This section could be enormous, but I’ll stick to a few tricks I learned. The first one that blew me away was <code>:set virtualedit=all</code>. It allows you to move the cursor anywhere in the window. If you enter characters or insert a visual block, Vim will add whatever spaces are required to the left of the inserted characters to keep them in place. Virtual edit mode makes it simple to edit tabular data. Turn it off with <code>:set virtualedit=</code>.</p>
<p>Next are some movement commands. I used to rely a lot on <code>}</code> to jump by paragraphs, and just muscle my way down the page. However the <code>]</code> character makes more precise motions: by function <code>]]</code>, scope <code>]}</code>, paren ‘])’, comment <code>]/</code>, diff block <code>]c</code>. This series is why the quickfix mapping <code>]q</code> mentioned earlier fits the pattern so well.</p>
<p>For big jumps I used to try things like <code>1000j</code>, but in normal mode you can actually just type a percentage and Vim will go there, like <code>50%</code>. Speaking of scroll percentage, you can see it at any time with CTRL-G. Thus I now do <code>:set noruler</code> and ask to see the info as needed. It’s less cluttered. Kind of the opposite of the trend of colorful patched font powerlines.</p>
<p>After jumping around between tags, files, or within a file, there are some commands to get your bearings. Try <code>:ls</code>, <code>:tags</code>, <code>:jumps</code>, and <code>:marks</code>. Jumping through tags actually creates a stack, and you can press CTRL-T to pop one back. I used to always press CTRL-O to back out of jumps, but it is not as direct as popping the tag stack.</p>
<p>In a project directory that has been indexed with ctags, you can open the editor directly to a tag with <code>-t</code>, like <code>vim -t main</code>. To find tags files more flexibly, set the <code>tags</code> configuration variable. Note the semicolon in the example below that allows Vim to search the current directory <em>upward</em> to the home directory. This way you could have a more general system tags file outside the project folder.</p>
<pre class="viml"><code>set tags=./tags,**5/tags,tags;~
&quot;                          ^ in working dir, or parents
&quot;                   ^ in any subfolder of working dir
&quot;           ^ sibling of open file</code></pre>
<p>There are some buffer tricks too. Switching to a buffer with <code>:bu</code> can take a fragment of the buffer name, not just a number. Sometimes it’s harder to memorize those numbers than remember the name of a source file. You can navigate buffers with marks too. If you use a capital letter as the name of a mark, you can jump to it across buffers. You could set a mark H in a header, C in a source file, and M in a Makefile to go from one buffer to another.</p>
<p>Do you ever get mad after yanking a word, deleting a word somewhere else, trying paste the first word in, and then discovering your original yank is overwritten? The Vim registers are underappreciated for this. Inspect their contents with <code>:reg</code>. As you yank text, previous yanks are rotated into the registers <code>"0</code> - <code>"9</code>. So <code>"0p</code> pastes the next-to-last yank/deletion. The special registers <code>"+</code> and <code>"*</code> can copy/paste from/to the system clipboard. They usually mean the same thing, except in some X11 setups that distinguish primary and secondary selection.</p>
<p>Another handy hidden feature is the command line window. It it’s a buffer that contains your previous commands and searches. Bring it up with <code>q:</code> or <code>q/</code>. Once inside you can move to any line and press enter to run it. However you can also edit any of the lines before pressing enter. Your changes won’t affect the line (the new command will merely be added to the bottom of the list).</p>
<p>This article could go on and on, so I’m going to call it here. For more great topics, see these help sections: views-sessions, viminfo, TOhtml, ins-completion, cmdline-completion, multi-repeat, scroll-cursor, text-objects, grep, netrw-contents.</p>
<figure>
<img src="/images/vim-logo.gif" alt="vim logo" /><figcaption aria-hidden="true">vim logo</figcaption>
</figure>]]></summary>
</entry>
<entry>
    <title>Unicode programming, with examples</title>
    <link href="https://begriffs.com/posts/2019-05-23-unicode-icu.html" />
    <id>https://begriffs.com/posts/2019-05-23-unicode-icu.html</id>
    <published>2019-05-23T00:00:00Z</published>
    <updated>2019-05-23T00:00:00Z</updated>
    <summary type="html"><![CDATA[<p>Most programming languages evolved awkwardly during the transition from ASCII to 16-bit UCS-2 to full Unicode. They contain internationalization features that often aren’t portable or don’t suffice.</p>
<p>Unicode is more than a numbering scheme for the characters of every language – although that in itself is a useful accomplishment. Unicode also includes characters’ case, directionality, and alphabetic properties. The Unicode standard and specifications describe the proper way to divide words and break lines, sort text, format numbers, display text in different directions, split/combine/reorder vowels South Asian languages, and determine when characters may look visually confusable.</p>
<p>Human languages are highly varied and internally inconsistent, and any application which treats strings as more than an opaque byte stream must embrace the complexity. Realistically this means using a mature third-party library.</p>
<p>This article illustrates text processing ideas with example programs. We’ll use the <a href="http://site.icu-project.org">International Components for Unicode</a> (ICU) library, which is mature, portable, and powers the international text processing behind many products and operating systems.</p>
<p>IBM (the maintainers of ICU) officially support a C, C++ and Java API. We’ll use the C API here for a better view into the internals. Many languages have bindings to the library, so these concepts should be applicable to your language of choice.</p>
<p><strong>Table of Contents:</strong></p>
<ul>
<li><a href="#concepts">Concepts</a>
<ul>
<li><a href="#what-is-a-character">What is a “character?”</a></li>
<li><a href="#glyphs-vs-graphemes">Glyphs vs graphemes</a></li>
<li><a href="#how-are-codepoints-encoded">How are codepoints encoded?</a></li>
<li><a href="#which-encoding-should-you-choose">Which encoding should you choose?</a></li>
</ul></li>
<li><a href="#icu-example-programs">ICU example programs</a>
<ul>
<li><a href="#generating-random-codepoints">Generating random codepoints</a></li>
<li><a href="#manipulating-codepoints">Manipulating codepoints</a></li>
<li><a href="#examining-utf-8-code-units">Examining UTF-8 code units</a></li>
<li><a href="#reading-lines-into-internal-utf-16-representation">Reading lines as UTF-16</a></li>
<li><a href="#extracting-iterating-codepoints-in-utf-16-string">Extracting, iterating codepoints</a></li>
<li><a href="#transformation">Transformation</a></li>
<li><a href="#punycode">Punycode</a></li>
<li><a href="#changing-case">Changing case</a></li>
<li><a href="#counting-words-and-graphemes">Counting words and graphemes</a></li>
<li><a href="#string-search">String search</a></li>
<li><a href="#comparing-strings-modulo-normalization">Strings modulo normalization</a></li>
<li><a href="#confusable-strings">Confusable strings</a></li>
</ul></li>
<li><a href="#further-reading">Further reading</a></li>
</ul>
<h2 id="concepts">Concepts</h2>
<p>Before getting into the example code, it’s important to learn the terminology. Let’s start at the most basic question.</p>
<h3 id="what-is-a-character">What is a “character?”</h3>
<p>“Character” is an overloaded term. What a native speaker of a language identifies as a letter or symbol is often stored as multiple values in the internal Unicode representation. The representation is further obscured by an additional encoding in memory, on disk, or during network transmission.</p>
<p>Let’s start at the abstraction closest to the user: the grapheme cluster. A “grapheme” is a graphical unit that a reader recognizes as a single element of the writing system. It’s the character as a user would understand it. For example, 山, ä and క్క are graphemes. Pieces of a single grapheme always stay together in print; breaking them apart is either nonsense or changes the meaning of the symbol. They are rendered as “glyphs,” i.e. markings on paper or screen which vary by font, style, or position in a word.</p>
<p>You might imagine that Unicode assigns each grapheme a unique number, but that is not true. It would be wasteful because there is a combinatorial explosion between letters and diacritical marks. For instance (o, ô, ọ, ộ) and (a, â, ạ, ậ) follow a pattern. Rather than assigning a distinct number to each, it’s more efficient to assign a number to o and a, and then to each of the combining marks. The graphemes can be built from letters and combining marks e.g. ậ = a + ◌̂ + ◌̣.</p>
<p>In reality Unicode takes both approaches. It assigns numbers to basic letters and combining marks, but also to some of their more common combinations. Many graphemes can thus be created in more than one way. For instance ộ can be specified in five ways:</p>
<ul>
<li>A: U+006f (o) + U+0302 (◌̂) + U+0323 (◌̣)</li>
<li>B: U+006f (o) + U+0323 (◌̣) + U+0302 (◌̂)</li>
<li>C: U+00f4 (ô) + U+0323 (◌̣)</li>
<li>D: U+1ecd (ọ) + U+0302 (◌̂)</li>
<li>E: U+1ed9 (ộ)</li>
</ul>
<p>The numbers (written U+xxxx) for each abstract character and each combining symbol are called “codepoints.” Every Unicode string is expressed as a list of codepoints. As illustrated above, multiple strings of codepoints may render into the same sequence of graphemes.</p>
<p>To meaningfully compare strings codepoint by codepoint for equality, both strings should both be represented in a consistent way. A standardized choice of codepoint decomposition for graphemes is called a “normal form.”</p>
<p>One choice is to decompose a string into as many codepoints as possible like examples A and B (with a weighting factor of which combining marks should come first). That is called Normalization Form Canonical Decomposition (NFD). Another choice is to do the opposite and use the fewest codepoints possible like example E. This is called Normalization Form Canonical Composition (NFC).</p>
<p>A core concept to remember is that, although codepoints are the building blocks of text, they don’t match up 1-1 with user-perceived characters (graphemes). Operations such as taking the length of an array of codepoints, or accessing arbitrary array positions are typically not useful for Unicode programs. Programs must also be mindful of the combining characters, like diacritical marks, when inserting or deleting codepoints. Inserting U+0061 into the asterisk position U+006f U+0302 (*) U+0323 changes the string “ộ” into “ôạ” rather than “ộa”.</p>
<h3 id="glyphs-vs-graphemes">Glyphs vs graphemes</h3>
<p>It’s not just fonts that cause graphemes to be rendered into varying glyphs. The rules of some languages cause glyphs to change through contextual shaping. For instance the Arabic letter “heh” has four forms, depending on which sides are flanked by letters. When isolated it appears as ﻩ and in the final/initial/medial position in a word it appears as ﻪ/ﻫ/ﻬ respectively. Similarly, Greek displays lower-case sigma differently at the end of the word (final form) than elsewhere. Some glyphs change based on visual order. In a right-to-left language the starting parenthesis “(” mirrors to display as “)”.</p>
<p>Not only do individual graphemes’ glyphs vary, graphemes can combine to form single glyphs. One way is through ligatures. The latin letters “fi” often join the dot of the i with the curve of the f (presentation form U+FB01 ﬁ). Another way is language irregularity. The Arabic ا and ل, when contiguous, <em>must</em> form ﻻ.</p>
<p>Conversely, a single grapheme can split into multiple glyphs. For instance in some Indic languages, vowels can split and surround preceding consonants. In Bengali, U+09CC ৌ surrounds U+09AE ম to become মৌ . Try placing a cursor at the end of this text box and pressing backspace: <input type="text" size="3" value="মৌ" /></p>
<h3 id="how-are-codepoints-encoded">How are codepoints encoded?</h3>
<p>In 1990, Unicode codepoints were 16 bits wide. That choice turned out to be too small for the symbols and languages people wanted to represent, so the committee extended the standard to 21 bits. That’s fine in the abstract, but how the 21 bits are stored in memory or communicated between computers depends on practical factors.</p>
<p>It’s an unusual memory size. Computer hardware doesn’t typically access memory in 21-bit chunks. Networking protocols, too, are better geared toward transmitting eight bits at a time. Thus, codepoints are broken into sequences of more conventionally sized blocks called <em>code units</em> for persistence on disk, transmission over networks, and manipulation in memory.</p>
<p>The Unicode Transformation Formats (UTF) describe different ways to map between codepoints and code units. The transformation formats are named after the bit width of their code units (7, 8, 16, or 32), as well as the endianness (BE or LE). For instance: UTF-8, or UTF-16BE. In addition to the UTFs, there’s another – more complex – encoding called Punycode. It is designed to conform with the limited ASCII character subset used for Internet host names.</p>
<p>A final bit of terminology. A “plane” is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16. Plane 0 is the Basic Multilingual Plane (BMP), which contains most commonly-used characters. The higher planes (1 through 16) are called “supplementary planes.”</p>
<h3 id="which-encoding-should-you-choose">Which encoding should you choose?</h3>
<p>For transmission and storage, use UTF-8. Programs which move ASCII data can handle it without modification. Machine endianness does not affect UTF-8, and the byte-sized units work well in networks and filesystems.</p>
<p>Some sites, like <a href="https://utf8everywhere.org">UTF-8 Everywhere</a> go even further and recommend using UTF-8 for internal manipulation of text in program memory. However, I would suggest you use whatever encoding your Unicode library favors for this. You’ll be performing operations through the library API, not directly on code units. As we’re seeing, there is too much complexity between glyphs, graphemes, codepoints and code units to be manipulating the units directly. Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.</p>
<p>It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s true that every code unit can hold a full codepoint. However, the relationship between codepoints and glyphs isn’t straightforward, so there isn’t a programmatic advantage to storing the string this way.</p>
<p>UTF-32 also wastes at minimum 11 (32 - 21) bits per codepoint, and typically more. For instance, UTF-16 requires only one 16-bit code unit to encode points in the Base Multilingual Plane (the most commonly encountered points). Thus UTF-32 can typically double the space required for the BMP.</p>
<p>There <em>are</em> times to manipulate UTF-32, such as when examining a single codepoint. We’ll see examples below.</p>
<h2 id="icu-example-programs">ICU example programs</h2>
<p>The programs in this article are ready to compile and run. They require the ICU C library called ICU4C, which is available on most platforms through the operating system package manager.</p>
<p>ICU provides five libraries for linking (we need the first two):</p>
<table class="table">
<thead>
<tr>
<th>
Package
</th>
<th>
Contents
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
icu-uc
</td>
<td>
Common (uc) and Data (dt/data) libraries
</td>
</tr>
<tr>
<td>
icu-io
</td>
<td>
Ustdio/iostream library (icuio)
</td>
</tr>
<tr>
<td>
icu-i18n
</td>
<td>
Internationalization (in/i18n) library
</td>
</tr>
<tr>
<td>
icu-le
</td>
<td>
Layout Engine
</td>
</tr>
<tr>
<td>
icu-lx
</td>
<td>
Paragraph Layout
</td>
</tr>
</tbody>
</table>
<p>To use ICU4C, set the compiler and linker flags with <code>pkg-config</code> in your Makefile. (Pkg-config may also need to be installed on your computer.)</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode makefile"><code class="sourceCode makefile"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="dt">CFLAGS  </span><span class="ch">=</span><span class="st"> -std=c99 -pedantic -Wall -Wextra </span><span class="ch">\</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="st">          `pkg-config --cflags icu-uc icu-io`</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="dt">LDFLAGS </span><span class="ch">=</span><span class="st"> `pkg-config --libs icu-uc icu-io`</span></span></code></pre></div>
<p>The examples in this article conform to the C89 standard, but we specify C99 in the Makefile because the ICU header files use C99-style (<code>//</code>) comments.</p>
<h3 id="generating-random-codepoints">Generating random codepoints</h3>
<p>To start getting a feel for ICU’s I/O and codepoint manipulation, let’s make a program to output completely random (but valid) codepoints. You could use this program as a basic fuzz tester, to see whether its output confuses other programs. A real fuzz tester ought to have the ability to take an explicit seed for repeatable output, but we will omit that functionality from our simple demo.</p>
<p>This program has limited portability because it gets entropy from <code>/dev/urandom</code>, a Unix device. To generate good random numbers using only the C standard library, see my other <a href="https://begriffs.com/posts/2019-01-19-inside-c-standard-lib.html#stdlib.h-random-numbers">article</a>. Also <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/random.html">POSIX</a> provides pseudo-random number functions.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* for constants like EXIT_FAILURE */</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* we&#39;ll be using standard C I/O to read random bytes */</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="co">/* to determine codepoint categories */</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/uchar.h&gt;</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="co">/* to output UTF-32 codepoints in proper encoding for terminal */</span></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> i <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> linelen<span class="op">;</span></span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* somewhat non-portable: /dev/urandom is unix specific */</span></span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a>	<span class="dt">FILE</span> <span class="op">*</span>f <span class="op">=</span> fopen<span class="op">(</span><span class="st">&quot;/dev/urandom&quot;</span><span class="op">,</span> <span class="st">&quot;rb&quot;</span><span class="op">);</span></span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>out<span class="op">;</span></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* UTF-32 code unit can hold an entire codepoint */</span></span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* to learn about c */</span></span>
<span id="cb2-20"><a href="#cb2-20" aria-hidden="true" tabindex="-1"></a>	UCharCategory cat<span class="op">;</span></span>
<span id="cb2-21"><a href="#cb2-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-22"><a href="#cb2-22" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>f<span class="op">)</span></span>
<span id="cb2-23"><a href="#cb2-23" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb2-24"><a href="#cb2-24" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Unable to open /dev/urandom</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb2-25"><a href="#cb2-25" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb2-26"><a href="#cb2-26" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb2-27"><a href="#cb2-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-28"><a href="#cb2-28" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* optional length to insert line breaks */</span></span>
<span id="cb2-29"><a href="#cb2-29" aria-hidden="true" tabindex="-1"></a>	linelen <span class="op">=</span> argc <span class="op">&gt;</span> <span class="dv">1</span> <span class="op">?</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> NULL<span class="op">,</span> <span class="dv">10</span><span class="op">)</span> <span class="op">:</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb2-30"><a href="#cb2-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-31"><a href="#cb2-31" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* have to obtain a Unicode-aware file handle. This function</span></span>
<span id="cb2-32"><a href="#cb2-32" aria-hidden="true" tabindex="-1"></a><span class="co">	 * has no failure return code, it always works. */</span></span>
<span id="cb2-33"><a href="#cb2-33" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb2-34"><a href="#cb2-34" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-35"><a href="#cb2-35" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* read a random 32 bits, presumably forever */</span></span>
<span id="cb2-36"><a href="#cb2-36" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>fread<span class="op">(&amp;</span>c<span class="op">,</span> <span class="kw">sizeof</span> c<span class="op">,</span> <span class="dv">1</span><span class="op">,</span> f<span class="op">))</span></span>
<span id="cb2-37"><a href="#cb2-37" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb2-38"><a href="#cb2-38" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* Scale 32-bit value to a number within code planes</span></span>
<span id="cb2-39"><a href="#cb2-39" aria-hidden="true" tabindex="-1"></a><span class="co">		 * zero through fourteen. (Planes 15-16 are private-use)</span></span>
<span id="cb2-40"><a href="#cb2-40" aria-hidden="true" tabindex="-1"></a><span class="co">		 *</span></span>
<span id="cb2-41"><a href="#cb2-41" aria-hidden="true" tabindex="-1"></a><span class="co">		 * The modulo bias is insignificant. The first 65535</span></span>
<span id="cb2-42"><a href="#cb2-42" aria-hidden="true" tabindex="-1"></a><span class="co">		 * codepoints are minutely favored, being generated by</span></span>
<span id="cb2-43"><a href="#cb2-43" aria-hidden="true" tabindex="-1"></a><span class="co">		 * 4370 different 32-bit numbers each. The remaining</span></span>
<span id="cb2-44"><a href="#cb2-44" aria-hidden="true" tabindex="-1"></a><span class="co">		 * 917505 codepoints are generated by 4369 numbers each.</span></span>
<span id="cb2-45"><a href="#cb2-45" aria-hidden="true" tabindex="-1"></a><span class="co">		 */</span></span>
<span id="cb2-46"><a href="#cb2-46" aria-hidden="true" tabindex="-1"></a>		c <span class="op">%=</span> <span class="bn">0xF0000</span><span class="op">;</span></span>
<span id="cb2-47"><a href="#cb2-47" aria-hidden="true" tabindex="-1"></a>		cat <span class="op">=</span> u_charType<span class="op">(</span>c<span class="op">);</span></span>
<span id="cb2-48"><a href="#cb2-48" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-49"><a href="#cb2-49" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* U_UNASSIGNED are &quot;non-characters&quot; with no assigned</span></span>
<span id="cb2-50"><a href="#cb2-50" aria-hidden="true" tabindex="-1"></a><span class="co">		 * meanings for interchange. U_PRIVATE_USE_CHAR are</span></span>
<span id="cb2-51"><a href="#cb2-51" aria-hidden="true" tabindex="-1"></a><span class="co">		 * reserved for use within organizations, and</span></span>
<span id="cb2-52"><a href="#cb2-52" aria-hidden="true" tabindex="-1"></a><span class="co">		 * U_SURROGATE are designed for UTF-16 code units in</span></span>
<span id="cb2-53"><a href="#cb2-53" aria-hidden="true" tabindex="-1"></a><span class="co">		 * particular. Don&#39;t print any of those. */</span></span>
<span id="cb2-54"><a href="#cb2-54" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>cat <span class="op">!=</span> U_UNASSIGNED <span class="op">&amp;&amp;</span> cat <span class="op">!=</span> U_PRIVATE_USE_CHAR <span class="op">&amp;&amp;</span></span>
<span id="cb2-55"><a href="#cb2-55" aria-hidden="true" tabindex="-1"></a>		    cat <span class="op">!=</span> U_SURROGATE<span class="op">)</span></span>
<span id="cb2-56"><a href="#cb2-56" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb2-57"><a href="#cb2-57" aria-hidden="true" tabindex="-1"></a>			u_fputc<span class="op">(</span>c<span class="op">,</span> out<span class="op">);</span></span>
<span id="cb2-58"><a href="#cb2-58" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>linelen <span class="op">&amp;&amp;</span> <span class="op">++</span>i <span class="op">&gt;=</span> linelen<span class="op">)</span></span>
<span id="cb2-59"><a href="#cb2-59" aria-hidden="true" tabindex="-1"></a>			<span class="op">{</span></span>
<span id="cb2-60"><a href="#cb2-60" aria-hidden="true" tabindex="-1"></a>				i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb2-61"><a href="#cb2-61" aria-hidden="true" tabindex="-1"></a>				<span class="co">/* there are a number of Unicode</span></span>
<span id="cb2-62"><a href="#cb2-62" aria-hidden="true" tabindex="-1"></a><span class="co">				 * linebreaks, but the standard ASCII</span></span>
<span id="cb2-63"><a href="#cb2-63" aria-hidden="true" tabindex="-1"></a><span class="co">				 * \n is valid, and will interact well</span></span>
<span id="cb2-64"><a href="#cb2-64" aria-hidden="true" tabindex="-1"></a><span class="co">				 * with a shell */</span></span>
<span id="cb2-65"><a href="#cb2-65" aria-hidden="true" tabindex="-1"></a>				u_fputc<span class="op">(</span><span class="ch">&#39;\n&#39;</span><span class="op">,</span> out<span class="op">);</span></span>
<span id="cb2-66"><a href="#cb2-66" aria-hidden="true" tabindex="-1"></a>			<span class="op">}</span></span>
<span id="cb2-67"><a href="#cb2-67" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb2-68"><a href="#cb2-68" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb2-69"><a href="#cb2-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-70"><a href="#cb2-70" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* should never get here */</span></span>
<span id="cb2-71"><a href="#cb2-71" aria-hidden="true" tabindex="-1"></a>	fclose<span class="op">(</span>f<span class="op">);</span></span>
<span id="cb2-72"><a href="#cb2-72" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb2-73"><a href="#cb2-73" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>A note about the mysterious U_UNASSIGNED category, the “non-characters.” These are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data. The Unicode Standard sets aside 66 non-character code points. The last two code points of each plane are noncharacters (U+FFFE and U+FFFF on the BMP). In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0…U+FDEF.</p>
<p>Applications are free to use any of these noncharacter code points internally. They have no standard interpretation when exchanged outside the context of internal use. They are not illegal in interchange, nor does their presence cause Unicode text to be ill-formed.</p>
<h3 id="manipulating-codepoints">Manipulating codepoints</h3>
<p>We discussed non-characters in the previous section, but there are also Private Use codepoints. Unlike non-characters, those for private use <em>are</em> designated for interchange between systems. However the precise meaning and glyphs for these characters is specific to the organization using them. The same codepoints can be used for different things by different people.</p>
<p>Unicode provides a large area for private use. Both a small code block in the BMP, as well as two entire planes: 15 and 16. Because no browser or text editor will render PUA codepoints beyond (typically) empty boxes, we can exploit plane 15 to make a visually confusing code. Ultimately it’s a cheesy transposition cypher, but it’s kind of fun.</p>
<p>Below is a program to shift characters in the BMP to/from plane 15, the Private Use Area A. Example output of an encoded string: 󰁂󰁥󰀠󰁳󰁵󰁲󰁥󰀠󰁴󰁯󰀠󰁤󰁲󰁩󰁮󰁫󰀠󰁹󰁯󰁵󰁲󰀠󰁏󰁶󰁡󰁬󰁴󰁩󰁮󰁥󰀡󰀊</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="co">/* for strcmp in argument parsing */</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> usage<span class="op">(</span><span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>prog<span class="op">)</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;Shift base multilingual plane to/from PUA-A</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">);</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;Usage: %s [-d]</span><span class="sc">\n\n</span><span class="st">&quot;</span><span class="op">,</span> prog<span class="op">);</span></span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;Encodes stdin (or decode with -d)&quot;</span><span class="op">);</span></span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a>	exit<span class="op">(</span>EXIT_SUCCESS<span class="op">);</span></span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb3-19"><a href="#cb3-19" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">,</span> <span class="op">*</span>out<span class="op">;</span></span>
<span id="cb3-20"><a href="#cb3-20" aria-hidden="true" tabindex="-1"></a>	<span class="kw">enum</span> <span class="op">{</span> MODE_ENCODE<span class="op">,</span> MODE_DECODE <span class="op">}</span> mode <span class="op">=</span> MODE_ENCODE<span class="op">;</span></span>
<span id="cb3-21"><a href="#cb3-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-22"><a href="#cb3-22" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">2</span><span class="op">)</span></span>
<span id="cb3-23"><a href="#cb3-23" aria-hidden="true" tabindex="-1"></a>		usage<span class="op">(</span>argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb3-24"><a href="#cb3-24" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span><span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">1</span><span class="op">)</span></span>
<span id="cb3-25"><a href="#cb3-25" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb3-26"><a href="#cb3-26" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;-d&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb3-27"><a href="#cb3-27" aria-hidden="true" tabindex="-1"></a>			mode <span class="op">=</span> MODE_DECODE<span class="op">;</span></span>
<span id="cb3-28"><a href="#cb3-28" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb3-29"><a href="#cb3-29" aria-hidden="true" tabindex="-1"></a>			usage<span class="op">(</span>argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb3-30"><a href="#cb3-30" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb3-31"><a href="#cb3-31" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-32"><a href="#cb3-32" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb3-33"><a href="#cb3-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-34"><a href="#cb3-34" aria-hidden="true" tabindex="-1"></a>	in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb3-35"><a href="#cb3-35" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>in<span class="op">)</span></span>
<span id="cb3-36"><a href="#cb3-36" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb3-37"><a href="#cb3-37" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdout as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb3-38"><a href="#cb3-38" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb3-39"><a href="#cb3-39" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb3-40"><a href="#cb3-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-41"><a href="#cb3-41" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* u_fgetcx returns UTF-32. U_EOF happens to be 0xFFFF,</span></span>
<span id="cb3-42"><a href="#cb3-42" aria-hidden="true" tabindex="-1"></a><span class="co">	 * not -1 like EOF typically is in stdio.h */</span></span>
<span id="cb3-43"><a href="#cb3-43" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> u_fgetcx<span class="op">(</span>in<span class="op">))</span> <span class="op">!=</span> U_EOF<span class="op">)</span></span>
<span id="cb3-44"><a href="#cb3-44" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb3-45"><a href="#cb3-45" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* -1 for UChar32 actually signifies invalid character */</span></span>
<span id="cb3-46"><a href="#cb3-46" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>c <span class="op">==</span> <span class="op">(</span>UChar32<span class="op">)</span><span class="bn">0xFFFFFFFF</span><span class="op">)</span></span>
<span id="cb3-47"><a href="#cb3-47" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb3-48"><a href="#cb3-48" aria-hidden="true" tabindex="-1"></a>			fputs<span class="op">(</span><span class="st">&quot;Invalid character.</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb3-49"><a href="#cb3-49" aria-hidden="true" tabindex="-1"></a>			<span class="cf">continue</span><span class="op">;</span></span>
<span id="cb3-50"><a href="#cb3-50" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb3-51"><a href="#cb3-51" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>mode <span class="op">==</span> MODE_ENCODE<span class="op">)</span></span>
<span id="cb3-52"><a href="#cb3-52" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb3-53"><a href="#cb3-53" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* Move the BMP into the Supplementary</span></span>
<span id="cb3-54"><a href="#cb3-54" aria-hidden="true" tabindex="-1"></a><span class="co">			 * Private Use Area-A, which begins</span></span>
<span id="cb3-55"><a href="#cb3-55" aria-hidden="true" tabindex="-1"></a><span class="co">			 * at codepoint 0xf0000 */</span></span>
<span id="cb3-56"><a href="#cb3-56" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span><span class="dv">0</span> <span class="op">&lt;</span> c <span class="op">&amp;&amp;</span> c <span class="op">&lt;</span> <span class="bn">0xe000</span><span class="op">)</span></span>
<span id="cb3-57"><a href="#cb3-57" aria-hidden="true" tabindex="-1"></a>				c <span class="op">+=</span> <span class="bn">0xf0000</span><span class="op">;</span></span>
<span id="cb3-58"><a href="#cb3-58" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb3-59"><a href="#cb3-59" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb3-60"><a href="#cb3-60" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb3-61"><a href="#cb3-61" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* Move the Supplementary Private Use</span></span>
<span id="cb3-62"><a href="#cb3-62" aria-hidden="true" tabindex="-1"></a><span class="co">			 * Plane down into the BMP */</span></span>
<span id="cb3-63"><a href="#cb3-63" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span><span class="bn">0xf0000</span> <span class="op">&lt;</span> c <span class="op">&amp;&amp;</span> c <span class="op">&lt;</span> <span class="bn">0xfe000</span><span class="op">)</span></span>
<span id="cb3-64"><a href="#cb3-64" aria-hidden="true" tabindex="-1"></a>				c <span class="op">-=</span> <span class="bn">0xf0000</span><span class="op">;</span></span>
<span id="cb3-65"><a href="#cb3-65" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb3-66"><a href="#cb3-66" aria-hidden="true" tabindex="-1"></a>		u_fputc<span class="op">(</span>c<span class="op">,</span> out<span class="op">);</span></span>
<span id="cb3-67"><a href="#cb3-67" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb3-68"><a href="#cb3-68" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-69"><a href="#cb3-69" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if you u_finit it, then u_fclose it */</span></span>
<span id="cb3-70"><a href="#cb3-70" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb3-71"><a href="#cb3-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-72"><a href="#cb3-72" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb3-73"><a href="#cb3-73" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h3 id="examining-utf-8-code-units">Examining UTF-8 code units</h3>
<p>So far we’ve been working entirely with complete codepoints. This next example gets into their representation as code units in a transformation format, namely UTF-8. We will read the codepoint as a hexadecimal program argument, and convert it to between 1-4 bytes in UTF-8, and print the hex values of those bytes.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** utf8.c ***/</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/utf8.h&gt;</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* ICU defines its own bool type to be used</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="co">	 * with their macro */</span></span>
<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a>	UBool err <span class="op">=</span> FALSE<span class="op">;</span></span>
<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* ICU uses C99 types like uint8_t */</span></span>
<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a>	<span class="dt">uint8_t</span> bytes<span class="op">[</span><span class="dv">4</span><span class="op">]</span> <span class="op">=</span> <span class="op">{</span><span class="dv">0</span><span class="op">};</span></span>
<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* probably should be size_t not int32_t, but</span></span>
<span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a><span class="co">	 * just matching what their macro expects */</span></span>
<span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int32_t</span> written <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> i<span class="op">;</span></span>
<span id="cb4-19"><a href="#cb4-19" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>parsed<span class="op">;</span></span>
<span id="cb4-20"><a href="#cb4-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-21"><a href="#cb4-21" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">2</span><span class="op">)</span></span>
<span id="cb4-22"><a href="#cb4-22" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb4-23"><a href="#cb4-23" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Usage: %s codepoint</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> <span class="op">*</span>argv<span class="op">);</span></span>
<span id="cb4-24"><a href="#cb4-24" aria-hidden="true" tabindex="-1"></a>		exit<span class="op">(</span>EXIT_FAILURE<span class="op">);</span></span>
<span id="cb4-25"><a href="#cb4-25" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb4-26"><a href="#cb4-26" aria-hidden="true" tabindex="-1"></a>	c <span class="op">=</span> strtol<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="op">&amp;</span>parsed<span class="op">,</span> <span class="dv">16</span><span class="op">);</span></span>
<span id="cb4-27"><a href="#cb4-27" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!*</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">]</span> <span class="op">||</span> <span class="op">*</span>parsed<span class="op">)</span></span>
<span id="cb4-28"><a href="#cb4-28" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb4-29"><a href="#cb4-29" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb4-30"><a href="#cb4-30" aria-hidden="true" tabindex="-1"></a>			<span class="st">&quot;Cannot parse codepoint: U+%s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb4-31"><a href="#cb4-31" aria-hidden="true" tabindex="-1"></a>		exit<span class="op">(</span>EXIT_FAILURE<span class="op">);</span></span>
<span id="cb4-32"><a href="#cb4-32" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb4-33"><a href="#cb4-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-34"><a href="#cb4-34" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* this is a macro, and updates the variables</span></span>
<span id="cb4-35"><a href="#cb4-35" aria-hidden="true" tabindex="-1"></a><span class="co">	 * directly. No need to pass addresses.</span></span>
<span id="cb4-36"><a href="#cb4-36" aria-hidden="true" tabindex="-1"></a><span class="co">	 * We&#39;re saying: write to &quot;bytes&quot;, tell us how</span></span>
<span id="cb4-37"><a href="#cb4-37" aria-hidden="true" tabindex="-1"></a><span class="co">	 * many were &quot;written&quot;, limit it to four */</span></span>
<span id="cb4-38"><a href="#cb4-38" aria-hidden="true" tabindex="-1"></a>	U8_APPEND<span class="op">(</span>bytes<span class="op">,</span> written<span class="op">,</span> <span class="dv">4</span><span class="op">,</span> c<span class="op">,</span> err<span class="op">);</span></span>
<span id="cb4-39"><a href="#cb4-39" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>err <span class="op">==</span> TRUE<span class="op">)</span></span>
<span id="cb4-40"><a href="#cb4-40" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb4-41"><a href="#cb4-41" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Invalid codepoint: U+%s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb4-42"><a href="#cb4-42" aria-hidden="true" tabindex="-1"></a>		exit<span class="op">(</span>EXIT_FAILURE<span class="op">);</span></span>
<span id="cb4-43"><a href="#cb4-43" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb4-44"><a href="#cb4-44" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-45"><a href="#cb4-45" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* print in format &#39;xxd -r&#39; can read */</span></span>
<span id="cb4-46"><a href="#cb4-46" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;0: &quot;</span><span class="op">);</span></span>
<span id="cb4-47"><a href="#cb4-47" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> written<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span></span>
<span id="cb4-48"><a href="#cb4-48" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;%2x&quot;</span><span class="op">,</span> bytes<span class="op">[</span>i<span class="op">]);</span></span>
<span id="cb4-49"><a href="#cb4-49" aria-hidden="true" tabindex="-1"></a>	puts<span class="op">(</span><span class="st">&quot;&quot;</span><span class="op">);</span></span>
<span id="cb4-50"><a href="#cb4-50" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb4-51"><a href="#cb4-51" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Suppose you compile this to a program named <code>utf8</code>. Here are some examples:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="co"># ascii characters are unchanged</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./utf8 61</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="ex">0:</span> 61</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="co"># other codepoints require more bytes</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./utf8 1F41A</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="ex">0:</span> f09f909a</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a><span class="co"># format is compatible with &quot;xxd&quot;</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./utf8 1F41A <span class="kw">|</span> <span class="ex">xxd</span> <span class="at">-r</span></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a><span class="ex">🐚</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a><span class="co"># surrogates (used in UTF-16) are not valid codepoints</span></span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./utf8 DC00</span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a><span class="ex">Invalid</span> codepoint: U+DC00</span></code></pre></div>
<h3 id="reading-lines-into-internal-utf-16-representation">Reading lines into internal UTF-16 representation</h3>
<h4 id="unlimited-line-length">Unlimited line length</h4>
<p>Here’s a useful helper function named <code>u_wholeline()</code> which reads a line of any length into a dynamically allocated buffer. It reads as UChar*, which is ICU’s standard UTF-16 code unit array.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co">/* to properly test realloc */</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;errno.h&gt;</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a><span class="co">/* line Feed, vertical tab, form feed, carriage return,</span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="co"> * next line, line separator, paragraph separator */</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#define NEWLINE(c) ( \</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a><span class="pp">	((c) &gt;= 0xa &amp;&amp; (c) &lt;= 0xd) || \</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a><span class="pp">	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )</span></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a><span class="co">/* allocates buffer, caller must free */</span></span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>UChar <span class="op">*</span>u_wholeline<span class="op">(</span>UFILE <span class="op">*</span>f<span class="op">)</span></span>
<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* assume most lines are shorter</span></span>
<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a><span class="co">	 * than 128 UTF-16 code units */</span></span>
<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">,</span> sz <span class="op">=</span> <span class="dv">128</span><span class="op">;</span></span>
<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a>	UChar c<span class="op">,</span> <span class="op">*</span>s <span class="op">=</span> malloc<span class="op">(</span>sz <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(*</span>s<span class="op">)),</span> <span class="op">*</span>s_new<span class="op">;</span></span>
<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>s<span class="op">)</span></span>
<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* u_fgetc returns UTF-16, unlike u_fgetcx */</span></span>
<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> <span class="op">(</span>s<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> u_fgetc<span class="op">(</span>f<span class="op">))</span> <span class="op">!=</span> U_EOF <span class="op">&amp;&amp;</span> <span class="op">!</span>NEWLINE<span class="op">(</span>s<span class="op">[</span>i<span class="op">]);</span> <span class="op">++</span>i<span class="op">)</span></span>
<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>i <span class="op">&gt;=</span> sz<span class="op">)</span></span>
<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* double the buffer when it runs out */</span></span>
<span id="cb6-29"><a href="#cb6-29" aria-hidden="true" tabindex="-1"></a>			sz <span class="op">*=</span> <span class="dv">2</span><span class="op">;</span></span>
<span id="cb6-30"><a href="#cb6-30" aria-hidden="true" tabindex="-1"></a>			errno <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb6-31"><a href="#cb6-31" aria-hidden="true" tabindex="-1"></a>			s_new <span class="op">=</span> realloc<span class="op">(</span>s<span class="op">,</span> sz <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(*</span>s<span class="op">));</span></span>
<span id="cb6-32"><a href="#cb6-32" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>errno <span class="op">==</span> ENOMEM<span class="op">)</span></span>
<span id="cb6-33"><a href="#cb6-33" aria-hidden="true" tabindex="-1"></a>				free<span class="op">(</span>s<span class="op">);</span></span>
<span id="cb6-34"><a href="#cb6-34" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">((</span>s <span class="op">=</span> s_new<span class="op">)</span> <span class="op">==</span> NULL<span class="op">)</span></span>
<span id="cb6-35"><a href="#cb6-35" aria-hidden="true" tabindex="-1"></a>				<span class="cf">return</span> NULL<span class="op">;</span></span>
<span id="cb6-36"><a href="#cb6-36" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb6-37"><a href="#cb6-37" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-38"><a href="#cb6-38" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if terminated by CR, eat LF */</span></span>
<span id="cb6-39"><a href="#cb6-39" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>s<span class="op">[</span>i<span class="op">]</span> <span class="op">==</span> <span class="bn">0xd</span> <span class="op">&amp;&amp;</span> <span class="op">(</span>c <span class="op">=</span> u_fgetc<span class="op">(</span>f<span class="op">))</span> <span class="op">!=</span> <span class="bn">0xa</span><span class="op">)</span></span>
<span id="cb6-40"><a href="#cb6-40" aria-hidden="true" tabindex="-1"></a>		u_fungetc<span class="op">(</span>c<span class="op">,</span> f<span class="op">);</span></span>
<span id="cb6-41"><a href="#cb6-41" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* s[i] will either be U_EOF or a newline; wipe it */</span></span>
<span id="cb6-42"><a href="#cb6-42" aria-hidden="true" tabindex="-1"></a>	s<span class="op">[</span>i<span class="op">]</span> <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb6-43"><a href="#cb6-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb6-44"><a href="#cb6-44" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> s<span class="op">;</span></span>
<span id="cb6-45"><a href="#cb6-45" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h4 id="limited-line-length">Limited line length</h4>
<p>The previous example reads an entire line. However, reading a limited number of code units from UTF-16 lines is more tricky. Truncating a Unicode string is always a little dangerous due to possibly splitting a word and breaking contextual shaping.</p>
<p>UTF-16 also has surrogate pairs, which are how that translation format expresses codepoints outside the BMP. Ending a UTF-16 string early can split surrogate pairs without the proper precaution.</p>
<p>The following example reads lines in chunks of at most three UTF-16 code units at a time. If it reads two consecutive codepoints from supplementary planes it will fail. The program accepts a “fix” argument to make it push a final unpaired surrogate back onto the stream for a future read.</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** codeunit.c ***/</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/utf16.h&gt;</span></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a><span class="co">/* BUFSZ set to be very small so that lines must be read in</span></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="co"> * many chunks. Helps illustrate split surrogate pairs */</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define BUFSZ 4</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> printHex<span class="op">(</span><span class="dt">const</span> UChar <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(*</span>s<span class="op">)</span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;%x &quot;</span><span class="op">,</span> <span class="op">*</span>s<span class="op">++);</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a>	putchar<span class="op">(</span><span class="ch">&#39;\n&#39;</span><span class="op">);</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a><span class="co">/* yeah, slightly annoying duplication */</span></span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> printHex32<span class="op">(</span><span class="dt">const</span> UChar32 <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(*</span>s<span class="op">)</span></span>
<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;%x &quot;</span><span class="op">,</span> <span class="op">*</span>s<span class="op">++);</span></span>
<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a>	putchar<span class="op">(</span><span class="ch">&#39;\n&#39;</span><span class="op">);</span></span>
<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">;</span></span>
<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* read line into ICU&#39;s default UTF-16 representation */</span></span>
<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a>	UChar line<span class="op">[</span>BUFSZ<span class="op">];</span></span>
<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* A buffer to hold codepoints of &quot;line&quot; as UTF-32 code</span></span>
<span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a><span class="co">	 * units.  The length is sufficient because it requires</span></span>
<span id="cb7-36"><a href="#cb7-36" aria-hidden="true" tabindex="-1"></a><span class="co">	 * fewer (or at least no greater) code units in UTF-32 to</span></span>
<span id="cb7-37"><a href="#cb7-37" aria-hidden="true" tabindex="-1"></a><span class="co">	 * encode the string */</span></span>
<span id="cb7-38"><a href="#cb7-38" aria-hidden="true" tabindex="-1"></a>	UChar32 codepoints<span class="op">[</span>BUFSZ<span class="op">];</span></span>
<span id="cb7-39"><a href="#cb7-39" aria-hidden="true" tabindex="-1"></a>	UChar <span class="op">*</span>final<span class="op">;</span></span>
<span id="cb7-40"><a href="#cb7-40" aria-hidden="true" tabindex="-1"></a>	UErrorCode err <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb7-41"><a href="#cb7-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-42"><a href="#cb7-42" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb7-43"><a href="#cb7-43" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb7-44"><a href="#cb7-44" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb7-45"><a href="#cb7-45" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb7-46"><a href="#cb7-46" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb7-47"><a href="#cb7-47" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-48"><a href="#cb7-48" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* read lines one small BUFSZ chunk at a time */</span></span>
<span id="cb7-49"><a href="#cb7-49" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>u_fgets<span class="op">(</span>line<span class="op">,</span> BUFSZ<span class="op">,</span> in<span class="op">))</span></span>
<span id="cb7-50"><a href="#cb7-50" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb7-51"><a href="#cb7-51" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* correct for split surrogate pairs only</span></span>
<span id="cb7-52"><a href="#cb7-52" aria-hidden="true" tabindex="-1"></a><span class="co">		 * if the &quot;fix&quot; argument is present */</span></span>
<span id="cb7-53"><a href="#cb7-53" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">1</span> <span class="op">&amp;&amp;</span> strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;fix&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb7-54"><a href="#cb7-54" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb7-55"><a href="#cb7-55" aria-hidden="true" tabindex="-1"></a>			final <span class="op">=</span> line <span class="op">+</span> u_strlen<span class="op">(</span>line<span class="op">);</span></span>
<span id="cb7-56"><a href="#cb7-56" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* want to consider the character before \0</span></span>
<span id="cb7-57"><a href="#cb7-57" aria-hidden="true" tabindex="-1"></a><span class="co">			 * if such exists */</span></span>
<span id="cb7-58"><a href="#cb7-58" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>final <span class="op">&gt;</span> line<span class="op">)</span></span>
<span id="cb7-59"><a href="#cb7-59" aria-hidden="true" tabindex="-1"></a>				final<span class="op">--;</span></span>
<span id="cb7-60"><a href="#cb7-60" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* if it is the lead unit of a surrogate pair */</span></span>
<span id="cb7-61"><a href="#cb7-61" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>U16_IS_LEAD<span class="op">(*</span>final<span class="op">))</span></span>
<span id="cb7-62"><a href="#cb7-62" aria-hidden="true" tabindex="-1"></a>			<span class="op">{</span></span>
<span id="cb7-63"><a href="#cb7-63" aria-hidden="true" tabindex="-1"></a>				<span class="co">/* push it back for a future read, and</span></span>
<span id="cb7-64"><a href="#cb7-64" aria-hidden="true" tabindex="-1"></a><span class="co">				 * truncate the string */</span></span>
<span id="cb7-65"><a href="#cb7-65" aria-hidden="true" tabindex="-1"></a>				u_fungetc<span class="op">(*</span>final<span class="op">,</span> in<span class="op">);</span></span>
<span id="cb7-66"><a href="#cb7-66" aria-hidden="true" tabindex="-1"></a>				<span class="op">*</span>final <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb7-67"><a href="#cb7-67" aria-hidden="true" tabindex="-1"></a>			<span class="op">}</span></span>
<span id="cb7-68"><a href="#cb7-68" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb7-69"><a href="#cb7-69" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-70"><a href="#cb7-70" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;UTF-16    : &quot;</span><span class="op">);</span></span>
<span id="cb7-71"><a href="#cb7-71" aria-hidden="true" tabindex="-1"></a>		printHex<span class="op">(</span>line<span class="op">);</span></span>
<span id="cb7-72"><a href="#cb7-72" aria-hidden="true" tabindex="-1"></a>		u_strToUTF32<span class="op">(</span></span>
<span id="cb7-73"><a href="#cb7-73" aria-hidden="true" tabindex="-1"></a>			codepoints<span class="op">,</span> BUFSZ<span class="op">,</span> NULL<span class="op">,</span></span>
<span id="cb7-74"><a href="#cb7-74" aria-hidden="true" tabindex="-1"></a>			line<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>err<span class="op">);</span></span>
<span id="cb7-75"><a href="#cb7-75" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;Error?    : %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> u_errorName<span class="op">(</span>err<span class="op">));</span></span>
<span id="cb7-76"><a href="#cb7-76" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;Codepoints: &quot;</span><span class="op">);</span></span>
<span id="cb7-77"><a href="#cb7-77" aria-hidden="true" tabindex="-1"></a>		printHex32<span class="op">(</span>codepoints<span class="op">);</span></span>
<span id="cb7-78"><a href="#cb7-78" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-79"><a href="#cb7-79" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* reset potential errors and go for another chunk */</span></span>
<span id="cb7-80"><a href="#cb7-80" aria-hidden="true" tabindex="-1"></a>		err <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb7-81"><a href="#cb7-81" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>codepoints <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb7-82"><a href="#cb7-82" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb7-83"><a href="#cb7-83" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-84"><a href="#cb7-84" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb7-85"><a href="#cb7-85" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb7-86"><a href="#cb7-86" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>If the program reads two weird numerals 𝟘𝟙 (different from 01), neither of which are in the BMP, it finds one codepoint but chokes on the broken pair:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="at">-n</span> 𝟘𝟙 <span class="kw">|</span> <span class="ex">./codeunit</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a><span class="ex">UTF-16</span>    : d835 dfd8 d835</span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a><span class="ex">Error?</span>    : U_INVALID_CHAR_FOUND</span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a><span class="ex">Codepoints:</span> 1d7d8</span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a><span class="ex">UTF-16</span>    : dfd9</span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="ex">Error?</span>    : U_INVALID_CHAR_FOUND</span>
<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a><span class="ex">Codepoints:</span></span></code></pre></div>
<p>However if we pass the “fix” argument, the program will read two complete codepoints:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="at">-n</span> 𝟘𝟙 <span class="kw">|</span> <span class="ex">./codeunit</span> fix</span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="ex">UTF-16</span>    : d835 dfd8</span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a><span class="ex">Error?</span>    : U_ZERO_ERROR</span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a><span class="ex">Codepoints:</span> 1d7d8</span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a><span class="ex">UTF-16</span>    : d835 dfd9</span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a><span class="ex">Error?</span>    : U_ZERO_ERROR</span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a><span class="ex">Codepoints:</span> 1d7d9</span></code></pre></div>
<p>Perhaps a better way to read a line with limited length is to use a “break iterator” to stop on a word boundary. We’ll see more about that later.</p>
<h3 id="extracting-iterating-codepoints-in-utf-16-string">Extracting, iterating codepoints in UTF-16 string</h3>
<p>Our next example will rather laboriously remove diacritical marks from a string. There’s an easier way to do this called “transformation,” but doing it manually provides an opportunity to decompose characters and iterate over them with the <code>U16_NEXT</code> macro.</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** nomarks.c ***/</span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/uchar.h&gt;</span></span>
<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/unorm2.h&gt;</span></span>
<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/utf16.h&gt;</span></span>
<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co">/* Limit to how many decomposed UTF-16 units a single</span></span>
<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a><span class="co"> * codepoint will become in NFD. I don&#39;t know the</span></span>
<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"> * correct value here so I chose a value that seems</span></span>
<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="co"> * to be overkill */</span></span>
<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define MAX_DECOMP_LEN 16</span></span>
<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> i<span class="op">,</span> n<span class="op">;</span></span>
<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">,</span> <span class="op">*</span>out<span class="op">;</span></span>
<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>	UChar decomp<span class="op">[</span>MAX_DECOMP_LEN<span class="op">];</span></span>
<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>	UNormalizer2 <span class="op">*</span>norm<span class="op">;</span></span>
<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>	in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">);</span></span>
<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!</span>in<span class="op">)</span></span>
<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-30"><a href="#cb10-30" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* using stdio functions with stderr and ustdio</span></span>
<span id="cb10-31"><a href="#cb10-31" aria-hidden="true" tabindex="-1"></a><span class="co">		 * with stdout. Mixing the two on a single file</span></span>
<span id="cb10-32"><a href="#cb10-32" aria-hidden="true" tabindex="-1"></a><span class="co">		 * handle would probably be bad. */</span></span>
<span id="cb10-33"><a href="#cb10-33" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb10-34"><a href="#cb10-34" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb10-35"><a href="#cb10-35" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-36"><a href="#cb10-36" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-37"><a href="#cb10-37" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* create a normalizer, in this case one going to NFD */</span></span>
<span id="cb10-38"><a href="#cb10-38" aria-hidden="true" tabindex="-1"></a>	norm <span class="op">=</span> <span class="op">(</span>UNormalizer2 <span class="op">*)</span>unorm2_getNFDInstance<span class="op">(&amp;</span>status<span class="op">);</span></span>
<span id="cb10-39"><a href="#cb10-39" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb10-40"><a href="#cb10-40" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb10-41"><a href="#cb10-41" aria-hidden="true" tabindex="-1"></a>			<span class="st">&quot;unorm2_getNFDInstance(): %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb10-42"><a href="#cb10-42" aria-hidden="true" tabindex="-1"></a>			u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb10-43"><a href="#cb10-43" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb10-44"><a href="#cb10-44" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-45"><a href="#cb10-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-46"><a href="#cb10-46" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* consume input as UTF-32 units one by one */</span></span>
<span id="cb10-47"><a href="#cb10-47" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> u_fgetcx<span class="op">(</span>in<span class="op">))</span> <span class="op">!=</span> U_EOF<span class="op">)</span></span>
<span id="cb10-48"><a href="#cb10-48" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb10-49"><a href="#cb10-49" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* Decompose c to isolate its n combining character</span></span>
<span id="cb10-50"><a href="#cb10-50" aria-hidden="true" tabindex="-1"></a><span class="co">		 * codepoints. Saves them as UTF-16 code units.  FYI,</span></span>
<span id="cb10-51"><a href="#cb10-51" aria-hidden="true" tabindex="-1"></a><span class="co">		 * this function ignores the type of &quot;norm&quot; and always</span></span>
<span id="cb10-52"><a href="#cb10-52" aria-hidden="true" tabindex="-1"></a><span class="co">		 * denormalizes */</span></span>
<span id="cb10-53"><a href="#cb10-53" aria-hidden="true" tabindex="-1"></a>		n <span class="op">=</span> unorm2_getDecomposition<span class="op">(</span></span>
<span id="cb10-54"><a href="#cb10-54" aria-hidden="true" tabindex="-1"></a>			norm<span class="op">,</span> c<span class="op">,</span> decomp<span class="op">,</span> MAX_DECOMP_LEN<span class="op">,</span> <span class="op">&amp;</span>status</span>
<span id="cb10-55"><a href="#cb10-55" aria-hidden="true" tabindex="-1"></a>		<span class="op">);</span></span>
<span id="cb10-56"><a href="#cb10-56" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-57"><a href="#cb10-57" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb10-58"><a href="#cb10-58" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb10-59"><a href="#cb10-59" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;unorm2_getDecomposition(): %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb10-60"><a href="#cb10-60" aria-hidden="true" tabindex="-1"></a>				u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb10-61"><a href="#cb10-61" aria-hidden="true" tabindex="-1"></a>			u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb10-62"><a href="#cb10-62" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb10-63"><a href="#cb10-63" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb10-64"><a href="#cb10-64" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-65"><a href="#cb10-65" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* if c does not decompose and is not itself</span></span>
<span id="cb10-66"><a href="#cb10-66" aria-hidden="true" tabindex="-1"></a><span class="co">		 * a diacritical mark */</span></span>
<span id="cb10-67"><a href="#cb10-67" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>n <span class="op">&lt;</span> <span class="dv">0</span> <span class="op">&amp;&amp;</span> ublock_getCode<span class="op">(</span>c<span class="op">)</span> <span class="op">!=</span></span>
<span id="cb10-68"><a href="#cb10-68" aria-hidden="true" tabindex="-1"></a>		    UBLOCK_COMBINING_DIACRITICAL_MARKS<span class="op">)</span></span>
<span id="cb10-69"><a href="#cb10-69" aria-hidden="true" tabindex="-1"></a>			u_fputc<span class="op">(</span>c<span class="op">,</span> out<span class="op">);</span></span>
<span id="cb10-70"><a href="#cb10-70" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-71"><a href="#cb10-71" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* walk canonical decomposition, reuse c variable */</span></span>
<span id="cb10-72"><a href="#cb10-72" aria-hidden="true" tabindex="-1"></a>		<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">;</span> <span class="op">)</span></span>
<span id="cb10-73"><a href="#cb10-73" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb10-74"><a href="#cb10-74" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* the U16_NEXT macro iterates over UChar (aka</span></span>
<span id="cb10-75"><a href="#cb10-75" aria-hidden="true" tabindex="-1"></a><span class="co">			 * UTF-16, advancing by one or two elements as</span></span>
<span id="cb10-76"><a href="#cb10-76" aria-hidden="true" tabindex="-1"></a><span class="co">			 * needed to get a codepoint. It saves the result</span></span>
<span id="cb10-77"><a href="#cb10-77" aria-hidden="true" tabindex="-1"></a><span class="co">			 * in UTF-32. The macro updates i and c. */</span></span>
<span id="cb10-78"><a href="#cb10-78" aria-hidden="true" tabindex="-1"></a>			U16_NEXT<span class="op">(</span>decomp<span class="op">,</span> i<span class="op">,</span> n<span class="op">,</span> c<span class="op">);</span></span>
<span id="cb10-79"><a href="#cb10-79" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* output only if not combining diacritical */</span></span>
<span id="cb10-80"><a href="#cb10-80" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>ublock_getCode<span class="op">(</span>c<span class="op">)</span> <span class="op">!=</span></span>
<span id="cb10-81"><a href="#cb10-81" aria-hidden="true" tabindex="-1"></a>			    UBLOCK_COMBINING_DIACRITICAL_MARKS<span class="op">)</span></span>
<span id="cb10-82"><a href="#cb10-82" aria-hidden="true" tabindex="-1"></a>				u_fputc<span class="op">(</span>c<span class="op">,</span> out<span class="op">);</span></span>
<span id="cb10-83"><a href="#cb10-83" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb10-84"><a href="#cb10-84" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb10-85"><a href="#cb10-85" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb10-86"><a href="#cb10-86" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb10-87"><a href="#cb10-87" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* u_get_stdout() doesn&#39;t need to be u_fclose&#39;d */</span></span>
<span id="cb10-88"><a href="#cb10-88" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb10-89"><a href="#cb10-89" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Here’s an example of running the program:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;résumé façade&quot;</span> <span class="kw">|</span> <span class="ex">./nomarks</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="ex">resume</span> facade</span></code></pre></div>
<h3 id="transformation">Transformation</h3>
<p>ICU provides a rich domain specific language for <a href="http://userguide.icu-project.org/transforms/general">transforming strings</a>. For example, our entire program in the previous section can be replaced by the transformation <code>NFD; [:Nonspacing Mark:] Remove; NFC</code>. This means to perform a canonical decomposition, remove nonspacing marks, and then canonically compose again. (In fact our program above didn’t re-compose.)</p>
<p>The program below echoes stdin to stdout, but passes the output through a transformation.</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** trans-stream.c ***/</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/utrans.h&gt;</span></span>
<span id="cb12-9"><a href="#cb12-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-10"><a href="#cb12-10" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb12-11"><a href="#cb12-11" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb12-12"><a href="#cb12-12" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb12-13"><a href="#cb12-13" aria-hidden="true" tabindex="-1"></a>	UParseError pe<span class="op">;</span></span>
<span id="cb12-14"><a href="#cb12-14" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">,</span> <span class="op">*</span>out<span class="op">;</span></span>
<span id="cb12-15"><a href="#cb12-15" aria-hidden="true" tabindex="-1"></a>	UTransliterator <span class="op">*</span>t<span class="op">;</span></span>
<span id="cb12-16"><a href="#cb12-16" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb12-17"><a href="#cb12-17" aria-hidden="true" tabindex="-1"></a>	UChar <span class="op">*</span>xform_id<span class="op">;</span></span>
<span id="cb12-18"><a href="#cb12-18" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> n<span class="op">;</span></span>
<span id="cb12-19"><a href="#cb12-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-20"><a href="#cb12-20" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">2</span><span class="op">)</span></span>
<span id="cb12-21"><a href="#cb12-21" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb12-22"><a href="#cb12-22" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb12-23"><a href="#cb12-23" aria-hidden="true" tabindex="-1"></a>			<span class="st">&quot;Usage: %s </span><span class="sc">\&quot;</span><span class="st">translation rules</span><span class="sc">\&quot;\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb12-24"><a href="#cb12-24" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb12-25"><a href="#cb12-25" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-26"><a href="#cb12-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-27"><a href="#cb12-27" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* the UTF-16 string should never be longer than the UTF-8</span></span>
<span id="cb12-28"><a href="#cb12-28" aria-hidden="true" tabindex="-1"></a><span class="co">	 * argv[1], so this should be safe */</span></span>
<span id="cb12-29"><a href="#cb12-29" aria-hidden="true" tabindex="-1"></a>	n <span class="op">=</span> strlen<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">])</span> <span class="op">+</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb12-30"><a href="#cb12-30" aria-hidden="true" tabindex="-1"></a>	xform_id <span class="op">=</span> malloc<span class="op">(</span>n <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span>UChar<span class="op">));</span></span>
<span id="cb12-31"><a href="#cb12-31" aria-hidden="true" tabindex="-1"></a>	u_strFromUTF8<span class="op">(</span>xform_id<span class="op">,</span> n<span class="op">,</span> NULL<span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb12-32"><a href="#cb12-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-33"><a href="#cb12-33" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* create transliterator by identifier */</span></span>
<span id="cb12-34"><a href="#cb12-34" aria-hidden="true" tabindex="-1"></a>	t <span class="op">=</span> utrans_openU<span class="op">(</span>xform_id<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> UTRANS_FORWARD<span class="op">,</span></span>
<span id="cb12-35"><a href="#cb12-35" aria-hidden="true" tabindex="-1"></a>	                 NULL<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>pe<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb12-36"><a href="#cb12-36" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* don&#39;t need the identifier any more */</span></span>
<span id="cb12-37"><a href="#cb12-37" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>xform_id<span class="op">);</span></span>
<span id="cb12-38"><a href="#cb12-38" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb12-39"><a href="#cb12-39" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;utrans_open(%s): %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb12-40"><a href="#cb12-40" aria-hidden="true" tabindex="-1"></a>		        argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb12-41"><a href="#cb12-41" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb12-42"><a href="#cb12-42" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-43"><a href="#cb12-43" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-44"><a href="#cb12-44" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb12-45"><a href="#cb12-45" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb12-46"><a href="#cb12-46" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb12-47"><a href="#cb12-47" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb12-48"><a href="#cb12-48" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb12-49"><a href="#cb12-49" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-50"><a href="#cb12-50" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-51"><a href="#cb12-51" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* transparently transliterate stdout */</span></span>
<span id="cb12-52"><a href="#cb12-52" aria-hidden="true" tabindex="-1"></a>	u_fsettransliterator<span class="op">(</span>out<span class="op">,</span> U_WRITE<span class="op">,</span> t<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb12-53"><a href="#cb12-53" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb12-54"><a href="#cb12-54" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb12-55"><a href="#cb12-55" aria-hidden="true" tabindex="-1"></a>		        <span class="st">&quot;Failed to set transliterator on stdout: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb12-56"><a href="#cb12-56" aria-hidden="true" tabindex="-1"></a>		        u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb12-57"><a href="#cb12-57" aria-hidden="true" tabindex="-1"></a>		u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb12-58"><a href="#cb12-58" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb12-59"><a href="#cb12-59" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb12-60"><a href="#cb12-60" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-61"><a href="#cb12-61" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* what looks like a simple echo loop actually</span></span>
<span id="cb12-62"><a href="#cb12-62" aria-hidden="true" tabindex="-1"></a><span class="co">	 * transliterate characters */</span></span>
<span id="cb12-63"><a href="#cb12-63" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> u_fgetcx<span class="op">(</span>in<span class="op">))</span> <span class="op">!=</span> U_EOF<span class="op">)</span></span>
<span id="cb12-64"><a href="#cb12-64" aria-hidden="true" tabindex="-1"></a>		u_fputc<span class="op">(</span>c<span class="op">,</span> out<span class="op">);</span></span>
<span id="cb12-65"><a href="#cb12-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-66"><a href="#cb12-66" aria-hidden="true" tabindex="-1"></a>	utrans_close<span class="op">(</span>t<span class="op">);</span></span>
<span id="cb12-67"><a href="#cb12-67" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb12-68"><a href="#cb12-68" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>As mentioned, it can emulate our earlier “nomarks” program:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;résumé façade&quot;</span> <span class="kw">|</span> <span class="ex">./trans</span> <span class="st">&quot;NFD; [:Nonspacing Mark:] Remove; NFC&quot;</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="ex">resume</span> facade</span></code></pre></div>
<p>It can also transliterate between scripts like this:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;miirekkaḍiki veḷutunnaaru?&quot;</span> <span class="kw">|</span> <span class="ex">./trans</span> <span class="st">&quot;Telugu&quot;</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="ex">మీరెక్కడికి</span> వెళుతున్నఅరు<span class="pp">?</span></span></code></pre></div>
<p>Applying the transformation to a stream with <code>u_fsettransliterator</code> is a simple way to do things. However I did discover and file an ICU <a href="https://unicode-org.atlassian.net/browse/ICU-20486">bug</a> which will be fixed in version 65.1.</p>
<p>A more robust way to apply transformations is by manipulating UChar strings directly. The technique is also probably more applicable in real applications.</p>
<p>Here’s a rewrite of trans-stream that operates on strings directly:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** trans-string.c ***/</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/utrans.h&gt;</span></span>
<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a><span class="co">/* max number of UTF-16 code units to accumulate while looking</span></span>
<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a><span class="co"> * for an unambiguous transliteration. Has to be fairly long to</span></span>
<span id="cb15-12"><a href="#cb15-12" aria-hidden="true" tabindex="-1"></a><span class="co"> * handle names in Name-Any transliteration like</span></span>
<span id="cb15-13"><a href="#cb15-13" aria-hidden="true" tabindex="-1"></a><span class="co"> * \N{LATIN CAPITAL LETTER O WITH OGONEK AND MACRON} */</span></span>
<span id="cb15-14"><a href="#cb15-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define CONTEXT 100</span></span>
<span id="cb15-15"><a href="#cb15-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-16"><a href="#cb15-16" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb15-18"><a href="#cb15-18" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb15-19"><a href="#cb15-19" aria-hidden="true" tabindex="-1"></a>	UChar c<span class="op">,</span> <span class="op">*</span>end<span class="op">;</span></span>
<span id="cb15-20"><a href="#cb15-20" aria-hidden="true" tabindex="-1"></a>	UChar input<span class="op">[</span>CONTEXT<span class="op">]</span> <span class="op">=</span> <span class="op">{</span><span class="dv">0</span><span class="op">},</span> <span class="op">*</span>buf<span class="op">,</span> <span class="op">*</span>enlarged<span class="op">;</span></span>
<span id="cb15-21"><a href="#cb15-21" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">,</span> <span class="op">*</span>out<span class="op">;</span> </span>
<span id="cb15-22"><a href="#cb15-22" aria-hidden="true" tabindex="-1"></a>	UTransPosition pos<span class="op">;</span></span>
<span id="cb15-23"><a href="#cb15-23" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int32_t</span> width<span class="op">,</span> sizeNeeded<span class="op">,</span> bufLen<span class="op">;</span></span>
<span id="cb15-24"><a href="#cb15-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-25"><a href="#cb15-25" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> n<span class="op">;</span></span>
<span id="cb15-26"><a href="#cb15-26" aria-hidden="true" tabindex="-1"></a>	UChar <span class="op">*</span>xform_id<span class="op">;</span></span>
<span id="cb15-27"><a href="#cb15-27" aria-hidden="true" tabindex="-1"></a>	UTransliterator <span class="op">*</span>t<span class="op">;</span></span>
<span id="cb15-28"><a href="#cb15-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-29"><a href="#cb15-29" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* bufLen must be able to hold at least CONTEXT, and</span></span>
<span id="cb15-30"><a href="#cb15-30" aria-hidden="true" tabindex="-1"></a><span class="co">	 * will be increased as needed for transliteration */</span></span>
<span id="cb15-31"><a href="#cb15-31" aria-hidden="true" tabindex="-1"></a>	bufLen <span class="op">=</span> CONTEXT<span class="op">;</span></span>
<span id="cb15-32"><a href="#cb15-32" aria-hidden="true" tabindex="-1"></a>	buf <span class="op">=</span> malloc<span class="op">(</span><span class="kw">sizeof</span><span class="op">(</span>UChar<span class="op">)</span> <span class="op">*</span> bufLen<span class="op">);</span></span>
<span id="cb15-33"><a href="#cb15-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-34"><a href="#cb15-34" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">2</span><span class="op">)</span></span>
<span id="cb15-35"><a href="#cb15-35" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-36"><a href="#cb15-36" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-37"><a href="#cb15-37" aria-hidden="true" tabindex="-1"></a>			<span class="st">&quot;Usage: %s </span><span class="sc">\&quot;</span><span class="st">translation rules</span><span class="sc">\&quot;\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb15-38"><a href="#cb15-38" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-39"><a href="#cb15-39" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-40"><a href="#cb15-40" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-41"><a href="#cb15-41" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* allocate and read identifier, like earlier example */</span></span>
<span id="cb15-42"><a href="#cb15-42" aria-hidden="true" tabindex="-1"></a>	n <span class="op">=</span> strlen<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">])</span> <span class="op">+</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb15-43"><a href="#cb15-43" aria-hidden="true" tabindex="-1"></a>	xform_id <span class="op">=</span> malloc<span class="op">(</span>n <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(</span>UChar<span class="op">));</span></span>
<span id="cb15-44"><a href="#cb15-44" aria-hidden="true" tabindex="-1"></a>	u_strFromUTF8<span class="op">(</span>xform_id<span class="op">,</span> n<span class="op">,</span> NULL<span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb15-45"><a href="#cb15-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-46"><a href="#cb15-46" aria-hidden="true" tabindex="-1"></a>	t <span class="op">=</span> utrans_openU<span class="op">(</span>xform_id<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> UTRANS_FORWARD<span class="op">,</span></span>
<span id="cb15-47"><a href="#cb15-47" aria-hidden="true" tabindex="-1"></a>	                 NULL<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> NULL<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb15-48"><a href="#cb15-48" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>xform_id<span class="op">);</span></span>
<span id="cb15-49"><a href="#cb15-49" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb15-50"><a href="#cb15-50" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;utrans_open(%s): %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-51"><a href="#cb15-51" aria-hidden="true" tabindex="-1"></a>		        argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb15-52"><a href="#cb15-52" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-53"><a href="#cb15-53" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-54"><a href="#cb15-54" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-55"><a href="#cb15-55" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb15-56"><a href="#cb15-56" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb15-57"><a href="#cb15-57" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-58"><a href="#cb15-58" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb15-59"><a href="#cb15-59" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-60"><a href="#cb15-60" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-61"><a href="#cb15-61" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-62"><a href="#cb15-62" aria-hidden="true" tabindex="-1"></a>	end <span class="op">=</span> input<span class="op">;</span></span>
<span id="cb15-63"><a href="#cb15-63" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* append UTF-16 code units one at a time for incremental</span></span>
<span id="cb15-64"><a href="#cb15-64" aria-hidden="true" tabindex="-1"></a><span class="co">	 * transliteration */</span></span>
<span id="cb15-65"><a href="#cb15-65" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> u_fgetc<span class="op">(</span>in<span class="op">))</span> <span class="op">!=</span> U_EOF<span class="op">)</span></span>
<span id="cb15-66"><a href="#cb15-66" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-67"><a href="#cb15-67" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* we consider at most CONTEXT consecutive code units</span></span>
<span id="cb15-68"><a href="#cb15-68" aria-hidden="true" tabindex="-1"></a><span class="co">		 * for transliteration (minus one for \0) */</span></span>
<span id="cb15-69"><a href="#cb15-69" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>end <span class="op">-</span> input <span class="op">&gt;=</span> CONTEXT<span class="op">-</span><span class="dv">1</span><span class="op">)</span></span>
<span id="cb15-70"><a href="#cb15-70" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb15-71"><a href="#cb15-71" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-72"><a href="#cb15-72" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;Exceeded max (%i) code units &quot;</span></span>
<span id="cb15-73"><a href="#cb15-73" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;for context.</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-74"><a href="#cb15-74" aria-hidden="true" tabindex="-1"></a>				CONTEXT<span class="op">);</span></span>
<span id="cb15-75"><a href="#cb15-75" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb15-76"><a href="#cb15-76" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-77"><a href="#cb15-77" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>end<span class="op">++</span> <span class="op">=</span> c<span class="op">;</span></span>
<span id="cb15-78"><a href="#cb15-78" aria-hidden="true" tabindex="-1"></a>		<span class="op">*</span>end <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb15-79"><a href="#cb15-79" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-80"><a href="#cb15-80" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* copy string so far to buf to operate on */</span></span>
<span id="cb15-81"><a href="#cb15-81" aria-hidden="true" tabindex="-1"></a>		u_strcpy<span class="op">(</span>buf<span class="op">,</span> input<span class="op">);</span></span>
<span id="cb15-82"><a href="#cb15-82" aria-hidden="true" tabindex="-1"></a>		pos<span class="op">.</span>start <span class="op">=</span> pos<span class="op">.</span>contextStart <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb15-83"><a href="#cb15-83" aria-hidden="true" tabindex="-1"></a>		pos<span class="op">.</span>limit <span class="op">=</span> pos<span class="op">.</span>contextLimit <span class="op">=</span> end <span class="op">-</span> input<span class="op">;</span></span>
<span id="cb15-84"><a href="#cb15-84" aria-hidden="true" tabindex="-1"></a>		sizeNeeded <span class="op">=</span> <span class="op">-</span><span class="dv">1</span><span class="op">;</span></span>
<span id="cb15-85"><a href="#cb15-85" aria-hidden="true" tabindex="-1"></a>		utrans_transIncrementalUChars<span class="op">(</span></span>
<span id="cb15-86"><a href="#cb15-86" aria-hidden="true" tabindex="-1"></a>			t<span class="op">,</span> buf<span class="op">,</span> <span class="op">&amp;</span>sizeNeeded<span class="op">,</span> bufLen<span class="op">,</span> <span class="op">&amp;</span>pos<span class="op">,</span> <span class="op">&amp;</span>status</span>
<span id="cb15-87"><a href="#cb15-87" aria-hidden="true" tabindex="-1"></a>		<span class="op">);</span></span>
<span id="cb15-88"><a href="#cb15-88" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* if buf not big enough for transliterated result */</span></span>
<span id="cb15-89"><a href="#cb15-89" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>status <span class="op">==</span> U_BUFFER_OVERFLOW_ERROR<span class="op">)</span></span>
<span id="cb15-90"><a href="#cb15-90" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb15-91"><a href="#cb15-91" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* utrans_transIncrementalUChars sets sizeNeeded,</span></span>
<span id="cb15-92"><a href="#cb15-92" aria-hidden="true" tabindex="-1"></a><span class="co">			 * so resize the buffer */</span></span>
<span id="cb15-93"><a href="#cb15-93" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">((</span>enlarged <span class="op">=</span></span>
<span id="cb15-94"><a href="#cb15-94" aria-hidden="true" tabindex="-1"></a>			     realloc<span class="op">(</span>buf<span class="op">,</span> <span class="kw">sizeof</span><span class="op">(</span>UChar<span class="op">)*</span>sizeNeeded<span class="op">))</span></span>
<span id="cb15-95"><a href="#cb15-95" aria-hidden="true" tabindex="-1"></a>			    <span class="op">==</span> NULL<span class="op">)</span></span>
<span id="cb15-96"><a href="#cb15-96" aria-hidden="true" tabindex="-1"></a>			<span class="op">{</span></span>
<span id="cb15-97"><a href="#cb15-97" aria-hidden="true" tabindex="-1"></a>				fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-98"><a href="#cb15-98" aria-hidden="true" tabindex="-1"></a>					<span class="st">&quot;Unable to grow buffer.</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">);</span></span>
<span id="cb15-99"><a href="#cb15-99" aria-hidden="true" tabindex="-1"></a>				<span class="co">/* fail gracefully and display</span></span>
<span id="cb15-100"><a href="#cb15-100" aria-hidden="true" tabindex="-1"></a><span class="co">				 * what we can */</span></span>
<span id="cb15-101"><a href="#cb15-101" aria-hidden="true" tabindex="-1"></a>				<span class="cf">break</span><span class="op">;</span></span>
<span id="cb15-102"><a href="#cb15-102" aria-hidden="true" tabindex="-1"></a>			<span class="op">}</span></span>
<span id="cb15-103"><a href="#cb15-103" aria-hidden="true" tabindex="-1"></a>			buf <span class="op">=</span> enlarged<span class="op">;</span></span>
<span id="cb15-104"><a href="#cb15-104" aria-hidden="true" tabindex="-1"></a>			bufLen <span class="op">=</span> sizeNeeded<span class="op">;</span></span>
<span id="cb15-105"><a href="#cb15-105" aria-hidden="true" tabindex="-1"></a>			u_strcpy<span class="op">(</span>buf<span class="op">,</span> input<span class="op">);</span></span>
<span id="cb15-106"><a href="#cb15-106" aria-hidden="true" tabindex="-1"></a>			pos<span class="op">.</span>start <span class="op">=</span> pos<span class="op">.</span>contextStart <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb15-107"><a href="#cb15-107" aria-hidden="true" tabindex="-1"></a>			pos<span class="op">.</span>limit <span class="op">=</span> pos<span class="op">.</span>contextLimit <span class="op">=</span> end <span class="op">-</span> input<span class="op">;</span></span>
<span id="cb15-108"><a href="#cb15-108" aria-hidden="true" tabindex="-1"></a>			sizeNeeded <span class="op">=</span> <span class="op">-</span><span class="dv">1</span><span class="op">;</span></span>
<span id="cb15-109"><a href="#cb15-109" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-110"><a href="#cb15-110" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* one more time, but with sufficient space */</span></span>
<span id="cb15-111"><a href="#cb15-111" aria-hidden="true" tabindex="-1"></a>			status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb15-112"><a href="#cb15-112" aria-hidden="true" tabindex="-1"></a>			utrans_transIncrementalUChars<span class="op">(</span></span>
<span id="cb15-113"><a href="#cb15-113" aria-hidden="true" tabindex="-1"></a>				t<span class="op">,</span> buf<span class="op">,</span> <span class="op">&amp;</span>sizeNeeded<span class="op">,</span> bufLen<span class="op">,</span></span>
<span id="cb15-114"><a href="#cb15-114" aria-hidden="true" tabindex="-1"></a>				<span class="op">&amp;</span>pos<span class="op">,</span> <span class="op">&amp;</span>status</span>
<span id="cb15-115"><a href="#cb15-115" aria-hidden="true" tabindex="-1"></a>			<span class="op">);</span></span>
<span id="cb15-116"><a href="#cb15-116" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-117"><a href="#cb15-117" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* handle errors other than U_BUFFER_OVERFLOW_ERROR */</span></span>
<span id="cb15-118"><a href="#cb15-118" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span> <span class="op">{</span></span>
<span id="cb15-119"><a href="#cb15-119" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb15-120"><a href="#cb15-120" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;utrans_transIncrementalUChars(): %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb15-121"><a href="#cb15-121" aria-hidden="true" tabindex="-1"></a>				u_errorName<span class="op">(</span>status<span class="op">));</span></span>
<span id="cb15-122"><a href="#cb15-122" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb15-123"><a href="#cb15-123" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb15-124"><a href="#cb15-124" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-125"><a href="#cb15-125" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* print buf[0 .. pos.start - 1] */</span></span>
<span id="cb15-126"><a href="#cb15-126" aria-hidden="true" tabindex="-1"></a>		u_printf<span class="op">(</span><span class="st">&quot;%.*S&quot;</span><span class="op">,</span> pos<span class="op">.</span>start<span class="op">,</span> buf<span class="op">);</span></span>
<span id="cb15-127"><a href="#cb15-127" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-128"><a href="#cb15-128" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* Remove the code units which were processed,</span></span>
<span id="cb15-129"><a href="#cb15-129" aria-hidden="true" tabindex="-1"></a><span class="co">		 * shifting back the remaining ones which could</span></span>
<span id="cb15-130"><a href="#cb15-130" aria-hidden="true" tabindex="-1"></a><span class="co">		 * not be unambiguously transliterated. Then hit</span></span>
<span id="cb15-131"><a href="#cb15-131" aria-hidden="true" tabindex="-1"></a><span class="co">		 * the loop to get another code unit and try again. */</span></span>
<span id="cb15-132"><a href="#cb15-132" aria-hidden="true" tabindex="-1"></a>		u_strcpy<span class="op">(</span>input<span class="op">,</span> buf<span class="op">+</span>pos<span class="op">.</span>start<span class="op">);</span></span>
<span id="cb15-133"><a href="#cb15-133" aria-hidden="true" tabindex="-1"></a>		end <span class="op">=</span> input <span class="op">+</span> <span class="op">(</span>pos<span class="op">.</span>limit <span class="op">-</span> pos<span class="op">.</span>start<span class="op">);</span></span>
<span id="cb15-134"><a href="#cb15-134" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-135"><a href="#cb15-135" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-136"><a href="#cb15-136" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if any leftovers from incremental transliteration */</span></span>
<span id="cb15-137"><a href="#cb15-137" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>end <span class="op">&gt;</span> input<span class="op">)</span></span>
<span id="cb15-138"><a href="#cb15-138" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb15-139"><a href="#cb15-139" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* transliterate input array in place, do our best */</span></span>
<span id="cb15-140"><a href="#cb15-140" aria-hidden="true" tabindex="-1"></a>		width <span class="op">=</span> end <span class="op">-</span> input<span class="op">;</span></span>
<span id="cb15-141"><a href="#cb15-141" aria-hidden="true" tabindex="-1"></a>		utrans_transUChars<span class="op">(</span></span>
<span id="cb15-142"><a href="#cb15-142" aria-hidden="true" tabindex="-1"></a>			t<span class="op">,</span> input<span class="op">,</span> NULL<span class="op">,</span> CONTEXT<span class="op">,</span> <span class="dv">0</span><span class="op">,</span> <span class="op">&amp;</span>width<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb15-143"><a href="#cb15-143" aria-hidden="true" tabindex="-1"></a>		u_printf<span class="op">(</span><span class="st">&quot;%S&quot;</span><span class="op">,</span> input<span class="op">);</span></span>
<span id="cb15-144"><a href="#cb15-144" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb15-145"><a href="#cb15-145" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb15-146"><a href="#cb15-146" aria-hidden="true" tabindex="-1"></a>	utrans_close<span class="op">(</span>t<span class="op">);</span></span>
<span id="cb15-147"><a href="#cb15-147" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb15-148"><a href="#cb15-148" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>buf<span class="op">);</span></span>
<span id="cb15-149"><a href="#cb15-149" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> U_SUCCESS<span class="op">(</span>status<span class="op">)</span> <span class="op">?</span> EXIT_SUCCESS <span class="op">:</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb15-150"><a href="#cb15-150" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h3 id="punycode">Punycode</h3>
<p>Punycode is a representation of Unicode within the limited ASCII character subset used for internet host names. If you enter a non-ASCII URL into a web browser navigation bar, the browser translates to Punycode before making the actual DNS lookup.</p>
<p>The encoding is part of the more general process of Internationalizing Domain Names in Applications (IDNA), which also normalizes the string.</p>
<p>Note that not all Unicode strings can be successfully encoded. For instance codepoints like “⒈” include a period in the glyph and are used for numbered lists. Converting that dot to the ASCII hostname would inadvertently specify a subdomain. ICU turns the offending character into U+FFFD (the “replacement character”) in the output and returns an error.</p>
<p>The following program uses <code>uidna_nameToASCII</code> or <code>uidna_nameToUnicode</code> as needed to translate between Unicode and punycode.</p>
<div class="sourceCode" id="cb16"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** puny.c ***/</span></span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-7"><a href="#cb16-7" aria-hidden="true" tabindex="-1"></a><span class="co">/* uidna stands for International Domain Names in </span></span>
<span id="cb16-8"><a href="#cb16-8" aria-hidden="true" tabindex="-1"></a><span class="co"> * Applications and contains punycode routines */</span></span>
<span id="cb16-9"><a href="#cb16-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/uidna.h&gt;</span></span>
<span id="cb16-10"><a href="#cb16-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb16-11"><a href="#cb16-11" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb16-12"><a href="#cb16-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-13"><a href="#cb16-13" aria-hidden="true" tabindex="-1"></a><span class="dt">void</span> chomp<span class="op">(</span>UChar <span class="op">*</span>s<span class="op">)</span></span>
<span id="cb16-14"><a href="#cb16-14" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-15"><a href="#cb16-15" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* unicode characters that split lines */</span></span>
<span id="cb16-16"><a href="#cb16-16" aria-hidden="true" tabindex="-1"></a>	UChar splits<span class="op">[]</span> <span class="op">=</span></span>
<span id="cb16-17"><a href="#cb16-17" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0xa</span><span class="op">,</span> <span class="bn">0xb</span><span class="op">,</span> <span class="bn">0xc</span><span class="op">,</span> <span class="bn">0xd</span><span class="op">,</span> <span class="bn">0x85</span><span class="op">,</span> <span class="bn">0x2028</span><span class="op">,</span> <span class="bn">0x2029</span><span class="op">,</span> <span class="ch">&#39;\0&#39;</span><span class="op">};</span></span>
<span id="cb16-18"><a href="#cb16-18" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>s<span class="op">)</span></span>
<span id="cb16-19"><a href="#cb16-19" aria-hidden="true" tabindex="-1"></a>		s<span class="op">[</span>u_strcspn<span class="op">(</span>s<span class="op">,</span> splits<span class="op">)]</span> <span class="op">=</span> <span class="ch">&#39;\0&#39;</span><span class="op">;</span></span>
<span id="cb16-20"><a href="#cb16-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb16-21"><a href="#cb16-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-22"><a href="#cb16-22" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb16-23"><a href="#cb16-23" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb16-24"><a href="#cb16-24" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">;</span></span>
<span id="cb16-25"><a href="#cb16-25" aria-hidden="true" tabindex="-1"></a>	UChar input<span class="op">[</span><span class="dv">1024</span><span class="op">],</span> output<span class="op">[</span><span class="dv">1024</span><span class="op">];</span></span>
<span id="cb16-26"><a href="#cb16-26" aria-hidden="true" tabindex="-1"></a>	UIDNAInfo info <span class="op">=</span> UIDNA_INFO_INITIALIZER<span class="op">;</span></span>
<span id="cb16-27"><a href="#cb16-27" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb16-28"><a href="#cb16-28" aria-hidden="true" tabindex="-1"></a>	UIDNA <span class="op">*</span>idna <span class="op">=</span> uidna_openUTS46<span class="op">(</span>UIDNA_DEFAULT<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb16-29"><a href="#cb16-29" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-30"><a href="#cb16-30" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* default action is performing punycode */</span></span>
<span id="cb16-31"><a href="#cb16-31" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int32_t</span> <span class="op">(*</span>action<span class="op">)(</span></span>
<span id="cb16-32"><a href="#cb16-32" aria-hidden="true" tabindex="-1"></a>			<span class="dt">const</span> UIDNA<span class="op">*,</span> <span class="dt">const</span> UChar<span class="op">*,</span> <span class="dt">int32_t</span><span class="op">,</span> UChar<span class="op">*,</span> </span>
<span id="cb16-33"><a href="#cb16-33" aria-hidden="true" tabindex="-1"></a>			<span class="dt">int32_t</span><span class="op">,</span> UIDNAInfo<span class="op">*,</span> UErrorCode<span class="op">*</span></span>
<span id="cb16-34"><a href="#cb16-34" aria-hidden="true" tabindex="-1"></a>		<span class="op">)</span> <span class="op">=</span> uidna_nameToASCII<span class="op">;</span></span>
<span id="cb16-35"><a href="#cb16-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-36"><a href="#cb16-36" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb16-37"><a href="#cb16-37" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb16-38"><a href="#cb16-38" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb16-39"><a href="#cb16-39" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb16-40"><a href="#cb16-40" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb16-41"><a href="#cb16-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-42"><a href="#cb16-42" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* the &quot;decode&quot; option reverses our action */</span></span>
<span id="cb16-43"><a href="#cb16-43" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&gt;</span> <span class="dv">1</span> <span class="op">&amp;&amp;</span> strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;decode&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb16-44"><a href="#cb16-44" aria-hidden="true" tabindex="-1"></a>		action <span class="op">=</span> uidna_nameToUnicode<span class="op">;</span></span>
<span id="cb16-45"><a href="#cb16-45" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-46"><a href="#cb16-46" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* u_fgets includes the newline, so we chomp it */</span></span>
<span id="cb16-47"><a href="#cb16-47" aria-hidden="true" tabindex="-1"></a>	u_fgets<span class="op">(</span>input<span class="op">,</span> <span class="kw">sizeof</span><span class="op">(</span>input<span class="op">)/</span><span class="kw">sizeof</span><span class="op">(*</span>input<span class="op">),</span> in<span class="op">);</span></span>
<span id="cb16-48"><a href="#cb16-48" aria-hidden="true" tabindex="-1"></a>	chomp<span class="op">(</span>input<span class="op">);</span></span>
<span id="cb16-49"><a href="#cb16-49" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-50"><a href="#cb16-50" aria-hidden="true" tabindex="-1"></a>	action<span class="op">(</span>idna<span class="op">,</span> input<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> output<span class="op">,</span></span>
<span id="cb16-51"><a href="#cb16-51" aria-hidden="true" tabindex="-1"></a>		<span class="kw">sizeof</span><span class="op">(</span>output<span class="op">)/</span><span class="kw">sizeof</span><span class="op">(*</span>output<span class="op">),</span></span>
<span id="cb16-52"><a href="#cb16-52" aria-hidden="true" tabindex="-1"></a>		<span class="op">&amp;</span>info<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb16-53"><a href="#cb16-53" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-54"><a href="#cb16-54" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>U_SUCCESS<span class="op">(</span>status<span class="op">)</span> <span class="op">&amp;&amp;</span> info<span class="op">.</span>errors<span class="op">!=</span><span class="dv">0</span><span class="op">)</span></span>
<span id="cb16-55"><a href="#cb16-55" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Bad input.</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb16-56"><a href="#cb16-56" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-57"><a href="#cb16-57" aria-hidden="true" tabindex="-1"></a>	u_printf<span class="op">(</span><span class="st">&quot;%S</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> output<span class="op">);</span></span>
<span id="cb16-58"><a href="#cb16-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb16-59"><a href="#cb16-59" aria-hidden="true" tabindex="-1"></a>	uidna_close<span class="op">(</span>idna<span class="op">);</span></span>
<span id="cb16-60"><a href="#cb16-60" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb16-61"><a href="#cb16-61" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb16-62"><a href="#cb16-62" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Example of using the program:</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;façade.com&quot;</span> <span class="kw">|</span> <span class="ex">./puny</span></span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="ex">xn--faade-zra.com</span></span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="co"># not every string is allowed</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;a⒈.com&quot;</span> <span class="kw">|</span> <span class="ex">./puny</span></span>
<span id="cb17-7"><a href="#cb17-7" aria-hidden="true" tabindex="-1"></a><span class="ex">Bad</span> input.</span>
<span id="cb17-8"><a href="#cb17-8" aria-hidden="true" tabindex="-1"></a><span class="ex">a�.com</span></span></code></pre></div>
<h3 id="changing-case">Changing case</h3>
<p>The C standard library has functions like <code>toupper</code> which operate on a single character at a time. ICU has equivalents like <code>u_toupper</code>, but working on single codepoints isn’t sufficient for proper casing. Let’s examine the program and see why.</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** pointcase.c ***/</span></span>
<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/uchar.h&gt;</span></span>
<span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb18-8"><a href="#cb18-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-9"><a href="#cb18-9" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb18-10"><a href="#cb18-10" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb18-11"><a href="#cb18-11" aria-hidden="true" tabindex="-1"></a>	UChar32 c<span class="op">;</span></span>
<span id="cb18-12"><a href="#cb18-12" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">,</span> <span class="op">*</span>out<span class="op">;</span></span>
<span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a>	UChar32 <span class="op">(*</span>op<span class="op">)(</span>UChar32<span class="op">)</span> <span class="op">=</span> NULL<span class="op">;</span></span>
<span id="cb18-14"><a href="#cb18-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-15"><a href="#cb18-15" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* set op to one of the casing operations</span></span>
<span id="cb18-16"><a href="#cb18-16" aria-hidden="true" tabindex="-1"></a><span class="co">	 * in uchar.h */</span></span>
<span id="cb18-17"><a href="#cb18-17" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&lt;</span> <span class="dv">2</span> <span class="op">||</span> strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;upper&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-18"><a href="#cb18-18" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> u_toupper<span class="op">;</span></span>
<span id="cb18-19"><a href="#cb18-19" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;lower&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-20"><a href="#cb18-20" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> u_tolower<span class="op">;</span></span>
<span id="cb18-21"><a href="#cb18-21" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;title&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb18-22"><a href="#cb18-22" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> u_totitle<span class="op">;</span></span>
<span id="cb18-23"><a href="#cb18-23" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb18-24"><a href="#cb18-24" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-25"><a href="#cb18-25" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Unrecognized case: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb18-26"><a href="#cb18-26" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-27"><a href="#cb18-27" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-28"><a href="#cb18-28" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-29"><a href="#cb18-29" aria-hidden="true" tabindex="-1"></a>	out <span class="op">=</span> u_get_stdout<span class="op">();</span></span>
<span id="cb18-30"><a href="#cb18-30" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb18-31"><a href="#cb18-31" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb18-32"><a href="#cb18-32" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb18-33"><a href="#cb18-33" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb18-34"><a href="#cb18-34" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb18-35"><a href="#cb18-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-36"><a href="#cb18-36" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* operates on UTF-32 */</span></span>
<span id="cb18-37"><a href="#cb18-37" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">((</span>c <span class="op">=</span> u_fgetcx<span class="op">(</span>in<span class="op">))</span> <span class="op">!=</span> U_EOF<span class="op">)</span></span>
<span id="cb18-38"><a href="#cb18-38" aria-hidden="true" tabindex="-1"></a>		u_fputc<span class="op">(</span>op<span class="op">(</span>c<span class="op">),</span> out<span class="op">);</span></span>
<span id="cb18-39"><a href="#cb18-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb18-40"><a href="#cb18-40" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb18-41"><a href="#cb18-41" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb18-42"><a href="#cb18-42" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<div class="sourceCode" id="cb19"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co"># not quite right, ß should become SS:</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;Die große Stille&quot;</span> <span class="kw">|</span> <span class="ex">./pointcase</span> upper</span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="ex">DIE</span> GROßE STILLE</span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="co"># also wrong, final sigma should be ς:</span></span>
<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;ΣΊΣΥΦΟΣ&quot;</span> <span class="kw">|</span> <span class="ex">./pointcase</span> lower</span>
<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a><span class="ex">σίσυφοσ</span></span></code></pre></div>
<p>As you can see, some graphemes need to “expand” into a greater number, and others are position-sensitive. To do this properly, we have to operate on entire strings rather than individual characters. Here is a program to do it right:</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** strcase.c ***/</span></span>
<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;locale.h&gt;</span></span>
<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#define BUFSZ 1024</span></span>
<span id="cb20-11"><a href="#cb20-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-12"><a href="#cb20-12" aria-hidden="true" tabindex="-1"></a><span class="co">/* wrapper function for u_strToTitle with signature</span></span>
<span id="cb20-13"><a href="#cb20-13" aria-hidden="true" tabindex="-1"></a><span class="co"> * matching the other casing functions */</span></span>
<span id="cb20-14"><a href="#cb20-14" aria-hidden="true" tabindex="-1"></a><span class="dt">int32_t</span> title<span class="op">(</span>UChar <span class="op">*</span>dest<span class="op">,</span> <span class="dt">int32_t</span> destCapacity<span class="op">,</span></span>
<span id="cb20-15"><a href="#cb20-15" aria-hidden="true" tabindex="-1"></a>		<span class="dt">const</span> UChar <span class="op">*</span>src<span class="op">,</span> <span class="dt">int32_t</span> srcLength<span class="op">,</span></span>
<span id="cb20-16"><a href="#cb20-16" aria-hidden="true" tabindex="-1"></a>		<span class="dt">const</span> <span class="dt">char</span> <span class="op">*</span>locale<span class="op">,</span> UErrorCode <span class="op">*</span>pErrorCode<span class="op">)</span></span>
<span id="cb20-17"><a href="#cb20-17" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb20-18"><a href="#cb20-18" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> u_strToTitle<span class="op">(</span>dest<span class="op">,</span> destCapacity<span class="op">,</span> src<span class="op">,</span></span>
<span id="cb20-19"><a href="#cb20-19" aria-hidden="true" tabindex="-1"></a>			srcLength<span class="op">,</span> NULL<span class="op">,</span> locale<span class="op">,</span> pErrorCode<span class="op">);</span></span>
<span id="cb20-20"><a href="#cb20-20" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span>
<span id="cb20-21"><a href="#cb20-21" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-22"><a href="#cb20-22" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb20-23"><a href="#cb20-23" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb20-24"><a href="#cb20-24" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">;</span></span>
<span id="cb20-25"><a href="#cb20-25" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>locale<span class="op">;</span></span>
<span id="cb20-26"><a href="#cb20-26" aria-hidden="true" tabindex="-1"></a>	UChar line<span class="op">[</span>BUFSZ<span class="op">],</span> cased<span class="op">[</span>BUFSZ<span class="op">];</span></span>
<span id="cb20-27"><a href="#cb20-27" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb20-28"><a href="#cb20-28" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int32_t</span> <span class="op">(*</span>op<span class="op">)(</span></span>
<span id="cb20-29"><a href="#cb20-29" aria-hidden="true" tabindex="-1"></a>			UChar<span class="op">*,</span> <span class="dt">int32_t</span><span class="op">,</span> <span class="dt">const</span> UChar<span class="op">*,</span> <span class="dt">int32_t</span><span class="op">,</span></span>
<span id="cb20-30"><a href="#cb20-30" aria-hidden="true" tabindex="-1"></a>			<span class="dt">const</span> <span class="dt">char</span><span class="op">*,</span> UErrorCode<span class="op">*</span></span>
<span id="cb20-31"><a href="#cb20-31" aria-hidden="true" tabindex="-1"></a>		<span class="op">)</span> <span class="op">=</span> NULL<span class="op">;</span></span>
<span id="cb20-32"><a href="#cb20-32" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-33"><a href="#cb20-33" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* casing is locale-dependent */</span></span>
<span id="cb20-34"><a href="#cb20-34" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>locale <span class="op">=</span> setlocale<span class="op">(</span>LC_CTYPE<span class="op">,</span> <span class="st">&quot;&quot;</span><span class="op">)))</span></span>
<span id="cb20-35"><a href="#cb20-35" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb20-36"><a href="#cb20-36" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Cannot determine system locale</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb20-37"><a href="#cb20-37" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb20-38"><a href="#cb20-38" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb20-39"><a href="#cb20-39" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-40"><a href="#cb20-40" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">&lt;</span> <span class="dv">2</span> <span class="op">||</span> strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;upper&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb20-41"><a href="#cb20-41" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> u_strToUpper<span class="op">;</span></span>
<span id="cb20-42"><a href="#cb20-42" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;lower&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb20-43"><a href="#cb20-43" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> u_strToLower<span class="op">;</span></span>
<span id="cb20-44"><a href="#cb20-44" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>strcmp<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">],</span> <span class="st">&quot;title&quot;</span><span class="op">)</span> <span class="op">==</span> <span class="dv">0</span><span class="op">)</span></span>
<span id="cb20-45"><a href="#cb20-45" aria-hidden="true" tabindex="-1"></a>		op <span class="op">=</span> title<span class="op">;</span></span>
<span id="cb20-46"><a href="#cb20-46" aria-hidden="true" tabindex="-1"></a>	<span class="cf">else</span></span>
<span id="cb20-47"><a href="#cb20-47" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb20-48"><a href="#cb20-48" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span> <span class="st">&quot;Unrecognized case: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb20-49"><a href="#cb20-49" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb20-50"><a href="#cb20-50" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb20-51"><a href="#cb20-51" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-52"><a href="#cb20-52" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb20-53"><a href="#cb20-53" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb20-54"><a href="#cb20-54" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb20-55"><a href="#cb20-55" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb20-56"><a href="#cb20-56" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb20-57"><a href="#cb20-57" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-58"><a href="#cb20-58" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Ideally we should change case up to the last word</span></span>
<span id="cb20-59"><a href="#cb20-59" aria-hidden="true" tabindex="-1"></a><span class="co">	 * break and push the remaining characters back for</span></span>
<span id="cb20-60"><a href="#cb20-60" aria-hidden="true" tabindex="-1"></a><span class="co">	 * a future read if the line was longer than BUFSZ.</span></span>
<span id="cb20-61"><a href="#cb20-61" aria-hidden="true" tabindex="-1"></a><span class="co">	 * Currently, if the string is truncated, the final</span></span>
<span id="cb20-62"><a href="#cb20-62" aria-hidden="true" tabindex="-1"></a><span class="co">	 * character would incorrectly be considered</span></span>
<span id="cb20-63"><a href="#cb20-63" aria-hidden="true" tabindex="-1"></a><span class="co">	 * terminal, which affects casing rules in Greek. */</span></span>
<span id="cb20-64"><a href="#cb20-64" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>u_fgets<span class="op">(</span>line<span class="op">,</span> BUFSZ<span class="op">,</span> in<span class="op">))</span></span>
<span id="cb20-65"><a href="#cb20-65" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb20-66"><a href="#cb20-66" aria-hidden="true" tabindex="-1"></a>		op<span class="op">(</span>cased<span class="op">,</span> BUFSZ<span class="op">,</span> line<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> locale<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb20-67"><a href="#cb20-67" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* if casing increases string length, and goes</span></span>
<span id="cb20-68"><a href="#cb20-68" aria-hidden="true" tabindex="-1"></a><span class="co">		 * beyond buffer size like the german ß -&gt; SS */</span></span>
<span id="cb20-69"><a href="#cb20-69" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>status <span class="op">==</span> U_BUFFER_OVERFLOW_ERROR<span class="op">)</span></span>
<span id="cb20-70"><a href="#cb20-70" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb20-71"><a href="#cb20-71" aria-hidden="true" tabindex="-1"></a>			<span class="co">/* Just issue a warning and read another line.</span></span>
<span id="cb20-72"><a href="#cb20-72" aria-hidden="true" tabindex="-1"></a><span class="co">			 * Don&#39;t treat it as severely as other errors. */</span></span>
<span id="cb20-73"><a href="#cb20-73" aria-hidden="true" tabindex="-1"></a>			fputs<span class="op">(</span><span class="st">&quot;Line too long</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb20-74"><a href="#cb20-74" aria-hidden="true" tabindex="-1"></a>			status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb20-75"><a href="#cb20-75" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb20-76"><a href="#cb20-76" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span> <span class="cf">if</span> <span class="op">(</span>U_FAILURE<span class="op">(</span>status<span class="op">))</span></span>
<span id="cb20-77"><a href="#cb20-77" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb20-78"><a href="#cb20-78" aria-hidden="true" tabindex="-1"></a>			fputs<span class="op">(</span>u_errorName<span class="op">(</span>status<span class="op">),</span> stderr<span class="op">);</span></span>
<span id="cb20-79"><a href="#cb20-79" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb20-80"><a href="#cb20-80" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb20-81"><a href="#cb20-81" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb20-82"><a href="#cb20-82" aria-hidden="true" tabindex="-1"></a>			u_printf<span class="op">(</span><span class="st">&quot;%S&quot;</span><span class="op">,</span> cased<span class="op">);</span></span>
<span id="cb20-83"><a href="#cb20-83" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb20-84"><a href="#cb20-84" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb20-85"><a href="#cb20-85" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb20-86"><a href="#cb20-86" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> U_SUCCESS<span class="op">(</span>status<span class="op">)</span></span>
<span id="cb20-87"><a href="#cb20-87" aria-hidden="true" tabindex="-1"></a>		<span class="op">?</span> EXIT_SUCCESS <span class="op">:</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb20-88"><a href="#cb20-88" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>This works better.</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;Die große Stille&quot;</span> <span class="kw">|</span> <span class="ex">./strcase</span> upper</span>
<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a><span class="ex">DIE</span> GROSSE STILLE</span>
<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;ΣΊΣΥΦΟΣ&quot;</span> <span class="kw">|</span> <span class="ex">./strcase</span> lower</span>
<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a><span class="ex">σίσυφος</span></span></code></pre></div>
<h3 id="counting-words-and-graphemes">Counting words and graphemes</h3>
<p>Let’s make a version of <code>wc</code> (the Unix word count program) that knows more about Unicode. Our version will properly count grapheme clusters and word boundaries.</p>
<p>For example, regular wc gets confused by the ancient Ogham script. This was a series of notches scratched into fence posts, and has a space character which is nonblank.</p>
<div class="sourceCode" id="cb22"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ &quot;</span> <span class="kw">|</span> <span class="fu">wc</span></span>
<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>       <span class="ex">1</span>       1      37</span></code></pre></div>
<p>One word, you say? Puh-leaze, if your program can’t handle Medieval Irish carvings then I want nothing to do with it. Here’s one that can:</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** uwc.c ***/</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;locale.h&gt;</span></span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ubrk.h&gt;</span></span>
<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb23-8"><a href="#cb23-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb23-9"><a href="#cb23-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-10"><a href="#cb23-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#define BUFSZ 512</span></span>
<span id="cb23-11"><a href="#cb23-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-12"><a href="#cb23-12" aria-hidden="true" tabindex="-1"></a><span class="co">/* line Feed, vertical tab, form feed, carriage return, </span></span>
<span id="cb23-13"><a href="#cb23-13" aria-hidden="true" tabindex="-1"></a><span class="co"> * next line, line separator, paragraph separator */</span></span>
<span id="cb23-14"><a href="#cb23-14" aria-hidden="true" tabindex="-1"></a><span class="pp">#define NEWLINE(c) ( \</span></span>
<span id="cb23-15"><a href="#cb23-15" aria-hidden="true" tabindex="-1"></a><span class="pp">	((c) &gt;= 0xa &amp;&amp; (c) &lt;= 0xd) || \</span></span>
<span id="cb23-16"><a href="#cb23-16" aria-hidden="true" tabindex="-1"></a><span class="pp">	(c) == 0x85 || (c) == 0x2028 || (c) == 0x2029 )</span></span>
<span id="cb23-17"><a href="#cb23-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-18"><a href="#cb23-18" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb23-19"><a href="#cb23-19" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb23-20"><a href="#cb23-20" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">;</span></span>
<span id="cb23-21"><a href="#cb23-21" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>locale<span class="op">;</span></span>
<span id="cb23-22"><a href="#cb23-22" aria-hidden="true" tabindex="-1"></a>	UChar line<span class="op">[</span>BUFSZ<span class="op">];</span></span>
<span id="cb23-23"><a href="#cb23-23" aria-hidden="true" tabindex="-1"></a>	UBreakIterator <span class="op">*</span>brk_g<span class="op">,</span> <span class="op">*</span>brk_w<span class="op">;</span></span>
<span id="cb23-24"><a href="#cb23-24" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb23-25"><a href="#cb23-25" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> ngraph <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> nword <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> nline <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb23-26"><a href="#cb23-26" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> len<span class="op">;</span></span>
<span id="cb23-27"><a href="#cb23-27" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-28"><a href="#cb23-28" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* word breaks are locale-specific, so we&#39;ll obtain</span></span>
<span id="cb23-29"><a href="#cb23-29" aria-hidden="true" tabindex="-1"></a><span class="co">	 * LC_CTYPE from the environment */</span></span>
<span id="cb23-30"><a href="#cb23-30" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>locale <span class="op">=</span> setlocale<span class="op">(</span>LC_CTYPE<span class="op">,</span> <span class="st">&quot;&quot;</span><span class="op">)))</span></span>
<span id="cb23-31"><a href="#cb23-31" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb23-32"><a href="#cb23-32" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Cannot determine system locale</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb23-33"><a href="#cb23-33" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb23-34"><a href="#cb23-34" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb23-35"><a href="#cb23-35" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-36"><a href="#cb23-36" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb23-37"><a href="#cb23-37" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb23-38"><a href="#cb23-38" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb23-39"><a href="#cb23-39" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb23-40"><a href="#cb23-40" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb23-41"><a href="#cb23-41" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-42"><a href="#cb23-42" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* create an iterator for graphemes */</span></span>
<span id="cb23-43"><a href="#cb23-43" aria-hidden="true" tabindex="-1"></a>	brk_g <span class="op">=</span> ubrk_open<span class="op">(</span></span>
<span id="cb23-44"><a href="#cb23-44" aria-hidden="true" tabindex="-1"></a>		UBRK_CHARACTER<span class="op">,</span> locale<span class="op">,</span> NULL<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb23-45"><a href="#cb23-45" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* and another for the edges of words */</span></span>
<span id="cb23-46"><a href="#cb23-46" aria-hidden="true" tabindex="-1"></a>	brk_w <span class="op">=</span> ubrk_open<span class="op">(</span></span>
<span id="cb23-47"><a href="#cb23-47" aria-hidden="true" tabindex="-1"></a>		UBRK_WORD<span class="op">,</span> locale<span class="op">,</span> NULL<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb23-48"><a href="#cb23-48" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-49"><a href="#cb23-49" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* yes, this is sensitive to splitting end of line</span></span>
<span id="cb23-50"><a href="#cb23-50" aria-hidden="true" tabindex="-1"></a><span class="co">	 * surrogate pairs and can be improved by our previous</span></span>
<span id="cb23-51"><a href="#cb23-51" aria-hidden="true" tabindex="-1"></a><span class="co">	 * function for reading bounded lines */</span></span>
<span id="cb23-52"><a href="#cb23-52" aria-hidden="true" tabindex="-1"></a>	<span class="cf">while</span> <span class="op">(</span>u_fgets<span class="op">(</span>line<span class="op">,</span> BUFSZ<span class="op">,</span> in<span class="op">))</span></span>
<span id="cb23-53"><a href="#cb23-53" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb23-54"><a href="#cb23-54" aria-hidden="true" tabindex="-1"></a>		len <span class="op">=</span> u_strlen<span class="op">(</span>line<span class="op">);</span></span>
<span id="cb23-55"><a href="#cb23-55" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-56"><a href="#cb23-56" aria-hidden="true" tabindex="-1"></a>		ubrk_setText<span class="op">(</span>brk_g<span class="op">,</span> line<span class="op">,</span> len<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb23-57"><a href="#cb23-57" aria-hidden="true" tabindex="-1"></a>		ubrk_setText<span class="op">(</span>brk_w<span class="op">,</span> line<span class="op">,</span> len<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb23-58"><a href="#cb23-58" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-59"><a href="#cb23-59" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* Start at beginning of string, count breaks.</span></span>
<span id="cb23-60"><a href="#cb23-60" aria-hidden="true" tabindex="-1"></a><span class="co">		 * Could have been a for loop, but this looks</span></span>
<span id="cb23-61"><a href="#cb23-61" aria-hidden="true" tabindex="-1"></a><span class="co">		 * simpler to me. */</span></span>
<span id="cb23-62"><a href="#cb23-62" aria-hidden="true" tabindex="-1"></a>		ubrk_first<span class="op">(</span>brk_g<span class="op">);</span></span>
<span id="cb23-63"><a href="#cb23-63" aria-hidden="true" tabindex="-1"></a>		<span class="cf">while</span> <span class="op">(</span>ubrk_next<span class="op">(</span>brk_g<span class="op">)</span> <span class="op">!=</span> UBRK_DONE<span class="op">)</span></span>
<span id="cb23-64"><a href="#cb23-64" aria-hidden="true" tabindex="-1"></a>			ngraph<span class="op">++;</span></span>
<span id="cb23-65"><a href="#cb23-65" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-66"><a href="#cb23-66" aria-hidden="true" tabindex="-1"></a>		ubrk_first<span class="op">(</span>brk_w<span class="op">);</span></span>
<span id="cb23-67"><a href="#cb23-67" aria-hidden="true" tabindex="-1"></a>		<span class="cf">while</span> <span class="op">(</span>ubrk_next<span class="op">(</span>brk_w<span class="op">)</span> <span class="op">!=</span> UBRK_DONE<span class="op">)</span></span>
<span id="cb23-68"><a href="#cb23-68" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>ubrk_getRuleStatus<span class="op">(</span>brk_w<span class="op">)</span> <span class="op">==</span></span>
<span id="cb23-69"><a href="#cb23-69" aria-hidden="true" tabindex="-1"></a>			    UBRK_WORD_LETTER<span class="op">)</span></span>
<span id="cb23-70"><a href="#cb23-70" aria-hidden="true" tabindex="-1"></a>				nword<span class="op">++;</span></span>
<span id="cb23-71"><a href="#cb23-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-72"><a href="#cb23-72" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* count the newline if it exists */</span></span>
<span id="cb23-73"><a href="#cb23-73" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>len <span class="op">&gt;</span> <span class="dv">0</span> <span class="op">&amp;&amp;</span> NEWLINE<span class="op">(</span>line<span class="op">[</span>len<span class="op">-</span><span class="dv">1</span><span class="op">]))</span></span>
<span id="cb23-74"><a href="#cb23-74" aria-hidden="true" tabindex="-1"></a>			nline<span class="op">++;</span></span>
<span id="cb23-75"><a href="#cb23-75" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb23-76"><a href="#cb23-76" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-77"><a href="#cb23-77" aria-hidden="true" tabindex="-1"></a>	printf<span class="op">(</span><span class="st">&quot;locale  : %s</span><span class="sc">\n</span><span class="st">&quot;</span></span>
<span id="cb23-78"><a href="#cb23-78" aria-hidden="true" tabindex="-1"></a>	       <span class="st">&quot;Grapheme: %zu</span><span class="sc">\n</span><span class="st">&quot;</span></span>
<span id="cb23-79"><a href="#cb23-79" aria-hidden="true" tabindex="-1"></a>	       <span class="st">&quot;Word    : %zu</span><span class="sc">\n</span><span class="st">&quot;</span></span>
<span id="cb23-80"><a href="#cb23-80" aria-hidden="true" tabindex="-1"></a>	       <span class="st">&quot;Line    : %zu</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb23-81"><a href="#cb23-81" aria-hidden="true" tabindex="-1"></a>	       locale<span class="op">,</span> ngraph<span class="op">,</span> nword<span class="op">,</span> nline<span class="op">);</span></span>
<span id="cb23-82"><a href="#cb23-82" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-83"><a href="#cb23-83" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* clean up iterators after use */</span></span>
<span id="cb23-84"><a href="#cb23-84" aria-hidden="true" tabindex="-1"></a>	ubrk_close<span class="op">(</span>brk_g<span class="op">);</span></span>
<span id="cb23-85"><a href="#cb23-85" aria-hidden="true" tabindex="-1"></a>	ubrk_close<span class="op">(</span>brk_w<span class="op">);</span></span>
<span id="cb23-86"><a href="#cb23-86" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb23-87"><a href="#cb23-87" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Much better:</p>
<div class="sourceCode" id="cb24"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> echo <span class="st">&quot;ᚈᚐ ᚋᚓ ᚔ ᚍᚏᚐ &quot;</span> <span class="kw">|</span> <span class="ex">./uwc</span></span>
<span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a><span class="ex">locale</span>  : en_US.UTF-8</span>
<span id="cb24-3"><a href="#cb24-3" aria-hidden="true" tabindex="-1"></a><span class="ex">Grapheme:</span> 14</span>
<span id="cb24-4"><a href="#cb24-4" aria-hidden="true" tabindex="-1"></a><span class="ex">Word</span>    : 4</span>
<span id="cb24-5"><a href="#cb24-5" aria-hidden="true" tabindex="-1"></a><span class="ex">Line</span>    : 1</span></code></pre></div>
<h3 id="string-search">String search</h3>
<p>When comparing strings, we can be more or less strict. A familiar example is case sensitivity, but Unicode provides other options. Comparing strings for equality is a degenerate case of sorting, where the strings must not only be determined as equal, but put in order. Sorting is called “collation” and the <a href="http://www.unicode.org/reports/tr10/">Unicode collation algorithm</a> supports multiple levels of increasing strictness.</p>
<table class="table">
<thead>
<tr>
<th>
Level
</th>
<th>
Description
</th>
</thead>
<tbody>
<tr>
<td>
Primary
</td>
<td>
base characters
</td>
</tr>
<tr>
<td>
Secondary
</td>
<td>
accents
</td>
</tr>
<tr>
<td>
Tertiary
</td>
<td>
case/variant
</td>
</tr>
<tr>
<td>
Quaternary
</td>
<td>
punctuation
</td>
</tr>
</tbody>
</table>
<p>Each level acts as a tie-breaker when strings match in previous levels. When searching we can choose how deep to check before declaring strings equal. To illustrate, consider a text file called words.txt containing these words:</p>
<pre><code>Cooperate
coöperate
COÖPERATE
co-operate
final
ﬁdes</code></pre>
<p>We will write a program called <code>ugrep</code>, where we can specify a comparison level and search string. If we search for “cooperate” and allow comparisons up to the tertiary level it matches nothing:</p>
<div class="sourceCode" id="cb26"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 3 cooperate <span class="op">&lt;</span> words.txt</span>
<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a><span class="co"># it&#39;s an exact match, no results</span></span></code></pre></div>
<p>It is possible to shift certain “ignorable” characters (like ‘-’) down to the quaternary level while conducting the original level 3 search:</p>
<div class="sourceCode" id="cb27"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 3i cooperate <span class="op">&lt;</span> words.txt</span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a><span class="ex">4:</span> co-operate</span></code></pre></div>
<p>Doing the same search at the secondary level disregards case, but is still sensitive to accents.</p>
<div class="sourceCode" id="cb28"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 2 cooperate <span class="op">&lt;</span> words.txt</span>
<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a><span class="ex">1:</span> Cooperate</span></code></pre></div>
<p>Once again, can allow ignorables at this level.</p>
<div class="sourceCode" id="cb29"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 2i cooperate <span class="op">&lt;</span> words.txt</span>
<span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a><span class="ex">1:</span> Cooperate</span>
<span id="cb29-3"><a href="#cb29-3" aria-hidden="true" tabindex="-1"></a><span class="ex">4:</span> co-operate</span></code></pre></div>
<p>Finally, going only to the primary level, we match words with the same base letters, modulo case and accents.</p>
<div class="sourceCode" id="cb30"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 1 cooperate <span class="op">&lt;</span> words.txt</span>
<span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a><span class="ex">1:</span> Cooperate</span>
<span id="cb30-3"><a href="#cb30-3" aria-hidden="true" tabindex="-1"></a><span class="ex">2:</span> coöperate</span>
<span id="cb30-4"><a href="#cb30-4" aria-hidden="true" tabindex="-1"></a><span class="ex">3:</span> COÖPERATE</span></code></pre></div>
<p>Note that the idea of a “base character” is dependent on locale. In Swedish, the letters o and ö are quite distinct, and not minor variants as in English. Setting the locale prior to search restricts the results even at the primary level.</p>
<div class="sourceCode" id="cb31"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb31-1"><a href="#cb31-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> LC_COLLATE=sv_SE ./ugrep 1 cooperate <span class="op">&lt;</span> fun.txt</span>
<span id="cb31-2"><a href="#cb31-2" aria-hidden="true" tabindex="-1"></a><span class="ex">1:</span> Cooperate</span></code></pre></div>
<p>One note about the tertiary level. It distinguishes not just case, but ligature presentation forms.</p>
<div class="sourceCode" id="cb32"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb32-1"><a href="#cb32-1" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 3 ﬁ <span class="op">&lt;</span> words.txt</span>
<span id="cb32-2"><a href="#cb32-2" aria-hidden="true" tabindex="-1"></a><span class="ex">6:</span> ﬁdes</span>
<span id="cb32-3"><a href="#cb32-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-4"><a href="#cb32-4" aria-hidden="true" tabindex="-1"></a><span class="co"># vs</span></span>
<span id="cb32-5"><a href="#cb32-5" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb32-6"><a href="#cb32-6" aria-hidden="true" tabindex="-1"></a><span class="ex">$</span> ./ugrep 2 ﬁ <span class="op">&lt;</span> words.txt</span>
<span id="cb32-7"><a href="#cb32-7" aria-hidden="true" tabindex="-1"></a><span class="ex">5:</span> final</span>
<span id="cb32-8"><a href="#cb32-8" aria-hidden="true" tabindex="-1"></a><span class="ex">6:</span> ﬁdes</span></code></pre></div>
<p>Pretty flexible, right? Let’s see the code.</p>
<div class="sourceCode" id="cb33"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb33-1"><a href="#cb33-1" aria-hidden="true" tabindex="-1"></a><span class="co">/*** ugrep.c ***/</span></span>
<span id="cb33-2"><a href="#cb33-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-3"><a href="#cb33-3" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;locale.h&gt;</span></span>
<span id="cb33-4"><a href="#cb33-4" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdlib.h&gt;</span></span>
<span id="cb33-5"><a href="#cb33-5" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;string.h&gt;</span></span>
<span id="cb33-6"><a href="#cb33-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-7"><a href="#cb33-7" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ucol.h&gt;</span></span>
<span id="cb33-8"><a href="#cb33-8" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/usearch.h&gt;</span></span>
<span id="cb33-9"><a href="#cb33-9" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustdio.h&gt;</span></span>
<span id="cb33-10"><a href="#cb33-10" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/ustring.h&gt;</span></span>
<span id="cb33-11"><a href="#cb33-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-12"><a href="#cb33-12" aria-hidden="true" tabindex="-1"></a><span class="pp">#define BUFSZ 1024</span></span>
<span id="cb33-13"><a href="#cb33-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-14"><a href="#cb33-14" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">int</span> argc<span class="op">,</span> <span class="dt">char</span> <span class="op">**</span>argv<span class="op">)</span></span>
<span id="cb33-15"><a href="#cb33-15" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb33-16"><a href="#cb33-16" aria-hidden="true" tabindex="-1"></a>	<span class="dt">char</span> <span class="op">*</span>locale<span class="op">;</span></span>
<span id="cb33-17"><a href="#cb33-17" aria-hidden="true" tabindex="-1"></a>	UFILE <span class="op">*</span>in<span class="op">;</span></span>
<span id="cb33-18"><a href="#cb33-18" aria-hidden="true" tabindex="-1"></a>	UCollator <span class="op">*</span>col<span class="op">;</span></span>
<span id="cb33-19"><a href="#cb33-19" aria-hidden="true" tabindex="-1"></a>	UStringSearch <span class="op">*</span>srch <span class="op">=</span> NULL<span class="op">;</span></span>
<span id="cb33-20"><a href="#cb33-20" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb33-21"><a href="#cb33-21" aria-hidden="true" tabindex="-1"></a>	UChar <span class="op">*</span>needle<span class="op">,</span> line<span class="op">[</span>BUFSZ<span class="op">];</span></span>
<span id="cb33-22"><a href="#cb33-22" aria-hidden="true" tabindex="-1"></a>	UColAttributeValue strength<span class="op">;</span></span>
<span id="cb33-23"><a href="#cb33-23" aria-hidden="true" tabindex="-1"></a>	<span class="dt">int</span> ignoreInsignificant <span class="op">=</span> <span class="dv">0</span><span class="op">,</span> asymmetric <span class="op">=</span> <span class="dv">0</span><span class="op">;</span></span>
<span id="cb33-24"><a href="#cb33-24" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> n<span class="op">;</span></span>
<span id="cb33-25"><a href="#cb33-25" aria-hidden="true" tabindex="-1"></a>	<span class="dt">long</span> i<span class="op">;</span></span>
<span id="cb33-26"><a href="#cb33-26" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-27"><a href="#cb33-27" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>argc <span class="op">!=</span> <span class="dv">3</span><span class="op">)</span></span>
<span id="cb33-28"><a href="#cb33-28" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb33-29"><a href="#cb33-29" aria-hidden="true" tabindex="-1"></a>		fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb33-30"><a href="#cb33-30" aria-hidden="true" tabindex="-1"></a>			<span class="st">&quot;Usage: %s {1,2,@,3}[i] pattern</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb33-31"><a href="#cb33-31" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb33-32"><a href="#cb33-32" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb33-33"><a href="#cb33-33" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-34"><a href="#cb33-34" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* cryptic parsing for our cryptic options */</span></span>
<span id="cb33-35"><a href="#cb33-35" aria-hidden="true" tabindex="-1"></a>	<span class="cf">switch</span> <span class="op">(*</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">])</span></span>
<span id="cb33-36"><a href="#cb33-36" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb33-37"><a href="#cb33-37" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="ch">&#39;1&#39;</span><span class="op">:</span></span>
<span id="cb33-38"><a href="#cb33-38" aria-hidden="true" tabindex="-1"></a>			strength <span class="op">=</span> UCOL_PRIMARY<span class="op">;</span></span>
<span id="cb33-39"><a href="#cb33-39" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb33-40"><a href="#cb33-40" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="ch">&#39;2&#39;</span><span class="op">:</span></span>
<span id="cb33-41"><a href="#cb33-41" aria-hidden="true" tabindex="-1"></a>			strength <span class="op">=</span> UCOL_SECONDARY<span class="op">;</span></span>
<span id="cb33-42"><a href="#cb33-42" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb33-43"><a href="#cb33-43" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="ch">&#39;@&#39;</span><span class="op">:</span></span>
<span id="cb33-44"><a href="#cb33-44" aria-hidden="true" tabindex="-1"></a>			strength <span class="op">=</span> UCOL_SECONDARY<span class="op">,</span> asymmetric <span class="op">=</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb33-45"><a href="#cb33-45" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb33-46"><a href="#cb33-46" aria-hidden="true" tabindex="-1"></a>		<span class="cf">case</span> <span class="ch">&#39;3&#39;</span><span class="op">:</span></span>
<span id="cb33-47"><a href="#cb33-47" aria-hidden="true" tabindex="-1"></a>			strength <span class="op">=</span> UCOL_TERTIARY<span class="op">;</span></span>
<span id="cb33-48"><a href="#cb33-48" aria-hidden="true" tabindex="-1"></a>			<span class="cf">break</span><span class="op">;</span></span>
<span id="cb33-49"><a href="#cb33-49" aria-hidden="true" tabindex="-1"></a>		<span class="cf">default</span><span class="op">:</span></span>
<span id="cb33-50"><a href="#cb33-50" aria-hidden="true" tabindex="-1"></a>			fprintf<span class="op">(</span>stderr<span class="op">,</span></span>
<span id="cb33-51"><a href="#cb33-51" aria-hidden="true" tabindex="-1"></a>				<span class="st">&quot;Unknown strength: %s</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">]);</span></span>
<span id="cb33-52"><a href="#cb33-52" aria-hidden="true" tabindex="-1"></a>			<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb33-53"><a href="#cb33-53" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb33-54"><a href="#cb33-54" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* length of argv[1] is &gt;0 or we would have died */</span></span>
<span id="cb33-55"><a href="#cb33-55" aria-hidden="true" tabindex="-1"></a>	ignoreInsignificant <span class="op">=</span> argv<span class="op">[</span><span class="dv">1</span><span class="op">][</span>strlen<span class="op">(</span>argv<span class="op">[</span><span class="dv">1</span><span class="op">])-</span><span class="dv">1</span><span class="op">]</span> <span class="op">==</span> <span class="ch">&#39;i&#39;</span><span class="op">;</span></span>
<span id="cb33-56"><a href="#cb33-56" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-57"><a href="#cb33-57" aria-hidden="true" tabindex="-1"></a>	n <span class="op">=</span> strlen<span class="op">(</span>argv<span class="op">[</span><span class="dv">2</span><span class="op">])</span> <span class="op">+</span> <span class="dv">1</span><span class="op">;</span></span>
<span id="cb33-58"><a href="#cb33-58" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* if UTF-8 could encode it in n, then UTF-16</span></span>
<span id="cb33-59"><a href="#cb33-59" aria-hidden="true" tabindex="-1"></a><span class="co">	 * should be able to as well */</span></span>
<span id="cb33-60"><a href="#cb33-60" aria-hidden="true" tabindex="-1"></a>	needle <span class="op">=</span> malloc<span class="op">(</span>n <span class="op">*</span> <span class="kw">sizeof</span><span class="op">(*</span>needle<span class="op">));</span></span>
<span id="cb33-61"><a href="#cb33-61" aria-hidden="true" tabindex="-1"></a>	u_strFromUTF8<span class="op">(</span>needle<span class="op">,</span> n<span class="op">,</span> NULL<span class="op">,</span> argv<span class="op">[</span><span class="dv">2</span><span class="op">],</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb33-62"><a href="#cb33-62" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-63"><a href="#cb33-63" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* searching is a degenerate case of collation,</span></span>
<span id="cb33-64"><a href="#cb33-64" aria-hidden="true" tabindex="-1"></a><span class="co">	 * so we read the LC_COLLATE locale */</span></span>
<span id="cb33-65"><a href="#cb33-65" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>locale <span class="op">=</span> setlocale<span class="op">(</span>LC_COLLATE<span class="op">,</span> <span class="st">&quot;&quot;</span><span class="op">)))</span></span>
<span id="cb33-66"><a href="#cb33-66" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb33-67"><a href="#cb33-67" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Cannot determine system collation locale</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span></span>
<span id="cb33-68"><a href="#cb33-68" aria-hidden="true" tabindex="-1"></a>		      stderr<span class="op">);</span></span>
<span id="cb33-69"><a href="#cb33-69" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb33-70"><a href="#cb33-70" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb33-71"><a href="#cb33-71" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-72"><a href="#cb33-72" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(!(</span>in <span class="op">=</span> u_finit<span class="op">(</span>stdin<span class="op">,</span> NULL<span class="op">,</span> NULL<span class="op">)))</span></span>
<span id="cb33-73"><a href="#cb33-73" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb33-74"><a href="#cb33-74" aria-hidden="true" tabindex="-1"></a>		fputs<span class="op">(</span><span class="st">&quot;Error opening stdin as UFILE</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> stderr<span class="op">);</span></span>
<span id="cb33-75"><a href="#cb33-75" aria-hidden="true" tabindex="-1"></a>		<span class="cf">return</span> EXIT_FAILURE<span class="op">;</span></span>
<span id="cb33-76"><a href="#cb33-76" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb33-77"><a href="#cb33-77" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-78"><a href="#cb33-78" aria-hidden="true" tabindex="-1"></a>	col <span class="op">=</span> ucol_open<span class="op">(</span>locale<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb33-79"><a href="#cb33-79" aria-hidden="true" tabindex="-1"></a>	ucol_setStrength<span class="op">(</span>col<span class="op">,</span> strength<span class="op">);</span></span>
<span id="cb33-80"><a href="#cb33-80" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-81"><a href="#cb33-81" aria-hidden="true" tabindex="-1"></a>	<span class="cf">if</span> <span class="op">(</span>ignoreInsignificant<span class="op">)</span></span>
<span id="cb33-82"><a href="#cb33-82" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* shift ignorable characters down to</span></span>
<span id="cb33-83"><a href="#cb33-83" aria-hidden="true" tabindex="-1"></a><span class="co">		 * quaternary level */</span></span>
<span id="cb33-84"><a href="#cb33-84" aria-hidden="true" tabindex="-1"></a>		ucol_setAttribute<span class="op">(</span>col<span class="op">,</span> UCOL_ALTERNATE_HANDLING<span class="op">,</span></span>
<span id="cb33-85"><a href="#cb33-85" aria-hidden="true" tabindex="-1"></a>		                  UCOL_SHIFTED<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb33-86"><a href="#cb33-86" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-87"><a href="#cb33-87" aria-hidden="true" tabindex="-1"></a>	<span class="co">/* Assumes all lines fit in BUFSZ. Should</span></span>
<span id="cb33-88"><a href="#cb33-88" aria-hidden="true" tabindex="-1"></a><span class="co">	 * fix this in real code and not increment i */</span></span>
<span id="cb33-89"><a href="#cb33-89" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">1</span><span class="op">;</span> u_fgets<span class="op">(</span>line<span class="op">,</span> BUFSZ<span class="op">,</span> in<span class="op">);</span> <span class="op">++</span>i<span class="op">)</span></span>
<span id="cb33-90"><a href="#cb33-90" aria-hidden="true" tabindex="-1"></a>	<span class="op">{</span></span>
<span id="cb33-91"><a href="#cb33-91" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* first time through, set up all options */</span></span>
<span id="cb33-92"><a href="#cb33-92" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(!</span>srch<span class="op">)</span></span>
<span id="cb33-93"><a href="#cb33-93" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span></span>
<span id="cb33-94"><a href="#cb33-94" aria-hidden="true" tabindex="-1"></a>			srch <span class="op">=</span> usearch_openFromCollator<span class="op">(</span></span>
<span id="cb33-95"><a href="#cb33-95" aria-hidden="true" tabindex="-1"></a>				needle<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> line<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span></span>
<span id="cb33-96"><a href="#cb33-96" aria-hidden="true" tabindex="-1"></a>			    col<span class="op">,</span> NULL<span class="op">,</span> <span class="op">&amp;</span>status</span>
<span id="cb33-97"><a href="#cb33-97" aria-hidden="true" tabindex="-1"></a>			<span class="op">);</span></span>
<span id="cb33-98"><a href="#cb33-98" aria-hidden="true" tabindex="-1"></a>			<span class="cf">if</span> <span class="op">(</span>asymmetric<span class="op">)</span></span>
<span id="cb33-99"><a href="#cb33-99" aria-hidden="true" tabindex="-1"></a>				usearch_setAttribute<span class="op">(</span></span>
<span id="cb33-100"><a href="#cb33-100" aria-hidden="true" tabindex="-1"></a>					srch<span class="op">,</span> USEARCH_ELEMENT_COMPARISON<span class="op">,</span></span>
<span id="cb33-101"><a href="#cb33-101" aria-hidden="true" tabindex="-1"></a>					USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD<span class="op">,</span></span>
<span id="cb33-102"><a href="#cb33-102" aria-hidden="true" tabindex="-1"></a>					<span class="op">&amp;</span>status</span>
<span id="cb33-103"><a href="#cb33-103" aria-hidden="true" tabindex="-1"></a>				<span class="op">);</span></span>
<span id="cb33-104"><a href="#cb33-104" aria-hidden="true" tabindex="-1"></a>		<span class="op">}</span></span>
<span id="cb33-105"><a href="#cb33-105" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* afterward just switch text */</span></span>
<span id="cb33-106"><a href="#cb33-106" aria-hidden="true" tabindex="-1"></a>		<span class="cf">else</span></span>
<span id="cb33-107"><a href="#cb33-107" aria-hidden="true" tabindex="-1"></a>			usearch_setText<span class="op">(</span>srch<span class="op">,</span> line<span class="op">,</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">);</span></span>
<span id="cb33-108"><a href="#cb33-108" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-109"><a href="#cb33-109" aria-hidden="true" tabindex="-1"></a>		<span class="co">/* check if keyword appears in line */</span></span>
<span id="cb33-110"><a href="#cb33-110" aria-hidden="true" tabindex="-1"></a>		<span class="cf">if</span> <span class="op">(</span>usearch_first<span class="op">(</span>srch<span class="op">,</span> <span class="op">&amp;</span>status<span class="op">)</span> <span class="op">!=</span> USEARCH_DONE<span class="op">)</span></span>
<span id="cb33-111"><a href="#cb33-111" aria-hidden="true" tabindex="-1"></a>			u_printf<span class="op">(</span><span class="st">&quot;%ld: %S&quot;</span><span class="op">,</span> i<span class="op">,</span> line<span class="op">);</span></span>
<span id="cb33-112"><a href="#cb33-112" aria-hidden="true" tabindex="-1"></a>	<span class="op">}</span></span>
<span id="cb33-113"><a href="#cb33-113" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-114"><a href="#cb33-114" aria-hidden="true" tabindex="-1"></a>	usearch_close<span class="op">(</span>srch<span class="op">);</span></span>
<span id="cb33-115"><a href="#cb33-115" aria-hidden="true" tabindex="-1"></a>	ucol_close<span class="op">(</span>col<span class="op">);</span></span>
<span id="cb33-116"><a href="#cb33-116" aria-hidden="true" tabindex="-1"></a>	u_fclose<span class="op">(</span>in<span class="op">);</span></span>
<span id="cb33-117"><a href="#cb33-117" aria-hidden="true" tabindex="-1"></a>	free<span class="op">(</span>needle<span class="op">);</span></span>
<span id="cb33-118"><a href="#cb33-118" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb33-119"><a href="#cb33-119" aria-hidden="true" tabindex="-1"></a>	<span class="cf">return</span> EXIT_SUCCESS<span class="op">;</span></span>
<span id="cb33-120"><a href="#cb33-120" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<h3 id="comparing-strings-modulo-normalization">Comparing strings modulo normalization</h3>
<p>In the concepts section, we saw a single grapheme can be constructed with different combinations of codepoints. In many cases when comparing strings for equality, we’re most interested in the strings being perceived by the user in the same way rather than a simple byte-for-byte match.</p>
<p>The ICU library provides a <a href="http://icu-project.org/apiref/icu4c/unorm2_8h.html#a991e0fe6f0d062dd6e8e924517f3f437">unorm_compare</a> function which returns a value similar to strcmp, and acts in a normalization independent way. It normalizes both strings incrementally while comparing them, so it can stop early if it finds a difference.</p>
<p>Here is code to check that the five ways of representing ộ are equivalent:</p>
<div class="sourceCode" id="cb34"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb34-1"><a href="#cb34-1" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;stdio.h&gt;</span></span>
<span id="cb34-2"><a href="#cb34-2" aria-hidden="true" tabindex="-1"></a><span class="pp">#include </span><span class="im">&lt;unicode/unorm2.h&gt;</span></span>
<span id="cb34-3"><a href="#cb34-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-4"><a href="#cb34-4" aria-hidden="true" tabindex="-1"></a><span class="dt">int</span> main<span class="op">(</span><span class="dt">void</span><span class="op">)</span></span>
<span id="cb34-5"><a href="#cb34-5" aria-hidden="true" tabindex="-1"></a><span class="op">{</span></span>
<span id="cb34-6"><a href="#cb34-6" aria-hidden="true" tabindex="-1"></a>	UErrorCode status <span class="op">=</span> U_ZERO_ERROR<span class="op">;</span></span>
<span id="cb34-7"><a href="#cb34-7" aria-hidden="true" tabindex="-1"></a>	UChar s<span class="op">[][</span><span class="dv">4</span><span class="op">]</span> <span class="op">=</span> <span class="op">{</span></span>
<span id="cb34-8"><a href="#cb34-8" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0x006f</span><span class="op">,</span><span class="bn">0x0302</span><span class="op">,</span><span class="bn">0x0323</span><span class="op">,</span><span class="dv">0</span><span class="op">},</span></span>
<span id="cb34-9"><a href="#cb34-9" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0x006f</span><span class="op">,</span><span class="bn">0x0323</span><span class="op">,</span><span class="bn">0x0302</span><span class="op">,</span><span class="dv">0</span><span class="op">},</span></span>
<span id="cb34-10"><a href="#cb34-10" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0x00f4</span><span class="op">,</span><span class="bn">0x0323</span><span class="op">,</span><span class="dv">0</span><span class="op">,</span><span class="dv">0</span><span class="op">},</span></span>
<span id="cb34-11"><a href="#cb34-11" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0x1ecd</span><span class="op">,</span><span class="bn">0x0302</span><span class="op">,</span><span class="dv">0</span><span class="op">,</span><span class="dv">0</span><span class="op">},</span></span>
<span id="cb34-12"><a href="#cb34-12" aria-hidden="true" tabindex="-1"></a>		<span class="op">{</span><span class="bn">0x1ed9</span><span class="op">,</span><span class="dv">0</span><span class="op">,</span><span class="dv">0</span><span class="op">,</span><span class="dv">0</span><span class="op">}</span></span>
<span id="cb34-13"><a href="#cb34-13" aria-hidden="true" tabindex="-1"></a>	<span class="op">};</span></span>
<span id="cb34-14"><a href="#cb34-14" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-15"><a href="#cb34-15" aria-hidden="true" tabindex="-1"></a>	<span class="dt">const</span> <span class="dt">size_t</span> n <span class="op">=</span> <span class="kw">sizeof</span><span class="op">(</span>s<span class="op">)/</span><span class="kw">sizeof</span><span class="op">(</span>s<span class="op">[</span><span class="dv">0</span><span class="op">]);</span></span>
<span id="cb34-16"><a href="#cb34-16" aria-hidden="true" tabindex="-1"></a>	<span class="dt">size_t</span> i<span class="op">;</span></span>
<span id="cb34-17"><a href="#cb34-17" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb34-18"><a href="#cb34-18" aria-hidden="true" tabindex="-1"></a>	<span class="cf">for</span> <span class="op">(</span>i <span class="op">=</span> <span class="dv">0</span><span class="op">;</span> i <span class="op">&lt;</span> n<span class="op">;</span> <span class="op">++</span>i<span class="op">)</span></span>
<span id="cb34-19"><a href="#cb34-19" aria-hidden="true" tabindex="-1"></a>		printf<span class="op">(</span><span class="st">&quot;%zu == %zu: %d</span><span class="sc">\n</span><span class="st">&quot;</span><span class="op">,</span> i<span class="op">,</span> <span class="op">(</span>i<span class="op">+</span><span class="dv">1</span><span class="op">)%</span>n<span class="op">,</span></span>
<span id="cb34-20"><a href="#cb34-20" aria-hidden="true" tabindex="-1"></a>			unorm_compare<span class="op">(</span></span>
<span id="cb34-21"><a href="#cb34-21" aria-hidden="true" tabindex="-1"></a>				s<span class="op">[</span>i<span class="op">],</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> s<span class="op">[(</span>i<span class="op">+</span><span class="dv">1</span><span class="op">)%</span>n<span class="op">],</span> <span class="op">-</span><span class="dv">1</span><span class="op">,</span> <span class="dv">0</span><span class="op">,</span> <span class="op">&amp;</span>status<span class="op">));</span></span>
<span id="cb34-22"><a href="#cb34-22" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>Output:</p>
<pre><code>0 == 1: 0
1 == 2: 0
2 == 3: 0
3 == 4: 0
4 == 0: 0</code></pre>
<p>A return value of 0 means the strings are equal.</p>
<h3 id="confusable-strings">Confusable strings</h3>
<p>Because Unicode introduces so many graphemes, there are more possibilities for scammers to confuse people using lookalike glyphs. For instance, domains like adoḅe.com or pаypal.com (with Cyrillic а) can direct unwary visitors to phishing sites. ICU contains an entire module for detecting “confusables,” those strings which are known to look too similar when rendered in common fonts. Each string is assigned a “skeleton” such that confusable strings get the same skeleton.</p>
<p>For an example, see my utility <a href="https://github.com/begriffs/utofu">utofu</a>. It has a little extra complexity with sqlite access code, so I am not reproducing it here. It’s designed to check Unicode strings to detect changes over time that might be spoofing.</p>
<p>The method of operation is this:</p>
<ol type="1">
<li>Read line as UTF-8</li>
<li>Convert to Normalization Form C for consistency</li>
<li>Calculate skeleton string</li>
<li>Insert UTF-8 version of normalized input and its skeleton into a database if the skeleton doesn’t already exist</li>
<li>Compare the normalized input string to the string in the database having corresponding skeleton. If not an exact match die with an error.</li>
</ol>
<h3 id="further-reading">Further reading</h3>
<p>Unicode and internationalization is a huge topic. I could only scratch the surface in this article. I read and enjoyed sections from these books and reference materials, and would recommend them:</p>
<ul>
<li><a href="https://www.goodreads.com/book/show/1827814.Unicode_Demystified">Unicode Demystified</a> by Richard Gillam</li>
<li><a href="http://unicode.org/versions/Unicode12.1.0/">The Unicode Standard</a></li>
<li><a href="http://userguide.icu-project.org">ICU User Guide</a></li>
<li><a href="http://icu-project.org/apiref/icu4c/index.html">ICU4C API Reference</a></li>
<li><a href="http://www.unicode.org/reports/">Unicode Technical Reports</a></li>
</ul>]]></summary>
</entry>

</feed>
