gwic, a simple KWIC program written in Go

Thu, Dec 24, 2015 | tags: Go programming

A KWIC (“KeyWord In Context”) program shows the syntactic word that has been searched within a text (the “query term”) in the context of the words surrounding it in the input text. For example, if we search for “champion” in the text of this random Wikipedia page the output a KWIC program gives you will look similar to this. This particular output has been produced by the “gwic” program I’ve written in Go (note that it does not search for an exact match but considers substring matches as well).

The simplest approach to generate a KWIC output is the following.

  1. Read text from input file.
  2. Split text into words (tokenization).
  3. Try to match each input word with the query term.
  4. If a word does not match, put it in a queue which has the size of the context you want to show.
  5. If a word matches, read as many further words as input as needed by the desired context size.
  6. After all the context words have been read, format and print the queue to the output (for printing the aligned output Golang’s text/tabwriter comes in handy).

There is one issue with this approach. While you read the other words needed to fill the context (to complete one output line) in step 5 it can happen that you read another word that matches the query term. With this basic approach you will end up ignoring the new match because it’s part of the context of the previous one.

I solved the issue by using a work struct containing a channel from which the missing context words are read. Each time a match has been detected we create one of those structs, copy the amount of prior context needed up to and including the matching word into a slice passed to one of its methods and add a pointer to the struct to a *work slice. Every word we read from the input from that point on is being fed to all of the work structs in this slice through their channels. If we encounter another match while reading the context words for a prior match we just create a new work struct to which we will then feed the incoming words in the same way.

The following graphic illustrates this approach.

gwic workers

The first row indicates the word input starting from word one (“w1”). A red border indicates that the word matched the query term. The arrows from the word items downwards indicate that the action described to the left of them has taken place. In this KWIC algorithm illustration a context size of 3 words to the left and right has been chosen.

Instead of using a channel to send the incoming words to the work structs, one could also just append to a member slice of the struct.

This “gwic” program is very basic. There is no way to specify the size of the desired context for example. The program currently also matches the query term as a substring as mentioned above. These issues can easily be fixed though and I may do so in the future.

Don’t hesitate to send comments and/or questions to my public-inbox (view).