gwic, a simple KWIC program written in Go
A KWIC (“KeyWord In Context”) program shows the syntactic word that has been searched within a text (the “query term”) in the context of the words surrounding it in the input text. For example, if we search for “champion” in the text of this random Wikipedia page the output a KWIC program gives you will look similar to this. This particular output has been produced by the “gwic” program I’ve written in Go (note that it does not search for an exact match but considers substring matches as well).
The simplest approach to generate a KWIC output is the following.
- Read text from input file.
- Split text into words (tokenization).
- Try to match each input word with the query term.
- If a word does not match, put it in a queue which has the size of the context you want to show.
- If a word matches, read as many further words as input as needed by the desired context size.
- After all the context words have been read, format and print the queue to the output (for printing the aligned output Golang’s text/tabwriter comes in handy).
There is one issue with this approach. While you read the other words needed to fill the context (to complete one output line) in step 5 it can happen that you read another word that matches the query term. With this basic approach you will end up ignoring the new match because it’s part of the context of the previous one.
I solved the issue by using a work
struct containing a channel
from which the missing context words are read. Each time a match has
been detected we create one of those structs, copy the amount of prior
context needed up to and including the matching word into a slice passed
to one of its methods and add a pointer to the struct to a *work
slice. Every word we read from the input from that point on is being fed
to all of the work
structs in this slice through their channels. If
we encounter another match while reading the context words for a prior
match we just create a new work
struct to which we will then feed
the incoming words in the same way.
The following graphic illustrates this approach.
The first row indicates the word input starting from word one (“w1”). A red border indicates that the word matched the query term. The arrows from the word items downwards indicate that the action described to the left of them has taken place. In this KWIC algorithm illustration a context size of 3 words to the left and right has been chosen.
Instead of using a channel to send the incoming words to the work
structs, one could also just append to a member slice of the struct.
This “gwic” program is very basic. There is no way to specify the size of the desired context for example. The program currently also matches the query term as a substring as mentioned above. These issues can easily be fixed though and I may do so in the future.