Postgresql Regexp Stopwords

While working on a NLP project with OCR-derived text, I discovered a need that wasn't immediately met by the built in parsers and dictionaries which ship with Postgresql. Fortunately text search in Postgresql is fairly extensible.

This post looks at the process of implementing a custom dictionary which can be used to create stop words which are really stop regular expressions.

Background on Parsing, Dictionaries, and Stop Words

The first step in using most text search features is converting unstructured text into text search vectors. A text search vector is a set of (lexeme, position) pairs.

=# select to_tsvector('Time is an illusion.  Lunchtime doubly so.');
                to_tsvector
--------------------------------------------
 'doubli':6 'illus':4 'lunchtim':5 'time':1
(1 row)

As you might expect, the internal representation of a ts_vector is simply information about where words are located inside a big string.

typedef struct
{
  int32   vl_len_;    /* varlena header (do not touch directly!) */
  int32   size; 
  WordEntry entries[1];   /* variable length */
  /* lexemes follow the entries[] array */
} TSVectorData;

The transformation from text to ts_vector involves parsing text into tokens, then filtering the tokens through a dictionary which may change or eliminate words.

To tsvector

A text search configuration can be used to map token categories to dictionaries. The official documentation contains additional details.

The Problem Statement

The text corpus is derived from an OCR process which results in some unhelpful tokens.

a-433
-43 44 40
Lunchtime is 46 #@$dfsdf an illusion.Q!~ Lunchtime ksdfkjsadjfksdjf so.
431,-

So how to get pleasant ts_vectors?

One solution would be adding a token type *ocr gibberish* to the parser and removing mappings from ocr_gibberish to any dictionary. While you can write your own parser, the default is pretty good, and it does not appear to be easily extended.

Another approach would be to use a dictionary as a white list. Unfortunately this corpus has creative authors who use words -- and regularly invent words -- not available in a dictionary.

Due to these limitations, the chosen solution was to create a dictionary which removes tokens based on regular expressions.

Writing a Custom Dictionary

This turned out to be fairly painless thanks to examples in pg_contrib.

Implementing a dictionary required implementing just two functions. dict_exclude_init to compile regular expressions from a rule file and dict_exclude_lexize to apply the regular expressions in the lexize process.

Once these functions are implemented and installed, the dictionary can be added.

=# CREATE TEXT SEARCH TEMPLATE dict_exclude_template (
        LEXIZE = dict_exclude_lexize,
        INIT   = dict_exclude_init
);

 =# CREATE TEXT SEARCH DICTIONARY dict_exclude (
  TEMPLATE = dict_exclude_template
);

A text search configuration can pass tokens to dict_exclude first, which will prevent other dictionaries from resolving lexemes.

create text search configuration ocr_gibberish ( COPY = pg_catalog.english );
alter text search configuration ocr_gibberish 
  alter mapping for asciihword, asciiword
    with dict_exclude, english_stem;

by adding a regular expression like ([^aeiouAEIOU]{5}) to an exclude.rules file, some gibberish can be removed.

=# select to_tsvector('ocr_gibberish', 
                    'Time is an illusion. sdfsdfsdf  Lunchtime doubly so.');

                to_tsvector                 
--------------------------------------------
 'doubli':7 'illus':4 'lunchtim':6 'time':1

Additional details on using dict_exclude are available in the project's readme.

If you have solved this problem another way, please let me know. A solution that did not require an external library would be preferable.