Text Processors

This module contains classes and functions for text processing. It is imported as follows:

import pynlpl.textprocessors

Tokenisation

A very crude tokeniser is available in the form of the function pynlpl.textprocessors.crude_tokeniser(string). This will split punctuation characters from words and returns a list of tokens. It however has no regard for abbreviations and end-of-sentence detection, which is functionality a more sophisticated tokeniser can provide:

tokens = pynlpl.textprocessors.crude_tokeniser("to be, or not to be.")

This will result in:

tokens == [‘to’,’be’,’,’,’or’,’not’,’to’,’be’,’.’]

N-gram extraction

The extraction of n-grams is an elemental operation in Natural Language Processing. PyNLPl offers the Windower class to accomplish this task:

tokens = pynlpl.textprocessors.crude_tokeniser("to be or not to be")
for trigram in Windower(tokens,3):
        print trigram

The input to the Windower should be a list of words and a value for n. In addition, the windower can output extra symbols at the beginning of the input sequence and at the end of it. By default, this behaviour is enabled and the input symbol is <begin>, whereas the output symbol is <end>. If this behaviour is unwanted you can suppress it by instantiating the Windower as follows:

Windower(tokens,3, None, None)

The Windower is implemented as a Python generator and at each iteration yields a tuple of length n.

class pynlpl.textprocessors.MultiWindower(tokens, min_n=1, max_n=9, beginmarker=None, endmarker=None)

Extract n-grams of various configurations from a sequence

class pynlpl.textprocessors.ReflowText(stream, filternontext=True)

Attempts to re-flow a text that has arbitrary line endings in it. Also undoes hyphenisation

class pynlpl.textprocessors.Tokenizer(stream, splitsentences=True, onesentenceperline=False, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\))|www\.)(?:[\w\d:#@%/;$()~_?\+-=\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+(?:\.[a-zA-Z]+)+')))

A tokenizer and sentence splitter, which acts on a file/stream-like object and when iterating over the object it yields a lists of tokens (in case the sentence splitter is active (default)), or a token (if the sentence splitter is deactivated).

class pynlpl.textprocessors.Windower(tokens, n=1, beginmarker='<begin>', endmarker='<end>')

Moves a sliding window over a list of tokens, upon iteration in yields all n-grams of specified size in a tuple.

Example without markers:

>>> for ngram in Windower("This is a test .",3, None, None):
...     print(" ".join(ngram))
This is a
is a test
a test .

Example with default markers:

>>> for ngram in Windower("This is a test .",3):
...     print(" ".join(ngram))
<begin> <begin> This
<begin> This is
This is a
is a test
a test .
test . <end>
. <end> <end>
pynlpl.textprocessors.calculate_overlap(haystack, needle, allowpartial=True)

Calculate the overlap between two sequences. Yields (overlap, placement) tuples (multiple because there may be multiple overlaps!). The former is the part of the sequence that overlaps, and the latter is -1 if the overlap is on the left side, 0 if it is a subset, 1 if it overlaps on the right side, 2 if its an identical match

pynlpl.textprocessors.crude_tokenizer(text)

Replaced by tokenize(). Alias

pynlpl.textprocessors.find_keyword_in_context(tokens, keyword, contextsize=1)

Find a keyword in a particular sequence of tokens, and return the local context. Contextsize is the number of words to the left and right. The keyword may have multiple word, in which case it should to passed as a tuple or list

pynlpl.textprocessors.is_end_of_sentence(tokens, i)
pynlpl.textprocessors.split_sentences(tokens)

Split sentences (based on tokenised data), returns sentences as a list of lists of tokens, each sentence is a list of tokens

pynlpl.textprocessors.strip_accents(s, encoding='utf-8')

Strip characters with diacritics and return a flat ascii representation

pynlpl.textprocessors.swap(tokens, maxdist=2)

Perform a swap operation on a sequence of tokens, exhaustively swapping all tokens up to the maximum specified distance. This is a subset of all permutations.

pynlpl.textprocessors.tokenise(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))

Alias for the British

pynlpl.textprocessors.tokenize(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))

Tokenizes a string and returns a list of tokens

Parameters:
  • text (string) – The text to tokenise
  • regexps (Tuple/list of regular expressions to use in tokenisation) – Regular expressions to use as tokeniser rules in tokenisation (default=_pynlpl.textprocessors.TOKENIZERRULES_)
Return type:

Returns a list of tokens

Examples:

>>> for token in tokenize("This is a test."):
...    print(token)
This
is
a
test
.