Statistics and Information Theory

This module contains classes and functions for statistics and information theory. It is imported as follows:

import pynlpl.statistics

Generic functions

Amongst others, the following generic statistical functions are available:

* ``mean(list)`` - Computes the mean of a given list of numbers
  • median(list) - Computes the median of a given list of numbers
  • stddev(list) - Computes the standard deviation of a given list of numbers
  • normalize(list) - Normalizes a list of numbers so that the sum is 1.0 .

Frequency Lists and Distributions

One of the most basic and widespread tasks in NLP is the creation of a frequency list. Counting is established by simply appending lists to the frequencylist:

freqlist =  pynlpl.statistics.FrequencyList()

Take care not to append lists rather than strings unless you mean to create a frequency list over its characters rather than words. You may want to use the pynlpl.textprocessors.crudetokeniser first:

freqlist.append(pynlpl.textprocessors.crude_tokeniser("to be or not to be"))

The count can also be incremented explicitly explicitly for a single item:


The FrequencyList offers dictionary-like access. For example, the following statement will be true for the frequency list just created:

freqlist['be'] == 2

Normalised counts (pseudo-probabilities) can be obtained using the p() method:


Normalised counts can also be obtained by instantiation a Distribution instance using the frequency list:

dist = pynlpl.statistics.Distribution(freqlist)

This too offers a dictionary-like interface, where values are by definition normalised. The advantage of a Distribution class is that it offers information-theoretic methods such as entropy(), maxentropy(), perplexity() and poslog().

A frequency list can be saved to file using the save(filename) method, and loaded back from file using the load(filename) method. The output() method is a generator yielding strings for each line of output, in ranked order.

API Reference

This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach :

class pynlpl.statistics.Distribution(data, base=2)

A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing


Compute the entropy of the distribution


Computes the information content of the specified type: -log_e(p(X))


Returns an unranked list of (type, prob) pairs. Use this only if you are not interested in the order.


Compute the maximum entropy of the distribution: log_e(N)


Returns the type that occurs the most frequently in the probability distribution

output(delimiter='\t', freqlist=None)

Generator yielding formatted strings expressing the time and probabily for each item in the distribution


alias for information content

class pynlpl.statistics.FrequencyList(tokens=None, casesensitive=True, dovalidation=True)

A frequency list (implemented using dictionaries)


Add a list of tokens to the frequencylist. This method will count them for you.

count(type, amount=1)

Count a certain type. The counter will increase by the amount specified (defaults to one)


Returns an unranked list of (type, count) pairs. Use this only if you are not interested in the order.


Load a frequency list from file (in the format produced by the save method)


Returns the type that occurs the most frequently in the frequency list

output(delimiter='\t', addnormalised=False)

Print a representation of the frequency list


Returns the probability (relative frequency) of the token

save(filename, addnormalised=False)

Save a frequency list to file, can be loaded later using the load method


Returns the total amount of tokens


Returns the total amount of tokens


Computes the type/token ratio

class pynlpl.statistics.HiddenMarkovModel(startstate, endstate=None)
setemission(state, distribution)
viterbi(observations, doprint=False)
class pynlpl.statistics.MarkovChain(startstate, endstate=None)
accessible(fromstate, tostate)

Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero

communicates(fromstate, tostate, maxlength=999999)

See if a node communicates (directly or indirectly) with another. Returns the probability of the shortest path (probably, but not necessarily the highest probability)

p(sequence, subsequence=True)

Returns the probability of the given sequence or subsequence (if subsequence=True, default).

settransitions(state, distribution)
pynlpl.statistics.dotproduct(X, Y)

Return the sum of the element-wise product of vectors x and y. >>> dotproduct([1, 2, 3], [1000, 100, 10]) 1230

pynlpl.statistics.histogram(values, mode=0, bin_function=None)

Return a list of (value, count) pairs, summarizing the input values. Sorted by increasing value, or if mode=1, by decreasing count. If bin_function is given, map it over values first.

pynlpl.statistics.levenshtein(s1, s2, maxdistance=9999)

Computes the levenshtein distance between two strings. Adapted from:


Base 2 logarithm. >>> log2(1024) 10.0


Return the arithmetic average of the values.


Return the middle value, when the values are sorted. If there are an odd number of elements, try to average the middle two. If they can’t be averaged (e.g. they are strings), choose one at random. >>> median([10, 100, 11]) 11 >>> median([1, 2, 3, 4]) 2.5


Return the most common value in the list of values. >>> mode([1, 2, 3, 2]) 2

pynlpl.statistics.normalize(numbers, total=1.0)

Multiply each number by a constant such that the sum is 1.0 (or total). >>> normalize([1,2,1]) [0.25, 0.5, 0.25]


Return the product of a sequence of numerical values. >>> product([1,2,6]) 12

pynlpl.statistics.stddev(values, meanval=None)

The standard deviation of a set of values. Pass in the mean if you already know it.

pynlpl.statistics.vector_add(a, b)

Component-wise addition of two vectors. >>> vector_add((0, 1), (8, 9)) (8, 10)