Formats

Corpus Gesproken Nederlands

exception pynlpl.formats.cgn.InvalidFeatureException
exception pynlpl.formats.cgn.InvalidTagException
pynlpl.formats.cgn.parse_cgn_postag(rawtag, raisefeatureexceptions=False)

FoLiA

See folia : folia.html

GIZA++

class pynlpl.formats.giza.GizaModel(filename, encoding='utf-8')
class pynlpl.formats.giza.GizaSentenceAlignment(sourceline, targetline, index)
getalignedtarget(index)

Returns target range only if source index aligns to a single consecutive range of target tokens.

intersect(other)
class pynlpl.formats.giza.IntersectionAlignment(source2target, target2source, encoding=False)
reset()
class pynlpl.formats.giza.MultiWordAlignment(filename, encoding=False)

Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)

reset()
targetword(index, targetwords, alignment)

Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between

targetwords(index, targetwords, alignment)

Return the aligned targetwords for a specified index in the source words

class pynlpl.formats.giza.WordAlignment(filename, encoding=False)

Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word

reset()
targetword(index, targetwords, alignment)

Return the aligned targetword for a specified index in the source words

pynlpl.formats.giza.parseAlignment(tokens)

Moses

class pynlpl.formats.moses.PhraseTable(filename, quiet=False, reverse=False, delimiter='|||', score_column=3, max_sourcen=0, sourceencoder=None, targetencoder=None, scorefilter=None)
class pynlpl.formats.moses.PhraseTableClient(host='localhost', port=65432)

SoNaR

class pynlpl.formats.sonar.Corpus(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)
class pynlpl.formats.sonar.CorpusDocument(filename, encoding='iso-8859-15')

This class represent one document/text of the Corpus (read-only)

paragraphs(with_id=False)

Extracts paragraphs, returns list of plain-text(!) paragraphs

sentences()

Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)

words()
class pynlpl.formats.sonar.CorpusDocumentX(filename, tree=None, index=True)

This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure

paragraphs(node=None)

iterate over paragraphs

save(filename=None, encoding='iso-8859-15')
sentences(node=None)

iterate over sentences

validate(formats_dir='../formats/')

checks if the document is valid

words(node=None)

iterate over words

xpath(expression)

Executes an xpath expression using the correct namespaces

class pynlpl.formats.sonar.CorpusFiles(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)
class pynlpl.formats.sonar.CorpusX(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)
pynlpl.formats.sonar.ns(namespace)

Resolves the namespace identifier to a full URL

Taggerdata

class pynlpl.formats.taggerdata.Taggerdata(filename, encoding='utf-8', mode='r')
align(referencewords, datatuple)

align the reference sentence with the tagged data

close()
next()
reset()
write(sentence)

TiMBL

class pynlpl.formats.timbl.TimblOutput(stream, delimiter=' ', ignorecolumns=[], ignorevalues=[])

A class for reading Timbl classifier output, supports the +v+db option and ignores comments starting with #

parseDistribution(instance, start, end=None)