Formats¶

Corpus Gesproken Nederlands¶

exception pynlpl.formats.cgn.InvalidFeatureException¶

exception pynlpl.formats.cgn.InvalidTagException¶

pynlpl.formats.cgn.parse_cgn_postag(rawtag, raisefeatureexceptions=False)¶

FoLiA¶

See folia : folia.html

GIZA++¶

class pynlpl.formats.giza.GizaModel(filename, encoding='utf-8')¶

class pynlpl.formats.giza.GizaSentenceAlignment(sourceline, targetline, index)¶

getalignedtarget(index)¶: Returns target range only if source index aligns to a single consecutive range of target tokens.

intersect(other)¶

class pynlpl.formats.giza.IntersectionAlignment(source2target, target2source, encoding=False)¶

reset()¶

class pynlpl.formats.giza.MultiWordAlignment(filename, encoding=False)¶

Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)

reset()¶

targetword(index, targetwords, alignment)¶: Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between

targetwords(index, targetwords, alignment)¶: Return the aligned targetwords for a specified index in the source words

class pynlpl.formats.giza.WordAlignment(filename, encoding=False)¶

Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word

reset()¶

targetword(index, targetwords, alignment)¶: Return the aligned targetword for a specified index in the source words

pynlpl.formats.giza.parseAlignment(tokens)¶

Moses¶

class pynlpl.formats.moses.PhraseTable(filename, quiet=False, reverse=False, delimiter='|||', score_column=3, max_sourcen=0, sourceencoder=None, targetencoder=None, scorefilter=None)¶

class pynlpl.formats.moses.PhraseTableClient(host='localhost', port=65432)¶

SoNaR¶

class pynlpl.formats.sonar.Corpus(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶

class pynlpl.formats.sonar.CorpusDocument(filename, encoding='iso-8859-15')¶

This class represent one document/text of the Corpus (read-only)

paragraphs(with_id=False)¶: Extracts paragraphs, returns list of plain-text(!) paragraphs

sentences()¶: Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)

words()¶

class pynlpl.formats.sonar.CorpusDocumentX(filename, tree=None, index=True)¶

This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure

paragraphs(node=None)¶: iterate over paragraphs

save(filename=None, encoding='iso-8859-15')¶

sentences(node=None)¶: iterate over sentences

validate(formats_dir='../formats/')¶: checks if the document is valid

words(node=None)¶: iterate over words

xpath(expression)¶: Executes an xpath expression using the correct namespaces

class pynlpl.formats.sonar.CorpusFiles(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶

class pynlpl.formats.sonar.CorpusX(corpusdir, extension='pos', restrict_to_collection='', conditionf=<function Corpus.<lambda>>, ignoreerrors=False)¶

pynlpl.formats.sonar.ns(namespace)¶: Resolves the namespace identifier to a full URL

Taggerdata¶

class pynlpl.formats.taggerdata.Taggerdata(filename, encoding='utf-8', mode='r')¶

align(referencewords, datatuple)¶: align the reference sentence with the tagged data

close()¶

next()¶

reset()¶

write(sentence)¶

TiMBL¶

class pynlpl.formats.timbl.TimblOutput(stream, delimiter=' ', ignorecolumns=[], ignorevalues=[])¶

A class for reading Timbl classifier output, supports the +v+db option and ignores comments starting with #

parseDistribution(instance, start, end=None)¶