pynlpl.formats.folia.Document

class pynlpl.formats.folia.Document(*args, **kwargs)

Bases: object

This is the FoLiA Document and holds all its data in memory.

All FoLiA elements have to be associated with a FoLiA document. Besides holding elements, the document may hold metadata including declarations, and an index of all IDs.

Method Summary

__init__(*args, **kwargs) Start/load a FoLiA document:
add(text) Alias for Document.append()
alias(annotationtype, set[, fallback]) Return the alias for a set (if applicable, returns the unaltered set otherwise iff fallback is enabled)
append(text) Add a text (or speech) to the document:
count(Class[, set, recursive, ignore]) See AbstractElement.count()
create(Class, *args, **kwargs) Create an element associated with this Document.
date([value]) Get or set the document’s date from/in the metadata.
declare(annotationtype, set, **kwargs) Declare a new annotation type to be used in the document.
declared(annotationtype, set) Checks if the annotation type is present (i.e.
defaultannotator(annotationtype[, set]) Obtain the default annotator for the specified annotation type and set.
defaultannotatortype(annotationtype[, set]) Obtain the default annotator type for the specified annotation type and set.
defaultdatetime(annotationtype[, set]) Obtain the default datetime for the specified annotation type and set.
defaultset(annotationtype) Obtain the default set for the specified annotation type.
findwords(*args, **kwargs)
items() Returns a depth-first flat list of all items in the document
json() Serialise the document to a dict ready for serialisation to JSON.
jsondeclarations() Return all declarations in a form ready to be serialised to JSON.
language([value]) No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata
license([value]) No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata
load(filename) Load a FoLiA XML file.
paragraphs([index]) Return a generator of all paragraphs found in the document.
parsemetadata(node) Internal method to parse metadata
parsesubmetadata(node)
parsexml(node[, ParentClass]) Internal method.
parsexmldeclarations(node) Internal method to parse XML declarations
pendingvalidation([warnonly]) Perform any pending validations
publisher([value]) No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata
save([filename]) Save the document to file.
select(Class[, set, recursive, ignore]) See AbstractElement.select()
sentences([index]) Return a generator of all sentence found in the document.
setimdi(node) OBSOLETE
text([cls, retaintokenisation]) Returns the text of the entire document (returns a unicode instance)
title([value]) Get or set the document’s title from/in the metadata
unalias(annotationtype, alias) Return the set for an alias (if applicable, raises an exception otherwise)
words([index]) Return a generator of all active words found in the document.
xml() Serialise the document to XML.
xmldeclarations() Internal method to generate XML nodes for all declarations
xmlmetadata() Internal method to serialize metadata to XML
xmlstring() Return the XML representation of the document as a string.
xpath(query) Run Xpath expression and parse the resulting elements.

Attributes

IDSEPARATOR

Method Details

__init__(*args, **kwargs)

Start/load a FoLiA document:

There are four sources of input for loading a FoLiA document:

  1. Create a new document by specifying an ID:

    doc = folia.Document(id='test')
    
  2. Load a document from FoLiA or D-Coi XML file:

    doc = folia.Document(file='/path/to/doc.xml')
    
  3. Load a document from an XML string:

    doc = folia.Document(string='<FoLiA>....</FoLiA>')
    
  4. Load a document by passing a parse xml tree (lxml.etree):

    doc = folia.Document(tree=xmltree)

Additionally, there are three modes that can be set with the mode= keyword argument:

  • folia.Mode.MEMORY - The entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved again.
  • folia.Mode.XPATH - The full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required.
Keyword Arguments:
 
  • setdefinition (dict) – A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance
  • loadsetdefinitions (bool) – download and load set definitions (default: False)
  • deepvalidation (bool) – Do deep validation of the document (default: False), implies loadsetdefinitions
  • textvalidation (bool) – Do validation of text consistency (default: False)``
  • preparsexmlcallback (function) – Callback for a function taking one argument (node, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children)
  • parsexmlcallback (function) – Callback for a function taking one argument (element, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children)
  • debug (bool) – Boolean to enable/disable debug
__init__(*args, **kwargs)

Start/load a FoLiA document:

There are four sources of input for loading a FoLiA document:

  1. Create a new document by specifying an ID:

    doc = folia.Document(id='test')
    
  2. Load a document from FoLiA or D-Coi XML file:

    doc = folia.Document(file='/path/to/doc.xml')
    
  3. Load a document from an XML string:

    doc = folia.Document(string='<FoLiA>....</FoLiA>')
    
  4. Load a document by passing a parse xml tree (lxml.etree):

    doc = folia.Document(tree=xmltree)

Additionally, there are three modes that can be set with the mode= keyword argument:

  • folia.Mode.MEMORY - The entire FoLiA Document will be loaded into memory. This is the default mode and the only mode in which documents can be manipulated and saved again.
  • folia.Mode.XPATH - The full XML tree will still be loaded into memory, but conversion to FoLiA classes occurs only when queried. This mode can be used when the full power of XPath is required.
Keyword Arguments:
 
  • setdefinition (dict) – A dictionary of set definitions, the key corresponds to the set name, the value is a SetDefinition instance
  • loadsetdefinitions (bool) – download and load set definitions (default: False)
  • deepvalidation (bool) – Do deep validation of the document (default: False), implies loadsetdefinitions
  • textvalidation (bool) – Do validation of text consistency (default: False)``
  • preparsexmlcallback (function) – Callback for a function taking one argument (node, an lxml node). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort parsing this element (and all its children)
  • parsexmlcallback (function) – Callback for a function taking one argument (element, a FoLiA element). Will be called whenever an XML element is parsed into FoLiA. The function should return an instance inherited from folia.AbstractElement, or None to abort adding this element (and all its children)
  • debug (bool) – Boolean to enable/disable debug
add(text)

Alias for Document.append()

alias(annotationtype, set, fallback=False)

Return the alias for a set (if applicable, returns the unaltered set otherwise iff fallback is enabled)

append(text)

Add a text (or speech) to the document:

Example 1:

doc.append(folia.Text)
Example 2::
doc.append( folia.Text(doc, id=’example.text’) )

Example 3:

doc.append(folia.Speech)
count(Class, set=None, recursive=True, ignore=True)

See AbstractElement.count()

create(Class, *args, **kwargs)

Create an element associated with this Document. This method may be obsolete and removed later.

date(value=None)

Get or set the document’s date from/in the metadata.

No arguments: Get the document’s date from metadata Argument: Set the document’s date in metadata

declare(annotationtype, set, **kwargs)

Declare a new annotation type to be used in the document.

Keyword arguments can be used to set defaults for any annotation of this type and set.

Parameters:
  • annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
  • set (str) – the set, should formally be a URL pointing to the set definition
Keyword Arguments:
 
  • annotator (str) – Sets a default annotator
  • annotatortype – Should be either AnnotatorType.MANUAL or AnnotatorType.AUTO, indicating whether the annotation was performed manually or by an automated process.
  • datetime (datetime.datetime) – Sets the default datetime
  • alias (str) – Defines alias that may be used in set attribute of elements instead of the full set name

Example:

doc.declare(folia.PosAnnotation, 'http://some/path/brown-tag-set', annotator="mytagger", annotatortype=folia.AnnotatorType.AUTO)
declared(annotationtype, set)

Checks if the annotation type is present (i.e. declared) in the document.

Parameters:
  • annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
  • set (str) – the set, should formally be a URL pointing to the set definition (aliases are also supported)

Example:

if doc.declared(folia.PosAnnotation, 'http://some/path/brown-tag-set'):
    ..
Returns:bool
defaultannotator(annotationtype, set=None)

Obtain the default annotator for the specified annotation type and set.

Parameters:
  • annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
  • set (str) – the set, should formally be a URL pointing to the set definition
Returns:

the set (str)

Raises:

NoDefaultError if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)

defaultannotatortype(annotationtype, set=None)

Obtain the default annotator type for the specified annotation type and set.

Parameters:
  • annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
  • set (str) – the set, should formally be a URL pointing to the set definition
Returns:

AnnotatorType.AUTO or AnnotatorType.MANUAL

Raises:

NoDefaultError if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)

defaultdatetime(annotationtype, set=None)

Obtain the default datetime for the specified annotation type and set.

Parameters:
  • annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
  • set (str) – the set, should formally be a URL pointing to the set definition
Returns:

the set (str)

Raises:

NoDefaultError if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)

defaultset(annotationtype)

Obtain the default set for the specified annotation type.

Parameters:annotationtype – The type of annotation, this is conveyed by passing the corresponding annototion class (such as PosAnnotation for example), or a member of AnnotationType, such as AnnotationType.POS.
Returns:the set (str)
Raises:NoDefaultError if the annotation type does not exist or if there is ambiguity (multiple sets for the same type)
findwords(*args, **kwargs)
items()

Returns a depth-first flat list of all items in the document

json()

Serialise the document to a dict ready for serialisation to JSON.

Example:

import json
jsondoc = json.dumps(doc.json())
jsondeclarations()

Return all declarations in a form ready to be serialised to JSON.

Returns:list of dict
language(value=None)

No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata

license(value=None)

No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata

load(filename)

Load a FoLiA XML file.

Argument:
filename (str): The file to load
paragraphs(index=None)

Return a generator of all paragraphs found in the document.

If an index is specified, return the n’th paragraph only (starting at 0)

parsemetadata(node)

Internal method to parse metadata

parsesubmetadata(node)
parsexml(node, ParentClass=None)

Internal method.

This is the main XML parser, will invoke class-specific XML parsers.

parsexmldeclarations(node)

Internal method to parse XML declarations

pendingvalidation(warnonly=None)

Perform any pending validations

Parameters:warnonly (bool) – Warn only (True) or raise exceptions (False). If set to None then this value will be determined based on the document’s FoLiA version (Warn only before FoLiA v1.5)
Returns:bool
publisher(value=None)

No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata

save(filename=None)

Save the document to file.

Parameters:filename (*) – The filename to save to. If not set (None, default), saves to the same file as loaded from.
select(Class, set=None, recursive=True, ignore=True)

See AbstractElement.select()

sentences(index=None)

Return a generator of all sentence found in the document. Except for sentences in quotes.

If an index is specified, return the n’th sentence only (starting at 0)

setimdi(node)

OBSOLETE

text(cls='current', retaintokenisation=False)

Returns the text of the entire document (returns a unicode instance)

title(value=None)

Get or set the document’s title from/in the metadata

No arguments: Get the document’s title from metadata Argument: Set the document’s title in metadata

unalias(annotationtype, alias)

Return the set for an alias (if applicable, raises an exception otherwise)

words(index=None)

Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.

If an index is specified, return the n’th word only (starting at 0)

xml()

Serialise the document to XML.

Returns:lxml.etree.Element
xmldeclarations()

Internal method to generate XML nodes for all declarations

xmlmetadata()

Internal method to serialize metadata to XML

xmlstring()

Return the XML representation of the document as a string.

xpath(query)

Run Xpath expression and parse the resulting elements. Don’t forget to use the FoLiA namesapace in your expressions, using folia: or the short form f: