Discussion of Tool Format Conversion to AG
Background and General Information
Purpose
The purpose of the workshop is to use the AG format as a common format multiple annotation tools can import from and export to. The end goal in this is to foster interoperability among various tools. To accomplish this, not only must the structural format be agreed upon, but also the semantic and meta data must be agreed on. Many decisions will have to be made. From these discussions with participants so far, we've created a list of issues and assumptions that were made. We (MITRE) want to define as little as possible up front - we hope to foster discussion as to the proper way to accomplish the goals. We hope this page will serve as a repository of consensus (or points of discussion) on how the conversions should be done.
Developers for the Multimodal Annotation Workshop 2007
- EXMARaLDA Thomas Schmidt (Thomas.schmidt@uni-hamburg.de)
- Transformer Tool - Oliver Ehmer (oliver.ehmer@romanistik.uni-freiburg.de)
- ANVIL - Michael Kipp
- C-BAS - Kevin Moffitt
- MacVisSTA Travis Rose
- Theme - Magnus Magnusson
- ELAN - Jeffrey Hoyt (initial ELAN conversion, jchoyt@mitre.org)
Tool Comparison Matrix
Thomas Schmidt was good enough to share an excellent resource he's created. From his email of May 4, 2007: "I've made a small matrix to compare the tool formats of our task - maybe others on the list have a use for this too. I've added TASX because it was part of the previous workshop. Transformer somehow does not fit in there, because it does not define its own data format (or does it?). Of course, all of this is based on my own knowledge of the formats which is fairly superficial in many cases (esp. for C-BAS and MacVisSTA) - if you find errors or missing information, please correct!"
Observations from Thomas Schmidt
Thomas Schmidt was also kind enough to break down the interoperability into four levels of increasing difficulty:
Level 1: Conversion from tool format A to the AG format.
Since AGs are a very general data structure, more specific data models are very likely to be a subset of the AG data model. This task is therefore relatively easy. It simply requires some one-to-one-mappings (e.g. TIME_SLOTS to Anchors, events to Annotations) and some decisions about how to translate additional structural information to Feature elements in the AG file (e.g. tier assignments).
Level 2: Round-trip from tool format A to AG and back to tool format A
This is what Jeff and I did so far. This task relies on the decisions made for tool-format-to-AG-conversion, i.e. the additional structural information is reconstructed from the information in the Feature elements. If this is done carefully, it should be possible to do the round-tripping without losing any information.
Level 3: Conversion from tool format A to AG, conversion from AG to tool format B
This is much harder because, at this level, incompatibilities between the different tool formats become relevant. For example, the information that a given tier is assigned to a certain speaker in tool A may be present in the AG document, but if tool B has no way of assigning tiers to speakers (as is the case, for instance, with Praat), this information will be lost when going from AG to tool format B. Still, if this works, a certain degree of interoperability is achieved: most importantly, users could create their data using a chain of tools with increasingly complex data models, e.g. go from Praat to EXMARaLDA to ELAN without losing any information in the process.
Level 4: Round-trip from tool format A to AG, to tool format B, and back to tool format A
This would be true interoperability. However, since the conversion from AG to tool format B may lead to information loss, this will usually be a lossy round-trip. For the users, this means that they can not change freely between different tools.
Issues
Relative file names for media files
Allow them. Actually, prefer them if possible. Tools that don't use relative paths will prompt the user to navigate to where the medai file, so there's no harm in allowing it. Why preferred? Well, this way when the files are moved, you don't have to modify the annotation file to point to the new location - you just have to keep the same directory structure.
Time units
Milliseconds in integer format. From limited research it appears the milliseconds is an appropriate level of granularity for thes annotations, and it's straight forward. The current translation package provided from MITRE to do ELAN to AG format and back puts milliseconds in float format - this will be fixed.
Annotation Graph format
The DTD from AG-1.1 should be used. This can be downloaded from http://www.ldc.upenn.edu/AG/doc/xml/ag-1.1.dtd. We are trying to find a validator that will validate DTD. If you know of one, please contact Jeffrey Hoyt at jchoyt@mitre.org.
Update 1 June 07: Thomas wrote one! However it has namespace validation turned on, which the default AG format violates. I've been using the xmllint library, which is part of a set of XML tools for GNOME. You can find it here:
http://xmlsoft.org. My Mandriva Linux box had it installed already. There
are also packages for UNIX and MacOSX, as well as a Windows port. I was able
to make a few changes to my to_ag.xml file and get it to validate. Usage is as such:
xmllint --valid --noout to_ag.xml
where
--valid tells it to do the validation check
--nooout tells it to NOT print out the XML file to the console
Update 6 June 07: To make sure the AG format validates and obey's the DTD, we've changed the default separator in the AG format from a ':' to a '_'. This may cause problems with the rest of the AG tools, but I'll have to take that up with Steven Bird. Examples below have been updated to reflect this change.
Tier heirachy meta data
Tier information should be stored in a Metadata element under the AG element. Update 1 June 2007: Thomas Schmidt proposed a very good change to the way I was storing metadata. Please adhere to the following structure:
ELAN example:
W-Words
ELAN
DefaultLocale
en
ELAN
Parent
W-Spch
EXMRALaDA example:
This should be refined to account for the way other tools store tier data.
To assign an Annotaiton to a specific tier, the Annotation should have a Feature named "tier" with the value of that Feature being the name of the tier. Note this tier should exist in the meta data above. See the "Annotation values" section below for an example.
Annotation values
Annotation values should be stored in a Feature within the Annotation with the name "description". For example:
this
Note: Which tier the annotiation belongs on is stored in the "type" attribute instead of a Feature with the name "tier". This in accordance with much discussion on the mailing list.Updated 6 June 07
ELAN referential tiers
ELAN has tiers with annotaitons that do not have their own start/end timestamps. Instead, they refer to an annotation in the parent tier. When MITRE initially did the conversion to AG, the referenced annotation's endpoints were retrieved and copied. The problem is when the AG file is converted back, ELAN is not handling the annoations correctly. ELAN 3.0 is not correctly modifying the endpoints. The best theory we have at the moment comes from Oliver Ehmer: "I think the problems occur because
of the following: if you let annotations in different tiers refer to the same time anchor, you need to specify how the tiers are related. In the *.EAF this is done in the sections "linguistic_type" and "constraint". As far a I saw, these sections are missing in the "back_to_elan.eaf". - I think this is not so nice. Because you need to process a lot of meta information when converting the data. The other solution would be to create new timeslots or anchors for every annotation. Of course this means loss of data"
Update 1 June 07: Hans Sloetjes has confirmed Oliver's theory. I have updated the ELAN converter to account for this issue.
"Timeless" anchors
All anchors must have a timestamp. Even though the AG file format allows a "timeless" anchor, some other tools do not. If we allow them, those tools will not be able to use the file. Our plan is to drop any "timeless" anchors when we convert from ELAN to AG and warn the user that it happened.