ULI Project‎ > ‎ULI process‎ > ‎TC Meeting Minutes‎ > ‎

Jul 26 2011 Meeting Minutes

Agenda:

Attendees:
Kevin
Helena
Takashi (teradata)
David F
Uwe
Daniel
Mati A.
Christian (SAP)


- Charter: post on web site (touch base with Uwe by weekend)
- Process: post on web site (touch base with Kevin by weekend)
- Roadmap: Shared roadmap with OpenTM2 team. Tangible deliverable.
- Other topics:
    * Unicode character proposal: segmentation break character==> Proposal review (Kevin and Arle) Summary:
On Jul 12, 2011, at 3:33 PM, Arle Lommel wrote:

Hi Kevin,

Here is a link to my preliminary document on segmentation marker characters.

https://docs.google.com/document/d/1dr6oYdCmUuKVU8YKOXDOJr6AAH0P8l19Zx4DrGxL0pg/edit?hl=en_US

I've gone ahead and proposed more characters than we are likely to want, but this is just a starting point.

A few issues arose:

If we have a word segmentation character, would ZWSP serve that purpose? It's already used in some languages for that notion. (For space-delimited languages, would we just assume that we don't need a word segmentation character? If we do that, then ZWSP would seem to fit the bill.)

If we have phrase-level segmentation characters, those must actually be paired to unambiguously identify phrases. Even so, they would not be powerful enough to indicate all possible phrasal relationships (e.g., they couldn't really deal with some classes of zeugma). I'd be inclined to drop this and say that it is a markup problem. But since this is an issue you care about, I'd like your opinion.

I've never prepared a Unicode proposal before, so I'm sure that there is a lot that would need tightening up in what I've done, but at least it's a strawman.

-Arle

On review, and looking over each of the existing characters, I think we should limit ourselves to a sentence break for the localization interoperability. I know I suggested that we try to go for each of the break levels -- and you did a great job raising them. It will help define the feature set, and where to place a separator character logically.

There are actually explicit Line Separator (U+2028) and Paragraph Separator (U+2029) characters. However the Line Separator is for text break, as a text layout feature, rather than sentence break. As you note, the ZWSP may be used for languages which do not explicitly mark white space -- and conventionally has been used this way.

Interestingly, if the SRX rules were to operate on Unicode regular expressions that support \b{s} (rather than \b as a synonym for \b{w}), we can include uax29 sentence segmentation in the segmentation regex, something like:

 <rule break="yes">
   <beforebreak>\b{s}</beforebreak>
   <afterbreak></afterbreak>
 </rule>

Putting uax29 sentence segmentation into SRX may be a better route than implementing uax29 in individual segmentation rules, if we can avoid it. I still believe it is possible to make a converter from SRX to the general transform engine, and this addition would not harm that process.

Kevin


    * UAX#29 and default segmentation rules ==> Input from Kevin, Rights to use by Arle. IBM input (see below)

Kevin: trying to get some utility going to get the SRX and UAX29 to go back and forth easily. SRX were to support full blown Unicode regular expression. Will probably take another couple of weeks. If the sentence boundary character were available would have been easier.

IBM input on the default segmentation rules provided by Arle:

The basic sentence segmentation rules are acceptable (the <languagerule> section defined as "default").
Abbreviations are adequately defined for these languages: Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Italian, Japanese, Polish, Portuguese, Spanish, Swedish, Thai. 
Abbreviations for additional OpenTM2 supported languages would need to be added.
SRX assumes that the translatable text has already been identified in a file format:
Defines the individual blocks of text for which the SRX rules must be applied.

Other area not supported in SRX but should also be our focus in consolidation with UAX#29, for example:
Word recognition. (UAX29)
Stem form reduction. For example, "test" vs "tests" or "child" vs "children".
Multi-word recognition. For example, "one-of-a-kind" (UAX29)
Need to identify, how "customer abbreviations lists" can be supported (special abbreviations known only to a translator or for a specific product) as an extension.

    * XLIFF Symposium (Sep) and LocWorld (Oct)

Action:
- Contribute transform engine for SRX rules to CLDR for the Dec 2011 release. --> Kevin
- Find someone in IBM to work with Arle on the UTC proposal for the new character. --> Helena
- Get a formal statement from Rodolfo about rights to use --> Arle. Rodolfo just made the file available under EPL (Eclipse Public License).
- Get everyone to provide input to the SRX file contributed Rodolfo.
- Provide IBM input on additional language supplementary input to the SRX default file --> Helena
- Follow up on w3c thread Christian mentioned --> Helena.
Comments