- Contribute transform engine for SRX rules to CLDR for the Dec 2011 release. --> Kevin (Open)
* Progress made. Better to do it against the original LDML content. XSLT transform to SRX using the LDML implementation. (within 2 weeks)
* Solicit feedback once the results are generated.
Kevin: SRX maintainer's responsibility to provide an implementation against SRX. Stepping to provide a reference implementation.
Christian: SRX as known formers vs SRX rules.
Helena: more leaning to the latter.
Christian: if we got SRX and no one really know of TR 29. Should we do some education? Should we talk to others who build the translation memories.
Kevin: want to implement a sentence break using ICU regex. Will have a normative effect on segmentation. Also means there is redudency in the default rule set. UAX 29 rule state, anything upper/lower case following . should not constitute a break.
Helena: should discuss Kent's input.
Kevin: agree with him in general terms. Exception should be encoded in these rule sets. Next week ULI meeting is on Oct 11 coincides with LocWorld so next week will be canceled.
- Get everyone to provide input to the SRX file contributed Rodolfo. (Open)
* Feedback below.
- Provide IBM input on additional language supplementary input to the SRX default file --> Helena (Closed)
- David to connect with Helena to solicit input about ULI PR with the Multilingual-Web LT activities. ---> David Filip (Open)
- Arle still owns the separator character proposal. Needs more fleshed out.---> Arle Lommel (Open)
Kevin: trying to solve the segmentation process interoperability. The proposal may not be needed and maybe using an existing char may achieve the goal. There may be another character missing: the non break character.
Helena: we need consistent break and non-break pair.
Christian: Hasn't formed an opinion yet. Are SRX rule file instance really being used?
Kevin: SRX rules mirror in very large break algorithm implementation in CLDR and LDML. If we process plain text and use ICU, it marks it up with the plain text, the breaking mechanism is the same as CLDR.
Steven: ICU and CLDR had been maintained by the same people
Christian: people who has a need for implementation.
Kevin: using ZWJ can already force that behavior in ICU.
- Liaison to CLDR TC --> Kevin (closed)
- Default SRX rules: http://uli.unicode.org/home/uli-documents/merged_srx.zip?attredirects=0&d=1.
Feedback so far:
* Good portion of the data is not useful for "usual content". Vetting needed.
* Standardization should also be on the linguistic construct: let's look at the CLDR current data first:
* Using beyond the "rule" element of SRX standards
Kevin: look at root.xml basically UAX#29
Helena: script/language sentence breaks.
Kevin: CJK sentence break with circle period.
Helena: does not work as well for content without punctuation.
Yoshito: apply okay for Japanese
Christian: Only seen rule based approach to segmentation. Looked into more math based implementation?
Helena: similar to SMT
Christian: use a different kind of method.
Kevin: need to have a large training corpus, doing that would be hard to ship.
Helena: Perhaps collaborate with TAUS?
Kevin: Most people just use UAX 29 segmentation. If we provide a way to provide break and non-break characters than we should resolve this. Contributing sentence break with Unicode Regex would be the best contribution to the MT community.
Helena: what next with the SRX input.
Kevin: write exception with SRX so one can create custom break.
Kevin: interoperable of the plain text results based on the UAX 29. Results of the either of the break iterator can be interpreted with unambiguous characters. Using X and division symbols can also serve as a starting point to bootstap.
Helena: need to still review and pull out default to see how it compare to root.xml. Plus looking for exception.
Kevin: come up with a base set of text and see what the proposed rules would do against the exemplary text. Perhaps a request for test case for each rule and each is expected to produce a result. Example to which it applies and a test case. It can then serve as a documentation and test cases.
Helena: believe Rodolfo just copied out of a dictionary
Kevin: just grab some content out of dictionary could be problem.
Helena: Mr. and Adm. are the examples we want to build into as default behavior.
Helena: Still need to solicit content contribution from organizations.
Steven: Can this be brought forward to Unicode list?
Actions from this meeting:
- Content contribution from organizations. All.