Terminology Extraction for Semantic Interoperability and Standardization

Start date
Jan. 1, 2011
End date
Dec. 31, 2012
Sponsor
IWT Tetra fund
Partners
IBCN
Research portal
http://research.flw.ugent.be/projects/texsis

About TExSIS

Companies and organisations have a growing need for a coherent language use in their external and internal communication, e.g. in manuals, product leaflets, presentations, support documents, etc. Especially in international organisations, in which these documents have to be produced in different languages, it is far from evident to continuously use the correct and appropriate terminology.
The TExSIS project aims at the automatic extraction of mono‐ and multilingual company specific terminology on the basis of a company’s document streams. These term lists are crucial in every language based man‐machine communication: in machine translation, computer‐assisted translation and in monolingual and multilingual document management.
The concrete deliverables that will result from the project are: knowledge (reported in technical reports and publications), prototypes of the different components, a prototype client‐server architecture for fully automatic monolingual and multilingual terminology extraction, etc. These prototypes will be made available open source. Software developers can further customize and implement the prototypes in company specific end user applications.
The practical applicability of the terminology extractor will be evaluated in two use cases with a broad scope, being machine translation and information retrieval. Both use cases cover the needs of a broad target group of companies and organisations. For both use cases, their will be a close cooperation with the companies in the user group for the delivery of documents and for the evaluation of the prototype. This should guarantee the general applicability of both use cases.
1) Use case machine translation
The tools for terminology extraction which will be developed in the course of the project, will be integrated in machine translation systems. How can (Flemish) companies benefit from these results?
• Reduction of the implementation cost of machine translation (MT). The implementation of MT systems and the customization of these systems to a specific company environment (e.g. automotive, banking, telecom domain) requires a huge amount of manual work, leading to a high implementation cost. Accurate terminology extraction will lead to a smoother domain adaptation and a lower implementation cost of MT. • Reduction of manual postediting. Companies which use MT for the translation of their user manuals, customer support documents, etc. currently often have to strongly postedit their texts in order to have an acceptable output. The integration of a fully automatic terminology extraction in the MT system will lead to an improved translation quality and thus reduce the costs of manual correction. This reduction of the manual postediting also leads to a shorter time‐to‐market. • Automatic production of company specific monolingual and multilingual dictionaries which can be used for a uniform communication on new and existing products and services. A graphical user interface can assist writers and translators in the adaptation/correction of these automatically extracted mono‐ and multilingual term lists. • Automatic method for consistency checks. Via automatic terminology extraction on all new texts in TMX, HTML, XML etc., the terms in these documents can be checked against the company specific term banks. A graphical user interface will be developed for the replacement of false terms in these new documents.
2) Use case information retrieval
The tools for terminology extraction developed in the project can also be integrated in intelligent information management applications. These applications contribute to faster, easier and more user-friendly solutions to manage, index, categorise and search large document collections. Companies which manage large archives or data bases, can be assisted in different ways via the TExSIS project:
• Automatic construction of a thesaurus of a certain archive. In case of an existing archive of textual documents, TExSIS can extract a thesaurus of relevant keywords (names of persons, organisations, locations and other keywords) in a fully or semi‐automatic way. This will allow companies to archive and maintain their documents in a structured way. • Term suggestions during document construction. When constructing a new document, the writer can be guided in his/her choice of relevant keywords via a term suggestion system. These new keywords are automatically linked to the thesaurus; the user will also be able to add new keywords to the thesaurus. • Automatic assignment of metadata to documents. When archiving a new document, the most important keywords can be added as metadata to the document in order to simplify or even automatize classification. This can be done in a fully automatic or semi‐automatic way. • Increased speed in searching large archives and data bases. When documents are labeled with relevant metatdata, it becomes possible to browse documents per category or to search for specific entities, which is heavily beneficial for the userfriendliness of search engines. Such search applications can make use of unfolded tree structures with subcategories or graphical sets of relevant suggestion terms.
TExSIS is funded by the Flemish government for 92,5%. The remaining 7,5% has been raised by the members of the user group: Cross Language, SD Worx, Telenet, Belga Persbureau, Selor, Yamagata Europe, Telelingua, Mentoring Systems, Xplanation Language Services, TextKernel, Comsof, Actonomy, ITP, Docbyte, PSA Peugeot Citroën SA, Intersystems, Jabbla, MEGA-doc, Mediargus, Eurologos, Wetenschappelijk en Technisch Centrum voor het Bouwbedrijf, Nederlandse Taalunie, Jan De Nul and Oneliner Translations.
TExSIS will benefit from the findings of current and previous related research projects such as The Dutch Parallel Corpus (DPC) and PSA. The following research publications will be of relevance for the TExSIS project:
• Macken, L. (2010). Sub-sentential alignment of translational correspondences. PhD thesis, University Press Antwerp, Antwerp. • Lefever, E., Macken, L., & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, Athens, Greece. • Macken, L., Lefever, E., & Hoste, V. (2008). Linguistically-based sub-sentential alignment for terminology extraction from a bilingual automotive corpus. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK. • Trushkina, J., Macken, L., & Paulussen, H. (2008). Sentence alignment in DPC: maximizing precision, minimizing human effort. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S. Piperidis, and D. Tapias (eds.), Proceedings of the Sixth Conference on International Language Resources and Evaluation (LREC'08). European Language Resources Association, Marrakech, Morocco.