Thursday, 12 June 2014 09:09

Panel 11: New Perspectives on Cohesion and Coherence: Implications for Translation

New Perspectives on Cohesion and Coherence: Implications for Translation
Kerstin Kunz, University of Heidelberg, Germany
Ekaterina Lapshinova-Koltunski, Saarland University, Germany
Katrin Menzel, Saarland University, Germany

 

The panel will investigate textual relations of cohesion and coherence in translation and multilingual text production with a strong focus on innovative methods of empirical analysis, as well as technology and computation. Given the amount of multilingual computation that is taking place, this topic is important for both human and machine translation, and further multilingual studies. Cohesion refers to the text-internal relationship of linguistic elements that are overtly linked via lexical and grammatical devices across sentence boundaries to be understood as a text. The recognition of coherence in a text is more subjective as it involves text- and reader-based features and refers to the logical flow of interrelated ideas in a text, thus establishing a mental textual world. There is a connection between these two concepts in that relations of cohesion can be regarded as explicit indicators of meaning relations in a text and, hence, contribute to its overall coherence.

The aim of this panel is to bring together scholars analyzing cohesion and coherence from different research perspectives that cover translation-relevant topics: language contrast, translationese and machine translation. What these approaches share is that they investigate instantiations of discourse phenomena in a multilingual context. And moreover, language comparison is based on empirical data. The challenges here can be identified with respect to the following methodological questions:

1. How to arrive at a cost-effective operationalization of the annotation process when dealing with a broader range of discourse phenomena?

2. Which statistical techniques are needed and are adequate for the analysis? And which methods can be combined for data interpretation?

3. Which applications of the knowledge acquired are possible in multilingual computation, especially in machine translation?

The contributions of different research groups involved in our panel reflect these questions. On the one hand, some contributions will concentrate on procedures to analyse cohesion and coherence from a corpus-linguistic perspective (M. Rysová, K. Rysová). On the other hand, our panel will include papers with a particular focus on textual cohesion in parallel corpora that include both originals and translated texts (K. Kerremans, K. Kunz/ E. Lapshinova-Koltunski/ S. Degaetano-Ortlieb, A. Kutuzov/M. Kunilovskayath). And finally, the papers in the panel will also include discussion of the nature of cohesion and coherence with implications for human and machine translation (E. Lapshinova-Koltunski, C. Scarton/ L. Specia, K. S. Smith/L. Specia).

Targeting the questions raised above and addressing them together from different research angles, the present panel will contribute to moving empirical translation studies ahead.

For informal enquiries: [eDOTlapshinovaATmxDOTuni-saarlandDOTde]

Foto Kunz

Kerstin Kunz (University of Heidelberg) holds an interim professorship at the Institute of Translation and Interpreting. She finished her PhD on Nominal Coreference in English and German in 2009. Since then, she has been involved in empirical research projects dealing with properties of translations and English-German contrasts on the level of lexicogrammar and discourse. Together with Erich Steiner, she currently has GECCo project (http://www.gecco.uni-saarland.de/GECCo/Home.html) at the Department of Applied Linguistics, Translation and Interpreting (Saarland University), in which different types of cohesive relations in English and German are explored, contrasting languages, originals and translations as well as written and spoken registers.

Foto LapshinovaEkaterina Lapshinova-Koltunski (Saarland University) is a post-doctoral researcher at the Department of Applied Linguistics, Translation and Interpreting. She finished her PhD on semi-automatic extraction and classification of language data at Institute for Natural Language Processing (Stuttgart) in 2010. Since then, she has been working in corpus-based projects related to language variation, language contrasts and translation, one of which is GECCo (http://www.gecco.uni-saarland.de/GECCo/Home.html). In 2012 she received a start-up research grant from the Saarland University to build resources for the analysis on variation in translation caused by different dimensions (register, translation method) resulting in translation varieties (including both human and machine translation).

Foto Menzel

Katrin Menzel (Saarland University) studied Conference Interpreting and Translation Studies at Saarland University (Saarbrücken, Germany). She has been working as a teaching and research staff member at the Department of Applied Linguistics, Translation and Interpreting at Saarland University since 2011. Katrin is involved in the research project "GECCo" on cohesion in English and German and works on the case study of ellipses as cohesive ties for her PhD thesis.

 

 

 

SESSION PLAN

Each paper is allocated with a 20 minutes time slot + 10 minutes discussion.

Discussion time is used at the end of each paper.

Introduction (20 Minutes)

SESSION 1: Contrastive Aspects of Cohesion and Coherence

PAPER 1:

Title: How to Annotate Multiword Discourse Connectives in Large Corpora

Speaker: Magdaléna Rysová, Charles University in Prague (Czech Republic)

PAPER 2:

Title: Interaction of Coreference and Sentence Information Structure in the Prague Dependency Treebank

Speaker: Kateřina Rysová, Charles University in Prague (Czech Republic)

SESSION 2: Textual Cohesion and Translation

PAPER 3:

Title: Terminological variation in multilingual parallel corpora: a semi-automatic method involving co-referential analysis

Speaker: Koen Kerremans, Vrije Universiteit Brussel (Belgium)

PAPER 4:

Title: Cohesive chains in an English-German parallel corpus: Methodologies and challenges

Speaker: Kerstin Kunz, University of Heidelberg (Germany); Ekaterina Lapshinova-Koltunski, Saarland University (Germany) and Stefania Degaetano-Ortlieb, Saarland University (Germany)

PAPER 5:

Title: Testing target text cohesion: An attempt at machine learning model to predict acceptability of sentence boundaries in English-Russian translation

Speaker: Andrey Kutuzov, Linguistic Lab on Corpus Technologies, Higher School of Economics (Moscow, Russia) and Maria Kunilovskaya, Tyumen State University (Russia)

SESSION 3: Aspects of Cohesion and Coherence in Human vs. Machine Translation

PAPER 6:

Title: Cohesion and Translation Variation: Corpus-based Analysis of Translation Varieties

Speaker: Ekaterina Lapshinova-Koltunski, Saarland University (Germany)

PAPER 7:

Title: Exploring Discourse in Machine Translation Quality Estimation

Speaker: Carolina Scarton, University of Sheffield (UK) and Lucia Specia, University of Sheffield (UK)

PAPER 8:

Title: Examining Lexical Coherence in a Multilingual Setting

Speaker: Karin Sim Smith, University of Sheffield (UK) and Lucia Specia (University of Sheffield)

PAPER TITLES, ABSTRACTS AND BIONOTES

PAPER 1

Title: How to Annotate Multiword Discourse Connectives in Large Corpora

Speaker: Magdaléna Rysová, Charles University in Prague (Czech Republic)

Abstract: The aim of the paper is to introduce a new annotation of multiword discourse connectives in the Prague Dependency Treebank (PDT) with respect to the fact how our annotation principles may be used by other large corpora of other languages like English or German. The annotation covers a heterogeneous class of multiword connectives (sometimes called alternative lexicalizations of discourse connectives, shortly AltLexes – cf. Prasad et. al, 2010) like "the condition is, that is the reason why, he explained, because of these facts" etc. Due to this huge heterogeneity, it was hard to create generally universal and uniform annotation principles that would fit for expressions whose basis are formed by nouns ("reason, condition, difference..."), verbs ("to cause, to explain, to mean...") or secondary prepositions ("because of, due to, with respect to..."). To create such principles and annotate the PDT corpus lasted three years (by three annotators – trained linguists with orientation on discourse). The annotation is partially manual and partially automatic. As for automatic detection of some multiword connectives, we made use of the interconnection of discourse and coreference, as some multiword connectives obligatory combines with anaphoric expressions like "due to this, because of this fact, this means" etc. Since the annotation of coreference already exists in PDT (see Nedoluzhko et al., 2011), it may be well used for the annotation of discourse (see Rysová, Mírovský, 2014). The paper will also present the principles according to which it is possible to state the boundaries among the wide class of multiword connectives. One of them is the principle of universality – some of the multiword connectives function as universal indicators of certain discourse relations (like "that is the reason why, because of this..."), while others are not universal but contextually dependent (like "because of this increase, this order means..."). Altogether the new annotation of multiword connectives contains app. 1.300 of discourse relations (the final number may slightly change, as the data are now being checked and corrected). Analysis of multiword discourse connectives (or AltLexes) and its annotation on large data is very current and not so much investigated theme of discourse analysis. As far as we know, our annotation is one of the first detailed and elaborated annotations of multiword connectives in today's linguistics and we hope that it may help in creating similar principles for other languages and corpora containing discourse relations.

Bionote: Magdaléna Rysová works as a Research Assistant at the Institute of Formal and Applied Linguistics, Charles University in Prague. Her main research interests are discourse analysis (with orientation on discourse connectives and their annotation in large corpora) and information structure. She is the PI of two internal university grants – "Discourse Connectives in Czech" (2013–2015) and "Discourse Relations within a Text" (2013–2014). She is a team member of several national and international grants, e.g. COST – "Structuring Discourse in Multilingual Europe (TextLink)". She will finish her Ph.D. study soon (Ph.D. thesis theme: "Possibilities of Expressing Discourse Relations in Czech").

PAPER 2:

Title: Interaction of Coreference and Sentence Information Structure in the Prague Dependency Treebank

Speaker: Kateřina Rysová, Charles University in Prague (Czech Republic)

Abstract: The paper tries to examine how coreference and anaphora relations in combination with the sentence information structure participate on text coherence.

Very often, text is studied and examined through many individual linguistic phenomena and, therefore, viewed through different linguistic optics. Besides this, it is useful to study text from more perspectives at once because the interdisciplinarity and interplay of the individual discourse phenomena help to see various text relations as interlinked net of relations of different kind. Therefore, we try to see the coreference and anaphoric relations and information structure in multilevel connections and to see how these two linguistic phenomena influence and help each other.

The complex view on the text for our purposes is enabled by the Prague Dependency Treebank (PDT) (Bejček et al., 2013) that offers a multilayer annotation of different discourse (and also other) phenomena at once.

The aim of the paper is to explore the relationship between coreference (and anaphoric) relations and topic or focus nature of words (in terms of sentence information structure /IS/). IS is seen from the perspective of Functional Generative Description (Hajičová et al., 1998), coreference and bridging anaphora are treated as in Nedoluzhko (2011).

The general question is whether words from the coreference (and anaphoric) chains occur rather in the focus or topic part of the sentence. In other words, we try to demonstrate how the information structure is developed depending on anaphora and coreference (i.e. which part of the sentence is further elaborated within a text through coreference and anaphoric chains, whether topic or focus, and in which way).

In the paper, we measure (in coherent newspaper texts) how the coreference and anaphoric text relations are dense (or sparse) 1) among contextually bound sentence members (i.e. among typically topic members); 2) among contextually non-bound sentence members (i.e. among typically focus members) and 3) among contextually bound and non-bound sentence members mutually (topic and focus). We can observe that the quantity of coreference and anaphoric relations and chains is not the same in the topic and focus parts of sentences.

The measurements are made on the large corpus data provided by the Prague Dependency Treebank 3.0. In the PDT data, the coreference and anaphoric relations and also sentence information structure are manually annotated – almost 50 thousand of annotated Czech sentences are available in PDT. For measurements, the PML Tree Query is used. This client server or the used methods of measurements can be applied also to similar research based on data of other corpora and other languages.

Bionote: Kateřina Rysová is a Senior Research Associate at the Institute of Formal and Applied Linguistics, Charles University in Prague. She finished her Ph.D. studies in linguistics at Charles University in Prague in 2013 (Ph.D. thesis: "On Word Order from the Communicative Point of View"). She is a team member of several national and international grants, e.g. COST – "Structuring Discourse in Multilingual Europe (TextLink)". Her main research interests are sentence information structure and discourse studies. She was the PI of an internal university grant (2011–2012) "Valency as a Word Order Factor" and has experience in creating and annotating corpora.

PAPER 3: Title: Terminological variation in multilingual parallel corpora: a semi-automatic method involving co-referential analysis

Speaker: Koen Kerremans, Vrije Universiteit Brussel (Belgium)

Abstract: The work presented in this article is part of a research study that focused on how terms and equivalents recorded in multilingual terminological databases can be extended with terminological variants and their translations retrieved from English source texts and their corresponding French and Dutch target texts (Kerremans 2014). For this purpose, a novel type of translation resource is proposed, resulting from a method for identifying terminological variants and their translations in texts. In many terminology approaches, terminological variants within and across languages are identified on the basis of semantic and/or linguistic criteria (Carreño Cruz 2008; Fernández-Silva et al. 2008). Contrary to such approaches, three perspectives of analysis were combined in Kerremans (2014) in order to build up the translation resource comprised of terminological variants and their translations. The first perspective is the semantic perspective, which means that units of specialised knowledge – or units of understanding (Temmerman 2000) – form the starting point for the analysis of term variation in the English source texts. The second perspective of analysis is the textual perspective, which implies that terminological variants pointing to a particular unit of understanding in a text are identified on the basis of their 'co-referential ties'. In the third perspective of analysis, which is the contrastive perspective, the French and Dutch translations of the English terms are extracted from the target texts. This approach is motivated by the fact that translators need to acquire a profound insight into the unit of understanding expressed in a source text before they can decide which equivalent to choose in the target language. In the framework of text linguistics, it has been shown how this can be achieved through the analysis of texts. A translator analyses the unit of understanding based on how it is expressed in the source texts (i.e. the semantic perspective), how its meaning is developed through the use of cohesive ties (i.e. the textual perspective) and how it can be rendered into the target language (i.e. the contrastive perspective). In this article, we shall only focus on how co-referential analysis was applied to the analysis of terminological variants in the source texts, resulting in lexical chains. These are "cohesive ties sharing the same referent, lexically rather than grammatically expressed" (Rogers 2007: 17). The terminological variants in these chains – which in this study were limited to only single word nouns or nominal expressions – become part of a general cluster of variants that were encountered in a collection of source texts. Several semi-automated modules were created in order to reduce the manual effort in the analysis of co-referential chains while ensuring consistency and completeness in the data. We will explain how the semi-automatic modules work and how these contribute to the development of the envisaged translation resource (cf. supra). We will also discuss what results can be derived from a co-referential analysis of terms and how these results can be used to quantitatively and qualitatively compare term variation between source and target texts.

Bionote: Koen Kerremans obtained his Master's degree in Germanic Philology at Universiteit Antwerpen in 2001, his Master's degree in Language Sciences - with a major in computational linguistics - at Universiteit Gent in 2002 and his PhD degree in Applied Linguistics at Vrije Universiteit Brussel in 2014. His research interests pertain to applied linguistics, language technologies, ontologies, specialised communication, terminology (variation) and translation studies. He is currently appointed as doctor-assistant at the department of Applied Linguistics (Faculty of Arts and Philosophy) of Vrije Universiteit Brussel (VUB) where he teaches courses on applied linguistics, terminology and culture-specific communication.

PAPER 4:

Title: Cohesive chains in an English-German parallel corpus: Methodologies and challenges

Speaker: Kerstin Kunz, University of Heidelberg (Germany); Ekaterina Lapshinova-Koltunski, Saarland University (Germany) and Stefania Degaetano-Ortlieb, Saarland University (Germany)

Abstract: The current paper discusses methodological challenges in analyzing cohesive relations with corpus-based procedures. It is based on research aiming at the comparison of English and German cohesion in written and spoken language and in originals and translations. For this objective, methodologies are developed that enable a fine-grained and precise analysis of different cohesive aspects in a representative corpus and that yield results for data interpretation within the duration of the project. Thus, methodologies have to be elaborate and cost effective at the same time.

We use an English-German comparable and parallel corpus which is pre-annotated on various grammatical levels and which has been enriched semi-automatically with information on cohesive devices of reference, conjunction, substitution and ellipsis. Our discussion will revolve around methodological challenges related to the current analysis of (1) co-reference and (2) lexical cohesion. The analysis of both types includes (a) identifying cohesive devices that function as explicit linguistic triggers (b) setting up a relation to the linguistic items with which they tie up (antecedents) and (c) integrating these ties into (longer) cohesive chains.

The methodological steps involved are the following:

1) Designing an annotation scheme. Main challenges revolve around the conceptual distinction of relations between instantiated co-reference and sense relations (lexical cohesion), the definition of categories that fit for a bilingual analysis, the inter-relatedness of chains, the depth of the ontological hierarchy and the distance between chain elements.

2) Designing semi-automatic annotation procedures. The challenge is to combine automatic pre-annotation and manual revision in a cost effective way. Our annotation of co-reference is based on the automatic extraction of reference devices, their manual revision and the manual annotation of chain relations (outputs of automatic co-reference tools were to error-prone for pre-annotation of coreference chains). For the annotation of lexical cohesion, we intend to proceed in a similar way. Sense relations and chains are pre-annotated using existing resources, e.g. WordNet, and revised by human annotators to obtain most precise results.

3) Extracting and analysing information. The challenge here is to extract data relevant for our research objective, i.e. information on chain length, distance between elements in chains in combination with morpho-syntactic preferences of chain elements, as well as on alignment of translational equivalents of cohesive relations. Moreover, appropriate statistical evaluation techniques have to be applied for interpretations in terms of language contrast and properties of translation. After demonstrating these methodologies on the basis of initial results, the presentation will end with a discussion of open questions. While our main aim is to design methodologies for a contrastive comparison of English and German on the level of text/ discourse, we hope to lay the ground for new paths in NLP and in machine translation, in particular. Furthermore, available alignments provide an insight into shifts in cohesion between source and target texts and the translation strategies applied.

Bionote: Kerstin Kunz holds an interim professorship at the Institute of Translation and Interpreting at Heidelberg University where she teaches in several BA and MA programs. She finished her PhD on English-German Nominal Coreference in 2009. She has been involved in various empirical research projects on properties of translations and English German contrasts on the level of lexicogrammar and discourse. Together with Erich Steiner, she currently has a corpus-based project at Saarland University. The GECCo project explores different types of cohesive relations in English and German, contrasting languages, originals and translations as well as written and spoken registers.

PAPER 5:

Title: Testing target text cohesion: An attempt at machine learning model to predict acceptability of sentence boundaries in English-Russian translation

Speaker: Andrey Kutuzov, Linguistic Lab on Corpus Technologies, Higher School of Economics (Moscow, Russia) and Maria Kunilovskaya, Tyumen State University (Russia)

Abstract: The results of our previous research based on a parallel learner corpus show that sentence splitting in English-Russian translation is not arbitrary and is most commonly used in particular syntactic conditions. The cases of splitting reflect both interlinguistic typological differences and translational regularities (explicitation, simplification), but it is impossible to say which factor is at play each time without further research. We have also shown that careless splitting can have negative effects on the target semantic and discourse structure. In this research we want to take a closer look shifts in sentence boundaries in translation, which we interpret as one of the factors reflecting text cohesion, and complement earlier findings with the results from multiple translations and comparable corpora analysis. By bringing together data from multiple parallel Russian Learner Translator Corpus and monolingual comparable corpora (purpose-built corpora of English and Russian essays) we hope to establish syntactic conditions under which splitting in translation is very likely for typological reasons (limited to the chosen genre). We will describe structural parameters of the expected TL discourse, in terms of discourse markers and other linguistic phenomena, such as sentence length, types of discourse relations between propositions, anaphora, etc. which motivate the start of a new sentence. One example of frequent discourse markers is 'and' conjunction, which seems to be unusual in the initial sentential position. But, as compared to Russian, it is typical for English to combine clauses with an interclausal ", and" or a semi-colon. These structures often undergo splitting in Russian translation, a new sentence being just juxtaposed or starting with the Russian conjunction 'И'. This proposal aims to establish the automated procedure to differentiate between typologically justified cases of sentence boundary change and those in which this shift has either "run away" or has not been used where appropriate, and it caused harm to the TT cohesion/coherence. To this end we plan to study the relation of various linguistic markers and the author's decision to start a new sentence in the TL (or to split one in the case of translation). We plan to employ both comparable and parallel corpora to guide us in establishing the said typological and, eventually, in building a machine learning model which describes the above mentioned relations in non-translated texts. Features of this model are sentence splitting markers, both lexical and syntactical, partly described in literature, partly extracted from the data itself. This model is applied to translated texts of the same genre. Cases of model failing to predict sentence boundaries in these texts are used to extract typical cohesion markers that correlate with erroneous sentence boundaries. Next, on the basis of these findings experts annotate a training parallel corpus, which will be used to build another model that is able to predict sentence boundaries mistakes as solutions that are not grounded in TL properties, but constitute translationese or incoherence/ambiguity. Thus, the outcome of the research is a machine learning model which, when applied to translated English-Russian texts, is able to detect cases of sentence boundaries incompatible or unusual as compared to non-translated TL.

Bionote: Andrey Kutuzov is a researcher and associate lecturer at National Research University Higher School of Economics, Russia. This is also where he got his MA in Computational Linguistics after graduating from Linguistics department of Tyumen State University. He is a co-developer of Russian Learner Translator Corpus (http://rus-ltc.org) and Russian Error-Annotated Learner English Corpus (http://realec.org). Andrey interests lie in the fields of parallel corpora, translation studies, distributional semantics and applying machine learning to natural language processing. Andrey's papers and CV are available at his homepage: https://hse-ru.academia.edu/AndreyKutuzov.

SESSION 3: Aspects of Cohesion and Coherence in Human vs. Machine Translation

PAPER 6:

Title: Cohesion and Translation Variation: Corpus-based Analysis of Translation Varieties

Speaker: Ekaterina Lapshinova-Koltunski, Saarland University (Germany)

Abstract: In this study, we analyse cohesion in 'translation varieties' - translation types or classes which differ in the translation methods or knowledge involved, e.g. human vs. machine translation (MT) or professional vs. novice. We expect variation in the distribution of different cohesive devices which occur in translations. Variation in translation can be caused by different factors, e.g. by systemic contrasts between source and target languages or different register settings, as well as ambiguities in both source and target languages. Thus, conjunction 'while' in the original sentence in (1a) is ambiguous between the readings 'during' and 'although'. The ambiguity is solved in (1b), but not in (1c), as the German 'während' is also ambiguous: (1a) My father preferred to stay in a bathrobe and be waited on for a change while he lead the stacks of newspapers [...] (1b) Mein Vater ist lieber im Bademantel geblieben und hat sich zur Abwechslung mal bedienen lassen und dabei die Zeitungsstapel durchgelesen [...] (1c) Mein Vater saß die ganze Zeit im Bademantel da und ließ sich zur Abwechslung bedienen, während er die Zeitungen laß [...]

English translations from German are less distinct and less register-dependent if compared to German translations from English. The variation in English-to-German translations strongly depends on register and devices of cohesion involved reflecting either shining-through or normalisation phenomena. Therefore, for our analysis, we chose a corpus of English-to-German translation varieties containing five subcorpora: translations 1) by professionals, 2) by students, 3) with a rule-based MT system, 4) with a statistical MT system trained with big data, 5) with a statistical MT system trained with small data.

Our first observations show that translation varieties differ in the distribution of cohesive devices. For example, novice translations contain more personal reference than the other translation, e.g. professional translators or a rule-based MT. Moreover, registers also differ in their preferences for cohesive devices, e.g. popular-science and instructions use the conjunctions während and dabei equally in German original texts. But tourism and political essays make more use of während than dabei. In professional translations, we observe the same tendency. In student translations, however, während is overused in most cases. The same tendency is observed for MT, where dabei sometimes does not occur at all.

So, we want to prove how cohesive devices reflect translation methods, the evidence of 'experience' (professional vs. novice or big data vs. small data), as well as registers involved in translation varieties under analysis. For this, we extract evidence for cohesive devices from the corpus and analyse the extracted methods with statistical techniques, applying unsupervised analysis to where the differences lie, and supervised techniques to find the features contributing to these differences. This knowledge is useful for both human translation and MT, e.g. in evaluation and MT improvement.

Bionote: Ekaterina Lapshinova-Koltunski (Saarland University) is a post-doctoral researcher at the Department of Applied Linguistics, Translation and Interpreting. She finished her PhD on semi-automatic extraction and classification of language data at Institute for Natural Language Processing (Stuttgart) in 2010. Since then, she has been working in corpus-based projects related to language variation, language contrasts and translation, one of which is GECCo (http://www.gecco.uni-saarland.de/GECCo/Home.html). In 2012 she received a start-up research grant from the Saarland University to build resources for the analysis on variation in translation caused by different dimensions (register, translation method) resulting in translation varieties (including both human and machine translation).

PAPER 7:

Title: Exploring Discourse in Machine Translation Quality Estimation

Speaker: Carolina Scarton, University of Sheffield (UK) and Lucia Specia, University of Sheffield (UK)

Abstract: Discourse covers linguistic phenomena that can go beyond sentence boundaries and are related to text cohesion and coherence. Suitable elementary discourse units (EDUs) are defined depending on the level of analysis (paragraphs, sentences or clauses). Cohesion can be defined as a phenomenon where EDUs are connected using linguistics markers (e.g.: connectives). Coherence is related to the topic of the text and to the logical relationships among EDUs (e.g.: causality). A few recent efforts have been made towards including discourse information into machine translation (MT) systems and MT evaluation metrics. In our work, we address quality estimation (QE) of MT. This challenging task focuses on evaluating translation quality without relying on human references. Features extracted from examples of source and translation texts, as well as the MT system, are used to train machine learning algorithms in order to predict the quality of new, unseen translations.

The motivation for using discourse information for QE is threefold: (i) on the source side: identifying discourse structures (such as, connectives) or patterns of structures which are more complex to be translated, and therefore will most likely lead to low quality translations; (ii) on the target side: identifying broken or incomplete discourse structures, which are more likely to be found in low quality translations; (iii) comparing discourse structures on both source and target sides to identify not only possible errors, but also language peculiarities which are not appropriately handled by the MT system.

Since discourse phenomena can happen at document-level, we moved from the traditional sentence-level QE to document-level QE. Document-level QE is useful, for example, for evaluation in gisting scenarios, where the quality of the document as a whole is important so that the end-user can make sense of it. We have explored lexical cohesion for QE at document-level for English-Portuguese, Spanish-English and English-Spanish translations in two ways: (i) considering repetitions of words, lemmas and nouns, in both source and target texts; (ii) considering Latent Semantic Analysis (LSA) cohesion. LSA is a method that can capture cohesive relations in a text, going beyond simple repetition counts. In our scenario, for each sentence, there is a word vector that represents it, considering all the words that appear in the document. Sentences are then compared based on their words vectors and sentences showing high similarity with most others are considered cohesive. Since LSA is language independent, it was applied on target and source texts. LSA cohesion features improved the results over a strong baseline.

Our next step is to move to the Rhetorical Structure Theory (RST) to capture coherence phenomena. On the source side, RST trees will be extracted and we will correlate the occurrence (or not) of the discourse structures (e.g.: Nucleous, Satellite or relations type, such as Attribution) with the quality labels. The same will be applied on the target side, where incorrect discourse units are expected to correlate better with low quality translations.

Bionote: Carolina Scarton is a PhD candidate and Marie Curie Early Stage Researcher (EXPERT project) at The University of Sheffield, working in the Natural Language Processing group on the Department of Computer Science, under supervision of Dr. Lucia Specia. Her research focuses on the use of discursive information for quality estimation of machine translations. She received a master's degree from University of São Paulo, Brazil, in 2013, where she worked at the Interinstitutional Center for Computational Linguistics (NILC).

PAPER 8:

Title: Examining Lexical Coherence in a Multilingual Setting

Speaker: Karin Sim Smith, University of Sheffield (UK) and Lucia Specia (University of Sheffield)

Abstract: Discourse has long been recognised as a crucial part of translation, but when it comes to Statistical Machine Translation (SMT), discourse information has been mostly neglected to date, as the decoders in SMT tend to work on a sentence by sentence basis. Our research concerns a study of lexical coherence, an issue that has not yet been exploited in the context of SMT. We explore an entity-based discourse framework, applying it for the first time in a multilingual context, aiming to: (i) examine whether human- authored texts offer different patterns of entities compared to (potentially incorrect) machine translated texts, and a version of the latter fixed by humans, and (ii) understand how this discourse phenomenon is realised across languages.

Entity distribution patterns are derived from entity grids or entity graphs. Entity grids are constructed by identifying the discourse entities in the documents under consideration, and constructing a 2D grids whereby each column corresponds to the entity, i.e. noun, being tracked, and each row represents a particular sentence in the document. Alternatively these can be projected on a bipartite graph where the sentences and entities form nodes, and the connections are the edges.

For the monolingual experiments, we use a corpus comprising three versions of the same documents: the human translation, the raw machine translation output and the post-edited version of the machine translation output, establishing whether any differences in lexical coherence may be due to the nature of the texts, as well as to potential errors in the machine translated version. We observed some trends in our monolingual comparative experiments on versions of translations, indicating that some patterns of differences between human translated and machine translated texts can be expected. We also applied the entity-based grid framework in a multilingual context, to parallel texts in English, French, and German. The goals are to understand differences in lexical coherence across languages, and in the future to establish whether this can be used as a means of ensuring that the same level of lexical coherence is transferred from the source to the machine translated documents.

We observed distinct patterns in our comparative multilingual approach: we discovered that the probabilities for different types of entity transitions varied, indicating a different coherence structure in the different languages. In this instance we are comparing the same texts, on a document by document basis, so the same genre and style, yet there is a clear and consistent difference in the probabilities. This would appear to indicate, amongst other things, that the manner in which lexical coherence is achieved varies from language to language. Besides establishing the worth of these features independently, we will also do so in the context of MT evaluation, and our ultimate goal is to then integrate them in an SMT model, in the hope that they will manage to exert influence in the decoding process and improve overall text coherence.

Bionote: Karin Sim Smith is currently in her 2nd year PhD at the Computer Science Department of Sheffield University, where she is part of the Modist project (Modelling Discourse in Machine Translation), which aims to improve discourse in Machine Translation. Specifically, she is researching ways to improve the coherence of SMT output, hoping to learn the coherence patterns that can be transferred from source to target text.

WRAP-UP SESSION (20 Minutes)

 

Back to top