Post-Editing Productivity and Raw Machine Translation Output Quality: Temporal and cognitive effort in discussion
Heloísa Delgado, Débora Pasin and Asafe Cortina
Pontifical Catholic University of Rio Grande do Sul
Organizations with large translation volumes and a broad range of target language requirements have increasingly implemented Machine Translation (MT) technology and, as a result, technical translators have progressively been asked to post-edit according to specific guidelines and quality criteria. In fact, organizations implementing MT are searching for models of productivity for post-editing which means that professional translators need to engage with this inevitable development so that productivity models are realistic. Although research and reports from industry demonstrate that it is feasible to increase productivity with the help of both MT and post-editing, there has been a concern regarding real expectations when it comes to the post-editor and (his) productivity. Studies have shown that post-editing may indeed be known to have a positive effect on productivity and in quality. The speed in which the extensive translated material is produced and its subsequent quality, taking post-editing productivity not only in terms of the ratio of quantity and quality to time, but also of the cognitive effort expended (effort here would be inversely related to productivity, i.e., the higher the effort, the lower the productivity) need to be largely discussed. Given that, the topics we would like to see addressed (but not limited to) in this panel are: i) correlations between automatic metrics and post-editing productivity measured by processing speed and cognitive measures of effort (through the use of eye tracking, TAP protocols, and the like); ii) analysis of the quality of post-editing versus productivity and iii) suggestions of metrics involving score thresholds and confidence estimation.
For informal enquiries: [heloisaDOTdelgadoATpucrsDOTbr]
Heloísa Koch Delgado (PUCRS) is an English Language educator and translator and holds a PhD in Language Studies (UFRGS). Her main fields of research are Terminology, Translation and English Language Teaching. Research member in GELCORPSUL (Corpus Linguistics Study Group) and GPEOCS (Olympic Studies and Health Sciences Research Group), contributing mainly in the field of terminology and translation output quality. Coordinator of DicTrans (Pedagogical Trilingual Dictionary about the Bipolar Disorder), partially supported by CNPq. Her present research focuses on the post-editing productivity of specialized languages, taking into consideration factors such as temporal and cognitive efforts.
Débora Montenegro Pasin (SD Language Office) is a specialist in Translation Studies (PUCRS) and has international experience (seven years/ USA & Italy) and extensive knowledge in the field of Linguistics and Terminology with emphasis on teaching and teacher training. She is a researcher and translator of technical and scientific texts in the following languages: Portuguese, English, Italian, Spanish and French. Member of the DicTrans project, being her main contributions the research of translation of specialized languages and the study of cognitive efforts on post-editing in both English and Italian languages.
Asafe Cortina is majoring in Computer Science and in English Teaching. He has been an English and Spanish translator and interpreter both in Brazil and in the United States. He has organized and worked as a translator and interpreter in events, mainly the ones related to Medicine and Computer Science. He is a member of the DicTrans Project and his focal point is the analysis of automatic metrics involving score thresholds and issues related to human and machine translation interfaces.
Introduction session – 15 minutes
PAPER TITLES, ABSTRACTS AND BIONOTES
PAPER 1 (20 minutes for presentation + 10 minutes for discussion)
Title: Automatic evaluation of machine translation: correlating post-editing effort and Translation Edit Rate (TER) scores
Speakers: Mercedes García-Martínez, Center for Research and Innovation in Translation and Translation Technology (CBS), Arlene Koglin, Federal University of Minas Gerais (UFMG), Bartolomé Mesa-Lao, CBS & Michael Carl, CBS
The availability of systems capable of producing fairly accurate translations has increased the popularity of machine translation (MT). The translation industry is steadily incorporating MT in their workflows engaging the human translator to post-edit the raw MT output in order to comply with a set of quality criteria in as few edits as possible. The quality of MT systems is generally measured by automatic metrics, producing scores that should correlate with human evaluation.
In this study, we investigate correlations between one of such metrics, i.e. Translation Edit Rate (TER), and actual post-editing effort as it is shown in post-editing process data collected under experimental conditions. Using the CasMaCat workbench as a post-editing tool, process data were collected using keystrokes and eye-tracking data from five professional translators under two different conditions: i) traditional post-editing and ii) interactive post-editing. In the second condition, as the user types, the MT system suggests alternative target translations which the post-editor can interactively accept or overwrite, whereas in the first condition no aids are provided to the user while editing the raw MT output. Each one of the five participants was asked to post-edit 12 different texts using the interactivity provided by the system and 12 additional texts without interactivity (i.e. traditional post-editing) over a period of 6 weeks.
Process research in post-editing is often grounded on three different but related categories of post-editing effort, namely i) temporal (time), ii) cognitive (mental processes) and iii) technical (keyboard activity). For the purposes of this research, TER scores were correlated with two different indicators of post-editing effort as computed in the CRITT Translation Process Database (TPR-DB) *. On the one hand, post-editing temporal effort was measured using FDur values (duration of segment production time excluding keystroke pauses >_ 200 seconds) and KDur values (duration of coherent keyboard activity excluding keystroke pauses >_ 5 seconds). On the other hand, post-editing technical effort was measured using Mdel values (number of manually generated deletions) and Mins values (number of manually generated insertions).
Results show that TER scores have a positive correlation with actual post-editing effort as reflected in the form of manual insertions and deletions (Mins/Mdel) as well as time to perform the task (KDur/FDur).
* CRITT Translation Process Database: http://bridge.cbs.dk/platform/?q=CRITT_TPR-db
Bionote: Mercedes García-Martínez is a computer science engineer and a research assistant at the Center for Research and Innovation in Translation and Translation Technology, CBS (Denmark). Arlene Koglin is a PhD candidate in Translation Studies at the Federal University of Minas Gerais (Brazil). Bartolomé Mesa-Lao is a freelance translator and a research assistant at the Center for Research and Innovation in Translation and Translation Technology, CBS (Denmark). Michael Carl is an associate professor at the Department of International Business Communication, CBS (Denmark).
PAPER 2 (20 minutes for presentation + 10 minutes for discussion)
Title: Cognitive effort in discussion: insights from Portuguese-Chinese translation and post-editing task logs
Speakers: Márcia Schmaltz, Ana Luísa Varani Leal, Lidiao S. Chao & Derek F. Wong, University of Macau (UM), Igor Antonio Lourenço da Silva, Federal University of Uberlândia (UFU), Adriana Pagano & Fábio Alves, Federal University of Minas Gerais (UFMG), & Paulo Quaresma, University of Evora (UE).
This paper reports on an ongoing empirical-experimental project (AuTema-PostEd) which aims at tapping into translation and post-editing processes as a source of insight into the role of translators' understanding in task problem solving. It analyses data gathered from translation and post-editing task logs by subjects working with the language pair Portuguese-Chinese in both directions (L1 into L2 and L2 into L1), Chinese being the subjects' L1 and Portuguese their L2. Sixteen professional translators performed two translation tasks (L1 into L2 and L2 into L1) and two post-editing tasks (one in their L1 and another one in their L2) using machine-translated input provided by the software PCT (Portuguese-Chinese Translator). Eye movements and keyboard and mouse activities were logged using the software Translog-II connected a Tobii T120 Eye Tracker in order to capture translators' behaviour (user-activity data, UAD) while translating and post-editing. Retrospective protocols were recorded immediately after each task. Source texts were short news reports (80-word or character-equivalent long) selected on the basis of distinctive cohesive chains running throughout them. The assumption was that identity chains whereby discourse participants are introduced and tracked throughout the text would require the translators to retrieve the identity of what is being talked about by referring to another expression either in the co-text or the context of situation and culture; retrieval movements were thus expected to be captured by eye movements and keyboard activity during reading and writing. Machine-translated inputs were expected to have an impact on source text understanding, especially in instances of ambiguity, predetermining, whether correctly or wrongly, the final target text rendition. Task logs were analysed to investigate text production of selected cohesive chains. To achieve that end, UAD from eye tracking recordings (look backs, look forwards, fixation count and duration) and keyboard logging (text production between pauses, and recursiveness) were collected using the methodology proposed. A linear mixed-effects regression model (LMER) was applied to the data set, and retrospective protocols were analysed for subjects' verbalization of problem-solving decisions regarding the cohesive chains under study. Quantitative results showed that, regardless of task type (i.e., translating from scratch or post-editing), the cohesive chain type had an impact on producing the target text, but not on understanding the source text, while retrospective protocols suggested impact on both. The results highlight the relevance of a fine-grained analysis of all data sources (i.e., eye tracking, key logging, and retrospective protocols) along with an analysis of the quality of the final renditions. Translation process research has borrowed a number of measures of research in other domains, such as reading and writing researches, and only combined analyses may be able to show what measures are really applicable to studies focusing on translation and post-editing.
Bionotes: Márcia Schmaltz, Ana Luísa Varani Leal, Lidiao S. Chao & Derek F. Wong are researchers from the University of Macau (UM) Graduate Program in Translation Studies and the Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory (NLP2CT). Igor Antonio Lourenço da Silva is a researcher at the Federal University of Uberlândia (UFU). Adriana Pagano & Fábio Alves are researchers at the Laboratory for Experimentation in Translation (LETRA) at the Federal University of Minas Gerais (UFMG). Paulo Quaresma is a researcher at the Department of Computer Science at the University of Evora (UE).
PAPER 3 (20 minutes for presentation + 10 minutes for discussion)
Title: Monolingual post-editing: an investigation of temporal, technical and cognitive effort during task execution
Speaker: Norma Fonseca, Federal University of Minas Gerais (UFMG)
This study investigates temporal and technical effort by comparing monolingual post-editing data with bilingual post-editing and human translation data in the language pairs English-Portuguese and French-Portuguese. Furthermore, it also investigates whether cognitive effort is associated with metacognition during monolingual post-editing processes in the same language pairs. In order to do that, we have carried out an exploratory study with six subjects in each language pair. Data was collected using key logging, screen recordings, and guided written protocols. The analysis focused on task execution time, on the number of mouse and keyboard movements in the three different tasks, on the pauses lasting more than 5 seconds and evidences of metacognition in guided written protocols in the monolingual post-editing task. Preliminary results indicate that temporal effort is greater in bilingual post-editing in the English-Portuguese language pair, and that technical effort is greater in monolingual post-editing in the same language pair. They also point to evidences of metacognition in the protocols, specifically metacognitive knowledge of person variables, in which subjects show they are aware of the cognitive effort, knowledge of task variables, by recognizing, for example, the nature of the task they perform, and knowledge of strategy variables, by knowing how to deal with problems and when adapting strategies to solve them. In line with advances in experimental research in Translation Studies, this study suggests the need for using eye tracking to collect more accurate data regarding cognitive effort in the definitive data collection that will start to be carried in August 2014. It also shows the usefulness of investigating more source languages, that is, English, Spanish and Chinese, which have different degrees of similarity with the target language, i.e., Portuguese, to see how source and target language proximity can influence temporal, technical and cognitive effort. Furthermore, the establishment of more criteria for selecting experimental texts proved to be essential in order to ensure the same degree of textual complexity of those experimental texts in just one kind of task: monolingual post-editing.
Bionote: Norma Fonseca is currently a PhD candidate in Applied Linguistics at the Graduate Program in Linguistics and Applied Linguistics (POSLIN) at Federal University of Minas Gerais (UFMG) in Brazil, where she develops empirical-experimental research in Translation Studies. She got a Master degree in Applied Linguistics from the same Program. Her bachelor degree in English and Portuguese was received from Federal University of Viçosa (UFV).
PAPER 4 (20 minutes for presentation + 10 minutes for discussion)
Title: Correlation and assessment of Omega-T translation output, post-edition and human translated product: linguistic quality in focus
Speakers: Larissa Ramos, Roundtable Studio & Vanessa Fischer, TraduServices
Published research on the topic of post-editing is plentiful but translation professionals still might predict that the product of machine translation (MT) combined with post-editing is inferior in quality to the human translated product. Fiederer and O'Brien conducted a study, which evaluated the quality of the sentences produced by MT and subsequently post-edited, and sentences translated by humans. They concluded that MT plus post-editing could be equal to or even higher than human translation quality, but highlighted that more search is needed especially in terms of linguistic quality and end users' acceptance. Other studies have shown some positive points of this method, both in productivity and in quality. Taking quality aspects into consideration, this presentation aims to correlate and assess, especially regarding linguistic quality issues, the potential of machine translation (MT) output, post-edition and human translation of scientific articles through the use of risk criteria methodology. Our corpus consists of 200-source text sentences (around 5.000 tokens), extracted from an article in the Bipolar Disorder Journal (2010), which were compared and analyzed between the outputs of the Omega-T software, of a professional translator and a post-editor. Our assessment methodology was based on an adaptation of Pym's model of risk criteria analysis, in which "translations problems can be described as high-risk, low-risk or anything in between". Pym's adapted model helped us to assess the outputs in a more objective way, although clarity and style aspects of the translated texts were also verified, especially concerning their proximity to the discourse inherent to the scientific community of psychiatric disorders. The risks were categorized as follows: a) Word non-equivalence: low risk, b) Word category: low risk, c) Term non-equivalence: medium risk, d) Word order: medium risk, and e) Term non-equivalence and word order: high risk. Results so far have shown that the recurrent post-edited linguistic feature (50%) is noun/adjective/verb collocations, which falls into the high-risk category (syntactic and pragmatic aspects are affected) and not seen in human professional translation. Recordings on AntConc have been made to keep a record of the post-editions and help with the recognition of other problems encountered in this phase such as cognitive efforts caused by the amount of language inadequacies presented by the machine. Although these results partially show that the linguistic quality might well be an issue in post-editing, they are far from conclusive. Results that are more concrete, based on a larger corpus, will be reached at the beginning of 2015.
Bionotes: Larissa Ramos is a Bachelor of Letters (Federal University of Rio Grande do Sul, UFRGS) and works as a translator at RoundTable Studio. She is a research member of the DicTrans Project. Vanessa Fischer has a major in Letters from PUCRS. She works as a translator at TraduServices and is a research member of the DicTrans Project.
PAPER 5 (20 minutes for presentation + 10 minutes for discussion)
Title: Direct translation of Architecture terms provided by Omega-T: post-editing cognitive effort in discussion
Speakers: Asafe Cortina, PUCRS & Dirceu de Oliveira Garcia Filho, Núcleo Arquitetura
The academic field of Architecture in Brazil is not as strong as in the United States and other European countries, but it has been increasing in the last few years, bringing about an burgeoning number of Brazilian academic architects who need to present papers and projects and submit articles abroad. Even though Architecture is an area in which English is frequently used, professionals of the field usually lack English writing and speaking skills, demanding fully technical research from translators and interpreters who, thus, narrow the gap between offer (low English proficiency level) and demand (the need to publicize research overseas). Architecture, like other fields, contains specialized terminology and a substantial amount of terms is commonly used on a daily basis. The terms in Brazilian Portuguese "pé-direito," "plana tipo," and "cortes", for example, are often translated as "right foot," "plant type," and "cuts" respectively, by machine translations in general; however, their English equivalents are "high ceiling," "standard plan," and "sections. This study, of a qualitative nature, aims at describing and analyzing the cognitive effort of post-edition in terms of productivity and quality, concerning specifically the issue of direct translation frequently offered by the MT (in our study, the Omega-T). We selected five articles on Architecture in English (3.000 tokens), which were inserted in the software AntConc to generate the frequency list of candidates to terms. As an example, in one text with 675 tokens, 34 were candidates to terms, which were repeated throughout the text at least twice each, and around 2/3 of them – not considering the repetitions – presented a direct translation by the MT and, consequently, lexical and pragmatic inadequacies. Although this research is still incipient and only provides initial results, we tend to believe that MT quality, regarding the appropriate equivalences of polysemic words, is low, which consumes time and demands high cognitive effort from the post-editor. Far from conclusive, this research will present statistical data analysis - larger corpus and specific methodology based on Controlled Language (CL) rules – at the beginning of 2015. We will also describe and analyze the amount of cognitive effort spent while post-editing lexical inadequacies before and after applying the CL rules. The think-aloud method will be used to evaluate the translation process and the target text revision. A professional architect will help revise the MT output to verify the adequacy of the terms translated to keep them close to the discourse inherent to the technical community of the Architecture field.
Bionote: Asafe Cortina is majoring in English Language Teaching at PUCRS. He has been an English and Spanish translator/interpreter for 5 years, both in Brazil and in the United States. He has organized international events, mainly the ones related to Medicine and Computer Science. He is a member of the DicTrans Project. Dirceu Garcia Filho is an architect and urbanist, graduated from PUCRS. He currently works with the development of 3D projects at "Núcleo Arquitetura." He studied graphic design and worked with book editing and diagramming.
PAPER 6 (20 minutes for presentation + 10 minutes for discussion)
Title: Post-editing of machine translation output: an analysis of productivity and quality regarding the cognitive effort in decision-making processes
Speaker: Débora Montenegro Pasin, SD Language Office
The following paper aims to analyze productivity and quality issues regarding the cognitive effort in decision-making processes when it comes to post-editing machine translation (henceforth, MT) output, in order to reach language accuracy and text adequacy in an optimized way. For the purposes of this paper, and yet, to enhance and disseminate the representativeness of Locke's work worldwide, an excerpt of the article about "Some Thoughts Concerning Education", a 1693 treatise on the education of gentlemen written by the English philosopher John Locke, was extracted from the online encyclopedia Wikipedia.com and translated from English to Brazilian Portuguese with the use of a translation free software; once the MT was accomplished, post-edition was started. It is relevant to mention that this paper is qualitatively based on the premises of risk criteria and productively based on the observations of post-editing metric correlations on productivity. According to a model of risk criteria, translation matters are related to their risk levels – low-risk, medium-risk or high-risk – and those are associated to the text suitability, which not only comprises grammar features, but also - and more importantly - text meaning and intention. Recent researches and reports from industry indicate that it is possible to increase productivity by using MT and post-editing; however, it is not yet clear what productivity can be realistically expected from a post-editor: the one concerning the ratio of quantity and quality to time, the one related to the cognitive effort expended, or both. Partial results have shown that the higher the effort the lower the productivity, on the other hand, high quality in socio-discursive pertinency is expected. The excerpt chosen consisted of 1.193 tokens and the following percentages were reached so far: (a) Requires complete translation: 20% (high risk); (b) Little post-editing needed: 40% (medium risk); (c) Fit for purpose: 40% (low risk). Although these results enhance the need of post-edition when it comes to language accuracy and text adequacy – text quality per se – time-related productivity results are far from conclusive. Further studies are to be concluded until the end of 2014.
Bionote: Débora Montenegro Pasin is a specialist in Translation Studies (Pontifical Catholic University of Rio Grande do Sul) and has international experience (almost 10 years/ USA & Italy) and extensive knowledge in the field of Linguistics and Terminology with emphasis on teaching and teacher training. She is a researcher and translator of technical and scientific texts in the following languages: Portuguese, English, Italian, Spanish and French. Member of the DicTrans project, being her main contributions the research of translation of specialized languages and the study of cognitive efforts on post-editing in both English and Italian languages.
WRAP-UP TIME SECTION
Six presentations x 30 = 180 minutes
Fifteen minutes for the introduction section
Twenty minutes for the wrap-up section
Total: 215 minutes