RESEARCH

SPEECH & AUDIO

Unsupervised Translation of Programming Languages

October 19, 2020

Abstract

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

Download the Paper

AUTHORS

Written by

Baptiste Rozière

Marie-Anne Lachaux

Lowik Chanussot

Guillaume Lample

Publisher

NeurIPS

Related Publications

May 24, 2024

SPEECH & AUDIO

NLP

DOC-RAG: ASR Language Model Personalization with Domain-Distributed Co-occurrence Retrieval Augmentation

Zhe Liu

May 24, 2024

April 14, 2024

SPEECH & AUDIO

NLP

CoLLD: Contrastive Layer-to-Layer Distillation for Compressing Multilingual Pre-Trained Speech Encoders

Heng-Jui Chang, Ning Dong (AI), Ruslan Mavlyutov, Sravya Popuri, Andy Chung

April 14, 2024

March 05, 2024

SPEECH & AUDIO

Generative Pre-training for Speech with Flow Matching

Alex Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

March 05, 2024

December 11, 2023

SPEECH & AUDIO

Audiobox: Unified Audio Generation with Natural Language Prompts

Wei-Ning Hsu, Akinniyi Akinyemi, Alice Rakotoarison, Andros Tjandra, Apoorv Vyas, Baishan Guo, Bapi Akula, Bowen Shi, Brian Ellis, Ivan Cruz, Jeff Wang, Jiemin Zhang, Mary Williamson, Matt Le, Rashel Moritz, Robbie Adkins, William Ngan, Xinyue Zhang, Yael Yungster, Yi-Chiao Wu

December 11, 2023

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.