NLP

CORE MACHINE LEARNING

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

November 08, 2021

Abstract

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.

Download the Paper

AUTHORS

Written by

Baptiste Rozière

Marie-Anne Lachaux

Marc Szafraniec

Guillaume Lample

Publisher

Neurips

Research Topics

Natural Language Processing (NLP)

Core Machine Learning

Related Publications

March 13, 2025

NLP

COMPUTER VISION

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

March 13, 2025

February 07, 2025

NLP

BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation

The Omnilingual MT Team, Pierre Andrews, Mikel Artetxe, Mariano Coria Meglioli, Marta R. Costa-jussa, Joe Chuang, David Dale, Cynthia Gao, Jean Maillard, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Eduardo Sánchez, Yiannis Tsiamas, Arina Turkatenko, Albert Ventayol, Shireen Yates

February 07, 2025

February 06, 2025

RESEARCH

NLP

Brain-to-Text Decoding: A Non-invasive Approach via Typing

Jarod Levy, Mingfang (Lucy) Zhang, Svetlana Pinet, Jérémy Rapin, Hubert Jacob Banville, Stéphane d'Ascoli, Jean Remi King

February 06, 2025

February 06, 2025

RESEARCH

NLP

From Thought to Action: How a Hierarchy of Neural Dynamics Supports Language Production

Mingfang (Lucy) Zhang, Jarod Levy, Stéphane d'Ascoli, Jérémy Rapin, F.-Xavier Alario, Pierre Bourdillon, Svetlana Pinet, Jean Remi King

February 06, 2025

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.