This week though Facebook researchers announced a major breakthrough in the area though after they developed what they call a “neural transcompiler,” a system that converts code from one high-level programming language like C++, Java, and Python into another – something that’s very hard even for an experienced programmer to do. You can think of it almost like a Universal Translator but for code …
Their new platform is also unsupervised, meaning it looks for previously undetected patterns in data sets without labels and with a minimal amount of human supervision, and it reportedly outperforms rule-based baselines by a “significant margin.”
Migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages, and it’s often costly. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Transcompilers could help in theory — they eliminate the need to rewrite code from scratch — but they’re difficult to build in practice because different languages can have a different syntax and rely on distinctive platform APIs, standard-library functions, and variable types.
Facebook’s system, called TransCoder, which can translate between C++, Java, and Python, tackles the challenge with an unsupervised learning approach. TransCoder is first initialized with cross-lingual language model pretraining, which maps pieces of code expressing the same instructions to identical representations regardless of programming language. Input streams of source code sequences are randomly masked out, and TransCoder is tasked with predicting the masked-out portions based on context. A process called denoising auto-encoding trains then trains the system to generate valid sequences even when fed with noisy input data, and back-translation allows TransCoder to generate parallel data that can be used for training.
The cross-lingual nature of TransCoder arises from the number of common tokens — anchor points — existing across programming languages, which come from common keywords like “for,” “while,” “if,” and “try” and also digits, mathematical operators, and English strings that appear in the source code. Back-translation serves to improve the system’s translation quality by coupling a source-to-target model with a “backward” target-to-source model trained in parallel. The target-to-source model is used to translate target sequences into the source language, producing noisy source sequences, while the source-to-target model helps to reconstruct the target sequences from the noisy sources until the two models converge.
The Facebook researchers trained TransCoder on a public GitHub corpus containing over 2.8 million open source repositories, targeting translation at the function level – in programming, functions are blocks of reusable code that are used to perform a single, related action. After pretraining TransCoder on all source code available, the denoising auto-encoding and back-translation components were trained on functions only, alternating between the components with batches of around 6,000 tokens.
To evaluate TransCoder’s performance, the researchers extracted 852 parallel functions in C++, Java, and Python from GeeksforGeeks, an online platform that gathers coding problems and presents solutions in several programming languages. Using these, they developed a new metric — computational accuracy — that tests whether hypothesis functions generate the same outputs as a reference when given the same inputs.
Facebook notes that while the best-performing version of TransCoder didn’t generate many functions strictly identical to the references, its translations had high computational accuracy. They attribute this to the incorporation of beam search, a method that maintains a set of partially decoded sequences that are appended to form sequences and then scored so the best sequences bubble to the top:
When translating from C++ to Java, 74.8% of TransCoder’s generations returned the expected outputs.
When translating from C++ to Python, 67.2% of TransCoder’s generations returned the expected outputs.
When translating from Java to C++, 91.6% of TransCoder’s generations returned the expected outputs.
When translating from Python to Java, 56.1% of TransCoder’s generations returned the expected outputs.
When translating from Python to C++, 57.8% of TransCoder’s generations returned the expected outputs.
When translating from Java to Python, 68.7% of TransCoder’s generations returned the expected outputs.
According to the researchers, TransCoder demonstrated an understanding of the syntax specific to each language as well as the languages’ data structures and their methods during experiments, and it correctly aligned libraries across programming languages while adapting to small modifications, like when a variable in the input was renamed. And while it wasn’t perfect — TransCoder failed to account for certain variable types during generation, for example — it outperformed frameworks that rewrite rules manually built using expert knowledge.
“TransCoder can easily be generalized to any programming language, does not require any expert knowledge, and outperforms commercial solutions by a large margin,” the co-authors wrote. “Our results suggest that a lot of mistakes made by the model could easily be fixed by adding simple constraints to the decoder to ensure that the generated functions are syntactically correct, or by using dedicated architectures.”
Facebook isn’t the only organization developing code-generating AI systems. During Microsoft’s Build conference earlier this year, OpenAI demoed a model trained on GitHub repositories that uses English-language comments to generate entire functions. And two years ago, researchers at Rice University created a system — Bayou — that’s able to write its own software programs by associating “intents” behind publicly available code.
“[Programs like these are] really just trying to eliminate the minutiae of creating software,” said principal scientist and director at Intel Labs Justin Gottschlich. “[They] could help accelerate productivity … [by taking care of] bugging. [And they could] increase the number of jobs [in tech] because people who don’t have a programming background will be able to take their creative intuition and capture that via machine by these intentionality interfaces.”
Matthew Griffin, described as “The Adviser behind the Advisers” and a “Young Kurzweil,” is the founder and CEO of the World Futures Forum and the 311 Institute, a global Futures and Deep Futures consultancy working between the dates of 2020 to 2070, and is an award winning futurist, and author of “Codex of the Future” series.
Regularly featured in the global media, including AP, BBC, Bloomberg, CNBC, Discovery, RT, Viacom, and WIRED, Matthew’s ability to identify, track, and explain the impacts of hundreds of revolutionary emerging technologies on global culture, industry and society, is unparalleled. Recognised for the past six years as one of the world’s foremost futurists, innovation and strategy experts Matthew is an international speaker who helps governments, investors, multi-nationals and regulators around the world envision, build and lead an inclusive, sustainable future.
A rare talent Matthew’s recent work includes mentoring Lunar XPrize teams, re-envisioning global education and training with the G20, and helping the world’s largest organisations envision and ideate the future of their products and services, industries, and countries.
Matthew's clients include three Prime Ministers and several governments, including the G7, Accenture, Aon, Bain & Co, BCG, Credit Suisse, Dell EMC, Dentons, Deloitte, E&Y, GEMS, Huawei, JPMorgan Chase, KPMG, Lego, McKinsey, PWC, Qualcomm, SAP, Samsung, Sopra Steria, T-Mobile, and many more.
FANATICALFUTURIST PODCAST! Hear about ALL the latest futures news and breakthroughs!SUBSCRIBE
1000's of articles about the exponential future, 1000's of pages of insights, 1000's of videos, and 100's of exponential technologies: Get The Email from 311, your no-nonsense briefing on all the biggest stories in exponential technology and science.