Matthew Griffin, described as “The Adviser behind the Advisers” and a “Young Kurzweil,” is the founder and CEO of the World Futures Forum and the 311 Institute, a global Futures and Deep Futures consultancy working between the dates of 2020 to 2070, and is an award winning futurist, and author of “Codex of the Future” series. Regularly featured in the global media, including AP, BBC, Bloomberg, CNBC, Discovery, RT, Viacom, and WIRED, Matthew’s ability to identify, track, and explain the impacts of hundreds of revolutionary emerging technologies on global culture, industry and society, is unparalleled. Recognised for the past six years as one of the world’s foremost futurists, innovation and strategy experts Matthew is an international speaker who helps governments, investors, multi-nationals and regulators around the world envision, build and lead an inclusive, sustainable future. A rare talent Matthew’s recent work includes mentoring Lunar XPrize teams, re-envisioning global education and training with the G20, and helping the world’s largest organisations envision and ideate the future of their products and services, industries, and countries. Matthew's clients include three Prime Ministers and several governments, including the G7, Accenture, Aon, Bain & Co, BCG, Credit Suisse, Dell EMC, Dentons, Deloitte, E&Y, GEMS, Huawei, JPMorgan Chase, KPMG, Lego, McKinsey, PWC, Qualcomm, SAP, Samsung, Sopra Steria, T-Mobile, and many more.
WHY THIS MATTERS IN BRIEF
Creating, and also then converting, video content is crazy laborious so companies are creating AI’s that do the work for you, and they’re getting better fast.
Nvidia and MIT have announced that they’ve open sourced their stunning Video-to-Video Artificial Intelligence (AI) synthesis model. In short, they’ve just thrown a highly advanced AI that’s frighteningly good at creating synthetic content, in other words converting real video into synthetic video, which could be used to create not just new VR content but also help create better fake content. And while I’m going to walk you through what it is and why it’s so interesting frankly you might just want to watch the video, but put a cushion on the floor because you’re going to fall off your chair when you see what they’ve created with it.
Anyway, onto the article… by using a Generative Adversarial Network (GAN) the team were able to “generate high resolution, photorealistic and temporally (time) coherent results with various input formats,” including segmentation masks, sketches, and poses – and that’s a huge leap forwards in a field where huge leaps take place almost daily.
Compared to Image-to-Image (I2I) translation and it’s close relative Text-to-Video (T2V) translation, which lets people type in text and then have an AI auto-generate the corresponding video, like the ones I’ve discussed before and which is amazing in itself, there’s been a lot less research into making AI’s that can perform Video-to-Video (V2V) translation and synthesis.
And why might you ask should anyone care about V2V? Well, for starters it would allow you to capture video of a city and instantly convert it into digital footage that you could then use to instantly create a realistic Virtual Reality (VR) world – with the added perk being that you could then use another AI to modify that world on the fly in any way you like – as the video above demonstrates nicely for you by turning buildings in a city into trees. And so on…
One of the problems of V2V translation so far though has been trying to solve the problem of low visual quality and the incoherency of video results in existing image synthesis approaches, both of which the team has been able to solve to the point that their new AI can create 2K resolution videos that are up to 30 seconds in length – another set of breakthroughs.
During their research the authors performed “extensive experimental validation on various datasets” and “the model showed better results than existing approaches from both quantitative and qualitative perspectives.” And in addition to that when they extended the method to multimodal video synthesis with identical input data, the model produced new visual properties in the scene, with both high resolution and coherency.
The team then went on to suggest that the model could be improved in the future by adding additional 3D cues such as depth maps to better synthesise turning cars; using object tracking to ensure an object maintains its colour and appearance throughout the video; and training with coarser semantic labels to solve issues in semantic manipulation.