Forward-looking: A team of researchers from around the globe working with Nvidia have crafted what’s being described as a Swiss Army knife for sound – an AI model capable of generating or transforming virtually any mix of music using any combination of audio files or text prompts.
The new model is known as Fugatto, which is short for Foundational Generative Audio Transformer Opus 1. According to Nvidia, its capabilities are unparalleled. For example, Fugatto can create a tune based solely on text, change the emotion in a singer’s voice or modify their accent, and even add or remove instruments from an existing song.
Fugatto could revolutionize the music creation process. With it, a producer could quickly prototype an idea for a new song complete with custom voice styles and instruments, or adjust effects in an existing track.
Ido Zmishlany, a multi-platinum producer and songwriter, believes AI and tools like Fugatto will help write the next chapter of music. That said, the model isn’t limited to music production.
Nvidia highlighted several alternate use cases, such as an advertising agency using it to modify voiceovers in a campaign to accommodate different regions, situations, or languages. The model could also help enhance language learning tools by allowing a user to customize the voice of the speaker, like making it sound like a friend or family member.
Video game developers could use the tool to create new assets on the fly based on player inputs, or modify pre-recorded assets to best fit the level of on-screen action at any given time.
Rafael Valle, one of the researchers that worked on the project, said they wanted to create a model that understands and generates sound like humans do.
More than a year of work went into crafting the full version of Fugatto, which uses 2.5 billion parameters. Nvidia said the mode was trained on a group of DGX systems powered by 32 Nvidia H100 Tensor Core GPUs. Unfortunately, a timeline on when Fugatto might be released to the public was not shared.