Whispers of A.I.’s Modular Future
HomeHome > News > Whispers of A.I.’s Modular Future

Whispers of A.I.’s Modular Future

Aug 08, 2023

By James Somers

One day in late December, I downloaded a program called Whisper.cpp onto my laptop, hoping to use it to transcribe an interview I’d done. I fed it an audio file and, every few seconds, it produced one or two lines of eerily accurate transcript, writing down exactly what had been said with a precision I’d never seen before. As the lines piled up, I could feel my computer getting hotter. This was one of the few times in recent memory that my laptop had actually computed something complicated—mostly I just use it to browse the Web, watch TV, and write. Now it was running cutting-edge A.I.

Despite being one of the more sophisticated programs ever to run on my laptop, Whisper.cpp is also one of the simplest. If you showed its source code to A.I. researchers from the early days of speech recognition, they might laugh in disbelief, or cry—it would be like revealing to a nuclear physicist that the process for achieving cold fusion can be written on a napkin. Whisper.cpp is intelligence distilled. It's rare for modern software in that it has virtually no dependencies—in other words, it works without the help of other programs. Instead, it is ten thousand lines of stand-alone code, most of which does little more than fairly complicated arithmetic. It was written in five days by Georgi Gerganov, a Bulgarian programmer who, by his own admission, knows next to nothing about speech recognition. Gerganov adapted it from a program called Whisper, released in September by OpenAI, the same organization behind ChatGPT and DALL-E. Whisper transcribes speech in more than ninety languages. In some of them, the software is capable of superhuman performance—that is, it can actually parse what somebody's saying better than a human can.

What's so unusual about Whisper is that OpenAI open-sourced it, releasing not just the code but a detailed description of its architecture. They also included the all-important "model weights": a giant file of numbers specifying the synaptic strength of every connection in the software's neural network. In so doing, OpenAI made it possible for anyone, including an amateur like Gerganov, to modify the program. Gerganov converted Whisper to C++, a widely supported programming language, to make it easier to download and run on practically any device. This sounds like a logistical detail, but it's actually the mark of a wider sea change. Until recently, world-beating A.I.s like Whisper were the exclusive province of the big tech firms that developed them. They existed behind the scenes, subtly powering search results, recommendations, chat assistants, and the like. If outsiders have been allowed to use them directly, their usage has been metered and controlled.

There have been a few other open-source A.I.s in the past few years, but most of them have been developed by reverse engineering proprietary projects. LeelaZero, a chess engine, is a crowdsourced version of DeepMind's AlphaZero, the world's best computer player; because DeepMind didn't release AlphaZero's model weights, LeelaZero had to be trained from scratch, by individual users—a strategy that was only workable because the program could learn by playing chess against itself. Similarly, Stable Diffusion, which conjures images from descriptions, is a hugely popular clone of OpenAI's DALL-E and Google's Imagen, but trained with publicly available data. Whisper may be the first A.I. in this class that was simply gifted to the public. In an era of cloud-based software, when all of our programs are essentially rented from the companies that make them, I find it somewhat electrifying that, now that I’ve downloaded Whisper.cpp, no one can take it away from me—not even Gerganov. His little program has transformed my laptop from a device that accesses A.I. to something of an intelligent machine in itself.

There was a time when researchers believed that human-level speech recognition might be "A.I.-hard"—their way of describing a problem that was so difficult it might only fall when computers possessed general intelligence. The idea was that there was enough ambiguity in spoken language that the only way to parse it would be by actually understanding what the speakers meant. Last week, I heard something on the radio that might have sounded, to a computer, like "Can you crane a Ford?" But my brain, knowing the context of the conversation, seamlessly resolved it as "Can Ukraine afford." The problems of meaning and context insured that, for decades, speech recognition was considered a measuring stick for the field of A.I. as a whole. The only way to understand speech, the thinking went, was to really understand it.

In an influential 2019 essay, the A.I. researcher Richard Sutton explains that early speech-recognition programs were loaded with specialized linguistics knowledge—not just about syntax, grammar, and phonetics but about how the shape of the human mouth constrained what sounds were possible. Despite their sophistication, these programs didn't work very well. In the nineteen-seventies, there was a turn toward statistical methods, which dropped expert knowledge in favor of patterns learned from data—for instance, about which sounds and words tended to go together. The success of that approach bled out into the rest of A.I., leading the field to center much of its effort on statistics drawn from huge amounts of data. The strategy paid dividends: by 1990, the state of the art for consumer speech recognition was a program called DragonDictate, which worked in real time. But Dragon required users to enunciate clearly and pause between each word, and cost nine thousand dollars. A major improvement came in 1997, when the same company released Dragon NaturallySpeaking. You no longer had to pause when talking to it. Still, accuracy on truly free-flowing or accented or technical speech was relatively poor. I remember my godfather, a perennial early adopter, showing off the speech-recognition system in his car around this time; he used it to call home from the carphone. Dialling would have been easier.

Speech-recognition programs were still too glitchy to be seamless. It was time-consuming to correct their mistakes. And yet, they were still dauntingly complex. A textbook from 1999, which described a then state-of-the-art speech-recognition system similar to Dragon NaturallySpeaking, ran to more than four hundred pages; to understand it, one had to master complicated math that was sometimes specific to sound—hidden Markov models, spectral analysis, and something called "cepstral compensation." The book came with a CD-ROM containing thirty thousand lines of code, much of it devoted to the vagaries of speech and sound. In its embrace of statistics, speech recognition had become a deep, difficult field. It appeared that progress would come now only incrementally, and with increasing pain.

But, in fact, the opposite happened. As Sutton put it in his 2019 essay, seventy years of A.I. research had revealed that "general methods that leverage computation are ultimately the most effective, and by a large margin." Sutton called this "the bitter lesson": it was bitter because there was something upsetting about the fact that packing more cleverness and technical arcana into your A.I. programs was not only inessential to progress but actually an impediment. It was better to have a simpler program that knew how to learn, running on a fast computer, and to task it with solving a complicated problem for itself. The lesson kept having to be relearned, Sutton wrote, because jamming everything you knew into an A.I. often yielded short-term improvements at first. With each new bit of knowledge, your program would get marginally better—but, in the long run, the added complexity would make it harder to find the way to faster progress. Methods that took a step back and stripped expert knowledge in favor of raw computation always won out. Sutton concluded that the goal of A.I. research should be to build "agents that can discover like we can" rather than programs "which contain what we have discovered." In recent years, A.I. researchers seem to have learned the bitter lesson once and for all. The result has been a parade of astonishing new programs.

Ever since I’ve had tape to type up—lectures to transcribe, interviews to write down—I’ve dreamed of a program that would do it for me. The transcription process took so long, requiring so many small rewindings, that my hands and back would cramp. As a journalist, knowing what awaited me probably warped my reporting: instead of meeting someone in person with a tape recorder, it often seemed easier just to talk on the phone, typing up the good parts in the moment. About five years ago, with a mix of shame and relief, I started paying other people to do transcription for me. I used a service called Rev, which farmed out the work and took a cut. It was expensive—around a hundred dollars for just a single interview—but the price testified to the labor involved. Rev had a much cheaper A.I. option, but, like other transcription programs I’d tried, it was just inaccurate enough to be a nuisance. It felt like you’d spend more time correcting the bad transcript than just typing it up yourself.

A year and a half ago, I heard about a service called Otter.AI, which was so much better than anything that had come before as to suggest a difference in kind. It wasn't great at punctuation, and you still had to correct it here and there, but it was the first transcription program that made tedious re-listening unnecessary. I was so impressed that it became a regular part of my workflow. A once impossible problem seemed to be at the almost-there stage.

Late last year, when Whisper appeared out of nowhere, it solved my problem for good. Whisper is basically as proficient as I am at transcription. The program picks up on subtle jargon, handling words whose sounds might easily be confused with other words’; for instance, it correctly hears a mechanical engineer saying, "It's going to take time to CAD this up," even capitalizing "CAD"—an acronym for "computer-aided design"—correctly. It figures out how to punctuate a person's self-interruptions, as in, "We’re almost going to ship. We’re about to—the next one's going to ship." It's free, it runs on my laptop, and it's conceptually simpler, by a long shot, than anything that came before it.

Nearly a decade ago, I wrote an essay wondering what would happen if speech transcription became truly ubiquitous. For one thing, it seems likely that we’ll see a lot more dictation. (Already, even though speaking to my phone feels unnatural, I find myself doing it more and more.) Once the technology reaches a certain level of quality, the task of the court reporter could go away; archivists might rejoice as recordings of speeches, meetings, depositions, and radio broadcasts from long ago became searchable. There could be even larger changes—we talk a lot, and almost all of it goes into the ether. What if people recorded conversations as a matter of course, made transcripts, and referred back to them the way we now look back to old texts or e-mails? There is something appealing to me about hoarding chit-chat; talking is easily my favorite activity, and I love the idea of honoring it by saving it. But then you think of advertisers paying handsomely to examine mentions of their brand names in natural conversation. You imagine losing a friend or a job over a stupid comment. Really, the prospect is terrifying.

Whisper's story reveals a lot about the history of A.I. and where it's going. When a piece of software is open-source, you can adapt it to your own ends—it's a box of Legos instead of a fully formed toy—and software that's flexible is remarkably enduring. In 1976, the programmer Richard Stallman created a text-editing program called Emacs that is still wildly popular among software developers today. I use it not just for programming but for writing: because it's open-source, I’ve been able to modify it to help me manage notes for my articles. I adapted code that someone had adapted from someone else, who had adapted it from someone else—a chain of tinkering going all the way back to Stallman.

Already, we’re seeing something similar happen with Whisper. A friend of mine, a filmmaker and software developer, has written a thin wrapper around the tool that transcribes all of the audio and video files in a documentary project to make it easier for him to find excerpts from interviews. Others have built programs that transcribe Twitch streams and YouTube videos, or that work as private voice assistants on their phones. A group of coders is trying to teach the tool to annotate who's speaking. Gerganov, who developed Whisper.cpp, has recently made a Web-based version, so that users don't have to download anything.

Nearly perfect speech recognition has become not just an application but a building block for applications. As soon as this happens, things move very fast. When OpenAI's text-to-image program, DALL-E, came out, it caused a sensation—but this was nothing compared with the flurry of activity kicked off by its open-source clone, Stable Diffusion. DALL-E used a "freemium" model, in which users could pay for additional images, and no one could modify its code; it generally proved more powerful and accurate than Stable Diffusion, because it was trained on mountains of proprietary data. But it's been forced to compete with a vast number and variety of adaptations, plug-ins, and remixes coming from the open-source community. Within weeks, users had adapted Stable Diffusion to create an "image-to-image" mode, in which they could tell the program to tweak an existing image with a text prompt. By repeatedly invoking this mode, a new method of illustration became possible, in which a user could iteratively compose an image with words, as if bossing around an endlessly patient robot artist.

This opening up, rather than any specific leap forward in capabilities, is defining the current moment in A.I. ChatGPT, OpenAI's conversational chatbot, is exciting not because it is particularly intelligent—it's often a fountain of bullshit or banality—but because whatever intelligence it does have is just there, for anyone to use at any time. The program's availability is perhaps its most important feature, because it allows ordinary people to suss out what it's good for. Even so, ChatGPT is not yet as open as Whisper. Because automated writing is so potentially valuable, OpenAI has an interest in tightly controlling it; the company charges for a premium version, and an ecosystem of for-profit apps that do little more than wrap ChatGPT will doubtless soon appear.

Eventually, though, someone will release a program that's nearly as capable as ChatGPT, and entirely open-source. An enterprising amateur will find a way to make it run for free on your laptop. People will start downloading it, remixing it, connecting it, rethinking and reimagining. The capabilities of A.I. will collide with our collective intelligence. And the world will start changing in ways we can't yet predict. ♦