The emergence of digital audio in its various manifestations, whether through podcasts, microcasts, social audio, voice assistants, embedded audio, smart speakers, or earbuds, has left many marketers perplexed:

"Why are we taking this step backward to humble audio from our march towards extravagant multimodality, and what does this retreat from more-and-more mean?"

My own answer to what I believe is a false paradox has been that we are witnessing no retreat, but rather a simple instance of less is more. People wish to have the ability to do less of one thing so that they can do more of something else. Less seeing and touching of the interface and more seeing and touching of the real world. 

To fully grasp and make sense of the reality of this moment and the rationality of this move, I have found the following five concepts helpful:

1. The Gutenberg Parenthesis

This is the proposition initially formulated by Prof. L. O. Sauerberg and popularized by Thomas Pettit, both of the University of Southern Denmark, that the period between the birth of Gurentberg's Printing Press in the mid 15th century and the rise of the internet at end of the 20th century marks a parenthesis in the broad arc of human communication that was dominated by orality, the ephemeral, by conversation and gossip.

During this period, an increasing number of us read books, novels, novelas, newspapers, magazines, pamphlets, manifestos, pitch letters and communicated with each other in long form written text, full sentences and all.

Since the birth of the internet, however, and especially since the emergence of social media and smartphones, our communications have become far less text-centric in the traditional sense. Yes, we still send emails and text messages to one another, but we are also using videos, emojis and GIFs to communicate with each other.

The emergence of audio — podcasting and social audio being the most striking examples — in this context is seen by those who believe that we are indeed living the closing of the parenthesis as the decisive next step in; if not a return to how things were, at least the restoration to a more prominent position of what once used to be the predominant way that we communicated with each other: orality. And, to be specific, not just any orality, and certainly not second orality — that is, voiced articulations of written paragraphs — but primary, spontaneous, authentic spoken communication.

Related Article: Our Audio World: Can Your Customers Hear What You're Saying?

2. Flow: Acting Joyfully

This concept, articulated and popularized by the late Hungarian-American psychologist Mihaly Csikszentmihalyi, touches on the proposition that we are happiest and most fulfilled when we are so absorbed in what we are doing (playing the guitar, writing an essay, shooting hoops, painting) that we are able to successfully and productively act almost effortlessly and as a result joyfully. 

We have all experienced such a state — which we commonly refer to as being “in the zone” or “in a groove” — and our delight at being in such a mental state is so powerful that it becomes something that we chase. The ability to get into the zone, many contend, not without justification, is becoming harder and harder, given the culture of distraction in which we live and operate today. In other words, we live in conditions that make the possibility of that state less likely and our ability to sustain it for long more difficult.

We are in the middle of crafting an essay, we are picking up pace, the full essay is finally firming up, we can see how this thing will end, and whole sentences are raining upon us like buckets, when suddenly the sharp “ding!” of a text message pops our bubble of concentration and scatters our bliss.

Key to being in a state of flow is time. We are flowing in tandem with time, one action follows another, no pauses, “like playing jazz,” as Csikszentmihalyi put it. To be in a state of flow is to be immersed, captured, moving along, entranced.

And that is where audio comes in: audio is a time-linear medium. Unlike text, we can’t easily go back and forth, skip from here to there, and back to here, and unlike the visual, we don’t receive everything at once, in parallel with time. This means that embedded within the very nature of audio is the plumbing that biases a listener’s experience to one conducive to flow. Hold the hand of a song, a podcast, a social audio conversation, and before long, you are moving in step with time, pulled into yourself and given a chance to be submerged and far more likely to tip into a state of flow.

3. The Medium Is the Message

This is the perspective, articulated most forcefully by Marshall McLuhen, that rejects any hard separation between form and content and instead proposes that the very form of the message heavily influences the nature of the message itself.

For instance, the constraints of typing a Tweets on a smart phone force us to write short messages, limiting not only what we communicate, but also how we will formulate what we are communicating. When limited to a few characters, if we want to be effective, we must confine our communication to one specific core message — one idea, one emotion, one call to action, etc. — and we need to be pithy and pack as much rhetorical punch as possible in the characters that we are using (including the emojis that we select).

Learning Opportunities

In contrast, such compactness and pithiness would be not only unnecessary but inappropriate in the context of emailing, where one is afforded ample space to articulate their message. Too short and one is perceived as cryptic, obtuse, even impolite. Instead, one is expected to greet, preface, flesh out, contrast and nuance, and then conclude one’s communication.  

With audio, the ability to hear a person’s voice introduces a dimension that is wholly absent in text/emoji based communications: the humanness of what is being communicated. Emotion — anxiety, enthusiasm, irritation, satisfaction, sarcasm — which is integrally communicated along with the words spoken, could be easily missed or even wrongly interpreted in the medium of text. (No wonder AI has a tough challenge parsing sarcasm and irony in social media text.)

Related Article: How Should a Voicebot React to Verbal Abuse from a Customer?

4. Communicative Rationality

You know those awkward situations when you are walking in the mall and a young, energetic person tending one of those small kiosks that sell widgets, knick-knacks, and various services (including back massages and facial hair removal), accosts you in the hope of getting you to engage with them (i.e., buy what they are selling), and you awkwardly try to move right along?

Why are those moments so awkward and why do we feel more than a bit ashamed of ourselves (for those of us who do) and a bit guilty about the way we just treated a fellow human being? The German philosopher Jürgen Habermas has a plausible answer: We human beings don’t like to be treated like objects, like means, mere walking wallets with dollars in them, and we do not feel comfortable conversing with someone who is pretending to be telling us about how good this thing, when in fact all they are doing is trying to pull some of those dollars out of our wallets.

In other words, the conversation that we are having with them is not an act of communication: it is a false interaction where one is pretending to be engaging you in a genuine exchange of information when in fact they are simply trying to manipulate you. Real communication happens when both of us are earnestly putting forward propositions, positions and engaging in a back and forth whose aim is the arrival to truth through reason.

What does this have to do with audio? Take the example of advertising in a podcast. Studies show that the most effective podcast ads are those that are spoken by the host about a product that they claim they believe in and use. Audio is about authenticity and truth, and those ads that maintain the integrity of the communication pact — no one is pretending, we are all earnest — are the winners.

An ad spoken by a host who has a following that believes in the integrity of that host will be perceived not as a manipulative attempt to get you to buy something, but rather as an honest communication: "I, the host, need to make a living. I have bills to pay. To pay these bills, I need to advertise products. I could advertise anything, but I am advertising these specific products because I use them, I like them, and I think you too, if you need to use something like this, should use them.’ Compare this to an annoying pop-up ad, a blaring 30 second TV commercial, or a five-second YouTube advertisement sequence.

5. The Media Equation

The late Clifford Nass, in a book, “The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places,” proposed that we humans have a visceral tendency to assign human characteristics to the media that we use, especially to computers and other interactive interfaces. As a result, we treat these inanimate objects as if they were real social actors, resulting in the assignment of affective intentionality to such entities (‘The computer likes me’) and the deployment of social rules with such inanimate objects that humans usually use only with other humans (and perhaps, in a truncated fashion, with animals)

The additional effect is that we build expectations and have reactions that make sense only in the context of human-to-human interactions (or at least interactions with animate, sentient beings). The concept of the media equation is especially pertinent in the context of human-to-voicebot interactions, where the human is engaged not with a traditional computer, where one is typing text or clicking on images and visual representations, or receiving text, images, and sound multimodally, but rather with a conversational Artificial Intelligence that listens to human speech and speaks back with human speech, and observes, as best it can, the rules of human-to-human conversations.

The implications are significant for those who design such interfaces, since they need to factor into their designs these tendencies that humans have of anthropomorphising interfaces, at times mitigating against such tendencies, at other times leaning on them.

Audio and conversational voice are an interesting technological phenomenon because their deployment requires many advanced elements of the technological stack, and yet, both are the oldest and the most natural forms and modalities of communication at the disposal of human beings. As such, since they both touch on issues and concepts that have occupied many disciplines in the humanities and the social sciences, engaging with thinkers, theories, and concepts from those disciplines can help us technologists and innovators not only understand the nature of the changes that we are facing, but also devise effective, grounded strategies for delivering innovations that can live up to those challenges.