Lessons While Using Generative Language and Audio For Practical Use Cases

Lessons While Using Generative Language and Audio For Practical Use Cases
Photo by Belinda Fewings / Unsplash

Generative AI makes developers lives much easier - but by how much?

I have been learning German for the past year, and one of the things I thought would be personally useful would be to generate many conversations in German - via voice, which be extremely useful for me to learn German. The audio I created can be found here – here's a rundown of what I learned while doing this.

Where I went wrong

  1. Generate only when needed, generated output may not always be parseable.
  2. LLMs can't count, in certain formats of text. This is normal because LLMs generate text on probability, but it can be jarring to see it say 2 + 2 = 5 (your calculator will return this 0% of the time, of course, but with LLMs, there's always that chance..)
  3. Parsing is annoying, you will have to manually edit the generated text often, or generate alot of text in the hope that something succeeds.
  4. You can never be explicit enough, there'll probably always be something you miss.
  5. Check the generated text, for any edge cases that may occur.
  6. Write fault tolerant code, don't expect an LLM to have always worked correctly, especially for massive workloads.
  7. Don't make assumptions about what can be generated and what cannot be generated without testing it.
  8. All generated output needs to be tested.

Generating the conversational audio I wanted practically has three major steps

  1. Generating conversations with an LLM between a few people in many many different themes.
  2. Converting the previous generated conversation text via Bark into audio.
  3. Repeat for 100 different conversations.

Generating conversations with an LLM

This in and of itself had 2 major steps:

  1. Creating a list of characters in the conversation.
  2. Creating a transcript of a conversation between the characters.

Creating a list of characters

I used the following prompt to generate "speakers" via my LLM, for who will be talking to each other:

For a conversation (that you will write later), only give me some characters for the conversation, there should be 
a maximum of 3 female speakers and 4 male speakers in the conversation

The conversation happens in Germany, so try to give German names.

Write down all the speakers in the conversation in the format:
```
---
number of female speakers : <num_female_speakers>
number of male speakers : <num_male_speakers>
<name> : <Male/Female>
<name> : <Male/Female>
<name> : <Male/Female>
....
---
    

Lessons from this

  1. Getting structured output from an LLM is hard, it took me a few tries with multiple prompt styles for an LLM to give me a good mostly-parsable output, and even then, for this use case, it'd have been easier for me to just ask it to generate a list of names and then randomly select some names from that list of names, as speakers.

  2. LLMs can't count, sometimes, an earlier iteration of this prompt was this

For a conversation (that you will write later), only give me some characters for the conversation, there should be 
a maximum of 3 female speakers and 4 male speakers in the conversation

write down all the speakers in the conversation in the format
```
---
<name> : <Male/Female>
<name> : <Male/Female>
<name> : <Male/Female>
....
---
```

Without me explicitly asking it to write down how many it speakers of a particular gender it would generate explicity before it generated the names and genders, it, often produced 4 female speakers even though I only requested 3.

Creating a chat transcript

I used the following code to create a chat transcript from the list of speakers:

with the following speakers
{speakers_raw}

write a conversation in the format
```
---
[DE] <speaker name> : <dialogue> 
[EN] <speaker name> : <dialogue>

[DE] <speaker name> : <dialogue>
[EN] <speaker name> : <dialogue>

[DE] <speaker name> : <dialogue>
[EN] <speaker name> : <dialogue>
...
---
```
Ensure the English translation is always in the directly next line, 
and dialogues between two participants have a empty line between them (as shown in the example) where the conversation is first given in german and then English.

Ensure you start and end the main part of the output with 3 minuses (---), as displayed above, which in this case will be the entire conversation.

The conversation should be about '{conversation_theme}'

Ensure the conversation gets into complex themes and narratives, and include a discussions of the problems people face, and what they like about the industry.

{speakers_raw} was substituted by the characters generated by the previous step, and so was {conversation_theme} which I got by asking to generate a list of conversations.

Lessons from this

  1. Parsing is hard
  2. Parsing is hard
  3. Parsing is hard, if you read the prompt above, the prompt had alot of explicitness I had to continiously outline to the tool (keep 2 spaces, start and end with 3 minuses, etc, etc), even though this occured, it would sometimes not respect the explicitly made statements, and often not keep the 3 spaces or start and end in the format expected, I just generated more until something worked
  • It would often misspell names it correctly spelled earlier, things like Johannas would become Hannes for no reason.
  • It would often spell "Emma" as "mma", which was absurd.
  1. You can probably never be explicit enough. Sometimes, it would insert things like "alle" (everyone in German) in the audio, which makes sense, when you think in terms of training data, but, I didn't want that, I had to rewrite this in order to make it work

Converting to audio

I used Bark's conversational code to generate the audio, you can find the code in the bottom part of the notebook here https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb

Lessons from this

  1. Check the generated content, alot of the issues I found myself in were recognized after I generated the content earlier, and then didn't see the bugs in the content. For rented GPUs this is a waste of GPU compute time, so, being more mindful of this would have certainly made my life easier.

  2. Write fault tolerant code, I later modified my code to follow this, but essentially when I was looping and converting things into audio, the loop often broke because of parsing issues, this is time that I could've saved by just, having had fault tolerant code in the first place, that auto-generated with newer transcripts, whenever an error occured, or skipped a generation when it had troubles generating.

  3. Bark's list of speakers, only has two female German speakers, so I took an English speaker, and assumed that the model would be able to make the speaker speak German - it couldn't, which makes sense when you think about the training data, because there's going to be very few speakers from primary English speaking countries that'd speak German fluently also being present in the training data, I should've tested this assumption properly.

Final lesson

After generating all the audio, I still found certain bits of audio, having major issues, often random screams or "tape scratches" within the audio, to the speaker saying completely unexpected phrases in the audio.

Neither generated text, nor audio, was ever 100% reliable, and needed a means to seperate good audio from bad audio, and keeping this in mind before making any assumptions and having constantly checked the audio would've saved me alot of time.

I wasn't able to clean up the audio, however, I found it good enough for my learning purposes. You can find all the generated audio over here : https://german-audio-stuff.dreamymagic.art