Partnering with Defined AI to bridge the data wealth gap

Partnering with Defined AI to bridge the data wealth gap

RunPod is dedicated to democratizing access to AI development and bridging the data wealth gap. Alongside Defined.ai, the world’s largest ethical AI training data marketplace, RunPod launched a pilot program to give startups access to enterprise-grade datasets for training SOTA models.

The Genesis of Collaboration

To build SOTA models, startups need access to high-quality data and compute. But these datasets and the compute required to train on the data is extremely expensive. To bridge the data wealth gap, RunPod.io, in collaboration with Defined.ai, launched a pilot program aimed at ensuring that developers and startups could leverage enterprise-grade datasets that would otherwise be unaccessible.

RunPod worked closely with Defined.ai to meticulously review each of their datasets to determine which would best align with the requirements of RunPod’s many startup clients. This careful curation process was crucial in setting the stage for effective AI model training and development.

Pilot Program with Theseus AI

As the pilot program began, Defined AI and RunPod granted selected developers access to one of twelve high-quality conversational speech datasets.

Theseus AI, a startup specializing in Automatic Speech Recognition (ASR) for the financial sector, was among the participants. Their challenge was formidable: developing ASR technology capable of transcribing complex financial terminology with high precision—a critical requirement for ensuring compliance and accuracy in financial services.

In the financial services sector, 83% of note-taking software users encounter transcription errors that require time-consuming corrections, and this sector demands accuracy due to its numerical nature. To address this, Theseus aimed to optimize the Automatic Speech Recognition (ASR) model using Whisper on financial-specific audio data since vocabulary varies significantly across different sectors, and generic open-source models are not trained for sector-specific data.

Theseus was looking to fine-tune Whisper on Define AI’s database of French and English financial-specific audio data, which included over 400 hours of transcribed data. Using traditional cloud computing services for this would be prohibitively expensive.

Enter RunPod

RunPod’s cloud platform, equipped with high-performance GPUs, was able to help in this process. It provided a scalable and cost-effective solution for Theseus AI, enabling them to utilize Defined.ai’s specialized datasets hosted on a RunPod Network Volume.

This setup supported the computational needs required to train advanced AI models while simplifying technical overhead, allowing Theseus AI to focus on improving its Whisper model.

"RunPod was a no-brainer for us. Their cloud setup, equipped with top-tier GPUs, was user-friendly and scalable. It felt like an extension of our own technology, eliminating complexities and helping us focus on innovation."

Remarkable Results

The results of this collaboration were nothing short of remarkable. By fine-tuning the Whisper-large-v3 model on Defined.ai's high-quality financial data, Theseus AI saw its Word Error Rate (WER) drop sharply from 18% to 1.7% on the validation dataset.

This dramatic improvement underscores the necessity of using sector-specific data to achieve state-of-the-art results. General models trained on broad datasets do not perform well in specialized contexts, proving that targeted fine-tuning is essential to obtain SOTA-level results.

The Impact of Collaboration

This partnership showcased the transformative power of combining premium datasets with accessible and powerful AI computational resources. By making Defined.ai's datasets freely available and accessible through RunPod, developers teams like Theseus AI can compete effectively in the AI arena and advance the frontiers of technology.

The success of this pilot program and RunPod’s ongoing partnership with Defined.ai are pivotal steps toward bridging the data wealth gap. By providing access to essential tools and data, RunPod ensures that AI innovation is accelerated and democratized, enabling a future where AI development is limited only by imagination, not resources.

You too can use RunPod to deploy a GPU pod with the PyTorch template readily available to train your model. Get started with one click by following this link and selecting your GPU. Additionally, visit RunPod.io to discover how RunPod’s compute resources can enable you to train and scale your AI models efficiently.