Researchers from Carnegie Mellon University, Shanghai Jiao Tong University, and Honda Research Institute have developed the Open Whisper-Style Speech Model (OWSM), an open-source solution for transparent speech recognition training. OWSM replicates whisper-style training using publicly available data and a toolbox. It aims to improve upon existing models like Whisper and plans to explore using more advanced architectures and incorporating self-supervised speech representations. The team also intends to expand the multitask framework to include other speech-processing tasks.
Natural language processing (NLP) has focused on large-scale Transformers, which are models trained on large datasets and have shown impressive abilities in various applications. Similar pre-training methods have been successful in voice processing. To create universal speech models that can handle multiple speech tasks, researchers have developed a collection of multilingual, multitask models called OpenAI Whisper. However, the complete process for building these models is not available to the public, which raises concerns about data leakage, lack of understanding of the model’s performance, and difficulties in addressing problems related to robustness, fairness, bias, and toxicity. To promote open science, a research team from Carnegie Mellon University, Shanghai Jiao Tong University, and Honda Research Institute has created the Open Whisper-Style Speech Model (OWSM), which replicates the Whisper training using open-source tools and publicly available data. OWSM introduces technical innovations such as any-to-any speech translation and improved efficiency. The team plans to provide reproducible recipes, pre-trained models, and training logs to enable researchers to understand the training procedure and gain important knowledge. While OWSM performs similarly to Whisper, its goal is not to compete but to explore further improvements. The team plans to use more sophisticated architectures, gather more diverse data, and incorporate self-supervised speech representations. They also aim to add other speech-processing tasks to create universal speech models.
Action Items:
1. Research and evaluate the Open Whisper-style Speech Model (OWSM) described in the meeting notes.
2. Identify potential use cases and applications for OWSM in our organization.
3. Assess the feasibility and resource requirements for implementing OWSM in our current speech recognition system.
4. Contact the research team from Carnegie Mellon University, Shanghai Jiao Tong University, and Honda Research Institute to inquire about any available documentation or support for implementing OWSM.
5. Share the information about OWSM with relevant team members and stakeholders for their awareness and input.
6. Monitor the progress of the researchers on OWSM to stay updated on any advancements or improvements.
7. Sign up for the newsletter mentioned in the meeting notes to receive updates on AI research news and projects.