Speech-to-Speech Model Comparison

Welcome to the Speech-to-Speech (S2S) Model Evaluation! 👏 In this evaluation, you will assess the performance of different S2S models, such as ChatGPT-4o, FunAudioLLM, SpeechGPT, Mini-Omni, Cascade, and LLaMA-Omni.
🎯 Goal: Test how well these models handle speech tasks across different domains. How It Works Once you select a specific domain and task (e.g., Educational Tutoring and Rhythm Control), you will proceed to the evaluation stage. In each round, you will be presented with an audio input.
🌰 Example:

Audio Sample:

The corresponding text is: "Say the following sentence at my speed first, then say it again very slowly: 'Artificial intelligence is changing the world in many ways.'" 🧠 (Note: the audio plays at 1.5x the normal speed.) Model Performance

ChatGPT-4o:

🎙️ Speech: Partially followed the instruction on speed.

🧾 Semantics: Accurately followed the instruction, with no semantic deviation or missing information.

FunAudioLLM:

🎙️ Speech: Partially followed the instruction on speed.

🧾 Semantics: Accurately followed the instruction, with no semantic deviation or missing information.

SpeechGPT:

🎙️ Speech: Did not follow the instruction on speed.

🧾 Semantics: Partially followed the instruction, with minor semantic deviation and missing information.

Mini-Omni:

🎙️ Speech: Did not follow the instruction on speed.

🧾 Semantics: Did not follow the instruction, with significant semantic deviation and missing information.

After making your choice, you'll proceed to the next round. 🔄

Click the button below to start the evaluation! 🚀

⚖️ Speech-to-Speech Model Comparison