Reward-Guided Fine-Tuning for Nepali Speech Recognition

If you've ever used voice-to-text in Nepali, you already know the pain. The transcriptions are often wrong in ways that feel almost random, words get mangled, sentences lose their meaning, and you end up correcting more than you would have typed.

We wanted to fix that, at least a little. So we spent a few months experimenting with a method we're calling reward-guided fine-tuning, and honestly, the results surprised us.

The root of the problem: garbage data

Most Nepali speech datasets out there are scraped from YouTube subtitles. It sounds like a great source, tons of content, lots of variety, but the quality is all over the place.

Some subtitles are accurate. Some have typos. Some are completely wrong and don't even match what's being said in the video. The problem is that when you train a model on all of this equally, it doesn't know the difference. It learns the good patterns and the bad ones together.

The model essentially memorizes mistakes and repeats them.

Our idea: just... stop training on bad examples

The fix sounds almost too simple. Instead of feeding the model all the data and hoping for the best, what if we taught it to only learn from the good examples?

That's the core idea. We called the pipeline reward-guided filtering, and it works in a loop:

Fine-tune Whisper on all the available data
Let the model generate transcriptions
Score how good those transcriptions are
Remove the worst-scoring samples from the training set
Retrain on what's left
Repeat

Each iteration, the training data gets a little cleaner. The model gets a little better. Rinse and repeat.

Building the "reward model"

The trickiest part was step 3, figuring out what makes a transcription good.

We didn't want to rely purely on automated metrics, so we collected about 2,000 transcription samples and had humans label each one as good, neutral, or bad. Then we trained a small classifier on those labels.

The classifier uses four features: Word Error Rate (WER), Character Error Rate (CER), the ratio of transcription length to reference length, and the raw length difference. Nothing fancy. We deliberately kept it lightweight.

It ended up being about 81% accurate at predicting transcription quality, which felt like a solid foundation.

The dataset

We worked with a mix of Nepali audio content, news broadcasts, podcasts, drama serials, and documentaries. In total, about 70+ hours of audio, broken into roughly 50,000 segments averaging around 6 seconds each, all at 16 kHz.

Not a huge dataset by modern standards, but representative of the kind of real-world audio you'd encounter.

Did it actually work?

Yes, more than we expected.

Model	WER	CER
Base Whisper	8.10%	7.40%
Fine-tuned	5.55%	5.04%
After iteration 1	5.12%	4.71%
After iteration 2	4.89%	4.52%

Standard fine-tuning already helped quite a bit. But the filtering pushed it further, about 11–12% better than fine-tuning alone.

Here's a concrete example of what that looks like in practice:

Reference: नेपालको राजधानी काठमाडौं हो।
Before filtering: नेपाल को रजधान काटमाण्डु हो।
After filtering: नेपालको राजधानी काठमाडौं हो।

The before version isn't completely wrong, but the errors are the kind that would frustrate a real user. The after version is a perfect match.

What we liked about this approach

The thing we appreciated most was how accessible it is. You don't need to implement full RLHF, you don't need massive compute, and you don't need a huge annotation team. A relatively small set of human labels, a simple classifier, and a few retraining iterations got us meaningful gains.

It's also a pattern that could generalize. Any low-resource language suffering from noisy web-scraped data could potentially benefit from the same idea.

Where it falls short

We're not going to oversell this. There are real limitations.

Getting even 2,000 clean human annotations takes time and effort. Our feature set is quite simple, a more sophisticated reward model would probably do better. And each retraining iteration has a computational cost that adds up, which matters if you're working with limited resources.

There's also the question of how well this scales. We only ran two iterations here. It's unclear where the gains plateau.

Where we want to take this

A few directions feel promising from here. Training on more annotated data is the obvious one. We're also curious whether a neural reward model, something that actually reads the text rather than just measuring error rates, would catch mistakes our current classifier misses.

Longer term, it'd be interesting to see how well this approach transfers to other low-resource languages facing the same data quality problems.

The takeaway

Clean data beats more data. That's not a new idea, but this project was a good reminder of how much it matters in practice, especially when you can't just throw more compute at the problem.

For low-resource languages like Nepali, where the data pipeline is messy almost by definition, having a principled way to filter out noise before retraining seems worth the effort. The results here suggest it is.

If you're working on something similar and want to compare notes, we'd love to hear from you.