Welsh Voice Clips

Hi everyone,

I wanted to share a project I’ve been working on that I think could be really useful for those building your own Anki decks, flashcard sets, or just looking to improve vocabulary and pronunciation.

What is it?

I have generated audio clips for the 17,628 most frequently used words in the Welsh language, based on the CorCenCC (National Corpus of Contemporary Welsh) written corpus.

Important Note: Base Words Only (Lemmas)

Please note that these are all lemmas (base dictionary words) rather than every possible conjugated or mutated form.

Why? Producing every single surface form (tenses, mutations, persons) would have turned 17,000ish clips into hundreds of thousands. That would have been a technical nightmare to generate and impossible to quality-check in any meaningful way. Sticking to lemmas keeps the collection high-quality and manageable.

How it was made:

I used the best Welsh text-to-speech engine I could find to generate the clips. You will notice they are all in a North Wales Male voice. I chose this specific voice because, after a lot of testing, it was the most natural-sounding one available. Due to technical limitations, I stuck to this single high-quality voice rather than mixing different ones.

I have spent many, many hours refining these clips and quality-checking the results to ensure the files are as clean and accurate as possible.

How to get them:

You have two options via this Google Drive link: WelshVoiceProjectMain – Google Drive

  1. Download the whole collection: Grab the entire folder (~17,600 files) if you want a complete offline library from the “Welsh_Voice_Lemmas_Zipped.tar.gz” compressed folder.
  2. Pick and choose: You can just browse the “WelshVoicesProjectFiles” folder and download the specific individual words you require.
  3. There is also an additional 2 files and 1 folder containing some of the sources information used to generate the clips.

Possible use cases:

  • Flashcards (Anki, Mnemosyne, SuperMemo, Quizlet, etc.): This is the main use case. If you use Anki or any other flashcard or spaced-repetition app that allows you to upload your own media files, you can import these audios to give your cards a voice.
  • Pronunciation checking: If you see a word written down and aren’t sure of the pronunciation, you can search this folder to hear it instantly.

Strengths & Weaknesses:

  • The Good: It’s a massive resource. If you are looking for the pronunciation of a specific word, it is almost certainly in here. It covers the vast majority of vocabulary you will encounter daily.
  • The “Beta” Nature: While I have put a lot of time into filtering out bad files, this was still an automated process involving thousands of clips. There may still be the occasional “dud” or robotic pronunciation that slipped through the net.
  • Note on Filenames: To ensure the files work on all computers, some special characters have been replaced with underscores (e.g. you might see i_r.wav instead of i'r.wav ). The audio itself is correct!

Request for Feedback:

If you find any clips that are broken, silent, or just sound wrong, please let me know in this thread. I can easily regenerate specific words, so I’m happy to fix them and improve the collection for everyone.

Download Link:

Pob lwc with the learning!


Technical Specifications (for the fellow nerds)

For those interested in the underlying data and how this was built, here is a breakdown of the resources and tech stack used:

  1. Source Data (The Word List) The vocabulary list is derived from Yr Amliadur: Frequency Lists for Contemporary Welsh (Version 1.0.0). This dataset is part of the CorCenCC project (National Corpus of Contemporary Welsh), which provides frequency counts based on a massive collection of written Welsh.
  1. Audio Engine (The Voice) The audio was generated using the open-source Welsh Text-to-Speech API provided by Techiaith (Canolfan Bedwyr, Bangor University).
  • Engine: Techiaith TTS API (Orpheus/Macsen)
  • Voice Used: Gwryw Gogleddol (North Wales Male)
  • Source: Techiaith TTS
  1. The Tech Stack (The Script) I wrote a custom Python script to automate the downloading and validation process.
  • Data Processing: pandas was used to clean and iterate through the CorCenCC frequency spreadsheets.
  • API Interaction: requests handled the retrieval of .wav files from the Techiaith server.
  • Quality Control (Audio Validation): To ensure the files weren’t empty or corrupt, the script utilized the wave and audioop libraries.
    • Sanitization: Filenames were scrubbed of illegal characters.
    • “Zombie” Check: Verified file headers (RIFF) to prevent corrupt downloads.
    • Silence Detection: Analyzed RMS amplitude to reject files that were silent or too quiet.
    • Duration Check: Automatically rejected clips under 0.5 seconds.
    • File size checking based on letter count.
3 Likes

That sounds like an incredible lot of work! Amazing!

1 Like

I just applied the same obsessive insanity I do with all my little projects lol. It was indeed a looooot of work, but if it helps even one person in the Welsh language learning community, I consider it to be a success.

1 Like

I’m sure it will be of help! How about giving a step-by-step example of how to use it?

One word that I’m often asked how to pronounce is the word for grandson - ŵyr. Could you explain, using that word, how someone can use the files? That would be a useful guide for anyone keen to use them, but needing some help to get started.

Hi Deborah,

A good use case would be in creating “Flashcard Sets” and/or “Anki Decks” with the .wav audio files integrated (the flashcards could play the audio).

If anyone is interested, I have created a “Top 1000 Written Welsh Words” Anki Deck using the .wav audio files contained within the collection.

A simpler use case would be to simply have your own personal offline database. The total folder size is less than 1GB (about 700mb if I remember correctly, after unzipping).

I have put instructions in the original post on how to navigate, use and download from the Google Drive this collection is currently being stored on.

I hope this helps.

P.S. Let me know if you or anyone is interested in the Anki Deck I have created (and use myself).

Incredible work Christopher. Thank you for sharing!

Diolch yn fawr!