Welsh Word Audio Clips

Welsh Word Audio Clips

Hi Everyone,

I wanted to share a project I’ve been working on that I think could be really useful for those building your own Anki decks, flashcard sets, or just looking to improve vocabulary and pronunciation.

What is it?

I have generated audio clips for over 17,000 of the most frequently used words in the Welsh language, based on the CorCenCC (National Corpus of Contemporary Welsh) written corpus.

** Please note that these are all lemmas (base dictionary words) rather than every possible conjugated or mutated form. Also, interjections have been removed (“hmm”, “ymm”, and the like). **

Why? Because producing every single surface form (tenses, mutations, persons) would have turned 17,000ish clips into hundreds of thousands. That would have been a technical nightmare to generate and impossible to quality-check in any meaningful way. Sticking to lemmas keeps the collection high-quality and manageable.

How it was made?

I used the best Welsh text-to-speech engine I could find to generate the clips. You will notice they are all in a North Wales Male voice. I chose this specific voice because, after a lot of testing, it was the most natural-sounding one available. Due to technical limitations, I stuck to this single high-quality voice rather than mixing different ones.

I have spent many, many hours refining these clips and quality-checking the results to ensure the files are as clean, authentic and accurate as possible.

How do I get my hands on them?

You can download via this Google Drive link:

Welsh Project Google Drive Link

  • Pick and choose: You can browse the “AudioClips” folder and download the words which you require.
  • The Google Drive also contains a “Top 1000 Written Welsh Lemmas” premade Anki Deck" inside the “AnkiDecks” folder. (I plan to create more Anki Decks in the future).

Possible use cases:

  • Flashcards (Anki, Mnemosyne, SuperMemo, Quizlet, etc): This is the main use case. If you use Anki or any other flashcard or spaced-repetition app that allows you to upload your own media files, you can import these audios to give your cards a voice.
  • Pronunciation checking: If you see a word written down and aren’t sure of the pronunciation, you can search this folder to hear it instantly.

Strengths & Weaknesses?

  • The Good: It’s a massive resource. If you are looking for the pronunciation of a specific word, it is almost certainly in here. It covers the vast majority of vocabulary you will encounter daily.
  • The automatically generated nature: While I have put a lot of time into filtering out bad files, this was still an automated process involving thousands of clips. There may still be the occasional “dud” or robotic pronunciation that slipped through the net.

** Note on Filenames: To ensure the files work on all computers, some special characters have been replaced with underscores (e.g. you might see i_r.wav instead of i'r.wav ). The audio itself is correct! **

Request for Feedback . . .

If you find any clips that are broken, silent, or just sound wrong, please let me know in this thread. I can easily regenerate specific words, so I’m happy to fix them and improve the collection for everyone. Also, if the Anki Deck has mistakes, let me know.

Download Link:

Welsh Project Google Drive Link

The Premade Anki Deck.

The Top 1000 Written Welsh Lemmas based on the CorCenCC collection.

This Deck has 7 fields:

  1. Rank
  2. Welsh Word
  3. English Meaning
  4. Part of Speech
  5. Audio (automatically pulled from the Anki2 collections folder)
  6. Welsh Sentence (An example sentence, only shown when the Welsh word is shown)
  7. English Sentence (The English translation of the Welsh Sentence, only shown when the English word is shown)

** The deck contains HTML and CSS formatting **

Mwynhewch :slight_smile:

P.S. I may update and improve the Anki Deck(s) and the audio clip collection from time to time, so if you can’t see the files on the Google Drive or the drive isn’t available, I am probably in the process of uploading better versions.

3 Likes

That sounds like an incredible lot of work! Amazing!

1 Like

I just applied the same obsessive insanity I do with all my little projects lol. It was indeed a looooot of work, but if it helps even one person in the Welsh language learning community, I consider it to be a success.

1 Like

I’m sure it will be of help! How about giving a step-by-step example of how to use it?

One word that I’m often asked how to pronounce is the word for grandson - ŵyr. Could you explain, using that word, how someone can use the files? That would be a useful guide for anyone keen to use them, but needing some help to get started.

Hi Deborah,

A good use case would be in creating “Flashcard Sets” and/or “Anki Decks” with the .wav audio files integrated (the flashcards could play the audio).

If anyone is interested, I have created a “Top 1000 Written Welsh Words” Anki Deck using the .wav audio files contained within the collection.

A simpler use case would be to simply have your own personal offline database. The total folder size is less than 1GB (about 700mb if I remember correctly, after unzipping).

I have put instructions in the original post on how to navigate, use and download from the Google Drive this collection is currently being stored on.

I hope this helps.

P.S. Let me know if you or anyone is interested in the Anki Deck I have created (and use myself).

Incredible work Christopher. Thank you for sharing!

Diolch yn fawr!

1 Like

I cant get the google drive link to work, is it still active?

1 Like

Sorry Anders, I don’t use the SSiW forum too much these days. I recently updated the collection(s), the structure of the Google Drive (this is why the link wasn’t working), and the wording of the original post. Please have a read of the post and click the link, it should all be working now.