So.. I’m a nerd.
I’ve seen a number of sources state that ____ words account for __% of spoken language. These figures vary everywhere, but I’m presuming they come from Zipf’s Law. Of course it’s going to be more complex than this, but it’s intriguing none-the-less.
So I decided to do a bit of analysis on Welsh myself using the past 5 weeks of Pobol Y Cwm episodes as a basis for the data.
Here is the process I followed:
- I took the past 5 weeks of Pobol Y Cwm episode scripts
- Analysed the frequency of each unique word spoken and the number of words in each percentile.
- Plotted the data on a graph
Here’s a basic overview of my findings:
- In all 15 episodes, there were 3,571 unique words.
- 1,090 words cover 90% of the dialogue.
- And just 468 words cover 80% of the dialogue.
- ‘i’ was the most common word, being used 1,573 times (4.76% of all words spoken)
I think this shows a really interesting insight into how little you really need to know to start getting to grips with conversations, aswell as how difficult “fluency” is to define.
Graph: Cumulative Word Usage in ‘Pobol Y Cwm’
Disclaimers and Inaccuracies:
For the sake of simplicity, variations, contractions, conjugations and mutations are all considered seperate words. i.e. gyda and 'da are counted seperately.
Similarly; words that can have different meanings based on context, for example ceir (plural of car / future of cael) are considered one word. This isn’t ideal, but I unfortunately don’t have the expertise to do much else.
There will therefore be small inaccuracies in this data, but it shouldn’t be too many points off
Anyway… Not sure if/how this would be useful to anyone, but I think it’s quite interesting.