How 468 Words Cover 80% of Pobol Y Cwm Dialogue

So.. I’m a nerd.

I’ve seen a number of sources state that ____ words account for __% of spoken language. These figures vary everywhere, but I’m presuming they come from Zipf’s Law. Of course it’s going to be more complex than this, but it’s intriguing none-the-less.

So I decided to do a bit of analysis on Welsh myself using the past 5 weeks of Pobol Y Cwm episodes as a basis for the data.

Here is the process I followed:

  1. I took the past 5 weeks of Pobol Y Cwm episode scripts
  2. Analysed the frequency of each unique word spoken and the number of words in each percentile.
  3. Plotted the data on a graph

Here’s a basic overview of my findings:

  • In all 15 episodes, there were 3,571 unique words.
  • 1,090 words cover 90% of the dialogue.
  • And just 468 words cover 80% of the dialogue.
  • ‘i’ was the most common word, being used 1,573 times (4.76% of all words spoken)

I think this shows a really interesting insight into how little you really need to know to start getting to grips with conversations, aswell as how difficult “fluency” is to define.

Graph: Cumulative Word Usage in ‘Pobol Y Cwm’

Disclaimers and Inaccuracies:

For the sake of simplicity, variations, contractions, conjugations and mutations are all considered seperate words. i.e. gyda and 'da are counted seperately.

Similarly; words that can have different meanings based on context, for example ceir (plural of car / future of cael) are considered one word. This isn’t ideal, but I unfortunately don’t have the expertise to do much else.

There will therefore be small inaccuracies in this data, but it shouldn’t be too many points off

Anyway… Not sure if/how this would be useful to anyone, but I think it’s quite interesting.

14 Likes

I think you have expertise!

I’m not sure I entirely understand, but it makes me feel better. I was wondering about this topic just last week, never dreaming that someone might be similarly intrigued.

I’m going back to read it again.

Diolch

3 Likes

I’m even more impressed :hugs:

2 Likes

Brilliant! You’re in the right place. I was also wondering about this “highest frequency words” stuff just recently - I think I saw a post somewhere about a language learning “hack” :roll_eyes: where someone had worked out the 600 most useful words to learn regardless of the language… and for a very reasonable price I could find out more! LOL.

There are many flavours of nerd here and we’re all welcome.

3 Likes

This is great analysis - thanks for sharing.

Like you say it makes learning a language feel relatively accessible, and mastery a never-ending pursuit!

3 Likes

I’m not sure how many unique words I’ve learned, and personally would find it a bit tedious to list them out and count them… but based on how much of Pobol y Cwm I can understand without subtitles, or with Cymraeg subtitles, I’m thinking it may be in the region of 468! :smiley:

1 Like

Wow that’s crazy! Definitely no need for that aha, I’m sure most of them are probably within SSIW Level 1-3 and Dysgu Cymraeg ‘Mynediad’, although I’ve not looked in detail.

1 Like

Was it Gabriel Wyner and his 625 words? I’ve read his book and although I disagree with some of what he says, I don’t think he’s a charlatan and he has some interesting ideas.

Obviously 625 shouldn’t be taken too seriously as it’s a bit arbitrary where you choose to draw the line and also because different cultures really ought to have different lists; but it’s not a terrible rough guidline IMHO.

2 Likes

Brilliant post, mate. Any chance of you doing Peppa Pig next ha ha?

2 Likes

Already on it :wink: ahah

2 Likes

I’m sure you’re right, I have learned quite a few languages so far and have come across word lists/booklets composed with a view to a basic vocabulary, based on research like you did - i.e., what are the words that occur most frequently. It gives you a very good start, so this is certainly very useful!

2 Likes

I don’t remember. I think a lot of people jump on bandwagons so it could have been someone else selling his basic idea. Anyway, I have all the resources I feel I need for now.

2 Likes

I love this, it’s great!

I think it’s also good to remember that we probably know more words than we think we do. I got 88 o Hoff Ansoddeiriau Cymraeg: 88 Favourite Welsh Adjectives for Christmas, and started looking at it a few days ago. I was surprised that I knew all but three of the adjectives listed!

I wonder if it’d be possible to create a little quiz with a (deduped) version of your list, in popularity order, so people can see how many they know?

2 Likes

This is brilliant (though I still don’t know why I don’t understand half of the conversations in Pobol y Cwm when I probably know four times the words!).
Do you have the list of words available?

3 Likes

To get into the knitty-gritty, I’d say it comes down to the types of words used, conjugations, mutations and pronounciations.

In English, 6% of the language is the word “The”, but without understanding verbs especially, you’ll probably not know what’s being said.

Similarly if, in English, you understand “going to” and someone said “gonna”, you could be completely thrown off.

For example:

“Mae’n mynd i gymryd mwy na jyst ni’n dwy i ymladd nhw, Kath.”

Let’s say you understand the below words:
Mae’n mynd i gymryd mwy na jyst ni’n dwy i ymladd nhw, Kath.
It’s going to ______ more than just (we/us) ___ to ______ them, Kath.

(At spoken speed, you may not understand this, but taking a second and breaking it down, you can understand the vast majority of the sentence.

But you’re still missing:
Gymryd = to take
Dwy = two (but may not be recognisable if not heard as dau on first hearing)
Ymladd = to fight

So whilst individually, we may understand the vast number of words in the sentence:

  • We’re not yet used to hearing those words in a sentence with specific accents/pronounciations, in structures we’ve not heard before.
  • We need to develop speed of listening
  • We’re missing some of the core verbs (to take & to fight).
  • I think it’s also easy to focus on the words that we don’t recognise rather than the ones that we do.

Perhaps it shows the importance of consuming the language, even if we don’t think we understand it yet.

6 Likes

This is fascinating! Thanks for doing the research and for sharing. Very interesting!

It reminds me of a conversation I had many years ago when someone claimed they had learnt 25 basic sentence structures in Japanese and had been able to converse with everyone they came across on a visit to Japan. I’m not sure how accurate that was, but it does give some inspiration.

2 Likes

One thing that trips me up when watching Welsh tv is the accent. Watching something like Y Fets or Cynefin, there are people whose accent I can understand very well, and people whom I can’t understand without the subtitles. I think it’s very valuable to watch programs that feature people with a variety of accents, because even if you don’t understand it all, you start to get a feel for the breadth of possible pronunciations.

6 Likes

That’s a good point. I might know more words than I said! If I take the time to pause and read closely (my reading and comprehension speed is sometimes too slow to take in the longer sentences especially) I often do find I know all or most of the words, or can at least take an intelligent guess at what an abbreviation must be.

As an aside, I was chuckling to myself yesterday at the very cheerful vocabulary I’m building up just now, thanks to PyC. Lladd. Gwenwyn. Rhagfarn. Casáu. Amau. :rofl:

3 Likes

This is indeed very interesting.

Can I suggest a variation which might be worth looking at? This is to make the same type of cumulative plot, but for f x (-log(f)) rather than just f (where f is the frequency of each word). This corresponds to a standard way to correct for the higher “information content” of less frequent items known as the Shannon Entropy, after its inventor. The numbers will come out higher, but might better reflect the issue you highlight where not knowing a few key words means that you lose the sense of the whole.

Thanks for a thought-provoking analysis!

Robin

1 Like

This is getting a bit data science heavy, so not sure how many people will know what we’re saying, but I’ll dive in nevertheless

That’s an interesting thought that I’m not sure how I feel about. I understand where that comes from, especially with my last thought around still not understanding certain sentences due to missing certain key words, although I feel like Shannon’s Entropy may be a bit generalised in this context.

Similarly, since results are more swayed by words that are used less often, I think I’d want a larger dataset, preferably in the hundreds of episodes, rather than just 15. For example, I plotted the data for both 8 episodes and 15 episodes:

8 Episodes:
80% - 749 Words
90% - 1396 Words

15 Episodes:
80% - 823 Words
90% - 1662 Words

Because the dataset isn’t large enough, 90% rises by almost 300 words. These differences would eventually level out with a larger dataset.

Also, whilst some words that are used less often contain more information than words like “i” and “fi” for example, something like “twyllo” (deceive/cheat) would be a great example of this, I think that it’s a bit of a generalisation.

For example, “Heddlu” (Police) is said 21 times and “Cadw” (Keep) is said only 10 times. Based on Shannon’s Entropy, you’d be saying that “Cadw” has a higher information content than “Heddlu” which I would argue is probably not the case in most situations.

With that being said, I’m sure there’s some value in it, even if just using it as an even more conservative estimate for a number of words needed to understand a certain percentage of speech, so here’s the same data plotted to f * (-log(f)).

3 Likes