Cymraeg-Friendly OCR software

mikeellwood · February 16, 2016, 10:08pm

Is any one aware of any OCR freeware for Windows that understands Welsh orthography?

Diolch o flaen llaw.

mikeellwood · February 17, 2016, 4:48pm

FWIW, I’ve ended up using this “FreeOCR”, at least for the moment:

http://www.paperfile.net/lang.html

I had used it before for German, so at least it’s familiar. As distributed, it does not support Welsh. In theory, you can add configuration files for other languages. I found one for Welsh on Github, put it into the appropriate folder and reloaded the s/w, but it would not work at all.

For now, I am compromising, using the language setting for French, as it understands most of the to bachs that way, although it makes a mess of some other things.

MickDavies · February 19, 2016, 8:30am

Would love to help, but I’m not even sure what an OCR is :S

owainlurch · February 19, 2016, 8:38am

Optical character reader. So a computer can read stuff (printed or handwritten) through an image provided by a photo or scanner.

I didn’t even know they came in different languages (though it makes complete sense now mikeellwood mentions it!) so I’m of absolutely no help either!

mikeellwood · February 19, 2016, 2:43pm

Although I worked in IT for years, I didn’t know much about OCR, having rarely had a need to use it, although I did start using it in semi-earnest a few years ago, without knowing much about the “innards”.

However, even just looking at western European languages, a lot of them have their own unique special characters, accent marks, etc, and at first I thought it was mainly a question of the software being able to recognise those. However, when I had a (quick) look at what came with the language pack I got from Github for Welsh, even without understanding exactly how it was used, I could see it included a lot of common words. I assume that these can be used by the software to help recognise words when perhaps the scan is not of the best quality (which can easily happen, in spite of best efforts). It’s a pity I couldn’t actually get the Welsh language pack to work with this software. As it is, I think it’s a bit of a compromise, and, for example, it very often confuses “w” for “vv”, perhaps because it doesn’t expect to find "w"s in places where it tends to occur in Welsh words.

Well, I can manage with this for now, but it would be nice to find a better solution in the long term.
I was wondering if the people at Bangor Uni had anything, but I’m not finding anything relevant on their website.

craigf · February 19, 2016, 10:37pm

I found out Google Docs can do it. Go to Google Drive. In settings, enable Cymraeg as a language you understand. Upload an imagine file, then open it in Goggle Docs.

mikeellwood · February 20, 2016, 12:20am

Wow, I would never have thought of that.
Many thanks!

I’m not very google-docs, etc-oriented, although I do have an account, and can probably get up to speed.

Doing a quick bit of googling suggests it can even recognise the language automatically (well, google translate often can, so this seems resasonable), but if it doesn’t, it won’t be too much effort to set it manually, I’m sure.

craigf · February 20, 2016, 12:33am

It recognized Welsh automatically for me and the results were pretty good. I did not see an option for selecting manually. Not sure if you have to set Cymraeg as a known language or not. The tutorial I saw said to but it was old.

mikeellwood · February 20, 2016, 1:50am

It’s not working exactly as per the tutorial I have found (also probably old), but it is working!

(e.g. I ticked a box in the settings to ask it to convert to google docs as it was uploaded, but it’s not doing that. However, if I right-click on the uploaded .jpg, it gives me the option to open it in google docs…it thinks about that for a minute or so, and then voila, it’s opened as text in google docs. There is a language setting in the “settings” and I have both English and Welsh set, but I’m not even sure if that’s relevant to the OCR-ing. Just for fun, I ran the option to translate, and I still had to select a language for that. (It didn’t make a very good job of the translation either!).

Not quite perfect (the OCR-ing) - a few little mistakes, but it’s pretty good - better than the FreeOCR I was using before.

It hasn’t been tested yet with "ŷ"or “ŵ”, but it has picked up the other vowels with to bachs correctly.

So far then, very promising! Diolch!

craigf · February 20, 2016, 2:26pm

That’s how it worked for me too. The example I ran, did have some ‘ŵ’ in it and they were recognized. But the image was pretty clean coming from a screen cap of a news article.

mikeellwood · February 20, 2016, 4:28pm

Do you happen to know if there are unicode characters for the Welsh “ff”, “ll” and “dd” which are strictly speaking single letters in Welsh? (or inputting them in some way or other as single characters, rather than just doubling up the single characters?)

Using the GD OCR, they are coming through as double letters, which is probably fine for my purposes. However, I think at least sometimes with what I was using before, they were actually coming through as single characters. I’ve had a look around, and haven’t seen a way to create those (the welsh.typeit for example, only covers the vowels).

…

I’ve been playing around with the Windows “charmap”, and not found them in there so far, either.

If anyone has the Windows Welsh keyboard installed, is there a way of inputting them with that?

…

update…managed to get this one using font “Welsh Cambria” in charmap:

ﬀ

“U + FB00 Latin Small Ligature” apparently.

(But doesn’t seem to work in “normal” fonts, at least in notebook, although it’s working here).

…
Edit2: I suspect this is something not really worth worrying about. If it’s this fiddly getting them in as single characters, I can’t see the average Welsh writer doing anything other than doubling up the single letters! As ever, open to correction / edification though.

craigf · February 20, 2016, 4:52pm

I don’t think these would be typically used. Printers often used ligature characters to improve kerning of certain letter combinations such as ‘fi’. I would imagine many of these are in unicode but would not be used unless you were printing a book.

mikeellwood · February 20, 2016, 4:58pm

Yes, thinking about it, I’m sure you are right.

And even in English, things like “th” and “ph” are really “single letters” (and at least “th” used to be represented by a single letter in Old English), but they just happen to be represented by double letters in the modern English language (and Welsh, as it happens).

So it’s another case of “paid a phoeni” I think.