Thursday 3 October 2024

Not my first Rodeo No 6: Corpus Schmorpus...


A shameless reposting of 13-year old content again, but much updated:

One question that people ask when you say you write dictionaries is: "well how do you decide which words to put in?" Good question, with lots of different kinds of answers. 

In the past you could just copy your competitors (only joking 😉) or like Murray on the huge Oxford English Dictionary, you could get hundreds of 'spotters' to send in words on cards. 

But one of the good answers and the one I would have given from when I started in 1988 is "you use a corpus". Corpus is a good old academic term for a usually large body of written evidence that you can search. In biblical terms a concordance of the corpus of the bible would show you each word used in it and how it is used.

For writing dictionaries you need a good sized sample of your target language (millions of words at least) so that you can spot and have evidence for rarer words.

In the past (before Joan Clarke, IBM and Berners-Lee) a corpus could be on paper, printed out as a huge set of concordances. I have worked with one. I think each million words filled an entire filing cabinet. But now we have not just billions, but many trillions of words which are in principle searchable. So the evidence is gob-smackingly large.

And if you search for the word gobsmackingly I promise you you will never run out of examples. 

In fact back in the early days of the internet when Netscape would tell you how many examples it had found we lexicographers would bet on the number and even suggest things that would only come up once, such as rare idions like 'black as Newgate's knocker' or 'God willing an the crick don't rise'. Try them now.

So what do you do with all this evidence? Well you can see *how* the word is used, which to my mind is the most important thing you need to know. Apart from its meaning, which in fact the corpus can't show you directly but is usually easy to work out, you can see if this word is informal or technical or very vanilla. By looking at the words with it you can see clusters of different meanings. 'Bite' will show you dogs and mosquitoes and even clutches, all biting. With a bit of care you will also find the word 'bug' there, almost always 4 positions after the verb. (Answers on a postcard 🙂). And now I am terribly tempted to search for the word 'vanilla'.

Since our first online fumblings (and I'm now terribly tempted to search for the word 'fumblings': you can see how lexicographers get distracted) the technology has improved massively and corpus searching underlies many of the miracles of things such as Google Translate. A late colleague of mine, Adam Kilgarriff, did more than anyone else before his premature death to bring computer technology to the lexicographers aid. You can find out more about it here:

https://www.sketchengine.eu/

Compared to Murray in his scriptorium (see below) we are truly blessed.




No comments:

Post a Comment