Corpus analysis


I didn’t get a chance to post this yesterday as I was too busy after the conference having dinner and ‘sampling’ New Zealand’s finest Monteith’s beers1, but I think the presentation was mostly a success.

I probably should have refined it a little more on Thursday night instead of heading to the pub and, yes, sampling more of New Zealand’s finest Monteith’s beers, because I think it was a little rushed and felt a bit underbaked, but aside from that I got the feeling that the reception was good. I didn’t leave any time for questions unfortunately, and after my talk were two more in the session, meaning people probably let it slip into their subconscious. Nonetheless, there has been some positive feedback.

The four plenary talks were all brilliant. Sarah Ogilvie took a historical look at the impact of James Murray, the first editor of the Oxford English Dictionary, and his understated willingness to be as inclusive to borrowed words as he could, despite some later revisionists’ assertions that he was too stubborn with including foreign words. Bruce Moore on the other hand, carpetted the Oxford’s more recent publications for sloppy antipodean citations, showing that many of the multiple citations for such obscure Australian and New Zealand word such as Old Thing for a dish of salted beef and unleaven bread, all derived from a single source, a wordlist of Australian words published in 1941 by Sidney Baker, yet the OED has listed them as separate pieces of evidence.

More relevant to my talk though, were two other talks yesterday on electronic dictionary systems. One was by Dave Moskowitz who developed the Freelex dictionary creation software for the adult monolingual Māori dictionary2, mostly because he didn’t want to do it all himself. Freelex, as its name might suggest, is free (as in both beer and speech) and open source, and it runs on a MySQL backend. The other talk was by Gilles-Maurice de Schryver who developed TshwaneLex, a commercial product that does a similar job, but which runs on a prorietary format at its  backend, based on XML.

Each of those are in hugely more advanced stages of development that our humble XML-based multiple format dictionary project. Even so, the demonstration of the Kirrkirr Kaurna dictionary and the mobile phone dictionary, which I was able to run on the projector screen as an emulator, were absorbed by the audience with a great deal of interest; especially paying attention to the idea that mobile phones were just the obvious choice for housing dictionaries in some parts of the world. Such a system, for instance, would be perfect for Southern Africa, which has a similar internet situation to Northern Australia.

Among our many Monteith’s last night, we had a long discussion about some aspects of theoretical lexicography3 such as what purpose dictionaries are meant to serve. Several of the talks refered to dictionary users being put off by things such as labels, parts of speech, scientific names and so on. These talks mentioned ‘training’ the users how to get the most out of that dictionary. But another point of view, not necessarily my own, that was put forward last night was that it may be better to instead rebuild the dictionary so that it’s what the user wants and needs, rather than to persevere with a non-user-friendly dictionary that try to shoehorn the audience into it.

For instance, Julie Baillie gave a talk directly after mine, in which she presented Oxford’s new beginner’s wordlist, which uses corpus techniques to find the words most used by younger children, who are just beginning to read and write. The inspiration for her research, which culminated in the production of the Oxford Wordlist, was that children in primary school classes were learning to read and write using wordlists created in the 60s and 70s in Europe. They naturally involved concepts foreign to Australian and New Zealand kids abnd were for the most part useless for the kids to learn to read and write with. She compiled the wordlist by the frequency of these words as they appeared in small narratives written by children in target age groups, and therefore better reflect those children’s worldviews. So, she has rebuilt the dictionary to suit the needs of the user, rather than force the user to conform their needs to the functions of the dictionary.

Brilliant.

Anyway, that’s one conference down, one to go. I’m off to Melbourne next week for the Unimelb postgrad conference, and perhaps also to discuss the possibility of doing a PhD there beginning in 2010.


  1. These Monteith’s Brewery beers are fantastic, mostly. Unless you like cider you can give the Summer Ale a miss, and the Raddler Ale is pretty much like a shandy. By far the best is Original Ale, whose closest Australian analogue would have to be Squire’s Amber Ale. Following closely behind is the Pilsener.

    You can tell that I’ve been busy in research this week.

  2. Which reminds me, I really want to find a copy of a good Māori dictionary before I leave
  3. Far out, I am the King of the Nerds

Maybe it’s my less-than-prime cognitive state right now, but I’m beginning to notice little grammatical quirks and ambiguities that I’d normally have overseen (that was silly of me – thanks for pointing it out, David) overlooked completely.

This web page popped up when I opted out of a frankly unsolicited email advertising list:

You have been opted out.

Pardon? Is that an applicativised use of the phrasal verb opt-out? My understanding of this verb is that you opt out of something, you do not get opted out. Then again, if this use doesn’t strike you as odd; if it’s alright to you, to say that someone has opted you out of something, please feel free to digress.

Incidentally, corpus.google.com¹ shows that the strings have opted out and has opted out together generate about 188,400 results, while been and get opted out only generate about 2,000. Be opted out is surprisingly common though, with about 12,500 hits, so maybe it isn’t as ungrammatical as I thought.

The other thing I noticed today was the packaging on a salami from the supermarket, which read:

Ideal for entertaining.
For entertaining recipes, visit our website.

Honestly. Recipes are matter-of-fact, functional things. How entertaining do they have to be?

Seriously though, I was just having a conversation about a very similar thing in the linguistics room on irc.freenode.net. I was previously under the impression that the term operating system is a paraphrase of something like a system that operates, in which case you’d call operating a verb participle, I guess. But since an operating system is actually a system that pertains to operating, it’s accurate enough to call it a gerund.

~

In other news, I just upgraded my wordpress software from 2.3.1 to 2.3.2, because apparently there was a security fault with 2.3.1, and readers were occasionally able to see drafts, which are usually hidden. In fact I noticed a while back that my stats page showed many of my drafts as having been visited, which concerned me slightly. But it should be fixed now, so I can feel free to draft on.

~

¹I’ve mentioned corpus.google.com before, and I’ve been using it now for well over a year. In fact up until an hour ago I though I had originally coined it. But it has come to my attention that there was a blog post that antedates my first use by over 18 months. Still, I certainly came up with it independently, so it’s much like arguing over whether it was Newton or Leibniz who invented calculus.

Here’s the relevant bit:

I wonder if Google will eventually offer such a service themselves? “corpus.google.com”? (Apologies to those who thought this post was actually announcing such a service.)

Predictably, I’ve also variously had to offer up similar apologies to some of my readers who were misled by my reference, such as David.

Earlier on this afternoon, I heard a cricket commentator, having heard about someone whose name he didn’t immediately recall, promise that he’d google him up. This would not be a natural usage for me, although it’s unequivocally clear what he means; it’s completely synonymous with (in my view) the more natural version to google someone, i.e. to search for them on Google.

Anyway, I started wondering how common the construction google up is, so I went and googled it… up, and here’s a breakdown of the returned hits on all permutations:

  google X google X up
him 106,000 7,510
her 92,900 3,490
them 151,000 11,300
it 5,600,000 513,000

Roughly speaking then, the non-phrasal variety google someone is far more common, but the phrasal variety google someone up has some substantial corpus.google.com¹ representation, about a tenth as much as the former.

I didn’t really have anything more to say about it, apart from pointing out that the phrasal verb google up is probably to be expected to occur on the basis of analogy from look up, as in I’ll just go and look them up (in the phonebook). Although curiously, a family member thought it meant to contact via Google rather than to merely find their details, as if on analogy from ring up.

In an age where Google has pretty much usurped phonebooks (of all colours), street directories, atlases, library catalogue cards, encyclopædiae and just about any other source of information, it may as well replace the linguistic idioms associated with them as well.

Holy crap! If someone googled up “Google”, do you think this post would be somewhere near the top?

~

¹No, corpus.google.com doesn’t exist as a URL; it’s just what I use to refer to the act of doing a quick Google search on a phrase using wildcards and quotation marks to back up one’s largely made up postulations about trends in modern English. Think of it as a snowclone, of the x.google.com template.

I just thought I’d point it out, since last time I used the term corpus.google.com, someone (let’s just call him Mr Nash – no wait, that’s too obvious; I’ll call him David N) wondered why they couldn’t find the corpus.google.com homepage.