Lexicography


A couple of months ago, I received a phonecall from a journalist from the Herald, who’d seen my appearance on SBS World News, and was interested in writing an article about the mobile phone dictionary project.

A few things have happened between then and now, including conferences, holidays and a didjeridu performance by Nicole Kidman on German TV that seems to have absorbed all local interest in indigenous affairs for a few days1, but on Friday morning, two articles appeared in the front page section of the Herald, based in part on an interview I gave a little while back.

The main article is about Phil Parker, the marketing guru who’s recently delisted his ‘books’ on Australian languages (including dictionaries, thesauruses and crossword puzzle books) after his dubious publications hit the virtual shelves, and after a small but vociferous group of linguists complained. The other article is about this mobile phone dictionary project that James and I are getting more and more involved in, and (very quickly) how this sort  of project can prevent the theft of data in the first place.

I feel that the article on Philip Parker makes me look like a bit of a whinger. Here’s the operative quote:

Aidan Wilson, a Sydney University linguist who wrote an honours thesis on the Wagiman language spoken north-west of Katherine, said Professor Parker had used the wrong spelling on the cover of his publication Webster’s English To Wageman Crossword Puzzles: Level 1.

Yes; it’s true that Parker had the wrong spelling, but it’s clearly not the reason I’m annoyed at the publication of these books. I’m more annoyed that the entirety of information within them is publicly available at locations that properly explain the data, the language, and cite sources, while these dictionaries, thesauruses and crossword puzzle books omit all of this information. In short, they are lossy2 versions of dictionaries already freely available.

The article also makes it sound like we, speakers of indigenous communities and linguists working with them, have hindered the publication of useful educational resources due to our collective sensitivities. It doesn’t help the situation that Parker probably had his heart in the right place in wanting to further disseminate information relating to critically endangered languages.

A dyslexic, he collects lists of words and publishes dictionaries, thesauruses and crossword puzzles at a loss, he says, in the interests of education. His work has been heralded as a way to create paper resources for resource-starved Third World students.

That’s all well and good, but perfectly good materials already exist – those that the linguists have produced and made freely available in full consultation with the language community. It surely isn’t helpful to convert these into forms in which the information is distilled and compressed such that it no longer conforms to even the minimum standard required for the most basic dictionary. All information apart from the name of the language, the headword and a single gloss has been omitted. That truly is lossy. To give you an idea of what I mean, here’s an entry from the Online Wagiman Dictionary:

ngal-gawu-mang

nominal

1. grandmother (mother’s mother)

Ga-ngotjje-ji-n ngal-gawu-mang-gu. Ga-ngotjje-ji-n gahan warren yerdeng-nga ya-nggi, ngal-gawu-mang warle-na. ‘He is scared of his grandmother. That kid ran away and hid because his grandmother growled him.’ (LM)

2.grandchild (from a woman to her daughter’s children)

see also gawu, ngal-gawu.

You can see that there are no less than 6 tiers of information here; a headword, part of speech, glosses divided into multiple senses, illustrative sentences, their glosses and importantly, the speaker responsible for that illustrative sentence, as well as related words. Parkers dictionary merely has this:

ngal-gawu-mang
grandmother
grandchild

I don’t think anyone could reasonably argue that the latter is more useful than the former, or even that it is good for it to be around in addition to the original. I would even go as far to say that its existence in this form is potentially harmful and outweighs any possible benefits of it as an educational resource.

There is another issue that stems from this that deserves attention. Suppose you found one of these dictionaries for a language you’ve never heard of. Let’s say it has some pretty extraordinary stuff in it and you’d like to know more, or even go to the sources and do some fact checking. How do you go about doing it? There’s no citations given anywhere,  no examples have made it through the distillation process and no speakers are referenced. We’re in a different situation as we know the original is a good quality publication due to Stephen Wilson’s work, and can pretty much trust that the ‘distilled’ version will more or less be correct. But if Parker gave the same treatment to a highly dubious dictionary, Urban dictionary, let’s say, then the output looks just as authoritative as something that derived from a reputable source in the first place. This clearly makes it very difficult for readers of dictionaries to make informed decisions about the quality of what they’ve got.

I should reiterate that I think Parker had the best of intentions; to further disseminate information about as many languages as possible, something I naturally admire as a linguist. Yet he fails to recognise that lexicography is not easy work; it can’t be done just with a data-harvester, a spreadsheet and a bunch of automatically generated Amazon.com comments and reviews. It takes linguists and lexicographers years to compile the information and resources necessary to create dictionaries. Producing very low-quality dictionaries, thesauruses and crossword puzzle books of some 600 worldwide languages does nothing but undermine their efforts.

  1. And that’s a whole nother post in its own right. []
  2. To borrow an audio term. []

I’ve been back in Sydney for almost a week now, having been in Melbourne before that to attend the University of Melbourne Linguistics and Applied Linguistics Postgraduates Conference, where I presented the Kaurna Electronic Dictionary1 to a sell-out crowd. It was the final leg of an epic, two part world wind whirlwind tour that began in Wellington almost two weeks ago. (more…)

  1. For some background on the dictionary, see these posts (definitely not automatically generated):
    Mobile Phone Dictionaries

    Ceased to Be

    Conferences, Seminars and Dictionaries

    More Good News
    One down, one to go []

I didn’t get a chance to post this yesterday as I was too busy after the conference having dinner and ‘sampling’ New Zealand’s finest Monteith’s beers1, but I think the presentation was mostly a success.

I probably should have refined it a little more on Thursday night instead of heading to the pub and, yes, sampling more of New Zealand’s finest Monteith’s beers, because I think it was a little rushed and felt a bit underbaked, but aside from that I got the feeling that the reception was good. I didn’t leave any time for questions unfortunately, and after my talk were two more in the session, meaning people probably let it slip into their subconscious. Nonetheless, there has been some positive feedback.

The four plenary talks were all brilliant. Sarah Ogilvie took a historical look at the impact of James Murray, the first editor of the Oxford English Dictionary, and his understated willingness to be as inclusive to borrowed words as he could, despite some later revisionists’ assertions that he was too stubborn with including foreign words. Bruce Moore on the other hand, carpetted the Oxford’s more recent publications for sloppy antipodean citations, showing that many of the multiple citations for such obscure Australian and New Zealand word such as Old Thing for a dish of salted beef and unleaven bread, all derived from a single source, a wordlist of Australian words published in 1941 by Sidney Baker, yet the OED has listed them as separate pieces of evidence.

More relevant to my talk though, were two other talks yesterday on electronic dictionary systems. One was by Dave Moskowitz who developed the Freelex dictionary creation software for the adult monolingual Māori dictionary2, mostly because he didn’t want to do it all himself. Freelex, as its name might suggest, is free (as in both beer and speech) and open source, and it runs on a MySQL backend. The other talk was by Gilles-Maurice de Schryver who developed TshwaneLex, a commercial product that does a similar job, but which runs on a prorietary format at its  backend, based on XML.

Each of those are in hugely more advanced stages of development that our humble XML-based multiple format dictionary project. Even so, the demonstration of the Kirrkirr Kaurna dictionary and the mobile phone dictionary, which I was able to run on the projector screen as an emulator, were absorbed by the audience with a great deal of interest; especially paying attention to the idea that mobile phones were just the obvious choice for housing dictionaries in some parts of the world. Such a system, for instance, would be perfect for Southern Africa, which has a similar internet situation to Northern Australia.

Among our many Monteith’s last night, we had a long discussion about some aspects of theoretical lexicography3 such as what purpose dictionaries are meant to serve. Several of the talks refered to dictionary users being put off by things such as labels, parts of speech, scientific names and so on. These talks mentioned ‘training’ the users how to get the most out of that dictionary. But another point of view, not necessarily my own, that was put forward last night was that it may be better to instead rebuild the dictionary so that it’s what the user wants and needs, rather than to persevere with a non-user-friendly dictionary that try to shoehorn the audience into it.

For instance, Julie Baillie gave a talk directly after mine, in which she presented Oxford’s new beginner’s wordlist, which uses corpus techniques to find the words most used by younger children, who are just beginning to read and write. The inspiration for her research, which culminated in the production of the Oxford Wordlist, was that children in primary school classes were learning to read and write using wordlists created in the 60s and 70s in Europe. They naturally involved concepts foreign to Australian and New Zealand kids abnd were for the most part useless for the kids to learn to read and write with. She compiled the wordlist by the frequency of these words as they appeared in small narratives written by children in target age groups, and therefore better reflect those children’s worldviews. So, she has rebuilt the dictionary to suit the needs of the user, rather than force the user to conform their needs to the functions of the dictionary.

Brilliant.

Anyway, that’s one conference down, one to go. I’m off to Melbourne next week for the Unimelb postgrad conference, and perhaps also to discuss the possibility of doing a PhD there beginning in 2010.

  1. These Monteith’s Brewery beers are fantastic, mostly. Unless you like cider you can give the Summer Ale a miss, and the Raddler Ale is pretty much like a shandy. By far the best is Original Ale, whose closest Australian analogue would have to be Squire’s Amber Ale. Following closely behind is the Pilsener.

    You can tell that I’ve been busy in research this week. []

  2. Which reminds me, I really want to find a copy of a good Māori dictionary before I leave []
  3. Far out, I am the King of the Nerds []

I’m sneakily writing this during afternoon tea of the first day of Australex on the lectern’s computer, which has an unrestricted internet connection, because I just heard a great New Zealandism that I thought I’d share.

The talk was by Tony Deverson from the University of Canterbury, talking about creating a dictionary of New Zealandisms and one of those that yhe brought up was to turn to custard, which is basically equivalent to Australian English to go pear-shaped. That, however, is not the New Zealandism that I want to share. When he was trying to gauge from the audience the wider use of the term, specifically whether it was used in Australia, he refered to Australia as The West Island.

In other news, I present tomorrow, so I’ll post something afterwards about how it unfolds. This will be my first time presenting anything, ever! And now someone needs to set up for their presentation, so I’d better go!

Furthermore to presenting the Kaurna electronic dictionaries at Australex next week, we’ve been invited to give a talk at the University of Melbourne Linguistics & Applied Linguistics Postgraduate Conference 2008, held November 21-22. It’s a great excuse for me to finally visit Melbourne for the first time in… about 13 years.

Then, this morning, we received confirmation that our abstract has been accepted for the 1st International Conference on Language Documentation and Conservation in Honolulu, Hawai’i in March next year! By which time we should both be well and truly stuck into our next phase of the project, being generously supported by a grant from the Hoffman foundation, which you can read about here.

Unfortunately for me, March next year is during the teaching period meaning I won’t be able to attend. But hopefully James will be free then and will present our project to a wider audience.

As I promised last week, I’ve managed to find a copy of the SBS World News report in which I appeared, that mentions and demonstrates the mobile phone dictionary – thanks to Jeremy who recorded it – and so I’ve put it up here.

Just bear in mind that I had no idea that I was going to be interviewed, which is why I’m unshaven and wearing – ahem – a Transformers T-shirt (Decepticons, no less).

I suppose this destroys for good any semblance of internet anonymity that I had feigned.

<UPDATE>
As Michael noticed, I think the large video file was causing some strife for the company that generously hosts this site, Affernet, so I’ve YouTubed it instead.
</UPDATE>

A few weeks ago I mentioned that a bunch of us at Sydney Uni had submitted an abstract for a conference presentation of the Kaurna electronic dictionary.

Just recently, we received the news that our abstract has been accepted. So, if you’re planning on coming along to Australex ’08 at the Victoria University of Wellington in November and you’d like to see the public unveiling of our Kirrkirr and mobile phone dictionaries, then by all means look out for us – by which I mean me.

As it’s been about a month since my last post, it’s probably about time I posted something at least to ensure that this site doesn’t get referred to as a ‘dead blog’. To make matters worse, not only have I not been posting, I’ve also been neglecting my reciprocal blogger duties of reading other people’s work, which I hope is a good indicator of how busy I’ve been. Reading through the myriad of blogs in my feed reader is  normally one of my most favoured activities.

So what is my excuse then?

The same old story really — work. But this time the various jobs are a little different. Besides my regular duties as audio engineer at Paradisec and my unrelenting duties as tutor of first-year linguistics, I have been preparing a grant application with a colleague to continue our work developing electronic dictionaries of minority languages, including dictionaries available as java applications on your mobile phone1.

We have also been preparing several papers, conference talks, seminars and so on to detail our project and our process of producing visually-rich multimedia electronic dictionaries from basic wordlists. There are a couple of conferences later in the year that this sort of thing would be perfect for, but we also plan to get a paper sent off to some prestigious lexicography journal somewhere.

As a teaser, here’s an abstract that we sent off to one such conference earlier this month:

Kaurna is the indigenous Australian language of Adelaide and the Adelaide Plains. It has not been actively used since 1929, when the last native speaker died. More recently, efforts have been undertaken to restore Kaurna to a state of community use. One recent project involved the creation of an electronic Kaurna dictionary carried out by a team at the University of Sydney during the first half of 2008. As this was a community-driven project, it had certain requirements, such as the need to archivally preserve the two main documentary sources of Kaurna: a book published in 1840, and a hand-written manuscript from 1857.

In an effort to maximise flexibility, portability and transparency, the Kaurna dictionary project opted for an XML formatted master dictionary that could then be converted to other formats, such as an HTML web-page, or even a printed dictionary. The current means of presentation is through Kirrkirr,  a multimedia-rich dictionary visualisation tool.

In this project we also developed software for presenting the dictionary on mobile phones. Mobile phones are almost ubiquitous today and most modern mobile phones have the memory capacity and features necessary for storing and presenting the dictionary content. They therefore present an excellent opportunity for learners of minority languages to have access to a dictionary. The mobile phone dictionary software is currently in its early stages, but we hope to improve it with further work and make it available to people compiling electronic dictionaries for other languages.

I’ll let you know how it all goes.

  1. You can read all about this project, which began with Kaurna, at a post of mine here, and at James’ post here. James’ post also includes example software for download, in case you want to try any of this out. []

A few posts back, I wrote about a book that David Nash had found on Amazon.com, which appeared to be a bi-directional crossword-puzzle book between English and Wageman [sic1]. It seemed as though these books, and a few others on Amazon on Wageman, contained the very same wordlist collected by a previous researcher and published under copyright at AIATSIS.

This is by no means an isolated incident. Parker has wordlists for around 600 languages stored online, and could potentially create crossword books, dictionaries and thesauri for each of them. See also Peter Austin’s post at Transient Languages and Cultures regarding a similar thing having happened to the Kamilaroi/Gamilaraay dictionary.

Instead of letting this issue slide into the obscurity of my Mabitjbaran, or Archives, I bought a copy of each, English to Wageman and Wageman to English, and have made contact with the ‘author’, Philip M. Parker, to solicit his explanation of what appears to be a blatant violation of copyright restrictions.

First thing’s first though. The books actually appear to be a pretty good educational resource, assuming that the school in Pine Creek is up to the point of recommencing its Wagiman language programs, of which I’ve only ever seen fleeting bits of evidence of ever having taken place2. The books comprise probably hundreds of automatically generated crosswords with the solution words in alphabetical order at the bottom. In spite of the books’ copyright restrictions by their supposed author, I’ve scanned a page of one of these books, which you can view here.

I’ve also done a little more background research on the author of these books, Philip M. Parker, and as it turns out, he’s not at all involved with dictionary compiling, language work or language education. In actual fact, he’s a professor of marketing and a generic entrepreneur at the Singapore campus of an international private business and marketing college based in France, called INSEAD. He even has a biography page on Wikipedia, which is interesting to this topic, as it goes into detail about his book publishing career. Apparently he’s quite famous in the marketing and entrepreneurial world.

His fame derives from the fact that he has developed a process that automatically produces and prints books on demand, with little or no interactive work. Each book that gets printed costs him an estimated 12 pence Sterling. So good is his software apparently that he has authored 85,764 books on sale at Amazon.com.

Parker estimates that it costs him about 12p to write a book, with, perhaps, not much difference in quality from what a competent wordsmith or an MBA might produce.

Nothing but the title need actually exist until somebody orders a copy. At that point, a computer assembles the book’s content and prints up a single copy.

Not much difference in quality from what a competent3 wordsmith might produce? If you check a random selection of some of these books, you’d be forgiven in not seeing what sort of quality he’s referring to:

The 2007-2012 Outlook for Tufted Washable Scatter Rugs, Bathmats, and Sets That Measure 6-Feet by 9-Feet or Smaller in India

Riveting. And that costs US$495.00, in case you were wondering.

What Parker does is harvest data, irrespective of what sort of data it is, and churns out books with it. It doesn’t matter if no one’s interested in the statistical prognostications for the Indian mid-sized bathmat industry, because each book is printed if and only if someone actually orders it; a copy may never actually exist. But considering there are libraries around the world that will buy a copy of each and every publication under the sun, Parker is probably earning a lot of money.

As I mentioned at the start, I’ve made contact with Parker and courteously attempted to solicit some information, such as which wordlist he used, and whether there were any copyright protections on that data. This is the response I got back:

Thank you for your concern; there are no copyright violations. Please feel free to copy my puzzles for your teaching4.

p.s. translations of words, themselves, cannot hold copyright, only the format in which they are presented (translations of single words are public knowledge; translations of creative works are not). I will later be doing anagrams, poems, rhyming sections, etc.. java-based web games (free to use), etc.

I felt a little confused by this response; I’m not very knowledgeable about copyright law and would have expected that someone’s research and work would be protected under copyright. At the same time though, I’m sure that Parker has done his legal research and knows full well what he can and cannot do. Peter Austin has a legal advantage over me in this respect; his Gamilaraay dictionary included some reconstructions:

It is not possible to copyright common knowledge such as words and meanings. Unfortunately for Parker, some of the quoted forms, like muRumuRu on page 11 are creative works since they are reconstitutions which I have posited on the basis of 19th century published and unpublished amateur recordings (as explained in the preface of my dictionaries — note that the orthographic R is not a Gamilaraay sound but a cover term for where I could not determine whether the source represented a flap rr or a continuant r). Now that is copying of creative work without attribution, in my view.

It may turn out to be a little more difficult to demonstrate some ‘creative work’ with the Wagiman dictionary, and we may just have to accept that legally, this sort of blatant plagiarism will be allowed to continue.

Let my warning be this: If you find a book written by Philip M. Parker that looks interesting, avoid it; you can probably find the content online for free.

  1. We spell it Wagiman these days. Wageman was the spelling adopted by earlier researchers, Ethnologue and AIATSIS. Phonetically speaking, I couldn’t judge either way. For ease of fact-checking, I’ll retain the spelling used in the books. []
  2. Perhaps Wamut could help me out here. []
  3. Notice also that he implies here that he is an incompetent wordsmith. []
  4. I take my blog to be ‘teaching’, thereby indemnifying myself against the apparent copyright violation of my publishing of a scan of one of his crosswords []

Over the weekend, David Nash drew my attention to a book that he found on Amazon, that purported to contain bilingual crosswords puzzles in English and Wageman1.

I was a bit perlexed by this, since, well, Wagiman doesn’t have much in the way of practical applications such as second-language learning, that is, of course, beyond the community of Wagiman people. It should be noted at this point though, that this book is not being marketed towards the small community of non-Wagiman speaking Wagiman people, but to a North American audience.

The book is published by a mob called Webster’s Online Dictionary, who I take to have no connection whatsoever to Merriam-Websters, given the look of their respective websites. Theirs appears to contain worldlists of hundreds and hundreds of languages, many of them minority languages, and it seems some of them have been converted to print, albeit in the bizarre form of bidirectional crossword puzzle books.

Here is the product description, as supplied by Amazon, and likely supplied by Philip M. Parker, the person behind Webster’s Online Dictionary:

Webster’s Crossword Puzzles are edited for three audiences. The first audience consists of students who are actively building their vocabularies in either Wageman or English in order to take foreign service, translation certification, Advanced Placement® (AP®) or similar examinations. By enjoying crossword puzzles, the reader can enrich their vocabulary in anticipation of an examination in either Wageman or English.

A translation certificate, Advanced Placement certificate, in Wagiman?  Really?

The second includes Wageman-speaking students enrolled in an English Language Program (ELP), an English as a Foreign Language (EFL) program, an English as a Second Language Program (ESL), or in a TOEFL® or TOEIC® preparation program.The third audience includes English-speaking students enrolled in bilingual education programs or Wageman speakers enrolled in English speaking schools.

EFL, ESL, TOEFL or TOEIC programs being run anywhere near Wagiman country? Really?

However, I can see in this book a benefit for some eventual teaching of Wagiman language in the local school, to help increase literacy in Wagiman, but unfortunately, the book uses an outdated orthography and may actually undermine increased Wagiman literacy efforts.

I wouldn’t want to financially support someone who – it appears – has taken a wordlist published in the public domain2 and has created something proprietary, like a book, with the goal of profit in mind, but I think I might still have to have a Wagiman-English crossword puzzle book on my shelf, just for the fun of it.

  1. Wageman was one of the variant spellings. Others include Wakiman (Cook, Austin) and Wogeman (Tyron). []
  2. I find it ironic, furthermore, that while the original wordlist was a public domain web-publication, Webster’s Online Dictionary prohibits automatic harvesting of any of their data. I doubt that they copy-pasted each and every entry from the wordlist. []

« Previous PageNext Page »