Technology


A few weeks ago I mentioned that a bunch of us at Sydney Uni had submitted an abstract for a conference presentation of the Kaurna electronic dictionary.

Just recently, we received the news that our abstract has been accepted. So, if you’re planning on coming along to Australex ’08 at the Victoria University of Wellington in November and you’d like to see the public unveiling of our Kirrkirr and mobile phone dictionaries, then by all means look out for us – by which I mean me.

As it’s been about a month since my last post, it’s probably about time I posted something at least to ensure that this site doesn’t get referred to as a ‘dead blog’. To make matters worse, not only have I not been posting, I’ve also been neglecting my reciprocal blogger duties of reading other people’s work, which I hope is a good indicator of how busy I’ve been. Reading through the myriad of blogs in my feed reader is  normally one of my most favoured activities.

So what is my excuse then?

The same old story really — work. But this time the various jobs are a little different. Besides my regular duties as audio engineer at Paradisec and my unrelenting duties as tutor of first-year linguistics, I have been preparing a grant application with a colleague to continue our work developing electronic dictionaries of minority languages, including dictionaries available as java applications on your mobile phone1.

We have also been preparing several papers, conference talks, seminars and so on to detail our project and our process of producing visually-rich multimedia electronic dictionaries from basic wordlists. There are a couple of conferences later in the year that this sort of thing would be perfect for, but we also plan to get a paper sent off to some prestigious lexicography journal somewhere.

As a teaser, here’s an abstract that we sent off to one such conference earlier this month:

Kaurna is the indigenous Australian language of Adelaide and the Adelaide Plains. It has not been actively used since 1929, when the last native speaker died. More recently, efforts have been undertaken to restore Kaurna to a state of community use. One recent project involved the creation of an electronic Kaurna dictionary carried out by a team at the University of Sydney during the first half of 2008. As this was a community-driven project, it had certain requirements, such as the need to archivally preserve the two main documentary sources of Kaurna: a book published in 1840, and a hand-written manuscript from 1857.

In an effort to maximise flexibility, portability and transparency, the Kaurna dictionary project opted for an XML formatted master dictionary that could then be converted to other formats, such as an HTML web-page, or even a printed dictionary. The current means of presentation is through Kirrkirr,  a multimedia-rich dictionary visualisation tool.

In this project we also developed software for presenting the dictionary on mobile phones. Mobile phones are almost ubiquitous today and most modern mobile phones have the memory capacity and features necessary for storing and presenting the dictionary content. They therefore present an excellent opportunity for learners of minority languages to have access to a dictionary. The mobile phone dictionary software is currently in its early stages, but we hope to improve it with further work and make it available to people compiling electronic dictionaries for other languages.

I’ll let you know how it all goes.

  1. You can read all about this project, which began with Kaurna, at a post of mine here, and at James’ post here. James’ post also includes example software for download, in case you want to try any of this out. []

A few posts back, I wrote about a book that David Nash had found on Amazon.com, which appeared to be a bi-directional crossword-puzzle book between English and Wageman [sic1]. It seemed as though these books, and a few others on Amazon on Wageman, contained the very same wordlist collected by a previous researcher and published under copyright at AIATSIS.

This is by no means an isolated incident. Parker has wordlists for around 600 languages stored online, and could potentially create crossword books, dictionaries and thesauri for each of them. See also Peter Austin’s post at Transient Languages and Cultures regarding a similar thing having happened to the Kamilaroi/Gamilaraay dictionary.

Instead of letting this issue slide into the obscurity of my Mabitjbaran, or Archives, I bought a copy of each, English to Wageman and Wageman to English, and have made contact with the ‘author’, Philip M. Parker, to solicit his explanation of what appears to be a blatant violation of copyright restrictions.

First thing’s first though. The books actually appear to be a pretty good educational resource, assuming that the school in Pine Creek is up to the point of recommencing its Wagiman language programs, of which I’ve only ever seen fleeting bits of evidence of ever having taken place2. The books comprise probably hundreds of automatically generated crosswords with the solution words in alphabetical order at the bottom. In spite of the books’ copyright restrictions by their supposed author, I’ve scanned a page of one of these books, which you can view here.

I’ve also done a little more background research on the author of these books, Philip M. Parker, and as it turns out, he’s not at all involved with dictionary compiling, language work or language education. In actual fact, he’s a professor of marketing and a generic entrepreneur at the Singapore campus of an international private business and marketing college based in France, called INSEAD. He even has a biography page on Wikipedia, which is interesting to this topic, as it goes into detail about his book publishing career. Apparently he’s quite famous in the marketing and entrepreneurial world.

His fame derives from the fact that he has developed a process that automatically produces and prints books on demand, with little or no interactive work. Each book that gets printed costs him an estimated 12 pence Sterling. So good is his software apparently that he has authored 85,764 books on sale at Amazon.com.

Parker estimates that it costs him about 12p to write a book, with, perhaps, not much difference in quality from what a competent wordsmith or an MBA might produce.

Nothing but the title need actually exist until somebody orders a copy. At that point, a computer assembles the book’s content and prints up a single copy.

Not much difference in quality from what a competent3 wordsmith might produce? If you check a random selection of some of these books, you’d be forgiven in not seeing what sort of quality he’s referring to:

The 2007-2012 Outlook for Tufted Washable Scatter Rugs, Bathmats, and Sets That Measure 6-Feet by 9-Feet or Smaller in India

Riveting. And that costs US$495.00, in case you were wondering.

What Parker does is harvest data, irrespective of what sort of data it is, and churns out books with it. It doesn’t matter if no one’s interested in the statistical prognostications for the Indian mid-sized bathmat industry, because each book is printed if and only if someone actually orders it; a copy may never actually exist. But considering there are libraries around the world that will buy a copy of each and every publication under the sun, Parker is probably earning a lot of money.

As I mentioned at the start, I’ve made contact with Parker and courteously attempted to solicit some information, such as which wordlist he used, and whether there were any copyright protections on that data. This is the response I got back:

Thank you for your concern; there are no copyright violations. Please feel free to copy my puzzles for your teaching4.

p.s. translations of words, themselves, cannot hold copyright, only the format in which they are presented (translations of single words are public knowledge; translations of creative works are not). I will later be doing anagrams, poems, rhyming sections, etc.. java-based web games (free to use), etc.

I felt a little confused by this response; I’m not very knowledgeable about copyright law and would have expected that someone’s research and work would be protected under copyright. At the same time though, I’m sure that Parker has done his legal research and knows full well what he can and cannot do. Peter Austin has a legal advantage over me in this respect; his Gamilaraay dictionary included some reconstructions:

It is not possible to copyright common knowledge such as words and meanings. Unfortunately for Parker, some of the quoted forms, like muRumuRu on page 11 are creative works since they are reconstitutions which I have posited on the basis of 19th century published and unpublished amateur recordings (as explained in the preface of my dictionaries — note that the orthographic R is not a Gamilaraay sound but a cover term for where I could not determine whether the source represented a flap rr or a continuant r). Now that is copying of creative work without attribution, in my view.

It may turn out to be a little more difficult to demonstrate some ‘creative work’ with the Wagiman dictionary, and we may just have to accept that legally, this sort of blatant plagiarism will be allowed to continue.

Let my warning be this: If you find a book written by Philip M. Parker that looks interesting, avoid it; you can probably find the content online for free.

  1. We spell it Wagiman these days. Wageman was the spelling adopted by earlier researchers, Ethnologue and AIATSIS. Phonetically speaking, I couldn’t judge either way. For ease of fact-checking, I’ll retain the spelling used in the books. []
  2. Perhaps Wamut could help me out here. []
  3. Notice also that he implies here that he is an incompetent wordsmith. []
  4. I take my blog to be ‘teaching’, thereby indemnifying myself against the apparent copyright violation of my publishing of a scan of one of his crosswords []

I occasionally find myself amused to see in my blog stats that someone has translated my blog into another language. Being so inquisitive, I often follow their lead.

Yesterday morning, I noticed that one of the referring pages was a Google translation of this post into Korean. Naturally, I had a look to see what my blog would look like written in Hangul. As you might expect, it looks really cool, except that I kept noticing a telephone number, the same telephone number, all the way through. Here’s what it looks like:

케빈 Rudd 전화 +852 2907 2112, 자신의 게시물 – 사과 연설에서 거듭 사과를하는 이유는 원상 회복 과정에 필요한는 그들을 처음으로보고 이후에 화해와 일반적으로합니다.

Strangely, each and every time this telephone number appears, it is preceded by the characters 전화, which, according to a Korean-reading friend of mine, means phone, and the whole thing is immediately preceded by Rudd. Looking at the corresponding English of each line (it pops up when you scroll over a line of Hangul), it appears that the phone number is purely being inserted and has no corresponding constituent in the English.

To put this another way, the string of letters Rudd in English, becomes Rudd 전화 +852 2907 2112 in Hangul.

In an attempt to track this a little further to its source, I typed “Rudd” into Google’s translation page, and sure enough, the phone number emerges. This tells me that it’s an artefact of Google translator, and not some mysterious subliminal message that I’ve subconsciously coded into my blog for the sole benefit of Korean readers.

I’m a little discombobulated1 by this, so if you know anything more about this oddity, or could even posit an explanation, I’d love to hear it.

Someone might even like to put their neck on the line and ring the number…

  1. I’ve always wanted to use that word. []

I’ve been a bit neglectful of this blog lately, and yes, I know I say that at the beginning of just about every post these days, but unfortunately it’s even more true now than ever.

The main reason I’m so busy is that I’ve been helping out in massaging and sanitising data for an electronic dictionary of Kaurna, the language traditionally associated with Tandanya and much of the surrounding region1. The language officially became ‘extinct’ almost a hundred years ago, but on the basis of two dictionaries written in the mid 19th century, linguistic revival efforts are having some huge success. Places in and around Tandanya have taken on alternative Kaurna names, you can learn Kaurna through all levels of education and you can even study Kaurna linguistics at a tertiary level. Not bad for a ‘dead’ language.

The dictionary I’m working on is just the latest instance of this revival effort. We’ve taken those two dictionaries from the mid 19th century and, after they’d been meticulously and painstakingly transcribed into text files and converted into toolbox-readable backslash-coded files, massaged out the inconsistencies. Our job has been to convert these into XML files, combine the two dictionaries into a single dictionary file and import it into Kirrkirr, an interactive dictionary application.

The final product won’t just be a cool, usable electronic dictionary, it’ll also be a faithful representation of the original two works, as everything will have been preserved and will be immediately viewable just by switching from one version to another. Even Teichelmann’s original spelling mistakes have been preserved. The user will be able to toggle between the original and a modern version with spelling errors corrected.

We also have a couple of other applications of this dictionary that we think will be useful for similar dictionary projects for endangered languages, especially in remote communities. But since I don’t want to spoil the fun of the announcement, I’m not going to say anymore.

Anyway, without getting too distracted, I wanted to share this little bit from the inside cover of the manuscript of the dictionary, written in 1857.

THE ADELAIDE LANGUAGE.

The tribe who used to speak this language has, accord-
ing to Mr. Teichelmann,* now ceased to be.

*Mr Teichelmann writes thus:–

“Salem on the Bremer, Callington, January 18th, 1858.

“Sir,–According to your wish, I have copied and translated into English, my collection of words and grammatical remarks on the language of the Aborigines who once inhabited the district around Adelaide; for they have disappeared to a very few.
[…]
Also, I do not entirely approve of the orthography of the native language, as we have spelled it, but it is useless now to alter any thing in it after the tribe has ceased to be.”

In retrospect, we’re seriously lucky that Teichelmann didn’t pack it in as soon as he realised that the tribe will have soon ‘ceased to be’ or we wouldn’t have such a detailed historical dictionary of the language upon which to base revival efforts. A lesson perhaps for all those people who question the motives of linguists who work in highly endangered languages.

I also found it interesting that in this passage, the person who wrote the tagline the tribe who used to speak this language has ceased to be, has evidently misunderstood Teichelmann’s intended meaning. He clearly meant when the remaining few people who speak this language (and thereby the language too) cease to be, (then there will be little need for a more useful orthography).

If you’re going to the Australian Languages Workshop, which this year is being held at Kioloa, an outpost of ANU, then you’ll be able to witness a full demonstration of this multi-tiered, quasi-archival dictionary by one of my colleagues.

So that’s an example of what’s been keeping me from regular blogging. There are plenty of other examples, of course, but they involve dropping whatever semblance of anonymity I delude myself into thinking I can hold on to.

  1. The Kaurna Dictionary project is made possible through the support of Kaurna Warra Pintyandi, a community based Kaurna language organisation. []

Long term readers of this blog would probably know that I occasionally like to mess around with Google Earth and to try out new things to do with languages and so forth. It began with an exercise in mapping some known and established place names in the Sydney Metropolitan Area, mostly concentrated in and around the Harbour, and then it moved on to a small project of mine to map the region of the Northern Territory with which Wagiman is traditionally associated¹.

Another project I began, and finished, a while ago, was to take the divided segments of the AIATSIS map of Australia’s Indigenous languages, and overlay them as images onto Google’s Earth. When I say ‘finished’, what I mean is, I’d posted it to the Google Earth community as a downloadable file, but I didn’t know that I’d screwed it up and made the images too transparent to see the language boundaries clearly.

Just the other day though, Jungurra expressed some interest in using it for the Australian Languages course that he’ll be teaching from next week, which prompted me to go and fix it up and make all the images fully opaque. So now, the whole thing can be made transparent so that the images don’t necessarily block the satellite images beneath. The new file can be found here.

Preparing this made me realise just how much of a problem the curvature of the Earth actually is. The further south you get, the more the images have to be contorted into place, and therefore the larger the discrepancy in location at some points. Some of the maps are displaced by anything up to about a hundred kilometres.

I don’t know how receptive AIATSIS are to this sort of new-fangled technology, but I think it’s something that they, even in collaboration with Google, could should think about, and eventually produce a Google Maps or Google Earth package of files that show languages and language boundaries. I envisage a situation where the language names and boundaries are treated as place names and borders like any others, and not as images that become blurred the further in you zoom.

At the end of the day, this is a bit of fun, but perhaps there are practical applications to such widespread popular things like Google Earth such that linguists, and others, can put them to (more) good educational use.

~

<update>
Here’s a screenshot, which I wasn’t able to do earlier. This is with the opacity of the AIATSIS map overlayed images turned quite far down, otherwise, you’d just be looking at the overlay, and it wouldn’t be very interesting. You can also see here how imperfect the fitting together of the original segments is, as there’s quite a lot of overlap, and boundaries that don’t quite match. But you know, I did the best I could. Click on the image for the larger size.

screenshot

You can even see Wagiman in the middle there.
</update>


¹As opposed to ‘where Wagiman is spoken’, for clear sociolinguistic reasons.

Last night on The Cutting Edge, a documentary entitled The Nuclear Comeback investigated the nuclear power option with respect to its costs, its benefits in terms of lowered carbon emissions, its safety, especially with terrorists attacks – infrequent as they are – at the forefront of everyone’s minds, and long-term effects such as waste storage. It was immensely interesting and I hope SBS publishes transcripts, videos possibly, and other information about it.

I wasn’t studiously taking notes unfortunately, but some of the facts and figures in the documentary, most of which came from the nuclear industry itself, are too amazing to forget. Here is a sample:

  • Australia currently derives 80% of its energy from coal, rendering Australians the highest per-capita emitters of carbon.

  • A minimum of 6, but a possibility of 14 nuclear power stations are planned for Australia.

  • 14 such stations together would produce only a projected 20% of Australia’s energy demands (presumably those demand are measured against our current consumption).

  • A power station in the UK (I can’t recall which, nor exactly where) employs more people in decommissioning than it ever did during its active life.

  • This power station produced energy for 47 years, yet it will take an estimated 120 years to decommission, which will cost an estimated one billion pounds.

  • Currently no high-level waste repositories, those needed for storage of spent fuel, exist in the world.

  • Several facilities exist that store low-to-medium-level waste, including workers’ clothing, instruments and tools (Incidentally, this is the sort of facility that seemed as though it was being forced upon the Yapa Yapa people of Muckaty Station).

  • Spent fuel takes 75 odd years to become exhausted of its residual heat energy. It must become exhausted of this heat energy before it can be stored in a high-level waste repository.

  • The fuel then takes an estimated 100,000 years before it’s deemed ‘low-to-medium-level’ and is able to be stored safely, that is, kilometres underground.

We want to build behemoth facilities that produce energy for a mere few decades but require over a hundred thousand years of management after that? How forsightful are we?

Apart from all this, the program looked at Chernobyl, and yes, while it was a tragic accident that was probably the indirect result of poor Soviet management and is now ‘entirely avoidable, it still provides a didactic demonstration of the monumental long-term effects when something does go wrong. Besides, there’s no guarantee that something else might go wrong. In 2006 in fact, the Forsmark nuclear power plant in Sweden came perilously close to meltdown, as backup deisel generators failed to run as expected. According to some, mere luck alone prevented a meltdown.

There remains an exclusion zone with a radius of 30 kilometres that surrounds Chernobyl, within which no one is allowed to live. Reactor 4, the one that exploded, is still producing radioactive material and is housed in a gigantic concrete sarcophagus, built mostly by remote controlled robots, to contain this material. Despite interminable repairs to the sarcophagus, it continues to deteriorate. If this sarcophagus happens to collapse, it may cause another cloud of radioactive dust to be released into the atmosphere. The last such cloud spread over much of the European continent, and most fell on Belarus.

The Chernobyl nuclear power station now produces no power and instead consumes huge amounts in maintenance and repairs, and several teams are employed to supervise the entire plant around the clock. This maintenance will necessarily continue for hundreds of thousands of years until the radiation decays to acceptable levels.

Honestly, nuclear power is insanity.

In blogging, if you haven’t posted for a week, there’s a slim chance someone might consider you defunct. If you were a word, the OED might feel inclined to put an innocent looking (Arch.) next to you, or worse, (Obs.).

I feel then, that I should post something to keep the bloggospheric undertaker at bay and, quite fortuitously, there’s lots going on to discuss.

During the week, Australia observed another milestone in the gradual struggle for equality of Aboriginal people, when Marion Scrymgour became the first Aboriginal person (Tiwi) to lead a state or territory. As Deputy Chief Minister, Scrymgour became acting Chief Minister when Paul Henderson, who succeeded Claire Martin, took a holiday.

On the federal side of the political toast however, Rudd and his ‘team’ have been disappointing in just about every respect, right after a strong start last month. Gillard’s warning of legal action to halt industrial action strikes me as odd, given that it comes from a party whose existence is (or should be) based on employee’s rights.

Macklin’s assertion that there won’t be any compensation for those indigenous children who were taken from their family – even though she meant in virtue of the apology alone – clearly sends the wrong signal and has solicited criticism from Aboriginal people.

Also, Swan has gone further down that well-trodden path that Keating macheted back in the 80s, and that Howard, complete with green ‘n gold track-suit, continued to tread up until November 23 last year. I’m talking about the further politicisation of the economy (or the further economisation of politics, whichever you prefer) by criticising independent banks’ decisions to raise interest rates irrespective of the Reserve Bank’s official cash rate.

But not everything has been politics. I’ve been busy lamenting the loss of our Transient Building at the University of Sydney, a temporary structure built after World War II, almost entirely of fibrous asbestos. Its recent renovations (to the interior only; the cost of insurance renders any work to the asbestos exterior fiscally indefensible) mean that the building is well and truly permanent. It has now become the Intransigent Building.

I’ve also been busy applying for grants here and there (well, really only here), and doing some peripheral work for a dictionary project in an Australian language, all of which has kept me too busy to take on silly projects of my own.

Which reminds me, I’ve just taken on a silly project of my own; I’ve recently acquired an antique Terra Cresta table-top arcade game, not at all dissimilar to the one pictured beneath, which I intend to restore with all the bits and pieces from a not-so-old laptop, an arcade-game emulator, and linux.

Finally, congratulations to Jane, whose post from July last year, Gunboat Lip-Gloss, was announced last week as one of On Line Opinion’s best blog posts of 2007. Readers of On Line Opinion are invited to comment, and several people have taken the opportunity to lambaste Jane with some old chestnuts, the least surprising of them being anti-intellectualism:

Is that what they teach at Universities these days? It would have been a “Fail” on any paper to misrepresent sources like that when I went to uni.

Gobsmacked

I shall return to regular-ish blogging soon!

Earlier on this afternoon, I heard a cricket commentator, having heard about someone whose name he didn’t immediately recall, promise that he’d google him up. This would not be a natural usage for me, although it’s unequivocally clear what he means; it’s completely synonymous with (in my view) the more natural version to google someone, i.e. to search for them on Google.

Anyway, I started wondering how common the construction google up is, so I went and googled it… up, and here’s a breakdown of the returned hits on all permutations:

  google X google X up
him 106,000 7,510
her 92,900 3,490
them 151,000 11,300
it 5,600,000 513,000

Roughly speaking then, the non-phrasal variety google someone is far more common, but the phrasal variety google someone up has some substantial corpus.google.com¹ representation, about a tenth as much as the former.

I didn’t really have anything more to say about it, apart from pointing out that the phrasal verb google up is probably to be expected to occur on the basis of analogy from look up, as in I’ll just go and look them up (in the phonebook). Although curiously, a family member thought it meant to contact via Google rather than to merely find their details, as if on analogy from ring up.

In an age where Google has pretty much usurped phonebooks (of all colours), street directories, atlases, library catalogue cards, encyclopædiae and just about any other source of information, it may as well replace the linguistic idioms associated with them as well.

Holy crap! If someone googled up “Google”, do you think this post would be somewhere near the top?

~

¹No, corpus.google.com doesn’t exist as a URL; it’s just what I use to refer to the act of doing a quick Google search on a phrase using wildcards and quotation marks to back up one’s largely made up postulations about trends in modern English. Think of it as a snowclone, of the x.google.com template.

I just thought I’d point it out, since last time I used the term corpus.google.com, someone (let’s just call him Mr Nash – no wait, that’s too obvious; I’ll call him David N) wondered why they couldn’t find the corpus.google.com homepage.

Bill Poser, of Language Log fame, has recently developed a small piece of software that calculates how much data space you need to fit whatever format, compression or length of audio file you have.

I understand it’s probably not going to be the most highly sought-after program this Saturnalia, but I, for one, have a legitimate need for it in my role as audio engineer for an archive. Having to figure out how much space a file will consume, or calculating the sample rate of a partially corrupted file using long-hand methods, can quickly become tedious, and I’m glad I now have a calculator to do this for me.

The program’s called AudioSpace and has been released under a GNU general public license, meaning it’s available for free. While you’re there though, check out Bill’s other software. There’s a regular expressions development tool, a possible word generator, an IPA standardiser and plenty more programs that would be highly useful to computational linguists, programmers, and geeks in general.

« Previous Page