Applications for N-Grams
This page is based uppon research on Language recognition based on
n-Grams. One of the theoretical foundations of this is the paper
"N-GRAM-Based Text Cathegorization" by Willian B.Canvar and John M.Trenkle
Based on this article there is the
text_cat Perl script by Gertjan van Noord
which is able to fairly reliable guess the right out
off 70 natural languages when presented with a file
containing text in that language.
Recently I did a quick and dirty port of this programm to
Java. (Actually this is not really a port since it lacks some
features and includes some others more, but it uses the same
basic principle).
The most important points of the algorithm
- it uses very low level information about a language
- it does not need large dictionaries of words
- it needs about 300 to 400 N-Grams of letters in those languages,
this data can easily be generated autatically from some sample in
that particular language.
- An N-gram is a tuple of letters (or in case of text_cat bytes),
e.g. the name "otto" contains the 2 1-grams "o" and "t", 3
2-grams "ot", "tt", "to" and the 3-grams "ott" and "tto" and of
course is itself a 4-gram.
Based on that algorithm I currently have two applications:
Application 1: Guessing a language
The Java program is capable to do the same job as text_cat, i.e. to
guess text language from a sample.
I'm sorry I haven't prepared that as an online application yet.
But it actually works the same way as it works with above
linked
text_cat stuff.
1
|
2 abc |
3 def |
4 ghi |
5 jkl |
6 mno |
7 pqrs |
8 tuv |
9 wxy |
|
0 _-. |
|
|
|
Application 2: PHONER, Generates Memorizable phonenumbers
Well, what is common in the US, to buy a phone number which can be
typed as a short piece of text. This is also possible in Germany now.
But what can you do with your old numbers? The answer is: Use PHONER,
phoner takes the N-Gram information and your number and tries to
find a word which is easier to memorize. You can test that algorithm
with below form:
Remarks:
- Non numbers are taken to the result verbatim.
- The algorithm also tries to leave the digit in the number.
- The digits 0 and 1 give the algorithm not much usable choice.
- The current implementation is unoptimized and therefore slow! But there
is also a fair amount of work to do, e.g. analysing a 5 digit number usually means over 1000 little texts to be analysed, and every additional digit multiplies the number of combinations with at least 4.
- For reasons of CPU time the maximum length of a string is currenly limited.
- That the language resources are not entirely appropriate
for the job yet, since they discriminate between upper and lower
case letters and interpunctation. The resource "Frank Special (frank2.lm)"
is a special pseudo resource I compiled for a mixture of German and English
and only for lower case letters
Caution: Though the algorithm is perfectly capable to help
you memorize your bank cards PIN or so, I do not encourage you to
type that into above unsafe form. It is internet, there could be someone
listening.
More Applications? More Ideas!
-
Modern wireless phones SMS are featuring automatic word guessing
algorithms, to help you speed up your typing, these alorithms
could easily be based uppon the N-Gram algorithm.
- I'm almost sure if you spend more work on the n-grams used,
you could implement things like automatic document classification
(e.g. automatically discriminate between a invoice and a love letter
or between a Shakespeare sonet and a news story). This would be
a point for further research.
- The method suggested by above refernces
for using the n-gram information bases solely
on ranking is very raw and I'm almost sure I have an idea to do s.th.
better. That would be another point for research.
Download and Documentation
Go to the The NGramJ Sourceforge pages.
Contact Frank Sven Nestel.
Other experiments on this site
Franks search engine.
Back to Franks Playground,
to Spieleck-Entry-Page
or go to Doris&Frank's site.