# Calculate Unique Value for any English Word

## Recommended Posts

We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...

##### Share on other sites

(IMO) I can't see the number that represents the word being anything other than a code. For example you couldn't add two numbers together to produce a number which represents another meaningful word.

If just producing a code would be sufficient then it is quite a simple.

Let first 2 digits (with value 01 - 52) represent the first character. (use 1 - 26 for small characters and 27 - 52 for capitals).

Repeat for subsequent characters using 2 digits per character.

Example:- School would be represented by 450308151512.

I think you may well be looking for something much deeper than this.

Edited by Joatmon
##### Share on other sites

We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc...

ASCII, encoding characters since 1960...

##### Share on other sites

ASCII, encoding characters since 1960...

The same thought came to me when eating my dinner! If using decimal numbers and using the suggestion given in #3, then 3 digits would be needed per character, only 2 if hex was used. Of course using ASCII would enable punctuation (and even spaces if required).

Edited by Joatmon
##### Share on other sites

Is there some reason why normal character encoding such as ASCII or UTF-8 is not appropriate? Why are you encoding the words in the first place? What are the requirements and all that?

##### Share on other sites

Use base 27 or 54 (or other) for the most efficient use of numbers with the first letter of the word being the least significant "digit". So, in their order, the letters encode as 27º, 27¹, 27², 27³, etc. That is, "a..." = 1, "b..." = 2, etc, "y..." = 25, and "z..." = 26; and ".a.." = 27, ".b.." = 54, etc; "..a." = 729, "..b." = 1458, etc.

##### Share on other sites

Example:- School would be represented by 450308151512.

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

##### Share on other sites

Ascii is the new kid on the block.

http://en.wikipedia.org/wiki/Morse_code

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

##### Share on other sites

Ascii is the new kid on the block.

http://en.wikipedia....wiki/Morse_code

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

Lol.

And yeah, if all you're trying to do is hash some strings the real question would be what language are you programming in?

##### Share on other sites

Ascii is the new kid on the block.

http://en.wikipedia.org/wiki/Morse_code

You need to account for the times so it needs to be a bit more complicated.

For example, 1 for a dot; 2 for a dash; 3 between letters

Yes but....morse does not distinguish between upper and lower case so the person John and the john we use in the restroom are the same in Morse

##### Share on other sites

There are lots of ways to do this.

What is the actual goal?

Some methods might be better than others, depending on the purpose.

##### Share on other sites

That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word.

Completely agree regarding efficiency. I was just looking to suggest a simple algorithm that would work. Because the OP says he is looking for a "value" I'm not even sure whether he is looking for something deeper than a code. (but what else the value of the unique number would be used for is beyond me).

Edited by Joatmon
##### Share on other sites

Thank you all for your responses. Our goal is to create a numeric value for any word and so that we will be able to compare the calculated values of words to other calculated values of other words to get the nearness of the two words. We are thinking of using something like bitwise operations to do the comparison. We will not surely go into technologies like phonetic or Fuzzy word match due to their slowness.

##### Share on other sites

Can you define "nearness" as that doesn't really mean much.

##### Share on other sites

Nearness means something like:

The word "school" and the word "Skool" might be 70% near or alike.

##### Share on other sites

Nearness means something like:

The word "school" and the word "Skool" might be 70% near or alike.

I would consider "skool" a misspelling, rather than a comparative word.

For many words, you must also consider usage. Using your example:

Main Entry: school

Part of Speech: noun

Definition: place, system for educating

Synonyms: academy, alma mater, blackboard, college, department, discipline, establishment, faculty, hall, halls of ivy, institute, institution, schoolhouse, seminary, university

Part of Speech: noun

Definition: body of philosophy on subject

Synonyms: belief, creed, faith, outlook, persuasion, school of thought, stamp, way, way of life

Part of Speech: verb

Definition: teach

Synonyms: advance, coach, control, cultivate, direct, discipline, drill, educate, guide, indoctrinate, inform, instruct, lead, manage, prepare, prime, show, train, tutor, verse

- Source:

##### Share on other sites

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

$\underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other}$

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

Edited by Xittenn
##### Share on other sites

Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So:

$\underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other}$

I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions.

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

We have so far decided to use binary values (base 2) for storage of codes and calculations. Our tests are so far acceptable, but we still think we need more. Appreciate further brainstorm or ideas.

##### Share on other sites

I'm not sure this is really possible.

Skool and School are similar to us as humans because we have been trained to know that sk and sch make a similar sound. You would need to encode that into a computer.

There are other problems because then you'll come across words like which and witch, they are identical in pronounceation but very different in spelling as compared to read and read which is identical in spelling but changes sound depending on context.

You either need to encode pretty much EVERYTHING or you need to invent artificial intelligence and teach it English.

##### Share on other sites

If you wish to compare similar sounding words you just need to use a on-line dictionary that has a full pronunciation guide

which

witch

If you were able to get database access to wiktionary's pronunciation portion of the definitions - you could notice that (even as Klaynos showed) that very differently spelled words have very similary pronunciation. Even these two near hompophones have slight differences - the initial sound of the w is slightly different; the which sound is slightly more breathy and soft compared to a harder firmer witch

There are multiple methods that lexicographers and etymologists use to denote methods of speaking aloud. Three different forms are show for each of the words above.

##### Share on other sites

If you are wanting to match similar sounding words I wonder if the software associated with turning the spoken word into the written word could be adapted? I am thinking of the software that produces subtitles on live programs. It certainly often prints similar sounding words by mistake. (Often with hilarious results!)

Edited by Joatmon
##### Share on other sites

Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures).

If you are just looking for homonyms it sounds like you really don't need to encode the characters of a word but it's phonemes. Words with multiple pronunciations will probably need to have multiple entries in your database, one for each pronunciation. You may need to encode both though, the characters and the phonemes depending on what you are trying to achieve.

## Create an account

Register a new account