Camille Thomas Barkho Posted May 31, 2012 Share Posted May 31, 2012 We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc... Link to comment Share on other sites More sharing options...
Klaynos Posted May 31, 2012 Share Posted May 31, 2012 Google hashing. Link to comment Share on other sites More sharing options...
Joatmon Posted May 31, 2012 Share Posted May 31, 2012 (edited) (IMO) I can't see the number that represents the word being anything other than a code. For example you couldn't add two numbers together to produce a number which represents another meaningful word. If just producing a code would be sufficient then it is quite a simple. Let first 2 digits (with value 01 - 52) represent the first character. (use 1 - 26 for small characters and 27 - 52 for capitals). Repeat for subsequent characters using 2 digits per character. Example:- School would be represented by 450308151512. I think you may well be looking for something much deeper than this. Edited May 31, 2012 by Joatmon Link to comment Share on other sites More sharing options...
doG Posted May 31, 2012 Share Posted May 31, 2012 We are trying to develop an algorithm to calculate a unique numeric value for any English word. Example: School = 675237652376523 etc... ASCII, encoding characters since 1960... 1 Link to comment Share on other sites More sharing options...
Joatmon Posted May 31, 2012 Share Posted May 31, 2012 (edited) ASCII, encoding characters since 1960... The same thought came to me when eating my dinner! If using decimal numbers and using the suggestion given in #3, then 3 digits would be needed per character, only 2 if hex was used. Of course using ASCII would enable punctuation (and even spaces if required). Edited May 31, 2012 by Joatmon Link to comment Share on other sites More sharing options...
the asinine cretin Posted May 31, 2012 Share Posted May 31, 2012 Is there some reason why normal character encoding such as ASCII or UTF-8 is not appropriate? Why are you encoding the words in the first place? What are the requirements and all that? Link to comment Share on other sites More sharing options...
ewmon Posted May 31, 2012 Share Posted May 31, 2012 Use base 27 or 54 (or other) for the most efficient use of numbers with the first letter of the word being the least significant "digit". So, in their order, the letters encode as 27º, 27¹, 27², 27³, etc. That is, "a..." = 1, "b..." = 2, etc, "y..." = 25, and "z..." = 26; and ".a.." = 27, ".b.." = 54, etc; "..a." = 729, "..b." = 1458, etc. Link to comment Share on other sites More sharing options...
D H Posted May 31, 2012 Share Posted May 31, 2012 Example:- School would be represented by 450308151512. That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word. The correct answer is Klaynos'. Google the term "hashing". Link to comment Share on other sites More sharing options...
John Cuthber Posted May 31, 2012 Share Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia.org/wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters Link to comment Share on other sites More sharing options...
the asinine cretin Posted May 31, 2012 Share Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia....wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters Lol. And yeah, if all you're trying to do is hash some strings the real question would be what language are you programming in? Link to comment Share on other sites More sharing options...
doG Posted May 31, 2012 Share Posted May 31, 2012 Ascii is the new kid on the block. http://en.wikipedia.org/wiki/Morse_code Since about 1840 You need to account for the times so it needs to be a bit more complicated. For example, 1 for a dot; 2 for a dash; 3 between letters Yes but....morse does not distinguish between upper and lower case so the person John and the john we use in the restroom are the same in Morse Link to comment Share on other sites More sharing options...
John Cuthber Posted May 31, 2012 Share Posted May 31, 2012 There are lots of ways to do this. What is the actual goal? Some methods might be better than others, depending on the purpose. Link to comment Share on other sites More sharing options...
Joatmon Posted May 31, 2012 Share Posted May 31, 2012 (edited) That's a bit wasteful. There are, depending on who's counting / what is being counted somewhere between half a million to tens of millions of words in the English language. In any case, it's a lot, lot less than 4 billion. In other words, the numbering scheme will easily fit in a 32 bit word. Your value for "school" does not, and "school" is not a particularly long word. The correct answer is Klaynos'. Google the term "hashing". Completely agree regarding efficiency. I was just looking to suggest a simple algorithm that would work. Because the OP says he is looking for a "value" I'm not even sure whether he is looking for something deeper than a code. (but what else the value of the unique number would be used for is beyond me). Edited May 31, 2012 by Joatmon Link to comment Share on other sites More sharing options...
Camille Thomas Barkho Posted June 1, 2012 Author Share Posted June 1, 2012 Thank you all for your responses. Our goal is to create a numeric value for any word and so that we will be able to compare the calculated values of words to other calculated values of other words to get the nearness of the two words. We are thinking of using something like bitwise operations to do the comparison. We will not surely go into technologies like phonetic or Fuzzy word match due to their slowness. Link to comment Share on other sites More sharing options...
Klaynos Posted June 1, 2012 Share Posted June 1, 2012 Can you define "nearness" as that doesn't really mean much. Link to comment Share on other sites More sharing options...
Camille Thomas Barkho Posted June 2, 2012 Author Share Posted June 2, 2012 Nearness means something like: The word "school" and the word "Skool" might be 70% near or alike. Link to comment Share on other sites More sharing options...
Pantaz Posted June 2, 2012 Share Posted June 2, 2012 Nearness means something like: The word "school" and the word "Skool" might be 70% near or alike. I would consider "skool" a misspelling, rather than a comparative word. For many words, you must also consider usage. Using your example: Main Entry: school Part of Speech: noun Definition: place, system for educating Synonyms: academy, alma mater, blackboard, college, department, discipline, establishment, faculty, hall, halls of ivy, institute, institution, schoolhouse, seminary, university Part of Speech: noun Definition: body of philosophy on subject Synonyms: belief, creed, faith, outlook, persuasion, school of thought, stamp, way, way of life Part of Speech: verb Definition: teach Synonyms: advance, coach, control, cultivate, direct, discipline, drill, educate, guide, indoctrinate, inform, instruct, lead, manage, prepare, prime, show, train, tutor, verse - Source: Thesaurus.com Link to comment Share on other sites More sharing options...
Xittenn Posted June 2, 2012 Share Posted June 2, 2012 (edited) Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So: [math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math] I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions. Edited June 2, 2012 by Xittenn Link to comment Share on other sites More sharing options...
Camille Thomas Barkho Posted June 2, 2012 Author Share Posted June 2, 2012 Then might I suggest coding your alphabet so as to bind phonetic statements by introducing embedded meta data into your numerical representations. Code a letter to not only represent a symbolic character, but also to associate it to its set of phonetic pronunciations as both an individual element within a statement and in conjunction with the letters that it is surrounded by. So: [math] \underbrace{ 01010101 }_\text{symbol} \underbrace{ 01010101 }_{\leftarrow \text{left}} \underbrace{ 01010101 }_{\text{right} \rightarrow} \underbrace{ 01010101 }_\text{mouth shape} \underbrace{ 01010101 }_\text{other} [/math] I'm not sure of a 'good' way to do this, but I'm sure you'll figure it out if you ask yourself the right questions. Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures). We have so far decided to use binary values (base 2) for storage of codes and calculations. Our tests are so far acceptable, but we still think we need more. Appreciate further brainstorm or ideas. Link to comment Share on other sites More sharing options...
Klaynos Posted June 2, 2012 Share Posted June 2, 2012 I'm not sure this is really possible. Skool and School are similar to us as humans because we have been trained to know that sk and sch make a similar sound. You would need to encode that into a computer. There are other problems because then you'll come across words like which and witch, they are identical in pronounceation but very different in spelling as compared to read and read which is identical in spelling but changes sound depending on context. You either need to encode pretty much EVERYTHING or you need to invent artificial intelligence and teach it English. Link to comment Share on other sites More sharing options...
imatfaal Posted June 2, 2012 Share Posted June 2, 2012 If you wish to compare similar sounding words you just need to use a on-line dictionary that has a full pronunciation guide which (UK) enPR: hwĭch, IPA: /ʍɪʧ/, X-SAMPA: /WItS/witch enPR: wĭch, IPA: /wɪtʃ/, X-SAMPA: /wItS/If you were able to get database access to wiktionary's pronunciation portion of the definitions - you could notice that (even as Klaynos showed) that very differently spelled words have very similary pronunciation. Even these two near hompophones have slight differences - the initial sound of the w is slightly different; the which sound is slightly more breathy and soft compared to a harder firmer witch There are multiple methods that lexicographers and etymologists use to denote methods of speaking aloud. Three different forms are show for each of the words above. 1 Link to comment Share on other sites More sharing options...
Joatmon Posted June 2, 2012 Share Posted June 2, 2012 (edited) If you are wanting to match similar sounding words I wonder if the software associated with turning the spoken word into the written word could be adapted? I am thinking of the software that produces subtitles on live programs. It certainly often prints similar sounding words by mistake. (Often with hilarious results!) Edited June 2, 2012 by Joatmon Link to comment Share on other sites More sharing options...
doG Posted June 2, 2012 Share Posted June 2, 2012 Thank you very much Xitten for your feedback. In fact, what we are trying to achieve here is exactly what you mentioned. Anyway, we need to come up with the algorithm that combines maybe characters into their codes and at the time groups them into phonetic families (but this would need cultural phonetic classification - example: CH is equivalent to K phonetically in some cultures while it is equivalent to SH in other cultures). If you are just looking for homonyms it sounds like you really don't need to encode the characters of a word but it's phonemes. Words with multiple pronunciations will probably need to have multiple entries in your database, one for each pronunciation. You may need to encode both though, the characters and the phonemes depending on what you are trying to achieve. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now