# Metric for similarity/difference between languages, a suggestion

In a previous thread an idea appeared that it might be helpful to quantify how similar/different are languages in their ways of grouping or referencing something. Here is a suggestion for such a metric. It could be formalized, but it is not a point now. The point is, to see if this metric represents a 'distance' between languages in this respect.

The metric starts on a case-by-case basis. Take two languages, R and T, and two objects, x and y. If x and y are called the same in R then R(xy)=0. If they are called differently, then R(xy)=1. The same for T: T(xy)=0 or T(xy)=1. Now we take RT(xy)=|R(xy)-T(xy)| being a distance between R and T on x and y. If it is 0, they are 'similar', if it is 1, they are different in the way of how they reference the things x and y.

Real example now (the raw data are from Kemmerer, Concepts in the Brain.) Three common drinking vessels, x, y, and z, were shown to groups of native speakers of Faroese (F), German (G), and Dutch (D).

The three vessels make three pairs: xy, xz, yz. The Faroese speakers called all three the same, 'koppur'. So, for these three pairs, F=(0,0,0). The German speakers called x and y 'Tasse', and they called z, 'Becker'. Thus for these pairs, G=(0,1,1). The Dutch speakers called them respectively, 'kopje', 'mok', and 'beker'. D=(1,1,1).

Now, let's compare. FG=(0,1,1). FD=(1,1,1). GD=(1,0,0). To make a scalar, we can take a sum of individual comparisons. Then we get distances between the languages, as per these three vessels, FG=2, FD=3, GD=1. So, in this little experiment, German and Dutch grouped the things relatively similarly, while Faroese was different from German and even more different from Dutch.

If such data are collected and summed for many different cases, this metric could give a quantitative comparison between languages more generally, with respect to how they make groupings of things, actions, etc. in outside world.

Any suggestions and questions are appreciated.

1 hour ago, Genady said:

The Dutch speakers called them respectively, 'kopje', 'mok', and 'beker'. D=(1,1,1).

Do you handle similarities in spelling and structure in the grouping and comparison? Without further research I think the Swedish words would be 'kopp', 'mugg', 'bägare'. None of the words are identical in the two languages but share some kind of similarity?

(Side note: I'll curiously follow this; it may be related to some machine learning stuff I've seen for other purposes, word embedding; the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. It seems related to the above idea and is used to model similarities within a language AFAIK rather than between different languages. I'll need to read some more to provide comments about translation / language similarities & word embedding)

3 minutes ago, Ghideon said:

Do you handle similarities in spelling and structure in the grouping and comparison? Without further research I think the Swedish words would be 'kopp', 'mugg', 'bägare'. None of the words are identical in the two languages but share some kind of similarity?

Thank you. No, I don't relate to what different languages call these things, but only to how they group them. To me, they could be called F1, F2, F3, and D1, D2, D3, etc. My interest is if F1=F2 and D1=D2? etc.

From the same source, the Swedish words for the same three things were 'kopp', 'mugg', 'glass'. So, like Dutch, it gets (1,1,1). As you see, you are almost right, 2/3

Frisian was different. It called the x, 'kopke', and both y and z, 'beker'. So it gets (1,1,0).

1 hour ago, Genady said:

Take two languages, R and T, and two objects, x and y.

A language consists of much more than object-nouns, though. Furthermore, not all languages cleanly distinguish between word types.

For example, I live in Thailand at the moment, and the Thai language does not, in many cases, make a distinction between noun, verb, and adjective (there are ways of marking a word to be of a specific type, though, if necessary). Furthermore, there are many cultural concepts here that have no equivalent in any other language - eg เกรงใจ, which means something like not wanting to cause an unnecessary burden for someone else. There is also a large number of personal pronouns that denote subtle differences in social status between speaker and listener, and have again no equivalent in any other language. Also, you have language registers - meaning you use different vocabulary for some things depending on who you are talking to (monks, people with status, members of royal family etc).

How would you handle such things?

Basically what I’m trying to say is that languages are not generally a 1-1 mapping into each other, unless they are very closely related already.

4 minutes ago, Genady said:

My interest is if F1=F2 and D1=D2? etc.

Some things will be like this in many languages, but you will always have words that cannot be mapped like this, because language always reflects local culture.

1 minute ago, Markus Hanke said:

Basically what I’m trying to say is that languages are not generally a 1-1 mapping into each other, unless they are very closely related already.

This is exactly what I am saying. I'm trying to evaluate how differently different languages map their linguistic elements into the outside world.

I don't refer to any grammatical structures, including nouns, verbs, etc. What I refer to is, any thing, action, event, relation, feature, aspect, ... of outside world is expressed somehow in every language. Different languages group this actions, events, ... differently in their expressions.

This specific study I used in the example above was conducted on 12 Germanic languages. I might use other studies in other examples if needed. In addition (it was mentioned in another thread) I speak three languages from three different families and understand two more, including one sign language (ASL, still a beginner). I very well aware of how different languages can be. In fact, this was my starting point.

I start with simple things, because they are easy, but for other studies video clips were shown, short stories in pictures, etc. This is how other things are handled so far. Clearly, some things are very difficult or even impossible to study in the lab.

2 hours ago, Genady said:

If such data are collected and summed for many different cases, this metric could give a quantitative comparison between languages more generally, with respect to how they make groupings of things, actions, etc. in outside world.

Don't we already have a pretty thorough understanding of how languages are related? The similarities of German and Dutch is not coincidental, as they're closely related, while Faroese belongs to the Nordic branch of Germanic languages, more like Icelandic and Danish. Here is a striking graphic depiction

I would also expect differences in vocabulary based on geography: while a Sudanese might need no words for snow, a Laplander is likely to have a dozen, and their description of wind and water would also be quite different. But I'm skeptical as to whether difference can be measures strictly by vocabulary, without reference to structure, grammar and inflection.

However, as I don't get what it is you're tying to measure, or why, I'll stay out of this thread.

51 minutes ago, Markus Hanke said:

Some things will be like this in many languages, but you will always have words that cannot be mapped like this, because language always reflects local culture.

One way to proceed is to compare languages by domain rather than 'overall'. Such domains would be, for example, plants and animals, artifacts, body parts, family relations, possession relations, motion events, events of giving and taking, ...

I suspect it is quite complicated. There attempts to map rough distance between language using a wide range of metrics and at least from what I understand there really is no good agreement on any overall methodology. In the above example I suspect that depending on what aspect is shown the differences are likely to be all over the place. Language is often context-driven and so are categories created in a given language.

Even within speakers of a language there is inherent vagueness. While this paper from Hancock and Volante focusses on linguistic uncertainties, I think categorizations are not as static and/or discrete as they might appear.

In the example in OP, depending on what item groupings you provide, they might invoke different contexts for the viewer which may be more cultural than linguistic. Also I am not sure whether the proposed measure handles certain ambiguities well. For example in the above example  I am not sure why Germans would use fewer words than the Dutch, considering that equivalent words exist in both languages, though there are many local variations (e.g. Pott) or variations using contractions (Kaffee- or Teetasse/ becher) or more formal usages that are less common (Trinkgefaess). I.e. the measures would vary potentially wildly even within a language region.

17 minutes ago, Peterkin said:

The similarities of German and Dutch is not coincidental, as they're closely related, while Faroese belongs to the Nordic branch of Germanic languages, more like Icelandic and Danish.

Exactly! And I was pleased to see that the suggested metric reflected this, albeit on such a small, statistically insignificant sample.

13 minutes ago, CharonY said:

I suspect it is quite complicated. There attempts to map rough distance between language using a wide range of metrics and at least from what I understand there really is no good agreement on any overall methodology. In the above example I suspect that depending on what aspect is shown the differences are likely to be all over the place. Language is often context-driven and so are categories created in a given language.

Even within speakers of a language there is inherent vagueness. While this paper from Hancock and Volante focusses on linguistic uncertainties, I think categorizations are not as static and/or discrete as they might appear.

In the example in OP, depending on what item groupings you provide, they might invoke different contexts for the viewer which may be more cultural than linguistic. Also I am not sure whether the proposed measure handles certain ambiguities well. For example in the above example  I am not sure why Germans would use fewer words than the Dutch, considering that equivalent words exist in both languages, though there are many local variations (e.g. Pott) or variations using contractions (Kaffee- or Teetasse/ becher) or more formal usages that are less common (Trinkgefaess). I.e. the measures would vary potentially wildly even within a language region.

Regarding the last point, words exist but are applied differently. This is a summary for a subset of 6 vessels from the total of 67 used in the study:

Faroese speakers used just one word to describe all six vessels. Speakers of Belgian Dutch, Frisian, and Danish employed two words but drew the boundary between them at different points along the continuum of containers. Swedish and English speakers made three-way distinctions that differed not only from the common one mentioned earlier, but also from each other. And finally, speakers of Netherlands Dutch carved the referential space into four separate categories.

Kemmerer, David. Concepts in the Brain (pp. 72-73).

Regarding the other points, yes, they need to be considered. I don't think that language is separate from culture though. Linguistic differences reflect cultural differences. This is important if one asks what make them different. My focus is not on why but on how are they different.

44 minutes ago, Genady said:

Regarding the other points, yes, they need to be considered. I don't think that language is separate from culture though. Linguistic differences reflect cultural differences. This is important if one asks what make them different. My focus is not on why but on how are they different.

I think exactly that make is difficult, though. If we try to quantify, we would start with e.g. creating categories. But what an experimenter create might be based on their own experiences. So let's say you have  language that has, say 5 categories for drinking vessels, but only 1 for eating bowls and conversely one that has only 1 for drinking, but 5 for eating, and then you have one that has one or two for each. if you used drinking vessels to build your model the first and last would group together and if you used the eating vessels it would be the latter. If you used both they might separate differently, but adding yet another concept would change the model entirely again.

Then some cultures or languages might have sophisticated categorization in areas that do not even exist in others and so on. I.e. whatever you select to look at will influence what your outcome will be. Finding a truly neutral ground where comparisons of divergent languages can be done with this is approach is seems incredibly difficult to me.

That being said, I suspect the matter is sufficiently complex that I would require some serious reading (such as the Kemmerer, which I am not familiar with) to contribute anything meaningfully.

4 minutes ago, CharonY said:

... require some serious reading (such as the Kemmerer, which I am not familiar with) to contribute anything meaningfully.

In the process. 350 dense pages. Unfortunately, it is already clear that available data just scratch the surface.

I said I wouldn't, but.... Okay, I lied.

How are you making sure that the subjects you choose from each linguistic group have the same level and type of  linguistic development? You might have a science teacher from group, who has an extensive specialized vocabulary in his field, but little or no interest in kitchenware, and a bricklayer from another group whose hobby is collecting archaic folk sayings. Different ages, genders, occupations, proclivities, interests, education, literacy and reading habits, facility in language acquisition and familiarity with foreign languages... How do you choose the subject population?

##### Share on other sites

2 hours ago, Genady said:

This is exactly what I am saying. I'm trying to evaluate how differently different languages map their linguistic elements into the outside world.

I think it’s more complicated than even this - because in my opinion language is more than a simple mapping into the outside world. It is strongly contextual, and meaning isn’t inherent (as it would be in a mapping), but given only through its actual use by people. Thus, language is more than an abstract set of rules and maps - it’s a cultural and social convention, and as such it is fluid and permanently evolving. You cannot separate language from the context of its users. I’m pretty firmly with Wittgenstein’s philosophy of language in this regard.

I’m not saying that comparative philology isn’t a worthwhile endeavour (it’s quite interesting!); only that there are inherent limitations to such a project.

2 hours ago, Genady said:

What I refer to is, any thing, action, event, relation, feature, aspect, ... of outside world is expressed somehow in every language.

I don’t believe this is true. Consider the example I gave earlier of เกรงใจ in Thai - this is a very subtle social concept that is quite specific to Thai culture. It is a real ‘thing’ in the outside world (an aspect of culture), but there exists no adequate translation for this in English or any other European language. Even trying to explain this concept in all its subtleties requires an entire paragraph of text at least, and even then it isn’t guaranteed that the reader will understand. Whole guide books have been written about it! Another example of such a thing is the word fa’alavelave in Samoan, which roughly refers to a social obligation created by something that has happened in the extended family, and for which material resources need to be raised so as not to loose face in the community (it also means simply ‘trouble’ or ‘problem’). You can verbally understand the explanation, but you won’t understand what fa’alavelave truly means to a Samoan person, unless you have lived in Samoa (it took me a long time to fully understand all implications of this when I lived there). The concept simply does not exist outside its cultural context, so no other language has any way to adequately express it in all its subtlety.

2 hours ago, Peterkin said:

I said I wouldn't, but.... Okay, I lied.

How are you making sure that the subjects you choose from each linguistic group have the same level and type of  linguistic development? You might have a science teacher from group, who has an extensive specialized vocabulary in his field, but little or no interest in kitchenware, and a bricklayer from another group whose hobby is collecting archaic folk sayings. Different ages, genders, occupations, proclivities, interests, education, literacy and reading habits, facility in language acquisition and familiarity with foreign languages... How do you choose the subject population?

You're right. Difficult questions. The only consolation is that we are not alone. The guys from fMRI studies have the same problems. Large randomized samples are not feasible. Kemmerer did not specify how the subjects were selected. Perhaps if I go to the original papers, I can find the answers. I know that each language was represented by a group rather than an individual, and that they selected a "preferential" answer from each test.

2 hours ago, Markus Hanke said:

I think it’s more complicated than even this - because in my opinion language is more than a simple mapping into the outside world. It is strongly contextual, and meaning isn’t inherent (as it would be in a mapping), but given only through its actual use by people. Thus, language is more than an abstract set of rules and maps - it’s a cultural and social convention, and as such it is fluid and permanently evolving. You cannot separate language from the context of its users. I’m pretty firmly with Wittgenstein’s philosophy of language in this regard.

I’m not saying that comparative philology isn’t a worthwhile endeavour (it’s quite interesting!); only that there are inherent limitations to such a project.

I don’t believe this is true. Consider the example I gave earlier of เกรงใจ in Thai - this is a very subtle social concept that is quite specific to Thai culture. It is a real ‘thing’ in the outside world (an aspect of culture), but there exists no adequate translation for this in English or any other European language. Even trying to explain this concept in all its subtleties requires an entire paragraph of text at least, and even then it isn’t guaranteed that the reader will understand. Whole guide books have been written about it! Another example of such a thing is the word fa’alavelave in Samoan, which roughly refers to a social obligation created by something that has happened in the extended family, and for which material resources need to be raised so as not to loose face in the community (it also means simply ‘trouble’ or ‘problem’). You can verbally understand the explanation, but you won’t understand what fa’alavelave truly means to a Samoan person, unless you have lived in Samoa (it took me a long time to fully understand all implications of this when I lived there). The concept simply does not exist outside its cultural context, so no other language has any way to adequately express it in all its subtlety.

How much will be missed if, when comparing two languages, we skip the idiosyncrasies like the ones you describe and focus on comparable parts only? There is still a huge common world out there to talk about. These idiosyncrasies are perhaps very important for these people and their society, but how important they are for linguistics? If they don't have any correspondence to anything in another language, why would they be considered for a comparison between the languages at all?

Maybe instead of comparing everywhere, use a Dow Jones / S&P / "inflation basket" approach - fix a diversified set of representative domains that can be used universally. Body parts domain would be a candidate #1.

1 hour ago, Genady said:

I know that each language was represented by a group rather than an individual, and that they selected a "preferential" answer from each test.

I do not like the sound of 'preferential' in a scientific study. Of course, most are done on university students, which might level the field, especially if all the students are a. in institution in their native country and b. chosen from the same range of disciplines in every country.  (Coz, if you were using exchange or foreign students, you have a huge bias from the get-go.)

1 hour ago, Genady said:

These idiosyncrasies are perhaps very important for these people and their society, but how important they are for linguistics?

You can never know, if you ignore them.

1 hour ago, Genady said:

There is still a huge common world out there to talk about.

The commonalities are easy. It's the differences that are hard.

17 hours ago, Peterkin said:

I do not like the sound of 'preferential' in a scientific study.

Me neither.

17 hours ago, Peterkin said:

You can never know, if you ignore them.

This is so. But, the situation could be simply like this. Consider some people who live on an island with unique flora and fauna, from which they obtain their food and materials. They have to have a language to talk about these flora and fauna. No other language has anything like that, because no other people ever interacted with this kind of flora and fauna. It seems to me that in such case, this, although large but idiosyncratic part of their language has nothing to add to our understanding of how their language compares or relates to other languages. It even could work the other way around, i.e. understanding of other, more generic parts of their language could help to understand how this idiosyncratic part works.

Deleted. I shouldn't pursue this topic: I just don't understand what you're after.

Edited by Peterkin

