There has recently been some discussion in OpenStreetMap on names and labeling due to some people expressing the desire to abandon the geographically neutral labeling on the OpenStreetMap standard style. One of the things this discussion once again showed is a basic problem in the way names are recorded in OpenStreetMap which I here want to briefly discuss.
The OpenStreetMap naming system is based on the idea that features in the database can have a local name, the name predominantly used locally for the feature, as well as an arbitrary number of names in different languages, that is how non-locals or locals speaking a different language than most name it. The first is to be mapped in the name tag, the latter ones go into name:<language> tags where <language> usually it the two letter code of the language of the name. There are other name tags like alt_name (for an alternative local name) or old_name (for a historic name no more in active use).
The OpenStreetMap standard style renders the content of the name tag and this way is supposed to display the name locally used. This is one of the most characteristic aspects of the map and a highly visible demonstration of OpenStreetMap being based on local knowledge and valuing geographic and cultural diversity. That there are of course people who think it is more important to have another map (in addition to hundreds of commercial OSM based maps) where they can read the labels than at least a single map that can be read by any local mapper all over the world in their local area is obvious but this is not my topic here.
The problem with basing labels on a single name tag for the local name is that then local mappers are often in conflict between tagging the actual local name and tagging whatever they want to see on the map – which might be affected by the desire of uniformity in labeling or to make the map better readable for non-locals. As a result the name tag often contains compound strings containing names in multiple languages, in particular in regions where multiple languages are widely used by locals and there might not even be a single dominant local name.
The solution to this problem would lie in dropping the illusion that there is always a single local name that can be verifiably mapped. Instead you would tag the names in the different languages as it is done currently and add a format string indicating what the common form of displaying the names of this feature locally is. Separating the multilingual name data from the information on local name use is the key here.
The format string would normally not have to be specified for every feature individually since typically all features in an area would use the same format string. Instead you would have the individual features inherit the format strings of the administrative units they are located in.
For example in case of Germany the admin_level 2 boundary relation (51477) would get something like language_format=$de
– and there would be no need for further format strings locally except maybe for a few smaller areas with a local language or individual features with only a foreign language name. Switzerland (51701) would get language_format=$de/$fr/$it/$rm
and the different Cantons would get different format strings depending on the locally used languages.
The key and syntax for the format string are just an example of course to illustrate the idea – those could be different.
I think the advantages of this concept are obvious:
- The rules for the individual language name tags are much clearer and better defined so there is less room for arbitrariness resulting in more reliable data for the data user than from the name tag.
- Any desire of the local mappers to get certain labels in the map would be articulated in the format strings and would not tint the actual name data.
- The format string allows data users a lot more flexibility – it can be ignored, modified or replaced by a custom and globally constant format string or a more complex interpreting function with fallbacks, transliterations etc. Or data users can select if they want to use format strings on a per feature basis or only as inherited from the admin units.
- The problem that different script variants are needed for the same Unicode characters in different languages (a.k.a. the Han unification problem) would be solved as well.
- Using the individual language names as data source for labels instead of the separately tagged name tag allows for quality control of this data through the map – likely resulting in less errors and inconsistencies in the name data overall.
- There would be an easy fallback during transition to this tagging system – if there is no valid format string or any of the languages in the format string is not tagged you could fall back to the legacy name tag.
But i will also mention the main disadvantages of this idea:
- The data users do not get a hand drawn label string prepared by the mapper and ready to use but have to interpret more structured information in the form of individual names and format strings.
- Allowing features to inherit the format string of administrative units will require spatial relationship tests which are too expensive to be done on the fly so this would need support from the OSM data converters, in particular those that are used for map rendering (like osm2pgsql, Imposm). This is not trivial, especially if you want to take into account that changing the format string of an administrative unit would potentially affect all named features within that unit.
Another possible point of critique is that the format string is non-verifiable. But obviously if the current name tag is verifiable so is the format string which just describes its structure in an abstract form.
September 14, 2018 at 01:25
Christoph, do you still think something like this is a good idea?
I’d been thinking about offering to add a tagging scheme for local languages. I probably could get permission to use the SIL / ethnologue database for this, which would provide maps of the local indigenous languages for most of the world. These wouldn’t be based on administrative boundaries. I believe we would also need a way to tag the “official” language(s) used by the government and the primary language used in trade. For example, in Indonesia there is a single official language, but there are about 700 local languages used in areas ranging from 2/3rd of Java (80 million people) to one small valley (1000 people). In Canada the whole country would have french+english as official languages, Quebec would has French as the primary trade language, the rest of Canada would have English, and areas with Inuit and Native American populations would have a local language.
As you said, verifying the borders between local languages is not something you can see on a satellite image. But like place names, it can be found by asking the local people in each settlement “what language do you speak here? Do they speak the same language in the next village”. The offical language and primary trade language would be easier to determine from government proclamations and by checking what languages are used most in the market.
September 14, 2018 at 11:45
Yes, i still think this would be a good idea but it would probably be fairly difficult to establish it in OSM because of the additional level of abstraction and the additional hurdle for data users because of that and because many mappers like the ability to directly paint the label so to speak.