Show simple item record

dc.contributor.authorAl Rozz, Y
dc.contributor.authorHamoodat, H
dc.contributor.authorMenezes, R
dc.date.accessioned2020-03-27T15:39:47Z
dc.date.issued2017-02-23
dc.description.abstractFor more than 5,000 years, we have been communicating using some form of written language. For many scholars, the advent of written language contributed to the development of societies because it enabled knowledge to be passed to future generations without considerable loss of information or ambiguity. Today, it is estimated that we use about 7,000 languages to communicate, but the majority of these do not have a written form; in fact, there are no reliable estimates of how many written languages exist today. There are three main families of written languages: Afro-Asiatic, Indo-European, and Turkic. These families of languages are based on historical family-trees. However, with the amount of data available today, one can start looking at language classification using regularities extracted from corpora of text. This paper focus on regularities of 10 languages from the mentioned families. In order to find features for these languages we use (1) Heaps’ law, which models the number of distinct words in a corpus as a function of the total number of words in the same corpora, and (2) structural properties of networks created from word co-occurrence in large corpora for different languages. Using clustering approaches we show that despite differences from years of being used in separate countries, the clustering still seem to respect some historical organization of families.en_GB
dc.identifier.citationIn: Gonçalves B., Menezes R., Sinatra R., Zlatic V. (eds) Complex Networks VIII. CompleNet 2017. Springer Proceedings in Complexity, pp. 161 - 173en_GB
dc.identifier.doi10.1007/978-3-319-54241-6_14
dc.identifier.urihttp://hdl.handle.net/10871/120446
dc.language.isoenen_GB
dc.publisherSpringer Natureen_GB
dc.rights© Springer International Publishing AG 2017en_GB
dc.subjectco-occurrence networksen_GB
dc.subjectlanguage classificationen_GB
dc.subjectHeaps’ Lawen_GB
dc.subjectclusteringen_GB
dc.titleCharacterization of written languages using structural features from common corporaen_GB
dc.typeBook chapteren_GB
dc.date.available2020-03-27T15:39:47Z
dc.identifier.issn2213-8684
dc.descriptionThis is the author accepted manuscript. The final version is available from Springer Nature via the DOI in this recorden_GB
dc.rights.urihttp://www.rioxx.net/licenses/all-rights-reserveden_GB
rioxxterms.versionAMen_GB
rioxxterms.licenseref.startdate2017-02-23
rioxxterms.typeBook chapteren_GB
refterms.dateFCD2020-03-27T15:39:04Z
refterms.versionFCDAM
refterms.dateFOA2020-03-27T15:39:54Z


Files in this item

This item appears in the following Collection(s)

Show simple item record