Characterization of written languages using structural features from common corpora

Al Rozz, Y; Hamoodat, H; Menezes, R

dc.contributor.author	Al Rozz, Y
dc.contributor.author	Hamoodat, H
dc.contributor.author	Menezes, R
dc.date.accessioned	2020-03-27T15:39:47Z
dc.date.issued	2017-02-23
dc.description.abstract	For more than 5,000 years, we have been communicating using some form of written language. For many scholars, the advent of written language contributed to the development of societies because it enabled knowledge to be passed to future generations without considerable loss of information or ambiguity. Today, it is estimated that we use about 7,000 languages to communicate, but the majority of these do not have a written form; in fact, there are no reliable estimates of how many written languages exist today. There are three main families of written languages: Afro-Asiatic, Indo-European, and Turkic. These families of languages are based on historical family-trees. However, with the amount of data available today, one can start looking at language classification using regularities extracted from corpora of text. This paper focus on regularities of 10 languages from the mentioned families. In order to find features for these languages we use (1) Heaps’ law, which models the number of distinct words in a corpus as a function of the total number of words in the same corpora, and (2) structural properties of networks created from word co-occurrence in large corpora for different languages. Using clustering approaches we show that despite differences from years of being used in separate countries, the clustering still seem to respect some historical organization of families.	en_GB
dc.identifier.citation	In: Gonçalves B., Menezes R., Sinatra R., Zlatic V. (eds) Complex Networks VIII. CompleNet 2017. Springer Proceedings in Complexity, pp. 161 - 173	en_GB
dc.identifier.doi	10.1007/978-3-319-54241-6_14
dc.identifier.uri	http://hdl.handle.net/10871/120446
dc.language.iso	en	en_GB
dc.publisher	Springer Nature	en_GB
dc.rights	© Springer International Publishing AG 2017	en_GB
dc.subject	co-occurrence networks	en_GB
dc.subject	language classification	en_GB
dc.subject	Heaps’ Law	en_GB
dc.subject	clustering	en_GB
dc.title	Characterization of written languages using structural features from common corpora	en_GB
dc.type	Book chapter	en_GB
dc.date.available	2020-03-27T15:39:47Z
dc.identifier.issn	2213-8684
dc.description	This is the author accepted manuscript. The final version is available from Springer Nature via the DOI in this record	en_GB
dc.rights.uri	http://www.rioxx.net/licenses/all-rights-reserved	en_GB
rioxxterms.version	AM	en_GB
rioxxterms.licenseref.startdate	2017-02-23
rioxxterms.type	Book chapter	en_GB
refterms.dateFCD	2020-03-27T15:39:04Z
refterms.versionFCD	AM
refterms.dateFOA	2020-03-27T15:39:54Z

Files in this item

Name:: Paper64.pdf
Size:: 1.632Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Computer Science

Show simple item record

Show Statistical Information