library(rcldf)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#>
#> matches
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(patchwork)
theme_set(theme_classic())The rcldf package provides tools to read and interact
with CLDF datasets in R. This vignette demonstrates how to
install the package, load these datasets, and use them.
Cross-Linguistic Data Formats (CLDF, Forkel et al. 2018) is a standardized data format designed to handle cross-linguistic and cross-cultural datasets. CLDF provides a consistent specification and package format for common types of linguistic and cultural data e.g. word lists, grammatical features, cultural traits etc. The aim of the format is to enable provide a simple, reliable data format to facilitate the sharing and re-use of these data to make new analyses possible.
There are currently more than 250 CLDF datasets available containing data from the world’s languages and cultures including everything from catalogues of linguistic metadata (e.g. Glottolog), EndangeredLanguages.com), to word lists of lexical data (e.g. Lexibank), grammatical features (e.g. WALS and Grambank), phonetic information (e.g. Phoible), geographic informations (e.g. Wurm & Hattori 1981/1983), and religious and cultural databases (e.g. D-PLACE and Pulotu).
You can install rcldf directly from GitHub using the
devtools package:
Once installed, load a CLDF dataset by creating a cldf
object from a dataset. You can point to the dataset using a path or URL
where the dataset is located:
library(rcldf)
# Load from a local directory:
ds <- cldf('/path/to/dir/wals_1a_cldf')
# or load from a specific metadatafile:
ds <- cldf('/path/to/dir/wals_1a_cldf/StructureDataset-metadata.json')
# or load from zenodo:
df <- cldf("https://zenodo.org/record/7844558/files/grambank/grambank-v1.0.3.zip?download=1")
# or load from github:
df <- cldf('https://github.com/lexibank/abvd')Once loaded, a cldf object has various bits of
information. To show this I’ll use a small dataset that comes packaged
with rcldf. This dataset contains a wordlist from a number
of Huon Peninsula languages from New Guinea, originally detailed in
McElhanon (1967):
library(rcldf)
ds <- cldf(system.file("extdata/huon", package="rcldf"))
# this dataset has 4 tables:
ds
#> A CLDF dataset with 4 tables (CognateTable, FormTable, LanguageTable, ParameterTable)
#>
#> McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
# more details:
summary(ds)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: McElhanon 1967 Huon Peninsula data
#> Path: /Users/simon/projects/rcldf/rcldf/inst/extdata/huon
#> Type: http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> Tables:
#> 1/4: CognateTable (10 columns, 1960 rows)
#> 2/4: FormTable (11 columns, 1960 rows)
#> 3/4: LanguageTable (9 columns, 14 rows)
#> 4/4: ParameterTable (4 columns, 140 rows)
#> Sources: 0
#> Cite:
#> McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.So here we have a dataset with tables of languages, parameters
(=words), forms (=the lexical items), and cognates (=cognacy information
showing how the lexical items are related). There is also other
information here, e.g. the citation for where the dataset
came from, the path where the dataset is stored, and which
Type of CLDF specificiation this dataset adheres to.
We can get full schema information too:
get_foreign_keys(ds)
#> SourceTable SourceColumn DestinationURL DestinationTable DestinationColumn
#> 1 forms.csv Language_ID languages.csv LanguageTable ID
#> 2 forms.csv Parameter_ID parameters.csv ParameterTable ID
#> 3 cognates.csv Form_ID forms.csv FormTable ID
schema(ds)
#> forms.csv
#> ---------
#> name link property
#> 1 Cognacy <NA>
#> 2 Comment <NA>
#> 3 Form <NA>
#> 4 ID <NA>
#> 5 Language_ID languages.csv:ID <NA>
#> 6 Loan <NA>
#> 7 Local_ID <NA>
#> 8 Parameter_ID parameters.csv:ID <NA>
#> 9 Segments <NA>
#> 10 Source <NA>
#> 11 Value <NA>
#> languages.csv
#> -------------
#> name link property
#> 1 Family <NA>
#> 2 Glottocode <NA>
#> 3 Glottolog_Name <NA>
#> 4 ID <NA>
#> 5 ISO639P3code <NA>
#> 6 Latitude <NA>
#> 7 Longitude <NA>
#> 8 Macroarea <NA>
#> 9 Name <NA>
#> parameters.csv
#> --------------
#> name link property
#> 1 Concepticon_Gloss <NA>
#> 2 Concepticon_ID <NA>
#> 3 ID <NA>
#> 4 Name <NA>
#> cognates.csv
#> ------------
#> name link property
#> 1 Alignment CLDF:alignment
#> 2 Alignment_Method CLDF:alignment
#> 3 Alignment_Source CLDF:alignment
#> 4 Cognate_Detection_Method CLDF:alignment
#> 5 Cognateset_ID CLDF:alignment
#> 6 Doubt CLDF:alignment
#> 7 Form CLDF:alignment
#> 8 Form_ID forms.csv:ID CLDF:alignment
#> 9 ID CLDF:alignment
#> 10 Source CLDF:alignmentEach table is attached to the df$tables list, so to access
them you need to call df$tables$<tablename>. These
are simply dataframes (or tibbles) so you can then do anything you want
with them:
names(ds$tables)
#> [1] "FormTable" "LanguageTable" "ParameterTable" "CognateTable"
# let's look at the languages --
head(ds$tables$LanguageTable)
#> # A tibble: 6 × 9
#> ID Name Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 borong Kosorong boro1279 <NA> ksr <NA> NA
#> 2 burum Burum buru1306 <NA> bmu <NA> NA
#> 3 dedua Dedua dedu1240 <NA> ded <NA> NA
#> 4 kate Kâte kate1253 <NA> kmg <NA> NA
#> 5 komba Komba komb1273 <NA> kpf <NA> NA
#> 6 kube Hube kube1244 <NA> kgf <NA> NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>
# and the parameters - in this case the words in the wordlist
head(ds$tables$ParameterTable)
#> # A tibble: 6 × 4
#> ID Name Concepticon_ID Concepticon_Gloss
#> <chr> <chr> <chr> <chr>
#> 1 i I 1209 I
#> 2 thou thou 1215 THOU
#> 3 we we 1212 WE
#> 4 who who 1235 WHO
#> 5 what what 1236 WHAT
#> 6 that that 78 THAT
# and finally, the lexical items themselves:
ds$tables$ValueTable
#> NULLCLDF datasets have sources stored in BibTeX format. We don’t load them by default, as it can take a long time to parse the BibTeX file correctly.
You can load and access them like this, and the sources are then
available in ds$sources in bib2df format:
ds <- cldf(system.file("extdata/huon", package="rcldf"), load_bib=TRUE)
# or if you loaded the CLDF without sources the first time you can add them now:
ds <- read_bib(ds)
ds$sources
#> # A tibble: 1 × 26
#> CATEGORY BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION
#> <chr> <chr> <chr> <chr> <list> <chr> <chr> <chr> <chr>
#> 1 ARTICLE McElhanon19… <NA> <NA> <chr> <NA> <NA> <NA> <NA>
#> # ℹ 17 more variables: EDITOR <list>, HOWPUBLISHED <chr>, INSTITUTION <chr>,
#> # JOURNAL <chr>, KEY <chr>, MONTH <chr>, NOTE <chr>, NUMBER <chr>,
#> # ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>,
#> # SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <dbl>Sometimes people want to have all the data from a CLDF dataset as one
dataframe. Use as.cldf.wide to do this, passing it the name
of a table to act as the base.
This will take the base table, and resolve all foreign keys (usually
*_ID) into their own columns. Name clashes between the two
tables are resolved by appending the table name (e.g. the column
Name in the original CodeTable will become
Name.CodeTable).
For example, this dataset has a FormTable which connects
to the ParameterTable via Parameter_ID and the
LanguageTable via Language_ID.
Using as.cldf.wide we can combine all the information
from ParameterTable and LanguageTable into the
FormTable:
as.cldf.wide(ds, 'FormTable')
#> Joining Language_ID -> LanguageTable -> ID
#> Joining Parameter_ID -> ParameterTable -> ID
#> # A tibble: 1,960 × 22
#> ID Local_ID Language_ID Parameter_ID Value Form Segments Comment Source
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 borong… 90452 borong i ni ni n i I McElh…
#> 2 borong… 91684 borong where daaŋ… daaŋ… d a a ŋ… where McElh…
#> 3 borong… 90466 borong thou gi gi g i thou McElh…
#> 4 borong… 91978 borong itsflower dzur… dzur… d z u r… (its) … McElh…
#> 5 borong… 90480 borong we nono nono n o n o we McElh…
#> 6 borong… 91810 borong hethrows mesa… mesa… m e s a… (he) t… McElh…
#> 7 borong… 90578 borong hesays dzo- dzo- d z o - (he) s… McElh…
#> 8 borong… 92076 borong itswing eŋga… eŋga… e ŋ g a… (its) … McElh…
#> 9 borong… 90704 borong hegivesme non- non- n o n - (he) g… McElh…
#> 10 borong… 92202 borong dull dzit… dzit… d z i t… dull McElh…
#> # ℹ 1,950 more rows
#> # ℹ 13 more variables: Cognacy <chr>, Loan <lgl>, Name.LanguageTable <chr>,
#> # Glottocode <chr>, Glottolog_Name <chr>, ISO639P3code <chr>,
#> # Macroarea <chr>, Latitude <dbl>, Longitude <dbl>, Family <chr>,
#> # Name.ParameterTable <chr>, Concepticon_ID <chr>, Concepticon_Gloss <chr>Sometimes you just want to get one table. To do this call
get_table_from with the table type, and dataset path:
get_table_from('LanguageTable', system.file("extdata/huon", package="rcldf"))
#> # A tibble: 14 × 9
#> ID Name Glottocode Glottolog_Name ISO639P3code Macroarea Latitude
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 borong Kosorong boro1279 <NA> ksr <NA> NA
#> 2 burum Burum buru1306 <NA> bmu <NA> NA
#> 3 dedua Dedua dedu1240 <NA> ded <NA> NA
#> 4 kate Kâte kate1253 <NA> kmg <NA> NA
#> 5 komba Komba komb1273 <NA> kpf <NA> NA
#> 6 kube Hube kube1244 <NA> kgf <NA> NA
#> 7 mape Mape mape1249 <NA> mlh <NA> NA
#> 8 mesem Mese mese1244 <NA> mci <NA> NA
#> 9 mindik Mindik buru1306 <NA> bmu <NA> NA
#> 10 nabak Nabak naba1256 <NA> naf <NA> NA
#> 11 ono Ono onoo1246 <NA> ons <NA> NA
#> 12 selepet Selepet sele1250 <NA> spl <NA> NA
#> 13 timbe Timbe timb1251 <NA> tim <NA> NA
#> 14 tobo Tobo tobo1251 <NA> kgf <NA> NA
#> # ℹ 2 more variables: Longitude <dbl>, Family <chr>Perhaps you want to know which version of a dataset you have without
loading the whole dataset. Use get_details:
get_details(system.file("extdata/huon", package="rcldf"))
#> Title
#> 1 McElhanon 1967 Huon Peninsula data
#> Path Size
#> 1 /Users/simon/projects/rcldf/rcldf/inst/extdata/huon 299051
#> Citation
#> 1 McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
#> ConformsTo
#> 1 http://cldf.clld.org/v1.0/terms.rdf#Wordlistrcldf has a couple of utility functions to get the
latest versions of the Glottolog, Concepticon, CLTS, and D-PLACE CLDF
reference catalogues:
rcldf has functions to find other datasets and load them
easily using the CLDF-META project:
datasets()
#> # A tibble: 819 × 21
#> ID Dataset Organisation Version Name Description Contributor Citation
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 18439527 rantane… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 2 18439506 asher20… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 3 18439497 tarble1… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 4 18439437 suttles… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 5 18439377 hochste… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 6 18439251 haynie2… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 7 18439221 goddard… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 8 18439153 zucchi2… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 9 18439140 wurm198… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> 10 18439091 wikiped… Glottography v1.0.1 "Glo… "<p>Cite t… <NA> <NA>
#> # ℹ 809 more rows
#> # ℹ 13 more variables: Creators <chr>, Contributors <chr>, DOI <chr>,
#> # Concept_DOI <chr>, Concept_ID <chr>, Date <chr>, Communities <chr>,
#> # License <chr>, Zenodo_Link <chr>, Zenodo_ID <chr>, Zenodo_Keyword <chr>,
#> # Zenodo_Type <chr>, GitHub_Link <chr>And too easily install one, find the Dataset name:
When you load a dataset from a URL, rcldf downloads the
dataset and unpacks it to a cache directory. By default this is a
temporary directory which will be deleted when you close R.
However, by specifying a directory or using
tools::R_user_dir("rcldf", which = "cache") you can re-use
the dataset later.
To see where downloads will be saved:
You can set the cache_dir setting for the session by:
If you want to set this permanently, then edit your R environ file to
add the line: RCLDF_CACHE_DIR=/path/somewhere.
To see what datasets you’ve downloaded:
list_cache_files()
#> Title
#> 1 <NA>
#> 2 CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2020 focusing on Oceanic languages
#> 3 CLDF dataset derived from McElhanon's "Preliminary Observations on Huon Peninsula Languages" from 1967
#> 4 CLDF dataset derived from Heggarty, Paul & Anderson, Cormac & Scarborough, Matthew’s "Indo-European Cognate Relationships database" ([IE-CoR version 1.0](https://github.com/lexibank/iecor/releases/tag/v1.0)) from 2019
#> 5 D-PLACE aggregated dataset
#> 6 The World Atlas of Language Structures Online
#> 7 glottolog/glottolog: Glottolog database 5.2 as CLDF
#> 8 Vanuatu Voices
#> 9 CLDF dataset derived from Kitchen et al.'s "Bayesian phylogenetic analysis of Semitic languages" from 2009
#> 10 Phlorest phylogeny derived from Greenhill, Haynie et al. 2023 'Uto-Aztecan (Greenhill, Haynie et al.)'
#> 11 Phlorest phylogeny derived from Atkinson 2006 'From Species to Languages: a phylogenetic approach to human prehistory'
#> 12 Phlorest phylogeny derived from Dunn et al. 2011 'Evolved structure of language shows lineage-specific trends in word-order universals'
#> 13 Glottography dataset derived from Tarble de Scaramelli and Zucchi 1984 "Nuevos Datos sobre la Arqueología Tardia del Orinoco: La Serie Valloide"
#> 14 <NA>
#> 15 Atlas of Pidgin and Creole Language Structures Online
#> 16 Grambank v1.0
#> 17 CLDF dataset derived from Greenhill et al.'s "Austronesian Basic Vocabulary Database" from 2022 focusing on Oceanic languages
#> Path
#> 1 /Users/simon/src/cldf/zenodo_api_records_13149095_files_lexibank_abvdoceanic_v0_3__8f9ed6b8b0ad77dae7b903feb3714d78/lexibank-abvdoceanic-2220f31/cldf-structure/StructureDataset-metadata.json
#> 2 /Users/simon/src/cldf/zenodo_api_records_13149095_files_lexibank_abvdoceanic_v0_3__8f9ed6b8b0ad77dae7b903feb3714d78/lexibank-abvdoceanic-2220f31/cldf/cldf-metadata.json
#> 3 /Users/simon/src/cldf/zenodo_api_records_13269405_files_lexibank_mcelhanonhuon_v1__3491d2ff6b2244176ce65a8843103350/lexibank-mcelhanonhuon-8b3f2e0/cldf/cldf-metadata.json
#> 4 /Users/simon/src/cldf/zenodo_api_records_13304537_files_lexibank_iecor_v1_2_zip_co_215dfb3a0afcd44b29231c8ee117202e/lexibank-iecor-efd7442/cldf/cldf-metadata.json
#> 5 /Users/simon/src/cldf/zenodo_api_records_13326769_files_D_PLACE_dplace_cldf_v3_1_1_32d05f55745f24aac282695bf37ed56d/D-PLACE-dplace-cldf-b1e5665/cldf/StructureDataset-metadata.json
#> 6 /Users/simon/src/cldf/zenodo_api_records_13950591_files_cldf_datasets_wals_v2020_4_5479aaf87253f1af7c3fe0aa4fdf3372/cldf-datasets-wals-0f5cd82/cldf/StructureDataset-metadata.json
#> 7 /Users/simon/src/cldf/zenodo_api_records_15640174_files_glottolog_glottolog_cldf_v_479ca3c88bab001b486810c64d7e435a/glottolog-glottolog-cldf-29aa9f0/cldf/cldf-metadata.json
#> 8 /Users/simon/src/cldf/zenodo_api_records_17471548_files_lexibank_vanuatuvoices_v1__f0c806ae82e27d17cfca6431254e54fb/lexibank-vanuatuvoices-16476a2/cldf/cldf-metadata.json
#> 9 /Users/simon/src/cldf/zenodo_api_records_17510140_files_lexibank_kitchensemitic_v2_ea904d4a94b44d98e160f4160edfbd87/lexibank-kitchensemitic-3a7f63b/cldf/cldf-metadata.json
#> 10 /Users/simon/src/cldf/zenodo_api_records_17572679_files_phlorest_greenhill_et_al20_c090ea5624629f8b7d1a212c4b73c8c6/phlorest-greenhill_et_al2023-a2f2430/cldf/Generic-metadata.json
#> 11 /Users/simon/src/cldf/zenodo_api_records_17581426_files_phlorest_atkinson2006_v1_3_8142efa7761471d573a35d4cf58fa619/phlorest-atkinson2006-e224c6a/cldf/Generic-metadata.json
#> 12 /Users/simon/src/cldf/zenodo_api_records_17583660_files_phlorest_dunn_et_al2011_v1_21e60c986cea5c31076f18c3793f5d34/phlorest-dunn_et_al2011-cc21b80/cldf/Generic-metadata.json
#> 13 /Users/simon/src/cldf/zenodo_api_records_18439497_files_Glottography_tarble1984nue_8600d8ef1bec5988577b70cba55322d0/Glottography-tarble1984nuevos-ae7153b/cldf/Generic-metadata.json
#> 14 /Users/simon/src/cldf/zenodo_api_records_2677911_files_cldf_datasets_phoible_v2_0__b70ad37320345e8f0730c59560fcdcc0/cldf-datasets-phoible-f36deac/cldf/StructureDataset-metadata.json
#> 15 /Users/simon/src/cldf/zenodo_api_records_3823888_files_cldf_datasets_apics_v2013_z_abce336a9f28d01e998c652ced58b3db/cldf-datasets-apics-4ed59b5/cldf/StructureDataset-metadata.json
#> 16 /Users/simon/src/cldf/zenodo_api_records_7844558_files_grambank_grambank_v1_0_3_zi_9afd43bdfb78f690e2ef8028d020aaae/grambank-grambank-7ae000c/cldf/StructureDataset-metadata.json
#> 17 /Users/simon/src/cldf/zenodo_api_records_7914774_files_SimonGreenhill_abvdoutliers_e4b6197af51f928f20d26d2f53880232/SimonGreenhill-abvdoutliers-7022d27/cldf/cldf-metadata.json
#> Size
#> 1 623955
#> 2 17663190
#> 3 402525
#> 4 7312488
#> 5 75836395
#> 6 9139849
#> 7 68950835
#> 8 24080436
#> 9 489990
#> 10 1525348
#> 11 23061
#> 12 485453
#> 13 349985
#> 14 11578033
#> 15 11008554
#> 16 56746518
#> 17 2720268
#> Citation
#> 1 <NA>
#> 2 Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.
#> 3 McElhanon, K.A. 1967. Preliminary Observations on Huon Peninsula Languages. Oceanic Linguistics. 6, 1-45.
#> 4 Heggarty, Paul & Anderson, Cormac & Scarborough, Matthew 2024. Indo-European Cognate Relationships database (IE-CoR version 1.1). Leipzig: Max Planck Institute for Evolutionary Anthropology
#> 5 Kathryn R. Kirby, Russell D. Gray, Simon J. Greenhill, Fiona M. Jordan, Stephanie Gomes-Ng, Hans-Jörg Bibiko, Damián E. Blasi, Carlos A. Botero, Claire Bowern, Carol R. Ember, Dan Leehr, Bobbi S. Low, Joe McCarter, William Divale, and Michael C. Gavin. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLoS ONE, 11(7): e0158391. doi:10.1371/journal.pone.0158391.
#> 6 Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://wals.info)
#> 7 Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2025. Glottolog 5.2. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at https://glottolog.org)
#> 8 Lana Takau, Tom Fitzpatrick, Mary Walworth, Aviva Shimelman, Sandrine Bessis, Tom Ennever, Iveth Rodriguez, Hans-Jörg Bibiko, Daria Dërmaku, Murray Garde, Marie-France Duhamel, Giovanni Abete, Laura Wägerle, Kaitip W. Kami, Tihomir Rangelov, & Russell Gray. (2025). Vanuatu Voices (v1.4.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4309140
#> 9 Bayesian phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic in the Near East. Andrew Kitchen, Christopher Ehret, Shiferaw Assefa, Connie J. Mulligan. Proc. R. Soc. B 2009 -; DOI: 10.1098/rspb.2009.0408. Published 29 April 2009
#> 10 Greenhill, S., Haynie, H., Ross, R., Chira, A., List, J.-M., Campbell, L., Botero, C., & Gray, R. (2023). A recent northern origin for the Uto-Aztecan family. Language. https://doi.org/10.1353/lan.0.0276
#> 11 Atkinson, Quentin D. 2006. From Species to Languages: a phylogenetic approach to human prehistory. PhD Thesis, University of Auckland, New Zealand.
#> 12 Dunn M, Greenhill SJ, Levinson SC & Gray RD. 2011. Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473(7345), 79-82.
#> 13 Scaramelli, Kay Tarble de & A. Zucchi. 1984. Nuevos Datos sobre la Arqueología Tardia del Orinoco: La Serie Valloide. Acta Científica Venezolana 35. 434-445.
#> 14 <NA>
#> 15 Michaelis, Susanne Maria & Maurer, Philippe & Haspelmath, Martin & Huber, Magnus (eds.) 2013. Atlas of Pidgin and Creole Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology.
#> 16 Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.
#> 17 Greenhill, S.J., Blust. R, & Gray, R.D. (2008). The Austronesian Basic Vocabulary Database: From Bioinformatics to Lexomics. Evolutionary Bioinformatics, 4:271-283.
#> ConformsTo
#> 1 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 2 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> 3 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> 4 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> 5 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 6 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 7 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 8 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> 9 http://cldf.clld.org/v1.0/terms.rdf#Wordlist
#> 10 http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 11 http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 12 http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 13 http://cldf.clld.org/v1.0/terms.rdf#Generic
#> 14 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 15 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 16 http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> 17 http://cldf.clld.org/v1.0/terms.rdf#WordlistYou can re-use datasets in your cache:
cldf(list_cache_files()[1, 'Path'])
#> A CLDF dataset with 3 tables (LanguageTable, ParameterTable, ValueTable)
#>
#> Or just save them to a particular directory:
To show you how to use rcldf to analyse datasets, we’re
going to test whether languages that distinguish inclusive pronouns from
exclusive pronouns (i.e. [clusivity][https://en.wikipedia.org/wiki/Clusivity]), also tend to
have high rigidity in social structure.
To do this, we will use the Grambank Feature GB028: Is there a distinction between inclusive and exclusive?, and the D-PLACE Trait EA113: Degree of rigidity in social structures.
First, let’s get the published version of Grambank off Zenodo. It’s always a good idea to use the published version as this is a citable and versioned product which makes replicating your analysis easier for other researchers. To do this, go to the Zenodo page for Grambank, and find the download link under the ‘Files’ section and copy it.
It will look like this https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1.
Give that to rcldf:
grambank <- cldf("https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1")
grambank
#> A CLDF dataset with 6 tables (CodeTable, contributors.csv, families.csv, LanguageTable, ParameterTable, ValueTable)
#>
#> Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.
# or:
# grambank <- load_dataset('grambank')Now, let’s use dplyr to get the variable we want from
grambank. We use summary to see what the dataset looks
like. It is a CLDF ‘Structure’ Dataset with six tables:
summary(grambank)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: Grambank v1.0
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/zenodo_records_7844558_files_grambank_grambank_v1_0_3_zip_5ba67f1e8557c79b5e4fffaab8f58bcc/grambank-grambank-7ae000c/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#> 1/6: CodeTable (4 columns, 398 rows)
#> 2/6: contributors.csv (5 columns, 139 rows)
#> 3/6: families.csv (2 columns, 215 rows)
#> 4/6: LanguageTable (13 columns, 2467 rows)
#> 5/6: ParameterTable (12 columns, 195 rows)
#> 6/6: ValueTable (9 columns, 441663 rows)
#> Sources: 0
#> Cite:
#> Skirgård, Hedvig and Haynie, Hannah J. and Blasi, Damián E. and Hammarström, Harald and Collins, Jeremy and Latarche, Jay J. and Lesage, Jakob and Weber, Tobias and Witzlack-Makarevich, Alena and Passmore, Sam and Chira, Angela and Maurits, Luke and Dinnage, Russell and Dunn, Michael and Reesink, Ger and Singer, Ruth and Bowern, Claire and Epps, Patience and Hill, Jane and Vesakoski, Outi and Robbeets, Martine and Abbas, Noor Karolin and Auer, Daniel and Bakker, Nancy A. and Barbos, Giulia and Borges, Robert D. and Danielsen, Swintha and Dorenbusch, Luise and Dorn, Ella and Elliott, John and Falcone, Giada and Fischer, Jana and Ghanggo Ate, Yustinus and Gibson, Hannah and Göbel, Hans-Philipp and Goodall, Jemima A. and Gruner, Victoria and Harvey, Andrew and Hayes, Rebekah and Heer, Leonard and Herrera Miranda, Roberto E. and Hübler, Nataliia and Huntington-Rainey, Biu and Ivani, Jessica K. and Johns, Marilen and Just, Erika and Kashima, Eri and Kipf, Carolina and Klingenberg, Janina V. and König, Nikita and Koti, Aikaterina and Kowalik, Richard G. A. and Krasnoukhova, Olga and Lindvall, Nora L.M. and Lorenzen, Mandy and Lutzenberger, Hannah and Martins, Tônia R.A. and Mata German, Celia and van der Meer, Suzanne and Montoya Samamé, Jaime and Müller, Michael and Muradoglu, Saliha and Neely, Kelsey and Nickel, Johanna and Norvik, Miina and Oluoch, Cheryl Akinyi and Peacock, Jesse and Pearey, India O.C. and Peck, Naomi and Petit, Stephanie and Pieper, Sören and Poblete, Mariana and Prestipino, Daniel and Raabe, Linda and Raja, Amna and Reimringer, Janis and Rey, Sydney C. and Rizaew, Julia and Ruppert, Eloisa and Salmon, Kim K. and Sammet, Jill and Schembri, Rhiannon and Schlabbach, Lars and Schmidt, Frederick W.P. and Skilton, Amalia and Smith, Wikaliler Daniel and de Sousa, Hilário and Sverredal, Kristin and Valle, Daniel and Vera, Javier and Voß, Judith and Witte, Tim and Wu, Henry and Yam, Stephanie and Ye 葉婧婷, Jingting and Yong, Maisie and Yuditha, Tessa and Zariquiey, Roberto and Forkel, Robert and Evans, Nicholas and Levinson, Stephen C. and Haspelmath, Martin and Greenhill, Simon J. and Atkinson, Quentin D. and Gray, Russell D. (in press). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances.To see what languages we have, let’s use rcldf’s
plotting functionality:
Ok, let’s start by extracting a list of languages in this dataset:
languages <- grambank$tables$LanguageTable |>
select(ID, Name, Macroarea, Latitude, Longitude)
# only selecting some columns above to make it easier to see
languages
#> # A tibble: 2,467 × 5
#> ID Name Macroarea Latitude Longitude
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 abad1241 Abadi Papunesia -9.03 147.
#> 2 abar1238 Mungbam Africa 6.58 10.2
#> 3 abau1245 Abau Papunesia -3.97 141.
#> 4 abee1242 Abé Africa 5.60 -4.38
#> 5 aben1249 Abenlen Ayta Papunesia 15.4 120.
#> 6 abip1241 Abipon South America -29 -61
#> 7 abkh1244 Abkhaz Eurasia 43.1 41.2
#> 8 abua1245 Abu' Arapesh Papunesia -3.46 143.
#> 9 abui1241 Abui Papunesia -8.31 125.
#> 10 abun1252 Abun Papunesia -0.571 132.
#> # ℹ 2,457 more rowsWe can also look at the ParameterTable to see
information on our parameter of interest: “GB028: Is there a distinction
between inclusive and exclusive?”. The ID of this feature
is “GB028”, so let’s just see that one:
grambank$tables$ParameterTable |> filter(ID=='GB028')
#> # A tibble: 1 × 12
#> ID Name Description ColumnSpec Patrons Grambank_ID_desc Boundness
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 GB028 Is there a di… "## Is the… <NA> HJH GB028 Clusivity <NA>
#> # ℹ 5 more variables: Flexivity <chr>, Gender_or_Noun_Class <chr>,
#> # Locus_of_Marking <chr>, Word_Order <chr>, Informativity <chr>Plot it:
plot_parameter(grambank, parameter='GB028')
#> Joining Language_ID -> LanguageTable -> ID
#> Joining Parameter_ID -> ParameterTable -> ID
#> Joining Code_ID -> CodeTable -> ID
#> Joining Coders -> contributors.csv -> IDOk, now let’s get all the Values for this Parameter in
ValueTable. Following the cldf standard means
that the Parameter will be mapped into the ValueTable in the
Parameter_ID column, so let’s select all the rows in
ValueTable that match Parameter_ID of GB028:
values <- grambank$tables$ValueTable |>
filter(Parameter_ID=='GB028') |>
select(ID, Language_ID, Parameter_ID, Value, Source)Now we need to get the data from D-PLACE. We’ll use the Github repository to get these data to show you how that works, but you should probably use the published version on Zenodo for proper work. Get the github repository link https://github.com/D-PLACE/dplace-data and give that to rcldf too:
dplace <- cldf("https://github.com/D-PLACE/dplace-cldf")
summary(dplace)
#> A Cross-Linguistic Data Format (CLDF) dataset:
#> Name: D-PLACE aggregated dataset
#> Path: /Users/simon/Library/Caches/org.R-project.R/R/rcldf/github_D_PLACE_dplace_cldf_266fef649c0d27ffcdc6502c39bad1fa/D-PLACE-dplace-cldf-c00e58b/cldf
#> Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
#> Tables:
#> 1/7: CodeTable (5 columns, 15450 rows)
#> 2/7: ContributionTable (8 columns, 122 rows)
#> 3/7: LanguageTable (20 columns, 6174 rows)
#> 4/7: MediaTable (6 columns, 109 rows)
#> 5/7: ParameterTable (11 columns, 2987 rows)
#> 6/7: TreeTable (9 columns, 109 rows)
#> 7/7: ValueTable (11 columns, 631668 rows)
#> Sources: 0
#> Cite:
#> Kathryn R. Kirby, Russell D. Gray, Simon J. Greenhill, Fiona M. Jordan, Stephanie Gomes-Ng, Hans-Jörg Bibiko, Damián E. Blasi, Carlos A. Botero, Claire Bowern, Carol R. Ember, Dan Leehr, Bobbi S. Low, Joe McCarter, William Divale, and Michael C. Gavin. (2016). D-PLACE: A Global Database of Cultural, Linguistic and Environmental Diversity. PLoS ONE, 11(7): e0158391. doi:10.1371/journal.pone.0158391.Great, let’s get the values for the parameter we want. D-PLACE
indexes the variables in a column Var_ID:
dplace$tables$ValueTable |> filter(Var_ID=='EA113')
#> # A tibble: 1,291 × 11
#> ID Soc_ID Var_ID Value Code_ID Comment Source sub_case year
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ea-117482 Aa1 EA113 Flexible EA113-2 <NA> <NA> Nyai Nyae reg… 1950
#> 2 ea-117483 Aa2 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 3 ea-117484 Aa3 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 4 ea-117485 Aa4 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 5 ea-117486 Aa5 EA113 Flexible EA113-2 <NA> <NA> Epulu net-hun… 1930
#> 6 ea-117487 Aa6 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 7 ea-117488 Aa7 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 8 ea-117489 Aa8 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 9 ea-117490 Aa9 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> 10 ea-117491 Ab1 EA113 <NA> EA113-NA <NA> <NA> <NA> <NA>
#> # ℹ 1,281 more rows
#> # ℹ 2 more variables: source_coded_data <chr>, admin_comment <chr>We want to merge this with our Grambank data but need a way to do
this. D-PLACE stores glottocode information that maps each
society to a language. This column is also in the Languages table in
Grambank. So, let’s get the trait values for EA113 and add the
glottocodes from LanguageTable using a join:
# get languages from DPLACE
dplanguages <- dplace$tables$LanguageTable |> select(ID, Glottocode)
# get values for EA113 and merge with language information
ea113 <- dplace$tables$ValueTable |>
filter(Var_ID=='EA113') |>
select(Soc_ID, Value) |>
left_join(dplanguages, join_by(Soc_ID==ID))
# rename `Value` to EA113
ea113 <- ea113 |> mutate(EA113=Value) |> select(Glottocode, EA113)
head(ea113)
#> # A tibble: 6 × 2
#> Glottocode EA113
#> <chr> <chr>
#> 1 juho1239 Flexible
#> 2 okie1245 <NA>
#> 3 nama1265 <NA>
#> 4 dama1270 <NA>
#> 5 bila1255 Flexible
#> 6 sand1273 <NA>Now get the grambank data into the same format:
gb028 <- values |>
mutate(Glottocode=Language_ID, GB028=Value) |>
select(Glottocode, GB028)
head(gb028)
#> # A tibble: 6 × 2
#> Glottocode GB028
#> <chr> <chr>
#> 1 abad1241 1
#> 2 abar1238 0
#> 3 abau1245 0
#> 4 abee1242 0
#> 5 aben1249 1
#> 6 abip1241 0…and finally join them up using the mutual column ‘Glottocode’. I’ll
use an inner join here to only get the languages/societies that are in
both datasets. And we’ll use na.omit to only keep rows that
have data for both variables:
df <- gb028 |> inner_join(ea113) |> na.omit()
#> Joining with `by = join_by(Glottocode)`
head(df)
#> # A tibble: 6 × 3
#> Glottocode GB028 EA113
#> <chr> <chr> <chr>
#> 1 bila1255 0 Flexible
#> 2 chig1238 0 Flexible
#> 3 fefe1239 0 Flexible
#> 4 gand1255 1 Rigid
#> 5 hehe1240 0 Flexible
#> 6 juku1254 0 RigidOk, it looks like we only have 17 rows for this pairing. That’s a little small, but let’s plot the data:
p1 <- ggplot(df, aes(x=GB028)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth` and `bins`
p2 <- ggplot(df, aes(x=EA113)) + geom_histogram(stat='count')
#> Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
#> `binwidth` and `bins`
p1 / p2 # patchworkHmm. It looks like societies that do not mark this distinction between inclusive and exclusive tend to be those with flexible social structure. Looking good for our hypothesis, but let’s test it formally to make sure we’re not seeing a chance pattern:
tab <- table(df$GB028, df$EA113)
tab
#>
#> Flexible Rigid
#> 0 9 4
#> 1 3 1
chisq.test(tab)
#> Warning in chisq.test(tab): Chi-squared approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: tab
#> X-squared = 4.986e-31, df = 1, p-value = 1….and the probability here is 1.0 of getting a result like this due to chance. So yes, this is just a chance result and there is no evidence in these data that languages which distinguish inclusive from exclusive also tend to be those that have rigid social structures. There goes our big paper in Nature/Science.