--- title: "`getGADS`: Using a relational eatGADS data base" author: "Benjamin Becker" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{`getGADS`: Using a relational eatGADS data base} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette illustrates how a relational `eatGADS` data base can be accessed and used. Therefore, the vignette is targeted at users who make use of an existing data base. For illustrative purposes we use a small example data base based on the campus files of the German PISA Plus assessment. The complete campus files and the original data set can be accessed [here](https://www.iqb.hu-berlin.de/fdz/Datenzugang/CF-Antrag/AntragsformularCF) and [here](https://www.iqb.hu-berlin.de/fdz/Datenzugang/SUF-Antrag). The data base is installed alongside `eatGADS` and the path can be accessed via the `system.file()` function. ```{r setup} library(eatGADS) db_path <- system.file("extdata", "pisa.db", package = "eatGADS") db_path ``` ## Why? Relational data bases created by `eatGADS` provide an alternative way of storing hierarchically structured data (e.g. from educational large-scale assessments). Compared to conventional approaches (one big or multiple `.sav`/`.Rdata` files) this yields the following advantages: * the data set is often smaller on disk (especially compared to `.sav` files) * meta data can be accessed without loading the actual data * no need for reshaping, as the different hierarchy levels can be accessed independently * saving working memory: often R struggles with large data sets; with `eatGADS` we can choose which variables to load into `R` * flexible application of value labels and missing codes for analyses in `R` ## Inspecting the data base We can inspect the data base structure with the `namesGADS()` function. The function returns a named `list`. Every list element represents a hierarchy level. The corresponding character vector contains all variable names on this hierarchy level. ```{r structure} nam <- namesGADS(db_path) nam ``` The example data base contains two hierarchy levels: A student level (`noImp`) and a plausible value level (`PVs`). On the student level, each row represents an individual student. On the plausible value level, each row represents an imputation number of a specific domain of an individual student. We can access meta information of the variables in the data set using the `extractMeta()` function. ```{r extractMeta} # Meta data for one variable extractMeta(db_path, "age") ``` To supply variables names we can also use the named list `nam` extracted earlier. This way, we can extract all meta information available for a hierarchy level. ```{r extractMeta all} extractMeta(db_path, nam$PVs) ``` Commonly the most informative columns are `varLabel` (containing variable labels), `value` (referencing labeled values), `valLabel` (containing value labels) and `missings` (is a labeled value a missing value (`"miss"`) or not (`"valid"`)). ```{r extractMeta description} # Meta data for manually chosen multiple variables extractMeta(db_path, c("idstud", "schtype")) ``` ## Extract data from data base To extract a data set from the data base, we can use the function `getGADS()`. If the data base is stored on a server drive, `getGADS_fast()` provides identical functionality but substantially increases the performance. With the `vSelect` argument we specify our variable selection. It is important to note that `getGADS()` returns a so called `GADSdat` object. This object type contains complex meta information (that is for example also available in a `SPSS` data set), and is therefore not directly usable for data analysis. We can, however, use the `extractMeta()` function on it to access the meta data. ```{r getGADS} gads1 <- getGADS(filePath = db_path, vSelect = c("idstud", "schtype", "gender")) class(gads1) extractMeta(gads1) ``` ## Extract data from `GADSdat` If we want to use the data for analyses in `R` we have to extract it from the `GADSdat` object via the function `extractData2()`. In doing so, we have to make two important decisions: (a) how should values marked as missing values be treated (`convertMiss`)? And (b) how should labeled values in general be treated (`labels2character`, `labels2factor`, `labels2ordered`, and `dropPartialLabels`)? Per default, all missing tags are applied, meaning all values tagged as missing are recoded to `NA` (`convertMiss == TRUE`). Furthermore, per default, all value labels are dropped (`labels2character = NULL`, `labels2factor = NULL`, `labels2ordered = NULL`). If for specific variables, value labels should be applied and the resulting variable should be a character variable, this can specified via, for example, setting `labels2character = c("var1", "var2")`. ```{r extractdata} ## leave all labeled variables as numeric, convert missings to NA dat1 <- extractData2(gads1) head(dat1) ## convert selected labeled variable(s) to character, convert missings to NA dat2 <- extractData2(gads1, labels2character = c("schtype")) head(dat2) ## convert all labeled variables to character, convert missings to NA dat3 <- extractData2(gads1, labels2character = namesGADS(gads1)) head(dat3) ``` In general, we recommend leaving labeled variables as numeric and converting values with missing codes to `NA`. If required, value labels can always be accessed via using `extractMeta()` on the `GADSdat` object or the data base. ## Selecting different hierarchy levels An important feature of `eatGADS` relational data bases are that data sets are automatically returned on the correct hierarchy level. For an overview of different data structures, see ["Tidy Data"](https://doi.org/10.18637/jss.v059.i10) or [this article explaining long and wide format using repeated measures](https://www.theanalysisfactor.com/wide-and-long-data/). In educational large-scale assessments, data usually contain multiple imputations or plausible values. Packages that enable us analyzing these types of data (like [`eatRep`](https://github.com/weirichs/eatRep)) often require these data in the long format. The function `getGADS()` extracts data automatically in the appropriate structure, depending on our variable selection. If we select only variables from the student level, the data returned is on the student level. Each student is represented in a single row. ```{r noImp hierarchy levels} gads1 <- getGADS(db_path, vSelect = c("schtype", "g8g9")) dat1 <- extractData2(gads1) dim(dat1) head(dat1) ``` If additionally variables from the plausible Value data table are extracted, the returned data structure changes. In the `PVs` data table, data is stored on the "student x dimension x plausible value number" level. The returned data has exactly this structure. ```{r PVs hierarchy levels} gads2 <- getGADS(db_path, vSelect = c("schtype", "value")) dat2 <- extractData2(gads2) dim(dat2) head(dat2) ``` These two examples highlight another feature of `getGADS()`: Only variables of substantial interest have to be selected for extraction. The correct ID variables are added automatically. ## Trend data bases In educational large-scale assessments, a common challenge is reporting longitudinal developments (*trends*). `getTrendGADS` allows extracting data from multiple data bases with identical variables in it. ```{r trendPaths} trend_path1 <- system.file("extdata", "trend_gads_2020.db", package = "eatGADS") trend_path2 <- system.file("extdata", "trend_gads_2015.db", package = "eatGADS") trend_path3 <- system.file("extdata", "trend_gads_2010.db", package = "eatGADS") ``` `eatGADS` comes with three small trend data bases which can be used for illustrative purposes. ```{r getTrendGADS} gads_trend <- getTrendGADS(filePaths = c(trend_path1, trend_path2, trend_path3), vSelect = c("idstud", "dimension", "score"), years = c(2020, 2015, 2010), fast = FALSE) dat_trend <- extractData2(gads_trend) head(dat_trend) ```