Unweighted data

Unweighted data is stored in a data.frame or a similar object. One example of such a similar object is a tibble (tbl), which can be produced by the tibble package. data.frames and similar objects do not contain information about survey design variables. Thus, surveytable treats these objects as unweighted data, with each observation having a weight of 1.

The example below illustrates how to use surveytable with unweighted data. We

create a tibble with unweighted data;
tell surveytable to work with this object; and
tabulate the SPECCAT (physician specialty) variable from these data.

library(surveytable)
library(tibble)

mytbl = as_tibble(namcs2019sv_df)

set_survey(mytbl)
#> * mytbl: the survey is unweighted.

Variables	Observations	Design
Survey info {mytbl (unweighted)}
33	8,250	Independent Sampling design (with replacement) survey::svydesign(ids = ~1, probs = rep(1, nrow(design)), data = design)

tab("SPECCAT")

Level	n	Number	SE	LL	UL	Percent	SE	LL	UL
Type of specialty (Primary, Medical, Surgical) {mytbl (unweighted)}
Primary care specialty	2,993	2,993	44	2,909	3,080	36.3	0.5	35.2	37.3
Surgical care specialty	3,050	3,050	44	2,965	3,137	37.0	0.5	35.9	38.0
Medical care specialty	2,207	2,207	40	2,130	2,287	26.8	0.5	25.8	27.7
N = 8250.

Complex survey

A complex survey is defined by its data as well as its survey design variables. In R, a complex survey is stored in a survey object. This object, in addition to containing the survey data, also contains information about the survey design variables. These include variables that specify such things as:

cluster ID’s, also known as primary sampling units (PSU’s);
cluster sampling probabilities;
strata;
finite population correction; and
sampling weights.

You can convert a data.frame or a similar object to a survey object using the survey::svydesign() command. Before using this command, you should consult the documentation for the survey that you are analyzing to find out what the survey design variables are.

The example below illustrates how to use surveytable with a complex survey. We

create a survey object;
tell surveytable to work with this object; and
tabulate the SPECCAT variable from the survey.

library(surveytable)

mysurvey = survey::svydesign(ids = ~ CPSUM
  , strata = ~ CSTRATM
  , weights = ~ PATWT
  , data = namcs2019sv_df)

set_survey(mysurvey)

Variables	Observations	Design
Survey info {mysurvey}
33	8,250	Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, data = namcs2019sv_df)

tab("SPECCAT")

Level	n	Number	SE	LL	UL	Percent	SE	LL	UL
Type of specialty (Primary, Medical, Surgical) {mysurvey}
Primary care specialty	2,993	521,466,378	31,136,212	463,840,192	586,251,877	50.3	2.6	45.1	55.5
Surgical care specialty	3,050	214,831,829	31,110,335	161,661,415	285,489,984	20.7	3.0	15.1	27.3
Medical care specialty	2,207	300,186,150	43,496,739	225,806,019	399,066,973	29.0	3.6	22.1	36.6
N = 8250.

Spark-based complex survey

Especially if you are working with big data, that data might be stored in a database, such as Apache Spark. mysurvey can work with a survey whose data lives in a database.

The example below illustrates how to use surveytable with a Spark-based complex survey. We

connect to Spark;
copy some data into a Spark DataFrame;
create a Spark-based survey object;
tell surveytable to work with this object;
tabulate the SPECCAT variable from the survey; and finally
disconnect from Spark.

Note that, for this example, we are using a "local" Spark connection – how you connect to Spark depends on your setup.

library(surveytable)
library(sparklyr)
#> Warning: package 'sparklyr' was built under R version 4.4.3

library(dplyr)

sc = spark_connect(master = "local")
#> * Using Spark: 3.5.5


mysparkdf = copy_to(sc, namcs2019sv_df)
mysurvey = survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM
                             , weights = ~PATWT, data = mysparkdf)

set_survey(mysurvey)

Variables	Observations	Design
Survey info {mysurvey}
33	8,250	Stratified 1 - level Cluster Sampling design (with replacement) With (398) clusters. survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, data = mysparkdf)

tab("SPECCAT")

Level	n	Number	SE	LL	UL	Percent	SE	LL	UL
SPECCAT {mysurvey}
Medical care specialty	2,207	300,186,150	43,496,739	225,806,019	399,066,973	29.0	3.6	22.1	36.6
Primary care specialty	2,993	521,466,378	31,136,212	463,840,192	586,251,877	50.3	2.6	45.1	55.5
Surgical care specialty	3,050	214,831,829	31,110,335	161,661,415	285,489,984	20.7	3.0	15.1	27.3
N = 8250.

spark_disconnect_all()
#> [1] 1

Complex survey with replicate weights

Some surveys, instead of specifying survey design variables, specify replicate weights. They might do this, for example, for privacy reasons.

You can convert a data.frame or a similar object to a survey object that uses replicate weights using the survey::svrepdesign() command.

The example below illustrates how to use surveytable with a complex survey that uses replicate weights. We

create fake replicate and sampling weights, for use in this example;
create a replicate weights-based survey object;
tell surveytable to work with this object; and
tabulate the SPECCAT variable from the survey.

library(surveytable)

mydata = namcs2019sv_df
nr = nrow(mydata)
set.seed(42)
for (ii in 1:20) {
  mydata[,paste0("fake_repw", ii)] = runif(nr, 10, 1000)
}
mydata$fake_w = runif(nr, 10, 1000)

mysurvey = survey::svrepdesign(
  repweights = "fake_repw*"
  , weights = ~fake_w
  , data = mydata
)

set_survey(mysurvey)

Variables	Observations	Design
Survey info {mysurvey}
54	8,250	Call: svrepdesign.default(repweights = “fake_repw*“, weights = ~fake_w, data = mydata) Balanced Repeated Replicates with 20 replicates.

tab("SPECCAT")

Level	n	Number	SE	LL	UL	Percent	SE	LL	UL
Type of specialty (Primary, Medical, Surgical) {mysurvey}
Primary care specialty	2,993	1,504,579	16,005	1,473,519	1,536,295	36.3	0.3	35.7	36.8
Surgical care specialty	3,050	1,520,930	13,701	1,494,299	1,548,036	36.7	0.3	36.0	37.3
Medical care specialty	2,207	1,123,957	10,713	1,103,140	1,145,166	27.1	0.2	26.6	27.6
N = 8250.

Use cases

Unweighted data

Complex survey

Spark-based complex survey

Complex survey with replicate weights