The datazoom.social package facilitates access to official Brazilian social data.
This package is in development stage - more datasets will be released soon.
In this first version of the package, the focus is only on the Continuous PNAD. We allow for many quarters to be easily downloaded and read, as well as identifying individuals across time, forming a panel.
You can install the development version of datazoom.social from GitHub
with:
install.packages("devtools")
devtools::install_github("datazoompuc/datazoom.social")
|
The load_pnadc function is a wrapper for
get_pnadc
from the package PNADcIBGE, with added identification algorithms for
panel construction. For details on the identification algorithms, see
vignette("BUILD_PNADC_PANEL").
Panel Structure:
The table below shows the first and last quarter (ANOtrimestre, e.g.
20121 = 2012 Q1) covered by each PNADC rotating panel:
| Panel | Start | End |
|---|---|---|
| 1 | 20121 | 20124 |
| 2 | 20121 | 20141 |
| 3 | 20132 | 20152 |
| 4 | 20143 | 20163 |
| 5 | 20154 | 20174 |
| 6 | 20171 | 20191 |
| 7 | 20182 | 20202 |
| 8 | 20193 | 20213 |
| 9 | 20204 | 20224 |
| 10 | 20221 | 20241 |
| 11 | 20232 | 20252 |
| 12 | 20243 | 20263 |
| 13 | 20254 | 20274 |
| 14 | 20271 | 20291 |
Usage:
Default
load_pnadc(
save_to = getwd(),
years,
quarters = 1:4,
panel = "advanced",
raw_data = FALSE,
save_options = c(TRUE, TRUE),
vars = NULL
)To download PNADC data for all quarters of 2022 and 2023, with advanced identification, simply run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022:2023
)To download PNADC data for all of 2022, but only the first quarter of 2023, run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022:2023,
quarters = list(1:4, 1)
)To download PNADC data without any variables treatment or identification (e.g., for all quarters of 2021), run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2021,
panel = "none",
raw_data = TRUE
)To download PNADC data, keep the quarters parquet on disk, and save panels as Parquet, run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022,
save_options = c(TRUE, FALSE)
)To download PNADC data and save panels as CSV but discard the intermediate quarters parquet, run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022,
save_options = c(FALSE, TRUE)
)To download only a specific subset of variables — for example, age
(V2009) and habitual income (VD4019) — alongside the structural
columns that PNADcIBGE always returns, run
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022,
vars = c("V2009", "VD4019")
)Note:
PNADcIBGE::get_pnadc()always downloads a set of ~210 structural columns regardless of thevarsargument. These include survey design weights (V1027,V1028,V1028001–V1028200,posest,posest_sxi), deflator variables (Habitual,Efetivo), and identifiers such asUF,Estrato,V1029,V1033, andID_DOMICILIO. Thevarsargument adds columns on top of those; it does not restrict them. Usevars = NULL(the default) to download all available microdata columns.
If you specify vars and also request panel identification, any columns
required by the identification algorithm that are absent from vars
will be added automatically and a warning will tell you which ones were
added. For example, when using panel = "advanced", the columns
V2007, V20082, V20081, V2008, and V2003 must be present. If
you omit them from vars, the function adds them for you:
# Only V2009 requested, but panel = "advanced" (the default) needs
# V2007, V20082, V20081, V2008 and V2003 — these are added automatically
# with a warning.
load_pnadc(
save_to = "Directory/You/Would/like/to/save/the/files",
years = 2022,
panel = "advanced",
vars = c("V2009", "VD4019")
)Options:
-
save_to: The directory in which the user desires to save the downloaded files.
-
years: picks the years for which the data will be downloaded
-
quarters: The quarters within those years to be downloaded. Can be either a vector such as
1:4for consistent quarters across years, or a list of vectors, if quarters are different for each year (e.g.list(1:4, 1:2)for four quarters in the first year and two in the second). -
panel: Which panel algorithm to apply to this data. There are three options:
none: No panel is built. Ifraw_data = TRUE, returns the original data. Otherwise, creates some extra treated variables. The intermediate quarters parquet is always kept whenpanel = "none".basic: Performs basic identification steps for creating households and individual identifiers for panel constructionadvanced: Performs advanced identification steps for creating households and individual identifiers for panel construction.
-
raw_data: A command to define if the user would like to download the raw or treated data. There are two options:
TRUE: if you want the PNADC variables as they come.FALSE: if you want the treated version of the PNADC variables.
-
save_options: A logical vector of length 2 controlling file saving behaviour:
c(TRUE, TRUE)(default): keeps the intermediate quarters parquet after panel is built; saves panel files as.csv.c(FALSE, TRUE): deletes the quarters parquet after use; saves panel files as.csv.c(TRUE, FALSE): keeps the quarters parquet; saves panel files as a.parquetdataset.c(FALSE, FALSE): deletes the quarters parquet after use; saves panel files as a.parquetdataset.
-
vars: A character vector of additional variable names to download, following the same convention as
varsinPNADcIBGE::get_pnadc(). UseNULL(the default) to download all available microdata columns. See the note above regarding the ~210 structural columns that are always returned byPNADcIBGE::get_pnadc()regardless of this argument.
Details:
The function performs the following steps:
-
Loop over years and quarters using
PNADcIBGE::get_pnadcto download the data. All quarters are collected in memory and saved together into a singlepnadc_quarters.parquetfile insave_to. -
Split the data into panels by the panel variable
V1014. Data from each panel is saved depending onsave_options. -
Read each panel file and apply the identification algorithms defined in
build_pnadc_panel. -
If
save_options[1] = FALSE, the intermediate quarters parquet is deleted after the panels are built.
- The identification algorithms in
build_pnadc_panelare drawn from Ribas, Rafael Perez, and Sergei Suarez Dillon Soares (2008): “Sobre o painel da Pesquisa Mensal de Emprego (PME) do IBGE”.
Usage:
Basic Panel
panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "basic")Advanced Panel
panel_data <- build_pnadc_panel(dat = pnad_sample, panel = "advanced")Description
Our load_pnadc function uses the internal function build_pnadc_panel
to identify households and individuals across quarters. The method used
for the identification is based on the paper of Ribas, Rafael Perez, and
Sergei Suarez Dillon Soares (2008): “Sobre o painel da Pesquisa Mensal
de Emprego (PME) do IBGE”.
The household identifier – stored as id_dom – combines the variables:
-
UPA– Primary Sampling Unit - PSU; -
V1008– Household; -
V1014– Panel Number;
In order to create a unique number for every combination of those variables.
The basic individual identifier – stored as id_ind – combines the
household id with:
-
V2007– Sex; -
Date of Birth – [
V20082(year),V20081(month),V2008(day)];
In order to create an unique number for every combination of those variables.
The advanced identifier is saved as id_rs. On individuals who were not
matched on all interviews, we relax some assumptions to increase
matching power. Under the assumption that the date of birth is often
misreported, we take individuals who are either:
-
Head of the household or their partner
-
Child of the head of the household, 25 or older
For these observations, we run the basic identification again, but allowing the year of birth to be wrong. We also include the order number.
The tables below show the levels of attrition obtained using the basic
and advanced identification algorithms, and compares them to the
attrition levels obtained in the Stata datazoom_social package.
| Interview | Percentage found (R) | Percentage found (Stata) |
|---|---|---|
| 1 | 100.0 | 100.0 |
| 2 | 86.2 | 85.7 |
| 3 | 78.5 | 77.5 |
| 4 | 73.2 | 71.6 |
| 5 | 69.1 | 66.8 |
Attrition for Panel 2
Each cell is the percentage of PNADC observations that are identified by the advanced algorithm in each interview.
DataZoom is developed by a team at Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), Department of Economics. Our official website is at: https://datazoom.com.br/en/.
To cite package datazoom.social in publications use:
Data Zoom (2023). Data Zoom: Simplifying Access To Brazilian Microdata.
https://datazoom.com.br/en/
A BibTeX entry for LaTeX users is:
@Unpublished{DataZoom2024,
author = {Data Zoom},
title = {Data Zoom: Simplifying Access To Brazilian Microdata},
url = {https://datazoom.com.br/en/},
year = {2024}}
