
Identifying an Efficient Release in Data Subject to Revisions
Source:vignettes/efficient-release.Rmd
efficient-release.Rmd
Macroeconomic data are often subject to revisions, reflecting the integration of new information, methodological improvements, and statistical adjustments. Understanding the optimal properties of a revision process is essential to ensure that early data releases provide a reliable foundation for policy decisions and economic forecasting. This vignette discusses the characteristics of optimal revisions, formalizes initial estimates as truth measured with noise, and presents an iterative approach to identify the point at which revisions become unpredictable also known as the efficient release. Further, examples are provided to illustrate the application of these concepts to GDP data.
Optimal Properties of Revisions
Data revisions can be classified into two main types: ongoing revisions, which incorporate new information as it becomes available, and benchmark revisions, which reflect changes in definitions, classifications, or methodologies. Optimally, revisions should satisfy the following properties (Aruoba 2008):
Unbiasedness: Revisions should not systematically push estimates in one direction. Mathematically, for a revision series where denotes the release number represents the final release and represents the series released at time ( is the initial release), unbiasedness requires:
Efficiency: An efficient release incorporates all available information optimally, ensuring that subsequent revisions are unpredictable. This can be tested using the Mincer-Zarnowitz regression: where an efficient release satisfies and .
Minimal Variance: Revisions should be as small as possible while still improving accuracy. The variance of revisions, defined as , should decrease over successive releases.
Initial Estimates as Truth Measured with Noise
A fundamental assumption in many revision models is that the first release of a data point is an imperfect measure of the true value due to measurement errors. This can be expressed as: where is the true value and is an error term that diminishes as increases. The properties of determine whether the preliminary release is an efficient estimator of . If is predictable, the revision process is inefficient, indicating room for improvement in the initial estimates. Vice-versa if is unpredictable, the preliminary release is efficient (i.e ).
The Challenge of Defining a Final Release
One major challenge in identifying an efficient release using Mincer-Zarnowitz regressions is that it is often unclear which data release should be considered final. While some revisions occur due to the incorporation of new information, others result from methodological changes that redefine past values. Consequently, statistical agencies may continue revising data for years, making it difficult to pinpoint a definitive final release. For instance in Germany the final release of national accounts data is typically published four years after the initial release. In Switzerland, GDP figures are never finalized. So, defining a final release is a non-trivial task that requires knowledge of the revision process.
Iterative Approach for Identifying
An iterative approach is proposed to determine the optimal number of revisions, , beyond which further revisions are negligible. This approach is based on running Mincer-Zarnowitz-style regressions to assess which release provides an optimal estimate of the final value. The procedure follows these steps (Kishor and Koenig 2012; Strohsal and Wolf 2020):
Regression Analysis: Regress the final release on the initial releases for : where the null hypothesis is that and , indicating that is an efficient estimate of .
Determine the Optimal : Increase iteratively and test whether the efficiency conditions hold. The smallest for which the hypothesis is not rejected is considered the final efficient release.
Importance of an Efficient Release
An efficient release is essential for various economic applications:
- Macroeconomic Policy Decisions: Monetary and fiscal policies depend on timely and reliable data. If initial releases are biased or predictable, policymakers may react inappropriately, leading to suboptimal policy outcomes.
- Nowcasting and Forecasting: Reliable early estimates improve the accuracy of nowcasting models, which are used for real-time assessment of economic conditions.
- Financial Market Stability: Investors base decisions on official economic data. If initial releases are misleading, market volatility may increase due to unexpected revisions.
- International Comparisons: Cross-country analyses require consistent data. If national statistical agencies release inefficient estimates, international comparisons become distorted.
An optimal revision process is one that leads to unbiased, efficient, and minimally variant data revisions. Initial estimates represent the truth measured with noise, and identifying the efficient release allows analysts to determine the earliest point at which data can be used reliably without concern for systematic revisions. The iterative approach based on Mincer-Zarnowitz regressions provides a robust framework for achieving this goal, improving the reliability of macroeconomic data for forecasting and policy analysis.
Example: Identifying an Efficient Release in GDP Data with
reviser
In the following example, we use the reviser
package to
identify the efficient release in quarterly GDP data for Switzerland,
the Euro Area, Japan, and the United States. We test the first 20 data
releases, and we aim to determine the point at which further revisions
become negligible. As the final release, we use the same release for all
countries, namely the data released 5 years after the first release. The
tests shows that the first efficient release occurs at different points
for each country, ranging from the 1st (Japan, US) to the 13th
(Switzerland) release.
library(reviser)
library(dplyr)
gdp <- reviser::gdp %>%
tsbox::ts_pc()
df <- get_nth_release(gdp, n = 0:19)
final_release <- get_nth_release(gdp, n = 20)
efficient <- get_first_efficient_release(
df,
final_release
)
res <- summary(efficient)
#> id: CHE
#> Efficient release: 13
#>
#> Model summary:
#>
#> Call:
#> stats::lm(formula = formula, data = df_wide)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.83040 -0.13561 -0.00657 0.15531 0.76670
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.05979 0.02526 2.367 0.0192 *
#> release_13 0.87766 0.03326 26.384 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2631 on 155 degrees of freedom
#> (21 observations deleted due to missingness)
#> Multiple R-squared: 0.8179, Adjusted R-squared: 0.8167
#> F-statistic: 696.1 on 1 and 155 DF, p-value: < 2.2e-16
#>
#>
#> Test summary:
#>
#> Linear hypothesis test:
#> (Intercept) = 0
#> release_13 = 1
#>
#> Model 1: restricted model
#> Model 2: final ~ release_13
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#> 1 157
#> 2 155 2 2.9999 0.05269 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#>
#> id: EA
#> Efficient release: 7
#>
#> Model summary:
#>
#> Call:
#> stats::lm(formula = formula, data = df_wide)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.39144 -0.06507 0.00536 0.07090 0.25743
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.01982 0.01113 1.781 0.0768 .
#> release_7 1.00311 0.01646 60.934 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1111 on 156 degrees of freedom
#> (20 observations deleted due to missingness)
#> Multiple R-squared: 0.9597, Adjusted R-squared: 0.9594
#> F-statistic: 3713 on 1 and 156 DF, p-value: < 2.2e-16
#>
#>
#> Test summary:
#>
#> Linear hypothesis test:
#> (Intercept) = 0
#> release_7 = 1
#>
#> Model 1: restricted model
#> Model 2: final ~ release_7
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#> 1 158
#> 2 156 2 3.0071 0.05231 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#>
#> id: JP
#> Efficient release: 0
#>
#> Model summary:
#>
#> Call:
#> stats::lm(formula = formula, data = df_wide)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.76370 -0.36739 -0.01023 0.38499 1.72906
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.05437 0.05116 1.063 0.29
#> release_0 0.83179 0.04878 17.051 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.5765 on 156 degrees of freedom
#> (20 observations deleted due to missingness)
#> Multiple R-squared: 0.6508, Adjusted R-squared: 0.6486
#> F-statistic: 290.7 on 1 and 156 DF, p-value: < 2.2e-16
#>
#>
#> Test summary:
#>
#> Linear hypothesis test:
#> (Intercept) = 0
#> release_0 = 1
#>
#> Model 1: restricted model
#> Model 2: final ~ release_0
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#> 1 158
#> 2 156 2 1.8585 0.1593
#>
#>
#> id: US
#> Efficient release: 0
#>
#> Model summary:
#>
#> Call:
#> stats::lm(formula = formula, data = df_wide)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.82880 -0.13760 0.03239 0.12910 0.95887
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.005432 0.028360 0.192 0.848
#> release_0 0.955730 0.029993 31.865 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2564 on 156 degrees of freedom
#> (20 observations deleted due to missingness)
#> Multiple R-squared: 0.8668, Adjusted R-squared: 0.866
#> F-statistic: 1015 on 1 and 156 DF, p-value: < 2.2e-16
#>
#>
#> Test summary:
#>
#> Linear hypothesis test:
#> (Intercept) = 0
#> release_0 = 1
#>
#> Model 1: restricted model
#> Model 2: final ~ release_0
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#> 1 158
#> 2 156 2 2.2919 0.1045
head(res)
#> # A tibble: 4 × 6
#> id e alpha beta p_value n_tested
#> <chr> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 CHE 13 0.0598 0.878 0.0527 14
#> 2 EA 7 0.0198 1.00 0.0523 8
#> 3 JP 0 0.0544 0.832 0.159 1
#> 4 US 0 0.00543 0.956 0.104 1
Note: The identification of the first efficient release is related to
the news hypothesis testing. Hence, the same conclusion could be reached
by using the function get_revision_analysis()
(setting
degree=3
, providing results for news and noise tests) as
shown below. An advantage of using the
get_first_efficient_release()
function is that it organizes
the data to be subsequently used in kk_nowcast()
to improve
nowcasts of preliminary releases (See vignette Nowcasting Revisions for more
details).
analysis <- get_revision_analysis(
df,
final_release,
degree=3
)
head(analysis)
#> # A tibble: 6 × 17
#> id release N `News test Intercept` `News test Intercept (std.err)`
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 CHE release_0 158 0.180 0.0540
#> 2 CHE release_1 158 0.163 0.0544
#> 3 CHE release_10 157 0.128 0.0475
#> 4 CHE release_11 158 0.126 0.0469
#> 5 CHE release_12 157 0.0920 0.0406
#> 6 CHE release_13 157 0.0598 0.0292
#> # ℹ 12 more variables: `News test Intercept (p-value)` <dbl>,
#> # `News test Coefficient` <dbl>, `News test Coefficient (std.err)` <dbl>,
#> # `News test Coefficient (p-value)` <dbl>, `News joint test (p-value)` <dbl>,
#> # `Noise test Intercept` <dbl>, `Noise test Intercept (std.err)` <dbl>,
#> # `Noise test Intercept (p-value)` <dbl>, `Noise test Coefficient` <dbl>,
#> # `Noise test Coefficient (std.err)` <dbl>,
#> # `Noise test Coefficient (p-value)` <dbl>, …