Extrapolation of 2009 PUF#

Overview#

OSPC’s current micro simulation model is using CPS-matched data file based on 2009 Public Use File (PUF) to generate tax revenue projections on a 10-year budget window. In order to do the calculations in the projection time period (2015-2024), the model needs to extrapolate the variables in PUF to the future years. OSPC follows a two-stage procedure, originally developed by John O’Hare, to extrapolate the PUF variables. The first stage is to ‘blow-up’ those variables directly using external macro-economic baseline projections, mainly to reflect inflation. Then the second step is to maintain proper distribution of wage and salaries, as well as key macro targets.

The CPS-matched file is a statistically matched file base on 09 PUF. It not only includes filers information from PUF, but also non-filers from Current Population Survey (CPS). The Public Use File of 2009, currently base of the statistical match, is a stratified sample from the IRS full file with all taxpayers’ information. In PUF, there are in total 152,526 records. One records was removed before futhur procedures, due to the fact that this record, with Record ID (RECID = 999999) contains aggregated information that would potential distort later steps. All the information in this aggregated record was re- distributed to all other records using a Stage I and Stage II extrapolation routine described below. All the remaining record has a weight variable indicating approximately how many taxpayers this record represents in the full file. In addition to the weight information, each record has 201 variables, roughly 169 of which are used in the OSPC tax calculator. The aggregate values (weighted sum) of each variable are very close to the statistics for all taxpayers provided by IRS.

Stage 1#

In Stage I, the model blows up PUF variables using per capita adjustment factors. This step will make sure the aggregated values of these variables match certain macroeconomic projections in terms of growth rates. Currently, the external baseline projection used in Stage I is mainly from the CBO economic outlook. In near future, the calculations from OSPC’s dynamic model, LOGUS, will replace these external projections.

The per capita adjustment factors applied to each record in PUF are calculated based on both macroeconomic targets growth rates and population growth rates. Both of these rates are simple- compounded rates, derived from multiple data resources, including CBO, IRS and Census Bureau. For example, the projected GDP in year t is projected at Xt, then the OSPC model calculates growth rate r as:

\[ r = \frac{X_t}{X_0} − 1 \]

Similarly, population P growth rate p is calculated as:

\[ p = \frac{P_t}{P_0} − 1 \]

To make sure the extrapolated variables indeed aggregate at \(X_{t}\), the model calculates the ratio of the two rates as per capital adjustment factor and then uses this value to blow up the target variable for each record in the base year \(x_{i0}\).

Per capita adjustment for each individual record:

\[ 1 + \varrho = \frac{1 + r}{1 + p} \]

After blow-up, the target variable value for each record and the weight variable would be:

\[ x_{it} = (1 + \varrho)x_{i0} \]
\[ w_{it} = (1 + p)w_{i0} \]

Here \(x_{it}\) represents the extrapolated value of targets for one return i at year t, and \(w_{it}\) represents the stage-I extrapolated value of weight for each return i.

This method of blow-up ensures that the aggregates of future individual variables would be in line with the macro projections based on IRS taxpayer information and CBO baseline. As you can see as follows:

Starting with the weighted sum of a target variable x at year t

\[\begin{split} \begin{align} &\sum_{i} x_{it}w_{it} \\ & = \sum_{i} (1+ \varrho)x_{i0}(1+p)w_{i0} \\ & = \sum_{i} \frac{1+r}{1+p}x_{i0}(1+p)w_{i0} \\ & = (1 + r)\sum_{i} x_{i0}w_{i0} \\ & = (1+r)X_0 \\ & = X_t \end{align} \end{split}\]

Currently in Stage I, OSPC’s model has 20 factors, 16 macro factors and 4 population related factors. For macroeconomic economic targets like GDP that do not have corresponding variables in the PUF, they are derived directly from CBO baseline projections. While the other factors, taxable interests for example, are calculated from both CBO projection and IRS statistics. Specifically, the model takes the most recent (currently 2012) statistics of this variable released by IRS as the base of macroeconomic targets projection. Then the model ‘ages’ those statistics with the yearly growth rates of CBO targets to extrapolate those statistics on the budget window. Some of the values extrapolated here are used as target in Stage II as well. More details will be provided later in this description. With these extrapolated values, the model then using the 2008 data as base, calculates simple growth rates of these extrapolated values for each projection year. Please see appendix A for a complete list of all factors.

As you can see, there are roughly 169 variables that need to be blown up in PUF but only 20 factors and less than 16 per-capita adjustments are available. At the moment for simplicity reasons, many variables are sharing one factor. Most variables are aged at personal income growth rate. Variables that don’t follow the trajectory of personal income are applied a factor that fits the best. For example, home mortgage interest deduction is currently using taxable interest income factor for adjustment. The complete table for the correspondence between stage I factors and PUF variables can be found in Appendix B.

Stage 2#

At the end of Stage I, the OSPC model blows up the base year data through the projection years. Even though all targets set up in Stage I can be hit at this point, some other variables, mainly the ones not included in the targets, might behave oddly if we only target the aggregate values. For example, we targeted total wage in Stage I and this would kick Earned Income Credit (EIC) out of place in future years. But the Stage I simple-compounded rate blow-up process cannot get both total wage and EIC fell in reasonable ranges at the same time, as those two variables impose closely related but different requirements on wage and population distribution. John found that if wage distribution were maintained same over years, this problem would be fixed automatically.

In Stage II, the OSPC model applies a linear programming (LP) algorithm on the post-Stage-I records to adjust the weights. In this way, all the targeted variables would sum up to the targets, and other non-target variables stay in reasonable ranges. The stage II targets not only include those stage I targets calculated based on both CBO baseline growth rates and IRS statistics, but also certain targets on population and return groups, and wage distribution. For the wage distribution, the model currently assumes the distribution of wage stays the same over the projection time period. This assumption is subject to change as many other resources, including JCT’s present model, suggest that the growth rate of higher income class is greater than lower income class. A complete list of Stage II targets is attached as Appendix C.

In this LP model, the object function is based on the absolute value of percentage adjustment on weights. Assume $z_{it} is the percentage adjustment on weights w for record i between any projection year t and post-stage-I base year weight. One twist on top of the original Stage I blow up here is that we applied different factor to different age groups for the weight variable. From the population growth rates by age in the past few years, the 65+ group grows at a higher rate than the overall population. (citation needed!!!) As PUF doesn’t contain any age information, we use social security benefits, represented by variable e02400 in PUF, to approximate the age characteristic of PUF. Specifically, if one filer received social security benefits in 2008, we assume he or she is older than 65 and blow up the weight of this return using the senior population growth rate (APOPSNR). Otherwise, the weight is blown up at the total return growth rate (ARETS). In this way, the weight variable at year t is:

\[\begin{split} w_{it} = \begin{cases} w_{i0}*ARETS if e02400 \leq 0\\ wi, 0*APOPSNR if e02400 > 0 \end{cases} \end{split}\]

Then the percentage adjustment on weights can be expressed as:

\[\begin{split} z_{it} = \begin{cases} \frac{w_{it}^*}{w_{i0}*ARETS} − 1 if e02400 ≤ 0\\ \frac{w_{it}^*}{w_{i0}*APOPSNR} − 1 if e02400 > 0 \end{cases} \end{split}\]

Here \(w_{it}^*\) is the final optimized weight after stage II, which is unknown at this point.

In this problem, we want to minimize the overall adjustments on all records, in other words, to minimize the sum of the absolute values of all the percentage changes on weights. Thus, this LP problem aims at minimizing the following function:

\[ \sum_i |z_{it}| \]

To enhance efficiency of the optimization, we decompose the percentage adjustment \(z_{it}\) into two components: r and s. These two components respectively carry the absolute values of positive and negative elements of original adjustments \(z_{it}\). Mathematically, they can be expressed as following:

\[\begin{split} r_{it} = \begin{cases} z_{it}, z_{it} > 0 \\ 0, else \end{cases} \end{split}\]
\[\begin{split} s_{it} = \begin{cases} z_{it}, z_{it} > 0 \\ 0, else \end{cases} \end{split}\]

Then it’s not hard to see that the original adjustment is the difference of the two components, and its absolute value is the sum of the two components.

\[ z_{it} = r_{it} – s_{it} \]
\[ |z_{it}| = r_{it} + s_{it} \]

Therefore, the object function turns into:

\[ \sum_{i} (r_{it} + s_{it}) \]

As this LP problem takes post-stage-I data as input and borrows a large number of targets derived in Stage I, the percentage adjustments should be fairly small. In addition, if the adjustment runs too large, there’s a big chance that the non-target variables get pulled away from their normal ranges. Thus in the LP model, the tolerance of each \(z_{it} = r_{it} – s_{it}\) is also incorporate as a constraint:

\[ r_{it} + s_{it} < \delta \]

For each year, the OSPC model try to find the lowest tolerance that still allows the LP solver generates feasible solutions. This tolerance is different from year to year. Generally speaking, years earlier in the projection period have lower tolerances than years further out in the future. Currently the tolerances for all the years are under 0.45.

In addition to this tolerance constraint, all other constraints of this LP problem can be grouped into three categories:

  • Return targets

  • Aggregate targets

  • Aggregate by Income class targets

The first category of return targets is mainly focused on different filing status of returns, and population of different age groups. Total numbers of returns with different filing status are maintained in the same ratio to total returns through years, and each category grows at the overall return growth rate. Senior filers are extrapolated at the rate for population with age greater than 65, projected by Census. Mathematically, these constraints are

\[ \sum_i w_{it}(1 + r_{it} - s_{it}) = W_t \]

where \(w_{it}\) refers to the twisted post-stage-I weights created for Stage II, and Wt refers to the Stage II return targets.

The second category contains macroeconomic targets other than total wages. These constraints are set up as following:

\[ \sum_i x_{it} w_{it}(1 + r_{it} - s_{it}) = X_t \]

Both target variable for record i, \(x_{it}\), and weight variable \(w_{it}\) are the post-stage-I results.

The third category of aggregate by income class includes 12 targets. These targets are the weighted sum of wages and salaries ranked and grouped by adjusted gross income (AGI) classes. Currently, the distribution of wages and salaries is maintained the same as the base year through the entire projection period.

After setting up the object function and all the constraint, the OSPC model runs the CLP solver to find solutions for this LP problem. To finish the Stage II adjustment, the model applies the solution from CLP solver, \(r_{it} − s_{it}\), to the post-stage-I weights and gets the final adjusted weights.

To sum up, the OSPC tax calculator applies the Stage I factors to blow up most of the PUF variables and uses the adjusted weights generated in Stage II for all projection years. This description outlines the current extrapolation routine, which is subject to change in near future for various reasons. Many factors would change the routine significantly, which includes a new version of PUF released by IRS. This description will be updated accordingly.

Appendix A Stage I factors#

Var_name

Long name

GDPNA

GDP

AWAGE

Wages and Salaries

AINTS

Interest

ADIVS

Dividends

ATXPY

Personal Income

ASCHCI

Business Income

ASCHCL

Business Loss

ACGNS

Capital Gain

ASCHF

Schedule F Income

ASCHEI

Schedule E Income

ASCHEL

Schedule E Loss

AUNCOMP

Unemployed Compensation

ASOCSEC

Social Security

ACPIM

Medical CPI

ABOOK

Book Income

AIPD

Interest Paid

Appendix B PUF variables and blow-up factors#

For each e-variable below, please refer to the spreadsheet for complete definition.

Factors

PUF variables

AGDPN

e03240

AWAGE

e00200

AINTS

e00300

e00400

ADIVS

e00600

e00650

ATXPY

e00700

e00800

e01400

e01500

e01700

e03150

e03210

e03220

e03230

e03300

e03400

e03500

e07230

e07240

e07260

e07300

p08000

e09700

e09800

e09900

e10700

e10900

e59560

e59680

e59700

e59720

e11550

e11070

e11100

e11200

e11300

e11400

e11570

e11580

e11581

e11582

e11583

e10605

e18400

e18500

e19550

e19800

e20100

e19700

e20550

e20600

e20400

e20800

e20500

e21040

e32800

e33000

e53240

e53280

e53410

e53300

e53317

e53458

e58950

e58990

p60100

p61850

e60000

e62100

e62900

e62720

e62730

e62740

p65300

p65400

e68000

e82200

t27800

s27860

p27800

t27860

s27860

e87530

e87550

ASCHCI

e00900

e03260

e30400

e30500

ASCHCL

e00900

ACGNS

e01000

e01100

e01200

p22250

e22320

e22370

p23250

e24515

e24516

e24518

e24535

e24560

e24598

e24615

e24570

ASCHEI

e02000

p25350

p25470

p25700

e25820

e25850

e25860

e25940

e25980

e25920

e25960

e26110

e26170

e26190

e26160

e26180

e26270

e26100

e26390

e26400

e27200

ASCHEL

e02000

ASCHF

e02100

AUCOMP

e02300

ASOCSEC

e02400

e02500

ACPIM

e03270

e03290

e17500

ABOOK

e07300

e07400

AIPD

e19200

Appendix C Stage II targets#

  • US Population

  • Total Returns

  • Single Returns

  • Joint Returns

  • Head of Household Returns

  • Number of Returns w/ Gross Security Income Number of Dependent Exemptions

  • Taxable Interest Income

  • Ordinary Dividends

  • Business Income (Schedule C) Business Loss (Schedule C)

  • Net Capital Gains in AGI

  • Taxable Pensions and Annuities Supplemental Income (Schedule E) Supplemental Loss (Schedule E)

  • Gross Social Security Income

  • Unemployment Compensation

  • Wages and Salaries: Zero or Less

  • Wages and Salaries: \(1 Less Than \)10,000

  • Wages and Salaries: \(10,000 Less Than \)20,000

  • Wages and Salaries: \(20,000 Less Than \)30,000

  • Wages and Salaries: \(30,000 Less Than \)40,000

  • Wages and Salaries: \(40,000 Less Than \)50,000

  • Wages and Salaries: \(50,000 Less Than \)75,000

  • Wages and Salaries: \(75,000 Less Than \)100,000

  • Wages and Salaries: \(100,000 Less Than \)200,000

  • Wages and Salaries: \(200,000 Less Than \)500,000

  • Wages and Salaries: \(500,000 Less Than \)1 Million

  • Wages and Salaries: $1 Million and Over