🛸 The Directed Prediction Index (DPI).
The Directed Prediction Index (DPI) is a quasi-causal inference (causal discovery) method for observational data designed to quantify the relative endogeneity (relative dependence) of outcome (Y) versus predictor (X) variables in regression models.
Citation
- Bao, H. W. S. (2025). DPI: The Directed Prediction Index for causal inference from observational data. https://doi.org/10.32614/CRAN.package.DPI
- Bao, H. W. S. (Manuscript). The Directed Prediction Index (DPI) for causal inference from observational data by quantifying relative endogeneity.
Installation
## Method 1: Install from CRAN
install.packages("DPI")
## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/DPI", force=TRUE)
Algorithm Details
Define as the product of (relative direction) and (absolute strength) of the expected relationship:
In econometrics and broader social sciences, an exogenous variable is assumed to have a directed (causal or quasi-causal) influence on an endogenous variable (). By quantifying the relative endogeneity of outcome versus predictor variables in multiple linear regression models, the DPI can suggest a plausible (admissible) direction of influence (i.e., ) after controlling for a sufficient number of possible confounders and simulated random covariates.
Key Steps of Conceptualization and Computation
All steps have been compiled into the functions DPI()
and DPI_curve()
. See their help pages for usage and illustrative examples. Below are conceptual rationales and mathematical explanations.
Step 1: Relative Direction
Define as relative endogeneity (relative dependence) of vs. in a given variable set involving all possible confounders :
It uses to test whether (outcome), compared to (predictor), can be more strongly predicted by all observable control variables (included in a given sample) and unobservable random covariates (randomly generated in simulation samples, as specified by k.cov
in the DPI()
function). A higher indicates higher dependence (i.e., higher endogeneity) in a given variable set.
Step 2: Absolute Strength
Define as absolute strength of the partial relationship between and when controlling for all possible confounders :
It uses to penalize insignificant () partial relationship between and . Partial correlation always has the equivalent test and the same value as partial regression coefficient between and . A higher indicates a more likely (less spurious) partial relationship when controlling for all possible confounders.
Notes on transformation among , , and :
with | |
---|---|
(~0) | (~1) |
0.0001 | 0.999 |
0.001 | 0.990 |
0.01 | 0.900 |
0.02 | 0.803 |
0.03 | 0.709 |
0.04 | 0.620 |
0.05 ( = 1) | 0.538 |
0.10 | 0.238 |
0.20 | 0.036 |
0.50 | 0.00009 |
0.80 | 0.0000002 |
1 | 0.000000004 |
Step 3: Data Simulation
(1) Main analysis using DPI()
: Simulate n.sim
random samples, with k.cov
(unobservable) random covariate(s) in each simulated sample, to test the statistical significance of DPI.
(2) Robustness check using DPI_curve()
: Run a series of DPI simulation analyses respectively with 1
~k.covs
(usually 1~10) random covariates, producing a curve of DPIs (estimates, 95% CI, and 99% CI; usually getting closer to 0 as k.covs
increases) that can indicate its sensitivity in identifying the directed prediction (i.e., How many random covariates can DPIs survive to remain significant?).
(3) Causal discovery using DPI_dag()
: Directed acyclic graphs (DAGs) via the DPI exploratory analysis for all significant partial correlations.
Other Functions
This package also includes other functions helpful for exploring variable relationships and performing simulation studies.
-
Network analysis functions
-
Data simulation utility functions
cor_matrix()
: Produce a symmetric correlation matrix from values.sim_data()
: Simulate data from a multivariate normal distribution.sim_data_exp()
: Simulate experiment-like data with independent binary Xs.