🛸 The Directed Prediction Index (DPI).
The Directed Prediction Index (DPI) is a quasi-causal inference (causal discovery) method for observational data designed to quantify the relative endogeneity (relative dependence) of outcome (Y) versus predictor (X) variables in regression models.
⚠️ Please use version ≥ 2025.11 for correct functionality (see Changelog).
Citation
- Bao, H. W. S. (2025). DPI: The Directed Prediction Index for causal inference from observational data. https://doi.org/10.32614/CRAN.package.DPI
- Bao, H. W. S. (Manuscript). The Directed Prediction Index (DPI): Quantifying relative endogeneity for causal inference from observational data.
Installation
## Method 1: Install from CRAN
install.packages("DPI")
## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/DPI", force=TRUE)
Algorithm Details
Define as the product of (relative endogeneity as direction score) and (normalized penalty as significance score) of the expected influence (quasi-causal) relationship:
In econometrics and broader social sciences, an exogenous variable is assumed to have a directed (causal or quasi-causal) influence on an endogenous variable (). By quantifying the relative endogeneity of outcome versus predictor variables in multiple linear regression models, the DPI can suggest a plausible (admissible) direction of influence (i.e., ) after controlling for a sufficient number of possible confounders and simulated random covariates.
Key Steps of Conceptualization and Computation
All steps have been compiled into DPI()
and DPI_curve()
. See their help pages for usage and illustrative examples. Below are conceptual rationales and mathematical explanations.
Step 1: Relative Endogeneity as Direction Score
Define as relative endogeneity (relative dependence) of vs. in a given variable set involving all possible confounders :
The endogeneity score aims to test whether (outcome), compared to (predictor), can be more strongly predicted by all observable control variables (included in a given sample) and unobservable random covariates (randomly generated in simulation samples, as specified by k.cov
in the DPI()
function). A higher indicates higher endogeneity in a set of variables.
Step 2: Normalized Penalty as Significance Score
Define as normalized penalty for insignificance of the partial relationship between and when controlling for all possible confounders :
The penalty score aims to penalize insignificant () partial relationship between and . Partial correlation always has the equivalent test and the same value as partial regression coefficient between and . A higher indicates a more likely (less spurious) partial relationship when controlling for all possible confounders. Be careful that it does not suggest the strength or effect size of relationships. It is used mainly for penalizing insignificant partial relationships.
To control for false positive rates, users can set a lower level (see alpha
in DPI()
and related functions) and/or use Bonferroni correction for multiple pairwise tests (see bonf
in DPI()
and related functions).
Notes on transformation among , , and :
Wagenmakers (2022) also proposed a simple and useful algorithm to compute approximate (pseudo) Bayes Factors from p values and sample sizes (see transformation rules below).
Below we show that normalized penalty scores and normalized log pseudo Bayes Factors have comparable effects in penalizing insignificant p values. However, indeed makes stronger penalties for p values when by restricting the penalty scores closer to 0, and it also makes straightforward both the specification of a more conservative α level and the Bonferroni correction of p values for multiple pairwise DPI tests.
Table. Transformation from p values to normalized penalty scores and pseudo Bayes Factors.
() |
() |
() [sigmoid(logBF)] |
() [sigmoid(logBF)] |
|
---|---|---|---|---|
(~0) | (~1) | (~1) | () [~1] | () [~1] |
0.0001 | 0.999 | 0.995 | 333.333 [0.997] | 105.409 [0.991] |
0.001 | 0.990 | 0.950 | 33.333 [0.971] | 10.541 [0.913] |
0.01 | 0.900 | 0.538 | 3.333 [0.769] | 1.054 [0.513] |
0.02 | 0.803 | 0.238 | 1.667 [0.625] | 0.527 [0.345] |
0.03 | 0.709 | 0.095 | 1.111 [0.526] | 0.351 [0.260] |
0.04 | 0.620 | 0.036 | 0.833 [0.455] | 0.264 [0.209] |
0.05 | 0.538 | 0.013 | 0.667 [0.400] | 0.211 [0.174] |
0.10 | 0.238 | 0.00009 | 0.333 [0.250] | 0.105 [0.095] |
0.20 | 0.036 | 0 | 0.219 [0.180] | 0.069 [0.065] |
0.50 | 0.00009 | 0 | 0.119 [0.106] | 0.038 [0.036] |
0.80 | 0 | 0 | 0.106 [0.096] | 0.033 [0.032] |
1 | 0 | 0 | 0.100 [0.091] | 0.032 [0.031] |
Step 3: Data Simulation
(1) Main analysis using DPI()
: Simulate n.sim
random samples, with k.cov
(unobservable) random covariate(s) in each simulated sample, to test the statistical significance of DPI.
(2) Robustness check using DPI_curve()
: Run a series of DPI simulation analyses respectively with 1
~k.covs
(usually 1~10) random covariates, producing a curve of DPIs (estimates and 95% CI; usually getting closer to 0 as k.covs
increases) that can indicate its sensitivity in identifying the directed prediction (i.e., How many random covariates can DPIs survive to remain significant?).
(3) Causal discovery using DPI_dag()
: Directed acyclic graphs (DAGs) via the DPI exploratory analysis for all significant partial correlations.
Other Functions
This package also includes other functions helpful for exploring variable relationships and performing simulation studies.
-
Network analysis functions
-
Data simulation functions
sim_data()
: Simulate data from a multivariate normal distribution.sim_data_exp()
: Simulate experiment-like data with independent binary Xs.
-
Miscellaneous functions
cor_matrix()
: Produce a symmetric correlation matrix from values.p_to_bf()
: Convert p values to pseudo Bayes Factors ().