Skip to contents

🛸 The Directed Prediction Index (DPI).

The Directed Prediction Index (DPI) is a quasi-causal inference (causal discovery) method for observational data designed to quantify the relative endogeneity (relative dependence) of outcome (Y) versus predictor (X) variables in regression models.

Author

Bruce H. W. S. Bao 包寒吴霜

📬 baohws@foxmail.com

📋 psychbruce.github.io

Citation

  • Bao, H. W. S. (2025). DPI: The Directed Prediction Index for causal inference from observational data. https://doi.org/10.32614/CRAN.package.DPI
  • Bao, H. W. S. (Manuscript). The Directed Prediction Index (DPI) for causal inference from observational data by quantifying relative endogeneity.

Installation

## Method 1: Install from CRAN
install.packages("DPI")

## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/DPI", force=TRUE)

Algorithm Details

Define DPI\text{DPI} as the product of Direction\text{Direction} (relative direction) and Strength\text{Strength} (absolute strength) of the expected XYX \rightarrow Y relationship:

DPIXY=DirectionXYStrengthXY=Delta(R2)Sigmoid(pα)=(RYX+Covs2RXY+Covs2)(1tanhpXY|Covs2α)(1,1) \begin{aligned} \text{DPI}_{X \rightarrow Y} & = \text{Direction}_{X \rightarrow Y} \cdot \text{Strength}_{XY} \\ & = \text{Delta}(R^2) \cdot \text{Sigmoid}(\frac{p}{\alpha}) \\ & = \left( R_{Y \sim X + Covs}^2 - R_{X \sim Y + Covs}^2 \right) \cdot \left( 1 - \tanh \frac{p_{XY|Covs}}{2\alpha} \right) \\ & \in (-1, 1) \end{aligned}

In econometrics and broader social sciences, an exogenous variable is assumed to have a directed (causal or quasi-causal) influence on an endogenous variable (ExoVarEndoVarExoVar \rightarrow EndoVar). By quantifying the relative endogeneity of outcome versus predictor variables in multiple linear regression models, the DPI can suggest a plausible (admissible) direction of influence (i.e., DPIXY>0: XY\text{DPI}_{X \rightarrow Y} > 0 \text{: } X \rightarrow Y) after controlling for a sufficient number of possible confounders and simulated random covariates.

Key Steps of Conceptualization and Computation

All steps have been compiled into the functions DPI() and DPI_curve(). See their help pages for usage and illustrative examples. Below are conceptual rationales and mathematical explanations.

Step 1: Relative Direction

Define DirectionXY\text{Direction}_{X \rightarrow Y} as relative endogeneity (relative dependence) of YY vs. XX in a given variable set involving all possible confounders CovsCovs:

DirectionXY=Endogeneity(Y)Endogeneity(X)=RYX+Covs2RXY+Covs2=Delta(R2)(1,1) \begin{aligned} \text{Direction}_{X \rightarrow Y} & = \text{Endogeneity}(Y) - \text{Endogeneity}(X) \\ & = R_{Y \sim X + Covs}^2 - R_{X \sim Y + Covs}^2 \\ & = \text{Delta}(R^2) \\ & \in (-1, 1) \end{aligned}

It uses Delta(R2)\text{Delta}(R^2) to test whether YY (outcome), compared to XX (predictor), can be more strongly predicted by all mm observable control variables (included in a given sample) and kk unobservable random covariates (randomly generated in simulation samples, as specified by k.cov in the DPI() function). A higher R2R^2 indicates higher dependence (i.e., higher endogeneity) in a given variable set.

Step 2: Absolute Strength

Define Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) as absolute strength of the partial relationship between XX and YY when controlling for all possible confounders CovsCovs:

Sigmoid(pα)=2[1sigmoid(pXY|Covsα)]=1tanhpXY|Covs2α(0,1) \begin{aligned} \text{Sigmoid}(\frac{p}{\alpha}) & = 2 \left[ 1 - \text{sigmoid}(\frac{p_{XY|Covs}}{\alpha}) \right] \\ & = 1 - \tanh \frac{p_{XY|Covs}}{2\alpha} \\ & \in (0, 1) \end{aligned}

It uses Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) to penalize insignificant (p>αp > \alpha) partial relationship between XX and YY. Partial correlation rpartialr_{partial} always has the equivalent tt test and the same pp value as partial regression coefficient βpartial\beta_{partial} between YY and XX. A higher Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) indicates a more likely (less spurious) partial relationship when controlling for all possible confounders.

Notes on transformation among tanh(x)\tanh(x), sigmoid(x)\text{sigmoid}(x), and Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}):

tanh(x)=exexex+ex=121+e2x=21+e2x1=2sigmoid(2x)1,(1,1)sigmoid(x)=11+ex=tanh(x2)+12,(0,1)Sigmoid(pα)=2[1sigmoid(pα)]=1tanhp2α.(0,1) \begin{aligned} \tanh(x) & = \frac{e^x - e^{-x}}{e^x + e^{-x}} \\ & = 1 - \frac{2}{1 + e^{2x}} \\ & = \frac{2}{1 + e^{-2x}} - 1 \\ & = 2 \cdot \text{sigmoid}(2x) - 1, & \in (-1, 1) \\ \text{sigmoid}(x) & = \frac{1}{1 + e^{-x}} \\ & = \frac{\tanh(\frac{x}{2}) + 1}{2}, & \in (0, 1) \\ \text{Sigmoid}(\frac{p}{\alpha}) & = 2 \left[ 1 - \text{sigmoid}(\frac{p}{\alpha}) \right] \\ & = 1 - \tanh \frac{p}{2\alpha}. & \in (0, 1) \end{aligned}

pp Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) with α=0.05\alpha = 0.05
(~0) (~1)
0.0001 0.999
0.001 0.990
0.01 0.900
0.02 0.803
0.03 0.709
0.04 0.620
0.05 (pα\frac{p}{\alpha} = 1) 0.538
0.10 0.238
0.20 0.036
0.50 0.00009
0.80 0.0000002
1 0.000000004

Step 3: Data Simulation

(1) Main analysis using DPI(): Simulate n.sim random samples, with k.cov (unobservable) random covariate(s) in each simulated sample, to test the statistical significance of DPI.

(2) Robustness check using DPI_curve(): Run a series of DPI simulation analyses respectively with 1~k.covs (usually 1~10) random covariates, producing a curve of DPIs (estimates, 95% CI, and 99% CI; usually getting closer to 0 as k.covs increases) that can indicate its sensitivity in identifying the directed prediction (i.e., How many random covariates can DPIs survive to remain significant?).

(3) Causal discovery using DPI_dag(): Directed acyclic graphs (DAGs) via the DPI exploratory analysis for all significant partial correlations.

Other Functions

This package also includes other functions helpful for exploring variable relationships and performing simulation studies.

  • Network analysis functions

    • cor_net(): Correlation and partial correlation networks.

    • BNs_dag(): Directed acyclic graphs (DAGs) via Bayesian networks (BNs).

  • Data simulation utility functions

    • cor_matrix(): Produce a symmetric correlation matrix from values.

    • sim_data(): Simulate data from a multivariate normal distribution.

    • sim_data_exp(): Simulate experiment-like data with independent binary Xs.