Skip to contents

🛸 The Directed Prediction Index (DPI).

The Directed Prediction Index (DPI) is a causal discovery method for observational data designed to quantify the relative endogeneity of outcome (Y) versus predictor (X) variables in regression models.

⚠️ Please use version ≥ 2025.11 for correct functionality (see Changelog).

Author

Bruce H. W. S. Bao 包寒吴霜

📬 baohws@foxmail.com

📋 psychbruce.github.io

Citation

  • Bao, H. W. S. (2025). DPI: The Directed Prediction Index for causal direction inference from observational data. https://doi.org/10.32614/CRAN.package.DPI
  • Bao, H. W. S. (in preparation). The Directed Prediction Index (DPI): Causal direction inference from relative endogeneity for multivariate observational data. (Manuscript in preparation)

Installation

## Method 1: Install from CRAN
install.packages("DPI")

## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/DPI", force=TRUE)

Algorithm Details

Define DPI(1,1)\text{DPI} \in (-1, 1) as RelativeEndogeneity(1,1)\text{RelativeEndogeneity} \in (-1, 1) restricted by NormalizedPenalty(0,1)\text{NormalizedPenalty} \in (0, 1) of the XYX \rightarrow Y relationship:

DPIXY=RelativeEndogeneityXYNormalizedPenaltyXY=Delta(R2)Sigmoid(pα)=(RYX+Covs2RXY+Covs2)(1tanhpXY|Covs2α)(1,1) \begin{aligned} \text{DPI}_{X \rightarrow Y} & = \text{RelativeEndogeneity}_{X \rightarrow Y} \cdot \text{NormalizedPenalty}_{X \rightarrow Y} \\ & = \text{Delta}(R^2) \cdot \text{Sigmoid}(\frac{p}{\alpha}) \\ & = \left( R_{Y \sim X + Covs}^2 - R_{X \sim Y + Covs}^2 \right) \cdot \left( 1 - \tanh \frac{p_{XY|Covs}}{2\alpha} \right) \\ & \in (-1, 1) \end{aligned}

In econometrics and broader social sciences, an exogenous variable is assumed to have a directed (causal or quasi-causal) influence on an endogenous variable (ExoVarEndoVarExoVar \rightarrow EndoVar). By quantifying the relative endogeneity of outcome versus predictor variables in multiple linear regression models, the DPI can suggest a plausible (admissible) causal (DPIXY>0\text{DPI}_{X \rightarrow Y} > 0 is a necessary but insufficient condition of XYX \rightarrow Y) after controlling for possible confounders and simulated random covariates.

Conceptualization and Computation

All steps have been compiled into DPI() and DPI_curve(). See their help pages for usage and illustrative examples. Below are conceptual rationales and mathematical explanations.

Step 1: Relative Endogeneity for Plausible Causal Direction

Define DirectionXY\text{Direction}_{X \rightarrow Y} as relative endogeneity of YY vs. XX in a given variable set involving all possible confounders CovsCovs:

DirectionXY=Endogeneity(Y)Endogeneity(X)=RYX+Covs2RXY+Covs2=Delta(R2)(1,1) \begin{aligned} \text{Direction}_{X \rightarrow Y} & = \text{Endogeneity}(Y) - \text{Endogeneity}(X) \\ & = R_{Y \sim X + Covs}^2 - R_{X \sim Y + Covs}^2 \\ & = \text{Delta}(R^2) \\ & \in (-1, 1) \end{aligned}

The Delta(R2)\text{Delta}(R^2)endogeneity score aims to test whether YY (outcome), compared to XX (predictor), can be more strongly predicted by all mm observable control variables (included in a given sample) and kk unobservable random covariates (randomly generated in simulation samples, as specified by k.cov in the DPI() function). A higher R2R^2 indicates higher endogeneity in a set of variables.

As an ideal property, the Delta(R2)\text{Delta}(R^2) can also ensure the resulting Directed Acyclic Graph (DAG) structure to be both directed and acyclic, since each direction (edge) has been constrained to go from a lower-R² variable (node) to a higher-R² variable (node) within a specific set of variables. Therefore, it would be impossible to observe any cyclic relationship in the DPI framework.

Step 2: Normalized Penalty for Insignificant Partial Correlation

Define Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) as normalized penalty for insignificant partial relationship between XX and YY when controlling for all confounders CovsCovs:

Sigmoid(pα)=2[1sigmoid(pXY|Covsα)]=1tanhpXY|Covs2α(0,1) \begin{aligned} \text{Sigmoid}(\frac{p}{\alpha}) & = 2 \left[ 1 - \text{sigmoid}(\frac{p_{XY|Covs}}{\alpha}) \right] \\ & = 1 - \tanh \frac{p_{XY|Covs}}{2\alpha} \\ & \in (0, 1) \end{aligned}

The Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha})penalty score aims to penalize insignificant (p>αp > \alpha) partial relationship between XX and YY. Partial correlation rpartialr_{partial} always has the equivalent tt test and the same pp value as partial regression coefficient βpartial\beta_{partial} between YY and XX. A higher Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) indicates a more likely (less spurious) partial relationship when controlling for all possible confounders. Be careful that it does not suggest the strength or effect size of relationships. It is used mainly for penalizing insignificant partial relationships.

To control for false positive rates, users can set a lower α\alpha level (see alpha in DPI() and related functions) and/or use Bonferroni correction for multiple pairwise tests (see bonf in DPI() and related functions).

Notes on transformation among tanh(x)\tanh(x), sigmoid(x)\text{sigmoid}(x), and Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}):

tanh(x)=exexex+ex=121+e2x=21+e2x1=2sigmoid(2x)1,(1,1)sigmoid(x)=11+ex=12[tanh(x2)+1],(0,1)Sigmoid(pα)=2[1sigmoid(pα)]=1tanhp2α.(0,1) \begin{aligned} \tanh(x) & = \frac{e^x - e^{-x}}{e^x + e^{-x}} \\ & = 1 - \frac{2}{1 + e^{2x}} \\ & = \frac{2}{1 + e^{-2x}} - 1 \\ & = 2 \cdot \text{sigmoid}(2x) - 1, & \in (-1, 1) \\ \text{sigmoid}(x) & = \frac{1}{1 + e^{-x}} \\ & = \frac{1}{2} \left[ \tanh(\frac{x}{2}) + 1 \right], & \in (0, 1) \\ \text{Sigmoid}(\frac{p}{\alpha}) & = 2 \left[ 1 - \text{sigmoid}(\frac{p}{\alpha}) \right] \\ & = 1 - \tanh \frac{p}{2\alpha}. & \in (0, 1) \end{aligned}

Wagenmakers (2022) also proposed a simple and useful algorithm to compute approximate (pseudo) Bayes Factors from p values and sample sizes (see transformation rules below).

PseudoBF10(p,n)={13pnif0<p0.10143p2/3nif0.10<p0.501p1/4nif0.50<p1 \text{PseudoBF}_{10}(p, n) = \left\{ \begin{aligned} & \frac{1}{3 p \sqrt n} && \text{if} && 0 < p \le 0.10 \\ & \frac{1}{\tfrac{4}{3} p^{2/3} \sqrt n} && \text{if} && 0.10 < p \le 0.50 \\ & \frac{1}{p^{1/4} \sqrt{n}} && \text{if} && 0.50 < p \le 1 \end{aligned} \right.

Below we show that normalized penalty scores Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) and normalized log pseudo Bayes Factors sigmoid(log(PseudoBF10))\text{sigmoid}(\log(\text{PseudoBF}_{10})) have comparable effects in penalizing insignificant p values. However, Sigmoid(pα)\text{Sigmoid}(\frac{p}{\alpha}) indeed makes stronger penalties for p values when p>αp > \alpha by restricting the penalty scores closer to 0, and it also makes straightforward both the specification of a more conservative α level and the Bonferroni correction of p values for multiple pairwise DPI tests.

Table. Transformation from p values to normalized penalty scores and pseudo Bayes Factors.
p valuep\text{ value} Sigmoid(p/α)\text{Sigmoid}(p/\alpha)
(α=0.05\alpha = 0.05)
Sigmoid(p/α)\text{Sigmoid}(p/\alpha)
(α=0.01\alpha = 0.01)
PseudoBF10\text{PseudoBF}_{10}
(n=100n = 100) [sigmoid(logBF)]
PseudoBF10\text{PseudoBF}_{10}
(n=1000n = 1000) [sigmoid(logBF)]
(~0) (~1) (~1) (++\infty) [~1] (++\infty) [~1]
0.0001 0.999 0.995 333.333 [0.997] 105.409 [0.991]
0.001 0.990 0.950 33.333 [0.971] 10.541 [0.913]
0.01 0.900 0.538 3.333 [0.769] 1.054 [0.513]
0.02 0.803 0.238 1.667 [0.625] 0.527 [0.345]
0.03 0.709 0.095 1.111 [0.526] 0.351 [0.260]
0.04 0.620 0.036 0.833 [0.455] 0.264 [0.209]
0.05 0.538 0.013 0.667 [0.400] 0.211 [0.174]
0.10 0.238 0.00009 0.333 [0.250] 0.105 [0.095]
0.20 0.036 0 0.219 [0.180] 0.069 [0.065]
0.50 0.00009 0 0.119 [0.106] 0.038 [0.036]
0.80 0 0 0.106 [0.096] 0.033 [0.032]
1 0 0 0.100 [0.091] 0.032 [0.031]

Step 3: Data Simulation

(1) Main analysis using DPI(): Simulate n.sim random samples, with k.cov (unobservable) random covariate(s) in each simulated sample, to test the statistical significance of DPI.

(2) Robustness check using DPI_curve(): Run a series of DPI simulation analyses respectively with 1~k.covs (usually 1~10) random covariates, producing a curve of DPIs (estimates and 95% CI; usually getting closer to 0 as k.covs increases) that can indicate its sensitivity in identifying the directed prediction (i.e., How many random covariates can DPIs survive to remain significant?).

(3) Causal discovery using DPI_dag(): Directed acyclic graphs (DAGs) via the DPI exploratory analysis for all significant partial correlations.

Other Functions

This package also includes other functions helpful for exploring variable relationships and performing simulation studies.

  • Network analysis functions

    • cor_net(): Correlation and partial correlation networks.

    • BNs_dag(): Directed acyclic graphs (DAGs) via Bayesian networks (BNs).

  • Data simulation functions

    • sim_data(): Simulate data from a multivariate normal distribution.

    • sim_data_exp(): Simulate experiment-like data with independent binary Xs.

  • Miscellaneous functions

    • cor_matrix(): Produce a symmetric correlation matrix from values.

    • p_to_bf(): Convert p values to pseudo Bayes Factors (PseudoBF10\text{PseudoBF}_{10}).