Home Rank-Based Structural Equation Modelling
Post
Cancel

Rank-Based Structural Equation Modelling

When I listen to what people are describing as patterns in the data I rarely find that linearity itself is of any core interest. Rather, many tools and educational resources are oriented around linear models. That itself I think has to do with path-dependence of historical developments and the computational feasibility of fitting linear models. But when pressed I find that most people don’t actualy give any special preference to linearity being scientifically plausible.

Now what is going to seem like a weird flex for many people is the notion of being able to do structural equation models on ranks. For other people, who were told from on-high “thou shalt not regress on ranks”, will see this approach as invalid. But it is valid. Its math.

Structural Model

First, I used a naive algorithm to choose a structural model at random. Here it is in R-like (specifically Lavaan-like) equations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
x42 ~ x90+x0
x27 ~ x87+x32
x12 ~ x0+x3+x84
x74 ~ x83+x76+x50
x63 ~ x51+x44+x77+x28
x93 ~ x90
x30 ~ x97
x14 ~ x44+x54
x81 ~ x6
x38 ~ x23+x49
x29 ~ x9
x62 ~ x2
x75 ~ x57
x31 ~ x65
x71 ~ x15
x68 ~ x87+x65
x43 ~ x1
x92 ~ x1
x60 ~ x96+x37
x0 ~ x5
x3 ~ x44+x89+x37
x50 ~ x16
x90 ~ x80
x2 ~ x67+x10
x57 ~ x65+x67+x7+x26
x1 ~ x66+x40
x96 ~ x4+x55
x5 ~ x41
x44 ~ x55
x89 ~ x15+x55
x37 ~ x54
x10 ~ x25+x52+x16
x67 ~ x84
x66 ~ x47+x22
x4 ~ x77
x54 ~ x87+x85
x25 ~ x19
x52 ~ x83
x47 ~ x13+x70
x22 ~ x7+x28
x87 ~ x16+x88+x78+x64
x19 ~ x58
x83 ~ x61
x13 ~ x58
x70 ~ x97+x80
x7 ~ x23
x88 ~ x6
x64 ~ x9
x61 ~ x84
x58 ~ x17
x23 ~ x59+x94
x9 ~ x78+x36
x17 ~ x79
x59 ~ x55
x79 ~ x77+x26
x55 ~ x20+x6+x33
x77 ~ x41
x26 ~ x84
x20 ~ x72
x84 ~ x98+x72
x72 ~ x82
x82 ~ x8

Pretty incomprehensible, right? Well, that’s not the syntax’s fault. And it isn’t entirely the fault of my not sorting the order of the equations, although that could have helped. This is a pretty complicated model! We’ll look at the results momentarily. Next I used a choice of sampling distribution to generate the data according to the structural equation model under an assumption of normality.

Each variable $X_j$ in structural form was simply of the form

\[X_j = \sum_{i \in \text{parents}(j)} \beta_{ij} X_i + \epsilon_j\]

where $\text{parents}(j)$ represents the set of parent nodes in the directed graph and I took

\[\epsilon_j \sim \mathcal{N}(0,1)\]

and

\[\beta_{ij} \sim \mathcal{N}(0,10).\]

Note that when $j$ does not have any parents in the graph that implies that

\[X_j = \epsilon_j\]

meaning that (1) our variables are already standardized and (2) this is a pretty run-of-the-mill linear SEM that induces (via change of variables) a multivariate normal distribution over the random variables \(\{ X_0, \ldots, X_{99} \}\).

Train Linear SEM

Ignore the p-values in this example. We don’t care about those in this post, and they’re not the right the right p-values anyway.

lvaloprvalEstimateStd. Errz-valuep-value
x0~x5-0.01588320.00651196-2.439080.0147246
x3~x44-0.07562790.0259957-2.909250.00362303
x3~x890.02589670.01552621.667930.0953292
x3~x37-0.02497390.0184324-1.354890.175452
x50~x16-14.83680.0322415-460.1790
x90~x80-2.931510.0320633-91.42870
x2~x67-0.02920560.0143646-2.033160.042036
x2~x10-0.02021380.0216339-0.9343540.350122
x57~x65-6.117320.428089-14.28980
x57~x670.7603060.023343132.57090
x57~x70.08039730.02411833.333460.000857734
x57~x26-1.8580.056646-32.80030
x1~x66-0.1189960.0200425-5.937192.89951e-09
x1~x40-7.121470.121875-58.43240
x96~x43.162830.22964713.77260
x96~x550.02742880.04082670.6718350.501689
x5~x41-11.49450.0321778-357.2190
x44~x55-0.05933730.0164798-3.600610.000317471
x89~x15-12.00290.104779-114.5540
x89~x550.02344070.007382853.175030.00149823
x37~x54-0.04203120.0203518-2.065230.0389008
x10~x250.3595850.05986.013131.81979e-09
x10~x52-0.09985010.0590903-1.689790.0910684
x10~x166.783230.32521420.85770
x67~x840.1593280.05632472.828730.00467326
x66~x470.1596630.0313315.096013.46878e-07
x66~x22-0.008357460.00933564-0.8952210.370669
x4~x77-0.9678560.0405739-23.85410
x54~x870.2115090.06367223.321840.000894271
x54~x857.03520.46373815.17060
x25~x190.0431030.0144412.984760.00283799
x52~x832.461680.079328631.03140
x47~x130.1763160.03629934.857291.19002e-06
x47~x700.01397110.009876451.414590.157188
x22~x70.09761220.03465532.816660.0048526
x22~x286.095680.6200529.830920
x87~x16-5.900510.0788379-74.84360
x87~x88-0.5957740.042375-14.05960
x87~x783.318860.073984744.85880
x87~x640.01959120.01251431.565510.117464
x19~x58-0.7339880.0804946-9.118480
x83~x610.3474710.024316114.28970
x13~x58-0.3203420.0365891-8.755140
x70~x97-17.39460.0294302-591.0460
x70~x80-6.730160.0318525-211.2910
x7~x23-0.004186370.0214885-0.1948190.845535
x88~x61.469730.030127748.78350
x64~x90.1072460.02434424.405421.0558e-05
x61~x840.01152460.005568682.069540.0384957
x58~x17-0.1634720.0237576-6.880825.9508e-12
x23~x59-0.2367680.0679229-3.485830.000490618
x23~x94-13.81650.703749-19.63270
x9~x781.448560.030967246.77730
x9~x36-7.745440.0324563-238.6420
x17~x79-0.05150490.014378-3.582190.00034072
x59~x55-0.07788390.0232169-3.354630.000794725
x79~x77-0.3041880.254101-1.197110.231262
x79~x260.08464850.05180441.6340.102258
x55~x200.6984850.044172215.81280
x55~x612.53850.106001118.2860
x55~x332.560060.11040223.18860
x77~x41-1.164880.0334789-34.79440
x26~x840.06929530.02320022.986850.00281871
x20~x720.02180560.007387582.951660.00316072
x84~x9810.02850.04488223.4520
x84~x72-0.008251190.00430194-1.918020.0551089
x72~x824.269130.17166324.86930
x82~x8-1.1770.0307035-38.33450
x42~x900.1747950.02255257.750579.10383e-15
x42~x0-0.1935060.0288042-6.717971.84273e-11
x27~x87-0.05589450.0155562-3.593070.000326803
x27~x32-6.449440.11109-58.05590
x12~x0-1.779930.19173-9.283550
x12~x30.008699650.07399280.1175740.906405
x12~x840.07986340.04361391.831150.0670785
x74~x83-2.046820.0658169-31.09870
x74~x76-12.85380.0962878-133.4940
x74~x500.003634040.006930410.5243610.600028
x63~x51-3.68040.180348-20.40720
x63~x440.06748650.02572252.623640.00869956
x63~x77-2.860110.121978-23.44790
x63~x288.202130.18961443.25690
x93~x901.162410.1267229.17290
x30~x97-5.062660.0305478-165.7290
x14~x440.05825530.01318394.418669.93166e-06
x14~x540.003838440.006029180.6366440.524357
x81~x69.142420.029405310.9140
x38~x23-0.0001917710.00135337-0.1416980.887318
x38~x49-6.418380.0346079-185.460
x29~x9-0.08594880.0194305-4.423399.71627e-06
x62~x2-0.1303190.0391703-3.326980.000877925
x75~x57-0.01063460.007978-1.332990.182536
x31~x652.290040.030855474.21860
x71~x15-9.098150.0328499-276.9610
x68~x87-0.1607880.046685-3.444120.00057293
x68~x65-6.566130.332406-19.75330
x43~x1-0.02629260.00846914-3.104520.0019059
x92~x1-0.1369730.0333884-4.102424.08843e-05
x60~x96-0.03275550.0081783-4.005176.19727e-05
x60~x37-0.04176410.0155725-2.681920.00732005
x1~~x115.08690.67470722.36070
x72~~x7269.24053.0965322.36070
x88~~x880.9764940.043670122.36070
x79~~x79153.9276.8838422.36070
x37~~x37106.1534.7473322.36070
x4~~x43.924610.17551422.36070
x83~~x831.944890.086978122.36070
x26~~x2656.84942.5423822.36070
x20~~x206.116070.27351922.36070
x10~~x10101.0174.5176122.36070
x57~~x57184.0318.230122.36070
x54~~x54206.3929.2301422.36070
x84~~x842.073790.092742622.36070
x55~~x5512.03740.5383322.36070
x77~~x771.07840.048227722.36070
x87~~x875.92010.26475522.36070
x0~~x05.432930.24296822.36070
x17~~x1731.95151.4289122.36070
x64~~x6437.10911.6595722.36070
x47~~x4736.33271.6248522.36070
x13~~x1325.61091.1453622.36070
x58~~x5818.26560.81686122.36070
x70~~x700.9400510.042040422.36070
x67~~x67335.07514.98522.36070
x52~~x5214.73850.65912422.36070
x66~~x6636.57821.6358322.36070
x9~~x91.039930.046506922.36070
x7~~x7316.3614.14822.36070
x22~~x22379.95916.992322.36070
x59~~x59105.0184.6965622.36070
x5~~x50.9962110.044551922.36070
x50~~x500.9928550.044401822.36070
x25~~x2527.99871.2521422.36070
x44~~x4452.91282.3663322.36070
x3~~x336.21971.619822.36070
x89~~x8910.6140.47467322.36070
x61~~x613.275290.14647522.36070
x23~~x23489.94121.910822.36070
x90~~x900.9525580.042599722.36070
x2~~x269.69323.1167722.36070
x82~~x820.9514680.042550922.36070
x96~~x96324.70114.521122.36070
x19~~x19123.9535.5433322.36070
x12~~x12200.9048.9847122.36070
x74~~x7410.14530.45371422.36070
x27~~x2712.3060.5503422.36070
x31~~x310.9560580.042756222.36070
x62~~x62107.4664.8060122.36070
x81~~x810.9302110.041600322.36070
x29~~x2923.64061.0572422.36070
x60~~x6025.85221.1561422.36070
x92~~x9274.83863.3468822.36070
x14~~x149.316320.41663922.36070
x42~~x424.534380.20278422.36070
x38~~x381.254570.056105922.36070
x75~~x7539.05431.7465622.36070
x93~~x93143.1646.4024922.36070
x68~~x68110.9514.9618922.36070
x71~~x711.043810.046680822.36070
x43~~x434.81520.21534222.36070
x63~~x6335.4631.5859522.36070
x30~~x301.012830.045295122.36070

Train Rank SEM

Now let’s look at the result on the ranks.

lvaloprvalEstimateStd. Errz-valuep-value
x0~x5-0.05068520.319554-0.1586120.873974
x3~x44-0.0906070.035594-2.545570.01091
x3~x890.05282890.1789620.2951970.767844
x3~x37-0.04899850.0467657-1.047740.294757
x50~x16-0.9972880.642918-1.551190.120856
x90~x80-0.9359122.32397-0.402720.687154
x2~x67-0.06236130.0343197-1.817070.0692058
x2~x10-0.03663530.0363691-1.007320.313781
x57~x65-0.3761449.81514-0.03832280.96943
x57~x670.8087570.02821528.66410
x57~x70.08790960.02864853.068560.00215096
x57~x26-0.8320150.0240271-34.62820
x1~x66-0.0843960.0165049-5.113393.16423e-07
x1~x40-0.8602267.79065-0.1104180.912078
x96~x40.3719610.04572668.134464.44089e-16
x96~x550.005068020.07367670.06878720.945159
x5~x41-0.9952470.838395-1.187090.235194
x44~x55-0.1108510.0950145-1.166680.24334
x89~x15-0.9616252.68861-0.3576660.720593
x89~x550.02549010.01889211.349250.177257
x37~x54-0.07097730.0357109-1.987550.0468613
x10~x250.1648210.0215297.655751.93179e-14
x10~x52-0.04485350.0272714-1.644710.10003
x10~x160.52487710.33210.05080060.959484
x67~x840.08153710.2734010.2982330.765525
x66~x470.1571120.03128125.022565.09866e-07
x66~x22-0.04798520.0520462-0.9219730.356543
x4~x77-0.5900270.0522867-11.28450
x54~x870.07430980.0441131.684530.0920785
x54~x850.4152759.124420.04551250.963699
x25~x190.09152820.04551692.010860.0443403
x52~x830.6818460.040646916.77490
x47~x130.1339020.03207364.174842.98197e-05
x47~x700.0417850.5041270.08288590.933942
x22~x70.071960.02618342.748310.00599034
x22~x280.2829299.042550.03128860.975039
x87~x16-0.7964256.55258-0.1215440.90326
x87~x88-0.1393540.0835817-1.667280.0954584
x87~x780.4652736.149480.07566050.939689
x87~x640.007944620.01361980.5833150.559681
x19~x58-0.2690860.0193834-13.88230
x83~x610.409930.031830712.87840
x13~x58-0.2503420.0291525-8.58730
x70~x97-0.9371190.895642-1.046310.295418
x70~x80-0.3159950.969359-0.3259840.744437
x7~x23-0.006939390.0356268-0.194780.845565
x88~x60.8238422.333630.353030.724066
x64~x90.105360.3601360.2925570.769861
x61~x840.06473990.1836260.3525630.724416
x58~x17-0.1903660.0315347-6.036731.57271e-09
x23~x59-0.1181380.0298934-3.951987.75065e-05
x23~x94-0.4827489.61376-0.05021430.959952
x9~x780.1845651.252170.1473970.882819
x9~x36-0.9698911.31238-0.7390330.459887
x17~x79-0.1086980.0513214-2.117980.0341769
x59~x55-0.1009810.0721987-1.398660.161916
x79~x77-0.03319260.072199-0.4597370.645705
x79~x260.0548130.02265652.41930.0155503
x55~x200.1225120.01866366.564185.23206e-11
x55~x60.9234044.186940.2205440.825448
x55~x330.1743774.360780.03998770.968103
x77~x41-0.729684.14135-0.1761940.860142
x26~x840.0840890.3210570.2619130.793389
x20~x720.08904760.02017294.414211.01379e-05
x84~x980.9892111.253650.7890630.430075
x84~x72-0.00699650.00352878-1.98270.0474011
x72~x820.5994420.0941536.366681.93166e-10
x82~x8-0.7520353.73609-0.2012890.840472
x42~x900.2235440.09992672.237080.0252814
x42~x0-0.1875170.026882-6.975553.04667e-12
x27~x87-0.0541630.0249817-2.16810.0301507
x27~x32-0.8714735.06653-0.1720060.863433
x12~x0-0.2839380.0358551-7.919032.44249e-15
x12~x3-0.00537150.0198303-0.2708730.786488
x12~x840.03694340.2338650.1579690.874481
x74~x83-0.2035750.00701163-29.0340
x74~x76-0.9326891.70561-0.5468370.584491
x74~x500.001994840.08969190.0222410.982256
x63~x51-0.3088367.64331-0.04040610.967769
x63~x440.0429950.01896622.266930.0233946
x63~x77-0.3641780.062133-5.861264.59371e-09
x63~x280.6897418.036410.0858270.931604
x93~x900.2699470.1300422.075840.0379089
x30~x97-0.9808541.60828-0.6098790.541942
x14~x440.1416690.02277246.221084.9375e-10
x14~x540.02683430.03385460.7926340.427991
x81~x60.994230.8960121.109620.267165
x38~x230.002209380.005200950.4248030.67098
x38~x49-0.9829671.54838-0.6348370.525535
x29~x9-0.09799930.331924-0.2952460.767806
x62~x2-0.09341380.0264644-3.529790.000415892
x75~x57-0.02756330.0272089-1.013020.311049
x31~x650.9110651.971020.462230.643916
x71~x15-0.9927710.986135-1.006730.314065
x68~x87-0.08760410.0452074-1.937830.0526445
x68~x65-0.5120319.14152-0.05601160.955333
x43~x1-0.08961540.0247739-3.617330.000297661
x92~x1-0.1277110.045804-2.78820.00530012
x60~x96-0.1223610.0423403-2.889940.00385318
x60~x37-0.06404260.0445107-1.438810.150203
x1~~x161647.52756.9622.36070
x72~~x721248945585.4122.36070
x88~~x885858.72262.0122.36070
x79~~x79860203846.9322.36070
x37~~x371021904570.0622.36070
x4~~x445114.92017.622.36070
x83~~x8355543.9248422.36070
x26~~x261675657493.7422.36070
x20~~x2052885.32365.122.36070
x10~~x101019614559.8322.36070
x57~~x5796742.24326.4422.36070
x54~~x5479904.83573.4522.36070
x84~~x841618.2572.370322.36070
x55~~x5518780.6839.89322.36070
x77~~x7716501.5737.9722.36070
x87~~x8740933.51830.622.36070
x0~~x069157.13092.822.36070
x17~~x1722794110193.822.36070
x64~~x642206499867.7322.36070
x47~~x472215319907.1822.36070
x13~~x132005588969.2422.36070
x58~~x5822768910182.622.36070
x70~~x70870.62838.935722.36070
x67~~x671215125434.1722.36070
x52~~x521069884784.6522.36070
x66~~x662205529863.3722.36070
x9~~x91700.2976.039122.36070
x7~~x71178685271.2122.36070
x22~~x2280809.73613.9222.36070
x59~~x591021204566.9422.36070
x5~~x5676.29530.244822.36070
x50~~x50394.79117.655622.36070
x25~~x252190959798.2122.36070
x44~~x441768617909.4522.36070
x3~~x322437510034.422.36070
x89~~x896992.15312.69822.36070
x61~~x6154813.72451.3422.36070
x23~~x2391434.34089.0722.36070
x90~~x905004.22223.79622.36070
x2~~x21431346401.1522.36070
x82~~x8214088.2630.04222.36070
x96~~x961063444755.8422.36070
x19~~x1988664.13965.1822.36070
x12~~x1288909.93976.1722.36070
x74~~x743183.6142.37522.36070
x27~~x2725626.21146.0422.36070
x31~~x313901.27174.4722.36070
x62~~x621006794502.4922.36070
x81~~x81863.70738.626222.36070
x29~~x291874348382.2922.36070
x60~~x602032589089.9822.36070
x92~~x921327205935.4322.36070
x14~~x1491841.54107.2822.36070
x42~~x42499772235.0422.36070
x38~~x382511.92112.33722.36070
x75~~x752170129705.0522.36070
x93~~x9384640.23785.2222.36070
x68~~x68839193752.9722.36070
x71~~x71940.64842.067122.36070
x43~~x4338825.71736.3422.36070
x63~~x6363706.22849.0322.36070
x30~~x302807.35125.54822.36070

Python Code

If you want to tinker around with the above example, here is the Python code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from itertools import product
import random

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
from scipy.stats import rankdata
import semopy

# Utils

def decycle(d):
    '''
    Pseudorandomly remove cycles from a directed graph.

    Changes the graph inplace.

    PARAMETERS
    ----------
    d : networkx.DiGraph
        Directed graph.

    RETURNS
    -------
    None
    '''
    while True:
        if nx.is_directed_acyclic_graph(d):
            break
        first_cycle = nx.cycles.find_cycle(d)
        target_edge = random.sample(first_cycle, 1)[0]
        d.remove_edge(*target_edge)

rank = lambda x: rankdata(x, method='dense', axis=0)

# Config
n = 100
m = 1000
possible_edges = list(product(range(n), repeat=2))
np.random.seed(2018)

# Generate a DAG
d = nx.DiGraph()
edges = random.sample(possible_edges, n)
d.add_edges_from(edges)
decycle(d)

# Create Lavaan-style string following DAG
model_str = ''
for node in nx.topological_sort(d):
    node_str = f''
    for i, edge in enumerate(d.out_edges(node)):
        if i == 0:
            node_str += f'x{edge[1]}'
        else:
            node_str += f'+x{edge[1]}'
    if node_str:
        model_str += f'x{node} ~ {node_str}\n'

print(model_str)
model = semopy.Model(model_str)


# Generate data that follows DAG
data = np.random.normal(size=m*n).reshape(m,n)
betas = {edge:np.random.normal(scale=10) for edge in d.edges()}
for node in nx.topological_sort(d):
    for edge in d.out_edges(node):
        data[:,node] += betas[edge] * data[:,edge[1]]

df = pd.DataFrame(data, columns=[f'x{i}' for i in range(n)])
df_rank = pd.DataFrame(rank(data), columns=[f'x{i}' for i in range(n)])

# Train model on data
print('Training model on data')
linear_results = model.fit(df)
linear_inspection = model.inspect()
linear_inspection.to_markdown('sem_linear_inspection.md', index=False)
semopy.semplot(model, 'sem_model.png', plot_covs=True, show=True)

# Train model on ranks of data
print('Training model on rank data')
rank_results = model.fit(df_rank)
rank_inspection = model.inspect()
rank_inspection.to_markdown('sem_rank_inspection.md', index=False)
semopy.semplot(model, 'sem_model_ranks.png', plot_covs=True, show=True)

This first example shows that we can in fact train linear or rank-based SEMs on this data generating process.

Comparisons & Discussion

Now that we can readily generating linear and rank-based SEMs, let’s setup a comparative experiment. We’ll generate 100 data generating processes, then train both the linear and rank-based SEMs with the correctly-specified directed acyclic graph (DAG). This DAG in a real problem might come from a domain-driven hypothesis or from automated causal discovery.

In this experiment I focused on estimating the extend to which the rank-based SEM could preserve the signum (i.e. sign) of the original true parameters. The rank-based $\beta_{\text{rank}}$ will be bounded to $[-1,1]$ just like Spearman’s correlation. But unlike Spearman’s correlation along, SEMs allow us to model conditional independence.

Here is a plot from one of the example rank-based SEMs. The horizontal axis gives us the true effect size, and the vertical axis is the indicator of whether the rank-based estimate matches the true parameter in sign.

What is evident here is that the when the rank-based SEM got the signum wrong it also tended to be the case that the absolute value of the true parameter was small. This isn’t partccularly surprising, but also note that the misclassification only occurred a small number of times among the many parameters of the model.

I also noted that there is a substantial amount of correlation between the accuracy of the signum in the rank-based vs linear SEM was substantial, and that generally both approaches did a pretty good (albeit imperfect) job of recovering the signum. Below is a plot of these accuracies paired by the true data generating process that was used.

In this particular computation experiment I found that

\[\hat P \left[ \hat P \left[ \mathbb{I} \left[ \operatorname{sign} \left( \hat \beta_{\text{rank}} \right) = \operatorname{sign} \left( \beta \right) \right] \right] > \hat P \left[ \mathbb{I} \left[ \operatorname{sign} \left( \hat \beta_{\text{linear}} \right) = \operatorname{sign} \left( \beta \right) \right] \right] \right] = \frac{39}{100}\]

and

\[\hat P \left[ \hat P \left[ \mathbb{I} \left[ \operatorname{sign} \left( \hat \beta_{\text{linear}} \right) = \operatorname{sign} \left( \beta \right) \right] \right] > \hat P \left[ \mathbb{I} \left[ \operatorname{sign} \left( \hat \beta_{\text{rank}} \right) = \operatorname{sign} \left( \beta \right) \right] \right] \right] = \frac{35}{100}\]

which are close enough that at 100 replicates could still easily be due to sampling variation. This is not strong evidence of stochastic dominance. Notably the rank-based SEM was not obviously dominated by the linear SEM in this experiment sampling from conditional normal variables.

As I discussed in Spearman Correlation Quatifies Comonotonicity, the notion of quantifying which variables go up-and-down together (or reverse) is valuable to scientists and other practioner’s of statistics. With rank-based SEM we can hypothesize patterns of increasing/decreasing relationships among the variables according to a structural causal model.

Here is the code to replicate the experiment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
from itertools import product
import random
from time import ctime

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
from scipy.stats import rankdata
import semopy

# Utils


def decycle(d):
    """
    Pseudorandomly remove cycles from a directed graph.

    Changes the graph inplace.

    PARAMETERS
    ----------
    d : networkx.DiGraph
        Directed graph.

    RETURNS
    -------
    None
    """
    while True:
        if nx.is_directed_acyclic_graph(d):
            break
        first_cycle = nx.cycles.find_cycle(d)
        target_edge = random.sample(first_cycle, 1)[0]
        d.remove_edge(*target_edge)


rank = lambda x: rankdata(x, method="dense", axis=0)

# Config
k = 100
n = 100
m = 1000
possible_edges = list(product(range(n), repeat=2))
np.random.seed(2018)
linear_scores = []
rank_scores = []

for _ in range(k):
    # Generate a DAG
    d = nx.DiGraph()
    edges = random.sample(possible_edges, n)
    d.add_edges_from(edges)
    decycle(d)

    # Create Lavaan-style string following DAG
    model_str = ""
    for node in nx.topological_sort(d):
        node_str = f""
        for i, edge in enumerate(d.out_edges(node)):
            if i == 0:
                node_str += f"x{edge[1]}"
            else:
                node_str += f"+x{edge[1]}"
        if node_str:
            model_str += f"x{node} ~ {node_str}\n"

    ##    print(model_str)
    model = semopy.Model(model_str)

    # Generate data that follows DAG
    data = np.random.normal(size=m * n).reshape(m, n)
    betas = {edge: np.random.normal(scale=10) for edge in d.edges()}
    str_betas = {
        frozenset((f"x{k1}", f"x{k2}")): val for (k1, k2), val in betas.items()
    }
    for node in nx.topological_sort(d):
        for edge in d.out_edges(node):
            data[:, node] += betas[edge] * data[:, edge[1]]

    df = pd.DataFrame(data, columns=[f"x{i}" for i in range(n)])
    df_rank = pd.DataFrame(rank(data), columns=[f"x{i}" for i in range(n)])

    # Train model on data
    linear_results = model.fit(df)
    linear_inspection = model.inspect()

    linear_inspection = linear_inspection[linear_inspection["op"] == "~"]
    linear_inspection["true_beta"] = linear_inspection[
        linear_inspection["op"] == "~"
    ].apply(lambda row: str_betas[frozenset([row.rval, row.lval])], axis=1)
    linear_score = (
        linear_inspection["Estimate"].apply(np.sign)
        == linear_inspection["true_beta"].apply(np.sign)
    ).mean()
    linear_scores.append(linear_score)

    # Train model on ranks of data
    rank_results = model.fit(df_rank)
    rank_inspection = model.inspect()

    rank_inspection = rank_inspection[rank_inspection["op"] == "~"]
    rank_inspection["true_beta"] = rank_inspection[rank_inspection["op"] == "~"].apply(
        lambda row: str_betas[frozenset([row.rval, row.lval])], axis=1
    )
    rank_score = (
        rank_inspection["Estimate"].apply(np.sign)
        == rank_inspection["true_beta"].apply(np.sign)
    ).mean()
    rank_scores.append(rank_score)

    print(_, ctime(), linear_score, rank_score)

# Some plots
plt.scatter(
    rank_inspection["true_beta"],
    rank_inspection["Estimate"].apply(np.sign)
    == rank_inspection["true_beta"].apply(np.sign),
)
plt.xlabel(r"$\beta$")
plt.ylabel(
    r"$\mathbb{I} \left[ \operatorname{sign}(\hat \beta) = \operatorname{sign}(\beta) \right]$"
)
plt.savefig("rank_sem_equal_signum_vs_param.png", dpi=300, transparent=True)
plt.close()

plt.scatter(rank_scores, linear_scores)
plt.xlabel("Prob. of Signum Match With Rank SEM")
plt.ylabel("Prob. of Signum Match With Linear SEM")
plt.savefig("prob_signum_match_sem.png", transparent=True)
plt.close()

Discussion & Conclusions

I was able to show that rank-based SEM is pretty good at recoving the sigmum of the true parameters from a family of data generating process. Because the underlying process had only conditionally linear parameters, the recovery of the sigmum of the parameters also meant that the rank-based SEM was telling us about monotonicity. However, the coefficients in the rank-based model would tell us about monotonicity in many cases anyway.

This is still quite a limited example because many data generating processes will not be conditionally Gaussian, but it is a good first approximation. Likewise the true data generating process that I used is linear in its path functions, but that is not necessary. A further example where the true path functions are non-linear in the parameters would make for a more interesting example. What I expect is that the rank-based SEM will tend to recover the conditionally-dependent monotonicity when it is strong in tendency, although not necessarily the signum of any parameters in the path functions as they may not so simply relate to the monotonicity relationship as was the case in linear SEM.

This post is licensed under CC BY 4.0 by the author.

A Rank-Based Mixed Effects Model

The Empirical Cumulative Distribution Function Is Rank-Based