Tapestry of flimsy steps

Regression tree algorithm van niets in C

2025-10-09T00:00:00+00:00

Motivatie

In deze blogpost beschrijf ik een basaal regression tree algorithm in C. Regression en decision trees zijn vandaag overal te vinden in het machine learning landschap. Nog in 2025 en ondanks het optreden van pre-trained one-shot transformer models, zijn gradient boosting machines (GBM) vaak de beste methoden om goede voorspellingen te maken (zie bvb. W. Rizkallah, Journal of Big Data, 2025). Ook in de industrie kom ik vaak algorithmen tegen zoals LightGBM die variaties zijn op de beroemde GBM.

Van alle onderdelen die deel uitmaken van GBM’s en hun verschillende implementaties, is de regression of decision tree het belangrijkste. Bovenop andere innovaties en componenten zoals loss functies, feature binning, gradient descent, one-side sampling, zitten die trees aan de basis van GBM’s en gerelateerde algorithmen zoals de Random Forest.

Het verschil tussen regression en decision zit in de uitkomst die de analist probeert te voorspellen. Is de uitkomst discreet: ja of nee, met twee of meer categorieën, dan heb je met een decision tree te maken. Is de uitkomst continu, huisprijzen bijvoorbeeld, dan heb je met een regression tree te maken.

In wezen gebruiken beide trees dezelfde fit strategie. In een gedefinieerd aantal stappen, delen zij een data sample in een aantal bladjes (leaves) zodat het verschil tussen elke bladjeswaarden en zijn gerelateerde uitkomsten het kleinste is. Nu, als je gewoon het verschil tussen uitkomsten en sample bladjes wilt verminderen, zou je het beste vinden om een bladje per observatie te definiëren. Maar dan heeft het tree model geen generaliserend vermogen. Dus moeten er concessies gedaan worden in de richting van generalisatie. Dit gebeurt door het zetten van parameters zoals het minimum aantal data per bladje, de maximale diepte van de tree of de minimale gain die toegestaan is.

Structuur

De bedoeling achter deze post is dat jij zelf kunt leren hoe regression trees werken door het algoritme zelf in C te schrijven. Dus ga ik de nodige functies geven zonder ze samen te zetten. Volgens mij is dat een plezante manier om iets te leren. Ik ben zelf zeker geen C expert, dus let op want er gaan memory leaks in de code zijn. Ik geef helemaal geen aandacht aan de leaks. Dit is zeker geen code dat klaar voor productie is.

Ik ga proberen de intuïtie en de logica achter elke code snippet door te geven. Tot een zekere hoogte.

De focus blijft op de implementatie en niet op de intuïtie, dus raad ik aan die die meer intuïtie nodig hebben om het te vinden op scikit-learn.

Nuttige concepten

De implementatie gebruikt een mooi aantal concepten die komen van programmatie en statistiek domeinen.

Om de beste splits tussen bladjes te vinden moet de data gesorteerd worden volgens de orde van grootte van elke variabele. Het is dus nodig om een sorting algorithm te gebruiken.

Om te kunnen beslissen of een split bevorderlijk is, gebruiken we zogenoemde gradients, hessians en een gain formule. In dit geval, zijn de gradients min of meer gecentreerde uitkomsten, en de hessians zijn unit vectors.

Recursie gaat vaak gebruikt worden. De tree zelf is gebouwd door een recursief algorithm. Om de gefitte trees te kunnen gebruiken om voorspellingen te doen moet je ze ook met recursie doorkruisen. Tree traversal is ook gebruikt om de feature importance te berekenen. Men kan zeggen dat een regression tree een weg geeft aan elk observatie (of instancie in machine learning taal…), en dat die weg gevolgd moet worden door recursie. Dus recursie gaat hierbeneden een heel belangrijke rol spelen.

Om het algorithm te kunnen testen, gebruik ik ook de simpele random number generator (RNG) van de C standard library. Data simulatie is dus ook een aanwezig concept.

Daarna zijn er een aantal dingen die geassocieerd zijn met het gebruik van C. Namelijk: pointers en memory addresses, stack en heap, struct’s en data types. Ik heb enkel een praktisch begrip van die concepten en ga ze vandaar zonder diepte proberen uit te leggen. Het belangrijkste daar is te weten dat een dynamische array in C een pointer is naar het eerste memory address van de array’s data. De analist moet dus altijd goed bijhouden hoe lang die array is. Anders krijg jij rommel.

Ook belangrijk voor recursieve functies is het verschil tussen data die zit op de stack en data die zit op de heap. Data op de stack wordt verwijderd zodra het buiten de huidige scope van het programma zit. Data op de heap is het tegenovergestelde. Het wordt bijgehouden tot het vrijgelaten wordt door de analist. Dus heap data kan veranderd worden in een functie en die veranderde data kan binnen de scope van een tweede functie gebruikt worden. Dat is vaak nodig met recursie.

Ingrediënten

Het is goed om te beginnen met het kleine aantal ingrediënten die gebruikt zullen worden. Het eerste houdt info bij over een mogelijke split. Het is zo gedefinieerd:

typedef struct SplitInfo {
  float gain; 
  float threshold; 
} SplitInfo;

Die SplitInfo struct moet gegevens over de beste split van elke feature behouden. Elke split heeft een threshold, waardoor observaties wiens waarde op die feature kleiner is dan de threshold naar de zogenoemde left children node gegeven worden, en de rest naar de right children node.

De gain wordt behouden om twee redenen. Ten eerste om de beste split van verschillende features te kunnen vergelijken. Ten tweede kan elke split gain gebruikt worden om feature importance te berekenen.

De tree zelf is een verzameling van gerelateerde nodes. Elke node heeft een diepte. Op diepte nul vindt men de root node die de hele dataset bevat. Op diepte één is deze root node gedeeld in twee children nodes. Elk deel kreeg zijn eigen observaties. Dan worden elk van die twee nodes op diepte één nogmaals in twee gedeeld enzovoort. Op de laatste diepte zitten de bladjes. Die nodes geven voorspellingen op basis van de voorafgaande splitsingen. De regression tree is eigenlijk een binary tree.

Elke node is dus zo gedefinieerd:

typedef struct Node {
  int feature_id;
  float threshold;
  float value;
  int depth;
  struct Node *left;
  struct Node *right;
  bool is_leaf;
  float gain;
} Node;

De left en right Node’s zijn pointers naar children nodes. Omdat elke Node zijn eigen kinderen omvat kan men zeggen dat het een recursief object is. Daarom gebruik ik pointers voor de left en right nodes. Anders zou een Node object een oneindig aantal kinderen en klein-kinderen nodes hebben.

Een Node heeft ook een feature_id en een threshold, om te behouden op welke feature en op welke waarde de data gesplitst was. Om te weten of de tree groot genoeg is, behoud ik ook de depth van de node.

Als een node een bladje is, krijgt het een waarde van true op is_leaf. Anders krijgt de node daar een waarde van false. Bladjes krijgen ook een value, maar geen feature_id of threshold aangezien zij geen children nodes hebben.

Eindelijk definieer ik de RegressionTree struct:

typedef struct RegressionTree {
  int max_depth;
  int min_leaf_samples;
  float constant;
  Node *root;

} RegressionTree;

De tree bevat de root node, twee complexiteit beperkingen en een constante. Vanaf de root node kan je de hele tree doorkruisen. De complexiteit beperkingen zijn gebruikt om iets toe te geven aan generaliteit. In het kort omdat ik voorspellingen op nieuwe data wil kunnen doen met het model.

De constante is het gemiddelde van de uitkomst waarden. Door de uitkomsten van de constante af te trekken, krijg ik de gradients. Dit mag op een omweg lijken maar gaat eigenlijk helpen met de berekening van de gains.

Een laatste struct die ik gebruik is de MaskIndices:

typedef struct MaskIndices {
  int *left_indices;
  size_t left_n;
  int *right_indices;
  size_t right_n;
} MaskIndices;

Zodra jij een split threshold vindt, ga je willen weten welke van die observaties moet naar de linkse en welke naar de rechtse children nodes. Deze gegevens worden door de linkse en de rechtse indices behouden. Met C is het ingewikkeld om de hoeveelheid onderdelen van een dynamische array te berekenen, dus neem ik ook left_n en right_n mee.

Functies

Main functie

Oké, eindelijk treffen wij de inhoud. Laat me beginnen met het einde van dit project: de main functie. Die omvat alle belangrijke stappen, van de declaratie van een gesimuleerde dataset en een regression tree model tot de ontdekking van de feature importances.

int main() {
  srand(time(NULL));

  size_t n = 1000;
  size_t m = 3;
  float **X = (float **)malloc(sizeof(float *) * m);
  X[0] = (float *)malloc(sizeof(float) * n);
  X[1] = (float *)malloc(sizeof(float) * n);
  X[2] = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
      X[j][i] = (float)rand() / RAND_MAX;
    }
  }

  // Assign a y array, with a simple relation to the x values
  float *Y = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    Y[i] = X[0][i] * 0.2 + X[1][i] * 0.5;
  }

  float mean_y = mean(Y, n);

Om te beginnen creëer ik een dataset met drie features X en een uitkomst Y. De features zitten binnen de dubbele pointer X. Van die dubbele pointer zijn de drie pointer columns toegankelijk. We gaan veel met de pointers spelen aangezien ze nodig zijn om dynamische arrays zoals X en Y te bouwen in C.

Om waarde te geven aan X neem ik samples van een uniforme distributie. rand geeft een waarde tussen 0 en RAND_MAX terug. Dus door te delen met RAND_MAX krijg jij een waarde tussen 0 en 1. Goed genoeg om te testen.

Y is compleet afhankelijk van de eerste en tweede kolom van X. Dus, als ik kijk naar de feature importances, zou ik een nul importance waarde zien voor de derde kolom van X.

  RegressionTree *reg_tree = malloc(sizeof(RegressionTree));

  reg_tree->max_depth = 3;
  reg_tree->min_leaf_samples = 5;


  fit(reg_tree, X, Y, n, m);

  float *predictions = predict(reg_tree, X, n, m);

Daarna declareer ik een RegressionTree op de heap, zodat ik het straks kan fitten met fit. fit is het belangrijkste deel van dit programma en is verantwoordelijk voor de splitsing van X in een aantal bladjes.

predict geeft dan een voorspelling voor elke observatie in X. Hoe goed is de gemiddelde voorspelling? Dat kan door de mean squared error (mse) gecheckt worden.

  float mse_model = mse_compute(Y, predictions, n);

  printf("MSE model: %.3f\n", mse_model);

  float *mean_y_vector = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    mean_y_vector[i] = mean_y;
  }

  float mse_null = mse_compute(Y, mean_y_vector, n);

  printf("MSE null: %.3f\n", mse_null);

Ik doe een evaluatie door de mse van het tree model te vergelijken met die van een zogenoemd null model. In dit geval is het null model het gemiddelde van alle Y waarden. In andere woorden, een tree met een enige node.

  float *feature_importances = compute_feature_importance(reg_tree, m);

  for (int i = 0; i < m; i++) {
    printf("Feat importance %d: %.3f\n", i, feature_importances[i]);
  }

In vergelijking met andere machine learning of ai methoden, zijn trees vrij doorzichtig. Door de gain van elke split te sommeren kan men een vertaling krijgen van welke variabelen de meeste belangrijke zijn volgens het model.

Nu we een vogelvlucht genomen hebben over elke stap van het programma, kun jij kijken naar hoe elk deel ervan is gebouwd.

Sorteren

Het sorteren van de observaties is nodig om een gain voor elke threshold te kunnen berekenen. Omdat ik de orde van een variabele wil gebruiken om meerdere arrays te sorteren, gebruik ik een functie die de sorterende index teruggeeft in plaats van de gesorteerd input array arr.

size_t *arg_sort(float *arr, size_t n) {
  float *arr_copy = (float *)malloc(sizeof(float) * n);
  memcpy(arr_copy, arr, sizeof(float) * n);

  size_t *index_arr = (size_t *)malloc(sizeof(size_t) * n);
  for (int i = 0; i < n; i++) {
    index_arr[i] = i;
  }

  float next_value;
  size_t next_index;

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < (n - i - 1); j++) {
      if (arr_copy[j] > arr_copy[j+1]) {
        next_value = arr_copy[j + 1];
        next_index = index_arr[j + 1];

        arr_copy[j + 1] = arr_copy[j];
        index_arr[j + 1] = index_arr[j];

        arr_copy[j] = next_value;
        index_arr[j] = next_index;
      }
    }
  }

  return index_arr;
}

Om te sorten gebruik ik het bubblesort algorithm omdat ik het gemakkelijk te herinneren vind. C heeft een eigen sorteren functie in de standard library, maar ik denk niet dat het of een alternatieve functie de sorterende indices teruggeeft.

Een ander detail is het gebruik van memcpy. Ik doe het zodat ik niet per ongeluk de data achter het arr dynamische array verander.

Wanneer ik de sorterende indices heb, moet een andere functie, reorder, het eigenlijke sorteren doen.

float *reorder(float *arr, size_t *reorder_indices, size_t n) {
  float *reordered_arr = (float *)malloc(sizeof(float) * n);
  for (int i = 0; i < n; i++) {
    reordered_arr[i] = arr[reorder_indices[i]];
  }

  return reordered_arr;
}

Een beetje rekenkunde

Door het programma zijn er een aantal kleine functies die verantwoordelijk zijn voor mathematische operaties. Eén heb je al gezien:

float mse_compute(float *actuals, float *predictions, size_t n) {
  float sse = 0;
  for (int i = 0; i < n; i++) {
    sse += pow(actuals[i] - predictions[i], 2.0);
  }

  return sse/(float)n;
}

mse_compute is een van de eenvoudigere functies in het programma en kan voor een beginner plezant zijn om een beetje met C te spelen. Anders is het een heel gewone functie en vraagt dus geen uitleg.

Op hetzelfde niveau zit een nuttige mean functie:

float mean(float *arr, size_t n) {
  float sum = 0;
  for (int i = 0; i < n; i++) {
    sum += arr[i];
  }
  return sum/(float)n;
}

Nu kan ik het hebben over de twee interessante rekenen functies: compute_gain en compute_leaf_value. Laat me met compute_leaf_value beginnen.

float compute_leaf_value(float G_sum, float H_sum) {
  return -G_sum/H_sum;
}

Hier wordt de waarde van een bladje berekend. G_sum is de som van de gradients in het bladje. Kort gezegd is hier een gradient het verschil tussen het gemiddelde van Y en een Y sample. Dus als je zeg maar twee Y samples hebt in een bladje, zijn de gradients $\mathrm{mean}(Y) - Y_1$ en $\mathrm{mean}(Y) - Y_2$. Herinner je dat de hessian hier altijd de waarde 1 heeft. Dan geeft compute_leaf_value gewoon het negatieve gemiddelde gradient van het bladje.

Waarom neem ik het negatieve? Veronderstel dat $Y_1$ en $Y_2$ zijn allebei groter dan $\mathrm{mean}(Y)$. Dan gaan de gradients negatief zijn, maar als jij het getal door -1 vermenigvuldigt, krijg jij een bladjeswaarde die positief is. En als jij aan de bladjeswaarde het gemiddelde van Y toevoegt, krijg jij net een voorspelling die zit tussen $Y_1$ en $Y_2$.

De gradients zijn dus bruikbaar niet enkel om node splits te vinden, maar ook om de voorspelling te maken. In die zin zijn de gradients herbruikbaar.

In compute_gain kan ik tonen hoe deze gebruikt worden om een split te vinden. Dit gebeurt door de gains van alle mogelijke splits per feature of variabele te vergelijken.

float compute_gain(float gradient_left, float gradient_right, float hessian_left, float hessian_right) {
  float left_side = pow(gradient_left, 2.0) / hessian_left + pow(gradient_right, 2.0) / hessian_right;
  float right_side = pow(gradient_left + gradient_right, 2.0) / (hessian_left + hessian_right);

  return left_side - right_side;
}

Ik ga met de right_side van de formule beginnen. Die geeft de gain als je de node niet splitst, maar daarentegen als je het als bladje zet. In andere worden is het de gain als je de tree niet laat groeien.

De left_side van de formule kijkt naar de informatie die gegenereerd wordt door de split. Als je nadenkt over de formule, zie je dat het een vrij gangbare vorm heeft: $x^2 + y^2 - (x + y)^2$, wat enkel positief kan zijn als $x$ het minteken van $y$ heeft. Dus het doel van de regression tree is de variatie rond het gemiddelde van $Y$ zo goed mogelijk te verklaren. Dit door de afhankelijkheid van $Y$ met $X$ te ontdekken.

De correcte fit zoeken

De fit functie is relatief eenvoudig. Het vindt het gemiddelde van uitkomst Y, definieert de gradients en de hessians dynamische arrays, alloceert geheugen aan de root node en roept het recursieve _split_node aan.

void fit(RegressionTree *reg_tree, float **X, float *Y, size_t n, size_t m) {
  float mean_y = mean(Y, n);

  // Assign mean_y as constant of tree (for predictions)
  reg_tree->constant = mean_y;

  float *gradients = (float *)malloc(sizeof(float) * n);
  float *hessians = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    gradients[i] = mean_y - Y[i];
    hessians[i] = 1.0; 
  }

  reg_tree->root = (Node *)malloc(sizeof(Node));
  Node *root = reg_tree->root;

  root->depth = 0;
  root->is_leaf = true;

  _split_node(
    root, 
    reg_tree->max_depth, 
    reg_tree->min_leaf_samples, 
    X, 
    gradients, 
    hessians, 
    m, 
    n);
}

The art of the split

_split_node is een dikke functie met een groot aantal argumenten en meerdere voorwaardelijke branches. Ik ga het kort beschrijven en daarna de hele functie hier dumpen. Het doel van die samenvatting is om structuur te geven aan een aandachtige lezing van _split_node.

Ik kijk eerst of een split gemaakt kan worden. Als de tree zijn max_depth aangetroffen heeft, of als de gesplitste nodes een te klein aantal observaties zouden hebben, geeft de functie de bladjeswaarde terug met compute_leaf_value (zie de laatste else instructie). De fit van een branch van de tree is gedaan.

Wanneer het tegenovergestelde gebeurt, dat is, wanneer een split gemaakt kan worden, kruist de functie alle variabelen door op zoek naar de variabele wiens split een beste gain geeft, en de threshold ervan.

Veel code is dan verantwoordelijk om de split voor te bereiden door de correcte gegevens aan de linkse en de rechtse children nodes te versturen. Om te weten welke waarde boven en welke waarde onder de threshold liggen, gebruik ik die MaskIndices struct. Als een waarde onder de threshold ligt, heb ik beslist om het naar de linkse node te sturen. Wat in twee Python regels zou gebeuren, gebeurt soms met 30 lijnen van C.

Voordat ik _split_node opnieuw aanroep, wijs ik geheugen toe voor de linkse en rechtse children nodes. Ik let ook op dat de depth van elk kind toeneemt met één.

Een verschil met andere functies die gebruikt worden om recursie te leren, is dat elke roep aan _split_node iets teruggeeft. De edge voorwaarde van de recursie is should_split.

Node *_split_node(
  Node *root,
  int max_depth,
  int min_leaf_samples,
  float **X, 
  float *gradients,
  float *hessians,
  size_t m,
  size_t n) {
  int depth = root->depth;

  bool split_decision = should_split(depth, n, max_depth, min_leaf_samples);

  if (split_decision == true) {
    root->is_leaf = false;

    int best_feature_id = 0;
    float best_threshold = 0;
    float best_gain = 0;

    // Need to first find the best split
    for (int j = 0; j < m; j++) {
      SplitInfo split_info = find_best_split(X[j], gradients, hessians, n);
      float current_gain = split_info.gain;
      if (current_gain > best_gain) {
        best_feature_id = j;
        best_threshold = split_info.threshold;
        best_gain = current_gain;
      }
    }

    if (best_gain <= 0) {
      float G_sum = 0;
      float H_sum = 0;

      for (int i = 0; i < n; i++) {
        G_sum += gradients[i];
        H_sum += hessians[i];
      }

      root->value = compute_leaf_value(G_sum, H_sum);
      root->is_leaf = true;
    }
    else { 
      // Saving the gains of each split node to compute feature importance later
      root->gain = best_gain;
      root->threshold = best_threshold;
      root->feature_id = best_feature_id;

      MaskIndices mask_indices = split_on_feature_threshold(
        best_threshold, 
        best_feature_id,
        X, 
        n
      );

      // Make left and right arrays to pass to split_node
      float *left_gradients = (float *)malloc(sizeof(float) * mask_indices.left_n);
      float *left_hessians = (float *)malloc(sizeof(float) * mask_indices.left_n);

      float *right_gradients = (float *)malloc(sizeof(float) * mask_indices.right_n);
      float *right_hessians = (float *)malloc(sizeof(float) * mask_indices.right_n);

      float **X_left = (float **)malloc(sizeof(float *) * m);
      float **X_right = (float **)malloc(sizeof(float *) * m);

      for (int j = 0; j < m; j++) {
        X_left[j] = (float *)malloc(sizeof(float) * mask_indices.left_n);
        X_right[j] = (float *)malloc(sizeof(float) * mask_indices.right_n);
      }

      for (int i = 0; i < mask_indices.left_n; i++) {
        int left_index = mask_indices.left_indices[i];
        left_gradients[i] = gradients[left_index];
        left_hessians[i] = hessians[left_index];

        for (int j = 0; j < m; j++) {
          X_left[j][i] = X[j][left_index];
        }
      }

      for (int i = 0; i < mask_indices.right_n; i++) {
        int right_index = mask_indices.right_indices[i];
        right_gradients[i] = gradients[right_index];
        right_hessians[i] = hessians[right_index];

        for (int j = 0; j < m; j++) {
          X_right[j][i] = X[j][right_index];
        }
      }
      printf("Split: left_n=%zu, right_n=%zu\n", mask_indices.left_n, mask_indices.right_n);

      // Define left and right nodes before recursive calls
      Node *left_node = (Node *)malloc(sizeof(Node));
      left_node->is_leaf = true;
      left_node->depth = depth + 1;

      Node *right_node = (Node *)malloc(sizeof(Node));
      right_node->is_leaf = true;
      right_node->depth = depth + 1;

      root->left = _split_node(
        left_node,
        max_depth,
        min_leaf_samples,
        X_left,
        left_gradients,
        left_hessians,
        m,
        mask_indices.left_n
      );

      root->right = _split_node(
        right_node,
        max_depth,
        min_leaf_samples,
        X_right,
        right_gradients,
        right_hessians,
        m,
        mask_indices.right_n
      );
    }
  }

  // If the split is not acceptable
  else {
    float G_sum = 0;
    float H_sum = 0;

    for (int i = 0; i < n; i++) {
      G_sum += gradients[i];
      H_sum += hessians[i];
    }

    root->value = compute_leaf_value(G_sum, H_sum);
  }
  // Now got to mask the data according to feature and threshold

  return root;
}

De eerste functie binnen _split_node die aangeroepen wordt is should_split. Het gaat kijken naar een paar condities om te weten of een split is toegestaan. Als ja, is er toch een verdere check naar de gain. De gain van de top gevonden split moet positief zijn, anders geeft de regression tree zonder de nieuwe split een betere fit.

Ik vind de min_leaf_samples check interessant. $\frac{\text{n_samples}}{2}$ is de minimale grootte van het grootste van de linkse en rechtse children nodes. Het min_leaf_samples is dus de kleinste aantal observaties van het grootste kind dat een split toestaat.

bool should_split(int depth, int n_samples, int max_depth, int min_leaf_samples) {
  if (
    depth < max_depth && (n_samples / 2) > min_leaf_samples
  ) {
    return true;
  }
  else {
    return false;
  }
}

De tweede aangeroepen functie binnen _split_node is find_best_split. Die zoekt naar een threshold op een array arr die de grootste gain teruggeeft. Die gain krijgt als argumenten de sommen van de linkse en rechtse gradiënten. De hessians zijn hier enkel gebruikt om de sommen te normaliseren.

Om ingewikkelde data masking te verminderen, gebruik ik de eerder getoonde arg_sort en reorder functies, samen met het gebruik van cumulative sommen. Ik kan dus de linkse en rechtse som eenvoudig krijgen. Ik zou willen zeggen dat het de efficiëntere manier om dingen te doen is, maar aangezien ik een bubblesort moet gebruiken, weet ik het niet. Naar mijn mening is het zeker gemakkelijk om te lezen.

SplitInfo find_best_split(float *arr, float *gradients, float *hessians, size_t n) {
  size_t *index_arr = arg_sort(arr, n);
  float *sorted_arr = reorder(arr, index_arr, n);
  float *sorted_gradients = reorder(gradients, index_arr, n);
  float *sorted_hessians = reorder(hessians, index_arr, n);

  float *cumsum_gradients = (float *)malloc(sizeof(float) * n);
  float *cumsum_hessians = (float *)malloc(sizeof(float) * n);

  cumsum_gradients[0] = sorted_gradients[0];
  cumsum_hessians[0] = sorted_hessians[0];
  // Need also the sums to get the left split gains
  float sum_gradients = sorted_gradients[0];
  float sum_hessians = sorted_hessians[0];

  for (int i = 1; i < n; i++) {
    cumsum_gradients[i] = sorted_gradients[i] + cumsum_gradients[i-1];
    cumsum_hessians[i] = sorted_hessians[i] + cumsum_hessians[i-1];

    sum_gradients += sorted_gradients[i];
    sum_hessians += sorted_hessians[i];
  }

  // Calculate gains for each possible split, and find best gain
  float best_gain = 0;
  float best_threshold = 0;

  // Setting i to max n-2 to avoid illegal splits
  for (int i = 0; i < (n-1); i++) {
    float gradient_left = cumsum_gradients[i];
    float gradient_right = sum_gradients - gradient_left; 

    float hessian_left = cumsum_hessians[i];
    float hessian_right = sum_hessians - hessian_left; 

    float gain = compute_gain(gradient_left, gradient_right, hessian_left, hessian_right);

    if (gain > best_gain) {
      best_gain = gain;
      best_threshold = sorted_arr[i];
    }
  }


  SplitInfo split_info;

  split_info.gain = best_gain;
  split_info.threshold = best_threshold;

  return split_info;
}

Nadat ik de beste split threshold heb gevonden, moet ik wel masking gebruiken om de data ook te splitsen en te sturen naar de relevante kinderen. Ik begin door de maximale grootte van elke indice te deduceren, en zodra ik de werkelijke grootte van de indice arrays weet, roep ik realloc aan om het onnodige geheugen vrij te laten.

MaskIndices split_on_feature_threshold(
  float threshold, 
  int feature_id, 
  float **X, 
  size_t n) {
  int *left_indices = (int *)malloc(sizeof(int) * n);
  int *right_indices = (int *)malloc(sizeof(int) * n);

  int left_cnt = 0;
  int right_cnt = 0;

  for (int i = 0; i < n; i++) {
    if (X[feature_id][i] <= threshold) {
      left_indices[left_cnt] = i;
      left_cnt++;
    }
    else {
      right_indices[right_cnt] = i;
      right_cnt++;
    }
  }

  left_indices = realloc(left_indices, sizeof(int) * left_cnt);
  right_indices = realloc(right_indices, sizeof(int) * right_cnt);

  MaskIndices mask_indices;

  mask_indices.left_indices = left_indices;
  mask_indices.left_n = left_cnt;

  mask_indices.right_indices = right_indices;
  mask_indices.right_n = right_cnt;

  return mask_indices;
}

Als jij tot hier bent geraakt, gefeliciteerd, jij weet hoe je een regression tree kan fitten in C. Jij gaat toch misschien willen blijven lezen. Ik ga het hebben over twee spannende tree traversal functies. Die zijn heel nuttig om wat sap te krijgen van de regression tree.

Plezante tree traversals

Ik begin met mijn favoriet van de twee: het berekening van de feature importances. Deze geven het belang van elke variabele van X. Dit wordt gedaan door te vinden op welke variabele of feature elke split gedaan is, en de gain van die split toe te voegen aan een feature_importances array van grootte $1 \times m$ waar $m$ het aantal variabelen is.

void _feature_importance(Node *node, float *feature_importances) {
  if (node->is_leaf == true) {
    return;
  }

  else {
    feature_importances[node->feature_id] += node->gain;

    _feature_importance(node->left, feature_importances);
    _feature_importance(node->right, feature_importances);
  }
}

Die recursieve functie wordt gebruikt binnen de compute_feature_importance functie. Die doet een beetje boekhouding.

De tweede tree traversal wordt gedaan om de voorspelling te kunnen doen. Voor elke observatie met waarde op alle variabelen van X, willen we een voorspelling kunnen doen. Dit gebeurt door de weg te vinden door de splits tot het bladje met een voorspelling voor $x$ is getroffen.

float _predict_single(Node *root, float *x, float constant) {
  // x is a 1 by m array

  if (root->is_leaf == false) {
    if (x[root->feature_id] <= root->threshold) {
      return _predict_single(root->left, x, constant);
    }
    else {
      return _predict_single(root->right, x, constant);
    }
  }
  else {
    return root->value + constant;
  }
}

Kortom wordt dat gedaan door te kijken, voor elke split tot een bladje, of $x$ onder of over de threshold van die split feature is. Volgens gaan we naar links of rechts, tot root->is_leaf true is.

Nogmaals, doe ik een beetje boekhouding om voorspellingen te doen en te behouden voor elke $x$ binnen een matrix $X$:

float *predict(RegressionTree *reg_tree, float **X, size_t n, size_t m) {
  float *predictions = (float *)malloc(sizeof(float) *n);

  float constant = reg_tree->constant;

  for (int i = 0; i < n; i++)  {
    float *x = (float *)malloc(sizeof(float) * m);
    // Prep the prediction vector
    for (int j = 0; j < m; j++) {
      x[j] = X[j][i];
    }

    float value = _predict_single(reg_tree->root, x, constant);

    free(x);

    predictions[i] = value;
  }

  return predictions;
}

Mogelijke verdere oefeningen

Er zijn een paar interessante oefeningen die je kunt doen om nog meer te leren van regression trees.

Hierboven hebben we gebruik gemaakt van de hele dataset om een split te vinden. Naarmate het aantal training observaties groter wordt, wordt dat minder interessant. Moderne tools zoals LightGBM en XGBoost kunnen gebruik maken van histogram binning om big data beter te handelen. In Python is het niet zo moeilijk om dit te doen met een paar numpy functies zoals searchsorted en bincount. Je kunt het dus in Python proberen en dan de oplossing naar C vertalen.

Het idee is om de gradients en hessians per bin te sommeren om daarna de split te vinden door elke bin te zoeken in plaats van de hele dataset. Als je nog zogenoemde exacte splits wilt blijven vinden kun je ook een oplossing zoeken voor duplicate feature waarden. Dit kan ook efficiëntie geven, zeker als je met een klein aantal waarden werkt.
Een andere oefening kan zijn om van de regression tree, een random forest of een gradient boosting machine te maken. Tot een zekere mate is dat eigenlijk eenvoudiger dan de oefening daarboven.
Nog een interessante oefening zou zijn om een decision tree te maken in plaats van een regression tree. Nog ingewikkelder zou het zijn om meer dan twee discrete uitkomsten goed te voorspellen.

Regression tree algorithm from scratch in C (English)

2025-10-09T00:00:00+00:00

This post is translated from Dutch with Claude 4.5 Sonnet. I did reread and revise the translation.

Motivation

In this blog post I describe a basic regression tree algorithm in C. Regression and decision trees are found everywhere in the machine learning landscape today. Even in 2025 and despite the emergence of pre-trained one-shot transformer models, gradient boosting machines (GBM) are often the best methods for making good predictions (see e.g. W. Rizkallah, Journal of Big Data, 2025). In industry I also frequently encounter algorithms like LightGBM that are variations on the famous GBM.

Of all the components that make up GBMs and their various implementations, the regression or decision tree is the most important. On top of other innovations and components like loss functions, feature binning, gradient descent, one-side sampling, these trees form the foundation of GBMs and related algorithms like the Random Forest. The difference between regression and decision lies in the outcome the analyst tries to predict. If the outcome is discrete: yes or no, with two or more categories, then you’re dealing with a decision tree. If the outcome is continuous, house prices for example, then you’re dealing with a regression tree.

Essentially both trees use the same fitting strategy. In a defined number of steps, they split a data sample into a number of leaves so that the difference between each leaf’s values and its related outcomes is smallest. Now, if you simply want to reduce the difference between outcomes and sample leaves, you’d find it best to define one leaf per observation. But then the tree model has no generalizing power. So concessions must be made toward generalization. This happens by setting parameters like the minimum number of data per leaf, the maximum depth of the tree, or the minimum gain that’s allowed.

Structure

The intention behind this post is that you can learn how regression trees work by writing the algorithm yourself in C. So I’m going to provide the necessary functions without putting them together. I think that’s an enjoyable way to learn something. I’m certainly no C expert myself, so watch out because there will be memory leaks in the code. I pay no attention whatsoever to the leaks. This is certainly not production-ready code.

I’ll try to convey the intuition and logic behind each code snippet. To a certain extent. The focus remains on the implementation and not on the intuition, so I recommend those who need more intuition to find it on scikit-learn.

Useful concepts

The implementation uses a good number of concepts that come from programming and statistics domains.

To find the best splits between leaves the data must be sorted according to the order of magnitude of each variable. So it’s necessary to use a sorting algorithm. To be able to decide whether a split is beneficial, we use so-called gradients, hessians and a gain formula. In this case, the gradients are more or less centered outcomes, and the hessians are unit vectors.

Recursion will be used frequently. The tree itself is built by a recursive algorithm. To be able to use the fitted trees to make predictions you also have to traverse them with recursion. Tree traversal is also used to calculate the feature importance. One can say that a regression tree gives a path to each observation (or instance in machine learning language…), and that this path must be followed through recursion. So recursion will play a very important role below.

To be able to test the algorithm, I also use the simple random number generator (RNG) from the C standard library. Data simulation is thus also a present concept.

Then there are a number of things associated with using C. Namely: pointers and memory addresses, stack and heap, structs and data types. I only have a practical understanding of these concepts and will therefore try to explain them without depth. The most important thing there is to know that a dynamic array in C is a pointer to the first memory address of the array’s data. The analyst must therefore always keep good track of how long that array is. Otherwise you get garbage.

Also important for recursive functions is the difference between data that sits on the stack and data that sits on the heap. Data on the stack is removed as soon as it’s outside the current scope of the program. Data on the heap is the opposite. It’s kept until it’s released by the analyst. So heap data can be changed in a function and that changed data can be used within the scope of a second function. That’s often necessary with recursion.

Ingredients

It’s good to start with the small number of ingredients that will be used. The first keeps info about a possible split. It’s defined like so:

typedef struct SplitInfo {
  float gain; 
  float threshold; 
} SplitInfo;

This SplitInfo struct must retain data about the best split of each feature. Each split has a threshold, through which observations whose value on that feature is smaller than the threshold are given to the so-called left children node, and the rest to the right children node.

The gain is retained for two reasons. First to be able to compare the best split of different features. Second, each split gain can be used to calculate feature importance.

The tree itself is a collection of related nodes. Each node has a depth. At depth zero one finds the root node that contains the entire dataset. At depth one this root node is divided into two children nodes. Each node gets its own set of observations. Then each of those two nodes at depth one is again divided in two and so on. At the final depth are the leaves. These leaf nodes give predictions based on the preceding splits. The regression tree is actually a binary tree.

Each node is defined like so:

typedef struct Node {
  int feature_id;
  float threshold;
  float value;
  int depth;
  struct Node *left;
  struct Node *right;
  bool is_leaf;
  float gain;
} Node;

The left and right Nodes are pointers to children nodes. Because each Node contains its own children one can say it’s a recursive object. That’s why I use pointers for the left and right nodes. Otherwise a Node object would have an infinite number of children and grandchildren nodes.

A Node also has a feature_id and a threshold, to retain on which feature and at which value the data was split. To know if the tree is large enough, I also keep the depth of the node. If a node is a leaf, it gets a value of true on is_leaf. Otherwise the node gets a value of false there. Leaves get a value, but no feature_id or threshold since they have no children nodes.

Finally I define the RegressionTree struct:

typedef struct RegressionTree {
  int max_depth;
  int min_leaf_samples;
  float constant;
  Node *root;

} RegressionTree;

The tree contains the root node, two complexity constraints and a constant. From the root node you can traverse the entire tree. The complexity constraints are used to sacrifice bias to improve generalization. In short, because I want to be able to make predictions on new data with the model.

The constant is the mean of the outcome values. By subtracting the constant from the outcomes, I get the gradients. This may seem like a detour but will actually help with the calculation of the gains.

A final struct I use is the MaskIndices:

typedef struct MaskIndices {
  int *left_indices;
  size_t left_n;
  int *right_indices;
  size_t right_n;
} MaskIndices;

Once you find a split threshold, you’ll want to know which of those observations must go to the left and which to the right children nodes. This data is retained by the left and right indices. With C it’s complicated to calculate the quantity of components of a dynamic array, so I also take along left_n and right_n.

Functions

Main function

Alright, finally we encounter the content. Let me start with the end of this project: the main function. It contains all important steps, from the declaration of a simulated dataset and a regression tree model to the discovery of the feature importances.

int main() {
  srand(time(NULL));

  size_t n = 1000;
  size_t m = 3;
  float **X = (float **)malloc(sizeof(float *) * m);
  X[0] = (float *)malloc(sizeof(float) * n);
  X[1] = (float *)malloc(sizeof(float) * n);
  X[2] = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
      X[j][i] = (float)rand() / RAND_MAX;
    }
  }

  // Assign a y array, with a simple relation to the x values
  float *Y = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    Y[i] = X[0][i] * 0.2 + X[1][i] * 0.5;
  }

  float mean_y = mean(Y, n);

To start I create a dataset with three features X and an outcome Y. The features lie within the double pointer X. From that double pointer the three pointer columns are accessible. We’re going to play a lot with the pointers since they’re necessary to build dynamic arrays like X and Y in C.

To give values to X I take samples from a uniform distribution. rand returns a value between 0 and RAND_MAX. So by dividing by RAND_MAX you get a value between 0 and 1. Good enough to test.

Y is completely dependent on the first and second column of X. So when I look at the feature importances, I should see a zero importance value for the third column of X.

  RegressionTree *reg_tree = malloc(sizeof(RegressionTree));

  reg_tree->max_depth = 3;
  reg_tree->min_leaf_samples = 5;


  fit(reg_tree, X, Y, n, m);

  float *predictions = predict(reg_tree, X, n, m);

I then declare a RegressionTree on the heap, so that I can fit it later with fit. fit is the most important part of this program and is responsible for splitting X into a number of leaves.

predict then gives a prediction for each observation in X. How good is the average prediction? That can be checked through the mean squared error (mse).

  float mse_model = mse_compute(Y, predictions, n);

  printf("MSE model: %.3f\n", mse_model);

  float *mean_y_vector = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    mean_y_vector[i] = mean_y;
  }

  float mse_null = mse_compute(Y, mean_y_vector, n);

  printf("MSE null: %.3f\n", mse_null);

I do an evaluation by comparing the mse of the tree model with that of a so-called null model. In this case the null model is the mean of all Y values. In other words, a tree with a single node.

  float *feature_importances = compute_feature_importance(reg_tree, m);

  for (int i = 0; i < m; i++) {
    printf("Feat importance %d: %.3f\n", i, feature_importances[i]);
  }

Compared to other machine learning or AI methods, trees are fairly transparent. By summing the gain of each split one can get a translation of which variables are the most important according to the model.

Now that we’ve taken a bird’s-eye view over each step of the program, let’s look at how each part of it is built.

Sorting

Sorting the observations is necessary to be able to calculate a gain for each potential threshold. Because I want to use the order of a variable to sort multiple arrays, I use a function that returns the sorting index instead of the sorted input array arr.

size_t *arg_sort(float *arr, size_t n) {
  float *arr_copy = (float *)malloc(sizeof(float) * n);
  memcpy(arr_copy, arr, sizeof(float) * n);

  size_t *index_arr = (size_t *)malloc(sizeof(size_t) * n);
  for (int i = 0; i < n; i++) {
    index_arr[i] = i;
  }

  float next_value;
  size_t next_index;

  for (int i = 0; i < n; i++) {
    for (int j = 0; j < (n - i - 1); j++) {
      if (arr_copy[j] > arr_copy[j+1]) {
        next_value = arr_copy[j + 1];
        next_index = index_arr[j + 1];

        arr_copy[j + 1] = arr_copy[j];
        index_arr[j + 1] = index_arr[j];

        arr_copy[j] = next_value;
        index_arr[j] = next_index;
      }
    }
  }

  return index_arr;
}

To sort I use the bubblesort algorithm because I find it easy to remember. C has its own sorting function in the standard library, but I don’t think it or an alternative function returns the sorting indices.

Another detail is the use of memcpy. I do it so that I don’t accidentally change the data behind the arr dynamic array.

Once I have the sorting indices, another function, reorder, must do the actual sorting.

float *reorder(float *arr, size_t *reorder_indices, size_t n) {
  float *reordered_arr = (float *)malloc(sizeof(float) * n);
  for (int i = 0; i < n; i++) {
    reordered_arr[i] = arr[reorder_indices[i]];
  }

  return reordered_arr;
}

A bit of arithmetic

Throughout the program there are a number of small functions that are responsible for mathematical operations. You’ve already seen one:

float mse_compute(float *actuals, float *predictions, size_t n) {
  float sse = 0;
  for (int i = 0; i < n; i++) {
    sse += pow(actuals[i] - predictions[i], 2.0);
  }

  return sse/(float)n;
}

mse_compute is one of the simpler functions in the program and can be enjoyable for a beginner to play a bit with C. Otherwise it’s a very ordinary function and thus requires no explanation.

At the same level of complexity is a useful mean function:

float mean(float *arr, size_t n) {
  float sum = 0;
  for (int i = 0; i < n; i++) {
    sum += arr[i];
  }
  return sum/(float)n;
}

Now I can talk about the two interesting calculation functions: compute_gain and compute_leaf_value. Let me start with compute_leaf_value.

float compute_leaf_value(float G_sum, float H_sum) {
  return -G_sum/H_sum;
}

Here the value of a leaf is calculated. G_sum is the sum of the gradients in the leaf. Briefly put, here a gradient is the difference between the mean of Y and a Y observation. So if you have say two Y observations in a leaf, the gradients are $\mathrm{mean}(Y)−Y1$ and $\mathrm{mean}(Y)−Y2$. Remember that the hessians always have the value 1 here. Then compute_leaf_value simply gives the negative mean gradient of the leaf.

Why do I take the negative? Suppose that $Y_1$ and $Y_2$ are both larger than $\mathrm{mean}(Y)$. Then the gradients will be negative, but if you multiply their sum by -1, you get a leaf value that’s positive. And if you add the mean of Y to the leaf value, you precisely get a prediction that sits between $Y_1$ and $Y_2$.

The gradients are thus usable not only to find node splits, but also to make the predictions. In that sense the gradients are reusable.

In compute_gain, I can show how these gradients are used to find a split. This happens by comparing the gains of all possible splits per feature or variable.

float compute_gain(float gradient_left, float gradient_right, float hessian_left, float hessian_right) {
  float left_side = pow(gradient_left, 2.0) / hessian_left + pow(gradient_right, 2.0) / hessian_right;
  float right_side = pow(gradient_left + gradient_right, 2.0) / (hessian_left + hessian_right);

  return left_side - right_side;
}

I’ll start with the right_side of the formula. That gives the gain if you don’t split the node, but instead set it as a leaf. In other words it’s the gain if you don’t let the tree grow.

The left_side of the formula looks at the information that’s generated by the split. If you think about the formula, you see that it has a recognizable form: $x^2+y^2−(x+y)^2$, which can only be positive if $x$ has the opposite sign of $y$. So the goal of the regression tree is to explain the variation around the mean of Y as well as possible. This is done by greedily revealing the dependency of Y with X.

Searching for the correct fit

The fit function is relatively simple. It finds the mean of outcome Y, defines the gradients and hessians dynamic arrays, allocates memory to the root node and calls the recursive _split_node function.

void fit(RegressionTree *reg_tree, float **X, float *Y, size_t n, size_t m) {
  float mean_y = mean(Y, n);

  // Assign mean_y as constant of tree (for predictions)
  reg_tree->constant = mean_y;

  float *gradients = (float *)malloc(sizeof(float) * n);
  float *hessians = (float *)malloc(sizeof(float) * n);

  for (int i = 0; i < n; i++) {
    gradients[i] = mean_y - Y[i];
    hessians[i] = 1.0; 
  }

  reg_tree->root = (Node *)malloc(sizeof(Node));
  Node *root = reg_tree->root;

  root->depth = 0;
  root->is_leaf = true;

  _split_node(
    root, 
    reg_tree->max_depth, 
    reg_tree->min_leaf_samples, 
    X, 
    gradients, 
    hessians, 
    m, 
    n);
}

The art of the split

_split_node is a thick function with a large number of arguments and multiple conditional statements. I’ll describe it briefly and then dump the entire function below. The goal of that summary is to give structure to an attentive reading of _split_node.

I first look at whether a split can be made. If the tree has reached its max_depth, or if the split nodes would have a too small number of observations, the function returns the leaf value with compute_leaf_value (see the last else instruction). The fit of a branch of the tree is then done.

When the opposite happens, that is, when a split can be made, the function navigates through all variables looking for the variable whose split gives a best gain, and its threshold.

Much code is then responsible for preparing the split by sending the correct data to the left and right children nodes. To know which value lies above and which value lies below the threshold, I use the MaskIndices struct. If a value lies below the threshold, I send it to the left node. What would happen in two Python lines, sometimes happens with 30 lines of C…

Before I call _split_node again, I allocate memory for the left and right children nodes. I also ensure that the depth of each child increases by one.

A difference with other functions that are typically used to learn recursion, is that each call to _split_node returns something. The edge condition of the recursion is should_split.

Node *_split_node(
  Node *root,
  int max_depth,
  int min_leaf_samples,
  float **X, 
  float *gradients,
  float *hessians,
  size_t m,
  size_t n) {
  int depth = root->depth;

  bool split_decision = should_split(depth, n, max_depth, min_leaf_samples);

  if (split_decision == true) {
    root->is_leaf = false;

    int best_feature_id = 0;
    float best_threshold = 0;
    float best_gain = 0;

    // Need to first find the best split
    for (int j = 0; j < m; j++) {
      SplitInfo split_info = find_best_split(X[j], gradients, hessians, n);
      float current_gain = split_info.gain;
      if (current_gain > best_gain) {
        best_feature_id = j;
        best_threshold = split_info.threshold;
        best_gain = current_gain;
      }
    }

    if (best_gain <= 0) {
      float G_sum = 0;
      float H_sum = 0;

      for (int i = 0; i < n; i++) {
        G_sum += gradients[i];
        H_sum += hessians[i];
      }

      root->value = compute_leaf_value(G_sum, H_sum);
      root->is_leaf = true;
    }
    else { 
      // Saving the gains of each split node to compute feature importance later
      root->gain = best_gain;
      root->threshold = best_threshold;
      root->feature_id = best_feature_id;

      MaskIndices mask_indices = split_on_feature_threshold(
        best_threshold, 
        best_feature_id,
        X, 
        n
      );

      // Make left and right arrays to pass to split_node
      float *left_gradients = (float *)malloc(sizeof(float) * mask_indices.left_n);
      float *left_hessians = (float *)malloc(sizeof(float) * mask_indices.left_n);

      float *right_gradients = (float *)malloc(sizeof(float) * mask_indices.right_n);
      float *right_hessians = (float *)malloc(sizeof(float) * mask_indices.right_n);

      float **X_left = (float **)malloc(sizeof(float *) * m);
      float **X_right = (float **)malloc(sizeof(float *) * m);

      for (int j = 0; j < m; j++) {
        X_left[j] = (float *)malloc(sizeof(float) * mask_indices.left_n);
        X_right[j] = (float *)malloc(sizeof(float) * mask_indices.right_n);
      }

      for (int i = 0; i < mask_indices.left_n; i++) {
        int left_index = mask_indices.left_indices[i];
        left_gradients[i] = gradients[left_index];
        left_hessians[i] = hessians[left_index];

        for (int j = 0; j < m; j++) {
          X_left[j][i] = X[j][left_index];
        }
      }

      for (int i = 0; i < mask_indices.right_n; i++) {
        int right_index = mask_indices.right_indices[i];
        right_gradients[i] = gradients[right_index];
        right_hessians[i] = hessians[right_index];

        for (int j = 0; j < m; j++) {
          X_right[j][i] = X[j][right_index];
        }
      }
      printf("Split: left_n=%zu, right_n=%zu\n", mask_indices.left_n, mask_indices.right_n);

      // Define left and right nodes before recursive calls
      Node *left_node = (Node *)malloc(sizeof(Node));
      left_node->is_leaf = true;
      left_node->depth = depth + 1;

      Node *right_node = (Node *)malloc(sizeof(Node));
      right_node->is_leaf = true;
      right_node->depth = depth + 1;

      root->left = _split_node(
        left_node,
        max_depth,
        min_leaf_samples,
        X_left,
        left_gradients,
        left_hessians,
        m,
        mask_indices.left_n
      );

      root->right = _split_node(
        right_node,
        max_depth,
        min_leaf_samples,
        X_right,
        right_gradients,
        right_hessians,
        m,
        mask_indices.right_n
      );
    }
  }

  // If the split is not acceptable
  else {
    float G_sum = 0;
    float H_sum = 0;

    for (int i = 0; i < n; i++) {
      G_sum += gradients[i];
      H_sum += hessians[i];
    }

    root->value = compute_leaf_value(G_sum, H_sum);
  }
  // Now got to mask the data according to feature and threshold

  return root;
}

The first function within _split_node that is called is should_split. It looks at a couple of conditions to know if a split is allowed. If yes, there’s still a further check on the gain. The gain of the top found split must be positive, otherwise the regression tree without the new split gives a better fit.

I find the min_leaf_samples check interesting. $\frac{\text{n_samples}}{2}$ is the minimum size of the largest of the left and right children nodes. The min_leaf_samples is thus the smallest number of observations in the largest child that a split allows.

bool should_split(int depth, int n_samples, int max_depth, int min_leaf_samples) {
  if (
    depth < max_depth && (n_samples / 2) > min_leaf_samples
  ) {
    return true;
  }
  else {
    return false;
  }
}

The second called function within _split_node is find_best_split. It searches for a threshold on an array arr that returns the largest gain. That gain gets as arguments the sums of the left and right gradients. The hessians are here only used to normalize the sums.

To minimize the use of data masking, I use the earlier shown arg_sort and reorder functions, together with cumulative sums. I can thus easily get the left and right gradient and hessian sums. I’d like to say this is the most efficient way to do things, but since I have to use a bubblesort, I don’t know. In my opinion, it’s certainly easy to read.

SplitInfo find_best_split(float *arr, float *gradients, float *hessians, size_t n) {
  size_t *index_arr = arg_sort(arr, n);
  float *sorted_arr = reorder(arr, index_arr, n);
  float *sorted_gradients = reorder(gradients, index_arr, n);
  float *sorted_hessians = reorder(hessians, index_arr, n);

  float *cumsum_gradients = (float *)malloc(sizeof(float) * n);
  float *cumsum_hessians = (float *)malloc(sizeof(float) * n);

  cumsum_gradients[0] = sorted_gradients[0];
  cumsum_hessians[0] = sorted_hessians[0];
  // Need also the sums to get the left split gains
  float sum_gradients = sorted_gradients[0];
  float sum_hessians = sorted_hessians[0];

  for (int i = 1; i < n; i++) {
    cumsum_gradients[i] = sorted_gradients[i] + cumsum_gradients[i-1];
    cumsum_hessians[i] = sorted_hessians[i] + cumsum_hessians[i-1];

    sum_gradients += sorted_gradients[i];
    sum_hessians += sorted_hessians[i];
  }

  // Calculate gains for each possible split, and find best gain
  float best_gain = 0;
  float best_threshold = 0;

  // Setting i to max n-2 to avoid illegal splits
  for (int i = 0; i < (n-1); i++) {
    float gradient_left = cumsum_gradients[i];
    float gradient_right = sum_gradients - gradient_left; 

    float hessian_left = cumsum_hessians[i];
    float hessian_right = sum_hessians - hessian_left; 

    float gain = compute_gain(gradient_left, gradient_right, hessian_left, hessian_right);

    if (gain > best_gain) {
      best_gain = gain;
      best_threshold = sorted_arr[i];
    }
  }


  SplitInfo split_info;

  split_info.gain = best_gain;
  split_info.threshold = best_threshold;

  return split_info;
}

After I’ve found the best split threshold, I do have to use masking to split the data and send it to the relevant children. I start by deducing the maximum size of each indice, and once I know the actual size of the indice arrays, I call realloc to release the unnecessary memory.

MaskIndices split_on_feature_threshold(
  float threshold, 
  int feature_id, 
  float **X, 
  size_t n) {
  int *left_indices = (int *)malloc(sizeof(int) * n);
  int *right_indices = (int *)malloc(sizeof(int) * n);

  int left_cnt = 0;
  int right_cnt = 0;

  for (int i = 0; i < n; i++) {
    if (X[feature_id][i] <= threshold) {
      left_indices[left_cnt] = i;
      left_cnt++;
    }
    else {
      right_indices[right_cnt] = i;
      right_cnt++;
    }
  }

  left_indices = realloc(left_indices, sizeof(int) * left_cnt);
  right_indices = realloc(right_indices, sizeof(int) * right_cnt);

  MaskIndices mask_indices;

  mask_indices.left_indices = left_indices;
  mask_indices.left_n = left_cnt;

  mask_indices.right_indices = right_indices;
  mask_indices.right_n = right_cnt;

  return mask_indices;
}

If you’ve gotten to here, congratulations, you know how to fit a regression tree in C. You’ll probably still want to keep reading though. I’m going to talk about two exciting tree traversal functions. These are very useful to get some juice out of the regression tree.

Two pleasant tree traversals

I’ll start with my favorite of the two: the calculation of the feature importances. These give the importance of each variable of X. This is done by finding on which variable or feature each split was done, and adding the gain of that split to a feature_importances array of size $1 \times m$ where $m$ is the number of variables.

void _feature_importance(Node *node, float *feature_importances) {
  if (node->is_leaf == true) {
    return;
  }

  else {
    feature_importances[node->feature_id] += node->gain;

    _feature_importance(node->left, feature_importances);
    _feature_importance(node->right, feature_importances);
  }
}

That recursive function is used within the compute_feature_importance function. That does a bit of bookkeeping.

The second tree traversal is done to allow making predictions. For each observation with values on all variables of X, we want to be able to make a prediction. This happens by finding the path through the splits until the leaf with a prediction for $x$ is reached.

float _predict_single(Node *root, float *x, float constant) {
  // x is a 1 by m array

  if (root->is_leaf == false) {
    if (x[root->feature_id] <= root->threshold) {
      return _predict_single(root->left, x, constant);
    }
    else {
      return _predict_single(root->right, x, constant);
    }
  }
  else {
    return root->value + constant;
  }
}

In short that’s done by looking, for each split until a leaf is reached, whether $x$ is below or above the threshold of each relevant split feature. Accordingly, we follow the left or right path, until root->is_leaf is true.

Again, I do a bit of bookkeeping to make and retain predictions for each $x$ within a matrix X:

float *predict(RegressionTree *reg_tree, float **X, size_t n, size_t m) {
  float *predictions = (float *)malloc(sizeof(float) *n);

  float constant = reg_tree->constant;

  for (int i = 0; i < n; i++)  {
    float *x = (float *)malloc(sizeof(float) * m);
    // Prep the prediction vector
    for (int j = 0; j < m; j++) {
      x[j] = X[j][i];
    }

    float value = _predict_single(reg_tree->root, x, constant);

    free(x);

    predictions[i] = value;
  }

  return predictions;
}

Possible further exercises

There are a few interesting exercises you can do to learn even more from regression trees.

Above we’ve made use of the entire dataset to find a split. As the number of training observations grows larger, that becomes less interesting. Modern tools like LightGBM and XGBoost can make use of histogram binning to handle big data better. In Python it’s not so difficult to do this with a few numpy functions like searchsorted and bincount. So you can try it in Python and then translate the solution to C.

The idea is to sum the gradients and hessians per bin to then find the split by searching each bin instead of the entire dataset. If you still want to keep finding so-called exact splits you can also look for a solution for duplicate feature values. This can also give efficiency, especially if you’re working with a small number of values.
Another exercise can be to make a random forest or a gradient boosting machine from the regression tree. To a certain extent that’s actually simpler than the exercise above.
Another interesting exercise would be to make a decision tree instead of a regression tree. Even more complicated would be to correctly predict more than two discrete outcomes.

Intégrer la recherche web à Claude Desktop sur Linux Mint

2025-05-07T00:00:00+00:00

Intégrer la recherche web à Claude Desktop sur Linux Mint

Cet article explique comment ajouter la recherche web à Claude Desktop sur Linux en utilisant le protocole MCP (Model Context Protocol), qui permet aux modèles d’IA d’utiliser des outils externes. Cela se fait en trois étapes:

L’installation du client Claude Desktop sous Linux
La configuration d’un premier serveur MCP donnant au modèle accès aux fichiers locaux. Cela à des fins de familiarisation.
La configuration d’un serveur comprenant les fonctionnalités de recherche web.

L’objectif est de permettre au modèle de repondre a des questions avec des informations actuellement disponible en ligne.

Intégrer la recherche web à Claude Desktop sur Linux Mint

Installation de Claude Desktop sur Linux

Dans ce tutoriel, j’installe Claude Desktop sur une instance de Linux Mint 22.1. Linux Mint est une distribution Linux assez populaire basée sur Debian.

Il n’y a présentement pas de version officielle de Claude Desktop disponible pour Linux. L’alternative que j’ai trouvée est ce projet sur github: claude-desktop-debian. C’est une adaptation de la version Windows et elle est compatible avec le protocole MCP. Le protocole MCP permet au client Claude d’accéder à des outils incluant la recherche web que je vais configurer.

Pour installer cette adaptation du client Claude, il faut avoir accès au programme en ligne de commande git. Pour installer git, il suffit de taper

apt get install git

dans son terminal (en incluant l’accès administratif sudo si nécessaire).

Ensuite, il faut cloner le dépôt git de claude-desktop-debian, puis lancer le build script pour générer un paquet .deb. Pour ce faire, j’exécute cette commande dans mon répertoire ~/Documents/:

# Clone this repository
git clone https://github.com/aaddrick/claude-desktop-debian.git
cd claude-desktop-debian

# Build the package (Defaults to .deb and cleans build files)
./build.sh

Le build script est bien fait et s’occupe d’installer les dépendances nécessaires avant l’installation du client Claude lui-même. Ce processus prend quelques minutes, assez pour se préparer un café.

Le build va générer un fichier semblable à claude-desktop_{version_number}_{architecture}.deb dans le répertoire claude-desktop-debian. Dans mon cas, le fichier .deb est claude-desktop_0.9.3_amd64.deb. Pour installer le paquet, je commence par exécuter

chmod +x claude-desktop_0.9.3_amd64.deb

ce qui fait du paquet un fichier exécutable. Ensuite, j’installe le paquet avec la commande

apt get install claude-desktop_0.9.3_amd64.deb

Une fois l’installation complétée, le client Claude devrait pouvoir être lancé à partir du terminal avec

claude-desktop

Pour tester, je m’identifie dans l’application avec mon compte google et j’envoie une petite requête comme Hello dans le menu de discussion.

Configurer un premier serveur MCP

Au lancement, le client Claude Desktop analyse le fichier ~/.config/Claude/claude_desktop_config.json pour découvrir d’éventuels outils. Lorsque découverts, ces outils peuvent être utilisés par le Large Language Model (LLM) pour répondre aux requêtes de l’utilisateur.

Par exemple, un outil météo permet au LLM de consulter la météo sur internet. Pour s’assurer que ces outils sont bien utilisés, il faut être suffisamment explicite dans la formulation de ses requêtes. Par exemple, demander Utilise l’outil météo pour donner la météo d’aujourd’hui à la place de Donne la météo d’aujourd’hui. Dans le second cas, le LLM a moins de chance de consulter l’outil et va alors donner une réponse comme: mon contexte ne me donne pas accès aux informations quotidiennes de météo.

Initialement, le fichier claude_desktop_config.json n’existera probablement pas, donc il va falloir le créer avec:

touch ~/.config/Claude/claude_desktop_config.json

Pour se familiariser avec la configuration des serveurs MCP, je recommande de configurer le serveur filesystem. Le serveur filesystem comporte des outils permettant de lire, créer et modifier des fichiers sur votre système local. Le client Claude va toujours demander la permission avant d’utiliser un outil et il est nécessaire de lire ces demandes de permission pour éviter des gros soucis.

Cette partie du tutoriel est tirée de MCP-Quickstart. Il suffit alors de copier coller le texte suivant dans le fichier claude_desktop_config.json

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/username/Desktop",
        "/Users/username/Downloads"
      ]
    }
  }
}

Les arguments /Users/username/Desktop/ et /Users/username/Downloads/ fournissent les portes d’acces que peut utiliser le LLM pour chercher et modifier nos fichiers. Ils doivent donc être adapter autant que nécessaire. Dans mon cas, j’ai la configuration suivante:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/home/marc/Documents/",
      ]
    }
  }
}

Ce texte de configuration étant inclut au fichier claude_desktop_config.json, il faut s’assurer d’avoir la dépendance node js sur votre système. Cela est nécessaire pour l’installation et le lancement des serveurs MCP utilisant la commande npx. Pour une distribution de type Debian comme Linux Mint, le plus simple est d’exécuter ce script dans son terminal:

# Download and install nvm:
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash

# in lieu of restarting the shell
\. "$HOME/.nvm/nvm.sh"

# Download and install Node.js:
nvm install 22

# Verify the Node.js version:
node -v # Should print "v22.15.0".
nvm current # Should print "v22.15.0".

# Verify npm version:
npm -v # Should print "10.9.2".

Une fois cela fait, je ferme et réouvre Claude Desktop en exécutant

claude-desktop

Je peux alors tester le serveur filesystem en demandant What folders are in my Documents, et après avoir donné les permissions demandées, j’obtiens la réponse: *There’s one folder in your Documents directory called “claude-desktop-debian”.

Ajouter le serveur brave-search pour la recherche brave

Ce test étant réussi. On peut maintenant passer au moteur de recherche. Pour la recherche j’utilise le serveur brave-search étant donné que son installation et utilisation est relativement simple. Il utilise aussi la même dépendance node js.

La première étape consiste à générer une clé API sur le site web de Brave. Pour ce faire, un compte utilisateur est créé sur brave search api. Par la suite, le bouton Add API key du menu API keys permet d’ajouter une clé API au compte.

Celle-ci est ensuite ajoutée au fichier de configuration claude_desktop_config.json. Le format attendu pour la configuration est donne sur ce projet github.

{
  "mcpServers": {
    "brave-search": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-brave-search"
      ],
      "env": {
        "BRAVE_API_KEY": "YOUR_API_KEY_HERE"
      }
    }
  }
}

Pour préserver les outils filesystem, il faut adjoindre cette configuration à celle donnée ci-dessous. Je donne un exemple pertinent au bas de ce tutoriel.

Après avoir inséré la clé et édité la configuration, je consulte un validateur json sur internet pour vérifier que le fichier config n’a pas d’erreur de syntaxe. En faisant cela, je m’assure de ne pas inclure une clé API valide dans le texte vérifié.

Une fois cela fait, on ferme et réouvre claude-desktop une nouvelle fois. Je pose maintenant cette question “Search online what is the weather today”. Si la configuration est correcte, le client Claude va demander d’utiliser l’outil brave_web_search avant de donner une réponse plus ou moins valide.

Si cela fonctionne, félicitations! Dans le cas contraire, voici un exemple pour référence:

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/marc/"
      ]
    },
    "brave-search": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-brave-search"
      ],
      "env": {
        "BRAVE_API_KEY": "BSA7Krtfakekey"
      }
    }
  }
}

Getting maths to render on this blog

2025-04-27T00:00:00+00:00

Going insane trying to get math to render on this blog

I realized last week that mathematical equations would not render when writing on this blog. I had a wild ride trying to get this working already, but now the time has come to get those equations to render.

By rendering math, I mean getting LaTeX syntax to show up as nice, high resolution pictures or scalable vector graphics. This blog post will end with such graphics.

Methodology

My approach is to give the google’s Gemini 2.5 Flash large language model (LLM) as much context as possible about the issue. Namely, I give:

The structure of the blog’s project, so the folders and files within.
That I am using github pages to deploy the blog.
What is in the _config.yml file of my blog.
The issue itself, namely that I’d like to see something like $\frac{x}{2}$ rendered properly (if it does not show up as mangled LaTeX, the mission was successful).

As a sidenote, I started using the Pro and Flash iterations of the Gemini 2.5 model last week and really like them. At the moment, the experimental versions of the model are free as in free beer.

Taking the LLM to heart

The first recommendation of the LLM is to create a default.html layout that overrides the basic minima layout of the blog.

To do this, I copy and paste the main blog page to default.html and add some instructions to import mathjax. mathjax is the package that should render the maths.

Doing this destroys the blog’s styling and does not render the maths.

It works on the LLM’s computer

The approach is to instead extend the default minima style of the blog. So I get rid of default.html and create a math.html file instead. In that file, I include a kind of markdown layer where I specify that the layout is default.

---
layout: default
---

The LLM assures me that it tried this solution and that it worked. It does not. The blog’s style is back, but the math is not rendering.

As a check, I put the mathjax import script directly in this blog post.

This also did not work. In a further query, the LLM recommended I include mathjax explicitly in my _config.yml.

title: Tapestry of flimsy steps
author: Marc-André Chénier
theme: minima

kramdown:
  math_engine: mathjax
  syntax_highlighter: rouge
  input: GFM # Optional: Use GitHub Flavored Markdown

I had good hopes, but this approach also failed. The next suggestion is to sandwich LaTeX expressions in a raw html routine. Again, without success.

Other attempts

Here is an inline fraction: (\frac{x}{2}).

Here is a display equation: [ a^2 + b^2 = c^2 ]

This should be the one…

\[\int x^2y \delta x\]

At last

What got things rolling is finding out about csega’s blog blog. He has a post on there where he makes a similarly heroic attempt to render $\frac{x}{2}$. Since he’s also using jekyll as his site generator, it gave me confidence this feat is possible.

I then went ahead and installed jekyll locally so I could serve and quickly debug the website from my laptop. This can be done with

jekyll serve

I had some issues getting jekyll to compile but most were solved with a variation of:

gem install ...

Gem is Ruby’s package manager. With this set-up, debugging went a lot faster. I also went ahead and vibe-coded the rest with cline.

Cline is an extension for the VS Code editor. It gives you a chat interface for LLMs, along with some tools to read and edit your project files. Very much like Cursor, except that you can set it up with an API of your choice without having to pay a subscription.

On the one hand, I could have spent this sunny afternoon reading jekyll’s and github pages documentation at length instead of iterating with an LLM. On the other hand, some blockers are nice to just push aside with vibe-coding.

Statistical approach to a 2 by 2 crossover design

2025-04-21T00:00:00+00:00

Statistical approach to a 2 by 2 crossover design

Statistical approach to a 2 by 2 crossover design

Question

This blog post is taken from a question I answered on cross validated during the week-end. I have had this blog on the backburner for a while, and that is as good of an opportunity to properly start it as I will get. You can find the q & a thread here: cross-validated. The original question from user AB108 is:

I’m conducting a randomized crossover trial with 16 participants, where each
subject receives two interventions (sub-occipital muscle inhibition and deep
breathing). For each intervention, heart rate variability (HRV) metrics (e.g.,
RMSSD and HF) are recorded before and after the intervention.

I’m aiming to determine whether one intervention leads to greater
parasympathetic activation than the other, based on these HRV measures.

The design involves: Repeated measures (pre and post) Two conditions per
participant A small sample size (n = 17) A few potential covariates (e.g.,
stress level, respiratory rate)

 What statistical approach would you recommend for analyzing this kind of data?
 Would you use a method that compares pre/post differences (deltas), or would
you suggest a model that incorporates all measurements directly? I'm
particularly interested in approaches that account for within-subject
variability and repeated measures.

Here are initial thoughts about the question:

It is actually two questions in one
I have not heard of crossover design before
There are multiple outcomes of interest and it is not clear how the inquirer plans to combine or include them in his analysis.
AB108 (the inquirer) is interested about including covariates in the analysis.

Two questions in one

The first question is about the desirability of analysing the pre-post difference in outcomes. For example, this could be taking the difference in RMSSD. Typically, a pre-post analysis would denote taking the difference of the outcome after versus the outcome before receiving a treatment, but here we are comparing two treatments: sub-occipital muscle inhibition and deep breathing. So we just take the difference in outcome between two treatments, and forget about the baseline (i.e. not treatment).

In general, a pre-post analysis is a waste of time. You can often argue that something unrelated to the difference of interest happens between the two interventions given to a subject. That makes it difficult to defend a causal statement.

Nevertheless, taking a pre-post difference does control for what is known as time-invariant subject characteristics. Those are things like your natural hair color or your neuroticism. To be a bit more precise, a time-invariant characteristic is something that stays constant during the period of the experiment. That makes taking the pre-post difference a good tactic to reduce the variation of the outcome of interest (for ex. the RMSSD here). At equal sample size, this increases the power of a statistical test on that difference.

The second question: would you suggest a model that incorporates all measurements directly? is unclear. Namely if it is suggested to include all measurements in a model. Assuming they are valid, I would not exclude measurements from an analysis, regardless of whether I am specifying a statistical model or not.

I chose to interpret it as whether it is worth it to specify a model. A model lets us control for additional measured covariates such as stress level, so it can be advantageous. However, including additional covariates is touchy, especially with a small sample size. The covariates have to be strongly correlated with the outcomes and at most weakly correlated with the treatment or they could induce bias and/or variation in the difference estimate between sub-occipital muscle inhibitiion and deep breathing. That’s a judgment call that I leave to the subject-matter expert.

What’s a crossover design?

The crossover design is a clever way to control for the time-invariant characteristics of subjects while removing eventual bias from the time spent in observation. For example, you can imagine that bias appearing with subjects getting more comfortable with the experimental setting between the initial and the post-treatment outcome measurements.

To control for bias due to such uncontrolled time-dependent effects, a crossover design split the subjects into treatment branches. Each branch receives the treatments or lack thereof (for ex. placebo), in a different sequence. In the question, the inquirer is tackling a case with two observations, pre and post, and two treatment, sub-occipital muscle inhibition and deep breathing. Subjects in the first branch then get one treatment, say muscle inhibition, at the pre observation period, while subjects in the second treatment branch receive deep breathing in the pre period. In the second period of observation, post, each branch gets the alternative treatment.

This split into sequences or branches of treatment sounds like a lot of trouble, but it allows control for time-dependent effects across subjects in the analysis. Concretely, this can be done with a statistical model or by simply taking the difference of the treatment differences between the treatment branches. Shared time-dependent effects between treatment branches are removed by taking this difference.

In theory, a crossover design is great because it isolates the difference in treatment effects from spurious time-related things better while keeping statistical power relatively high with its measurement of within-subject outcomes. I recommend looking at this clear review paper for a better intuition of the experimental design: On the proper use of the crossover design in clinical trials

In practice, you have to account with a so-called wash-out effect. That is the effect of one treatment having an effect on the next one given to a subject.

Multiple outcomes of interest

There are multiple outcomes of interest: RMSSD, HF and fellow user jginestet points to a paper (An overview of heart rate variability metrics) listing 26 measures relevant to the analysis of heart rate variability (HRV).

Multiple outcomes is a hornet nest for analysts. Outcomes can be combined in all sorts of way during an analysis with more or less unsavory results. I chose to ignore this problem and focus on the analysis of a single outcome. User jginestet tackles this issue directly and gives relevant recommendations with regard to the multiple comparison problems, multivariate statistical modeling and small sample analysis.

Tackling the first question

Whether to use a method that compares pre/post differences?

The classical two-step approach to crossover design analysis with 2 repeated measures per participant suggests comparing the pre-post differences first. This gives a within-subject effect estimate but doesn’t control for period effects (e.g., getting used to the experiment). That’s where the second step of the approach comes in.

After the first step, take the means of these differences per treatment branch (two branches in this design). In the second step, compute the difference between these two means (e.g., muscle inhibition → deep breathing average minus deep breathing → muscle inhibition average). This removes any additive period effect.

In practice, I’d handle the first step manually and use software to do an independent samples t-test at the second step. Here’s an example in R:

# First step: difference within subjects
crossover_patient_split <- split(crossover_data, crossover_data$PatientID)
patient_diff_df <- do.call("rbind",
  lapply(crossover_patient_split, FUN=function(x) {
    data.frame(
      period_diff=(x$X[x$Period == 1] - x$X[x$Period == 2]),  # Corrected syntax
      PatientID=x$PatientID[1],
      Sequence=x$Sequence[1]  # Seq. 1: A→B, Seq. 2: B→A
    )
  })
)

# Second step: t-test on the difference between sequences
t.test(
  period_diff ~ Sequence,
  data=patient_diff_df,
  var.equal=TRUE
)

In On the proper use of the crossover design in clinical trials, they recommend a Wilcoxon rank-sum test instead of a t-test if non-normality is suspected in the within-subject differences. With small samples and continuous outcomes, non-normality often arises due to outliers.

Tackling the second question

Would you suggest a model that incorporates all measurements directly?

In the $2 \times 2$ crossover design, the main advantage of a model is its ability to include time-varying covariates like respiratory rate. I also find this approach more straightforward: you directly control for subject and period effects while directly estimating the treatment difference.

Here’s a R linear regression example producing the same t-statistic as the two-step approach:

fit1 <- lm(X ~ Treatment + factor(PatientID) + Period, data=crossover_data)
summary(fit1)

The Treatment variable could represent muscle inhibition or deep breathing depending on the prefered interpretation. The model explicitely controls for subject and period effects. The treatment branch mentioned above isn’t explicitly included but allows the identification of the period effect.

A peek at the data structure:

Launching this blog

2024-09-29T00:00:00+00:00

This post will probably be removed once the blog gets some steam. Until then, it will be a reminder of its humble beginning.