Currently, I am using random padding data injected directly into the database to allow the regression calculations to function correctly when a new organisation starts.
Problem: When using random padding via an array (instead of database injection), the regression output changes on each page refresh because the random values are recalculated every time.
Current Solution: Data is injected directly into the database to maintain consistency across page loads.
Proposed Improvement: There must be a way to fix the random array data in memory or in a file so that it does not regenerate on each refresh.
mt_srand()) at the
start of each session to create a predictable sequence.Additional Consideration: Even when using fixed or seeded padding arrays, care must be taken to ensure the transition from padding to real data is smooth and does not disrupt regression outputs. The padding data should be backwards filled from the first available real data point where possible to maintain stability and avoid step changes in the regression when real data begins to replace the padding.
Action: Explore and test these options to establish a non-database method for consistent padding data that remains fixed across page loads.
There is a difficulty when adding new questions and KPIs later because the regression input arrays become out of step, leading to nulls and causing errors in the regressions.
Potential Solution: One approach might be to add random priming data for the newly introduced questions and KPIs, aligning them with the existing time indexes of the current data.
Challenge: It is not yet clear how to correctly generate and align this priming data. Further investigation is required to determine the most robust method.
Concern: It is uncertain what impact this approach would have on existing regressions and whether it would cause significant changes to the historical outputs for parameters that already exist.
Additional Consideration: Any padding data (fixed or random) should be back-filled from the most recent 'real' data points to avoid introducing calculation fluctuations when real data starts to substitute the padding. Care must be taken to ensure that the transition from padding to real data is smooth and does not disrupt the regression outputs.
Action: Investigate whether priming new questions and KPIs with time-aligned data would preserve existing regression integrity or cause system-wide recalculations.
Linked to Note 2, if a measure is suspended, the only way for the regression to continue functioning correctly is to remove that measure from the regression input arrays.
Problem: If a measure is left in the regression dataset without current values, it will introduce nulls and break the calculation process.
Solution: Develop a mechanism to dynamically exclude suspended measures from the regression input arrays to maintain the integrity of the regression calculations.
Action: Explore the best method to detect and remove suspended measures in real time to prevent regression errors while preserving historical calculation stability.
If a regression is initially based on padding data (e.g. a fixed array with values ranging from 4.9 to 5.1), the question arises whether the padding should remain in the calculation as real data arrives.
Considerations if Padding Remains:
Considerations if Padding is Removed:
Best Practice: Begin with padding fully included, then progressively reduce its influence as real data accumulates. For example, weighting the real data more heavily each week until the padding is phased out.
Example Phase-Out Logic:
$realDataCount = count($realData);
$totalTarget = 120; // Padding length
if ($realDataCount < $totalTarget) {
$weightReal = $realDataCount / $totalTarget;
$weightPadding = 1 - $weightReal;
$combinedData = array_map(function($real, $padding) use ($weightReal, $weightPadding) {
return ($real * $weightReal) + ($padding * $weightPadding);
}, $realData, $paddingArray);
} else {
$combinedData = $realData;
}
Action: Develop a phasing mechanism to gradually transition from padding to real data without destabilising the regression outputs.
| State Description | Sentiment | Correlation (r) | P-value Pattern | Net Beta Sum | IV Avg | Insight Score (current) |
|---|---|---|---|---|---|---|
| Healthy & Stable | High (≥ 70%) | High or undefined (r ~ 1.0) | Mostly > 0.05 | ~ 0 (flat) | High (≥ 70%) | Low (misleading) |
| Changing & Aligned | – | High (r ~ 0.6 – 1.0) | Many < 0.05 | Strong positive | – | High (expected) |
| Changing & Non-Aligned | – | High, inverted (r ~ -0.6 – -1.0) | Many < 0.05 | Strong negative | – | High (warning) |
| Unclear / No Pattern | – | Low or inconsistent (r ~ 0) | Mostly > 0.05 | Near zero | – | Low (expected) |
| Unhealthy & Stable | Low (≤ 30%) | High or undefined (r ~ 1.0) | Mostly > 0.05 | ~ 0 (flat) | Low (≤ 30%) | Low (misleading) |
// Case-Based Health Score: Category + Description
function classifyHealthCase(
float $r,
float $netBetaSum,
float $avgDV
): array {
$correlation = $r;
$beta = $netBetaSum;
$sentiment = $avgDV;
// Assume thresholds:
// - r > 0.7 = strong alignment
// - beta > 1.5 = high insight amplitude
// - sentiment > 70% = healthy mood
if ($sentiment > 70 && $correlation > 0.7) {
return [
'category' => 'Healthy & Aligned',
'score' => 90,
'note' => 'Strong mood and strong insight – ideal situation.'
];
}
if ($sentiment > 70 && $correlation < 0.3) {
return [
'category' => 'Healthy but Non-Aligned',
'score' => 50,
'note' => 'Organisation is calm, but no clear drivers are detected.'
];
}
if ($sentiment < 50 && $correlation > 0.7 && $beta > 1.5) {
return [
'category' => 'Unhealthy but Insightful',
'score' => 65,
'note' => 'Issues are present, but clearly diagnosed.'
];
}
if ($sentiment < 50 && $correlation < 0.3) {
return [
'category' => 'Unhealthy & Non-Aligned',
'score' => 30,
'note' => 'Low happiness and no insight – serious concern.'
];
}
return [
'category' => 'Mixed/Unclear',
'score' => 45,
'note' => 'Ambiguous state – requires deeper interpretation.'
];
}
// Weighted Health Score (Threshold-Based)
function calculateWeightedHealth(
float $netBetaSum,
float $r,
float $avgIV,
float $avgDV
): array {
$weights = [
'r' => 0.3,
'IV' => 0.2,
'DV' => 0.2,
'beta' => 0.3
];
$score_r = ($r >= 0.7) ? 100 : (($r <= 0.3) ? 0 : ($r - 0.3) / 0.4 * 100);
$score_IV = min(100, max(0, $avgIV));
$score_DV = min(100, max(0, $avgDV));
$score_beta = min(100, max(0, ($netBetaSum / 3) * 100));
$organisationalHealthScore = round(
$weights['r'] * $score_r +
$weights['IV'] * $score_IV +
$weights['DV'] * $score_DV +
$weights['beta'] * $score_beta,
2
);
return [
'r' => $r,
'avgIV' => $avgIV,
'avgDV' => $avgDV,
'netBetaSum' => $netBetaSum,
'score_r' => round($score_r, 2),
'score_IV' => round($score_IV, 2),
'score_DV' => round($score_DV, 2),
'score_beta' => round($score_beta, 2),
'organisationalHealthScore' => $organisationalHealthScore
];
}
// Sigmoid-Based Health Score (Debug Version)
function sigmoid($x) {
return 1 / (1 + exp(-$x));
}
function calculateHealthScoreDebug(
float $netBetaSum,
float $r,
float $avgIV,
float $avgDV
): array {
$normBeta = max(min($netBetaSum / 3, 1), -1);
$normR = max(min($r, 1), -1);
$normIV = $avgIV / 100;
$normDV = $avgDV / 100;
$betaAdj = round((sigmoid($normBeta * 5) - 0.5) * 10, 2);
$scoreR = round($normR * 25, 2);
$scoreIV = round($normIV * 25, 2);
$scoreDV = round($normDV * 25, 2);
$scoreBeta = $betaAdj;
$organisationalHealthScore = round($scoreR + $scoreIV + $scoreDV + $scoreBeta, 2);
return [
'r' => $r,
'avgIV' => $avgIV,
'avgDV' => $avgDV,
'netBetaSum' => $netBetaSum,
'betaAdjustment' => $betaAdj,
'score_r' => $scoreR,
'score_IV' => $scoreIV,
'score_DV' => $scoreDV,
'organisationalHealthScore' => $organisationalHealthScore
];
}
possible output
Array
(
[r] => 0.6
[avgIV] => 70
[avgDV] => 85
[netBetaSum] => 2.5
[betaAdjustment] => 4.02
[score_r] => 15
[score_IV] => 17.5
[score_DV] => 21.25
[organisationalHealthScore] => 57.77
)
Let me know if you'd like a collapsible version or stylized for Bootstrap display.
📌 Scoring Method Sheet (with β-value usage) 🔹 Case-Based (Threshold) Method • Uses if/else rules to classify inputs (e.g. low/medium/high correlation, β sum, etc.). • No mathematical weighting — fixed bins and logic. • ✅ May use β-values to define thresholds (e.g. "if β sum > 1.0"). • ❌ But β-values are not proportionally scaled — binary logic only. 🔹 Weighted (Linear) Method • Adds together scaled versions of each metric (e.g. correlation, β sum, significance %). • ✅ Directly uses the sum of significant β-values — larger betas = higher score. • ❗ Can overweight large betas — linear scaling assumes more always means better. • Good for seeing amplitude impact but may lack nuance. 🔹 Sigmoid (Smooth Weighted) Method • Same inputs as weighted, but passes each through a sigmoid curve before combining. • ✅ Uses β-values, but scales their influence smoothly — avoids score spikes from outliers. • ✅ Designed to stabilize output as predictors saturate. • Best overall method for scoring realism and continuity. Summary: - ✅ Both Weighted and Sigmoid methods rely on β-values. - ❌ Case-based may reference βs but not use them quantitatively. - 🧠 Sigmoid gives the best balance of nuance, stability, and fairness in healthy or extreme cases.
// ───────────────────────────────────────────────────────────────
// Option 1 – Average IV-to-DV Correlation
// -------------------------------------------------------------
// Calculates Pearson’s r for each IV (column) against the DV,
// takes the absolute value, then returns the average.
// Use this as a “raw insight strength” metric that’s easy to
// explain and avoids self-referential model logic.
// ───────────────────────────────────────────────────────────────
/**
* Calculate Pearson correlation between two equal-length arrays.
*
* @param array $x
* @param array $y
* @return float r (–1 … +1). Returns 0 if either series is flat.
*/
function pearsonCorrelation(array $x, array $y): float
{
$n = count($x);
if ($n !== count($y) || $n === 0) {
throw new InvalidArgumentException("Arrays must be same length & non-empty.");
}
$meanX = array_sum($x) / $n;
$meanY = array_sum($y) / $n;
$num = $denX = $denY = 0.0;
for ($i = 0; $i < $n; $i++) {
$dx = $x[$i] - $meanX;
$dy = $y[$i] - $meanY;
$num += $dx * $dy;
$denX += $dx ** 2;
$denY += $dy ** 2;
}
if ($denX == 0 || $denY == 0) return 0.0; // flat series → no correlation
return $num / sqrt($denX * $denY);
}
/**
* Average absolute Pearson correlation between each IV and DV.
*
* @param array $ivs Array of IV arrays (each array = one predictor column)
* @param array $dv Array of DV (happiness) values
* @return float Mean |r| across all IVs (0 … 1)
*/
function averageIVtoDVCorrelation(array $ivs, array $dv): float
{
$nIVs = count($ivs);
$total = 0.0;
foreach ($ivs as $iv) {
if (count($iv) !== count($dv)) {
throw new InvalidArgumentException("IV and DV arrays must be equal length.");
}
$r = pearsonCorrelation($iv, $dv);
$total += abs($r); // use absolute r to ignore sign
}
return $nIVs > 0 ? $total / $nIVs : 0.0;
}
// ── Example usage ─────────────────────────────────────────────
$ivs = [
[1, 2, 3, 4, 5], // IV₁ (positively correlated)
[2, 3, 4, 5, 6], // IV₂ (positively correlated)
[10, 9, 8, 7, 6] // IV₃ (negatively correlated)
];
$dv = [1, 2, 3, 4, 5]; // Happiness (DV)
echo "Average |r| = " . round(averageIVtoDVCorrelation($ivs, $dv), 3);
// e.g. 0.978
Note 6 – Weighting in Regression (based on response volume) When daily sentiment data is used in regression analysis, some days may have many responses while others have very few. To make the regression reflect data quality, you can apply weighting: • Each day’s data point is weighted by the number of responses it represents. • Days with more responses are considered more reliable and exert more influence on the regression line. • This reduces distortion caused by low-response days (which may contain more noise or bias). In PHP (or similar), most basic linear regression functions do not support weights out of the box, but it’s possible to implement weighted regression using custom functions or libraries that support it. 🧠 Example: If Monday has 100 responses and Tuesday has 5, Monday’s data point should count more in fitting the regression line.
I’ve updated your regression function to include optional weights. pass a $weights array (with one weight per row), it will perform weighted least squares regression. If no weights are passed, it defaults to ordinary least squares.
// Each row in $samples corresponds to a day's IVs (independent variables)
$samples = [
[3.5, 2.1], // Day 1
[4.0, 1.9], // Day 2
[2.8, 3.2], // Day 3
// ...
];
// Each value in $targets corresponds to that day's DV (e.g. happiness)
$targets = [7.2, 6.5, 5.8];
// Number of responses per day — this is the weight for each row
$responseCounts = [12, 3, 28]; // i.e., more trust in Day 3 than Day 2
// Optional: variable names for output
$names = ['Autonomy', 'Recognition'];
// Run the weighted regression
$result = $TPHRegression->MultiRegressionV4($names, $samples, $targets, $responseCounts);
public function MultiRegressionV4($names, $samples, $targets, $weights = null)
{
// Convert samples to matrix and add intercept
$X = MatrixFactory::create($samples);
$intercept = MatrixFactory::create(array_fill(0, count($samples), [1]));
$X = $X->augmentLeft($intercept);
// Create target vector
$y = MatrixFactory::create(array_map(fn($t) => [$t], $targets));
// Weight matrix
if ($weights) {
$W = MatrixFactory::create(array_map(fn($w) => [$w], $weights))->diagonal();
} else {
$W = MatrixFactory::identity(count($samples));
}
// Weighted least squares: B = (X'WX)^-1 X'Wy
$X_transpose = $X->transpose();
$XTX = $X_transpose->multiply($W)->multiply($X);
$XTy = $X_transpose->multiply($W)->multiply($y);
$coefficients = $XTX->inverse()->multiply($XTy);
// Residuals and SSE
$residuals = $y->subtract($X->multiply($coefficients));
$sse = $residuals->transpose()->multiply($W)->multiply($residuals)->get(0, 0);
$df = count($y->getMatrix()) - count($coefficients->getMatrix());
$mse = $sse / $df;
// Variance-Covariance matrix and standard errors
$varianceCovMatrix = $XTX->inverse()->scalarMultiply($mse);
$SEs = array_map(fn($i) => sqrt($varianceCovMatrix->get($i, $i)), range(0, count($coefficients->getMatrix()) - 1));
// t-stats and p-values
$tStats = array_map(fn($i) => $coefficients->get($i, 0) / $SEs[$i], array_keys($SEs));
$tDist = new StudentT($df);
$pValues = array_map(fn($t) => 2 * (1 - $tDist->cdf(abs($t))), $tStats);
// Standardized coefficients
$numPredictors = count($samples[0]);
$stdDevsX = [];
for ($i = 0; $i < $numPredictors; $i++) {
$mean = array_sum(array_column($samples, $i)) / count($samples);
$stdDev = sqrt(array_sum(array_map(fn($x) => ($x - $mean) ** 2, array_column($samples, $i))) / (count($samples) - 1));
$stdDevsX[] = $stdDev;
}
$meanY = array_sum($targets) / count($targets);
$stdDevY = sqrt(array_sum(array_map(fn($x) => ($x - $meanY) ** 2, $targets)) / (count($targets) - 1));
$standardizedCoefficients = [];
foreach ($coefficients->getMatrix() as $index => $value) {
$standardizedCoefficients[] = $index === 0 ? $value[0] : $value[0] * ($stdDevsX[$index - 1] / $stdDevY);
}
// Output table and arrays
$MRtable = "| Sample | Coefficients | Standardized Coefficients | SE | t | p |
|---|---|---|---|---|---|
| {$names[$key]} | " . round($coefficients->get($key, 0), 3) . " | " . round($standardizedCoefficients[$key], 3) . " | " . round($SEs[$key], 3) . " | " . round($tStats[$key], 3) . " | " . round($pValues[$key], 3) . " |