How IoTSyn Works

IoTSyn v3.2.0 is a physics-based framework for generating synthetic IoT datasets using explicit mathematical models, a portable MT19937 Mersenne Twister PRNG, and reproducible configuration-driven simulation across six application domains — with a Kolmogorov-Smirnov validation layer emitted as a JSON sidecar on every generated dataset.

Framework v3.2.0 6 IoT Domains KS-Validated Sidecar PHP 7.4 – 8.3 bit-identical

Contents

1. Overview

IoTSyn generates synthetic IoT data without training generative AI models. Instead, it uses a deterministic statistical engine and domain-specific mathematical models grounded in physics, physiology, and network behavior.

The framework currently supports six domains: Smart Home, IoT Network Security, Predictive Maintenance, Medical IoT, IIoT Network Traffic, and Connected Vehicles. Each dataset is produced from explicit equations and configurable parameters rather than opaque latent representations.

Version 3.2 replaces IoTSyn's earlier reliance on PHP's native mt_rand with a portable pure-PHP Mersenne Twister MT19937, introduces closed-form / ODE-based models for several Smart Home and Medical IoT variables, adds Pareto heavy-tail distributions to the IoT-Security generator, and ships a Kolmogorov-Smirnov validation report (*.validation.json) alongside every CSV.

Why this matters

For research, transparency matters as much as realism. IoTSyn is designed so that every distribution, correlation, temporal dependency, and domain-specific pattern can be inspected, justified, and reproduced from the stored configuration and seed — and in v3.2.0, automatically audited by an independent statistical layer that emits a machine-readable validation report.

2. Design Principles

Transparent
Every model is expressed with explicit equations, documented parameters, and interpretable assumptions.
Deterministic
The same domain, parameters, row count, and seed reproduce the same dataset under the validated runtime environment.
Grounded
Models are derived from thermodynamics, ventilation mass balance, reliability theory, circadian physiology, and stochastic state transitions.

3. Framework Architecture

IoTSyn follows a three-layer design: input configuration, deterministic statistical engine, and domain-specific generators. This separation allows a shared stochastic backbone to support multiple IoT scenarios while preserving domain realism.

Layer Role Main Components
Input Configuration Defines the scenario to generate Domain, physical parameters, seed value, row count, sampling interval
Deterministic Statistical Engine Provides all shared stochastic primitives Portable MT19937 PRNG, Box-Muller Gaussian, Poisson, Exponential, LogNormal, Weibull, Gamma, Pareto, AR(1), Cholesky, Markov transitions
Domain-Specific Generators Applies physical and behavioral models Smart Home (RC thermal + analytic CO₂ + Magnus-Tetens), Security (Pareto heavy-tail), Predictive Maintenance, Medical IoT (Bergman Minimal Model), IIoT, Connected Vehicle
Validation Layer (new in v3.2.0) Independently audits every generated dataset Per-column descriptive statistics, physical-range checks, Kolmogorov-Smirnov goodness-of-fit, and a machine-readable *.validation.json sidecar

4. Statistical Engine

The statistical engine is the mathematical backbone of IoTSyn. It provides all distributions and stochastic processes used across generators, with no dependency on trained AI models.

4.1 Portable MT19937 Pseudo-Random Number Generator

IoTSyn v3.2.0 replaces PHP's native mt_rand() with a self-contained, pure-PHP re-implementation of the Mersenne Twister MT19937 algorithm of Matsumoto & Nishimura (1998). The internal state is an array of 624 unsigned 32-bit words, seeded through the Knuth recurrence si = (1812433253 · (si−1 ⊕ (si−1 ≫ 30)) + i) & 0xFFFFFFFF, with word-by-word tempering applied on extraction.

y = x ⊕ (x ≫ 11) y = y ⊕ ((y ≪ 7) & 0x9D2C5680) y = y ⊕ ((y ≪ 15) & 0xEFC60000) y = y ⊕ (y ≫ 18) U = y / 2³²

Because PHP's native mt_rand() silently changed its algorithm between 7.1 and 7.2, datasets generated by IoTSyn v3.1 could differ across hosts. The v3.2.0 implementation is bit-identical across PHP 7.4 – 8.3 and all host operating systems: the same (domain, parameters, row_count, seed) tuple produces byte-identical CSV output anywhere the code runs.

4.2 Gaussian Distribution (Box-Muller Transform)

Normal variates are generated using the Box-Muller transform:

Z₀ = √(−2 ln U₁) · cos(2πU₂) Z₁ = √(−2 ln U₁) · sin(2πU₂) X = μ + σ · Z

The spare value is cached per instance rather than globally, preventing cross-contamination between independent generators.

4.3 Core Distributions

Distribution Generation Method Typical Use Notes
Poisson Knuth for λ < 30; Normal approximation otherwise Event counts Discrete arrivals and count processes
Exponential X = −ln(U)/λ Connection durations Memoryless waiting time model
LogNormal Y = exp(N(μ, σ)) Packet sizes, latency-like variables Right-skewed positive quantities
Weibull CDF-based sampling Degradation and failure Useful for wear-out behavior
Gamma Marsaglia & Tsang Aggregate traffic, waiting times Flexible positive-valued process model
Pareto (v3.2.0) X = xm · U−1/α DoS / DDoS flow sizes, heavy-tailed bursts α = 1.3 (DoS), α = 1.2 (DDoS); matches Crovella & Bestavros (1997) and Antonakakis et al. (2017)

4.4 Correlations via Cholesky Decomposition

Bivariate dependence is introduced through Cholesky-based construction:

Z_x = (X − μ_x) / σ_x Z_y = ρ · Z_x + √(1 − ρ²) · Z_ind Y = μ_y + σ_y · Z_y

This preserves the target correlation structure in expectation while keeping the variables interpretable.

4.5 Temporal Dependencies via AR(1)

Temporal smoothness is modeled using a first-order autoregressive process:

ε(t) = φ · ε(t−1) + η(t) η(t) ~ N(0, σ²(1 − φ²))

AR(1) noise prevents unrealistic jumps between consecutive values. Typical settings include φ = 0.85 for indoor temperature and φ = 0.9 for physiological vital signs.

4.6 Discrete State Transitions via Markov Chains

Time-dependent Markov transition matrices are used for occupancy states, attack progression, and driving states. Transition probabilities can vary by context or time of day, allowing sequences to remain temporally coherent rather than independently sampled.

5. Domain Generators

Each IoTSyn generator combines the shared statistical engine with a domain-specific model family. The goal is not generic randomness, but structurally meaningful synthetic behavior.

Domain Key Models Primary Uses
Smart HomeRC thermal relaxation (ISO 13790), analytic CO₂ mass-balance ODE, Magnus-Tetens psychrometric humidity, Markov occupancyHVAC optimization, occupancy detection, indoor air quality
IoT SecurityLogNormal packet sizes, Pareto heavy-tail DoS/DDoS flows (α = 1.2 – 1.3), attack Markov phases, burst behaviorIntrusion detection, attack classification, DDoS research
Predictive MaintenanceWeibull degradation, ISO 10816 vibration, thermal loadingRUL prediction, fault modeling
Medical IoTBergman Minimal Model (3-state ODE, RK4), Dalla Man gamma-shaped meal absorption, circadian vitals, NEWS2-inspired statusDiabetes research, closed-loop control, patient monitoring
IIoT NetworkProtocol-aware traffic, OT attack signatures, role-based devicesSCADA and OT security research
Connected VehicleDriving state machine, dead reckoning, 5-gear RPM, fuel modelFleet analytics, driving behavior studies

5.1 Smart Home

Indoor environmental sensing is modeled using closed-form solutions to the underlying physics rather than ad-hoc harmonic approximations. Indoor temperature follows a first-order RC thermal relaxation model (ISO 13790) that incorporates the outdoor driving temperature, a setpoint, and a time constant τ = RC:

T(t + Δt) = T_set + (T(t) − T_set) · exp(−Δt / τ) + γ · (T_out(t) − T_set) · (1 − exp(−Δt / τ)) + ε_AR(1)

Carbon dioxide concentration is solved analytically from the indoor mass-balance ODE instead of being approximated with forward Euler. For constant occupancy n and ventilation rate Q on a step [t, t + Δt]:

dC/dt = (n · G − Q · (C − C_out)) / V C_∞ = C_out + (n · G) / Q C(t + Δt) = C_∞ + (C(t) − C_∞) · exp(−Q · Δt / V)

Relative humidity is derived from the Magnus-Tetens psychrometric relationship rather than imposed as a fixed negative correlation with temperature:

e_sat(T) = 611.2 · exp(17.62 · T / (243.12 + T)) RH(t) = 100 · e(t) / e_sat(T(t))

Occupancy is simulated through time-of-day-dependent Markov transitions (sleep / away / active), and feeds directly into the CO₂ ODE through the source term n · G.

5.2 IoT Network Security

Normal traffic uses LogNormal packet sizes and Exponential durations. Malicious traffic is generated through phased attack progression — recon → escalation → peak → cooldown — with distinct signatures per attack class (DoS, DDoS, Scan, Brute Force, Botnet).

Version 3.2 replaces truncated-Gaussian flow sizes for volumetric attacks with a Pareto heavy-tail distribution, consistent with measured internet traffic (Crovella & Bestavros, 1997) and with DDoS traces such as Mirai and its successors (Antonakakis et al., 2017):

X = x_m · U^(−1/α) U ~ Uniform(0, 1] α = 1.3 (DoS) α = 1.2 (DDoS)

5.3 Predictive Maintenance

D(t) = 1 − exp(−(t / L)^β)
Temperature: T = T_amb + T_load·(RPM/RPM_rated)² + 30·D² + ε_AR(1) Vibration: V = V_base·(1 + 8·D²) + imbalance Current: I = I_rated·(Load/100)·(1 + 0.3·D) Pressure: P = P_design·(1 − 0.3·D)

5.4 Medical IoT

Cardiovascular signals retain the circadian / activity / AR(1) structure of v3.1:

HR(t) = HR_base(age) + HR_circadian(h) + HR_activity + ε_AR(1)

Glucose metabolism, however, is upgraded in v3.2.0 from additive Gaussian "meal spikes" to the Bergman Minimal Model (Bergman et al., 1979) — a three-state ODE system integrated with a fixed-step fourth-order Runge-Kutta solver at a 1-minute sub-step:

dG/dt = −(p₁ + X) · G + p₁ · G_b + R_a(t) / V_G dX/dt = −p₂ · X + p₃ · (I − I_b) dI/dt = −n · (I − I_b) + U_I(t)

where G is plasma glucose (mg/dL), X is remote-compartment insulin action (1/min), and I is plasma insulin (μU/mL). Meal absorption R_a(t) follows the Dalla Man (2007) gamma-shaped absorption curve rather than a rectangular pulse:

R_a(t) = (D · f / τ_meal²) · (t − t_meal) · exp(−(t − t_meal) / τ_meal)

The remaining vitals include age-dependent baselines, blood-pressure coupling to heart rate, possible sleep-apnea events in elderly profiles, and a NEWS2-inspired composite health-status layer.

5.5 IIoT Network Traffic

Industrial network traffic is modeled for protocols such as Modbus TCP, OPC UA, DNP3, BACnet, and EtherNet/IP. OT attack classes include Man-in-the-Middle, Replay, False Data Injection, DoS, and Reconnaissance.

5.6 Connected Vehicle

Vehicle telemetry is generated through a driving state machine with traffic-density modulation. Position is estimated through dead reckoning with GPS noise, engine RPM follows a 5-gear transmission model, and fuel consumption depends on speed and acceleration.

6. Validation Layer (new in v3.2.0)

IoTSyn v3.2.0 ships a built-in validation layer that runs on every generated dataset and emits a machine-readable *.validation.json sidecar next to the CSV. The same report is offered as a secondary download on the dataset page, so reviewers can inspect the numerical evidence without re-running the generator.

6.1 What the Validation Report contains

6.2 Kolmogorov-Smirnov one-sample test

For a sample {x₁, …, x_n} with empirical CDF F_n and a hypothesized CDF F₀ (e.g. LogNormal for packet sizes, Weibull for degradation, Pareto for DDoS flows), the test statistic is:

D_n = sup_x | F_n(x) − F₀(x) |

The asymptotic p-value is obtained from the Kolmogorov distribution (Kolmogorov, 1933):

P(√n · D_n > z) = 2 · Σ_{k=1}^{∞} (−1)^{k−1} · exp(−2 k² z²)

6.3 Validation dimensions

IoTSyn's validation focuses on internal consistency — whether the generated outputs remain faithful to their governing equations, expected temporal behavior, configured relationships, and declared distributions.

Validation Dimension What is checked
Statistical relationship preservationConfigured correlations and couplings appear in the generated data
Physical model fidelityEquations such as Weibull degradation, RC thermal relaxation, analytic CO₂, and Bergman glucose produce expected dynamics
Domain plausibilityGenerated values remain within physically or clinically reasonable ranges
State coherenceMarkov-based attack phases, occupancy, and driving states evolve coherently over time
Configuration sensitivityOutputs respond meaningfully to parameter changes rather than remaining fixed-pattern simulations
Distributional goodness-of-fit (new)One-sample KS tests verify that each column matches its declared theoretical distribution within the asymptotic tolerance

Example internal validation outcomes

  • Smart Home CO₂–occupancy coupling showed strong positive association under the analytic mass-balance solution.
  • Predictive Maintenance produced strong degradation–vibration relationships consistent with the ISO-based formulation, and Weibull KS-fit remained within asymptotic tolerance.
  • Medical IoT preserved age-dependent heart-rate baselines, circadian nadirs around early morning hours, and Bergman BMM glucose excursions inside clinically plausible 70 – 250 mg/dL envelopes.
  • Security and vehicle generators produced temporally coherent phased behaviors through Markov state transitions; DDoS flow sizes matched the Pareto α = 1.2 reference CDF within KS asymptotic tolerance.

7. Reproducibility

Every dataset includes a seed value and generation metadata. The same (domain, parameter set, row count, seed) tuple reproduces the same output — and, as of v3.2.0, the same byte-for-byte CSV across PHP versions.

Implementation: IoTSyn v3.2.0 uses a portable pure-PHP Mersenne Twister MT19937 with instance-level random state management. This prevents interference between generator instances and removes the dependency on PHP's native mt_rand(), whose internal algorithm silently changed between 7.1 and 7.2.

Runtime note: IoTSyn v3.2.0 has been verified bit-identical on PHP 7.4, 8.0, 8.1, 8.2, and 8.3, on both Linux and Windows hosts. The validation sidecar itself is deterministic in its statistical content (numeric values may render with the host's JSON float precision, but the underlying numbers are reproducible).

8. Scope and Limitations

IoTSyn is designed for transparent and reproducible synthetic data generation, not for perfect replication of every real-world environment. The models are deliberate simplifications intended for benchmarking, prototyping, testing, and educational use.

  • Each new domain requires new domain-specific modeling assumptions.
  • Physics-based realism is structured and interpretable, but not equivalent to full real-world complexity.
  • IoTSyn does not learn unknown latent patterns from real proprietary datasets.
  • The Bergman Minimal Model captures the dominant short-horizon glucose–insulin dynamics but omits second-meal effects, hepatic glucose production variability, and exercise-dependent insulin sensitivity.
  • The RC thermal model represents a single lumped zone; it does not capture multi-zone heat exchange, solar gain through fenestration, or mechanical ventilation coupling.
  • The Pareto heavy-tail approximation for volumetric attacks captures macroscopic flow-size statistics but does not reproduce protocol-level artifacts of specific botnet families.
  • Current validation is internal (KS goodness-of-fit, physical-range checks, temporal coherence); external benchmark studies against real-world datasets remain an active research direction.

9. References

  1. Box, G.E.P. & Muller, M.E. (1958). A Note on the Generation of Random Normal Deviates. Annals of Mathematical Statistics, 29(2), 610–611.
  2. Knuth, D.E. (1997). The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd ed. Addison-Wesley.
  3. Marsaglia, G. & Tsang, W.W. (2000). A simple method for generating gamma variables. ACM Transactions on Mathematical Software, 26(3), 363–372.
  4. Gentle, J.E. (2009). Computational Statistics. Springer.
  5. ASHRAE (2019). ASHRAE Handbook: HVAC Applications.
  6. Persily, A.K. & de Jonge, L. (2017). Carbon dioxide generation rates for building occupants. Indoor Air, 27(5), 868–879.
  7. ISO 10816-1:1995. Mechanical vibration — Evaluation of machine vibration by measurements on non-rotating parts.
  8. Royal College of Physicians (2017). National Early Warning Score (NEWS) 2.
  9. Refinetti, R. & Menaker, M. (1992). The circadian rhythm of body temperature. Physiology & Behavior, 51(3), 613–637.
  10. Venditti, F.J. et al. (2005). Circadian variation of heart rate variability. Journal of Cardiovascular Electrophysiology, 16(1), 27–31.
  11. Matsumoto, M. & Nishimura, T. (1998). Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8(1), 3–30.
  12. Crovella, M.E. & Bestavros, A. (1997). Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.
  13. Antonakakis, M. et al. (2017). Understanding the Mirai botnet. 26th USENIX Security Symposium, 1093–1110.
  14. Bergman, R.N., Ider, Y.Z., Bowden, C.R. & Cobelli, C. (1979). Quantitative estimation of insulin sensitivity. American Journal of Physiology — Endocrinology and Metabolism, 236(6), E667–E677.
  15. Dalla Man, C., Rizza, R.A. & Cobelli, C. (2007). Meal simulation model of the glucose-insulin system. IEEE Transactions on Biomedical Engineering, 54(10), 1740–1749.
  16. Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari, 4, 83–91.
  17. Tetens, O. (1930). Über einige meteorologische Begriffe. Zeitschrift für Geophysik, 6, 297–309.
  18. ISO 13790:2008. Energy performance of buildings — Calculation of energy use for space heating and cooling. International Organization for Standardization.
  19. Duhair, A. (2026). IoTSyn: a physics-based framework for reproducible synthetic IoT data generation across six domains (v3.2.0). IoTSyn Technical Report. Available at https://iotsyn.com.

Ready to Generate?

All documented models are available in the IoTSyn generator.