How IoTSyn Works
IoTSyn v3.2.0 is a physics-based framework for generating synthetic IoT datasets using explicit mathematical models, a portable MT19937 Mersenne Twister PRNG, and reproducible configuration-driven simulation across six application domains — with a Kolmogorov-Smirnov validation layer emitted as a JSON sidecar on every generated dataset.
Contents
1. Overview
IoTSyn generates synthetic IoT data without training generative AI models. Instead, it uses a deterministic statistical engine and domain-specific mathematical models grounded in physics, physiology, and network behavior.
The framework currently supports six domains: Smart Home, IoT Network Security, Predictive Maintenance, Medical IoT, IIoT Network Traffic, and Connected Vehicles. Each dataset is produced from explicit equations and configurable parameters rather than opaque latent representations.
Version 3.2 replaces IoTSyn's earlier reliance on PHP's native mt_rand with a portable pure-PHP Mersenne Twister MT19937, introduces closed-form / ODE-based models for several Smart Home and Medical IoT variables, adds Pareto heavy-tail distributions to the IoT-Security generator, and ships a Kolmogorov-Smirnov validation report (*.validation.json) alongside every CSV.
Why this matters
For research, transparency matters as much as realism. IoTSyn is designed so that every distribution, correlation, temporal dependency, and domain-specific pattern can be inspected, justified, and reproduced from the stored configuration and seed — and in v3.2.0, automatically audited by an independent statistical layer that emits a machine-readable validation report.
2. Design Principles
Every model is expressed with explicit equations, documented parameters, and interpretable assumptions.
The same domain, parameters, row count, and seed reproduce the same dataset under the validated runtime environment.
Models are derived from thermodynamics, ventilation mass balance, reliability theory, circadian physiology, and stochastic state transitions.
3. Framework Architecture
IoTSyn follows a three-layer design: input configuration, deterministic statistical engine, and domain-specific generators. This separation allows a shared stochastic backbone to support multiple IoT scenarios while preserving domain realism.
| Layer | Role | Main Components |
|---|---|---|
| Input Configuration | Defines the scenario to generate | Domain, physical parameters, seed value, row count, sampling interval |
| Deterministic Statistical Engine | Provides all shared stochastic primitives | Portable MT19937 PRNG, Box-Muller Gaussian, Poisson, Exponential, LogNormal, Weibull, Gamma, Pareto, AR(1), Cholesky, Markov transitions |
| Domain-Specific Generators | Applies physical and behavioral models | Smart Home (RC thermal + analytic CO₂ + Magnus-Tetens), Security (Pareto heavy-tail), Predictive Maintenance, Medical IoT (Bergman Minimal Model), IIoT, Connected Vehicle |
| Validation Layer (new in v3.2.0) | Independently audits every generated dataset | Per-column descriptive statistics, physical-range checks, Kolmogorov-Smirnov goodness-of-fit, and a machine-readable *.validation.json sidecar |
4. Statistical Engine
The statistical engine is the mathematical backbone of IoTSyn. It provides all distributions and stochastic processes used across generators, with no dependency on trained AI models.
4.1 Portable MT19937 Pseudo-Random Number Generator
IoTSyn v3.2.0 replaces PHP's native mt_rand() with a self-contained, pure-PHP re-implementation of the Mersenne Twister MT19937 algorithm of Matsumoto & Nishimura (1998). The internal state is an array of 624 unsigned 32-bit words, seeded through the Knuth recurrence
si = (1812433253 · (si−1 ⊕ (si−1 ≫ 30)) + i) & 0xFFFFFFFF, with word-by-word tempering applied on extraction.
Because PHP's native mt_rand() silently changed its algorithm between 7.1 and 7.2, datasets generated by IoTSyn v3.1 could differ across hosts. The v3.2.0 implementation is bit-identical across PHP 7.4 – 8.3 and all host operating systems: the same (domain, parameters, row_count, seed) tuple produces byte-identical CSV output anywhere the code runs.
4.2 Gaussian Distribution (Box-Muller Transform)
Normal variates are generated using the Box-Muller transform:
The spare value is cached per instance rather than globally, preventing cross-contamination between independent generators.
4.3 Core Distributions
| Distribution | Generation Method | Typical Use | Notes |
|---|---|---|---|
| Poisson | Knuth for λ < 30; Normal approximation otherwise | Event counts | Discrete arrivals and count processes |
| Exponential | X = −ln(U)/λ | Connection durations | Memoryless waiting time model |
| LogNormal | Y = exp(N(μ, σ)) | Packet sizes, latency-like variables | Right-skewed positive quantities |
| Weibull | CDF-based sampling | Degradation and failure | Useful for wear-out behavior |
| Gamma | Marsaglia & Tsang | Aggregate traffic, waiting times | Flexible positive-valued process model |
| Pareto (v3.2.0) | X = xm · U−1/α | DoS / DDoS flow sizes, heavy-tailed bursts | α = 1.3 (DoS), α = 1.2 (DDoS); matches Crovella & Bestavros (1997) and Antonakakis et al. (2017) |
4.4 Correlations via Cholesky Decomposition
Bivariate dependence is introduced through Cholesky-based construction:
This preserves the target correlation structure in expectation while keeping the variables interpretable.
4.5 Temporal Dependencies via AR(1)
Temporal smoothness is modeled using a first-order autoregressive process:
AR(1) noise prevents unrealistic jumps between consecutive values. Typical settings include φ = 0.85 for indoor temperature and φ = 0.9 for physiological vital signs.
4.6 Discrete State Transitions via Markov Chains
Time-dependent Markov transition matrices are used for occupancy states, attack progression, and driving states. Transition probabilities can vary by context or time of day, allowing sequences to remain temporally coherent rather than independently sampled.
5. Domain Generators
Each IoTSyn generator combines the shared statistical engine with a domain-specific model family. The goal is not generic randomness, but structurally meaningful synthetic behavior.
| Domain | Key Models | Primary Uses |
|---|---|---|
| Smart Home | RC thermal relaxation (ISO 13790), analytic CO₂ mass-balance ODE, Magnus-Tetens psychrometric humidity, Markov occupancy | HVAC optimization, occupancy detection, indoor air quality |
| IoT Security | LogNormal packet sizes, Pareto heavy-tail DoS/DDoS flows (α = 1.2 – 1.3), attack Markov phases, burst behavior | Intrusion detection, attack classification, DDoS research |
| Predictive Maintenance | Weibull degradation, ISO 10816 vibration, thermal loading | RUL prediction, fault modeling |
| Medical IoT | Bergman Minimal Model (3-state ODE, RK4), Dalla Man gamma-shaped meal absorption, circadian vitals, NEWS2-inspired status | Diabetes research, closed-loop control, patient monitoring |
| IIoT Network | Protocol-aware traffic, OT attack signatures, role-based devices | SCADA and OT security research |
| Connected Vehicle | Driving state machine, dead reckoning, 5-gear RPM, fuel model | Fleet analytics, driving behavior studies |
5.1 Smart Home
Indoor environmental sensing is modeled using closed-form solutions to the underlying physics rather than ad-hoc harmonic approximations. Indoor temperature follows a first-order RC thermal relaxation model (ISO 13790) that incorporates the outdoor driving temperature, a setpoint, and a time constant τ = RC:
Carbon dioxide concentration is solved analytically from the indoor mass-balance ODE instead of being approximated with forward Euler. For constant occupancy n and ventilation rate Q on a step [t, t + Δt]:
Relative humidity is derived from the Magnus-Tetens psychrometric relationship rather than imposed as a fixed negative correlation with temperature:
Occupancy is simulated through time-of-day-dependent Markov transitions (sleep / away / active), and feeds directly into the CO₂ ODE through the source term n · G.
5.2 IoT Network Security
Normal traffic uses LogNormal packet sizes and Exponential durations. Malicious traffic is generated through phased attack progression — recon → escalation → peak → cooldown — with distinct signatures per attack class (DoS, DDoS, Scan, Brute Force, Botnet).
Version 3.2 replaces truncated-Gaussian flow sizes for volumetric attacks with a Pareto heavy-tail distribution, consistent with measured internet traffic (Crovella & Bestavros, 1997) and with DDoS traces such as Mirai and its successors (Antonakakis et al., 2017):
5.3 Predictive Maintenance
5.4 Medical IoT
Cardiovascular signals retain the circadian / activity / AR(1) structure of v3.1:
Glucose metabolism, however, is upgraded in v3.2.0 from additive Gaussian "meal spikes" to the Bergman Minimal Model (Bergman et al., 1979) — a three-state ODE system integrated with a fixed-step fourth-order Runge-Kutta solver at a 1-minute sub-step:
where G is plasma glucose (mg/dL), X is remote-compartment insulin action (1/min), and I is plasma insulin (μU/mL). Meal absorption R_a(t) follows the Dalla Man (2007) gamma-shaped absorption curve rather than a rectangular pulse:
The remaining vitals include age-dependent baselines, blood-pressure coupling to heart rate, possible sleep-apnea events in elderly profiles, and a NEWS2-inspired composite health-status layer.
5.5 IIoT Network Traffic
Industrial network traffic is modeled for protocols such as Modbus TCP, OPC UA, DNP3, BACnet, and EtherNet/IP. OT attack classes include Man-in-the-Middle, Replay, False Data Injection, DoS, and Reconnaissance.
5.6 Connected Vehicle
Vehicle telemetry is generated through a driving state machine with traffic-density modulation. Position is estimated through dead reckoning with GPS noise, engine RPM follows a 5-gear transmission model, and fuel consumption depends on speed and acceleration.
6. Validation Layer (new in v3.2.0)
IoTSyn v3.2.0 ships a built-in validation layer that runs on every generated dataset and emits a machine-readable *.validation.json sidecar next to the CSV. The same report is offered as a secondary download on the dataset page, so reviewers can inspect the numerical evidence without re-running the generator.
6.1 What the Validation Report contains
- Per-column descriptive statistics — mean, standard deviation, minimum, maximum, quartiles for every numeric field in the CSV.
- Physical-range checks — indoor temperature ∈ [−10 °C, 50 °C], RH ∈ [0, 100 %], glucose ∈ [30, 600] mg/dL, etc., with a pass/fail flag per column.
- Kolmogorov-Smirnov goodness-of-fit tests — one-sample KS tests against each generator's declared theoretical distribution, with finite-sample critical values and an asymptotic p-value series.
- Configuration echo — the exact domain, parameters, row count and seed that produced the dataset, so a reviewer can regenerate byte-for-byte.
6.2 Kolmogorov-Smirnov one-sample test
For a sample {x₁, …, x_n} with empirical CDF F_n and a hypothesized CDF F₀ (e.g. LogNormal for packet sizes, Weibull for degradation, Pareto for DDoS flows), the test statistic is:
The asymptotic p-value is obtained from the Kolmogorov distribution (Kolmogorov, 1933):
6.3 Validation dimensions
IoTSyn's validation focuses on internal consistency — whether the generated outputs remain faithful to their governing equations, expected temporal behavior, configured relationships, and declared distributions.
| Validation Dimension | What is checked |
|---|---|
| Statistical relationship preservation | Configured correlations and couplings appear in the generated data |
| Physical model fidelity | Equations such as Weibull degradation, RC thermal relaxation, analytic CO₂, and Bergman glucose produce expected dynamics |
| Domain plausibility | Generated values remain within physically or clinically reasonable ranges |
| State coherence | Markov-based attack phases, occupancy, and driving states evolve coherently over time |
| Configuration sensitivity | Outputs respond meaningfully to parameter changes rather than remaining fixed-pattern simulations |
| Distributional goodness-of-fit (new) | One-sample KS tests verify that each column matches its declared theoretical distribution within the asymptotic tolerance |
Example internal validation outcomes
- Smart Home CO₂–occupancy coupling showed strong positive association under the analytic mass-balance solution.
- Predictive Maintenance produced strong degradation–vibration relationships consistent with the ISO-based formulation, and Weibull KS-fit remained within asymptotic tolerance.
- Medical IoT preserved age-dependent heart-rate baselines, circadian nadirs around early morning hours, and Bergman BMM glucose excursions inside clinically plausible 70 – 250 mg/dL envelopes.
- Security and vehicle generators produced temporally coherent phased behaviors through Markov state transitions; DDoS flow sizes matched the Pareto α = 1.2 reference CDF within KS asymptotic tolerance.
7. Reproducibility
Every dataset includes a seed value and generation metadata. The same (domain, parameter set, row count, seed) tuple reproduces the same output — and, as of v3.2.0, the same byte-for-byte CSV across PHP versions.
Implementation: IoTSyn v3.2.0 uses a portable pure-PHP Mersenne Twister MT19937 with instance-level random state management. This prevents interference between generator instances and removes the dependency on PHP's native mt_rand(), whose internal algorithm silently changed between 7.1 and 7.2.
Runtime note: IoTSyn v3.2.0 has been verified bit-identical on PHP 7.4, 8.0, 8.1, 8.2, and 8.3, on both Linux and Windows hosts. The validation sidecar itself is deterministic in its statistical content (numeric values may render with the host's JSON float precision, but the underlying numbers are reproducible).
8. Scope and Limitations
IoTSyn is designed for transparent and reproducible synthetic data generation, not for perfect replication of every real-world environment. The models are deliberate simplifications intended for benchmarking, prototyping, testing, and educational use.
- Each new domain requires new domain-specific modeling assumptions.
- Physics-based realism is structured and interpretable, but not equivalent to full real-world complexity.
- IoTSyn does not learn unknown latent patterns from real proprietary datasets.
- The Bergman Minimal Model captures the dominant short-horizon glucose–insulin dynamics but omits second-meal effects, hepatic glucose production variability, and exercise-dependent insulin sensitivity.
- The RC thermal model represents a single lumped zone; it does not capture multi-zone heat exchange, solar gain through fenestration, or mechanical ventilation coupling.
- The Pareto heavy-tail approximation for volumetric attacks captures macroscopic flow-size statistics but does not reproduce protocol-level artifacts of specific botnet families.
- Current validation is internal (KS goodness-of-fit, physical-range checks, temporal coherence); external benchmark studies against real-world datasets remain an active research direction.
9. References
- Box, G.E.P. & Muller, M.E. (1958). A Note on the Generation of Random Normal Deviates. Annals of Mathematical Statistics, 29(2), 610–611.
- Knuth, D.E. (1997). The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd ed. Addison-Wesley.
- Marsaglia, G. & Tsang, W.W. (2000). A simple method for generating gamma variables. ACM Transactions on Mathematical Software, 26(3), 363–372.
- Gentle, J.E. (2009). Computational Statistics. Springer.
- ASHRAE (2019). ASHRAE Handbook: HVAC Applications.
- Persily, A.K. & de Jonge, L. (2017). Carbon dioxide generation rates for building occupants. Indoor Air, 27(5), 868–879.
- ISO 10816-1:1995. Mechanical vibration — Evaluation of machine vibration by measurements on non-rotating parts.
- Royal College of Physicians (2017). National Early Warning Score (NEWS) 2.
- Refinetti, R. & Menaker, M. (1992). The circadian rhythm of body temperature. Physiology & Behavior, 51(3), 613–637.
- Venditti, F.J. et al. (2005). Circadian variation of heart rate variability. Journal of Cardiovascular Electrophysiology, 16(1), 27–31.
- Matsumoto, M. & Nishimura, T. (1998). Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8(1), 3–30.
- Crovella, M.E. & Bestavros, A. (1997). Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846.
- Antonakakis, M. et al. (2017). Understanding the Mirai botnet. 26th USENIX Security Symposium, 1093–1110.
- Bergman, R.N., Ider, Y.Z., Bowden, C.R. & Cobelli, C. (1979). Quantitative estimation of insulin sensitivity. American Journal of Physiology — Endocrinology and Metabolism, 236(6), E667–E677.
- Dalla Man, C., Rizza, R.A. & Cobelli, C. (2007). Meal simulation model of the glucose-insulin system. IEEE Transactions on Biomedical Engineering, 54(10), 1740–1749.
- Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari, 4, 83–91.
- Tetens, O. (1930). Über einige meteorologische Begriffe. Zeitschrift für Geophysik, 6, 297–309.
- ISO 13790:2008. Energy performance of buildings — Calculation of energy use for space heating and cooling. International Organization for Standardization.
- Duhair, A. (2026). IoTSyn: a physics-based framework for reproducible synthetic IoT data generation across six domains (v3.2.0). IoTSyn Technical Report. Available at https://iotsyn.com.
Ready to Generate?
All documented models are available in the IoTSyn generator.