Consider a portfolio that includes several rural and peri-urban solar sites in the Eastern Cape and Limpopo. Nice installations, solid inverter hardware, decent generation profiles. But for the first six months of operation, the data tells a different story. The completeness dashboard shows regular gaps — some sites dipping to 60-70% on certain days — and the ops team keeps opening tickets against the inverters. "Site X underperforming again." "Check the string faults at Site Y."
Except there are no string faults. The inverters are generating fine. The data is just getting lost between the site and the cloud. Spotty cellular connectivity, gateway reboots, upload timeouts — the pipeline assumes a reliable network, and when it doesn't get one, it silently drops records. Every "performance anomaly" in the analytics is actually a communications gap wearing a disguise.
Once the real problem is understood, the fix follows a simple principle: normalize and validate at the edge, queue durably, and sync when you can. Here's what that looks like.
The Pipeline Architecture
collect locally -> durable queue -> ODSE transform -> schema validation -> batch sync -> semantic validation -> warehouse
The key insight is that data quality controls live close to the source — on the gateway itself or on a local edge device. You don't wait for the data to reach the cloud before checking if it's well-formed. By the time it gets there (if it gets there on the first try), it should already be structurally valid. Cloud-side, you run the heavier semantic checks that need portfolio context — things like "is this generation value plausible given the site's capacity and today's irradiance?"
What Changes at the Edge
The biggest shift is making local persistence durable. A typical initial setup writes transformed records to a temporary directory and uploads them immediately. If the upload fails, the records are gone. Classic fire-and-forget architecture that only works when the network works.
The fix is a durable local queue — nothing exotic, just SQLite on the gateway — that holds both the raw payload and the ODSE-transformed record. Every record gets an idempotent key derived from site ID, timestamp, and source hash. That key is what makes replay safe: if the same record arrives twice, the cloud side can deduplicate without worrying about double-counting.
Also move timestamp normalization to the edge. If the cloud pipeline handles timezone conversion, any upload delay introduces ambiguity about which timezone the data is in. Instead, everything should leave the gateway as timezone-explicit ISO format. No guesswork downstream.
The Monday Morning Mystery
Here's an example that illustrates why this matters. Suppose one site in the portfolio shows 40% completeness every Monday. Tuesdays through Sundays are fine — 95%+ — but Mondays are a disaster. The ops team flags it as a recurring inverter fault.
On investigation, the team finds that the cellular gateway at that site is rebooting on a weekly cron job — some leftover from the ISP's default configuration. The reboot happens at midnight Sunday, the gateway takes about 90 minutes to reconnect, and the upload queue isn't durable. Every record collected between midnight and ~1:30am is lost. But because the generation during those hours is near zero (it's nighttime), the energy totals look almost normal. The completeness gap is invisible unless you look at interval-level data.
With a durable queue and idempotent sync, the gateway can reboot whenever it wants. Records queue up locally and sync on reconnection. Monday completeness goes from 40% to 98%.
Validation in Two Stages
At the edge
Run the ODSE transform plus schema validation before a record enters the durable queue. If the record is malformed — null energy fields, missing timestamps, impossible values — it gets rejected right there. This sounds obvious, but a typical pipeline will happily ship garbage to the cloud and then fail during analytics. By catching structural errors at the edge, you stop polluting the backlog with records that can never be valid.
At sync
When batches arrive cloud-side, run semantic validation. This is where you check things that require portfolio context: is the reported generation plausible for this site's capacity? Does the interval pattern match what you expect for this source? Is there a gap in the sequence that suggests missing data rather than zero generation? Schema validation tells you the record is well-formed. Semantic validation tells you it makes sense.
Making Replay Safe
The other common pitfall is duplicate records. When the network is flaky, uploads retry. If your sync path isn't idempotent, retries create duplicates, and duplicates inflate your energy totals. In this example, one site appears to be generating 15% above nameplate capacity for a week. It turns out to be the same afternoon's data uploaded three times.
The fix is straightforward: stable record keys and deduplication at the cloud boundary.
if seen(record_id):
skip()
else:
persist(record)
recompute_completeness(site_id, date)
After every sync batch, recompute completeness for the affected site and date range. This catches both newly filled gaps and any intervals that are still missing.
The Metrics That Actually Matter
Many teams track uptime and generation totals. Those are fine, but they don't help you distinguish between "the inverter is down" and "the network is down." Here are the four metrics that actually matter:
- Queue depth and age at each gateway — if records are piling up, the network is the problem, not the asset.
- Transform and validation success rate by source — tells you whether the OEM export format has changed or degraded.
- Sync lag (event time vs arrival time) — the gap between when data was generated and when it reached the warehouse.
- Post-reconciliation completeness by site — the only metric that tells you the full truth after replays have landed.
The combination of sync lag and post-reconciliation completeness is what finally lets the ops team stop blaming inverters for network problems. When sync lag spikes but completeness recovers after replay, it's a comms issue. When completeness stays low even after replay, something's actually wrong at the site.
Common Mistakes (So You Don't Make Them)
A common initial architecture runs normalization only in the cloud. When the network drops, raw records are lost before they ever get transformed. Moving the ODSE transform to the edge is the single most impactful change in this entire redesign.
Another mistake: treating delayed data as a fault state. If the alerting system fires "missing data" alerts for sites that are simply queuing records locally during a connectivity window, the team burns hours investigating non-incidents. Separate "data not yet arrived" from "data will never arrive" — the first is a sync delay, the second is a real incident. The durable queue and replay mechanism makes this distinction possible.
And timezone drift — don't ignore it. In this example, one of the gateways has its system clock drifting by a few minutes each week because NTP isn't configured. Small drift, but enough to shift intervals across boundaries and break completeness calculations. Pin your timestamps at transform time and validate the clock source.
The Result
With these changes, the rural sites go from being the "problem children" of the portfolio to some of the most reliable data sources available. Not because the connectivity improves — it doesn't — but because the pipeline stops pretending the connectivity is something it isn't. Design for the network conditions you actually have, not the ones you wish you had. Durable queues, idempotent sync, edge validation, and honest completeness metrics. That's the entire playbook.
← Back to Blog