Skip to content

Docs: 'Ingest from other data sources' missing critical filename/dedup details #2096

Description

@RolandKrummenacher

Summary

The Ingest from other data sources section of the deploy tutorial documents the folder path convention but omits several details that are required for the Hub's ingestion pipeline to work correctly. Following the doc literally led to silent data loss in our GCP → FOCUS → Hub pipeline.

Gaps we hit while integrating a GCP Cloud Billing FOCUS view into the Hub:

1. Filename convention is not documented (critical)

The doc says nothing about filenames, but the ingestion pipeline splits each filename on __ (double underscore) to derive ingestionId and originalFileName. Source: Analytics/app.bicep:1806-1811.

The pre-ingest cleanup (Analytics/app.bicep:1352) drops all extents tagged with the same `drop-by:` that don't share the current run's `ingestionId`. If you upload files without the `__` separator, every file gets its own derived `ingestionId` (the full filename), so each file's ingestion drops all previously-ingested files in the same folder — leaving only whatever got ingested last.

Concrete impact: we uploaded 12 Parquet files per month; 10 rows out of ~2100 survived in `Costs_final_v1_2` per run.

Suggested doc fix: Add a subsection:

  • Filename pattern: `__.parquet`
  • `ingestionId` must be stable across all files in a single logical run (e.g. a timestamp + GUID)
  • Separator is two underscores

2. Dedup requires full-month replacement, not deltas

The doc hints at this with "overriding datasets should consistently land in the same folder path so they're overwritten each time," but it's easy to miss. The implication isn't stated: if you write only incremental/delta rows to the month folder, the pre-ingest cleanup wipes the rest of the month's extents, and Kusto ends up with just the delta.

Suggested doc fix: Add an explicit note: "When new rows arrive for an existing month, you must re-write the entire month's data in that folder (with a new ingestionId). Incremental deltas are not supported — they will result in data loss."

3. Tables referenced in step 4 are outdated

Step 4 says "update the Costs_raw and Costs_final_v1_0 tables, and Costs_transform_v1_0, Costs_v1_0, and Costs functions". Current Hub deployments use the `v1_2` variants (`Costs_final_v1_2`, `Costs_transform_v1_2`, `Costs_v1_2`). The doc wasn't updated when the schema was bumped.

4. Empty Parquet shards cause 8-minute ingestion stalls

Not a documentation gap per se, but worth calling out: tools that produce partitioned Parquet output (BigQuery `EXPORT DATA`, Spark `write.parquet`, etc.) commonly emit header-only empty shards (~8 KB on a typical FOCUS schema). Kusto rejects these with `BadRequest_NoRecordsOrWrongFormat: 0 bytes`, and each failure is retried 3× at 120s intervals (~8 minutes wasted per empty file). With ~50% empty shards a typical run's Hub pipeline takes 1–2 hours instead of minutes.

Suggested doc fix: Add a tip: "Filter out empty shards (below ~10 KB on a typical FOCUS schema) before uploading to the ingestion container to avoid retry storms in the Hub pipeline."

5. Manifest content specification is ambiguous

The doc says "Create an empty `manifest.json` file." The pipeline actually accepts any valid JSON. The `IngestionQueries` pipeline writes a `settings.json`-derived payload, so "empty" is a convention, not a requirement. Worth clarifying so integrators know they can embed audit metadata (e.g. `ingestionId`, exported timestamp).

6. Stale source-blob cleanup is the integrator's responsibility

When integrators use the `__.parquet` convention with a fresh `ingestionId` per run (recommended by point 1), each run's Parquet files have different filenames than previous runs. The Hub pipeline's drop-by cleanup operates on Kusto extents but does not delete source blobs in the ingestion container.

Consequence: stale Parquets from previous runs accumulate in the target folder. On the next manifest trigger, the Hub lists all Parquets in the folder (old + new), ingests every one, then uses drop-by-cleanup to keep only one `ingestionId`'s extents. Which ingestionId wins depends on blob processing order — often the older (lexicographically earlier) one, silently reverting the data to the prior run.

Concrete impact we observed: after changing the FOCUS view's `BilledCost` semantics, triggering a full re-export, and confirming the new Parquets were uploaded, `Costs_final_v1_2` still showed the old values. The Hub had picked up both old-ingestionId and new-ingestionId files, and the older-timestamped run's extents won.

Suggested doc fix: Add a step: "Before uploading a new run's Parquets, delete all existing Parquets in the target folder that don't share the current run's ingestionId. The Hub pipeline does not clean up source blobs — leftover files from prior runs can race with new ones and silently revert your data."

Alternatively, integrators could use fixed filenames (`part-0.parquet`, `part-1.parquet`, …) that overwrite in place, matching the Azure Cost Management export pattern. This works only if the shard count is stable across runs; otherwise leftover shards cause the same issue.

Repro context

GCP Cloud Billing detailed export → BigQuery FOCUS view → Parquet exports to GCS → streamed to Azure ingestion container → Hub pipeline → Kusto `Costs_final_v1_2`. Verified against `Hub` database on a FinCops-hosted Fabric eventhouse.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions