Skip to content

Add JSON to File operator (CMEM-7616)#1050

Open
edufuga wants to merge 30 commits into
developfrom
feature/CMEM-7616-json-to-file
Open

Add JSON to File operator (CMEM-7616)#1050
edufuga wants to merge 30 commits into
developfrom
feature/CMEM-7616-json-to-file

Conversation

@edufuga

@edufuga edufuga commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This PR is based on develop. It depends on PR [#1041] (CMEM-7427).

CMEM-7616: JSON to File operator

Adds an operator that reads a JSON string from a field on each input entity and writes it to a file, surfacing the result as a FileEntity for downstream file-backed operators or datasets.

Why

Parse JSON consumes a JSON string and produces structured entities; it does not preserve the string. Users who need to pass the JSON value to a file-backed operator downstream — for archival, transfer, or further processing — had no operator for that. JsonToFileOperator fills that gap.

Changes

JsonToFileOperator is a CustomTask registered in JsonPlugins. Its input port mirrors Parse JSON: one entity carrying a JSON string. Its output is a FileEntity. For each input entity the operator reads the value at the configured input path, validates it as JSON, and writes it to a file tagged application/json. When the input contains more than one entity, a zero-based index suffix is appended to the filename before the extension to keep output names unique and predictable. If no input path is configured, the operator defaults to the first field on the input entity, consistent with Parse JSON's behaviour.

An optional zipOutput flag switches to ZIP mode: instead of one file per entity, all entities are packed into a single ZIP archive with one entry per entity. Entry names follow the same suffix scheme as the single-file path. The produced FileEntity carries application/zip in ZIP mode. If the write loop fails mid-archive the partial temp file is deleted before the exception propagates, so callers never receive a broken ZIP.

An optional outputProperty parameter wraps the JSON value in an object under the given key before writing. With outputProperty set to payload, an input value {"name":"Alice"} is written as {"payload":{"name":"Alice"}}. When left empty — the default — the value is written as-is. The wrapping applies in both normal and ZIP modes. Internally, validateJson is replaced by parseJson returning a JsonNode, which validates and supplies the parsed node for wrapping in a single pass.

A third outputMode, jsonArray, merges all input entities' values into a single JSON array file, splicing each element verbatim.

JsonParserTask gains a mutual relatedPlugins entry pointing at JsonToFileOperator, and vice versa, so the workbench surfaces the relationship from either side.

Tests

LocalJsonToFileOperatorExecutorTest covers the three output modes independently. For file mode: single and multi-entity output, the literal-versus-suffixed filename rule, application/json MIME-type propagation, and input-path defaulting. For zip mode: single-entity packing under entry.json, multi-entity packing with suffixed entry names, a named ZIP container, a single literal entry name, an explicitly configured MIME type, empty input, and a 500-entity scale run confirming the write loop and entry naming hold at volume. For jsonArray mode: single and multi-entity merging into one array in input order, a single named output file with no index suffix, the default and explicitly configured MIME types, and the value-fidelity cases below.

Invalid input is skipped per entity rather than aborting the task, asserted in every mode: an empty value and an unparseable value are each skipped and recorded as a warning on the execution report while the valid entities still produce output; a mixed batch writes the valid records and skips the bad one, with contiguous file and ZIP entry names across the gap; a fully-skipped run yields the mode's natural empty output — no files, an empty [], or a ZIP with no entries; and the report's entityCount (plus the skip-warning count) is asserted in file, zip, and jsonArray modes, with an all-valid merged run asserting no warnings.

Byte fidelity in jsonArray mode is pinned directly: trailing decimal zeros, large-magnitude integers, plain and multiple-of-ten integers, scalar / array / null elements, duplicate object keys, and pretty-printed internal formatting are written through verbatim; a non-JSON token after a valid value and a comma-separated trailing value at the root are rejected (skipped and warned); and whitespace padding around an element is preserved. outputProperty wrapping is covered across modes — single and multiple entities, ZIP entries, a JSON-array input, a skipped invalid value, and number- and whitespace-preservation in file and zip when the wrapper is set.

LocalJsonToFileWorkflowTest runs the operator end-to-end in a three-node in-memory workflow: file mode asserting the output resource holds the source JSON byte-for-byte, zip mode asserting entry names and content across two entities, jsonArray mode asserting the merged array, and a skip case asserting an invalid record is dropped while the valid ones flow through to the downstream dataset.


Task: https://jira.eccenca.com/browse/CMEM-7616

Eduard Fugarolas added 14 commits May 15, 2026 15:16
A new Silk operator that takes a JSON string carried as a field value on each incoming entity and writes it to a file, surfacing the result downstream as a `FileEntity`. The operator iterates over every input entity, validates each value as JSON via the in-house Jackson-based parser, and produces one file per entity tagged with `application/json`. An optional output filename parameter governs naming, with an index suffix appended when the input carries more than one entity to keep filenames unique. The plugin is registered alongside Parse JSON in `JsonPlugins.scala`, and Parse JSON gains a mutual `relatedPlugins` entry so the workbench surfaces the two as related from either side. Eight executor-level tests cover the contract, including the missing-value and invalid-JSON failure modes; an integration test confirms the produced file lands verbatim in a downstream `JsonDataset` resource.
The doc covers the operator's input, the file entity produced downstream, and the *Input path* and *Output file name* parameters. The suffix rule for multi-entity input is laid out with a short inline example showing the resulting filenames. A worked single-entity walkthrough at the end mirrors the Parse JSON doc.
Touches the @plugin, @Param, and relatedPlugins descriptions on both operators, the two user-facing docs, and the JSON to File workflow test name.
The JSON to File documentation claimed the operator writes bytes verbatim and tagged files with a hardcoded content type. Both were inaccurate or incomplete: writeString encodes to UTF-8, and the content type was not user-configurable. Adds a mimeType parameter (default application/json) to JsonToFileOperator and threads it through the executor, replacing the hardcoded class-level val. The output-filename section now covers the extension-less case. Duplicate phrasing in the Parse JSON Schema section and class-name leakage in both operators' input-count error messages are also fixed.
When zipOutput is enabled, the executor packs all input entities into a single ZIP file with one entry per entity instead of producing one file per entity. Entry naming follows the same suffix convention as the non-ZIP path. The MIME type defaults to application/zip when zipOutput is true and the parameter is at its default value; an explicit value passes through unchanged. Seven executor-level tests cover entry naming, MIME type handling, empty input, and invalid JSON; a workflow-level test confirms the ZIP lands correctly in a downstream dataset resource. Documentation updates the Input, Output, and Parameters sections to cover both modes.
The write loop in executeZip is now wrapped in a try/catch that calls WritableResource.delete() on the backing temp file before re-throwing, so a failed ZIP operation does not leave a partial file in the temp directory until GC. The ZipInputStream readers in the unit test helper and the workflow integration test are closed via try/finally and read with an explicit UTF-8 charset rather than the platform default. A 500-entity scale test is added to confirm the full write loop and entry naming hold at volume.
…pe, simplify the executor

The createTemp factory gains two trailing optional parameters — name and mimeType — so callers can place a file at a specific path in the temp directory and attach a content type in one call without post-construction copies. All existing callers are unaffected; both parameters default to None and the existing lifecycle lines run unchanged. The named-file logic that allocateNamedFileEntity was duplicating by hand now delegates entirely to createTemp, shrinking it to two lines. The two anonymous-file call sites drop their trailing copy calls in favour of passing mimeType directly, and the non-ZIP write path gets the same partial-failure cleanup already present in ZIP mode. FileResource, FileType, and java.io.File are no longer referenced in the executor and are removed from its imports.
The test for a single input entity with an explicit output file name checked the file name but not the MIME type. Since the executor sets application/json on every non-ZIP entity, that property was covered implicitly in other paths but never asserted here. One assertion added.
The two workflow tests duplicated about 40 lines of workspace setup each. Extracted a shared runWorkflow helper to the companion object (not the class — the method has no instance state dependency), with an explicit return type, a Scaladoc comment, and an outputName parameter that defaults to "output.json". The ZIP test passes "output.zip", which is load-bearing: JsonDataset uses the file extension to determine ZIP mode. Implicit dependencies (UserContext, PluginContext, Prefixes) are threaded as explicit implicit parameters; call sites are unaffected since the implicits are in scope on the class.
LocalJsonToFileWorkflowTest was importing org.scalatest.matchers.should.Matchers while the rest of the feature's test files use org.scalatest.matchers.must.Matchers. Import updated; all four shouldBe assertions replaced with mustBe. 134 files in the codebase use mustBe against 81 for shouldBe, so must is the established convention.
When set, the operator wraps the JSON value in an object under the given key before writing — input {"name":"Alice"} with outputProperty = "payload" produces {"payload":{"name":"Alice"}}. When empty (default), the value is written as-is. Applies in both normal and ZIP modes.

Refactors validateJson into parseJson returning JsonNode, which both validates and supplies the parsed node for wrapping. Five new tests cover single and multi-entity wrapping, ZIP mode, array input, and the invalid-JSON error path. Documentation updated with parameter description and example.
@edufuga edufuga requested a review from robertisele June 2, 2026 14:29
The operator can now merge all input entities into a single JSON array file, alongside the existing one-file-per-entity and ZIP modes. Since the three modes are mutually exclusive, the zipOutput boolean is replaced by an outputMode enum (file, zip, jsonArray) implementing EnumerationParameterType, so the plugin framework validates the value and renders the choice with no manual handling. The executor now dispatches on the mode: the per-entity branch is extracted into executeFile, and a new executeMergedJson parses each entity once, applies the outputProperty wrapping per element, collects the values into a single JsonArray, and writes one file. Two small cleanups ride along in the extraction: the repeated "jsonOutput" temp-file prefix becomes a companion-object constant, and the per-entity path uses an explicit loop instead of a map that existed only for its side effects. The existing ZIP tests move to the enum, new tests cover the merged mode, and the operator documentation is reworked to describe all three modes.
@edufuga edufuga marked this pull request as draft June 10, 2026 07:13
Eduard Fugarolas added 13 commits June 10, 2026 09:43
The merged jsonArray mode only had tests for simple object inputs. These pin its current behavior before we settle the number handling: numbers get re-serialized (multiples of ten and big integers turn into E-notation, trailing zeros drop), plus scalar/array/null elements, the default and a custom MIME type, and the duplicate-key and whitespace collapse that come with re-serializing each element. Also covers the merged mode in the workflow test, and adds a mergedTask fixture and a runMerged helper. The expected values are what it does today; they change when the number handling does.
The merged jsonArray mode parsed each element to a node and re-serialized the array, which rewrote numbers (100 became 1E+2), dropped trailing zeros, collapsed duplicate keys, and compacted whitespace — diverging from file and zip modes, which write the input bytes as-is. It now writes each element's original text: parsing still runs, but only to validate, and the parsed node is discarded. When outputProperty is set, the wrapper key is built with proper escaping and the value is spliced in unchanged. The merged-mode tests are updated to expect the preserved output.
Covers cases the suite missed in jsonArray mode: a non-JSON token after a valid value and a comma-separated trailing value at the root are both rejected as invalid input, and whitespace around an element is preserved verbatim.
Drops the per-test copies of the jsonArray task that just repeated the class-level mergedTask field, routing those cases through the existing runMerged helper instead. The two tests that need a differently-configured task keep their own, now named namedMergedTask and wrappedMergedTask so they no longer shadow the field. Also removes the unused prefixes implicit.
The per-test executor boilerplate is replaced by three parameterized functions in the companion object: inputTable, runJsonToFile, and readZipEntries. runJsonToFile takes the executor, schema and task and returns the produced file entities, so every test collapses to a single line that maps over the result for contents, name, mime or size, and the cases that build their own tables inline are gone. The helpers no longer close over instance state, which keeps them reusable, and the now-unused Prefixes import is dropped.
The executor and entity schema become implicit parameters of runJsonToFile alongside the plugin context, so call sites pass only the task. The single-use operator val is folded into its task.
Each test built its source dataset by hand-splicing JSON through Json.stringify; that moves into a sourceContent helper, and the ZIP test's inline read loop becomes a zipEntries helper, both in the companion object.
The merged JSON map mixed reading-and-validating an entity, discarding the parsed node, and wrapping the raw value under the output property all in one block. That moves into two named helpers, readValidatedJson and wrapRawValue, leaving executeMergedJson as a plain read-validate then wrap then join pipeline.
A single entity with invalid JSON or an empty value used to abort the whole task in every mode, and in merged mode that discarded the entire array. Each mode now validates per entity, skips the bad ones, and writes its output from the rest, recording one warning per skipped entity on the execution report through the context the operator is already handed. Wrong input count and unsupported modes still fail fast. partitionValid does the split into valid contents and warnings; reserializedContent and rawContent name the file/zip and merged content strategies. Tests cover the skip-and-warn behaviour across the three modes with mixed and all-invalid input.
…ile tests

The existing skip tests checked that warnings appear; this adds the two cases they missed. One asserts an all-valid merged run produces no warnings, the other that the report's entityCount equals the number of valid entities when one is skipped. runCapturingReport is pulled out to expose the execution report itself, and runWithReport is now built on top of it.
The Input section still said a missing, empty, or malformed value made the operator raise an error and stop, which is no longer how it behaves. This replaces that with a new Invalid input section: invalid entities are skipped per record and recorded as warnings on the execution report while the valid ones are written, an all-skipped run yields the mode's natural empty output (no files / empty array / empty ZIP), and configuration errors like a wrong input count still fail the task.
Reflects that invalid entities are now skipped rather than written.
…t modes

File and zip modes re-serialized the parsed JSON when an output property was set, normalizing numbers and reformatting the value. They now splice the original bytes like merged mode, so all three modes preserve the input verbatim and share a single rawContent function; the unused applyOutputProperty, reserializedContent, readValidatedJson and wrapRawValue helpers are removed. The execution report is now set on every run rather than only on a skip, and file mode gains the empty-input early return the other modes already had. Tests cover number and whitespace preservation with an output property in file and zip modes, the report's entity count in file and zip modes, and an end-to-end skip-and-continue case in the workflow test.
@edufuga edufuga marked this pull request as ready for review June 11, 2026 14:50
* develop:
  Update other libraries that are vulnerable based on the trivy scan
  Spark does not support Java 25 bytecode, so we need to set the target to Java 21
  Use Java 25 in Github pipeline
  Compile with Java 25 target (if available)
  remove unused code
  forward gui elements
  forward gui elements, include layout fixes
  forward gui elements, include Toaster fix
  fix react UMD problem
  pin blueprint to correct version for all sub libs
relatedPlugins = Array(
new PluginReference(
id = JsonParserTask.pluginId,
description = "JSON to File writes the JSON string on each input entity to a file; Parse JSON parses the same kind of input into structured entities driven by a downstream schema."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "on each input entity" -> "from each..."

val tempFile = File.createTempFile(prefix, suffix)
def createTemp(prefix: String, suffix: String = ".tmp", name: Option[String] = None, mimeType: Option[String] = None): FileEntity = {
val tempFile = name match {
case Some(n) => new File(System.getProperty("java.io.tmpdir"), n)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new approach of creating temp files inrtoduces a security bug. It does not check that the file is actually inside the temp directory. E.g. "../" will escape it.
Even if this would be prevented, this would allow to access any temp file the process has read or write access to.
Also, regarding the "JSON to file" usage, multiple executions with the same output file parameter will try to write into the exact same file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file name should be "preserved", this should probably be done with a different approach.

context: ActivityContext[ExecutionReport])
(implicit pluginContext: PluginContext): Option[LocalEntities] = {
if (entities.isEmpty) {
return Some(FileEntitySchema.create(Seq.empty, task))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently will not write a file with the empty array content [], but no file. If the output is e.g. a JSON dataset, the previous, stale file is kept, which is probably unexpected.

context: ActivityContext[ExecutionReport])
(implicit pluginContext: PluginContext): Option[LocalEntities] = {
if (entities.isEmpty) {
return Some(FileEntitySchema.create(Seq.empty, task))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with the JSON array output, this will not write an empty zip, which might leave stale files in a JSON output dataset.

case None => 0 // Take the value of the first path
}
val trimmedFileName = task.data.outputFileName.trim
val entities = entityTable.entities.toIndexedSeq

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will materialize all entities into memory. Most tasks if possible output streams in order to keep the memory-footprint low.

val zip = new ZipOutputStream(outputStream)
for ((content, index) <- contents.zipWithIndex) {
val entryName = if (trimmedFileName.nonEmpty) {
if (multiple) suffixedName(trimmedFileName, index) else trimmedFileName

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the temp file name creation, ZIP entries should probably also be sanitized.

context: ActivityContext[ExecutionReport])
(implicit pluginContext: PluginContext): Option[LocalEntities] = {
if (entities.isEmpty) {
return Some(FileEntitySchema.create(Seq.empty, task))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This early return prevents that the execution report is ever updated.

)
)
case class JsonToFileOperator(@Param("The Silk path expression of the input entity that contains the JSON string. " +
"If not set, the value of the first defined property will be taken.")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first defined property might be misleading. "defined" could be misunderstood as the first property that has a non-empty value.

inputPath: String = "",
@Param("Filename for the produced file. If left empty, an auto-generated temporary name is used. " +
"In file output mode, when the input contains more than one entity, an index suffix is appended " +
"before the extension (e.g. `out-0.json`, `out-1.json`) to keep filenames unique. " +

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description and the code assumes that a file name containing colons have a file suffix. E.g. something like ".hiddenfile" would end up getting an indexed name like "-0.hiddenfile". Not sure what a good heuristic would be though, maybe at least not adding the index part to the start.

outputFileName: String = "",
@Param("MIME type of the produced file.")
mimeType: String = "application/json",
@Param("Output mode: file writes one file per entity, zip packs all entities into a single ZIP archive, " +

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user will only see the display names of the enums and not the IDs, so this should mention the display names instead.

val outputMimeType = Some(task.data.mimeType)
val outputProperty = task.data.outputProperty
val (contents, warnings) = partitionValid(entities)(rawContent(_, pathIndex, outputProperty))
val serialized = contents.mkString("[", ",", "]")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entities are currently already loaded into memory. But when this changes, creating the array string in-memory would probably not be a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants