Add JSON to File operator (CMEM-7616)#1050
Conversation
A new Silk operator that takes a JSON string carried as a field value on each incoming entity and writes it to a file, surfacing the result downstream as a `FileEntity`. The operator iterates over every input entity, validates each value as JSON via the in-house Jackson-based parser, and produces one file per entity tagged with `application/json`. An optional output filename parameter governs naming, with an index suffix appended when the input carries more than one entity to keep filenames unique. The plugin is registered alongside Parse JSON in `JsonPlugins.scala`, and Parse JSON gains a mutual `relatedPlugins` entry so the workbench surfaces the two as related from either side. Eight executor-level tests cover the contract, including the missing-value and invalid-JSON failure modes; an integration test confirms the produced file lands verbatim in a downstream `JsonDataset` resource.
The doc covers the operator's input, the file entity produced downstream, and the *Input path* and *Output file name* parameters. The suffix rule for multi-entity input is laid out with a short inline example showing the resulting filenames. A worked single-entity walkthrough at the end mirrors the Parse JSON doc.
Touches the @plugin, @Param, and relatedPlugins descriptions on both operators, the two user-facing docs, and the JSON to File workflow test name.
…lexibleSchemaPort-CMEM-7427
The JSON to File documentation claimed the operator writes bytes verbatim and tagged files with a hardcoded content type. Both were inaccurate or incomplete: writeString encodes to UTF-8, and the content type was not user-configurable. Adds a mimeType parameter (default application/json) to JsonToFileOperator and threads it through the executor, replacing the hardcoded class-level val. The output-filename section now covers the extension-less case. Duplicate phrasing in the Parse JSON Schema section and class-name leakage in both operators' input-count error messages are also fixed.
When zipOutput is enabled, the executor packs all input entities into a single ZIP file with one entry per entity instead of producing one file per entity. Entry naming follows the same suffix convention as the non-ZIP path. The MIME type defaults to application/zip when zipOutput is true and the parameter is at its default value; an explicit value passes through unchanged. Seven executor-level tests cover entry naming, MIME type handling, empty input, and invalid JSON; a workflow-level test confirms the ZIP lands correctly in a downstream dataset resource. Documentation updates the Input, Output, and Parameters sections to cover both modes.
The write loop in executeZip is now wrapped in a try/catch that calls WritableResource.delete() on the backing temp file before re-throwing, so a failed ZIP operation does not leave a partial file in the temp directory until GC. The ZipInputStream readers in the unit test helper and the workflow integration test are closed via try/finally and read with an explicit UTF-8 charset rather than the platform default. A 500-entity scale test is added to confirm the full write loop and entry naming hold at volume.
…pe, simplify the executor The createTemp factory gains two trailing optional parameters — name and mimeType — so callers can place a file at a specific path in the temp directory and attach a content type in one call without post-construction copies. All existing callers are unaffected; both parameters default to None and the existing lifecycle lines run unchanged. The named-file logic that allocateNamedFileEntity was duplicating by hand now delegates entirely to createTemp, shrinking it to two lines. The two anonymous-file call sites drop their trailing copy calls in favour of passing mimeType directly, and the non-ZIP write path gets the same partial-failure cleanup already present in ZIP mode. FileResource, FileType, and java.io.File are no longer referenced in the executor and are removed from its imports.
The test for a single input entity with an explicit output file name checked the file name but not the MIME type. Since the executor sets application/json on every non-ZIP entity, that property was covered implicitly in other paths but never asserted here. One assertion added.
The two workflow tests duplicated about 40 lines of workspace setup each. Extracted a shared runWorkflow helper to the companion object (not the class — the method has no instance state dependency), with an explicit return type, a Scaladoc comment, and an outputName parameter that defaults to "output.json". The ZIP test passes "output.zip", which is load-bearing: JsonDataset uses the file extension to determine ZIP mode. Implicit dependencies (UserContext, PluginContext, Prefixes) are threaded as explicit implicit parameters; call sites are unaffected since the implicits are in scope on the class.
LocalJsonToFileWorkflowTest was importing org.scalatest.matchers.should.Matchers while the rest of the feature's test files use org.scalatest.matchers.must.Matchers. Import updated; all four shouldBe assertions replaced with mustBe. 134 files in the codebase use mustBe against 81 for shouldBe, so must is the established convention.
When set, the operator wraps the JSON value in an object under the given key before writing — input {"name":"Alice"} with outputProperty = "payload" produces {"payload":{"name":"Alice"}}. When empty (default), the value is written as-is. Applies in both normal and ZIP modes.
Refactors validateJson into parseJson returning JsonNode, which both validates and supplies the parsed node for wrapping. Five new tests cover single and multi-entity wrapping, ZIP mode, array input, and the invalid-JSON error path. Documentation updated with parameter description and example.
The operator can now merge all input entities into a single JSON array file, alongside the existing one-file-per-entity and ZIP modes. Since the three modes are mutually exclusive, the zipOutput boolean is replaced by an outputMode enum (file, zip, jsonArray) implementing EnumerationParameterType, so the plugin framework validates the value and renders the choice with no manual handling. The executor now dispatches on the mode: the per-entity branch is extracted into executeFile, and a new executeMergedJson parses each entity once, applies the outputProperty wrapping per element, collects the values into a single JsonArray, and writes one file. Two small cleanups ride along in the extraction: the repeated "jsonOutput" temp-file prefix becomes a companion-object constant, and the per-entity path uses an explicit loop instead of a map that existed only for its side effects. The existing ZIP tests move to the enum, new tests cover the merged mode, and the operator documentation is reworked to describe all three modes.
The merged jsonArray mode only had tests for simple object inputs. These pin its current behavior before we settle the number handling: numbers get re-serialized (multiples of ten and big integers turn into E-notation, trailing zeros drop), plus scalar/array/null elements, the default and a custom MIME type, and the duplicate-key and whitespace collapse that come with re-serializing each element. Also covers the merged mode in the workflow test, and adds a mergedTask fixture and a runMerged helper. The expected values are what it does today; they change when the number handling does.
The merged jsonArray mode parsed each element to a node and re-serialized the array, which rewrote numbers (100 became 1E+2), dropped trailing zeros, collapsed duplicate keys, and compacted whitespace — diverging from file and zip modes, which write the input bytes as-is. It now writes each element's original text: parsing still runs, but only to validate, and the parsed node is discarded. When outputProperty is set, the wrapper key is built with proper escaping and the value is spliced in unchanged. The merged-mode tests are updated to expect the preserved output.
Covers cases the suite missed in jsonArray mode: a non-JSON token after a valid value and a comma-separated trailing value at the root are both rejected as invalid input, and whitespace around an element is preserved verbatim.
Drops the per-test copies of the jsonArray task that just repeated the class-level mergedTask field, routing those cases through the existing runMerged helper instead. The two tests that need a differently-configured task keep their own, now named namedMergedTask and wrappedMergedTask so they no longer shadow the field. Also removes the unused prefixes implicit.
The per-test executor boilerplate is replaced by three parameterized functions in the companion object: inputTable, runJsonToFile, and readZipEntries. runJsonToFile takes the executor, schema and task and returns the produced file entities, so every test collapses to a single line that maps over the result for contents, name, mime or size, and the cases that build their own tables inline are gone. The helpers no longer close over instance state, which keeps them reusable, and the now-unused Prefixes import is dropped.
The executor and entity schema become implicit parameters of runJsonToFile alongside the plugin context, so call sites pass only the task. The single-use operator val is folded into its task.
Each test built its source dataset by hand-splicing JSON through Json.stringify; that moves into a sourceContent helper, and the ZIP test's inline read loop becomes a zipEntries helper, both in the companion object.
The merged JSON map mixed reading-and-validating an entity, discarding the parsed node, and wrapping the raw value under the output property all in one block. That moves into two named helpers, readValidatedJson and wrapRawValue, leaving executeMergedJson as a plain read-validate then wrap then join pipeline.
A single entity with invalid JSON or an empty value used to abort the whole task in every mode, and in merged mode that discarded the entire array. Each mode now validates per entity, skips the bad ones, and writes its output from the rest, recording one warning per skipped entity on the execution report through the context the operator is already handed. Wrong input count and unsupported modes still fail fast. partitionValid does the split into valid contents and warnings; reserializedContent and rawContent name the file/zip and merged content strategies. Tests cover the skip-and-warn behaviour across the three modes with mixed and all-invalid input.
…ile tests The existing skip tests checked that warnings appear; this adds the two cases they missed. One asserts an all-valid merged run produces no warnings, the other that the report's entityCount equals the number of valid entities when one is skipped. runCapturingReport is pulled out to expose the execution report itself, and runWithReport is now built on top of it.
The Input section still said a missing, empty, or malformed value made the operator raise an error and stop, which is no longer how it behaves. This replaces that with a new Invalid input section: invalid entities are skipped per record and recorded as warnings on the execution report while the valid ones are written, an all-skipped run yields the mode's natural empty output (no files / empty array / empty ZIP), and configuration errors like a wrong input count still fail the task.
Reflects that invalid entities are now skipped rather than written.
…t modes File and zip modes re-serialized the parsed JSON when an output property was set, normalizing numbers and reformatting the value. They now splice the original bytes like merged mode, so all three modes preserve the input verbatim and share a single rawContent function; the unused applyOutputProperty, reserializedContent, readValidatedJson and wrapRawValue helpers are removed. The execution report is now set on every run rather than only on a skip, and file mode gains the empty-input early return the other modes already had. Tests cover number and whitespace preservation with an output property in file and zip modes, the report's entity count in file and zip modes, and an end-to-end skip-and-continue case in the workflow test.
* develop: Update other libraries that are vulnerable based on the trivy scan Spark does not support Java 25 bytecode, so we need to set the target to Java 21 Use Java 25 in Github pipeline Compile with Java 25 target (if available) remove unused code forward gui elements forward gui elements, include layout fixes forward gui elements, include Toaster fix fix react UMD problem pin blueprint to correct version for all sub libs
| relatedPlugins = Array( | ||
| new PluginReference( | ||
| id = JsonParserTask.pluginId, | ||
| description = "JSON to File writes the JSON string on each input entity to a file; Parse JSON parses the same kind of input into structured entities driven by a downstream schema." |
There was a problem hiding this comment.
Maybe change "on each input entity" -> "from each..."
| val tempFile = File.createTempFile(prefix, suffix) | ||
| def createTemp(prefix: String, suffix: String = ".tmp", name: Option[String] = None, mimeType: Option[String] = None): FileEntity = { | ||
| val tempFile = name match { | ||
| case Some(n) => new File(System.getProperty("java.io.tmpdir"), n) |
There was a problem hiding this comment.
This new approach of creating temp files inrtoduces a security bug. It does not check that the file is actually inside the temp directory. E.g. "../" will escape it.
Even if this would be prevented, this would allow to access any temp file the process has read or write access to.
Also, regarding the "JSON to file" usage, multiple executions with the same output file parameter will try to write into the exact same file.
There was a problem hiding this comment.
If the file name should be "preserved", this should probably be done with a different approach.
| context: ActivityContext[ExecutionReport]) | ||
| (implicit pluginContext: PluginContext): Option[LocalEntities] = { | ||
| if (entities.isEmpty) { | ||
| return Some(FileEntitySchema.create(Seq.empty, task)) |
There was a problem hiding this comment.
This currently will not write a file with the empty array content [], but no file. If the output is e.g. a JSON dataset, the previous, stale file is kept, which is probably unexpected.
| context: ActivityContext[ExecutionReport]) | ||
| (implicit pluginContext: PluginContext): Option[LocalEntities] = { | ||
| if (entities.isEmpty) { | ||
| return Some(FileEntitySchema.create(Seq.empty, task)) |
There was a problem hiding this comment.
Same as with the JSON array output, this will not write an empty zip, which might leave stale files in a JSON output dataset.
| case None => 0 // Take the value of the first path | ||
| } | ||
| val trimmedFileName = task.data.outputFileName.trim | ||
| val entities = entityTable.entities.toIndexedSeq |
There was a problem hiding this comment.
This will materialize all entities into memory. Most tasks if possible output streams in order to keep the memory-footprint low.
| val zip = new ZipOutputStream(outputStream) | ||
| for ((content, index) <- contents.zipWithIndex) { | ||
| val entryName = if (trimmedFileName.nonEmpty) { | ||
| if (multiple) suffixedName(trimmedFileName, index) else trimmedFileName |
There was a problem hiding this comment.
Similar to the temp file name creation, ZIP entries should probably also be sanitized.
| context: ActivityContext[ExecutionReport]) | ||
| (implicit pluginContext: PluginContext): Option[LocalEntities] = { | ||
| if (entities.isEmpty) { | ||
| return Some(FileEntitySchema.create(Seq.empty, task)) |
There was a problem hiding this comment.
This early return prevents that the execution report is ever updated.
| ) | ||
| ) | ||
| case class JsonToFileOperator(@Param("The Silk path expression of the input entity that contains the JSON string. " + | ||
| "If not set, the value of the first defined property will be taken.") |
There was a problem hiding this comment.
first defined property might be misleading. "defined" could be misunderstood as the first property that has a non-empty value.
| inputPath: String = "", | ||
| @Param("Filename for the produced file. If left empty, an auto-generated temporary name is used. " + | ||
| "In file output mode, when the input contains more than one entity, an index suffix is appended " + | ||
| "before the extension (e.g. `out-0.json`, `out-1.json`) to keep filenames unique. " + |
There was a problem hiding this comment.
The description and the code assumes that a file name containing colons have a file suffix. E.g. something like ".hiddenfile" would end up getting an indexed name like "-0.hiddenfile". Not sure what a good heuristic would be though, maybe at least not adding the index part to the start.
| outputFileName: String = "", | ||
| @Param("MIME type of the produced file.") | ||
| mimeType: String = "application/json", | ||
| @Param("Output mode: file writes one file per entity, zip packs all entities into a single ZIP archive, " + |
There was a problem hiding this comment.
The user will only see the display names of the enums and not the IDs, so this should mention the display names instead.
| val outputMimeType = Some(task.data.mimeType) | ||
| val outputProperty = task.data.outputProperty | ||
| val (contents, warnings) = partitionValid(entities)(rawContent(_, pathIndex, outputProperty)) | ||
| val serialized = contents.mkString("[", ",", "]") |
There was a problem hiding this comment.
The entities are currently already loaded into memory. But when this changes, creating the array string in-memory would probably not be a good idea.
This PR is based on develop. It depends on PR [#1041] (CMEM-7427).
CMEM-7616: JSON to File operator
Adds an operator that reads a JSON string from a field on each input entity and writes it to a file, surfacing the result as a
FileEntityfor downstream file-backed operators or datasets.Why
Parse JSON consumes a JSON string and produces structured entities; it does not preserve the string. Users who need to pass the JSON value to a file-backed operator downstream — for archival, transfer, or further processing — had no operator for that.
JsonToFileOperatorfills that gap.Changes
JsonToFileOperatoris aCustomTaskregistered inJsonPlugins. Its input port mirrors Parse JSON: one entity carrying a JSON string. Its output is aFileEntity. For each input entity the operator reads the value at the configured input path, validates it as JSON, and writes it to a file taggedapplication/json. When the input contains more than one entity, a zero-based index suffix is appended to the filename before the extension to keep output names unique and predictable. If no input path is configured, the operator defaults to the first field on the input entity, consistent with Parse JSON's behaviour.An optional
zipOutputflag switches to ZIP mode: instead of one file per entity, all entities are packed into a single ZIP archive with one entry per entity. Entry names follow the same suffix scheme as the single-file path. The producedFileEntitycarriesapplication/zipin ZIP mode. If the write loop fails mid-archive the partial temp file is deleted before the exception propagates, so callers never receive a broken ZIP.An optional
outputPropertyparameter wraps the JSON value in an object under the given key before writing. WithoutputPropertyset topayload, an input value{"name":"Alice"}is written as{"payload":{"name":"Alice"}}. When left empty — the default — the value is written as-is. The wrapping applies in both normal and ZIP modes. Internally,validateJsonis replaced byparseJsonreturning aJsonNode, which validates and supplies the parsed node for wrapping in a single pass.A third
outputMode,jsonArray, merges all input entities' values into a single JSON array file, splicing each element verbatim.JsonParserTaskgains a mutualrelatedPluginsentry pointing atJsonToFileOperator, and vice versa, so the workbench surfaces the relationship from either side.Tests
LocalJsonToFileOperatorExecutorTestcovers the three output modes independently. Forfilemode: single and multi-entity output, the literal-versus-suffixed filename rule,application/jsonMIME-type propagation, and input-path defaulting. Forzipmode: single-entity packing underentry.json, multi-entity packing with suffixed entry names, a named ZIP container, a single literal entry name, an explicitly configured MIME type, empty input, and a 500-entity scale run confirming the write loop and entry naming hold at volume. ForjsonArraymode: single and multi-entity merging into one array in input order, a single named output file with no index suffix, the default and explicitly configured MIME types, and the value-fidelity cases below.Invalid input is skipped per entity rather than aborting the task, asserted in every mode: an empty value and an unparseable value are each skipped and recorded as a warning on the execution report while the valid entities still produce output; a mixed batch writes the valid records and skips the bad one, with contiguous file and ZIP entry names across the gap; a fully-skipped run yields the mode's natural empty output — no files, an empty
[], or a ZIP with no entries; and the report'sentityCount(plus the skip-warning count) is asserted infile,zip, andjsonArraymodes, with an all-valid merged run asserting no warnings.Byte fidelity in
jsonArraymode is pinned directly: trailing decimal zeros, large-magnitude integers, plain and multiple-of-ten integers, scalar / array / null elements, duplicate object keys, and pretty-printed internal formatting are written through verbatim; a non-JSON token after a valid value and a comma-separated trailing value at the root are rejected (skipped and warned); and whitespace padding around an element is preserved.outputPropertywrapping is covered across modes — single and multiple entities, ZIP entries, a JSON-array input, a skipped invalid value, and number- and whitespace-preservation infileandzipwhen the wrapper is set.LocalJsonToFileWorkflowTestruns the operator end-to-end in a three-node in-memory workflow:filemode asserting the output resource holds the source JSON byte-for-byte,zipmode asserting entry names and content across two entities,jsonArraymode asserting the merged array, and a skip case asserting an invalid record is dropped while the valid ones flow through to the downstream dataset.Task: https://jira.eccenca.com/browse/CMEM-7616