Make Splunk output sourcetype configurable in CLF#3251
Make Splunk output sourcetype configurable in CLF#3251marpears wants to merge 4 commits intoopenshift:masterfrom
Conversation
|
/hold |
jcantrill
left a comment
There was a problem hiding this comment.
There is no functional test to support this addition. Also note we do not backport features to earlier releases
| // SourceType can be used to specify a pretrained or custom sourcetype in Splunk. | ||
| // If not specified, `sourceType` will be "_json" or be determined automatically when using `payloadKey` based on the type of the final event payload. |
There was a problem hiding this comment.
... be determined automatically ...
I generally do not like this change but this statement is inaccurate. There is no 'automatic' determination as the value configured in the sink is the same as provided in the API.
This field is a hint to the splunk receiver how to parse the incoming message field, which why they can't figure it out how to do this on the server-side is beyond me. By allowing user's to configure this value, we are allowing them to misrepresent what the log format of the actual work load is. Given logs come from various workloads from the entire cluster to a single sink means the 'pre-trained' model may not apply to every log. How do we guarantee that with this implementation?
The CLF is intentionally opinionated meaning the current implementation is correct: you either get a JSON blob of ViaQ format or generic text from some other part of the ViaQ payload. IMO the proper way to add this feature is to evaluate the message field and dynamically set it based upon a pattern defined in the message. The caveat is whether it can be set for individual log messages.
That being said, what does it mean to the user, to set this value to say "apache http" and the sink additionally handles some well-known java log4j formatted logs?
- Does the receiver politely process everything?
- Are we going to be on the hook to "fix" the case where the logs don't match the sourceType
There was a problem hiding this comment.
Hi @jcantrill thanks for reviewing this PR. In reply to your comments:
The problem this feature is intended to resolve is a scenario where a user has configured custom source types in splunk to interpret particular app log message content and does not want to use the default source type of _json.
As an example, custom source types can be configured to trim the event payload, extract key/value pairs from .message and transform them into fields for use with dashboards.
The inability to define a source type with the CLO is particularly an issue for a user who is moving away from a self-managed log collector that allows source type to be defined, and has a mature suite of splunk custom source types and dashboards. When the source type is _json instead of the required custom source type, dashboards and search queries can be broken if they rely on fields which have not been extracted, and also splunk costs increase if the log event is not trimmed. Before the CLO supported native splunk forwarding, often users would implement a self-managed fluentd instance but the splunk hec plugin is now archived. This, and the native support for splunk forwarding is a reason why a user would be seeking to move to the CLO.
If a source type needs to be referenced for a particular application's log events, but is not appropriate for other apps, the ClusterLogForwarder would need multiple splunk outputs so that it does not misrepresent the format of log events for other applications.
I have altered the text that referred to "automatic" source type determination to try and make this part clearer and added examples of CLF configurations that use the sourceType feature in docs/features/logforwarding/outputs/splunk-forwarding.adoc
There was a problem hiding this comment.
If a source type needs to be referenced for a particular application's log events, but is not appropriate for other apps, the ClusterLogForwarder would need multiple splunk outputs so that it does not misrepresent the format of log events for other applications.
This is exactly the concern I have since fundamentally it requires users to know the log format of the workloads on the cluster and how to wire them from source container to output. This is precisely the behavior we explicitly wanted to avoid since before the introduction of CRIO; this was the problem to be solved with the initial fluend and regex of the source. Frankly, IMO, it completely defeats the purpose of an operator and it being the thing to manage your deployments.
That being said, we discussed this change within our architecture meeting and landed on fundamentally what you describe based on a customer's usage and experience with Splunk. Essentially, we would expect the code to be further changed to:
- defining no messagePayloadKey, the functionality remains as-is
- defining no sourceType with messagePayloadKey, the functionality remains as-is
- defining sourceType with messagePayloadKey, the functionality populates the spec'd sourceType
- add a BIG BOLD WARNING to the API and repo docs that basically say you get what you asked for and that care needs to be taken when crafting pipelines
I will re-review the comment and try to comment these items accordingly. Additionally, I will review the vector config api to confirm that sourceType can use templates. The best use of this feature, if vector supports it, would be for users to label their workloads with the sourceType and then the collector could use that value when forwarding logs.
… of sourceType use to docs
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: marpears The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
jcantrill
left a comment
There was a problem hiding this comment.
Please add e2e validation tests and at least one functional test
| // | ||
| // +kubebuilder:validation:Optional | ||
| // +operator-sdk:csv:customresourcedefinitions:type=spec,displayName="SourceType",xDescriptors={"urn:alm:descriptor:com.tectonic.ui:text"} | ||
| SourceType string `json:"sourceType,omitempty"` |
There was a problem hiding this comment.
- Add a validation rule similar to https://github.com/marpears/cluster-logging-operator/blob/b41e2067585f143ab6b4ffa656ceae5e9d779a16/api/observability/v1/output_types.go#L80 that allows setting this value only when the payloadKey is modified
- Add comment that identifies this field can be templated along with a validation rule like https://github.com/marpears/cluster-logging-operator/blob/b41e2067585f143ab6b4ffa656ceae5e9d779a16/api/observability/v1/output_types.go#L1177
- Add a comment note about it is on the administrator to configure the pipeline so the sourceType matches the log entry. The collector makes no effort or validation to ensure they match.
As an aside to the last point, I do not know how Splunk behaves with a mismatch which is a little concerning to me.
Also there are api validation tests here https://github.com/marpears/cluster-logging-operator/tree/b41e2067585f143ab6b4ffa656ceae5e9d779a16/test/e2e/collection/apivalidations
| 5. `indexedFields`: Optional. Fields to be added to Splunk index. | ||
| 6. `payloadKey`: Optional. Specifies record field to use as payload. | ||
| 7. `compression`: Optional. Compression configuration, available to are: `none`, `gzip`. Default is `none`. | ||
| 5. `sourceType`: Optional. Can be used to specify a pretrained or custom sourcetype in Splunk. If not specified it will default to `_json` and be detected by .payloadkey. |
There was a problem hiding this comment.
There is nothing 'detected' by the collector or operator. I recommend:
- linking to the list of pretrained sourcetypes
`sourceType`: Optional. Can be used to specify a pretrained or custom sourcetype in Splunk when the payloadKey is additionally defined. If not specified, it will default to `_json`.
| To ensure consistency and meaningful categorization, the `source` value can be dynamically derived from the `log_type` and `log_source` fields, | ||
| following Cluster Log Forwarder's conventions. | ||
|
|
||
| === `sourceType` |
There was a problem hiding this comment.
This would be a good place to add an example where the sourceType is defined as a pod label and the CLF spec's the sourceType as a templated field to use from the pod meta. This highlights toe value of labels which is a core concept
| ** `_json` — used when `payloadKey` points to an object. | ||
| ** `generic_single_line` — used when the payload is a primitive value (e.g., string, number, boolean). | ||
|
|
||
| NOTE: Use `payloadKey` carefully. Selecting a single field as the payload may cause other important information in the log to be dropped, potentially leading to inconsistent or incomplete log events. |
There was a problem hiding this comment.
My expectation is the greatest value comes from when this points to the message field
|
|
||
| By default, `payloadKey` is not set, which means the complete log record is forwarded as the payload. | ||
|
|
||
| If `sourceType` is not defined, payloadKey can override the default `sourcetype` value of `_json` : |
There was a problem hiding this comment.
Please explain. I don't understand what this means.
Description
This PR allows for configuration of the Splunk output source type in the CLF using a new
sourceTypefield. This is so that we can support users who have defined custom source types in Splunk.If
sourceTypeis not defined, then the current behavior is preserved where_jsonis the default and can be overridden ifpayloadKeyis set based on the type of the final event payload.CLF configuration example:
/cc @Clee2691 @cahartma
/assign @jcantrill
/cherrypick release-6.4
/cherrypick release-6.5
Links