Design for plan validation for TPCDS queries, with features for extensibility to more data sets and various tpcds configs
Attaching class diagram and E2E workflow of the gold standard test setup for hyperspace. This plan should have a basic setup for tpcds query plan validation as well as extension points for more query/data/config combinations.
./src/test/resources/tpcds-spark-2.4/ >>> Without Hyperspace: This directory is fully used by TPCDSSparkSuite.
indexconfigs/ >>> Empty Index config. This is to store how default spark would perform
flags/ >>> specialized configs for without hyperspace
approvedSimplifiedPlans/ >>> simplified plans generated and stored for validation once.
queries/ >>> query files, one for each query
./src/test/resources/tpcdsBasic/ >>> This directory is fully used by TPCDSBasicSuite. Similar directories for other setups
indexconfigs/ >>> index configs. could be a conf file with index defs
flags/ >>> specialized configs for every setup (e.g. with/out hybrid scan
approvedSimplifiedPlans/ >>> simplified plans generated and stored for validation once.
queries/ >>> query files, one for each query
./src/test/resources/tpcdsOther/
... >>> similar setup as above
Indexcreator.createIndex(sourceTable, indexConfig): Unit => creates <index_storage_location>/<index_name>/_hyperspace_log/0
<index_storage_location>/<index_name>/_hyperspace_log/0:
{
name "name" : "filterIndex",
"derivedDataset" : {
"properties" : {
"columns" : {
indexCols "indexed" : [ "c3" ],
included cols "included" : [ "c1" ]
},
schema "schemaString" : "{\"type\":\"struct\",\"fields\":[{\"name\":\"c3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}",
"numBuckets" : 200,
"properties" : {
"hasParquetAsSourceFormat" : "true"
}
},
"kind" : "CoveringIndex"
},
"content" : {
index_storage "root" : {
"name" : "file:/C:/",
"files" : [ ],
"subDirs" : [ {
"name" : "Users",
"files" : [ ],
"subDirs" : [ {
"name" : "apdave",
"files" : [ ],
"subDirs" : [ {
"name" : "github",
"files" : [ ],
"subDirs" : [ {
"name" : "hyperspace-1",
"files" : [ ],
"subDirs" : [ {
"name" : "src",
"files" : [ ],
"subDirs" : [ {
"name" : "test",
"files" : [ ],
"subDirs" : [ {
"name" : "resources",
"files" : [ ],
"subDirs" : [ {
"name" : "indexLocation",
"files" : [ ],
"subDirs" : [ {
"name" : "filterIndex",
"files" : [ ],
"subDirs" : [ {
"name" : "v__=0",
"files" : [ {
arbitrary file info "name" : "somefile.parquet",
"size" : 10,
"modifiedTime" : 1612989388690,
"id" : 0
}],
"subDirs" : [ ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
},
"fingerprint" : {
"kind" : "NoOp",
"properties" : { }
}
},
"source" : {
"plan" : {
"properties" : {
"relations" : [ {
"rootPaths" : [ "file:/C:/Users/apdave/github/hyperspace-1/src/test/resources/e2eTests/lineitem" ],
"data" : {
"properties" : {
"content" : {
"root" : {
"name" : "file:/C:/",
"files" : [ ],
"subDirs" : [ {
"name" : "Users",
"files" : [ ],
"subDirs" : [ {
"name" : "apdave",
"files" : [ ],
"subDirs" : [ {
"name" : "github",
"files" : [ ],
"subDirs" : [ {
"name" : "hyperspace-1",
"files" : [ ],
"subDirs" : [ {
"name" : "src",
"files" : [ ],
"subDirs" : [ {
"name" : "test",
"files" : [ ],
"subDirs" : [ {
"name" : "resources",
"files" : [ ],
"subDirs" : [ {
"name" : "e2eTests",
"files" : [ ],
"subDirs" : [ {
"name" : "lineitem",
empty Content object "files" : [ ],
"subDirs" : [ ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
} ]
},
"fingerprint" : {
"kind" : "NoOp",
"properties" : { }
}
},
"update" : null
},
"kind" : "HDFS"
},
schema from catalog "dataSchemaJson" : "{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c4\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c5\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}",
"fileFormat" : "parquet",
"options" : { }
} ],
"rawPlan" : null,
"sql" : null,
"fingerprint" : {
"properties" : {
"signatures" : [ {
fixed provider "provider" : "com.microsoft.hyperspace.index.MockSignatureProvider",
returns source table name "value" : "lineitem"
} ]
},
"kind" : "LogicalPlan"
}
},
"kind" : "Spark"
}
},
"properties" : { },
"version" : "0.1",
"id" : 1,
"state" : "ACTIVE",
"timestamp" : 1612998769321,
"enabled" : true
}
It is possible that with addition of new rules or indexes, we expect updated query plans. This would lead to test failures if we fail to update the approvedSimplifiedPlan for those queries.
To re-generate golden files for entire suite, run:
{{{
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
}}}
To re-generate golden file for a single test, run:
{{{
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q49)"
}}}
If a test starts failing, it means the expected plan is different from actual plan for a failed query. For now it's a manual step to resolve this issue.
We have two options at this point:
Based on the current design, it's pretty easy to add new suites to Gold Standard.
Problem Statement
#282 Gold Standard
Design for plan validation for TPCDS queries, with features for extensibility to more data sets and various tpcds configs
Proposed solution
Attaching class diagram and E2E workflow of the gold standard test setup for hyperspace. This plan should have a basic setup for tpcds query plan validation as well as extension points for more query/data/config combinations.
File Structure:
/ApprovedSimplifiedPlans/
The
approvedSimplifiedPlansdirectory will contain two files for each tpcds query:explain.txtandsimplified.txt.explain.txt: this file contains df.explain() output of a query. This is only for display and comparison purposes for the user. This file is not used for comparison in the tests.
simplified.txt: this file is a simplified plan. it normalizes references and cleans up locations. This plan is used in comparison and fails tests if string matching fails.
Test Class Diagram and File Structure
Workflow of DataGenerator and IndexGenerator
End to End Test Workflow
Complete pdf with above diagrams in high def:
Class Diagrams and Test Workflow.pdf
Implementation
Who/When: @apoorvedave1 , 3 weeks from date of start (5-6 weeks for merge) not including interruptions.
PRs
Tasks:
MockTPCDSDataGenerator Tasks:
Index Generator Tasks:
Comparator Tasks:
PlanStabilityStuite Tasks:
Implemented by subclasses. e.g. for TPCDSBasicSuite (extends PlanStabilitySuite), this would be "src/test/resources/tpcdsbasic/"
Use configs and query id to get spark query from query file at "src/test/resources/tpcdsbasic/queries/"
Run query.explain to generate simplified plan.
normalize and return
For all queries to test
generateSimplifiedPlans and save at test location
create normalized plan for query
if (regenerateApprovedPlans) copy plan to approvedPlans location
else compare with approvedPlans location and return result
TpcdsBasicSuite Tasks:
Use dataGenerator and IndexGenerator to create data and index. >> E.g. for refresh index/hybrid scan, use this differently.
Updating Approved Plans
It is possible that with addition of new rules or indexes, we expect updated query plans. This would lead to test failures if we fail to update the approvedSimplifiedPlan for those queries.
Regression: Defining Regression and Test Failure
If a test starts failing, it means the expected plan is different from actual plan for a failed query. For now it's a manual step to resolve this issue.
We have two options at this point:
Adding New Test Suites
Based on the current design, it's pretty easy to add new suites to Gold Standard.
Performance Implications (if applicable)
None
Open issues (if applicable)
Additional context (if applicable)
Similar to Spark's Plan Stability Suite
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala