Feature requested
In the case of very large datasets (34 Billion records) the generated index is formed out of big files and has a performance degradation.
Given the following details...
Query
val sql = s"""
SELECT ts.timestamp
FROM ts
WHERE ts.timestamp >= to_timestamp('2020-03-17')
AND ts.timestamp < to_timestamp('2020-03-18')
LIMIT 1000
"""
Executed with:
Dataset
- schema size is about 20 top fields and about 17 of these are heavily nested
- about
34 Billion rows
- the
timestamp field is of timestamp type and is up to seconds
- the cardinality of the timestamp values is:
17 145 000 out of 34 155 510 037
- the format is Iceberg
Index
hs.createIndex(
ts,
IndexConfig(
"idx_ts3",
indexedColumns = Seq("timestamp"),
includedColumns = Seq("ns", "id")))
The index has:
434GB total index size
200 files
2.3GB average file size
Explained query
=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->
=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->
=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0
=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
| Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
| *DataSourceV2Scan| 1| 0| -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)| 0| 1| 1|
| CollectLimit| 1| 1| 0|
| Filter| 1| 1| 0|
| Project| 1| 1| 0|
| WholeStageCodegen| 1| 1| 0|
+--------------------------------------------------------+-------------------+------------------+----------+
The cluster
I did run the experiment on Databricks cluster with the following details:
- driver:
64 cores 432GB memory
6 workers: 32 cores 256GB memory
- Spark version
2.4.5
Results
Time to get the 1000 rows:
- with Hyperspace is
17.24s
- without Hyperspace is
16.86s
Acceptance criteria
The time to get 1000 rows using Hyperspace should be at least as twice as fast.
Additional context
For some more context, this has been started on #329 PR.
Feature requested
In the case of very large datasets (
34 Billionrecords) the generated index is formed out of big files and has a performance degradation.Given the following details...
Query
Executed with:
Dataset
34 Billionrowstimestampfield is oftimestamptype and is up to seconds17 145 000out of34 155 510 037Index
The index has:
434GBtotal index size200files2.3GBaverage file sizeExplained query
The cluster
I did run the experiment on Databricks cluster with the following details:
64cores432GBmemory6workers:32cores256GBmemory2.4.5Results
Time to get the
1000rows:17.24s16.86sAcceptance criteria
The time to get
1000rows using Hyperspace should be at least as twice as fast.Additional context
For some more context, this has been started on #329 PR.