Fix reconcile schema fetch failure on Foreign Catalogs (Lakehouse Federation)#2367
Fix reconcile schema fetch failure on Foreign Catalogs (Lakehouse Federation)#2367moomindani wants to merge 3 commits intodatabrickslabs:mainfrom
Conversation
…eration) Foreign Catalogs created via Lakehouse Federation do not have the Databricks-specific `full_data_type` column in `information_schema.columns`, causing an UNRESOLVED_COLUMN error for all report types. This adds a DESCRIBE TABLE fallback when the `full_data_type` column is not found. Co-authored-by: Isaac
|
@BesikiML and @bishwajit-db can you take a look at this when you get a chance |
bishwajit-db
left a comment
There was a problem hiding this comment.
Overall LGTM. Please have a look at the comment https://github.com/databrickslabs/lakebridge/pull/2367/changes#r3056652489
| if "full_data_type" not in str(e): | ||
| return self.log_and_throw_exception(e, "schema", schema_query) | ||
|
|
||
| # Fallback to DESCRIBE TABLE for catalogs that lack the full_data_type column |
There was a problem hiding this comment.
Can we add an optional parameter to _get_schema_query?
def _get_schema_query(catalog: str, schema: str, table: str, force_describe: bool = False):
if catalog == "hive_metastore" or force_describe:
return f"describe table {catalog}.{schema}.{table}"
...Then the fallback becomes:
describe_query = _get_schema_query(catalog_str, schema, table, force_describe=True)
There was a problem hiding this comment.
Thanks for the review! Updated to reuse _get_schema_query with a force_describe parameter instead of a separate function. Please take another look.
Address review feedback from @bishwajit-db — reuse _get_schema_query with a force_describe parameter instead of a separate _get_describe_table_query function. Co-authored-by: Isaac
m-abulazm
left a comment
There was a problem hiding this comment.
Thanks for raising this. I documented a different approach as scaling this might require dedicated logic for hive vs foreign catalog vs native catalog. to be open for future changes, I would prefer a subclass instead conditional control
|
|
||
|
|
||
| def _get_schema_query(catalog: str, schema: str, table: str): | ||
| def _get_schema_query(catalog: str, schema: str, table: str, force_describe: bool = False): |
There was a problem hiding this comment.
I would prefer to split this into _get_schema_query_for_source that uses describe table exclusively since we know databricks sources can only be Hive, global views or foreign catalog that only supporst DESCRIBE TABLE.
and _get_schema_query_for_target that uses the SELECT query only as this is the needed behavior always.
This requires distinguishing source vs target databricks data source. so I would implement a subclass instead of nested conditionals and control through exception handling.
Also to note this rubs a bit with PR #2362. should not block us on this PR but keep in mind that there might be conflicts
There was a problem hiding this comment.
Thanks for the detailed feedback! Refactored to split DatabricksDataSource into two subclasses:
DatabricksSourceDataSource: always usesDESCRIBE TABLE(works for hive_metastore, global_temp, and Foreign Catalogs)DatabricksTargetDataSource: usesinformation_schemawithfull_data_type(for native UC tables)
The source_adapter and utils now create the appropriate subclass based on source vs target role. This eliminates the need for exception-based fallback.
Also verified with an integration test — reconcile against a Foreign Catalog (Lakebase PostgreSQL via Lakehouse Federation) succeeded as direct source.
Noted the potential conflict with #2362 — will keep an eye on it.
Address review feedback from @m-abulazm — instead of exception-based fallback, split DatabricksDataSource into: - DatabricksSourceDataSource: uses DESCRIBE TABLE (works for hive, global_temp, and Foreign Catalogs via Lakehouse Federation) - DatabricksTargetDataSource: uses information_schema with full_data_type (optimal for native Unity Catalog tables) The source_adapter and utils are updated to create the appropriate subclass based on source vs target role. Co-authored-by: Isaac
Summary
full_data_typecolumn ininformation_schema.columns, causingUNRESOLVED_COLUMNerrors for all reconcile report types (schema,data,row,all)DatabricksDataSourceinto two subclasses per review feedback:DatabricksSourceDataSource: usesDESCRIBE TABLE(works for hive_metastore, global_temp, and Foreign Catalogs)DatabricksTargetDataSource: usesinformation_schemawithfull_data_type(for native Unity Catalog tables)source_adapterandutilsnow create the appropriate subclass based on source vs target roleTest plan
test_get_schema_target,test_get_schema_source,test_get_schema_source_foreign_catalog,test_get_schema_target_exception_handling,test_get_schema_source_exception_handlingtest_create_adapter_for_databricks_dialect_source,test_create_adapter_for_databricks_dialect_target