Skip to content

API data ingestion pipeline and robustness improvements#10

Open
antonio-cln wants to merge 51 commits into
PRAISELab-PicusLab:mainfrom
antonio-cln:main
Open

API data ingestion pipeline and robustness improvements#10
antonio-cln wants to merge 51 commits into
PRAISELab-PicusLab:mainfrom
antonio-cln:main

Conversation

@antonio-cln
Copy link
Copy Markdown

Summary

The proposed PR aims at introducing new functionalities and refactor part of the existing code in order to better deal with standardizing the data ingestion process, introduce fetching documents through an API request, introduce guardrails for analytical function to ensure user-side stability and provide informations about why certain actions cannot be performed.


New modules

  • www/services/api_etl.py
    This is one of the core compoment of the API ingestion pipeline. This module deals with two main points:

    1. manages the complexities of web-based ingestion: automated retries, pagination handling, connection throttling to respect API rate limits.
      This is dealt with in each search_<source>_keywords() function. Since each database source allows interaction in a different way, several functions required to be implemented.
    2. introduces mapping layer to standardize the data provided by the different API to the 24-tag requested dataframe.
      This is dealt with in each <source>_mapping_dict(). Just like in the case of search_<source>_keywords(), since each database provides data in a different format, different functions required to be implemented.
  • www/services/data_validation.py
    This is one of the core component of the API ingestion pipeline. This module manages validation of a dataframe, verifying data integrity before it enters any of the analytical workflows. It makes sure that the extracted tags are conform to the Web of Science provided schema, in particular it verifies that:

  • "DB", "UT", "DI", "PMID", "TI", "SO", "JI", "DT", "LA", "RP", "AB", "VL", "IS", "BP", "EP", "SR" are string

  • "AU", "AF", "C1", "CR", "DE", "ID" are list

  • "PY", "TC" are integer
    Furthermore, since Pandas usually converts string to object, an explicit conversion to string has been applied to guarantee full conformity with the requested format.


Changes

  • Dispatcher
    A dispatcher pattern has been added in app.py.
    The user is allowed to choose from what database (OpenAlex, PubMed, Scopus) he wants to fetch documents from and based on his choice, a certain pipeline will be used to proceed with the API ingestion pipeline introduced by api_etl.py and data_validation.py.

  • Single/Multiple Uploaded File processing
    A more resilient approach through a try/except block has been implemented in format_functions.py when formatting the columns to prevent corrupted records to crash the file processing. The involved corrupted file information are provided in the terminal.

    For each of the format_<tag>_column.py function, a more resilient approach has been implemented through is_valid_field() and clean_txt_string() to make sure to avoid unwanted values, NaN or None to enter the workflow. This guarantees that the produced dataframe columns are conform with the requests of being string, integers or list.

  • API
    The aforementioned API pipeline is accessible through the web interface provided by app.py thanks to an additional entry in the dropdown menu that allows the user to choose how he wants to provide the data. This option allows the user to choose between OpenAlex, PubMed and Scopus.

    Furthermore, an API Key field is present to allow the use to provide his own key to perform a query. This is a mandatory step for certain databases like Scopus while OpenAlex and PubMed allow to query their database without any key but with some restrictions.

    Several API requests can be executed sequentially and any successfully completed requests are then merged together. This allows the user to fetch documents from several database sources and use the analytics functions on a broader set of data.

  • Analytical Function Guardrails
    Most of the present analytical function work perfectly with Web of Science provided columns. The analytical function have been mostly adjusted to work according to the data provided by Web of Science and since different databases don't provide all the data that Web of Science provides, most of the functions will crash.

    Analytical function crashes have been dealt with both internally and externally.

    • External guardrails have been implemented in order to avoid the functions to even run if the required columns to generate the plot are either missing (which shouldn't be the case since the dataframe has been validated before, it is just a double-check) or are completely empty.
      According to the requests, column can be empty and all rows can be represented by either an empty string, "", or an empty list, []. These external guardrails prevents function from running if the data required for it to work is not conform and inform the end-user that the specific column required for that function is missing from the dataset.

      Two different approaches have been used since there are two macro categories of analytical functions:

      1. Functions that require one or more specific columns to be populated with values to run correctly
      2. Functions that require one or more specific columns to be populated with values to run correctly and the user is required to select from a dropdown menu which columns he wants the function to use

      For the first scenario, the following guarding check has been implemented thanks to an auxiliary validation function, is_column_empty(). The following example is taken from Most Relevant Sources Section.

      if is_column_empty(raw_df, "SO"):
          ui.modal_remove() # Kill the loading spinner immediately
          ui.notification_show(
              "❌ Analysis Cancelled: The required 'SO' (Sources) metadata column is missing or empty in this dataset.",
              type="error",
              duration=10
          )
          req(False)
      # Analytical function lines of code
      except SilentException:
          pass

      For the second scenario, on top of the is_column_empty() function, a dynamic dropdown menu items update has been implemented in order to avoid the user from selecting non-valid columns. The following example is taken from the Three-Field Plot Section

      @reactive.effect
          def update_three_field_dropdowns():
          # Unpack data frame safely
          raw_df = df.get() if hasattr(df, 'get') else df
                          
          # Fallback option dictionary if data is absent
          base_options = {"AU": "Authors", "CR": "References", "DE": "Keywords", "SO": "Sources", "CR_SO": "Cited Sources", "AU_UN": "Affiliations", "AU_CO": "Countries", "ID": "Keywords Plus", "TI_TM": "Titles", "AB_TM": "Abstract"}
                          
          if raw_df is None or raw_df.empty:
              return
                          
          # Filter down options to ONLY columns that contain actual valid data substance
          filtered_choices = {}
          for key, label in base_options.items():
              # Note: TI_TM and AB_TM rely on "TI" and "AB" column data respectively 
              test_col = "TI" if key == "TI_TM" else ("AB" if key == "AB_TM" else key)
                              
              if not is_column_empty(raw_df, test_col):
                  filtered_choices[key] = label
                                  
              # Push the safe, validated choices into the DOM inputs
              if filtered_choices:
                  # Safely pick defaults ensuring they exist within our filtered subset
                  keys_list = list(filtered_choices.keys())
                  sel_l = keys_list[1] if len(keys_list) > 1 else keys_list[0]
                  sel_m = keys_list[0]
                  sel_r = keys_list[2] if len(keys_list) > 2 else keys_list[0]
                              
                  ui.update_select("left_field", choices=filtered_choices, selected=sel_l)
                  ui.update_select("middle_field", choices=filtered_choices, selected=sel_m)
                  ui.update_select("right_field", choices=filtered_choices, selected=sel_r)
    • Internal modifications to most of the functions have been implemented in order avoid crashes due to valid calculations that lead to values that generate issues (mostly zero values). In this scenario, a placeholder plot is generated to inform the end-user that the calculated values to generate a plot are not valid. The following example is taken from the get_authors_local_impact() and deals with g_index, h_index and m_index.

     if n == 0 or source_counts_visualization[impact_column].max() == 0:
         metric_label = author_local_impact.replace('_', ' ').title()
         fig = go.Figure()
         fig.add_annotation(
             text=f"⚠️ Cannot Generate Plot<br><br>The calculated <b>'{metric_label}'</b> for all identified sources evaluates to <b>0</b>.<br>"
             "There are no non-zero citation metrics available to plot.",
             xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
             font=dict(size=16, color="#D9534F", family="Segoe UI, Arial"), align="center"
         )
         fig.update_layout(
             xaxis={"visible": False}, yaxis={"visible": False},
             plot_bgcolor="rgba(245,245,245,0.5)", paper_bgcolor="white", height=500
         )
         fig = go.FigureWidget(fig)
         fig._config = fig._config | {'displaylogo': False}
         return fig, source_counts

antonio-cln and others added 30 commits May 24, 2026 22:11
NEW:
- api_etl.py: ETL pipeline to fetch and transform metadata from OpenAlex documents
MODIFIED:
- www/services/__init__.py: loc 18
- www/services/metatagextraction.py: loc 17-18, 45-46
- functions/get_database.py: loc 37-38
- app.py: loc 66, 654-655, 716-728, 739-740, 770-781
NEW
 - Data fusion between one or more input sources (single files, API queries)
NEW
 - Scopus .csv parsing
NEW
 - Scopus .bib
ROLLBACK
 - Multi-source dataset: the app logic seems too hardcoded with the Ifs and branches a different logic for each database. It would require sort of a complete code refactor to deal with it?
Some extra rollback required
Fixed some functions
NEW
 - metatagextraction.SR employed to generate SR field
…PubMed che non hanno cited references, ma l'eccezione è gestita e da un errore nella notifica.
NEW
 - Function wrapping to prevent crashes
Guarding checks
NEW
 - Defined Data Validation and Mapping Dictionary modules
viictor-it and others added 21 commits June 1, 2026 17:32
CONTROLLARE API STANDARDIZERS.
…copus in the data import section. Updated user interface to reflect new functionality.
…ex, PubMed, and Scopus. Implemented DOI handling to prevent duplicates in the dataset.
…odules

- Updated docstrings in `scopus_mapping_dict` to clarify its purpose and functionality.
- Added type hint for the `is_df_valid` function to specify it accepts a pandas DataFrame.
- Expanded docstring for `is_df_valid` to describe its validation process and return values.
- Improved overall readability and maintainability of the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants