API data ingestion pipeline and robustness improvements by antonio-cln · Pull Request #10 · PRAISELab-PicusLab/bibliometrix-python

antonio-cln · 2026-06-02T18:51:34Z

Summary

The proposed PR aims at introducing new functionalities and refactor part of the existing code in order to better deal with standardizing the data ingestion process, introduce fetching documents through an API request, introduce guardrails for analytical function to ensure user-side stability and provide informations about why certain actions cannot be performed.

New modules

www/services/api_etl.py
This is one of the core compoment of the API ingestion pipeline. This module deals with two main points:
1. manages the complexities of web-based ingestion: automated retries, pagination handling, connection throttling to respect API rate limits.
  This is dealt with in each search_<source>_keywords() function. Since each database source allows interaction in a different way, several functions required to be implemented.
2. introduces mapping layer to standardize the data provided by the different API to the 24-tag requested dataframe.
  This is dealt with in each <source>_mapping_dict(). Just like in the case of search_<source>_keywords(), since each database provides data in a different format, different functions required to be implemented.
www/services/data_validation.py
This is one of the core component of the API ingestion pipeline. This module manages validation of a dataframe, verifying data integrity before it enters any of the analytical workflows. It makes sure that the extracted tags are conform to the Web of Science provided schema, in particular it verifies that:
"DB", "UT", "DI", "PMID", "TI", "SO", "JI", "DT", "LA", "RP", "AB", "VL", "IS", "BP", "EP", "SR" are string
"AU", "AF", "C1", "CR", "DE", "ID" are list
"PY", "TC" are integer
Furthermore, since Pandas usually converts string to object, an explicit conversion to string has been applied to guarantee full conformity with the requested format.

Changes

Dispatcher
A dispatcher pattern has been added in app.py.
The user is allowed to choose from what database (OpenAlex, PubMed, Scopus) he wants to fetch documents from and based on his choice, a certain pipeline will be used to proceed with the API ingestion pipeline introduced by api_etl.py and data_validation.py.
Single/Multiple Uploaded File processing
A more resilient approach through a try/except block has been implemented in format_functions.py when formatting the columns to prevent corrupted records to crash the file processing. The involved corrupted file information are provided in the terminal.

For each of the format_<tag>_column.py function, a more resilient approach has been implemented through is_valid_field() and clean_txt_string() to make sure to avoid unwanted values, NaN or None to enter the workflow. This guarantees that the produced dataframe columns are conform with the requests of being string, integers or list.
API
The aforementioned API pipeline is accessible through the web interface provided by app.py thanks to an additional entry in the dropdown menu that allows the user to choose how he wants to provide the data. This option allows the user to choose between OpenAlex, PubMed and Scopus.

Furthermore, an API Key field is present to allow the use to provide his own key to perform a query. This is a mandatory step for certain databases like Scopus while OpenAlex and PubMed allow to query their database without any key but with some restrictions.

Several API requests can be executed sequentially and any successfully completed requests are then merged together. This allows the user to fetch documents from several database sources and use the analytics functions on a broader set of data.

Analytical Function Guardrails
Most of the present analytical function work perfectly with Web of Science provided columns. The analytical function have been mostly adjusted to work according to the data provided by Web of Science and since different databases don't provide all the data that Web of Science provides, most of the functions will crash.

Analytical function crashes have been dealt with both internally and externally.

External guardrails have been implemented in order to avoid the functions to even run if the required columns to generate the plot are either missing (which shouldn't be the case since the dataframe has been validated before, it is just a double-check) or are completely empty.
According to the requests, column can be empty and all rows can be represented by either an empty string, "", or an empty list, []. These external guardrails prevents function from running if the data required for it to work is not conform and inform the end-user that the specific column required for that function is missing from the dataset.

Two different approaches have been used since there are two macro categories of analytical functions:

Functions that require one or more specific columns to be populated with values to run correctly
Functions that require one or more specific columns to be populated with values to run correctly and the user is required to select from a dropdown menu which columns he wants the function to use

For the first scenario, the following guarding check has been implemented thanks to an auxiliary validation function, is_column_empty(). The following example is taken from Most Relevant Sources Section.

if is_column_empty(raw_df, "SO"):
    ui.modal_remove() # Kill the loading spinner immediately
    ui.notification_show(
        "❌ Analysis Cancelled: The required 'SO' (Sources) metadata column is missing or empty in this dataset.",
        type="error",
        duration=10
    )
    req(False)
# Analytical function lines of code
except SilentException:
    pass

For the second scenario, on top of the is_column_empty() function, a dynamic dropdown menu items update has been implemented in order to avoid the user from selecting non-valid columns. The following example is taken from the Three-Field Plot Section

@reactive.effect
    def update_three_field_dropdowns():
    # Unpack data frame safely
    raw_df = df.get() if hasattr(df, 'get') else df
                    
    # Fallback option dictionary if data is absent
    base_options = {"AU": "Authors", "CR": "References", "DE": "Keywords", "SO": "Sources", "CR_SO": "Cited Sources", "AU_UN": "Affiliations", "AU_CO": "Countries", "ID": "Keywords Plus", "TI_TM": "Titles", "AB_TM": "Abstract"}
                    
    if raw_df is None or raw_df.empty:
        return
                    
    # Filter down options to ONLY columns that contain actual valid data substance
    filtered_choices = {}
    for key, label in base_options.items():
        # Note: TI_TM and AB_TM rely on "TI" and "AB" column data respectively 
        test_col = "TI" if key == "TI_TM" else ("AB" if key == "AB_TM" else key)
                        
        if not is_column_empty(raw_df, test_col):
            filtered_choices[key] = label
                            
        # Push the safe, validated choices into the DOM inputs
        if filtered_choices:
            # Safely pick defaults ensuring they exist within our filtered subset
            keys_list = list(filtered_choices.keys())
            sel_l = keys_list[1] if len(keys_list) > 1 else keys_list[0]
            sel_m = keys_list[0]
            sel_r = keys_list[2] if len(keys_list) > 2 else keys_list[0]
                        
            ui.update_select("left_field", choices=filtered_choices, selected=sel_l)
            ui.update_select("middle_field", choices=filtered_choices, selected=sel_m)
            ui.update_select("right_field", choices=filtered_choices, selected=sel_r)

Internal modifications to most of the functions have been implemented in order avoid crashes due to valid calculations that lead to values that generate issues (mostly zero values). In this scenario, a placeholder plot is generated to inform the end-user that the calculated values to generate a plot are not valid. The following example is taken from the get_authors_local_impact() and deals with g_index, h_index and m_index.

 if n == 0 or source_counts_visualization[impact_column].max() == 0:
     metric_label = author_local_impact.replace('_', ' ').title()
     fig = go.Figure()
     fig.add_annotation(
         text=f"⚠️ Cannot Generate Plot<br><br>The calculated <b>'{metric_label}'</b> for all identified sources evaluates to <b>0</b>.<br>"
         "There are no non-zero citation metrics available to plot.",
         xref="paper", yref="paper", x=0.5, y=0.5, showarrow=False,
         font=dict(size=16, color="#D9534F", family="Segoe UI, Arial"), align="center"
     )
     fig.update_layout(
         xaxis={"visible": False}, yaxis={"visible": False},
         plot_bgcolor="rgba(245,245,245,0.5)", paper_bgcolor="white", height=500
     )
     fig = go.FigureWidget(fig)
     fig._config = fig._config | {'displaylogo': False}
     return fig, source_counts

NEW: - api_etl.py: ETL pipeline to fetch and transform metadata from OpenAlex documents MODIFIED: - www/services/__init__.py: loc 18 - www/services/metatagextraction.py: loc 17-18, 45-46 - functions/get_database.py: loc 37-38 - app.py: loc 66, 654-655, 716-728, 739-740, 770-781

NEW - Data fusion between one or more input sources (single files, API queries)

NEW - Scopus .csv parsing

NEW - Scopus .bib

ROLLBACK - Multi-source dataset: the app logic seems too hardcoded with the Ifs and branches a different logic for each database. It would require sort of a complete code refactor to deal with it?

Some extra rollback required

Fixed some functions

NEW - metatagextraction.SR employed to generate SR field

…PubMed che non hanno cited references, ma l'eccezione è gestita e da un errore nella notifica.

NEW - Function wrapping to prevent crashes

…thon

Guarding checks

NEW - Defined Data Validation and Mapping Dictionary modules

…p.py

…thon

CONTROLLARE API STANDARDIZERS.

…thon

…e errors in co-citation field selection.

… analysis

…thon

…copus in the data import section. Updated user interface to reflect new functionality.

…ex, PubMed, and Scopus. Implemented DOI handling to prevent duplicates in the dataset.

…odules - Updated docstrings in `scopus_mapping_dict` to clarify its purpose and functionality. - Added type hint for the `is_df_valid` function to specify it accepts a pandas DataFrame. - Expanded docstring for `is_df_valid` to describe its validation process and return values. - Improved overall readability and maintainability of the code.

antonio-cln and others added 30 commits May 24, 2026 22:11

2026-05-25

3ea0aa7

NEW - Data fusion between one or more input sources (single files, API queries)

2026-05-26

015378f

NEW - Scopus .csv parsing

2026-05-26

f609c4c

NEW - Scopus .bib

2026-05-26

02570f1

2026-05-26

0247699

2026-05-26

6d1d926

2026-05-27

ff13901

ROLLBACK - Multi-source dataset: the app logic seems too hardcoded with the Ifs and branches a different logic for each database. It would require sort of a complete code refactor to deal with it?

2026-05-27

4d991fa

Some extra rollback required

2026-05-28

34dec8c

Fixed some functions

2026-05-28

e528b48

2026-05-29

c59987c

NEW - metatagextraction.SR employed to generate SR field

.

23482ab

.

d895091

Adesso la funzione get_referencesspectroscopy non crasha con dati di …

acf3a48

…PubMed che non hanno cited references, ma l'eccezione è gestita e da un errore nella notifica.

.

eb97aec

bug fix in bradgord law

86054ce

2026-05-31

4b9b023

NEW - Function wrapping to prevent crashes

Merge branch 'main' of https://github.com/antonio-cln/bibliometrix-py…

447e0bc

…thon

Update app.py

8ca39f9

2026-05-31

3da370e

Guarding checks

Update app.py

62f360b

Update app.py

ca32d40

2026-06-01

41dc13b

NEW - Defined Data Validation and Mapping Dictionary modules

.

7e4a111

Revert "."

4b9d09f

.

4b5a3b2

.

e56084a

.

64b808b

.

b1e343c

viictor-it and others added 21 commits June 1, 2026 17:32

.

abe5dcc

.

8ce0918

.

f9752d0

modified error message for Most Global Cited Documents function in ap…

c8ca5bf

…p.py

2026-06-01

275cf93

Merge branch 'main' of https://github.com/antonio-cln/bibliometrix-py…

62d692d

…thon

Enhance error handling for API data fetching and keyword searches

9047e4c

MERGE.

6eaa9e7

CONTROLLARE API STANDARDIZERS.

Merge branch 'main' of https://github.com/antonio-cln/bibliometrix-py…

8984435

…thon

Fixed errors in get_co_occurence_network and modified app.py to handl…

2c64302

…e errors in co-citation field selection.

Imported nltk library and added stopwords download for word frequency…

a6b6663

… analysis

.

a364de7

Merge branch 'main' of https://github.com/antonio-cln/bibliometrix-py…

bffdf59

…thon

PPPPPPPPPPPPPPPP

33d03d8

Final integration of API querying support for OpenAlex, PubMed, and S…

ce80122

…copus in the data import section. Updated user interface to reflect new functionality.

Enhance API integration by adding a unified fetch function for OpenAl…

6fc4f07

…ex, PubMed, and Scopus. Implemented DOI handling to prevent duplicates in the dataset.

Added docstrings to api_etl.py

2a47927

Added docstrings to data_validation.py

3a6649b

fixed

d2578c4

2026-06-02

dd186cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API data ingestion pipeline and robustness improvements#10

API data ingestion pipeline and robustness improvements#10
antonio-cln wants to merge 51 commits into
PRAISELab-PicusLab:mainfrom
antonio-cln:main

antonio-cln commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

antonio-cln commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants