populse_mia.data_manager.data_history_inspect¶
This module provides utilities for tracking, retrieving, and reconstructing the processing history of data files within the Mia framework. It enables users to trace the lineage of data products, including the bricks (processing steps) and intermediate files involved in their creation.
The module supports:
History Reconstruction: Build pipelines or graphs representing the processing history of a file, including all upstream dependencies.
Brick-Process Conversion: Convert brick database entries into lightweight “fake” processes for pipeline visualization.
Ancestor Tracking: Identify direct and indirect ancestors of a file, including handling of temporary values and ambiguous execution times.
Data Entry Validation: Check if filenames correspond to valid database entries or temporary placeholders.
Functions
|
Convert a brick database entry into a 'fake process'. |
|
Retrieves the complete processing history of a file in the database, formatted as a "fake pipeline". |
|
Determine if the specified filename is present within the given value. |
|
Identify processes in the given list that have the specified filename as part of their outputs. |
|
Retrieves the processing history for a given data file, based on |
|
Retrieves the complete "useful" history of a file in the database as a set of processing bricks. |
|
Retrieves the complete "useful" processing history of a file in the database. |
|
Retrieve processing bricks referenced in the direct filename history. |
|
Extract filenames from a nested structure of lists, tuples, and dictionaries. |
|
Retrieve a brick from the database using its UUID and return it as a ProtoProcess instance. |
|
Retrieve upstream processes connected via a temporary value ("<temp>"). |
|
Determine if the given filename is a valid database entry within the specified project. |
Classes
|
A lightweight convenience class that stores a brick database entry along with additional usage information. |
- class populse_mia.data_manager.data_history_inspect.ProtoProcess(brick=None)[source]¶
Bases:
objectA lightweight convenience class that stores a brick database entry along with additional usage information.
This class encapsulates a brick database entry and tracks whether it has been used, providing a simple interface for managing brick-related data.
- populse_mia.data_manager.data_history_inspect.brick_to_process(brick, project)[source]¶
Convert a brick database entry into a ‘fake process’.
This function transforms a brick database entry into a Process instance that represents its parameters and values. The process gets a name, uuid, and exec_time from the brick. This “fake process” cannot perform actual processing but serves as a representation of the brick’s traits and values.
- Parameters:
brick – (dict or str) The brick database entry to convert. If a string is provided, it is treated as the brick’s unique ID, and the corresponding brick document is retrieved from the project’s database.
project – (object) The project object providing access to the database and its documents.
- Returns:
(Process or None) A Process instance representing the brick’s parameters and values. Returns None if the brick is not found.
- populse_mia.data_manager.data_history_inspect.data_history_pipeline(filename, project)[source]¶
Retrieves the complete processing history of a file in the database, formatted as a “fake pipeline”.
The generated pipeline consists of unspecialized (fake) processes, each representing a processing step with all parameters of type Any. The pipeline includes connections and traces all upstream ancestors of the file, capturing the entire processing path leading to the latest version of the file.
If the file was modified multiple times, the pipeline reflects only the relevant processing steps that contributed to the final output. Orphaned processing steps from overwritten versions are omitted.
- Parameters:
filename – (str) The name of the file whose processing history is being retrieved.
project – (Project) The project object containing the database and relevant details.
- Returns:
(Pipeline | None) A Pipeline object representing the processing history, or None if no relevant history is found.
- populse_mia.data_manager.data_history_inspect.data_in_value(value, filename, project)[source]¶
Determine if the specified filename is present within the given value.
This function recursively searches through the value, which can be a string, list, tuple, or dictionary, to check if it contains the specified filename. The filename can be a special placeholder “<temp>” or a “short” filename, which is a relative path within the project’s database data directory.
- Parameters:
value –
(str, list, tuple, or dict) The data structure to search. It can be:
A string representing a file path.
A list or tuple containing multiple file paths.
A dictionary where file paths are stored as values.
filename –
(str) The filename to search for. It can be:
The special placeholder “<temp>” indicating a temporary value.
A relative file path to the project database data directory.
project – (object) The project object containing the project’s folder path as an attribute (project.folder).
- Returns:
(bool) True if the filename is found in the value, False otherwise.
- populse_mia.data_manager.data_history_inspect.find_procs_with_output(procs, filename, project)[source]¶
Identify processes in the given list that have the specified filename as part of their outputs.
This function searches through a list of processes to determine which ones have the specified filename in their output values. The results are organized by execution time.
- Parameters:
procs – (iterable of ProtoProcess) A collection of ProtoProcess instances to search through.
filename – (str) The filename to search for within the processes’ outputs.
project – (Project) An instance of the project, used to access the database folder.
- Returns:
(dict) A dictionary where keys are execution times and values are lists of tuples. Each tuple contains a process and the parameter name associated with the filename. Format: {exec_time: [(process, param_name), …]}.
- populse_mia.data_manager.data_history_inspect.get_data_history(filename, project)[source]¶
Retrieves the processing history for a given data file, based on
get_data_history_processes().The returned dictionary contains:
“parent_files”: A set of filenames representing data (direct or indirect) used to produce the given file.
“processes”: A set of UUIDs of processing bricks that contributed to the file’s creation.
- Parameters:
filename – (str) The name of the file whose processing history is being retrieved.
project – (Project) The project object containing the database and relevant details.
- Returns:
(dict) A dictionary with the following keys:
“processes”: (set) A set of UUIDs representing the processing bricks involved.
“parent_files”: (set) A set of filenames that were used to produce the data.
- populse_mia.data_manager.data_history_inspect.get_data_history_bricks(filename, project)[source]¶
Retrieves the complete “useful” history of a file in the database as a set of processing bricks.
This function is a filtered version of
get_data_history_processes(), similar todata_history_pipeline(), but instead of constructing a pipeline, it returns only the set of brick elements that were actually used in the relevant processing history of the file.- Parameters:
filename – (str) The name of the file whose processing history is being retrieved.
project – (Project) The project object containing the database and relevant details.
- Returns:
(set) A set of brick elements representing the “useful” processing steps that contributed to the final version of the given data file.
- populse_mia.data_manager.data_history_inspect.get_data_history_processes(filename, project)[source]¶
Retrieves the complete “useful” processing history of a file in the database.
This function returns:
A dictionary of processes (
ProtoProcessinstances), where keys are process UUIDs.A set of links between these processes, forming the processing graph.
Unlike
data_history_pipeline(), which converts the history into aPipeline, this function provides a lower-level representation. Some processes retrieved during history traversal may not be used; they are distinguished by theirusedattribute (set to True for relevant processes).Processing bricks that are not used (possibly from earlier runs where the data file was overwritten) may either be absent from the history or have
used = False.- Parameters:
filename – (str) The name of the file whose processing history is being retrieved.
project – (Project) The project object containing the database and relevant details.
- Returns:
(tuple)
procs (dict): {uuid: ProtoProcess instance} mapping.
links (set): {(src_protoprocess, src_plug_name, dst_protoprocess, dst_plug_name)}. External connections are represented with None as src_protoprocess or dst_protoprocess.
- populse_mia.data_manager.data_history_inspect.get_direct_proc_ancestors(filename, project, procs, before_exec_time=None, only_latest=True, org_proc=None)[source]¶
Retrieve processing bricks referenced in the direct filename history.
This function identifies the most recent processing steps that generated the given filename. If multiple processes share the same execution time, they are all retained to account for ambiguity. The function also allows filtering by execution time and excluding a specified originating process.
- Parameters:
filename – (str) The data filename to inspect.
project – (Project) The project instance used to access the database.
procs – (dict) Dictionary mapping process UUIDs to ProtoProcess instances. This dictionary is updated with newly retrieved processes.
before_exec_time – (datetime) If specified, only processing bricks executed before this time are considered.
only_latest – (bool) If True (default), keeps only the latest processes found in the history. If before_exec_time is specified, retains only the latest before that time.
org_proc – (ProtoProcess) The originating process, which is excluded from execution time filtering but included in the ancestor list.
- Returns:
(dict) A dictionary mapping brick UUIDs to ProtoProcess instances.
- populse_mia.data_manager.data_history_inspect.get_filenames_in_value(value, project, allow_temp=True)[source]¶
Extract filenames from a nested structure of lists, tuples, and dictionaries.
This function parses the given value, which can be a nested combination of lists, tuples, and dictionaries, to retrieve all filenames referenced within it. Only filenames that are valid database entries or the special “<temp>” value (if allow_temp is True) are retained. Other filenames are considered read-only static data and are not included in the results.
- Parameters:
value – (object) The value to parse. It can be a single string, a list, tuple, dictionary, or a nested combination of these types.
project – (object) The project object providing access to the database.
allow_temp – (bool, optional) If True, includes the temporary filename “<temp>” in the results. Defaults to True.
- Returns:
(set) A set of filenames that are valid database entries or the temporary filename “<temp>” (if allowed).
- populse_mia.data_manager.data_history_inspect.get_history_brick_process(brick_id, project, before_exec_time=None)[source]¶
Retrieve a brick from the database using its UUID and return it as a ProtoProcess instance.
This function fetches a brick from the database using its unique identifier (UUID). It returns the brick as a ProtoProcess instance if the brick has been executed (its execution status is “Done”) and, if specified, its execution time is not later than before_exec_time. If the brick does not meet these criteria or is not found in the database, the function returns None.
- Parameters:
brick_id – (str) The unique identifier (UUID) of the brick to retrieve.
project – (object) The project object providing access to the database.
before_exec_time – (str) An execution time filter. If provided, bricks executed after this timestamp are discarded.
- Returns:
(ProtoProcess or None) A ProtoProcess instance representing the brick if it meets the criteria; otherwise, None.
- populse_mia.data_manager.data_history_inspect.get_proc_ancestors_via_tmp(proc, project, procs)[source]¶
Retrieve upstream processes connected via a temporary value (“<temp>”).
This function is intended for internal use within get_data_history_processes and data_history_pipeline. It attempts to identify upstream processes connected to the given process (proc) through a temporary filename.
The function first searches the direct history of the process’s output files. If no matching process is found, it searches the entire database of bricks, which may be slower for large databases. Matching is based on the temporary filename and processing time, which can be error-prone.
- Parameters:
proc – (ProtoProcess) The process whose ancestors need to be determined.
project – (object) The project object providing access to the session and other necessary functionalities for processing.
procs – (dict) A dictionary of processes, where keys are process IDs and values are ProtoProcess instances.
- Returns:
(tuple)
new_procs (dict): A dictionary mapping process UUIDs to ProtoProcess instances.
links (set): A set of tuples representing pipeline links in the format (src_protoprocess, src_plug_name, dst_protoprocess, dst_plug_name). Links from/to the pipeline main plugs are also included, where src_protoprocess or dst_protoprocess may be None.
Contains:
Inner functions:
_get_tmp_param: Identifies a process parameter associated with a temporary value.
- populse_mia.data_manager.data_history_inspect.is_data_entry(filename, project, allow_temp=True)[source]¶
Determine if the given filename is a valid database entry within the specified project.
This function checks whether the input filename is either a recognized temporary value (“<temp>”) or a file located within the project’s database data directory. If the filename is valid, it returns either the relative path to the database data directory or “<temp>” (if allowed). If the file is not found in the database, the function returns None.
- Parameters:
filename – (str) The full path or special value “<temp>” to be checked.
project – (object) The project object providing access to the database and folder structure.
allow_temp – (bool, optional) If True, allows the special value “<temp>” to be considered a valid entry. Defaults to True.
- Returns:
(str or None)
The relative path to the project’s database data directory if the filename is a valid database entry.
“<temp>” if the input is “<temp>” and allow_temp is True.
None if the filename is not a valid database entry.